I am using Python Multiprocessing module to scrape a website. Now this website has over 100,000 pages. What I am trying to do is to put every 500 pages I retrieve into a separate folder. The problem is that though I successfully create a new folder, my script only populates the previous folder. Here is the code:
global a = 1
global b = 500
def fetchAfter(y):
global a
global b
strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"
if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):
f = open(strfile, "w")
if __name__ == '__main__':
start = time.time()
for i in range(1,3):
os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))
pool = Pool(processes=12)
pool.map(fetchAfter, range(a,b))
pool.close()
pool.join()
a = b
b = b + 500
print time.time()-start
globalkeyword. As far as I can tell, removing it from your script won't change a thing.multiprocessing.Poolis not very high if you're scraping 100000 pages; by default it only creates the processes once at the beginning of the run, and executes the worker function repeatedly in each of the children.select()loop orgeventgreenlets) is far superior than either threads or multiprocessing. Not to mention using an other python interpreter.