Python MultiProcessing and Directory Creation

Question

I am using Python Multiprocessing module to scrape a website. Now this website has over 100,000 pages. What I am trying to do is to put every 500 pages I retrieve into a separate folder. The problem is that though I successfully create a new folder, my script only populates the previous folder. Here is the code:

global a = 1

global b = 500

def fetchAfter(y):

       global a

       global b

       strfile = "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\" + str(y) + ".html"

       if (os.path.exists( os.path.join( "E:\\A\\B\\" + str(a) + "-" + str(b) + "\\", str(y) + ".html" )) == 0):

                f = open(strfile, "w")


if __name__ == '__main__':

       start = time.time()
       for i in range(1,3):
              os.makedirs("E:\\Results\\Class 9\\" + str(a) + "-" + str(b))

              pool = Pool(processes=12)
              pool.map(fetchAfter, range(a,b))
              pool.close()
              pool.join()
              a = b
              b = b + 500

       print time.time()-start

off topic: There is really no need to use the global keyword. As far as I can tell, removing it from your script won't change a thing. — mgilson
– mgilson, Commented Aug 23, 2012 at 19:13
@user1343318: The owner of the website might not appreciate it if you start scraping his site at full speed... — Roland Smith
– Roland Smith, Commented Aug 23, 2012 at 20:37
@JoelCornett: The process creation overhead of multiprocessing.Pool is not very high if you're scraping 100000 pages; by default it only creates the processes once at the beginning of the run, and executes the worker function repeatedly in each of the children. — Roland Smith
– Roland Smith, Commented Aug 23, 2012 at 20:42
@JoelCornett: It seems to me that the usual advice that goes around if people ask how to speed up their program is to use threads. Depending on what one is trying to accomplish it could very well be that using an event-driven architecture (select() loop or gevent greenlets) is far superior than either threads or multiprocessing. Not to mention using an other python interpreter. — Roland Smith
– Roland Smith, Commented Aug 23, 2012 at 21:17

Roland Smith · Accepted Answer · 2012-08-23 21:08:30Z

It is best for the worker function to only rely on the single argument it gets for determining what to do. Because that is the only information it gets from the parent process every time it is called. This argument can be almost any Python object (including a tuple, dict, list) so you're not really limited in the amount of information you pass to a worker.

So make a list of 2-tuples. Each 2-tuple should consist of (1) the file to get and (2) the directory where to stash it. Feed that list of tuples to map(), and let it rip.

I'm not sure if it is useful to specify the number of processes you want to use. Pool generally uses as many processes as your CPU has cores. That is usually enough to max out all the cores. :-)

BTW, you should only call map() once. And since map() blocks until everything is done, there is no need to call join().

Edit: Added example code below.

import multiprocessing
import requests
import os

def processfile(arg):
    """Worker function to scrape the pages and write them to a file.

    Keyword arguments:
    arg -- 2-tuple containing the URL of the page and the directory
           where to save it.
    """
    # Unpack the arguments
    url, savedir = arg

    # It might be a good idea to put a random delay of a few seconds here, 
    # so we don't hammer the webserver!

    # Scrape the page. Requests rules ;-)
    r = requests.get(url)
    # Write it, keep the original HTML file name.
    fname = url.split('/')[-1]
    with open(savedir + '/' + fname, 'w+') as outfile:
        outfile.write(r.text)

def main():
    """Main program.
    """
    # This list of tuples should hold all the pages... 
    # Up to you how to generate it, this is just an example.
    worklist = [('http://www.foo.org/page1.html', 'dir1'), 
                ('http://www.foo.org/page2.html', 'dir1'), 
                ('http://www.foo.org/page3.html', 'dir2'), 
                ('http://www.foo.org/page4.html', 'dir2')]
    # Create output directories
    dirlist = ['dir1', 'dir2']
    for d in dirlist:
        os.makedirs(d)
    p = Pool()
    # Let'er rip!
    p.map(processfile, worklist)
    p.close()

if __name__ == '__main__':
    main()

Roland, can you please write some code too to get along with the explanation? I will be deeply grateful.

Community · Accepted Answer · 2017-05-23 11:54:12Z

0

Multiprocessing, as the name implies, uses separate processes. The processes you create with your Pool do not have access to the original values of a and b that you are adding 500 to in the main program. See this previous question.

The easiest solution is to just refactor your code so that you pass a and b to fetchAfter (in addition to passing y).

edited May 23, 2017 at 11:54

CommunityBot

11 silver badge

answered Aug 23, 2012 at 19:25

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

2 Comments

user1343318 Over a year ago

Okey, I tried to pass pool.map(fetchAfter, range(a,b), a, b) but now it says map takes at most 4 arguments, 5 given. P.S.: That is the reason I resorted to globals in the first place.

BrenBarn Over a year ago

The simplest way is to do as @Roland Smith suggests, and pass a single tuple containing the range, a, and b.

Community · Accepted Answer · 2017-05-23 12:32:32Z

Here's one way to implement it:

#!/usr/bin/env python
import logging
import multiprocessing as mp
import os
import urllib

def download_page(url_path):
    try:
        urllib.urlretrieve(*url_path)
        mp.get_logger().info('done %s' % (url_path,))
    except Exception as e:
        mp.get_logger().error('failed %s: %s' % (url_path, e))

def generate_url_path(rootdir, urls_per_dir=500):
    for i in xrange(100*1000):
        if i % urls_per_dir == 0: # make new dir
           dirpath = os.path.join(rootdir, '%d-%d' % (i, i+urls_per_dir))
           if not os.path.isdir(dirpath):
              os.makedirs(dirpath) # stop if it fails
        url = 'http://example.com/page?' + urllib.urlencode(dict(number=i))
        path = os.path.join(dirpath, '%d.html' % (i,))
        yield url, path

def main():
    mp.log_to_stderr().setLevel(logging.INFO)

    pool = mp.Pool(4) # number of processes is unrelated to number of CPUs
                      # due to the task is IO-bound
    for _ in pool.imap_unordered(download_page, generate_url_path(r'E:\A\B')):
        pass

if __name__ == '__main__':
   main()

See also Python multiprocessing pool.map for multiple arguments and the code
Brute force basic http authorization using httplib and multiprocessing from how to make HTTP in Python faster?

Collectives™ on Stack Overflow

Python MultiProcessing and Directory Creation

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related