Python: Download multiple files quickly

Question

In Python how can I download a bunch of files quickly? urllib.urlretrieve() is very slow, and I'm not very sure how to go about this.

I have a list of 15-20 files to download, and it takes forever just to download one. Each file is about 2-4 mb.

I have never done this before, and I'm not really sure where I should start. Should I use threading and download a few at a time? Or should I use threading to download pieces of each file, but one file at a time, or should I even be using threading?

Jon Clements · Accepted Answer · 2012-12-23 20:35:19Z

1

urllib.urlretrieve() is very slow

Really? If you've got 15-20 files of 2-4mb each, then I'd just line 'em up and download 'em. The bottle neck is going to be the bandwith for your server and yourself. So IMHO, hardly worth threading or trying anything clever in this case...

answered Dec 23, 2012 at 20:35

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jackcogdill Over a year ago

Well I tried using pyaxel and it was very fast, but after about 20 files it says there's too many open files.

Jon Clements Over a year ago

@yentup might have been worth mentioning that in your question - but I stand by my opinion that a simple loop over the urls to be retrieved and just wait is better... if we're talking 10s of thousands of files (with gigs of data), across multiple pipelines then start looking to worry about it... but until then, I don't feel you have a "problem" as such

Omid Kamangar · Accepted Answer · 2012-12-23 20:43:53Z

1

One solution (which is not Python specific) is to save the download URLs in another file and download them using a download manager program such as wget or aria2. You can invoke the download manager from your Python program.

But as @Jon mentioned, this is not really necessary for your case. urllib.urlretrieve() is enough for it!

Another option is to use Mechanize to download the files.

answered Dec 23, 2012 at 20:43

Omid Kamangar

5,7889 gold badges44 silver badges70 bronze badges

Comments

Gaurav Shrivastava · Accepted Answer · 2020-10-19 04:20:15Z

1

Try using wget module of python. Here is a code snippet.

import wget
wget.download(url, out = path)

answered Oct 19, 2020 at 4:20

Gaurav Shrivastava

94312 silver badges20 bronze badges

Comments

miku · Accepted Answer · 2012-12-23 20:39:35Z

0

stream.py is an somewhat experimental, yet cute UI for parallel python (via threads or processes) based on ideas from data flow programming: An URL-retriever is provided in the examples:

https://github.com/aht/stream.py/blob/master/example/retrieve_urls.py

Since it's short:

#!/usr/bin/env python

"""
Demonstrate the use of a ThreadPool to simultaneously retrieve web pages.
"""

import urllib2
from stream import ThreadPool

URLs = [
    'http://www.cnn.com/',
    'http://www.bbc.co.uk/',
    'http://www.economist.com/',
    'http://nonexistant.website.at.baddomain/',
    'http://slashdot.org/',
    'http://reddit.com/',
    'http://news.ycombinator.com/',
]

def retrieve(urls, timeout=30):
    for url in urls:
        yield url, urllib2.urlopen(url, timeout=timeout).read()

if __name__ == '__main__':
    retrieved = URLs >> ThreadPool(retrieve, poolsize=4)
    for url, content in retrieved:
        print '%r is %d bytes' % (url, len(content))
    for url, exception in retrieved.failure:
        print '%r failed: %s' % (url, exception)

You would just need to replace urllib2.urlopen(url, timeout=timeout).read() with urlretrieve....

answered Dec 23, 2012 at 20:39

miku

189k47 gold badges314 silver badges317 bronze badges

1 Comment

blackmamba Over a year ago

Throws an error 'economist.com' failed: 'tuple' object has no attribute 'read'

Collectives™ on Stack Overflow

Python: Download multiple files quickly

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related