Parallel/Async Download of S3 data into EC2 in Python?

Question

I have large data files stored in S3 that I need to analyze. Each batch consists of ~50 files, each of which can be analyzed independently.

I'd like to setup parallel downloads of the S3 data into the EC2 instance, and setup triggers that start the analysis process on each file that downloads.

Are there any libraries that handle an async download, trigger on complete model?

If not, I'm thinking of setting up multiple download processes with pyprocessing, each of which will download and analyze a single piece of the file. Does that sound reasonable or are there better alternatives?

Parand · Accepted Answer · 2009-03-13 20:37:58Z

3

Answering my own question, I ended up writing a simple modification to the Amazon S3 python library that lets you download the file in chunks or read it line by line. Available here.

answered Mar 13, 2009 at 20:37

Parand

107k49 gold badges158 silver badges188 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nathan Stocks · Accepted Answer · 2009-02-11 21:28:41Z

0

It sounds like you're looking for twisted:

"Twisted is an event-driven networking engine written in Python and licensed under the MIT license."

http://twistedmatrix.com/trac/

I've used the twisted python for quite a few asynchronous projects involving both communicating over the Internet and with subprocesses.

answered Feb 11, 2009 at 21:28

Nathan Stocks

2,1743 gold badges20 silver badges31 bronze badges

Comments

Jay · Accepted Answer · 2009-02-11 21:30:34Z

0

I don't know of anything that already exists that does exactly what you're looking for, but even if not it should be reasonably easy to put together with Python. For a threaded approach, you might take a look at this Python recipe that does multi-threaded HTTP downloads for testing download mirrors.

EDIT: Few packages that I found that might do the majority of the work for you and be what you're looking for

answered Feb 11, 2009 at 21:30

Jay

42.9k14 gold badges69 silver badges83 bronze badges

Collectives™ on Stack Overflow

Parallel/Async Download of S3 data into EC2 in Python?

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related