3

I'm about to start working with data that is ~500 GB in size. I'd like to be able to access small components of the data at any given time with Python. I'm considering in using PyTables or MongoDB with PyMongo (or Hadoop - thanks Drahkar). Are there other file structures/DBs that I should consider?

Some of the operations I'll be doing are computing distances from one point to another. Extracting data based on indices from boolean tests and the like. The results may go online for a website, but at the moment it is intended to be only used on a desktop for analysis.

Cheers

4
  • 1
    There should be a requirement to leave a comment if you downvote. Why was this downvoted twice? I'm the first one to downvote a question if it sucks but this question doesn't seem unreasonable... Commented Oct 8, 2012 at 12:06
  • 2
    You may also wish to consider HDF5. Commented Oct 8, 2012 at 12:19
  • unutbu - That's a good idea. PyTables is based on that. I'm a co-developer for an astronomy data read/write package called ATpy (atpy.github.com) and we make use of HDF5, but accessing subsets of the data requires some big re-writing in the code. It may be the best solution in the end, but I'm waiting to hear what others may suggest before making the commitment. Commented Oct 8, 2012 at 12:26
  • 1
    I'm surprised that this question has been closed. After doing some R&D for the last few days and I have a summary report that I'd like to provide here. Is it only possible once the question has been reopened? Commented Oct 18, 2012 at 12:15

1 Answer 1

1

If you are seriously looking at data processing on a Big Data process, I would highly suggest looking into Hadoop. One provider being Cloudera ( http://www.cloudera.com/ ). It is a very powerful platform that has many tools within it for data processing. Many languages, including Python, have modules for accessing the data, plus a hadoop cluster can do a significant amount of the processing for you once you have built the various mapreduce, Hive and hbase jobs for it.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the suggestion. I have looked at Hadoop as well. Let me edit my question to include it. I'm curious what the consensus will be. Is the Python support for Hadoop good comparable or better than MongoDB?
Someone suggested Riak for Python: github.com/basho/riak-python-client. Getting closer to a closure on this. If I find something, something will be posted on here in case anyone has similar questions.
The purposes oh hadoop versus mongodb, couchdb, couchbase, etc are significant. Mongodb, couchdb, and couchbase are all nosql solutions where hadoop is a storage and analyzing cluster. So what you need depends heavily on what you need to use it for specifically.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.