3

I have a scientific application that reads a potentially huge data file from disk and transforms it into various Python data structures such as a map of maps, list of lists etc. NumPy is called in for numerical analysis. The problem is, the memory usage can grow rapidly. As swap space is called in, the system slows down significantly. The general strategy I have seen:

  1. lazy initialization: this doesn't seem to help in the sense that many operations require in memory data anyway.
  2. shelving: this Python standard library seems support writing data object into a datafile (backed by some db) . My understanding is that it dumps data to a file, but if you need it, you still have to load all of them into memory, so it doesn't exactly help. Please correct me if this is a misunderstanding.
  3. The third option is to leverage a database, and offload as much data processing to it

As an example: a scientific experiment runs several days and have generated a huge (tera bytes of data) sequence of:

co-ordinate(x,y) observed event E at time t.

And we need to compute a histogram over t for each (x,y) and output a 3-dimensional array.

Any other suggestions? I guess my ideal case would be the in-memory data structure can be phased to disk based on a soft memory limit and this process should be as transparent as possible. Can any of these caching frameworks help?

Edit:

I appreciate all the suggested points and directions. Among those, I found user488551's comments to be most relevant. As much as I like Map/Reduce, to many scientific apps, the setup and effort for parallelization of code is even a bigger problem to tackle than my original question, IMHO. It is difficult to pick an answer as my question itself is so open ... but Bill's answer is more close to what we can do in real world, hence the choice. Thank you all.

3
  • Have you checked with a profiler what it says? Could be something totaly different from what you think, like unnecessary allocation in some loop. Commented Jan 30, 2012 at 21:31
  • Your workload shouldn't really grow out of bounds. Is there no way to place theoretical limits on how much data is necessary for any given quantum of computation? Commented Jan 30, 2012 at 21:33
  • "many operations require in memory data anyway"? You'll have to be much, much more precise on this issue. To reduce memory footprint, you have to divide the problem into smaller pieces that can run slower, but use less memory. Commented Jan 30, 2012 at 21:56

2 Answers 2

3

Have you considered divide and conquer? Maybe your problem lends itself to that. One framework you could use for that is Map/Reduce.

Does your problem have multiple phases such that Phase I requires some data as input and generates an output which can be fed to phase II? In that case you can have 1 process do phase I and generate data for phase II. Maybe this will reduce the amount of data you simultaneously need in memory?

Can you divide your problem into many small problems and recombine the solutions? In this case you can spawn multiple processes that each handle a small sub-problem and have one or more processes to combine these results in the end?

If Map-Reduce works for you look at the Hadoop framework.

Sign up to request clarification or add additional context in comments.

Comments

1

Well, if you need the whole dataset in RAM, there's not much to do but get more RAM. Sounds like you aren't sure if you really need to, but keeping all the data resident requires the smallest amount of thinking :)

If your data comes in a stream over a long period of time, and all you are doing is creating a histogram, you don't need to keep it all resident. Just create your histogram as you go along, write the raw data out to a file if you want to have it available later, and let Python garbage collect the data as soon as you have bumped your histogram counters. All you have to keep resident is the histogram itself, which should be relatively small.

3 Comments

There's a caveat here in that 2D (or 3D) histograms aren't necessarily that all that small - particularly if the data is relatively sparse. I've run into situations where they are considerably larger than the original data points.
Last comment timed out ... If your histograms are too big, you've still got a clear way of separating the data/processing (2D histograms on each time bin as it comes in), but I'd suggest using pytables (www.pytables.org) to store the resulting histogram. This gives you cached, appendable disk storage that looks like an ndarray, but only loads the data on read. I've used this a lot for image streams.
Good point about sparse histograms; you definitely want to match the representation to the expected characteristics of the data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.