Iterating over a large data set in long running Python process - memory issues?

Question

I am working on a long running Python program (a part of it is a Flask API, and the other realtime data fetcher).

Both my long running processes iterate, quite often (the API one might even do so hundreds of times a second) over large data sets (second by second observations of certain economic series, for example 1-5MB worth of data or even more). They also interpolate, compare and do calculations between series etc.

What techniques, for the sake of keeping my processes alive, can I practice when iterating / passing as parameters / processing these large data sets? For instance, should I use the gc module and collect manually?

UPDATE

I am originally a C/C++ developer and would have NO problem (and would even enjoy) writing parts in C++. I simply have 0 experience doing so. How do I get started?

Any advice would be appreciated. Thanks!

replaced large and data tags with large-data and memory

Junuxx
– Junuxx

2012-06-05 04:39:10 +00:00
Commented Jun 5, 2012 at 4:39 — Junuxx
– Junuxx, Commented Jun 5, 2012 at 4:39

cheeken · Accepted Answer · 2012-06-05 05:24:56Z

1

Working with large datasets isn't necessarily going to cause memory complications. As long as you use sound approaches when you view and manipulate your data, you can typically make frugal use of memory.

There are two concepts you need to consider as you're building the models that process your data.

What is the smallest element of your data need access to to perform a given calculation? For example, you might have a 300GB text file filled with numbers. If you're looking to calculate the average of the numbers, read one number at a time to calculate a running average. In this example, the smallest element is a single number in the file, since that is the only element of our data set that we need to consider at any point in time.
How can you model your application such that you access these elements iteratively, one at a time, during that calculation? In our example, instead of reading the entire file at once, we'll read one number from the file at a time. With this approach, we use a tiny amount of memory, but can process an arbitrarily large data set. Instead of passing a reference to your dataset around in memory, pass a view of your dataset, which knows how to load specific elements from it on demand (which can be freed once worked with). This similar in principle to buffering and is the approach many iterators take (e.g., xrange, open's file object, etc.).

In general, the trick is understanding how to break your problem down into tiny, constant-sized pieces, and then stitching those pieces together one by one to calculate a result. You'll find these tenants of data processing go hand-in-hand with building applications that support massive parallelism, as well.

Looking towards gc is jumping the gun. You've provided only a high-level description of what you are working on, but from what you've said, there is no reason you need to complicate things by poking around in memory management yet. Depending on the type of analytics you are doing, consider investigating numpy which aims to lighten the burden of heavy statistical analysis.

answered Jun 5, 2012 at 5:24

cheeken

34.8k4 gold badges39 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user1094786 Over a year ago

The data that I get is pulled from SQL. It is non-linear, has to pass between functions etc. Therefore, I don't really see how to break it down...

cheeken Over a year ago

@user1094786 Without knowing exactly what you are doing with some of this data, it is hard to guide you. Can you update your question with a simple example of a type of calculation you might do?

lithuak · Accepted Answer · 2012-06-05 04:59:29Z

Its hard to say without real look into your data/algo, but the following approaches seem to be universal:

Make sure you have no memory leaks, otherwise it would kill your program sooner or later. Use objgraph for it - great tool! Read the docs - it contains good examples of the types of memory leaks you can face at python program.
Avoid copying of data whenever possible. For example - if you need to work with part of the string or do string transformations - don't create temporary substring - use indexes and stay read-only as long as possible. It could make your code more complex and less "pythonic" but this is the cost for optimization.
Use gc carefully - it can make you process irresponsible for a while and at the same time add no value. Read the doc. Briefly: you should use gc directly only when there is real reason to do that, like Python interpreter being unable to free memory after allocating big temporary list of integers.
Seriously consider rewriting critical parts on C++. Start thinking about this unpleasant idea already now to be ready to do it when you data become bigger. Seriously, it usually ends this way. You can also give a try to Cython it could speed up the iteration itself.

Collectives™ on Stack Overflow

Iterating over a large data set in long running Python process - memory issues?

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related