Storing large python object in RAM for later use

Question

Is it possible to store python (or C++) data in RAM for latter use and how can this be achieved?

Background: I have written a program that finds which lines in the input table match the given regular expression. I can find all the lines in roughly one second or less. However the problem is that i process the input table into a python object every time i start this program. This process takes about 30minutes.

This program will eventually run on a machine with over 128GB of RAM. The python object takes about 2GB of RAM. The input table changes rarely and therefore the python object (that i'm currently recalculating every time) actually changes rarely. Is there a way that i can create this python object once, store it in RAM 24/7 (recreate if input table changes or server restarts) and then use it every time when needed?

NOTE: The python object will not be modified after creation. However i need to be able to recreate this object if needed.

EDIT: Only solution i can think of is just to keep the program running 24/7 (as a daemon??) and then issuing commands to it as needed.

You're generating a bitmap index which you'd rather store, than compute each time Python loads - us that about it? — Jon Clements
– Jon Clements, Commented Dec 17, 2012 at 0:06
The easiest thing to do would be to get your admin to make a RAM disk for you and write the results to it as a pickle file. If this isn't possible, you'd need to do some gymnastics :) — Chinmay Kanchi
– Chinmay Kanchi, Commented Dec 17, 2012 at 0:13
@ Jon Clements - Yes, that's about it. I was thinking of writing a daemon that would process the input table at start/restart. However i was unsure how i would then issue commands to this daemon. But i'm unsure if this is the way to go and there is probably some better way? — FableBlaze
– FableBlaze, Commented Dec 17, 2012 at 7:50

JustMaximumPower · Accepted Answer · 2012-12-18 07:20:31Z

2

To store anything in RAM you need an running process. Therefore the easiest solution is to implement what you wrote in your edit. You could also create a new process that always runs and let the old process connect to the new one to get the data. How you connect is up to you. You could use shared memory or a TCP/IP socket. TCP/IP has the advantage of allowing the data to be network accessible, but please make it secure.

--edit--

Most operating systems also allow you to mount a pace of RAM as a drive. A RAM drive. You could write (like Neil suggested) the objects to that.

edited Dec 18, 2012 at 7:20

answered Dec 17, 2012 at 8:21

JustMaximumPower

1,3073 gold badges11 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Omnifarious Over a year ago

This is the solution I would've suggested.

FableBlaze Over a year ago

I will look into RAM drive. At first glace it seems to be what i was looking for.

FableBlaze Over a year ago

In the end i decided to go with a daemon process that runs 24/7 and thus keeping the data in memory.

Neil · Accepted Answer · 2012-12-17 00:10:01Z

2

You could try pickling your object and saving it to a file, so that each time the program runs it just has to deserialise the object instead of recalculating it. Hopefully the server's disk cache will keep the file hot if necessary.

answered Dec 17, 2012 at 0:10

Neil

55.5k8 gold badges65 silver badges74 bronze badges

1 Comment

FableBlaze Over a year ago

Yes, this would be one way to go. However the deserializing would still take a bit of time from each runtime. Time that it would not take if this data would be already present in RAM (which would not be a problem in my case, the problem is that i do not know how).

Leo Goodstadt · Accepted Answer · 2012-12-18 00:29:49Z

We regularly load and store much larger chunks of memory than 2 Gb in no time (seconds). We can get 350 Mb/s from our 3 year old SAN.

The bottlenecks /overheads seem to involve mainly python object management. I find that using marshal is much faster than cPickle. Allied with the use of data structures which involve minimal python object handles, this is more than fast enough.

For data structures, you can either use array.array or numpy. array.array is slightly more portable (no extra libraries involved) but numpy is much more convenient in many ways. For example, instead of having 10 million integer (python objects), you would create a single array.array('i') with 10 million elements.

The best part to using marshal is that it is a very simple format you can write to and read from easily using c/c++ code.

Omnifarious · Accepted Answer · 2012-12-18 07:30:03Z

Your problem description is kind of vague and can be read in several different ways.

One way in which I read this is that you have some kind of ASCII representation of a data structure on disk. You read this representation into memory, and then grep through it one or more times looking for things that match a given regular expression.

Speeding this up depends a LOT on the data structure in question.

If you are simply doing line splitting, then maybe you should just read the whole thing into a byte array using a single read instruction. Then you can alter how you grep to use a byte-array grep that doesn't span multiple lines. If you fiddle the expression to always match a whole line by putting ^.*? at the beginning and .*?$ at the end (the ? forces a minimal instead of maximal munch) then you can check the size of the matched expression to find out how many bytes forward to go.

Alternately, you could try using the mmap module to achieve something similar without having to read anything and incur the copy overhead.

If there is a lot of processing going on to create your data structure and you can't think of a way to use the data in the file in a very raw way as a simple byte array, then you're left with various other solutions depending, though of these it sounds like creating a daemon is the best option.

Since your basic operation seems to be 'tell me which tables entries match a regexp', you could use the xmlrpc.server and xmlrpc.client libraries to simply wrap up a call that takes the regular expression as a string and returns the result in whatever form is natural. The library will take care of all the work of wrapping up things that look like function calls into messages over a socket or whatever.

Now, your idea of actually keeping it in memory is a bit of a red-herring. I don't think it takes 30 minutes to read 2G of information from disk these days. It likely takes at most 5, and likely less than 1. So you might want to look at how you're building the data structure to see if you could optimize that instead.

What pickle and/or marshal will buy you is highly optimized code for building the data structure out of a serialized form. This will cause the data structure creation to possibly be constrained by disk read speeds instead. That means the real problem you're addressing is not reading it off disk each time, but building the data structure in your own address space.

And holding it in memory and using a daemon isn't a guarantee that it will stay in memory. It just guarantees that it stays built up as the data structure you want within the address space of a Python process. The os may decide to swap that memory to disk at any time.

Again, this means that focusing on the time to read it from disk is likely not the right focus. Instead, focus on how to efficiently re-create (or preserve) the data structure in the address space of a Python process.

Anyway, that's my long-winded ramble on the topic. Given the vagueness of your question, there is no definite answer, so I just gave a smorgasbord of possible techniques and some guiding ideas.

Collectives™ on Stack Overflow

Storing large python object in RAM for later use

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related