0

I have some data which is about 1.5GB. Now I want to store those information into a big dict in python. However, it costs much bigger than 1.5GB, maybe 10 times. The machine doesn't have so much memory. Is there any way to use less memory to put those data into a dict structure? The key and value are all integer.

Best Regards,

2
  • What are you going to do with the dictionary after loading it from file? Commented Nov 22, 2012 at 14:08
  • maybe try some key-value db such as leveldb Commented Nov 22, 2012 at 14:08

5 Answers 5

1

Use a fast database that stores the key-value pairs to disk and allows intelligent retrieval and indexing, such as sqlite.

Sign up to request clarification or add additional context in comments.

Comments

1

You should try using a database so you do not have to store all the data in memory.

A Berkeley Database is ideal for your use as it only stores key-value pairs. It is a "dict" in database form!

The code would look something like:

from bsddb3 import db
dbdict = DB()
dbdict.open("your database", None, db.DB_HASH, db.DB_CREATE)
dbdict[3]=2 #works just like a dict!

Here are bindings: Python "bindings" for Oracle Berkeley DB

Comments

0

Use pickle object to store data in dictionary. Refer this link to use pickle http://wiki.python.org/moin/UsingPickle

Comments

0

If the keys are integers, then, depending on the range of the keys, you can use an array http://docs.python.org/2/library/array.html instead of dictionary. Your keys become indices in the array, that's all. This will be more memory efficient than creating a dictionary.

If you don't have enough RAM to fit all your data into an array, then use something like sqlite or Berkeley DB to, effectively, have a dictionary on file. Of course, it will be much slower.

Comments

0

Since your indexes and data are integers, you can save your data in a file and access it as if it was an array, but only the pages you are working on will be in the RAM, the others will stay on the disk.

See http://docs.python.org/2/library/mmap.html

mmap is byte-based which means that an index in that will be like index*sizeof(int) on your architecture, and you will need to read sizeof(int) bytes instead of just one byte, and use the struct module (http://docs.python.org/2/library/struct.html) to convert that into a python integer.

This solution is a bit slower than using an array if all the data fits in the RAM, if your system starts paging out then this solution will be faster than using normal arrays.

4 Comments

Note that the poster has keys and values, which indicates a mapping and not an array. It would of course be possible to implement a hash-table using only the mmap module, but it's much better (and faster) to use an existing on-disk database such as sqlite or Berkley DB.
Yes but both keys and values seems to be "int", and given the size of the whole thing i guess keys aren't so sparse. So it's pretty much an array, wich is waay faster than using a database. Because databases are made to handle a lot of things which are not needed here.
The keys are integers, which could easily refer to 64-bit ints of unknown sparseness. The poster speaks of "big dict", so it's clear that a dict-like (as opposed to array-like) interface is desired. Without knowing the intended usage pattern it's hard to predict whether a database or a hand-rolled solution based on mmap/struct would be faster.
Yes i agree. I posted the solution as one of the possible options, not as the answer of life universe and everything.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.