What is the best way to store between a million to 450,000 Boolean values in a dictionary like collection indexed by a long number? I need to use the least amount of memory possible. True and Int both take up more than 22 bytes per entry. Is there a lower memory per Boolean possible?
-
2How will this be "dictionary like"? What will be the keys, what will be the values?Marcin– Marcin2011-07-12 11:30:24 +00:00Commented Jul 12, 2011 at 11:30
-
He probably meant "array like"Aaron Digulla– Aaron Digulla2011-07-12 11:35:28 +00:00Commented Jul 12, 2011 at 11:35
-
1If it really is a dict, just the keys will take a significant amount of memoryJohn La Rooy– John La Rooy2011-07-12 12:12:42 +00:00Commented Jul 12, 2011 at 12:12
-
Could you clarify in your question what this collection looks like? Are there really 200,000,000,000 bools and the same number of ints as keys? That many ints will take up over 740GB on their own even if stored in just 4 bytes each. And that many bools will take up another 23GB at one bit each...Scott Griffiths– Scott Griffiths2011-07-12 15:39:16 +00:00Commented Jul 12, 2011 at 15:39
3 Answers
Check this question. Bitarray seems to be the preferred choice.
The two main modules for this are bitarray and bitstring (I wrote the latter). Each will do what you need, but some plus and minus points for each:
bitarray
- Written as a C extension so very quick.
- Python 2 only.
bitstring
- Pure Python.
- Python 2.6+ and Python 3.x
- Richer array of methods for reading and interpreting data.
So it depends on what you need to do with your data. If it's just storage and retrieval then both will be fine, but for performance critical stuff it's better to use bitarray if you can. Take a look at the docs (bitstring, bitarray) to see which you prefer.
1 Comment
Have you thought about using a hybrid list/bitstring?
Use your list to store one dimension of your bits. Each list item would hold a bitstring of fixed length. You would use your list to focus your search to the bitstring of interest, then use the bitstring to find/modify your bit of interest.
The list should allow the most efficent recall of the bitstrings, the bitstrings should allow you to pack all your data as efficiently as possible, and the hybrid list/bitstring should allow a compromise between speed (slightly slower accessing the bit string in the list) and storage (bit packed data plus list overhead.)