Memory efficient way to store strings

Question

Assume I have millions strings. Each string has an int value. I want to retrieve this value by input string but I don't want to store all this strings because they take a lot of space. I can't use hash table because of it need to store all or at least many strings in memory. So what is good data structure for my case (I don't need to add or delete any strings, I am already have prepared data and read is only allowed operation)

What programming language? Also, are there many identical strings? — user180326
– user180326, Commented Mar 29, 2013 at 15:21
@jdv-Jan de Vaan No all strings are unique. I donnt think that my question language specific but i prefer c#. — Neir0
– Neir0, Commented Mar 29, 2013 at 15:23
It's unclear what you need to do. Do you just need to extract those numbers and save to another file? Or do you need to perform some calculations with them? Is it OK if the input order isn't preserved? — Alexey Frunze
– Alexey Frunze, Commented Mar 29, 2013 at 15:29
@Alexey Frunze i need to extract this values and do some calculations with them. "input order isn't preserved" - what do you mean? — Neir0
– Neir0, Commented Mar 29, 2013 at 15:34
I mean, say, your input is 4,3,2,1 but you output it as 1*2,2*2,3*2,4*2 (multiplied by 2 but in reverse order). If you need to add all numbers up or multiply them together or XOR, the order won't matter either. Depending on what you need to do, you may or may not be able to reorder the input to your advantage. — Alexey Frunze
– Alexey Frunze, Commented Mar 29, 2013 at 15:37

Fred Foo · Accepted Answer · 2013-03-29 15:27:10Z

4

Use a trie to prevent storing common substrings..

answered Mar 29, 2013 at 15:27

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nolo Over a year ago

@larsmans Heh! I've though about something like this for maximizing the efficiency of a very large regex pattern, though now I wonder if this is done automatically when a regex string is parsed. Nice to know what it's called.

argentage Over a year ago

a hashtable is not a memory efficient way of storing strings, though

phs Over a year ago

@Neir0 It's going to be speed or size; pick one.

Fred Foo Over a year ago

@Nolo: Tries cannot represent all regular expressions, you need NFAs or DFAs for that. Most RE toolkits will actually compile REs to FAs under the hood.

Paul Rubel · Accepted Answer · 2013-03-30 00:17:48Z

3

If you can can pre-process the word list take a look at perfect hashes, like CMPH. ( gperf is another, but seems optimized for smaller data sets. )

From the CMPH docs:

A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.

...

The CMPH Library encapsulates the newest and more efficient algorithms in an easy-to-use, production-quality, fast API. The library was designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, ...

answered Mar 30, 2013 at 0:17

Paul Rubel

27.4k7 gold badges64 silver badges85 bronze badges

Comments

comingstorm · Accepted Answer · 2013-03-30 00:08:01Z

1

You may want to look at the Judy tree, which is designed to be both fast and compact, and has a version designed for string keys. Its implementation is available on sourceforge.

edited Mar 30, 2013 at 0:08

answered Mar 29, 2013 at 23:55

comingstorm

26.2k3 gold badges47 silver badges68 bronze badges

Comments

Randy Howard · Accepted Answer · 2013-03-29 15:26:23Z

0

Your reason to not use a hash table doesn't sound valid based on the limited information in your question currently. It's fairly efficient if implemented well. It can also have the advantage of not wasting memory storing duplicate strings if that is acceptable for your needs, further reducing memory consumption if duplicate strings are possible.

You could conceivably also store a compressed form of each string in the hash table if you were creative about how you do lookups. How long are the strings typically?

answered Mar 29, 2013 at 15:26

Randy Howard

2,15416 silver badges26 bronze badges

1 Comment

Neir0 Over a year ago

Average length is 10 letters. At least i can do not store strings with one item buckets of my hashtable. So i think there are exists way to inprove this approach.

Collectives™ on Stack Overflow

Memory efficient way to store strings

4 Answers 4

4 Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related