compressing a set of binary strings with fixed length

Question

I'm looking for a data structure / algorithm to store an unordered set S of binary strings of a fixed length n (i.e. all matching the following regular expression: [01]{n}).

Insertion and lookup ("Is element x in the S?") should maximally take polynomial time to |S|.

The space complexity should be logarithmic to |S| for large sets. In other words, the space should not be exponential to n if for example 2^n / 2 random and unique strings are inserted, but polynomial to n.

Is such a thing known to exist?

Do you allow errors in your lookup queries? Otherwise, it is easy to prove that what you are asking for is impossible. — Yuval Filmus
– Yuval Filmus, Commented Oct 29, 2014 at 21:01

Yuval Filmus · Accepted Answer · 2014-10-29 21:54:56Z

4

Suppose $S$ consists of $m$ strings in $\{0,1\}^n$ and we don't allow query arrows. Any two different sets $S_1,S_2 \subseteq \{0,1\}^n$ of size $m$ respond differently to some lookup query: there must be some $x \in S_1\setminus S_2$, for example, and the lookup of $x$ should succeed in $S_1$ and fail in $S_2$. For this reason, the contents of your data structure should be different for $S_1,S_2$. Since there are $\binom{2^n}{m}$ different choices for $S$, any data structure supporting all of them should have at least $\binom{2^n}{m}$ different settings. In particular, if you use $M$ bits to store it then $2^M \geq \binom{2^n}{m}$. When $m = 2^n/2$, for example, this forces $M \geq 2^n - O(n) = 2m - O(\log m)$.

answered Oct 29, 2014 at 21:54

Yuval Filmus

281k27 gold badges320 silver badges518 bronze badges

$\begingroup$ Great explanation. Thank you very much. However, I have one more question: Since your proof considers the worst case, would it be possible to be better on average for random strings with some special compression? Perhaps with something like the following? n = 3; S1 = [001,100,101,110,111] = [001,1..]; S2 = [001,100,101,110] = [001,10.,110] $\endgroup$

Tobias Hermann
– Tobias Hermann

2014-10-29 22:23:04 +00:00
Commented Oct 29, 2014 at 22:23
$\begingroup$ The same argument shows that if, for example, you only want your data structure to work for an $\alpha$ fraction of sets $S$ (of your choice!), then you only save $-\log_2 \alpha$ bits. For example, you can save a mere $\log n$ bits by restricting to a $1/n$ fraction of sets. So average case doesn't seem to help. $\endgroup$

Yuval Filmus
– Yuval Filmus

2014-10-29 22:25:42 +00:00
Commented Oct 29, 2014 at 22:25
$\begingroup$ Of course, if the set of strings is compressible (in the sense that the $k$-th order empirical entropy is less than the $0$-order entropy), then there are techniques which can help you there, and give you sublinear query time. But actually, since the question only asked for a query time polynomial in $|S|$, using an off-the-shelf compression program will do the job. $\endgroup$

Pseudonym
– Pseudonym ♦

2014-10-29 22:58:52 +00:00
Commented Oct 29, 2014 at 22:58

Add a comment |

Stack Exchange Network

compressing a set of binary strings with fixed length

1 Answer 1

Your Answer

Hot Network Questions

compressing a set of binary strings with fixed length

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions