2
$\begingroup$

I'm looking for a data structure / algorithm to store an unordered set S of binary strings of a fixed length n (i.e. all matching the following regular expression: [01]{n}).

Insertion and lookup ("Is element x in the S?") should maximally take polynomial time to |S|.

The space complexity should be logarithmic to |S| for large sets. In other words, the space should not be exponential to n if for example 2^n / 2 random and unique strings are inserted, but polynomial to n.

Is such a thing known to exist?

$\endgroup$
2
  • 1
    $\begingroup$ Do you allow errors in your lookup queries? Otherwise, it is easy to prove that what you are asking for is impossible. $\endgroup$ Commented Oct 29, 2014 at 21:01
  • $\begingroup$ No, errors are not allowed. $\endgroup$ Commented Oct 29, 2014 at 21:31

1 Answer 1

4
$\begingroup$

Suppose $S$ consists of $m$ strings in $\{0,1\}^n$ and we don't allow query arrows. Any two different sets $S_1,S_2 \subseteq \{0,1\}^n$ of size $m$ respond differently to some lookup query: there must be some $x \in S_1\setminus S_2$, for example, and the lookup of $x$ should succeed in $S_1$ and fail in $S_2$. For this reason, the contents of your data structure should be different for $S_1,S_2$. Since there are $\binom{2^n}{m}$ different choices for $S$, any data structure supporting all of them should have at least $\binom{2^n}{m}$ different settings. In particular, if you use $M$ bits to store it then $2^M \geq \binom{2^n}{m}$. When $m = 2^n/2$, for example, this forces $M \geq 2^n - O(n) = 2m - O(\log m)$.

$\endgroup$
3
  • $\begingroup$ Great explanation. Thank you very much. However, I have one more question: Since your proof considers the worst case, would it be possible to be better on average for random strings with some special compression? Perhaps with something like the following? n = 3; S1 = [001,100,101,110,111] = [001,1..]; S2 = [001,100,101,110] = [001,10.,110] $\endgroup$ Commented Oct 29, 2014 at 22:23
  • $\begingroup$ The same argument shows that if, for example, you only want your data structure to work for an $\alpha$ fraction of sets $S$ (of your choice!), then you only save $-\log_2 \alpha$ bits. For example, you can save a mere $\log n$ bits by restricting to a $1/n$ fraction of sets. So average case doesn't seem to help. $\endgroup$ Commented Oct 29, 2014 at 22:25
  • $\begingroup$ Of course, if the set of strings is compressible (in the sense that the $k$-th order empirical entropy is less than the $0$-order entropy), then there are techniques which can help you there, and give you sublinear query time. But actually, since the question only asked for a query time polynomial in $|S|$, using an off-the-shelf compression program will do the job. $\endgroup$ Commented Oct 29, 2014 at 22:58

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.