3

When doing a union between two compatible HyperLogLog objects, you can just take the maximum bucket to do a lossless union that doesn't introduce any new error:

Union.Bucket[i] = Max(A.Bucket[i], B.Bucket[i])

When doing an intersection though, you have to use the inclusion-exclusion principle:

IntersectionCountEstimate = A.CountEstimate() + B.CountEstimate() - Union.CountEstimate()

Why is it that using the minimum bucket value doesn't work as an effective intersection?

Intersection.Bucket[i] = Min(A.Bucket[i], B.Bucket[i])

1 Answer 1

3

The cause is that the relationship between two instances of the HyperLogLog statistic is not very intuitive:

Consider two corresponding buckets A[i] and B[i] from separate HyperLogLog structures A and B (which have the same number of buckets and use the same hash function), and for simplicity's sake assume the data in A and in B are independently drawn from the same distribution. Let's assume we first draw all the elements for A, and only then draw elements for B.

For every element we observe reaching B[i], what is the probability that is it in the intersection of A and B, i.e. what is the probability that it is already in A[i]? Well that depends - how "full" is A[i]? If A[i] is completely "full" (i.e., A[i] "contains" ALL the elements from the distribution which can reach A[i]), then the probability is 1. In that case, the cardinality of the intersection of A[i] and B[i] would indeed be the cardinality of B[i]. However, it is almost NEVER the case that A[i] is "full" - so the intersection is MUCH SMALLER than the cardinality of B[i].

Sign up to request clarification or add additional context in comments.

1 Comment

Is there a statistical manner to compute how much smaller would we expect the intersection to be? Possibly we could experimentally evaluate that, and create a table for correcting the estimate derived by the intersection.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.