Data structure that supports range based most frequently occuring element query

Question

I'm looking for a data structure with which I can find the most frequently occuring number (among an array of numbers) in a given, variable range.

Let's consider the following 1 based array:

1 2 3 1 1 3 3 3 3 1 1 1 1

If I query the range (1,4), the data structure must retun 1, which occurs twice. Several other examples:

(1,13) = 1

(4,9) = 3

(2,2) = 2

(1,3) = 1 (all of 1,2,3 occur once, so return the first/smallest one. not so important at the moment)

I have searched, but could not find anything similar. I'm looking (ideally) a data structure with minimal space requirement, fast preprocessing, and/or query complexities.

Thanks in advance!

Is there a limit on the maximal value that can be in the array ? — Loïc Février
– Loïc Février, Commented Oct 6, 2010 at 16:33
Good question. It is better to think that it can either be at most K, or unbounded. So, we can have a better complexity at the more restricted former case(I suppose). — kolistivra
– kolistivra, Commented Oct 6, 2010 at 19:11

Loïc Février · Accepted Answer · 2010-10-06 16:59:16Z

2

Let N be the size of the array and M the number of different values in that array.

I'm considering two complexities : pre-processing and querying an interval of size n, each must be spacial and temporal.

Solution 1 :

Spacial : O(1) and O(M)
Temporal : O(1) and O(n + M)

No pre-processing, we look at all values of the interval and find the most frequent one.

Solution 2 :

Spacial : O(M*N) and O(1)
Temporal : O(M*N) and O(min(n,M))

For each position of the array, we have an accumulative array that gives us for each value x, how many times x is in the array before that position.

Given an interval we just need for each x to subtract 2 values to find the number of x in that interval. We iterate over each x and find the maximum value. If n < M we iterate over each value of the interval, otherwise we iterate over all possible values for x.

Solution 3 :

Spacial : O(N) and O(1)
Temporal : O(N) and O(min(n,M)*log(n))

For each value x build a binary heap of all the position in the array where x is present. The key in your heap is the position but you also store the total number of x between this position and the begin of the array.

Given an interval we just need for each x to subtract 2 values to find the number of x in that interval : in O(log(N)) we can ask the x's heap to find the two positions just before the start/end of the interval and substract the numbers. Basically it needs less space than a histogram but the query in now in O(log(N)).

edited Oct 6, 2010 at 16:59

answered Oct 6, 2010 at 16:47

Loïc Février

7,7909 gold badges44 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

kolistivra Over a year ago

Thanks for different solutions, but I'm looking for something with better worst case complexities. +1 for good organization of diverse ideas.

Loïc Février Over a year ago

What are your constraints in term of space/time ? I admin that O(min(n,M)*log(n)) can still be big, what is the maximum you can do ?

kolistivra Over a year ago

Think of this problem as more of theoretical interest of me, rather than a practical thing. I don't know how best we can get, but I keep on thinking =)

Loïc Février Over a year ago

Alright ;) I think we can improve time complexity but we'll need more space. If you have a solution with better complexity than mine, I'm interested.

mikera · Accepted Answer · 2010-10-11 11:30:26Z

0

You could create a binary partition tree where each node represents a histogram map of {value -> frequency} for a given range, and has two child nodes which represent the upper half and lower half of the range.

Querying is then just a case of recursively adding together a small number of these histograms to cover the range required, and scanning the resulting histogram once to find the highest occurrence count.

Useful optimizations include:

Using a histogram with mutable frequency counts as an "accumulator" while you add histograms together
Stop using precomputed histograms once you get down to a certain size (maybe a range less than the total number of possible values M) and just counting the numbers directly. It's a time/space trade-off that I think will pay off a lot of the time.
If you have a fixed small number of possible values, use an array rather than a map to store the frequency counts at each node

UPDATE: my thinking on algorithmic complexity assuming a bounded small number of possible values M and a total of N values in the complete range:

Preprocessing is O(N log N) - basically you need to traverse the complete list and build a binary tree, building one node for every M elements in order to amortise the overhead of each node
Querying is O(M log N) - basically adding up O(log N) histograms each of size M, plus counting O(M) values on either side of the range
Space requirement is O(N) - approx. 2N/M histograms each of size M. The 2 factor is the sum from having N/M histograms at the bottom level, 0.5N/M histograms at the next level, 0.25N/M at the third level etc...

edited Oct 11, 2010 at 11:30

answered Oct 6, 2010 at 16:47

mikera

107k28 gold badges265 silver badges427 bronze badges

4 Comments

kolistivra Over a year ago

Can you provide me with preprocessing and query complexities of your idea? Especially preprocessing seemed high for me...

mikera Over a year ago

Depends on your data distribution . With M possibles values and N total elements in the array, I think it's M log N for preprocessing and M log N for queries. Space is also M log N. Hence this approach is best (and I believe asymptotically better than all the other solutions posted) in the case that N is very large compared to M.

mikera Over a year ago

Sorry meant to say O(N log N) for preprocessing. That's assuming M is small and bounded compared to N of course.

mikera Over a year ago

Ahh actually it's harder than I thought because of the amortisation. Updated my latest thinking in the answer.

Collectives™ on Stack Overflow

Data structure that supports range based most frequently occuring element query

2 Answers 2

4 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related