5

I'm looking for a data structure with which I can find the most frequently occuring number (among an array of numbers) in a given, variable range.

Let's consider the following 1 based array:

1 2 3 1 1 3 3 3 3 1 1 1 1

If I query the range (1,4), the data structure must retun 1, which occurs twice. Several other examples:

(1,13) = 1

(4,9) = 3

(2,2) = 2

(1,3) = 1 (all of 1,2,3 occur once, so return the first/smallest one. not so important at the moment)

I have searched, but could not find anything similar. I'm looking (ideally) a data structure with minimal space requirement, fast preprocessing, and/or query complexities.

Thanks in advance!

2
  • Is there a limit on the maximal value that can be in the array ? Commented Oct 6, 2010 at 16:33
  • Good question. It is better to think that it can either be at most K, or unbounded. So, we can have a better complexity at the more restricted former case(I suppose). Commented Oct 6, 2010 at 19:11

2 Answers 2

2

Let N be the size of the array and M the number of different values in that array.

I'm considering two complexities : pre-processing and querying an interval of size n, each must be spacial and temporal.


Solution 1 :

  • Spacial : O(1) and O(M)
  • Temporal : O(1) and O(n + M)

No pre-processing, we look at all values of the interval and find the most frequent one.


Solution 2 :

  • Spacial : O(M*N) and O(1)
  • Temporal : O(M*N) and O(min(n,M))

For each position of the array, we have an accumulative array that gives us for each value x, how many times x is in the array before that position.

Given an interval we just need for each x to subtract 2 values to find the number of x in that interval. We iterate over each x and find the maximum value. If n < M we iterate over each value of the interval, otherwise we iterate over all possible values for x.


Solution 3 :

  • Spacial : O(N) and O(1)
  • Temporal : O(N) and O(min(n,M)*log(n))

For each value x build a binary heap of all the position in the array where x is present. The key in your heap is the position but you also store the total number of x between this position and the begin of the array.

Given an interval we just need for each x to subtract 2 values to find the number of x in that interval : in O(log(N)) we can ask the x's heap to find the two positions just before the start/end of the interval and substract the numbers. Basically it needs less space than a histogram but the query in now in O(log(N)).

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for different solutions, but I'm looking for something with better worst case complexities. +1 for good organization of diverse ideas.
What are your constraints in term of space/time ? I admin that O(min(n,M)*log(n)) can still be big, what is the maximum you can do ?
Think of this problem as more of theoretical interest of me, rather than a practical thing. I don't know how best we can get, but I keep on thinking =)
Alright ;) I think we can improve time complexity but we'll need more space. If you have a solution with better complexity than mine, I'm interested.
0

You could create a binary partition tree where each node represents a histogram map of {value -> frequency} for a given range, and has two child nodes which represent the upper half and lower half of the range.

Querying is then just a case of recursively adding together a small number of these histograms to cover the range required, and scanning the resulting histogram once to find the highest occurrence count.

Useful optimizations include:

  • Using a histogram with mutable frequency counts as an "accumulator" while you add histograms together
  • Stop using precomputed histograms once you get down to a certain size (maybe a range less than the total number of possible values M) and just counting the numbers directly. It's a time/space trade-off that I think will pay off a lot of the time.
  • If you have a fixed small number of possible values, use an array rather than a map to store the frequency counts at each node

UPDATE: my thinking on algorithmic complexity assuming a bounded small number of possible values M and a total of N values in the complete range:

  • Preprocessing is O(N log N) - basically you need to traverse the complete list and build a binary tree, building one node for every M elements in order to amortise the overhead of each node
  • Querying is O(M log N) - basically adding up O(log N) histograms each of size M, plus counting O(M) values on either side of the range
  • Space requirement is O(N) - approx. 2N/M histograms each of size M. The 2 factor is the sum from having N/M histograms at the bottom level, 0.5N/M histograms at the next level, 0.25N/M at the third level etc...

4 Comments

Can you provide me with preprocessing and query complexities of your idea? Especially preprocessing seemed high for me...
Depends on your data distribution . With M possibles values and N total elements in the array, I think it's M log N for preprocessing and M log N for queries. Space is also M log N. Hence this approach is best (and I believe asymptotically better than all the other solutions posted) in the case that N is very large compared to M.
Sorry meant to say O(N log N) for preprocessing. That's assuming M is small and bounded compared to N of course.
Ahh actually it's harder than I thought because of the amortisation. Updated my latest thinking in the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.