2

I have an array of 20 numbers (64 bit int) something like 10, 25, 36,43...., 118, 121 (sorted numbers).

Now, I have to give millions of numbers as input (say 17, 30).

What I have to give as output is:

for Input 17:

17 is < 25 and > 10. So, output will be index 0.

for Input 30:

30 is < 36 and > 25. So, output will be index 1.

Now, I can do it using linear search, binary serach. Is there any method to do it faster way ? Input numbers are random (gaussian).

4
  • 1
    Binary search is probably the fastest you'll get. Commented Feb 9, 2013 at 8:47
  • actually, this problem is quite interesting... there is a sequence of bins in which you should check in that order. the first thing you want to do is derive a decision tree that tells you how to go about searching to get minimal expected run time. then use that tree to do the search. depending on the distribution of the 20 numbers and the distribution of the incoming numbers, you can build this tree appropriately. Commented Feb 9, 2013 at 11:16
  • by the way, where did you get this problem? this is an entropy compression algorithm... information theory? Commented Feb 9, 2013 at 11:51
  • @thang, No. Actually, I am a master's student and I find it during my project work on a text processing tool. Commented Feb 9, 2013 at 11:58

5 Answers 5

6

If you know the distribution, you can direct your search in a smarter way.

Here is the rough idea of this variant of binary search:

Assuming that your data is expected to be distributed uniformly on 0 to 100.

If you observe the value 0, you start at the beginning. If your value is 37, you start at 37% of the array you have. This is the key difference to binary search: you don't always start at 50%, but you try to start in the expected "optimal" position.

This also works for Gaussian distributed data, if you know the parameters (If you don't know them, you can still estimate them easily from the observed data). You would compute the Gaussian CDF, and this yields the place to start your search.

Now for the next step, you need to refine your search. At the position you looked at, there was a different value. You can use this to re-estimate the position to continue searching.

Now even if you don't know the distribution this can work very well. So you start with a binary search, and looked at objects at 50% and 25% already. Instead of going to 37.5% next, you can do a better guess, if your query values was e.g. very close to the 50% entry. Unless your data set is very "clumpy" (and your queries are not correlated to the data) then this should still outperform "naive" binary search that always splits in the middle.

http://en.wikipedia.org/wiki/Interpolation_search

The expected average runtime apparently is O(log(log(n)), from Wikipedia.

Update: since someone complained that with just 20 numbers things are different. Yes, they are. With 20 numbers linear search may be best. Because of CPU caching. Linear scanning through a small amount of memory - that fits into the CPU cache - can be really fast. In particular with an unrolled loop. But that case is quite pathetic and uninteresting IMHO.

Sign up to request clarification or add additional context in comments.

8 Comments

+1 In my experience, interpolation search beats binary search hands-down.
-1, this is incorrect use of interpolation search. Interp search works well if your search space is big. This search space is only 20. The randomness that you are given (Gaussian) is of the input number, NOT of the search space. The method that you have outlined to pick the starting point is incorrect. The starting point should be the bin that is most probable based on whatever statistical property you know of the incoming data. For example, if your 20 numbers were 0 80 ... 99, then assuming uniform distribution if incoming data, you should ALWAYS start by checking the first bin!
in fact, even in the wikipedia article: This method will only work if calculations on the size of differences between key values are sensible.
@thang with 20 numbers, even a linear search may be faster because of CPU caching and pipelining. So that not really is an argument. He probably wants to scale up at some point. The Wikipedia statement refers to the fact that e.g. on a list of (textual) Strings, interpolation search will not work very well; although you could interpret strings as variable length numbers and try to do this kind of search, because this kind of difference is not sensible for text. Plus, note that I said "assuming uniform on 0...100" for the simple example of choosing the starting bin!
@ErichSchubert, i don't think that is it. that is the whole point of the problem -- you have a small search space and you need to do a lot of searches. in fact, linear searching is not faster even with 20 numbers. with 20 number, the entire search space is in cache, so regardless of linear or binary, you're operating on cache. additionally, linear search space has just as bad a chance of screwing up branch prediction (if not worse). with binary search, you are making fewer comparisons...
|
2

I believe best option for you is to use upper_bound - it will find the first value in the array bigger than the one you are searching for.

Still depending on the problem you try to solve maybe lower_bound or binary_search may be the thing you need.

All of these algorithms are with logarithmic complexity.

1 Comment

Given the description of the problem, std::binary_search is not suitable.
2

There is nothing will be better than binary search since your array is sorted.

Linear search is O(n) while binary search is O(log n)

Edit:

Interpolation search makes an extra assumption (the elements have to be uniformly distributed) and do more comparisons per iteration.

You can try both and empirically measure which is better for your case

11 Comments

Incorrect. See my reply. If you have some idea of the distribution, you can do interpolation search in log(log(n))
It is still a variant of binary search
Yes, but it is not the binary search algorithm, that is available as binarySearch or std::binary_search in various class libraries, which always starts at 50%. Just like A* is not Dijkstra, albeit it is a variant of it.
I see your point, but knowing the distribution of the numbers to be searched might not be that helpful if the array that we search was generated purely random i.e. not Gaussian
@iTech, the performance can be improved if you tune the comparison indices as I have described to optimize for the expected # of compares. i guess, technically, that is still binary search since at each step it divides the space into 2 sections, albeit not equal sections. i think that worst and average cases are still O(log n), but the constant for the average case can be smaller depending on how different the distribution of the 20 number is from the distribution of the incoming number. if you are doing the search 1 million times, it can matter a lot...
|
2

In fact, this problem is quite interesting because it is a re-cast of an information theoretic framework.

Given 20 numbers, you will end up with 21 bins (including < first one and > last one).

For each incoming number, you are to map to one of these 21 bins. This mapping is done by comparison. Each comparison gives you 1 bit of information (< or >= -- two states).

So suppose the incoming number requires 5 comparisons in order to figure out which bin it belongs to, then it is equivalent to using 5 bits to represent that number.

Our goal is to minimize the number of comparisons! We have 1 million numbers each belonging to 21 ordered code words. How do we do that?

This is exactly an entropy compression problem.

Let a[1],.. a[20], be your 20 numbers.

Let p(n) = pr { incoming number is < n }.

Build the decision tree as follows.

Step 1.

   let i = argmin |p(a[i]) - 0.5|

   define p0(n) = p(n) / (sum(p(j), j=0...a[i-1])), and p0(n)=0 for n >= a[i].
   define p1(n) = p(n) / (sum(p(j), j=a[i]...a[20])), and p1(n)=0 for n < a[i].

Step 2.

   let i0 = argmin |p0(a[i0]) - 0.5|
   let i1 = argmin |p1(a[i1]) - 0.5|

and so on...

and by the time we're done, we end up with:

i, i0, i1, i00, i01, i10, i11, etc.

each one of these i gives us the comparison position.

so now our algorithm is as follows:

let u = input number.

if (u < a[i]) {
   if (u < a[i0]) {
      if (u < a[i00]) {
      } else {
      }
   } else {
      if (u < a[i01]) {
      } else {
      }
   }
} else {
   similarly...
}

so the i's define a tree, and the if statements are walking the tree. we can just as well put it into a loop, but it's easier to illustrate with a bunch of if.

so for example, if you knew that your data were uniformly distributed between 0 and 2^63, and your 20 number were

0,1,2,3,...19

then

i      = 20  (notice that there is no i1)
i0     = 10
i00    = 5
i01    = 15
i000   = 3
i001   = 7
i010   = 13
i011   = 17
i0000  = 2     
i0001  = 4     
i0010  = 6     
i0011  = 9
i00110 = 8
i0100  = 12
i01000 = 11
i0110  = 16
i0111  = 19
i01110 = 18

ok so basically, the comparison would be as follows:

if (u < a[20]) {
  if (u < a[10]) {
     if (u < a[5]) {
     } else {
         ...
     }
  } else {
     ...
  }
} else {
  return 21
}

so note here, that I am not doing binary search! I am first checking the end point. why?

there is 100*((2^63)-20)/(2^63) percent chance that it will be greater than a[20]. this is basically like 99.999999999999999783159565502899% chance!

so this algorithm as it is has an expected number of comparison of 1 for a dataset with the properties specified above! (this is better than log log :p)

notice what I have done here is I am basically using fewer compares to find numbers that are more probable and more compares to find numbers that are less probable. for example, the number 18 requires 6 comparisons (1 more than needed with binary search); however, the numbers 20 to 2^63 require only 1 comparison. this same principle is used for lossless (entropy) data compression -- use fewer bits to encode code words that appear often.

building the tree is a one time process and you can use the tree 1 million times later.

the question is... when does this decision tree become binary search? homework exercise! :p the answer is simple. it's similar to when you can't compress a file any more.

ok, so I didn't pull this out of my behind... the basis is here:

http://en.wikipedia.org/wiki/Arithmetic_coding

Comments

1

You could perform binary search using std::lower_bound and std::upper_bound. These give you back iterators, so you can use std::distance to get an index.

5 Comments

There's.. std::binary_search too.
@Rapptz that doesn't work if the searched element is not present in the array.
@Rapptz: std::binary_search is next to useless. When you do a binary search(or any search), you almost always want to know the position of the element. All std::binary_search does is tell you true or false.
@BenjaminLindley yeah I forgot, I thought it would return an iterator to the element like the other find algorithms.
@Rapptz even if it did, it would only be useful iff the element was in the array. OP is looking for bounds.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.