3

I have many arrays with sorted data. I need to perform binary search in this arrays. If key ranges in this arrays were disjoint it will be possible to sort arrays by range and then perform binary search as with single array. But in my case, key-ranges in this arrays can overlap. In this case it is only possible to perform filtering to exclude some arrays and then sort the other part. In my case most of the arrays doesn't overlap so filtering, most of the time, will return only one array but it is still possible for bad data to ruin performance.

Is it possible to use better algorithm in this case? It is possible to modify arrays slightly, add some metadata or links to other arrays.

Update This arrays is data pages backed by disk storage. I use memory mapped files for this. I can sort data inside page very fast, because copying doesn't involved in this process. But to merge two pages I need to copy large amount of data between pages. I have very large amount of data, terabytes! But each page is only 8Mb so it can be searched quickly. New pages added to the storage from time to time. Pages contains time-series data so it already partially sorted and new arrays doesn't overlap with old data most of the time.

6
  • 4
    You forgot to add a question. If you add one we can try to answer it. Commented Oct 1, 2013 at 13:44
  • Thanks, I really do :) Commented Oct 1, 2013 at 13:48
  • I might be missing something, but why can't you just run a binary search on each array separately? Commented Oct 1, 2013 at 13:57
  • Do you want us to improve your filter algorithm (selecting the [overlapping?] arrays to search) or the overall structure? In the latter case please explain what you need many small arrays for. Commented Oct 1, 2013 at 14:03
  • I can, but for each array I know key-range and I didn't need to search them all. Commented Oct 1, 2013 at 14:10

5 Answers 5

4
+100

If key ranges in this arrays were disjoint it will be possible to sort arrays by range and then perform binary search as with single array. But in my case, key-ranges in this arrays can overlap.

You still can sort them then. Instead of naively filtering all arrays by their boundaries, you can use a interval tree to store them and to retrieve the to-be-searched arrays in logarithmical time. Since you have a lot of arrays and they only seldom overlap each other, this should give a noticeable performance boost.

Sign up to request clarification or add additional context in comments.

Comments

2

If you only plan on performing a few queries I don't think you can improve your algorithm - I believe it is already quite good. If you expect to perform a lot of queries I would advice you to merge the arrays to a single one and perform binary search over it. Merge is just the same algorithm that is part of merge sort and is linear. So as long as the number of queries makes up for the linear merge it is worth it.

2 Comments

I think about this the same way, but there is some problems with that. This arrays is memory pages backed by persistent storage and merging all of them is so much I/O! I have terabytes of this arrays (and one array is just 8Mb). Sorting and merging data in one page is extremely cheap (no data copying needed) but sorting and merging all data is hard (need to copy data between pages). The other problem is that new arrays can be added and it can overlap with merged data so I need to perform partial sort periodically.
And yes, data searched many times. The problem is that this data is also updated from time to time.
2

Terabytes in 8MB pages means you have the handle a few million pages. Each page is sorted inside, and values in the pages can (rarely, but they can) overlap each other.

I would expect that the impact on finding the right page is higher then find the right entry within a page.

Therefore I recommend the following approach:

  • Maintain an array with lowest and highest key per page (lowestPageKey, highestPageKey).
  • Do a binary search to get the fitting pages and do a second binary search within the page.
  • For finding the fitting pages on searchKey do a range fitting binary search in the meta data.
    • Use the condition lowestPageKey <= searchKey <= highestPageKey to find the right page.
    • If lowestPageKey > searchKey you can continue with the lower half of the array
    • If highestPageKey < searchKey you can continue with the higher half of the array

This way you'll find the right page(s) and can issue a second binary search within the found pages.

One more question from my side: If values in pages overlap, you can find more then one entry (or more pages) that contain the search key. What do you expect back in such a case? One page/entry randomly, all pages/entries, the first/last page/entry or an error message?

2 Comments

I need two queries: point query - that returns item with specific key (or closest); range query - that returns range of keys from the array that can potentially occupy several pages.
Ok, then the first query will always be the range version. And the second (within the pages being candidates) you can choose between the two
2

You imply that you have many queries on mostly static data, so I'll assume that. You're on the right track. Only don't exclude overlapping arrays. Track the overlaps. Here is how. Begin by compiling an index of ranges. If the arrays were disjoint, they would be the blocks. When you have two arrays overlapping:

|     A    |
     |       B       |

Split into three ranges:

| A  | AB  |   B     |

As the diagram implies, the index of ranges just records low and high bounds and a list of arrays that cover the range.

Now search the index (in memory) to determine which array or arrays to search. Then go search all those. As a further optimization, can use the block boundaries to limit the array search. In other words, if you get block AB above, you can rule out part of A and part of B as you search them.

How to compile and update the index efficiently? I suggest an interval tree. This page provides pseudocode. If you are programming in C++, you can use the relevant Boost library to good advantage.

With interval trees, each array is an interval. When you query the tree with a point, you get back all the relevant intervals. These are the arrays that much be searched.

Comments

1

Maintain multiple groups of arrays which have disjoint ranges.

When performing the binary search, do it in parallel over these groups or try it over groups based on smallest first.

For every group, maintain the ranges and whenever a new page arrives, attach it to the largest group which does not have a disjoint range with this new page. If the page does not belong to any of the groups, create a new one.

As you said that most of the times the ranges do not overlap, the chances of having these extra groups is quite less and yet the algorithm can adapt when such an anomaly occurs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.