0

I have an algorithm where I need to compute all pair-wise distances between large objects (where the distance is a true metric, so dist(a,b) == dist(b,a), among all the other metric requirements). I have around a thousand objects, so I need to calculate about 500,000 distances. There are a few issues:

  1. All of these 1000 objects are serialized, sitting in individual files on the disk.
  2. Reading them from disk is a huge operation compared to the relatively trivial distance calculation.
  3. I can't have all of them in memory at the same time without swapping, and then thrashing. I can fit about 500 in memory at a time.
  4. This means that at some point I will need to re-read an object into memory, after having read and released the memory at some point previously.

So given that reading from disk is the bottleneck, and that I can't read in more than half at a time, can anyone think of an algorithm for reading and releasing these objects which minimizes the total number of reads?

I considered reading in the first half, doing all of those pair-wise distances, releasing all of that memory and reading the second half, and then doing all of those pair-wise distances. At this point, I still need the distances between the objects in the first half and the objects in the second, and I'm not sure what to do. I also considered having a cache that, when full, randomly chooses a present object to evict and reads the next one, but I feel that there has to be a better option. I considered something like LRU, but it can lead to behavior where the object required was evicted on the last calculation.

All-in-all, I'm kinda stuck. Any advice?

5
  • You could try to make them fit in memory :-) Or buy double the memory! The first option is often feasible by cheating a little. For example by keeping them "compressed" in memory, or deduplicating some data. Commented Aug 26, 2013 at 14:55
  • 3
    Re: your idea of "pair-wise", you would divide the total in four parts. Check A->B, A->C, A->D, then B->C, B->D, and finally C->D. That should cover all possibilites. (No idea of its actual efficiency.) Commented Aug 26, 2013 at 14:56
  • 2
    Are all the object attributes needed for the distance calculation? Commented Aug 26, 2013 at 14:57
  • @AdamBurry No, they're actually not. I'll look into loading only the relevant bits and pieces, thanks. Commented Aug 26, 2013 at 16:52
  • Out of curiosity, what part of the question was wrong? Commented Aug 26, 2013 at 16:53

3 Answers 3

3

Load the points in the first half. Compute pairwise distances. Keeping the first half in memory, load the points in the second half one by one, computing distances. Load the entire second half and compute pairwise distances. This is about 1.5 reads per point, which is about optimal in the following model.

The model is that a syntactically correct program for this task is a sequence of instructions of the form LOAD(i, j), whose meaning is to load point i into memory location j. The program is correct if, for every pair of points, there exists an instruction immediately after which both points are in memory.

It is clear that, in a correct program, every point is loaded at least once. Assuming exactly 1000 points and 500 locations, it follows that there are at least 500 evictions in favor of points being loaded for the first time. Assume without loss of generality that no point ever is duplicated in memory. After a point p is evicted in favor of a point q being loaded for the first time, it is necessary to reload p in order for the distance between p and q to be computed. Therefore, there are at least 500 reloads and hence at least 1500 loads.

Sign up to request clarification or add additional context in comments.

Comments

2

My first thought (so it might not be the best one):

For each i:

  • Read i-th group of 250. Calculate distances between these.

  • For each remaining group j of 250:

    • Read the j-th group.

    • Calculate distances between the points in the i-th and the j-th groups.

    • Discard group j.

  • Discard group i.

So you'd read the following groups:

i  j
1  2
   3
   4
2  3
   4
3  4
4

You'll notice that reading group 4 by itself is rather pointless - you can just calculate the distances when reading the 3rd (or some other) group.

So you end with an average of (1+2+3+3)/4 = 2.25 reads per point.

Comments

1

I'll try to slightly improve @Dukeling answer. If I understand correctly, this will be better

i j
1 2
  3
  4
2 
3
  2

And now you have (1+3+2+1)/4=1.75 reads per point.

5 Comments

How does this improve the calculation?
@Phpdna That does not improve calculation, but we have less loads to memory, and it was said that "Reading them from disk is a huge operation compared to the relatively trivial distance calculation". Please read question carefully before posting such a comments.
But how would he compare I with j when after you have compare 1 with all other groups you just load group 2?
Do you mean last loading?
You do understand correctly and this will be better. You may want to elaborate - at the end of i=1, keep 4 in memory and read 2 and 3. Then keep 3 in memory and read 2. But David's is still better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.