Algorithm for Minimizing Number of Large Loads

Question

I have an algorithm where I need to compute all pair-wise distances between large objects (where the distance is a true metric, so dist(a,b) == dist(b,a), among all the other metric requirements). I have around a thousand objects, so I need to calculate about 500,000 distances. There are a few issues:

All of these 1000 objects are serialized, sitting in individual files on the disk.
Reading them from disk is a huge operation compared to the relatively trivial distance calculation.
I can't have all of them in memory at the same time without swapping, and then thrashing. I can fit about 500 in memory at a time.
This means that at some point I will need to re-read an object into memory, after having read and released the memory at some point previously.

So given that reading from disk is the bottleneck, and that I can't read in more than half at a time, can anyone think of an algorithm for reading and releasing these objects which minimizes the total number of reads?

I considered reading in the first half, doing all of those pair-wise distances, releasing all of that memory and reading the second half, and then doing all of those pair-wise distances. At this point, I still need the distances between the objects in the first half and the objects in the second, and I'm not sure what to do. I also considered having a cache that, when full, randomly chooses a present object to evict and reads the next one, but I feel that there has to be a better option. I considered something like LRU, but it can lead to behavior where the object required was evicted on the last calculation.

All-in-all, I'm kinda stuck. Any advice?

You could try to make them fit in memory :-) Or buy double the memory! The first option is often feasible by cheating a little. For example by keeping them "compressed" in memory, or deduplicating some data. — xanatos
– xanatos, Commented Aug 26, 2013 at 14:55
Re: your idea of "pair-wise", you would divide the total in four parts. Check A->B, A->C, A->D, then B->C, B->D, and finally C->D. That should cover all possibilites. (No idea of its actual efficiency.) — Jongware
– Jongware, Commented Aug 26, 2013 at 14:56
Are all the object attributes needed for the distance calculation? — Adam Burry
– Adam Burry, Commented Aug 26, 2013 at 14:57
@AdamBurry No, they're actually not. I'll look into loading only the relevant bits and pieces, thanks. — anjruu
– anjruu, Commented Aug 26, 2013 at 16:52

David Eisenstat · Accepted Answer · 2013-08-26 15:34:21Z

Load the points in the first half. Compute pairwise distances. Keeping the first half in memory, load the points in the second half one by one, computing distances. Load the entire second half and compute pairwise distances. This is about 1.5 reads per point, which is about optimal in the following model.

The model is that a syntactically correct program for this task is a sequence of instructions of the form LOAD(i, j), whose meaning is to load point i into memory location j. The program is correct if, for every pair of points, there exists an instruction immediately after which both points are in memory.

It is clear that, in a correct program, every point is loaded at least once. Assuming exactly 1000 points and 500 locations, it follows that there are at least 500 evictions in favor of points being loaded for the first time. Assume without loss of generality that no point ever is duplicated in memory. After a point p is evicted in favor of a point q being loaded for the first time, it is necessary to reload p in order for the distance between p and q to be computed. Therefore, there are at least 500 reloads and hence at least 1500 loads.

Bernhard Barker · Accepted Answer · 2013-08-26 15:02:46Z

2

My first thought (so it might not be the best one):

For each i:

Read i-th group of 250. Calculate distances between these.
For each remaining group j of 250:
- Read the j-th group.
- Calculate distances between the points in the i-th and the j-th groups.
- Discard group j.
Discard group i.

So you'd read the following groups:

You'll notice that reading group 4 by itself is rather pointless - you can just calculate the distances when reading the 3rd (or some other) group.

So you end with an average of (1+2+3+3)/4 = 2.25 reads per point.

answered Aug 26, 2013 at 15:02

Bernhard Barker

55.7k14 gold badges111 silver badges143 bronze badges

Comments

Seblis · Accepted Answer · 2013-08-26 15:18:50Z

1

I'll try to slightly improve @Dukeling answer. If I understand correctly, this will be better

And now you have (1+3+2+1)/4=1.75 reads per point.

edited Aug 26, 2013 at 15:18

answered Aug 26, 2013 at 15:13

Seblis

3375 silver badges17 bronze badges

5 Comments

Cybercartel Over a year ago

How does this improve the calculation?

Seblis Over a year ago

@Phpdna That does not improve calculation, but we have less loads to memory, and it was said that "Reading them from disk is a huge operation compared to the relatively trivial distance calculation". Please read question carefully before posting such a comments.

Cybercartel Over a year ago

But how would he compare I with j when after you have compare 1 with all other groups you just load group 2?

Seblis Over a year ago

Do you mean last loading?

Bernhard Barker Over a year ago

You do understand correctly and this will be better. You may want to elaborate - at the end of i=1, keep 4 in memory and read 2 and 3. Then keep 3 in memory and read 2. But David's is still better.

Collectives™ on Stack Overflow

Algorithm for Minimizing Number of Large Loads

3 Answers 3

Comments

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related