I have an algorithm where I need to compute all pair-wise distances between large objects (where the distance is a true metric, so dist(a,b) == dist(b,a), among all the other metric requirements). I have around a thousand objects, so I need to calculate about 500,000 distances. There are a few issues:
- All of these 1000 objects are serialized, sitting in individual files on the disk.
- Reading them from disk is a huge operation compared to the relatively trivial distance calculation.
- I can't have all of them in memory at the same time without swapping, and then thrashing. I can fit about 500 in memory at a time.
- This means that at some point I will need to re-read an object into memory, after having read and released the memory at some point previously.
So given that reading from disk is the bottleneck, and that I can't read in more than half at a time, can anyone think of an algorithm for reading and releasing these objects which minimizes the total number of reads?
I considered reading in the first half, doing all of those pair-wise distances, releasing all of that memory and reading the second half, and then doing all of those pair-wise distances. At this point, I still need the distances between the objects in the first half and the objects in the second, and I'm not sure what to do. I also considered having a cache that, when full, randomly chooses a present object to evict and reads the next one, but I feel that there has to be a better option. I considered something like LRU, but it can lead to behavior where the object required was evicted on the last calculation.
All-in-all, I'm kinda stuck. Any advice?