Smart caching of expensive objects in Python

Question

I have a directory of images in order. Typically my code will be using data from a sequential subset of images (e.g. images 5-10), and the naive options for accessing these are:

Create a wrapper object with a method that loads the image when needed and reads my data (e.g. a pixel value). This has little memory overhead but will be slow as it will need to load each image every time.
Store all the images in memory. This will be fast but obviously there's a limit to how many images we can store.

I would like to find:

Some method by which I can define how to read the image corresponding to an index or a path, and then allows me to access, say magic_image_collection[index] without me having to worry about whether it's going to return the object in memory or read it afresh. This would ideally keep the appropriate images or the n most recently accessed images in memory.

tito · Accepted Answer · 2012-01-04 16:53:45Z

6

You can extend the default dict and use __missing__ method to call a loading function if the key is missing:

class ImageDict(dict):
    def __missing__(self, key):
        self[key] = img = self.load(key)
        return img
    def load(self, key):
        # create a queue if not exist (could be moved to __init__)
        if not hasattr(self, '_queue'):
            self._queue = []
        # pop the oldest entry in the list and the dict
        if len(self._queue) >= 100:
            self.pop(self._queue.pop(0))
        # append this key as a newest entry in the queue
        self._queue.append(key)
        # implement image loading here and return the image instance
        print 'loading', key
        return 'Image for %s' % key

And the output (the loading happen only when the key doesn't exist yet.)

>>> d = ImageDict()
>>> d[3]
loading 3
'Image for 3'
>>> d[3]
'Image for 3'
>>> d['bleh']
loading bleh
'Image for bleh'
>>> d['bleh']
'Image for bleh'

One evolution would be to store only the N last element in the dict, and purge the oldest entries. You can implement it by keeping a list of keys for ordering.

edited Jan 4, 2012 at 16:53

answered Jan 3, 2012 at 15:33

tito

13.3k1 gold badge57 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

PaulMcG Over a year ago

Get rid of __getitem__ and rename load to __missing__ and you should be okay.

PaulMcG Over a year ago

All you have to do in __missing__ is to either return the appropriate value for the key, or raise an exception. The dict code that called __missing__ will take care of updating the dict (which your class inherited from). To add support for 'last n elements', add a list of keys as a member, and add the key to the end of the list in __missing__. When the list exceeds n, pop the oldest (0'th) key from the list and from self.

tito Over a year ago

Thanks Paul, didn't know __missing__, very nice !

PaulMcG Over a year ago

@tito - if I read the docs correctly, you don't even have to assign to self[key] in __missing__, just return it (or raise an exception if the key is invalid somehow). The doc code will take care of the rest.

YXD Over a year ago

Paul's right there. I'll mark this as correct. @tito mind if I tidy it up a bit though and add the 'last n elements' functionality as Paul suggested?

|

dcrosta · Accepted Answer · 2012-01-03 15:25:36Z

2

Weakrefs aren't what you want -- weakrefs are a way to reference an item that allows the garbage collector to collect (i.e. destroy) the referent if only weakrefs to it exist. In other words, if you create and store only weakrefs to some object, it is likely to be garbage collected quickly, and you won't have benefitted from it.

I'd go with your option #1 above. On modern operating systems, the OS maintains an in-memory cache of recently accessed files (or parts thereof), which will mean that you'll have to bear the cost of loading the file from disk once, but after that, subsequent accesses to the file will be as fast (or nearly so) as if it were in memory in your application. The FS cache is usually a LRU-style cache, so frequently-accessed items will tend to stay in memory, while infrequently accessed items will tend to be evicted (and will subsequently be loaded from disk if needed). In most cases, it is sufficient to rely on the operating system's implementation of this sort of logic, rather than writing your own (especially since you don't have to write and maintain code to do it!)

answered Jan 3, 2012 at 15:25

dcrosta

26.4k8 gold badges74 silver badges83 bronze badges

1 Comment

YXD Over a year ago

Thanks for the clarification with weakrefs. I'll try both option #1 and @tito's idea.

Collectives™ on Stack Overflow

Smart caching of expensive objects in Python

2 Answers 2

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related