2

I'm trying to use mmap to load a dictionary from file. I'll explain my problem on simplified example. In real, I have 10 files, which have to be loaded in miliseconds (or act like been loaded).

So lets have a dictionary - 50 mb. My program should find a value by key under 1 sec. Searching in this dictionary is not a problem, it could be done far under 1 sec. The problem is that when sb puts an input into the text field and press enter, the program starts to load the dictionary into memory so program can find a key. This loading can takes several seconds but I have to get result under 1 sec (dictionary can't be loaded before pressing enter). So I was recommended to use mmap module which should be far faster.

I can't google a good example. I've tried this (I know that it is an incorrect use)

def loadDict():
    with open('dict','r+b') as f: # used pickle to save
        fmap = mmap.mmap(f.fileno(),0)
        dictionary = cpickle.load(fmap)
    return dictionary


def search(pattern):
    dictionary = loadDict()
    return dictionary['pattern']

search('apple') <- it still takes many seconds

Could you give me a good example of proper mmap use?

6
  • Why do you assume that mmap is faster than normal file IO functions? Commented Oct 19, 2014 at 10:50
  • @Kay Because I was told (in my university), that mmap loads only that part of a file which is needed in certain moment so it doesn't needs to load a whole file into the memory which takes many secs. And that's the reason why I should use mmap. Commented Oct 19, 2014 at 10:52
  • The purpose of mmap is to map a file into memory and implement demand paging. This means a particular segment will only be read from disk into memory the first time you access it (but then stays in memory). That means that repeatedly accessing the same chunks of a file and seeking forth and back in the file will be very fast. But since for your purpose you basically need random access to the entire file, using mmap is obviously not going to help here, but instead make things worse. Commented Oct 19, 2014 at 10:54
  • 1
    "dictionary can't be loaded before pressing enter" - why? Commented Oct 19, 2014 at 10:55
  • 1
    Maybe a sqlite database instead of pickling would be an option? Commented Oct 19, 2014 at 10:56

1 Answer 1

6

Using an example file of 2,400,000 keys/values (52.7 megabytes) pairs such as:

key1,value1
key2,value2
etc , etc

Creating example file:

with open("stacktest.txt", "a") as f: 
    contents = ["key" + str(i) + ",value" + str(i) for i in range(2400000)]
    f.write("\n".join(contents) + "\n")

What is actually slow is having to construct the dictionary. Reading a file of 50mb is fast enough. Finding a value in a wall of text of this size is also fast enough. Using that, you will be able to find a single value in under 1 second.

Since I know the structure of my file I am able to use this shortcut. This should be tuned to your exact file structure though:

Reading in the file and manually searching for the known pattern (searching for the unique string in the whole file, then using the comma delimiter and newline delimiters).

with open("stacktest.txt") as f: 
    bigfile = f.read()
    my_key = "key2399999"
    start = bigfile.find(my_key)
    comma = bigfile[start:start+1000].find(",") + 1
    end = bigfile[start:start+1000].find("\n")
    print bigfile[start+comma:start+end]
    # value2399999

Timing for it all: 0.43s on average

Mission accomplished?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.