Python in-memory files for caching large files

Question

I am doing very large data processing (16GB) and I would like to try and speed up the operations through storing the entire file in RAM in order to deal with disk latency.

I looked into the existing libraries but couldn't find anything that would give me the flexibility of interface.

Ideally, I would like to use something with integrated read_line() method so that the behavior is similar to the standard file reading interface.

You're looking for io.StringIO. Like this: stream = io.StringIO(open("file").read()) — Homer512
– Homer512, Commented Jul 19, 2022 at 20:59
Alternatively, you could use the mmap module. That saves you the trouble of prefetching the whole file if you just want quick cached reads or random access. docs.python.org/3.8/library/mmap.html — Homer512
– Homer512, Commented Jul 19, 2022 at 21:08
@Homer512 I am already using mmap, but it simply maps the file in the memory address space without actually pre-loading the entire thing into RAM. I looked at io.StringIO before, but I didn't realize you could use the file handler as an input. I will give it a shot, thanks. — AdmiralFishHead
– AdmiralFishHead, Commented Jul 19, 2022 at 21:30
@Homer512 - The main issue with StringIO is that it's oriented towards using strings. My files are read in binary mode with subsequent conversion into strings (I use indexed entries and look up tables). Basically I need to use the raw file as a binary file for quick parsing of the starting point of text data, but I also need to have a way of itterating through data to find the end sequence of the data. StringIO and BytesIO don't offer that option. — AdmiralFishHead
– AdmiralFishHead, Commented Jul 20, 2022 at 2:41
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community
– Community Bot, Commented Jul 20, 2022 at 5:56

Mostafa Farzán · Accepted Answer · 2022-07-24 12:47:17Z

2

Just read the file once. The OS will cache it for the next read operations. It's called the page cache.

For example, run this code and see what happens (replace the path with one pointing to a file on your system):

import time

filepath = '/home/mostafa/Videos/00053.MTS'
n_bytes = 200_000_000  # read 200 MB

t0 = time.perf_counter()
with open(filepath, 'rb') as f:
    f.read(n_bytes)
t1 = time.perf_counter()
with open(filepath, 'rb') as f:
    f.read(n_bytes)
t2 = time.perf_counter()

print(f'duration 1 = {t1 - t0}')
print(f'duration 2 = {t2 - t1}')

On my Linux system, I get:

duration 1 = 2.1419905399670824
duration 2 = 0.10992361599346623

Note how faster the second read operation is, even though the file was closed and reopened. This is because, in the second read operation, the file is being read from RAM, thanks to the Linux kernel. Also note that if you run this program again, you'll see that both operations finish quickly, because the file was cached from the previous run:

duration 1 = 0.10143039299873635
duration 2 = 0.08972924604313448

So in short, just use this code to preload the file into RAM:

with open(filepath, 'rb') as f:
    f.read()

Finally, it's worth mentioning that page cache is practically non-deterministic, i.e. your file might not get fully cached, especially if the file size is big compared to the RAM. But in cases like that, caching the file might not be practical after all. In short, default OS page caching is not the most sophisticated way, but it is good enough for many use cases.

edited Jul 24, 2022 at 12:47

answered Jul 19, 2022 at 21:35

Mostafa Farzán

1,0132 gold badges14 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AdmiralFishHead Over a year ago

Thanks, and nevermind my previous comment, it appears that in Python mmap does indeed cache the entire thing into RAM. I am not sure how mmap will behave with JIT libraries tho. Also simply doing read() on the whole file will the same trick, albeit with 25% larger memory footprint than mmap.

Homer512 Over a year ago

@AdmiralFishHead it's not the mmap module that does the caching. It's the operating system via its page-cache. So this works the same across all implementations. You can also use mmap.madvise(mmap.MADV_WILLNEED, ...) as a prefetch hint.

AdmiralFishHead · Accepted Answer · 2022-07-21 23:43:29Z

0

Homer512 answered my question somewhat, mmap is a nice option that results in caching.

Alternatively, Numpy has some interfaces for parsing byte streams from a file directly into memory and it also comes with some string-related operations.

answered Jul 21, 2022 at 23:43

AdmiralFishHead

11 silver badge2 bronze badges

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Python in-memory files for caching large files

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related