I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.
It was very convenient to use a memory mapped file wrapped by a bytearray. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.
It was also faster than accumulating the current chunk in a bytearray (and much faster than accumulating in bytes!). But I have certain concerns that I would like to address:
- Is bytearray copying the data?
- I open the file as
rband themmapwithaccess=mmap.ACCESS_READ. Butbytearrayis, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use? - Because I do not accumulate in the buffer, I am random accessing the
bytearray(and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?
mmapin yourbytearray?memoryviewbutmmapdoesn't support thememoryviewprotocol on 2.x.