14

I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.

It was very convenient to use a memory mapped file wrapped by a bytearray. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.

It was also faster than accumulating the current chunk in a bytearray (and much faster than accumulating in bytes!). But I have certain concerns that I would like to address:

  1. Is bytearray copying the data?
  2. I open the file as rb and the mmap with access=mmap.ACCESS_READ. But bytearray is, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use?
  3. Because I do not accumulate in the buffer, I am random accessing the bytearray (and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?
9
  • 1
    Are you able to read the sources? python.org/downloads/source it is in the Objects folder. Commented Nov 2, 2014 at 6:26
  • 1
    What Python version? Also, how are you wrapping mmap in your bytearray? Commented Nov 2, 2014 at 16:46
  • 1
    @Veedrac I am targeting 2.7 and 3.4. Right now, I am just doing bytearray(mmap(<etc>)) Commented Nov 2, 2014 at 22:46
  • 1
    On Python 3 you can use memoryview but mmap doesn't support the memoryview protocol on 2.x. Commented Nov 3, 2014 at 2:48
  • 2
    Although your explanation is pretty detailed, a piece of code would be very helpful. Commented Nov 15, 2014 at 18:15

2 Answers 2

1
  1. Converting one object to a mutable object does incur data copying. You can directly read the file to a bytearray by using:

    f = open(FILENAME, 'rb')
    data = bytearray(os.path.getsize(FILENAME))
    f.readinto(data)
    

from http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews#id12

  1. There is a string to bytearray conversion, so there is potential performance issue.

  2. bytearray is an array, so it can hit the limit of PY_SSIZE_T_MAX/sizeof(PyObject*). For more info, you can visit How Big can a Python Array Get?

Sign up to request clarification or add additional context in comments.

Comments

0

You could do this little hack.

import mmap

class memmap(mmap.mmap):
    def read_byte(self):
        return ord(super(memmap,self).read_byte())

Create a class that inherits from the mmap class and overwrites the default read_byte that returns a string of length 1 to one that returns a int. And then you could use this class as any other mmap class.

I hope this helps.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.