Slicing a file in Python

Question

I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.

It was very convenient to use a memory mapped file wrapped by a bytearray. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.

It was also faster than accumulating the current chunk in a bytearray (and much faster than accumulating in bytes!). But I have certain concerns that I would like to address:

Is bytearray copying the data?
I open the file as rb and the mmap with access=mmap.ACCESS_READ. But bytearray is, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use?
Because I do not accumulate in the buffer, I am random accessing the bytearray (and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?

Are you able to read the sources? python.org/downloads/source it is in the Objects folder. — User
– User, Commented Nov 2, 2014 at 6:26
What Python version? Also, how are you wrapping mmap in your bytearray? — Veedrac
– Veedrac, Commented Nov 2, 2014 at 16:46
@Veedrac I am targeting 2.7 and 3.4. Right now, I am just doing bytearray(mmap(<etc>)) — Hernan
– Hernan, Commented Nov 2, 2014 at 22:46
On Python 3 you can use memoryview but mmap doesn't support the memoryview protocol on 2.x. — Veedrac
– Veedrac, Commented Nov 3, 2014 at 2:48
Although your explanation is pretty detailed, a piece of code would be very helpful. — Alex
– Alex, Commented Nov 15, 2014 at 18:15

Community · Accepted Answer · 2017-05-23 12:09:45Z

1

Converting one object to a mutable object does incur data copying. You can directly read the file to a bytearray by using:
```
f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)
```

from http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews#id12

There is a string to bytearray conversion, so there is potential performance issue.
bytearray is an array, so it can hit the limit of PY_SSIZE_T_MAX/sizeof(PyObject*). For more info, you can visit How Big can a Python Array Get?

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered Jan 25, 2015 at 3:59

snowblade

235 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alex · Accepted Answer · 2014-11-15 18:34:29Z

0

You could do this little hack.

import mmap

class memmap(mmap.mmap):
    def read_byte(self):
        return ord(super(memmap,self).read_byte())

Create a class that inherits from the mmap class and overwrites the default read_byte that returns a string of length 1 to one that returns a int. And then you could use this class as any other mmap class.

I hope this helps.

answered Nov 15, 2014 at 18:34

Alex

4674 silver badges13 bronze badges

Collectives™ on Stack Overflow

Slicing a file in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related