When creating a Python bytearray from NumPy array, where does the extra data come from?

Question

Consider two ways of naively making the same bytearray (using Python 2.7.11, but confirmed same behavior in 3.4.3 as well):

In [80]: from array import array

In [81]: import numpy as np    

In [82]: a1 = array('L',  [1, 3, 2, 5, 4])

In [83]: a2 = np.asarray([1,3,2,5,4], dtype=int)

In [84]: b1 = bytearray(a1)

In [85]: b2 = bytearray(a2)

Since both array.array and numpy.ndarray support the buffer protocol, I would expect both to export the same underlying data on conversion to bytearray.

But the data from above:

In [86]: b1
Out[86]: bytearray(b'\x01\x03\x02\x05\x04')

In [87]: b2
Out[87]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')

At first I thought maybe a naive call to bytearray on a NumPy array will inadvertently get some extra bytes due to data type, contiguity, or some other overhead data.

But even when looking at the NumPy buffer data handle directly, it still says size is 40 and gives the same data:

In [90]: a2.data
Out[90]: <read-write buffer for 0x7fb85d60fee0, size 40, offset 0 at 0x7fb85d668fb0>

In [91]: bytearray(a2.data)
Out[91]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')

The same failing happens with a2.view():

In [93]: bytearray(a2.view())
Out[93]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')

I noted that if I gave dtype=np.int32 then the length of bytearray(a2) is 20 instead of 40, suggesting that the extra bytes have to do with type information -- it's just not clear why or how:

In [20]: a2 = np.asarray([1,3,2,5,4], dtype=int)

In [21]: len(bytearray(a2.data))
Out[21]: 40

In [22]: a2 = np.asarray([1,3,2,5,4], dtype=np.int32)

In [23]: len(bytearray(a2.data))
Out[23]: 20

AFAICT, np.int32 ought to correspond to the array 'L' typecode, but any explanations about why not would be massively helpful.

How can one reliably extract only the part of the data that "should" be exported via the buffer protocol ... as in, the same as what the plain array data looks like in this case.

What do you mean by "should" be exported? The buffer protocol just specifies how to get the data, it doesn't say anything about what the data should be. — BrenBarn
– BrenBarn, Commented Mar 6, 2016 at 22:39
Is this because numpy is 64 bit by default (16 nibbles) ? Try to change the byte order (big endian, little endian) and see what happens. See this docs.scipy.org/doc/numpy/reference/arrays.dtypes.html — ralf htp
– ralf htp, Commented Mar 6, 2016 at 22:45

BrenBarn · Accepted Answer · 2016-03-06 22:59:03Z

6

When you create your bytearray from the array.array, it is treating it as an iterable of ints, not as a buffer. You can see this because:

>>> bytearray(a1)
bytearray(b'\x01\x03\x02\x05\x04')
>>> bytearray(buffer(a1))
bytearray(b'\x01\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x04\x00\x00\x00')

That is, creating a bytearray directly from the array gives you "plain" ints, but creating a bytearray from a buffer of the array gives you the actual byte representations of those ints. Also, you cannot create a bytearray from an array that has ints that won't fit into a single byte:

>>> bytearray(array.array(b'L', [256]))
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    bytearray(array.array(b'L', [256]))
ValueError: byte must be in range(0, 256)

The behavior is still puzzling, though, because both array.array and np.ndarray support both the buffer protocol and iteration, yet somehow creating a bytearray from a array.array gets the data via iteration, while creating a bytearray from a numpy.ndarray gets the data via the buffer protocol. There is presumably some arcane explanation for this switched priority in the C internals of these two types, but I have no idea what it is.

In any case, it's not really correct to say that what you're seeing with your a1 is what "should" happen; as I showed above, the data '\x01\x03\x02\x05\x04' is not actually what array.array exposes via the buffer protocol. If anything, the behavior with the numpy array is what you "should" get from the buffer protocol; it is the array.array behavior that is not consistent with the buffer protocol.

edited Mar 6, 2016 at 22:59

answered Mar 6, 2016 at 22:53

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Stop harming Monica Over a year ago

Iterating over a numpy.ndarray yields scalars of the array's dtype while an array.array of any integer typecode yields int values, hence the different behaviour.

ely Over a year ago

I more meant that one behavior or other should be "the" expected way. In a least astonishment sense, it's not good that the simple call to bytearray differs for these two types. I would be happy if either one of the two behaviors was the expected default though. In my application, this just means I need to be wary of sticking in a bunch of buffer calls when dealing with array.array.

ely Over a year ago

@Goyo What about the case when I used int as the numpy dtype .. and why are the total byte lengths 20 and 40 in the two numpy cases. It seems like more is going on that just a simple story of the array.array ints giving as few bytes as needed (namely 1) while numpy gives 4 bytes always ... that doesn't quite seem to be happening.

BrenBarn Over a year ago

@Goyo: That still doesn't explain it, since bytearray([np.int32(x) for x in 1, 2, 3]) still returns a bytearray with "plain" int values, unlike bytearray(np.array([1, 2, 3], dtype=np.int32)). So it's not just a matter of the individual values.

BrenBarn Over a year ago

@Mr.F: When you specify int as the dtype, numpy just picks a platform default numpy integer type, which is apparently int64 in your case. You can check the dtype of the resulting array to see what its dtype actually is.

|

Community · Accepted Answer · 2020-06-20 09:12:55Z

5

I get the same bytearray with both cases:

In [1032]: sys.version
Out[1032]: '3.4.3 (default, Mar 26 2015, 22:07:01) \n[GCC 4.9.2]'
In [1033]: from array import array

In [1034]: a1=array('L',[1,3,2,5,4])
In [1035]: a2=np.array([1,3,2,5,4],dtype=np.int32)

In [1036]: bytearray(a1)
Out[1036]: bytearray(b'\x01\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x04\x00\x00\x00')
In [1037]: bytearray(a2)
Out[1037]: bytearray(b'\x01\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x04\x00\x00\x00')

In both cases I have 5 numbers, which occupy 4 bytes each (as 32 bit integers) - 20 bytes.

bytearray is probably asking for the following methods (or something equivalent):

In [1038]: a1.tobytes()
Out[1038]: b'\x01\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x04\x00\x00\x00'
In [1039]: a2.tostring()
Out[1039]: b'\x01\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x04\x00\x00\x00'

I can remove the extra bytes by changing dtype:

In [1059]: a2.astype('i1').tostring()
Out[1059]: b'\x01\x03\x02\x05\x04'

https://docs.python.org/2.6/c-api/buffer.html

Starting from version 1.6, Python has been providing Python-level buffer objects and a C-level buffer API so that any built-in or used-defined type can expose its characteristics. Both, however, have been deprecated because of various shortcomings, and have been officially removed in Python 3.0 in favour of a new C-level buffer API and a new Python-level object named memoryview.

The new buffer API has been backported to Python 2.6, and the memoryview object has been backported to Python 2.7. It is strongly advised to use them rather than the old APIs, unless you are blocked from doing so for compatibility reasons.

Given these changes in the buffer interface it's not surprising that the older array module was not changed in 2.6 and 2.7, but changed in 3.0+.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 6, 2016 at 23:40

hpaulj

233k14 gold badges260 silver badges392 bronze badges

3 Comments

ely Over a year ago

You are correct. I did not re-check the array.array example in Python 3, only the NumPy example. This has to do with the way that array.array was reimplemented in Python 3 to directly support the buffer protocol. So it seems that this explains why bytearray treats it as an iterable instead in Python 2. bytearray must check first if the passed data supports direct buffer access (array.array does not in Python 2, you must use an indirect idiom). If it does, it gets the data as you show. If it doesn't, as in Python 2, then it fails over to treating it like an iterable of ints.

ely Over a year ago

It is odd, though, that bytearray wouldn't check first if the passed data supported the old-style buffer protocol, and only after both buffer checks would it resort to iterating over the values. This whole topic turns out to be quite important if you are building a library that deals with buffers and also seeks to have compatibility with both Python 2 and Python 3.

hpaulj Over a year ago

It may be a matter of development history. array module has been around forever, but with the growth of numpy it is something of a development backwater. bytearray is new in 2.6. And I think the concept of a buffer protocol belongs in Python3, with a certain amount of backporting to Py2. In Py3, default strings are unicode, bytestrings are the special case.

Collectives™ on Stack Overflow

When creating a Python bytearray from NumPy array, where does the extra data come from?

2 Answers 2

7 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related