1

As the numpy docs describe for the object dtype, arrays created with the object dtype are simply references to an underlying data store like a python list. The tobytes() method on such an object returns pointers to this data store.

I was wondering if it's possible to create an ndarray object from a python list without creating a copy on creation.

For example, trying to create an ndarray from a list then assigning copy=False to np.asarray raises an exception:

import numpy as np

l = ['spam', 'eggs']
arr = np.asarray(l, dtype='object', copy=False) # raises ValueError

I don't know how numpy is storing the underlying data, but it seems like it should be very similar (if not identical) to a python list.

12
  • 1
    list objects don't expose their buffers so this would only be possible in hacky ways where you would get around that (maybe with ctypes) but it would have to rely on implementation details. And also, it would be very brittle (what happens if the buffer gets resized?) Commented Aug 4 at 22:53
  • 1
    But of course, I cannot count the number of assumption I've made here: endianness, size of pointers, plus, the obvious: order, nature and size of the internal structure that hold the internal cpython list container. Plus, from there, I cannot easily create an object array (for any other type, it is quite easy to create a ndarray whose data is at a given, already, existing buffer. But not for object. I suppose you could easily hack it similarly) Commented Aug 4 at 23:36
  • 1
    Bottom line is: yes, the addresses of the objects in a list are, with cpython, in practice, stored in a contiguous array of ponters. And yes so are the one of numpy. So, you could cetainly hack your way into creating a numpy array that lies on a buffer that is the place where the address of a python list are stored. But that is clearly UB, and implementation dependent. And then, as junapa.arrivillaga said, what if for some reason python decide to change the buffer?) Commented Aug 4 at 23:41
  • 1
    What do you mean by the 'underlying data'? The strings referenced by l won't be copied even when arr has its own data buffer. Both will have some sort of C array of pointers. In effect arr will be a shallow copy of l, with separate methods (like append for the list, and reshape for the array). Commented Aug 5 at 0:26
  • 1
    "Why do this?" If it's for performance, where does the list come from and can you start with a Numpy array to begin with? If not, why not? This is fully X/Y. Commented Aug 5 at 2:00

1 Answer 1

0

Make a list of strings:

In [1]: import numpy as np
In [2]: alist = ['one', 'two', 'three']

And an array from that:

In [3]: arr = np.asarray(alist); arr
Out[3]: array(['one', 'two', 'three'], dtype='<U5')

Without dtype, it is a numpy string dtype (occupying 3*5*4=60 bytes).

But with object dtype:

In [4]: arr = np.asarray(alist, dtype=object); arr
Out[4]: array(['one', 'two', 'three'], dtype=object)

This is a shallow copy; the 3rd element is the same as the 3rd element of list - a python string:

In [5]: id(alist[2])
Out[5]: 2008637825040

In [6]: id(arr[2])
Out[6]: 2008637825040

If the list contains a mutable object, such as a list of strings:

In [7]: blist = ['one', 'two', 'three', ['a','b']]; barr=np.array(blist,object)

In [8]: blist[3]
Out[8]: ['a', 'b']

In [9]: barr[3]
Out[9]: ['a', 'b']

modifying that object in one, modifies it in the other:

In [10]: barr[3].append('c');barr
Out[10]: array(['one', 'two', 'three', list(['a', 'b', 'c'])], dtype=object)

In [11]: blist
Out[11]: ['one', 'two', 'three', ['a', 'b', 'c']]

But replacing a element of the list with a new value, does not change the array.

In [13]: blist[1]=12.3; blist, barr
Out[13]: 
(['one', 12.3, 'three', ['a', 'b', 'c']],
 array(['one', 'two', 'three', list(['a', 'b', 'c'])], dtype=object))

In many ways an object dtype array is like a list, e.g. alist.copy(). But methods are different. The list can append, the array can reshape etc. In general you don't gain much by making an object dtype array. Some operations may be simpler to write, but rarely are they faster.

ps

The `copy=False' error message:

In [28]: np.asarray(alist, dtype=object, copy=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 1
----> 1 np.asarray(alist, dtype=object, copy=False)

ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

is just telling us that the default copy=None is just as useful. It will copy only if needed. copy=True is more useful, forcing a copy (but it is still a shallow copy). To get a deep copy with object dtype, I think we have to use something like copy.deepcopy - but I haven't fiddled with that in a long time.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. I didn't realize that it was a shallow copy. And I completely forgot about the id function to test it. I was originally thinking of a "copy" in numpy as completely duplicating the data in memory, but a shallow copy doesn't do that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.