Numpy view contiguous part of non-contiguous array as dtype of bigger size

Question

I was trying to generate an array of trigrams (i.e. continuous-three-letter combinations) from a super long char array:

# data is actually load from a source file
a = np.random.randint(0, 256, 2**28, 'B').view('c')

Since making copy is not efficient (and it creates problems like cache miss), I directly generated the trigram using stride tricks:

tri = np.lib.stride_tricks.as_strided(a, (len(a) - 2, 3), a.strides * 2)

This generates a trigram list with shape (2**28 - 2, 3) where each row is a trigram. Now I want to convert the trigram to a list of string (i.e. S3) so that numpy displays it more "reasonably" (instead of individual chars).

tri = tri.view('S3')

It gives the exception:

ValueError: To change to a dtype of a different size, the array must be C-contiguous

I understand generally data should be contiguous in order to create a meaningful view, but this data is contiguous at "where it should be": each three elements are contiguous.

So I'm wondering how to view contiguous part in non-contiguous np.ndarray as dtype of bigger size? A more "standard" way would be better, while hackish ways are also welcome. It seems that I can set shape and stride freely with np.lib.stride_tricks.as_strided, but I can't force the dtype to be something, which is the problem here.

EDIT

Non-contiguous array can be made by simple slicing. For example:

np.empty((8, 4), 'uint32')[:, :2].view('uint64')

will throw the same exception above (while from a memory point of view I should be able to do this). This case is much more common than my example above.

@AndyK I believe OP wants to avoid the copy that this forces. — Paul Panzer
– Paul Panzer, Commented Nov 14, 2018 at 9:55
The databuffer for any array is contiguous - one long low level array of bytes. But a view of that buffer might not be 'C' contiguous. In the [:,:2] case there are 2 elements, then a gap, 2 more elements, etc. Look at the flags. Evidently view isn't going the extra step of verifying that the 8 bytes it needs for each uint64 are contiguous. — hpaulj
– hpaulj, Commented Nov 14, 2018 at 17:43

Paul Panzer · Accepted Answer · 2018-11-14 10:02:22Z

4

If you have access to a contiguous array from which your non-contiguous one is derived, it should typically be possible to work around this limitation.

For example your trigrams can be obtained like so:

>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b')', b'\xf2', b'\xf7', ..., b'\xf4', b'\xf1', b'z'], dtype='|S1')
>>> np.lib.stride_tricks.as_strided(a[:0].view('S3'), ((2**28)-2,), (1,))
array([b')\xf2\xf7', b'\xf2\xf7\x14', b'\xf7\x14\x1b', ...,
       b'\xc9\x14\xf4', b'\x14\xf4\xf1', b'\xf4\xf1z'], dtype='|S3')

In fact, this example demonstrates that all we need is a contiguous "stub" at the memory buffer's base for view casting, since afterwards, because as_strided does not do many checks we are essentially free to do whatever we like.

It seems we can always get such a stub by slicing to a size 0 array. For your second example:

>>> X = np.empty((8, 4), 'uint32')[:, :2]
>>> np.lib.stride_tricks.as_strided(X[:0].view(np.uint64), (8, 1), X.strides)
array([[140133325248280],
       [             32],
       [       32083728],
       [       31978800],
       [              0],
       [       29686448],
       [             32],
       [       32362720]], dtype=uint64)

edited Nov 14, 2018 at 10:02

answered Nov 14, 2018 at 9:45

Paul Panzer

53.3k3 gold badges60 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

kuzand Over a year ago

That's interesting, although quite difficult to understand why it works. +1

ZisIsNotZis Over a year ago

viewing a size-zero array is interesting! I was thinking about somehow create a correct-dtype array (like size-one array from viewing bytes), but size-zero view is definitely more useful!

Mad Physicist Over a year ago

As part of a PR to add some string indexing to numpy, I came up with basically this method. One interesting side effect is that you can always get a contiguous buffer with enough motivation. See here for an example: github.com/numpy/numpy/pull/20694

Mad Physicist Over a year ago

Once #20722 passes, this will no longer be necessary, but what a neat hack.

Mad Physicist · Accepted Answer · 2022-01-06 01:38:43Z

As of numpy 1.23.0, you will be able to do exactly what you want without jumping through extra hoops. I've added PR#20722 to numpy to address pretty much this exact issue. The idea is that if your new dtype is smaller than the current, you can clearly expand a unit or contiguous axis without any problems. If the new dtype is larger, you can shrink a contiguous axis.

With the update, your code runs out of the box:

>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b'\x19', b'\xf9', b'\r', ..., b'\xc3', b'\xa3', b'{'], dtype='|S1')
>>> tri = np.lib.stride_tricks.as_strided(a, (len(a)-2,3), a.strides*2)
>>> tri.view('S3')
array([[b'\x9dB\xeb'],
       [b'B\xebU'],
       [b'\xebU\xa4'],
       ...,
       [b'-\xcbM'],
       [b'\xcbM\x97'],
       [b'M\x97o']], dtype='|S3')

The array has to have a unit dimension or be contiguous in the last axis, which is true in your case.

I've also added PR#20694 to introduce slicing to the np.char module. If that PR gets accepted as-is, you will be able to do:

>>> np.char.slice_(a.view(f'U{len(a)}'), step=1, chunksize=3)

Collectives™ on Stack Overflow

Numpy view contiguous part of non-contiguous array as dtype of bigger size

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related