3

I was trying to generate an array of trigrams (i.e. continuous-three-letter combinations) from a super long char array:

# data is actually load from a source file
a = np.random.randint(0, 256, 2**28, 'B').view('c')

Since making copy is not efficient (and it creates problems like cache miss), I directly generated the trigram using stride tricks:

tri = np.lib.stride_tricks.as_strided(a, (len(a) - 2, 3), a.strides * 2)

This generates a trigram list with shape (2**28 - 2, 3) where each row is a trigram. Now I want to convert the trigram to a list of string (i.e. S3) so that numpy displays it more "reasonably" (instead of individual chars).

tri = tri.view('S3')

It gives the exception:

ValueError: To change to a dtype of a different size, the array must be C-contiguous

I understand generally data should be contiguous in order to create a meaningful view, but this data is contiguous at "where it should be": each three elements are contiguous.

So I'm wondering how to view contiguous part in non-contiguous np.ndarray as dtype of bigger size? A more "standard" way would be better, while hackish ways are also welcome. It seems that I can set shape and stride freely with np.lib.stride_tricks.as_strided, but I can't force the dtype to be something, which is the problem here.

EDIT

Non-contiguous array can be made by simple slicing. For example:

np.empty((8, 4), 'uint32')[:, :2].view('uint64')

will throw the same exception above (while from a memory point of view I should be able to do this). This case is much more common than my example above.

3
  • What about np.ascontiguousarray(tri).view('S3') ? Commented Nov 14, 2018 at 9:44
  • @AndyK I believe OP wants to avoid the copy that this forces. Commented Nov 14, 2018 at 9:55
  • The databuffer for any array is contiguous - one long low level array of bytes. But a view of that buffer might not be 'C' contiguous. In the [:,:2] case there are 2 elements, then a gap, 2 more elements, etc. Look at the flags. Evidently view isn't going the extra step of verifying that the 8 bytes it needs for each uint64 are contiguous. Commented Nov 14, 2018 at 17:43

2 Answers 2

4

If you have access to a contiguous array from which your non-contiguous one is derived, it should typically be possible to work around this limitation.

For example your trigrams can be obtained like so:

>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b')', b'\xf2', b'\xf7', ..., b'\xf4', b'\xf1', b'z'], dtype='|S1')
>>> np.lib.stride_tricks.as_strided(a[:0].view('S3'), ((2**28)-2,), (1,))
array([b')\xf2\xf7', b'\xf2\xf7\x14', b'\xf7\x14\x1b', ...,
       b'\xc9\x14\xf4', b'\x14\xf4\xf1', b'\xf4\xf1z'], dtype='|S3')

In fact, this example demonstrates that all we need is a contiguous "stub" at the memory buffer's base for view casting, since afterwards, because as_strided does not do many checks we are essentially free to do whatever we like.

It seems we can always get such a stub by slicing to a size 0 array. For your second example:

>>> X = np.empty((8, 4), 'uint32')[:, :2]
>>> np.lib.stride_tricks.as_strided(X[:0].view(np.uint64), (8, 1), X.strides)
array([[140133325248280],
       [             32],
       [       32083728],
       [       31978800],
       [              0],
       [       29686448],
       [             32],
       [       32362720]], dtype=uint64)
Sign up to request clarification or add additional context in comments.

4 Comments

That's interesting, although quite difficult to understand why it works. +1
viewing a size-zero array is interesting! I was thinking about somehow create a correct-dtype array (like size-one array from viewing bytes), but size-zero view is definitely more useful!
As part of a PR to add some string indexing to numpy, I came up with basically this method. One interesting side effect is that you can always get a contiguous buffer with enough motivation. See here for an example: github.com/numpy/numpy/pull/20694
Once #20722 passes, this will no longer be necessary, but what a neat hack.
3

As of numpy 1.23.0, you will be able to do exactly what you want without jumping through extra hoops. I've added PR#20722 to numpy to address pretty much this exact issue. The idea is that if your new dtype is smaller than the current, you can clearly expand a unit or contiguous axis without any problems. If the new dtype is larger, you can shrink a contiguous axis.

With the update, your code runs out of the box:

>>> a = np.random.randint(0, 256, 2**28, 'B').view('c')
>>> a
array([b'\x19', b'\xf9', b'\r', ..., b'\xc3', b'\xa3', b'{'], dtype='|S1')
>>> tri = np.lib.stride_tricks.as_strided(a, (len(a)-2,3), a.strides*2)
>>> tri.view('S3')
array([[b'\x9dB\xeb'],
       [b'B\xebU'],
       [b'\xebU\xa4'],
       ...,
       [b'-\xcbM'],
       [b'\xcbM\x97'],
       [b'M\x97o']], dtype='|S3')

The array has to have a unit dimension or be contiguous in the last axis, which is true in your case.


I've also added PR#20694 to introduce slicing to the np.char module. If that PR gets accepted as-is, you will be able to do:

>>> np.char.slice_(a.view(f'U{len(a)}'), step=1, chunksize=3)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.