How to decode a numpy array of dtype=numpy.string_?

Question

I need to decode, with Python 3, a string that was encoded the following way:

>>> s = numpy.asarray(numpy.string_("hello\nworld"))
>>> s
array(b'hello\nworld', 
      dtype='|S11')

I tried:

>>> str(s)
"b'hello\\nworld'"

>>> s.decode()
AttributeError                            Traceback (most recent call last)
<ipython-input-31-7f8dd6e0676b> in <module>()
----> 1 s.decode()

AttributeError: 'numpy.ndarray' object has no attribute 'decode'

>>> s[0].decode()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-34-fae1dad6938f> in <module>()
----> 1 s[0].decode()

IndexError: 0-d arrays can't be indexed

hpaulj · Accepted Answer · 2016-10-03 16:18:01Z

3

Another option is the np.char collection of string operations.

In [255]: np.char.decode(s)
Out[255]: 
array('hello\nworld', 
      dtype='<U11')

It accepts the encoding keyword if needed. But .astype is probably better if you don't need this.

This s is 0d (shape ()), so needs to be indexed with s[()].

In [268]: s[()]
Out[268]: b'hello\nworld'
In [269]: s[()].decode()
Out[269]: 'hello\nworld'

s.item() also works.

answered Oct 3, 2016 at 16:18

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kasravnd · Accepted Answer · 2016-10-03 12:27:17Z

1

In Python 3, there are two types that represent sequences of characters: bytes and str (contain Unicode characters). When you use string_ as your type, numpy will return bytes. If you want the regular str you should use unicode_ type in numpy:

>>> s = numpy.asarray(numpy.unicode_("hello\nworld"))
>>> s
array('hello\nworld', 
      dtype='<U11')

>>> str(s)
'hello\nworld'

But note that if you don't specify a type for your string (string_ or unicode_) it will return the default str type (which in python 3.x is the str (contain the unicode characters)).

>>> s = numpy.asarray("hello\nworld")
>>> str(s)
'hello\nworld'

edited Oct 3, 2016 at 12:27

answered Oct 3, 2016 at 12:22

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

3 Comments

PiRK Over a year ago

The reason why I encode with numpy.string_ data is for compatibility. My data goes to a data format called HDF5, and can be potentially read back by other software than just python.

Kasravnd Over a year ago

@PiRK If you want a compatible approach between python versions you should just use numpy.asarray() otherwise it has nothing to do with python.

PiRK Over a year ago

Unfortunately I also need my output HDF5 files to be compatible with old Fortran libraries, various versions of the Octave software, Matlab... etc

Dimitris Fasarakis Hilliard · Accepted Answer · 2016-10-03 12:37:01Z

1

If my understanding is correct, you can do this with astype which, if copy = False will return the array with the contents in the corresponding type:

>>> s = numpy.asarray(numpy.string_("hello\nworld"))
>>> r = s.astype(str, copy=False)
>>> r 
array('hello\nworld', 
      dtype='<U11')

edited Oct 3, 2016 at 12:37

answered Oct 3, 2016 at 12:13

Dimitris Fasarakis Hilliard

162k35 gold badges282 silver badges265 bronze badges

3 Comments

PiRK Over a year ago

Thanks! This helps a lot. Now I can recover my string this way: s = str(s.astype(str))

Kasravnd Over a year ago

You don't need to convert the type when you can get the regular str directly with unicode_.

PiRK Over a year ago

I don't control the encoding stage. In my real-world problem, I don't create s myself. I just happen to know that it was written to a file after this encoding stage.

Collectives™ on Stack Overflow

How to decode a numpy array of dtype=numpy.string_?

3 Answers 3

Comments

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related