Numpy String Encoding

Question

The module numpy is an excellent tool for memory-efficient storage of python objects, among them strings. For ANSI strings in numpy arrays only 1 byte per character is used.

However, there is one inconvenience. The type of stored objects is no more string but bytes, which means that have to be decoded for further use in most cases, which in turn means quite bulky code:

>>> import numpy
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an b'apple' and a b'pear'
>>> print("Mary has an {} and a {}".format(my_array[0].decode('utf-8'),
... my_array[1].decode('utf-8')))
Mary has an apple and a pear

This inconvenience can be eliminated by using another data type, e.g:

>>> my_array = numpy.array(['apple', 'pear'], dtype = 'U5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an apple and a pear

However, this is achieved only by cost of 4-fold increase in memory usage:

>>> numpy.info(my_array)
class:  ndarray
shape:  (2,)
strides:  (20,)

itemsize:  20

aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x1a5b020
byteorder:  little
byteswap:  False
type: <U5

Is there a solution that combines advantages of both efficient memory allocation and convenient usage for ANSI strings?

This is a Python3 issue, which displays byte strings with the b. — hpaulj
– hpaulj, Commented Aug 25, 2015 at 15:30

Community · Accepted Answer · 2017-05-23 10:29:09Z

It's not a big difference over the decode, but astype works (and can be applied to the whole array rather than each string). But the longer array will remain around as long as it is needed.

In [538]: x=my_array.astype('U');"Mary has an {} and a {}".format(x[0],x[1])
Out[538]: 'Mary has an apple and a pear'

I can't find anything in the format syntax that would force 'b' less formatting.

https://stackoverflow.com/a/19864787/901925 - shows how to customize the Formatter class, changing the format_field method. I tried something similar with the convert_field method. But the calling syntax is still messy.

In [562]: def makeU(astr):
    return astr.decode('utf-8')
   .....: 

In [563]: class MyFormatter(string.Formatter):
    def convert_field(self, value, conversion):
        if 'q'== conversion:
            return makeU(value)
        else:
            return super(MyFormatter, self).convert_field(value, conversion)
   .....:         

In [564]: MyFormatter().format("Mary has an {!q} and a {!q}",my_array[0],my_array[1])
Out[564]: 'Mary has an apple and a pear'

A couple of other ways of doing this formatting:

In [642]: "Mary has an {1} and a {0} or {1}".format(*my_array.astype('U'))
Out[642]: 'Mary has an pear and a apple or pear'

This converts the array (on the fly) and passes it to format as a list. It also works if the array is already unicode:

In [643]: "Mary has an {1} and a {0} or {1}".format(*uarray.astype('U'))
Out[643]: 'Mary has an pear and a apple or pear'

np.char has functions that apply string functions to elements of a character array. With this decode can be applied to the whole array:

In [644]: "Mary has a {1} and an {0}".format(*np.char.decode(my_array))
Out[644]: 'Mary has a pear and an apple'

(this doesn't work if the array is already unicode).

If you do much with string arrays, np.char is worth a study.

Thank you for the profound answer. As I need not only format strings, but also pass single array elements to functions, I opted for making the function: def U(astr): return astr.decode('utf-8'), as it requires minimum additional symbols. It is also the most obvious solution.

dawg · Accepted Answer · 2015-08-25 16:32:08Z

4

Given:

>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')

You can decode on the fly:

>>> print("Mary has an {} and a {}".format(*map(lambda b: b.decode('utf-8'), my_array)))
Mary has an apple and a pear

Or you can create a specific formatter:

import string
class ByteFormatter(string.Formatter):
    def __init__(self, decoder='utf-8'):
        self.decoder=decoder

    def format_field(self, value, spec):
        if isinstance(value, bytes):
            return value.decode(self.decoder)
        return super(ByteFormatter, self).format_field(value, spec)   

>>> print(ByteFormatter().format("Mary has an {} and a {}", *my_array))
Mary has an apple and a pear

edited Aug 25, 2015 at 16:32

answered Aug 25, 2015 at 15:47

dawg

105k24 gold badges143 silver badges217 bronze badges

Collectives™ on Stack Overflow

Numpy String Encoding

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related