Specifying encoding using NumPy loadtxt/savetxt

Question

Using the NumPy loadtxt and savetxt functions fails whenever non-ASCII characters are involved. These function are primarily ment for numeric data, but alphanumeric headers/footers are also supported.

Both loadtxt and savetxt seem to be applying the latin-1 encoding, which I find very orthogonal to the rest of Python 3, which is thoroughly unicode-aware and always seem to be using utf-8 as the default encoding.

Given that NumPy hasn't moved to utf-8 as the default encoding, can I at least change the encoding away from latin-1, either via some implemented function/attribute or a known hack, either just for loadtxt/savetxt or for NumPy in its entirety?

That this is not possible with Python 2 is forgivable, but it really should not be a problem when using Python 3. I've found the problem using any combination of Python 3.x and the last many versions of NumPy.

Example code

Consider the file data.txt with the content

# This is π
3.14159265359

Trying to load this with

import numpy as np
pi = np.loadtxt('data.txt')
print(pi)

fails with a UnicodeEncodeError exception, stating that the latin-1 codec can't encode the character '\u03c0' (the π character).

This is frustrating because π is only present in a comment/header line, so there is no reason for loadtxt to even attempt to encode this character.

I can successfully read in the file by explicitly skipping the first row, using pi = np.loadtxt('data.txt', skiprows=1), but it is inconvenient to have to know the exact number of header lines.

The same exception is thrown if I try to write a unicode character using savetxt:

np.savetxt('data.txt', [3.14159265359], header='# This is π')

To accomplish this task successfully, I first have to write the header by some other means, and then save the data to a file object opened with the 'a+b' mode, e.g.

with open('data.txt', 'w') as f:
    f.write('# This is π\n')
with open('data.txt', 'a+b') as f:
    np.savetxt(f, [3.14159265359])

which needless to say is both ugly and inconvenient.

Solution

I settled on the solution by hpaulj, which I thought would be nice to spell out fully. Near the top of my program I now do

import numpy as np

asbytes = lambda s: s if isinstance(s, bytes) else str(s).encode('utf-8')
asstr = lambda s: s.decode('utf-8') if isinstance(s, bytes) else str(s)
np.compat.py3k.asbytes = asbytes
np.compat.py3k.asstr = asstr
np.compat.py3k.asunicode = asstr
np.lib.npyio.asbytes = asbytes
np.lib.npyio.asstr = asstr
np.lib.npyio.asunicode = asstr

after which np.loadtxt and np.savetxt handles Unicode correctly.

Note that for newer versions of NumPy (I can confirm 1.14.3, but properly somewhat older versions as well) this trick is not needed, as it seems that Unicode is now handled properly by default.

I've banged my head against this before, but don't recall the details. savetxt insists on writing bytestrings, i.e. the file is opened in 'wb'. But it should be easy to replicate its action with your own file write. It just iterates on rows of your array, and writes a formatted line to the file. If you can format a row of your array satisfactorily, you can write your own csv. — hpaulj
– hpaulj, Commented Jan 8, 2017 at 0:48
Similar issues were previously discussed here: write-numpy-unicode-array-to-a-text-file, reading-unicode-elements-into-numpy-array — fedepad
– fedepad, Commented Jan 8, 2017 at 0:54

Community · Accepted Answer · 2017-05-23 10:31:17Z

2

At least for savetxt the encodings are handled in

Signature: np.lib.npyio.asbytes(s)
Source:   
    def asbytes(s):
        if isinstance(s, bytes):
            return s
        return str(s).encode('latin1')
File:      /usr/local/lib/python3.5/dist-packages/numpy/compat/py3k.py
Type:      function

Signature: np.lib.npyio.asstr(s)
Source:   
    def asstr(s):
        if isinstance(s, bytes):
            return s.decode('latin1')
        return str(s)
File:      /usr/local/lib/python3.5/dist-packages/numpy/compat/py3k.py
Type:      function

The header is written to the wb file with

        header = header.replace('\n', '\n' + comments)
        fh.write(asbytes(comments + header + newline))

Write numpy unicode array to a text file has some of my previous explorations. There I was focusing on characters in the data, not the header.

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Jan 8, 2017 at 0:58

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jmd_dk Over a year ago

When is one function used over the other? I find that both loadtxt and savetxt works for my examples just by overwriting np.lib.npyio.asbytes (which is really nice!).

Warren Weckesser · Accepted Answer · 2017-01-08 00:48:27Z

2

A couple hacks:

Open the file in binary mode, and pass the open file object to loadtxt:

In [12]: cat data.txt
# This is π
3.14159265359

In [13]: with open('data.txt', 'rb') as f:
    ...:     result = np.loadtxt(f)
    ...:     

In [14]: result
Out[14]: array(3.14159265359)

Open the file using latin1 encoding, and pass the open file object to loadtxt:

In [15]: with open('data.txt', encoding='latin1') as f:
    ...:     result = np.loadtxt(f)
    ...:     

In [16]: result
Out[16]: array(3.14159265359)

answered Jan 8, 2017 at 0:48

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

5 Comments

jmd_dk Over a year ago

Both works! Can you explain why the last hack works? Does core Python take care of the encoding of the file content before it reaches NumPy? Even so, why doesn't it crash when it encounters the π, which is outside of latin-1?

Warren Weckesser Over a year ago

It doesn't encounter π; it sees the two bytes that in hex are 0xCF and 0x80 (the UTF-8 encoding of π), and interprets them as two distinct Latin-1 characters. That's not a problem, though, because they are in a comment so ultimately they are ignored.

jmd_dk Over a year ago

Can any utf-8 character be interpreted as several latin-1 characters, or will the above method break on some characters which doesn't behave so nicely? My choice of π was arbitrary.

Warren Weckesser Over a year ago

According to en.wikipedia.org/wiki/ISO/IEC_8859-1, there are undefined characters in the "Latin 1" encoding. But I just created a test file with 256 commented lines of the form # *\n, where * ranges from 0 to 255, and I was able to read it using the above code. So as long as all the UTF-8 characters are in comments, it should work.

jmd_dk Over a year ago

Thanks for the effort, but I don't get why that test is sufficient.

Collectives™ on Stack Overflow

Specifying encoding using NumPy loadtxt/savetxt

Example code

Solution

2 Answers 2

1 Comment

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Example code

Solution

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related