Bug with Python UTF-16 output and Windows line endings?

Question

With this code:

test.py

import sys
import codecs

sys.stdout = codecs.getwriter('utf-16')(sys.stdout)

print "test1"
print "test2"

Then I run it as:

test.py > test.txt

In Python 2.6 on Windows 2000, I'm finding that the newline characters are being output as the byte sequence \x0D\x0A\x00 which of course is wrong for UTF-16.

Am I missing something, or is this a bug?

Under Mac OS X it works fine: "fe ff 00" are the first three bytes. — lutz
– lutz, Commented Jul 23, 2009 at 5:57
Interesting information but I don't see how it's relevant to the question. I imagine that this issue is only significant for platforms with Windows-style (CR-LF) line endings. — Craig McQueen
– Craig McQueen, Commented Jul 23, 2009 at 6:12

Glenn Maynard · Accepted Answer · 2009-07-23 08:44:36Z

3

Try this:

import sys
import codecs

if sys.platform == "win32":
    import os, msvcrt
    msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)

class CRLFWrapper(object):
    def __init__(self, output):
        self.output = output

    def write(self, s):
        self.output.write(s.replace("\n", "\r\n"))

    def __getattr__(self, key):
        return getattr(self.output, key)

sys.stdout = CRLFWrapper(codecs.getwriter('utf-16')(sys.stdout))
print "test1"
print "test2"

answered Jul 23, 2009 at 8:44

Glenn Maynard

57.9k11 gold badges123 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Craig McQueen Over a year ago

Looks great. Which leads to two points: 1) can Python support chaining of codecs? 2) if yes, then I suggest Python should provide a codec to deal with line ending conversion and deprecate the use of low-level "binary vs text" file I/O.

Glenn Maynard Over a year ago

Well, CRLF line endings are supposed to be transparent when you open a file in non-binary mode. The reason it's breaking here is because it's designed only for byte stream files, not for word streams like UTF-16. To handle that, it'd need to define some way to tell the file which type it is. I think that just doesn't fit the file design, and I suspect the use case of outputting UTF-16 is too uncommon to jump those hoops to accomodate it automatically.

Glenn Maynard · Accepted Answer · 2009-07-23 06:15:35Z

3

The newline translation is happening inside the stdout file. You're writing "test1\n" to sys.stdout (a StreamWriter). StreamWriter translates this to "t\x00e\x00s\x00t\x001\x00\n\x00", and sends it to the real file, the original sys.stderr.

That file doesn't know that you've converted the data to UTF-16; all it knows is that any \n values in the output stream need to be converted to \x0D\x0A, which results in the output you're seeing.

answered Jul 23, 2009 at 6:15

Glenn Maynard

57.9k11 gold badges123 silver badges133 bronze badges

2 Comments

Craig McQueen Over a year ago

Thanks, that's insightful and points me in the right direction.

Craig McQueen Over a year ago

I found this Python documentation for codecs.open(): docs.python.org/library/codecs.html#codecs.open which says "Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing." I guess that means this combination of encoding and line-ending translation is hard to deal with.

Craig McQueen · Accepted Answer · 2009-07-23 08:07:41Z

0

I've found two solutions so far, but not one that gives output of UTF-16 with Windows-style line endings.

First, to redirect Python print statements to a file with UTF-16 encoding (output Unix style line-endings):

import sys
import codecs

sys.stdout = codecs.open("outputfile.txt", "w", encoding="utf16")

print "test1"
print "test2"

Second, to redirect to stdout with UTF-16 encoding, without line-ending translation corruption (output Unix style line-endings) (thanks to this ActiveState recipe):

import sys
import codecs

sys.stdout = codecs.getwriter('utf-16')(sys.stdout)

if sys.platform == "win32":
    import os, msvcrt
    msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)

print "test1"
print "test2"

answered Jul 23, 2009 at 8:07

Craig McQueen

43.8k32 gold badges138 silver badges188 bronze badges

Collectives™ on Stack Overflow

Bug with Python UTF-16 output and Windows line endings?

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related