Searching a Unicode file using Python

Question

Setup

I'm writing a script to process and annotate build logs from Visual Studio. The build logs are HTML, and from what I can tell, Unicode (UTF-16?) as well. Here's a snippet from one of the files:

c:\anonyfolder\anonyfile.c(17169) : warning C4701: potentially uninitialized local variable 'object_adrs2' used
c:\anonyfolder\anonyfile.c(17409) : warning C4701: potentially uninitialized local variable 'pclcrd_ptr' used
c:\anonyfolder\anonyfile.c(17440) : warning C4701: potentially uninitialized local variable 'object_adrs2' used

The first 16 bytes of the file look like this:

feff 003c 0068 0074 006d 006c 003e 000d

The rest of the file is littered with null bytes as well.

I'd like to be able to perform string and regular expression searches/matches on these files. However, when I try the following code I get an error message.

buildLog = open(sys.argv[1]).readlines()

for line in buildLog:
    match = u'warning'
    if line.find(match) >= 0:
        print line

The error message:

Traceback (most recent call last):
File "proclogs.py", line 60, in
if line.find(match) >= 0:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

Apparently it's choking on the 0xff byte in 0xfeff at the beginning of the file. If I skip the first line, I get no matches:

buildLog = open(sys.argv[1]).readlines()

for line in buildLog[1:]: # Skip the first line.
    match = u'warning'
    if line.find(match) >= 0:
        print line

Likewise, using the non-Unicode match = 'warning' produces no results.

Question

How can I portably search a Unicode file using strings and regular expressions in Python? Additionally, how can I do so such that I can reconstruct the original file? (The goal is to be able to write annotations on the warning lines without mangling the file.)

Have you tried adding a call to decode() as suggested below? — hughdbrown
– hughdbrown, Commented Aug 5, 2009 at 21:37
And are you using python 3.x or some 2.x variant? If the former, you will get strings as unicode. — hughdbrown
– hughdbrown, Commented Aug 5, 2009 at 21:38
I withdraw my answer. I tried Alexander Ljungberg's answer and it works perfectly. — hughdbrown
– hughdbrown, Commented Aug 5, 2009 at 21:45
Just so you know, ffef is a byte-order mark (BOM). en.wikipedia.org/wiki/Byte-order_mark — Alan Moore
– Alan Moore, Commented Aug 6, 2009 at 4:29

Alexander Ljungberg · Accepted Answer · 2009-08-05 20:48:48Z

7

Try using the codecs package:

import codecs
buildLog = codecs.open(sys.argv[1], "r", "utf-16").readlines()

Also you may run into trouble with your print statement as it may try to convert the strings to your console encoding. If you're printing for your review you could use,

print repr(line)

edited Aug 5, 2009 at 20:48

answered Aug 5, 2009 at 20:42

Alexander Ljungberg

6,3624 gold badges33 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Andrew Keeton Over a year ago

Thanks, this is exactly what I needed. Say I run across a UTF-8 or an ASCII file, will this break?

John Machin Over a year ago

@Andrew Keeton: Of course will break if you don't change the encoding from "utf-16" to "utf-8" or "ascii" (or "cp1252") as appropriate. See http://www.amk.ca/python/howto/unicode and http://www.joelonsoftware.com/articles/Unicode.html

Sean · Accepted Answer · 2009-08-05 20:55:11Z

0

Tried this? When saving a parsing script with non-ascii characters, I had the interpreter suggest an alternate encoding to the front of the file.

Non-ASCII found, yet no encoding declared.  Add a line like:
# -*- coding: cp1252 -*-

Adding that as the first line of the script fixed the problem for me. Not sure if this is what's causing your error, though.

answered Aug 5, 2009 at 20:55

Sean

6237 silver badges18 bronze badges

1 Comment

John Machin Over a year ago

That's certainly not causing his error. Yours is a COMPILE-time problem -- a source file encoded in cp1252 is not so declared. His is a RUN-time problem caused by trying to read a utf16-encoded file as though it were ascii.

Collectives™ on Stack Overflow

Searching a Unicode file using Python

Setup

Question

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Setup

Question

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related