How to read large file with unicode in Python 3

Question

Hello i have a large file that contain unicode characters, and when i try to open it in Python 3 this is the mistake i have.

File "addRNC.py", line 47, in add_rnc()

File "addRNC.py", line 13, in init for value in rawDoc.readline():

File "/usr/local/lib/python3.1/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 158: invalid continuation byte

And i try everything and didn't work, here is the code:

rawDoc = io.open("/root/potential/rnc_lst.txt", 'r', encoding='utf8')
    result = []
    for value in rawDoc.readline():

        if len(value.split('|')[9]) > 0 and len(value.split('|')[10]) > 0: 
            if value.split('|')[9] == 'ACTIVO' and value.split('|')[10] == 'NORMAL':
                address = ''
                for piece in value.split('|')[4:7]:
                    address += piece
                if value.split('|')[8] != '':
                    rawdate = value.split('|')[8].split('/')
                    _date = rawdate[2]+"-"+rawdate[1]+"-"+rawdate[0]
                else:
                    _date = 'NULL'

                id = db.prepare("SELECT id FROM potentials_reg WHERE(rnc = '%s')"%(value.split('|')[0]))()

                if len(id) == 0:
                    if _date == 'NULL':
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', NULL, '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7], 'true'))()
                    else:
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7],_date, 'true'))()
                else:
                    pass

    db.close()

What makes you think that the file is a Unicode file that’s encoded in UTF-8? Byte 0xD3 is a U+201D ʀɪɢʜᴛ ᴅᴏᴜʙʟᴇ Qᴜᴏᴛᴀᴛɪᴏɴ ᴍᴀʀᴋ in the MacRoman encoding, for example. Does the file validate as UTF-8? — tchrist
– tchrist, Commented Feb 1, 2012 at 4:49

Borealid · Accepted Answer · 2012-02-01 04:51:03Z

5

Your file actually contains invalid UTF-8.

When you say "contains unicode characters", you should be aware that Unicode doesn't specify how the characters are represented. So even if the file represents Unicode data, it could be in UTF-8, UTF-16 (UTF-16BE or UTF-16LE, each with or without a BOM), the deprecated UCS-2, or perhaps even one of the more esoteric forms...

Double check that the file is valid; I'd bet that you indeed have a byte 0xD3 (11010011), which must in UTF-8 be the leading byte of a two-byte character, in a follower position (in other words, 0xD3 immediately follows a byte whose binary representation begins with 11 [is greater than 0xC0]).

The most likely reason for this is that your file contains non-ASCII characters, but isn't in UTF-8.

answered Feb 1, 2012 at 4:51

Borealid

99.4k9 gold badges112 silver badges124 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

hidura Over a year ago

I think is a Unicode character because on the 158 position there is a 'Ó'.

Lennart Regebro Over a year ago

@hidura: Unicode and UTF-8 are not the same thing. Yes your file contains Unicode characters. That does not mean it is encoded in UTF-8. HTTP://regebro.wordpress.com/2011/03/23/… There is no character at all at position 158, there is a NUMBER. That number is 201. In UTF-8, that's an Ó, correct. In MacRoman, it's a quotation mark. Does the Ó make sense? What is position 157 and 159?

Borealid Over a year ago

@hidura Not all non-English characters are Unicode. Many legacy documents use what are called codepages ( en.wikipedia.org/wiki/Code_page ).

Lennart Regebro Over a year ago

@Borealid: Which contains characters that are all part of Unicode, and hence are Unicode characters. :-)

Borealid Over a year ago

@LennartRegebro Not exactly true; Unicode consolidated multiple previously-distinct glyphs into single characters in some cases, and the reverse in some cases. See the notes below the table on en.wikipedia.org/wiki/Code_page_437 ; the characters in a code page are tied to an intended visual representation, not a semantic meaning!

|

Collectives™ on Stack Overflow

How to read large file with unicode in Python 3

1 Answer 1

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related