1

Hello i have a large file that contain unicode characters, and when i try to open it in Python 3 this is the mistake i have.

File "addRNC.py", line 47, in add_rnc()

File "addRNC.py", line 13, in init for value in rawDoc.readline():

File "/usr/local/lib/python3.1/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 158: invalid continuation byte

And i try everything and didn't work, here is the code:

rawDoc = io.open("/root/potential/rnc_lst.txt", 'r', encoding='utf8')
    result = []
    for value in rawDoc.readline():

        if len(value.split('|')[9]) > 0 and len(value.split('|')[10]) > 0: 
            if value.split('|')[9] == 'ACTIVO' and value.split('|')[10] == 'NORMAL':
                address = ''
                for piece in value.split('|')[4:7]:
                    address += piece
                if value.split('|')[8] != '':
                    rawdate = value.split('|')[8].split('/')
                    _date = rawdate[2]+"-"+rawdate[1]+"-"+rawdate[0]
                else:
                    _date = 'NULL'

                id = db.prepare("SELECT id FROM potentials_reg WHERE(rnc = '%s')"%(value.split('|')[0]))()

                if len(id) == 0:
                    if _date == 'NULL':
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', NULL, '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7], 'true'))()
                    else:
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7],_date, 'true'))()
                else:
                    pass

    db.close()
1
  • 1
    What makes you think that the file is a Unicode file that’s encoded in UTF-8? Byte 0xD3 is a U+201D ʀɪɢʜᴛ ᴅᴏᴜʙʟᴇ Qᴜᴏᴛᴀᴛɪᴏɴ ᴍᴀʀᴋ in the MacRoman encoding, for example. Does the file validate as UTF-8? Commented Feb 1, 2012 at 4:49

1 Answer 1

5

Your file actually contains invalid UTF-8.

When you say "contains unicode characters", you should be aware that Unicode doesn't specify how the characters are represented. So even if the file represents Unicode data, it could be in UTF-8, UTF-16 (UTF-16BE or UTF-16LE, each with or without a BOM), the deprecated UCS-2, or perhaps even one of the more esoteric forms...

Double check that the file is valid; I'd bet that you indeed have a byte 0xD3 (11010011), which must in UTF-8 be the leading byte of a two-byte character, in a follower position (in other words, 0xD3 immediately follows a byte whose binary representation begins with 11 [is greater than 0xC0]).

The most likely reason for this is that your file contains non-ASCII characters, but isn't in UTF-8.

Sign up to request clarification or add additional context in comments.

11 Comments

I think is a Unicode character because on the 158 position there is a 'Ó'.
@hidura: Unicode and UTF-8 are not the same thing. Yes your file contains Unicode characters. That does not mean it is encoded in UTF-8. HTTP://regebro.wordpress.com/2011/03/23/… There is no character at all at position 158, there is a NUMBER. That number is 201. In UTF-8, that's an Ó, correct. In MacRoman, it's a quotation mark. Does the Ó make sense? What is position 157 and 159?
@hidura Not all non-English characters are Unicode. Many legacy documents use what are called codepages ( en.wikipedia.org/wiki/Code_page ).
@Borealid: Which contains characters that are all part of Unicode, and hence are Unicode characters. :-)
@LennartRegebro Not exactly true; Unicode consolidated multiple previously-distinct glyphs into single characters in some cases, and the reverse in some cases. See the notes below the table on en.wikipedia.org/wiki/Code_page_437 ; the characters in a code page are tied to an intended visual representation, not a semantic meaning!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.