1

I am trying to read a dot file containing:

graph {
    KZJLCHYE -- DJTGWUZZ;
    PNLWKOXF -- BFSIOMPY;
    ...
}

But when I try to read the dot file, I get "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte". Is there a way I can read the contents of a dot file in Python only using the standard library?

2
  • Could you add the first few (raw) bytes of the file to your question? Something like hexdump -C -n 16 yourfile.dot Commented Dec 2, 2015 at 7:05
  • @JeremyKerr here are the raw bytes: 00000000 d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00 00 00 |................| 00000010 Commented Dec 2, 2015 at 14:16

2 Answers 2

1

Encoding of text files is a murky subject that will never be completely resolved. You either need to guess the encoding or you have a corrupted (or binary) file on your hands:

  1. To guess the encoding, try to open it in any advanced text editor, see if it guesses the encoding for you and/or highlights problematic characters.

  2. If you don't care about the bad character at pos 0, you can instruct python to ignore it. See Python3 manual: open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None) -- just set errors='ignore' Python3 handles encodings better than python2, so it would help if you mentioned which version you are using.

  3. Read file as binary stream and deal with bad characters when converting it to str: open(file, 'rb'). Again, your options for decoding depend on the python version, so I cannot elaborate further.

Sign up to request clarification or add additional context in comments.

4 Comments

I am using Python3. I ignored the errors, but nothing from the file is read. I'm assuming this is because the entire file raises errors. I also tried reading the binary stream and converting back to a string, but I get illegible output ().
@sikez it is possible that you are trying to read in a wrong encoding. E.g.: If it tries to read a 3-byte UTF-8 character but in reality the text is encoded in 4-byte UTF-32 characters then you are bound to get every character offset wrong. See my step 1: open it in a good editor (I use emacs) and it will guess encoding name. You can also use python library chardet to guess the encoding. Can also try some mainstream ones: open(file, encoding='latin-1'), open(file, encoding='utf-16'), open(file, encoding='utf-32')...
Thanks a lot! I used chardet, and found out the encoding was actually ISO-8859-2.
You might have a horrible mix of text with some binary data generated by "we don't need no standards" software. Since ISO-8859-2 is a 1-byte codec, its failure indicates that the text is encoded in one of the multi-byte unicodes, making automated decoding impossible, since the mixed in binary data will always corrupt character boundary detection. Your best bet is to read it in binary and try to manually isolate the binary part before decoding. P.S.: Also, if ANY text editor can read your file correctly, then my reasoning above is invalid: automated decoding is possible and you made a mistake.
0

To ignore the unicode characters in file you can do

var = unicode(var, errors='ignore')

1 Comment

I think @sikez means "dot", as in the text-based graph format, not "dot" as in files prefixed with . in a user's home directory.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.