Read Dot Files in Python

Question

I am trying to read a dot file containing:

graph {
    KZJLCHYE -- DJTGWUZZ;
    PNLWKOXF -- BFSIOMPY;
    ...
}

But when I try to read the dot file, I get "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte". Is there a way I can read the contents of a dot file in Python only using the standard library?

Could you add the first few (raw) bytes of the file to your question? Something like hexdump -C -n 16 yourfile.dot — Jeremy Kerr
– Jeremy Kerr, Commented Dec 2, 2015 at 7:05
@JeremyKerr here are the raw bytes: 00000000 d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00 00 00 |................| 00000010 — sikez
– sikez, Commented Dec 2, 2015 at 14:16

Muposat · Accepted Answer · 2015-12-02 07:03:41Z

1

Encoding of text files is a murky subject that will never be completely resolved. You either need to guess the encoding or you have a corrupted (or binary) file on your hands:

To guess the encoding, try to open it in any advanced text editor, see if it guesses the encoding for you and/or highlights problematic characters.
If you don't care about the bad character at pos 0, you can instruct python to ignore it. See Python3 manual: open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None) -- just set errors='ignore' Python3 handles encodings better than python2, so it would help if you mentioned which version you are using.
Read file as binary stream and deal with bad characters when converting it to str: open(file, 'rb'). Again, your options for decoding depend on the python version, so I cannot elaborate further.

answered Dec 2, 2015 at 7:03

Muposat

1,5061 gold badge12 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

sikez Over a year ago

I am using Python3. I ignored the errors, but nothing from the file is read. I'm assuming this is because the entire file raises errors. I also tried reading the binary stream and converting back to a string, but I get illegible output ().

Muposat Over a year ago

@sikez it is possible that you are trying to read in a wrong encoding. E.g.: If it tries to read a 3-byte UTF-8 character but in reality the text is encoded in 4-byte UTF-32 characters then you are bound to get every character offset wrong. See my step 1: open it in a good editor (I use emacs) and it will guess encoding name. You can also use python library chardet to guess the encoding. Can also try some mainstream ones: open(file, encoding='latin-1'), open(file, encoding='utf-16'), open(file, encoding='utf-32')...

sikez Over a year ago

Thanks a lot! I used chardet, and found out the encoding was actually ISO-8859-2.

Muposat Over a year ago

You might have a horrible mix of text with some binary data generated by "we don't need no standards" software. Since ISO-8859-2 is a 1-byte codec, its failure indicates that the text is encoded in one of the multi-byte unicodes, making automated decoding impossible, since the mixed in binary data will always corrupt character boundary detection. Your best bet is to read it in binary and try to manually isolate the binary part before decoding. P.S.: Also, if ANY text editor can read your file correctly, then my reasoning above is invalid: automated decoding is possible and you made a mistake.

ashishmohite · Accepted Answer · 2015-12-02 07:00:21Z

0

To ignore the unicode characters in file you can do

var = unicode(var, errors='ignore')

edited Dec 2, 2015 at 7:00

answered Dec 2, 2015 at 6:47

ashishmohite

1,1207 silver badges14 bronze badges

1 Comment

Jeremy Kerr Over a year ago

I think @sikez means "dot", as in the text-based graph format, not "dot" as in files prefixed with . in a user's home directory.

Collectives™ on Stack Overflow

Read Dot Files in Python

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related