63

I needed to parse files generated by other tool, which unconditionally outputs json file with UTF-8 BOM header (EFBBBF). I soon found that this was the problem, as Python 2.7 module can't seem to parse it:

>>> import json
>>> data = json.load(open('sample.json'))

ValueError: No JSON object could be decoded

Removing BOM, solves it, but I wonder if there is another way of parsing json file with BOM header?

1

7 Answers 7

101

You can open with codecs:

import json
import codecs

json.load(codecs.open('sample.json', 'r', 'utf-8-sig'))

or decode with utf-8-sig yourself and pass to loads:

json.loads(open('sample.json').read().decode('utf-8-sig'))
Sign up to request clarification or add additional context in comments.

2 Comments

I strongly recommend using io.open() over codecs.open(): json.load(io.open('sample.json', 'r', encoding='utf-8-sig')). The io module is more robust and faster.
@MartijnPieters: Thanks for that comment, good to know. I found this discussion of the differences that might be useful: groups.google.com/forum/#!topic/comp.lang.python/s_eIyt3KoLE
46

Simple! You don't even need to import codecs.

with open('sample.json', encoding='utf-8-sig') as f:
    data = json.load(f)

Comments

7

you can also do it with keyword with

import codecs
with codecs.open('samples.json', 'r', 'utf-8-sig') as json_file:  
    data = json.load(json_file)

or better:

import io
with io.open('samples.json', 'r', encoding='utf-8-sig') as json_file:  
    data = json.load(json_file)

Comments

6

Since json.load(stream) uses json.loads(stream.read()) under the hood, it won't be that bad to write a small hepler function that lstrips the BOM:

from codecs import BOM_UTF8

def lstrip_bom(str_, bom=BOM_UTF8):
    if str_.startswith(bom):
        return str_[len(bom):]
    else:
        return str_

json.loads(lstrip_bom(open('sample.json').read()))    

In other situations where you need to wrap a stream and fix it somehow you may look at inheriting from codecs.StreamReader.

4 Comments

Why not use the string strip function?
@SamStoelinga, since strip function receives not a prefix but a set of characters to remove. That it you need to either decode the byte-string into the unicode or use the approach above to be sure you left-strip only the UTF-8 BOM.
I'm getting an error that says TypeError: expected str,bytes or os.Pathlike object, not _io.TextIOWrapper
@Zypps987, the snippet assumes python2 where read() returns bytes. To make the snippet work in python3 you will need to encode BOM_UTF8 to 'utf-8'. But you don't need this, when you have utf-8-sig encoding.
0

If this is a one-off, a very simple super high-tech solution that worked for me...

  • Open the JSON file in your favorite text editor.
  • Select-all
  • Create a new file
  • Paste
  • Save.

BOOM, BOM header gone!

Comments

0

I removed the BOM manually with Linux command.

First I check if there are efbb bf bytes for the file, with head i_have_BOM | xxd.

Then I run dd bs=1 skip=3 if=i_have_BOM.json of=I_dont_have_BOM.json.

bs=1 process 1 byte each time, skip=3, skip the first 3 bytes.

Comments

0

I'm using utf-8-sig just with import json

with open('estados.json', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
print(data)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.