0

I am trying to decode, then parse a JSON file it's about 9MB. But when I try to decode the json file, to make it into a python dictionary object I get the error:

'utf8' codec can't decode bytes in position 3161744-3161747: invalid data

I think this might be because of encoding/decoding issues, but I'm not entirely certain. I don't know what the file is being encoding as because I am getting it from a third party, and unfortunately I can't show the file because it contains sensitive information.

Also, the people who supplied the JSON file said it's a valid JSON file and passes json lint. Here is my code below:

import json

""" JSON Parser """
class parser:
    json_file = None

    """ The JSON File name"""
    def json_object(self, file):
        self.json_file = file

    """ Open up file and parse it """
    def json_encode(self):
        try:
            json_data = open(self.json_file)
            data = json_data.read().decode('utf8')
            result = json.loads(data)
        except Exception as e:
            result = e
        return result

""" Instantiate parser and begin parsing the file"""
p = parser()
p.json_object('file.js')
print p.json_encode()
4
  • 3
    Although the file may be formatted properly from a JSON point of view, it may still be invalid from a UTF-8 encoding point of view. You should be able to elicit the same error by reading the file as a UTF-8 text file, which would eliminate JSON from the problem. Are you certain the file is UTF-8 encoded and not something like ISO 8859-1? Commented Jan 30, 2012 at 21:07
  • @GregHewgill I am not certain, that's the problem, I can open it in a text-editor as utf-8 encoded then save it. Then when I run the parser I only get the last part of the json file encoded. The same problem occurs when I try in PHP. This is a very odd problem, however, I still think it has something to do with the way the file was encoded in the first place. Commented Jan 30, 2012 at 21:27
  • Can you read the file, then try data[3161730:3161760] to see what's causing the error? Commented Jan 30, 2012 at 21:57
  • @ThomasK thanks for that tip, don't know why I didn't try that, it seems to have weird characters like ê. I got rid of them, but I still only get the very last part decoded out of a large file. Commented Jan 30, 2012 at 22:18

1 Answer 1

1

I don't think that you should be decoding the utf-8 before reading it in. Json should be transparent to the encoding as you might have some strings in the json that are utf-8 and others that are latin-9, etc. Try:

json.load(open(self.json_file))
Sign up to request clarification or add additional context in comments.

1 Comment

I tried this already. The error I get is "'utf8' codec can't decode bytes in position 45-48: invalid data". I don't know how to check what characters are in those byte positions

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.