1

I have a document with new-line-delimited json's, to which I apply some functions. Everything works up until this line, which looks exactly like this:

{"_id": "5f114", "type": ["Type1", "Type2"], "company": ["5e84734"], "answers": [{"title": " answer 1", "value": false}, {"title": "answer 2
", "value": true}, {"title": "This is a title.", "value": true}, {"title": "This is another title", "value": true}], "audios": [null], "text": {}, "lastUpdate": "2020-07-17T06:24:50.562Z", "title": "This is a question?", "description": "1000 €.", "image": "image.jpg", "__v": 0}

The entire code:

import json  

def unimportant_function(d):
    d.pop('audios', None)
    return {k:v for k,v in d.items() if v != {}}


def parse_ndjson(data):
    return [json.loads(l) for l in data.splitlines()]

with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
    data = handle.read()
    dicts = parse_ndjson(data)

for d in dicts:
    new_d = unimportant_function(d)
    json_string=json.dumps(new_d, ensure_ascii=False)
    print(json_string)

The error JSONDecodeError: Unterminated string starting at: line 1 column 260 (char 259) happens at dicts = parse_ndjson(data). Why? I also have no idea what that symbol after "answer 2" is, it didn't appear in the data but it appeared when I copy pasted it.

What is the problem with the data?

7
  • Can you try changing the true to True and the false to False ? same for null to None as those are the expected keyword for python. Commented Jul 7, 2021 at 10:05
  • 2
    @JulesCivel the string is a JSON, so it should be true and not True for it to be a valid JSON. Commented Jul 7, 2021 at 10:07
  • 1
    I tried doing json.loads('<the_JSON>') and it worked for me. I do not know exactly how to reproduce the error. Commented Jul 7, 2021 at 10:09
  • "I also have no idea what that symbol after "answer 2" is" I don't see any unusual symbol in what you pasted. Commented Jul 7, 2021 at 10:09
  • @KarlKnechtel when using json.loads() the dictionary contains this: {'title': 'answer 2\u2029', 'value': True}. The \u2029 character is the one I think he is talking about. Commented Jul 7, 2021 at 10:15

1 Answer 1

3

The unprintable character embedded in the "answer 2" string is a paragraph separator, which is treated as whitespace by .splitlines():

>>> 'foo\u2029bar'.splitlines()
['foo', 'bar']

(Speculation: the ndjson file might be exploiting this to represent "this string should have a newline in it", working around the file format. If so, it should probably be using a \n escape instead.)

The character is, however, not treated specially if you iterate over the lines of the file normally:

>>> # For demonstration purposes, I create a `StringIO`
>>> # from a hard-coded string. A file object reading
>>> # from disk will behave similarly.
>>> import io
>>> for line in io.StringIO('foo\u2029bar'):
...     print(repr(line))
...
'foo\u2029bar'

So, the simple fix is to make parse_ndjson expect a sequence of lines already - don't call .splitlines, and fix the calling code appropriately. You can either pass the open handle directly:

def parse_ndjson(data):
    return [json.loads(l) for l in data]

with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
    dicts = parse_ndjson(handle)

or pass it to list to create a list explicitly:

def parse_ndjson(data):
    return [json.loads(l) for l in data]

with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
    dicts = parse_ndjson(list(handle))

or create the list using the provided .readlines() method:

def parse_ndjson(data):
    return [json.loads(l) for l in data]

with open('C:\\path\\the_above_example.json', 'r', encoding="utf8") as handle:
    dicts = parse_ndjson(handle.readlines())
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the explanation! I now understand where the problem lies in the code. I know it’s a silly question but by passing the open handle directly do you mean I simply remove the parse_ndjson function?
this is correct, with the mention that if one copies the code 1:1 there won't be any defined "data" name because one passes the open handle directly. I wish I could edit but I can't :/
Typo (each time in the with block, data should be handle). Fixed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.