1

I have a very large JSON-like file, but it is not using proper JSON syntax: the object keys are not quoted. I'd like to write a script to fix the file, so that I can load it with json.loads.

I need to match all words followed by a colon and replace them with the quoted word. I think the regex is \w+\s*: and that I should use re.sub, but I'm not exactly sure how to do it.

How can I take the following input and get the given output?

# In
{abc : "xyz", cde : {}, fgh : ["hfz"]}
# Out
{"abc" : "xyz", "cde" : {}, "fgh" : ["hfz"]}

# In
{
    a: "b",
    b: {
        c: "d",
        d: []
    },
    e: "f"
}
# Out
{
    "a": "b",
    "b": {
        "c": "d",
        "d": []
    },
    "e": "f"
}
0

3 Answers 3

11

Rather than a potentially fragile regex solution, you can take advantage of the fact that while your log file isn't valid JSON, it is valid YAML. Using the PyYAML library, you can load it into a Python data structure and then write it back out as valid JSON:

import json
import yaml

with open("original.log") as f:
    data = yaml.load(f)

with open("jsonified.log", "w") as f:
    json.dump(data, f)
Sign up to request clarification or add additional context in comments.

2 Comments

Stumbled across this thread in search of a solution for the same problem occurring in R. This solution works just as well with the yaml package there. Thanks!
Thank you for this excellent solution! Based on documentation and articles I've come across, I'd like to note that using yaml.safe_load() is the more secure way of reading yaml. See: arp242.net/yaml-config.html
2

I suggest matching whole words that are not enclosed into double quotation marks and adding quotation marks around them:

import re
p = re.compile(r'(?<!")\b\w+\b(?!")')
test_str = "{abc : \"xyz\", cde : {}, fgh : [\"hfz\"]}"
print re.sub(p, r'"\g<0>"', test_str)

See IDEONE demo, output:

{"abc" : "xyz", "cde" : {}, "fgh" : ["hfz"]}

7 Comments

It is an unambiguous back-reference to the whole matched text. Used to avoid the overhead of using capturing groups around the entire pattern.
This is exactly what i wanted to find out with my original question. The concept of how to back reference in regular expressions.
@akshitBhatia If you wanted to learn regex back references, why didn't you ask about that?
True. That is an answer tailored to this use case. And since the moderator have edited my original question, i should select that answer as the correct one. But this is the general concept which i wanted answered in my original question which was kept on hold for being too broad. I am upvoting your answer but selecting the other one because it is the correct answer to the edited question.
@mariotomo See the comment above. It was not meant to answer the question in its current form, it was heavily edited. Besides, I would have never answered it now, this post is 3 years old.
|
1

I met this old question while looking for ways to parse sloppy JSON shorthand into python.

my input looks like this:

'{lat: 8.5, lon: -80.0}'

and, as said, it has to be sloppy with spaces, it could just as well be:

'{lat:8.5,lon:-80.0}'

I like the YAML hint, but it doesn't go well with sloppy spacing, and I do not wish to add one more dependency to my already longish list, so I tried the regex solution, and it wasn't good enough for my case.

my solution looks like this:

re.sub(r'(\w+)[ ]*(?=:)', r'"\g<1>"', input_string)

it defines one group, holding alphanumeric data, it allows for whitespace to follow, it anchors to a semicolon, it replaces the matched substring with group one, enclosed in double quotes. it leaves alone all the rest. this pattern will not be matched if the key is already quoted.

in particular:

>>> re.sub(r'(\w+)[ ]*(?=:)', r'"\g<1>"', 
... '{abc : "xyz", cde : {a:"b", c: 0}, fgh : ["hfz"], 123: 123}')
'{"abc": "xyz", "cde": {"a":"b", "c": 0}, "fgh": ["hfz"], "123": 123}'
>>> re.sub(r'(\w+)[ ]*(?=:)', r'"\g<1>"', _)
'{"abc": "xyz", "cde": {"a":"b", "c": 0}, "fgh": ["hfz"], "123": 123}'
>>> 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.