2

I would like to parse JSON-like strings. Their lone difference with normal JSON is the presence of contiguous commas in arrays. When there are two such commas, it implicitly means that null should be inserted in-between. Example:

       JSON-like:  ["foo",,,"bar",[1,,3,4]]
      Javascript:  ["foo",null,null,"bar",[1,null,3,4]]
Decoded (Python):  ["foo", None, None, "bar", [1, None, 3, 4]]

The native json.JSONDecoder class doesn't allow me to change the behavior of the array parsing. I can only modify the parser for objects (dicts), ints, floats, strings (by giving kwargs functions to JSONDecoder(), please see the doc).

So, does it mean I have to write a JSON parser from scratch? The Python code of json is available but it's quite a mess. I would prefer to use its internals instead of duplicating its code!

1

6 Answers 6

5

Since what you're trying to parse isn't JSON per se, but rather a different language that's very much like JSON, you may need your own parser.

Fortunately, this isn't as hard as it sounds. You can use a Python parser generator like pyparsing. JSON can be fully specified with a fairly simple context-free grammar (I found one here), so you should be able to modify it to fit your needs.

Sign up to request clarification or add additional context in comments.

4 Comments

Maybe its overkill but someone else may need it! So +1 from me!
+1. Note that while my answer does give you something that should work (as far as I can think of) this is a better solution. It may be more work, but it will be a lot more resilient. If you are making something small, then by all means, use my hack, but if you are doing something more important, do it properly.
@Lattyware This is definitely true. Our approach will mess up if there are consective commas in a string object, for example.
Please see my new answer. I've got almost what I want, but it still fails in one situation.
3

Small & simple workaround to try out:

  1. Convert JSON-like data to strings.
  2. Replace ",," with ",null,".
  3. Convert it to whatever is your representation.
  4. Let JSONDecoder(), do the heavy lifting.

    1. & 3. can be omitted if you already deal with strings.

(And if converting to string is impractical, update your question with this info!)

3 Comments

Thanks. I've already tried to do this though. I've used a simple re.sub but sometimes, some ,null,, remain. This workaround is a bit too dirty!
Unfortunately, it's not quite as simple as this, as when you replace ,, with ,null, you add in a comma, so you go from ,,, to ,null,, which still fails.
@Lattyware But if you use a lookbehind, as in my answer, all is well. It works on your example, anyway. :)
2

You can do the comma replacement of Lattyware's/przemo_li's answers in one pass by using a lookbehind expression, i.e. "replace all commas that are preceded by just a comma":

>>> s = '["foo",,,"bar",[1,,3,4]]'

>>> re.sub(r'(?<=,)\s*,', ' null,', s)
'["foo", null, null,"bar",[1, null,3,4]]'

Note that this will work for small things where you can assume there aren't consecutive commas in string literals, for example. In general, regular expressions aren't enough to handle this problem, and Taymon's approach of using a real parser is the only fully correct solution.

2 Comments

I knew there had to be a way to do it entirely with regexes, but alas, it was beyond me. +1, this is a neater solution.
Your code however, needs a fix - re.sub(r'(?<=,)\s*,', ' null,') should be re.sub(r'(?<=,),', ' null,', s).
1

It's a hackish way of doing it, but one solution is to simply do some string modification on the JSON-ish data to get it in line before parsing it.

import re
import json

not_quite_json = '["foo",,,"bar",[1,,3,4]]'
not_json = True
while not_json:
    not_quite_json, not_json = re.subn(r',\s*,', ', null, ', not_quite_json)

Which leaves us with:

'["foo", null, null, "bar",[1, null, 3,4]]'

We can then do:

json.loads(not_quite_json)

Giving us:

['foo', None, None, 'bar', [1, None, 3, 4]]

Note that it's not as simple as a replace, as the replacement also inserts commas that can need replacing. Given this, you have to loop through until no more replacements can be made. Here I have used a simple regex to do the job.

Comments

1

I've had a look at Taymon recommendation, pyparsing, and I successfully hacked the example provided here to suit my needs. It works well at simulating Javascript eval() but fails one situation: trailing commas. There should be a optional trailing comma – see tests below – but I can't find any proper way to implement this.

from pyparsing import *

TRUE = Keyword("true").setParseAction(replaceWith(True))
FALSE = Keyword("false").setParseAction(replaceWith(False))
NULL = Keyword("null").setParseAction(replaceWith(None))

jsonString = dblQuotedString.setParseAction(removeQuotes)
jsonNumber = Combine(Optional('-') + ('0' | Word('123456789', nums)) +
                    Optional('.' + Word(nums)) +
                    Optional(Word('eE', exact=1) + Word(nums + '+-', nums)))

jsonObject = Forward()
jsonValue = Forward()
# black magic begins
commaToNull = Word(',,', exact=1).setParseAction(replaceWith(None))
jsonElements = ZeroOrMore(commaToNull) + Optional(jsonValue) + ZeroOrMore((Suppress(',') + jsonValue) | commaToNull)
# black magic ends
jsonArray = Group(Suppress('[') + Optional(jsonElements) + Suppress(']'))
jsonValue << (jsonString | jsonNumber | Group(jsonObject) | jsonArray | TRUE | FALSE | NULL)
memberDef = Group(jsonString + Suppress(':') + jsonValue)
jsonMembers = delimitedList(memberDef)
jsonObject << Dict(Suppress('{') + Optional(jsonMembers) + Suppress('}'))

jsonComment = cppStyleComment
jsonObject.ignore(jsonComment)

def convertNumbers(s, l, toks):
    n = toks[0]
    try:
        return int(n)
    except ValueError:
        return float(n)

jsonNumber.setParseAction(convertNumbers)

def test():
    tests = (
        '[1,2]',       # ok
        '[,]',         # ok
        '[,,]',        # ok
        '[  , ,  , ]', # ok
        '[,1]',        # ok
        '[,,1]',       # ok
        '[1,,2]',      # ok
        '[1,]',        # failure, I got [1, None], I should have [1]
        '[1,,]',       # failure, I got [1, None, None], I should have [1, None]
    )
    for test in tests:
        results = jsonArray.parseString(test)
        print(results.asList())

4 Comments

Instead of black magic with double-comma strings, can you instead simply make the empty string a valid element in a list? That seems cleaner to me and would work properly at the end of a list.
Wait, my apologies, I misunderstood the problem. So you want to chop off one trailing implicit null from the end of every list that has one?
I think your options are to do exactly that (after parsing the string, remove a trailing null from the list if there is one; you'd also have to be sure it's implicit and not literal), or explicitly allow a trailing comma at the end of a list and then use precedence to prevent ambiguity.
Well, you are probably right, but I've got some difficulties understanding how this pyparsing stuff works...
0

For those looking for something quick and dirty to convert general JS objects (to dicts). Some part of the page of one real site gives me some object I'd like to tackle. There are 'new' constructs for dates, and it's in one line, no spaces in between, so two lines suffice:

data=sub(r'new Date\(([^)])*\)', r'\1', data)
data=sub(r'([,{])(\w*):', r'\1"\2":', data)

Then json.loads() worked fine. Your mileage may vary:)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.