0

I have a string that lists the properties of a request event.

My string looks like:

requestBody: {
    propertyA = 1
    propertyB = 2
    propertyC = {
        propertyC1 = 1
        propertyC2 = 2
    }
    propertyD = [
        { propertyD1 = { propertyD11 = 1}},
        { propertyD1 = [ {propertyD21 = 1, propertyD22 = 2}, 
                         {propertyD21 = 3, propertyD22 = 4}]}
    ]
}

I have tried to replace the "=" with ":" so that I can put it into a JSON reader in python, but JSON also requires that key and value are stored in string with double quotes and a "," to separate each KV pair. This then became a little complicated to implement. What are some better approaches to parsing this into python dictionary with exactly the same structure (e.g. embedded dictionaries are also preserved)?

Question: If I were to write a full parser, what's the main pattern that I should tackle? Storing parenthesis in a stack until the parenthesis completes?

5
  • 4
    It appears consistently defined. Write a full parser from scratch? Commented Apr 17, 2016 at 21:56
  • I understand defining multiple properties inside {}'s as some sort of nested object (such as propertyC), and I understand multiple objects inside []'s as an array of objects (as in propertyD2). But what is intended when you have multiple properties inside []'s (as in propertyD)? Should this really be an object in {}'s, with properties propertyD1 and propertyD2? Commented Apr 17, 2016 at 22:44
  • 1
    Also, it appears that sometimes list elements are comma-delimited, and sometimes newline-delimited. For instance, is there supposed to be a comma after the definition of propertyA? Commented Apr 17, 2016 at 22:45
  • @PaulMcGuire I revised my string format a little. For your first question, objects are always enclosed by {}, arrays should only contain objects enclosed by {}. For your second question, yes, there is inconsistency in comma-delimited and newline-delimited records in object (however, arrays are always comma-delimited). Commented Apr 17, 2016 at 22:47
  • What software is generating these strings? There may be a Python module for working with it already. Commented Apr 17, 2016 at 22:54

1 Answer 1

2

This is a nice case for using pyparsing, especially since it adds the issue of recursive structuring.

The short answer is the following parser (processes everything after the leading requestBody :):

LBRACE,RBRACE,LBRACK,RBRACK,EQ = map(Suppress, "{}[]=")
NL = LineEnd().setName("NL")

# define special delimiter for lists and objects, since they can be
# comma-separated or just newline-separated
list_delim = NL | ','
list_delim.leaveWhitespace()

# use a parse action to convert numeric values to ints or floats at parse time
def convert_number(t):
    try:
        return int(t[0])
    except ValueError:
        return float(t[0])
number = Word(nums, nums+'.').addParseAction(convert_number)

qs = quotedString

# forward-declare value, since it will be defined recursively
obj_value = Forward()

ident = Word(alphas, alphanums+'_')
obj_property = Group(ident + EQ + obj_value)

# use Dict wrapper to auto-define nested properties as key-values
obj = Group(LBRACE + Dict(Optional(delimitedList(obj_property, delim=list_delim))) + RBRACE)

obj_array = Group(LBRACK + Optional(delimitedList(obj, delim=list_delim)) + RBRACK)

# now assign to previously-declared obj_value, using '<<=' operator
obj_value <<= obj_array | obj | number | qs

# parse the data
res = obj.parseString(sample)[0]

# convert the result to a dict
import pprint
pprint.pprint(res.asDict())

gives

{'propertyA': 1,
 'propertyB': 2,
 'propertyC': {'propertyC1': 1, 'propertyC2': 2},
 'propertyD': {'propertyD1': {'propertyD11': 1},
               'propertyD2': {'propertyD21': 3, 'propertyD22': 4}}}
Sign up to request clarification or add additional context in comments.

2 Comments

can you briefly explain what obj_value <<= obj_array | obj | number | qs does?
We have to use the <<= operator to assign expressions to a Forward expression. You will see this in most recursive pyparsing programs. In your case, obj_value was defined as a Forward as a sort of placeholder, as if to say, "I'll define this later, but I need a reference for this expression." Then obj_value gets used in obj_property, which is part of obj. Ultimately, we want to define just what goes into obj_value, which is any one of an obj_array, an obj, a number, or a quoted string. Since obj_value is already defined as a Forward, we assign this using the <<= operator.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.