Python: How do I parse a string into a recursive dictionary

Question

Coming from a file I have something like the following string:

var1 : data1
var2 : data2
dict1 {  
     var3 : data3  
     dict2 {  
         var4 : data4  
     }
     var5 : data5
}
dict3 {
     var6 : data6
     var7 : data7
}

and so on. (end of lines are \n, indents are \t each)
And I try to convert it into something like that:

Dictionary={"var1":"data1","var2":"data2", "dict1" : 
    {"var3":"data3", "dict2" : {
        "var4":"data4" }, "var5":"data5"}
    , dict3:{"var6":"data6","var7":"data7"}

(indents are only too keep it somehow human readable)
To solve it, all I can think of, is to split it into a list, then walk the list down until I find a "}" in the string, delete it (so i won't run into it later), then walk up until I find string with "{", remove the whitespaces before and the " {" after (using right now temp=re.split ('(\S+) \{',out[z]) for this example the 1st temp[1] would be 'dict2'), add everything in between, and finally move on to the next "}".

But that's not fast or elegant. I am definitely missing something.
code is currently:

def procvar(strinG):
    x=y=z=temp1=temp2=0
    back = False
    out=re.split ('\n',strinG) #left over from some other tries
    while z < len(out):
        print "z=",z," out[z]= ", out[z]
        if "{" in out[z]:
            if back == True:
                back = False
                xtemp=re.split ('(\S+) \{',out[z])
                out[z]=xtemp[1]
                ytemp=xtemp[1]
                temp2=z+1
                print "Temp: ",temp1," - ",out[temp1]
                out[z]={out[z]:[]}
                while temp2 <= temp1:
                    out[z][xtemp[1]].append(out[temp2]) # not finished here, for the time being I insert the strings as they are
                    del out[temp2]
                    temp1-=1
                print out[z]
        if "}" in out[z]:
            back = True
            del out[z]
            temp1 = z-1
        if back == True:
            z-=1
        else:
            z+=1
    return out

thierrybm · Accepted Answer · 2013-08-10 22:19:02Z

2

your format is close enough to the yaml one (easy_install pyyaml): http://pyyaml.org/wiki/PyYAML

x = """var1 : data1
var2 : data2
dict1 {  
     var3 : data3  
     dict2 {  
         var4 : data4  
     }
     var5 : data5
}
dict3 {
     var6 : data6
     var7 : data7
}"""

x2 = x.replace('{', ':').replace('}','')
yaml.load(x2) 

{'dict1': {'dict2': {'var4': 'data4'}, 'var3': 'data3', 'var5': 'data5'},
 'dict3': {'var6': 'data6', 'var7': 'data7'},
 'var1': 'data1',
 'var2': 'data2'}

answered Aug 10, 2013 at 22:19

thierrybm

1296 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Viktor Kerkez Over a year ago

The only problem with this solution is that it will modify the keys and values if they contain either { or } characters.

thierrybm Over a year ago

True. It is a hack. The proper solution would be to adapt Yaml's language, it seems feasible but I don't know enough about it: pyyaml.org/wiki/…

Viktor Kerkez Over a year ago

Maybe just to fix the replacements: x = re.sub('^\s*}\s*$', '', x) and x = re.sub('^(\s*[^\s]+\s+){(\s*)$', '\\1:\\2', x). This will be a little bit safer.

Viktor Kerkez · Accepted Answer · 2013-08-10 22:57:46Z

0

import re

# key : value regexp
KV_RE = re.compile(r'^\s*(?P<key>[^\s]+)\s+:\s+(?P<value>.+?)\s*$')
# dict start regexp
DS_RE = re.compile(r'^\s*(?P<key>[^\s]+)\s+{\s*$')
# dict end regexp
DE_RE = re.compile(r'^\s*}\s*$')


def parse(s):
    current = {}
    stack = []
    for line in s.strip().splitlines():
        match = KV_RE.match(line)
        if match:
            gd = match.groupdict()
            current[gd['key']] = gd['value']
            continue
        match = DS_RE.match(line)
        if match:
            stack.append(current)
            current = current.setdefault(match.groupdict()['key'], {})
            continue
        match = DE_RE.match(line)
        if match:
            current = stack.pop()
            continue
        # Error occured
        print('Error: %s' % line)
        return {}
    return current

edited Aug 10, 2013 at 22:57

answered Aug 10, 2013 at 22:22

Viktor Kerkez

46.8k13 gold badges109 silver badges88 bronze badges

2 Comments

user2671189 Over a year ago

Thanks a bunch, think the correct way to handle regexp is everything, eh? In fact I had to make some slight modifications like KV_RE = re.compile('^\s*(?P<key>[\w\d]+) : (?P<value>.*)$') due the fact that the data string may or may not contain data, or may or may not contain whitespaces and similar. And also for some reasons the function stumbles across an empty stack at some time. To prevent the crash of this function I used try: at that point.

Viktor Kerkez Over a year ago

It was just a rough approximation, cause I don't know how your data exactly looks like. If it is really regular with no special characters, the yaml and literal_eval solutions are also a good choice, this is more general and extendable. But if you are getting empty stack errors, that probably means that your data is not that regular after all? Or not every closing brace is on it's own line?

dawg · Accepted Answer · 2013-08-10 22:50:03Z

0

If your text is in the same regular pattern as the example, you can use ast.literal_eval to parse the string.

First, let's modify the string to be legal Python dict text:

import re

st='''\
var1 : data1
var2 : data2
dict1 {  
     var3 : data3  
     dict2 {  
         var4 : data4  
     }
     var5 : data5
}
'''

# add commas after key, val pairs
st=re.sub(r'^(\s*\w+\s*:\s*\w+)\s*$',r'\1,',st,flags=re.M)

# insert colon after name and before opening brace 
st=re.sub(r'^\s*(\w+\s*){\s*$',r'\1:{',st,flags=re.M)

# add comma closing brace
st=re.sub(r'^(\s*})\s*$',r'\1,',st,flags=re.M)

# put names into quotes
st=''.join(['"{}"'.format(s.group(0)) if re.search(r'\w+',s.group(0)) else s.group(0) 
                for s in re.finditer(r'\w+|\W+',st)])

# add opening and closing braces
st='{'+st+'}'
print st

prints the modified string:

{"var1" : "data1",
"var2" : "data2",
"dict1" :{
     "var3" : "data3",
"dict2" :{
         "var4" : "data4",
     },
     "var5" : "data5",
},}

Now use ast to turn the string into a data structure:

import ast
print ast.literal_eval(st)

prints

{'dict1': {'var5': 'data5', 'var3': 'data3', 'dict2': {'var4': 'data4'}}, 'var1': 'data1', 'var2': 'data2'}

answered Aug 10, 2013 at 22:50

dawg

105k24 gold badges143 silver badges217 bronze badges

2 Comments

user2671189 Over a year ago

Hmm, also not bad. Maybe I give that also a shot. Since the string isn't written in stone. As a fact I generate it at some point. And it came into my mind that I maybe have to implement some way or the other to make the data strings capable to handle multiple lines.

dawg Over a year ago

If you have control of the program creating the files, there are better options for persistent data. Look at pickle and json

Collectives™ on Stack Overflow

Python: How do I parse a string into a recursive dictionary

3 Answers 3

3 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related