2

Consider the following data structure:

[HEADER1]
{
   key value
   key value
   ...
   [HEADER2]
   {
      key value
      ...
   }
   key value
   [HEADER3]
   {
      key value
      [HEADER4]
      {
         key value
         ...
      }
   }
   key value
}

There are no indents in the raw data, but I added them here for clarity. The number of key-value pairs is unknown, '...' indicates there could be many more within each [HEADER] block. Also the amount of [HEADER] blocks is unknown.

Note that the structure is nested, so in this example header 2 and 3 are inside header 1 and header 4 is inside header 3.

There can be many more (nested) headers, but I kept the example short.

How do I go about parsing this into a nested dictionary structure? Each [HEADER] should be the key to whatever follows inside the curly brackets.

The final result should be something like:

dict = {'HEADER1': 'contents of 1'}
contents of 1 = {'key': 'value', 'key': 'value', 'HEADER2': 'contents of 2', etc}

I'm guessing I need some sort of recursive function, but I am pretty new to Python and have no idea where to start.

For starters, I can pull out all the [HEADER] keys as follows:

path = 'mydatafile.txt'
keys = []

with open (path, 'rt') as file:
   for line in file:
      if line.startswith('['):
         keys.append(line.rstrip('\n'))

for key in keys:
   print(key)

But then what, maybe this not even needed?

Any suggestions?

9
  • So are there really headers without closing }s and also double }s and values outside of {}s? Commented Oct 21, 2017 at 18:06
  • No, each header is followed by {...}, but since they can be nested, there could be two closing brackets on adjacent lines. Commented Oct 21, 2017 at 18:07
  • What's going on with stuff under header2 and 4 then? Commented Oct 21, 2017 at 18:08
  • Oh wait, is header 2 within header1 ? Might be an idea to show how you'd expect the output dict to actually look Commented Oct 21, 2017 at 18:08
  • Correct, and 4 is inside 3, and 3 is inside 1. Commented Oct 21, 2017 at 18:09

1 Answer 1

4

You can do it by pre-formatting your file content using few regex and then pass it to json.loads

You can do these kind of regex substitutions one by one:

#1 \[(\w*)\]\n -> "$1":

#2 \}\n(\w) -> },$1

#3 (\w*)\s(\w*)\n([^}]) -> $1:$2,$3

#4 (\w*)\s(\w*)\n\} -> $1:$2}

and then finally pass the final string to json.loads:

import json
d = json.loads(s)

which will parse it to a dict format.

Explanation :

1. \[(\w*)\]\n : replace [HEADERS]\n with "HEADERS":

2. \}\n(\w): replace any closing braces i.e, } that have any value after them, with },

3. (\w*)\s(\w*)\n([^}]): replace key value\n with key:value, for lines having any next elements

4. (\w*)\s(\w*)\n\}: replace key value\n with key:value for lines having no next elements

So, by minor modifications to these regexes you will be able to parse it to a dict format, the basic concept is to reformat the file contents to a format that can be parsed easily.

Sign up to request clarification or add additional context in comments.

8 Comments

So how do I iterate over the lines and accomplish this? for line in file: line = re.sub("\[(\w*)\]\n", "", line) is not changing anything?
don't iterate over lines, read the whole file and then use these regex on the file content and then pass the resulting string to the next regex
I see, I tried that: s = open(path, 'rt').read() s1 = re.sub("\[(\w*)\]\n", "", s), but no changes.
also don't replace with empty string, check the answer for what to substitute with which regex. see this for how to use captured groups : stackoverflow.com/questions/6711567/…
checkout the above link, that'll help a lot with substitution, basically you need to use \1 instead of $1 in python
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.