0

I have a text string similar to the one below:

statistics:
    time-started: Tue Feb  5 15:33:35 2013
    time-sampled: Thu Feb  7 12:25:39 2013
    statistic:
        active: 0
        interactive: 0
    count: 0
    up:
        packets: 0
        bytes: 0
    down:
        packets: 0
        bytes: 0

I need to parse strings such as the one above (the strings I need to parse are actually much larger/deeper, here I just provided an example). The easiest way to parse out some elements I think would be to convert this string to an XML string and use xml.etree.ElementTree to choose the element I am looking for.

So I would like to convert the string above into an XML string like the one below:

<statistics>
    <time-started>Tue Feb  5 15:33:35 2013</time-started>
    <time-sampled>Thu Feb  7 12:25:39 2013</time-sampled>
    <statistic>
        <active>0</active>
        <interactive>0</interactive>
    </statistic>
    <count>0</count>
    <up>
        <packets>0</packets>
        <bytes>0</bytes>
    </up>
    <down>
        <packets>0</packets>
        <bytes>0</bytes>
    </down>
</statistics>

As you can see all of the information is available in the string to convert it into an XML. I don´t want to reinvent the wheel if there is a simple way or a module that can do this.

2
  • The work you'll be doing to convert to XML will probably be enough to interpret the string directly: you need to parse the string to convert to XML and then you want to parse the XML to access the values, but you'll have the values once you've parsed the string :) Commented Feb 7, 2013 at 12:34
  • @isedev, that is why I was hoping there would be a module available that can do this! Commented Feb 7, 2013 at 12:37

2 Answers 2

2

You are basically trying to convert YAML to XML. You can use PyYAML for parsing your input string to python dict and then use an xml generator to convert the dict to XML.

Sign up to request clarification or add additional context in comments.

4 Comments

there are duplicate tags in the inner part of the tree. will PyYAML work or not? (like the packets-tags above)
the packets tag is only repeating in a separate branch, it will only be a problem if you the same tag under the same branch. up: packets: 0 packets: 1
@user2050283, can you provide an example or link to a python tutorial on how to use PyYAML.
@user2050283, I used PyYAML and it worked like a charm on my unittest but running it for real I noticed that there are in fact some tags on the same level with the same name and since PyYaml seems to use a Dic only the latest value is stored in its dictionary. What can I do at this point if there are element with the same name at the same level?
0

user2050283 definitely is right, it is yaml and this makes parsing easy. Mainly for educational reasons I tried to parse it myself. Looking forward to some feedback.

The structure of your data is hierarchical, tree-like. So lets define a tree in Python, as simple as possible (reference):

from collections import defaultdict

def tree(): return defaultdict(tree)

Next, let's use this tree in a parsing function. It iterates over lines, looks at the indentation, keeps record if it and of the current path (aka breadcrumbs) and tries to split a line into key and value (if it exists) and fills our tree. Where appropriate, I extracted logical chunks as separate functions, that follow below. If an indentation doesn't match any previous indentation, it throws an error - basically like Python does for its source code.

def load_data(f):
    doc = tree()
    previous_indents = [""]
    path = [""]

    for line in map(lambda x: x.rstrip("\n"), 
                    filter( is_valid_line, f)
                ):
        line_wo_indent = line.lstrip(" ")
        indent = line[:(len(line) - len(line_wo_indent))]

        k, v = read_key_and_value(line_wo_indent)

        if len(indent) > len(previous_indents[-1]):
            previous_indents.append(indent)
            path.append(k)

        elif len(indent) == len(previous_indents[-1]):    
            path[-1] = k

        else: # indent is shorter
            try:
                while previous_indents[-1] != indent:
                    previous_indents.pop()
                    path.pop()            
            except IndexError:
                raise IndentationError("Indent doesn't match any previous indent.")
            path[-1] = k

        if v is not None:
            set_leaf_value_from_path(doc, path, v)
    return doc

The helper functions I created are:

  • set_leaf_value_from_path: takes a tree, a path (list of keys) and a value. It uses recursion to descent into the tree and set the value of the leaf defined by path.
  • read_key_and_value: splitting a line into key and value, at first ":"
  • is_valid_line: used to check whether a line is not empty or starts with a number sign

Here is the full script

from collections import defaultdict

def tree(): return defaultdict(tree)

def dicts(t): 
    if isinstance(t, dict):
        return {k: dicts(t[k]) for k in t}
    else:
        return t

def load_data(f):
    doc = tree()
    previous_indents = [""]
    path = [""]

    for line in map(lambda x: x.rstrip("\n"), 
                    filter( is_valid_line, f)
                ):
        line_wo_indent = line.lstrip(" ")
        indent = line[:(len(line) - len(line_wo_indent))]

        k, v = read_key_and_value(line_wo_indent)

        if len(indent) > len(previous_indents[-1]):
            previous_indents.append(indent)
            path.append(k)

        elif len(indent) == len(previous_indents[-1]):    
            path[-1] = k

        else: # indent is shorter
            try:
                while previous_indents[-1] != indent:
                    previous_indents.pop()
                    path.pop()            
            except IndexError:
                raise IndentationError("Indent doesn't match any previous indent.")
            path[-1] = k

        if v is not None:
            set_leaf_value_from_path(doc, path, v)
    return doc

def set_leaf_value_from_path(tree_, path, value):
    if len(path)==1:
        tree_[path[0]] = value
    else:
        set_leaf_value_from_path(tree_[path[0]], path[1:], value)

def read_key_and_value(line):
    pos_of_first_column = line.index(":")
    k = line[:pos_of_first_column].strip()
    v = line[pos_of_first_column+1:].strip()
    return k, v if len(v) > 0 else None

def is_valid_line(line):
    if line.strip() == "":
        return False
    if line.lstrip().startswith("#"):
        return False
    return True


if __name__ == "__main__":
    import cStringIO

    document_str = """
statistics:
    time-started: Tue Feb  5 15:33:35 2013
    time-sampled: Thu Feb  7 12:25:39 2013
    statistic:
        active: 0
        interactive: 0
    count: 1
    up:
        packets: 2
        bytes: 2
    down:
        packets: 3
        bytes: 3
"""
    f = cStringIO.StringIO(document_str)
    doc = load_data(f)

    from pprint import pprint
    pprint(dicts(doc))

Known restrictions:

  • Only scalars are supported as values
  • Only string-scalars as values
  • Multi-line scalars are not supported
  • Comments are not implemented as in the definition, i.e., they may not start anywhere in a line; only lines starting with a number sign are treated as comments

These are only the known restrictions. I'm sure other parts of YAML aren't supported either. But it seems to be enough for your data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.