Obtain hierarchical structure from python string

Question

I am trying to obtain a hierarchical structure of sections, sub-sections, sub-sub-sections in a Wikipedia page.

I have a string like this:

mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='

In this case the page name is 'a' and the structure is following

= b =
= c =
  == d ==
  == e ==
     === f ===
     === g ===
         ==== h ====
     === i ===
  == j ==
  == k ==
= l =

The equality signs are indicators of section or sub-section and so on. I need to obtain a python list containing all the relational hierarchical structures like this:

mylist = ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

So far I have been able to find the sections, sub-sections and so on by doing this:

sections = re.findall(r' = (.*?)\ =', mystr)
subsections = re.findall(r' == (.*?)\ ==', mystr)
...

But I don't know how to proceed from here to get the desired mylist.

Welcome to SO. To improve your question, please describe how the hierarchical structure should be determined from the string (i.e. the meaning of the equals signs) and post code demonstrating what you have already attempted. — David Scarlett
– David Scarlett, Commented Jun 2, 2017 at 5:41
Basically I am trying to extract text from wikipedia. The string contains the content names of a particular wikipedia page (sections, sub-sections, sub-sub-sections, etc). In my example a is page name; b, c, l are sections (so they have only one equal sign around them); d, e, j are sub-sections under c (so they have two equal signs around them) and so on. — user8101320
– user8101320, Commented Jun 2, 2017 at 5:50

Thierry Lathuille · Accepted Answer · 2017-06-02 07:56:22Z

You can do it like this:
- the first function parses your string, and yields tokens (level, name) like (0, 'a'), (1, 'b')
- the second one builds the tree from there.

import re

def tokens(string):
    # The root name doesn't respect the '= name =' convention,
    # so we cut the string on the first " = " and yield the root name
    root_end = string.index(' = ') 
    root, rest = string[:root_end], string[root_end:]
    yield 0, root

    # We use a regex for the next tokens, who consist of the following groups:
    # - any number of "=" followed by 0 or more spaces,
    # - the name, not containing any =
    # - and again, the first group of "=..."

    tokens_re = re.compile(r'(=+ ?)([^=]+)\1')
    # findall will return a list:
    # [('= ', 'b '), ('= ', 'c '), ('== ', 'd '), ('== ', 'e '), ('=== ', 'f '), ...]
    for token in tokens_re.findall(rest):
        level = token[0].count('=')
        name = token[1].strip()
        yield level, name


def tree(token_list):    
    out = []
    # We keep track of the current position in the hierarchy:
    hierarchy = []
    for token in token_list:
        level, name = token
        # We cut the hierarchy below the level of our token
        hierarchy = hierarchy[:level]
        # and append the current one
        hierarchy.append(name)
        out.append('/'.join(hierarchy))
    return out


mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
out = tree(tokens(mystr))
# Check that this is your expected output
assert out == ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

print(out)
# ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

Collectives™ on Stack Overflow

Obtain hierarchical structure from python string

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related