1

I am trying to obtain a hierarchical structure of sections, sub-sections, sub-sub-sections in a Wikipedia page.

I have a string like this:

mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='

In this case the page name is 'a' and the structure is following

= b =
= c =
  == d ==
  == e ==
     === f ===
     === g ===
         ==== h ====
     === i ===
  == j ==
  == k ==
= l =

The equality signs are indicators of section or sub-section and so on. I need to obtain a python list containing all the relational hierarchical structures like this:

mylist = ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

So far I have been able to find the sections, sub-sections and so on by doing this:

sections = re.findall(r' = (.*?)\ =', mystr)
subsections = re.findall(r' == (.*?)\ ==', mystr)
...

But I don't know how to proceed from here to get the desired mylist.

2
  • 1
    Welcome to SO. To improve your question, please describe how the hierarchical structure should be determined from the string (i.e. the meaning of the equals signs) and post code demonstrating what you have already attempted. Commented Jun 2, 2017 at 5:41
  • Basically I am trying to extract text from wikipedia. The string contains the content names of a particular wikipedia page (sections, sub-sections, sub-sub-sections, etc). In my example a is page name; b, c, l are sections (so they have only one equal sign around them); d, e, j are sub-sections under c (so they have two equal signs around them) and so on. Commented Jun 2, 2017 at 5:50

1 Answer 1

1

You can do it like this:
- the first function parses your string, and yields tokens (level, name) like (0, 'a'), (1, 'b')
- the second one builds the tree from there.

import re

def tokens(string):
    # The root name doesn't respect the '= name =' convention,
    # so we cut the string on the first " = " and yield the root name
    root_end = string.index(' = ') 
    root, rest = string[:root_end], string[root_end:]
    yield 0, root

    # We use a regex for the next tokens, who consist of the following groups:
    # - any number of "=" followed by 0 or more spaces,
    # - the name, not containing any =
    # - and again, the first group of "=..."

    tokens_re = re.compile(r'(=+ ?)([^=]+)\1')
    # findall will return a list:
    # [('= ', 'b '), ('= ', 'c '), ('== ', 'd '), ('== ', 'e '), ('=== ', 'f '), ...]
    for token in tokens_re.findall(rest):
        level = token[0].count('=')
        name = token[1].strip()
        yield level, name


def tree(token_list):    
    out = []
    # We keep track of the current position in the hierarchy:
    hierarchy = []
    for token in token_list:
        level, name = token
        # We cut the hierarchy below the level of our token
        hierarchy = hierarchy[:level]
        # and append the current one
        hierarchy.append(name)
        out.append('/'.join(hierarchy))
    return out


mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
out = tree(tokens(mystr))
# Check that this is your expected output
assert out == ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

print(out)
# ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.