Parsing nested HTML Lists using Python

Question

My HTML code contains nested lists like this:

<ul>
  <li>Apple</li>
  <li>Pear</li>
  <ul>
     <li>Cherry</li>
     <li>Orange</li>
     <ul>
        <li>Pineapple</li>
     </ul>
  </ul>
  <li>Banana</li>
</ul>

I need to parse them so they look like this:

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

I tried using BeautifulSoup, but I am stuck on how to consider the nesting in my code.

Example, where x contains the HTML code listed above:

import bs4

soup = bs4.BeautifulSoup(x, "html.parser")
for ul in soup.find_all("ul"):
    for li in ul.find_all("li"):
        li.replace_with("+ {}\n".format(li.text))

Jack Fleeting · Accepted Answer · 2021-11-11 15:19:03Z

3

It's somewhat of a hack, but you can do it using lxml instead:

import lxml.html as lh

uls = """[your html above]"""
doc = lh.fromstring(uls)
tree = etree.ElementTree(doc)
for e in doc.iter('li'):
        path = tree.getpath(e)
        print('+' * path.count('ul'), e.text)

Output:

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

answered Nov 11, 2021 at 15:19

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ajax1234 · Accepted Answer · 2021-11-11 16:57:47Z

2

You can use recursion:

import bs4, re
from bs4 import BeautifulSoup as soup
s = """
<ul>
  <li>Apple</li>
  <li>Pear</li>
  <ul>
     <li>Cherry</li>
     <li>Orange</li>
     <ul>
        <li>Pineapple</li>
     </ul>
  </ul>
  <li>Banana</li>
</ul>
"""
def indent(d, c = 0):
   if (s:=''.join(i for i in d.contents if isinstance(i, bs4.NavigableString) and i.strip())):
       yield f'{"+"*c} {s}'
   for i in d.contents:
      if not isinstance(i, bs4.NavigableString):
         yield from indent(i, c+1)

print('\n'.join(indent(soup(s, 'html.parser').ul)))

Output:

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

answered Nov 11, 2021 at 16:57

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

Comments

RJ Adriaansen · Accepted Answer · 2021-11-11 15:39:51Z

1

I think it would be easier to convert the html string to markdown with custom bullets. This can be done with markdownify:

import markdownify

formatted_html = markdownify.markdownify(x, bullets=['+', '++', '+++'], strip="ul")

result:

+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana

answered Nov 11, 2021 at 15:39

RJ Adriaansen

9,7192 gold badges16 silver badges29 bronze badges

Collectives™ on Stack Overflow

Parsing nested HTML Lists using Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related