I'm trying to use Beautifulsoup to parse the following string to a nested list (with each <p> tag converted into a list):
message = '<p>part one <a href="/links/link1">part two</a>part three</p><p>part four <a href="/links/link2">part five</a>part six</p><p>part seven <a href="/links/link3">part eight</a></p>'
I want the output to look like:
[['part one','/links/link1','part two','part three'],['part four','/links/link2','part five','part six'],['part seven','/links/link3','part eight']]
I only want the <p> tags to be converted into nested list. Everything else should come out as strings in the main list.
My script is:
def get_data(d):
if d.name == 'p':
yield list(d)
else:
if isinstance(d, bs4.element.NavigableString):
yield d
if d.name == 'a':
yield d['href']
yield from [i for b in getattr(d, 'contents', []) for i in get_data(b)]
def messageParser(message):
return list(get_data(bs4.BeautifulSoup(message, 'html.parser')))
But what I get is:
[['part one ', <a href="/links/link1">part two</a>, 'part three'], 'part one ', '/links/link1', 'part two', 'part three', ['part four ', <a href="/links/link2">part five</a>, 'part six'], 'part four ', '/links/link2', 'part five', 'part six', ['part seven ', <a href="/links/link3">part eight</a>], 'part seven ', '/links/link3', 'part eight']
Why <p> tag content is parsed outside the nested list (as a duplicate)? What am I missing here?