Parsing String with HTML tags into a Nested List with Beautifulsoup

Question

I'm trying to use Beautifulsoup to parse the following string to a nested list (with each  tag converted into a list):

message = '<p>part one <a href="/links/link1">part two</a>part three</p><p>part four <a href="/links/link2">part five</a>part six</p><p>part seven <a href="/links/link3">part eight</a></p>'

I want the output to look like:

[['part one','/links/link1','part two','part three'],['part four','/links/link2','part five','part six'],['part seven','/links/link3','part eight']]

I only want the  tags to be converted into nested list. Everything else should come out as strings in the main list.

My script is:

def get_data(d):
  if d.name == 'p':
    yield list(d)
  else:
    if isinstance(d, bs4.element.NavigableString):
      yield d
    if d.name == 'a':
      yield d['href']
  yield from [i for b in getattr(d, 'contents', []) for i in get_data(b)]


def messageParser(message):
  return list(get_data(bs4.BeautifulSoup(message, 'html.parser')))

But what I get is:

[['part one ', <a href="/links/link1">part two</a>, 'part three'], 'part one ', '/links/link1', 'part two', 'part three', ['part four ', <a href="/links/link2">part five</a>, 'part six'], 'part four ', '/links/link2', 'part five', 'part six', ['part seven ', <a href="/links/link3">part eight</a>], 'part seven ', '/links/link3', 'part eight']

Why  tag content is parsed outside the nested list (as a duplicate)? What am I missing here?

KunduK · Accepted Answer · 2019-09-17 09:49:42Z

1

Try the below code.You need to iterate the p tag.

message = '<p>part one <a href="/links/link1">part two</a>part three</p><p>part four <a href="/links/link2">part five</a>part six</p><p>part seven <a href="/links/link3">part eight</a></p>'

def get_data(d):

 if isinstance(d, bs4.element.NavigableString):
   yield d
 if d.name == 'a':
   yield d['href']
 yield from [i for b in getattr(d, 'contents', []) for i in get_data(b)]


def messageParser(message):
  return list(get_data(bs4.BeautifulSoup(message, 'html.parser')))

print([messageParser(str(item)) for item in bs4.BeautifulSoup(message, 'html.parser').select('p')])

Output:

[['part one ', '/links/link1', 'part two', 'part three'], ['part four ', '/links/link2', 'part five', 'part six'], ['part seven ', '/links/link3', 'part eight']]

answered Sep 17, 2019 at 9:49

KunduK

33.4k5 gold badges19 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Alijy Over a year ago

I tried your solution. This only works if every part of the message is within  tags. In cases that the message string contains something without a tag (e.g. message = 'part one part two <li>part three</li>' it doesn't parse it properly since it only selects the p tag to work with.

KunduK Over a year ago

I have provided the solution based on your requirements what you have posted.If you have any other requirements you should post as a separate question.Thanks.

Alijy Over a year ago

Yes you did, but the string in the question was just a sample of the set. I did mention in my question that Everything else should come out as strings in the main list.

Alijy Over a year ago

I figured it out. Just needed to remove .select('p') from the print line.

Collectives™ on Stack Overflow

Parsing String with HTML tags into a Nested List with Beautifulsoup

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related