1

I'm trying to use Beautifulsoup to parse the following string to a nested list (with each <p> tag converted into a list):

message = '<p>part one <a href="/links/link1">part two</a>part three</p><p>part four <a href="/links/link2">part five</a>part six</p><p>part seven <a href="/links/link3">part eight</a></p>'

I want the output to look like:

[['part one','/links/link1','part two','part three'],['part four','/links/link2','part five','part six'],['part seven','/links/link3','part eight']]

I only want the <p> tags to be converted into nested list. Everything else should come out as strings in the main list.

My script is:

def get_data(d):
  if d.name == 'p':
    yield list(d)
  else:
    if isinstance(d, bs4.element.NavigableString):
      yield d
    if d.name == 'a':
      yield d['href']
  yield from [i for b in getattr(d, 'contents', []) for i in get_data(b)]


def messageParser(message):
  return list(get_data(bs4.BeautifulSoup(message, 'html.parser')))

But what I get is:

[['part one ', <a href="/links/link1">part two</a>, 'part three'], 'part one ', '/links/link1', 'part two', 'part three', ['part four ', <a href="/links/link2">part five</a>, 'part six'], 'part four ', '/links/link2', 'part five', 'part six', ['part seven ', <a href="/links/link3">part eight</a>], 'part seven ', '/links/link3', 'part eight']

Why <p> tag content is parsed outside the nested list (as a duplicate)? What am I missing here?

1 Answer 1

1

Try the below code.You need to iterate the p tag.

message = '<p>part one <a href="/links/link1">part two</a>part three</p><p>part four <a href="/links/link2">part five</a>part six</p><p>part seven <a href="/links/link3">part eight</a></p>'

def get_data(d):

 if isinstance(d, bs4.element.NavigableString):
   yield d
 if d.name == 'a':
   yield d['href']
 yield from [i for b in getattr(d, 'contents', []) for i in get_data(b)]


def messageParser(message):
  return list(get_data(bs4.BeautifulSoup(message, 'html.parser')))

print([messageParser(str(item)) for item in bs4.BeautifulSoup(message, 'html.parser').select('p')])

Output:

[['part one ', '/links/link1', 'part two', 'part three'], ['part four ', '/links/link2', 'part five', 'part six'], ['part seven ', '/links/link3', 'part eight']]
Sign up to request clarification or add additional context in comments.

4 Comments

I tried your solution. This only works if every part of the message is within <p> tags. In cases that the message string contains something without a tag (e.g. message = '<p>part one</p> part two <li>part three</li>' it doesn't parse it properly since it only selects the p tag to work with.
I have provided the solution based on your requirements what you have posted.If you have any other requirements you should post as a separate question.Thanks.
Yes you did, but the string in the question was just a sample of the set. I did mention in my question that Everything else should come out as strings in the main list.
I figured it out. Just needed to remove .select('p') from the print line.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.