0

I am trying to parse with BeautifulSoup html.parser, and I am having trouble with the tag, in that it is being processed differently than other tags:

On the <title> tag, it works as expected:

>>> BeautifulSoup("<title>Somalia’s Electoral Crisis in Extremis</title>", features='html.parser')
<title>Somalia’s Electoral Crisis in Extremis</title>

However when processing the <link> tag, it introduces a slash in the opening tag and drops the closing tag:

>>>BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='html.parser')
<link/>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/

Why is it doing this?

Now if I use the 'lxml' or 'xml' tags, it works fine.

>>> BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='lxml')
<html><head><link/></head><body><p>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</p></body></html>

I am using html.parser because I also encounter nested elements (tags like <something:tag>) and CDATA strings. So parsing CDATA with lxml (which did not work for me) would also be a solution if it is possible.

Am I going to have to write some logic to decide which library to parse each site with, or is there a way to do this with BeautifulSoup as is?

5
  • 2
    lxml is faster and more forgiving so I generally use that. There is an existing post comparing the various parsers as well as in the documentation. Link should have no end tag: html.spec.whatwg.org/#the-link-element. I assume lxml is attempting a repair on this html? Commented Apr 3, 2021 at 1:19
  • Thanks, but is there a way to handle CDATA in lxml? That is why I am not using it. Commented Apr 3, 2021 at 1:19
  • 1
    stackoverflow.com/questions/13694143/…. ![CDATA[]] is an instruction that content should not be interpreted as xml. There are probably more answers on how to work with CDATA. I will have a quick look. This is just the one that came to mind. I note you wanted to use lxml. Commented Apr 3, 2021 at 1:21
  • Or is there a way to specify custom tags for cases where I encounter ones that are malformed in a specific way? Commented Apr 3, 2021 at 1:21
  • 1
    stackoverflow.com/questions/37661822/python-lxml-modify-cdata, stackoverflow.com/questions/25813756/… ....and possibly from these: stackoverflow.com/search?q=lxml+cdata Commented Apr 3, 2021 at 1:23

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.