I am trying to parse with BeautifulSoup html.parser, and I am having trouble with the tag, in that it is being processed differently than other tags:
On the <title> tag, it works as expected:
>>> BeautifulSoup("<title>Somalia’s Electoral Crisis in Extremis</title>", features='html.parser')
<title>Somalia’s Electoral Crisis in Extremis</title>
However when processing the <link> tag, it introduces a slash in the opening tag and drops the closing tag:
>>>BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='html.parser')
<link/>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/
Why is it doing this?
Now if I use the 'lxml' or 'xml' tags, it works fine.
>>> BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='lxml')
<html><head><link/></head><body><p>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</p></body></html>
I am using html.parser because I also encounter nested elements (tags like <something:tag>) and CDATA strings. So parsing CDATA with lxml (which did not work for me) would also be a solution if it is possible.
Am I going to have to write some logic to decide which library to parse each site with, or is there a way to do this with BeautifulSoup as is?
lxml? That is why I am not using it.