BeautifulSoup html.parser and <link> tag in xml (vs CDATA with 'lxml' parser). Must I use both?

Ask Question

Asked 4 years, 8 months ago

Modified 4 years, 8 months ago

Viewed 200 times

I am trying to parse with BeautifulSoup html.parser, and I am having trouble with the tag, in that it is being processed differently than other tags:

On the <title> tag, it works as expected:

>>> BeautifulSoup("<title>Somalia’s Electoral Crisis in Extremis</title>", features='html.parser')
<title>Somalia’s Electoral Crisis in Extremis</title>

However when processing the <link> tag, it introduces a slash in the opening tag and drops the closing tag:

>>>BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='html.parser')
<link/>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/

Why is it doing this?

Now if I use the 'lxml' or 'xml' tags, it works fine.

>>> BeautifulSoup("<link>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</link>", features='lxml')
<html><head><link/></head><body><p>https://warontherocks.com/2021/04/somalias-electoral-crisis-in-extremis/</p></body></html>

I am using html.parser because I also encounter nested elements (tags like <something:tag>) and CDATA strings. So parsing CDATA with lxml (which did not work for me) would also be a solution if it is possible.

Am I going to have to write some logic to decide which library to parse each site with, or is there a way to do this with BeautifulSoup as is?

asked Apr 3, 2021 at 1:02

Stonecraft

8141 gold badge16 silver badges36 bronze badges

2

lxml is faster and more forgiving so I generally use that. There is an existing post comparing the various parsers as well as in the documentation. Link should have no end tag: html.spec.whatwg.org/#the-link-element. I assume lxml is attempting a repair on this html?

QHarr
– QHarr

2021-04-03 01:19:21 +00:00
Commented Apr 3, 2021 at 1:19
Thanks, but is there a way to handle CDATA in lxml? That is why I am not using it.

Stonecraft
– Stonecraft

2021-04-03 01:19:32 +00:00
Commented Apr 3, 2021 at 1:19
1

stackoverflow.com/questions/13694143/…. ![CDATA[]] is an instruction that content should not be interpreted as xml. There are probably more answers on how to work with CDATA. I will have a quick look. This is just the one that came to mind. I note you wanted to use lxml.

QHarr
– QHarr

2021-04-03 01:21:14 +00:00
Commented Apr 3, 2021 at 1:21
Or is there a way to specify custom tags for cases where I encounter ones that are malformed in a specific way?

Stonecraft
– Stonecraft

2021-04-03 01:21:21 +00:00
Commented Apr 3, 2021 at 1:21
1

stackoverflow.com/questions/37661822/python-lxml-modify-cdata, stackoverflow.com/questions/25813756/… ....and possibly from these: stackoverflow.com/search?q=lxml+cdata

QHarr
– QHarr

2021-04-03 01:23:23 +00:00
Commented Apr 3, 2021 at 1:23

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

BeautifulSoup html.parser and <link> tag in xml (vs CDATA with 'lxml' parser). Must I use both?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked