1

I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. When this does not work I go to manual regex

I have the text

<html xmlns="http://www.w3.org/1999/xhtml"\n      xmlns:og="http://ogp.me/ns#"\n      xmlns:fb="https://www.facebook.com/2008/fbml">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>\n    <title>\n                    .@wolfblitzercnn prepping questions for the Cheney intvw. @CNNSitRoom today. 5p. \n            </title>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />...

And I am trying to grab the values between the <title> and </title> tags. It should be fairly simple, but it is not working. Here's my python code for it.

result = re.search('\<title\>(.+?)\</title\>', html)
if result is not None:
    title = result.group(0)

This does not work on this text for whatever reason. It returns result.group() as None or I get an AttributeError. AttributeError: 'NoneType' object has no attribute 'groups'

I've C&P'd this text into online python regex developers and tried all the options (re.match, re.findall, re.search) and they work there but for whatever reason in my script it is not able to find anything between these tags. Even trying other regex such as

<title>(.*?)</title>

etc

2 Answers 2

5

You should use the dotall flag to make the . match newline characters as well.

result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)

As the documentation says:

...without this flag, '.' will match anything except a newline

Sign up to request clarification or add additional context in comments.

Comments

2

If you want to grab the test between the <title> and <\title> tags you should use this regexp:

pattern = "<title>([^<]+)</title>"

re.findall(pattern, html_string) 

1 Comment

Why the re.DOTALL flag? You don't even use a ..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.