Python Regex can't find substring but it should

Question

I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. When this does not work I go to manual regex

I have the text

<html xmlns="http://www.w3.org/1999/xhtml"\n      xmlns:og="http://ogp.me/ns#"\n      xmlns:fb="https://www.facebook.com/2008/fbml">\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>\n    <title>\n                    .@wolfblitzercnn prepping questions for the Cheney intvw. @CNNSitRoom today. 5p. \n            </title>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />...

And I am trying to grab the values between the <title> and </title> tags. It should be fairly simple, but it is not working. Here's my python code for it.

result = re.search('\<title\>(.+?)\</title\>', html)
if result is not None:
    title = result.group(0)

This does not work on this text for whatever reason. It returns result.group() as None or I get an AttributeError. AttributeError: 'NoneType' object has no attribute 'groups'

I've C&P'd this text into online python regex developers and tried all the options (re.match, re.findall, re.search) and they work there but for whatever reason in my script it is not able to find anything between these tags. Even trying other regex such as

<title>(.*?)</title>

etc

Junuxx · Accepted Answer · 2012-06-22 22:28:27Z

5

You should use the dotall flag to make the . match newline characters as well.

result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)

As the documentation says:

...without this flag, '.' will match anything except a newline

answered Jun 22, 2012 at 22:28

Junuxx

14.3k5 gold badges43 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user278064 · Accepted Answer · 2012-06-22 22:37:08Z

2

If you want to grab the test between the <title> and <\title> tags you should use this regexp:

pattern = "<title>([^<]+)</title>"

re.findall(pattern, html_string)

edited Jun 22, 2012 at 22:37

answered Jun 22, 2012 at 22:28

user278064

10.2k1 gold badge36 silver badges48 bronze badges

1 Comment

ohaal Over a year ago

Why the re.DOTALL flag? You don't even use a ..

Collectives™ on Stack Overflow

Python Regex can't find substring but it should

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related