I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. When this does not work I go to manual regex
I have the text
<html xmlns="http://www.w3.org/1999/xhtml"\n xmlns:og="http://ogp.me/ns#"\n xmlns:fb="https://www.facebook.com/2008/fbml">\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>\n <title>\n .@wolfblitzercnn prepping questions for the Cheney intvw. @CNNSitRoom today. 5p. \n </title>\n <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />...
And I am trying to grab the values between the <title> and </title> tags. It should be fairly simple, but it is not working. Here's my python code for it.
result = re.search('\<title\>(.+?)\</title\>', html)
if result is not None:
title = result.group(0)
This does not work on this text for whatever reason. It returns result.group() as None or I get an AttributeError. AttributeError: 'NoneType' object has no attribute 'groups'
I've C&P'd this text into online python regex developers and tried all the options (re.match, re.findall, re.search) and they work there but for whatever reason in my script it is not able to find anything between these tags. Even trying other regex such as
<title>(.*?)</title>
etc