I'm working on a program that is scraping some information from an HTML-file based on different regex expressions. I've encountered an error with the following code
My HTMLParser subclass:
class MyHtmlParser(HTMLParser):
def __init__(self):
self.reset()
self.title = []
def handle_data(self, d):
Result = re.search(r'ANMELDELSE .*(?=</b>)',d)
if Result:
self.title.append(Result.group(0))
def return_data(self):
return self.title
Running the code:
with open(r'....', "r") as f: #correct path to local test.html
page = f.read()
parser.feed(page)
parser.return_data()
Now the HTML file is really messy and in Norwegian, but here is a subset that should trigger this
<p style="margin: 0cm 0cm 0pt;"><span style="text-decoration: underline;">Sak 428/18-123, 03.09.2018 </span></p>
<p style="margin: 0cm 0cm 0pt;"><b> </b></p>
<p style="margin: 0cm 0cm 0pt;"><b>ANMELDELSE FOR TRAKASSERING</b></p>
This should select "ANMELDELSE FOR TRAKASSERING" and it does in both https://regex101.com/ and in https://regexr.com/, but when executing the code, all I get printed is an empty list. The code has worked with previous regex calls, so I'm a bit lost.
Hope someone can help!
r'ANMELDELSE[^<>]*. Are you sure the space there is not a non-breaking space?dwhen you callhandle_data(self, d)?\u00A0char, very similar to a regular space.