Custom HTMLParser with regex not returning correctly

Question

I'm working on a program that is scraping some information from an HTML-file based on different regex expressions. I've encountered an error with the following code

My HTMLParser subclass:

class MyHtmlParser(HTMLParser):
    def __init__(self):
        self.reset()
        self.title = []
    def handle_data(self, d):
        Result = re.search(r'ANMELDELSE .*(?=</b>)',d)
        if Result:
            self.title.append(Result.group(0))
    def return_data(self):
        return self.title

Running the code:

with open(r'....', "r") as f: #correct path to local test.html
    page = f.read()
parser.feed(page)
parser.return_data()

Now the HTML file is really messy and in Norwegian, but here is a subset that should trigger this

<p style="margin: 0cm 0cm 0pt;"><span style="text-decoration: underline;">Sak 428/18-123, 03.09.2018 </span></p>
<p style="margin: 0cm 0cm 0pt;"><b>&nbsp;</b></p>
<p style="margin: 0cm 0cm 0pt;"><b>ANMELDELSE FOR TRAKASSERING</b></p>

This should select "ANMELDELSE FOR TRAKASSERING" and it does in both https://regex101.com/ and in https://regexr.com/, but when executing the code, all I get printed is an empty list. The code has worked with previous regex calls, so I'm a bit lost.

Hope someone can help!

If I were to use a regex here, I'd use something close to r'ANMELDELSE[^<>]*. Are you sure the space there is not a non-breaking space? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Sep 18, 2018 at 19:43
What kind of object is passed in as d when you call handle_data(self, d)? — Ruzihm
– Ruzihm, Commented Sep 18, 2018 at 19:48
This really helped @WiktorStribiżew! Do you mind clarifying what you mean by non-breaking space? I was very confused when both websites I tried it on gave me the correct answer.. — BenMyr
– BenMyr, Commented Sep 18, 2018 at 19:53
That was probably it, it seems to work now! Thanks @WiktorStribiżew! — BenMyr
– BenMyr, Commented Sep 18, 2018 at 20:06

Wiktor Stribiżew · Accepted Answer · 2018-09-18 20:14:44Z

1

Granted your text has ANMELDELSE only in some text node, you may grab it using

r'ANMELDELSE[^<>]*'

Your original pattern contains a literal regular space (\x20). Instead of that space, a non-breaking space is often used to make sure the next word stays on the same line in text editors/viewers.

To match it, you could use \s and pass re.U modifier (it is required as you are using Python 2.7) to your re.search method, but since you want to match up to the end of the tag, just use a negated character class [^<>]*, any 0+ chars other than < and >.

answered Sep 18, 2018 at 20:14

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Custom HTMLParser with regex not returning correctly

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related