1

I'm working on a program that is scraping some information from an HTML-file based on different regex expressions. I've encountered an error with the following code

My HTMLParser subclass:

class MyHtmlParser(HTMLParser):
    def __init__(self):
        self.reset()
        self.title = []
    def handle_data(self, d):
        Result = re.search(r'ANMELDELSE .*(?=</b>)',d)
        if Result:
            self.title.append(Result.group(0))
    def return_data(self):
        return self.title

Running the code:

with open(r'....', "r") as f: #correct path to local test.html
    page = f.read()
parser.feed(page)
parser.return_data()

Now the HTML file is really messy and in Norwegian, but here is a subset that should trigger this

<p style="margin: 0cm 0cm 0pt;"><span style="text-decoration: underline;">Sak 428/18-123, 03.09.2018 </span></p>
<p style="margin: 0cm 0cm 0pt;"><b>&nbsp;</b></p>
<p style="margin: 0cm 0cm 0pt;"><b>ANMELDELSE FOR TRAKASSERING</b></p>

This should select "ANMELDELSE FOR TRAKASSERING" and it does in both https://regex101.com/ and in https://regexr.com/, but when executing the code, all I get printed is an empty list. The code has worked with previous regex calls, so I'm a bit lost.

Hope someone can help!

5
  • If I were to use a regex here, I'd use something close to r'ANMELDELSE[^<>]*. Are you sure the space there is not a non-breaking space? Commented Sep 18, 2018 at 19:43
  • What kind of object is passed in as d when you call handle_data(self, d)? Commented Sep 18, 2018 at 19:48
  • This really helped @WiktorStribiżew! Do you mind clarifying what you mean by non-breaking space? I was very confused when both websites I tried it on gave me the correct answer.. Commented Sep 18, 2018 at 19:53
  • 1
    It is a \u00A0 char, very similar to a regular space. Commented Sep 18, 2018 at 19:55
  • That was probably it, it seems to work now! Thanks @WiktorStribiżew! Commented Sep 18, 2018 at 20:06

1 Answer 1

1

Granted your text has ANMELDELSE only in some text node, you may grab it using

r'ANMELDELSE[^<>]*'

Your original pattern contains a literal regular space (\x20). Instead of that space, a non-breaking space is often used to make sure the next word stays on the same line in text editors/viewers.

To match it, you could use \s and pass re.U modifier (it is required as you are using Python 2.7) to your re.search method, but since you want to match up to the end of the tag, just use a negated character class [^<>]*, any 0+ chars other than < and >.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.