0

I am trying to get this line out from a page:

                                            $ 55 326

I have made this regex to get the numbers:

    player_info['salary'] = re.compile(r'\$ \d{0,3} \d{1,3}')

When I get the text I use bs4 and the text is of type 'unicode'

    for a in soup_ntr.find_all('div', id='playerbox'):
       player_box_text = a.get_text()
       print(type(player_box_text))

I can't seem to get the result. I have also tried with a regex like these

    player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}')
    player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}', re.UNICODE)

But I can't find out to get the data. The page I am reading has this header:

    Content-Type: text/html; charset=utf-8

Hope for some help to figure it out.

2 Answers 2

3

re.compile doesn't match anything. It just creates a compiled version of the regex.

You want something like this:

matchObj = re.match(r'\$ (\d{0,3}) (\d{1,3})', player_box_text)
player_info['salary'] = matchObj.group(1) + matchObj.group(2)
Sign up to request clarification or add additional context in comments.

2 Comments

Sorry about the use og compile, i actually use re.search later on where I use the compiled version. My trouble is that I can find some data and other data fails because I can't figure out to get the data in the right encoding.
see your point. Actually I am using re.search. I first make the expression and then call re.search with the expression.
1

This is a good site for getting to grips with regex. http://txt2re.com/

#!/usr/bin/python
# URL that generated this code:
# http://txt2re.com/index-python.php3?s=$%2055%20326&2&1

import re

txt='$ 55 326' 
re1='.*?'   # Non-greedy match on filler
re2='(\\d+)'    # Integer Number 1
re3='.*?'   # Non-greedy match on filler
re4='(\\d+)'    # Integer Number 2

rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
    int1=m.group(1)
    int2=m.group(2)
    print "("+int1+")"+"("+int2+")"+"\n"

4 Comments

I have tried the expression, but I think I fail handling of the utf-8 / unicode. My expression find the data if I change the white space. I don't really know how the get it.
This works fine, but it also catches some other things like 0$cp in a word like 00$cphCon.
you can make the regex more complex as you need to. If you know the input format of the data is reliable the regex can be as simple as only what will always reliably extract your string. So here you know that you only want the $ symbol and numbers to appear in the string. That is more then possible with a little further regex.
I figured it out from your example and got the data I wanted. Good link too, thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.