python encoding regex issue

Question

I am trying to get this line out from a page:

                                            $ 55 326

I have made this regex to get the numbers:

    player_info['salary'] = re.compile(r'\$ \d{0,3} \d{1,3}')

When I get the text I use bs4 and the text is of type 'unicode'

    for a in soup_ntr.find_all('div', id='playerbox'):
       player_box_text = a.get_text()
       print(type(player_box_text))

I can't seem to get the result. I have also tried with a regex like these

    player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}')
    player_info['salary'] = re.compile(ur'\$ \d{0,3} \d{1,3}', re.UNICODE)

But I can't find out to get the data. The page I am reading has this header:

    Content-Type: text/html; charset=utf-8

Hope for some help to figure it out.

Cfreak · Accepted Answer · 2012-10-05 22:03:31Z

3

re.compile doesn't match anything. It just creates a compiled version of the regex.

You want something like this:

matchObj = re.match(r'\$ (\d{0,3}) (\d{1,3})', player_box_text)
player_info['salary'] = matchObj.group(1) + matchObj.group(2)

answered Oct 5, 2012 at 22:03

Cfreak

19.4k6 gold badges52 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jantzen05 Over a year ago

Sorry about the use og compile, i actually use re.search later on where I use the compiled version. My trouble is that I can find some data and other data fails because I can't figure out to get the data in the right encoding.

jantzen05 Over a year ago

see your point. Actually I am using re.search. I first make the expression and then call re.search with the expression.

Paul Collingwood · Accepted Answer · 2012-10-05 22:02:40Z

1

This is a good site for getting to grips with regex. http://txt2re.com/

#!/usr/bin/python
# URL that generated this code:
# http://txt2re.com/index-python.php3?s=$%2055%20326&2&1

import re

txt='$ 55 326' 
re1='.*?'   # Non-greedy match on filler
re2='(\\d+)'    # Integer Number 1
re3='.*?'   # Non-greedy match on filler
re4='(\\d+)'    # Integer Number 2

rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
    int1=m.group(1)
    int2=m.group(2)
    print "("+int1+")"+"("+int2+")"+"\n"

answered Oct 5, 2012 at 22:02

Paul Collingwood

9,1013 gold badges27 silver badges38 bronze badges

4 Comments

jantzen05 Over a year ago

I have tried the expression, but I think I fail handling of the utf-8 / unicode. My expression find the data if I change the white space. I don't really know how the get it.

jantzen05 Over a year ago

This works fine, but it also catches some other things like 0$cp in a word like 00$cphCon.

Paul Collingwood Over a year ago

you can make the regex more complex as you need to. If you know the input format of the data is reliable the regex can be as simple as only what will always reliably extract your string. So here you know that you only want the $ symbol and numbers to appear in the string. That is more then possible with a little further regex.

jantzen05 Over a year ago

I figured it out from your example and got the data I wanted. Good link too, thank you.

Collectives™ on Stack Overflow

python encoding regex issue

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related