python regular expression with utf8 issue

Question

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.

PROCESS：类型：关爱积分[NOTIFY]   交易号：2012022900000109   订单号：W12022910079166    交易金额：0.01元    交易状态：true 2012-2-29 10:13:08

The file itself was saved in utf-8 format. file name is xx.txt

here is my python code, env is python2.7

#coding: utf-8
import re
pattern = re.compile(r'交易金额：(\d+)元')
for line in open('xx.txt'):
    match = pattern.match(line.decode('utf-8'))
    if match:
        print match.group()

The problematic thing here is I got no results.

I wanna get the decimal string from 交易金额：0.01元, in here, which is 0.01.

Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.

uhz · Accepted Answer · 2012-05-11 06:45:59Z

19

There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:

#coding: utf-8

text = u"PROCESS：类型：关爱积分[NOTIFY]   交易号：2012022900000109   订单号：W12022910079166    交易金额：0.01元    交易状态：true 2012-2-29 10:13:08"
import re
pattern = re.compile(ur'交易金额：(\d+\.?\d+)元', re.UNICODE)

print pattern.search(text).group(1)

answered May 11, 2012 at 6:45

uhz

2,5281 gold badge20 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ThiefMaster · Accepted Answer · 2012-05-11 06:27:42Z

5

You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.

answered May 11, 2012 at 6:27

ThiefMaster

320k85 gold badges608 silver badges648 bronze badges

1 Comment

castiel Over a year ago

still not working. can u provide your code to accomplish this little task, much appreciated

Cathy Lin · Accepted Answer · 2016-10-31 10:22:06Z

1

If you use utf-8, you can use flags=re.LOCALE

#coding: utf-8
import re
pattern = re.compile(r'交易金额：(\d+\.?\d+)元', flags=re.LOCALE)
for line in open('xx.txt'):
    match = pattern.match(line)

More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

answered Oct 31, 2016 at 10:22

Cathy Lin

111 bronze badge

Comments

personal_cloud · Accepted Answer · 2024-02-27 04:37:00Z

0

Your code has two small mistakes:

Using a bytes regex on a unicode string.
Missing decimal point in the regex.

When we fix the above mistakes, we get the following code.

#coding: utf-8
import re
pattern = re.compile(r'交易金额：([\.\d]+)元')
for line in open('xx.txt'):
    match = pattern.search(line)
    if match:
        print (match.groups())

In Python 2, it works because the regex and file are both byte strings. In Python 3, it works because both are unicode strings. This is a great example of how decoding the UTF-8 is unnecessary in 99% of programs (yet some languages insist on suggesting it).

edited Feb 27, 2024 at 4:37

answered Feb 27, 2024 at 4:20

personal_cloud

4,6274 gold badges34 silver badges49 bronze badges

Collectives™ on Stack Overflow

python regular expression with utf8 issue

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related