11

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.

PROCESS:类型:关爱积分[NOTIFY]   交易号:2012022900000109   订单号:W12022910079166    交易金额:0.01元    交易状态:true 2012-2-29 10:13:08

The file itself was saved in utf-8 format. file name is xx.txt

here is my python code, env is python2.7

#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+)元')
for line in open('xx.txt'):
    match = pattern.match(line.decode('utf-8'))
    if match:
        print match.group()

The problematic thing here is I got no results.

I wanna get the decimal string from 交易金额:0.01元, in here, which is 0.01.

Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.

4 Answers 4

19

There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:

#coding: utf-8

text = u"PROCESS:类型:关爱积分[NOTIFY]   交易号:2012022900000109   订单号:W12022910079166    交易金额:0.01元    交易状态:true 2012-2-29 10:13:08"
import re
pattern = re.compile(ur'交易金额:(\d+\.?\d+)元', re.UNICODE)

print pattern.search(text).group(1)
Sign up to request clarification or add additional context in comments.

Comments

5

You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.

1 Comment

still not working. can u provide your code to accomplish this little task, much appreciated
1

If you use utf-8, you can use flags=re.LOCALE

#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+\.?\d+)元', flags=re.LOCALE)
for line in open('xx.txt'):
    match = pattern.match(line)

More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

Comments

0

Your code has two small mistakes:

  1. Using a bytes regex on a unicode string.
  2. Missing decimal point in the regex.

When we fix the above mistakes, we get the following code.

#coding: utf-8
import re
pattern = re.compile(r'交易金额:([\.\d]+)元')
for line in open('xx.txt'):
    match = pattern.search(line)
    if match:
        print (match.groups())

In Python 2, it works because the regex and file are both byte strings. In Python 3, it works because both are unicode strings. This is a great example of how decoding the UTF-8 is unnecessary in 99% of programs (yet some languages insist on suggesting it).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.