0

Input String

<msgCode>1111</msgCode>asdasdad<errorId>2222</errorId>

What I want

(1111,2222)

If I use findall, this is what I get :

>>> import re;
>>> print re.findall("<(msgCode|errorId)>([0-9]+)</(msgCode|errorId)>","<msgCode>1111</msgCode>asdasdad<errorId>2222</errorId>");
[('msgCode', '1111', 'msgCode'), ('errorId', '2222', 'errorId')]

What I hope for is

[('1111','2222')]

Is there a easy way to do it using re instead of post-processing output ?

2
  • 2
    you should really parse xml with an xml parser. Commented Jan 31, 2014 at 3:06
  • Yes, let's all pontificate using the same thread over and over again, even though the OP might be certain that his XML/HTML will never contain tags nested within themselves. Commented Jan 31, 2014 at 3:25

2 Answers 2

2

consider using xpath instead:

>>> from lxml import html
>>> root = html.fromstring('<msgCode>1111</msgCode>asdasdad<errorId>2222</errorId>')
>>> root.xpath('//*[self::msgcode or self::errorid]/text()')
['1111', '2222']
Sign up to request clarification or add additional context in comments.

1 Comment

This is the reason why I post SO questions, even when I have crude workaround such as post-processing the regex find. :) :)
-1

Use a Non-Capture group for the msgCode tags (?:msgCode|errorId)

>> import re
>> subject = "<msgCode>1111</msgCode>asdasdad<errorId>2222</errorId>"
>> result = re.findall("<(?:msgCode|errorId)>([0-9]+)</(?:msgCode|errorId)>", subject)
>> print result

['1111', '2222']

1 Comment

In my case this happened to be HTML, hence I selected another as answer. But thank you for your answer Vasil. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.