I have a text, in which only <b> and </b> has been used.for example<b>abcd efg-123</b> . Can can I extract the string between these tags? also I need to extract 3 words before and after this chunk of <b>abcd efg-123</b> string.
How can I do that? what would be the suitable regular expression for this?
-
2obligatory: stackoverflow.com/questions/1732348/…Lie Ryan– Lie Ryan2010-10-20 13:46:06 +00:00Commented Oct 20, 2010 at 13:46
Add a comment
|
4 Answers
Handles tags inside the <b> unless they are <b> ofcouse.
import re
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
r'(((?:(?:^|\s)+\w+){3}\s*)' # Match 3 words before
r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>' # Match <b>...</b>
r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after
result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
' 1 2 3',
'abcd efg-123',
'word word2 word3 ')]
This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser.
This is actually a very dumb version and doesn't allow nested tags.
re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)
Comments
You should not use regexes for HTML parsing. That way madness lies.
The above-linked article actually provides a regex for your problem -- but don't use it.