0

I have a text, in which only <b> and </b> has been used.for example<b>abcd efg-123</b> . Can can I extract the string between these tags? also I need to extract 3 words before and after this chunk of <b>abcd efg-123</b> string. How can I do that? what would be the suitable regular expression for this?

1

4 Answers 4

3

this will get what's in between the tags,

>>> s="1 2 3<b>abcd efg-123</b>one two three"
>>> for i in s.split("</b>"):
...   if "<b>" in i:
...      print i.split("<b>")[-1]
...
abcd efg-123
Sign up to request clarification or add additional context in comments.

Comments

1

Handles tags inside the <b> unless they are <b> ofcouse.

import re    
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
      r'(((?:(?:^|\s)+\w+){3}\s*)'            # Match 3 words before
      r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>'  # Match <b>...</b>
      r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after

result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
    ' 1 2 3',
    'abcd efg-123',
    'word word2 word3 ')]

This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser.

2 Comments

this doesn't work if there is no words before or after, or less that 3 words, right?
@Hossein That's correct. However it is a simple change. Change {3} to {,3}
1

This is actually a very dumb version and doesn't allow nested tags.

re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)

See Python documentation.

Comments

0

You should not use regexes for HTML parsing. That way madness lies.

The above-linked article actually provides a regex for your problem -- but don't use it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.