Finding a strings in a text using regular expressions with Python

Question

I have a text, in which only  and  has been used.for exampleabcd efg-123 . Can can I extract the string between these tags? also I need to extract 3 words before and after this chunk of abcd efg-123 string. How can I do that? what would be the suitable regular expression for this?

obligatory: stackoverflow.com/questions/1732348/…

Lie Ryan
– Lie Ryan

2010-10-20 13:46:06 +00:00
Commented Oct 20, 2010 at 13:46 — Lie Ryan
– Lie Ryan, Commented Oct 20, 2010 at 13:46

ghostdog74 · Accepted Answer · 2010-10-20 13:49:04Z

3

this will get what's in between the tags,

>>> s="1 2 3<b>abcd efg-123</b>one two three"
>>> for i in s.split("</b>"):
...   if "<b>" in i:
...      print i.split("<b>")[-1]
...
abcd efg-123

answered Oct 20, 2010 at 13:49

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

driax · Accepted Answer · 2010-10-20 14:17:17Z

1

Handles tags inside the  unless they are  ofcouse.

import re    
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
      r'(((?:(?:^|\s)+\w+){3}\s*)'            # Match 3 words before
      r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>'  # Match <b>...</b>
      r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after

result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
    ' 1 2 3',
    'abcd efg-123',
    'word word2 word3 ')]

This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser.

edited Oct 20, 2010 at 14:17

answered Oct 20, 2010 at 14:10

driax

2,6661 gold badge24 silver badges21 bronze badges

2 Comments

Hossein Over a year ago

this doesn't work if there is no words before or after, or less that 3 words, right?

driax Over a year ago

@Hossein That's correct. However it is a simple change. Change {3} to {,3}

eric_arthur_blair · Accepted Answer · 2010-10-20 14:02:40Z

1

This is actually a very dumb version and doesn't allow nested tags.

re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)

See Python documentation.

edited Oct 20, 2010 at 14:02

answered Oct 20, 2010 at 13:50

eric_arthur_blair

112 bronze badges

Comments

Cœur · Accepted Answer · 2018-10-21 10:41:36Z

0

You should not use regexes for HTML parsing. That way madness lies.

The above-linked article actually provides a regex for your problem -- but don't use it.

edited Oct 21, 2018 at 10:41

Cœur

39k25 gold badges207 silver badges282 bronze badges

answered Oct 20, 2010 at 13:48

Joshua Fox

19.9k25 gold badges102 silver badges170 bronze badges

Collectives™ on Stack Overflow

Finding a strings in a text using regular expressions with Python

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related