4

How to use regular expression to get src of image from the following html string using Python

<td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><a href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;usg=AFQjCNFqz8ZCIf6NjgPPiTd2LIrByKYLWA&amp;url=http://www.news.com.au/business/spain-victory-faces-market-test/story-fn7mjon9-1226390697278"><img src="//nt3.ggpht.com/news/tbn/380jt5xHH6l_FM/6.jpg" alt="" border="1" width="80" height="80" /><br /><font size="-2">NEWS.com.au</font></a></font></td>

I tried to use

matches = re.search('@src="([^"]+)"',text)
print(matches[0])

But got nothing

4

3 Answers 3

9

Instead of regex, you could consider using BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(junk)
>>> soup.findAll('img')
[<img src="//nt3.ggpht.com/news/tbn/380jt5xHH6l_FM/6.jpg" alt="" border="1" width="80" height="80" />]
>>> soup.findAll('img')[0]['src']
u'//nt3.ggpht.com/news/tbn/380jt5xHH6l_FM/6.jpg'
Sign up to request clarification or add additional context in comments.

1 Comment

wouldn't Beautiful Soup add a lot of overhead to the solution? img tags are relatively easy to parse (and since they don't enclose other text, usually are formatted correctly)
6

Just lose the @ in the regex and it will work

Comments

-1

You could simplify your re a little:

match = re.search(r'src="(.*?)"', text)

1 Comment

It gets javascript files too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.