0

I have trouble using the re.search() method. I am trying to extract an image link from following string explicit:

div class="beitragstext">\n\t\t\t\t<p>Es gibt derzeit keine Gründe mehr NICHT auf 1.1.3 zu springen!</p>\n<p><a href="http://www.flickr.com/photos/factoryjoe/372948722/"><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg" alt="372948722_6ec4028a80.jpg" border="0" width="430" height="466" /></a></p>\n<p>Photo: <a href="http://www.flickr.com/photos/factoryjoe">factoryjoe</a>

I want to substract the URL of the first image, and the URL only.

This is my code: imageURLObject = re.search(r'http(?!.*http).*?\.(jpg|png|JPG|PNG)', match)

The result should be https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg

Instead, the method return is None. But if use this regex re.search(r'http.*?\.(jpg|png|JPG|PNG)', match), without the `*(?!.http), the first http hit will match until .(jpg|png|JPG|PNG) and this would be the return:

http://www.flickr.com/photos/factoryjoe/372948722/"><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg

Can someone help me please ? :-)

2
  • Will the image link always come after 'src='? Commented Apr 28, 2020 at 14:53
  • 1
    Yes it does, I didn't notice before. I added it to my regex and now it works. Thank you!! Commented Apr 28, 2020 at 15:02

1 Answer 1

1

Use Beautiful soup for HTML parsing..

https://beautiful-soup-4.readthedocs.io/en/latest/

from bs4 import BeautifulSoup

html = """
<div class="beitragstext">\n\t\t\t\t<p>Es gibt derzeit keine Gründe mehr NICHT auf 1.1.3 zu springen!</p>\n<p><a href="http://www.flickr.com/photos/factoryjoe/372948722/"><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg" alt="372948722_6ec4028a80.jpg" border="0" width="430" height="466" /></a></p>\n<p>Photo: <a href="http://www.flickr.com/photos/factoryjoe">factoryjoe</a>
"""

soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'beitragstext'})

for i in links:
    print(i.find('img')['src'])

>>> https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg
Sign up to request clarification or add additional context in comments.

1 Comment

You can also use requests to grab the html direct from a url..read the docs.. very straightforward!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.