Python regex: re.search() does not find string

Question

I have trouble using the re.search() method. I am trying to extract an image link from following string explicit:

div class="beitragstext">\n\t\t\t\tEs gibt derzeit keine GrÃ¼nde mehr NICHT auf 1.1.3 zu springen!\n<a href="http://www.flickr.com/photos/factoryjoe/372948722/"><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg" alt="372948722_6ec4028a80.jpg" border="0" width="430" height="466" /></a>\nPhoto: <a href="http://www.flickr.com/photos/factoryjoe">factoryjoe</a>

I want to substract the URL of the first image, and the URL only.

This is my code: imageURLObject = re.search(r'http(?!.*http).*?\.(jpg|png|JPG|PNG)', match)

The result should be https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg

Instead, the method return is None. But if use this regex re.search(r'http.*?\.(jpg|png|JPG|PNG)', match), without the `*(?!.http), the first http hit will match until .(jpg|png|JPG|PNG) and this would be the return:

http://www.flickr.com/photos/factoryjoe/372948722/"><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg

Can someone help me please ? :-)

Yes it does, I didn't notice before. I added it to my regex and now it works. Thank you!! — anonymousStudent
– anonymousStudent, Commented Apr 28, 2020 at 15:02

johnashu · Accepted Answer · 2020-04-28 14:56:59Z

1

Use Beautiful soup for HTML parsing..

https://beautiful-soup-4.readthedocs.io/en/latest/

from bs4 import BeautifulSoup

html = """
<div class="beitragstext">\n\t\t\t\t<p>Es gibt derzeit keine GrÃ¼nde mehr NICHT auf 1.1.3 zu springen!</p>\n<p><a href="http://www.flickr.com/photos/factoryjoe/372948722/"><img src="https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg" alt="372948722_6ec4028a80.jpg" border="0" width="430" height="466" /></a></p>\n<p>Photo: <a href="http://www.flickr.com/photos/factoryjoe">factoryjoe</a>
"""

soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'beitragstext'})

for i in links:
    print(i.find('img')['src'])

>>> https://www.iphoneblog.de/wp-content/uploads/2008/02/372948722-6ec4028a80.jpg

answered Apr 28, 2020 at 14:56

johnashu

2,2094 gold badges23 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

johnashu Over a year ago

You can also use requests to grab the html direct from a url..read the docs.. very straightforward!

Collectives™ on Stack Overflow

Python regex: re.search() does not find string

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related