Python Regex to extract content of src of an html tag?

Question

I tried something like this but failed. I don't know regex can anyone help me with this?

import re

html = """
<body>
<h1>dummy heading</h1>
<img src="/pic/earth.jpg" alt="planet" width="200">
<img src="/pic/redrose.jpg" alt="flower" width="200">
</body>
"""
x = re.search('^src=".*jpg$', html)
print(x)

I'm expecting output like this ['/pic/earth.jpg','/pic/redrose.jpg']

Better approach is to use HTML parser, such as BeautifulSoup. See stackoverflow.com/questions/1732348/… — Andrej Kesely
– Andrej Kesely, Commented Jun 4, 2020 at 10:25

James McGuigan · Accepted Answer · 2020-06-04 10:43:45Z

Good first start, but you have several minor issues with your code:

^ and $ refer to the start and end of the string
- or end-of-line with re.MULTILINE flag enabled
.search() returns Null or a Match object rather than the matched strings
you probably want the .findall() method
if you have backslashed in your regex (which you don't yet), then you may want to use raw r"string" strings for your regex code
also think of all the possible permutations of what could be in your input data, such as HTML allowing both ' and " for quotes, and that there could be a src= attribute in something that is not an image

Here are the docs: - https://docs.python.org/3/library/re.html#re.findall

Try this as a regex:

image_urls = re.findall(r'<img[^<>]+src=["\']([^"\'<>]+\.(?:gif|png|jpe?g))["\']', html, re.I)
print(image_urls)
>>> ['/pic/earth.jpg', '/pic/redrose.jpg']

To break this down a little:

re.findall() return a list of strings
<img we are looking to start in an image tag
[^<>]+ 1 or more chars that don't open/close the html tag
- there might not be a src="" tag in the current <img>
["\'] the HTML could use either type of quote
[^"\'<>]+ keep reading 1+ chars whilst the string and the tag are not closed
\. literal dots need to be escaped, else they mean the "match anything" special char
(?:gif|png|jpe?g) a range of possible file extensions, but don't create a capture bracket for them (which would return these in your array)
([^"\'<>]+\.(?:gif|png|jpe?g)) this is the capture bracket for what will actually get returned for each match
["\'] search for the closing quote to end the capture bracket
re.I make the regex case insensitive

Shinbeom Choi · Accepted Answer · 2020-06-04 10:35:42Z

2

I'm not good at regEx. So my answer may not be best.

Try this.

x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)

than you can see x like below.

['/pic/earth.jpg', '/pic/redrose.jpg']

RegEx explanation :

(?=src) : positive lookup --> only see those have src word

src=\" : must include this specific word src="

(?P somthing) : this expression grouping somthing to name src

[^\"]+ : everything except " character

answered Jun 4, 2020 at 10:35

Shinbeom Choi

484 bronze badges

Collectives™ on Stack Overflow

Python Regex to extract content of src of an html tag?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related