0

I tried something like this but failed. I don't know regex can anyone help me with this?

import re

html = """
<body>
<h1>dummy heading</h1>
<img src="/pic/earth.jpg" alt="planet" width="200">
<img src="/pic/redrose.jpg" alt="flower" width="200">
</body>
"""
x = re.search('^src=".*jpg$', html)
print(x)

I'm expecting output like this ['/pic/earth.jpg','/pic/redrose.jpg']

2
  • Better approach is to use HTML parser, such as BeautifulSoup. See stackoverflow.com/questions/1732348/… Commented Jun 4, 2020 at 10:25
  • Yeah, I know but I want this using Regex! Commented Jun 4, 2020 at 10:29

2 Answers 2

3

Good first start, but you have several minor issues with your code:

  • ^ and $ refer to the start and end of the string
    • or end-of-line with re.MULTILINE flag enabled
  • .search() returns Null or a Match object rather than the matched strings
  • you probably want the .findall() method
  • if you have backslashed in your regex (which you don't yet), then you may want to use raw r"string" strings for your regex code
  • also think of all the possible permutations of what could be in your input data, such as HTML allowing both ' and " for quotes, and that there could be a src= attribute in something that is not an image

Here are the docs: - https://docs.python.org/3/library/re.html#re.findall

Try this as a regex:

image_urls = re.findall(r'<img[^<>]+src=["\']([^"\'<>]+\.(?:gif|png|jpe?g))["\']', html, re.I)
print(image_urls)
>>> ['/pic/earth.jpg', '/pic/redrose.jpg']

To break this down a little:

  • re.findall() return a list of strings
  • <img we are looking to start in an image tag
  • [^<>]+ 1 or more chars that don't open/close the html tag
    • there might not be a src="" tag in the current <img>
  • ["\'] the HTML could use either type of quote
  • [^"\'<>]+ keep reading 1+ chars whilst the string and the tag are not closed
  • \. literal dots need to be escaped, else they mean the "match anything" special char
  • (?:gif|png|jpe?g) a range of possible file extensions, but don't create a capture bracket for them (which would return these in your array)
  • ([^"\'<>]+\.(?:gif|png|jpe?g)) this is the capture bracket for what will actually get returned for each match
  • ["\'] search for the closing quote to end the capture bracket
  • re.I make the regex case insensitive
Sign up to request clarification or add additional context in comments.

Comments

2

I'm not good at regEx. So my answer may not be best.

Try this.

x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)

than you can see x like below.

['/pic/earth.jpg', '/pic/redrose.jpg']

RegEx explanation :

(?=src) : positive lookup --> only see those have src word

src=\" : must include this specific word src="

(?P somthing) : this expression grouping somthing to name src

[^\"]+ : everything except " character

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.