0

I have this python code to extract the image src from an HTML website

listingid=[img['src'] for img in soup.select('[src]')]

Now would like to extract the values from the following output and store into a dictionary:

Any approach I can take to achieve this?

Im thinking if there is any syntax in python to take 14 characters before a specific suffix(like .jpg)

1
  • s[-18:-4]? Python slices can take negative indices to mean start from the end. Commented Sep 21, 2022 at 2:59

2 Answers 2

1

You can use negative indices in Python slices to count from the end. Since you say in the question you want 14 characters before a 4 character suffix, a simple s[-18:-4] would do.

With your code:

listingid = [img['src'] for img in soup.select('[src]')]
listingid = [s[-18:-4] for s in listingid]

or, in one statement:

listingid = [img['src'][-18:-4] for img in soup.select('[src]')]
Sign up to request clarification or add additional context in comments.

Comments

1

If the number of characters is exactly the same use slicing for shorthand, if it differ I would recommend to try split() by pattern:

[i.get('src').split('_')[-1].split('.')[0] for i in soup.select('[src]')]

or using regex:

import re
[re.search('.*?([0-9]+)\.[a-zA-Z]+$',i.get('src')).group(1) for i in soup.select('[src]')]

Example

from bs4 import BeautifulSoup

html = '''
<img src="img/katalog/honda-crv-4x2-2.0-at-2001_30082022103745.jpg">
<img src="img/katalog/mitsubishi-xpander-1.5-exceed-manual-2018_08072022134628.jpg">
'''
soup = BeautifulSoup(html)

[i.get('src').split('_')[-1].split('.')[0] for i in soup.select('[src]')]

Output

['30082022103745', '08072022134628']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.