0

I am trying to scrape specific text from specific table elements on an Amazon product page.

URL_1 has all elements - https://www.amazon.com/dp/B008Q5LXIE/ URL_2 has only 'Sales Rank' - https://www.amazon.com/dp/B001V9X26S

URL_1: The "Product Details" table has 9 items and I am only interested in 'Product Dimensions', 'Shipping Weight', Item Model Number, and all 'Seller's Rank'

I am not able to parse out the text on these items as some are in one block of code, where others are not.

I am using beautifulsoup and I have run a text.strip() on the table and got everything but very messy. I have tried soup.find('li') and text.strip() to find individual elements but with seller rank, it returns all 3 ranks jumbled in one return. I have also tried regex to clean text but it won't work for the 4 different seller ranks. I have had success using the Try, Except, Pass method for scraping and would have each of these in that format

A bad example of the code used, I was trying to get sales rank past the </b> 
element in the HTML
#Sales Rank
        sales_rank ='NOT'
        try:
            sr = soup.find('li', attrs={'id':'SalesRank'})
            sales_rank = sr.find('/b').text.strip()
        except:
            pass

I expect to be able to scrape the listed elements into a dictionary. I would like to see the results as

dimensions = 6x4x4
weight = 4.8 ounces
Item_No = IT-DER0-IQDU
R1_NO = 2,036
R1_CAT = Health & Household
R2_NO = 5
R2_CAT = Joint & Muscle Pain Relief Medications
R3_NO = 3
R3_CAT = Naproxen Sodium
R4_NO = 6
R4_CAT = Migraine Relief

my_dict =   {'dimensions':'dimensions','weight':'weight','Item_No':'Item_No', 'R1_NO':R1_NO,'R1_CAT':'R1_CAT','R2_NO':R2_NO,'R2_CAT':'R2_CAT','R3_NO':R3_NO,'R3_CAT':'R3_CAT','R4_CAT':'R4_CAT'}

URL_2: The only element of interest on page is 'Sales Rank'. 'Product Dimensions', 'Shipping Weight', Item Model Number are not present. However, I would like a return similar to that of URL_1 but the missing elements would have a value of 'NA'. Same results as URL_1, only 'NA' is given when an element is not present. I have had success accomplishing this by setting a value prior to the Try/Except statement. Ex: Shipping Weight = 'NA' ... then run try/except: pass , so I get 'NA' and my dictionary is not empty.

1 Answer 1

1

You could use stripped_strings and :contains with bs4 4.7.1. This feels like a lot of jiggery pokery to get the desired output format. Sure someone with more python experience could reduce this and improve its efficiency. Merging dicts syntax taken from @aaronhall.

import requests
from bs4 import BeautifulSoup as bs
import re

links = ['https://www.amazon.com/Professional-Dental-Guard-Remoldable-Customizable/dp/B07L4YHBQ4', 'https://www.amazon.com/dp/B0040ODFK4/?tag=stackoverfl08-20']

for link in links:

    r = requests.get(link, headers = {'User-Agent': 'Mozilla\5.0'})
    soup = bs(r.content, 'lxml')
    fields = ['Product Dimensions', 'Shipping Weight', 'Item model number', 'Amazon Best Sellers Rank']

    temp_dict = {}

    for field in fields:
        element = soup.select_one('li:contains("' + field + '")')
        if element is None:
            temp_dict[field] = 'N/A'
        else:
            if field == 'Amazon Best Sellers Rank':
                item = [re.sub('#|\(','', string).strip() for string in soup.select_one('li:contains("' + field + '")').stripped_strings][1].split(' in ')
                temp_dict[field] = item
            else:
                item = [string for string in element.stripped_strings][1]
                temp_dict[field] = item.replace('(', '').strip()

    ranks = soup.select('.zg_hrsr_rank')
    ladders = soup.select('.zg_hrsr_ladder')

    if ranks:
        cat_nos = [item.text.split('#')[1] for item in ranks]
    else:
         cat_nos = ['N/A']

    if ladders:                      
        cats = [item.text.split('\xa0')[1] for item in soup.select('.zg_hrsr_ladder')]
    else:
        cats = ['N/A']

    rankings = dict(zip(cat_nos, cats))

    map_dict = {
        'Product Dimensions': 'dimensions',
        'Shipping Weight': 'weight', 
        'Item model number': 'Item_No',
        'Amazon Best Sellers Rank': ['R1_NO','R1_CAT']
    }

    final_dict = {}

    for k,v in temp_dict.items():
        if k == 'Amazon Best Sellers Rank' and v!= 'N/A':
            item = dict(zip(map_dict[k],v))
            final_dict = {**final_dict, **item}
        elif k == 'Amazon Best Sellers Rank' and v == 'N/A':
            item = dict(zip(map_dict[k], [v, v]))
            final_dict = {**final_dict, **item}
        else:
            final_dict[map_dict[k]] = v

    for k,v in enumerate(rankings):
        #print(k + 1, v, rankings[v])
        prefix = 'R' + str(k + 2) + '_'
        final_dict[prefix + 'NO'] = v  
        final_dict[prefix + 'CAT'] = rankings[v]

    print(final_dict)
Sign up to request clarification or add additional context in comments.

11 Comments

Oops.... I should have scrolled further down the question as I didn't see that bit! I will have another look.
Can you provide an example url where this is the case and the expected output?
Please try bottom version with a few urls.
Not a problem. Thank you for feeding back :-)
can you provide an example url where this occurs?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.