0

I am working on a large-scale web scraping project where the HTML structure of each webpage is different from each other. I wanted to scrape the product description from the webpages and I am using the BeautifulSoup package.

For example, the product description that I am trying to scrape is stored in HTML structures:

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Product description" </p>
</div>

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Product description" </p>
</div>

I have written a for loop that gets the data from the div class "product-description" depending on the page structure. My sample code snippet:

requests = (grequests.get(url) for url in urls)
responses = grequests.imap(requests, grequests.Pool(1000))

for response in responses:

        html_soup = BeautifulSoup(response.text, 'html.parser')

        if html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling:
                product_description = html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling.text

        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling.next_sibling:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.next_sibling.next_sibling.text

        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.next_sibling.text

        else:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.text

I expected the if conditions to check if there are siblings in the current level of HTML and if not check for subsequent conditions. However, after 3000 iterations, I am getting an Attribute error saying Nonetype object has no attribute next_sibling. Screenshot attached below:

Attribute error

I know there must be some other easier way to handle this dynamic page structure. Any help would be much appreciated. Thanks in advance!

4
  • Add all to an array, pop the first element (Title), then last element (Descr), and what's left is the content. Commented Apr 17, 2020 at 3:29
  • [i.text for i in soup.select('.product-description p:last-child')] if there is always a product description and it is the last p. Will write up as answer if applies but need confirmation on assumptions. Commented Apr 17, 2020 at 5:40
  • does my answer help you??? Commented Apr 18, 2020 at 6:22
  • Thanks @Joshua Varghese for your comments and help. All these suggestions work, but I just realised that the "product-description" is not consistent in the last <p> in all web pages. So, I am thinking of removing all html tags within the <div> and use regex to extract the required data. I will post a new question if any help is required. Once again, thank you! Commented Apr 19, 2020 at 2:27

1 Answer 1

1

Try this:

for i in soup.find_all('div',class_="product-description"):
    try:
        print(i.find_all('p')[-1].text)
    except:
        pass

Here soup is:

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Product description" </p>
</div>

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Product description" </p>
</div>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.