2

I am trying to parse this html to get the item title (e.g. Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW)

<div style="" class="">
    <h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW</h1>
            <h2 id="subTitle" class="it-sttl">
            Brand New + Free Shipping, Satisfaction Guaranteed! </h2>
    <!-- DO NOT change linkToTagId="rwid" as the catalog response has this ID set  -->
    <div class="vi-hdops-three-clmn-fix">           
        <div style="" class="vi-notify-new-bg-wrapper">
                <div class="vi-notify-new-bg-dTop" style=""> </div>
                <div id="vi_notification_new" class="vi-notify-new-bg-dBtm" style="top: -28px;"> 
                    <img src="https://ir.ebaystatic.com/rs/v/tnj4p1myre1mpff12w4j1llndmc.png" width="11" height="12" class="vi-notify-new-img" alt="Popular">
                    <span style="font-weight:bold;">5 sold in last 24 hours</span>
                </div>
            </div>
        </div>      
    </div>

I am using the following code to parse the page

url1 = "https://www.ebay.com/itm/Big-Boss-Air-Fryer-Healthy-1300-Watt-Super-Sized-16-Quart-Fryer-5-Colors-NEW/122454150244?    epid=2254405949&hash=item1c82d60c64:m:mqfT2XbgveSevmN5MV1iysg"

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)

    for item in soup.findAll('h1', {'class':'it-ttl'}):
        print(item.string) # Use item.text

get_single_item_data(url1)

When I do this, beautifulsoup return 'None'.

One solution I found is to use print(item.text) instead, but now I get this 'Details about  Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW'(I do not want 'Details about ').

Is there an efficient way to get the item title without having to get the text and then taking off the 'Details about '?

2 Answers 2

2

This is because of this caveat of the .string attribute:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

Since the header element contains multiple children - it cannot be defined and defaults to None.

To avoid cutting of "Details about" part, you can get the first text node in a non-recursive mode:

soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False)

Demo:

In [3]: soup = BeautifulSoup(data, "html.parser")

In [4]: print(soup.find('h1', {'class':'it-ttl'}).find(text=True, recursive=False))
Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW
Sign up to request clarification or add additional context in comments.

2 Comments

What does setting (recursive=False) do in the code?
@JoeChan it just helps to get the high-level node, in this case "text" node - see more in the documentation.
2

You [sh/co]uld use .text instead .string :

from bs4 import BeautifulSoup
import requests


url1 = "https://www.ebay.com/itm/Big-Boss-Air-Fryer-Healthy-1300-Watt-Super-Sized-16-Quart-Fryer-5-Colors-NEW/122454150244?    epid=2254405949&hash=item1c82d60c64:m:mqfT2XbgveSevmN5MV1iysg"

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,'html.parser')



    for item in soup.findAll('h1', {'class':'it-ttl'}):
        print(item.text) # Use item.text

get_single_item_data(url1)

output:

Details about   Big Boss Air Fryer - Healthy 1300-Watt Super Sized 16-Quart, Fryer 5 Colors -NEW

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.