0

I would like to extract data from a website, and I need to know if it contains some of the equipment. As the example below, I know A has CD, but he doesn't have CDA.

HTML:

<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>

My code:

res = requests.get('https://www.acd.com/carinfo-4434.php')
soup=BeautifulSoup(res.text,'lxml')
for item in soup.find_all(attrs={'class':'ABC'}):       
    for link in item.find_all('li'):
        print(link)

From my code, I will extract all the li from the HTML, like this:

<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li> 
<li>
    <p>b1<span>1</span></p>
</li>
<li>
    <p>b2<span>2</span></p>
</li>

But that's not what I want. What I wanna do, is to extract from "li class" and text, the hope the result will be like this:

specChecked, CD
specChecked, VCD
, CDA

(Or maybe I can just replace specChecked as 1 and blank space as 0)

2 Answers 2

3

You can do something like below to get the content of desired class along with empty one.

from bs4 import BeautifulSoup

content = """
<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>
"""
soup = BeautifulSoup(content, "html.parser")
for item in soup.find_all('li',class_=["specChecked",""]):
    print("{}, {}".format(' '.join(item['class']),item.text))

Output:

specChecked, CD
specChecked, VCD
, CDA
Sign up to request clarification or add additional context in comments.

Comments

2
s = """<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
for link in soup.find_all('li'):
    if link.has_attr("class"):
        print(link.get("class", ""), link.text)

Output:

[u'specChecked'], u'CD'
[u'specChecked'], u'VCD'
[u''], u'CDA'
  • You can use has_attr to check if li has class attribute
  • link.get to get the class value
  • link.text to extract the text.

1 Comment

Instead of checking if li has class attribute, you can use soup.find_all('li', class_=True).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.