Python: extract class and text

Question

I would like to extract data from a website, and I need to know if it contains some of the equipment. As the example below, I know A has CD, but he doesn't have CDA.

HTML:

<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>

My code:

res = requests.get('https://www.acd.com/carinfo-4434.php')
soup=BeautifulSoup(res.text,'lxml')
for item in soup.find_all(attrs={'class':'ABC'}):       
    for link in item.find_all('li'):
        print(link)

From my code, I will extract all the li from the HTML, like this:

<li class="specChecked"><p>CD</p></li>
<li class="specChecked"><p>VCD</p></li>
<li class=""><p>CDA</p></li> 
<li>
    <p>b1<span>1</span></p>
</li>
<li>
    <p>b2<span>2</span></p>
</li>

But that's not what I want. What I wanna do, is to extract from "li class" and text, the hope the result will be like this:

specChecked, CD
specChecked, VCD
, CDA

(Or maybe I can just replace specChecked as 1 and blank space as 0)

SIM · Accepted Answer · 2018-05-11 09:58:05Z

3

You can do something like below to get the content of desired class along with empty one.

from bs4 import BeautifulSoup

content = """
<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>
"""
soup = BeautifulSoup(content, "html.parser")
for item in soup.find_all('li',class_=["specChecked",""]):
    print("{}, {}".format(' '.join(item['class']),item.text))

Output:

specChecked, CD
specChecked, VCD
, CDA

answered May 11, 2018 at 9:58

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rakesh · Accepted Answer · 2018-05-11 08:32:20Z

2

s = """<div class="ABC">
    <h3>A</h3>
    <ul>
        <li class="specChecked"><p>CD</p></li>
        <li class="specChecked"><p>VCD</p></li>
        <li class=""><p>CDA</p></li>                       
    </ul>
    <h3>B</h3>
    <div class="buyCarDetailContentSpecContent ">
        <ul>
        <li>
            <p>b1<span>1</span></p>
        </li>
        <li>
            <p>b2<span>2</span></p>
        </li>
        </ul>
    </div>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
for link in soup.find_all('li'):
    if link.has_attr("class"):
        print(link.get("class", ""), link.text)

Output:

[u'specChecked'], u'CD'
[u'specChecked'], u'VCD'
[u''], u'CDA'

You can use has_attr to check if li has class attribute
link.get to get the class value
link.text to extract the text.

answered May 11, 2018 at 8:32

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

1 Comment

Keyur Potdar Over a year ago

Instead of checking if li has class attribute, you can use soup.find_all('li', class_=True).

Collectives™ on Stack Overflow

Python: extract class and text

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related