BeautifulSoup - extracting texts within one class

Question

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I split the text...

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

and extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.

EDIT:

There are other < a> < /a> in the source code, so if I remove id= from my code, it would return all other texts that I don't need. For examples,

<div class="MYClass"><span class="Class">RandomText.<br>RandomText.<br>
<a href=someURL>RandomTextExtracted.</a><br>

Also,

</div><div class=MYClass>
<a href="SomeURL>RandomTextExtracted</a>

You're specifically getting all links that have id category1 but Text2 I want doesn't have that id. — Jack Ryan
– Jack Ryan, Commented Mar 8, 2018 at 4:17
@JackRyan, I recognized so, that's why I couldn't extract Text2 I want. What do you think might help on this? Thanks. — Karma
– Karma, Commented Mar 8, 2018 at 4:20
Thanks @KeyurPotdar, but there are other <a> in the source that it would extract all other texts that I don't want if I only leave with [find_all('a')]. Any idea? Thanks. — Karma
– Karma, Commented Mar 8, 2018 at 5:03

Keyur Potdar · Accepted Answer · 2018-03-08 06:18:42Z

1

Since the id of an element is unique, you can find the first <a> tag using id="category1". To find the next <a> tag, you can use find_next() method.

html = '''<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>'''
soup = BeautifulSoup(html, 'lxml')

a_tag1 = soup.find('a', id='category1')
print(a_tag1)    # or use `a_tag1.text` to get the text
a_tag2 = a_tag1.find_next('a')
print(a_tag2)

Output:

<a href="SomeURL" id="category1">Text1 I want</a>
<a href="SomeURL">Text2 I want</a>

^{(I've tested it for the link you've provided, and it works there too.)}

edited Mar 8, 2018 at 6:18

answered Mar 8, 2018 at 5:41

Keyur Potdar

7,2386 gold badges27 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bigbounty · Accepted Answer · 2018-03-08 05:26:29Z

0

You need a your code a little

from bs4 import BeautifulSoup
soup = BeautifulSoup("<div class=\"MYCLASS\">Category1: <a id=category1 href=\"SomeURL\" > \
Text1 I want</a> &gt; Category2: <a href=\"SomeURL\" >Text2 I want</a></div> \
I","lxml")
for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a'):
        print(url.text.strip())

Remove id for 'a' tag and run the same code.

If you need text of specify ids, you need to know the ids.

ids = [id1,id2]
for div in soup.find_all('div', class_='MYCLASS'):
    for id in ids:
        for url in soup.find_all('a',id=id):
            print(url.text.strip())

edited Mar 8, 2018 at 5:26

answered Mar 8, 2018 at 4:23

bigbounty

17.5k7 gold badges45 silver badges76 bronze badges

5 Comments

Karma Over a year ago

Thanks @bigbounty. What if there are other classes with the same name "MYCLASS"? When I run your code, it returns Text1 and Text2 that I want, but also other texts that I don't want. How would you extract that? Thank you.

bigbounty Over a year ago

You say that you don't want , and you want to extract it!?

Karma Over a year ago

Sorry for the confusion, I just added more info in the question. So, there are some other classes that share the same name MYCLASS as the targeted one. But the only text that I want to extract is from Category1 and Category2.

bigbounty Over a year ago

Make a list of wanted ids. Loop through the wanted ids and put the id in id tag in your code and run it

Karma Over a year ago

there are 2 Texts that I want to extract from the same id id=category1. But it would return only Text1, not Text2. I'm trying to extract Text1 (after Category1) and Text2 (after Category2).

Collectives™ on Stack Overflow

BeautifulSoup - extracting texts within one class

2 Answers 2

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related