0

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I split the text...

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

and extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.

EDIT:

There are other < a> < /a> in the source code, so if I remove id= from my code, it would return all other texts that I don't need. For examples,

<div class="MYClass"><span class="Class">RandomText.<br>RandomText.<br>
<a href=someURL>RandomTextExtracted.</a><br>

Also,

</div><div class=MYClass>
<a href="SomeURL>RandomTextExtracted</a>
4
  • You're specifically getting all links that have id category1 but Text2 I want doesn't have that id. Commented Mar 8, 2018 at 4:17
  • @JackRyan, I recognized so, that's why I couldn't extract Text2 I want. What do you think might help on this? Thanks. Commented Mar 8, 2018 at 4:20
  • Just remove the id='category1' from the find_all(). Commented Mar 8, 2018 at 4:31
  • Thanks @KeyurPotdar, but there are other <a> in the source that it would extract all other texts that I don't want if I only leave with [find_all('a')]. Any idea? Thanks. Commented Mar 8, 2018 at 5:03

2 Answers 2

1

Since the id of an element is unique, you can find the first <a> tag using id="category1". To find the next <a> tag, you can use find_next() method.

html = '''<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>'''
soup = BeautifulSoup(html, 'lxml')

a_tag1 = soup.find('a', id='category1')
print(a_tag1)    # or use `a_tag1.text` to get the text
a_tag2 = a_tag1.find_next('a')
print(a_tag2)

Output:

<a href="SomeURL" id="category1">Text1 I want</a>
<a href="SomeURL">Text2 I want</a>

(I've tested it for the link you've provided, and it works there too.)

Sign up to request clarification or add additional context in comments.

Comments

0

You need a your code a little

from bs4 import BeautifulSoup
soup = BeautifulSoup("<div class=\"MYCLASS\">Category1: <a id=category1 href=\"SomeURL\" > \
Text1 I want</a> &gt; Category2: <a href=\"SomeURL\" >Text2 I want</a></div> \
I","lxml")
for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a'):
        print(url.text.strip())

Remove id for 'a' tag and run the same code.

If you need text of specify ids, you need to know the ids.

ids = [id1,id2]
for div in soup.find_all('div', class_='MYCLASS'):
    for id in ids:
        for url in soup.find_all('a',id=id):
            print(url.text.strip())

5 Comments

Thanks @bigbounty. What if there are other classes with the same name "MYCLASS"? When I run your code, it returns Text1 and Text2 that I want, but also other texts that I don't want. How would you extract that? Thank you.
You say that you don't want , and you want to extract it!?
Sorry for the confusion, I just added more info in the question. So, there are some other classes that share the same name MYCLASS as the targeted one. But the only text that I want to extract is from Category1 and Category2.
Make a list of wanted ids. Loop through the wanted ids and put the id in id tag in your code and run it
there are 2 Texts that I want to extract from the same id id=category1. But it would return only Text1, not Text2. I'm trying to extract Text1 (after Category1) and Text2 (after Category2).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.