added 364 characters in body

Source Link

edited Mar 8, 2018 at 5:17

Karma

269
1
4
11

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I split the text...

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

and extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.

EDIT:

There are other < a> < /a> in the source code, so if I remove id= from my code, it would return all other texts that I don't need. For examples,

<div class="MYClass"><span class="Class">RandomText.<br>RandomText.<br>
<a href=someURL>RandomTextExtracted.</a><br>

Also,

</div><div class=MYClass>
<a href="SomeURL>RandomTextExtracted</a>

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I split the text...

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

and extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I split the text...

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

and extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.

EDIT:

There are other < a> < /a> in the source code, so if I remove id= from my code, it would return all other texts that I don't need. For examples,

<div class="MYClass"><span class="Class">RandomText.<br>RandomText.<br>
<a href=someURL>RandomTextExtracted.</a><br>

Also,

</div><div class=MYClass>
<a href="SomeURL>RandomTextExtracted</a>

deleted 2 characters in body

Source Link

edited Mar 8, 2018 at 4:17

Karma

269
1
4
11

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a>a></div>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I did:split the text...

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

And Iand extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I did:

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

And I extracted "Text1 I want" but still miss "Text2 I want". Any idea? Thank you.

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a></div>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I split the text...

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

and extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.

Source Link

asked Mar 8, 2018 at 4:14

Karma

269
1
4
11

BeautifulSoup - extracting texts within one class

I'm trying to extract texts from this webpage below:

<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> &gt; Category2: <a href="SomeURL" >Text2 I want</a>

I tried:

for div in soup.find_all('div', class_='MYCLASS'):
    for url in soup.find_all('a', id='category1'):
        print(url)

And it returned:

    <a href="someURL" id="category1">Text1 I want</a>

So I did:

    for div in soup.find_all('div', class_='MYCLASS'):
        for url in soup.find_all('a', id='category1'):
            category1 = str(url).split('category1">')[1].split('</a>')[0]
            print(category1)

And I extracted "Text1 I want" but still miss "Text2 I want". Any idea? Thank you.

Collectives™ on Stack Overflow

Return to Question

BeautifulSoup - extracting texts within one class