Extract data from html using beautifulsoup

Question

Im trying to extract the data which is under EXPERIENCE tag. Im using beautifulsoup to extract the data. Below is my html:

<div><span>EXPERIENCE

<br/></span></div><div><span>

<br/></span></div><div><span>

<br/></span></div><div><span></span><span> </span><span>I worked in XYZ company from 2016 - 2018

<br/></span></div><div><span> I worked on JAVA platform

<br/></span></div><div><span>From then i worked in ABC company

</br>2018- Till date

</br></span></div><div><span>I got handson on Python Language

</br></span></div><div><span>PROJECTS

</br></span></div><div><span>Developed and optimized many application, etc...

My work till now:

with open('E:/cvparser/test.html','rb') as h:

    dh = h.read().splitlines()

    out = str(dh)

    soup = BeautifulSoup(out,'html.parser')

    for tag in soup.select('div:has(span:contains("EXPERIENCE"))'):

        final = (tag.get_text(strip = True, separator = '\n'))

    print(final)

Expected Output:

I worked in XYZ company from 2016 - 2018

I worked on JAVA platform

From then i worked in ABC company

2018- Till date

I got handson on Python Language

For my code its returning null. Can someone help me out here?

just to clarify, EXPERIENCE is not a tag. The tag you're interested in is the <span> tag. So you are looking for the data under the span tag that contains the text/content EXPERIENCE — chitown88
– chitown88, Commented Aug 28, 2019 at 13:55
This is almost cetainly a duplicate. I have seen this same question three times recently in only slightly different forms. — QHarr
– QHarr, Commented Aug 28, 2019 at 15:04

Maaz · Accepted Answer · 2019-08-28 12:53:44Z

What I understood is you want to have text in span between EXPERIENCE and PROJECTS

Here is what you need:

from bs4 import BeautifulSoup as soup

html = """<div><span>EXPERIENCE

<br/></span></div><div><span>

<br/></span></div><div><span>

<br/></span></div><div><span></span><span> </span><span>I worked in XYZ company from 2016 - 2018

<br/></span></div><div><span> I worked on JAVA platform

<br/></span></div><div><span>From then i worked in ABC company

</br>2018- Till date

</br></span></div><div><span>I got handson on Python Language

</br></span></div><div><span>PROJECTS
</br></span></div><div><span>Developed and optimized many application, etc...</span></div>"""

page = soup(html, "html.parser")

save = False
final = ''
for div in page.find_all('div'):
    text = div.get_text()

    if text and text.strip().replace('\n','') == 'PROJECTS':
        save = False

    if save and text and text.strip().replace('\n', ''):
        # last if is to avoid new line in final result
        final = '{0}\n{1}'.format(final,text.replace('\n',''))
    else:
        if text and 'EXPERIENCE' in text:
            save = True

print(final)

OUTPUT:

 I worked in XYZ company from 2016 - 2018
 I worked on JAVA platform
From then i worked in ABC company
I got handson on Python Language

Lemon · Accepted Answer · 2019-08-28 12:48:38Z

0

I am not sure about your html example, but try this:

from bs4 import BeautifulSoup
result2 = requests.get("") # your url here
src2 = result2.content
soup = BeautifulSoup(src2, 'lxml')


for item in soup.find_all('div', {'span': 'Experience'}): 
    print(item.text)

answered Aug 28, 2019 at 12:48

Lemon

1412 silver badges7 bronze badges

Comments

Ajax1234 · Accepted Answer · 2019-08-28 13:55:07Z

0

You can use itertools.groupby to match all relevant sub contents to their appropriate header:

import itertools, re
from bs4 import BeautifulSoup as soup
d = lambda x:[i for b in x.contents for i in ([b] if b.name is None else d(b))]
data = list(filter(None, map(lambda x:re.sub('\n+|^\s+', '', x), d(soup(html, 'html.parser')))))
new_d = [list(b) for _, b in groupby(data, key=lambda x:x.isupper())]
result = {new_d[i][0]:new_d[i+1] for i in range(0, len(new_d), 2)}

Output:

{'EXPERIENCE': ['\uf0b7', 'I worked in XYZ company from 2016 - 2018', 'I worked on JAVA platform', 'From then i worked in ABC company', 'I got handson on Python Language'], 'PROJECTS': ['Developed and optimized many application, etc...']}

To get your desired output:

print('\n'.join(result['EXPERIENCE']))

Output:


I worked in XYZ company from 2016 - 2018
I worked on JAVA platform
From then i worked in ABC company
2018- Till date
I got handson on Python Language

answered Aug 28, 2019 at 13:55

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

1 Comment

Anil Kumar Over a year ago

It gave me list index out of range.@Ajax1234

Collectives™ on Stack Overflow

Extract data from html using beautifulsoup

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related