4

Im trying to extract the data which is under EXPERIENCE tag. Im using beautifulsoup to extract the data. Below is my html:

<div><span>EXPERIENCE

<br/></span></div><div><span>

<br/></span></div><div><span>

<br/></span></div><div><span></span><span> </span><span>I worked in XYZ company from 2016 - 2018

<br/></span></div><div><span> I worked on JAVA platform

<br/></span></div><div><span>From then i worked in ABC company

</br>2018- Till date

</br></span></div><div><span>I got handson on Python Language

</br></span></div><div><span>PROJECTS

</br></span></div><div><span>Developed and optimized many application, etc...

My work till now:

with open('E:/cvparser/test.html','rb') as h:

    dh = h.read().splitlines()

    out = str(dh)

    soup = BeautifulSoup(out,'html.parser')

    for tag in soup.select('div:has(span:contains("EXPERIENCE"))'):

        final = (tag.get_text(strip = True, separator = '\n'))

    print(final)

Expected Output:

I worked in XYZ company from 2016 - 2018

I worked on JAVA platform

From then i worked in ABC company

2018- Till date

I got handson on Python Language

For my code its returning null. Can someone help me out here?

2
  • just to clarify, EXPERIENCE is not a tag. The tag you're interested in is the <span> tag. So you are looking for the data under the span tag that contains the text/content EXPERIENCE Commented Aug 28, 2019 at 13:55
  • This is almost cetainly a duplicate. I have seen this same question three times recently in only slightly different forms. Commented Aug 28, 2019 at 15:04

3 Answers 3

2

What I understood is you want to have text in span between EXPERIENCE and PROJECTS

Here is what you need:

from bs4 import BeautifulSoup as soup

html = """<div><span>EXPERIENCE

<br/></span></div><div><span>

<br/></span></div><div><span>

<br/></span></div><div><span></span><span> </span><span>I worked in XYZ company from 2016 - 2018

<br/></span></div><div><span> I worked on JAVA platform

<br/></span></div><div><span>From then i worked in ABC company

</br>2018- Till date

</br></span></div><div><span>I got handson on Python Language

</br></span></div><div><span>PROJECTS
</br></span></div><div><span>Developed and optimized many application, etc...</span></div>"""

page = soup(html, "html.parser")

save = False
final = ''
for div in page.find_all('div'):
    text = div.get_text()

    if text and text.strip().replace('\n','') == 'PROJECTS':
        save = False

    if save and text and text.strip().replace('\n', ''):
        # last if is to avoid new line in final result
        final = '{0}\n{1}'.format(final,text.replace('\n',''))
    else:
        if text and 'EXPERIENCE' in text:
            save = True

print(final)

OUTPUT:

 I worked in XYZ company from 2016 - 2018
 I worked on JAVA platform
From then i worked in ABC company
I got handson on Python Language
Sign up to request clarification or add additional context in comments.

Comments

0

I am not sure about your html example, but try this:

from bs4 import BeautifulSoup
result2 = requests.get("") # your url here
src2 = result2.content
soup = BeautifulSoup(src2, 'lxml')


for item in soup.find_all('div', {'span': 'Experience'}): 
    print(item.text)

Comments

0

You can use itertools.groupby to match all relevant sub contents to their appropriate header:

import itertools, re
from bs4 import BeautifulSoup as soup
d = lambda x:[i for b in x.contents for i in ([b] if b.name is None else d(b))]
data = list(filter(None, map(lambda x:re.sub('\n+|^\s+', '', x), d(soup(html, 'html.parser')))))
new_d = [list(b) for _, b in groupby(data, key=lambda x:x.isupper())]
result = {new_d[i][0]:new_d[i+1] for i in range(0, len(new_d), 2)}

Output:

{'EXPERIENCE': ['\uf0b7', 'I worked in XYZ company from 2016 - 2018', 'I worked on JAVA platform', 'From then i worked in ABC company', 'I got handson on Python Language'], 'PROJECTS': ['Developed and optimized many application, etc...']}

To get your desired output:

print('\n'.join(result['EXPERIENCE']))

Output:


I worked in XYZ company from 2016 - 2018
I worked on JAVA platform
From then i worked in ABC company
2018- Till date
I got handson on Python Language

1 Comment

It gave me list index out of range.@Ajax1234

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.