web data scraping : split html content

Question

I'm scraping a website and I was able to reduce a variable called "gender" to this :

[<span style="text-decoration: none;">
                        Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
                    </span>, <span style="text-decoration: none;">associé gérant </span>]

And now I'd like to have only "associé" in the variable but I can't find a way to split this html code.

The reason is that I want to know if it's "associé" (male) or "associée" (female).

does anyone have any ideas ?

Cheers

----- edit ---- here my code which gets me the html output

url = "http://www.rc2.vd.ch/registres/hrcintapp-pub/companyReport.action?rcentId=5947621600000055031025&lang=FR&showHeader=false"

r = requests.get(url)
soup = BeautifulSoup(r.content,"lxml")
table = soup.select_one("#adm").find_next("table") #select_one finds only the first tag that matches a selector:
table2 = soup.select_one("#adm").find_all_next("table")


output = table.select("td span[style^=text-decoration:]", limit=2)  #.text.split(",", 1)[0].strip()

print(output)

Please show what code produced this output? Thanks.

alecxe
– alecxe

2016-09-15 20:25:29 +00:00
Commented Sep 15, 2016 at 20:25 — alecxe
– alecxe, Commented Sep 15, 2016 at 20:25
yes sure I edit right now

jjyoh
– jjyoh

2016-09-15 21:06:36 +00:00
Commented Sep 15, 2016 at 21:06 — jjyoh
– jjyoh, Commented Sep 15, 2016 at 21:06

Padraic Cunningham · Accepted Answer · 2016-09-15 21:15:04Z

1

Whatever the parent of the two elements is you can call span:nth-of-type(2) to get the second span, then just check the text:

html = """<span style="text-decoration: none;">
                        Lass Christian, du Danemark, à Yverdon-les-Bains, avec 200 parts de CHF 100
                    </span>
           <span style="text-decoration: none;">associé gérant </span>"""

soup = BeautifulSoup(html)

text = soup.select_one("span:nth-of-type(2)").text

Or if it not always the second span you can search for the span by the partial text associé:

import re
text = soup.find("span", text=re.compile(ur"associé")).text

For your edit, all you need is to extract the text last element and use .split(None, 1)[1] to get the gender:

text = table.select("td span[style^=text-decoration:]", limit=2)[-1].text
gender = text.split(None, 1)[1] # > gérant

edited Sep 15, 2016 at 21:15

answered Sep 15, 2016 at 20:47

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jjyoh Over a year ago

it gives me an error : TypeError: expected string or buffer

Padraic Cunningham Over a year ago

@J.jaques, . Nothing in my code would do that if you used it correctly. What exactly are you passing?

Collectives™ on Stack Overflow

web data scraping : split html content

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related