Extracting text using selenium

Question

Below is an excerpt of the html code:

<div class="Class1">Category1</div>
<div class="Class2">"Text1 I want"</div>
<div class="Class1">Category2</div>
<div class="Class2">"Text2 I want"</div>

I know I can extract Text1 and Text2 by using:

find_element = browser.find_elements_by_xpath("//div[@class='Class2']")
element = [x.text for x in find_element]
text1 = element[0]
text2 = element[1]

But if the structure of the html is changed, elements will be changed accordingly. Is there any way for me to extract Text1 and Text2 by referring to Category1 and Category2, respectively?

Thank you.

Please give some examples of structure changes, we need to know which part is immutable among the structure change examples to see figure out how we can use the immutable part to archive your goal. Or you can directly tell us the immutable parts, like class name of Categoryx, class name of Text I want etc. — yong
– yong, Commented Mar 17, 2018 at 0:34
@yong I don't know if the structure is going to change, just trying to avoid potential errors to extract the texts by referring to their categories. — Karma
– Karma, Commented Mar 17, 2018 at 0:41

yong · Accepted Answer · 2018-03-17 00:58:32Z

1

If the Text I want always inside the next sibling div of Category div, you can try as following:

Case 1

<div class="Class1">Category1</div>
<div class="Class2">"Text1 I want"</div>

//div[.='Category1']/following-sibling::div[1]

Case 2

<div class="Class1">Category1</div>
<div class="Class2">
  <div class="xxx">
    <span>"Text1 I want"</span>
  </div>
</div>

//div[.='Category1']/following-sibling::div[1]//span

There can be many possible structure, the key part in the xpath is //div[.='Category1']/following-sibling::div[1]

answered Mar 17, 2018 at 0:58

yong

13.8k1 gold badge19 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Karma Over a year ago

Thank you. That's all I need.

DYZ · Accepted Answer · 2018-03-17 00:16:40Z

0

I suggest using BeautifulSoup. Find the Category1 tag, then its next_sibling:

import bs4
your_html = browser.page_source
soup = bs4.BeautifulSoup(your_html, 'lxml')

class1tag = soup.find('div', text='Category1')
tag = class1tag.next_sibling.next_sibling
print(tag)
#<div class="Class2">"Text1 I want"</div>
print(tag.text)
#"Text1 I want"

edited Mar 17, 2018 at 0:16

answered Mar 17, 2018 at 0:13

DYZ

57.3k10 gold badges73 silver badges101 bronze badges

9 Comments

Karma Over a year ago

I tried. And the website prohibited me from extracting info using BeautifulSoup.

DYZ Over a year ago

If you can open it with selenium, you can save the document as HTML: html = browser.page_source.

Karma Over a year ago

That would be too much though, as I'll be saving at least 3,000 html. I'll try it out, but is there other more efficient way to do so?

DYZ Over a year ago

I do not understand your concern. You still have to parse all the HTML documents to get the data that you want.

Karma Over a year ago

I was just thinking to avoid downloading too many files (HTML documents in this case) because extract those few texts was all I need.

|

Arnon Axelrod · Accepted Answer · 2018-03-17 21:41:14Z

I guess that your concern regarding changes to the structure of the html are based on the fact that the semantics of the data is of key- value paid (the keys being the categories and the values are the text), while the structure is simply a list of divs where the odd ones are the keys and the following even ones are their corresponding values. The problem though isn't with your Selenium locators, but rather in the structure of the html itself (which consequently affects your ability to use more robust locators). I would suggest that you ask the developers to improve the structure of the html to reflect it's appropriate semantics. Discuss together the best structure that fits all the needs, including those of the test automation.

Collectives™ on Stack Overflow

Extracting text using selenium

3 Answers 3

1 Comment

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related