0

Below is an excerpt of the html code:

<div class="Class1">Category1</div>
<div class="Class2">"Text1 I want"</div>
<div class="Class1">Category2</div>
<div class="Class2">"Text2 I want"</div>

I know I can extract Text1 and Text2 by using:

find_element = browser.find_elements_by_xpath("//div[@class='Class2']")
element = [x.text for x in find_element]
text1 = element[0]
text2 = element[1]

But if the structure of the html is changed, elements will be changed accordingly. Is there any way for me to extract Text1 and Text2 by referring to Category1 and Category2, respectively?

Thank you.

2
  • Please give some examples of structure changes, we need to know which part is immutable among the structure change examples to see figure out how we can use the immutable part to archive your goal. Or you can directly tell us the immutable parts, like class name of Categoryx, class name of Text I want etc. Commented Mar 17, 2018 at 0:34
  • @yong I don't know if the structure is going to change, just trying to avoid potential errors to extract the texts by referring to their categories. Commented Mar 17, 2018 at 0:41

3 Answers 3

1

If the Text I want always inside the next sibling div of Category div, you can try as following:

Case 1

<div class="Class1">Category1</div>
<div class="Class2">"Text1 I want"</div>

//div[.='Category1']/following-sibling::div[1]

Case 2

<div class="Class1">Category1</div>
<div class="Class2">
  <div class="xxx">
    <span>"Text1 I want"</span>
  </div>
</div>

//div[.='Category1']/following-sibling::div[1]//span

There can be many possible structure, the key part in the xpath is //div[.='Category1']/following-sibling::div[1]

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. That's all I need.
0

I suggest using BeautifulSoup. Find the Category1 tag, then its next_sibling:

import bs4
your_html = browser.page_source
soup = bs4.BeautifulSoup(your_html, 'lxml')

class1tag = soup.find('div', text='Category1')
tag = class1tag.next_sibling.next_sibling
print(tag)
#<div class="Class2">"Text1 I want"</div>
print(tag.text)
#"Text1 I want"

9 Comments

I tried. And the website prohibited me from extracting info using BeautifulSoup.
If you can open it with selenium, you can save the document as HTML: html = browser.page_source.
That would be too much though, as I'll be saving at least 3,000 html. I'll try it out, but is there other more efficient way to do so?
I do not understand your concern. You still have to parse all the HTML documents to get the data that you want.
I was just thinking to avoid downloading too many files (HTML documents in this case) because extract those few texts was all I need.
|
0

I guess that your concern regarding changes to the structure of the html are based on the fact that the semantics of the data is of key- value paid (the keys being the categories and the values are the text), while the structure is simply a list of divs where the odd ones are the keys and the following even ones are their corresponding values. The problem though isn't with your Selenium locators, but rather in the structure of the html itself (which consequently affects your ability to use more robust locators). I would suggest that you ask the developers to improve the structure of the html to reflect it's appropriate semantics. Discuss together the best structure that fits all the needs, including those of the test automation.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.