0

I need to analyse web page contents. Page has javascrips. Can you advice on better way than using Selenium?

If not: page when loaded in browser has elements:

<div class="js-container">    <table class="zebra" style="width: 100%;">
        <tbody><tr>
            <th>A</th>
            <th>B</th>
            <th>C</th>
        </tr>
            <tr>
                <td>A1</td>
                <td>A2</td>
                <td>
                    <a href="http://X" style="color: black">T1</a>
                </td>
            </tr>
            <tr>
                ....
            </tr>
....

I need to read a table, element by element. I run for example:

myList = myDriver.find_elements_by_class_name("js-container").

Then how do I get inner elements of "js-container" object?

The only element resulting myList has is: print (myList[0]):

<selenium.webdriver.remote.webelement.WebElement (session="61238", element="{71293}")>

2 Answers 2

2

Maybe you need BeautifulSoup - feeding to it Selenium driver.page_source. It is a python tool and it can build a tree based on the web page. BeautifulSoup document

Sign up to request clarification or add additional context in comments.

3 Comments

Is it the fact that, you want to fetch a page, which will be changed on loading, and you need the result page?
When I try to use Selenium, I find the question stackoverflow.com/a/30103931/5359105 .Try browser.page_source to get page to convey it to BS.
@Ben Lee, looks like browser.page_source does it. Thank you. I wander why it's not documented as one of main features of selenium.
2

Selenium can do this just fine.

tableDescendants = myDriver.find_elements_by_css_selector("table.zebra *")
for tableDescendant in tableDescendants
    outer = tableDescendant.get_attribute("outerHTML")
    inner = tableDescendant.get_attribute("innerHTML")
    print outer[:outer.find(inner)]

This code grabs all descendants of the TABLE tag, removes everything after the start of the innerHTML string and prints the result. outerHTML contains the element itself and all descendants and innerHTML contains only the descendants. So, to get only the HTML of the element itself, we need to remove innerHTML from outerHTML.

4 Comments

Thank you. How to specify table name with spaces?
It sounds like you are asking a new question. If not, please clarify what you are asking.
I mean if not class="zebra", but e.g. class="zebra ver2"
CSS Selectors is the way to go there. The basic format is <tagname>.<classname1>.<classname2>, e.g. table.zebra.ver2. Check out these resources: CSS Selector Reference and CSS Selector Tips.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.