0

I am trying to a scrape a table consisting og 45 columns and 7 rows. The table is loaded using ajax and I can't access the API. Thus I needed to use selenium in Python. I am close to get what I want but I don't know how I can turn my 'selenium find elements' into a Pandas DataFrame. So far, my code looks like this:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

driver = webdriver.Chrome()
url = "http://www.hctiming.com/myphp/resources/login/browse_results.php?live_action=yes&smartphone_action=no" #a redirect to a login page occurs
driver.get(url)
driver.find_element_by_id("open").click()

user = driver.find_element_by_name("username")
password = driver.find_element_by_name("password")
user.clear()
user.send_keys("MyUserNameWhichIWillNotShare")
password.clear()
password.send_keys("myPasswordWhicI willNotShare")
driver.find_element_by_name("submit").click()

try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "Results Services")) # I must first click in this line
    )
    element.click()

    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "View Live")) # Then I must click in this link. Now I have access to the result database
    )
    element.click()

except:
    driver.quit()

time.sleep(5) #I have set a timesleep to 5 secunds. There must be a better way to accomplish this. I just want to make sure that the table is loaded when I try to scrape it

columns = len(driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th"))
rows = len(driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr"))
print(columns, rows)

The last code line prints 45 and 7. Thus, this seems to work. However, I don't understand how I can make a dataframe of it? Thank you.

1 Answer 1

1

It's hard to tell not seeing data structure, but if table is simple, you can try to parse it directly by pandas read_html.

df = pd.read_html(driver.page_source)[0]

You can also create datafame by iterating through all table elements properly manipulating xpath:

df = pd.DataFrame()
    for i in range(rows):
        s = pd.Series()
        for c in range(columns):
            s[c] = driver.find_elements_by_xpath(f"/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr[{i+1}]/td[{c+1}]")
        df = df.append(s, ignore_index=True)
Sign up to request clarification or add additional context in comments.

2 Comments

Interesting. So if I understand correctly, then df = pd.read_html(driver.page_source)[0] somehow can browse through all the tables on a page? Because when I gave to [1], [2], and so forth, it gave me different tables on the page.
Exactly. It just looks for <table> elements and parse all of them. To minimize computing effort, you can pass attrs, that parser should look for. Just read the docs :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.