Web scrape after search with Python, Selenium, BeautifulSoup

Question

I want to web scrape a high school summary table after entering all the necessary information. However, I can't figure out how to do that since the url doesn't change after getting onto the school's page. I didn't find anything that is relevant to what I'm trying to do. Any idea how I can scrape a table after going through the search process? Thank you.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome("drivers/chromedriver")

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text(input("New Jersey"))

driver.find_element_by_id("city").send_keys(input("Galloway"))
driver.find_element_by_id("name").send_keys(input("Absegami High School"))
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

url = driver.current_url
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
school_info = soup.find('table', class_="border=")
print(school_info)

Which table do you want to scrape? As there are multiple tables available on the page. — Vin
– Vin, Commented Jul 25, 2020 at 9:04

nteshxx · Accepted Answer · 2020-07-25 18:09:20Z

1

Try this:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Jersey")

driver.find_element_by_id("city").send_keys("Galloway")
driver.find_element_by_id("name").send_keys("Absegami High School")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

#scraping the caption of the tables
all_sub_head = driver.find_elements_by_class_name("tableSubHeaderForWsrDetail") 

#scraping all the headers of the tables
all_headers = driver.find_elements_by_class_name("tableHeaderForWsrDetail")

#filtering the desired headers
required_headers = all_headers[5:]

#scraoing all the table data
all_contents = driver.find_elements_by_class_name("tdTinyFontForWsrDetail")

#filtering the desired tabla data
required_contents = all_contents[45:]
    
print("                ",all_sub_head[1].text,"                ")
for i in range(15):
    print(required_headers[i].text, "              >     ", required_contents[i].text )
    
print("execution completed")

OUTPUT

                 High School Summary                 
NCAA High School Code               >      310759
CEEB Code               >      310759
High School Name               >      ABSEGAMI HIGH SCHOOL
Address               >      201 S WRANGLEBORO RD
GALLOWAY
NJ - 08205
Primary Contact Name               >      BONNIE WADE
Primary Contact Phone               >      609-652-1485
Primary Contact Fax               >      609-404-9683
Primary Contact Email               >      [email protected]
Secondary Contact Name               >      MR. DANIEL KERN
Secondary Contact Phone               >      6096521372
Secondary Contact Fax               >      6094049683
Secondary Contact Email               >      [email protected]
School Website               >      http://www.gehrhsd.net/
Link to Online Course Catalog/Program of Studies               >      Not Available
Last Update of List of NCAA Courses               >      12-Feb-20
execution completed

output screenshot: click me!!!

answered Jul 25, 2020 at 18:09

nteshxx

286 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

nteshxx Over a year ago

use this driver = webdriver.Chrome("drivers/chromedriver") instead of driver = webdriver.Chrome()

J. Doe Over a year ago

can you explain required_contents = all_contents[45:] a little more?

nteshxx Over a year ago

as you can see there are three tables High School Account Status, High School Summary*, **High School Information, the common thing between these three tables are that the captions i.e in blue colour are stored under a common class tableSubHeaderForWsrDetail, and all the text with yellow background are stored in another common class tableHeaderForWsrDetail, also all the table data is stored in a common class tdTinyFontForWsrDetail

nteshxx Over a year ago

so required_contents = all_contents[45:] simply slices the table data i.e 5x9 = 45 table data blocks of table High School Account Status and stores the remaining table data blocks of the High School Summary in required_contents list

Collectives™ on Stack Overflow

Web scrape after search with Python, Selenium, BeautifulSoup

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related