1

I want to web scrape a high school summary table after entering all the necessary information. However, I can't figure out how to do that since the url doesn't change after getting onto the school's page. I didn't find anything that is relevant to what I'm trying to do. Any idea how I can scrape a table after going through the search process? Thank you.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome("drivers/chromedriver")

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text(input("New Jersey"))

driver.find_element_by_id("city").send_keys(input("Galloway"))
driver.find_element_by_id("name").send_keys(input("Absegami High School"))
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

url = driver.current_url
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
school_info = soup.find('table', class_="border=")
print(school_info)
2
  • Which table do you want to scrape? As there are multiple tables available on the page. Commented Jul 25, 2020 at 9:04
  • As I mentioned in the post, the high school summary table. Commented Jul 25, 2020 at 16:41

1 Answer 1

1

Try this:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Jersey")

driver.find_element_by_id("city").send_keys("Galloway")
driver.find_element_by_id("name").send_keys("Absegami High School")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

#scraping the caption of the tables
all_sub_head = driver.find_elements_by_class_name("tableSubHeaderForWsrDetail") 

#scraping all the headers of the tables
all_headers = driver.find_elements_by_class_name("tableHeaderForWsrDetail")

#filtering the desired headers
required_headers = all_headers[5:]

#scraoing all the table data
all_contents = driver.find_elements_by_class_name("tdTinyFontForWsrDetail")

#filtering the desired tabla data
required_contents = all_contents[45:]
    
print("                ",all_sub_head[1].text,"                ")
for i in range(15):
    print(required_headers[i].text, "              >     ", required_contents[i].text )
    
print("execution completed")

OUTPUT

                 High School Summary                 
NCAA High School Code               >      310759
CEEB Code               >      310759
High School Name               >      ABSEGAMI HIGH SCHOOL
Address               >      201 S WRANGLEBORO RD
GALLOWAY
NJ - 08205
Primary Contact Name               >      BONNIE WADE
Primary Contact Phone               >      609-652-1485
Primary Contact Fax               >      609-404-9683
Primary Contact Email               >      [email protected]
Secondary Contact Name               >      MR. DANIEL KERN
Secondary Contact Phone               >      6096521372
Secondary Contact Fax               >      6094049683
Secondary Contact Email               >      [email protected]
School Website               >      http://www.gehrhsd.net/
Link to Online Course Catalog/Program of Studies               >      Not Available
Last Update of List of NCAA Courses               >      12-Feb-20
execution completed

output screenshot: click me!!!

Sign up to request clarification or add additional context in comments.

4 Comments

use this driver = webdriver.Chrome("drivers/chromedriver") instead of driver = webdriver.Chrome()
can you explain required_contents = all_contents[45:] a little more?
as you can see there are three tables High School Account Status, High School Summary*, **High School Information, the common thing between these three tables are that the captions i.e in blue colour are stored under a common class tableSubHeaderForWsrDetail, and all the text with yellow background are stored in another common class tableHeaderForWsrDetail, also all the table data is stored in a common class tdTinyFontForWsrDetail
so required_contents = all_contents[45:] simply slices the table data i.e 5x9 = 45 table data blocks of table High School Account Status and stores the remaining table data blocks of the High School Summary in required_contents list

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.