I am trying to scrape the CDC website for the data of the last 7 days reported cases for COVID-19. https://covid.cdc.gov/covid-data-tracker/#cases_casesinlast7days I've tried to find the table, by name, id, class, and it always returns as none type. When I print the data scraped, I cant manually locate the table in the html either. Not sure what I'm doing wrong here. Once the data is imported, I need to populate a pandas dataframe to later use for graphing purposes, and export the data table as a csv.
-
for extra information, it appears that the table is generated in javascript so selemium will need to be used to get this dataTaylor Killen– Taylor Killen2020-10-17 19:33:54 +00:00Commented Oct 17, 2020 at 19:33
-
what Taylor says is right. Additionally, I see that there is a button "download" on your website, so you might just try that (with selenium)qmeeus– qmeeus2020-10-17 19:38:50 +00:00Commented Oct 17, 2020 at 19:38
1 Answer
You might as well request data from the API directly (check out Network tab in your browser while refreshing the page):
import requests
import pandas as pd
endpoint = "https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData"
data = requests.get(endpoint, params={"id": "US_MAP_DATA"}).json()
df = pd.DataFrame(data["US_MAP_DATA"])
EDIT: Trying to make this answer more general and useful.
How did you discern that this was how to parse the data?
Firstly, you need to inspect the page (Ctrl + Shift + I) and navigate to network tab:
Secondly, you need to refresh the page to record network activity.
Where to look?
Check XHR to limit number of records (1);
Look through the records by clicking on them (2) and check their preview responses (3) to find out if it's the data you need.
It doesn't always work but when it does, parsing data from API directly is so much easier than writing scrapers via requests / bs4 / selenium etc and should be the first choice.


