-2

I am new to coding and need some assistance. I am trying to make a web scraper for a project that involves scraping NFL roster data from 2000 to 2023 but am getting an error requesting the html. I am using Jupyter labs (Python-Pyodide) to write my code and this is the only code I have:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO

years = list(range(2000, 2024))
url = 'https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023'
data = requests.get(url)

This is the error I'm getting:

(JsException: NetworkError: Failed to execute 'send' on 'XMLHttpRequest': Failed to load 'https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023'.)

Can you explain why I am getting this error and how do i fix it?

2 Answers 2

0

You didn't specify the request headers. But this page doesnt have table tags, so u cant use pd.read_html

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023"
headers = {
  'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'
}
result = []
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('div', class_='divtable divtable-striped divtable-mobile')
table_head = [head.get_text() for head in table.find('div', class_='thead')]
for s in table.find_all('span', class_='visible-xs-inline'):
    s.extract()
for row in table.find_all('div', class_='tr'):
    result.append(dict(zip(table_head, [cell.get_text() for cell in row.find_all('div', class_='td')])))
df = pd.DataFrame(result)
print(df)

OUTPUT:

     #            Player Pos   G  GS Age            College
0   82   Andre Baccellia  WR   5   0  26         Washington
1    3       Budda Baker  DB  12  12  27         Washington
2   96        Eric Banks  DE   2   0  25  Texas-San Antonio
3   51       Krys Barnes  LB  16   6  25               UCLA
4   66    Jackson Barton  OT   1   0  28               Utah
..  ..               ...  ..  ..  ..  ..                ...
73  21  Garrett Williams  DB   9   6  22           Syracuse
74  27     Divaad Wilson  DB   2   1  23    Central Florida
75  20      Marco Wilson  DB  15  11  24            Florida
76  14    Michael Wilson  WR  13  12  23           Stanford
77  10        Josh Woods  LB  11   7  27           Maryland
Sign up to request clarification or add additional context in comments.

6 Comments

I just tried this on my end but it still did not work. I still got error such as JSException, _RequestError, HTTPException, ProtocolError, and ConnectionError. Do i have to change the 'accept': ...' part on my end? Or is there some other reason I am getting these errors?
@RaulOjeda Then why did you mark it as accepted? @Sergey - And I am seeing AttributeError: 'NoneType' object has no attribute 'find' in regards to the line table_head = [head.get_text() for head in table.find('div', class_='thead')]. Importantly, at present the code given in this answer won't work where the OP specified: " I am using Jupyter labs (Python-Pyodide) to write my code ". The network ability of JupyterLite is restricted by security in the browser. You cannot directly translate what works for an ipykernel to a pyodide-based kernel at this time, without accommodations.
@Wayne my bad, i used simple linux terminal for test, I didn't see the OP ask about Jupyter labs
Understandable. Not just typical JupyterLab, they specifically meant JupyterLite which has a JupyterLab flavor.
@Wayne How do you make those accommodations on the browser? Or is it easier to just do i from the desktop version?
|
0

You need to send headers with your get request. Specifically User-Agent. When you send this value it mocks as if the request comes from a browser e.g. a real person and not a bot/scraper. You can find this value easily by Googling "what is my user agent". Copy that entire thing; you will need it in a minute.

Declare a dict using the value you copied:

my_headers = {
    "User-Agent": "<YOUR_VALUE>"
}

Pass headers as an argument in the get method:

my_url = "https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023"
data = requests.get(url=my_url, headers=my_headers)
print(data.content) # just to confirm you got the response back

Here is the scenic route to get your User-Agent and see what values are/could be there in "headers", if you're interested:

  1. Hit F12 on your keyboard when viewing this page. The developer tools will open up.
  2. Navigate to the "Network" tab
  3. Choose "All"
  4. If you don't see anything, no worries; just refresh the page
  5. Click on an item, you will see another section pop up
  6. Click on "Headers" and scroll down until you find "User-Agent"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.