This page uses JavaScript to detect scrapers and to display second table.
It may need to use Selenium to control real web browser which can run JavaScrit.
You may get full HTML using driver.page_source (and search both tables) or you can use Selenium to get HTML only with second table.
This table has id="stats_standard" so it can helps to find it.
To get HTML with table it may need .get_attribute('outerHTML')
Later it needs to use io.StringIO(html) in pd.read_html() to get data from table(s).
Because you want to run it in Colab then it may need to install Chrome and use it with headless mode.
When you get DataFrame then it may also need to clean headers because it creates multi-level headers. It may also need to remove rows with repeated headers.
import pandas as pd
import io
from selenium import webdriver
from selenium.webdriver.common.by import By
# ---
import selenium
print('Selenium:', selenium.__version__)
# ---
url = 'https://fbref.com/en/comps/9/2023-2024/stats/2023-2024-Premier-League-Stats'
options = webdriver.ChromeOptions()
#options.add_argument("--headless=chrome") # standard
options.add_argument("--headless=new") # with more capabilities added in 2022
# https://www.selenium.dev/blog/2023/headless-is-going-away/
driver = webdriver.Chrome(options=options) # the newest Selenium will automatically download driver - so it doesn't need `service=`
driver.get(url)
#driver.implicitly_wait(5) # time for JavaScript for creating second table
#html = driver.page_source # get Full HTML
#print(html)
#dfs = pd.read_html(io.StringIO(html))
#print(f"{len(dfs) = }")
#for df in dfs:
# print(df.head())
# --- get HTML only with second table ---
#table = driver.find_element(By.XPATH, '//table[@id="stats_standard"]')
table = driver.find_element(By.ID, 'stats_standard')
html = table.get_attribute('outerHTML')
#print(html)
driver.quit()
# --- DataFrame ---
dfs = pd.read_html(io.StringIO(html))
#print(f'{len(dfs) = }')
#print(dfs[0])
#print('--')
df = dfs[0]
# - clean headers
vals = [b if a.startswith('Un') else f'{a}: {b}' for a, b in df.columns]
df.columns = vals
print(df.columns)
#print(df[['Rk', 'Player']])
# - remove extra rows with headers
df = df[ df['Rk'] != 'Rk' ].reset_index()
print(df[['Rk', 'Player']])
Result:
Selenium: 4.35.0
Index(['Rk', 'Player', 'Nation', 'Pos', 'Squad', 'Age', 'Born',
'Playing Time: MP', 'Playing Time: Starts', 'Playing Time: Min',
'Playing Time: 90s', 'Performance: Gls', 'Performance: Ast',
'Performance: G+A', 'Performance: G-PK', 'Performance: PK',
'Performance: PKatt', 'Performance: CrdY', 'Performance: CrdR',
'Expected: xG', 'Expected: npxG', 'Expected: xAG', 'Expected: npxG+xAG',
'Progression: PrgC', 'Progression: PrgP', 'Progression: PrgR',
'Per 90 Minutes: Gls', 'Per 90 Minutes: Ast', 'Per 90 Minutes: G+A',
'Per 90 Minutes: G-PK', 'Per 90 Minutes: G+A-PK', 'Per 90 Minutes: xG',
'Per 90 Minutes: xAG', 'Per 90 Minutes: xG+xAG', 'Per 90 Minutes: npxG',
'Per 90 Minutes: npxG+xAG', 'Matches'],
dtype='object')
Rk Player
0 1 Max Aarons
1 2 Joshua Acheampong
2 3 Tyler Adams
3 4 Tosin Adarabioyo
4 5 Elijah Adebayo
.. ... ...
575 576 Nicolò Zaniolo
576 577 Anass Zaroury
577 578 Oleksandr Zinchenko
578 579 Kurt Zouma
579 580 Martin Ødegaard
[580 rows x 2 columns]
read_html()can't run JavaScript. I wwould need to use Selenium to control real web browser which can run JavaScript and get HTML from web browser (as text) and send it toread_html()usingio.StringIO(html). You could also useDevToolsin browser to see if page reads these data from some other url - and read it withrequestsrequests, get this part of HTML and use it withioStringIO(html)inread_html().