-1

I would like to scrape the 2nd table in the page seen below from the link - https://fbref.com/en/comps/9/2023-2024/stats/2023-2024-Premier-League-Stats on google collab. But pd.read_html only gives me the first table. enter image description here

Thank you for your help.

Tried this code in google colab but only prints first table even when I change the index to 1.

enter image description here

4
  • The other tables are commented out in the raw html: stackoverflow.com/a/76911245 Commented Sep 6 at 11:56
  • always put code as text, not image. Commented Sep 6 at 19:06
  • first you should turn off JavaScript in browser and load this page to see what you can get without JavaScript. This shows you that second table is not created when you don't run JavaScript. And read_html() can't run JavaScript. I wwould need to use Selenium to control real web browser which can run JavaScript and get HTML from web browser (as text) and send it to read_html() using io.StringIO(html). You could also use DevTools in browser to see if page reads these data from some other url - and read it with requests Commented Sep 6 at 20:41
  • and if table is in original HTML but hidden then you may need to read page with requests, get this part of HTML and use it with ioStringIO(html) in read_html(). Commented Sep 6 at 20:46

3 Answers 3

0

This page uses JavaScript to detect scrapers and to display second table.

It may need to use Selenium to control real web browser which can run JavaScrit.

You may get full HTML using driver.page_source (and search both tables) or you can use Selenium to get HTML only with second table.

This table has id="stats_standard" so it can helps to find it. To get HTML with table it may need .get_attribute('outerHTML')

Later it needs to use io.StringIO(html) in pd.read_html() to get data from table(s).

Because you want to run it in Colab then it may need to install Chrome and use it with headless mode.

When you get DataFrame then it may also need to clean headers because it creates multi-level headers. It may also need to remove rows with repeated headers.

import pandas as pd
import io

from selenium import webdriver
from selenium.webdriver.common.by import By

# ---

import selenium
print('Selenium:', selenium.__version__)

# ---

url = 'https://fbref.com/en/comps/9/2023-2024/stats/2023-2024-Premier-League-Stats'

options = webdriver.ChromeOptions()
#options.add_argument("--headless=chrome")  # standard
options.add_argument("--headless=new")     # with more capabilities added in 2022
                                            # https://www.selenium.dev/blog/2023/headless-is-going-away/

driver = webdriver.Chrome(options=options)  # the newest Selenium will automatically download driver - so it doesn't need `service=`

driver.get(url)

#driver.implicitly_wait(5)  # time for JavaScript for creating second table

#html = driver.page_source  # get Full HTML
#print(html)

#dfs = pd.read_html(io.StringIO(html))

#print(f"{len(dfs) = }")
#for df in dfs:
#    print(df.head())

# --- get HTML only with second table ---

#table = driver.find_element(By.XPATH, '//table[@id="stats_standard"]')
table = driver.find_element(By.ID, 'stats_standard')

html = table.get_attribute('outerHTML')
#print(html)

driver.quit()

# --- DataFrame ---

dfs = pd.read_html(io.StringIO(html))

#print(f'{len(dfs) = }')
#print(dfs[0])
#print('--')

df = dfs[0]  

# - clean headers

vals = [b if a.startswith('Un') else f'{a}: {b}' for a, b in df.columns]
df.columns = vals
print(df.columns)

#print(df[['Rk', 'Player']])

# - remove extra rows with headers

df = df[ df['Rk'] != 'Rk' ].reset_index()

print(df[['Rk', 'Player']])

Result:

Selenium: 4.35.0

Index(['Rk', 'Player', 'Nation', 'Pos', 'Squad', 'Age', 'Born',
       'Playing Time: MP', 'Playing Time: Starts', 'Playing Time: Min',
       'Playing Time: 90s', 'Performance: Gls', 'Performance: Ast',
       'Performance: G+A', 'Performance: G-PK', 'Performance: PK',
       'Performance: PKatt', 'Performance: CrdY', 'Performance: CrdR',
       'Expected: xG', 'Expected: npxG', 'Expected: xAG', 'Expected: npxG+xAG',
       'Progression: PrgC', 'Progression: PrgP', 'Progression: PrgR',
       'Per 90 Minutes: Gls', 'Per 90 Minutes: Ast', 'Per 90 Minutes: G+A',
       'Per 90 Minutes: G-PK', 'Per 90 Minutes: G+A-PK', 'Per 90 Minutes: xG',
       'Per 90 Minutes: xAG', 'Per 90 Minutes: xG+xAG', 'Per 90 Minutes: npxG',
       'Per 90 Minutes: npxG+xAG', 'Matches'],
      dtype='object')

      Rk               Player
0      1           Max Aarons
1      2    Joshua Acheampong
2      3          Tyler Adams
3      4     Tosin Adarabioyo
4      5       Elijah Adebayo
..   ...                  ...
575  576       Nicolò Zaniolo
576  577        Anass Zaroury
577  578  Oleksandr Zinchenko
578  579           Kurt Zouma
579  580      Martin Ødegaard

[580 rows x 2 columns]
Sign up to request clarification or add additional context in comments.

Comments

0

If Selenium is an option for you, you can identify the 2nd table by its id (stats_standard) then it's very straightforward :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import Chrome
import pandas

# pylint: disable=possibly-used-before-assignment

URL = "https://fbref.com/en/comps/9/2023-2024/stats/2023-2024-Premier-League-Stats"
TIMEOUT = 5  # wait timeout
REJECT = True  # reject non-essential cookies


def click_through() -> None:
    if REJECT:
        wait = WebDriverWait(DRIVER, TIMEOUT)
        ec = EC.element_to_be_clickable
        loc = (By.CSS_SELECTOR, "button.osano-cm-button--type_denyAll")
        wait.until(ec(loc)).click()


def get_columns() -> list[str]:
    wait = WebDriverWait(DRIVER, TIMEOUT)
    ec = EC.visibility_of_all_elements_located
    loc = (By.CSS_SELECTOR, "#stats_standard thead tr")
    trs = wait.until(ec(loc))
    assert len(trs) == 2
    return [e.text for e in trs[1].find_elements(By.CSS_SELECTOR, "th")]


def get_stats_table_trs():
    wait = WebDriverWait(DRIVER, TIMEOUT)
    ec = EC.visibility_of_all_elements_located
    loc = (By.CSS_SELECTOR, "#stats_standard tbody tr")
    return wait.until(ec(loc))


def get_content() -> list[list[str]]:
    content: list[list[str]] = []
    for tr in get_stats_table_trs():
        if not "thead" in tr.get_attribute("class"):
            row = [e.text for e in tr.find_elements(By.CSS_SELECTOR, "th,td")]
            if __debug__:
                print(row)
            content.append(row)
    return content


if __name__ == "__main__":
    with Chrome() as DRIVER:
        DRIVER.get(URL)
        click_through()
        df = pandas.DataFrame(get_content(), columns=get_columns())
        print(df)

Comments

0

The site you are trying to scrape is protected by CloudFlare, which means you cannot just request the page contents and then scrape it using BeautifulSoup or pandas. What you could try is either writing a JavaScript snippet, that reads the data once you enter the site. See here

stats = [];
rows = Array.from(document.querySelectorAll("#all_stats_standard tr"));
for (const tr of rows) {
  const cells = tr.querySelectorAll("th, td");
  if (!cells.length) continue;

  const fields = {};

  cells.forEach(td => {
    const key = td.dataset.stat || "unknown";
    let val = td.textContent.trim();

    const a = td.querySelector("a");
    if (a) fields[`${key}_url`] = a.getAttribute("href");

    const plain = val.replace(/,/g, "");
    const num = Number(plain);
    if (!Number.isNaN(num) && plain !== "") val = num;

    fields[key] = val;

    
  });
    stats.push(fields);
}
stats.shift(); // Remove the first 2 headers
stats.shift();
console.log(stats);

Or you can paste the contents of the loaded HTML into colab and then try to use your code again.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.