Web scraping a table into a pandas DataFrame from fbref.com [duplicate]

Question

I am attempting to to web scrape all of the player stats for each team in the Argentinian soccer league: https://fbref.com/en/comps/21/stats/Primera-Division-Stats. My issue is that I am scraping all of the team data, and cannot figure out how to get just the individual player data.

I am pretty new to web scraping and pandas. I am not sure if I am just missing something or approaching the problem incorrectly. Any help would be appreciated.

Here is my block of code:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url_argentina_standard = 'https://fbref.com/en/comps/21/stats/Primera-Division-Stats'

data = requests.get(url_argentina_standard)
standard_stats = soup.find_all(class_='stats_table')
temp = pd.read_html(str(standard_stats))[0]
argentina_standard_df = pd.DataFrame(temp)]]

This code returns the team data, but shows there are only 2 tables that have the class_='stats_table', none of them being the player stats table.

standard_stats = soup.find_all(class_='stats_table')[0]
standard_stats = soup.find_all(class_='stats_table')[1]

I have also tried switching up the find_all() and using something like find_all(id='stats_squads_standards_for', which is the table id of the one I am trying to get.

Ideally, I am would like to be able to switch the link to another teams division such as https://fbref.com/en/comps/56/stats/Austrian-Bundesliga-Stats, which has the same formatting.

You did not indicate what the problem is.

Itération 122442
– Itération 122442

2023-08-16 06:49:48 +00:00
Commented Aug 16, 2023 at 6:49 — Itération 122442
– Itération 122442, Commented Aug 16, 2023 at 6:49

HedgeHog · Accepted Answer · 2023-08-16 07:38:36Z

1

The main problem, in my opinion, is that the table you are looking for is commented out (kind of objects) and thus basically "invisible".

Regardless, no BeautifulSoup object has been created in the example code that could be accessed via soup.find_all(class_='stats_table').

The easiest way to solve the problem without going into depth with regard to BeautifulSoup is to replace/remove the comment characters that lead in and out:

requests.get(url).text.replace('<!--','').replace('-->','')

Then you could use pandas.read_html() with attribute parameters:

import requests
import pandas as pd

url= 'https://fbref.com/en/comps/21/stats/Primera-Division-Stats'
pd.read_html(
    requests.get(url).text.replace('<!--','').replace('-->','')
    ,attrs={'id':'stats_standard'}
)[0]

edited Aug 16, 2023 at 7:38

answered Aug 16, 2023 at 7:06

HedgeHog

25.4k5 gold badges18 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nathan Rogers Over a year ago

That worked awesome, thank you. I was wondering if you actually could go deeper and explain it. I might be able to update my question so that it answers the issue more broadly, and not just specifically for fbref.com.

Collectives™ on Stack Overflow

Web scraping a table into a pandas DataFrame from fbref.com [duplicate]

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related