0

I am attempting to to web scrape all of the player stats for each team in the Argentinian soccer league: https://fbref.com/en/comps/21/stats/Primera-Division-Stats. My issue is that I am scraping all of the team data, and cannot figure out how to get just the individual player data.

I am pretty new to web scraping and pandas. I am not sure if I am just missing something or approaching the problem incorrectly. Any help would be appreciated.

Here is my block of code:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url_argentina_standard = 'https://fbref.com/en/comps/21/stats/Primera-Division-Stats'

data = requests.get(url_argentina_standard)
standard_stats = soup.find_all(class_='stats_table')
temp = pd.read_html(str(standard_stats))[0]
argentina_standard_df = pd.DataFrame(temp)]]

This code returns the team data, but shows there are only 2 tables that have the class_='stats_table', none of them being the player stats table.

standard_stats = soup.find_all(class_='stats_table')[0]
standard_stats = soup.find_all(class_='stats_table')[1]

I have also tried switching up the find_all() and using something like find_all(id='stats_squads_standards_for', which is the table id of the one I am trying to get.

Ideally, I am would like to be able to switch the link to another teams division such as https://fbref.com/en/comps/56/stats/Austrian-Bundesliga-Stats, which has the same formatting.

1
  • You did not indicate what the problem is. Commented Aug 16, 2023 at 6:49

1 Answer 1

1

The main problem, in my opinion, is that the table you are looking for is commented out (kind of objects) and thus basically "invisible".

Regardless, no BeautifulSoup object has been created in the example code that could be accessed via soup.find_all(class_='stats_table').

The easiest way to solve the problem without going into depth with regard to BeautifulSoup is to replace/remove the comment characters that lead in and out:

requests.get(url).text.replace('<!--','').replace('-->','')

Then you could use pandas.read_html() with attribute parameters:

import requests
import pandas as pd

url= 'https://fbref.com/en/comps/21/stats/Primera-Division-Stats'
pd.read_html(
    requests.get(url).text.replace('<!--','').replace('-->','')
    ,attrs={'id':'stats_standard'}
)[0]
Sign up to request clarification or add additional context in comments.

1 Comment

That worked awesome, thank you. I was wondering if you actually could go deeper and explain it. I might be able to update my question so that it answers the issue more broadly, and not just specifically for fbref.com.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.