1

I'm trying to scrape http://www.basketball-reference.com/awards/all_league.html for some analysis and my objective is something like below

0 1st Marc Gasol 2014-2015
1 1st Anthony Davis 2014-2015
2 1st Lebron James 2014-2015
3 1st James Harden 2014-2015
4 1st Stephen Curry 2014-2015
5 2nd Paul Gasol 2014-2015 and so on

And this is the code I have so far, is there anyway to do this? Any suggestions/help much appreciated.

r = requests.get('http://www.basketball-reference.com/awards/all_league.html')
soup=BeautifulSoup(r.text.replace(' ','').replace('>','').encode('ascii','ignore'),"html.parser")
all_league_data = pd.DataFrame(columns = ['year','team','player']) 


stw_list = soup.findAll('div', attrs={'class': 'stw'}) # Find all 'stw's'
for stw in stw_list:
    table = stw.find('table', attrs = {'class':'no_highlight stats_table'})
    for row in table.findAll('tr'):
        col = row.findAll('td')
        if col:
            year = col[0].find(text=True)
            team = col[2].find(text=True)
            player = col[3].find(text=True)
            all_league_data.loc[len(all_league_data)] = [team, player, year]
    all_league_data

2 Answers 2

1

Looks like your code should work fine, but here's a working version without pandas:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.basketball-reference.com/awards/all_league.html')
soup=BeautifulSoup(r.text.replace(' ','').replace('>','').encode('ascii','ignore'),"html.parser")
all_league_data = []

stw_list = soup.findAll('div', attrs={'class': 'stw'}) # Find all 'stw's'
for stw in stw_list:
    table = stw.find('table', attrs = {'class':'no_highlight stats_table'})
    for row in table.findAll('tr'):
        col = row.findAll('td')
        if col:
            year = col[0].find(text=True)
            team = col[2].find(text=True)
            player = col[3].find(text=True)
            all_league_data.append([team, player, year])

for i, line in enumerate(all_league_data):
    print(i, *line)
Sign up to request clarification or add additional context in comments.

Comments

1

You are already using pandas so use read_html

import pandas as pd

all_league_data = pd.read_html('http://www.basketball-reference.com/awards/all_league.html')
print(all_league_data)

Which will give you all the table data in a dataframe:

  In [7]:  print(all_league_data[0].dropna().head(5))
         0    1    2                 3                   4  \
0  2014-15  NBA  1st      Marc Gasol C     Anthony Davis F   
1  2014-15  NBA  2nd       Pau Gasol C  DeMarcus Cousins C   
2  2014-15  NBA  3rd  DeAndre Jordan C        Tim Duncan F   
4  2013-14  NBA  1st     Joakim Noah C      LeBron James F   
5  2013-14  NBA  2nd   Dwight Howard C     Blake Griffin F   

                     5                6                    7  
0       LeBron James F   James Harden G      Stephen Curry G  
1  LaMarcus Aldridge F     Chris Paul G  Russell Westbrook G  
2      Blake Griffin F   Kyrie Irving G      Klay Thompson G  
4       Kevin Durant F   James Harden G         Chris Paul G  
5         Kevin Love F  Stephen Curry G        Tony Parker G  

It will be trivial to rearrange however you like or drop certain columns, read_html takes a few args like attrs which you can also apply, it is all in the link.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.