Scraping multiple data types into same dataframe

Question

I am trying to scrape this site: https://www.basketball-reference.com/players/a/

My end goal is to build a dataframe of that table, along with the a new column that includes the players index. For example, for the top player this would be abdelal01.

My current attempt:

url = "https://www.basketball-reference.com/players/a"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html)

headers = [th.getText() for th in soup.findAll('tr')[0].findAll('th')]
headers = headers

rows = soup.findAll('tr')

player_names = [[td.getText() for td in rows[i].findAll('th')]
            for i in range(len(rows))]



names = pd.DataFrame(player_names, columns = headers)
names.head(10)

player_stats = [[td.getText() for td in rows[i].findAll('td')]
            for i in range(len(rows))]


stats = pd.DataFrame(player_stats, columns = headers[1:])
stats['Player'] = names['Player']

Essentially this completely rebuilds the table, but without the URL to the player. Is there an easier way to do this instead of building two dataframes given that in html they have different reference points?

And what is the best way to collect the index to the player?

AaronS · Accepted Answer · 2020-08-12 14:47:21Z

The simplest way to extract table data is through the pandas package. Which can be then manipulated easily.

The read_html() method grabs any table data from a page.

import pandas as pd
df = pd.read_html('https://www.basketball-reference.com/players/a/')[0]
df

Output

          Player    From    To      Pos Ht      Wt  Birth Date  Colleges
0   Alaa Abdelnaby  1991    1995    F-C 6-10    240 June 24, 1968   Duke
1   Zaid Abdul-Aziz 1969    1978    C-F 6-9 235 April 7, 1946   Iowa State
2   Kareem Abdul-Jabbar*    1970    1989    C   7-2 225 April 16, 1947  UCLA
3   Mahmoud Abdul-Rauf  1991    2001    G   6-1 162 March 9, 1969   LSU
4   Tariq Abdul-Wahad   1998    2003    F   6-6 223 November 3, 1974    Michigan, San Jose State
... ... ... ... ... ... ... ... ...
161 Dennis Awtrey   1971    1982    C   6-10    235 February 22, 1948   Santa Clara
162 Gustavo Ayón    2012    2014    C   6-10    250 April 1, 1985   NaN
163 Jeff Ayres  2010    2016    F   6-9 240 April 29, 1987  Arizona State
164 Deandre Ayton   2019    2020    C   6-11    250 July 23, 1998   Arizona
165 Kelenna Azubuike    2007    2012    G   6-5 220 December 16, 1983   Kentucky

Players Table

df['players']

Output

0            Alaa Abdelnaby
1           Zaid Abdul-Aziz
2      Kareem Abdul-Jabbar*
3        Mahmoud Abdul-Rauf
4         Tariq Abdul-Wahad
               ...         
161           Dennis Awtrey
162            Gustavo Ayón
163              Jeff Ayres
164           Deandre Ayton
165        Kelenna Azubuike

Collectives™ on Stack Overflow

Scraping multiple data types into same dataframe

1 Answer 1

Output

Players Table

Output

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Output

Players Table

Output

Comments

Your Answer

Sign up or log in

Post as a guest

Related