I am trying to scrape this site: https://www.basketball-reference.com/players/a/
My end goal is to build a dataframe of that table, along with the a new column that includes the players index. For example, for the top player this would be abdelal01.
My current attempt:
url = "https://www.basketball-reference.com/players/a"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html)
headers = [th.getText() for th in soup.findAll('tr')[0].findAll('th')]
headers = headers
rows = soup.findAll('tr')
player_names = [[td.getText() for td in rows[i].findAll('th')]
for i in range(len(rows))]
names = pd.DataFrame(player_names, columns = headers)
names.head(10)
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
stats = pd.DataFrame(player_stats, columns = headers[1:])
stats['Player'] = names['Player']
Essentially this completely rebuilds the table, but without the URL to the player. Is there an easier way to do this instead of building two dataframes given that in html they have different reference points?
And what is the best way to collect the index to the player?