I am trying to parse table located here. Using below code, but It is returning multilevel index.
url1='aboveurl.htm'
df1=pd.read_html(url1)
df1=df1[0]
I am trying to parse table located here. Using below code, but It is returning multilevel index.
url1='aboveurl.htm'
df1=pd.read_html(url1)
df1=df1[0]
df=df[0]
df.columns=df.columns.droplevel()
This will remove the multiindex as you preferred.
print(df.head())
Rk School Conf W L Pct W L Pct Off \
0 1 Clemson ACC (Atlantic) 12 2 .857 7 1 .875 33.3
1 2 North Carolina State ACC (Atlantic) 9 4 .692 6 2 .750 32.2
2 3 Louisville ACC (Atlantic) 8 5 .615 4 4 .500 38.1
3 4 Wake Forest ACC (Atlantic) 8 5 .615 4 4 .500 35.3
4 5 Boston College ACC (Atlantic) 7 6 .538 4 4 .500 25.7
Def SRS SOS AP Pre AP High AP Rank Notes
0 13.6 20.62 6.84 5 1 4 NaN
1 25.2 12.17 5.55 NaN 14 23 NaN
2 27.4 9.67 3.75 16 14 NaN NaN
3 28.3 11.42 6.03 NaN NaN NaN NaN
4 22.8 9.39 7.08 NaN NaN NaN NaN
df.columns.droplevel() if youprint(df.columns.tolist()) you see see bad spacesIn order to parse only specific columns, read_html cannot be used (as far as I know) and one possible alternative approach is to use beautifulsoup and scrape row-by-row and column-by-column.
Here's the details
Imports
import requests
import pandas as pd
from bs4 import BeautifulSoup
Get the soup for the page
url = "https://www.sports-reference.com/cfb/years/2017-standings.html"
source = result = requests.get(url)
soup = BeautifulSoup(source.text, 'html.parser')
Extract all rows from the table with id=standings
table = soup.find('table', attrs={'id':'standings'})
table_rows = table.find_all('tr')
Get the first 6 column names with th (which comes from the second row, in order to avoid the row with Overall)
heads = table_rows[1].find_all('th')[:6]
col_names = [h.text for h in heads]
Loop over all rows and extract the fields (using td) School, Conf, W, L, Pct (corresponding to column indexes 0,1,2,3,4, skipped Rk which falls under th and not td) and place them in a nested list, manually inserting the sixth column Rk into each sublist for each iteration of the loop
standing_rows = []
rank = 1
for tr in table_rows:
cols = tr.find_all('td')
if cols:
row = [tr.text for tr in cols[1:6]]
standing_rows.append([rank]+row)
rank += 1
Finally, put nested list into a Pandas DataFrame
df = pd.DataFrame(standing_rows, columns=col_names)
Here are the first 5 rows
df.head()
Rk School Conf W L Pct
0 1 Clemson ACC (Atlantic) 12 2 .857
1 2 North Carolina State ACC (Atlantic) 9 4 .692
2 3 Louisville ACC (Atlantic) 8 5 .615
3 4 Wake Forest ACC (Atlantic) 8 5 .615
4 5 Boston College ACC (Atlantic) 7 6 .538
and the last 5 rows (not showing the column names)
df.tail()
125 126 Louisiana Sun Belt 5 7 .417
126 127 Louisiana-Monroe Sun Belt 4 8 .333
127 128 Idaho Sun Belt 4 8 .333
128 129 South Alabama Sun Belt 4 8 .333
129 130 Texas State Sun Belt 2 10 .167