1

I am trying to parse table located here. Using below code, but It is returning multilevel index.

url1='aboveurl.htm'

df1=pd.read_html(url1)
df1=df1[0]
2
  • which level is the preffered col name? Commented Jan 26, 2019 at 3:14
  • 2nd level wich includes rank, W,L Commented Jan 26, 2019 at 3:16

2 Answers 2

2
df=df[0]
df.columns=df.columns.droplevel()

This will remove the multiindex as you preferred.

print(df.head())
  Rk                School            Conf   W  L   Pct  W  L   Pct   Off  \
0  1               Clemson  ACC (Atlantic)  12  2  .857  7  1  .875  33.3   
1  2  North Carolina State  ACC (Atlantic)   9  4  .692  6  2  .750  32.2   
2  3            Louisville  ACC (Atlantic)   8  5  .615  4  4  .500  38.1   
3  4           Wake Forest  ACC (Atlantic)   8  5  .615  4  4  .500  35.3   
4  5        Boston College  ACC (Atlantic)   7  6  .538  4  4  .500  25.7   

    Def    SRS   SOS AP Pre AP High AP Rank Notes  
0  13.6  20.62  6.84      5       1       4   NaN  
1  25.2  12.17  5.55    NaN      14      23   NaN  
2  27.4   9.67  3.75     16      14     NaN   NaN  
3  28.3  11.42  6.03    NaN     NaN     NaN   NaN  
4  22.8   9.39  7.08    NaN     NaN     NaN   NaN 
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you , Exactly what I was looking for
No Problem. :) Cheers..!!
Is there a way to prase only first W,L and PCT wich is under Overall column ?
@Data_is_Power not sure since format is bad in the Multiindex. :( before df.columns.droplevel() if youprint(df.columns.tolist()) you see see bad spaces
No Worries. I renamed the columns and took care of it
0

In order to parse only specific columns, read_html cannot be used (as far as I know) and one possible alternative approach is to use beautifulsoup and scrape row-by-row and column-by-column.

Here's the details

Imports

import requests
import pandas as pd
from bs4 import BeautifulSoup

Get the soup for the page

url = "https://www.sports-reference.com/cfb/years/2017-standings.html"
source = result = requests.get(url)
soup = BeautifulSoup(source.text, 'html.parser')

Extract all rows from the table with id=standings

table = soup.find('table', attrs={'id':'standings'})
table_rows = table.find_all('tr')

Get the first 6 column names with th (which comes from the second row, in order to avoid the row with Overall)

heads = table_rows[1].find_all('th')[:6]
col_names = [h.text for h in heads]

Loop over all rows and extract the fields (using td) School, Conf, W, L, Pct (corresponding to column indexes 0,1,2,3,4, skipped Rk which falls under th and not td) and place them in a nested list, manually inserting the sixth column Rk into each sublist for each iteration of the loop

standing_rows = []
rank = 1
for tr in table_rows:
    cols = tr.find_all('td')
    if cols:
        row = [tr.text for tr in cols[1:6]]
        standing_rows.append([rank]+row)
        rank += 1

Finally, put nested list into a Pandas DataFrame

df = pd.DataFrame(standing_rows, columns=col_names)

Here are the first 5 rows

df.head()
      Rk                School             Conf   W   L    Pct
0      1               Clemson   ACC (Atlantic)  12   2   .857
1      2  North Carolina State   ACC (Atlantic)   9   4   .692
2      3            Louisville   ACC (Atlantic)   8   5   .615
3      4           Wake Forest   ACC (Atlantic)   8   5   .615
4      5        Boston College   ACC (Atlantic)   7   6   .538

and the last 5 rows (not showing the column names)

df.tail()
125  126             Louisiana         Sun Belt   5   7   .417
126  127      Louisiana-Monroe         Sun Belt   4   8   .333
127  128                 Idaho         Sun Belt   4   8   .333
128  129         South Alabama         Sun Belt   4   8   .333
129  130           Texas State         Sun Belt   2  10   .167

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.