How to parse table using Pandas html function?

Question

I am trying to parse table located here. Using below code, but It is returning multilevel index.

url1='aboveurl.htm'

df1=pd.read_html(url1)
df1=df1[0]

which level is the preffered col name?

anky
– anky

2019-01-26 03:14:39 +00:00
Commented Jan 26, 2019 at 3:14 — anky
– anky, Commented Jan 26, 2019 at 3:14
2nd level wich includes rank, W,L

Data_is_Power
– Data_is_Power

2019-01-26 03:16:43 +00:00
Commented Jan 26, 2019 at 3:16 — Data_is_Power
– Data_is_Power, Commented Jan 26, 2019 at 3:16

anky · Accepted Answer · 2019-01-26 03:17:39Z

2

df=df[0]
df.columns=df.columns.droplevel()

This will remove the multiindex as you preferred.

print(df.head())
  Rk                School            Conf   W  L   Pct  W  L   Pct   Off  \
0  1               Clemson  ACC (Atlantic)  12  2  .857  7  1  .875  33.3   
1  2  North Carolina State  ACC (Atlantic)   9  4  .692  6  2  .750  32.2   
2  3            Louisville  ACC (Atlantic)   8  5  .615  4  4  .500  38.1   
3  4           Wake Forest  ACC (Atlantic)   8  5  .615  4  4  .500  35.3   
4  5        Boston College  ACC (Atlantic)   7  6  .538  4  4  .500  25.7   

    Def    SRS   SOS AP Pre AP High AP Rank Notes  
0  13.6  20.62  6.84      5       1       4   NaN  
1  25.2  12.17  5.55    NaN      14      23   NaN  
2  27.4   9.67  3.75     16      14     NaN   NaN  
3  28.3  11.42  6.03    NaN     NaN     NaN   NaN  
4  22.8   9.39  7.08    NaN     NaN     NaN   NaN

answered Jan 26, 2019 at 3:17

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Data_is_Power Over a year ago

Thank you , Exactly what I was looking for

anky Over a year ago

No Problem. :) Cheers..!!

Data_is_Power Over a year ago

Is there a way to prase only first W,L and PCT wich is under Overall column ?

anky Over a year ago

@Data_is_Power not sure since format is bad in the Multiindex. :( before df.columns.droplevel() if youprint(df.columns.tolist()) you see see bad spaces

Data_is_Power Over a year ago

No Worries. I renamed the columns and took care of it

edesz · Accepted Answer · 2019-01-26 04:27:05Z

In order to parse only specific columns, read_html cannot be used (as far as I know) and one possible alternative approach is to use beautifulsoup and scrape row-by-row and column-by-column.

Here's the details

Imports

import requests
import pandas as pd
from bs4 import BeautifulSoup

Get the soup for the page

url = "https://www.sports-reference.com/cfb/years/2017-standings.html"
source = result = requests.get(url)
soup = BeautifulSoup(source.text, 'html.parser')

Extract all rows from the table with id=standings

table = soup.find('table', attrs={'id':'standings'})
table_rows = table.find_all('tr')

Get the first 6 column names with th (which comes from the second row, in order to avoid the row with Overall)

heads = table_rows[1].find_all('th')[:6]
col_names = [h.text for h in heads]

Loop over all rows and extract the fields (using td) School, Conf, W, L, Pct (corresponding to column indexes 0,1,2,3,4, skipped Rk which falls under th and not td) and place them in a nested list, manually inserting the sixth column Rk into each sublist for each iteration of the loop

standing_rows = []
rank = 1
for tr in table_rows:
    cols = tr.find_all('td')
    if cols:
        row = [tr.text for tr in cols[1:6]]
        standing_rows.append([rank]+row)
        rank += 1

Finally, put nested list into a Pandas DataFrame

df = pd.DataFrame(standing_rows, columns=col_names)

Here are the first 5 rows

df.head()
      Rk                School             Conf   W   L    Pct
0      1               Clemson   ACC (Atlantic)  12   2   .857
1      2  North Carolina State   ACC (Atlantic)   9   4   .692
2      3            Louisville   ACC (Atlantic)   8   5   .615
3      4           Wake Forest   ACC (Atlantic)   8   5   .615
4      5        Boston College   ACC (Atlantic)   7   6   .538

and the last 5 rows (not showing the column names)

df.tail()
125  126             Louisiana         Sun Belt   5   7   .417
126  127      Louisiana-Monroe         Sun Belt   4   8   .333
127  128                 Idaho         Sun Belt   4   8   .333
128  129         South Alabama         Sun Belt   4   8   .333
129  130           Texas State         Sun Belt   2  10   .167

Collectives™ on Stack Overflow

How to parse table using Pandas html function?

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related