Pandas read_html returned column with NaN values in Python

Question

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..

wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)

anky · Accepted Answer · 2019-01-22 02:17:02Z

Try something like this (include flavor as bs4):

df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')

df = df[0]
print(df.head())

   Image                                 Stadium         City State  \
0    NaN                  Aggie Memorial Stadium   Las Cruces    NM   
1    NaN                               Alamodome  San Antonio    TX   
2    NaN  Alaska Airlines Field at Husky Stadium      Seattle    WA   
3    NaN                      Albertsons Stadium        Boise    ID   
4    NaN                Allen E. Paulson Stadium   Statesboro    GA   

               Team     Conference   Capacity  \
0  New Mexico State    Independent  30,343[1]   
1              UTSA          C-USA      65000   
2        Washington         Pac-12  70,500[2]   
3       Boise State  Mountain West  36,387[3]   
4  Georgia Southern       Sun Belt      25000   
.............................
.............................

To replace anything under square brackets use:

df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())

0    30,343
1     65000
2    70,500
3    36,387
4     25000

Hope this helps.

Aditya Diwakar · Accepted Answer · 2019-01-21 22:47:16Z

0

Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.

You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.

answered Jan 21, 2019 at 22:47

Aditya Diwakar

1801 silver badge9 bronze badges

Comments

Data_is_Power · Accepted Answer · 2019-01-23 01:30:29Z

0

Answer Posted by @anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.

    df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
    df4 = df4[0]

Solution was to takeout "r" presented by @anky_91 in line 1 and line 4

 print(df4.Capacity.head())

    0    30,343
    1     65000
    2    70,500
    3    36,387
    4     25000
    Name: Capacity, dtype: object

answered Jan 23, 2019 at 1:30

Data_is_Power

7853 gold badges13 silver badges32 bronze badges

Collectives™ on Stack Overflow

Pandas read_html returned column with NaN values in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related