1

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..

wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)

3 Answers 3

2

Try something like this (include flavor as bs4):

df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')

df = df[0]
print(df.head())

   Image                                 Stadium         City State  \
0    NaN                  Aggie Memorial Stadium   Las Cruces    NM   
1    NaN                               Alamodome  San Antonio    TX   
2    NaN  Alaska Airlines Field at Husky Stadium      Seattle    WA   
3    NaN                      Albertsons Stadium        Boise    ID   
4    NaN                Allen E. Paulson Stadium   Statesboro    GA   

               Team     Conference   Capacity  \
0  New Mexico State    Independent  30,343[1]   
1              UTSA          C-USA      65000   
2        Washington         Pac-12  70,500[2]   
3       Boise State  Mountain West  36,387[3]   
4  Georgia Southern       Sun Belt      25000   
.............................
.............................

To replace anything under square brackets use:

df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())

0    30,343
1     65000
2    70,500
3    36,387
4     25000

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

0

Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.

You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.

Comments

0

Answer Posted by @anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.

    df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
    df4 = df4[0]

Solution was to takeout "r" presented by @anky_91 in line 1 and line 4

 print(df4.Capacity.head())

    0    30,343
    1     65000
    2    70,500
    3    36,387
    4     25000
    Name: Capacity, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.