Pandas read_html() returns 'nan' on a specific column

Question

I am using pandas to scrape a website but it returns a whole column with 'nan' values instead of the proper ones. I have tried changing several read_html() parameters, such as flavor, converters, and na_values without success. I noticed that the html code of the troubled column differs in that the rest of them are 'td class=' type, while the one not being read properly reads 'td data-behavior='. When I simply copy/paste the table into excel, everything is pasted ok. I would kindly appreciate any help.

I tried changing some parameters on read_html() without success. I have also tried to get the table using lxml/xpath and didn't succeed either.

week_data = pd.read_html('https://www.espn.co.uk/nfl/fixtures/_/week/2/seasontype/1',
                          converters={'time': str})

The column should have strings containing the time of the match.

if page uses JavaScript to add data then you can't get it with panda, requests/urllib, lxml/beautifulsoup because they can't run JavaScript. You may need Selenium to control web browser which will run JavaScript and later you can get HTML. Selenium-Python — furas
– furas, Commented Jul 7, 2019 at 0:45
Thanks! I have never used selenium but I will look into it :) — serra
– serra, Commented Jul 7, 2019 at 1:11

clockwatcher · Accepted Answer · 2019-07-08 19:19:24Z

2

They're embedding the date time in the data-date attribute so another option rather than resorting to selenium is simply to pull that attribute out and stick it in the td element using beautifulsoup.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import dateutil
from datetime import datetime

espn_page = requests.get('https://www.espn.co.uk/nfl/fixtures/_/week/2/seasontype/1')
soup = BeautifulSoup(espn_page.content, 'html.parser')
espn_schedule = soup.find('div', {'class': 'main-content'})
for td in espn_schedule.find_all('td', {'data-behavior': 'date_time'}):
    utc = dateutil.parser.parse(td.get('data-date'))
    localtime = utc.astimezone(dateutil.tz.gettz())
    td.string = localtime.strftime("%I:%M")


df = pd.read_html(str(espn_schedule))
print(df[0].columns)
print(df[0][df[0].columns[2]])

edited Jul 8, 2019 at 19:19

answered Jul 7, 2019 at 3:01

clockwatcher

3,38316 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

serra Over a year ago

Thanks a lot! I highly appreciate your sharing of knowledge.

serra Over a year ago

It works great, but I need further help, if you can. I posted it below.

serra Over a year ago

Clockwatcher, is there a good manual or reference text so that I can study in more detail how BS works? Thanks again.

clockwatcher Over a year ago

@gosci - I'm sure you've already found it but the beautiful soup documentation covers pretty much anything you'd ever want to know about it -- crummy.com/software/BeautifulSoup/bs4/doc

serra Over a year ago

Yes, I found everything I needed. Thank you very much!

serra · Accepted Answer · 2019-07-08 20:01:38Z

0

Your code works perfectly, but I rather need the text contained after the 'href' element, which is '6:00 PM':

So I modified your code like this:

for td in espn_schedule.find_all('a', {'data-dateformat': 'time1'}):
    td.string = td.get('href')

And I succesfully get to the element I want, except that I don't know how to extract the text after it (which is '6:00 PM'). How can I do that?

edited Jul 8, 2019 at 20:01

answered Jul 8, 2019 at 17:39

serra

597 bronze badges

1 Comment

clockwatcher Over a year ago

the data-date is given in UTC time. So your 6:00pm may be my 4:00pm. To get it to do what your'e after, you just need to convert the given UTC to your local time. The easiest way to do that is with the python-dateutil pip package. I modified my post above to format it with just the local time.

Collectives™ on Stack Overflow

Pandas read_html() returns 'nan' on a specific column

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related