2

I am using pandas to scrape a website but it returns a whole column with 'nan' values instead of the proper ones. I have tried changing several read_html() parameters, such as flavor, converters, and na_values without success. I noticed that the html code of the troubled column differs in that the rest of them are 'td class=' type, while the one not being read properly reads 'td data-behavior='. When I simply copy/paste the table into excel, everything is pasted ok. I would kindly appreciate any help.

I tried changing some parameters on read_html() without success. I have also tried to get the table using lxml/xpath and didn't succeed either.

week_data = pd.read_html('https://www.espn.co.uk/nfl/fixtures/_/week/2/seasontype/1',
                          converters={'time': str})

The column should have strings containing the time of the match.

2
  • if page uses JavaScript to add data then you can't get it with panda, requests/urllib, lxml/beautifulsoup because they can't run JavaScript. You may need Selenium to control web browser which will run JavaScript and later you can get HTML. Selenium-Python Commented Jul 7, 2019 at 0:45
  • Thanks! I have never used selenium but I will look into it :) Commented Jul 7, 2019 at 1:11

2 Answers 2

2

They're embedding the date time in the data-date attribute so another option rather than resorting to selenium is simply to pull that attribute out and stick it in the td element using beautifulsoup.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import dateutil
from datetime import datetime

espn_page = requests.get('https://www.espn.co.uk/nfl/fixtures/_/week/2/seasontype/1')
soup = BeautifulSoup(espn_page.content, 'html.parser')
espn_schedule = soup.find('div', {'class': 'main-content'})
for td in espn_schedule.find_all('td', {'data-behavior': 'date_time'}):
    utc = dateutil.parser.parse(td.get('data-date'))
    localtime = utc.astimezone(dateutil.tz.gettz())
    td.string = localtime.strftime("%I:%M")


df = pd.read_html(str(espn_schedule))
print(df[0].columns)
print(df[0][df[0].columns[2]])
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks a lot! I highly appreciate your sharing of knowledge.
It works great, but I need further help, if you can. I posted it below.
Clockwatcher, is there a good manual or reference text so that I can study in more detail how BS works? Thanks again.
@gosci - I'm sure you've already found it but the beautiful soup documentation covers pretty much anything you'd ever want to know about it -- crummy.com/software/BeautifulSoup/bs4/doc
Yes, I found everything I needed. Thank you very much!
0

Your code works perfectly, but I rather need the text contained after the 'href' element, which is '6:00 PM':

So I modified your code like this:

for td in espn_schedule.find_all('a', {'data-dateformat': 'time1'}):
    td.string = td.get('href')

And I succesfully get to the element I want, except that I don't know how to extract the text after it (which is '6:00 PM'). How can I do that?

1 Comment

the data-date is given in UTC time. So your 6:00pm may be my 4:00pm. To get it to do what your'e after, you just need to convert the given UTC to your local time. The easiest way to do that is with the python-dateutil pip package. I modified my post above to format it with just the local time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.