0

Running Python 3.6.1 |Anaconda 4.4.0 (64-bit) on a Windows device.

Using selenium I gather the following html source:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = "https://nextgenstats.nfl.com/stats/receiving#yards"
driver = webdriver.Chrome(executable_path=r"C:/Program Files (x86)/Google/Chrome/chromedriver.exe")
driver.get(url)
htmlSource = driver.page_source

If one checks the url, they will see a nice table that is dynamically loaded. I am unsure how this table can be extracted from htmlsource so that a Pandas dataframe can be constructed from it.

3
  • 1
    pandas has read_html() which can find all <table> in file. Commented Dec 15, 2017 at 8:45
  • @furas read_html() without BeautifulSoup returned an error saying no tables found. The answer from COLDSPEED works. Commented Dec 15, 2017 at 8:49
  • 1
    it was no answer but only commet with sugestion what to use. Commented Dec 15, 2017 at 8:57

2 Answers 2

3

You're pretty close. You just need to help pandas a bit here. Here's what you need to do in a nutshell.

  1. Load the source into BeautifulSoup
  2. Find the table in question. Use soup.find
  3. Call pd.read_html
from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlSource, 'html.parser')
table = soup.find('div', class_='ngs-data-table')

df_list = pd.read_html(table.prettify())

Now, df_list contains a list of all tables on that page -

df_list[1].head()

                0    1   2    3    4     5      6   7    8      9     10  11
0    Antonio Brown  PIT  WR  4.3  2.6  13.7  45.32  99  160  61.88  1509   9
1  DeAndre Hopkins  HOU  WR  4.6  2.1  13.1  42.19  88  155  56.77  1232  11
2     Adam Thielen  MIN  WR  5.8  2.6  11.0  37.38  80  124  64.52  1161   4
3      Julio Jones  ATL  WR  5.2  2.4  14.2  43.34  73  118  61.86  1161   3
4     Keenan Allen  LAC  WR  5.4  2.6   9.5  31.30  83  129  64.34  1143   5
Sign up to request clarification or add additional context in comments.

4 Comments

Awesome, BeautifulSoup was the missing link.
@sunspots There may be another way to do this, but as far as I know, this is by far the easiest way. All it takes is examining the data, pinpointing the table, and the rest, as they say, is history.
In a day or two I might put a bounty on this question just to see if anyone has any other tricks they'd like to share, using any of read_html's multitudinous arguments.
@COLDSPEED dryscape might be another way, but the support for Windows seems complicated: pypi.python.org/pypi/dryscrape/1.0
2

As a Scrapy user, I'm used to look at XHR requests. If you change year in your site you'll see the API call to https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG

The API returns JSON, so it makes sense to use a JSON parser like read_json for the data.

enter image description here

Here's how you can use this is the Scrapy shell:

$ scrapy shell

In [1]: fetch("https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG")
2017-12-15 13:11:30 [scrapy.core.engine] INFO: Spider opened
2017-12-15 13:11:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG> (referer: None)

In [2]: import pandas as pd

In [3]: data = pd.read_json(response.body)

In [4]: data.keys()
Out[4]: Index([u'season', u'seasonType', u'stats', u'threshold'], dtype='object')

In [5]: pd.DataFrame(list(data['stats']))

If you don't have scrapy, you can use requests

import requests
import pandas as pd

url = "https://appapi.ngs.nfl.com/statboard/receiving?season=2017&seasonType=REG"

response = requests.get(url)
data = pd.read_json(response.text)
df = pd.DataFrame(list(data['stats']))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.