0

I am very new to scraping, and am trying to pull data from a section of this website - https://projects.fivethirtyeight.com/soccer-predictions/premier-league/. The data I'm trying to get is in the second tab, "Matches," and is the section titled "Upcoming Matches."

I have attempted to do this with SelectorGadget and using rvest, as follows -

library(rvest)
url <- ("https://projects.fivethirtyeight.com/soccer-predictions/premier-league/")
url %>%
   html_nodes(".prob, .name") %>%
   html_text()

this returns values, however corresponding to the first tab on the page, "Standings." How can I reference the correct section that I am trying to pull?

2
  • this page uses JavaScript to load data when you click Matches and probably rvest can't run JavaScript. You may need Rselenium to control real web browser which can run JavaScript. OR you can use DevTools in Firefox/Chrome to get url which JavaScript uses to download data - usually it get JSON data Commented Mar 17, 2022 at 17:24
  • 1
    it loads some values from projects.fivethirtyeight.com/soccer-predictions/forecasts/… Commented Mar 17, 2022 at 17:28

1 Answer 1

0

First:I don't know R but Python.

When you click Matches then page uses JavaScript to generate matches and it loads JSON data from:

https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_forecast.json

https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json

https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_clinches.json

I checked only one of them - 2021_premier-league_matches.json - and I see it has data for Completed Matches


I made example in Python:

import requests

url = 'https://projects.fivethirtyeight.com/soccer-predictions/forecasts/2021_premier-league_matches.json'

response = requests.get(url)
data = response.json() 

for item in data:
    # search date
    if item['datetime'].startswith('2022-03-16'):

        print('team1:', item['team1_code'], '|', item['team1'])
        print('prob1:', item['prob1'])
        print('score1:', item['score1'])
        print('adj_score1:', item['adj_score1'])
        print('chances1:', item['chances1'])
        print('moves1:', item['moves1'])
        print('---')

        print('team2:', item['team2_code'], '|', item['team2'])
        print('prob2:', item['prob2'])
        print('score2:', item['score2'])
        print('adj_score2:', item['adj_score2'])
        print('chances2:', item['chances2'])
        print('moves2:', item['moves2'])

        print('----------------------------------------')

Result:

team1: BHA | Brighton and Hove Albion
prob1: 0.30435
score1: 0
adj_score1: 0.0
chances1: 1.244
moves1: 1.682
---
team2: TOT | Tottenham Hotspur
prob2: 0.43627
score2: 2
adj_score2: 2.1
chances2: 1.924
moves2: 1.056
----------------------------------------
team1: ARS | Arsenal
prob1: 0.22114
score1: 0
adj_score1: 0.0
chances1: 0.569
moves1: 0.514
---
team2: LIV | Liverpool
prob2: 0.55306
score2: 2
adj_score2: 2.1
chances2: 1.243
moves2: 0.813
----------------------------------------
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.