BeautifulSoup output to dataframe in Python

Question

I have some problem with webscraping. I need data from betting site, scrape and store it at dataframe.

My code:

import numpy as numpy
import pandas as pd
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

DRIVER_PATH = 'C:\\executables\\chromedriver.exe'

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

driver.get("https://www.nike.sk/live-stavky/futbal")

time.sleep(10)


soup = BeautifulSoup(driver.page_source, 'html.parser')

# match time
out_1 = soup.find_all(class_='ellipsis flex fs-10 c-black-50 justify-between pr-5')
# home and away teams
out_2 = soup.find_all(class_='ellipsis f-condensed c-black-100 text-extra-bold match-opponents pr-10')
# match status
out_3 = soup.find_all(class_='flex justify-center text-right flex-col match-score-col fs-12 c-orange text-extra-bold')
# match status 2
out_4 = soup.find_all(class_='flex justify-center text-right flex-col match-score-col fs-12 text-extra-bold c-default-light')

My output (out_1, ..., out_4) is messy blocks of text. How can I put it in a complete dataframe? Can I turn it to dataframe without regex?

Andrej Kesely · Accepted Answer · 2023-01-10 09:48:14Z

1

You can try to use their Ajax API to download the data in Json format, then make dataframe from this data:

import json
import re

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://push.nike.sk/snapshot?path=%2Fn1%2Foverview%2Ffutbal%2Ftournaments%2F"

html_doc = requests.get("https://www.nike.sk/live-stavky/futbal").text

token = re.search(r'"securityToken":"([^"]+)"', html_doc).group(1)


data = json.loads(requests.get(url, headers={"x-security-token": token}).json()[0][-1])

all_data = []
for m in data["matches"]:
    s1 = m["score"]["scores"]["TOTAL"]["home"]
    s2 = m["score"]["scores"]["TOTAL"]["away"]
    all_data.append((m["home"]["en"], m["away"]["en"], s1, s2))

df = pd.DataFrame(all_data, columns=["Team 1", "Team 2", "Score 1", "Score 2"])
print(df)

Prints:

                        Team 1             Team 2 Score 1 Score 2
0                Barito Putera           Makassar       1       1
1               Rahmatgonj MFS       Sheikh Jamal       2       0
2  Stredoafrická republika SRL        Etiópia SRL       2       1
3                   Kosovo SRL       Arménsko SRL       0       0
4             Mohammedan Dhaka  Azampur FC Uttara       3       0

answered Jan 10, 2023 at 9:48

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

314mip Over a year ago

Thank you, your code works great. But for me it s a bit like a magic formula... :D Pls, how did you get the url (push.nike.sk...)?

Andrej Kesely Over a year ago

@314mip Try to open Webdeveloper tools -> Network Tab in Chrome and Firefox and reload the page. You shall see this URL inside it along with the data. (The page is using Javascript to get and render the data).

Collectives™ on Stack Overflow

BeautifulSoup output to dataframe in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related