Trouble with web-scraping using Python for "baseball-reference.com" box score [duplicate]

Question

I have trouble with the webpage below getting the player hyperlinks web scraped as it only prints out players from the menu at the bottom of the page, rather than the players listed for the relevant box score game. What needs to change so that I can get the players for the Minnesota Twins and Angels as listed?

import requests
from bs4 import BeautifulSoup

# URL of the webpage
url = "https://www.baseball-reference.com/boxes/ANA/ANA202305210.shtml"

# Send a GET request to the webpage
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the webpage using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all hyperlink elements on the page with "/players/" in the href attribute
    links = soup.find_all('a', href=lambda href: href and '/players/' in href)
    
    # Extract and print the href attribute of each matching hyperlink
    for link in links:
        href = link.get('href')
        print(href)
else:
    print("Failed to fetch the webpage.")

jqurious · Accepted Answer · 2025-10-26 07:51:36Z

If you inspect the source code of the page (Ctrl + U) in your browser, you will see that the tables are stored inside HTML comments (), so beautifulsoup doesn't see it.

You can load the page, find all relevant comment sections and convert it to the new BeautifulSoup object. Then the player links become visible:

import requests
from bs4 import BeautifulSoup, Comment

url = "https://www.baseball-reference.com/boxes/ANA/ANA202305210.shtml"

response = requests.get(url)
response.raise_for_status()

soup = BeautifulSoup(response.content, "html.parser")

# the key part:
# convert the HTML comment section <!-- ... --> to new BeautifulSoup object
new_soup = ""
for c in soup.find_all(string=Comment):
    new_soup += c if c.strip().startswith("<") else ""

new_soup = BeautifulSoup(new_soup, "html.parser")
links = new_soup.find_all("a", href=lambda href: href and "/players/" in href)

for link in links:
    href = link.get("href")
    print(f"{link.text:<30} {href}")

Prints:

Joey Gallo                     /players/g/gallojo01.shtml
Carlos Correa                  /players/c/correca01.shtml
Alex Kirilloff                 /players/k/kirilal01.shtml
Edouard Julien                 /players/j/julieed01.shtml
Kyle Farmer                    /players/f/farmeky01.shtml
Trevor Larnach                 /players/l/larnatr01.shtml
Willi Castro                   /players/c/castrwi01.shtml
Donovan Solano                 /players/s/solando01.shtml
Ryan Jeffers                   /players/j/jeffery01.shtml
Pablo López                    /players/l/lopezpa01.shtml
Jorge López                    /players/l/lopezjo02.shtml
José De León                   /players/d/deleojo03.shtml

...and so on.

Jakub Komárek · Accepted Answer · 2023-08-16 21:12:03Z

The reason why your code does not work is that the links that you are trying to scrape are not present in the page HTML, but are added upon page load using JavaScript. You can verify this by disabling JavaScript in your browser and loading the webpage.

To scrape a website after it has finished processing JavaScript code, you can use Selenium. This works by actually launching a web browser, loading the side inside it and then inspecting the loaded results.

Here is an example of how your task can be solved using Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By

URL = "https://www.baseball-reference.com/boxes/ANA/ANA202305210.shtml"

# Use Google Chome / Chromium, but in headless mode so that its window does not open
options = webdriver.ChromeOptions()
options.add_argument("--headless")

# Initialize the browser driver
driver = webdriver.Chrome(options=options)
try:
    # Open the webpage
    driver.get(URL)

    # Locate the links
    link_objects = driver.find_elements(By.XPATH, '//a[contains(@href, "/players/")]')

    # Extract the data
    link_urls = [link_object.get_attribute("href") for link_object in link_objects]
    link_texts = [link_object.text for link_object in link_objects]
finally:
    # Close the browser
    driver.quit()

# Print the extracted data
for text, link_url in zip(link_texts, link_urls):
    print(f"{text:<30} {link_url}")

In this specific case, it might be possible to extract the data from page comments as described in the answer by @Andrej Kesely. However, this will only work under the assumption that the comment always contains the same data that are displayed on the actual page, and not just some kind of placeholder or an out-of-date version. To verify this, it might be necessary to experiment or inspect the site JavaScript code. On the other hand, when using Selenium, we can be reasonably confident we are looking at the same data that actually show up in a real browser.

Collectives™ on Stack Overflow

Trouble with web-scraping using Python for "baseball-reference.com" box score [duplicate]

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related