1

I have trouble with the webpage below getting the player hyperlinks web scraped as it only prints out players from the menu at the bottom of the page, rather than the players listed for the relevant box score game. What needs to change so that I can get the players for the Minnesota Twins and Angels as listed?

import requests
from bs4 import BeautifulSoup

# URL of the webpage
url = "https://www.baseball-reference.com/boxes/ANA/ANA202305210.shtml"

# Send a GET request to the webpage
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the webpage using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all hyperlink elements on the page with "/players/" in the href attribute
    links = soup.find_all('a', href=lambda href: href and '/players/' in href)
    
    # Extract and print the href attribute of each matching hyperlink
    for link in links:
        href = link.get('href')
        print(href)
else:
    print("Failed to fetch the webpage.")
0

2 Answers 2

2

If you inspect the source code of the page (Ctrl + U) in your browser, you will see that the tables are stored inside HTML comments (<!-- ... -->), so doesn't see it.

You can load the page, find all relevant comment sections and convert it to the new BeautifulSoup object. Then the player links become visible:

import requests
from bs4 import BeautifulSoup, Comment

url = "https://www.baseball-reference.com/boxes/ANA/ANA202305210.shtml"

response = requests.get(url)
response.raise_for_status()

soup = BeautifulSoup(response.content, "html.parser")

# the key part:
# convert the HTML comment section <!-- ... --> to new BeautifulSoup object
new_soup = ""
for c in soup.find_all(string=Comment):
    new_soup += c if c.strip().startswith("<") else ""

new_soup = BeautifulSoup(new_soup, "html.parser")
links = new_soup.find_all("a", href=lambda href: href and "/players/" in href)

for link in links:
    href = link.get("href")
    print(f"{link.text:<30} {href}")

Prints:

Joey Gallo                     /players/g/gallojo01.shtml
Carlos Correa                  /players/c/correca01.shtml
Alex Kirilloff                 /players/k/kirilal01.shtml
Edouard Julien                 /players/j/julieed01.shtml
Kyle Farmer                    /players/f/farmeky01.shtml
Trevor Larnach                 /players/l/larnatr01.shtml
Willi Castro                   /players/c/castrwi01.shtml
Donovan Solano                 /players/s/solando01.shtml
Ryan Jeffers                   /players/j/jeffery01.shtml
Pablo López                    /players/l/lopezpa01.shtml
Jorge López                    /players/l/lopezjo02.shtml
José De León                   /players/d/deleojo03.shtml

...and so on.
Sign up to request clarification or add additional context in comments.

Comments

1

The reason why your code does not work is that the links that you are trying to scrape are not present in the page HTML, but are added upon page load using JavaScript. You can verify this by disabling JavaScript in your browser and loading the webpage.

To scrape a website after it has finished processing JavaScript code, you can use Selenium. This works by actually launching a web browser, loading the side inside it and then inspecting the loaded results.

Here is an example of how your task can be solved using Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By

URL = "https://www.baseball-reference.com/boxes/ANA/ANA202305210.shtml"

# Use Google Chome / Chromium, but in headless mode so that its window does not open
options = webdriver.ChromeOptions()
options.add_argument("--headless")

# Initialize the browser driver
driver = webdriver.Chrome(options=options)
try:
    # Open the webpage
    driver.get(URL)

    # Locate the links
    link_objects = driver.find_elements(By.XPATH, '//a[contains(@href, "/players/")]')

    # Extract the data
    link_urls = [link_object.get_attribute("href") for link_object in link_objects]
    link_texts = [link_object.text for link_object in link_objects]
finally:
    # Close the browser
    driver.quit()

# Print the extracted data
for text, link_url in zip(link_texts, link_urls):
    print(f"{text:<30} {link_url}")

In this specific case, it might be possible to extract the data from page comments as described in the answer by @Andrej Kesely. However, this will only work under the assumption that the comment always contains the same data that are displayed on the actual page, and not just some kind of placeholder or an out-of-date version. To verify this, it might be necessary to experiment or inspect the site JavaScript code. On the other hand, when using Selenium, we can be reasonably confident we are looking at the same data that actually show up in a real browser.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.