Attempt to scrape search results from a site - Python

Question

I needed to scrape the telefone numbers and the email addreses from the following using python:

url = 'https://rma.cultura.gob.ar/#/app/museos/resultados?provincias=Buenos%20Aires'

source = requests.get(url).text

soup = BeautifulSoup(source, 'lxml')

print(soup)

The problem is that what I get from the requests.get is not the html that I need. I suppose the site uses javascript to show those results but I'm not familiar with that since I'm just starting with python programming. I solved this by copying the code of each result page to an unique text file and then extracting the emails with regex but I'm curious if there is something simple to be done to access the data directly.

Use something that is more like a real browser, such as Selenium. — John Gordon
– John Gordon, Commented Nov 18, 2022 at 22:11

Andrej Kesely · Accepted Answer · 2022-11-18 22:24:29Z

The data you see on the page is loaded from external URL via JavaScript. To get the data you can use requests/json modules, for example:

import json
import requests

api_url = "https://rmabackend.cultura.gob.ar/api/museos"

params = {
    "estado": "Publicado",
    "grupo": "Museo",
    "o": "p",
    "ordenar": "nombre_oficial_institucion",
    "page": 1,
    "page_size": "12",
    "provincias": "Buenos Aires",
}

while True:
    data = requests.get(api_url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for d in data["data"]:
        print(d["attributes"]["nombre-oficial-institucion"])

    if params["page"] == data["meta"]["pagination"]["pages"]:
        break

    params["page"] += 1

Prints:

2 Museos, Bellas Artes y MAC
Archivo Histórico y Museo "Astillero Río Santiago" (ARS)
Archivo Histórico y Museo del Servicio Penitenciario Bonaerense
Archivo y Museo Historico Municipal Roberto T. Barili "Villa Mitre"
Asociación Casa Bruzzone
Biblioteca Popular y Museo "José Manuel Estrada"
Casa Museo "Haroldo Conti"
Casa Museo "Xul Solar" -  Tigre
Complejo Histórico y Museográfico "Dr. Alfredo Antonio Sabaté"


...and so on.

DMcC · Accepted Answer · 2022-11-18 22:42:12Z

The page is using AJAX to load content. Using something like Selenium to simulate the browser will allow all the javascript to run and then you can extract the source:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()
url = 'https://rma.cultura.gob.ar/#/app/museos/resultados?provincias=Buenos%20Aires'

# navigate to the page
driver.get(url)
# wait until a link with text 'ficha' has loaded
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, 'ficha')))
source = driver.page_source
soup = BeautifulSoup(source, features='lxml')
driver.quit()

Collectives™ on Stack Overflow

Attempt to scrape search results from a site - Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related