1

I needed to scrape the telefone numbers and the email addreses from the following using python:

url = 'https://rma.cultura.gob.ar/#/app/museos/resultados?provincias=Buenos%20Aires'

source = requests.get(url).text

soup = BeautifulSoup(source, 'lxml')

print(soup)

The problem is that what I get from the requests.get is not the html that I need. I suppose the site uses javascript to show those results but I'm not familiar with that since I'm just starting with python programming. I solved this by copying the code of each result page to an unique text file and then extracting the emails with regex but I'm curious if there is something simple to be done to access the data directly.

1
  • Use something that is more like a real browser, such as Selenium. Commented Nov 18, 2022 at 22:11

2 Answers 2

1

The data you see on the page is loaded from external URL via JavaScript. To get the data you can use requests/json modules, for example:

import json
import requests

api_url = "https://rmabackend.cultura.gob.ar/api/museos"

params = {
    "estado": "Publicado",
    "grupo": "Museo",
    "o": "p",
    "ordenar": "nombre_oficial_institucion",
    "page": 1,
    "page_size": "12",
    "provincias": "Buenos Aires",
}

while True:
    data = requests.get(api_url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for d in data["data"]:
        print(d["attributes"]["nombre-oficial-institucion"])

    if params["page"] == data["meta"]["pagination"]["pages"]:
        break

    params["page"] += 1

Prints:

2 Museos, Bellas Artes y MAC
Archivo Histórico y Museo "Astillero Río Santiago" (ARS)
Archivo Histórico y Museo del Servicio Penitenciario Bonaerense
Archivo y Museo Historico Municipal Roberto T. Barili "Villa Mitre"
Asociación Casa Bruzzone
Biblioteca Popular y Museo "José Manuel Estrada"
Casa Museo "Haroldo Conti"
Casa Museo "Xul Solar" -  Tigre
Complejo Histórico y Museográfico "Dr. Alfredo Antonio Sabaté"


...and so on.
Sign up to request clarification or add additional context in comments.

Comments

0

The page is using AJAX to load content. Using something like Selenium to simulate the browser will allow all the javascript to run and then you can extract the source:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()
url = 'https://rma.cultura.gob.ar/#/app/museos/resultados?provincias=Buenos%20Aires'

# navigate to the page
driver.get(url)
# wait until a link with text 'ficha' has loaded
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, 'ficha')))
source = driver.page_source
soup = BeautifulSoup(source, features='lxml')
driver.quit()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.