1

I'm trying scrape data from Mexico's Central Bank website but have hit a wall. In terms of actions, I need to first access a link within an initial URL. Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. If all goes well, I will be taken to a new url where a set of links to pdfs are available.

The original url is:

"http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html"

The nested URL (the one with the dropbox) is: "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX"

The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'.

Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links.

My code follows below :

from bs4 import BeautifulSoup
import requests

pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)

In the code, respo.url is equal to url...the code fails. Can anybody pls help me identify where the problem is? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. Thanks!

3 Answers 3

2

Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. There are typically two approaches I often see:

  1. Reverse engineer the form

If the form makes AJAX calls (e.g. makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (e.g. manually do what the form is doing behind the scenes). Read this post for a more elaborate tutorial.

  1. Use a headless browser

If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. A popular framework for this is Selenium. This simulates a normal browser. Read this post for a more elaborate tutorial.

Judging by a cursory look at the page you're working on, I recommend approach #2.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks v much - will dig Selenium a try. Rgrds
1

The page you have to scrape is:

http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces

Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers

Example:

import requests
import pandas as pd

cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"


payload = {
    "JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
    "fechaAConsultar": "21/03/2019"
}

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "Content-Type": "application/x-www-form-urlencoded",
    "Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)

1 Comment

Thank much Lante. But correct me if I'm wrong - that doesn't solve the issue with "clicking" the button right?
0

When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests.

(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000)

It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). You'll still have access to the html to be able to scrape what you need. You can probably reuse a lot of what you're doing as well in selenium.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.