Scraping with drop down menu + button using Python

Question

I'm trying scrape data from Mexico's Central Bank website but have hit a wall. In terms of actions, I need to first access a link within an initial URL. Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. If all goes well, I will be taken to a new url where a set of links to pdfs are available.

The original url is:

"http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html"

The nested URL (the one with the dropbox) is: "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX"

The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'.

Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links.

My code follows below :

from bs4 import BeautifulSoup
import requests

pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)

In the code, respo.url is equal to url...the code fails. Can anybody pls help me identify where the problem is? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. Thanks!

williamrfry · Accepted Answer · 2019-04-09 21:06:38Z

2

Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. There are typically two approaches I often see:

Reverse engineer the form

If the form makes AJAX calls (e.g. makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (e.g. manually do what the form is doing behind the scenes). Read this post for a more elaborate tutorial.

Use a headless browser

If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. A popular framework for this is Selenium. This simulates a normal browser. Read this post for a more elaborate tutorial.

Judging by a cursory look at the page you're working on, I recommend approach #2.

answered Apr 9, 2019 at 21:06

williamrfry

3603 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

GM71 Over a year ago

Thanks v much - will dig Selenium a try. Rgrds

Lante Dellarovere · Accepted Answer · 2019-04-10 08:37:15Z

1

The page you have to scrape is:

http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces

Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers

Example:

import requests
import pandas as pd

cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"


payload = {
    "JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
    "fechaAConsultar": "21/03/2019"
}

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "Content-Type": "application/x-www-form-urlencoded",
    "Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)

edited Apr 10, 2019 at 8:37

answered Apr 9, 2019 at 21:51

Lante Dellarovere

1,8582 gold badges9 silver badges10 bronze badges

1 Comment

GM71 Over a year ago

Thank much Lante. But correct me if I'm wrong - that doesn't solve the issue with "clicking" the button right?

Sam · Accepted Answer · 2019-04-09 21:05:24Z

0

When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests.

(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000)

It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). You'll still have access to the html to be able to scrape what you need. You can probably reuse a lot of what you're doing as well in selenium.

answered Apr 9, 2019 at 21:05

Sam

6111 gold badge4 silver badges10 bronze badges

Collectives™ on Stack Overflow

Scraping with drop down menu + button using Python

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related