3

I want to webscrape [this][1] page dynamic form, I'm using Selenium right now for that and obtaining some results.

My questions:

  1. It is possible to replace the Selenium + WebDriver code with some POST Request? (I have worked with Requests before, but only when an API is available... I can't figure out how to reverse code this form)

  2. Is there a better way to clean up the result page to get only the table? (In my example, the result "data" variable is a mess, but anyway I have obtained the last value which was the main purpose of the script)

  3. Any recommendations?

My code:

from selenium import webdriver
import pandas as pd

from bs4 import BeautifulSoup

def get_tables(htmldoc):
    soup = BeautifulSoup(htmldoc)
    return soup.findAll('table')

driver = webdriver.Chrome()
driver.get("http://dgasatel.mop.cl/visita_new.asp")
estacion1 = driver.find_element_by_name("estacion1")
estacion1.send_keys("08370007-6")
driver.find_element_by_xpath("//input[@name='chk_estacion1a' and @value='08370007-6_29']").click()
driver.find_element_by_xpath("//input[@name='period' and @value='1d']").click()
driver.find_element_by_xpath("//input[@name='tiporep' and @value='I']").click()
driver.find_element_by_name("button22").click()

data = pd.read_html(driver.page_source)

print(data[4].tail(1).iloc[0][2])

Thanks in advance. [1]: http://dgasatel.mop.cl/visita_new.asp

2
  • I have used Postman to find the api calls... try it! Commented Dec 11, 2018 at 12:27
  • never heard of Postman, but I'll need to check that out. the html-requests package might be a possibility. It has the option to let the page render before pulling the source html Commented Dec 11, 2018 at 19:25

1 Answer 1

1

The short answer to your question is yes, you can use the requests library to make the post requests. For an example you can easily open the inspector on your browser and copy a request using the following site:

https://curl.trillworks.com/

Then you can feed the response.text into BeautifulSoup to parse out the tables you want.

When I do this with the site in your example I get the following:

import requests

cookies = {
    'ASPSESSIONIDCQTTBCRB': 'BFDPGLCCEJMKPFKGJJFHKHFC',
}

headers = {
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'Origin': 'http://dgasatel.mop.cl',
    'Upgrade-Insecure-Requests': '1',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Referer': 'http://dgasatel.mop.cl/filtro_paramxestac_new.asp',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'en-US,en;q=0.9',
}

data = {
  'estacion1': '-1',
  'estacion2': '-1',
  'estacion3': '-1',
  'accion': 'refresca',
  'tipo': 'ANO',
  'fecha_fin': '11/12/2018',
  'hora_fin': '0',
  'period': '1d',
  'fecha_ini': '11/12/2018',
  'fecha_finP': '11/12/2018',
  'UserID': 'nobody',
  'EsDL1': '0',
  'EsDL2': '0',
  'EsDL3': '0'
}

response = 
requests.post(
    'http://dgasatel.mop.cl/filtro_paramxestac_new.asp',
    headers=headers, cookies=cookies, data=data)

For cleaning up the data, I recommend you map the data points you are wanting onto a dictionary or into a csv with loops.

for table in data:
    if table.tail(1) and table.tail(1).iloc:
        print(table.tail(1).iloc[0][2])
Sign up to request clarification or add additional context in comments.

1 Comment

Perfect!, Thank you! The only change that I had to do is to point the POST request to the results page (dgasatel.mop.cl/cons_det_instan.asp in my case) and it worked.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.