How to navigate a website and extract data with Python

Question

I am not much of a programmer. Just learning. I want to extract (public) electoral data from my country's electoral Authority using Python. This is for academic purposes but I also want to develop my programming skills. All of the data I store will be posted publicly, of course.

I need to know which python modules allow me to enter websites and read the HTML to recognize certain data which I need to collect. I just hope for some guidelines on how to, or any additional suggestions anyone has.

I wish o extract votes for each party and additional data presented completely deaggregated: State/Municipality/County/Center/Table. Finally, I hope to store it in a csv or xlsx (I guess I'd use openpyxl or xlsxwriter).

My idea is to make a program that:

1) Takes the link input (e.g.);

2) It identifies the links for every State on the left of the HTML (Amazonas, Anzoategui, and so on);

3) For loop though each state and finds the url (it's a HTML so I guess it'll search & extract the <a> tag, right?) for each State;

4) Repeats with municipalities;

4) Repeats with "Parroquia" (county);

5) Repeats for every voting center;

6) Finally for every voting table in each center (1, 2, 3... whatever);

7) Next it stores the result for every party (eg. manually I'd press the name of every candidate, recognize the LOGO of the party and store its votes (30 in the example)). And it also should store the data from the "technical table" at the end.

The final result should be to store all the data: State, Municipality, County, Center, Table, and the result for each party.

also there are lot of other libraries in python like beautifulsoup,lxml,selenium,phatomjs.... — Sunit
– Sunit, Commented Dec 6, 2015 at 17:33

Adriaan · Accepted Answer · 2017-08-22 10:55:34Z

The following will help:

from selenium import webdriver - For setting up a new webdriver to go to websites. (The one for Chrome works quite well)

from selenium.webdriver.common.by import By - For selecting html elements by css selector, tag name, id, etc.

from selenium.webdriver.support.ui import WebDriverWait - For setting up a minimum load time for the url to load

from selenium.webdriver.support import expected_conditions as EC - To set up expected conditions uner which to take action when waiting for a url to load. For example a condition could be waiting until all <a> tags have been loaded.

from selenium.webdriver.common.keys import Keys - For simulating keypresses or sending text to an HTML element

from BeautifulSoup import BeautifulSoup - For parsing through a downloaded HTML document

import re - To enable the use of regular expressions

import xlwt - For writing to Microsoft Excel workbooks

from xlutils.copy import copy - For creating copies of Microsoft Excel workbooks

import time - For setting up pausing times while Python code is executing

import xlrd - For reading from Microsoft Excel workbooks

Packages to download:

xlrd 0.9.4
xlutils 1.7.1
xlwt 1.0.0
BeautifulSoup 4.4.1
selenium 2.48.0

Most of the above can be downloaded from the python package index

Collectives™ on Stack Overflow

How to navigate a website and extract data with Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related