1

I am not much of a programmer. Just learning. I want to extract (public) electoral data from my country's electoral Authority using Python. This is for academic purposes but I also want to develop my programming skills. All of the data I store will be posted publicly, of course.

I need to know which python modules allow me to enter websites and read the HTML to recognize certain data which I need to collect. I just hope for some guidelines on how to, or any additional suggestions anyone has.

I wish o extract votes for each party and additional data presented completely deaggregated: State/Municipality/County/Center/Table. Finally, I hope to store it in a csv or xlsx (I guess I'd use openpyxl or xlsxwriter).

My idea is to make a program that:

1) Takes the link input (e.g.);

2) It identifies the links for every State on the left of the HTML (Amazonas, Anzoategui, and so on);

3) For loop though each state and finds the url (it's a HTML so I guess it'll search & extract the <a> tag, right?) for each State;

4) Repeats with municipalities;

4) Repeats with "Parroquia" (county);

5) Repeats for every voting center;

6) Finally for every voting table in each center (1, 2, 3... whatever);

7) Next it stores the result for every party (eg. manually I'd press the name of every candidate, recognize the LOGO of the party and store its votes (30 in the example)). And it also should store the data from the "technical table" at the end.

The final result should be to store all the data: State, Municipality, County, Center, Table, and the result for each party.

2
  • also there are lot of other libraries in python like beautifulsoup,lxml,selenium,phatomjs.... Commented Dec 6, 2015 at 17:33
  • Thanks @SunitRana! Scrapy seems to be an excellent idea. Commented Dec 6, 2015 at 17:39

1 Answer 1

1

The following will help:

from selenium import webdriver - For setting up a new webdriver to go to websites. (The one for Chrome works quite well)

from selenium.webdriver.common.by import By - For selecting html elements by css selector, tag name, id, etc.

from selenium.webdriver.support.ui import WebDriverWait - For setting up a minimum load time for the url to load

from selenium.webdriver.support import expected_conditions as EC - To set up expected conditions uner which to take action when waiting for a url to load. For example a condition could be waiting until all <a> tags have been loaded.

from selenium.webdriver.common.keys import Keys - For simulating keypresses or sending text to an HTML element

from BeautifulSoup import BeautifulSoup - For parsing through a downloaded HTML document

import re - To enable the use of regular expressions

import xlwt - For writing to Microsoft Excel workbooks

from xlutils.copy import copy - For creating copies of Microsoft Excel workbooks

import time - For setting up pausing times while Python code is executing

import xlrd - For reading from Microsoft Excel workbooks

Packages to download:

  1. xlrd 0.9.4

  2. xlutils 1.7.1

  3. xlwt 1.0.0

  4. BeautifulSoup 4.4.1

  5. selenium 2.48.0

Most of the above can be downloaded from the python package index

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.