Web scraping table with hidden part using python

Question

I´m trying to get the information from this table:

<table class="table4 table4-1 table4-1-1"><thead><tr><th class="estilo1">No</th><th class="estilo2">Si</th><!--                                                        <th><div class="contenedor-vinculos6"><a title="Ver más " class="vinculo-interrogacion" href="#">Más información</a></div></th>--></tr></thead><tbody><tr><td class="estilo1"><span class="estilo3">100%<span class="numero-voto">(15)</span></span><div class="grafica1 grafica1-desacuerdo"><div class="item-grafica" style="width: 100%;"/></div></div></td><td class="estilo2"><span class="estilo3">0%<span class="numero-voto">(0)</span></span><div class="grafica1 grafica1-deacuerdo"><div class="item-grafica" style="width: 0%;"/></div></div></td><td><span class="display-none">Más información</span></td></tr></tbody></table>

I´m doing the following in python3:

req = Request('http://www.congresovisible.org/votaciones/10918/',headers=headers)
web_page = urlopen(req)
soup = BeautifulSoup(web_page.read(), 'html.parser')
table= soup.find_all('table', attrs={'class':'table4 table4-1 table4-1-1'})

This works but only shows part of the table, it excludes everything after:

<td class="estilo2"><span class="estilo3...)

This is the output

[<table class="table4 table4-1 table4-1-1"><thead><tr><th class="estilo1">No</th><th class="estilo2">Si</th><!--                                                        <th><div class="contenedor-vinculos6"><a title="Ver más " class="vinculo-interrogacion" href="#">Más información</a></div></th>--></tr></thead><tbody><tr><td class="estilo1"><span class="estilo3">100%<span class="numero-voto">(15)</span></span><div class="grafica1 grafica1-desacuerdo"><div class="item-grafica" style="width: 100%;"></div></div></td></tr></tbody></table>]

How could I extract the whole table?

alecxe · Accepted Answer · 2016-10-19 00:04:23Z

1

It is actually quite easy to solve. html.parser does not parse this kind of non-well-formed HTML well. Use a more lenient html5lib instead. This works for me:

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.congresovisible.org/votaciones/10918/')
soup = BeautifulSoup(response.content, 'html5lib')
table = soup.find_all('table', attrs={'class':'table4 table4-1 table4-1-1'})
print(table)

Note that this requires html5lib package to be installed:

pip install --upgrade html5lib

By the way, lxml parser works as well:

soup = BeautifulSoup(response.content, 'lxml')

answered Oct 19, 2016 at 0:04

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user2246905 Over a year ago

I´m getting this error. Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library. With lxml is the same error. Do you know why may be the reason?

alecxe Over a year ago

@user2246905 that's exactly why I've added a note to the answer - you need to install either html5lib or lxml, whatever you choose to stick to. Hope that helps.

user2246905 Over a year ago

I already installed both. import html5lib has no error but it seems to have not been installed good or something

alecxe Over a year ago

@user2246905 make sure you've installed them into the same python environment you are running your script in.

user2246905 Over a year ago

I checked and they are in the right environment. When I do html5lib.__version__ it shows '0.999'. You think it is not installing the last version according to this stackoverflow.com/questions/39086278/… ?

Collectives™ on Stack Overflow

Web scraping table with hidden part using python

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related