1

I´m trying to get the information from this table:

<table class="table4 table4-1 table4-1-1"><thead><tr><th class="estilo1">No</th><th class="estilo2">Si</th><!--                                                        <th><div class="contenedor-vinculos6"><a title="Ver más " class="vinculo-interrogacion" href="#">Más información</a></div></th>--></tr></thead><tbody><tr><td class="estilo1"><span class="estilo3">100%<span class="numero-voto">(15)</span></span><div class="grafica1 grafica1-desacuerdo"><div class="item-grafica" style="width: 100%;"/></div></div></td><td class="estilo2"><span class="estilo3">0%<span class="numero-voto">(0)</span></span><div class="grafica1 grafica1-deacuerdo"><div class="item-grafica" style="width: 0%;"/></div></div></td><td><span class="display-none">Más información</span></td></tr></tbody></table>

I´m doing the following in python3:

req = Request('http://www.congresovisible.org/votaciones/10918/',headers=headers)
web_page = urlopen(req)
soup = BeautifulSoup(web_page.read(), 'html.parser')
table= soup.find_all('table', attrs={'class':'table4 table4-1 table4-1-1'})

This works but only shows part of the table, it excludes everything after:

<td class="estilo2"><span class="estilo3...)

This is the output

[<table class="table4 table4-1 table4-1-1"><thead><tr><th class="estilo1">No</th><th class="estilo2">Si</th><!--                                                        <th><div class="contenedor-vinculos6"><a title="Ver más " class="vinculo-interrogacion" href="#">Más información</a></div></th>--></tr></thead><tbody><tr><td class="estilo1"><span class="estilo3">100%<span class="numero-voto">(15)</span></span><div class="grafica1 grafica1-desacuerdo"><div class="item-grafica" style="width: 100%;"></div></div></td></tr></tbody></table>]

How could I extract the whole table?

1 Answer 1

1

It is actually quite easy to solve. html.parser does not parse this kind of non-well-formed HTML well. Use a more lenient html5lib instead. This works for me:

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.congresovisible.org/votaciones/10918/')
soup = BeautifulSoup(response.content, 'html5lib')
table = soup.find_all('table', attrs={'class':'table4 table4-1 table4-1-1'})
print(table)

Note that this requires html5lib package to be installed:

pip install --upgrade html5lib

By the way, lxml parser works as well:

soup = BeautifulSoup(response.content, 'lxml')
Sign up to request clarification or add additional context in comments.

5 Comments

I´m getting this error. Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library. With lxml is the same error. Do you know why may be the reason?
@user2246905 that's exactly why I've added a note to the answer - you need to install either html5lib or lxml, whatever you choose to stick to. Hope that helps.
I already installed both. import html5lib has no error but it seems to have not been installed good or something
@user2246905 make sure you've installed them into the same python environment you are running your script in.
I checked and they are in the right environment. When I do html5lib.__version__ it shows '0.999'. You think it is not installing the last version according to this stackoverflow.com/questions/39086278/… ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.