Parsing html tables with Beautifulsoup in Python

Question

I'm trying to parse tables from lots of html pages. Each tagret table has next structure:

<table width="100%%" border="2" bordercolor="navy">
  <tr bordercolor="#0000FF">
    <td width="20%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field1</b></font></td>
    <td width="20%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field2</b></font></td>
     <td width="60%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field3</b></font></td>
  </tr>
    <tr>
    <td width="12%">A1</td>
    <td width="32%"><a href="../">A2</a></td>
    <td width="56%">A3</td>
  </tr>
  <tr>
    <td width="12%">B1</td>
    <td width="32%"><a href="../">B2</a></td>
    <td width="56%">B3
</td>
  </tr>
  <tr>
    <td width="12%">C1</td>
    <td width="32%"><a href="../">C2</a></td>
    <td width="56%">C3</td>
  </tr>
  <tr>
    <td width="12%">D1</td>
    <td width="32%"><a href="../">D2</a></td>
    <td width="56%">D3</td>
  </tr>

</table>

Number of rows varies from page to page, so parser should be able to work for any number of rows. I would like to collect info from each html page like

A1 A2 A3
B1 B2 B3
C1 C2 C3
D1 D2 D3

How can I do that?

t.m.adam · Accepted Answer · 2019-03-31 10:16:26Z

4

You can use find_all() and get_text() to gather the table data. The find_all() method returns a list that contains all descendants of a tag; and get_text() returns a string that contains a tag's text contents. First select all tabes, for each table select all rows, for each row select all columns and finally extract the text. That would collect all table data in the same order and structure that it appears on the HTML document.

from bs4 import BeautifulSoup

html = 'my html document'
soup = BeautifulSoup(html, 'html.parser')
tables = [
    [
        [td.get_text(strip=True) for td in tr.find_all('td')] 
        for tr in table.find_all('tr')
    ] 
    for table in soup.find_all('table')
]

The tables variable contains all the tables in the document, and it is a nested list that has the following structure,

tables -> rows -> columns

If the structure is not important and you only want to collect text from all tables in one big list, use:

table_data = [i.text for i in soup.find_all('td')]

Or if you prefer CSS selectors:

table_data = [i.text for i in soup.select('td')]

If the goal is to gather table data regardless of HTML attributes or other parameters, then it may be best to use pandas. The pandas.read_html() method reads HTML from URLs, files or strings, parses it and returns a list of dataframes that contain the table data.

import pandas as pd

html = 'my html document'
tables = pd.read_html(html)

Note that pandas.read_html() is more fragile than BeautifulSoup and it will raise a Value Error if it fails to parse the HTML or if the document doesn't have any tables.

edited Mar 31, 2019 at 10:16

answered Aug 23, 2017 at 15:08

t.m.adam

15.4k3 gold badges34 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

tima Over a year ago

I faced some problems with empty cells in tables. Usage of td.text for td in tr.find_all('td') instead of td.string.strip() helped me with that issue

t.m.adam Over a year ago

I used strip() because it removes trailing spaces, tabs, etc, and produces clean text. Also i didn't collect columns from the 1st row ( [1:] ) as it seems to be a heading. Of course my code is a generic example based on the html in your post; you can modify it to fit your needs.

wei ren Over a year ago

I think that td.text or td.get_text() is a better way of retrieving the text content in the table. For differences between .text and .string, please refer to stackoverflow.com/questions/25327693/…

Collectives™ on Stack Overflow

Parsing html tables with Beautifulsoup in Python

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related