2

I'm trying to parse tables from lots of html pages. Each tagret table has next structure:

<table width="100%%" border="2" bordercolor="navy">
  <tr bordercolor="#0000FF">
    <td width="20%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field1</b></font></td>
    <td width="20%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field2</b></font></td>
     <td width="60%%" height="22" bgcolor="navy"><font color="#FFFFFF"><b>Field3</b></font></td>
  </tr>
    <tr>
    <td width="12%">A1</td>
    <td width="32%"><a href="../">A2</a></td>
    <td width="56%">A3</td>
  </tr>
  <tr>
    <td width="12%">B1</td>
    <td width="32%"><a href="../">B2</a></td>
    <td width="56%">B3
</td>
  </tr>
  <tr>
    <td width="12%">C1</td>
    <td width="32%"><a href="../">C2</a></td>
    <td width="56%">C3</td>
  </tr>
  <tr>
    <td width="12%">D1</td>
    <td width="32%"><a href="../">D2</a></td>
    <td width="56%">D3</td>
  </tr>

</table>

Number of rows varies from page to page, so parser should be able to work for any number of rows. I would like to collect info from each html page like

A1 A2 A3
B1 B2 B3
C1 C2 C3
D1 D2 D3

How can I do that?

1 Answer 1

4

You can use find_all() and get_text() to gather the table data. The find_all() method returns a list that contains all descendants of a tag; and get_text() returns a string that contains a tag's text contents. First select all tabes, for each table select all rows, for each row select all columns and finally extract the text. That would collect all table data in the same order and structure that it appears on the HTML document.

from bs4 import BeautifulSoup

html = 'my html document'
soup = BeautifulSoup(html, 'html.parser')
tables = [
    [
        [td.get_text(strip=True) for td in tr.find_all('td')] 
        for tr in table.find_all('tr')
    ] 
    for table in soup.find_all('table')
]

The tables variable contains all the tables in the document, and it is a nested list that has the following structure,

tables -> rows -> columns

If the structure is not important and you only want to collect text from all tables in one big list, use:

table_data = [i.text for i in soup.find_all('td')]

Or if you prefer CSS selectors:

table_data = [i.text for i in soup.select('td')]

If the goal is to gather table data regardless of HTML attributes or other parameters, then it may be best to use pandas. The pandas.read_html() method reads HTML from URLs, files or strings, parses it and returns a list of dataframes that contain the table data.

import pandas as pd

html = 'my html document'
tables = pd.read_html(html)

Note that pandas.read_html() is more fragile than BeautifulSoup and it will raise a Value Error if it fails to parse the HTML or if the document doesn't have any tables.

Sign up to request clarification or add additional context in comments.

3 Comments

I faced some problems with empty cells in tables. Usage of td.text for td in tr.find_all('td') instead of td.string.strip() helped me with that issue
I used strip() because it removes trailing spaces, tabs, etc, and produces clean text. Also i didn't collect columns from the 1st row ( [1:] ) as it seems to be a heading. Of course my code is a generic example based on the html in your post; you can modify it to fit your needs.
I think that td.text or td.get_text() is a better way of retrieving the text content in the table. For differences between .text and .string, please refer to stackoverflow.com/questions/25327693/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.