Extract HTML Table Based on Specific Column Headers - Python

Question

I am trying to extract html tables from the following URL .

For example, 2019 Director Compensation Table that is on page 44. I believe the table doesn't have a specific id, such as 'Compensation Table' etc.. To extract the table I can only think of matching column names or keywords such as "Stock Awards" or "All Other Compensation" then grabbing the associated table.

Is there an easy way to extract these tables based on column names? Or maybe an easier way?

Thanks!

I am relatively new at scraping HTML tables.. my code is as follows

from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser')
rows = soup.find_all('tr')

What is the expected output?

Bitto
– Bitto

2020-04-01 20:08:35 +00:00
Commented Apr 1, 2020 at 20:08 — Bitto
– Bitto, Commented Apr 1, 2020 at 20:08
@BittoBennichan the entire table

Patriots_25
– Patriots_25

2020-04-01 21:27:06 +00:00
Commented Apr 1, 2020 at 21:27 — Patriots_25
– Patriots_25, Commented Apr 1, 2020 at 21:27

αԋɱҽԃ αмєяιcαη · Accepted Answer · 2020-04-02 02:30:00Z

1

Sure you can do that, using pandas read_html function using match and attrs according to documentation.

import pandas as pd

df = pd.read_html(
    "https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Non-Employee Directors")

print(df)

df[0].to_csv("data.csv", index=False, header=False)

Output: View-Online

edited Apr 2, 2020 at 2:30

answered Apr 2, 2020 at 2:23

αԋɱҽԃ αмєяιcαη

11.6k3 gold badges23 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Patriots_25 Over a year ago

Thank you very much - this works. If we took it a step further, do you think its possible to iterate this through a number of different html files. The issue that would arise is if there are multiple tables that include 'Non-Employee Directors' or there isn't uniformity between formatting. For example 3M (as above) may use 'Non-Employee Directors' while Apple may use 'External Directors'. Any thoughts?

αԋɱҽԃ αмєяιcαη Over a year ago

@Patriots_25 you can match with attrs and position [] only if it's always in same position !

Patriots_25 Over a year ago

Understood- so if we look at Apple's filing sec.gov/Archives/edgar/data/320193/000119312520001450/… The same table is included except it has much different column headers. Can you think of any way to universally try to extract these tables?

αԋɱҽԃ αмєяιcαη Over a year ago

@Patriots_25 which table exactly ? share screen-shot

Patriots_25 Over a year ago

Does this link work for screen shots? imgur.com/a/OihTSZR If it works - see how the data is the same but there isn't uniformity between filers. For example AAPL has all the data in one table... MMM has it broken up into 2 tables

|

Collectives™ on Stack Overflow

Extract HTML Table Based on Specific Column Headers - Python

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related