1

I am trying to extract html tables from the following URL .

For example, 2019 Director Compensation Table that is on page 44. I believe the table doesn't have a specific id, such as 'Compensation Table' etc.. To extract the table I can only think of matching column names or keywords such as "Stock Awards" or "All Other Compensation" then grabbing the associated table.

Is there an easy way to extract these tables based on column names? Or maybe an easier way?

Thanks!

I am relatively new at scraping HTML tables.. my code is as follows

from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm'
r = requests.get(url) 
soup = BeautifulSoup(r.text, 'html.parser')
rows = soup.find_all('tr')
2
  • 1
    What is the expected output? Commented Apr 1, 2020 at 20:08
  • @BittoBennichan the entire table Commented Apr 1, 2020 at 21:27

1 Answer 1

1

Sure you can do that, using pandas read_html function using match and attrs according to documentation.

import pandas as pd

df = pd.read_html(
    "https://www.sec.gov/Archives/edgar/data/66740/000120677420000907/mmm3661701-def14a.htm", attrs={'style': 'border-collapse: collapse; width: 100%; font: 9pt Arial, Helvetica, Sans-Serif'}, match="Non-Employee Directors")

print(df)

df[0].to_csv("data.csv", index=False, header=False)

Output: View-Online

enter image description here

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you very much - this works. If we took it a step further, do you think its possible to iterate this through a number of different html files. The issue that would arise is if there are multiple tables that include 'Non-Employee Directors' or there isn't uniformity between formatting. For example 3M (as above) may use 'Non-Employee Directors' while Apple may use 'External Directors'. Any thoughts?
@Patriots_25 you can match with attrs and position [] only if it's always in same position !
Understood- so if we look at Apple's filing sec.gov/Archives/edgar/data/320193/000119312520001450/… The same table is included except it has much different column headers. Can you think of any way to universally try to extract these tables?
@Patriots_25 which table exactly ? share screen-shot
Does this link work for screen shots? imgur.com/a/OihTSZR If it works - see how the data is the same but there isn't uniformity between filers. For example AAPL has all the data in one table... MMM has it broken up into 2 tables
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.