Python - Automatic table scraping from complex HTML

Question

I'm trying to automate the scraping of all data from every table on a website and output each table into a tab in excel.

I've been using the code currently available from questions such as https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python, https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059 and Python - Web Scraping HTML table and printing to CSV.

When using this URL, I'm struggling to pull both the underlying data and table headers. The HTML format is very dense making it difficult for me to extract the tables in the correct structure.

My current code:

from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import re
import html2text
import requests
import pandas as pd

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# US english
LANGUAGE = "en-US,en;q=0.5"

def get_soup(url):
    """Constructs and returns a soup using the HTML content of `url` passed"""
    # initialize a session
    session = requests.Session()
    # set the User-Agent as a regular browser
    session.headers['User-Agent'] = USER_AGENT
    # request for english content (optional)
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE
    # make the request
    html = session.get(url)
    # return the soup
    return bs(html.content, "html.parser")

def get_all_tables(soup):
    """Extracts and returns all tables in a soup object"""
    return soup.find_all("table")

def get_table_headers(table):
    """Given a table soup, returns all the headers"""
    headers = []
    for th in table.find("tr").find_all("th"):
        headers.append(th.text.strip())
    return headers

def get_table_rows(table):
    """Given a table, returns all its rows"""
    rows = []
    for tr in table.find_all("tr")[1:]:
        cells = []
        # grab all td tags in this table row
        tds = tr.find_all("td")
        if len(tds) == 0:
            # if no td tags, search for th tags
            # can be found especially in wikipedia tables below the table
            ths = tr.find_all("th")
            for th in ths:
                cells.append(th.text.strip())
        else:
            # use regular td tags
            for td in tds:
                cells.append(td.text.strip())
        rows.append(cells)
    return rows

def save_as_csv(table_name, headers, rows):
    pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")

def main(url):
    # get the soup
    soup = get_soup(url)
    # extract all the tables from the web page
    tables = get_all_tables(soup)
    print(f"[+] Found a total of {len(tables)} tables.")
    # iterate over all tables
    for i, table in enumerate(tables, start=1):
        # get the table headers
        headers = get_table_headers(table)
        # get all the rows of the table
        rows = get_table_rows(table)
        # save table as csv file
        table_name = f"table-{i}"
        print(f"[+] Saving {table_name}")
        save_as_csv(table_name, headers, rows)

main("https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm")

For example, I would need the code to identify a table, such as the one in the attached image, and place all the information into an excel format

Code from questions such as Extract HTML Tables With Similar Data from Different Sources with Different Formatting - Python and Extract HTML Table Based on Specific Column Headers - Python is able to search through the URL, but is looking for too specific a criterion, as I need all the tables in the URL.

Any help would be appreciated! I'm sure there's an elegant solution that I'm not seeing

The URL you posted doesn't seem to be valid. Can you update your post? — Paul M.
– Paul M., Commented Apr 15, 2020 at 17:50
Just changed it! Should have been to: https://www.sec.gov/ix?doc=/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm — AlwaysInTheDark
– AlwaysInTheDark, Commented Apr 15, 2020 at 17:53
Does this answer your question? Extract HTML Table Based on Specific Column Headers - Python — αԋɱҽԃ αмєяιcαη
– αԋɱҽԃ αмєяιcαη, Commented Apr 15, 2020 at 19:11
This answer is for specified HTML from the base code, unfortunately, I'm looking to automate the pulling of tables from multiple sites, so I can't specify what the headers are beforehand. I just need to pull the tables in their HTML format and put them straight into excel. Cleaning the output can be done later — AlwaysInTheDark
– AlwaysInTheDark, Commented Apr 15, 2020 at 20:12

Paul M. · Accepted Answer · 2020-04-15 20:56:54Z

1

I took a look. The URL in your post heavily relies on JavaScript to populate the page with its elements. That's why BeautifulSoup can't see it. The template HTML has twelve tables, all of which look like this initially:

<table class="table table-striped table-sm">
    <tbody id="form-information-modal-carousel-page-1">
        <!-- Below is populated dynamically VIA JS -->
            <tr>
                <td class="text-center">
                    <i class="fas fa-spinner fa-spin"></i>
                </td>
            </tr>
    </tbody>
</table>
</div>
<div class="carousel-item table-responsive">
    <table class="table table-striped table-bordered table-sm">
        <tbody id="form-information-modal-carousel-page-2">
            <!-- Below is populated dynamically VIA JS -->
            ...

Notice the comments . Basically all of the interesting data isn't baked into this HTML. I logged my network traffic, and the page makes two XHR requests. One looked promising, namely MetaLinks.json. It's huge, but unfortunately the table data isn't in there (still pretty interested, maybe useful for other things). The other XHR resource is an actual HTML document which contains the baked-in table data. JSON would be nicer since we wouldn't have to use BeautifulSoup to parse it, but whatever. By the way, this HTML is the one we actually want to scrape. We don't want to scrape the URL you provided (the interactive inline XBRL Viewer) - it actually uses that HTML XHR resource to populate itself. This HTML is the same one you view when you click on the inline XBRL viewer's hamburger menu in the top left and select "Open as HTML". In case you're having trouble finding it, the URL is: https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm

EDIT - Here's a little example. I'm just pulling some of the numbers from the tables:

def main():

    import requests
    from bs4 import BeautifulSoup

    url = "https://www.sec.gov/Archives/edgar/data/1701605/000170160519000089/bkr-2019093010xq.htm"

    response = requests.get(url, headers={})
    response.raise_for_status()

    soup = BeautifulSoup(response.content, "html.parser")

    for table in soup.find_all("table"):
        for row in table.find_all("tr"):
            for data in row.find_all("ix:nonfraction"):
                print(data.text, end=" ")
            print()
        print()


    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

3,339 3,142 9,886 9,421 
2,543 2,523 7,604 7,191 
5,882 5,665 17,490 16,612 


2,901 2,819 8,647 8,371 
1,880 1,873 5,705 5,491 
679 608 2,083 1,944 
71 66 183 374 
54 17 128 113 
5,585 5,383 16,746 16,293 
297 282 744 319 
14 6 124 51 
59 55 174 164 
224 233 446 206 
— 85 — 139 
107 110 269 86 
117 38 177 19 
60 25 97 83 
57 13 80 64 
...

The output is actually much longer than I've shown, but you get the idea. Also, I'm not pulling all of the relevant numbers from the tables, since I'm only looking at ix:nonfraction tags, but there are other kinds (decimal numbers, for example). The HTML is REALLY dense - you'll have to figure out how to get all the other fields from each row, take care of non ix:nonfraction tags, handle empty columns, etc.

edited Apr 15, 2020 at 20:56

answered Apr 15, 2020 at 20:37

Paul M.

10.8k2 gold badges11 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

AlwaysInTheDark Over a year ago

Thank you! That makes a lot of sense. Do you have any ideas as to how to extract these tables in the non-HTML XHR webpage? I've run it through my code, but I think the merged column headers might be messing with the output

Paul M. Over a year ago

@AlwaysInTheDark If you really want to scrape the data directly from the inline XBRL viewer page, you'll need something like Selenium to simulate a real browsing session - which takes care of the JavaScript stuff. I'd really advise against doing that, however. Selenium is overkill in my opinion, and once that page is initialized (post-JavaScript), the DOM structure is identical to the XHR HTML. You're better off just scraping the HTML from the URL I posted.

Paul M. Over a year ago

@AlwaysInTheDark I should also mention that just because the template HTML contains 12 tables, doesn't mean these are the actual tables you want to scrape later on. I wouldn't worry about anything that's in the XBRL viewer and focus solely on the XHR resource. I'll update my answer with a little (not great) example snippet.

AlwaysInTheDark Over a year ago

Don't worry, I intend to. However, even with the base HTML that you posted, I'm still getting errors from my code and I'm not sure why :( I'm looking to grab as much as possible from the tables. At the moment, I just want to get the data, and I'll try and extract it from excel when I need it later

Paul M. Over a year ago

@AlwaysInTheDark Take a look at my updated answer. I know it's not much, but maybe it's still helpful.

Collectives™ on Stack Overflow

Python - Automatic table scraping from complex HTML

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related