scrape and parse table from nested URLs in python

Question

I want to scrape and parse the table from the nested URL link and create a pandas data frame and export it. I figured out how to scrape table from HTML page if the page has one table, but now I need to scrape and parse table from sublink which is on mother link, I think I need to loop through all sublink for parsing its table that I interested. I am wondering is there any efficient way to do this using BeautifulSoup. Can anyone point me out how to make this happen?

my attempt

Here is my current attempt to scrape and parse a single table from HTML page, but I don't know how to scrape and parse the table with its unique table name from nested HTML page and create a pandas dataframe at the end.

def scrape_table(url):
    response = requests.get(url, timeout=10)
    bs= BeautifulSoup(response.content, 'html.parser')
    table = bs.find('table')
    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells =[]
        for cell in row.findAll('td'):
            text = cell.text
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    x= list_of_rows[1:]
    df = pd.DataFrame(x, index=None)
    df.to_csv("output.csv")

but here is what I want to do:

main_entry_html = "http://www.bom.gov.au/climate/current/statement_archives.shtml"
child_url_1= "http://www.bom.gov.au/climate/current/month/aus/archive/202001.summary.shtml"
child_url_2 = "http://www.bom.gov.au/climate/current/month/aus/archive/202002.summary.shtml"
child_url_2 = "http://www.bom.gov.au/climate/current/month/aus/archive/202003.summary.shtml"
...

and so on, I need to access all monthly summary link through 2015-01 to 2020-07, and scrape and parse table with titled of Area-average rainfall, finally create dataframe as my desired output shown.

I think I can use for loop to iterate each child URL link (a.k.a, month summary link) then parse the table that I want by looking at its table name. I am not sure how to achieve this in python? Can anyone point me out how to make this happen? any possible thoughts?

desired output:

here is the desired dataframe that I want to get after scraping and parsing all the tables from all child-URL-link. Here is the example dataframe with dummy values:

Is there any way I can get my desired dataframe above? How can I scrape and parse the table from nested-url-link? Can anyone give me possible ideas of how to achieve my expected output? Thanks

Jonas · Accepted Answer · 2020-07-27 19:39:39Z

2

You can use pandas inbuild function pandas.read_html() to directly find the table and convert it into a DataFrame. After this you can save all DataFrames into a list and concate them. Also i would recommend to use string formatting to get all the urls, because you can loop trough all your dates which also gives you the opportunity to save the dates as a column in your DataFrame.

import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

dates = [201901, 201902, 201903, 201904, 201905]

year = ['2019','2020']
dates = []

for year in year:
    for month in range(1,13):
        if year == '2020' and month == 7: break 
        if month <= 9: 
            dates.append(year + '0' + str(month))
        else:
            dates.append(year + str(month))

frames = []

for date in dates:
    r = requests.get(f'http://www.bom.gov.au/climate/current/month/aus/archive/{date}.summary.shtml', timeout = 10)
    soup = BeautifulSoup(r.content, 'html.parser')

    table = soup.find_all('table')   
    table = pd.read_html(str(soup))[2]      
    result = pd.DataFrame(table)

    dates_col = pd.DataFrame({'dates': [str(date)[:4] + '-' + str(date)[4:] for i in range(len(result.index)+1)]})
    result.insert(0, 'date', dates_col)
    result.columns = ['dates', 'region', 'rank', 'average', 'departure from mean', 'comment']

    frames.append(result)
    time.sleep(1)


full_df = pd.concat(frames, ignore_index=True)

edited Jul 27, 2020 at 19:39

answered Jul 27, 2020 at 18:55

Jonas

1,7691 gold badge12 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

beyond_inifinity Over a year ago

Thanks for your output. Can we handle the dates with a parametric way instead of using a hard-coded way? I mean is there any way we can pass start_date, end_date as an argument, so we can scrap the data accordingly? Any possible thoughts? Thanks

Jonas Over a year ago

Yes I think there are several possibilities. I edited one above.

beyond_inifinity Over a year ago

when we rename full_df, there is value error bumped out, I think because of Unnamed: 0_level_1 showed up. How to fix this? why do we use this line if year == '2020' and month == 7: break, is it going to be hardcoded again? Thanks for your beautiful answer :)

Jonas Over a year ago

Ah yes, good point. its better to name the columns within the loop and then concate them. I edited my solution above.

Collectives™ on Stack Overflow

scrape and parse table from nested URLs in python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related