2

I want to scrape and parse the table from the nested URL link and create a pandas data frame and export it. I figured out how to scrape table from HTML page if the page has one table, but now I need to scrape and parse table from sublink which is on mother link, I think I need to loop through all sublink for parsing its table that I interested. I am wondering is there any efficient way to do this using BeautifulSoup. Can anyone point me out how to make this happen?

my attempt

Here is my current attempt to scrape and parse a single table from HTML page, but I don't know how to scrape and parse the table with its unique table name from nested HTML page and create a pandas dataframe at the end.

def scrape_table(url):
    response = requests.get(url, timeout=10)
    bs= BeautifulSoup(response.content, 'html.parser')
    table = bs.find('table')
    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells =[]
        for cell in row.findAll('td'):
            text = cell.text
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    x= list_of_rows[1:]
    df = pd.DataFrame(x, index=None)
    df.to_csv("output.csv")

but here is what I want to do:

main_entry_html = "http://www.bom.gov.au/climate/current/statement_archives.shtml"
child_url_1= "http://www.bom.gov.au/climate/current/month/aus/archive/202001.summary.shtml"
child_url_2 = "http://www.bom.gov.au/climate/current/month/aus/archive/202002.summary.shtml"
child_url_2 = "http://www.bom.gov.au/climate/current/month/aus/archive/202003.summary.shtml"
...

and so on, I need to access all monthly summary link through 2015-01 to 2020-07, and scrape and parse table with titled of Area-average rainfall, finally create dataframe as my desired output shown.

I think I can use for loop to iterate each child URL link (a.k.a, month summary link) then parse the table that I want by looking at its table name. I am not sure how to achieve this in python? Can anyone point me out how to make this happen? any possible thoughts?

desired output:

here is the desired dataframe that I want to get after scraping and parsing all the tables from all child-URL-link. Here is the example dataframe with dummy values:

desired dataframe

Is there any way I can get my desired dataframe above? How can I scrape and parse the table from nested-url-link? Can anyone give me possible ideas of how to achieve my expected output? Thanks

1 Answer 1

2

You can use pandas inbuild function pandas.read_html() to directly find the table and convert it into a DataFrame. After this you can save all DataFrames into a list and concate them. Also i would recommend to use string formatting to get all the urls, because you can loop trough all your dates which also gives you the opportunity to save the dates as a column in your DataFrame.

import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

dates = [201901, 201902, 201903, 201904, 201905]

year = ['2019','2020']
dates = []

for year in year:
    for month in range(1,13):
        if year == '2020' and month == 7: break 
        if month <= 9: 
            dates.append(year + '0' + str(month))
        else:
            dates.append(year + str(month))

frames = []

for date in dates:
    r = requests.get(f'http://www.bom.gov.au/climate/current/month/aus/archive/{date}.summary.shtml', timeout = 10)
    soup = BeautifulSoup(r.content, 'html.parser')

    table = soup.find_all('table')   
    table = pd.read_html(str(soup))[2]      
    result = pd.DataFrame(table)

    dates_col = pd.DataFrame({'dates': [str(date)[:4] + '-' + str(date)[4:] for i in range(len(result.index)+1)]})
    result.insert(0, 'date', dates_col)
    result.columns = ['dates', 'region', 'rank', 'average', 'departure from mean', 'comment']

    frames.append(result)
    time.sleep(1)


full_df = pd.concat(frames, ignore_index=True)
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your output. Can we handle the dates with a parametric way instead of using a hard-coded way? I mean is there any way we can pass start_date, end_date as an argument, so we can scrap the data accordingly? Any possible thoughts? Thanks
Yes I think there are several possibilities. I edited one above.
when we rename full_df, there is value error bumped out, I think because of Unnamed: 0_level_1 showed up. How to fix this? why do we use this line if year == '2020' and month == 7: break, is it going to be hardcoded again? Thanks for your beautiful answer :)
Ah yes, good point. its better to name the columns within the loop and then concate them. I edited my solution above.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.