Return to Question

deleted 139 characters in body; edited title

Source Link

edited May 6, 2015 at 17:17

Jamal

35.2k
13
134
238

Using BeautifulSoup to scrape various tables and combine in csva .csv file

Looking for some feedback/criticism/improvements to a piece of code I've written.

Quick overview of what it does... AA page contains a table of links, each link contains a table relevant to the link (a subject). Create a list of these links to pass to the function called scrapeTablescrapeTable which then takes the table and stores it in a CSV file. A directory of files are created for each subject which are then merged into one master file.

If you need anything clarifying then let me know! Much appreciatedI'm looking for any help :)some feedback/criticism/improvements to a piece of code I've written. Code is below:

Using BeautifulSoup to scrape various tables and combine in csv file

Looking for some feedback/criticism/improvements to a piece of code I've written.

Quick overview of what it does... A page contains a table of links, each link contains a table relevant to the link (a subject). Create a list of these links to pass to the function called scrapeTable which then takes the table and stores it in a CSV file. A directory of files are created for each subject which are then merged into one master file.

If you need anything clarifying then let me know! Much appreciated for any help :). Code is below:

Using BeautifulSoup to scrape various tables and combine in a .csv file

A page contains a table of links, each link contains a table relevant to the link (a subject). Create a list of these links to pass to the function called scrapeTable which then takes the table and stores it in a CSV file. A directory of files are created for each subject which are then merged into one master file.

I'm looking for some feedback/criticism/improvements to a piece of code I've written.

Source Link

asked May 6, 2015 at 10:53

ashleh

Using BeautifulSoup to scrape various tables and combine in csv file

Looking for some feedback/criticism/improvements to a piece of code I've written.

If you need anything clarifying then let me know! Much appreciated for any help :). Code is below:

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import glob
import os

def scrapeTable(url):
    r = s.get(url)
    
    soup = BeautifulSoup(r.text,"lxml") 
    
    #get page header
    title = soup.find('h4', 'otherTablesSubTitle')
    subject_name = title.contents[0]
   
   #get table with 'tablesorter' as name
    table = soup.find('table', {'class': 'tablesorter'})
    
    #open file using page header    
    with open('C:/' + subject_name + '.csv', 'ab') as f:
        csvwriter = csv.writer(f)
        for row in table.findAll('tr'):
            headers = []
            for item in soup.find_all('th'):
                headers.append(item.contents[0])
        
        #because some pages don't follow exact format, rename any instances of Institution to University
        for idx, h in enumerate(headers):
            if 'Institution' in h:
                headers[idx] = 'University'
        
        csvwriter.writerow(headers)

        for row in table.findAll('tr'):
            cells = [c.text.encode('utf-8') for c in row.findAll('td')]

            csvwriter.writerow(cells)

    #get third index to use as id for pd.melt           
    header_id = headers[2]
    #remove third index to use remaining as values for pd.melt
    headers.pop(2)

    #denormalise the table and insert subject name at beginning
    df = pd.read_csv('C:/' + subject_name + '.csv')
    a = pd.melt(df, id_vars=header_id, value_vars=headers, var_name='Measure', value_name='Value')
    a.insert(0, 'Subject', subject_name)

    a.to_csv('C:/' + subject_name + '.csv', sep=',', index=False)

#details to post to login form
payload = {
    'username': 'username',
    'password': 'password'
}

#use with to close session after finished
with requests.Session() as s:
    p = s.post('websitelogin', data=payload)
    r = s.get('website')

    soup = BeautifulSoup(r.text, "lxml")

    #get list of links (subjects)
    links = []
    for anchor in soup.findAll('a', href=True):
        if 'imported' in anchor['href']:
            links.append('link' + anchor['href'])

    #for each link, call scrapeTable and pass link through          
    for item in links:
        scrapeTable(item)
        
        
        
#this merges all the files together into one file called final      
path = 'C:/'
allCSV = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
CSVList = []
for file in allCSV:
    df = pd.read_csv(file, index_col=None, header=0)
    CSVList.append(df)

frame = pd.concat(CSVList)
frame.to_csv('C:/final.csv', sep=',', index=False)