0

I am trying to scrape the following URL and so far have been able to use the following code to extract out the ul elements.

from bs4 import BeautifulSoup
import urllib
import csv
import requests
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content.prettify())
page_content.ul

However, my goal is to extract the information contained within the table into a csv file. How can I go about doing this judging from my current code?

3 Answers 3

3

You can use python pandas library to import data into csv. Which is the easiest way to do that.

import pandas as pd
tables=pd.read_html("https://repo.vse.gmu.edu/ait/AIT580/580books.html")
tables[0].to_csv("output.csv",index=False)

To install pandas just use

pip install pandas
Sign up to request clarification or add additional context in comments.

1 Comment

This answer was extremely helpful as it exported the html table into a csv file but I was looking to perform this action using beautifulsoup;
1

Although I think that KunduKs answer provides an elegant solution using pandas, I would like to give you another approach, since you explicitly asked how to go on from your current code (which is using the csv module and BeautifulSoup).

from bs4 import BeautifulSoup
import csv
import requests

new_file = '/path/to/new/file.csv'
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
table = page_content.find('table')

for i,tr in enumerate(table.findAll('tr')):
    row = []
    for td in tr.findAll('td'):
        row.append(td.text)
    if i == 0: # write header
        with open(new_file, 'w') as f:
            writer = csv.DictWriter(f, row)
            writer.writeheader() # header
    else:
        with open(new_file, 'a') as f:
            writer = csv.writer(f)
            writer.writerow(row)

As you can see, we first fetch the whole table and then iterate over the tr elements first and then the td elements. In the first round of the iteration (tr), we use the information as a header for our csv file. Subsequently, we write all information as rows to the csv file.

4 Comments

This answer is awesome and exactly what I was looking for! My only issue now is that when the csv file is exported there are additional blank rows added between rows in the table. Any idea how I can remedy that situation?
Nevermind mind I was able to fix the above issue by adding newline='' to the open csv file statement:
My only issue now is wanting to also export the column headers into the csv file as those are still missing.
I just tested the snippet and it works fine - including the header. Did you mind the indentation?
1

Slightly cleaner approach using list comprehensions:

import csv
import requests
from bs4 import BeautifulSoup

page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'

page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for items in page_content.find('table').find_all('tr'):
        data = [item.get_text(strip=True) for item in items.find_all(['th','td'])]
        print(data)
        writer.writerow(data)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.