How to extract Table contents from an HTML page using BeautifulSoup in Python?

Question

I am trying to scrape the following URL and so far have been able to use the following code to extract out the ul elements.

from bs4 import BeautifulSoup
import urllib
import csv
import requests
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content.prettify())
page_content.ul

However, my goal is to extract the information contained within the table into a csv file. How can I go about doing this judging from my current code?

KunduK · Accepted Answer · 2019-07-22 14:40:03Z

3

You can use python pandas library to import data into csv. Which is the easiest way to do that.

import pandas as pd
tables=pd.read_html("https://repo.vse.gmu.edu/ait/AIT580/580books.html")
tables[0].to_csv("output.csv",index=False)

To install pandas just use

pip install pandas

answered Jul 22, 2019 at 14:40

KunduK

33.4k5 gold badges19 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Yusuf Cattaneo Over a year ago

This answer was extremely helpful as it exported the html table into a csv file but I was looking to perform this action using beautifulsoup;

trotta · Accepted Answer · 2019-07-29 08:20:10Z

1

Although I think that KunduKs answer provides an elegant solution using pandas, I would like to give you another approach, since you explicitly asked how to go on from your current code (which is using the csv module and BeautifulSoup).

from bs4 import BeautifulSoup
import csv
import requests

new_file = '/path/to/new/file.csv'
page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
table = page_content.find('table')

for i,tr in enumerate(table.findAll('tr')):
    row = []
    for td in tr.findAll('td'):
        row.append(td.text)
    if i == 0: # write header
        with open(new_file, 'w') as f:
            writer = csv.DictWriter(f, row)
            writer.writeheader() # header
    else:
        with open(new_file, 'a') as f:
            writer = csv.writer(f)
            writer.writerow(row)

As you can see, we first fetch the whole table and then iterate over the tr elements first and then the td elements. In the first round of the iteration (tr), we use the information as a header for our csv file. Subsequently, we write all information as rows to the csv file.

edited Jul 29, 2019 at 8:20

answered Jul 22, 2019 at 14:55

trotta

1,2221 gold badge16 silver badges23 bronze badges

4 Comments

Yusuf Cattaneo Over a year ago

This answer is awesome and exactly what I was looking for! My only issue now is that when the csv file is exported there are additional blank rows added between rows in the table. Any idea how I can remedy that situation?

Yusuf Cattaneo Over a year ago

Nevermind mind I was able to fix the above issue by adding newline='' to the open csv file statement:

Yusuf Cattaneo Over a year ago

My only issue now is wanting to also export the column headers into the csv file as those are still missing.

trotta Over a year ago

I just tested the snippet and it works fine - including the header. Did you mind the indentation?

SIM · Accepted Answer · 2019-07-22 20:50:01Z

1

Slightly cleaner approach using list comprehensions:

import csv
import requests
from bs4 import BeautifulSoup

page_link = 'https://repo.vse.gmu.edu/ait/AIT580/580books.html'

page_response = requests.get(page_link)
page_content = BeautifulSoup(page_response.content, "html.parser")

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for items in page_content.find('table').find_all('tr'):
        data = [item.get_text(strip=True) for item in items.find_all(['th','td'])]
        print(data)
        writer.writerow(data)

answered Jul 22, 2019 at 20:50

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Collectives™ on Stack Overflow

How to extract Table contents from an HTML page using BeautifulSoup in Python?

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related