Retrieving data from urls in csv file - Python

Question

How do I modify this code to use the url list from the csv, go to those pages, and then execute the last section of the code to retrieve the correct data?

I've got the feeling the code section that goes to the csv-stored links and retrieves data from them is way off, but I've got a csv with the urls I'm targeting listed one per row, and the last section of this code that targets the contact details etc is working correctly as well.

import requests
import re
from bs4 import BeautifulSoup
import csv

#Read csv
csvfile = open("gymsfinal.csv")
csvfilelist = csvfile.read()

#Get data from each url
def get_page_data():
    for page_data in csvfilelist:
        r = requests.get(page_data.strip())
        soup = BeautifulSoup(r.text, 'html.parser')
        return soup

pages = get_page_data()
'''print pages'''

#The work performed on scraped data
print soup.find("span",{"class":"wlt_shortcode_TITLE"}).text
print soup.find("span",{"class":"wlt_shortcode_map_location"}).text
print soup.find("span",{"class":"wlt_shortcode_phoneNum"}).text
print soup.find("span",{"class":"wlt_shortcode_EMAIL"}).text

th = soup.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
    match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
    if match:
        print link.text

gyms = [name,address,phoneNum,email]
gym_data_list.append(gyms)

#Saving specific listing data to csv
with open ("xgyms.csv", "wb") as file:
    writer = csv.writer(file)
    for row in gym_data_list:
        writer.writerow(row)

Snippet of gymsfinal.csv:

http://www.gym-directory.com/listing/green-apple-wellness-centre/
http://www.gym-directory.com/listing/train-247-fitness-prahran/
http://www.gym-directory.com/listing/body-club/
http://www.gym-directory.com/listing/training-glen/

Changed to writer.writerow([row]) in order to have csv data saved without commas between each letter.

9th line should be r = requests.get(page_data.strip()). This assumes your file has one url per line. If it's actually a csv, better use the csv module. — multivac
– multivac, Commented Sep 28, 2015 at 23:44
Made changes to the way csv is written, but still getting this: requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h? — DanielSon
– DanielSon, Commented Sep 29, 2015 at 0:41

wpercy · Accepted Answer · 2015-09-29 13:59:12Z

There are a couple of issues here. First of all, you never close your first file object, which is a big no-no. You should be using the with syntax that you use towards the bottom of your code snippet for the reading of the csv as well.

You're getting the error requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h? because when you read the csv in, you're just reading it in as one big string, complete with newlines. So when you iterate over it with for page_data in csvfilelist:, it's iterating through each character in the string (strings are iterable in Python). Obviously that isn't a valid url, so requests throws an exception. When you read your file in, it should look something like this

with open('gymsfinal.csv') as f:
    reader = csv.reader(f)
    csvfilelist = [ row[0] for row in reader ]

You should also change how you return your url(s) from get_page_data(). Currently, you're only going to return the first soup. In order to make it return a generator of all the soups, all you need to do is change that return into a yield. Good resource on yield and generators.

You're also going to have a problem with your print statments. they should either go inside a for loop that looks like for soup in pages: or they should go inside get_page_data(). There is no variable soup defined in the context of those prints.

Collectives™ on Stack Overflow

Retrieving data from urls in csv file - Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related