1

How do I modify this code to use the url list from the csv, go to those pages, and then execute the last section of the code to retrieve the correct data?

I've got the feeling the code section that goes to the csv-stored links and retrieves data from them is way off, but I've got a csv with the urls I'm targeting listed one per row, and the last section of this code that targets the contact details etc is working correctly as well.

import requests
import re
from bs4 import BeautifulSoup
import csv

#Read csv
csvfile = open("gymsfinal.csv")
csvfilelist = csvfile.read()

#Get data from each url
def get_page_data():
    for page_data in csvfilelist:
        r = requests.get(page_data.strip())
        soup = BeautifulSoup(r.text, 'html.parser')
        return soup

pages = get_page_data()
'''print pages'''

#The work performed on scraped data
print soup.find("span",{"class":"wlt_shortcode_TITLE"}).text
print soup.find("span",{"class":"wlt_shortcode_map_location"}).text
print soup.find("span",{"class":"wlt_shortcode_phoneNum"}).text
print soup.find("span",{"class":"wlt_shortcode_EMAIL"}).text

th = soup.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
    match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
    if match:
        print link.text

gyms = [name,address,phoneNum,email]
gym_data_list.append(gyms)

#Saving specific listing data to csv
with open ("xgyms.csv", "wb") as file:
    writer = csv.writer(file)
    for row in gym_data_list:
        writer.writerow(row)

Snippet of gymsfinal.csv:

http://www.gym-directory.com/listing/green-apple-wellness-centre/
http://www.gym-directory.com/listing/train-247-fitness-prahran/
http://www.gym-directory.com/listing/body-club/
http://www.gym-directory.com/listing/training-glen/

Changed to writer.writerow([row]) in order to have csv data saved without commas between each letter.

3
  • 1
    9th line should be r = requests.get(page_data.strip()). This assumes your file has one url per line. If it's actually a csv, better use the csv module. Commented Sep 28, 2015 at 23:44
  • Post a snippet of gymsfinal.csv please? Commented Sep 29, 2015 at 0:09
  • Made changes to the way csv is written, but still getting this: requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h? Commented Sep 29, 2015 at 0:41

1 Answer 1

2

There are a couple of issues here. First of all, you never close your first file object, which is a big no-no. You should be using the with syntax that you use towards the bottom of your code snippet for the reading of the csv as well.

You're getting the error requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h? because when you read the csv in, you're just reading it in as one big string, complete with newlines. So when you iterate over it with for page_data in csvfilelist:, it's iterating through each character in the string (strings are iterable in Python). Obviously that isn't a valid url, so requests throws an exception. When you read your file in, it should look something like this

with open('gymsfinal.csv') as f:
    reader = csv.reader(f)
    csvfilelist = [ row[0] for row in reader ]

You should also change how you return your url(s) from get_page_data(). Currently, you're only going to return the first soup. In order to make it return a generator of all the soups, all you need to do is change that return into a yield. Good resource on yield and generators.

You're also going to have a problem with your print statments. they should either go inside a for loop that looks like for soup in pages: or they should go inside get_page_data(). There is no variable soup defined in the context of those prints.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.