How do I modify this code to use the url list from the csv, go to those pages, and then execute the last section of the code to retrieve the correct data?
I've got the feeling the code section that goes to the csv-stored links and retrieves data from them is way off, but I've got a csv with the urls I'm targeting listed one per row, and the last section of this code that targets the contact details etc is working correctly as well.
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gymsfinal.csv")
csvfilelist = csvfile.read()
#Get data from each url
def get_page_data():
for page_data in csvfilelist:
r = requests.get(page_data.strip())
soup = BeautifulSoup(r.text, 'html.parser')
return soup
pages = get_page_data()
'''print pages'''
#The work performed on scraped data
print soup.find("span",{"class":"wlt_shortcode_TITLE"}).text
print soup.find("span",{"class":"wlt_shortcode_map_location"}).text
print soup.find("span",{"class":"wlt_shortcode_phoneNum"}).text
print soup.find("span",{"class":"wlt_shortcode_EMAIL"}).text
th = soup.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
if match:
print link.text
gyms = [name,address,phoneNum,email]
gym_data_list.append(gyms)
#Saving specific listing data to csv
with open ("xgyms.csv", "wb") as file:
writer = csv.writer(file)
for row in gym_data_list:
writer.writerow(row)
Snippet of gymsfinal.csv:
http://www.gym-directory.com/listing/green-apple-wellness-centre/
http://www.gym-directory.com/listing/train-247-fitness-prahran/
http://www.gym-directory.com/listing/body-club/
http://www.gym-directory.com/listing/training-glen/
Changed to writer.writerow([row]) in order to have csv data saved without commas between each letter.
r = requests.get(page_data.strip()). This assumes your file has one url per line. If it's actually a csv, better use the csv module.requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?