2

I have a python scraper that can currently only search one website at a time.

I have a list of 6-700 websites a day that are all identical in an excel list. I'm trying to find a way to change from a single website - to multiple websites held in a single column within a .xlsm file

I have previously written code to manually open 50 tabs at a time (see example1) but would like to incorporate that code or a version of, into my scraper if possible.

(Example1)

import webbrowser
import xlrd
file_location = "C:\Python27\REAScraper\ScrapeFile.xlsm"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name("Sheet1")

url_column = 3
for row in range(1, 1000):
    if row % 1 == 0:
        raw_input("Paused. Press Enter to continue")
    url = sheet.cell_value(row, url_column)
    webbrowser.open_new_tab(url)

Below is the py scraper

import urllib2
from bs4 import BeautifulSoup
import csv
import lxml
import xlrd

page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'lxml')
titleTag = soup.html.head.title
titleTag = titleTag.text.strip()
p_class = soup.find('p')
p_class = p_class.text.strip()
d_class = soup.find('div', class_="property-value__price")
d_class = d_class.string.strip()
e_class = soup.find('p', class_="property-value__agent")
e_class = e_class.string.strip()

print titleTag, p_class, d_class, e_class
with open('index2.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([titleTag, p_class, d_class, e_class])

As stated, I can get the single website to work, but not in a range or from an excel sheet. I've tried looking at automate the hard stuff, p.t.h.w, 100's of reddit & google searches....just looking for some assistance if possible.

Cheers :)

1 Answer 1

3

Your question is quite broad and it could easily turn into entire programming tutorial, so here are few points to get you started.

  1. First file looks OK. You are correct to open Excel file and read rows in loop. What you are missing is that instead of opening new web browser tab, you should call you scraper function.

  2. You can simply paste entire scraper code in place of webbrowser.open_new_tab(url) call in first file. Better yet, put it in function and call that function in first file. Better yet, keep scraper in separate file and make it importable module. Creating python modules might be daunting task, so you may want to defer it until you feel more comfortable with language.

  3. Target CSV file is opened in append mode, which means that no data will be overwritten - that is good. Depending on how much data you gather from one site, you might want to use separate file for each loop iteration. This requires storing file name in variable instead of hardcoding it. You might look into os module to learn how to check if file exists, how to create directory for all these CSV files etc.

  4. Instead of hardcoding number of rows to read, you should read them until you find empty cell or some out-of-bound exception is raised.

  5. Questions like that are probably better suited to reddit or similar programming-learning community.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.