1

ive been trying to run scrapy from a python script file because i need to get the data and save it into my db. but when i run it with scrapy command

scrapy crawl argos

the script runs fine but when im trying to run it with a script, following this link

http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

i get this error

$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
  File "pricewatch/pricewatch.py", line 39, in <module>
    main()
  File "pricewatch/pricewatch.py", line 31, in main
    update()
  File "pricewatch/pricewatch.py", line 24, in update
    setup_crawler("argos.co.uk")
  File "pricewatch/pricewatch.py", line 13, in setup_crawler
    settings = get_project_settings()
  File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
    settings_module = import_module(settings_module_path)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
ImportError: No module named settings

i am unable to understand why it doesnt found get_project_setting() but runs fine with scrapy command on terminal

here is the screen shot of my project

enter image description here

here is the pricewatch.py code:

import commands
import sys
from database import DBInstance
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from spiders.argosspider import ArgosSpider
from scrapy.utils.project import get_project_settings
import settings

def setup_crawler(domain):
    spider = ArgosSpider(domain=domain)
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

def update():
    #print "Enter a product to update:"
    #product = raw_input()
    #print product
    #db = DBInstance()
    setup_crawler("argos.co.uk")
    log.start()
    reactor.run()

def main():
    try:
        if sys.argv[1] == "update":
            update()
        elif sys.argv[1] == "database":
            #db = DBInstance()
    except IndexError:
        print "You must select a command from Update, Search, History"


if  __name__ =='__main__':
    main()

2 Answers 2

2

i have fixed it just need to put pricewatch.py to project's top level directory and then running it solved it

Sign up to request clarification or add additional context in comments.

Comments

0

This answer is heavily copied from this answer which I believe answers your question and additionally provides a descent example.

Consider a project with the following structure.

my_project/
    main.py                 # Where we are running scrapy from
    scraper/
        run_scraper.py               #Call from main goes here
        scrapy.cfg                   # deploy configuration file
        scraper/                     # project's Python module, you'll import your code from here
            __init__.py
            items.py                 # project items definition file
            pipelines.py             # project pipelines file
            settings.py              # project settings file
            spiders/                 # a directory where you'll later put your spiders
                __init__.py
                quotes_spider.py     # Contains the QuotesSpider class

Basically, the command scrapy startproject scraper is executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.

My main file:

from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()

My run_scraper.py file:

from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os


class Scraper:
    def __init__(self):
        settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spiders = QuotesSpider # The spider you want to crawl

    def run_spiders(self):
        self.process.crawl(self.spider)
        self.process.start()  # the script will block here until the crawling is finished

Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:

SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'

etc...

1 Comment

(Good )Point taken. I've added the essential parts of the answer. Thanks for the feedback.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.