1

I'm writing a python program which needs data from the internet. So I wrote some Scrapy spiders which are going on several pages and scrape the data. After that they are storing the data in an Excel file that's kind of my database. For that I wrote an own class which is handling the datas inside the excel file the way I need it. So that works. Now to my question:

I want the spiders to start from another python script. I found some code to be able to do it. But I also need to import all the settings from the Scrapy project and the pipelines, items etc. as well. I can't use the

    get_project_settings()

because the script is in another directory (the Scrapy project folder is in the same directory as the script I want it to start from) : That's what I got so far:

    from scrapy.crawler import CrawlerProcess
    from desktop.Project.bots.question.spider import spider_test

    process = CrawlerProcess(settings={'Here I need to import the settings file from the spiders Project' })
    process.crawl(spider_test)
    process.start()

The spider runs but I need my settings. It works completely fine when I put that script in the same project folder as my settings are and use the following code:

    from scrapy.crawler import CrawlerProcess
    from desktop.question.spider import spider_test

    process = CrawlerProcess(get_project_settings())
    process.crawl(spider_test)
    process.start()

I also do not want to rewrite all the settings from the settings file as a dict and implement it manually like this:

   process = CrawlerProcess(settings={
"FEEDS": {
    "items.json": {"format": "json"},
},
})

The last code is just an Example from the Scrapy docs obviously I don't need no Exporter. I already tried to just import the settings file I need and set it as the settings parameter but the parameter settings needs a python dictionary type.

   process = CrawlerProcess(settings={})

I really hope somebody can help me with some explanation how to solve the problem.

4
  • 1
    Here is a similar question. stackoverflow.com/questions/31662797/… Commented Apr 25, 2020 at 14:51
  • this will only run one spider, if you have different settings for each spider you can use custom_settings inside each spider Commented Apr 25, 2020 at 15:26
  • thanks that really helped me a lot but now I got the problem that I want to run that function multiple times but the reactor raises an Error that it can't get restarted. I already read some posts about this but I can't figure out how to solve that problem Commented Apr 26, 2020 at 16:03
  • If you need to restart the process, you can use subprocess. Commented Apr 27, 2020 at 19:45

1 Answer 1

0

add a new file (example.py) in your project

import os
while True:
    os.system('scrapy crawl verbos')

then

python example.py
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.