0

Im currently following the official docs as well as a Youtube video to scrape javascript pages with Scrapy and their splash js rendering service.

https://splash.readthedocs.io/en/stable/install.html

https://www.youtube.com/watch?v=VvFC93vAB7U

I have Docker installed on my Mac and run it as per the official doc instructions:

docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash

I then have this demo code taken from the Youtube video:

import scrapy
from scrapy_splash import SplashRequest

class Demo_js_pider(scrapy.Spider):
    name = 'jsdemo'

    def start_request(self):
        yield SplashRequest(
            url = 'http://quotes.toscrape.com/js',
            callback = self.parse,
        )

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").extract.first(),
                'author': quote.css("small.author::text").extract_first(),
                'tags': quote.css("div.tags > a.tag::text").extract(),
            }

this is run with 'scrapy crawl jsdemo' (I already have scrapy installed in a local virtualenv (python 3.6.4) and all the correct modules including the scrapy-splash one)

However when it runs nothing is returned apart from the below output and nor error messages:

2018-05-11 12:42:27 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Spider opened
2018-05-11 12:42:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-11 12:42:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-11 12:42:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 5, 11, 5, 42, 27, 552500),
 'log_count/DEBUG': 1,
 'log_count/INFO': 7,
 'memusage/max': 49602560,
 'memusage/startup': 49602560,
 'start_time': datetime.datetime(2018, 5, 11, 5, 42, 27, 513940)}
2018-05-11 12:42:27 [scrapy.core.engine] INFO: Spider closed (finished)

the above is truncated, this is a link to the full output: https://pastebin.com/yQVp3n6z

I have tried this several times now. I have also tried running a basic html scraping spider from the main Scrapy tutorial and this ran just fine so Id guess the error is with Splash somewhere?

I noticed this in the output too:

DEBUG: Telnet console listening on 127.0.0.1:6023

is this correct? The docker commands runs Splash on telnet 5023, I tried changing that to 6023 and it didnt change anything. I have also tried setting TELENTCONSOLE_PORT in the settings to both 5023 and 6023 and this just throws these errors when I try to run scrapy crawl:

Traceback (most recent call last):
  File "/Users/david/Documents/projects/cryptoinfluencers/env/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 170, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 198, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 203, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/crawler.py", line 55, in __init__
    self.extensions = ExtensionManager.from_crawler(self)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/extensions/telnet.py", line 53, in from_crawler
    return cls(crawler)
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/extensions/telnet.py", line 46, in __init__
    self.portrange = [int(x) for x in crawler.settings.getlist('TELNETCONSOLE_PORT')]
  File "/Users/david/Documents/projects/cryptoinfluencers/env/lib/python3.6/site-packages/scrapy/settings/__init__.py", line 182, in getlist
    return list(value)
TypeError: 'int' object is not iterable

At this point Im not sure what else I need to change...

1 Answer 1

2

You have simple typo: start_request() vs start_requests()

Also you have another typo extract.first()

Here is working code:

import scrapy
from scrapy_splash import SplashRequest

class Demo_js_pider(scrapy.Spider):
    name = 'jsdemo'

    def start_requests(self):
        yield SplashRequest(
            url = 'http://quotes.toscrape.com/js',
            callback = self.parse,
        )

    def parse(self, response):
        print("Parsing...\n")
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").extract_first(),
                'author': quote.css("small.author::text").extract_first(),
                'tags': quote.css("div.tags > a.tag::text").extract(),
            }
Sign up to request clarification or add additional context in comments.

1 Comment

oh! I spent hours looking at that. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.