1

I am trying to crawl-scrape a website with scrapy and splash. I want to scrape a specific html code from a response which seems in the image. Here is the response with its headers: enter image description here

Here is the response (the html I want to scrape): enter image description here

I can find that HTML with the Inspect Tool. What my code returns is the html which I can see with "View page source" Tool. So, this means taht Javascript modifies the code before embedding it. But, the splash role is to run javascript and return HTML, isn't it?? The response.body returns the source code of the page without the html code i need from the response i mentioned above.

import scrapy
from scrapy_splash import SplashRequest
from bs4 import BeautifulSoup

class NetherSplashSpider(scrapy.Spider):
    name = 'nether_splash'
    download_delay = 10

    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    def start_requests(self):
        yield SplashRequest(
            url='https://www.gaslicht.com/stroom-vergelijken?partial=true&aanbieders=eneco&skip=0&take=10&_=1559207102962',
            callback=self.parse,
        )


    def parse(self, response):



        filename = 'splash.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
2
  • Can you find that HTML in your browser using the Inspect tool (not the network tool)? Maybe JavaScript modifies the code slightly before embedding it. Also try using a higher time in your Splash configuration. Commented May 30, 2019 at 14:38
  • Yes I can find that HTML with the Inspect Tool. What my code returns is the html which I can see with "View page source" Tool. It's sure that Javascript modifies the code before embedding it. But, the splash role is to run javascript and return HTML, isn't it?? Commented Jun 5, 2019 at 7:54

1 Answer 1

2

In order to load the full page, you will need to add the "wait" parameter. Try adding "args={'wait': 1.0}" to your SplashRequest.

yield SplashRequest(
            url='https://www.gaslicht.com/stroom-vergelijken?partial=true&aanbieders=eneco&skip=0&take=10&_=1559207102962',
            callback=self.parse, args={'wait': 1.0}
        )
Sign up to request clarification or add additional context in comments.

2 Comments

I tried but still nothing :/... It doesn't returns me the HTML I can see with Inspect Tool...
Is Splash working properly? Please try to run the url using Splash on localhost:8050

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.