How to scrape the html code from the response received?

Question

I am trying to crawl-scrape a website with scrapy and splash. I want to scrape a specific html code from a response which seems in the image. Here is the response with its headers: enter image description here

Here is the response (the html I want to scrape): enter image description here

I can find that HTML with the Inspect Tool. What my code returns is the html which I can see with "View page source" Tool. So, this means taht Javascript modifies the code before embedding it. But, the splash role is to run javascript and return HTML, isn't it?? The response.body returns the source code of the page without the html code i need from the response i mentioned above.

import scrapy
from scrapy_splash import SplashRequest
from bs4 import BeautifulSoup

class NetherSplashSpider(scrapy.Spider):
    name = 'nether_splash'
    download_delay = 10

    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    def start_requests(self):
        yield SplashRequest(
            url='https://www.gaslicht.com/stroom-vergelijken?partial=true&aanbieders=eneco&skip=0&take=10&_=1559207102962',
            callback=self.parse,
        )


    def parse(self, response):



        filename = 'splash.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

Can you find that HTML in your browser using the Inspect tool (not the network tool)? Maybe JavaScript modifies the code slightly before embedding it. Also try using a higher time in your Splash configuration. — Gallaecio
– Gallaecio, Commented May 30, 2019 at 14:38
Yes I can find that HTML with the Inspect Tool. What my code returns is the html which I can see with "View page source" Tool. It's sure that Javascript modifies the code before embedding it. But, the splash role is to run javascript and return HTML, isn't it?? — pap
– pap, Commented Jun 5, 2019 at 7:54

mrhaanraadts · Accepted Answer · 2019-05-30 20:40:15Z

2

In order to load the full page, you will need to add the "wait" parameter. Try adding "args={'wait': 1.0}" to your SplashRequest.

yield SplashRequest(
            url='https://www.gaslicht.com/stroom-vergelijken?partial=true&aanbieders=eneco&skip=0&take=10&_=1559207102962',
            callback=self.parse, args={'wait': 1.0}
        )

answered May 30, 2019 at 20:40

mrhaanraadts

866 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pap Over a year ago

I tried but still nothing :/... It doesn't returns me the HTML I can see with Inspect Tool...

mrhaanraadts Over a year ago

Is Splash working properly? Please try to run the url using Splash on localhost:8050

Collectives™ on Stack Overflow

How to scrape the html code from the response received?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related