Why Splash+Scrapy add html header to json response

Question

What I'm missing?

I'm trying to scrapy some json but I'm keeping receiving this html header with the json response:

response.data['html'] return:

2021-02-18 10:35:57 [bcb] DEBUG: b'<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"TotalRows":132,"RowCount":15,"Rows":[{"tit`....

Here is the code:

    yield scrapy.Request(address_pesquisa, self.parse, meta={
            'splash': {
                'args': {
                    # set rendering arguments here
                    'html': 1,
                    'png': 0,

                },

                # optional parameters
                'endpoint': 'render.json',  # optional; default is render.json
                'splash_url': 'http://192.168.15.100:8050',  # optional; overrides SPLASH_URL
                'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
                'splash_headers': {},  # optional; a dict with headers sent to Splash
                'dont_process_response': False,  # optional, default is False
                'dont_send_headers': True,  # optional, default is False
                'magic_response': True,  # optional, default is True
            }
        })

I have to remove this header by my self with some regex or what? Or my scrapy is misconfigured?

Toivo Mattila · Accepted Answer · 2021-02-19 07:35:20Z

1

Straightforward option for extracting the JSON inside the HTML would be to use XPath (or CSS selectors). Here's the documentation for Scrapy Selectors.

Something like this in scrapy.Request callback function (self.parse)

json_response = response.xpath('html/body/pre/text()').get()
json_response = json.loads(json_response)

Note that I didn't test the code so you might need to change it a little bit (if I typo'd the XPath or something).

Also, you might want to try downloading the page with i.e. curl or Scrapy shell and check if the HTML part is still in the response. If not, somehow using Splash might make the website return a response that has the HTML.

Update on why the HTML is not in the response when using curl:

One possibility is that the web server returns a different response when using a browser than when using curl. One reason for doing this is to make the JSON more readable for the user using the browser. I mean, trying to read through JSON is easier when it's properly formatted and not just everything on a single line :D

So, if this is the case, my guess would be that Splash passes some data to the server (i.e. User-Agent, being able to render JavaScript) that makes the server return a response with the HTML.

Skipping Splash and using just Scrapy Request for making the request could help (and also make the crawler a little bit faster).

Anyway, if the XPath works (and the small and only possible speed increase doesn't matter), go with the XPath.

edited Feb 19, 2021 at 7:35

answered Feb 18, 2021 at 15:09

Toivo Mattila

3971 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rafael Pinheiro Over a year ago

I appreciate that you answered. And in fact, somehow the splash is actually adding this html to the response and with curl no. I am going to take your xpath suggestion. Thanks =)

Toivo Mattila Over a year ago

No problem, glad if it helped! If it solved your problem, I would appreciate if you marked it as the accepted answer :)

Toivo Mattila Over a year ago

@RafaelPinheiro I updated the answer and added some speculation for why there's a difference between using curl and Splash

Collectives™ on Stack Overflow

Why Splash+Scrapy add html header to json response

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related