0

What I'm missing?

I'm trying to scrapy some json but I'm keeping receiving this html header with the json response:

response.data['html'] return:

2021-02-18 10:35:57 [bcb] DEBUG: b'<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"TotalRows":132,"RowCount":15,"Rows":[{"tit`....

Here is the code:

    yield scrapy.Request(address_pesquisa, self.parse, meta={
            'splash': {
                'args': {
                    # set rendering arguments here
                    'html': 1,
                    'png': 0,

                },

                # optional parameters
                'endpoint': 'render.json',  # optional; default is render.json
                'splash_url': 'http://192.168.15.100:8050',  # optional; overrides SPLASH_URL
                'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
                'splash_headers': {},  # optional; a dict with headers sent to Splash
                'dont_process_response': False,  # optional, default is False
                'dont_send_headers': True,  # optional, default is False
                'magic_response': True,  # optional, default is True
            }
        })

I have to remove this header by my self with some regex or what? Or my scrapy is misconfigured?

1 Answer 1

1

Straightforward option for extracting the JSON inside the HTML would be to use XPath (or CSS selectors). Here's the documentation for Scrapy Selectors.

Something like this in scrapy.Request callback function (self.parse)

json_response = response.xpath('html/body/pre/text()').get()
json_response = json.loads(json_response)

Note that I didn't test the code so you might need to change it a little bit (if I typo'd the XPath or something).

Also, you might want to try downloading the page with i.e. curl or Scrapy shell and check if the HTML part is still in the response. If not, somehow using Splash might make the website return a response that has the HTML.


Update on why the HTML is not in the response when using curl:

One possibility is that the web server returns a different response when using a browser than when using curl. One reason for doing this is to make the JSON more readable for the user using the browser. I mean, trying to read through JSON is easier when it's properly formatted and not just everything on a single line :D

So, if this is the case, my guess would be that Splash passes some data to the server (i.e. User-Agent, being able to render JavaScript) that makes the server return a response with the HTML.

Skipping Splash and using just Scrapy Request for making the request could help (and also make the crawler a little bit faster).

Anyway, if the XPath works (and the small and only possible speed increase doesn't matter), go with the XPath.

Sign up to request clarification or add additional context in comments.

3 Comments

I appreciate that you answered. And in fact, somehow the splash is actually adding this html to the response and with curl no. I am going to take your xpath suggestion. Thanks =)
No problem, glad if it helped! If it solved your problem, I would appreciate if you marked it as the accepted answer :)
@RafaelPinheiro I updated the answer and added some speculation for why there's a difference between using curl and Splash

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.