0

I am trying to use Scrapy framework to scrape https://www.sreality.cz/en/search/for-sale/apartments website.

Portion of the web's code is written in JavaScript, so I am trying to use Splash Docker container to provide me with html which I could easily parse.

I downloaded the scrapinghub/splash Docker image and started its container at port 8050 in terminal.

% docker pull scrapinghub/splash

% docker run -p 8050:8050 scrapinghub/splash

In settings.py file in my scrapy project directory I added these lines of code as instructed at https://github.com/scrapy-plugins/scrapy-splash.

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

I created a new spider in my project directory.

import scrapy
from scrapy_splash import SplashRequest

class FlatSpider(scrapy.Spider):
    name = "flat"
    def start_requests(self):
        # sreality url
        url = 'https://www.sreality.cz/en/search/for-sale/apartments'

        # beer test url
        # url = 'https://www.beerwulf.com/en-gb/c/mixedbeercases'

        yield SplashRequest(url=url, callback=self.parse, args={'wait': 0.5})

    def parse(self, response):

        # sreality variable
        foo = response.css('span.name.ng-binding::text').get()

        # beer test variable
        # foo = response.css('h4.product-name::text').get()

        print(foo)

If I run this spider using % scrapy crawl flat in terminal it prints None even though it should return text (which I can see in Chrome inspector). But otherwise it all seems to work. If I comment in the two 'beer test' lines of code it successfully renders html I can parse and the code prints the text in terminal.

Also, when I open Splash in http://localhost:8050 and try to render the web https://www.sreality.cz/en/search/for-sale/apartments it does not seem to work correctly. However, it works for different webs.

For some reason this scraping solution does not work for this particular web that I am interested in. I am trying to figure out why and how to get response.css from this web that I could easily parse.

I run this on macOS 13.0.1 Apple silicon if it matters.

1 Answer 1

4

I tried to use Splash before but the community for Splash is not active anymore, there is a better plugin to scrape interactive websites which is scrapy-playwright .

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.