Scrape urls from dynamic webpage using Scrapy

Question

I want to make a web scraper in Scrapy that extracts 10000 links of news from this website https://hamariweb.com/news/newscategory.aspx?cat=7 This webpage is dynamic when I scroll down more links load.

I tried it with selenium but it's not working.

import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy import signals
from scrapy.http import HtmlResponse

class WebnewsSpider(scrapy.Spider):
   name = 'webnews'
   allowed_domains = ['www.hamariweb.com']
   start_urls = ['https://hamariweb.com/news/newscategory.aspx?cat=7']


 def __init__ (self):
    options = webdriver.ChromeOptions()
    options.add_argument("--start-maximized")
   # options.add_argument('--blink-settings=imagesEnabled=false')
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito') 
    self.driver = webdriver. Chrome("C://Users//hammad//Downloads//chrome 
    driver",chrome_options=options)

def parse(self, response):
    self.driver.get(response.url)
    pause_time = 1
    last_height = self.driver.execute_script("return document.body.scrollHeight")
    #start = datetime.datetime.now()

    while True:
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
        time.sleep(pause_time)
        print("\n\n\nend\n\n\n")
        new_height = self.driver.execute_script("return document.body.scrollHeight")

The above mentioned code opens a browser in incognito mode and continues to scroll down. I also want to extract 10000 news links and want to stop the browser when limit reached.

RNHTTR · Accepted Answer · 2019-10-07 13:06:21Z

1

You can add logic for gathering URLs to your parse() method by gathering the css hrefs:

def parse(self, response):
    self.driver.get(response.url)
    pause_time = 1
    last_height = self.driver.execute_script("return document.body.scrollHeight")
    #start = datetime.datetime.now()
    urls = []
    while True:
        if len(urls) <= 10000:
            for href in response.css('a::attr(href)'):
                urls.append(href) # Follow tutorial to learn how to use the href object as you need
        else:
            break # Exit your while True statement when 10,000 links have been collected
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
        time.sleep(pause_time)
        print("\n\n\nend\n\n\n")
        new_height = self.driver.execute_script("return document.body.scrollHeight")

There's a lot of information regarding how to handle links in the scrapy tutorial following links section. You can use the information there to learn what else you can do with links in scrapy.

I haven't tested this with the infinite scroll, so you may need to make some changes, but this should get you going in the right direction.

edited Oct 7, 2019 at 13:06

answered Oct 7, 2019 at 12:53

RNHTTR

2,5252 gold badges19 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Hira Over a year ago

Thanks for your kind reply solved it :).... but data not export to csv, csv file is always blank :(

RNHTTR Over a year ago

What are you using to write the file to csv? Have you checked the python docs for writing to csv?

RNHTTR Over a year ago

Cool -- can you upvote and accept the answer please?

Collectives™ on Stack Overflow

Scrape urls from dynamic webpage using Scrapy

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related