1

I want to make a web scraper in Scrapy that extracts 10000 links of news from this website https://hamariweb.com/news/newscategory.aspx?cat=7 This webpage is dynamic when I scroll down more links load.

I tried it with selenium but it's not working.

import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy import signals
from scrapy.http import HtmlResponse

class WebnewsSpider(scrapy.Spider):
   name = 'webnews'
   allowed_domains = ['www.hamariweb.com']
   start_urls = ['https://hamariweb.com/news/newscategory.aspx?cat=7']


 def __init__ (self):
    options = webdriver.ChromeOptions()
    options.add_argument("--start-maximized")
   # options.add_argument('--blink-settings=imagesEnabled=false')
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito') 
    self.driver = webdriver. Chrome("C://Users//hammad//Downloads//chrome 
    driver",chrome_options=options)

def parse(self, response):
    self.driver.get(response.url)
    pause_time = 1
    last_height = self.driver.execute_script("return document.body.scrollHeight")
    #start = datetime.datetime.now()

    while True:
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
        time.sleep(pause_time)
        print("\n\n\nend\n\n\n")
        new_height = self.driver.execute_script("return document.body.scrollHeight")

The above mentioned code opens a browser in incognito mode and continues to scroll down. I also want to extract 10000 news links and want to stop the browser when limit reached.

0

1 Answer 1

1

You can add logic for gathering URLs to your parse() method by gathering the css hrefs:

def parse(self, response):
    self.driver.get(response.url)
    pause_time = 1
    last_height = self.driver.execute_script("return document.body.scrollHeight")
    #start = datetime.datetime.now()
    urls = []
    while True:
        if len(urls) <= 10000:
            for href in response.css('a::attr(href)'):
                urls.append(href) # Follow tutorial to learn how to use the href object as you need
        else:
            break # Exit your while True statement when 10,000 links have been collected
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
        time.sleep(pause_time)
        print("\n\n\nend\n\n\n")
        new_height = self.driver.execute_script("return document.body.scrollHeight")

There's a lot of information regarding how to handle links in the scrapy tutorial following links section. You can use the information there to learn what else you can do with links in scrapy.

I haven't tested this with the infinite scroll, so you may need to make some changes, but this should get you going in the right direction.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your kind reply solved it :).... but data not export to csv, csv file is always blank :(
What are you using to write the file to csv? Have you checked the python docs for writing to csv?
Cool -- can you upvote and accept the answer please?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.