0

I want to scrape a website and its sub-pages, but it is taking too long. How can I optimize the request or use an alternative solution?

Below is the code I am using. It takes 10s for just loading the Google home page. So it's clearly not scalable if I were to give it 280 links

from selenium import webdriver
import time
# prepare the option for the chrome driver
options = webdriver.ChromeOptions()
options.add_argument('headless')

# start chrome browser
browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver" ,chrome_options=options)
start=time.time()
browser.get('http://www.google.com/xhtml')
print(time.time()-start)
browser.quit()

2
  • 2
    Have you tried using scrapy? Could you provide the url that you are actually scraping, the problem might be server related. Commented Jan 3, 2020 at 11:17
  • tajinequiparle.com/dictionnaire-francais-arabe-marocain this url and i will go through all the letters and then go through all the words Commented Jan 3, 2020 at 11:27

3 Answers 3

2

Use python requests and Beautiful soup module.

import requests
from bs4 import BeautifulSoup
url="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/"
url1="https://tajinequiparle.com/dictionnaire-francais-arabe-marocain/{}/"
req = requests.get(url,verify=False)
soup = BeautifulSoup(req.text, 'html.parser')
print("Letters : A")
print([item['href'] for item in soup.select('.columns-list a[href]')])

letters=['B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']

for letter in letters:

    req = requests.get(url1.format(letter), verify=False)
    soup = BeautifulSoup(req.text, 'html.parser')
    print('Letters : ' + letter)
    print([item['href'] for item in soup.select('.columns-list a[href]')])
Sign up to request clarification or add additional context in comments.

Comments

2

you can use that script for the speed. multithread crawler better than all:

https://edmundmartin.com/multi-threaded-crawler-in-python/

After that you must change that code:

def run_scraper(self):
    with open("francais-arabe-marocain.csv", 'a') as file:
        file.write("url")
        file.writelines("\n")
        for i in range(50000):
            try:
                target_url = self.to_crawl.get(timeout=600)
                if target_url not in self.scraped_pages and "francais-arabe-marocain" in target_url:
                    self.scraped_pages.add(target_url)
                    job = self.pool.submit(self.scrape_page, target_url)
                    job.add_done_callback(self.post_scrape_callback)
                    df = pd.DataFrame([{'url': target_url}])
                    df.to_csv(file, index=False, header=False)
                    print(target_url)
            except Empty:
                return
            except Exception as e:
                print(e)
                continue

If url include "francais-arabe-marocain" save urls in a csv file. CSV File Screenshot

After that you can scrape that urls in one for loop reading csv line by line with same way

Comments

0

try to use urllib just like this

import urllib.request
start=time.time()
page = urllib.request.urlopen("https://google.com/xhtml")
print(time.time()-start)

it took only 2s. However, it depends also on the quality of connection you have

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.