0

So my thoughts are that if I would add something that can split the url-range into 5 and then give each of 5 chromedriver instances their own split of the url-range to handle it would make scraping much faster. And thats my biggest question. But maybe then its better if each chromedriver had their own csv file, or I would need to add something that pools all the scraping in one file? Im really at a loss here and I'm already pushing my skill level. I am eternally grateful for any concrete help on at least how to get multithreading working. Thank you!

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import csv
path_to_file = "test1.csv"
csvFile = open(path_to_file, 'a', encoding="utf-8", newline='')
csvWriter = csv.writer(csvFile)
options = webdriver.ChromeOptions() 
driver = webdriver.Chrome(options=options)
header_added = False
time.sleep(3)
for i in range(1,153512):
    print(f"https://www.ancestry.com/discoveryui-content/view/{i}:61965")
    driver.get(f"https://www.ancestry.com/discoveryui-content/view/{i}:61965")
    try:
        Name = driver.find_element(By.XPATH,"//table[@id='recordServiceData']//tr[contains(.,'Name:')]").text.replace("Name:", "")
    except:
        Name =''
    csvWriter.writerow([i, Name])
    print(Name)
1

1 Answer 1

1

try this:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import csv
path_to_file = "test1.csv"
csvFile = open(path_to_file, 'a', encoding="utf-8", newline='')
csvWriter = csv.writer(csvFile)

header_added = False
time.sleep(3)


    
def init_driver_worker(_range_task): #create new instace of chrome then make it do its job
    ##### init driver
    options = webdriver.ChromeOptions()
    #you can't run multible instances of chrome
    #  with the same profile being used,
    #  so either create new profile for each instance or use incognito mode
    options.add_argument("--incognito")
    options.add_argument("--headless") #use headless browser (no GUI) to be faster
    driver = webdriver.Chrome(options=options)
    ##### do the task
    for i in _range_task:
        print(f"https://www.ancestry.com/discoveryui-content/view/{i}:61965")
        driver.get(f"https://www.ancestry.com/discoveryui-content/view/{i}:61965")
        try:
            Name = driver.find_element(By.XPATH,"//table[@id='recordServiceData']//tr[contains(.,'Name:')]").text.replace("Name:", "")
        except:
            Name =''
        csvWriter.writerow([i, Name])
        print(Name)
    exit() #close the thread
    
    
    
def split_range(_range, parts): #split a range to chunks
    chunk_size = int(len(_range)/parts)
    chunks = [_range[x:x+chunk_size] for x in range(0, len(_range), chunk_size)]
    return chunks

my_range = range(1,153512)
chunks = split_range(my_range, 10) # split the task to 10 instances of chrome

from threading import Thread
thread_workers = []
for chunk in chunks:
    t = Thread(target=init_driver_worker, args=([chunk]))
    thread_workers.append(t)
    t.start()
    
# wait for the thread_workers to finish
for t in thread_workers:
    t.join()
Sign up to request clarification or add additional context in comments.

3 Comments

Hello Ibrahem! Thank you for such a quick answer. I just tried it but I get an error: Exception in thread Thread-1 (init_driver_worker): Traceback (most recent call last): File "C:\Python310\lib\threading.py", line 1009, in _bootstrap_inner Exception in thread Thread-2 (init_driver_worker): For more: pastebin.com/Zkp13wdu
Comment: I found the answer to the first comment. you had just one mistake in the code, one missing comma: args=(chunk)) should be args=(chunk,))
Sorry, I should have tested the code first. I fixed it using args=([chunk]). also, I added options.add_argument("--headless") to use headless browser (no GUI) for the code to be faster and lighter (&& thanks for your vote Up !)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.