Selenium / asyncio - using Executor without respawning webdriver

Question

I have an asyncio-based crawler that occasionally offloads crawling that requires the browser to a ThreadPoolExecutor, as follows:

def browserfetch(url):
    browser = webdriver.Chrome()
    browser.get(url)
    # Some explicit wait stuff that can take up to 20 seconds.
    return browser.page_source

async def fetch(url, loop):
    with concurrent.futures.ThreadPoolExecutor() as pool:
        result = await loop.run_in_executor(pool, browserfetch, url)
    return result

My issue is that I believe this respawns the headless browser each time I call fetch, which incurs browser startup time on each call to webdriver.Chrome. Is there a way for me to refactor browserfetch or fetch so that the same headless driver can be used on multiple fetch calls?

What have I tried?

I've considered more explicit use of threads/pools to start the Chrome instance in a separate thread/process, communicating within the fetch call via queues, pipes, etc (all run in Executors to keep the calls from blocking). I'm not sure how to make this work, though.

alex_noname · Accepted Answer · 2020-07-02 17:58:36Z

2

I believe that starting browsers in separate processes and communicate with him via queue is a good approach (and more scalable). The pseudo-code might look like this:

#  worker.py 
def entrypoint(in_queue, out_queue):  # run in process
    crawler = Crawler()
    browser = Browser()
    while not stop:
        command = in_queue.get()
        result = crawler.process(command, browser)
        out_queue.put(result)

# main.py
import worker

in_queue, out_queue = Process(worker.entrypoint)
while not stop:
    in_queue.put(new_task)
    result = out_queue.get()

answered Jul 2, 2020 at 17:58

alex_noname

33.2k6 gold badges95 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MikeRand Over a year ago

main.py calls to put and get: should these be executed in a pool to avoid blocking (e.g. await loop.run_in_executor(None, inqueue.put, new_task) and result = await loop.run_in_executor(None, out_queue.get)), assuming the loop is sitting in a coroutine?

alex_noname Over a year ago

You can just use repetitive put_nowait/get_nowait or leverage some ready classes like this stackoverflow.com/a/24704950/13782669

Collectives™ on Stack Overflow

Selenium / asyncio - using Executor without respawning webdriver

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related