1

I have an asyncio-based crawler that occasionally offloads crawling that requires the browser to a ThreadPoolExecutor, as follows:

def browserfetch(url):
    browser = webdriver.Chrome()
    browser.get(url)
    # Some explicit wait stuff that can take up to 20 seconds.
    return browser.page_source

async def fetch(url, loop):
    with concurrent.futures.ThreadPoolExecutor() as pool:
        result = await loop.run_in_executor(pool, browserfetch, url)
    return result

My issue is that I believe this respawns the headless browser each time I call fetch, which incurs browser startup time on each call to webdriver.Chrome. Is there a way for me to refactor browserfetch or fetch so that the same headless driver can be used on multiple fetch calls?

What have I tried?

I've considered more explicit use of threads/pools to start the Chrome instance in a separate thread/process, communicating within the fetch call via queues, pipes, etc (all run in Executors to keep the calls from blocking). I'm not sure how to make this work, though.

1 Answer 1

2

I believe that starting browsers in separate processes and communicate with him via queue is a good approach (and more scalable). The pseudo-code might look like this:

#  worker.py 
def entrypoint(in_queue, out_queue):  # run in process
    crawler = Crawler()
    browser = Browser()
    while not stop:
        command = in_queue.get()
        result = crawler.process(command, browser)
        out_queue.put(result)            

# main.py
import worker

in_queue, out_queue = Process(worker.entrypoint)
while not stop:
    in_queue.put(new_task)
    result = out_queue.get()
Sign up to request clarification or add additional context in comments.

2 Comments

main.py calls to put and get: should these be executed in a pool to avoid blocking (e.g. await loop.run_in_executor(None, inqueue.put, new_task) and result = await loop.run_in_executor(None, out_queue.get)), assuming the loop is sitting in a coroutine?
You can just use repetitive put_nowait/get_nowait or leverage some ready classes like this stackoverflow.com/a/24704950/13782669

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.