1

I am trying to write a simple scrapper using Playwright Python library

Here is a basic example how I use it:

from contextlib import contextmanager

from playwright.sync_api import sync_playwright


import asyncio

def is_async():
    return asyncio.get_event_loop().is_running()


class BaseScrapper(object):
    
    @property
    @contextmanager
    def playwright(self):
        print(f'BaseScrapper.playwright - is_async={is_async()}')
        with sync_playwright() as p:
            print(f'BaseScrapper.playwright with - is_async={is_async()}')
            yield p
    
    @property
    @contextmanager
    def browser(self):
        print(f'BaseScrapper.browser - is_async={is_async()}')
        with self.playwright as p:
            print(f'BaseScrapper.browser with - is_async={is_async()}')
            yield p.chromium.launch(headless=True)
    
    @contextmanager
    def open_page(self, url):
        print(f'BaseScrapper.open_page - is_async={is_async()}')
        with self.browser as browser:
            print(f'BaseScrapper.open_page.with - is_async={is_async()}')
            new_page = browser.new_page()
            
            # attach response listener
            new_page.on("response", self.intercept_response)
            
            new_page.goto(url, wait_until="domcontentloaded")
            yield new_page
    
    def intercept_response(self, response):
        pass



class ScrapeTest(BaseScrapper):
    @contextmanager
    def run(self):
        print(f'ScrapeTest.run - is_async={is_async()}')
        with self.open_page(url='www.google.com') as page:
            print(f'ScrapeTest.run - is_async={is_async()}')
            yield page


def run():
    print(f'running ... is_async={is_async()}')
    s = ScrapeTest()
    with s.run() as p:
        print(f'run.with is_async={is_async()}')

I would expect this to be run in sync context but it tries to switch to async. Running run function produces this output:

>>> run()
running ... is_async=False
ScrapeTest.run - is_async=False
BaseScrapper.open_page - is_async=False
BaseScrapper.browser - is_async=False
BaseScrapper.playwright - is_async=False
BaseScrapper.playwright with - is_async=True
BaseScrapper.browser with - is_async=True
BaseScrapper.open_page.with - is_async=True

Why does "playwright" function switch to async context inside the context manager block? I am trying to run some django ORM calls inside there and it fails with

django.core.exceptions.SynchronousOnlyOperation: You cannot call this from an async context - use a thread or sync_to_async.

But i am trying to run it in sync context ...

What did I miss?

Many thanks!

even simpler example:

from playwright.sync_api import sync_playwright

import asyncio


class BaseScrapper(object):
    @property
    def playwright(self):
        print('basescrapper.playwright')
        return sync_playwright().start()
    
    @property
    def browser(self):
        print('basescrapper.browser')
        return self.playwright.chromium.launch(headless=True)
    
    def open_page(self, url):
        print('basescrapper.open page')
        page = self.browser.new_page()
        page.goto(url)
        return page



class SampleScrapper(BaseScrapper):
    def run(self):
        print(f'before - {"async" if asyncio.get_event_loop().is_running() else "sync"}')
        page = self.open_page(url='https://www.google.com')
        print(f'after - {"async" if asyncio.get_event_loop().is_running() else "sync"}')


def run():
    print(f'run 1 - {"async" if asyncio.get_event_loop().is_running() else "sync"}')
    s = SampleScrapper()
    print(f'run 2 - {"async" if asyncio.get_event_loop().is_running() else "sync"}')
    s.run()
    print(f'run 3 - {"async" if asyncio.get_event_loop().is_running() else "sync"}')

results in:

>>> run()
run 1 - sync
run 2 - sync
before - sync
basescrapper.open page
basescrapper.browser
basescrapper.playwright
after - async
run 3 - async
2
  • Your code shows that an async loop is running, not that your own script is async. IO is async by definition, especially talking to another application like the browser Commented Nov 13, 2024 at 11:55
  • Is there an actual problem? It doesn't matter that Playwright itself is asynchronous and doesn't block. The API your application calls does block. In fact, why not use the async API? Commented Nov 13, 2024 at 12:57

1 Answer 1

1

It seems that Playwright for Python uses asyncio to perform its tasks but it has a wrapper to make the API sync.

Note: I'm not proficient in python. Correct me if I'm wrong.

Sign up to request clarification or add additional context in comments.

1 Comment

You're actually right. Playwright itself is async and in all languages (except Java which just got something like async) use an async API. Its base platform is Node.js which requires async for IO. Both Node.js and .NET have async-only APIs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.