Handling pagination in python playwright when the url doesn't change

Question

I am trying to scrape this site https://franchisedisclosure.gov.au/Register with playwright and the url doesn't change after you click on the next button. How do I solve this pagination problem? Here's my code `

from bs4 import BeautifulSoup as bs
from playwright.sync_api import sync_playwright

url = 'https://franchisedisclosure.gov.au/Register'

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False, slow_mo=50)
    page = browser.new_page()
    page.goto(url)
    page.locator("text=I agree to the terms of use").click()
    page.locator("text=Continue").click()
    page.wait_for_load_state('domcontentloaded')
    page.is_visible('tbody')
    html = page.inner_html('table.table.table-hover')
    soup = bs(html, 'html.parser')
    table = soup.find('tbody')
    rows = table.findAll('tr')
    names = []
    industry = []
    Locations = []
    for row in rows:
        info = row.findAll('td')
        name = info[0].text.strip()
        industry = info[1].text.strip()
        Locations = info[2].text.strip()

`

I've checked online and every solution I see involves the url changing. And for some reason, you can make requests to the api of the site. Postman said something about the parameters not being sent.

Jaky Ruby · Accepted Answer · 2022-12-06 10:45:53Z

0

With some small adjustments you can get it, lets try this:

from bs4 import BeautifulSoup as bs
from playwright.sync_api import sync_playwright
import time

url = 'https://franchisedisclosure.gov.au/Register'

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False, slow_mo=100)
    page = browser.new_page()
    page.goto(url)
    page.locator("text=I agree to the terms of use").click()
    page.locator("text=Continue").click()
    page.wait_for_load_state('domcontentloaded')
    names = []
    industry = []
    Locations = []
    # When you click to next page, an element with text "Loading" appears in the screen, so we save that element
    loading_icon = "//strong[text()='Loading...']"
    # This is the "next page" button
    next_page_locator = "//ul[@class='pagination']/li[3]"
    # We select the option of 50 elements per page
    page.select_option("#perPageCount", value="50")
    # We wait for the selector of loading icon to be visible and then to be hidden, which means the new list is fully loaded
    page.wait_for_selector(loading_icon, state="visible")
    page.wait_for_selector(loading_icon, state="hidden")
    time.sleep(1)
    # We make a loop until the button "Next page" is disabled, which means there are no more pages to paginate
    while "disabled" not in page.get_attribute(selector=next_page_locator, name="class"):
        # We get the info you wanted
        page.is_visible('tbody')
        html = page.inner_html('table.table.table-hover')
        soup = bs(html, 'html.parser')
        table = soup.find('tbody')
        rows = table.findAll('tr')
        for row in rows:
            info = row.findAll('td')
            name = info[0].text.strip()
            industry = info[1].text.strip()
            Locations = info[2].text.strip()
        # Once we get the info we click in next page and we wait for the loading element to be visible and then to be hidden.
        page.click(next_page_locator)
        page.wait_for_selector(loading_icon, state="visible")
        page.wait_for_selector(loading_icon, state="hidden")
        time.sleep(1)

answered Dec 6, 2022 at 10:45

Jaky Ruby

1,6941 gold badge9 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Chiedozie Agwu Over a year ago

Thanks, it worked. How did you see the loading text? I didn't spot it at all.

Jaky Ruby Over a year ago

Actually when you click next Page Button there is an spinner for loading, Next to the spinner webhas that loading element.

NanoNerd · Accepted Answer · 2022-12-14 15:09:56Z

Thanks for the great question... and answer. In addition / as opposed to using the loading_icon, you could also use a "networkidle", so expanding on @Jaky Ruby's answer adding page.wait_for_load_state(state="networkidle"). I often use the networkidle option to check for the completed loading of the next page, however I've read somewhere it's not necessarily best practice... but it works quite often.

from bs4 import BeautifulSoup as bs
from playwright.sync_api import sync_playwright
import time

url = 'https://franchisedisclosure.gov.au/Register'

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False, slow_mo=100)
    page = browser.new_page()
    page.goto(url)
    page.locator("text=I agree to the terms of use").click()
    page.locator("text=Continue").click()
    page.wait_for_load_state('domcontentloaded')
    names = []
    industry = []
    Locations = []
    # When you click to next page, an element with text "Loading" appears in the screen, so we save that element
    loading_icon = "//strong[text()='Loading...']"
    # This is the "next page" button
    next_page_locator = "//ul[@class='pagination']/li[3]"
    # We select the option of 50 elements per page
    page.select_option("#perPageCount", value="50")
    # We wait for the selector of loading icon to be visible and then to be hidden, which means the new list is fully loaded
    page.wait_for_selector(loading_icon, state="visible")
    page.wait_for_selector(loading_icon, state="hidden")
    page.wait_for_load_state(state="networkidle")
    time.sleep(1)
    # We make a loop until the button "Next page" is disabled, which means there are no more pages to paginate
    while "disabled" not in page.get_attribute(selector=next_page_locator, name="class"):
        # We get the info you wanted
        page.is_visible('tbody')
        html = page.inner_html('table.table.table-hover')
        soup = bs(html, 'html.parser')
        table = soup.find('tbody')
        rows = table.findAll('tr')
        for row in rows:
            info = row.findAll('td')
            name = info[0].text.strip()
            industry = info[1].text.strip()
            Locations = info[2].text.strip()
        # Once we get the info we click in next page and we wait for the loading element to be visible and then to be hidden.
        page.click(next_page_locator)
        page.wait_for_selector(loading_icon, state="visible")
        page.wait_for_selector(loading_icon, state="hidden")
        time.sleep(1)

Collectives™ on Stack Overflow

Handling pagination in python playwright when the url doesn't change

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related