2

I'm not getting the expected value returned from the below code.

from playwright.sync_api import sync_playwright
import time
import random

def main():
    with sync_playwright() as p:
        browser = p.firefox.launch(headless=False)
        page = browser.new_page()
        url = "https://www.useragentlist.net/"
        page.goto(url)
        time.sleep(random.uniform(2,4))

        test = page.locator('xpath=//span[1][@class="copy-the-code-wrap copy-the-code-style-button copy-the-code-inside-wrap"]/pre/code/strong').inner_text()
        print(test)

        count = page.locator('xpath=//span["copy-the-code-wrap copy-the-code-style-button copy-the-code-inside-wrap"]/pre/code/strong').count()
        print(count)


        browser.close()


if __name__ == '__main__':
    main()

page.locator().count() returns a value of 0, I have no issue getting the text from the lines above it, but I need to access all elements, what is wrong with my implementation of locator and count?

5
  • Why the sleep and XPath? What info are you trying to get? Commented Aug 7, 2024 at 21:09
  • The random sleep is almost always necessary to avoid being flagged as a bot, so I always add it as a preventative measure, easier to do that first than to have to add it later Commented Aug 7, 2024 at 22:12
  • True, sleeps can sometimes help with that, although I usually only attempt that after I've been flagged, otherwise it's a big performance hit. Adding a user agent is probably a better initial preventative step that doesn't incur a performance penalty--Playwright's default user agent says "I am a bot" essentially. Also, in this case, you're not actually interacting with the page, just visiting it and then leaving. Commented Aug 7, 2024 at 22:16
  • There are two reasons I have it setup like this. First, I'm going to use this code as a template while I migrate my projects from selenium to playwright. Second, I'll use the data I get from this website to generate a random user agent each time I run the other web scrapers Commented Aug 7, 2024 at 22:22
  • You might consider using a library that generates a random user agent which is a bit faster and more reliable (the user agent site might have downtime, causing a disruption). Commented Aug 7, 2024 at 22:24

1 Answer 1

1

Your second locator XPath has no @class=, so it's different than the first one that works. Store the string in a variable so you don't have to type it twice or encounter copy-paste or stale data errors.

In any case, your approach seems overcomplicated. Each user agent is in a <code> tag--just scrape that:

from playwright.sync_api import sync_playwright # 1.44.0


def main():
    with sync_playwright() as p:
        browser = p.firefox.launch()
        page = browser.new_page()
        url = "https://www.useragentlist.net/"
        page.goto(url, wait_until="domcontentloaded")
        agents = page.locator("code").all_text_contents()
        print(agents)
        browser.close()


if __name__ == "__main__":
    main()

Locators auto-wait so there's no need to sleep. Avoid XPaths 99% of the time--they're brittle and difficult to read and maintain. Just use CSS selectors or user-visible locators. The goal is to choose the simplest selector necessary to disambiguate the elements you want, and nothing more. span/pre/code/strong is a rigid hierarchy--if one of these changes, your code breaks unnecessarily.

By the way, the user agents are in the static HTML, so unless you're trying to circumvent a block, you can do this faster with requests and Beautiful Soup:

from requests import get  # 2.31.0
from bs4 import BeautifulSoup  # 4.10.0

response = get("https://www.useragentlist.net")
response.raise_for_status()
print([x.text for x in BeautifulSoup(response.text, "lxml").select("code")])

Better still (possibly), use a library like fake_useragent to generate your random user agent.

Sign up to request clarification or add additional context in comments.

1 Comment

Yeah, that fixed it. Almost every time I use xpaths it's a typo or other minor error that causes a catastrophic failure

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.