Can I automate browsing a dynamic website without opening a browser?

Question

I got into automating tasks on the web using python. I have tried requests/urllib3/requests-html but they don't get me the right elements, because they get only the html (not the updated version with javascript). Some recommended Selenium, but it opens a browser with the webdriver. I need a way to get elements after they get updated, and maybe after they get updated for a second time. The reason I don't want it to open a browser is I'm running my script on a hosting-scripts service.

Please can you share an minimal reproducible example. Where is some code and a test URL/HTML ? — QHarr
– QHarr, Commented Nov 25, 2018 at 17:32
A little late, but I found woob.tech interesting, full browser without webdriver / local browser install. — nicolaus-hee
– nicolaus-hee, Commented Sep 17, 2021 at 17:49

Terry Craddock · Accepted Answer · 2018-11-25 16:37:40Z

2

Here is my Solution to your problem.

Beautiful Soup doesn't mimic a client. Javascript is code that runs on the client. With Python, we simply make a request to the server, and get the server's response, along of course with the javascript, but it's the browser that reads and runs that javascript. Thus, we need to do that. There are many ways to do this. If you're on Mac or Linux, you can setup dryscrape... or we can just do basically what dryscrape does in PyQt4.

    import sys
    from PyQt4.QtGui import QApplication
    from PyQt4.QtCore import QUrl
    from PyQt4.QtWebKit import QWebPage
    import bs4 as bs
    import urllib.request

    class Client(QWebPage):

        def __init__(self, url):
            self.app = QApplication(sys.argv)
            QWebPage.__init__(self)
            self.loadFinished.connect(self.on_page_load)
            self.mainFrame().load(QUrl(url))
            self.app.exec_()

        def on_page_load(self):
            self.app.quit()

    url = 'https://pythonprogramming.net/parsememcparseface/'
    client_response = Client(url)
    source = client_response.mainFrame().toHtml()
    soup = bs.BeautifulSoup(source, 'lxml')
    js_test = soup.find('p', class_='jstest')
    print(js_test.text)

Just in case you wanted to make use of dryscrape:

    import dryscrape

   sess = dryscrape.Session()
   sess.visit('https://pythonprogramming.net/parsememcparseface/')
   source = sess.body()

   soup = bs.BeautifulSoup(source,'lxml')
   js_test = soup.find('p', class_='jstest')
   print(js_test.text)

edited Nov 25, 2018 at 16:37

answered Nov 25, 2018 at 16:31

Terry Craddock

314 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Funris Over a year ago

I already tried PyQt4, but there's always No module named PyQt4 error.

Terry Craddock Over a year ago

pip install PyQt4 Works like a charm for me.

Дмитро Олександрович Over a year ago

stackoverflow.com/questions/22640640/…

Eike Pierstorff · Accepted Answer · 2018-11-25 17:53:49Z

I would recommend that you look into the --headless option in webdriver, but that will probably not work for you, since this still requires the browser installed so webdriver can make use of the browsers rendering engine ("headless" means it does not start the UI). Since your hosting service will probably not have the browser executables installed this will not work.

Without a rendering engine you will not get the rendered (and JS-enhanced) web page, that simply does not work in pure python.

On option would be a service like saucelabs (I am not affiliated, but I am a happy user) who run browsers on their infrastructure and allow you to control them via their API. So you can run selenium scripts that get the HTML/JS content via RemoteWebDriver and process the results on your own server.

Collectives™ on Stack Overflow

Can I automate browsing a dynamic website without opening a browser?

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related