2

I am trying to get the content of a web page section. The data in that section is loaded dynamically by javascript. I found some code on here, edited it but when I run the script I return None

Here's the code

import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from pprint import pprint

class Page(QWebEnginePage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()
        

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()

def main():
    page = Page('https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=IBM%20Security&product=ibm/Information+Management/InfoSphere+Guardium&release=10.0&platform=Linux&function=all')
    soup = bs.BeautifulSoup(page.html, 'html.parser')
    section = soup.find('table', {'id' : 'DataTables_Table_0'})
    pprint (section)

if __name__ == '__main__': main()

Here's the output

Load finished
None
4
  • I don't see what would make your code to wait for the URL to load. It seems that the Page creation in main returns immediately, in which case there is no HTML loaded yet, when you try to parse it. Your output paste would tell otherwise though. Commented Sep 17, 2020 at 20:18
  • Try printing the self.html. When you do this, you will see that the DataTables_Table_0 element is missing in the output. @antont There is no problem in loading the HTML, as far as I can see. Commented Sep 17, 2020 at 20:33
  • I think it loads the HTML after it has tried to parse it. Docs seem to indicate that load() returns immediately there. Seems that the soup parsing should be called from _on_load_finished. The HTML on the web does seem to have that element. doc.qt.io/qt-5/qwebenginepage.html#load Commented Sep 17, 2020 at 20:37
  • @antont If you try with 'body', {'id': 'ibm-com'}, you will see that you will get successful results. (I chose this myself after printing the self.html.) Even if you get the html with urllib, the result does not change. So I don't think the problem is in the code. Commented Sep 17, 2020 at 20:45

1 Answer 1

3

The loadFinished signal only indicates that the page has been loaded but after that more DOM elements can be created, and that is the case of the element with id "DataTables_Table_0" which is created moments after the page is loaded.

A possible solution is to inject a script that checks if the element exists, and that notifies so that the HTML is obtained.

import sys
from functools import cached_property

from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets, QtWebChannel

from pprint import pprint
import bs4 as bs


def get_webchannel_source():
    file = QtCore.QFile(":/qtwebchannel/qwebchannel.js")
    if not file.open(QtCore.QIODevice.ReadOnly):
        return ""
    content = file.readAll()
    file.close()
    return content.data().decode()


class Manager(QtCore.QObject):
    def __init__(self, *, offline=True, visible=False, parent=None):
        super().__init__(parent)
        self._html = ""
        self._is_finished = False
        self.app
        self._profile = (
            QtWebEngineWidgets.QWebEngineProfile()
            if offline
            else QtWebEngineWidgets.QWebEngineProfile.defaultProfile()
        )
        self.view.resize(640, 480)
        if not visible:
            self.view.setAttribute(QtCore.Qt.WA_DontShowOnScreen, True)
        self.view.show()
        self.webchannel.registerObject("manager", self)
        self.view.page().setWebChannel(self.webchannel)

    @cached_property
    def app(self):
        return QtWidgets.QApplication(sys.argv)

    @property
    def profile(self):
        return self._profile

    @cached_property
    def view(self):
        view = QtWebEngineWidgets.QWebEngineView()
        page = QtWebEngineWidgets.QWebEnginePage(self.profile, self)
        view.setPage(page)
        return view

    @cached_property
    def webchannel(self):
        return QtWebChannel.QWebChannel(self)

    @property
    def html(self):
        return self._html

    def set_script(self, script):
        qscript = QtWebEngineWidgets.QWebEngineScript()
        qscript.setName("qscript")
        qscript.setSourceCode(get_webchannel_source() + "\n" + script)
        qscript.setInjectionPoint(QtWebEngineWidgets.QWebEngineScript.DocumentReady)
        qscript.setWorldId(QtWebEngineWidgets.QWebEngineScript.MainWorld)
        self.profile.scripts().insert(qscript)

    def start(self, url):
        self.view.load(QtCore.QUrl.fromUserInput(url))
        self.app.exec_()

    @QtCore.pyqtSlot()
    def save_html(self):
        if not self._is_finished:
            self.view.page().toHtml(self.html_callable)
            self._is_finished = True

    def html_callable(self, html):
        self._html = html
        self.app.quit()


JS = """
var manager = null;

function find_element() {
  var e = document.getElementById('DataTables_Table_0');
  console.log("try verify", e, manager);
  if (e != null && manager != null) {
    console.log(e)
    manager.save_html()
  } else {
    setTimeout(find_element, 100);
  }
}

(function wait_qt() {
  if (typeof qt != 'undefined') {
    console.log("Qt loaded");
    new QWebChannel(qt.webChannelTransport, function (channel) {
      manager = channel.objects.manager;
      find_element();
    });
  } else {
    setTimeout(wait_qt, 100);
  }
})();
"""


def main():
    manager = Manager()
    manager.set_script(JS)
    manager.start(
        "https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=IBM%20Security&product=ibm/Information+Management/InfoSphere+Guardium&release=10.0&platform=Linux&function=all"
    )
    soup = bs.BeautifulSoup(manager.html, "html.parser")
    section = soup.find("table", {"id": "DataTables_Table_0"})
    pprint(section)


if __name__ == "__main__":
    main()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.