Dynamic Text Scraping

Question

I followed many tutorial about Javascript Scraping but I can not really manage to take the numbers out from this table:

http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html

I tried for last with a Sentdex tutorial with this code:

import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl

class Page(QWebEnginePage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()


def main():
    page = Page('http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html')
    soup = bs.BeautifulSoup(page.html, 'html.parser')
    tableSup = soup.find_all("td",{"class": "col2 yellowBack"})
    print(tableSup)

if __name__ == '__main__': main()

it looks like I am out of target... everyone always speak of a script associated with those text that appear in the web-page source but then disappear in beautiful soup tag text... but I can't really find the scripts associated with the value in the main table of the page above..?

Any suggestion on where I should direct my research?

Vinícius Figueiredo · Accepted Answer · 2017-08-02 23:44:06Z

2

Notice the table you want to scrape is inside an iframe, you should do a request for this iframe and then proceed to scrape the table. The iframe url was discovered by a simple inspection of the element. An example code using requests is shown below:

from bs4 import BeautifulSoup
import requests

iframe = "https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWQEqN6Sq2pe6I0o/TehV5qd"
html = requests.get(iframe).text
soup = BeautifulSoup(html,'html.parser')

column = soup.findAll("td",{"class": "col2 yellowBack"})
values = [row.string for row in column]

It looks like you are interested in the values from that column, so values is the desired output:

>>> values
['56.37', '107.75', 'n.a.', '95.99', 'n.a.', '56.00', '52.32', '234.85', '81.21', '40.72', '76.29', '19.90', 'n.a.', '92.41', '12.83', '62.19', '78.28', '60.51', '4995.58', '92.99', '67.56', '175.24', '58.71', '82.14', '57.75', '46.86', '22.95', '70.06', '150.16', '6793.46', '31.07', '34.31', '50.39']

answered Aug 2, 2017 at 23:44

Vinícius Figueiredo

6,5234 gold badges30 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3755529 Over a year ago

fantastic! thanks a lot. I noticed that the link (src) fo the <iframe> changes all the time. Despite that even using your old one still work the same. But would you say is good to first scrape the "src" from the page and then use that to grab the iframe? in mins: web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/… web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/… web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/…

Vinícius Figueiredo Over a year ago

@user3755529 I'm glad to help! you could find all the iframes, and then request each one at a time, checking if it's the one with the ("td",{"class": "col2 yellowBack"}), otherwise you continue to next one.

Collectives™ on Stack Overflow

Dynamic Text Scraping

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related