2

I followed many tutorial about Javascript Scraping but I can not really manage to take the numbers out from this table:

http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html

I tried for last with a Sentdex tutorial with this code:

import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl

class Page(QWebEnginePage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()


def main():
    page = Page('http://www.wsj.com/mdc/public/npage/2_3023_creditdervs.html')
    soup = bs.BeautifulSoup(page.html, 'html.parser')
    tableSup = soup.find_all("td",{"class": "col2 yellowBack"})
    print(tableSup)

if __name__ == '__main__': main()

it looks like I am out of target... everyone always speak of a script associated with those text that appear in the web-page source but then disappear in beautiful soup tag text... but I can't really find the scripts associated with the value in the main table of the page above..?

Any suggestion on where I should direct my research?

1 Answer 1

2

Notice the table you want to scrape is inside an iframe, you should do a request for this iframe and then proceed to scrape the table. The iframe url was discovered by a simple inspection of the element. An example code using requests is shown below:

from bs4 import BeautifulSoup
import requests

iframe = "https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWQEqN6Sq2pe6I0o/TehV5qd"
html = requests.get(iframe).text
soup = BeautifulSoup(html,'html.parser')

column = soup.findAll("td",{"class": "col2 yellowBack"})
values = [row.string for row in column]

It looks like you are interested in the values from that column, so values is the desired output:

>>> values
['56.37', '107.75', 'n.a.', '95.99', 'n.a.', '56.00', '52.32', '234.85', '81.21', '40.72', '76.29', '19.90', 'n.a.', '92.41', '12.83', '62.19', '78.28', '60.51', '4995.58', '92.99', '67.56', '175.24', '58.71', '82.14', '57.75', '46.86', '22.95', '70.06', '150.16', '6793.46', '31.07', '34.31', '50.39']
Sign up to request clarification or add additional context in comments.

2 Comments

fantastic! thanks a lot. I noticed that the link (src) fo the <iframe> changes all the time. Despite that even using your old one still work the same. But would you say is good to first scrape the "src" from the page and then use that to grab the iframe? in mins: web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/… web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/… web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/…
@user3755529 I'm glad to help! you could find all the iframes, and then request each one at a time, checking if it's the one with the ("td",{"class": "col2 yellowBack"}), otherwise you continue to next one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.