How can I extract HTML from a URL that requires JavaScript?

Question

I'm attempting to extract the raw HTML from a page that requires JavaScript (with the ultimate goal of extracting just the plain text). Unfortunately, simple get requests return HTML specifying the need for a browser running JS. Example:

> html = open('https://www.medicare.gov/Publications/').read
"<!DOCTYPE HTML>\n<!--[if lt IE 8]><html class=\"no-js oldIE\" lang=\"en-US\"><![endif]-->\n<!--[if IE 8]><html class=\"no-js lt-ie9\" lang=\"en-US\"><![endif]-->\n<!--[if IE 9]><html class=\"no-js ie9\" lang=\"en-US\"><![endif]-->\n<!--[if IE 11]>\n<style>\nbody{\ndisplay:none;\n}\n</style>\n<![endif]-->\n<!--[if (gt IE )|!(IE)]><!-->\n<html lang=\"en-US\" class=\"no-js not-ie\">\n    <!--<![endif]-->\n    <head>\n        <meta charset=\"utf-8\">\n        <!--<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">-->\n        <title>Publications</title>...
<body>\n        <div class=\"wrapper sl-translate\">\n            <div class=\"needCSS hidden\">\n                This application is not fully accessible to users whose browsers do not support or have Cascading Style Sheets (CSS) disabled. For a more optimal experience viewing this application, please enable CSS in your browser and refresh the page.\n            </div>\n            <!--<p class=\"browsehappy\">\n                Your browser is out of date  or not supported. Please visit <a href=\"http://browsehappy.com/\">browse happy</a> to upgrade to a better supported, modern browser.\n            </p>-->\n            <div class=\"js-off-message\">\n                <noscript>\n                The page could not be loaded. This application currently does not support browsers with \"JavaScript\" disabled. Please enable JavaScript and refresh the page to view this application.\n                </noscript>"

I've tried tricking the page into thinking I'm using a browser with JS, but (shocker), that didn't work either:

html = = HTTP.headers(user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36").follow.get(url).to_s

Is there any trick to extracting the static HTML from a JS-driven page like https://www.medicare.gov/hospicecompare/?

The only way to get JavaScript to run is to have a JavaScript runtime involved. Tools like Selenium or Phantom.js can do this. — tadman
– tadman, Commented Jan 2, 2018 at 19:52
Thanks, I think these comments have pointed me in the right direction. I'm trying the flow described in readysteadycode.com/…. — MothOnMars
– MothOnMars, Commented Jan 2, 2018 at 19:54

MothOnMars · Accepted Answer · 2018-01-10 22:10:32Z

1

Based on the comments suggesting I use a headless browser, and the approach described in https://readysteadycode.com/howto-scrape-websites-with-ruby-and-headless-chrome, I was able to extract the HTML and plain text using the following:

Install ChromeDriver

$ brew install chromedriver

Install selenium-webdriver gem

$ gem install selenium-webdriver

Get webpage

> require 'selenium-webdriver'
> options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
> driver = Selenium::WebDriver.for(:chrome, options: options)
> driver.get 'https://www.medicare.gov/hospicecompare/'

Extract HTML

> driver.find_element(css: 'html').attribute('innerHTML')

Extract Plain Text

> driver.find_element(css: 'html').text

References

https://readysteadycode.com/howto-scrape-websites-with-ruby-and-headless-chrome https://github.com/SeleniumHQ/selenium/wiki/Ruby-Bindings

edited Jan 10, 2018 at 22:10

answered Jan 2, 2018 at 21:01

MothOnMars

2,3793 gold badges22 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How can I extract HTML from a URL that requires JavaScript?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related