2

I'm attempting to extract the raw HTML from a page that requires JavaScript (with the ultimate goal of extracting just the plain text). Unfortunately, simple get requests return HTML specifying the need for a browser running JS. Example:

> html = open('https://www.medicare.gov/Publications/').read
"<!DOCTYPE HTML>\n<!--[if lt IE 8]><html class=\"no-js oldIE\" lang=\"en-US\"><![endif]-->\n<!--[if IE 8]><html class=\"no-js lt-ie9\" lang=\"en-US\"><![endif]-->\n<!--[if IE 9]><html class=\"no-js ie9\" lang=\"en-US\"><![endif]-->\n<!--[if IE 11]>\n<style>\nbody{\ndisplay:none;\n}\n</style>\n<![endif]-->\n<!--[if (gt IE )|!(IE)]><!-->\n<html lang=\"en-US\" class=\"no-js not-ie\">\n    <!--<![endif]-->\n    <head>\n        <meta charset=\"utf-8\">\n        <!--<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">-->\n        <title>Publications</title>...
<body>\n        <div class=\"wrapper sl-translate\">\n            <div class=\"needCSS hidden\">\n                This application is not fully accessible to users whose browsers do not support or have Cascading Style Sheets (CSS) disabled. For a more optimal experience viewing this application, please enable CSS in your browser and refresh the page.\n            </div>\n            <!--<p class=\"browsehappy\">\n                Your browser is out of date  or not supported. Please visit <a href=\"http://browsehappy.com/\">browse happy</a> to upgrade to a better supported, modern browser.\n            </p>-->\n            <div class=\"js-off-message\">\n                <noscript>\n                The page could not be loaded. This application currently does not support browsers with \"JavaScript\" disabled. Please enable JavaScript and refresh the page to view this application.\n                </noscript>"

I've tried tricking the page into thinking I'm using a browser with JS, but (shocker), that didn't work either:

html = = HTTP.headers(user_agent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36").follow.get(url).to_s

Is there any trick to extracting the static HTML from a JS-driven page like https://www.medicare.gov/hospicecompare/?

3
  • 3
    you'll have to use a headless browser Commented Jan 2, 2018 at 19:51
  • 4
    The only way to get JavaScript to run is to have a JavaScript runtime involved. Tools like Selenium or Phantom.js can do this. Commented Jan 2, 2018 at 19:52
  • Thanks, I think these comments have pointed me in the right direction. I'm trying the flow described in readysteadycode.com/…. Commented Jan 2, 2018 at 19:54

1 Answer 1

1

Based on the comments suggesting I use a headless browser, and the approach described in https://readysteadycode.com/howto-scrape-websites-with-ruby-and-headless-chrome, I was able to extract the HTML and plain text using the following:

Install ChromeDriver

$ brew install chromedriver

Install selenium-webdriver gem

$ gem install selenium-webdriver

Get webpage

> require 'selenium-webdriver'
> options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
> driver = Selenium::WebDriver.for(:chrome, options: options)
> driver.get 'https://www.medicare.gov/hospicecompare/'

Extract HTML

> driver.find_element(css: 'html').attribute('innerHTML')

Extract Plain Text

> driver.find_element(css: 'html').text

References

https://readysteadycode.com/howto-scrape-websites-with-ruby-and-headless-chrome https://github.com/SeleniumHQ/selenium/wiki/Ruby-Bindings

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.