Parse data from JavaScript of retrieved page

Question

I'm retrieving a web page with OpenURI:

require 'open-uri'
page = open('http://www.example.com').read.scrub

Now I'd like to parse the values of the attributes playerurl, playerdata and pageurl of the retrieved page. They appear in a <script> tag:

<script>
..
..
  PlayerWatchdog.init({
      'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
      'playerdata': 'http://www.example.com/player',
      'pageurl': 'http://www.example.com?test=2',
      });
..
..
</script>

What's the smartest way to accomplish this?

I'm not sure what you mean by "the three JS attributes". Is there an embeded script that you want to parse? Or are these inside HTML elements? — Max
– Max, Commented Nov 3, 2014 at 16:51
It's a script-tag inside the html-page itself. By attributes I mean the values of 'playerurl', 'playerdata' and 'pageurl' — Hedge
– Hedge, Commented Nov 3, 2014 at 17:07

the Tin Man · Accepted Answer · 2014-11-03 18:29:30Z

You can use an HTML parser, such as Nokogiri, to take apart the HTML document, and quickly find the <script> tag you're after. The content inside a <script> tag is text, so Nokogiri's text method will return that. Then it's a matter of selectively retrieving the lines you want, which can be done by a simple regular expression:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <head>
    <script>
      PlayerWatchdog.init({
          'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
          'playerdata': 'http://www.example.com/player',
          'pageurl': 'http://www.example.com?test=2',
          });
    </script>
  </head>
</html>
EOT

script_text = doc.at('script').text 
playerurl, playerdata, pageurl = %w[
  playerurl
  playerdata
  pageurl
].map{ |i| script_text[/'#{ i }': '([^']+')/, 1] }

playerurl # => "http://cdn.static.de/now/player.swf?ts=2011354353'"
playerdata # => "http://www.example.com/player'"
pageurl # => "http://www.example.com?test=2'"

at returns the first matching <script> Node instance. Depending on the HTML you might not want the first matching <script>. You can use search instead, which will return a NodeSet, similar to an array of Nodes, and then grab a particular element from the NodeSet, or, instead of using a CSS selector, you can use XPath which will let you easily specify a particular occurrence of the tag desired.

Once the tag is found, text returns its contents, and the task moves from Nokogiri to using a pattern to find what is desired. /'#{ i }': '([^']+')/ is a simple pattern that looks for a word, passed in in i followed by : ' then capture everything up to the next '. That pattern is passed to String's [] method.

Max · Accepted Answer · 2014-11-03 17:31:27Z

1

Ruby has no built-in javascript parsing capabilities. You can use a regexp, though this will be rather sensitive to the formatting of the page (for example this will break if the page starts using double quotes for strings):

playerurl = page[/'playerurl':\s*'([^']*)'/, 1]

answered Nov 3, 2014 at 17:31

Max

22.5k7 gold badges55 silver badges79 bronze badges

Collectives™ on Stack Overflow

Parse data from JavaScript of retrieved page

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related