0

I'm retrieving a web page with OpenURI:

require 'open-uri'
page = open('http://www.example.com').read.scrub

Now I'd like to parse the values of the attributes playerurl, playerdata and pageurl of the retrieved page. They appear in a <script> tag:

<script>
..
..
  PlayerWatchdog.init({
      'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
      'playerdata': 'http://www.example.com/player',
      'pageurl': 'http://www.example.com?test=2',
      });
..
..
</script>

What's the smartest way to accomplish this?

2
  • I'm not sure what you mean by "the three JS attributes". Is there an embeded script that you want to parse? Or are these inside HTML elements? Commented Nov 3, 2014 at 16:51
  • It's a script-tag inside the html-page itself. By attributes I mean the values of 'playerurl', 'playerdata' and 'pageurl' Commented Nov 3, 2014 at 17:07

2 Answers 2

3

You can use an HTML parser, such as Nokogiri, to take apart the HTML document, and quickly find the <script> tag you're after. The content inside a <script> tag is text, so Nokogiri's text method will return that. Then it's a matter of selectively retrieving the lines you want, which can be done by a simple regular expression:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <head>
    <script>
      PlayerWatchdog.init({
          'playerurl': 'http://cdn.static.de/now/player.swf?ts=2011354353',
          'playerdata': 'http://www.example.com/player',
          'pageurl': 'http://www.example.com?test=2',
          });
    </script>
  </head>
</html>
EOT

script_text = doc.at('script').text 
playerurl, playerdata, pageurl = %w[
  playerurl
  playerdata
  pageurl
].map{ |i| script_text[/'#{ i }': '([^']+')/, 1] }

playerurl # => "http://cdn.static.de/now/player.swf?ts=2011354353'"
playerdata # => "http://www.example.com/player'"
pageurl # => "http://www.example.com?test=2'"

at returns the first matching <script> Node instance. Depending on the HTML you might not want the first matching <script>. You can use search instead, which will return a NodeSet, similar to an array of Nodes, and then grab a particular element from the NodeSet, or, instead of using a CSS selector, you can use XPath which will let you easily specify a particular occurrence of the tag desired.

Once the tag is found, text returns its contents, and the task moves from Nokogiri to using a pattern to find what is desired. /'#{ i }': '([^']+')/ is a simple pattern that looks for a word, passed in in i followed by : ' then capture everything up to the next '. That pattern is passed to String's [] method.

Sign up to request clarification or add additional context in comments.

Comments

1

Ruby has no built-in javascript parsing capabilities. You can use a regexp, though this will be rather sensitive to the formatting of the page (for example this will break if the page starts using double quotes for strings):

playerurl = page[/'playerurl':\s*'([^']*)'/, 1]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.