5

I use R to parse html code, and I would like to know the most efficient way to sparse the following code :

<script type="text/javascript">
var utag_data = {
  environnement : "prod",
  device : getDevice(),
  displaytype : getDisplay($(window).innerWidth()),
  pagename : "adview",
  pagetype : "annonce"}</script>

I started to do this:

infos = unlist(xpathApply(page,
                          '//script[@type="text/javascript"]',
                          xmlValue))
infos=gsub('\n|  ','',infos)
infos=gsub("var utag_data = ","",infos)
fromJSON(infos)

And the code above returns somthing really weird:

$nvironnemen
[1] "prod"

$evic
NULL

$isplaytyp
NULL

$agenam
[1] "adview" etc.

I would like to know how to do it very efficient way: how to parse directly the data list in the javascript ? Thank you.

2
  • The code is doing its work. Nothing wrong with it. Or do you mean not getting NULL for keys device and displaytyp? Commented May 29, 2016 at 22:14
  • OK, in fact, I was surprised that in the output, environnement is transformed to "$nvironnemen", I thought it is was a bug. How could you explain this ? Commented May 30, 2016 at 9:51

1 Answer 1

5

I didn't try your code, but I think your gsub() regexes might be overagressive (which is prbly causing the name munging).

It's possible to run javascript code using the V8 package, but it wont be able to execute the DOM-based getDevice() and getDisplay() functions since they don't exist in the V8 engine:

library(V8)
library(rvest)

pg <- read_html('<script type="text/javascript">
var utag_data = {
  environnement : "prod",
  device : getDevice(),
  displaytype : getDisplay($(window).innerWidth()),
  pagename : "adview",
  pagetype : "annonce"}</script>')


script <- html_text(html_nodes(pg, xpath='//script[@type="text/javascript"]'))

ctx <- v8()

ctx$eval(script)
## Error: ReferenceError: getDevice is not defined

However, you can compensate for that:

# we need to remove the function calls and replace them with blanks
# since both begin with 'getD' this is pretty easy:
script <- gsub("getD[[:alpha:]\\(\\)\\$\\.]+,", "'',", script)  

ctx$eval(script)
ctx$get("utag_data")

## $environnement
## [1] "prod"
## 
## $device
## [1] ""
## 
## $displaytype
## [1] ""
## 
## $pagename
## [1] "adview"
## 
## $pagetype
## [1] "annonce"
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @hrbrmstr for your help. I didn't try to evaluate the functions, so it is perfect ! And in fact, do you know why with json, "environnement" becomes "$nvironnemen" ?
Could you also please explain what does this regex "getD[[:alpha:]\(\)\\$\\.]+," means ? It seems to delete all characters before "()", and in "()". How do you "read" it ? Thank you very much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.