How to parse javascript data list with R

Question

I use R to parse html code, and I would like to know the most efficient way to sparse the following code :

<script type="text/javascript">
var utag_data = {
  environnement : "prod",
  device : getDevice(),
  displaytype : getDisplay($(window).innerWidth()),
  pagename : "adview",
  pagetype : "annonce"}</script>

I started to do this:

infos = unlist(xpathApply(page,
                          '//script[@type="text/javascript"]',
                          xmlValue))
infos=gsub('\n|  ','',infos)
infos=gsub("var utag_data = ","",infos)
fromJSON(infos)

And the code above returns somthing really weird:

$nvironnemen
[1] "prod"

$evic
NULL

$isplaytyp
NULL

$agenam
[1] "adview" etc.

I would like to know how to do it very efficient way: how to parse directly the data list in the javascript ? Thank you.

The code is doing its work. Nothing wrong with it. Or do you mean not getting NULL for keys device and displaytyp? — agustin
– agustin, Commented May 29, 2016 at 22:14
OK, in fact, I was surprised that in the output, environnement is transformed to "$nvironnemen", I thought it is was a bug. How could you explain this ? — John Smith
– John Smith, Commented May 30, 2016 at 9:51

hrbrmstr · Accepted Answer · 2016-05-30 12:33:11Z

5

I didn't try your code, but I think your gsub() regexes might be overagressive (which is prbly causing the name munging).

It's possible to run javascript code using the V8 package, but it wont be able to execute the DOM-based getDevice() and getDisplay() functions since they don't exist in the V8 engine:

library(V8)
library(rvest)

pg <- read_html('<script type="text/javascript">
var utag_data = {
  environnement : "prod",
  device : getDevice(),
  displaytype : getDisplay($(window).innerWidth()),
  pagename : "adview",
  pagetype : "annonce"}</script>')


script <- html_text(html_nodes(pg, xpath='//script[@type="text/javascript"]'))

ctx <- v8()

ctx$eval(script)
## Error: ReferenceError: getDevice is not defined

However, you can compensate for that:

# we need to remove the function calls and replace them with blanks
# since both begin with 'getD' this is pretty easy:
script <- gsub("getD[[:alpha:]\\(\\)\\$\\.]+,", "'',", script)  

ctx$eval(script)
ctx$get("utag_data")

## $environnement
## [1] "prod"
## 
## $device
## [1] ""
## 
## $displaytype
## [1] ""
## 
## $pagename
## [1] "adview"
## 
## $pagetype
## [1] "annonce"

edited May 30, 2016 at 12:33

answered May 30, 2016 at 0:20

hrbrmstr

79.1k11 gold badges146 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

John Smith Over a year ago

Thank you @hrbrmstr for your help. I didn't try to evaluate the functions, so it is perfect ! And in fact, do you know why with json, "environnement" becomes "$nvironnemen" ?

John Smith Over a year ago

Could you also please explain what does this regex "getD[[:alpha:]\\$\\.]+," means ? It seems to delete all characters before "()", and in "()". How do you "read" it ? Thank you very much.

Collectives™ on Stack Overflow

How to parse javascript data list with R

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related