Tidy nested json tree

Question

This comes up a lot when dealing with API's.

Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.

I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.

I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,

But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.

#####################
# Some Test Data
aNestedTree = 
  list(a = 1, 
       b = 2, 
       c = list(
         a = list(1:5), 
         b = 2, 
         c = list(
           a = 1, 
           d = 3,
           e = list())),
       d = list(
         y = 3,
         z = 2
       ))

############################################################
# Run through the list and rename all list elements,
# We unlist once at  time, adding "__" at each unlist step
# until the object is no longer a list

renameVars <- function(lst, sep = '__') {
  if(is.list(lst)) {
    names(lst) <- paste0(names(lst),sep)
    renameVars(unlist(lst, recursive = FALSE),sep = sep)
  } else {
    lst
  }
}

res <- renameVars(aNestedTree)

We can check the output and see that we have a strangely named object, But there's a method to this madness.

> res
    a________     b________  c__.a____1__  c__.a____2__  c__.a____3__ 
            1             2             1             2             3 
 c__.a____4__  c__.a____5__   c__.b______ c__.c__.a____ c__.c__.d____ 
            4             5             2             1             3 
  d__.y______   d__.z______ 
            3             2

Now I put this in a data.table, so I can shape it.

library(data.table)
dt <- data.table(values = res, name = names(res))

# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests

> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
    values          name V1 V2 V3
 1:      1     a________  a NA NA
 2:      2     b________  b NA NA
 3:      1  c__.a____1__  c  a  1
 4:      2  c__.a____2__  c  a  2
 5:      3  c__.a____3__  c  a  3
 6:      4  c__.a____4__  c  a  4
 7:      5  c__.a____5__  c  a  5
 8:      2   c__.b______  c  b NA
 9:      1 c__.c__.a____  c  c  a
10:      3 c__.c__.d____  c  c  d
11:      3   d__.y______  d  y NA
12:      2   d__.z______  d  z NA

I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)

I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.

Did you look at data.tree? data.tree intro data.tree application And this question — dracodoc
– dracodoc, Commented Sep 18, 2016 at 16:11

cole · Accepted Answer · 2017-06-10 02:19:01Z

I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.

I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.

Notes on approach:

That all values would be numeric (I cast them to character afterwards)
Helpers gather elements of the varying types, and bind_rows stacks the datasets together.
level is kept track of by level of recursion

First define the helpers:

recurse_gather <- function(.x,.level) {
  .x <- tidyjson::bind_rows(
    gobj(.x,.level)
    , garr(.x,.level)
    , gpersist(.x,.level)
  )

  if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
    .x <- recurse_gather(.x,.level+1)
  }

  return(.x)
}
gobj <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(type=='object') %>%
    gather_object(paste0('v',.level)) %>%
    select(-type)
}

gpersist <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(! type %in% c('object','array')) %>%
    mutate_(.dots=setNames(
      paste0('as.character(NA)')
      ,paste0('v',.level)
    )) %>%
    select(-type)
}

garr <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(type=='array') %>%
    gather_array('arridx') %>%
    append_values_number(paste0('v',.level)) %>%
    mutate_(.dots=setNames(
      paste0('as.character(v',.level,')')
      ,paste0('v',.level)
    )) %>%
    select(-arridx,-type)
}

Then using the helpers is pretty straight-forward.

library(dplyr)
library(tidyjson)

j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#>    document.id    v1    v2    v3    v4
#>  *       <int> <chr> <chr> <chr> <chr>
#>  1           1     a     1  <NA>  <NA>
#>  2           1     b     2  <NA>  <NA>
#>  3           1     c     a     1  <NA>
#>  4           1     c     a     2  <NA>
#>  5           1     c     a     3  <NA>
#>  6           1     c     a     4  <NA>
#>  7           1     c     a     5  <NA>
#>  8           1     c     b     2  <NA>
#>  9           1     c     c     a     1
#> 10           1     c     c     d     3
#> 11           1     d     y     3  <NA>
#> 12           1     d     z     2  <NA>

Hopeful that future development on the tidyjson package will make this an easier problem to tackle!

David Ranzolin · Accepted Answer · 2016-09-18 02:54:23Z

0

I struggled in similar situations, but the tidyjson package has bailed me out time after time when dealing with nested JSON. There's a fair amount of typing required, but the tidyjson functions return a tidy object. Documentation here: https://github.com/sailthru/tidyjson

answered Sep 18, 2016 at 2:54

David Ranzolin

1,1341 gold badge8 silver badges21 bronze badges

3 Comments

Shape Over a year ago

Can you show me an example with tidyjson that is tree-agnostic? (I.e. you don't need to know the names of the deeply nested levels? I've played with it before, but I thought it generally requires the kind of knowledge about the structure of the tree that I'd like to get by making it tidier.

David Ranzolin Over a year ago

Hmm, apologies, but I'm not sure such an example exists. After I get the JSON, I call jsonlite::prettify() to see the entire structure, then begin parsing with tidyjson::enter_object(), tidyjson::spread_values(), etc.

Shape Over a year ago

With the size of the json I'm working with, that's not always an easy option. I'm looking for a function that, like the one I posted above, will work on any nested tree, and move it into a tidy structure.

Christoph Glur · Accepted Answer · 2016-09-22 11:54:16Z

As dracodoc pointed out, data.tree might help. E.g. like this:

library(data.tree)
aNestedTree = 
  list(a = 1, 
       b = 2, 
       c = list(
         a = list(1:5), 
         b = 2, 
         c = list(
           a = 1, 
           d = 3,
           e = list())),
       d = list(
         y = 3,
         z = 2
       ))

tree <- FromListSimple(aNestedTree)
print(tree)

This will give:

      levelName  z
1 Root          NA
2  ¦--c         NA
3  ¦   ¦--a     NA
4  ¦   °--c     NA
5  ¦       °--e NA
6  °--d          2

And:

tree$fieldsAll
[1] "a" "b" "1" "d" "y" "z"

Side note: typically, you could do something like this:

do.call("print", c(tree, tree$fieldsAll))

However, here, this doesn't work because some node names are the same as field names. I consider this a bug and will fix it soon.

Collectives™ on Stack Overflow

Tidy nested json tree

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related