3

This comes up a lot when dealing with API's.

Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.

I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.

I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,

But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.

#####################
# Some Test Data
aNestedTree = 
  list(a = 1, 
       b = 2, 
       c = list(
         a = list(1:5), 
         b = 2, 
         c = list(
           a = 1, 
           d = 3,
           e = list())),
       d = list(
         y = 3,
         z = 2
       ))

############################################################
# Run through the list and rename all list elements,
# We unlist once at  time, adding "__" at each unlist step
# until the object is no longer a list

renameVars <- function(lst, sep = '__') {
  if(is.list(lst)) {
    names(lst) <- paste0(names(lst),sep)
    renameVars(unlist(lst, recursive = FALSE),sep = sep)
  } else {
    lst
  }
}

res <- renameVars(aNestedTree)

We can check the output and see that we have a strangely named object, But there's a method to this madness.

> res
    a________     b________  c__.a____1__  c__.a____2__  c__.a____3__ 
            1             2             1             2             3 
 c__.a____4__  c__.a____5__   c__.b______ c__.c__.a____ c__.c__.d____ 
            4             5             2             1             3 
  d__.y______   d__.z______ 
            3             2 

Now I put this in a data.table, so I can shape it.

library(data.table)
dt <- data.table(values = res, name = names(res))

# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests

> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
    values          name V1 V2 V3
 1:      1     a________  a NA NA
 2:      2     b________  b NA NA
 3:      1  c__.a____1__  c  a  1
 4:      2  c__.a____2__  c  a  2
 5:      3  c__.a____3__  c  a  3
 6:      4  c__.a____4__  c  a  4
 7:      5  c__.a____5__  c  a  5
 8:      2   c__.b______  c  b NA
 9:      1 c__.c__.a____  c  c  a
10:      3 c__.c__.d____  c  c  d
11:      3   d__.y______  d  y NA
12:      2   d__.z______  d  z NA

I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)

I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.

2

3 Answers 3

2

I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.

I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.

Notes on approach:

  • That all values would be numeric (I cast them to character afterwards)
  • Helpers gather elements of the varying types, and bind_rows stacks the datasets together.
  • level is kept track of by level of recursion

First define the helpers:

recurse_gather <- function(.x,.level) {
  .x <- tidyjson::bind_rows(
    gobj(.x,.level)
    , garr(.x,.level)
    , gpersist(.x,.level)
  )

  if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
    .x <- recurse_gather(.x,.level+1)
  }

  return(.x)
}
gobj <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(type=='object') %>%
    gather_object(paste0('v',.level)) %>%
    select(-type)
}

gpersist <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(! type %in% c('object','array')) %>%
    mutate_(.dots=setNames(
      paste0('as.character(NA)')
      ,paste0('v',.level)
    )) %>%
    select(-type)
}

garr <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(type=='array') %>%
    gather_array('arridx') %>%
    append_values_number(paste0('v',.level)) %>%
    mutate_(.dots=setNames(
      paste0('as.character(v',.level,')')
      ,paste0('v',.level)
    )) %>%
    select(-arridx,-type)
}

Then using the helpers is pretty straight-forward.

library(dplyr)
library(tidyjson)

j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#>    document.id    v1    v2    v3    v4
#>  *       <int> <chr> <chr> <chr> <chr>
#>  1           1     a     1  <NA>  <NA>
#>  2           1     b     2  <NA>  <NA>
#>  3           1     c     a     1  <NA>
#>  4           1     c     a     2  <NA>
#>  5           1     c     a     3  <NA>
#>  6           1     c     a     4  <NA>
#>  7           1     c     a     5  <NA>
#>  8           1     c     b     2  <NA>
#>  9           1     c     c     a     1
#> 10           1     c     c     d     3
#> 11           1     d     y     3  <NA>
#> 12           1     d     z     2  <NA>

Hopeful that future development on the tidyjson package will make this an easier problem to tackle!

Sign up to request clarification or add additional context in comments.

Comments

0

I struggled in similar situations, but the tidyjson package has bailed me out time after time when dealing with nested JSON. There's a fair amount of typing required, but the tidyjson functions return a tidy object. Documentation here: https://github.com/sailthru/tidyjson

3 Comments

Can you show me an example with tidyjson that is tree-agnostic? (I.e. you don't need to know the names of the deeply nested levels? I've played with it before, but I thought it generally requires the kind of knowledge about the structure of the tree that I'd like to get by making it tidier.
Hmm, apologies, but I'm not sure such an example exists. After I get the JSON, I call jsonlite::prettify() to see the entire structure, then begin parsing with tidyjson::enter_object(), tidyjson::spread_values(), etc.
With the size of the json I'm working with, that's not always an easy option. I'm looking for a function that, like the one I posted above, will work on any nested tree, and move it into a tidy structure.
0

As dracodoc pointed out, data.tree might help. E.g. like this:

library(data.tree)
aNestedTree = 
  list(a = 1, 
       b = 2, 
       c = list(
         a = list(1:5), 
         b = 2, 
         c = list(
           a = 1, 
           d = 3,
           e = list())),
       d = list(
         y = 3,
         z = 2
       ))

tree <- FromListSimple(aNestedTree)
print(tree)

This will give:

      levelName  z
1 Root          NA
2  ¦--c         NA
3  ¦   ¦--a     NA
4  ¦   °--c     NA
5  ¦       °--e NA
6  °--d          2

And:

tree$fieldsAll
[1] "a" "b" "1" "d" "y" "z"

Side note: typically, you could do something like this:

do.call("print", c(tree, tree$fieldsAll))

However, here, this doesn't work because some node names are the same as field names. I consider this a bug and will fix it soon.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.