2

I have a data frame which consists of a single column of some very messy JSON data. I would like to convert the JSON entries in that column to additional columns in the same data frame, I have a messy solution, but it will be tedious and long to apply it to my actual dataset.

Here is my sample data frame:

sample.df <- data.frame(id = c(101, 102, 103, 104),
                    json_col = c('[{"foo_a":"bar"}]',
                                 '[{"foo_a":"bar","foo_b":"bar"}]',
                                 '[{"foo_a":"bar","foo_c":2}]',
                                 '[{"foo_a":"bar","foo_b":"bar","foo_c":2,"nested_col":{"foo_d":"bar","foo_e":3}}]'),
                    startdate = as.Date(c('2010-11-1','2008-3-25','2007-3-14','2006-2-21')))

in reality my data frame has over 100000 entries and consists of multiple JSON columns where I need to apply the solution to this question, there are also several orders of nested lists (i.e. nested lists within nested lists).

Here is my solution:

j.col <- sample.df[2]
library(jsonlite)
j.l <- apply(j.col, 1, jsonlite::fromJSON, flatten = T)
library(dplyr)
l.as.df <- bind_rows(lapply(j.l,data.frame))
new.df <- cbind(sample.df$id, l.as.df, sample.df$startdate)

My solution is a roundabout method were I extract the column from the data frame with the JSON stuff, and then convert the JSON into a second dataframe, and then I combine the two dataframes into a third datadframe. This will be long and tedious to do with my actual data, not to mention that it is inelegant. How can I do this without having to create the additional data frames?

Thanks in advance for any help!

1
  • Given the messiness of that data I would actually say your approach is pretty good! I'm sure one of the wizards on this site will provide you with something more elegant, though :). Commented Apr 3, 2018 at 15:51

3 Answers 3

3

Here's another approach that will spare you the intermediate dataframes:

library(dplyr)
library(jsonlite)

new.df <- sample.df %>% 
          rowwise() %>%
          do(data.frame(fromJSON(.$json_col, flatten = T))) %>%
          ungroup() %>%
          bind_cols(sample.df %>% select(-json_col))

print(new.df)
# # A tibble: 4 x 7
#   foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e    id  startdate
#   <chr> <chr> <int>            <chr>            <int> <dbl>     <date>
# 1     _  <NA>    NA             <NA>               NA   101 2010-11-01
# 2     _     _    NA             <NA>               NA   102 2008-03-25
# 3     _  <NA>     2             <NA>               NA   103 2007-03-14
# 4     _     _     2                _                3   104 2006-02-21
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the solution, it looks great. can I ask why you need to ungroup() when there is no group_by() preceding?
The reason is that rowwise() essentially groups the data frame by the row number- it behaves similarly to if you had created a row number column and then grouped the data by that column. The use of ungroup() removes the rowwise grouping.
3
library(dplyr)
library(tidyr)
library(purrr)
library(jsonlite)

sample.df %>%
  mutate(
    json_parsed = map(json_col, ~ fromJSON(., flatten=TRUE))
  ) %>%
  unnest(json_parsed)

#    id
# 1 101
# 2 102
# 3 103
# 4 104
#                                                                           json_col
# 1                                                                [{"foo_a":"bar"}]
# 2                                                  [{"foo_a":"bar","foo_b":"bar"}]
# 3                                                      [{"foo_a":"bar","foo_c":2}]
# 4 [{"foo_a":"bar","foo_b":"bar","foo_c":2,"nested_col":{"foo_d":"bar","foo_e":3}}]
#    startdate foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e
# 1 2010-11-01   bar  <NA>    NA             <NA>               NA
# 2 2008-03-25   bar   bar    NA             <NA>               NA
# 3 2007-03-14   bar  <NA>     2             <NA>               NA
# 4 2006-02-21   bar   bar     2              bar                3

If you are reducing libraries, you can remove purrr and instead use:

...
    json_parsed = lapply(.$json_col, fromJSON, flatten=TRUE)
...

1 Comment

great answer, thank you. How would you do this if there were also whole rows that are NA?
1

I think this will work. The main idea is that we take json_col and turn it into a character string, that we can then pass into the fromJSON function that takes care of the rest.

library(stringi)
library(jsonlite)

sample.df$json_col<- as.character(sample.df$json_col)
json_obj<- paste(sample.df$json_col, collapse = "")
json_obj<- stri_replace_all_fixed(json_obj, "][", ",")
new.df<- cbind(sample.df$id, fromJSON(json_obj), sample.df$startdate)

> new.df
#  sample.df$id foo_a foo_b foo_c nested_col.foo_d nested_col.foo_e
#1          101     _  <NA>    NA             <NA>               NA
#2          102     _     _    NA             <NA>               NA
#3          103     _  <NA>     2             <NA>               NA
#4          104     _     _     2                _                3
#  sample.df$startdate
#1          2010-11-01
#2          2008-03-25
#3          2007-03-14
#4          2006-02-21

Make sure that the cbind part work correctly! In this case it did, but make sure that in your overall manipulations, you don't change the order of things.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.