JSON to dataframe in R

Question

I have a text/html file with 158 row and 25 column data in JSON format and I have been trying to convert it into a dataframe so that I could write it in .csv. I have tried "rjson" and 'jsonlite' packages to read the data and then use convert it into datatable by two approach

Use

 library(jsonlite) 
 json_file = "projectslocations.html"
 json_datan <- fromJSON(json_file)

The data structure has only one row with 158 variables

2.using jsonlite and data.table

      library(jsonlite)
      library(data.table)
      json_dat <- fromJSON(json_file)
      class(json_dat)
      lst= rbindlist(json_dat, fill=TRUE)

This shows data.frame with 158 rows and 25 variables. However I cant write this dataframe in csv or even view the dataframe.

error :

 Error in FUN(X[[i]], ...) : 
 Invalid column: it has dimensions. Can't format it. If it's the result of         data.table(table()), use as.data.table(table()) instead.

The original data is available here

It's a heavily nested structure. you're going to have to roll up your sleeves and do some data munging. — hrbrmstr
– hrbrmstr, Commented Jul 23, 2016 at 15:23
The biggest problem is that each document has been named - e.g. P108941: {...}. If the file was just an array of unnamed docs - e.g. [{...}, {...}] your life would be easier (I think). — Alex Ioannides
– Alex Ioannides, Commented Jul 23, 2016 at 17:00
"eg. P108941" is the name of the 'project' which should go in the 'projects' column. I have updated the original structure in a hope that might help answer the question. — Arihant
– Arihant, Commented Jul 23, 2016 at 17:10
How do you intend to represent the locations field, which is often an array of docs containing location information, as a single column in a CSV file? — Alex Ioannides
– Alex Ioannides, Commented Jul 23, 2016 at 17:54
Alex, One project might have many locations so I was thinking of having locations in multiple rows, rest of the information remains same for the same project . — Arihant
– Arihant, Commented Jul 23, 2016 at 19:02

Alex Ioannides · Accepted Answer · 2016-07-25 05:28:22Z

Here's how I would munge your data using a bit of functional programming with the purrr package and the data-munging awesomeness of the dplyr package:

library(jsonlite) 
library(purrr)
library(dplyr)

# load JSON data and parse to list in R
json_file = file("projects.txt")
json_data <- fromJSON(json_file, simplifyDataFrame = FALSE)[[1]]

# extract location data seperately and create a data.frame with a project id column
locations <- 
  json_data %>% 
  at_depth(1, "locations") %>% 
  at_depth(2, ~data.frame(.x, stringsAsFactors = FALSE)) %>% 
  map(~bind_rows(.x)) %>% 
  bind_rows(.id = "id")

# prefix 'location_' to all location fields
colnames(locations) <- paste0("location_", colnames(locations))

# extract all project data excluding location data and create a data.frame
projects <- 
  json_data %>% 
  map(function(x) {x$locations <- NULL; x}) %>% 
  map(~data.frame(as.list(unlist(.x)), stringsAsFactors = FALSE)) %>% 
  bind_rows()

# join project and location data to yield a final denormalised data structure
projects_and_locations <- 
  projects %>% 
  inner_join(locations, by = c('id' = 'location_id'))

# details of single row of final denormalised data.frame
str(projects_and_locations[1,]) 

# 'data.frame': 1 obs. of  32 variables:
#   $ id                    : chr "P130343"
# $ project_name          : chr "MENA- Desert Ecosystems and Livelihoods Knowledge Sharing an"
# $ pl                    : chr "Global Environment Project"
# $ fy                    : chr "2013"
# $ ca                    : chr "$1.00M"
# $ gpname                : chr "Environment & Natural Resources"
# $ s                     : chr "Environment"
# $ ttl                   : chr "Taoufiq Bennouna"
# $ ttlupi                : chr "000314228"
# $ sbc                   : chr "ENV"
# $ sbn                   : chr "Environment"
# $ boardapprovaldate     : chr "23-May-2013"
# $ crd                   : chr "16-Feb-2012"
# $ dmd                   : chr ""
# $ ed                    : chr "10-Jun-2013"
# $ fdd                   : chr "04-Dec-2013"
# $ rcd                   : chr "31-Dec-2017"
# $ fc                    : chr "false"
# $ totalamt              : chr "$1.00M"
# $ url                   : chr "http://www.worldbank.org/projects/P130343?lang=en"
# $ project_abstract.cdata: chr ""
# $ sector.Name           : chr "Agriculture, fishing, and forestry"
# $ sector.code           : chr "AX"
# $ countrycode           : chr "5M"
# $ countryname           : chr "Middle East and North Africa"
# $ location_geoLocId     : chr "0002464470"
# $ location_url          : chr "javascript:projectPopupInfo('P130343', '0002464470')"
# $ location_geoLocName   : chr "Tunis"
# $ location_latitude     : chr "36.8190"
# $ location_longitude    : chr "10.1660"
# $ location_country      : chr "TN"
# $ location_countryName  : chr "Tunisia"

Jeroen Ooms · Accepted Answer · 2016-07-27 11:32:51Z

The first problem is that the data cannot be simplified because the json is untidy: it has data in it's keys (project names). A workaround is to remove the key names before simplifying:

library(jsonlite)
mydata <- fromJSON('http://pastebin.com/raw/HS3YEQxZ', simplifyVector = FALSE)
project_names <- names(mydata$projects) 
names(mydata$projects) = NULL
out <- jsonlite:::simplify(mydata, flatten = TRUE)
projects <- out$projects
projects$name <- project_names

This gets projects data in it the proper data frame shape. However if you look at the structure it turns out you have a one-to-many dataset: the sector and locations columns actually have a nested data frame with multiple rows.

str(projects[1,])

Hence you will need to do a left join operation to merge this into a simple 2D data frame. This is a problem on it's own, unrelated to JSON.

Because you have more than one nested column, it is unclear what you expect your output to look like. Use tidyr::unnest to left-join against one of the nested columns:

# Unnest 'locations' column
out <- tidyr::unnest(projects, locations)
names(out)

Note that tidyr automatically drops the sectors column in this case because it is incompatible with left-joining locations with projects.

rcorty · Accepted Answer · 2016-07-23 18:18:31Z

1

This question is really hard to answer as-is. You can see from my little code block that, when flattened (aka unlisted), each project has a different number of elements (different length).

You will need to decide how to deal with this based on what you want out of the data. An R data.frame must be rectangular (each list has the same length).

library(rjson)
fn <- 'path.to.file....'
json_data <- fromJSON(file=fn)
sapply(X = json_data$projects, 
       FUN = f <- function(l) length(unlist(l)))

answered Jul 23, 2016 at 18:18

rcorty

1,2101 gold badge11 silver badges30 bronze badges

1 Comment

Arihant Over a year ago

thanks rcorty, I was planning on have multiple rows for the same project.For projects with multiple locations all the information is duplicated in multiple multiple rows to match the number of locations.

Collectives™ on Stack Overflow

JSON to dataframe in R

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related