1

I have a text/html file with 158 row and 25 column data in JSON format and I have been trying to convert it into a dataframe so that I could write it in .csv. I have tried "rjson" and 'jsonlite' packages to read the data and then use convert it into datatable by two approach

  1. Use

     library(jsonlite) 
     json_file = "projectslocations.html"
     json_datan <- fromJSON(json_file)
    

The data structure has only one row with 158 variables

2.using jsonlite and data.table

      library(jsonlite)
      library(data.table)
      json_dat <- fromJSON(json_file)
      class(json_dat)
      lst= rbindlist(json_dat, fill=TRUE)

This shows data.frame with 158 rows and 25 variables. However I cant write this dataframe in csv or even view the dataframe.

error :

 Error in FUN(X[[i]], ...) : 
 Invalid column: it has dimensions. Can't format it. If it's the result of         data.table(table()), use as.data.table(table()) instead.

The original data is available here

7
  • 3
    It's a heavily nested structure. you're going to have to roll up your sleeves and do some data munging. Commented Jul 23, 2016 at 15:23
  • The biggest problem is that each document has been named - e.g. P108941: {...}. If the file was just an array of unnamed docs - e.g. [{...}, {...}] your life would be easier (I think). Commented Jul 23, 2016 at 17:00
  • "eg. P108941" is the name of the 'project' which should go in the 'projects' column. I have updated the original structure in a hope that might help answer the question. Commented Jul 23, 2016 at 17:10
  • How do you intend to represent the locations field, which is often an array of docs containing location information, as a single column in a CSV file? Commented Jul 23, 2016 at 17:54
  • Alex, One project might have many locations so I was thinking of having locations in multiple rows, rest of the information remains same for the same project . Commented Jul 23, 2016 at 19:02

3 Answers 3

6

Here's how I would munge your data using a bit of functional programming with the purrr package and the data-munging awesomeness of the dplyr package:

library(jsonlite) 
library(purrr)
library(dplyr)

# load JSON data and parse to list in R
json_file = file("projects.txt")
json_data <- fromJSON(json_file, simplifyDataFrame = FALSE)[[1]]

# extract location data seperately and create a data.frame with a project id column
locations <- 
  json_data %>% 
  at_depth(1, "locations") %>% 
  at_depth(2, ~data.frame(.x, stringsAsFactors = FALSE)) %>% 
  map(~bind_rows(.x)) %>% 
  bind_rows(.id = "id")

# prefix 'location_' to all location fields
colnames(locations) <- paste0("location_", colnames(locations))

# extract all project data excluding location data and create a data.frame
projects <- 
  json_data %>% 
  map(function(x) {x$locations <- NULL; x}) %>% 
  map(~data.frame(as.list(unlist(.x)), stringsAsFactors = FALSE)) %>% 
  bind_rows()

# join project and location data to yield a final denormalised data structure
projects_and_locations <- 
  projects %>% 
  inner_join(locations, by = c('id' = 'location_id'))

# details of single row of final denormalised data.frame
str(projects_and_locations[1,]) 

# 'data.frame': 1 obs. of  32 variables:
#   $ id                    : chr "P130343"
# $ project_name          : chr "MENA- Desert Ecosystems and Livelihoods Knowledge Sharing an"
# $ pl                    : chr "Global Environment Project"
# $ fy                    : chr "2013"
# $ ca                    : chr "$1.00M"
# $ gpname                : chr "Environment & Natural Resources"
# $ s                     : chr "Environment"
# $ ttl                   : chr "Taoufiq Bennouna"
# $ ttlupi                : chr "000314228"
# $ sbc                   : chr "ENV"
# $ sbn                   : chr "Environment"
# $ boardapprovaldate     : chr "23-May-2013"
# $ crd                   : chr "16-Feb-2012"
# $ dmd                   : chr ""
# $ ed                    : chr "10-Jun-2013"
# $ fdd                   : chr "04-Dec-2013"
# $ rcd                   : chr "31-Dec-2017"
# $ fc                    : chr "false"
# $ totalamt              : chr "$1.00M"
# $ url                   : chr "http://www.worldbank.org/projects/P130343?lang=en"
# $ project_abstract.cdata: chr ""
# $ sector.Name           : chr "Agriculture, fishing, and forestry"
# $ sector.code           : chr "AX"
# $ countrycode           : chr "5M"
# $ countryname           : chr "Middle East and North Africa"
# $ location_geoLocId     : chr "0002464470"
# $ location_url          : chr "javascript:projectPopupInfo('P130343', '0002464470')"
# $ location_geoLocName   : chr "Tunis"
# $ location_latitude     : chr "36.8190"
# $ location_longitude    : chr "10.1660"
# $ location_country      : chr "TN"
# $ location_countryName  : chr "Tunisia" 
Sign up to request clarification or add additional context in comments.

Comments

5

The first problem is that the data cannot be simplified because the json is untidy: it has data in it's keys (project names). A workaround is to remove the key names before simplifying:

library(jsonlite)
mydata <- fromJSON('http://pastebin.com/raw/HS3YEQxZ', simplifyVector = FALSE)
project_names <- names(mydata$projects) 
names(mydata$projects) = NULL
out <- jsonlite:::simplify(mydata, flatten = TRUE)
projects <- out$projects
projects$name <- project_names  

This gets projects data in it the proper data frame shape. However if you look at the structure it turns out you have a one-to-many dataset: the sector and locations columns actually have a nested data frame with multiple rows.

str(projects[1,])

Hence you will need to do a left join operation to merge this into a simple 2D data frame. This is a problem on it's own, unrelated to JSON.

Because you have more than one nested column, it is unclear what you expect your output to look like. Use tidyr::unnest to left-join against one of the nested columns:

# Unnest 'locations' column
out <- tidyr::unnest(projects, locations)
names(out)

Note that tidyr automatically drops the sectors column in this case because it is incompatible with left-joining locations with projects.

Comments

1

This question is really hard to answer as-is. You can see from my little code block that, when flattened (aka unlisted), each project has a different number of elements (different length).

You will need to decide how to deal with this based on what you want out of the data. An R data.frame must be rectangular (each list has the same length).

library(rjson)
fn <- 'path.to.file....'
json_data <- fromJSON(file=fn)
sapply(X = json_data$projects, 
       FUN = f <- function(l) length(unlist(l)))

1 Comment

thanks rcorty, I was planning on have multiple rows for the same project.For projects with multiple locations all the information is duplicated in multiple multiple rows to match the number of locations.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.