0

I have a list of more than 100,000 json files from which I want to get a data.table with only a few variables. Unfortunately the files are complex. The content of each json file looks like:

Sample 1

$id
[1] "10.1"
$title
$title$value
[1] "Why this item"
$itemsource
$itemsource$id
[1] "AA"
$date
[1] "1992-01-01"
$itemType
[1] "art"
$creators
list()

Sample 2

$id
[1] "10.2"
$title
$title$value
[1] "We need this item"
$itemsource
$itemsource$id
[1] "AY"
$date
[1] "1999-01-01"
$itemType
[1] "art"
$creators
    type                name firstname    surname affiliationIds
1 Person Frank W. Cornell.  Frank W. Cornell.             a1
2 Person David A. Chen.  David A. Chen.             a1

$affiliations
  id                                          name
1 a1 Foreign Affairs Desk, New York Times

What I need from this set of files is a table with creator names, item ids and dates. For the two sample files above:

id           date            name                firstname lastname  creatortype
"10.1"      "1992-01-01"      NA                    NA        NA      NA
"10.2"      "1999-01-01"  Frank W. Cornell.      Frank W.   Cornell.  Person
"10.2"      "1999-01-01"  David A. Chen.         David A.   Chen.     Person

What I have done so far:

library(parallel)
library(data.table)
library(jsonlite)
library(dplyr)

filelist = list.files(pattern="*.json",recursive=TRUE,include.dirs =TRUE)
parsed = mclapply(filelist, function(x) fromJSON(x),mc.cores=24)
data = rbindlist(mclapply(1:length(parsed), function(x) { 
  a = data.table(item = parsed[[x]]$id, date = list(list(parsed[[x]]$date)), name = list(list(parsed[[x]]$name)), creatortype = list(list(parsed[[x]]$creatortype))) #ignoring the firstname/lastname fields here for convenience
  b = data.table(id = a$item, date = unlist(a$date), name=unlist(a$name), creatortype=unlist(a$creatortype))
  return(b)
},mc.cores=24))

However, on the last step, I get this error:

"Error in rbindlist(mclapply(1:length(parsed), function(x){:
Item 1 of list is not a data.frame, data.table or list"

Thanks in advance for your suggestions. Related questions include: Extract data from list of lists [R] R convert json to list to data.table I want to convert JSON file into data.table in r How can read files from directory using R? Convert R data table column from JSON to data table

1 Answer 1

1

from the error message, i suppose this basically means that one of the results from mclapply() is empty, by empty I mean either NULL or data.table with 0 row, or simply encounters an error within the parallel processing.

what you could do is:

  1. add more checks inside the mclapply() like try-error or check the class of b and nrow of b, whether b is empty or not

  2. when you use rbindlist, add argument fill = T

hope this solves ur problem.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.