1

I currently have the following problem: I extracted some data via the crunchbase API, resulting in a big nested list of the following structure (there are many more nested lists on several instances included, I here only display the part of the structure currently relevant for me):

> str(x[[1]])
$ uuid         : chr "5f9957b0841251e6e439d757XXXXXX"
$ relationships: List of 27
..$ websites: List of 3
.. ..$ cardinality: chr "OneToMany"
.. ..$ items      :'data.frame':    4 obs. of  7 variables:
.. .. ..$ properties.website_type: chr [1:4] "homepage" "facebook" "twitter" "linkedin"
.. .. ..$ properties.url         : chr [1:4] "http://www.example.com" "https://www.facebook.com/example" "http://twitter.com/example" "http://www.linkedin.com/company/example"

Consider the following minimal example:

x <- list()
x[[1]] <- list(uuid = "123", 
           relationships = list(websites = list(items =  list(
                                                properties.website_type = c("homepage", "facebook", "twitter", "linkedin"), 
                                                properties.url = c("www.example1.com", "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com") ) )  ) )
x[[2]] <- list(uuid = "987", 
           relationships = list(websites = list(items =  list(
             properties.website_type = c("homepage", "facebook", "twitter" ), 
             properties.url = c("www.example2.com", "www.fbex2.com", "www.twitterex2.com") ) )  ) )

Now, I would like to create a dataframe with the following column structure:

> x.df
uuid          web.url  web.facebook        web.twitter        web.linkedin
1  123 www.example1.com www.fbex1.com www.twitterex1.com www.linkedinex1.com
2  987 www.example2.com www.fbex2.com www.twitterex2.com                <NA>

Meaning: I would like to have every uuid (a unique firm identifier) in a single column, followed by the urls of the different platforms (fb, twitter...). I tried a lot of different things with a combination of lapply(), spread(), and row_bind(), yet didn't manage to make anything work. Any help on that would be appreciated.

4
  • 2
    Please provide a sample of your data using dput Commented Jun 25, 2018 at 10:11
  • Done. I added a downloadable link for a few datapoints. Commented Jun 25, 2018 at 15:27
  • 1
    please make a minimal example instead of a 1000-line file to a link that may break at any time. See how to make a reproducible example Commented Jun 25, 2018 at 21:40
  • Done. Hope now it is clear. Commented Jun 26, 2018 at 7:02

2 Answers 2

1

dplyr approach could be

library(dplyr)
library(tidyr)

#convert list to dataframe in long format
df <- do.call(rbind, lapply(x, data.frame, stringsAsFactors = FALSE))

#final result
df1 <- df %>%
  spread(relationships.websites.items.properties.website_type, relationships.websites.items.properties.url)

which gives

  uuid      facebook         homepage            linkedin            twitter
1  123 www.fbex1.com www.example1.com www.linkedinex1.com www.twitterex1.com
2  987 www.fbex2.com www.example2.com                <NA> www.twitterex2.com


Sample data:

x <- list(structure(list(uuid = "123", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
    "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", 
    "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com"
    )), .Names = c("properties.website_type", "properties.url"
    ))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")), structure(list(uuid = "987", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
    "facebook", "twitter"), properties.url = c("www.example2.com", 
    "www.fbex2.com", "www.twitterex2.com")), .Names = c("properties.website_type", 
    "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")))


Update: In order to fix below error

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0

you would need to remove corrupted elements from input data where website_type has one value but properties.url has NULL. Run this chunk of code as a pre-processing step before executing the main solution:

idx <- which(sapply(x, function(k) is.null(k$relationships$websites$items$properties.url)))
x <- x[-idx]

Sample data to test this pre-processing step:

x <- list(structure(list(uuid = "123", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = c("homepage", 
    "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", 
    "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com"
    )), .Names = c("properties.website_type", "properties.url"
    ))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")), structure(list(uuid = "987", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = "homepage", 
        properties.url = NULL), .Names = c("properties.website_type", 
    "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")), structure(list(uuid = "345", relationships = structure(list(
    websites = structure(list(items = structure(list(properties.website_type = "homepage", 
        properties.url = NULL), .Names = c("properties.website_type", 
    "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", 
"relationships")))
Sign up to request clarification or add additional context in comments.

10 Comments

Great, that generally seems to be what I need. Runs perfectly with the example. However, when I try it with my full dataset, I always get an error message: "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0" Any idea what the problem could be?
Probably you have an element in your sample data wherein number of values in $relationships$websites$items$properties.website_type & $relationships$websites$items$properties.url is not matching. Because of this data.frame is throwing this error. So first you need to think on how do you want to handle such cases i.e. website_type is there but url is missing.
Indeed, on that, you are probably right! I didn\t consider that case. In case the url is missing, it in the optimal case should be an NA.
I think you are missing a point here. Consider this example and let me know the desired output - x <- structure(list(uuid = "123", relationships = structure(list(websites = structure(list( items = structure(list(properties.website_type = c("homepage", "facebook", "twitter", "linkedin"), properties.url = c("www.example1.com", "www.fbex1.com", "www.linkedinex1.com")), .Names = c("properties.website_type", "properties.url"))), .Names = "items")), .Names = "websites")), .Names = c("uuid", "relationships")) Here twitter has no url in this example and gives the same error.
Hello again. Sorry, was in transit for some time and now catching up. I think I understand the problem correctly. To keep it simple, I actually only want to consider the website, facebook, twitter and linkedin, even though there might be actually more types of URLs in that part of the list. The desired output structure then would be exactly like the one I posted in the original question. About what happens when the number of values in url and website.type not match, I am not emotional. I think these cases are very rare, and could also be deleted alltogether.
|
0

I know this is a clunkier solution, but it helped me seeing the process step by step (running str (x_df) to see each result):

library(tidyverse)

# Using your example
x <- list()
x[[1]] <- list(uuid = "123",
                    relationships = list(websites = list(items =  list(
                        properties.website_type = c("homepage", "facebook", "twitter", "linkedin"),
                        properties.url = c("www.example1.com", "www.fbex1.com", "www.twitterex1.com", "www.linkedinex1.com") ) )  ) )
x[[2]] <- list(uuid = "987",
                    relationships = list(websites = list(items =  list(
                        properties.website_type = c("homepage", "facebook", "twitter" ),
                        properties.url = c("www.example2.com", "www.fbex2.com", "www.twitterex2.com") ) )  ) )

 

# --- Iterations of unnest:
x_df <- x %>% tibble::as_tibble_col( .) %>%  
    tidyr::unnest_wider(col = "value")  %>% 
    tidyr::unnest_longer(col = "relationships")   %>%  
    tidyr::unnest_wider(col = "relationships")  %>%  
    tidyr::unnest_wider(col =  "items")  %>%  
    tidyr::unnest_longer(col = c("properties.website_type", "properties.url")) %>% 
# --- Lastly, group by id: 
    group_by(uuid) %>% 
    tidyr::pivot_wider(data = ., 
                             names_from = properties.website_type, 
                             values_from = c("properties.url"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.