1

I got a dataset composed by 2 columns. The "WEBDATA" column contains a list in each cell. This is the first time I have to deal with a dataset that contains lists and I m stuck...

My dataset looks like this:

WORD  |   WEBDATA
Home  |   list(Domain = c(77, 25, 7, 97, 71, 1, 42, 35, 37, 58, 9
Baby  |   list(Domain = c(77, 25, 7, 97, 71, 1, 42, 35, 37, 58, 9
Dog   |   list(Domain = c(77, 25, 7, 97, 71, 1, 42, 35, 37, 58, 9
Food  |   list(Domain = c(77, 25, 7, 97, 71, 1, 42, 35, 37, 58, 9

When I m checking the content inside each cell of the WEBDATA column, it returns me this:

> dataset$WEBDATA[[1]]

   Domain
1  website1.com
2  mysuperwebsite.com
3  bestwebsite.uk

   Url
1  https://www.website1.com/product2/
2  https://www.mysuperwebsite.com/productB/
3  https://www.bestwebsite.uk/product67/

To be sure it was lists and to check what it looks like, I tried this:

class(dataset$WEBDATA)
[1] "list"

testdataset <- data.frame(dataset$WEBDATA[[2]])
    Domain              |  Url
1   website1.com        |  https://www.website1.com/product2/
2   mysuperwebsite.com  |  https://www.mysuperwebsite.com/productB/
3   bestwebsite.uk      |  https://www.bestwebsite.uk/product67/

My goal is to split the WEBDATA lists into several rows.

The final dataset should look like this:

WORD  |  Number |  Domain             |  Url
Home  |   1     |  website1.com       |  https://www.website1.com/product2/
Home  |   2     |  mysuperwebsite.com |  https://www.mysuperwebsite.com/productB/
Home  |   3     |  bestwebsite.uk     |  https://www.bestwebsite.uk/product67/
Baby  |   1     |  websitezz.uk       |  https://www.websitezz.uk/page/
Baby  |   2     |  websiteabc.com     |  https://www.websiteabc.com/post/
Baby  |   3     |  thewebsite.com     |  https://www.thewebsite.com/post75/

I thought of the strsplit() function but with lists I don't really know how to make it. Can you please help?

Here is a sample dataset, you can paste it in R:

theDataReconstituted <- structure(list(
    WORD = structure(c(8L, 7L, 6L, 10L, 9L), .Label = c("dog dood", "dog foo", "dog food uk", "dog foof", "dogfood", "burns dog food", "canagan dog food", "dog food", "skinners dog food", "wainwrights dog food" ), class = "factor"), 
    WEBDATA = list(
        structure(list(
            Domain = structure(c(1L, 2L, 2L), .Label = c("pet-supermarket.co.uk", "petsathome.com" ), class = "factor"), 
            Url = structure(c(3L, 1L, 2L), .Label = c("petsathome.com/shop/en/pets/dog/dog-food-and-treats", "petsathome.com/shop/en/pets/dog/dog-food-and-treats/dry-dog-food", "pet-supermarket.co.uk/Dog/Dog-Food-Treats/Dog-Food/c/PSGB00070" ), class = "factor")), 
            .Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)), 
        structure(list(
            Domain = structure(c(1L, 1L, 1L), .Label = "canagan.co.uk", class = "factor"), 
            Url = structure(c(1L, 3L, 2L), .Label = c("canagan.co.uk/", "canagan.co.uk/products-cat.html", "canagan.co.uk/products.html" ), class = "factor")), 
            .Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)), 
        structure(list(
            Domain = structure(c(1L, 1L, 2L), .Label = c("burnspet.co.uk", "petsathome.com"), class = "factor"), 
            Url = structure(1:3, .Label = c("burnspet.co.uk/", "burnspet.co.uk/burns-dog-food-products/", "petsathome.com/shop/en/pets/merch-groups/burns" ), class = "factor")), 
            .Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)), 
        structure(list(
            Domain = structure(c(1L, 1L, 1L), .Label = "petsathome.com", class = "factor"), 
            Url = structure(c(2L, 3L, 1L), .Label = c("petsathome.com/shop/en/pets/merch-groups/feature/wainwrights-dog-food", "petsathome.com/shop/en/pets/merch-groups/mg-004", "petsathome.com/shop/en/pets/merch-groups/wainwrights-dog-" ), class = "factor")), 
            .Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)), 
        structure(list(
            Domain = structure(c(1L, 1L, 1L), .Label = "skinnerspetfoods.co.uk", class = "factor"), 
            Url = structure(c(1L, 3L, 2L), .Label = c("skinnerspetfoods.co.uk/", "skinnerspetfoods.co.uk/our-range/", "skinnerspetfoods.co.uk/product-category/field-trial-range/" ), class = "factor")), 
            .Names = c("Domain", "Url"), class = "data.frame", row.names = c(NA, -3L)))), 
    row.names = c(NA, -5L), 
    class = c("tbl_df", "tbl", "data.frame" ), 
    .Names = c("WORD", "WEBDATA"))
7
  • 3
    Can you edit with the results of calling dput on a representative sample of your data? You've got nested list columns such that it's not really possible for anyone to reproduce the precise situation otherwise. Commented Dec 16, 2017 at 17:22
  • How are you relating Home with Website1.com etc? It seems those sites belong to 2nd item that is associated with Baby. Commented Dec 16, 2017 at 17:23
  • Website1.com is contained in the list on the same row as Home. Thanks for noticing it, there was a mistake in the code above, I edited it. Commented Dec 16, 2017 at 17:26
  • @Remi As @alistaire requested, please put a subset of the output of dput(dataset) (perhaps the subset in your post). Commented Dec 16, 2017 at 17:35
  • 4
    Just library(tidyverse); theDataReconstituted %>% unnest() %>% group_by(WORD) %>% mutate(Number = row_number()) will work. You'll get some errors about coercing factors to character, but it's not causing any problems. Commented Dec 16, 2017 at 18:05

1 Answer 1

0

As @alistaire sais in comment, the answer is:

library(tidyverse)
theDataReconstituted %>% 
  unnest() %>% 
  group_by(WORD) %>% 
  mutate(Number = row_number()) 

You'll get some errors about coercing factors to character, but it's not causing any problems.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.