1

This is a followup to my previous question Merging vectors of strings in a list in R

I have tried an alternate approach using data.table.

I have a data.table G as follows

d <- list( c("SD1:LUSH", "SD44:CANCEL", "SD384:FR563", "SD32:TRUMPET"), c("SD23:SWITCH", "SD1:LUSH", "SD567:TREK"), c("SD42:CRAYON", "SD345:FOX", "SD183:WIRE"), c("SD345:HOLE", "SD340:DUST", "SD387:ROLL"), c("SD455:TOMATO", "SD39:MATURE"), c("SD12:PAINTING", "SD315:MONEY31", "SD387:SPRING"),  c("SD32:TRUMPET", "SD1:FIELD"))
d2 <-  lapply(d, function(x) sapply(strsplit(x, ":"), "[", 1))

d <- lapply(d, paste0, collapse=", ")
d2 <- lapply(d2, paste0, collapse=", ")

d <- as.data.frame(as.matrix(lapply(d, paste0, collapse=", ")))
d2 <- as.data.frame(as.matrix(lapply(d2, paste0, collapse=", ")))

d <- as.data.frame(cbind(d,d2))
colnames(d) <- c("sdw", "sd")
d$sd <- as.character(d$sd)
d$sdw <- as.character(d$sdw)



 G <- data.table( d , key = "sd" )
                                                sdw                     sd
1: SD1:LUSH, SD44:CANCEL, SD384:FR563, SD32:TRUMPET SD1, SD44, SD384, SD32
2:       SD12:PAINTING, SD315:MONEY31, SD387:SPRING     SD12, SD315, SD387
3:                SD23:SWITCH, SD1:LUSH, SD567:TREK       SD23, SD1, SD567
4:                          SD32:TRUMPET, SD1:FIELD              SD32, SD1
5:               SD345:HOLE, SD340:DUST, SD387:ROLL    SD345, SD340, SD387
6:               SD42:CRAYON, SD345:FOX, SD183:WIRE     SD42, SD345, SD183
7:                        SD455:TOMATO, SD39:MATURE            SD455, SD39

I am trying to aggregate elements in column sdw based on elements in column sd.

[1], [2] and [7] have SD1 common between them. So their corresponding sdw elements should merge together. Also [1] and [7] have both SD1 and SD32 common.

[4] has SD345 common with [3] and SD387 common with [5]. So [4], [3] and [5] sdw elements should merge together.

[7] is not having any SD__ common with other vectors, so it should remain as such.

In short I want to aggregate G$sdw elements based on overlapping SD__ terms in G$sd

The output I am looking for is as follows with just three rows.

[1] "SD1:LUSH, SD1:FIELD,  SD23:SWITCH, SD32:TRUMPET, SD44:CANCEL, SD384:FR563,  SD567:TREK"            
[2] "SD12:PAINTING, SD42:CRAYON, SD183:WIRE, SD340:DUST SD345:FOX, SD345:HOLE, SD387:SPRING, SD387:ROLL"
[3] "SD455:TOMATO, SD39:MATURE"

I have tried data.table package as follows

# Extract SDs from GN$sd
G <- G[ , list( ID = unlist( strsplit( sd , "," ) ) ) , by = list(sdw, sd) ]
G$ID <- gsub(" ", "", G$ID)
G <- data.table( G , key = "ID" )

# Merge according to common IDs
G2 <- G[, list(Gp1 = paste0(sort(unique(unlist(strsplit(sdw, split=", ")))), collapse=", "),
                           Gp2 = paste0(sort(unique(unlist(strsplit(sd, split=", ")))), collapse=", "))  , by = "ID"]

G2 <- data.table( G, key="Gp2")
G2 <- unique(G2)
G2

ID                                                                                  Gp1                                 Gp2
1:   SD1 SD1:FIELD, SD1:LUSH, SD23:SWITCH, SD32:TRUMPET, SD384:FR563, SD44:CANCEL, SD567:TREK SD1, SD23, SD32, SD384, SD44, SD567
2:  SD23                                                    SD1:LUSH, SD23:SWITCH, SD567:TREK                    SD1, SD23, SD567
3:  SD32                          SD1:FIELD, SD1:LUSH, SD32:TRUMPET, SD384:FR563, SD44:CANCEL              SD1, SD32, SD384, SD44
4: SD387       SD12:PAINTING, SD315:MONEY31, SD340:DUST, SD345:HOLE, SD387:ROLL, SD387:SPRING    SD12, SD315, SD340, SD345, SD387
5:  SD12                                           SD12:PAINTING, SD315:MONEY31, SD387:SPRING                  SD12, SD315, SD387
6: SD345               SD183:WIRE, SD340:DUST, SD345:FOX, SD345:HOLE, SD387:ROLL, SD42:CRAYON    SD183, SD340, SD345, SD387, SD42
7: SD183                                                   SD183:WIRE, SD345:FOX, SD42:CRAYON                  SD183, SD345, SD42
8: SD340                                                   SD340:DUST, SD345:HOLE, SD387:ROLL                 SD340, SD345, SD387
9:  SD39                                                            SD39:MATURE, SD455:TOMATO                         SD39, SD455

This can only merge based on duplication of SD__ terms across rows in G$sd. It is not taking into consideration multiple common terms across elements and also same element having distinct common terms with other elements.

Is there any way to achieve the desired output in R. My full dataset has thousands of such rows.

1

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.