This is a followup to my previous question Merging vectors of strings in a list in R
I have tried an alternate approach using data.table.
I have a data.table G as follows
d <- list( c("SD1:LUSH", "SD44:CANCEL", "SD384:FR563", "SD32:TRUMPET"), c("SD23:SWITCH", "SD1:LUSH", "SD567:TREK"), c("SD42:CRAYON", "SD345:FOX", "SD183:WIRE"), c("SD345:HOLE", "SD340:DUST", "SD387:ROLL"), c("SD455:TOMATO", "SD39:MATURE"), c("SD12:PAINTING", "SD315:MONEY31", "SD387:SPRING"), c("SD32:TRUMPET", "SD1:FIELD"))
d2 <- lapply(d, function(x) sapply(strsplit(x, ":"), "[", 1))
d <- lapply(d, paste0, collapse=", ")
d2 <- lapply(d2, paste0, collapse=", ")
d <- as.data.frame(as.matrix(lapply(d, paste0, collapse=", ")))
d2 <- as.data.frame(as.matrix(lapply(d2, paste0, collapse=", ")))
d <- as.data.frame(cbind(d,d2))
colnames(d) <- c("sdw", "sd")
d$sd <- as.character(d$sd)
d$sdw <- as.character(d$sdw)
G <- data.table( d , key = "sd" )
sdw sd
1: SD1:LUSH, SD44:CANCEL, SD384:FR563, SD32:TRUMPET SD1, SD44, SD384, SD32
2: SD12:PAINTING, SD315:MONEY31, SD387:SPRING SD12, SD315, SD387
3: SD23:SWITCH, SD1:LUSH, SD567:TREK SD23, SD1, SD567
4: SD32:TRUMPET, SD1:FIELD SD32, SD1
5: SD345:HOLE, SD340:DUST, SD387:ROLL SD345, SD340, SD387
6: SD42:CRAYON, SD345:FOX, SD183:WIRE SD42, SD345, SD183
7: SD455:TOMATO, SD39:MATURE SD455, SD39
I am trying to aggregate elements in column sdw based on elements in column sd.
[1], [2] and [7] have SD1 common between them. So their corresponding sdw elements should merge together. Also [1] and [7] have both SD1 and SD32 common.
[4] has SD345 common with [3] and SD387 common with [5]. So [4], [3] and [5] sdw elements should merge together.
[7] is not having any SD__ common with other vectors, so it should remain as such.
In short I want to aggregate G$sdw elements based on overlapping SD__ terms in G$sd
The output I am looking for is as follows with just three rows.
[1] "SD1:LUSH, SD1:FIELD, SD23:SWITCH, SD32:TRUMPET, SD44:CANCEL, SD384:FR563, SD567:TREK"
[2] "SD12:PAINTING, SD42:CRAYON, SD183:WIRE, SD340:DUST SD345:FOX, SD345:HOLE, SD387:SPRING, SD387:ROLL"
[3] "SD455:TOMATO, SD39:MATURE"
I have tried data.table package as follows
# Extract SDs from GN$sd
G <- G[ , list( ID = unlist( strsplit( sd , "," ) ) ) , by = list(sdw, sd) ]
G$ID <- gsub(" ", "", G$ID)
G <- data.table( G , key = "ID" )
# Merge according to common IDs
G2 <- G[, list(Gp1 = paste0(sort(unique(unlist(strsplit(sdw, split=", ")))), collapse=", "),
Gp2 = paste0(sort(unique(unlist(strsplit(sd, split=", ")))), collapse=", ")) , by = "ID"]
G2 <- data.table( G, key="Gp2")
G2 <- unique(G2)
G2
ID Gp1 Gp2
1: SD1 SD1:FIELD, SD1:LUSH, SD23:SWITCH, SD32:TRUMPET, SD384:FR563, SD44:CANCEL, SD567:TREK SD1, SD23, SD32, SD384, SD44, SD567
2: SD23 SD1:LUSH, SD23:SWITCH, SD567:TREK SD1, SD23, SD567
3: SD32 SD1:FIELD, SD1:LUSH, SD32:TRUMPET, SD384:FR563, SD44:CANCEL SD1, SD32, SD384, SD44
4: SD387 SD12:PAINTING, SD315:MONEY31, SD340:DUST, SD345:HOLE, SD387:ROLL, SD387:SPRING SD12, SD315, SD340, SD345, SD387
5: SD12 SD12:PAINTING, SD315:MONEY31, SD387:SPRING SD12, SD315, SD387
6: SD345 SD183:WIRE, SD340:DUST, SD345:FOX, SD345:HOLE, SD387:ROLL, SD42:CRAYON SD183, SD340, SD345, SD387, SD42
7: SD183 SD183:WIRE, SD345:FOX, SD42:CRAYON SD183, SD345, SD42
8: SD340 SD340:DUST, SD345:HOLE, SD387:ROLL SD340, SD345, SD387
9: SD39 SD39:MATURE, SD455:TOMATO SD39, SD455
This can only merge based on duplication of SD__ terms across rows in G$sd. It is not taking into consideration multiple common terms across elements and also same element having distinct common terms with other elements.
Is there any way to achieve the desired output in R. My full dataset has thousands of such rows.