Merging/Collapsing data.table rows based on common strings

Ask Question

Asked 11 years, 7 months ago

Modified 10 years, 11 months ago

Viewed 587 times

Part of R Language Collective

This is a followup to my previous question Merging vectors of strings in a list in R

I have tried an alternate approach using data.table.

I have a data.table G as follows

d <- list( c("SD1:LUSH", "SD44:CANCEL", "SD384:FR563", "SD32:TRUMPET"), c("SD23:SWITCH", "SD1:LUSH", "SD567:TREK"), c("SD42:CRAYON", "SD345:FOX", "SD183:WIRE"), c("SD345:HOLE", "SD340:DUST", "SD387:ROLL"), c("SD455:TOMATO", "SD39:MATURE"), c("SD12:PAINTING", "SD315:MONEY31", "SD387:SPRING"),  c("SD32:TRUMPET", "SD1:FIELD"))
d2 <-  lapply(d, function(x) sapply(strsplit(x, ":"), "[", 1))

d <- lapply(d, paste0, collapse=", ")
d2 <- lapply(d2, paste0, collapse=", ")

d <- as.data.frame(as.matrix(lapply(d, paste0, collapse=", ")))
d2 <- as.data.frame(as.matrix(lapply(d2, paste0, collapse=", ")))

d <- as.data.frame(cbind(d,d2))
colnames(d) <- c("sdw", "sd")
d$sd <- as.character(d$sd)
d$sdw <- as.character(d$sdw)



 G <- data.table( d , key = "sd" )
                                                sdw                     sd
1: SD1:LUSH, SD44:CANCEL, SD384:FR563, SD32:TRUMPET SD1, SD44, SD384, SD32
2:       SD12:PAINTING, SD315:MONEY31, SD387:SPRING     SD12, SD315, SD387
3:                SD23:SWITCH, SD1:LUSH, SD567:TREK       SD23, SD1, SD567
4:                          SD32:TRUMPET, SD1:FIELD              SD32, SD1
5:               SD345:HOLE, SD340:DUST, SD387:ROLL    SD345, SD340, SD387
6:               SD42:CRAYON, SD345:FOX, SD183:WIRE     SD42, SD345, SD183
7:                        SD455:TOMATO, SD39:MATURE            SD455, SD39

I am trying to aggregate elements in column sdw based on elements in column sd.

[1], [2] and [7] have SD1 common between them. So their corresponding sdw elements should merge together. Also [1] and [7] have both SD1 and SD32 common.

[4] has SD345 common with [3] and SD387 common with [5]. So [4], [3] and [5] sdw elements should merge together.

[7] is not having any SD__ common with other vectors, so it should remain as such.

In short I want to aggregate G$sdw elements based on overlapping SD__ terms in G$sd

The output I am looking for is as follows with just three rows.

[1] "SD1:LUSH, SD1:FIELD,  SD23:SWITCH, SD32:TRUMPET, SD44:CANCEL, SD384:FR563,  SD567:TREK"            
[2] "SD12:PAINTING, SD42:CRAYON, SD183:WIRE, SD340:DUST SD345:FOX, SD345:HOLE, SD387:SPRING, SD387:ROLL"
[3] "SD455:TOMATO, SD39:MATURE"

I have tried data.table package as follows

# Extract SDs from GN$sd
G <- G[ , list( ID = unlist( strsplit( sd , "," ) ) ) , by = list(sdw, sd) ]
G$ID <- gsub(" ", "", G$ID)
G <- data.table( G , key = "ID" )

# Merge according to common IDs
G2 <- G[, list(Gp1 = paste0(sort(unique(unlist(strsplit(sdw, split=", ")))), collapse=", "),
                           Gp2 = paste0(sort(unique(unlist(strsplit(sd, split=", ")))), collapse=", "))  , by = "ID"]

G2 <- data.table( G, key="Gp2")
G2 <- unique(G2)
G2

ID                                                                                  Gp1                                 Gp2
1:   SD1 SD1:FIELD, SD1:LUSH, SD23:SWITCH, SD32:TRUMPET, SD384:FR563, SD44:CANCEL, SD567:TREK SD1, SD23, SD32, SD384, SD44, SD567
2:  SD23                                                    SD1:LUSH, SD23:SWITCH, SD567:TREK                    SD1, SD23, SD567
3:  SD32                          SD1:FIELD, SD1:LUSH, SD32:TRUMPET, SD384:FR563, SD44:CANCEL              SD1, SD32, SD384, SD44
4: SD387       SD12:PAINTING, SD315:MONEY31, SD340:DUST, SD345:HOLE, SD387:ROLL, SD387:SPRING    SD12, SD315, SD340, SD345, SD387
5:  SD12                                           SD12:PAINTING, SD315:MONEY31, SD387:SPRING                  SD12, SD315, SD387
6: SD345               SD183:WIRE, SD340:DUST, SD345:FOX, SD345:HOLE, SD387:ROLL, SD42:CRAYON    SD183, SD340, SD345, SD387, SD42
7: SD183                                                   SD183:WIRE, SD345:FOX, SD42:CRAYON                  SD183, SD345, SD42
8: SD340                                                   SD340:DUST, SD345:HOLE, SD387:ROLL                 SD340, SD345, SD387
9:  SD39                                                            SD39:MATURE, SD455:TOMATO                         SD39, SD455

This can only merge based on duplication of SD__ terms across rows in G$sd. It is not taking into consideration multiple common terms across elements and also same element having distinct common terms with other elements.

Is there any way to achieve the desired output in R. My full dataset has thousands of such rows.

edited May 23, 2017 at 10:32

CommunityBot

11 silver badge

asked May 6, 2014 at 8:07

Crops

5,1965 gold badges42 silver badges68 bronze badges

possible duplicate of Merge two dataframes containing duplicate elements

Paul Sweatte
– Paul Sweatte

2014-10-09 16:33:15 +00:00
Commented Oct 9, 2014 at 16:33

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Merging/Collapsing data.table rows based on common strings

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked