Concatenate two strings with common elements

Question

I am working on a simple problem in R (but I have not yet figured it out though;p):

Given a vector vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada", ..., "Amada + Steven", "Steven + Henry"). I want to create a new vector vect2 that contains all the elements in vect1 and new elements that share the following property: for every two strings "A+B" and "B+C", we concatenate it into "A+C" and add this new element into vect2. Can anyone please help me do this?

Also, I want to get all the elements standing in front of + in each string, is the following code correct?

for (i in length(vect1)){ vect3[i] <- regexpr(".*+", vect1[i]) }

3rd question: if I have a dataframe d with a Date column in the format %d-%b (for example, 01-Apr), how do I order this dataframe in an increasing order based on Date?? Let's just say d <- c(01-Apr,01-Mar,02-Jan,31-June,30-May).

Are the elements of vect1 always two people, or can it be 1 or 3+? This sounds like combinatorial "fun". — r2evans
– r2evans, Commented Mar 1, 2018 at 18:57
Pretty sure you're going to need to split vect1 into separate columns for the pairs. — Mako212
– Mako212, Commented Mar 1, 2018 at 18:57
@r2evans: it is ALWAYS two people, fortunately. How is my for loop code? @ManishSaraswat Saraswat: Yes, an example would be "Mary + Pete" & "Pete + Amada" (column 2 and 3) = "Mary + Amada". So the new vector would have the size of vect1 + all the new concatenated elements like this. — user177196
– user177196, Commented Mar 1, 2018 at 19:12

ClementWalter · Accepted Answer · 2018-03-01 19:45:29Z

1

I think you could (should) avoid both for loops and the use of external lib if not required.

So this might be a solution:

// create data
vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada", "Amada + Steven", "Steven + Henry")

// create a matrix of pairs with removed white spaces
pairsMatrix <- do.call(rbind, sapply(vect1, function(v) strsplit(gsub(pattern = " ", replacement = "", x = v), "\\+")))

// remove dimnames (not necessary though)
dimnames(pairsMatrix) <- NULL

// for all line of the pairsMatrix, find if second element is somewhere else first element. Bind that with the previous pairs
allPairs <- do.call(rbind, c(list(pairsMatrix), apply(pairsMatrix, 1, function(names) c(names[1], pairsMatrix[names[2]==pairsMatrix[,1], 2]))))

// filter for oneself-relationships
allPairs[allPairs[,1]!=allPairs[,2],]

      [,1]     [,2]    
 [1,] "Andy"   "Pete"  
 [2,] "Mary"   "Pete"  
 [3,] "Pete"   "Amada" 
 [4,] "Amada"  "Steven"
 [5,] "Steven" "Henry" 
 [6,] "Andy"   "Amada" 
 [7,] "Mary"   "Amada" 
 [8,] "Pete"   "Steven"
 [9,] "Amada"  "Henry"

Concerning your last point, I think a simple sort with proper Date object will do it.

answered Mar 1, 2018 at 19:45

ClementWalter

5,3943 gold badges38 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Balter Over a year ago

Probably a better solution. I was too lazy to avoid the loops :). You got my upvote.

Balter · Accepted Answer · 2018-03-01 19:32:47Z

1

I think this should do it but I did things I probably shouldn't do... like growing objects and nesting for loops. If you want to access all elements in front of the '+', just use name.matrix[,1].

vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada","Amada + Steven", "Steven + Henry")

library(stringr)

name.matrix <- matrix(do.call('rbind',str_split(vect1, pattern = "\\s?[+]\\s?")), ncol = 2)

new.stuff <- c()

for(x in unique(name.matrix[,2])){
  sub.mat.1 <- matrix(name.matrix[name.matrix[,2] == x,], ncol = 2)
  sub.mat.2 <- matrix(name.matrix[name.matrix[,1] == x,], ncol = 2)
  if(length(sub.mat.1) && length(sub.mat.2)){
    for(y in seq_along(sub.mat.1[,2])){
      new.add <- paste0(sub.mat.1[y,1],'+', sub.mat.2[,2])
      new.stuff <- c(new.stuff, new.add)
    }
  }
}

vect2 <- c(vect1, new.stuff)
vect2
#[1] "Andy+Pete"      "Mary + Pete"    "Pete+ Amada"    "Amada + Steven" "Steven + Henry" "Andy+Amada"    
#[7] "Mary+Amada"     "Pete+Steven"    "Amada+Henry"

Update:

Third question. Well there's only 30 days in June. So you're going to get an NA there. If it's a data.frame that you're trying to sort based on date, you'll need to use the format df[order(df$Date),]. The lubridate package also might be helpful when working with dates.

d <- c('01-Apr','01-Mar','02-Jan','31-June','30-May')

d.new <- as.Date(d, format = '%d-%b')
d.new <- d.new[order(d.new)]
d.new
#[1] "2018-01-02" "2018-03-01" "2018-04-01" "2018-05-30" NA

edited Mar 1, 2018 at 19:32

answered Mar 1, 2018 at 19:16

Balter

1,0956 silver badges12 bronze badges

7 Comments

user177196 Over a year ago

Thank you so much for your help. What a solution!! Could you also please help with the 3rd question too?

Balter Over a year ago

No problem. Ok, I wrote a response.

user177196 Over a year ago

I tried that, but R just got frozen (my dataset has 85.56M+ rows). I wonder if this is because the entries are in double-quotation marks??

Balter Over a year ago

Your dataset has over 85 million rows? No I suspect it got frozen because of the sheer size of sorting that many rows. Double quotation marks shouldn't matter. Try breaking off a piece of the huge dataset and sorting that just to see if it's working.

Balter Over a year ago

Sorry I've been traveling. It should be as you wrote it, if it was a data.frame. But the example I gave is just a vector and therefore, you don't need the comma.

|

Collectives™ on Stack Overflow

Concatenate two strings with common elements

2 Answers 2

1 Comment

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related