0

I am working on a simple problem in R (but I have not yet figured it out though;p):

Given a vector vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada", ..., "Amada + Steven", "Steven + Henry"). I want to create a new vector vect2 that contains all the elements in vect1 and new elements that share the following property: for every two strings "A+B" and "B+C", we concatenate it into "A+C" and add this new element into vect2. Can anyone please help me do this?

Also, I want to get all the elements standing in front of + in each string, is the following code correct?

for (i in length(vect1)){ vect3[i] <- regexpr(".*+", vect1[i]) }

3rd question: if I have a dataframe d with a Date column in the format %d-%b (for example, 01-Apr), how do I order this dataframe in an increasing order based on Date?? Let's just say d <- c(01-Apr,01-Mar,02-Jan,31-June,30-May).

4
  • 1
    Are the elements of vect1 always two people, or can it be 1 or 3+? This sounds like combinatorial "fun". Commented Mar 1, 2018 at 18:57
  • Pretty sure you're going to need to split vect1 into separate columns for the pairs. Commented Mar 1, 2018 at 18:57
  • 1
    Can you provide an example ? Commented Mar 1, 2018 at 18:59
  • @r2evans: it is ALWAYS two people, fortunately. How is my for loop code? @ManishSaraswat Saraswat: Yes, an example would be "Mary + Pete" & "Pete + Amada" (column 2 and 3) = "Mary + Amada". So the new vector would have the size of vect1 + all the new concatenated elements like this. Commented Mar 1, 2018 at 19:12

2 Answers 2

1

I think you could (should) avoid both for loops and the use of external lib if not required.

So this might be a solution:

// create data
vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada", "Amada + Steven", "Steven + Henry")

// create a matrix of pairs with removed white spaces
pairsMatrix <- do.call(rbind, sapply(vect1, function(v) strsplit(gsub(pattern = " ", replacement = "", x = v), "\\+")))

// remove dimnames (not necessary though)
dimnames(pairsMatrix) <- NULL

// for all line of the pairsMatrix, find if second element is somewhere else first element. Bind that with the previous pairs
allPairs <- do.call(rbind, c(list(pairsMatrix), apply(pairsMatrix, 1, function(names) c(names[1], pairsMatrix[names[2]==pairsMatrix[,1], 2]))))

// filter for oneself-relationships
allPairs[allPairs[,1]!=allPairs[,2],]

      [,1]     [,2]    
 [1,] "Andy"   "Pete"  
 [2,] "Mary"   "Pete"  
 [3,] "Pete"   "Amada" 
 [4,] "Amada"  "Steven"
 [5,] "Steven" "Henry" 
 [6,] "Andy"   "Amada" 
 [7,] "Mary"   "Amada" 
 [8,] "Pete"   "Steven"
 [9,] "Amada"  "Henry" 

Concerning your last point, I think a simple sort with proper Date object will do it.

Sign up to request clarification or add additional context in comments.

1 Comment

Probably a better solution. I was too lazy to avoid the loops :). You got my upvote.
1

I think this should do it but I did things I probably shouldn't do... like growing objects and nesting for loops. If you want to access all elements in front of the '+', just use name.matrix[,1].

vect1 <- c("Andy+Pete", "Mary + Pete", "Pete+ Amada","Amada + Steven", "Steven + Henry")

library(stringr)

name.matrix <- matrix(do.call('rbind',str_split(vect1, pattern = "\\s?[+]\\s?")), ncol = 2)

new.stuff <- c()

for(x in unique(name.matrix[,2])){
  sub.mat.1 <- matrix(name.matrix[name.matrix[,2] == x,], ncol = 2)
  sub.mat.2 <- matrix(name.matrix[name.matrix[,1] == x,], ncol = 2)
  if(length(sub.mat.1) && length(sub.mat.2)){
    for(y in seq_along(sub.mat.1[,2])){
      new.add <- paste0(sub.mat.1[y,1],'+', sub.mat.2[,2])
      new.stuff <- c(new.stuff, new.add)
    }
  }
}

vect2 <- c(vect1, new.stuff)
vect2
#[1] "Andy+Pete"      "Mary + Pete"    "Pete+ Amada"    "Amada + Steven" "Steven + Henry" "Andy+Amada"    
#[7] "Mary+Amada"     "Pete+Steven"    "Amada+Henry" 

Update:

Third question. Well there's only 30 days in June. So you're going to get an NA there. If it's a data.frame that you're trying to sort based on date, you'll need to use the format df[order(df$Date),]. The lubridate package also might be helpful when working with dates.

d <- c('01-Apr','01-Mar','02-Jan','31-June','30-May')

d.new <- as.Date(d, format = '%d-%b')
d.new <- d.new[order(d.new)]
d.new
#[1] "2018-01-02" "2018-03-01" "2018-04-01" "2018-05-30" NA  

7 Comments

Thank you so much for your help. What a solution!! Could you also please help with the 3rd question too?
No problem. Ok, I wrote a response.
I tried that, but R just got frozen (my dataset has 85.56M+ rows). I wonder if this is because the entries are in double-quotation marks??
Your dataset has over 85 million rows? No I suspect it got frozen because of the sheer size of sorting that many rows. Double quotation marks shouldn't matter. Try breaking off a piece of the huge dataset and sorting that just to see if it's working.
Sorry I've been traveling. It should be as you wrote it, if it was a data.frame. But the example I gave is just a vector and therefore, you don't need the comma.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.