0

I have a dataset that looks like this :

ColA; ColB; ColC;
PAR; BKK; Y;
BKK; SYD; Y;  
NYC; LAX; Y;
LAX; SFO; Y;

I want to duplicate the rows where ColC==Y and if colB of a row==colA of another row, then I want to create a row with these values : colA of the first and colB of the second. In our example, it would look like this:

ColA; ColB; ColC;
PAR; SYD; Y; 
NYC; SFO; Y;

And these rows would be added to the main dataset.

I have tried using a "for" loop, and generating a temporary dataset to rbind the two, but it doesn't work.

for (i in 1:nrow(maindataset)){
    for (j in (i+1):nrow(maindataset)-1){
        if (maindataset$colB[i]==maindataset$colA[j] & maindataset$colC[i]==maindataset$colC[j]) {
        newDF<-data.frame(ColA=maindataset$colA[i],ColB=maindataset$colA[j],ColC=maindataset$colA[j],stringsAsFactors = FALSE)
    maindataset<-rbind(maindataset,newDF)
}
}
}

I'm not sure that a loop is the best solution. Do you have any idea of the way I could solve it out?

Thanks!

2 Answers 2

1

In general whenever you need to equate two columns in a dataset, think of doing a join. In this case, first filter out the rows where colB is present in colA, then you can do a left join with the original dataset. That should give you the rows you require. Then you can do an rbind to add them to the original dataset after selecting the appropriate rows and renaming them as needed:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(123)
df <- tibble(colA=sample(state.abb,20,replace = T),
             colB=sample(state.abb,20,replace = T)) %>% 
  filter(colA!=colB)
df # Sample dataframe with two columns having data from same superset
#> # A tibble: 19 x 2
#>    colA  colB 
#>    <chr> <chr>
#>  1 NM    CT   
#>  2 IA    TN   
#>  3 IN    FL   
#>  4 AZ    ME   
#>  5 TN    OK   
#>  6 WY    IN   
#>  7 TX    KY   
#>  8 OR    TX   
#>  9 IN    RI   
#> 10 MO    ID   
#> 11 MT    IA   
#> 12 NE    NY   
#> 13 CA    TN   
#> 14 NE    VT   
#> 15 NV    CT   
#> 16 NH    SD   
#> 17 OH    GA   
#> 18 DE    MN   
#> 19 MT    NE

df1 <- df %>% 
  filter(colB %in% colA) %>% # filter the rows where `colB` is present in `colA`
  left_join(df, by=c('colB'='colA')) # Left join with original dataset
df1
#> # A tibble: 8 x 3
#>   colA  colB  colB.y
#>   <chr> <chr> <chr> 
#> 1 IA    TN    OK    
#> 2 WY    IN    FL    
#> 3 WY    IN    RI    
#> 4 OR    TX    KY    
#> 5 MT    IA    TN    
#> 6 CA    TN    OK    
#> 7 MT    NE    NY    
#> 8 MT    NE    VT

df1 %>% 
  select(-colB) %>% 
  rename(colB=colB.y) $ colB.y is the column you need, so rename it and drop the other one
#> # A tibble: 8 x 2
#>   colA  colB 
#>   <chr> <chr>
#> 1 IA    OK   
#> 2 WY    FL   
#> 3 WY    RI   
#> 4 OR    KY   
#> 5 MT    TN   
#> 6 CA    OK   
#> 7 MT    NY   
#> 8 MT    VT

Created on 2020-02-10 by the reprex package (v0.3.0)

Sign up to request clarification or add additional context in comments.

Comments

1

It seems you tried to obtain newDF but you encounter some difficulty. Here is a base R solution that may be helpful to you

newDF <- with(subset(df,ColC == "Y"), 
              data.frame(ColA = ColA[na.omit(p <-match(ColA,ColB))],
                         ColB = ColB[which(!is.na(p))],
                         ColC = "Y"))

such that

> newDF
  ColA ColB ColC
1  PAR  SYD    Y
2  NYC  SFO    Y

DATA

df <- structure(list(ColA = c("PAR", "BKK", "NYC", "LAX"), ColB = c("BKK", 
"SYD", "LAX", "SFO"), ColC = c("Y", "Y", "Y", "Y")), class = "data.frame", row.names = c(NA, 
-4L))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.