1

I have a dataframe (df1) with a column "PartcipantID". Some ParticipantIDs are wrong and should be replaced with the correct ParticipantID. I have another dataframe (df2) where all Participant IDs appear in columns Goal_ID to T4. The Participant IDs in column "Goal_ID" are the correct IDs.
Now I want to replace all ParticipantIDs in df1 with all Goal_ID ParticipantIDs from df2.

This is my original dataframe (df1):

structure(list(Partcipant_ID = c("AA_SH_RA_91", "AA_SH_RA_91", 
"AB_BA_PR_93", "AB_BH_VI_90", "AB_BH_VI_90", "AB_SA_TA_91", "AJ_BO_RA_92", 
"AJ_BO_RA_92", "AK_SH_HA_91", "AL_EN_RA_95", "AL_MA_RA_95", "AL_SH_BA_99", 
"AM_BO_AB_49", "AM_BO_AB_94", "AM_BO_AB_94", "AM_BO_AB_94", "AN_JA_AN_91", 
"AN_KL_GE_11", "AN_KL_WO_91", "AN_MA_DI_95", "AN_MA_DI_95", "AN_SE_RA_95", 
"AN_SE_RA_95", "AN_SI_RA_97", "AN_SO_PU_94", "AN_SU_RA_91", "AR_BO_RA_92", 
"AR_KA_VI_94", "AR_KA_VI_94", "AS_AR_SO_90", "AS_AR_SU_95", "AS_KU_SO_90", 
"AS_MO_AS_97", "AW_SI_OJ_97", "AW_SI_OJ_97", "AY_CH_SU_97", "BH_BE_LD_84", 
"BH_BE_LI_83", "BH_BE_LI_83", "BH_BE_LI_84", "BH_KO_SA_87", "BH_PE_AB_89", 
"BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"), Start_T2 = structure(c(NA, 
NA, NA, NA, 1579514871, 1576658745, NA, 1579098225, NA, NA, 1576663067, 
1576844759, NA, 1577330639, NA, NA, 1576693930, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 1577718380, 1577718380, 1577454467, NA, 
NA, 1576352237, NA, NA, NA, NA, 1576420656, 1576420656, NA, NA, 
1578031772, 1576872938, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), End_T2 = structure(c(NA, NA, NA, NA, 1579515709, 
1576660469, NA, 1579098989, NA, NA, 1576693776, 1576845312, NA, 
1577331721, NA, NA, 1576694799, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 1577719049, 1577719049, 1577455167, NA, NA, 1576352397, 
NA, NA, NA, NA, 1576421607, 1576421607, NA, NA, 1578032408, 1576873875, 
NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
45L), class = "data.frame")

And this is the reference data frame (df2):

structure(list(Goal_ID = c("AJ_BO_RA_92", "AL_EN_RA_95", "AM_BO_AB_49", 
"AS_KU_SO_90", "BH_BE_LI_84", "BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"
), T2 = c("AJ_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", "AS_AR_SO_90", 
"BH_BE_LI_83", "BH_YA_SA_87", "BI_NA_PR_94", "BI_NA_PR_94"), 
    T3 = c("AR_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83", 
    NA, "BI_CH_PR_94", "BI_CH_PR_94"), T4 = c("AJ_BO_RA_92", 
    "AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83", "BH_KO_SA_87", 
    "BI_CH_PR_94", "BI_CH_PR_94")), row.names = c(NA, -8L), class = c("tbl_df", 
"tbl", "data.frame"))

For example, in my df1, I want

"AR_BO_RA_92" to be replaced by "AJ_BO_RA_92";
"AL_MA_RA_95" to be replaced by "AL_EN_RA_95";
"AM_BO_AB_94" to be replaced by "AM_BO_AB_49"

and so on...

I thought about using string_replace and I started with this:

df1$Partcipant_ID <- str_replace(df1$Partcipant_ID, "AR_BO_RA_92", "AJ_BO_RA_92")

But that is of course very unefficient because I have so many replacements and it would be nice to make use of my reference data frame. I just cannot figure it out myself.
I hope this is understandable. Please ask if you need additional information.

Thank you so much already!

2
  • So participantIDs in df1 that appear in T2,T3,T4 in df2 should be replaced by the ID in T1 in df2? Could you clarify the structure of df2 a little more? Commented Mar 12, 2020 at 15:25
  • Exactly. The "goal dataframe" is df1 (so here I want the incorrect IDs to be replaced by the correct IDs). df2 is the reference data frame. The IDs in df2 in all columns also appear in df1 and should be replaced by the IDs in column Goal_ID of df2. I edited the structure of df2 in my question. Maybe it's a bit clearer now. Commented Mar 12, 2020 at 15:31

2 Answers 2

1

You can use match to find where the string is located and excange those which have been found and are not NA like:

i <- match(df1$Partcipant_ID, unlist(df2[-1])) %% nrow(df2)
j <- !is.na(i)
df1$Partcipant_ID[j] <- df2$Goal_ID[i[j]]
df1$Partcipant_ID
# [1] "AA_SH_RA_91" "AA_SH_RA_91" "AB_BA_PR_93" "AB_BH_VI_90" "AB_BH_VI_90"
# [6] "AB_SA_TA_91" "AJ_BO_RA_92" "AJ_BO_RA_92" "AK_SH_HA_91" "AL_EN_RA_95"
#[11] "AL_MA_RA_95" "AL_SH_BA_99" "AM_BO_AB_49" "AM_BO_AB_94" "AM_BO_AB_94"
#[16] "AM_BO_AB_94" "AN_JA_AN_91" "AN_KL_GE_11" "AN_KL_WO_91" "AN_MA_DI_95"
#[21] "AN_MA_DI_95" "AN_SE_RA_95" "AN_SE_RA_95" "AN_SI_RA_97" "AN_SO_PU_94"
#[26] "AN_SU_RA_91" "AR_BO_RA_92" "AR_KA_VI_94" "AR_KA_VI_94" "AS_AR_SO_90"
#[31] "AS_AR_SU_95" "AS_KU_SO_90" "AS_MO_AS_97" "AW_SI_OJ_97" "AW_SI_OJ_97"
#[36] "AY_CH_SU_97" "BH_BE_LD_84" "BH_BE_LI_83" "BH_BE_LI_83" "BH_BE_LI_84"
#[41] "BH_KO_SA_87" "BH_PE_AB_89" "BH_YA_SA_87" "BI_CH_PR_94" "BI_CH_PR_94"
Sign up to request clarification or add additional context in comments.

Comments

0

I think this might work. Create a true look up table with a column of correct and incorrect codes. I.e. stack the columns, then join the subsequent df3 to df1 and use coalesce to create a new part_id. You spelt participant wrong, which made me feel more human I always do that.

library(dplyr)

df3 <- df2[1:2] %>% 
  bind_rows(df2[c(1,3)] %>% rename(T2 = T3), 
            df2[c(1,4)] %>% rename(T2 = T4)) %>% 
  distinct()


df1 %>% 
  left_join(df3, by = c("Partcipant_ID" = "T2")) %>% 
  mutate(Goal_ID = coalesce(Goal_ID, Partcipant_ID)) %>% 
  select(Goal_ID, Partcipant_ID, Start_T2, End_T2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.