r Replace multiple strings in a data frame column with multiple strings from a column of another data frame

Question

I have a dataframe (df1) with a column "PartcipantID". Some ParticipantIDs are wrong and should be replaced with the correct ParticipantID. I have another dataframe (df2) where all Participant IDs appear in columns Goal_ID to T4. The Participant IDs in column "Goal_ID" are the correct IDs.
Now I want to replace all ParticipantIDs in df1 with all Goal_ID ParticipantIDs from df2.

This is my original dataframe (df1):

structure(list(Partcipant_ID = c("AA_SH_RA_91", "AA_SH_RA_91", 
"AB_BA_PR_93", "AB_BH_VI_90", "AB_BH_VI_90", "AB_SA_TA_91", "AJ_BO_RA_92", 
"AJ_BO_RA_92", "AK_SH_HA_91", "AL_EN_RA_95", "AL_MA_RA_95", "AL_SH_BA_99", 
"AM_BO_AB_49", "AM_BO_AB_94", "AM_BO_AB_94", "AM_BO_AB_94", "AN_JA_AN_91", 
"AN_KL_GE_11", "AN_KL_WO_91", "AN_MA_DI_95", "AN_MA_DI_95", "AN_SE_RA_95", 
"AN_SE_RA_95", "AN_SI_RA_97", "AN_SO_PU_94", "AN_SU_RA_91", "AR_BO_RA_92", 
"AR_KA_VI_94", "AR_KA_VI_94", "AS_AR_SO_90", "AS_AR_SU_95", "AS_KU_SO_90", 
"AS_MO_AS_97", "AW_SI_OJ_97", "AW_SI_OJ_97", "AY_CH_SU_97", "BH_BE_LD_84", 
"BH_BE_LI_83", "BH_BE_LI_83", "BH_BE_LI_84", "BH_KO_SA_87", "BH_PE_AB_89", 
"BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"), Start_T2 = structure(c(NA, 
NA, NA, NA, 1579514871, 1576658745, NA, 1579098225, NA, NA, 1576663067, 
1576844759, NA, 1577330639, NA, NA, 1576693930, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, 1577718380, 1577718380, 1577454467, NA, 
NA, 1576352237, NA, NA, NA, NA, 1576420656, 1576420656, NA, NA, 
1578031772, 1576872938, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), End_T2 = structure(c(NA, NA, NA, NA, 1579515709, 
1576660469, NA, 1579098989, NA, NA, 1576693776, 1576845312, NA, 
1577331721, NA, NA, 1576694799, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 1577719049, 1577719049, 1577455167, NA, NA, 1576352397, 
NA, NA, NA, NA, 1576421607, 1576421607, NA, NA, 1578032408, 1576873875, 
NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
45L), class = "data.frame")

And this is the reference data frame (df2):

structure(list(Goal_ID = c("AJ_BO_RA_92", "AL_EN_RA_95", "AM_BO_AB_49", 
"AS_KU_SO_90", "BH_BE_LI_84", "BH_YA_SA_87", "BI_CH_PR_94", "BI_CH_PR_94"
), T2 = c("AJ_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", "AS_AR_SO_90", 
"BH_BE_LI_83", "BH_YA_SA_87", "BI_NA_PR_94", "BI_NA_PR_94"), 
    T3 = c("AR_BO_RA_92", "AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83", 
    NA, "BI_CH_PR_94", "BI_CH_PR_94"), T4 = c("AJ_BO_RA_92", 
    "AL_MA_RA_95", "AM_BO_AB_94", NA, "BH_BE_LI_83", "BH_KO_SA_87", 
    "BI_CH_PR_94", "BI_CH_PR_94")), row.names = c(NA, -8L), class = c("tbl_df", 
"tbl", "data.frame"))

For example, in my df1, I want

"AR_BO_RA_92" to be replaced by "AJ_BO_RA_92";
"AL_MA_RA_95" to be replaced by "AL_EN_RA_95";
"AM_BO_AB_94" to be replaced by "AM_BO_AB_49"

and so on...

I thought about using string_replace and I started with this:

df1$Partcipant_ID <- str_replace(df1$Partcipant_ID, "AR_BO_RA_92", "AJ_BO_RA_92")

But that is of course very unefficient because I have so many replacements and it would be nice to make use of my reference data frame. I just cannot figure it out myself.
I hope this is understandable. Please ask if you need additional information.

Thank you so much already!

So participantIDs in df1 that appear in T2,T3,T4 in df2 should be replaced by the ID in T1 in df2? Could you clarify the structure of df2 a little more? — hammoire
– hammoire, Commented Mar 12, 2020 at 15:25
Exactly. The "goal dataframe" is df1 (so here I want the incorrect IDs to be replaced by the correct IDs). df2 is the reference data frame. The IDs in df2 in all columns also appear in df1 and should be replaced by the IDs in column Goal_ID of df2. I edited the structure of df2 in my question. Maybe it's a bit clearer now. — Ane
– Ane, Commented Mar 12, 2020 at 15:31

GKi · Accepted Answer · 2020-03-12 15:35:08Z

You can use match to find where the string is located and excange those which have been found and are not NA like:

i <- match(df1$Partcipant_ID, unlist(df2[-1])) %% nrow(df2)
j <- !is.na(i)
df1$Partcipant_ID[j] <- df2$Goal_ID[i[j]]
df1$Partcipant_ID
# [1] "AA_SH_RA_91" "AA_SH_RA_91" "AB_BA_PR_93" "AB_BH_VI_90" "AB_BH_VI_90"
# [6] "AB_SA_TA_91" "AJ_BO_RA_92" "AJ_BO_RA_92" "AK_SH_HA_91" "AL_EN_RA_95"
#[11] "AL_MA_RA_95" "AL_SH_BA_99" "AM_BO_AB_49" "AM_BO_AB_94" "AM_BO_AB_94"
#[16] "AM_BO_AB_94" "AN_JA_AN_91" "AN_KL_GE_11" "AN_KL_WO_91" "AN_MA_DI_95"
#[21] "AN_MA_DI_95" "AN_SE_RA_95" "AN_SE_RA_95" "AN_SI_RA_97" "AN_SO_PU_94"
#[26] "AN_SU_RA_91" "AR_BO_RA_92" "AR_KA_VI_94" "AR_KA_VI_94" "AS_AR_SO_90"
#[31] "AS_AR_SU_95" "AS_KU_SO_90" "AS_MO_AS_97" "AW_SI_OJ_97" "AW_SI_OJ_97"
#[36] "AY_CH_SU_97" "BH_BE_LD_84" "BH_BE_LI_83" "BH_BE_LI_83" "BH_BE_LI_84"
#[41] "BH_KO_SA_87" "BH_PE_AB_89" "BH_YA_SA_87" "BI_CH_PR_94" "BI_CH_PR_94"

hammoire · Accepted Answer · 2020-03-12 15:58:48Z

0

I think this might work. Create a true look up table with a column of correct and incorrect codes. I.e. stack the columns, then join the subsequent df3 to df1 and use coalesce to create a new part_id. You spelt participant wrong, which made me feel more human I always do that.

library(dplyr)

df3 <- df2[1:2] %>% 
  bind_rows(df2[c(1,3)] %>% rename(T2 = T3), 
            df2[c(1,4)] %>% rename(T2 = T4)) %>% 
  distinct()


df1 %>% 
  left_join(df3, by = c("Partcipant_ID" = "T2")) %>% 
  mutate(Goal_ID = coalesce(Goal_ID, Partcipant_ID)) %>% 
  select(Goal_ID, Partcipant_ID, Start_T2, End_T2)

edited Mar 12, 2020 at 15:58

answered Mar 12, 2020 at 15:50

hammoire

3611 gold badge2 silver badges10 bronze badges

Collectives™ on Stack Overflow

r Replace multiple strings in a data frame column with multiple strings from a column of another data frame

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related