0

I have a set of patient data df I am trying to de-identify in R.

structure(list(name = structure(c(2L, 5L, 1L, 6L, 4L, 3L), .Label = c("Andrew", 
                                                                      "Jim", "Kurt", "Lester", "Mickey", "Taylor"), class = "factor"), 
               heart_rate = c(78L, 82L, 67L, 105L, 85L, 94L), age = c(35L, 
                                                                      23L, 43L, 52L, 33L, 45L), partner = structure(c(5L, 2L, 6L, 
                                                                                                                      1L, 3L, 4L), .Label = c("Andrew", "Jim ", "Kurt ", "Lester ", 
                                                                                                                                              "Mickey ", "Taylor "), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                             -6L))

I want to replace the names of both the name and partner columns based on the id column of this object called key

structure(list(name = structure(c(2L, 5L, 1L, 6L, 4L, 3L), .Label = c("Andrew", 
"Jim", "Kurt", "Lester", "Mickey", "Taylor"), class = "factor"), 
    id = structure(c(2L, 5L, 1L, 6L, 4L, 3L), .Label = c("A3", 
    "J9", "K5", "L4", "M4", "T7"), class = "factor")), class = "data.frame", row.names = c(NA, 
-6L))

I can de-identify the name column with this code

df[["name"]] <- key[ match(df[['name']], key[['name']] ) , 'id']

but, when I try to de-identify the partner column with this code

df[["partner"]] <- key[ match(df[['partner']], key[['name']] ) , 'id']

My dataframe looks like this

structure(list(name = structure(c(2L, 5L, 1L, 6L, 4L, 3L), .Label = c("A3", 
"J9", "K5", "L4", "M4", "T7"), class = "factor"), heart_rate = c(78L, 
82L, 67L, 105L, 85L, 94L), age = c(35L, 23L, 43L, 52L, 33L, 45L
), partner = structure(c(NA, NA, NA, 1L, NA, NA), .Label = c("A3", 
"J9", "K5", "L4", "M4", "T7"), class = "factor")), row.names = c(NA, 
-6L), class = "data.frame")

Does anyone have any suggestions? Bonus points for methods that could just apply over all columns in a dataset in one line and explanations of code are greatly appreciated.

1 Answer 1

2

The issue is that in your partner column in df there is a space after most of the words:

.Label = c("Andrew", "Jim ", "Kurt ", "Lester ", "Mickey ", "Taylor ")

This means that match() won't find an exact match except for the name "Andrew", for which it correctly returns that index.

The way to fix this is to remove whitespace from your partner column with

df$partner = trimws(df$partner)

then your code works fine:

> df[["partner"]] <- key[ match(df[['partner']], key[['name']] ) , 'id']
> df
  name heart_rate age partner
1   J9         78  35      M4
2   M4         82  23      J9
3   A3         67  43      T7
4   T7        105  52      A3
5   L4         85  33      K5
6   K5         94  45      L4
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.