0

I am looking for a more efficient method of re-coding column entries in a dataframe, where the recoding is conditional on the entries in other columns.

Take this simple example, which demonstrates my current procedure of creating a new column for the recoded data, converting it to character, and then using the subset square brackets to recode the data (is there an official name for this procedure?).

## example data frame
df = data.frame( id = seq( 1 , 100 , by=1 ) ,
                 x = rep( c("W", "Z") , each=50),
                 y = c( rep( c("A","B","C","D") , 25 ) ) )

# add a new column based on column y; convert to character 
df$newY = as.character( df$y ) 

# change newY entries to numbers based on conditions in other columns
df$newY[ df$x == "W" & df$newY == "B" ] <- 1
df$newY[ df$x == "Z" & df$newY == "D" ] <- 3

This procedure is fine for recoding variables with a small number of conditions, but becomes cumbersome for larger number of conditional arguments or when there are many distinct variables to recode.

Could anyone help me with finding a more efficient method of doing this?

Thanks!

9
  • Is there some kind of logic or pattern in the recoding? By efficiency, do you mean a method that requires less typing or do you mean faster performance / memory efficiency? Commented Feb 24, 2016 at 10:17
  • how many conditions are we talking? Commented Feb 24, 2016 at 10:18
  • Would something like this solve your problem: df$newY = as.factor( paste0(df$y, df$x) ) ; as.numeric(df$newY) Commented Feb 24, 2016 at 10:21
  • @MaxPD In my data, recoding is conditional on one other column in the dataframe (as in the example) but there are up to four multiples of the same variable needing to be converted to the same new coding (e.g. imagine if there was A1, A2, A3 etc. in the above example needing to be recoded to 1, conditional on "W"). There are also 8 distinct variables in the conditional column (e.g. "x" column above), and up to 11 different variables in the y column, meaning 8 blocks of ~ 11 lines of recoding. I hope that is clear. Commented Feb 24, 2016 at 10:29
  • @docendodiscimus By efficiency, I do mean just less typing, sorry. For patterning, it's difficult. For instance, a variable, e.g. A, conditional on Z, may have to be recoded as 1, but A conditional on W needs recoding as 2. Perhaps a better approach would be to reshape the data frame from long to wide format, and recode each variable as a separate column... Commented Feb 24, 2016 at 10:31

1 Answer 1

1

Some approaches to this:

df <- data.frame(id = seq( 1 , 100 , by=1 ) ,
                 x = rep( c("W", "Z") , each=50),
                 y = c( rep( c("A","B","C","D") , 25)))

# Take the product (my preference)
as.numeric(df$x) * as.numeric(df$y)

# Create new factor based on x and y and convert to numeric
as.numeric(as.factor(paste0(df$x, df$y)))
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! This helps a lot. I missed your second example in the comment to the OP.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.