4

I have a dataset with many columns and I'd like to locate the columns that have fewer than n unique responses and change just those columns into factors.

Here is one way I was able to do that:

#create sample dataframe
df <- data.frame("number" = c(1,2.7,8,5), "binary1" = c(1,0,1,1), 
"answer" = c("Yes","No", "Yes", "No"), "binary2" = c(0,0,1,0))
n <- 3

#for each column
for (col in colnames(df)){
#check if the first entry is numeric
  if (is.numeric(df[col][1,1])){
# check that there are fewer than 3 unique values
    if ( length(unique(df[col])[,1]) < n ) {
    df[[col]] <- factor(df[[col]])
                                           }
                               }
                         }

What is another, hopefully more succinct, way of accomplishing this?

3 Answers 3

6

Here is a way using tidyverse.

We can make use of where within across to select the columns with logical short-circuit expression where we check

  1. the columns are numeric - (is.numeric)
  2. if the 1 is TRUE, check whether number of distinct elements less than the user defined n
  3. if 2 is TRUE, then check all the unique elements in the column are 0 and 1
  4. loop over those selected column and convert to factor class
library(dplyr)
df1 <- df %>% 
     mutate(across(where(~is.numeric(.) && 
                           n_distinct(.) < n && 
                           all(unique(.) %in% c(0, 1))),  factor))

-checking

str(df1)
'data.frame':   4 obs. of  4 variables:
 $ number : num  1 2.7 8 5
 $ binary1: Factor w/ 2 levels "0","1": 2 1 2 2
 $ answer : chr  "Yes" "No" "Yes" "No"
 $ binary2: Factor w/ 2 levels "0","1": 1 1 2 1
Sign up to request clarification or add additional context in comments.

18 Comments

@GregorThomas Suppose the column have only 0 or 1 alone, I am not sure if the OP wanted to convert those to factor. Also, unique values can be 2, 3 or 4 or 5. I guess the OP is specifically looking for those binary
Despite their use of "binary" a couple times, the only check OP is attempting in the question is the n_distinct. The "binary" columns may just be an example.
Though you've explained your steps well enough that they should be able to adapt as needed.
@Mark The ~ is a compact lambda function in tidyverse which is similar to function(x). The default value here is . or .x i.e. the column value. In base R, with new release you can use \(x) x as a compact form
@Mark you may have noticed that where(is.numeric) in select. Here, I use lambda expression because there are multiple conditions joined by &&. So, either use ~ is.numeric(.) && or function(x) is.numeric(x) &&. The lambda expressions are commonly used in lapply/sapply/apply based functions in base R as well where the x is the column value for that particular column looped
|
2

You can also use imap function to great advantage in this case. A thousand thanks to my dear friend @akrun who never ceases to inspire us:

library(dplyr)
library(purrr)

n <- 3

df %>% 
  imap_dfc(~ if(is.numeric(.x) & length(unique((.x)) < n) 
                & all(unique(.x) %in% c(0, 1))) {
    factor(df[[.y]])
    }  else {
      df[[.y]]
  }
)

# A tibble: 4 x 4
  number binary1 answer binary2
   <dbl> <fct>   <chr>  <fct>  
1    1   1       Yes    0      
2    2.7 0       No     0      
3    8   1       Yes    1      
4    5   1       No     0  

Comments

1

A base R option

out <- list2DF(
    lapply(
        df,
        function(x) {
            if (length(unique(x)) < n & all(x %in% c(0, 1))) as.factor(x) else x
        }
    )
)

gives

> str(out)
'data.frame':   4 obs. of  4 variables:
 $ number : num  1 2.7 8 5
 $ binary1: Factor w/ 2 levels "0","1": 2 1 2 2
 $ answer : chr  "Yes" "No" "Yes" "No"
 $ binary2: Factor w/ 2 levels "0","1": 1 1 2 1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.