9

My data frame, my.data, contains both numeric and factor variables. I want to standardise just the numeric variables in this data frame.

> mydata2=data.frame(scale(my.data, center=T, scale=T))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

Could the standardising work by doing this? I want to standardise the columns 8,9,10,11 and 12 but I think I have the wrong code.

mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))

Thanks in advance

1
  • 4
    mydata[ sapply(mydata, is.numeric) ] <- lapply(mydata[ sapply(mydata, is.numeric) ], scale, center=TRUE, scale=TRUE) maybe Commented Apr 18, 2016 at 14:53

3 Answers 3

10

Here is one option to standardize

 mydata[] <- lapply(mydata, function(x) if(is.numeric(x)){
                     scale(x, center=TRUE, scale=TRUE)
                      } else x)
Sign up to request clarification or add additional context in comments.

Comments

4

You can use the dplyr package to do this:

mydata2%>%mutate_if(is.numeric,scale)

2 Comments

Could you give a little more explanation?
library(dplyr) has the function mutate_if, where we can perform operations based on condition. We have scaled the variable if and only if its numeric.
0

Here are some options to consider, although it is answered late:

# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)

# Set working directory
setwd("path")

# Example data frame
df <- data.frame("Age" = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39), 
                 "Name" = c("Christine", "Kim", "Kevin", "Aishwarya", "Rafel", "Bettina", "Joshua", "Afreen", "Wang", "Kerubo"),
                 "Salary in $" = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
                 "Gender" = c("Female", "Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Male"),
                 "Height in cm" = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
                 "Weight in kg" = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))

Let us check the structure of df:

str(df)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  21 19 25 34 45 63 39 28 50 39
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  2138 1516 2213 2500 2660 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num  60 70 88 48 71 51 65 44 53 91

We see that Age, Salary, Height and Weight are numeric and Name and Gender are categorical (factor variables).

Let us scale just the numeric variables using only base R:

1) Option: (slight modification of what akrun has proposed here)

start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
  (x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 - start_time1

Time difference of 0.02717805 secs
str(df1)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

2) Option: (akrun's approach)

start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
  scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 - start_time2

Time difference of 0.02599907 secs
str(df2)
'data.frame':   10 obs. of  6 variables:
$ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
$ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

3) Option:

start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 - start_time3

Time difference of -59.6766 secs
str(df3)
'data.frame':   10 obs. of  6 variables:
  $ Age         : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2

4) Option (using tidyverse and invoking dplyr):

library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 - start_time4

Time difference of 0.012043 secs
str(df4)
'data.frame':   10 obs. of  6 variables:
  $ Age         : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
..- attr(*, "scaled:center")= num 36.3
..- attr(*, "scaled:scale")= num 13.8
$ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 ...
..- attr(*, "scaled:center")= num 3183
..- attr(*, "scaled:scale")= num 1329
$ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 ...
..- attr(*, "scaled:center")= num 173
..- attr(*, "scaled:scale")= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 ...
..- attr(*, "scaled:center")= num 64.1
..- attr(*, "scaled:scale")= num 16.2

Based on what kind of structure as output you demand and speed, you can judge. If your data is unbalanced and you want to balance it, and suppose you want to do classification after that after scaling the numeric variables, the matrix numeric structure of the numeric variables, namely - Age, Salary, Height and Weight will cause problems. I mean,

str(df4$Age)
 num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 ...
 - attr(*, "scaled:center")= num 36.3
 - attr(*, "scaled:scale")= num 13.8

Since, for example, ROSE package (which balances data) doesn't accept data structures apart from int, factor and num, it will throw an error.

To avoid this issue, the numeric variables after scaling can be saved as vectors instead of a column matrix by:

library(tidyverse)

start_time4 <- Sys.time()

df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)

end_time4 <- Sys.time()

end_time4 - start_time4

with

Time difference of 0.01400399 secs

str(df4)

'data.frame':   10 obs. of  6 variables:

 $ Age         : num  -1.105 -1.249 -0.816 -0.166 0.628 ...


 $ Name        : Factor w/ 10 levels "Afreen","Aishwarya",..: 4 8 7 2 9 3 5 1 10 6

 $ Salary.in.. : num  -0.787 -1.255 -0.731 -0.514 -0.394 ...

 $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 2 1 2 2

 $ Height.in.cm: num  -0.0585 -0.5596 1.5285 -0.309 0.5262 ...

 $ Weight.in.kg: num  -0.254 0.365 1.478 -0.996 0.427 ...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.