I have a data frame with customer information in rows and periods (months) in columns. I use this format for clustering purposes. I want to scale the values in the rows. I can do it with the following code, but there are some problems:
- The code is too complex for something that should be a simple operation.
- The "scale" function returns "NaN" in some cases.
- Entering explicit customer names (vars=c("A","B",...) will not work since the real data has thousands of customers.
Here is my sample data and code:
mydata
cust P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20
1 A 1 1.0 1 1.0 1 1.0 1 1.0 1 1.0 1 1.0 1 1.0 1 1.0 1 1.0 1 1.0
2 B 5 5.0 5 5.0 5 5.0 5 5.0 5 5.0 5 5.0 5 5.0 5 5.0 5 5.0 5 5.0
3 C 9 9.0 9 9.0 9 9.0 9 9.0 9 9.0 9 9.0 9 9.0 9 9.0 9 9.0 9 9.0
4 D 0 1.0 2 1.0 0 1.0 2 1.0 0 1.0 2 1.0 0 1.0 2 1.0 0 1.0 2 1.0
5 E 4 5.0 6 5.0 4 5.0 6 5.0 4 5.0 6 5.0 4 5.0 6 5.0 4 5.0 6 5.0
6 F 8 9.0 10 9.0 8 9.0 10 9.0 8 9.0 10 9.0 8 9.0 10 9.0 8 9.0 10 9.0
7 G 2 1.5 1 0.5 0 0.5 1 1.5 2 1.5 1 0.5 0 0.5 1 1.5 2 1.5 1 0.5
8 H 6 5.5 5 4.5 4 4.5 5 5.5 6 5.5 5 4.5 4 4.5 5 5.5 6 5.5 5 4.5
9 I 10 9.5 9 8.5 8 8.5 9 9.5 10 9.5 9 8.5 8 8.5 9 9.5 10 9.5 9 8.5
code that I am using:
library(dplyr)
library(tidyr)
# first transpose the data
g_mydata = mydata %>% gather(period,value,-cust)
spr_mydata = g_mydata %>% spread(cust,value)
# then scale the values for each period
sc_mydata = spr_mydata %>%
mutate_each_(funs(scale),vars = c("A","B","C","D","E","F","G","H","I") )
# then transpose again back to original format
g_scdata = sc_mydata %>% gather(cust,value,-period)
scaled_data = g_scdata %>% spread(period,value)
Thanks for any help or suggestions.