3

I have a data frame with customer information in rows and periods (months) in columns. I use this format for clustering purposes. I want to scale the values in the rows. I can do it with the following code, but there are some problems:

  1. The code is too complex for something that should be a simple operation.
  2. The "scale" function returns "NaN" in some cases.
  3. Entering explicit customer names (vars=c("A","B",...) will not work since the real data has thousands of customers.

Here is my sample data and code:

mydata 
  cust P1  P2 P3  P4 P5  P6 P7  P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20
1    A  1 1.0  1 1.0  1 1.0  1 1.0  1 1.0   1 1.0   1 1.0   1 1.0   1 1.0   1 1.0
2    B  5 5.0  5 5.0  5 5.0  5 5.0  5 5.0   5 5.0   5 5.0   5 5.0   5 5.0   5 5.0
3    C  9 9.0  9 9.0  9 9.0  9 9.0  9 9.0   9 9.0   9 9.0   9 9.0   9 9.0   9 9.0
4    D  0 1.0  2 1.0  0 1.0  2 1.0  0 1.0   2 1.0   0 1.0   2 1.0   0 1.0   2 1.0
5    E  4 5.0  6 5.0  4 5.0  6 5.0  4 5.0   6 5.0   4 5.0   6 5.0   4 5.0   6 5.0
6    F  8 9.0 10 9.0  8 9.0 10 9.0  8 9.0  10 9.0   8 9.0  10 9.0   8 9.0  10 9.0
7    G  2 1.5  1 0.5  0 0.5  1 1.5  2 1.5   1 0.5   0 0.5   1 1.5   2 1.5   1 0.5
8    H  6 5.5  5 4.5  4 4.5  5 5.5  6 5.5   5 4.5   4 4.5   5 5.5   6 5.5   5 4.5
9    I 10 9.5  9 8.5  8 8.5  9 9.5 10 9.5   9 8.5   8 8.5   9 9.5  10 9.5   9 8.5

code that I am using:

library(dplyr)
library(tidyr)
# first transpose the data
g_mydata = mydata %>% gather(period,value,-cust)
spr_mydata = g_mydata %>% spread(cust,value)
# then scale the values for each period
sc_mydata = spr_mydata %>% 
      mutate_each_(funs(scale),vars = c("A","B","C","D","E","F","G","H","I") )   
# then transpose again back to original format
g_scdata = sc_mydata %>% gather(cust,value,-period)
scaled_data = g_scdata %>% spread(period,value)

Thanks for any help or suggestions.

2
  • 1
    I don't really understand what you want to achieve. Do you want to scale each row individually? Scale the data by customer ID? Commented Nov 11, 2015 at 1:36
  • I want to scale each individual customer so that I can match patterns. There are three different patterns in the sample data, customers {ABC}, {DEF}, and {GHI}. Each group has the same pattern, but at different scales. Commented Nov 11, 2015 at 1:45

2 Answers 2

7

You could always try apply():

sc_mydata = apply(spr_mydata[, -1], 1, scale)

If the NaN's are messing that up, you could transpose spr_mydata and try to run scale() directly:

scale(spr_mydata[-1, ])
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the suggestion. The "apply" family hurts my head, I am starting to use the dplyr / tidyr packages (but I am still learning).
That apply command would just run 'scale' on each row of your data (and I removed the first column with all the customer names so it would work). Then it packages up the results into a matrix for you. This is exactly the kind of problem it's designed to handle.
2

Here is a dplyr way of doing it.

long_data = 
  mydata %>% 
  gather(period, value,-cust)

to_scale = 
  long_data %>%
  group_by(cust) %>%
  summarize(sd = sd(value)) %>%
  filter(sd != 0) %>%
  select(-sd)

flat = 
  long_data %>%
  anti_join(to_scale) %>%
  mutate(value = 0)

wide_scale = 
  long_data %>%
  right_join(to_scale) %>%
  group_by(cust) %>%
  mutate(value = 
           value %>%
           scale %>%
           signif(7)) %>%
  bind_rows(flat) %>%
  spread(period, value)

type = 
  wide_scale %>%
  select(-cust) %>%
  distinct %>%
  mutate(type_ID = 1:n())

customer__type = 
  type %>%
  left_join(wide_scale) %>%
  select(type_ID, cust)

1 Comment

Thanks! This works, and I love that it is with dplyr, it makes it easy to follow what is going on.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.