Scale rows of data

Question

I have a data frame with customer information in rows and periods (months) in columns. I use this format for clustering purposes. I want to scale the values in the rows. I can do it with the following code, but there are some problems:

The code is too complex for something that should be a simple operation.
The "scale" function returns "NaN" in some cases.
Entering explicit customer names (vars=c("A","B",...) will not work since the real data has thousands of customers.

Here is my sample data and code:

mydata 
  cust P1  P2 P3  P4 P5  P6 P7  P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20
1    A  1 1.0  1 1.0  1 1.0  1 1.0  1 1.0   1 1.0   1 1.0   1 1.0   1 1.0   1 1.0
2    B  5 5.0  5 5.0  5 5.0  5 5.0  5 5.0   5 5.0   5 5.0   5 5.0   5 5.0   5 5.0
3    C  9 9.0  9 9.0  9 9.0  9 9.0  9 9.0   9 9.0   9 9.0   9 9.0   9 9.0   9 9.0
4    D  0 1.0  2 1.0  0 1.0  2 1.0  0 1.0   2 1.0   0 1.0   2 1.0   0 1.0   2 1.0
5    E  4 5.0  6 5.0  4 5.0  6 5.0  4 5.0   6 5.0   4 5.0   6 5.0   4 5.0   6 5.0
6    F  8 9.0 10 9.0  8 9.0 10 9.0  8 9.0  10 9.0   8 9.0  10 9.0   8 9.0  10 9.0
7    G  2 1.5  1 0.5  0 0.5  1 1.5  2 1.5   1 0.5   0 0.5   1 1.5   2 1.5   1 0.5
8    H  6 5.5  5 4.5  4 4.5  5 5.5  6 5.5   5 4.5   4 4.5   5 5.5   6 5.5   5 4.5
9    I 10 9.5  9 8.5  8 8.5  9 9.5 10 9.5   9 8.5   8 8.5   9 9.5  10 9.5   9 8.5

code that I am using:

library(dplyr)
library(tidyr)
# first transpose the data
g_mydata = mydata %>% gather(period,value,-cust)
spr_mydata = g_mydata %>% spread(cust,value)
# then scale the values for each period
sc_mydata = spr_mydata %>% 
      mutate_each_(funs(scale),vars = c("A","B","C","D","E","F","G","H","I") )   
# then transpose again back to original format
g_scdata = sc_mydata %>% gather(cust,value,-period)
scaled_data = g_scdata %>% spread(period,value)

Thanks for any help or suggestions.

I don't really understand what you want to achieve. Do you want to scale each row individually? Scale the data by customer ID? — user3710546
– user3710546, Commented Nov 11, 2015 at 1:36
I want to scale each individual customer so that I can match patterns. There are three different patterns in the sample data, customers {ABC}, {DEF}, and {GHI}. Each group has the same pattern, but at different scales. — Paul
– Paul, Commented Nov 11, 2015 at 1:45

wmay · Accepted Answer · 2015-11-11 01:49:00Z

7

You could always try apply():

sc_mydata = apply(spr_mydata[, -1], 1, scale)

If the NaN's are messing that up, you could transpose spr_mydata and try to run scale() directly:

scale(spr_mydata[-1, ])

answered Nov 11, 2015 at 1:49

wmay

2342 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Paul Over a year ago

Thanks for the suggestion. The "apply" family hurts my head, I am starting to use the dplyr / tidyr packages (but I am still learning).

wmay Over a year ago

That apply command would just run 'scale' on each row of your data (and I removed the first column with all the customer names so it would work). Then it packages up the results into a matrix for you. This is exactly the kind of problem it's designed to handle.

bramtayl · Accepted Answer · 2015-11-11 02:03:18Z

2

Here is a dplyr way of doing it.

long_data = 
  mydata %>% 
  gather(period, value,-cust)

to_scale = 
  long_data %>%
  group_by(cust) %>%
  summarize(sd = sd(value)) %>%
  filter(sd != 0) %>%
  select(-sd)

flat = 
  long_data %>%
  anti_join(to_scale) %>%
  mutate(value = 0)

wide_scale = 
  long_data %>%
  right_join(to_scale) %>%
  group_by(cust) %>%
  mutate(value = 
           value %>%
           scale %>%
           signif(7)) %>%
  bind_rows(flat) %>%
  spread(period, value)

type = 
  wide_scale %>%
  select(-cust) %>%
  distinct %>%
  mutate(type_ID = 1:n())

customer__type = 
  type %>%
  left_join(wide_scale) %>%
  select(type_ID, cust)

edited Nov 11, 2015 at 2:03

answered Nov 11, 2015 at 1:52

bramtayl

4,0242 gold badges13 silver badges20 bronze badges

1 Comment

Paul Over a year ago

Thanks! This works, and I love that it is with dplyr, it makes it easy to follow what is going on.

Collectives™ on Stack Overflow

Scale rows of data

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related