Creating multiple new summarizing columns in data.table

Question

I am trying to create a summary data table from an existing data.table in R. For this purpose rather than looping across all variables, i am trying to leverage data.table ":=" function in R

raw_df <- cbind.data.frame(1:4,7:10,15:18)
names(raw_df) <- c('A','B','C')
raw_df$sample <- 1

What i want is the mean and quantile of each columns in a new data.table. below is the code i am using but this only updates the existing table. How can i create a new data.table without writing each of the variable names in the code.

setDT(raw_df)[, c("Unique_Count","Mean",paste("Quantile_",seq(0,1,0.05),sep = "")) 
    := lapply(.SD, function(x) c(length(unique(x)),mean(x,na.rm = T),quantile(x,probs = seq(0,1,0.05),na.rm = T))), 
    by = sample]

Thanks in anticipation!

Jaap · Accepted Answer · 2018-02-23 09:58:53Z

3

A possible solution:

melt(setDT(raw_df), id = 'sample')[, {q <- quantile(value, probs = seq(0,1,0.05), na.rm = TRUE);
                                      c(unique_count = uniqueN(value), means = mean(value, na.rm = TRUE), as.list(q))}
                                   , by = .(sample, variable)]

which gives:

   sample variable unique_count means 0%    5%  10%   15%  20%   25%  30%   35%  40%   45%  50%   55%  60%   65%  70%   75%  80%   85%  90%   95% 100%
1:      1        A            4   2.5  1  1.15  1.3  1.45  1.6  1.75  1.9  2.05  2.2  2.35  2.5  2.65  2.8  2.95  3.1  3.25  3.4  3.55  3.7  3.85    4
2:      1        B            4   8.5  7  7.15  7.3  7.45  7.6  7.75  7.9  8.05  8.2  8.35  8.5  8.65  8.8  8.95  9.1  9.25  9.4  9.55  9.7  9.85   10
3:      1        C            4  16.5 15 15.15 15.3 15.45 15.6 15.75 15.9 16.05 16.2 16.35 16.5 16.65 16.8 16.95 17.1 17.25 17.4 17.55 17.7 17.85   18

NOTE: it is better not to use the %-character in column names. To prevent that, you can use:

melt(setDT(raw_df), id = 'sample')[, {q <- quantile(value, probs = seq(0,1,0.05), na.rm = TRUE);
                                      names(q) <- paste0('p_', gsub('%','',names(q)));
                                      c(unique_count = uniqueN(value), means = mean(value, na.rm = TRUE), as.list(q))}
                                   , by = .(sample, variable)]

which gives:

   sample variable unique_count means p_0   p_5 p_10  p_15 p_20  p_25 p_30  p_35 p_40  p_45 p_50  p_55 p_60  p_65 p_70  p_75 p_80  p_85 p_90  p_95 p_100
1:      1        A            4   2.5   1  1.15  1.3  1.45  1.6  1.75  1.9  2.05  2.2  2.35  2.5  2.65  2.8  2.95  3.1  3.25  3.4  3.55  3.7  3.85     4
2:      1        B            4   8.5   7  7.15  7.3  7.45  7.6  7.75  7.9  8.05  8.2  8.35  8.5  8.65  8.8  8.95  9.1  9.25  9.4  9.55  9.7  9.85    10
3:      1        C            4  16.5  15 15.15 15.3 15.45 15.6 15.75 15.9 16.05 16.2 16.35 16.5 16.65 16.8 16.95 17.1 17.25 17.4 17.55 17.7 17.85    18

edited Feb 23, 2018 at 9:58

answered Feb 23, 2018 at 8:11

Jaap

83.7k36 gold badges190 silver badges203 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vivek Singh Over a year ago

thanks! works really well! Only problem is melt(setDT(raw_df), measure.vars = 1:3) will create a much larger data records in cases of large files.

Jaap Over a year ago

@VivekSingh that shouldn't be a problem for data.table imo (unless the size of the dataset exceeds your RAM)

Collectives™ on Stack Overflow

Creating multiple new summarizing columns in data.table

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related