2

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.

Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.

And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?

With regular data.frame I would just do:

df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))

I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.

I would image something like this to work for data.tables:

dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]

But it doesn't.

EDIT:

Another example of doing that updating columns with their per-row-scaled version:

dt = data.table object

dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]

Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?

0

2 Answers 2

3

PART 1: The one line solution you requested:

# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`

One-line Solution Version 1: Use magrittR and the pipe operator:

DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
    .SDcols = grep("corrupt", colnames(DT))]

One-line Solution Version 2: Explicitly defines the function for the lapply:

DT[, (grep("keyword", colnames(DT))) := 
     (lapply(.SD, function(x){scale(x, center = F)})), 
     .SDcols = grep("corrupt", colnames(DT))]

Modification - If you want to do it by group, just use the by =

DT[  , (grep("keyword", colnames(DT))) := 
              (lapply(.SD, function(x){scale(x, center = F)}))
     , .SDcols = grep("corrupt", colnames(DT))
     , by = Grouping.Variable]

You can verify:

# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]

PART 2: A Step-by-Step Solution: (more general and easier to follow)

The above solution works clearly for the narrow example given.

As a public service, I am posting this for anyone that is still searching for a way that

  • feels a bit less condensed;
  • easier to understand;
  • more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)

Here's the step-by-step way of doing the same:

Get the data into Data.Table format:

# You get a data.table called DT
DT <- as.data.table(df)

Then, Handle the Column Names:

# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))



# FOR PEOPLE who want to store both transformed and untransformed values. 
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")

Define the function you want to apply

#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:

normalize <- function(X, 
                      X.mean = mean(X, na.rm = TRUE), 
                      X.sd = sd(X, na.rm = TRUE))
                      {
                          X <- (X - X.mean) / X.sd
                          return(X)
                      }

After that, it is trivial in Data.Table syntax:

# Voila, the newly created set of columns the contain the transformed value, 
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]

Verify:

new values stored in columns with names stored in:

DT[, .SD, .SDcols = Reference.Cols.normalized]

Untransformed values left unharmed

DT[, .SD, .SDcols = Reference.Cols]

Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Sign up to request clarification or add additional context in comments.

Comments

2

If what you need is really to scale by row, you can try doing it in 2 steps:

# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]

# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]

3 Comments

Thanks for the help! Do you think there is a shorter (or one-liner) way to do it? Seems a bit elaborate given the simple nature of the task.
@PiotrGrabowski, glad if it helped you, I don't understand why you deleted your Q though. It may be helpful to others. It probably could be a one-liner but I think it would make the code less clear.
I deleted it because I was given a downvote, so I thought that the question is of low "quality" so to speak. However, I also found out an additional way of doing what I was looking for, which I will append to my original post. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.