fast replacement of data.table values by labels stored in another data.table

Question

It is related to this question and this other one, although to a larger scale. I have two data.tables:

The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.

See reproducible example :

EDIT: Addition of a new variable to include the '0' case.

EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.

library(data.table)
library(magrittr)

# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
  country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
  gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
  age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
  age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
  status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
  children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)

# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
  label_id = as.character(c(1:9)),
  country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
  gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
  age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
  status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
  children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)

# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")

# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]

You can see that the dictionary table is smaller to the survey data table, this is expected. Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table. My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.

I have actually achieved it using a loop, like the one below:

result_DT1 <- copy(repex_DT) 
for (x in categorical_var){
  if(length(which(repex_DT[[x]]=="0"))==0){
    values_vector <- labels_DT$label_id
    labels_vector <- labels_DT[[x]]
  }else{
    values_vector <- c("0",labels_DT$label_id)
    labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
  result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}

What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records. Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:

test_stringi <- copy(repex_DT) %>% 
  .[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
    str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
    replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
    .SDcols = c("country")]

Would switch() as a built-in function work for you (instead of a second table where you need to spent quite same CPU time to seperate the code/name part of the string), possibly vectorized in case of many variables? See e.g. stackoverflow.com/a/51562194/3414968 — Martin
– Martin, Commented Mar 12, 2022 at 12:51
I am not familiar with switch but it seems the ... argument cannot be updated on-the-fly via a function, or at least I cannot see how to do it easily as with the from and to vectors of plyr::mapvalues. Considering that each variable as different labels, that is problematic. — Maxence Dum.
– Maxence Dum., Commented Mar 12, 2022 at 17:24

det · Accepted Answer · 2022-03-14 06:53:37Z

2

Columns of your 2nd data.table are just look up vectors:

same_cols <- intersect(names(repex_DT), names(labels_DT))

repex_DT[
  , 
  (same_cols) := mapply(
    function(x, y) y[as.integer(x)], 
    repex_DT[, same_cols, with = FALSE], 
    labels_DT[, same_cols, with = FALSE],
    SIMPLIFY = FALSE
  )
]

edit

you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:

labels_list <- list(
  country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
  gender = c("Male","Female"),
  age_group = c("Less than 35","35 and more"),
  status = c("Employed","Unemployed","Do not want to say"),
  children = c("0","1","2","3","4","5 and more")
)

same_cols <- names(labels_list)

repex_DT[
  , 
  (same_cols) := mapply(
    function(x, y) y[factor(as.integer(x))], 
    repex_DT[, same_cols, with = FALSE], 
    labels_list,
    SIMPLIFY = FALSE
  )
]

Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...

edited Mar 14, 2022 at 6:53

answered Mar 12, 2022 at 13:09

det

5,2921 gold badge13 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Maxence Dum. Over a year ago

Thank you @det for your answer, which seems much faster than my approach from what I have tested so far. May I ask you to update your answer to consider my question edit? Your code does not work if the integer to match is 0, which I have plenty in my dataset.

Maxence Dum. Over a year ago

Nice suggestion, works like a charm on this sample data! I have tried it on my own dataset, and found out I forgot another specific case I frequently encounter: I do not always have every unique levels in my tables. When this happens, considering values are coerced to factor, it logically do not appear as a possible factor level. This results in an incorrect recode of values, as you can see in my updated reprocible example. Do you have a workaround for this too? (e.g. force factor levels to take all unique values of labels_list for a given variable)

det Over a year ago

I'm not exactly sure how to solve it because there is no way to know what 'number' is missing due to your labeling scheme. Is it 1, 3, 8? In my opinion best would be to relabel repext_DT so that numbers go in sequence (1, 2, 3...) but that is something that will add time and I can't do it for same reason. You can adapt first suggestion and add labels for 0 at first positions in your label_DT (you would then need to subset using as.integer(x)+1)

V. Lou · Accepted Answer · 2022-03-12 15:34:55Z

0

a very computationally effective way would be to melt your tables first, match them and cast again:

repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")

answered Mar 12, 2022 at 15:34

V. Lou

1595 bronze badges

2 Comments

Maxence Dum. Over a year ago

Thank you @V. Lou for your answer, it leads to the expected result faster than my loop-based approach. However, dcast takes quite a long time to compute and other contributions (mapply suggestion from @det at this time) are faster. Also, I tried it on my own dataset which contains 0, and they are converted to NA, which is a problem. I have edited my question to tackle that issue for future answers.

V. Lou Over a year ago

0 get converted to NA cause it has no correspondence in your label_id column

Maxence Dum. · Accepted Answer · 2022-03-23 11:13:02Z

I finally found time to work on an answer to this matter. I changed my approach and used fastmatch::fmatch to identify labels to update. As pointed out by @det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice. Still, this is much faster than my initial for loop approach.

The answer below:

library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)

#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
  which(intersect(names(repex_DT), names(labels_DT)) %fin% 
          names(repex_DT)[which(unlist(lapply(repex_DT, function(x) 
            sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
 use.names=FALSE)>=1)])]

same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
  which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]

labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0   <- labels_DT[, same_cols_with0, with=FALSE]
levels_id  <- as.integer(labels_DT$label_id)

#Update joins via matching IDs (credit to @det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>% 
  .[, (same_cols_standard) := mapply(
    function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
    repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>% 
  .[, (same_cols_with0) := mapply(
    function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
    repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

Collectives™ on Stack Overflow

fast replacement of data.table values by labels stored in another data.table

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related