0

SO #24833247 covers nearly all the use cases for passing column names dynamically to a data.table within a function. However it misses one I'm currently trying to address: passing variables to the i expression.

I'm trying to refactor some data cleansing code to a function that converts certain values to NA after I've pulled the data into a data.table

For example, given the following:

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
dt[colA %in% c('~', ''), colA := NA]
dt[colB %in% c('~', ''), colB := NA]

I want a generic function that replaces the '~', '?' and '' values with NA, instead of having to explicitly code each transformation.

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
clearCol(dt, colA)
clearCol(dt, colB)

The j expression is straight-forward

clearCol <- function(dt, f) {
  f = substitute(f)
  dt[,(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '',)), colA)[]
    x
1: NA
2: NA
3: NA
4: NA
5: NA

However, extending it to add the variable to the i expression fails:

clearCol <- function(dt, f) {
  f = substitute(f)
  dt[(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
 Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments

Swapping to this seems to work, but the lack of output with verbose = TRUE (compared to the hard-coded method at the top) leaves me concerned that it will not scale well when given the large data sets I'm working with

clearCol <- function(dt, f) {
  f = deparse(substitute(f))
  dt[get(f) %in% c('~', ''),(f) := NA]
}
clearCol(data.table(colA = c('A', 'b', '~', 'd', '')), colA)[]
   colA
1:    A
2:    b
3:   NA
4:    d
5:   NA

Is there another way of doing what I want?

0

1 Answer 1

1

You can follow FAQ 1.6 to get the verbose output:

cc = function(d, col, vs = c("~", ""), verb = FALSE){
  col = substitute(col)
  ix  = substitute(col %in% vs)
  d[eval(ix), as.character(col) := NA, verbose = verb ][]
}

dt <- data.table(colA = c('A', 'b', '~', 'd', ''), colB = c('', '?', 'a1', 'a2', 'z4'))
cc(dt, colA, verb = TRUE)

which gives

Creating new index 'colA'
Starting bmerge ...done in 0 secs
Detected that j uses these columns: <none> 
Assigning to 2 row subset of 5 rows
Dropping index 'colA' due to update on 'colA' (column 1)
   colA colB
1:    A     
2:    b    ?
3:   NA   a1
4:    d   a2
5:   NA   z4

However, notice what the verbose output is saying here. It's creating an index (assuming you didn't do something to create it already, which seems likely since the data was only just read in)... and then it's removing that index (since it is invalidated by the edit to the column). That hardly sounds like something that would do much to contribute to efficiency.

If you really want to do this efficiently, there are a couple options:

  1. Use na.strings when reading the data in
  2. Use set if you have a ton of columns and somehow can't do #1
Sign up to request clarification or add additional context in comments.

3 Comments

In the actual use case, the data is coming through a JDBC database connection, so na.strings isn't an option. However, the substitute/eval combination definitely gets me want I want. Thanks
Ok. Btw, data.table, R and other things are discussed over here, mostly on weekdays: chat.stackoverflow.com/rooms/75819/gmts I don't want to spam github emails, but that's an interesting example in your last comment there. Odd that you again managed to have typos since it looks like copy-paste.
The lesson is run your code, don't just write your code

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.