Using dplyr n_distinct in function with quoted variable

Question

I'm trying to use dplyr within a function, passing in a column name as a variable to then be used with n_distinct in the summarize function.

I understand that programming with dplyr has become easier, with the summarize_, arrange_ etc functions, as described in vignette(nse). I've tried various combinations of interp from lazyeval as well. n_distinct responses with "Input to n_distinct() must be a single variable name from the data set" (which makes sense, it's just that I have the variable name in a string ...)

This works fine outside a function (mention is a column name in the data.frame):

summarize(data, count=n_distinct(mention))

This was my first effort:

getProportions <- function(datain, id_column) {
    overall_total <- summarize(datain, count=n_distinct(id_column))[1,1]
}

getProportions(measures, "mention")

And after reading the NSE documentation and some threads on here about programming with dplyr I tried:

overall_total <- summarize_(datain, count=interp(~n_distinct(var),var=as.name(id_column)))[1,1]

but to no avail. Any ideas? Almost seems like n_distinct_() is needed?

Edit My apologies and thanks. You are right, the interp version does work, it seems that I never quite hit that full combination. I looked over my old versions and when I have the var part right I was using plain summarize() and when I used summarize_() I left off the var= part of the interp call. Sigh. My fault for not producing a full working example with both versions.

Not sure there the problem is since this works for me: f <- function(data, col) summarise_(data, count = interp(~n_distinct(var), var = as.name(col))) and then f(mtcars, "cyl") returns the correct output. Can you clarify what exactly doesn't work? — talat
– talat, Commented Jan 14, 2015 at 18:02
Thanks again (I edited the answer). This was a non-question; should I delete it? — jameshowison
– jameshowison, Commented Jan 15, 2015 at 15:09
You can delete it or answer it yourself and accept it since others might find it useful in the future — talat
– talat, Commented Jan 15, 2015 at 16:00

jameshowison · Accepted Answer · 2015-01-15 19:59:33Z

3

As indicated in the comments, the right way to do this was my second option, which apparently I had never quite tested (i'd left of the var = part of the interp call.):

f <- function(data, col) {
        summarise_(data, count = interp(~n_distinct(var), var = as.name(col)))
}
f(mtcars, "cyl")

answered Jan 15, 2015 at 19:59

jameshowison

3012 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mthornal Over a year ago

I needed to qualify the namespace of the interp function to get this to work i.e. lazyeval::interp

MS Berends Over a year ago

Indeed, worked for me too. But I found that length(unique(col)) is WAY faster than n_distinct(col) in any dplyr calculation. No idea why.

Collectives™ on Stack Overflow

Using dplyr n_distinct in function with quoted variable

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related