4

I'm trying to use dplyr within a function, passing in a column name as a variable to then be used with n_distinct in the summarize function.

I understand that programming with dplyr has become easier, with the summarize_, arrange_ etc functions, as described in vignette(nse). I've tried various combinations of interp from lazyeval as well. n_distinct responses with "Input to n_distinct() must be a single variable name from the data set" (which makes sense, it's just that I have the variable name in a string ...)

This works fine outside a function (mention is a column name in the data.frame):

summarize(data, count=n_distinct(mention))

This was my first effort:

getProportions <- function(datain, id_column) {
    overall_total <- summarize(datain, count=n_distinct(id_column))[1,1]
}

getProportions(measures, "mention")

And after reading the NSE documentation and some threads on here about programming with dplyr I tried:

overall_total <- summarize_(datain, count=interp(~n_distinct(var),var=as.name(id_column)))[1,1]

but to no avail. Any ideas? Almost seems like n_distinct_() is needed?

Edit My apologies and thanks. You are right, the interp version does work, it seems that I never quite hit that full combination. I looked over my old versions and when I have the var part right I was using plain summarize() and when I used summarize_() I left off the var= part of the interp call. Sigh. My fault for not producing a full working example with both versions.

3
  • Not sure there the problem is since this works for me: f <- function(data, col) summarise_(data, count = interp(~n_distinct(var), var = as.name(col))) and then f(mtcars, "cyl") returns the correct output. Can you clarify what exactly doesn't work? Commented Jan 14, 2015 at 18:02
  • Thanks again (I edited the answer). This was a non-question; should I delete it? Commented Jan 15, 2015 at 15:09
  • 1
    You can delete it or answer it yourself and accept it since others might find it useful in the future Commented Jan 15, 2015 at 16:00

1 Answer 1

3

As indicated in the comments, the right way to do this was my second option, which apparently I had never quite tested (i'd left of the var = part of the interp call.):

f <- function(data, col) {
        summarise_(data, count = interp(~n_distinct(var), var = as.name(col)))
}
f(mtcars, "cyl")
Sign up to request clarification or add additional context in comments.

2 Comments

I needed to qualify the namespace of the interp function to get this to work i.e. lazyeval::interp
Indeed, worked for me too. But I found that length(unique(col)) is WAY faster than n_distinct(col) in any dplyr calculation. No idea why.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.