3

What is the proper way to do this?

I have a function that works great on its own given a series of inputs and I'd like to use this function on a large dataset rather than singular values by looping through the data by row. I have tried to update the function to call data.frame columns rather than vector values, but have been unsuccessful.

A simple example of this is:

Let's say I have a date.frame with 4 columns, data$id, data$height, data$weight, data$gender. I want to write a function that will loop over each row (using apply) and calculate BMI (kg/m^2). I know that it would be easy to do with dplyr but I would like to learn how to do this without resorting to external packages but can't find a clear answer how to properly reference the columns within the function.

Apologize in advance if this is a duplicate. I've been searching Stackoverflow pretty thoroughly in hopes of finding an exisiting example.

4
  • 4
    Basic arithmetic functions are vectorized. You don't need dplyr or lapply to add a BMI column, you can just do data$BMI = data$weight / data$height^2. Commented May 11, 2015 at 21:13
  • If you want to right a function that takes a data frame and adds a BMI column and then returns the modified data frame, you can refer to the columns by column number data[, 2] / data[, 3]^2, by quoted name data[, "weight"] / data[, "height"]^2. For both of these methods you could have the user input optional arguments to the function to specify either the column index or the quoted name of the columns to use. Commented May 11, 2015 at 21:15
  • @Gregor But don't do that, right? Seems kind of wasteful to pass around a data.frame. Just write a function myfun for construction of the column and use it with data$mynewcol <- with(data,myfun(weight,height,other_col)) Commented May 11, 2015 at 21:18
  • 2
    @Frank well yes, but I'm trying to answer the general question rather than the specific case. The OP seems to want to know how to work with data and columns inside a function, but chose an example where that's not what one should do. Looking past the example, the answer is string column names, indices, or NSE. Commented May 11, 2015 at 21:20

3 Answers 3

4

I think this is what you're looking for. The easiest way to refer to columns of a data frame functionally is to use quoted column names. In principle, what you're doing is this

data[, "weight"] / data[, "height"]^2

but inside a function you might want to let the user specify that the height or weight column is named differently, so you can write your function

add_bmi = function(data, height_col = "height", weight_col = "weight") {
    data$bmi = data[, weight_col] / data[, height_col]
    return(data)
}

This function will assume that the columns to use are named "height" and "weight" by default, but the user can specify other names if necessary. You could do a similar solution using column indices instead, but using names tends to be easier to debug.

Functions this simple are rarely useful. If you're calculating BMI for a lot of datasets maybe it is worth keeping this function around, but since it is a one-liner in base R you probably don't need it.

my_data$BMI = with(my_data, weight / height^2)

One note is that using column names stored in variables means you can't use $. This is the price we pay by making things more programmatic, and it's a good habit to form for such applications. See fortunes::fortune(343):

Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable consequences. It's best to acquire the '[[' and '[' habit early.

-- Peter Ehlers (about the use of $-extraction) R-help (March 2013)

For fancier usage like dplyr does where you don't have to quote column names and such (and can evaluate expressions), the lazyeval package makes things relatively painless and has very nice vignettes.

The base function with can be used to do some lazy evaluating, e.g.,

with(mtcars, plot(disp, mpg))
# sometimes with is nice
plot(mtcars$disp, mtcars$mpg)

but with is best used interactively and in straightforward scripts. If you get into writing programmatic production code (e.g., your own R package), it's safer to avoid non-standard evaluation. See, for example, the warning in ?subset, another base R function that uses non-standard evaluation.

Sign up to request clarification or add additional context in comments.

Comments

0

Speaking generally, functions should not know about more than they need to know about. If you write a function that requires a data.frame, when it is not essential that the input data be provided in a data.frame, then you are making your function more restrictive than it needs to be.

The correct way to write this function is as follows:

bmi <- function(height,weight) weight/height^2;

This will allow you compute a vector of BMI values from a vector of height values and a vector of weight values, since both / and ^ are vectorized operations. So, for example, if you had two loose vectors of height and weight, then you could call it as follows:

set.seed(1);
N <- 5;
height <- rnorm(N,1.7,0.2);
weight <- rnorm(N,65,4);
BMI <- bmi(height,weight);
height; weight; BMI;
## [1] 1.574709 1.736729 1.532874 2.019056 1.765902
## [1] 61.71813 66.94972 67.95330 67.30313 63.77845
## [1] 24.88926 22.19652 28.91995 16.50967 20.45224

And if you had your inputs contained in a data.frame, you would be able to do this:

set.seed(2);
N <- 5;
df <- data.frame(id=1:N, height=rnorm(N,1.7,0.2), weight=rnorm(N,65,4), gender=sample(c('M','F'),N,replace=T) );
df$BMI <- bmi(df$height,df$weight);
df;
##   id   height   weight gender      BMI
## 1  1 1.520617 65.52968      F 28.33990
## 2  2 1.736970 67.83182      M 22.48272
## 3  3 2.017569 64.04121      F 15.73268
## 4  4 1.473925 72.93790      M 33.57396
## 5  5 1.683950 64.44485      M 22.72637

Comments

0

Providing this answer as I was not able to find it on SO and banged my head against the wall trying to figure out why my function within my R package was assuming my new column was an object and not a data.frame column.

If a function takes in a data.frame and within the function you are adding and transforming the additional column(s), the way to do so is as follows:

example_func <- function(df) {
  # To add a new column
  df[["New.Column"]] <- value
  
  # To get the ith value of that column
  df[[i, "New.Column"]]

  # To subset set the df using some conditional logic on that column
  df[df[["New.Column"]]==value]

  # To sort on that column
  setorderv(df, "New.Column", -1)
}

Note this requires library(devtools)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.