2

I'm writing a function to clean some CEX data (doesn't really matter), and I cannot figure out why I am unable to use %in% to subset a data frame with a list when I am able to perform the analogous operation with == on a single item. What I am attempting to perform is like f_fails() below. Unless I'm mistaken, I need to be able to feed a string but cannot.

Is there something distinct about %in% in items 6 and 8 below that does not apply for ==? How can I perform 6 and 8 in another way?

# Test Data
set.seed(123)
df <- data.frame(
  NEWID = rep(1:10, 1, each = 10),
  COST = rnorm(100, 1000, 10),
  UCC = round(runif(100, 3995, 4005))
)

# All of these work except the 6th one
# 1.
df[df$UCC == 4000,]
# 2. 
df[df$"UCC" == 4000,]
# 3. 
df[df["UCC"] == 4000,]

# 4. 
df[df$UCC %in% c(4000,4001),]
# 5. 
df[df$"UCC" %in% c(4000,4001),]
# 6.  The one I need does not work
df[df["UCC"] %in% c(4000,4001),]

# 7. This works fine
f_works <- function(data, filter_var, one_val){
  # I can feed values with == and filter
  d <- data[data[filter_var] == one_val,]
  d
}
# 8. This (what I want) returns an empty data frame.
f_fails <- function(data = df, filter_var, many_vals){
  # I cannot feed 2+ values with %in% and filter
  d <- data[data[filter_var] %in% many_vals,]
  d
}

f_works(df, "UCC", 4000)
f_fails(df, "UCC", c(4000,4001))

2 Answers 2

2

In this case, %in% expects a vector either side and data[filter_var] returns a dataframe on the left. You need to use [[]] instead:

f <- function(data = df, filter_var, many_vals){
  d <- data[data[[filter_var]] %in% many_vals,]
}

head(f(df, "UCC", c(4000, 4001)))
#    NEWID     COST  UCC
# 3      1 1015.587 4001
# 4      1 1000.705 4000
# 11     2 1012.241 4000
# 27     3 1008.378 4000
# 28     3 1001.534 4001
# 31     4 1004.265 4001
Sign up to request clarification or add additional context in comments.

6 Comments

Nice, this works. I suppose a general takeaway is that == can test a string on a vector OR a df$column but %in% can only test vectors on vectors. I can't think of a reason this is necessary since df[["UCC"]] %in% c(4000,4001) != c(4000,4001) %in% df[["UCC"]]. The left side of %in% could take a vector or a df$column like ==. Anyone know of a reason?
@dcoy. The %in% operator does take a vector. Note that df$column is a vectir. But df[column] is not a vector but a list/data.frame of length 1. Which behaves differently than df[[column]]--a vector.
@Onyambu,I think I was unclear but can't edit now. I agree with all you said. I meant to ask something like "is there any reason to limit %in% by not allowing it to be more inclusive and take either a vector or a non-vector on the LHS, as we allow with ==". I.e., why not allow either df["UCC"] or df[[UCC]] with %in% as we allow when using==? Is there any reason aside from "that's the way it is"?
@dcoy Yes there is a reason to limit %in%. Note that dataframes do have a method ==, while lists do not have the == method. On the other hand, lists do possess the %in% method. This enables checks like list(1,1:2) %in% list(1:2,3:4,5:6). But you can not do list(1,1:2) == list(1:2,3:4,5:6) to return FALSE. Without the limitation, when do you know whether you are comparing equality or element inclusion?? The two operators are different and %in% is used to check for membership and not equality
@Onyambu, thanks for the replies. I actually did not know you cannot use == on two lists with the same dimensions. I think for my broader question, I'd need more space to articulate and maybe simulate scenarios more. I might be misunderstanding something, but to me the answer to the question in your penultimate sentence is your final sentence. It's intuitive that == could never work for inclusion (lists/vectors/anything with differing dimensions). I simply do not understand why my question example #3 works, but example #6 does not. Not really an issue of equality vs inclusion, imo.
|
1

If you use the class() or str() functions, you will see that df$UCC is a numeric vector:

class(df$UCC)
## [1] "numeric"

At the same time

class(df["UCC"])
## [1] "data.frame"

You can compare a numeric vector with a value or use %in% operator:

df$UCC == 4000
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 
## etc.

df$UCC %in% c(4000, 4001)
##  [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 
## etc.

If you will try to compare a dataframe with a value (which has the same "numeric" type), you will get a matrix as a result:

class( df["UCC"] == 4000)
## [1] "matrix" "array" 

When you use %in% operator you ask if the object on the left is equal to one of the objects in the set on the right. The data frame is not a part of a numeric vector object.

class( df["UCC"]  %in% c(4000, 4001))
## [1] "logical"

If, however, instead you use a numeric vector df$UCC, it will work since both left and right side of the %in% operator have the same "numeric vector" class:

df$UCC  %in% c(4000, 4001)
##  [1] FALSE FALSE  TRUE  TRUE FALSE FALSE

The easiest way to implement your function, is to use the dplyr package

library(dplyr)
d <- filter(data, get({{filter_var}}) %in% many_vals)

2 Comments

Your answer is more thorough, and I appreciate the class() component. I am sorry for giving it to the other person, since they were first and the [[many_values]] is a simpler fix. I appreciate your response.
Why dplyr + get?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.