3

I have a data frame with three columns and thousands of rows. The first two columns (x and y) contain character strings, and the third (z) contains numeric data. I need to subset the data frame based on matching values in both of the first two columns.

    x <- c("a", "b", "c", "d", "f", "g", "h", "i", "j", "k")
    y <- c("h", "b", "k", "a", "g", "d", "i", "c", "f", "j")
    z <- c(1:10)
    df <- data.frame(x, y, z)

       x y  z
    1  a h  1
    2  b b  2
    3  c k  3
    4  d a  4
    5  f g  5
    6  g d  6
    7  h i  7
    8  i c  8
    9  j f  9
    10 k j 10

Say this is my table, and the values I am interested in are "a", "c", "f", "h" and "k". I only want to return the rows in which both x and y contain one of the five, so in this case rows 1 and 3.

I've tried:

    df2 <- filter(df, 
             x == ("a" | "c" | "f" | "h" | "k") & 
             y == ("a" | "c" | "f" | "h" | "k"))

but this doesn't work for factors or character strings. Is there an equivalent or another way around this?

Thanks in advance.

0

3 Answers 3

4

I think this returns what you are looking for:

# build vector of necessary elements
mustHaves <- c("a", "c", "f", "h", "k")
# perform subsetting
df[with(df, x %in% mustHaves & y %in% mustHaves),]
  x y z
1 a h 1
3 c k 3

data

df <- data.frame(x, y, z, stringsAsFactors = FALSE)
Sign up to request clarification or add additional context in comments.

1 Comment

A perfect answer, and so quick! Thanks a lot.
0

With dplyr

df2 <- filter(df, 
                x %in% c("a" ,"c","f" ,"h","k") & 
                  y %in% c("a" ,"c","f" ,"h","k"))
df2
  x y z
1 a h 1
2 c k 3

2 Comments

Thank you, this also works perfectly. So my mistake was using == instead of %in%. I am quite new to R, so I hadn't come across %in% before.
No problem, yep, %in% basically matches any string within the following vector. Consider accepting one of the above solutions so the question is closed.
0

What about:

df2 <- filter(df, grepl("[acfhk]",x) & grepl("[acfhk]",y))

using dplyr package

2 Comments

Thank you, this also works perfectly. Three different ways to achieve the same thing. I have a lot to learn.
You are welcome. It is probably worth checking which is the fastest if you expect to process really big datasets.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.