Subset rows in a data frame based on a vector of values

Question

I have two data sets that are supposed to be the same size but aren't. I need to trim the values from A that are not in B and vice versa in order to eliminate noise from a graph that's going into a report. (Don't worry, this data isn't being permanently deleted!)

I have read the following:

But I'm still not able to get this to work right. Here's my code:

bg2011missingFromBeg <- setdiff(x=eg2011$ID, y=bg2011$ID)
#attempt 1
eg2011cleaned <- subset(eg2011, ID != bg2011missingFromBeg)
#attempt 2
eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg]

The first try just eliminates the first value in the resulting setdiff vector. The second try yields and unwieldy error:

Error in `[.data.frame`(eg2012, !eg2012$ID %in% bg2012missingFromBeg) 
:  undefined columns selected

I don't think merge is appropriate here. I do not want the datasets to be combined. — Zelbinian
– Zelbinian, Commented Mar 5, 2013 at 15:40
No, I think merge is exactly appropriate. An inner join would give you only rows that are in both A and B. You can then subset the columns of the result if the merge added any extraneous ones. — joran
– joran, Commented Mar 5, 2013 at 15:43
This gets me the appropriate number of rows but, again, I'm left with only one data set. Moreover, from a certain perspective this data set is dirtier than the one I started with - I have one set with some rows from "End" and some rows from "Beg" and I don't know which are which. — Zelbinian
– Zelbinian, Commented Mar 5, 2013 at 15:49
Your first concern is mistaken. Calling merge does nothing to the two original data frames. Hence, you could do something like A <- merge(A,B). And getting the correct columns is no harder than something like merge(A,B)[,colnames(A)] assuming none are duplicated. But if you really only are matching on one column, then adibender's solution is probably simpler for your purposes. — joran
– joran, Commented Mar 5, 2013 at 16:04
Look, I agree that in your case, with only one index column, merge isn't the simplest option. But if you spend any time using it, you'll see that it is quite possible to use for this purpose. Duplicate columns have .x or .y` appended to their names so you can tell which of the original data frames they came from. It would require 1-2 extra lines of code, but using merge would work just fine. — joran
– joran, Commented Mar 5, 2013 at 16:13

adibender · Accepted Answer · 2017-06-06 09:32:23Z

88

This will give you what you want:

eg2011cleaned <- eg2011[!eg2011$ID %in% bg2011missingFromBeg, ]

The error in your second attempt is because you forgot the ,

In general, for convenience, the specification object[index] subsets columns for a 2d object. If you want to subset rows and keep all columns you have to use the specification object[index_rows, index_columns], while index_cols can be left blank, which will use all columns by default.

However, you still need to include the , to indicate that you want to get a subset of rows instead of a subset of columns.

edited Jun 6, 2017 at 9:32

answered Mar 5, 2013 at 15:46

adibender

7,5883 gold badges39 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Zelbinian Over a year ago

Yes, this solves the problem. I'll wait for other answers, though, just in case someone comes up with something really clever. :)

Señor O Over a year ago

subset will work if you change the logical statement to subset(eg2011, !ID %in% bg2011missingFromBeg)

Dinre Over a year ago

This is really the same answer as the one I wrote up. Choose adibender's answer, since it directly solves your problem. I only added mine for others to be able to use as a more robust reference if they stumble upon this page later.

JaKu Over a year ago

This is also true for a faster 'data.table'

Dinre · Accepted Answer · 2013-03-05 15:48:52Z

17

If you really just want to subset each data frame by an index that exists in both data frames, you can do this with the 'match' function, like so:

data_A[match(data_B$index, data_A$index, nomatch=0),]
data_B[match(data_A$index, data_B$index, nomatch=0),]

This is, though, the same as:

data_A[data_A$index %in% data_B$index,]
data_B[data_B$index %in% data_A$index,]

Here is a demo:

# Set seed for reproducibility.
set.seed(1)

# Create two sample data sets.
data_A <- data.frame(index=sample(1:200, 90, rep=FALSE), value=runif(90))
data_B <- data.frame(index=sample(1:200, 120, rep=FALSE), value=runif(120))

# Subset data of each data frame by the index in the other.
t_A <- data_A[match(data_B$index, data_A$index, nomatch=0),]
t_B <- data_B[match(data_A$index, data_B$index, nomatch=0),]

# Make sure they match.
data.frame(t_A[order(t_A$index),], t_B[order(t_B$index),])[1:20,]

#    index     value index.1    value.1
# 27     3 0.7155661       3 0.65887761
# 10    12 0.6049333      12 0.14362694
# 88    14 0.7410786      14 0.42021589
# 56    15 0.4525708      15 0.78101754
# 38    18 0.2075451      18 0.70277874
# 24    23 0.4314737      23 0.78218212
# 34    32 0.1734423      32 0.85508236
# 22    38 0.7317925      38 0.56426384
# 84    39 0.3913593      39 0.09485786
# 5     40 0.7789147      40 0.31248966
# 74    43 0.7799849      43 0.10910096
# 71    45 0.2847905      45 0.26787813
# 57    46 0.1751268      46 0.17719454
# 25    48 0.1482116      48 0.99607737
# 81    53 0.6304141      53 0.26721208
# 60    58 0.8645449      58 0.96920881
# 30    59 0.6401010      59 0.67371223
# 75    61 0.8806190      61 0.69882454
# 63    64 0.3287773      64 0.36918946
# 19    70 0.9240745      70 0.11350771

answered Mar 5, 2013 at 15:48

Dinre

4,22619 silver badges26 bronze badges

3 Comments

Zelbinian Over a year ago

Let me see if I can repeat data_A[data_A$index %in% data_B$index,] in words. Give me all of the rows in data_A that are in both data_A and data_B, based on their index. Do I have it right?

Dinre Over a year ago

@Zelbinian Specifically, it says: Give me all the rows in data_A where the index value of the rows in data_A can be found in the index values in data_B. It's a slightly different statement, but you have it basically correct. The only difference is that we are specifically wanting the row numbers from data_A. If you reversed the statement, it wouldn't work right.

Sam Firke Over a year ago

Nice reproducible example + explanations.

maycca · Accepted Answer · 2016-04-13 20:56:31Z

Really human comprehensible example (as this is the first time I am using %in%), how to compare two data frames and keep only rows containing the equal values in specific column:

# Set seed for reproducibility.
set.seed(1)

# Create two sample data frames.
data_A <- data.frame(id=c(1,2,3), value=c(1,2,3))
data_B <- data.frame(id=c(1,2,3,4), value=c(5,6,7,8))

# compare data frames by specific columns and keep only 
# the rows with equal values 
data_A[data_A$id %in% data_B$id,]   # will keep data in data_A
data_B[data_B$id %in% data_A$id,]   # will keep data in data_b

Results:

> data_A[data_A$id %in% data_B$id,]
  id value
1  1     1
2  2     2
3  3     3

> data_B[data_B$id %in% data_A$id,]
  id value
1  1     5
2  2     6
3  3     7

Sam Firke · Accepted Answer · 2015-04-02 13:40:57Z

3

Per the comments to the original post, merges / joins are well-suited for this problem. In particular, an inner join will return only values that are present in both dataframes, making thesetdiff statement unnecessary.

Using the data from Dinre's example:

In base R:

cleanedA <- merge(data_A, data_B[, "index"], by = 1, sort = FALSE)
cleanedB <- merge(data_B, data_A[, "index"], by = 1, sort = FALSE)

Using the dplyr package:

library(dplyr)
cleanedA <- inner_join(data_A, data_B %>% select(index))
cleanedB <- inner_join(data_B, data_A %>% select(index))

To keep the data as two separate tables, each containing only its own variables, this subsets the unwanted table to only its index variable before joining. Then no new variables are added to the resulting table.

edited Apr 2, 2015 at 13:40

answered Apr 1, 2015 at 13:14

Sam Firke

23.4k11 gold badges100 silver badges117 bronze badges

Collectives™ on Stack Overflow

Subset rows in a data frame based on a vector of values

4 Answers 4

4 Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related