18

I'm using the dplyr package in R and have grouped my data by 3 variables (Year, Site, Brood).

I want to get rid of groups made up of less than 3 rows. For example in the following sample I would like to remove the rows for brood '2'. I have a lot of data to do this with so while I could painstakingly do it by hand it would be so helpful to automate it using R.

Year Site Brood Parents
1996 A    1     1  
1996 A    1     1  
1996 A    1     0  
1996 A    1     0  
1996 A    2     1      
1996 A    2     0  
1996 A    3     1  
1996 A    3     1  
1996 A    3     1  
1996 A    3     0  
1996 A    3     1  

I hope this makes sense and thank you very much in advance for your help! I'm new to R and stackoverflow so apologies if the way I've worded this question isn't very good! Let me know if I need to provide any other information.

4
  • Is dplyr necessary? Or are solutions in base-R or data.table also appropriate? Commented Feb 8, 2016 at 14:32
  • 3
    @Heroka data %>% group_by(Year, Site, Brood) %>% filter(n() >= 3) why wouldn't you use dplyr? ;) Commented Feb 8, 2016 at 14:40
  • 1
    @Mullefa because there are other options, and I'm personally more comfortable with data.table and base-R. But I understand that preferences can vary between persons :P Commented Feb 8, 2016 at 14:45
  • @Heroka I'm sure those other methods are completely appropriate! Like I say I'm a total rookie and I've simply used dplyr more than base-R/data.table :P Commented Feb 8, 2016 at 15:10

3 Answers 3

38

One way to do it is to use the magic n() function within filter:

library(dplyr)

my_data <- data.frame(Year=1996, Site="A", Brood=c(1,1,2,2,2))

my_data %>% 
  group_by(Year, Site, Brood) %>% 
  filter(n() >= 3)

The n() function gives the number of rows in the current group (or the number of rows total if there is no grouping).

Sign up to request clarification or add additional context in comments.

1 Comment

This awesome answer helped me, here is the official dplyr documentation for n() if anyone else needs it since I had a bit of a hard time finding it: dplyr.tidyverse.org/reference/n.html The main takeaway is that n() can be used with summarise(), mutate() and filter()
9

Throwing the data.table approach here to join the party:

library(data.table)
setDT(my_data)
my_data[ , if (.N >= 3L) .SD, by = .(Year, Site, Brood)]

Comments

3

You can also do this using base R:

temp <- read.csv(paste(folder,"test.csv", sep=""), head=TRUE, sep=",")
matches <- aggregate(Parents ~ Year + Site + Brood, temp, FUN="length")
temp <- merge(temp, matches, by=c("Year","Site","Brood"))
temp <- temp[temp$Parents.y >= 3, c(1,2,3,4)]

1 Comment

Or in a very ugly oneliner: dat[unlist(sapply(split(dat,list(dat$Year,dat$Site,dat$Brood)),function(x){rep(nrow(x),nrow(x))}))>3,]

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.