How to delete groups containing less than 3 rows of data in R? [duplicate]

Question

I'm using the dplyr package in R and have grouped my data by 3 variables (Year, Site, Brood).

I want to get rid of groups made up of less than 3 rows. For example in the following sample I would like to remove the rows for brood '2'. I have a lot of data to do this with so while I could painstakingly do it by hand it would be so helpful to automate it using R.

Year Site Brood Parents
1996 A    1     1  
1996 A    1     1  
1996 A    1     0  
1996 A    1     0  
1996 A    2     1      
1996 A    2     0  
1996 A    3     1  
1996 A    3     1  
1996 A    3     1  
1996 A    3     0  
1996 A    3     1

I hope this makes sense and thank you very much in advance for your help! I'm new to R and stackoverflow so apologies if the way I've worded this question isn't very good! Let me know if I need to provide any other information.

Is dplyr necessary? Or are solutions in base-R or data.table also appropriate? — Heroka
– Heroka, Commented Feb 8, 2016 at 14:32
@Heroka data %>% group_by(Year, Site, Brood) %>% filter(n() >= 3) why wouldn't you use dplyr? ;) — Mullefa
– Mullefa, Commented Feb 8, 2016 at 14:40
@Mullefa because there are other options, and I'm personally more comfortable with data.table and base-R. But I understand that preferences can vary between persons :P — Heroka
– Heroka, Commented Feb 8, 2016 at 14:45
@Heroka I'm sure those other methods are completely appropriate! Like I say I'm a total rookie and I've simply used dplyr more than base-R/data.table :P — Keeley Seymour
– Keeley Seymour, Commented Feb 8, 2016 at 15:10

drhagen · Accepted Answer · 2016-02-08 14:46:43Z

38

One way to do it is to use the magic n() function within filter:

library(dplyr)

my_data <- data.frame(Year=1996, Site="A", Brood=c(1,1,2,2,2))

my_data %>% 
  group_by(Year, Site, Brood) %>% 
  filter(n() >= 3)

The n() function gives the number of rows in the current group (or the number of rows total if there is no grouping).

edited Feb 8, 2016 at 14:46

answered Feb 8, 2016 at 14:34

drhagen

9,85211 gold badges61 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ricky Over a year ago

This awesome answer helped me, here is the official dplyr documentation for n() if anyone else needs it since I had a bit of a hard time finding it: dplyr.tidyverse.org/reference/n.html The main takeaway is that n() can be used with summarise(), mutate() and filter()

MichaelChirico · Accepted Answer · 2019-05-10 09:44:54Z

9

Throwing the data.table approach here to join the party:

library(data.table)
setDT(my_data)
my_data[ , if (.N >= 3L) .SD, by = .(Year, Site, Brood)]

edited May 10, 2019 at 9:44

answered Feb 8, 2016 at 14:51

MichaelChirico

34.9k17 gold badges122 silver badges209 bronze badges

Comments

pluke · Accepted Answer · 2016-02-08 14:40:27Z

3

You can also do this using base R:

temp <- read.csv(paste(folder,"test.csv", sep=""), head=TRUE, sep=",")
matches <- aggregate(Parents ~ Year + Site + Brood, temp, FUN="length")
temp <- merge(temp, matches, by=c("Year","Site","Brood"))
temp <- temp[temp$Parents.y >= 3, c(1,2,3,4)]

answered Feb 8, 2016 at 14:40

pluke

4,4966 gold badges56 silver badges85 bronze badges

1 Comment

Heroka Over a year ago

Or in a very ugly oneliner: dat[unlist(sapply(split(dat,list(dat$Year,dat$Site,dat$Brood)),function(x){rep(nrow(x),nrow(x))}))>3,]

Collectives™ on Stack Overflow

How to delete groups containing less than 3 rows of data in R? [duplicate]

3 Answers 3

1 Comment

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Linked

Related