R Subset data frame and perform function based on columns

Question

Sample data. I'm not sure how to use the code block system on SO yet.

df <- data.frame(c(1,1,1,2,2,2,3,3,3),c(1990,1991,1992,1990,1991,1992,1990,1991,1992),c(1,2,3,3,2,1,2,1,3))
colnames(df) <- c("id", "year", "value")

That generates a simple matrix.

id year value 1 1990 1 1 1991 2 1 1992 3 2 1990 3 2 1991 2 2 1992 1 3 1990 2 3 1991 1 3 1992 3

I was sorting through the R subsetting questions, and couldn't figure out the second step in a ddply function {plyr} applied to it.

Logic: For all ID subgroups, find the highest value (which is 3) at the earliest time point.

I'm confused as to what syntax to use here. From searching SO, I think ddply is the best choice, but can't figure out how. Ideally, my output should be a vector of UNIQUE IDs (as only one is selected, with the entire row taken with it. This isn't working in R for me, but its the best "logic" I could come up with.

ddply( (ddply(df,id)), year, which.min(value) )

E.g.

id year value 1 1992 3 2 1990 3 3 1992 3

If 3 is not available, the next highest (2, or 1) should be taken. Any ideas?

Roland · Accepted Answer · 2013-07-17 19:43:45Z

2

You need to understand that ddply splits your original data.frame into data.frames according to the splitting variable(s). Thus, it needs a function with a data.frame as argument and return value.

library(plyr)
ddply(df,.(id),function(DF) {res <- DF[which.max(DF$value),]
                             res[which.min(res$year),]})

#   id year value
# 1  1 1992     3
# 2  2 1990     3
# 3  3 1992     3

answered Jul 17, 2013 at 19:43

Roland

134k12 gold badges203 silver badges305 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ashah57 Over a year ago

Thank you! I haven't written many functions yet in R, so that is where I got caught up. If I understand it, ddply generates a data frame which can be manipulated by a function statement (as you placed above).

Roland Over a year ago

Exactly, ddply splits into data.frames, which get passed to the function. The function returns data.frames to ddply which in turn combines them.

ashah57 Over a year ago

I've returned to this function you wrote 4 times now - this was perfect, Roland.

eddi · Accepted Answer · 2013-07-17 19:49:20Z

0

I believe data.table is the best tool for you (both for speed and syntactic reasons):

library(data.table)
dt = data.table(df)

# order by year, and then take the first row for each id that has max 'value'
dt[order(year), .SD[which.max(value)], by = id]
#   id year value
#1:  1 1992     3
#2:  2 1990     3
#3:  3 1992     3

# if you're after speed, this slightly worse syntax is the current way of achieving it
dt[dt[order(year), .I[which.max(value)], by = id]$V1]

answered Jul 17, 2013 at 19:49

eddi

49.5k6 gold badges109 silver badges157 bronze badges

1 Comment

ashah57 Over a year ago

This appears to have worked as well - it matched with what Roland posted in terms of results. Thanks so much eddi.

Collectives™ on Stack Overflow

R Subset data frame and perform function based on columns

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related