0

Sample data. I'm not sure how to use the code block system on SO yet.

df <- data.frame(c(1,1,1,2,2,2,3,3,3),c(1990,1991,1992,1990,1991,1992,1990,1991,1992),c(1,2,3,3,2,1,2,1,3))
colnames(df) <- c("id", "year", "value")

That generates a simple matrix.

id year value
1 1990 1
1 1991 2
1 1992 3
2 1990 3
2 1991 2
2 1992 1
3 1990 2
3 1991 1
3 1992 3

I was sorting through the R subsetting questions, and couldn't figure out the second step in a ddply function {plyr} applied to it.

Logic: For all ID subgroups, find the highest value (which is 3) at the earliest time point.

I'm confused as to what syntax to use here. From searching SO, I think ddply is the best choice, but can't figure out how. Ideally, my output should be a vector of UNIQUE IDs (as only one is selected, with the entire row taken with it. This isn't working in R for me, but its the best "logic" I could come up with.

ddply( (ddply(df,id)), year, which.min(value) )

E.g.

id year value
1 1992 3
2 1990 3
3 1992 3

If 3 is not available, the next highest (2, or 1) should be taken. Any ideas?

2 Answers 2

2

You need to understand that ddply splits your original data.frame into data.frames according to the splitting variable(s). Thus, it needs a function with a data.frame as argument and return value.

library(plyr)
ddply(df,.(id),function(DF) {res <- DF[which.max(DF$value),]
                             res[which.min(res$year),]})

#   id year value
# 1  1 1992     3
# 2  2 1990     3
# 3  3 1992     3
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you! I haven't written many functions yet in R, so that is where I got caught up. If I understand it, ddply generates a data frame which can be manipulated by a function statement (as you placed above).
Exactly, ddply splits into data.frames, which get passed to the function. The function returns data.frames to ddply which in turn combines them.
I've returned to this function you wrote 4 times now - this was perfect, Roland.
0

I believe data.table is the best tool for you (both for speed and syntactic reasons):

library(data.table)
dt = data.table(df)

# order by year, and then take the first row for each id that has max 'value'
dt[order(year), .SD[which.max(value)], by = id]
#   id year value
#1:  1 1992     3
#2:  2 1990     3
#3:  3 1992     3

# if you're after speed, this slightly worse syntax is the current way of achieving it
dt[dt[order(year), .I[which.max(value)], by = id]$V1]

1 Comment

This appears to have worked as well - it matched with what Roland posted in terms of results. Thanks so much eddi.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.