I am working on a raw dataset that looks something like this:
df <- data.frame("ID" = c("Alpha", "Alpha", "Alpha", "Alpha",
"Beta","Beta", "Beta","Beta" ),
"treatment"= LETTERS[seq(from = 1, to = 8)],
"Year" = c(1970, 1970, 1980, 1990, 1970, 1980,
1980,1990),
"Val" = c(0,0,0,1,0,1,0,1),
"Val2" = c(0,2.34,1.3,0,0,2.34,3.2,1.3))
The data is a bit dirty as I have multiple observations for each ID and Year identifier - e.g. I have 2 different rows for Alpha in 1970. The same holds for Beta in 1980.
The issue is that the variable of interest Val Val2 have different scores in the duplicated rows (in terms of id/year).
I would like to find a concise way to produce the following final dataframe:
final <- data.frame("ID" = c("Alpha", "Alpha", "Alpha",
"Beta", "Beta","Beta" ),
"treatment"= c("B","C","D","E","G","H"),
"Year" = c(1970, 1980, 1990, 1970,
1980,1990),
"Val" = c(0,0,1,0,0,1),
"Val2" = c(2.34,1.3,0,0,3.2,1.3),
"del_treat" = c("A",NA,NA,NA,"F",NA),
"del_Val"=c(0,NA,NA,NA,1,NA),
"del_Val2"=c(0,NA,NA,NA,2.34,NA))
The logic is the following:
1) I want to have only one obs for every ID/year
2) I want only to retain the observation with a higher value in the Val2 category.
3) I would like to store the deleted rows values into separate columns to keep track of what I am deleting del_treat, del_Val and del_Val2.
To illustrate. In df there is a duplicated observation for Alpha/1970. I want to reduce this to a single row. Val2 takes the value of 0 and 2.34, and in the final data frame, only 2.34 is retained. However, the values of treatment A are reported in newly created columns del_treat, del_Val and del_Val2.
I am able to select rows based on the Val2``setDT(df)[order(-Val2)][,.SD[1,], by = .(ID, Year)]
value, but I would like to find a concise way to also 'store' the results deleted into the new columns