Here is the data file df
Gene CHR Start End Window
AKT3 chr1 243651534 244006553 355019
AKT3 chr1 243666483 244006553 340070
CBL chr11 119076989 119178858 101869
CLCF1 chr11 67131640 67141206 9566
CLCF1 chr11 67131640 67141648 10008
I want to delete the rows that are duplicated in Gene column and only keep one with the largest window.
The results should be look like this below:
Gene CHR Start End Window
AKT3 chr1 243651534 244006553 355019
CBL chr11 119076989 119178858 101869
CLCF1 chr11 67131640 67141648 10008
I can do that in R using the below code:
data = split(df, df$Gene)
data = lapply(data, function(x) x[which.max(x$Window), , drop=FALSE])
data = do.call("rbind", data)
But could any one told me how to do that using awk or sed?
Thanks.