1

Here is the data file df

Gene    CHR Start   End Window
AKT3    chr1    243651534   244006553   355019
AKT3    chr1    243666483   244006553   340070
CBL chr11   119076989   119178858   101869
CLCF1   chr11   67131640    67141206    9566
CLCF1   chr11   67131640    67141648    10008

I want to delete the rows that are duplicated in Gene column and only keep one with the largest window.

The results should be look like this below:

Gene    CHR Start   End Window
AKT3    chr1    243651534   244006553   355019
CBL chr11   119076989   119178858   101869
CLCF1   chr11   67131640    67141648    10008

I can do that in R using the below code:

data = split(df, df$Gene)
data = lapply(data, function(x) x[which.max(x$Window), , drop=FALSE])
data = do.call("rbind", data)

But could any one told me how to do that using awk or sed?

Thanks.

0

2 Answers 2

2

Using awk you can do:

awk '!seen[$1] || $5 > max[$1]{seen[$1]=$0; max[$1]=$5}
     END { for (i in seen) print seen[i]}' file
CLCF1   chr11   67131640    67141648    10008
AKT3    chr1    243651534   244006553   355019
CBL chr11   119076989   119178858   101869

This awk command uses an array seen to keep only unique rows in it. This command also uses an array max to keep max value of column 5 for each $1. seen is populated first time or when current record's $5 is greater than corresponding entry in max array.

To get formatted output:

awk '!seen[$1] || $5 > max[$1]{seen[$1]=$0; max[$1]=$5}
   END { for (i in seen) print seen[i]}' file | column -t
CLCF1  chr11  67131640   67141648   10008
AKT3   chr1   243651534  244006553  355019
CBL    chr11  119076989  119178858  101869
Sign up to request clarification or add additional context in comments.

1 Comment

I added some explanation in my answer.
0

Assuming your file is Tab Delimeted
You can use the below code
sort -t$'\t' -k5nr File|awk -F'\t' '!a[$1]++'
This is how it works
Sort numerically based on the window column and then allow only the 1st record

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.