Remove duplicated rows based on column values using awk or sed

Question

Here is the data file df

Gene    CHR Start   End Window
AKT3    chr1    243651534   244006553   355019
AKT3    chr1    243666483   244006553   340070
CBL chr11   119076989   119178858   101869
CLCF1   chr11   67131640    67141206    9566
CLCF1   chr11   67131640    67141648    10008

I want to delete the rows that are duplicated in Gene column and only keep one with the largest window.

The results should be look like this below:

Gene    CHR Start   End Window
AKT3    chr1    243651534   244006553   355019
CBL chr11   119076989   119178858   101869
CLCF1   chr11   67131640    67141648    10008

I can do that in R using the below code:

data = split(df, df$Gene)
data = lapply(data, function(x) x[which.max(x$Window), , drop=FALSE])
data = do.call("rbind", data)

But could any one told me how to do that using awk or sed?

Thanks.

anubhava · Accepted Answer · 2015-05-25 12:26:22Z

2

Using awk you can do:

awk '!seen[$1] || $5 > max[$1]{seen[$1]=$0; max[$1]=$5}
     END { for (i in seen) print seen[i]}' file
CLCF1   chr11   67131640    67141648    10008
AKT3    chr1    243651534   244006553   355019
CBL chr11   119076989   119178858   101869

This awk command uses an array seen to keep only unique rows in it. This command also uses an array max to keep max value of column 5 for each $1. seen is populated first time or when current record's $5 is greater than corresponding entry in max array.

To get formatted output:

awk '!seen[$1] || $5 > max[$1]{seen[$1]=$0; max[$1]=$5}
   END { for (i in seen) print seen[i]}' file | column -t
CLCF1  chr11  67131640   67141648   10008
AKT3   chr1   243651534  244006553  355019
CBL    chr11  119076989  119178858  101869

edited May 25, 2015 at 12:26

answered May 25, 2015 at 9:54

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

anubhava Over a year ago

I added some explanation in my answer.

Dipak · Accepted Answer · 2015-05-25 12:38:24Z

0

Assuming your file is Tab Delimeted
You can use the below code
sort -t$'\t' -k5nr File|awk -F'\t' '!a[$1]++'
This is how it works
Sort numerically based on the window column and then allow only the 1st record

answered May 25, 2015 at 12:38

Dipak

597 bronze badges

Collectives™ on Stack Overflow

Remove duplicated rows based on column values using awk or sed

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related