0

I would like to remove duplicates from a dataset which has 3 columns

A       0   3238
B       0   3367
C       0   3130
D       1   3130

I need to remove lines which contain duplicate values in the third column, but preferentially keeping those with the value '1' in the second column. I know how to remove duplicates using awk, but I can't work out how to add in the conditional statment.

Thanks

2 Answers 2

3

give this line a try:

awk '{if($3 in a)a[$3]=$2==1?$0:a[$3];else a[$3]=$0}END{for(i in a)print a[i]}' file
Sign up to request clarification or add additional context in comments.

3 Comments

+1 for neat way to solve it. I did not at first realize that $2==1?$0:a[$3] is evaluated before = wish was a bit confusing. I guess a[$3]=($2==1?$0:a[$3]) would work as well.
@Qben yes it does. and with brackets it would be easier to read.
The syntax without brackets is non-portable, e.g. it would fail syntactically on MacOS awks (or so I hear...).
3
$ sort -k2nr file | awk '!seen[$3]++'
D       1   3130
A       0   3238
B       0   3367

2 Comments

Interesting bits of awk. Can you please explain the !seen[$3]++ part ?
It's the common awk idiom to only output the first value in a series of potential duplicates. Every time a value is used as an index in the array the array's entry for that value is post-incremented, so the first time a value is seen it's array entry is zero so the ! operator makes the overall result true. After that first time though the array entry is non-zero so the ! makes the result false. It's like uniq but doesn't require the values to be sorted and let's you operate on fields rather than the whole input line/record.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.