Eliminate Duplicate Rows based on Two Columns using Awk

Question

Given this input:

#       133        15
KBL  40.385  26.385 1.0000 S
KBL  23.846   9.289 1.0000 P
KBL  40.234  26.385 1.0000 S
#       133         4
KBL  40.234  28.566 1.0000 S
KBL  40.385  28.566 1.0000 S
KBL  23.846  12.032 1.0000 P

I wish to remove the duplicate rows, specifically where a value on either column 2 or column 3 is repeated. In other words, I wish to get this output:

#       133        15
KBL  40.385  26.385 1.0000 S
KBL  23.846   9.289 1.0000 P
#       133         4
KBL  40.234  28.566 1.0000 S
KBL  23.846  12.032 1.0000 P

I have tried awk '!a[$0]++' file.xy. However, that only removes the lines that are fully identical. I'm looking to only remove the lines that have repeated values in either columns two or three.

Using Awk, what would be the best way to remove these duplicate rows? Thanks.

I have tried awk '!a[$0]++' file.xy ; however, that only removes the lines that are fully identical. I'm looking to only remove the lines that have repeated values in either columns two or three. — geeb.24
– geeb.24, Commented Sep 10, 2018 at 1:45
your last row has the duplicate $2. Not sure your spec is consistent with the posted data — karakfa
– karakfa, Commented Sep 10, 2018 at 1:52

Ed Morton · Accepted Answer · 2018-09-10 02:35:42Z

2

Assuming you want the lines that start with # printed, do not want their $2 or $3 values considered in the tests for duplicate values, and only want to eliminate duplicates within each of the separate #-line delimited blocks:

$ awk '/^#/{print; delete seen; next} !(seen[$2]++ || seen[$3]++)' file
#       133        15
KBL  40.385  26.385 1.0000 S
KBL  23.846   9.289 1.0000 P
#       133         4
KBL  40.234  28.566 1.0000 S
KBL  23.846  12.032 1.0000 P

edited Sep 10, 2018 at 2:35

answered Sep 10, 2018 at 2:00

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ashutosh · Accepted Answer · 2018-09-10 02:26:07Z

1

This will give you the output you require. We need two files for this method:

awk '!a[$2]++ || !a[$3]++' file1.txt > file2.txt && awk '!a[$3]++' file2.txt

The output produced is:

#       133        15
KBL  40.385  26.385 1.0000 S
KBL  23.846   9.289 1.0000 P
#       133         4
KBL  40.234  28.566 1.0000 S
KBL  23.846  12.032 1.0000 P

edited Sep 10, 2018 at 2:26

answered Sep 10, 2018 at 2:09

Ashutosh

5187 silver badges22 bronze badges

4 Comments

karakfa Over a year ago

Not sure this is needed here, but you can pipe one awk into another instead of going through a bash variable and printing it into a file. This itself is redundant since the output of the first command can be written to a file directly as well.

Ashutosh Over a year ago

@karakfa Thanks for the info, I think you are right. I have updated my answer.

Ed Morton Over a year ago

I don;t follow the logic that got you to it, but I'm pretty sure your code is equivalent to awk '(!a[$2]++ || !a[$3]++) && !a[$3]++' file1.txt and my boolean algebra is a bit rusty but I'm again pretty sure that (!a[$2]++ || !a[$3]++) && !a[$3]++ is equivalent to just !a[$3]++. It also mixes the numeric values on the lines that start with # with the other lines which I don't think is what the OP wants and it does the uniqueness test across the whole file instead of separately within each of the #-line delimited blocks which again I think isn't what the OP wants.

Ashutosh Over a year ago

@EdMorton You always teach me something new. Yes that is so silly of me. You are 100% right. My code is not equivalent to awk '(!a[$2]++ || !a[$3]++) && !a[$3]++' file1.txt. :)

Collectives™ on Stack Overflow

Eliminate Duplicate Rows based on Two Columns using Awk

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related