1

Given this input:

#       133        15
KBL  40.385  26.385 1.0000 S
KBL  23.846   9.289 1.0000 P
KBL  40.234  26.385 1.0000 S
#       133         4
KBL  40.234  28.566 1.0000 S
KBL  40.385  28.566 1.0000 S
KBL  23.846  12.032 1.0000 P

I wish to remove the duplicate rows, specifically where a value on either column 2 or column 3 is repeated. In other words, I wish to get this output:

#       133        15
KBL  40.385  26.385 1.0000 S
KBL  23.846   9.289 1.0000 P
#       133         4
KBL  40.234  28.566 1.0000 S
KBL  23.846  12.032 1.0000 P

I have tried awk '!a[$0]++' file.xy. However, that only removes the lines that are fully identical. I'm looking to only remove the lines that have repeated values in either columns two or three.

Using Awk, what would be the best way to remove these duplicate rows? Thanks.

3
  • 1
    Can you share what have you tried so far? Commented Sep 10, 2018 at 0:51
  • I have tried awk '!a[$0]++' file.xy ; however, that only removes the lines that are fully identical. I'm looking to only remove the lines that have repeated values in either columns two or three. Commented Sep 10, 2018 at 1:45
  • your last row has the duplicate $2. Not sure your spec is consistent with the posted data Commented Sep 10, 2018 at 1:52

2 Answers 2

2

Assuming you want the lines that start with # printed, do not want their $2 or $3 values considered in the tests for duplicate values, and only want to eliminate duplicates within each of the separate #-line delimited blocks:

$ awk '/^#/{print; delete seen; next} !(seen[$2]++ || seen[$3]++)' file
#       133        15
KBL  40.385  26.385 1.0000 S
KBL  23.846   9.289 1.0000 P
#       133         4
KBL  40.234  28.566 1.0000 S
KBL  23.846  12.032 1.0000 P
Sign up to request clarification or add additional context in comments.

Comments

1

This will give you the output you require. We need two files for this method:

awk '!a[$2]++ || !a[$3]++' file1.txt > file2.txt && awk '!a[$3]++' file2.txt

The output produced is:

#       133        15
KBL  40.385  26.385 1.0000 S
KBL  23.846   9.289 1.0000 P
#       133         4
KBL  40.234  28.566 1.0000 S
KBL  23.846  12.032 1.0000 P

4 Comments

Not sure this is needed here, but you can pipe one awk into another instead of going through a bash variable and printing it into a file. This itself is redundant since the output of the first command can be written to a file directly as well.
@karakfa Thanks for the info, I think you are right. I have updated my answer.
I don;t follow the logic that got you to it, but I'm pretty sure your code is equivalent to awk '(!a[$2]++ || !a[$3]++) && !a[$3]++' file1.txt and my boolean algebra is a bit rusty but I'm again pretty sure that (!a[$2]++ || !a[$3]++) && !a[$3]++ is equivalent to just !a[$3]++. It also mixes the numeric values on the lines that start with # with the other lines which I don't think is what the OP wants and it does the uniqueness test across the whole file instead of separately within each of the #-line delimited blocks which again I think isn't what the OP wants.
@EdMorton You always teach me something new. Yes that is so silly of me. You are 100% right. My code is not equivalent to awk '(!a[$2]++ || !a[$3]++) && !a[$3]++' file1.txt. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.