Bash/Shell: How to remove duplicates from csv file by columns?

Question

I have a csv separated with ;. I need to remove lines where content of 2nd and 3rd column is not unique, and deliver the material to the standard output.

Example input:

irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data3;data4;irrelevant;irrelevant  
irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant  
irrelevant;data1;data2;irrelevant;irrelevant  
irrelevant;data3;data4;irrelevant;irrelevant

Desired output

irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant

I have found solutions where only first line is printed to the output:

sort -u -t ";" -k2,1 file

but this is not enough.

I have tried to use uniq -u but I can't find a way to check only a few columns.

in all the lines there isn't an unique value in the 2nd and 3rd columns. — Avinash Raj
– Avinash Raj, Commented Aug 22, 2014 at 15:28
I agree with @jaypal, that question is about finding unique records only. — anubhava
– anubhava, Commented Aug 22, 2014 at 15:32
@AvinashRaj: OP wants to list those records where col2, col3 appear only once in whole file. — anubhava
– anubhava, Commented Aug 22, 2014 at 15:38
Yes, @anubhava is right. Storing the material in some temporary template seems to be the only way. It seems both awk and perl solutions are very similar. — xpdude
– xpdude, Commented Aug 23, 2014 at 0:19

anubhava · Accepted Answer · 2014-08-22 15:45:48Z

5

Using awk:

awk -F';' '!seen[$2,$3]++{data[$2,$3]=$0}
      END{for (i in seen) if (seen[i]==1) print data[i]}' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant

Explanation: If $2,$3 combination doesn't exist in seen array then a new entry with key of $2,$3 is stored in data array with whole record. Every time $2,$3 entry is found a counter for $2,$3 is incremented. Then in the end those entries with counter==1 are printed.

edited Aug 22, 2014 at 15:45

answered Aug 22, 2014 at 15:30

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jaypal singh · Accepted Answer · 2014-08-22 16:04:48Z

-1

If order is important and if you can use perl then:

perl -F";" -lane '
    $key = @F[1,2]; 
    $uniq{$key}++ or push @rec, [$key, $_] 
}{ 
    print $_->[1] for grep { $uniq{$_->[0]} == 1 } @rec' file
irrelevant;data5;data6;irrelevant;irrelevant  
irrelevant;data7;data8;irrelevant;irrelevant  
irrelevant;data9;data0;irrelevant;irrelevant

We use column2 and column3 to create composite key. We create array of array by pushing the key and the line to array rec for the first occurrence of the line.

In the END block, we check if that occurrence is the only occurrence. If so, we go ahead and print the line.

answered Aug 22, 2014 at 16:04

jaypal singh

77.6k24 gold badges108 silver badges147 bronze badges

Comments

0x5C91 · Accepted Answer · 2015-03-25 00:15:26Z

-1

awk '!a[$0]++' file_input > file_output

This worked for me. It compares whole lines.

edited Mar 25, 2015 at 0:15

0x5C91

3,5354 gold badges36 silver badges48 bronze badges

answered Mar 24, 2015 at 23:33

Andrey Strelnikov

1

Collectives™ on Stack Overflow

Bash/Shell: How to remove duplicates from csv file by columns?

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related