Unix Delete Duplicate rows from csv based on 2 columns

Question

i have a csv file with multiple columns. Some might have duplicates over the 4th col (col4).

I need to delete the whole row where the duplicates occurs and keep only 1 row. The decision of this row is made by getting highest value from col1.

Below is an example:

col1,col2,col3,col4 

1,x,a,123

2,y,b,123

3,y,b,123

1,z ,c,999

Duplicate is found in row 1 and row2 and row3, only third row should be kept because col1(row3) > col1(row2) > col1(row1).

For now this code delete duplicates in col4 without looking at col1

awk '!seen[$4]++' myfile.csv

I would like to add a condition to check col1 for each duplicates and delete the ones with lowest value in col1 and keep the row with highest value n col1

Output should be:

col1,col2,col3,col4

3,y,b,123

1,z,c,999

Thank you!

No, this is not clear, could you please put more information and sample Input_file ans expected output into post, so that all could help here. — RavinderSingh13
– RavinderSingh13, Commented Jan 10, 2017 at 12:27
there is an input and output example please read it carefully. — Mr Smith
– Mr Smith, Commented Jan 10, 2017 at 13:19

RavinderSingh13 · Accepted Answer · 2017-01-12 03:26:54Z

1

@Mr Smith: Could you please try following and let me know if this helps you.

awk -F"[[:space:]]+,[[:space:]]+"  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}'   Input_file  Input_file

EDIT: Try:

awk -F","  'FNR==NR{A[$NF]=$1>A[$NF]?$1:A[$NF];next} (($NF) in A) && $1 == A[$NF] && A[$NF]{print}' Input_file   Input_file

EDIT2: Following is explanation as per OP's request:
awk -F","                               ##### starting awk here and mentioning field delimiter as comma(,).
'FNR==NR{                               ##### FNR==NR condition will be TRUE only when Input_file first time is getting read.
                                              Because we want to save the values of last field as an index in array A and whose value is $1.
                                              So FNR and NR are the awk's default keywords, where the only difference between NR and FNR is 
                                              both will tell the number of lines but FNR will be RESET each time a new Input_file is being read,
                                              where NR will be keep on increasing till all the Input_files are completed. So this condition will be 
                                              TRUE only when first Input_file is being read.
A[$NF]=                                 ##### Now making an array named A whose index is $NF(last field of that array), then I am checking a condition
$1>A[$NF]                               ##### Condition here is if current line's $1 is greater than the value of A[$NF]'s value(Off course $NF last fields
                                              will be same for them then only they will be compared, so if $1's value is greater than A[$NF]'s value then 
?                                       ##### Using ? wild character means if condition is TRUE then perform following statements.
$1                                      ##### which is to make the value of A[$NF] to $1(because as per your requirement we need the HIGHEST value)
:                                       ##### If condition is FALSE which I explained 2 lines before than : operator indicates to perform actions which are following it.
A[$NF];                                 ##### Keep the value of A[$NF] same as [$NF] no change in it.
next}                                   ##### next is an awk's in built keyword so it will skip all further statements and take the control to again start from
                                              very first statement, off course it is used to avoid the execution of statements while first time Input_file is being read.
(($NF) in A) && $1 == A[$NF] && A[$NF]{ ##### So these conditions will be executed only and only when 2nd time Input_file is being read. Checking here 
                                              if $NF(last field of current line) comes in array A and array A's value is equal to first field and array A's value is NOT NULL.
print                                   ##### If above all conditions are TRUE then print the current line of Input_file
}' Input_file   Input_file              ##### Mentioning the Input_files here.

edited Jan 12, 2017 at 3:26

answered Jan 10, 2017 at 13:32

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mr Smith Over a year ago

i tred it, result was the same no change duplicates are still there.

RavinderSingh13 Over a year ago

Off course they will be there, when you posted then I guess you haven't used code tags or etc so space has come at that time in between fields so I have given solution accordingly, could you please try my edited solution ?

Mr Smith Over a year ago

why 2 times Input_file Input_file in the code ? Could you explain please

RavinderSingh13 Over a year ago

One of the most important reason is that to keep the order of output lines same as Input_file's line, though I have edited my solution with explanation, let me know if you have any queries on same.

Collectives™ on Stack Overflow

Unix Delete Duplicate rows from csv based on 2 columns

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related