My previous question (here) just got a bit more complicated.
I need to identify duplicate entries in a comma-separated text file, based on a combination of columns 4 and 5 that represent 4: an identifier, 5: a date. Those two columns can simply be concatenated.
If there are duplicates, I need to perform a check based on the substring of another column, which represents an int value. The duplicate with the higher value should be kept. Example for creating the substring:
cut -d"," -f14 file.txt | grep -o "N0[0-9]{3}" | sed 's/N0//g'If the int values are the same, I need to perform yet another check based on column 9, which represents a date. The row with the later date should be kept.
Example data:
xyz,xyz,xyz,X08ABC,20200804,xyz,xyz,xyz,2020-08-17T20:08:27.000000Z,xyz,xyz,xyz,xyz,long_string_N0809_bla
xyz,xyz,xyz,X08ABC,20200804,xyz,xyz,xyz,2020-09-25T16:18:43.000000Z,xyz,xyz,xyz,xyz,long_string_N0809_bla
In the example, the third columns would be kept while the first two are discarded. They are all duplicates (col4+col5 = the same), the int value in col 14 is highest in row 2 and 3, date in col 9 is later in row 3 than row 2.
If I can keep the order of rows in the original file, that is a plus - but not a must.
Edit: Having an awk solution would be great, if this can be accomplished in a somewhat readable fashion in awk. I assume that's possible, but I get very confused with multiple operations within awk.