Bash/awk: Find duplicates based on two columns, keep rows based on values in other columns

Question

My previous question (here) just got a bit more complicated.

I need to identify duplicate entries in a comma-separated text file, based on a combination of columns 4 and 5 that represent 4: an identifier, 5: a date. Those two columns can simply be concatenated.
If there are duplicates, I need to perform a check based on the substring of another column, which represents an int value. The duplicate with the higher value should be kept. Example for creating the substring: cut -d"," -f14 file.txt | grep -o "N0[0-9]{3}" | sed 's/N0//g'
If the int values are the same, I need to perform yet another check based on column 9, which represents a date. The row with the later date should be kept.

Example data:

xyz,xyz,xyz,X08ABC,20200804,xyz,xyz,xyz,2020-08-17T20:08:27.000000Z,xyz,xyz,xyz,xyz,long_string_N0809_bla
xyz,xyz,xyz,X08ABC,20200804,xyz,xyz,xyz,2020-09-25T16:18:43.000000Z,xyz,xyz,xyz,xyz,long_string_N0809_bla

In the example, the third columns would be kept while the first two are discarded. They are all duplicates (col4+col5 = the same), the int value in col 14 is highest in row 2 and 3, date in col 9 is later in row 3 than row 2.

If I can keep the order of rows in the original file, that is a plus - but not a must.

Edit: Having an awk solution would be great, if this can be accomplished in a somewhat readable fashion in awk. I assume that's possible, but I get very confused with multiple operations within awk.

stee · Accepted Answer · 2020-09-30 19:52:16Z

1

It took me quite some time reading through other questions, manuals, etc, but I think this should be a possible solution. It does need sorting though, so the order of rows unfortunately doesn't stay the same. Also, it relies on the column entries always having the same length, which has a lot of potential for trouble.

sort -t, -k 14.15,14.17r -k9.1,9.4r -k9.6,9.7r -k9.9,9.10r file.txt | awk -F, '{OFS=","} {print $0, $4"_"$5}' | awk -F, '!a[$15]++'

This sorts the rows by the int value subset from column 14 (809), and by the date in column 9 (sort by year, month, date). Then, a new column that combines col 4 and col5 is created, to identify duplicates. The last awk command only keeps those duplicate lines that appear first - which is also the reason why everything needed to be sorted first.

Improvements are welcome!

answered Sep 30, 2020 at 19:52

stee

11110 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

thanasisp Over a year ago

1) Numeric reversed for the one column and reversed for the dates are enough, these are your secondary sortings. 2) For the duplicates, you can use a combination of fields as the hash key, no need to concat them as an additional field. sort -t, -rnk14 -rk9 | awk -F, '!seen[$4,$5]++'

stee Over a year ago

Thanks! Two questions: I'm not 100% sure about point 1). I still need to split the date (col 9) in year, month, date, as it's separated by -. Or am I missing something? Simply sorting column 14 by numeric does not work either, unless I subset the positions containing the numeric value of interest.

thanasisp Over a year ago

man sort for what kinds of sorting you can use per field. For 9 it is just alphabetical sorting, reversed or not, according to your needs. For 14, I see only two sample lines, but it seems mixed numbers and letters, try -V, -h and/or search for this, it;s well documented.

Collectives™ on Stack Overflow

Bash/awk: Find duplicates based on two columns, keep rows based on values in other columns

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related