0

My previous question (here) just got a bit more complicated.

  1. I need to identify duplicate entries in a comma-separated text file, based on a combination of columns 4 and 5 that represent 4: an identifier, 5: a date. Those two columns can simply be concatenated.

  2. If there are duplicates, I need to perform a check based on the substring of another column, which represents an int value. The duplicate with the higher value should be kept. Example for creating the substring: cut -d"," -f14 file.txt | grep -o "N0[0-9]{3}" | sed 's/N0//g'

  3. If the int values are the same, I need to perform yet another check based on column 9, which represents a date. The row with the later date should be kept.

Example data:

xyz,xyz,xyz,X08ABC,20200804,xyz,xyz,xyz,2020-08-17T20:08:27.000000Z,xyz,xyz,xyz,xyz,long_string_N0809_bla
xyz,xyz,xyz,X08ABC,20200804,xyz,xyz,xyz,2020-09-25T16:18:43.000000Z,xyz,xyz,xyz,xyz,long_string_N0809_bla

In the example, the third columns would be kept while the first two are discarded. They are all duplicates (col4+col5 = the same), the int value in col 14 is highest in row 2 and 3, date in col 9 is later in row 3 than row 2.

If I can keep the order of rows in the original file, that is a plus - but not a must.

Edit: Having an awk solution would be great, if this can be accomplished in a somewhat readable fashion in awk. I assume that's possible, but I get very confused with multiple operations within awk.

1 Answer 1

1

It took me quite some time reading through other questions, manuals, etc, but I think this should be a possible solution. It does need sorting though, so the order of rows unfortunately doesn't stay the same. Also, it relies on the column entries always having the same length, which has a lot of potential for trouble.

sort -t, -k 14.15,14.17r -k9.1,9.4r -k9.6,9.7r -k9.9,9.10r file.txt | awk -F, '{OFS=","} {print $0, $4"_"$5}' | awk -F, '!a[$15]++'

This sorts the rows by the int value subset from column 14 (809), and by the date in column 9 (sort by year, month, date). Then, a new column that combines col 4 and col5 is created, to identify duplicates. The last awk command only keeps those duplicate lines that appear first - which is also the reason why everything needed to be sorted first.

Improvements are welcome!

Sign up to request clarification or add additional context in comments.

3 Comments

1) Numeric reversed for the one column and reversed for the dates are enough, these are your secondary sortings. 2) For the duplicates, you can use a combination of fields as the hash key, no need to concat them as an additional field. sort -t, -rnk14 -rk9 | awk -F, '!seen[$4,$5]++'
Thanks! Two questions: I'm not 100% sure about point 1). I still need to split the date (col 9) in year, month, date, as it's separated by -. Or am I missing something? Simply sorting column 14 by numeric does not work either, unless I subset the positions containing the numeric value of interest.
man sort for what kinds of sorting you can use per field. For 9 it is just alphabetical sorting, reversed or not, according to your needs. For 14, I see only two sample lines, but it seems mixed numbers and letters, try -V, -h and/or search for this, it;s well documented.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.