I was trying to apply the method proposed here {Removing duplicates on a variable without sorting} to remove duplicates in a string using awk when I noticed it was not working as expected.
For example, suppose we have:
s="apple apple tree appleapple tree"
Removing duplicates we expect the following output:
apple tree appleaplle
which should be obtained by applying the following command to the string (complete explanation in the link):
awk 'BEGIN{RS=" "; ORS=" "}{ if(a[$0] == 0){a[$0]+=1; print $0}}' <<< $s
It uses associative array, thus we do not expect to print twice the same record. However, following this method I get this
apple tree appleapple tree
This first apple duplicate was erased as desired, but not the last one.
In fact, if we print the length of each record we see that the last record is not tree but tree+ return character (I suppose so).
$ awk 'BEGIN{RS=" "; ORS=" "}{ print length($0); print $0}' <<< $s
$ 5 apple 5 apple 4 tree 10 appleapple 5 tree
Notice that last tree is indeed 5 characters and not 4, resulting in breaking the associative array method.
I do not understand why there is this character and where does it come from? And how to solve this issue to remove duplicates using this method?
Thanks you very much for any suggestion
od -c scriptfileto see if your file has CR+LF line endings, anddos2unixto fix.