0

I was trying to apply the method proposed here {Removing duplicates on a variable without sorting} to remove duplicates in a string using awk when I noticed it was not working as expected.

For example, suppose we have:

s="apple apple tree appleapple tree"

Removing duplicates we expect the following output:

apple tree appleaplle

which should be obtained by applying the following command to the string (complete explanation in the link):

awk 'BEGIN{RS=" "; ORS=" "}{ if(a[$0] == 0){a[$0]+=1; print $0}}' <<< $s

It uses associative array, thus we do not expect to print twice the same record. However, following this method I get this

 apple tree appleapple tree

This first apple duplicate was erased as desired, but not the last one. In fact, if we print the length of each record we see that the last record is not tree but tree+ return character (I suppose so).

$ awk 'BEGIN{RS=" "; ORS=" "}{ print length($0); print $0}' <<< $s
$ 5 apple 5 apple 4 tree 10 appleapple 5 tree

Notice that last tree is indeed 5 characters and not 4, resulting in breaking the associative array method.

I do not understand why there is this character and where does it come from? And how to solve this issue to remove duplicates using this method?

Thanks you very much for any suggestion

2
  • 1
    use od -c scriptfile to see if your file has CR+LF line endings, and dos2unix to fix. Commented Sep 12, 2017 at 21:22
  • Just for once it's not a CR+LF issue, it's simple pilot error. Commented Sep 12, 2017 at 21:37

4 Answers 4

3

If you don't need to maintain the word order:

$ ( set -f; printf "%s\n" $s | sort -u | paste -sd" " )
apple appleapple tree

If you do want to keep the order:

$ awk '                                                                                                      
    {          
        delete seen
        sep=""
        for (i=1; i<=NF; i++) {
            if (!seen[$i]++) {
                printf "%s%s", sep, $i
            }
            sep=OFS
        }
        print ""
    }
' <<<"$s"
apple tree appleapple
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you. I will go for the second answer because I am not yet familiar with sort and past. Using fields, as you and @MarcLambrichs suggest in another answer, seems to avoid this problem. Nevertheless I still do not understand what is going wrong with using records.
The problem with your approach to using records is that when you set RS=" " it meant that the \n at the end of your line was then part of the final field and tree is not the same as tree\n. If you added a blank char to the end of your input string and quoted it properly (<<< "$s ") or set RS="[[:space:]]+" instead of RS=" " it would've worked thought that latter is gawk-specific due to multi-char RS.
Right, I do understand now. Indeed I tried adding an extra blank at the end and it was working, but was not satisfied with that 'solution'. The issue is clear now.
3

As already discussed, by setting RS to " " that means that \n is no longer the character between records and so it becomes part of the last field on your input line "tree\n".

FWIW if you have GNU awk for multi-char RS you could just do:

awk -v RS='\\s+' '!seen[$0]++{printf "%s%s", (NR>1?OFS:""), $0} END{print ""}'

1 Comment

Perfectly clear after your explanation. No mystery left. Multi-char delimiter for records is needed to use that method.
2

This example shows you're suspicion is correct:

$ echo "apple apple tree appleapple tree" | awk 'BEGIN{RS=" "; ORS=" "}
{ printf("%s |%s| ", length($0), $0)}'
5 |apple| 5 |apple| 4 |tree| 10 |appleapple| 5 |tree
|

I would use FS to get all different values, like this:

$ echo "apple apple tree appleapple tree" | awk '{for (i=1; i<=NF; i++) 
printf "%s %s\n", length($i), $i}'
5 apple
5 apple
4 tree
10 appleapple
4 tree

And to get rid of doubles:

echo "apple apple tree appleapple tree" | awk 'BEGIN{ORS=" "}{for (i=1; 
i<=NF; i++)a[$i]++} END {for (i in a) print i }'

1 Comment

Thanks, yes, using fields instead of records seems to be a better way to achieve this
0

This is what I did for duplicate records:

awk '{if(arr[$1]!="true") print $1; arr[$1]="true"}' file.txt

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.