Removing duplicates in bash string using awk

Question

I was trying to apply the method proposed here {Removing duplicates on a variable without sorting} to remove duplicates in a string using awk when I noticed it was not working as expected.

For example, suppose we have:

s="apple apple tree appleapple tree"

Removing duplicates we expect the following output:

apple tree appleaplle

which should be obtained by applying the following command to the string (complete explanation in the link):

awk 'BEGIN{RS=" "; ORS=" "}{ if(a[$0] == 0){a[$0]+=1; print $0}}' <<< $s

It uses associative array, thus we do not expect to print twice the same record. However, following this method I get this

 apple tree appleapple tree

This first apple duplicate was erased as desired, but not the last one. In fact, if we print the length of each record we see that the last record is not tree but tree+ return character (I suppose so).

$ awk 'BEGIN{RS=" "; ORS=" "}{ print length($0); print $0}' <<< $s
$ 5 apple 5 apple 4 tree 10 appleapple 5 tree

Notice that last tree is indeed 5 characters and not 4, resulting in breaking the associative array method.

I do not understand why there is this character and where does it come from? And how to solve this issue to remove duplicates using this method?

Thanks you very much for any suggestion

use od -c scriptfile to see if your file has CR+LF line endings, and dos2unix to fix. — glenn jackman
– glenn jackman, Commented Sep 12, 2017 at 21:22
Just for once it's not a CR+LF issue, it's simple pilot error. — Ed Morton
– Ed Morton, Commented Sep 12, 2017 at 21:37

glenn jackman · Accepted Answer · 2017-09-12 21:20:06Z

3

If you don't need to maintain the word order:

$ ( set -f; printf "%s\n" $s | sort -u | paste -sd" " )
apple appleapple tree

If you do want to keep the order:

$ awk '                                                                                                      
    {          
        delete seen
        sep=""
        for (i=1; i<=NF; i++) {
            if (!seen[$i]++) {
                printf "%s%s", sep, $i
            }
            sep=OFS
        }
        print ""
    }
' <<<"$s"
apple tree appleapple

answered Sep 12, 2017 at 21:20

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

websealevel Over a year ago

Thank you. I will go for the second answer because I am not yet familiar with sort and past. Using fields, as you and @MarcLambrichs suggest in another answer, seems to avoid this problem. Nevertheless I still do not understand what is going wrong with using records.

Ed Morton Over a year ago

The problem with your approach to using records is that when you set RS=" " it meant that the \n at the end of your line was then part of the final field and tree is not the same as tree\n. If you added a blank char to the end of your input string and quoted it properly (<<< "$s ") or set RS="[[:space:]]+" instead of RS=" " it would've worked thought that latter is gawk-specific due to multi-char RS.

websealevel Over a year ago

Right, I do understand now. Indeed I tried adding an extra blank at the end and it was working, but was not satisfied with that 'solution'. The issue is clear now.

Ed Morton · Accepted Answer · 2017-09-13 04:06:43Z

3

As already discussed, by setting RS to " " that means that \n is no longer the character between records and so it becomes part of the last field on your input line "tree\n".

FWIW if you have GNU awk for multi-char RS you could just do:

awk -v RS='\\s+' '!seen[$0]++{printf "%s%s", (NR>1?OFS:""), $0} END{print ""}'

edited Sep 13, 2017 at 4:06

answered Sep 12, 2017 at 21:44

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

1 Comment

websealevel Over a year ago

Perfectly clear after your explanation. No mystery left. Multi-char delimiter for records is needed to use that method.

Marc Lambrichs · Accepted Answer · 2017-09-12 21:27:29Z

2

This example shows you're suspicion is correct:

$ echo "apple apple tree appleapple tree" | awk 'BEGIN{RS=" "; ORS=" "}
{ printf("%s |%s| ", length($0), $0)}'
5 |apple| 5 |apple| 4 |tree| 10 |appleapple| 5 |tree
|

I would use FS to get all different values, like this:

$ echo "apple apple tree appleapple tree" | awk '{for (i=1; i<=NF; i++) 
printf "%s %s\n", length($i), $i}'
5 apple
5 apple
4 tree
10 appleapple
4 tree

And to get rid of doubles:

echo "apple apple tree appleapple tree" | awk 'BEGIN{ORS=" "}{for (i=1; 
i<=NF; i++)a[$i]++} END {for (i in a) print i }'

edited Sep 12, 2017 at 21:27

answered Sep 12, 2017 at 21:18

Marc Lambrichs

2,8922 gold badges15 silver badges16 bronze badges

1 Comment

websealevel Over a year ago

Thanks, yes, using fields instead of records seems to be a better way to achieve this

Prince Bansal · Accepted Answer · 2019-02-08 11:13:23Z

0

This is what I did for duplicate records:

awk '{if(arr[$1]!="true") print $1; arr[$1]="true"}' file.txt

answered Feb 8, 2019 at 11:13

Prince Bansal

1,65516 silver badges26 bronze badges

Collectives™ on Stack Overflow

Removing duplicates in bash string using awk

4 Answers 4

3 Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related