using awk to remove duplicates

Question

I have been playing with awk and sed. I have a file with the following format

0000098236|Q1.1|one|Q2.1|one|Q3.1|one
0000027965|Q1.5|five|Q1.1|one|Q2.1|one
0000083783|Q1.1|one|Q1.5|five|Q2.1|one
0000027965|Q1.1|one|Q1.1|one|Q1.5|five
0000083983|Q1.1|one|Q1.5|five|Q2.1|one
0000083993|Q1.3|three|Q1.4|four|Q1.2|two

I want to tansform the QX.X to a specific numerical value. I accomplished that with sed:

sed -e "s/\<Q1.1\>/88/g" |
sed -e "s/Q1.2/89/g" |
sed -e "s/Q1.3/90/g" |
sed -e "s/Q1.4/91/g" |
sed -e "s/Q1.5/92/g" |

etc, etc. So far so good. After I do this I get

0000098236|88|one|88|one|88|one
0000027965|92|five|88|one|88|one
0000083783|88|one|92|five|88|one
0000027965|88|one|88|one|92|five
0000083983|88|one|92|five|88|one
0000083993|90|three|91|four|89|two

The delimiter is the pipe. Now I need to remove the duplicates pairs

I want to always keep the first value
I want to group the rest in pairs, so in the first line above, 88|one is one pair
I want to create a file that takes the duplicates pairs out from a single line

So the file above, should look something like the following after running the transformation

0000098236|88|one
0000027965|95|five|88|one
0000083783|88|one|92|five
0000027965|88|one|88|one
0000083983|88|one|92|five
0000083993|90|three|91|four|89|two

I tried to use awk and arrays but cannot get it to work.

Can you post your current code? Simpler than starting from scratch. — yamen
– yamen, Commented Apr 19, 2012 at 4:15
possible duplicate of How to remove duplicates entries from a file using shell - If this is a duplicate of your other question: You should, maybe, answer the questions there and improve your question by editing, instead of opening a new one. — user unknown
– user unknown, Commented Apr 19, 2012 at 4:42
Is the example in line 4 correct? shouldn't the '92|five' be preserved? — user unknown
– user unknown, Commented Apr 19, 2012 at 4:45
The example data is screwy. In line 4, a 92|five is removed even though it occurs once, but two occurrences of 88|one are retained. Line 2 has a 95 in the original, but 92 in the filtered. — Kaz
– Kaz, Commented Apr 19, 2012 at 4:49
You are right, there is an error in the target format, 92|five should be preserved. It should look: 0000098236|88|one 0000027965|92|five|88|one 0000083783|88|one|92|five 0000027965|88|one|92|five 0000083983|88|one|92|five 0000083993|90|three|91|four|89|two — user1339980
– user1339980, Commented Apr 19, 2012 at 13:53

yazu · Accepted Answer · 2012-04-19 07:47:51Z

2

sed -r ':a s#([0-9]+\|[a-z]+)(.*)\1#\1\2#; ta; s#\|\|+#|#g; s#\|$##' FILE
0000098236|88|one
0000027965|92|five|88|one
0000083783|88|one|92|five
0000027965|88|one|92|five
0000083983|88|one|92|five
0000083993|90|three|91|four|89|two

answered Apr 19, 2012 at 7:47

yazu

4,7301 gold badge22 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user1339980 Over a year ago

I am embarrased to put this code, given it does not work and I am sure is not elegant, but here it goes awk ' { delete p; n = split($0, a, "|"); printf("%s", a[1]); for (i=2;i<=n;i++) { if (a[i+2]==a[i+4]) { printf("|%s",a[i]); printf("|%s",a[i+1]); printf("|%s",a[i+2]); printf("|%s",a[i+3]); i=i+5; break; } if (a[i]==a[i+2] && a[i]==a[i+4]) { printf("|%s", a[i]); printf("|%s",a[i+1]); i=i+5; break; }

user1339980 Over a year ago

if (a[i]==a[i+4]) { printf("|%s",a[i]); printf("|%s",a[i+1]); printf("|%s",a[i+2]); printf("|%s",a[i+3]); i=i+5; break; } } printf "\n"; } ' $TMP_FILE

Dennis Williamson · Accepted Answer · 2012-04-25 04:47:48Z

2

This eliminates the need for preprocessing. It assumes that the digit after the decimal point is what is significant for selecting the replacement.

awk '
BEGIN {
    r = "88 89 90 91 92";
    split(r, rep);
    FS = OFS = "|"
}
{
    delete seen;
    cf = i = 2;
    while (cf < NF) {
        split($cf, a, ".");
        newval = rep[a[2]];
        if (!seen[newval]) {
            $i = newval;
            $(i + 1) = $(cf + 1)
            seen[newval] = 1;
            nf = i + 1;
            i += 2;
        };
        cf += 2
    };
    NF = nf;
    print
}' inputfile

answered Apr 25, 2012 at 4:47

Dennis Williamson

364k95 gold badges386 silver badges446 bronze badges

Comments

Kaz · Accepted Answer · 2012-04-19 06:59:18Z

TXR:

@(do (defun rem-dupes (pairs : recur)
       (if (null pairs) 
         nil
         (let ((front (first pairs))
               (tail (rem-dupes (rest pairs) t)))
           (if (memqual front tail)
             (if recur
               (remqual front tail)
               (cons front (remqual front tail)))
             (cons (first pairs) tail))))))
@(collect :vars nil)
@(freeform 1)
@id|@(coll)@left|@right@/[|\n]/@(end)
@(bind pairs @(rem-dupes [mapcar list left right]))
@(set left @[mapcar first pairs])
@(set right @[mapcar second pairs])
@(output)
@id@(rep)|@left|@right@(end)
@(end)
@(end)

Run:

$ txr data.txr data.txt
0000098236|88|one
0000027965|92|five
0000083783|88|one|92|five
0000027965|88|one|92|five
0000083983|88|one|92|five
0000083993|90|three|91|four|89|two

potong · Accepted Answer · 2012-04-19 12:15:56Z

This might work for you:

sed ':a;s/\(\([0-9]*|[^|]*\).*\)|\2/\1/;ta' file
0000098236|88|one
0000027965|92|five|88|one
0000083783|88|one|92|five
0000027965|88|one|92|five
0000083983|88|one|92|five
0000083993|90|three|91|four|89|two

In fact all the file processing can be achieved using one instance of sed:

cat <<\! >file.sed
> 1{x;s/$/.1|88.2|89.3|90.4|91.5|91/;x}  # stuff lookup into hold space .key|value
> s/|Q[^.]*/|/g                          # guessing here - remove Q and number prefix
> :a;s/\(\(\.[^|]*|[^|]*\).*\)|\2/\1/;ta # remove duplicate fields
> G                                      # append newline and lookup table
> :b;s/\(\.[^|]*\)\(.*\n.*\)\1|\([^.]*\)/\3\2/;tb # replace key with value from lookup
> s/\n.*//                               # remove lookup table
> !
sed -f file.sed original_file
0000098236|88|one
0000027965|91|five|88|one
0000083783|88|one|91|five
0000027965|88|one|91|five
0000083983|88|one|91|five
0000083993|90|three|91|four|89|two

Collectives™ on Stack Overflow

using awk to remove duplicates

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related