1

I have been playing with awk and sed. I have a file with the following format

0000098236|Q1.1|one|Q2.1|one|Q3.1|one
0000027965|Q1.5|five|Q1.1|one|Q2.1|one
0000083783|Q1.1|one|Q1.5|five|Q2.1|one
0000027965|Q1.1|one|Q1.1|one|Q1.5|five
0000083983|Q1.1|one|Q1.5|five|Q2.1|one
0000083993|Q1.3|three|Q1.4|four|Q1.2|two

I want to tansform the QX.X to a specific numerical value. I accomplished that with sed:

sed -e "s/\<Q1.1\>/88/g" |
sed -e "s/Q1.2/89/g" |
sed -e "s/Q1.3/90/g" |
sed -e "s/Q1.4/91/g" |
sed -e "s/Q1.5/92/g" |

etc, etc. So far so good. After I do this I get

0000098236|88|one|88|one|88|one
0000027965|92|five|88|one|88|one
0000083783|88|one|92|five|88|one
0000027965|88|one|88|one|92|five
0000083983|88|one|92|five|88|one
0000083993|90|three|91|four|89|two

The delimiter is the pipe. Now I need to remove the duplicates pairs

  1. I want to always keep the first value
  2. I want to group the rest in pairs, so in the first line above, 88|one is one pair
  3. I want to create a file that takes the duplicates pairs out from a single line

So the file above, should look something like the following after running the transformation

0000098236|88|one
0000027965|95|five|88|one
0000083783|88|one|92|five
0000027965|88|one|88|one
0000083983|88|one|92|five
0000083993|90|three|91|four|89|two

I tried to use awk and arrays but cannot get it to work.

5
  • Can you post your current code? Simpler than starting from scratch. Commented Apr 19, 2012 at 4:15
  • possible duplicate of How to remove duplicates entries from a file using shell - If this is a duplicate of your other question: You should, maybe, answer the questions there and improve your question by editing, instead of opening a new one. Commented Apr 19, 2012 at 4:42
  • 1
    Is the example in line 4 correct? shouldn't the '92|five' be preserved? Commented Apr 19, 2012 at 4:45
  • 1
    The example data is screwy. In line 4, a 92|five is removed even though it occurs once, but two occurrences of 88|one are retained. Line 2 has a 95 in the original, but 92 in the filtered. Commented Apr 19, 2012 at 4:49
  • You are right, there is an error in the target format, 92|five should be preserved. It should look: 0000098236|88|one 0000027965|92|five|88|one 0000083783|88|one|92|five 0000027965|88|one|92|five 0000083983|88|one|92|five 0000083993|90|three|91|four|89|two Commented Apr 19, 2012 at 13:53

4 Answers 4

2
sed -r ':a s#([0-9]+\|[a-z]+)(.*)\1#\1\2#; ta; s#\|\|+#|#g; s#\|$##' FILE
0000098236|88|one
0000027965|92|five|88|one
0000083783|88|one|92|five
0000027965|88|one|92|five
0000083983|88|one|92|five
0000083993|90|three|91|four|89|two
Sign up to request clarification or add additional context in comments.

2 Comments

I am embarrased to put this code, given it does not work and I am sure is not elegant, but here it goes awk ' { delete p; n = split($0, a, "|"); printf("%s", a[1]); for (i=2;i<=n;i++) { if (a[i+2]==a[i+4]) { printf("|%s",a[i]); printf("|%s",a[i+1]); printf("|%s",a[i+2]); printf("|%s",a[i+3]); i=i+5; break; } if (a[i]==a[i+2] && a[i]==a[i+4]) { printf("|%s", a[i]); printf("|%s",a[i+1]); i=i+5; break; }
if (a[i]==a[i+4]) { printf("|%s",a[i]); printf("|%s",a[i+1]); printf("|%s",a[i+2]); printf("|%s",a[i+3]); i=i+5; break; } } printf "\n"; } ' $TMP_FILE
2

This eliminates the need for preprocessing. It assumes that the digit after the decimal point is what is significant for selecting the replacement.

awk '
BEGIN {
    r = "88 89 90 91 92";
    split(r, rep);
    FS = OFS = "|"
}
{
    delete seen;
    cf = i = 2;
    while (cf < NF) {
        split($cf, a, ".");
        newval = rep[a[2]];
        if (!seen[newval]) {
            $i = newval;
            $(i + 1) = $(cf + 1)
            seen[newval] = 1;
            nf = i + 1;
            i += 2;
        };
        cf += 2
    };
    NF = nf;
    print
}' inputfile

Comments

1

TXR:

@(do (defun rem-dupes (pairs : recur)
       (if (null pairs) 
         nil
         (let ((front (first pairs))
               (tail (rem-dupes (rest pairs) t)))
           (if (memqual front tail)
             (if recur
               (remqual front tail)
               (cons front (remqual front tail)))
             (cons (first pairs) tail))))))
@(collect :vars nil)
@(freeform 1)
@id|@(coll)@left|@right@/[|\n]/@(end)
@(bind pairs @(rem-dupes [mapcar list left right]))
@(set left @[mapcar first pairs])
@(set right @[mapcar second pairs])
@(output)
@id@(rep)|@left|@right@(end)
@(end)
@(end)

Run:

$ txr data.txr data.txt
0000098236|88|one
0000027965|92|five
0000083783|88|one|92|five
0000027965|88|one|92|five
0000083983|88|one|92|five
0000083993|90|three|91|four|89|two

Comments

0

This might work for you:

sed ':a;s/\(\([0-9]*|[^|]*\).*\)|\2/\1/;ta' file
0000098236|88|one
0000027965|92|five|88|one
0000083783|88|one|92|five
0000027965|88|one|92|five
0000083983|88|one|92|five
0000083993|90|three|91|four|89|two

In fact all the file processing can be achieved using one instance of sed:

cat <<\! >file.sed
> 1{x;s/$/.1|88.2|89.3|90.4|91.5|91/;x}  # stuff lookup into hold space .key|value
> s/|Q[^.]*/|/g                          # guessing here - remove Q and number prefix
> :a;s/\(\(\.[^|]*|[^|]*\).*\)|\2/\1/;ta # remove duplicate fields
> G                                      # append newline and lookup table
> :b;s/\(\.[^|]*\)\(.*\n.*\)\1|\([^.]*\)/\3\2/;tb # replace key with value from lookup
> s/\n.*//                               # remove lookup table
> !
sed -f file.sed original_file
0000098236|88|one
0000027965|91|five|88|one
0000083783|88|one|91|five
0000027965|88|one|91|five
0000083983|88|one|91|five
0000083993|90|three|91|four|89|two

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.