2

I want to remove duplicate rows from a CSV and concatenate the values of specific column (in this case, column2).

Input

ID column2 column3 column4, etc....
1  a       test3   test4
1  r       test3   test4
1  c       test3   test4
2  r       test3   test4
2  o       test3   test4
3  a       test3   test4
4  b       test3   test4
4  c       test3   test4
4  e       test3   test4

Expected result

ID column2 column3 column4, etc....
1  a|r|c   test3   test4
2  r|o     test3   test4
3  a       test3   test4
4  b|c|e   test3   test4

Is it possible with awk?

2
  • 1
    Are column 3 and 4 constant? Commented Nov 29, 2017 at 23:34
  • @glennjackman There is more than 4 columns. And yes, they are constant. Commented Nov 29, 2017 at 23:44

2 Answers 2

1

With awk, for variable column, for the general case where all the other columns may change.

awk -v col=2 -v OFS="\t" '{
    temp=$col
    $col=""
    a[$0]=a[$0]? a[$0] "|" temp: temp
}
END {for (i in a) {
        split(i, b)
        for (j=1; j<=length(b); j++) {
            if (j==col) printf a[i] OFS
            printf b[j] OFS
        }
        printf ORS
    }
}' file |sort -n |column -t

This uses an associative array with the line excluding $col as index and append to it the value of $col.

At the END we take care to put $col back to its place while printing, by splitting fields to another array.

The order of the output is undetermined and you can pipe it to sort for any sorting type per any field. And to column -t if you need so.

Sign up to request clarification or add additional context in comments.

2 Comments

the values in other columns can be text (more than one word).
Then you have to use a delimiter other than tab, when you get your csv data and then set the awk field separator accordingly. This FS should be also used inside the split function above. I don't see why you get tab delimited file if you have plain text as fields.
0

This might work for you (GNU sed & column):

sed -r '1b;:a;$!N;s/^(\s*\S+\s)(\S+)\s*(\S+\s*\S+\s*)(.*)n\1(\S+)\s*\3/\1\2|\5 \3\4/;ta;P;D' file | column -t

Pattern match on all lines except the first and then format the expected result using back references and the column command.

N.B. The first field is stripped of its white space.

1 Comment

There is more than 4 columns. And no white space in the first one. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.