How to remove duplicate rows and concatenate values from one column in CSV

Question

I want to remove duplicate rows from a CSV and concatenate the values of specific column (in this case, column2).

Input

ID column2 column3 column4, etc....
1  a       test3   test4
1  r       test3   test4
1  c       test3   test4
2  r       test3   test4
2  o       test3   test4
3  a       test3   test4
4  b       test3   test4
4  c       test3   test4
4  e       test3   test4

Expected result

ID column2 column3 column4, etc....
1  a|r|c   test3   test4
2  r|o     test3   test4
3  a       test3   test4
4  b|c|e   test3   test4

Is it possible with awk?

@glennjackman There is more than 4 columns. And yes, they are constant. — Kiwop
– Kiwop, Commented Nov 29, 2017 at 23:44

thanasisp · Accepted Answer · 2017-11-30 00:57:44Z

1

With awk, for variable column, for the general case where all the other columns may change.

awk -v col=2 -v OFS="\t" '{
    temp=$col
    $col=""
    a[$0]=a[$0]? a[$0] "|" temp: temp
}
END {for (i in a) {
        split(i, b)
        for (j=1; j<=length(b); j++) {
            if (j==col) printf a[i] OFS
            printf b[j] OFS
        }
        printf ORS
    }
}' file |sort -n |column -t

This uses an associative array with the line excluding $col as index and append to it the value of $col.

At the END we take care to put $col back to its place while printing, by splitting fields to another array.

The order of the output is undetermined and you can pipe it to sort for any sorting type per any field. And to column -t if you need so.

answered Nov 30, 2017 at 0:57

thanasisp

6,0053 gold badges18 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kiwop Over a year ago

the values in other columns can be text (more than one word).

thanasisp Over a year ago

Then you have to use a delimiter other than tab, when you get your csv data and then set the awk field separator accordingly. This FS should be also used inside the split function above. I don't see why you get tab delimited file if you have plain text as fields.

potong · Accepted Answer · 2017-11-30 11:05:47Z

0

This might work for you (GNU sed & column):

sed -r '1b;:a;$!N;s/^(\s*\S+\s)(\S+)\s*(\S+\s*\S+\s*)(.*)n\1(\S+)\s*\3/\1\2|\5 \3\4/;ta;P;D' file | column -t

Pattern match on all lines except the first and then format the expected result using back references and the column command.

N.B. The first field is stripped of its white space.

edited Nov 30, 2017 at 11:05

answered Nov 29, 2017 at 23:10

potong

59.3k6 gold badges55 silver badges92 bronze badges

1 Comment

Kiwop Over a year ago

There is more than 4 columns. And no white space in the first one. Thanks.

Collectives™ on Stack Overflow

How to remove duplicate rows and concatenate values from one column in CSV

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related