Merge only specific columns in CSV if first column matches using bash

Question

I have the following delimited content in a text file:

col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348

I would like to merge the contents of the 3rd and 5th columns where the 1st column matches. The output should look like:

col1|col2|col3-1|col3-2|col3-3|col4|col5-1|col5-2|col5-3|col6
id1|2314|jack|jill|jim|nov-12|water|oil|ether|3294 
id2|8322|john|||dec-01|sand|||2334
id3|6775|mike|||jan-13|dust|||9348

It doesn't matter if the order of rows and columns in the output is different from the input.

Edit 1: There can be a maximum of 5 merges, anything thereafter should be added to the 5th column with a comma, e.g.

col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6
id1|2314|jack|jill|jim|val3-4|val3-5,val3-6|nov12|water|oil|ether|val5-4|val5-5,val5-6|3294

Edit 2: As a side note, the actual file contains 14 columns and the merge is required for columns 9 and 13. I was able to adapt @Allan's answer below to do the needful. Also, as I mentioned in a comment to @RavinderSingh13's answer, the output is automatically processed by a cron job, so the number of columns post merge needs to be fixed at 5 each.

Can you merge maximum 3 elements with the same col1 id? or can you imagine something like col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6 — Allan
– Allan, Commented Dec 14, 2018 at 3:31
and for the other lines how do you want us to display them when there are not 5 merge operations taking place? Should something like that be ok? id3|6775|mike|||,|jan-13|dust|||,|9348 — Allan
– Allan, Commented Dec 14, 2018 at 4:09
That would be great. If we can avoid the commas where there are no values that would be preferred, but it's not a deal breaker. It should be trivial to replace the |,| with || using sed. — Jim Walker
– Jim Walker, Commented Dec 14, 2018 at 4:31

RavinderSingh13 · Accepted Answer · 2018-12-14 04:15:04Z

2

Could you please try following, this solution will of fields line by line(NOT 3 number of columns hard coded), also it will enter maximum number of column headings in col3 and col5.

awk -F'|' '
FNR==NR{
  b[$1]=$1 in a?b[$1] FS $3:$3
  c[$1]=$1 in a?c[$1] FS $5:$5
  num1=split(b[$1],array1,"|")
  num2=split(c[$1],array2,"|")
  a[$1]=$1
  num=num1>num2?num1:num2>prev?num2:prev
  prev=num
  next
}
FNR==1{
  $3=$5=""
  while(++count<=num){
    $3=$3 OFS "col3-"count
  }
  $5=$3
  gsub("col3","col5",$5)
  print
  count=""
  next
}
!d[$1]++ && FNR>1{
  num1=split(b[$1],array1,"|")
  num2=split(c[$1],array2,"|")
  while(num1++<=num){
    b[$1]=b[$1] OFS
  }
  while(num2++<=num){
    c[$1]=c[$1] OFS
  }
  $3=b[$1]
  $5=c[$1]
  print
}'  Input_file OFS="|"  Input_file

answered Dec 14, 2018 at 4:15

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Allan Over a year ago

great!!! ;-) ;-) way better than mine hardcoded solutions that I want to adapt after lunch xD

Allan Over a year ago

But the questions has been updated and we have to add some ',' after the 3rd column...

RavinderSingh13 Over a year ago

Sorry I think OP has changed it later, riding bike as of now will check on it.

Jim Walker Over a year ago

I'm going to go with @Allan's answer since I need the number of columns to be fixed (these files are received in a pipeline and are pushed for automatic processing post cleanup). But your answer is excellent work, and I'm bookmarking it for future reference. I received a bunch of syntax errors which I plan to look into shortly. Thanks.

Allan Over a year ago

And now enjoy the beauty of awk with my new answer ;-)

Allan · Accepted Answer · 2018-12-14 07:56:06Z

First element of answer (when sized was fixed to 3 and ugly solution):

awk 'BEGIN{FS=OFS="|"; print "col1|col2|col3-1|col3-2|col3-3|col4|col5-1|col5-2|col5-3|col6"}NR>1{col2[$1]=$2;col4[$1]=$4;col6[$1]=$6;if(length(col3[$1])==0){col3[$1]=$3}else{col3[$1]=col3[$1]"|"$3}if(length(col5[$1])==0){col5[$1]=$5}else{col5[$1]=col5[$1]"|"$5}}END{n=asorti(col3,oArray);for(i=1; i<=n;i++){if(index(col3[oArray[i]],"|")==0){col3[oArray[i]]=col3[oArray[i]]"||";col5[oArray[i]]=col5[oArray[i]]"||";};print oArray[i],col2[oArray[i]],col3[oArray[i]],col4[oArray[i]],col5[oArray[i]],col6[oArray[i]]}}' csvToMerge.in 
col1|col2|col3-1|col3-2|col3-3|col4|col5-1|col5-2|col5-3|col6
id1|2314|jack|jill|jim|nov-12|water|oil|ether|3294
id2|8322|john|||dec-01|sand|||2334
id3|6775|mike|||jan-13|dust|||9348

more readable:

    $ cat awkprof.out
    # gawk profile, created Fri Dec 14 13:12:34 2018

    # BEGIN rule(s)

    BEGIN {
 1          FS = OFS = "|"
 1          print "col1|col2|col3-1|col3-2|col3-3|col4|col5-1|col5-2|col5-3|col6"
    }

    # Rule(s)

 6  NR > 1 { # 5
 5          col2[$1] = $2
 5          col4[$1] = $4
 5          col6[$1] = $6
 5          if (length(col3[$1]) == 0) { # 3
 3                  col3[$1] = $3
 2          } else {
 2                  col3[$1] = col3[$1] "|" $3
            }
 5          if (length(col5[$1]) == 0) { # 3
 3                  col5[$1] = $5
 2          } else {
 2                  col5[$1] = col5[$1] "|" $5
            }
    }

    # END rule(s)

    END {
 1          n = asorti(col3, oArray)
 3          for (i = 1; i <= n; i++) {
 3                  if (index(col3[oArray[i]], "|") == 0) { # 2
 2                          col3[oArray[i]] = col3[oArray[i]] "||"
 2                          col5[oArray[i]] = col5[oArray[i]] "||"
                    }
 3                  print oArray[i], col2[oArray[i]], col3[oArray[i]], col4[oArray[i]], col5[oArray[i]], col6[oArray[i]]
            }
    }

BEAUTIFULL SOLUTION

Dynamically construct col3 and col5 by computing the max number of identical occurrences of elements from col1

script csvmerge.awk

#function definitions

#function used to add the "|" at the end of col3, col5 when the element does not reach MAX number of occurences
function paddingfunction(MAX,input){
        output=input;
        gsub(/[^|]/,"",output);
        l=length(output);
        tmp=""
        for(u=l; u<MAX-1;u++)
        {
                tmp=tmp OFS;
        }
        return input""tmp;
}

#function used to generate nice header
function headerAppender(inputString){
        tmp=inputString;
        for(i=1;i<=MAX;i++){
                printf tmp""i OFS
        }
}

BEGIN{
        #Generate the header line
        FS=OFS="|";
        printf "col1" OFS "col2" OFS;
        headerAppender("col3-");
        printf "col4" OFS; headerAppender("col5-");
        print "col6"
}

NR>1{
        #save all the cells and concat the cells when col1 is the same
        col2[$1]=$2;
        col4[$1]=$4;
        col6[$1]=$6;
        if(length(col3[$1])==0){
                col3[$1]=$3
        }
        else{
                col3[$1]=col3[$1] OFS $3
        }
        if(length(col5[$1])==0){
                col5[$1]=$5
        }
        else{
                col5[$1]=col5[$1] OFS $5
        }
}

END{
        #sort the array
        n=asorti(col3,oArray);
        #print the cells
        for(i=1; i<=n;i++){
                print oArray[i],col2[oArray[i]],paddingfunction(MAX,col3[oArray[i]]),col4[oArray[i]],paddingfunction(MAX,col5[oArray[i]]),col6[oArray[i]];
        }
}

input1: (6 elements to group)

$ cat csvToMerge.in 
col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348
id4|6776|mik1|jan-14|dast|9344
id4|6776|mik2|jan-14|dest|9344
id4|6776|mik3|jan-14|dist|9344
id4|6776|mik4|jan-14|dost|9344
id4|6776|mik5|jan-14|dst|9344
id4|6776|mik6|jan-14|dut|9344

input2: (5 elements to group)

$ cat csvToMerge2.in 
col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348
id4|6776|mik1|jan-14|dast|9344
id4|6776|mik2|jan-14|dest|9344
id4|6776|mik3|jan-14|dist|9344
id4|6776|mik4|jan-14|dost|9344
id4|6776|mik5|jan-14|dst|9344

output 1:

$ awk -f csvmerge.awk -v MAX=`awk -F'|' ' {tot[$1]++}END{tmp=""; for (i in tot){if(tot[i]>tmp){tmp=tot[i]}}; print tmp; } ' csvToMerge.in` csvToMerge.in 
col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col3-6|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col5-6|col6
id1|2314|jack|jill|jim||||nov-12|water|oil|ether||||3294
id2|8322|john||||||dec-01|sand||||||2334
id3|6775|mike||||||jan-13|dust||||||9348
id4|6776|mik1|mik2|mik3|mik4|mik5|mik6|jan-14|dast|dest|dist|dost|dst|dut|9344

output 2:

$ awk -f csvmerge.awk -v MAX=`awk -F'|' ' {tot[$1]++}END{tmp=""; for (i in tot){if(tot[i]>tmp){tmp=tot[i]}}; print tmp; } ' csvToMerge2.in` csvToMerge2.in 
col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6
id1|2314|jack|jill|jim|||nov-12|water|oil|ether|||3294
id2|8322|john|||||dec-01|sand|||||2334
id3|6775|mike|||||jan-13|dust|||||9348
id4|6776|mik1|mik2|mik3|mik4|mik5|jan-14|dast|dest|dist|dost|dst|9344

Notes:

-v MAX=`awk -F'|' ' {tot[$1]++}END{tmp=""; for (i in tot){if(tot[i]>tmp){tmp=tot[i]}}; print tmp; } ' csvToMerge.in`

This will save in variable MAX the maximum number of occurrences to group , in your case max is 5 but you can imagine other situations where you need to group more elements.

hey buddy how about without hardcoding header one too :) stackoverflow.com/a/53773434/5866580
Throws "awk: line 2: function asorti never defined". I'm running mawk 1.3.3.
@Allan, how do I get it to always output 5 columns for each merge column regardless of how many were actually there? e.g. col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|... val1|val2|val3-1|val3-2||||val4...
@JimWalker Simply replace: -v MAX=awk -F'|' ' {tot[$1]++}END{tmp=""; for (i in tot){if(tot[i]>tmp){tmp=tot[i]}}; print tmp; } ' by -v MAX=5
@Allan, trying to figure out how to adapt your code such that any occurrences beyond MAX are added to the MAXth column separated by commas. E.g. if MAX=5, and there are 7 occurrences in the group, then the output ends up something like this: col3-1|col3-2|col3-3|col3-4|col3-5,col3-6,col3-7... I'll update once I have it working. Appreciate all the effort.

Rafael · Accepted Answer · 2018-12-14 06:29:03Z

0

$ awk -f merge_fields.awk <(perl join.pl <(sort -t'|' -k1 data.txt))

id1|2314|jack|jill|jim|nov-12|water|oil|ether|3294 
id2|8322|john|dec-01|sand|2334
id3|6775|mike|jan-13|dust|9348

join.pl

use v5.14;

readline(<>);
my @queue = ();

while (<>) {
    chomp and my @fields = split /\|/;
    say join('|', @queue) and @queue = ()
    if (@queue and @queue[0] ne @fields[0]);
    push(@queue, @fields);
}
say join('|', @queue) if @queue;

merge_fields.awk

BEGIN { OFS=FS="|" }
NF > 6 {
    for (i = 6 + 1; i < NF; i++) {
        if ($i ~ $1) {
            $3 = $3 OFS $(i+2)
            $5 = $5 OFS $(i+4)
        }
    }
}
{ print $1,$2,$3,$4,$5,$6 }

edited Dec 14, 2018 at 6:29

answered Dec 14, 2018 at 6:11

Rafael

7,79613 gold badges36 silver badges49 bronze badges

Comments

stack0114106 · Accepted Answer · 2018-12-14 15:17:06Z

Here is one more solution using Perl. It prints max of 5 elements for col 3 and 5 and prints the rest of them after col6. Just added "xx" and "yy" as default values, so that it can be viewed in the output

Script:

/tmp> cat csv_35col.ksh
perl -F"/\|/" -ane '
chomp($F[5]);
$id=$F[0]; 
if($.>1) { 
        if( $id ~~ @names ) 
        {
          @t3=@{ $kv3{$id} }; @t5=@{ $kv5{$id} };
          push(@t3,$F[2]); push(@t5,$F[4]);
          $kv3{$id}=[ @t3 ]; $kv5{$id}=[ @t5 ];
        }
        else
        {
        push(@names,$id); $kv{$id}=[ @F[0,1,3,5] ];
        $kv3{$id}=[ @F[2] ]; $kv5{$id}=[ @F[4] ];
        }
}
END {
$d="|"; 
for (1..6) { if($_==5 ||  $_==3) { $x=$_; for (1..5) { printf("%s|","col$x-$_")}} else { printf("%s|","col$_")}  } 
for my $x (@names) { 
@n=@{$kv{$x}}; @n3=@{$kv3{$x}}; @n5=@{$kv5{$x}};
for (0..4) { $n3[$_]= $n3[$_] ? $n3[$_] : "xx"; $n5[$_]=$n5[$_]? $n5[$_] : "yy"; }
print "\n".join($d,@n[0,1],@n3[0..4],${n[2]},@n5[0..4],${n[3]},@n3[5..$#n3],@n5[5..$#n5]);  
 }
print "\n";
} 
' $1
/tmp>

Inputs:

/tmp> cat jimw.csv
col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348
/tmp> cat jimw2.csv
col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348
id4|6776|mik1|jan-14|dast|9344
id4|6776|mik2|jan-14|dest|9344
id4|6776|mik3|jan-14|dist|9344
id4|6776|mik4|jan-14|dost|9344
id4|6776|mik5|jan-14|dst|9344
id4|6776|mik6|jan-14|dut|9344
/tmp>

Results:

/tmp> csv_35col.ksh jimw.csv 
col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6|
id1|2314|jack|jill|jim|xx|xx|nov-12|water|oil|ether|yy|yy|3294 
id2|8322|john|xx|xx|xx|xx|dec-01|sand|yy|yy|yy|yy|2334
id3|6775|mike|xx|xx|xx|xx|jan-13|dust|yy|yy|yy|yy|9348
/tmp> 
/tmp> csv_35col.ksh jimw2.csv
col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6|
id1|2314|jack|jill|jim|xx|xx|nov-12|water|oil|ether|yy|yy|3294 
id2|8322|john|xx|xx|xx|xx|dec-01|sand|yy|yy|yy|yy|2334
id3|6775|mike|xx|xx|xx|xx|jan-13|dust|yy|yy|yy|yy|9348
id4|6776|mik1|mik2|mik3|mik4|mik5|jan-14|dast|dest|dist|dost|dst|9344|mik6|dut
/tmp>

Collectives™ on Stack Overflow

Merge only specific columns in CSV if first column matches using bash

4 Answers 4

5 Comments

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related