1

I have the following delimited content in a text file:

col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348

I would like to merge the contents of the 3rd and 5th columns where the 1st column matches. The output should look like:

col1|col2|col3-1|col3-2|col3-3|col4|col5-1|col5-2|col5-3|col6
id1|2314|jack|jill|jim|nov-12|water|oil|ether|3294 
id2|8322|john|||dec-01|sand|||2334
id3|6775|mike|||jan-13|dust|||9348

It doesn't matter if the order of rows and columns in the output is different from the input.

Edit 1: There can be a maximum of 5 merges, anything thereafter should be added to the 5th column with a comma, e.g.

col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6
id1|2314|jack|jill|jim|val3-4|val3-5,val3-6|nov12|water|oil|ether|val5-4|val5-5,val5-6|3294 

Edit 2: As a side note, the actual file contains 14 columns and the merge is required for columns 9 and 13. I was able to adapt @Allan's answer below to do the needful. Also, as I mentioned in a comment to @RavinderSingh13's answer, the output is automatically processed by a cron job, so the number of columns post merge needs to be fixed at 5 each.

5
  • Can you merge maximum 3 elements with the same col1 id? or can you imagine something like col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6 Commented Dec 14, 2018 at 3:31
  • @Allan, edited the post to answer your question. Thanks. Commented Dec 14, 2018 at 4:05
  • and for the other lines how do you want us to display them when there are not 5 merge operations taking place? Should something like that be ok? id3|6775|mike|||,|jan-13|dust|||,|9348 Commented Dec 14, 2018 at 4:09
  • That would be great. If we can avoid the commas where there are no values that would be preferred, but it's not a deal breaker. It should be trivial to replace the |,| with || using sed. Commented Dec 14, 2018 at 4:31
  • Not from me!!!! Commented Dec 14, 2018 at 6:21

4 Answers 4

2

Could you please try following, this solution will of fields line by line(NOT 3 number of columns hard coded), also it will enter maximum number of column headings in col3 and col5.

awk -F'|' '
FNR==NR{
  b[$1]=$1 in a?b[$1] FS $3:$3
  c[$1]=$1 in a?c[$1] FS $5:$5
  num1=split(b[$1],array1,"|")
  num2=split(c[$1],array2,"|")
  a[$1]=$1
  num=num1>num2?num1:num2>prev?num2:prev
  prev=num
  next
}
FNR==1{
  $3=$5=""
  while(++count<=num){
    $3=$3 OFS "col3-"count
  }
  $5=$3
  gsub("col3","col5",$5)
  print
  count=""
  next
}
!d[$1]++ && FNR>1{
  num1=split(b[$1],array1,"|")
  num2=split(c[$1],array2,"|")
  while(num1++<=num){
    b[$1]=b[$1] OFS
  }
  while(num2++<=num){
    c[$1]=c[$1] OFS
  }
  $3=b[$1]
  $5=c[$1]
  print
}'  Input_file OFS="|"  Input_file
Sign up to request clarification or add additional context in comments.

5 Comments

great!!! ;-) ;-) way better than mine hardcoded solutions that I want to adapt after lunch xD
But the questions has been updated and we have to add some ',' after the 3rd column...
Sorry I think OP has changed it later, riding bike as of now will check on it.
I'm going to go with @Allan's answer since I need the number of columns to be fixed (these files are received in a pipeline and are pushed for automatic processing post cleanup). But your answer is excellent work, and I'm bookmarking it for future reference. I received a bunch of syntax errors which I plan to look into shortly. Thanks.
And now enjoy the beauty of awk with my new answer ;-)
1

First element of answer (when sized was fixed to 3 and ugly solution):

awk 'BEGIN{FS=OFS="|"; print "col1|col2|col3-1|col3-2|col3-3|col4|col5-1|col5-2|col5-3|col6"}NR>1{col2[$1]=$2;col4[$1]=$4;col6[$1]=$6;if(length(col3[$1])==0){col3[$1]=$3}else{col3[$1]=col3[$1]"|"$3}if(length(col5[$1])==0){col5[$1]=$5}else{col5[$1]=col5[$1]"|"$5}}END{n=asorti(col3,oArray);for(i=1; i<=n;i++){if(index(col3[oArray[i]],"|")==0){col3[oArray[i]]=col3[oArray[i]]"||";col5[oArray[i]]=col5[oArray[i]]"||";};print oArray[i],col2[oArray[i]],col3[oArray[i]],col4[oArray[i]],col5[oArray[i]],col6[oArray[i]]}}' csvToMerge.in 
col1|col2|col3-1|col3-2|col3-3|col4|col5-1|col5-2|col5-3|col6
id1|2314|jack|jill|jim|nov-12|water|oil|ether|3294
id2|8322|john|||dec-01|sand|||2334
id3|6775|mike|||jan-13|dust|||9348

more readable:

    $ cat awkprof.out
    # gawk profile, created Fri Dec 14 13:12:34 2018

    # BEGIN rule(s)

    BEGIN {
 1          FS = OFS = "|"
 1          print "col1|col2|col3-1|col3-2|col3-3|col4|col5-1|col5-2|col5-3|col6"
    }

    # Rule(s)

 6  NR > 1 { # 5
 5          col2[$1] = $2
 5          col4[$1] = $4
 5          col6[$1] = $6
 5          if (length(col3[$1]) == 0) { # 3
 3                  col3[$1] = $3
 2          } else {
 2                  col3[$1] = col3[$1] "|" $3
            }
 5          if (length(col5[$1]) == 0) { # 3
 3                  col5[$1] = $5
 2          } else {
 2                  col5[$1] = col5[$1] "|" $5
            }
    }

    # END rule(s)

    END {
 1          n = asorti(col3, oArray)
 3          for (i = 1; i <= n; i++) {
 3                  if (index(col3[oArray[i]], "|") == 0) { # 2
 2                          col3[oArray[i]] = col3[oArray[i]] "||"
 2                          col5[oArray[i]] = col5[oArray[i]] "||"
                    }
 3                  print oArray[i], col2[oArray[i]], col3[oArray[i]], col4[oArray[i]], col5[oArray[i]], col6[oArray[i]]
            }
    }

BEAUTIFULL SOLUTION

Dynamically construct col3 and col5 by computing the max number of identical occurrences of elements from col1

script csvmerge.awk

#function definitions

#function used to add the "|" at the end of col3, col5 when the element does not reach MAX number of occurences
function paddingfunction(MAX,input){
        output=input;
        gsub(/[^|]/,"",output);
        l=length(output);
        tmp=""
        for(u=l; u<MAX-1;u++)
        {
                tmp=tmp OFS;
        }
        return input""tmp;
}

#function used to generate nice header
function headerAppender(inputString){
        tmp=inputString;
        for(i=1;i<=MAX;i++){
                printf tmp""i OFS
        }
}

BEGIN{
        #Generate the header line
        FS=OFS="|";
        printf "col1" OFS "col2" OFS;
        headerAppender("col3-");
        printf "col4" OFS; headerAppender("col5-");
        print "col6"
}

NR>1{
        #save all the cells and concat the cells when col1 is the same
        col2[$1]=$2;
        col4[$1]=$4;
        col6[$1]=$6;
        if(length(col3[$1])==0){
                col3[$1]=$3
        }
        else{
                col3[$1]=col3[$1] OFS $3
        }
        if(length(col5[$1])==0){
                col5[$1]=$5
        }
        else{
                col5[$1]=col5[$1] OFS $5
        }
}

END{
        #sort the array
        n=asorti(col3,oArray);
        #print the cells
        for(i=1; i<=n;i++){
                print oArray[i],col2[oArray[i]],paddingfunction(MAX,col3[oArray[i]]),col4[oArray[i]],paddingfunction(MAX,col5[oArray[i]]),col6[oArray[i]];
        }
}

input1: (6 elements to group)

$ cat csvToMerge.in 
col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348
id4|6776|mik1|jan-14|dast|9344
id4|6776|mik2|jan-14|dest|9344
id4|6776|mik3|jan-14|dist|9344
id4|6776|mik4|jan-14|dost|9344
id4|6776|mik5|jan-14|dst|9344
id4|6776|mik6|jan-14|dut|9344

input2: (5 elements to group)

$ cat csvToMerge2.in 
col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348
id4|6776|mik1|jan-14|dast|9344
id4|6776|mik2|jan-14|dest|9344
id4|6776|mik3|jan-14|dist|9344
id4|6776|mik4|jan-14|dost|9344
id4|6776|mik5|jan-14|dst|9344

output 1:

$ awk -f csvmerge.awk -v MAX=`awk -F'|' ' {tot[$1]++}END{tmp=""; for (i in tot){if(tot[i]>tmp){tmp=tot[i]}}; print tmp; } ' csvToMerge.in` csvToMerge.in 
col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col3-6|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col5-6|col6
id1|2314|jack|jill|jim||||nov-12|water|oil|ether||||3294
id2|8322|john||||||dec-01|sand||||||2334
id3|6775|mike||||||jan-13|dust||||||9348
id4|6776|mik1|mik2|mik3|mik4|mik5|mik6|jan-14|dast|dest|dist|dost|dst|dut|9344

output 2:

$ awk -f csvmerge.awk -v MAX=`awk -F'|' ' {tot[$1]++}END{tmp=""; for (i in tot){if(tot[i]>tmp){tmp=tot[i]}}; print tmp; } ' csvToMerge2.in` csvToMerge2.in 
col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6
id1|2314|jack|jill|jim|||nov-12|water|oil|ether|||3294
id2|8322|john|||||dec-01|sand|||||2334
id3|6775|mike|||||jan-13|dust|||||9348
id4|6776|mik1|mik2|mik3|mik4|mik5|jan-14|dast|dest|dist|dost|dst|9344

Notes:

-v MAX=`awk -F'|' ' {tot[$1]++}END{tmp=""; for (i in tot){if(tot[i]>tmp){tmp=tot[i]}}; print tmp; } ' csvToMerge.in`

This will save in variable MAX the maximum number of occurrences to group , in your case max is 5 but you can imagine other situations where you need to group more elements.

5 Comments

hey buddy how about without hardcoding header one too :) stackoverflow.com/a/53773434/5866580
Throws "awk: line 2: function asorti never defined". I'm running mawk 1.3.3.
@Allan, how do I get it to always output 5 columns for each merge column regardless of how many were actually there? e.g. col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|... val1|val2|val3-1|val3-2||||val4...
@JimWalker Simply replace: -v MAX=awk -F'|' ' {tot[$1]++}END{tmp=""; for (i in tot){if(tot[i]>tmp){tmp=tot[i]}}; print tmp; } ' by -v MAX=5
@Allan, trying to figure out how to adapt your code such that any occurrences beyond MAX are added to the MAXth column separated by commas. E.g. if MAX=5, and there are 7 occurrences in the group, then the output ends up something like this: col3-1|col3-2|col3-3|col3-4|col3-5,col3-6,col3-7... I'll update once I have it working. Appreciate all the effort.
0

$ awk -f merge_fields.awk <(perl join.pl <(sort -t'|' -k1 data.txt))

id1|2314|jack|jill|jim|nov-12|water|oil|ether|3294 
id2|8322|john|dec-01|sand|2334
id3|6775|mike|jan-13|dust|9348

join.pl

use v5.14;

readline(<>);
my @queue = ();

while (<>) {
    chomp and my @fields = split /\|/;
    say join('|', @queue) and @queue = ()
    if (@queue and @queue[0] ne @fields[0]);
    push(@queue, @fields);
}
say join('|', @queue) if @queue;

merge_fields.awk

BEGIN { OFS=FS="|" }
NF > 6 {
    for (i = 6 + 1; i < NF; i++) {
        if ($i ~ $1) {
            $3 = $3 OFS $(i+2)
            $5 = $5 OFS $(i+4)
        }
    }
}
{ print $1,$2,$3,$4,$5,$6 }

Comments

0

Here is one more solution using Perl. It prints max of 5 elements for col 3 and 5 and prints the rest of them after col6. Just added "xx" and "yy" as default values, so that it can be viewed in the output

Script:

/tmp> cat csv_35col.ksh
perl -F"/\|/" -ane '
chomp($F[5]);
$id=$F[0]; 
if($.>1) { 
        if( $id ~~ @names ) 
        {
          @t3=@{ $kv3{$id} }; @t5=@{ $kv5{$id} };
          push(@t3,$F[2]); push(@t5,$F[4]);
          $kv3{$id}=[ @t3 ]; $kv5{$id}=[ @t5 ];
        }
        else
        {
        push(@names,$id); $kv{$id}=[ @F[0,1,3,5] ];
        $kv3{$id}=[ @F[2] ]; $kv5{$id}=[ @F[4] ];
        }
}
END {
$d="|"; 
for (1..6) { if($_==5 ||  $_==3) { $x=$_; for (1..5) { printf("%s|","col$x-$_")}} else { printf("%s|","col$_")}  } 
for my $x (@names) { 
@n=@{$kv{$x}}; @n3=@{$kv3{$x}}; @n5=@{$kv5{$x}};
for (0..4) { $n3[$_]= $n3[$_] ? $n3[$_] : "xx"; $n5[$_]=$n5[$_]? $n5[$_] : "yy"; }
print "\n".join($d,@n[0,1],@n3[0..4],${n[2]},@n5[0..4],${n[3]},@n3[5..$#n3],@n5[5..$#n5]);  
 }
print "\n";
} 
' $1
/tmp> 

Inputs:

/tmp> cat jimw.csv
col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348
/tmp> cat jimw2.csv
col1|col2|col3|col4|col5|col6
id1|2314|jack|nov-12|water|3294 
id2|8322|john|dec-01|sand|2334
id1|2314|jill|nov-12|oil|3294
id1|2314|jim|nov-12|ether|3294
id3|6775|mike|jan-13|dust|9348
id4|6776|mik1|jan-14|dast|9344
id4|6776|mik2|jan-14|dest|9344
id4|6776|mik3|jan-14|dist|9344
id4|6776|mik4|jan-14|dost|9344
id4|6776|mik5|jan-14|dst|9344
id4|6776|mik6|jan-14|dut|9344
/tmp> 

Results:

/tmp> csv_35col.ksh jimw.csv 
col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6|
id1|2314|jack|jill|jim|xx|xx|nov-12|water|oil|ether|yy|yy|3294 
id2|8322|john|xx|xx|xx|xx|dec-01|sand|yy|yy|yy|yy|2334
id3|6775|mike|xx|xx|xx|xx|jan-13|dust|yy|yy|yy|yy|9348
/tmp> 
/tmp> csv_35col.ksh jimw2.csv
col1|col2|col3-1|col3-2|col3-3|col3-4|col3-5|col4|col5-1|col5-2|col5-3|col5-4|col5-5|col6|
id1|2314|jack|jill|jim|xx|xx|nov-12|water|oil|ether|yy|yy|3294 
id2|8322|john|xx|xx|xx|xx|dec-01|sand|yy|yy|yy|yy|2334
id3|6775|mike|xx|xx|xx|xx|jan-13|dust|yy|yy|yy|yy|9348
id4|6776|mik1|mik2|mik3|mik4|mik5|jan-14|dast|dest|dist|dost|dst|9344|mik6|dut
/tmp> 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.