3

I'm trying to merge columns based on the value in the first field. I've tried using awk, but to no avail. Please see example input and output:

Input:  
10013   97      1503384  
10013   196     1506234  
10013   61      1507385  
10013   1559    1508385  
10014   1726    1514507  
10014   960     1519162  
10015   1920    1545535  
10015   124     1548915  
10015   77      1550284  

Desired_Output:  
10013   97,196,61,1559  1503384,1506234,1507385,1508385  
10014   1726,960        1514507,1519162  
10015   1920,124,77     1545535,1548915,1550284  

Thanks in advance for any advice!

2
  • 3
    Welcome to stackoverflow, please use code tags for your Input(s) and scripts. Commented Nov 28, 2017 at 16:35
  • 1
    it always helps to post your script even if it's not working as desired. Commented Nov 28, 2017 at 17:05

4 Answers 4

5

The shortest GNU datamash solution:

datamash -sW -g1 collapse 2 collapse 3 <file
  • -g1 - group by the 1st field
  • collapse N - operation producing comma-separated list of all input values of the field N within each group

The output:

10013   97,196,61,1559  1503384,1506234,1507385,1508385
10014   1726,960    1514507,1519162
10015   1920,124,77 1545535,1548915,1550284
Sign up to request clarification or add additional context in comments.

3 Comments

This looks like a very useful tool for these kind of tasks.
@karakfa, it's very convenient for simple grouping/aggregation and arithmetic operations. Recommended "stuff"
Just make sure that you have the most recent version - not all distribution repositories are up-to-date. See their dowload page.
2
$ cat tst.awk
$1 != f1 { if (NR>1) print f1, f2, f3; f1=f2=f3=s="" }
{ f1=$1; f2=f2 s $2; f3=f3 s $3; s="," }
END { print f1, f2, f3 }

$ awk -f tst.awk file | column -t
10013  97,196,61,1559  1503384,1506234,1507385,1508385
10014  1726,960        1514507,1519162
10015  1920,124,77     1545535,1548915,1550284

2 Comments

How do we tweak this for unknown number of columns?
@Naveed Write a loop? Post a new question with sample input/output if you'd like more help.
1

awk to the rescue!

$ awk '{f2[$1]=f2[$1] sep[$1] $2;                   # concatenate 2nd field 
        f3[$1]=f3[$1] sep[$1] $3;                   # concatenate 3rd field 
        sep[$1]=","}                                # lazy init separator to skip first
   END {for(k in f2) print k,f2[k],f3[k]}' file |   # iterate over keys and print
  column -t                                         # pretty print


10013  97,196,61,1559  1503384,1506234,1507385,1508385
10014  1726,960        1514507,1519162
10015  1920,124,77     1545535,1548915,1550284

note the output order is not guaranteed, but you can sort by the first field.

Comments

0

Awk solution (assuming that the input lines are already sorted):

awk '!a[$1]++{ if ("f2" in b) { print f1, b["f2"], b["f3"]; delete b } }
     { 
         f1=$1; 
         b["f2"]=(b["f2"]!=""? b["f2"]",":"")$2; 
         b["f3"]=(b["f3"]!=""? b["f3"]",":"")$3 
     }
     END{ print f1, b["f2"], b["f3"] }' OFS='\t file
  • delete b - with this action we'll prevent the array b from holding all values during the processing (saving memory). It will be cleared on each unique 1st field value

The output:

10013   97,196,61,1559  1503384,1506234,1507385,1508385
10014   1726,960    1514507,1519162
10015   1920,124,77 1545535,1548915,1550284

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.