Find Maximum of all columns based on distinct first column

Question

I am using Ubuntu and I have an input file like this

ifile.dat
1   10  15
3   34  20
1   4   22
3   32  33
5   3   46
2   2   98
4   20  100
3   13  23
4   50  65
1   40  76
2   20  22

How do I achieve this?

ofile.dat
1   40  76
2   20  98
3   34  33
4   50  100
5   3   46

I mean the max of each column by comparing first column. Thanks.

Here is what I have tried(on a sample file with 13columns). But the highest value is not coming up this way.

cat input.txt | sort -k1,1 -k2,2nr -k3,3nr -k4,4nr -k5,5nr -k6,6nr -k7,7nr -k8,8nr -k9,9nr -k10,10nr -nrk11,11 -nrk12,12 -nrk13,13 | sort -k1,1 -u

It didn't work. So a helpful guy tried to help me with this below. But no matter on mac or ubuntu with gawk, I couldn't run it and see the errors below

awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' input.txt

Error is

awk: syntax error at source line 1
 context is
    BEGIN{PROCINFO["sorted_in"] = "@val_num_asc"} {for(i=2;i<=NF;++i) if >>>  (a[$1][ <<< 
awk: illegal statement at source line 1
awk: illegal statement at source line 1

I did try removing the BEGIN statement and by playing with the for loop, but couldn't find luck. Thanks.

P.S.: I got this answer from stackoverflow. So I am posting it here because this is a unix/linux special forum.

Hang on, do you need this to work for an arbitrary number of columns? — terdon
– terdon ♦, Commented Jul 5, 2017 at 21:14
@terdon, yes, this is a reasonable question. In that case - all answers that relied on 2,3 columns should be reconsidered — RomanPerekhrest
– RomanPerekhrest, Commented Jul 5, 2017 at 21:20
The error you show seems to indicate an older version of AWK (or an implementation that does not support true multidimensional arrays). Which version(s) of AWK are you using? — Fox
– Fox, Commented Jul 5, 2017 at 21:27
@terdon I don't know if the Ubuntu box I have access to from where I am is fully up to date, but a quick test there indicates that arrays of arrays are unsupported — Fox
– Fox, Commented Jul 5, 2017 at 21:34

steeldriver · Accepted Answer · 2017-07-05 23:02:52Z

7

GNU datamash is nice for things like this:

$ datamash -sW groupby 1 max 2,3 < ifile.dat 
1   40  76
2   20  98
3   34  33
4   50  100
5   3   46

To handle a larger number of columns, you can specify a range e.g.

datamash -sW groupby 1 max 2-13 < ifile.dat

edited Jul 5, 2017 at 23:02

answered Jul 5, 2017 at 21:12

steeldriver

83.9k12 gold badges124 silver badges175 bronze badges

Add a comment |

RomanPerekhrest · Accepted Answer · 2017-07-05 22:55:15Z

4

awk solution for any number of columns (you have mentioned a sample file with 13 columns):

Let's say we have the extended sample file:

1   10  15  10  99
3   34  20  20  111
1   4   22  22  33
3   32  33  12  5
5   3   46  44  9
2   2   98  55  55 
4   20  100 11  33
3   13  23  77  23
4   50  65  33  66
1   40  76  78  16
2   20  22  98  93

awk '{ for(i=2;i<=NF;i++) { if (!($1 in a) || $i > a[$1][i]) a[$1][i]=$i }}
     END{ r=""; for(i in a) { r=i; for(j in a[i]) r=r OFS a[i][j]; print r } 
     }' OFS='\t' file

The output:

1   40  76  78  99
2   20  98  98  93
3   34  33  77  111
4   50  100 33  66
5   3   46  44  9

edited Jul 5, 2017 at 22:55

answered Jul 5, 2017 at 21:14

RomanPerekhrest

30.9k5 gold badges47 silver badges68 bronze badges

You need !($1 in a) due to precedence and you need to check it before the first column not repeat for each column, or else make it !($1 in a && i in a[$1]). However, awk for(v in a) can enumerate in any order and usually not the 'natural' order; my gawk 4.0.1 (on Ubuntu 14.04) does the 5 values of $1 here as 4,5,1,2,3 (visible because they're tagged) and the 4 stored columns as 4,5,2,3 (less visible). But array-of-array requires gawk 4 which also has PROCINFO["sorted_in"]="@ind_num_asc" to fix this. Finally, you don't need the r="".

dave_thompson_085
– dave_thompson_085

2017-07-05 23:41:49 +00:00
Commented Jul 5, 2017 at 23:41

Add a comment |

terdon · Accepted Answer · 2017-07-08 14:17:59Z

4

Here's one way in awk:

$ awk '{ 
        if($2 > a[$1][2]){
            a[$1][2] = $2
        } 
        if($3 > a[$1][3]){
            a[$1][3] = $3
        }
       }
  END{
        for(i in a){
            printf "%s ", i; 
            for(c=1; c<=maxFields; c++){
              if(c in a[i]){
                 printf "%s ",a[i][c]
              }
            }
            print ""
        }' ifile.dat 
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46

The script simply uses the two-dimensional array a to store the maximum value for each of the 2 columns. For each value i of the 1st column, a[i][2] will hold the maximum value found for i in the 2nd column and a[i][3] the maximum for the 3rd. Once the whole file has been processed, we print the maximum values for each value of i.

If you have more than 3 columns, you can use:

awk '{ 
        for(c=2; c<=NF; c++){
            if($c > a[$1][c]){
                a[$1][c] = $c; 
            }
        }
       } 
       END{
            for(i in a){
                printf "%s: ", i; 
                for(c in a[i]){
                    printf "%s ",a[i][c]
                }
                print ""
            }
        }' ifile.dat

Note that the above solution will not work correctly with negative values, or if you can have 0 etc and it can get the order of the fields wrong since awk doesn't necessarily traverse arrays in order. A more robust approach is:

awk '{ 
        for(c=2; c<=NF; c++){
            if(!(c in a) || $c > a[$1][c]){
                a[$1][c] = $c; 
            }
        }
      } 
      END{
            for(i in a){
                printf "%s ", i; 
                for(c in a[i]){
                    printf "%s ",a[i][c]
                }
                print ""
            }
         }' ifile.dat

edited Jul 8, 2017 at 14:17

answered Jul 5, 2017 at 21:11

terdon♦

253k69 gold badges481 silver badges720 bronze badges

thanks @terdon. I do have some NA values at times. But no negative values though.

j smith
– j smith

2017-07-05 22:18:53 +00:00
Commented Jul 5, 2017 at 22:18
You have NA? That changes everything. How should those be treated? Please always make sure that the example you show accurately represents your data. Otherwise, we will give you answers that don't work for you and waste both your time and ours.

terdon
– terdon ♦

2017-07-05 22:22:02 +00:00
Commented Jul 5, 2017 at 22:22
Your fix for negative doesn't work for zero followed by negative (in the same column on a later line); Roman's does after I tweaked it. You also have the same issue that for(v in a) is not necessarily in the correct order.

dave_thompson_085
– dave_thompson_085

2017-07-05 23:45:05 +00:00
Commented Jul 5, 2017 at 23:45
@terdon - This task has no NA values fortunately. So we are good.

j smith
– j smith

2017-07-06 01:38:12 +00:00
Commented Jul 6, 2017 at 1:38
@dave_thompson_085 yes, it won't work if there's a leading 0, but why would a negative number ever have a leading 0? As for the order, I don't think it's relevant. The OP didn't state that the output needs to preserve order and, if it does, you can always sort by the 1st column.

terdon
– terdon ♦

2017-07-06 08:23:57 +00:00
Commented Jul 6, 2017 at 8:23

| Show 2 more comments

choroba · Accepted Answer · 2017-07-05 21:12:30Z

2

Using sort as the main tool:

sort             ifile.dat -k1,1 -k2,2nr | sort -uk1,1 | awk '{print $1,$2}' \
| paste - <(sort ifile.dat -k1,1 -k3,3nr | sort -uk1,1 | awk '{print $3}')

answered Jul 5, 2017 at 21:12

choroba

49.7k7 gold badges92 silver badges119 bronze badges

thanks. it works for the sample ifile.dat. But I have many columned data. appreciate your time.

j smith
– j smith

2017-07-05 22:18:13 +00:00
Commented Jul 5, 2017 at 22:18
You can easily generate the source code for any number of columns, but it will be too slow for higher numbers.

choroba
– choroba

2017-07-05 22:24:29 +00:00
Commented Jul 5, 2017 at 22:24

Add a comment |

Sergiy Kolodyazhnyy · Accepted Answer · 2017-07-06 03:35:28Z

Python 3 Script

#!/usr/bin/env python3
import sys
from collections import OrderedDict as od

# read data in the file first, create data dictionary of column lists
data = od()
with open(sys.argv[1]) as f:
     for line in f:
          columns = line.strip().split()
          how_many = len(columns)-1
          if columns[0] not in data.keys():
              data[ columns[0] ] = [ [] for i in range(how_many) ]
          for index in range(how_many):
              data[ columns[0] ][index].append( int(columns[index+1]) )

# post process all the created lists of lists by applying max() on each
for item in sorted(data.keys()):
    print(item,end=" ") 
    for array in data[item]:
        print(max(array),end=" ")
    print("")

Test run

With input example provided by OP:

$ ./columns_max.py input.txt                                                                                                                         
1 40 76 
2 20 98 
3 34 33 
4 50 100 
5 3 46

With extended example in Roman Perekhrest's answer:

$ ./columns_max.py input.txt                                                                                                                         
1 40 76 78 99 
2 20 98 98 93 
3 34 33 77 111 
4 50 100 33 66 
5 3 46 44 9

How this works:

The essential idea is that we create a dictionary of first column items. So in the dictionary we'll have keys 1,2,3,4 and 5. Each corresponding value for dictionary item is a list of lists, where each sub-list corresponds with a column. So, for key 1 we would have a list with two lists, where first list is for all column 2 items, and second list is for all column 3 items. Basically, this:

('1', [ ['10', '4', '40'], ['15', '22', '76']] )

Now, there is very nice function called max(), which allows us to take a numeric list and extract the largest item from it. All we have to do is iterate over each key, take out all the lists , and apply max() function to them.

user218374 · Accepted Answer · 2017-07-06 05:45:28Z

2

perl -lane '
   $F[$_] > $A[$F[0]-1][$_] and $A[$F[0]-1][$_] = $F[$_] for 1 .. $#F}{
   print 1+$_, "@{$A[$_]}" for grep defined $A[$_], 0 .. $#A
' ifile.dat

Results

Working

Data structure involved is an `LoL` (list of lists) assuming that the
column 1 data is nonnegative.

@A = (
   [column_2_max_for_idx1, column_3_max_for_idx1, column_4_max_for_idx1, ...],
   [........],
);

answered Jul 6, 2017 at 5:45

user218374

Add a comment |

Stack Exchange Network

Find Maximum of all columns based on distinct first column

6 Answers 6

Python 3 Script

Test run

How this works:

Results

Working

You must log in to answer this question.

Hot Network Questions

Find Maximum of all columns based on distinct first column

6 Answers 6

Python 3 Script

Test run

How this works:

Results

Working

You must log in to answer this question.

Related

Hot Network Questions