Using AWK to select rows containing any field with a value greater than Y

Question

I have a CSV file with the following records:

DATE,TAG,ID,METRIC_1,METRIC_2,METRIC_3,METRIC_4,METRIC_5,METRIC_6,METRIC_7,METRIC_8,METRIC_9,METRIC_A,METRIC_B,METRIC_C,METRIC_D,METRIC_E,METRIC_F,METRIC_G
2000-01-29,3PXI1,37681,1.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI2,37682,20.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI3,37683,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI4,37684,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXI7,37687,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI8,37688,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI9,37689,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXJ0,37690,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00

The goal is to get only the rows that have values greater than zero using AWK command:

2000-01-29,3PXI1,37681,1.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI2,37682,20.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00

What I tried to do

awk -v FS=, 'NR!=1 {for(i=4; i<NF; i++) if($i>0)print$0;next}' file.csv

The output:

2000-01-29,3PXI1,37681,1.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI2,37682,20.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00

I know it is failing because it is iterating through each column checking the condition and printing the output with each column that meets the condition therefore the duplicate records.

How can this be corrected to print the current line that matches the condition once and skip to the next line ?

EDIT: here is the above code formatted legibly by gawk -o-:

NR != 1 {
        for (i = 4; i < NF; i++) {
                if ($i > 0) {
                        print $0
                }
        }
        next
}

Daweo · Accepted Answer · 2022-09-07 07:05:10Z

Firstly observe that

NR!=1 {for(i=4; i<NF; i++) if($i>0)print$0;next}

means that next is outside for loop body, so it is executed after loop is completely done and as you have only that pattern-action pair, it does just act as no-operation. Add {...} to inform GNU AWK what you actually wants, that is replace above part using

NR!=1 {for(i=4; i<NF; i++){if($i>0){print$0;next}}}

then for

DATE,TAG,ID,METRIC_1,METRIC_2,METRIC_3,METRIC_4,METRIC_5,METRIC_6,METRIC_7,METRIC_8,METRIC_9,METRIC_A,METRIC_B,METRIC_C,METRIC_D,METRIC_E,METRIC_F,METRIC_G
2000-01-29,3PXI1,37681,1.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI2,37682,20.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI3,37683,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI4,37684,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXI7,37687,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI8,37688,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI9,37689,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXJ0,37690,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00

you will get output

2000-01-29,3PXI1,37681,1.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI2,37682,20.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00

Also be warned that your code ignores last field, if this is feature compliant with requirements left it as it, if this is bug, use i<=NF as check.

(tested in gawk 4.2.1)

karakfa · Accepted Answer · 2022-09-07 13:01:37Z

2

$ awk -F, 'NR>1{for(i=4;i<=NF;i++) if($i>0) {print; next}}' file.csv

answered Sep 7, 2022 at 13:01

karakfa

67.8k8 gold badges45 silver badges59 bronze badges

1 Comment

karakfa Over a year ago

posted this as a canonical solution. Please see @Daweo's answer with explanations.

RARE Kpop Manifesto · Accepted Answer · 2022-09-09 07:59:16Z

compared to checking fields one at a time, it's less hassle to simply save $0, use regex to high-speed scan the input line, and only restore it when positive values have been located

{m,g}awk 'BEGIN { _^= FS = OFS = "," (__="") } substr(__, 

(___=$(_=__)) * ($++_=$++_=$++_=__), gsub(",(-[^,]+|[+-]?0([.]0*)?)",
               FS))^!_ == NR || /^[,]*$/ ? NF = __ : ($!NF = ___)^__'

2000-01-29,3PXI1,37681,1.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI2,37682,20.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00

Ed Morton · Accepted Answer · 2022-09-10 18:21:22Z

You already have the answer to what was wrong with your script but consider this alternative to looping through all of your fields:

$ awk '/^([^,]+,){3}.*[^0.,]/' file
DATE,TAG,ID,METRIC_1,METRIC_2,METRIC_3,METRIC_4,METRIC_5,METRIC_6,METRIC_7,METRIC_8,METRIC_9,METRIC_A,METRIC_B,METRIC_C,METRIC_D,METRIC_E,METRIC_F,METRIC_G
2000-01-29,3PXI1,37681,1.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI2,37682,20.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00

Just add NR>1 && to the start of the condition if you really don't want to print the header line:

$ awk 'NR>1 && /^([^,]+,){3}.*[^0.,]/' file
2000-01-29,3PXI1,37681,1.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI2,37682,20.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00
2000-01-29,3PXI5,37685,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,22.37,23.91,0.00,0.00,0.00,0.00
2000-01-29,3PXI6,37686,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,30.00,40.14,0.00,0.00,0.00,0.00
2000-01-29,3PXJ1,37691,0.00,0.00,0.00,0.00,0.00,0,0.00,0.00,0.00,1,25.00,51.13,0.00,0.00,0.00,0.00

Collectives™ on Stack Overflow

Using AWK to select rows containing any field with a value greater than Y

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related