Using awk to select rows with a specific value in column greater than x

Question

I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not.

I would like to extract all lines with a value greater than 98 including 99, 100 and so on.

Here my code and my input format:

for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done

A   chr11   98.80   83  1   0   2   84

B   chr7    95.45   22  1   0   40  61

C   chr7    88.89   27  0   1   46  72

D   chr6    100.00  20  0   0   1   20

Expected Output

A   chr11   98.80   83  1   0   2   84

D   chr6    100.00  20  0   0   1   20

No, no... simply awk '$3 > 98' *input.file (which will use the default print to output) — David C. Rankin
– David C. Rankin, Commented Jun 23, 2020 at 7:12
Yes, you do it all with awk. No shell loop. Just awk '$3 > 98' *input.file Is it redirecting to output.file where you are confused? — David C. Rankin
– David C. Rankin, Commented Jun 23, 2020 at 7:17
awk '$3 > 98' *input.file didnt work, i got the same output :/ — gnikixam
– gnikixam, Commented Jun 23, 2020 at 7:19

David C. Rankin · Accepted Answer · 2020-06-23 07:56:23Z

3

Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:

awk '$3 > 98 {
    match (FILENAME,/input.file$/)
    print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file

Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.

match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.

For exmaple, if you had input files:

$ ls -1 *input.file
v1input.file
v2input.file

Both with your example content:

$ cat v1input.file
A chr11 98.80 83 1 0 2 84

B chr7 95.45 22 1 0 40 61

C chr7 88.89 27 0 1 46 72

D chr6 100.00 20 0 0 1 20

Running the awk command above would results in two output files:

$ ls -1 *output.file
v1output.file
v2output.file

Containing the records where the third-field was greater than 98:

$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

edited Jun 23, 2020 at 7:56

answered Jun 23, 2020 at 7:33

David C. Rankin

85.1k6 gold badges67 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ed Morton Over a year ago

match (FILENAME,/input.file$/) and substr(FILENAME,1,RSTART-1) "output.file" are being done once per output line for every file, they only need to be done once per output (or input would be OK) file total for efficiency. Not closing the output files as you go means your awk will fail with a "too many open files" error after creating a doczen or so output files unless you're using GNU awk. Input/output redirection to/from an unparenthesized expression is undefined behavior per POSIX so it'll only work in some awks.

Ed Morton Over a year ago

So do this instead:

awk 'FNR==1{ close(out); match(FILENAME,/input\.file$/); out=substr(FILENAME,1,RSTART-1) "output.file" } $3 > 98{ print > out }' *input.file

for efficiency and portability to all awks.

Collectives™ on Stack Overflow

Using awk to select rows with a specific value in column greater than x

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related