Awk adding variable number of missing values

Question

I have a set of tab-separated files with gene identifiers in the first column, and each subsequent column represents an individual sample with values for that given gene in column one. Here is an truncated example of one of my files with only a few samples:

DDR1 8.55578403700418 8.65526857898327 8.71701700266541 
MIR4640 8.55578403700418 8.65526857898327 8.71701700266541 
RFC2  5.47524925570941 5.88644077981836 5.77277342309348
HSPA6 4.12035662689116 4.01089068869244 3.82366440713502
PAX8  
GUCA1A

I got some ideas from Awk adding constant values, Bash Script Awk if statements, and AWK if length statement append, Since I have several thousand rows and possibly hundreds of columns depending on the input file, I tried writing my script like this:

cd ../path/to/file

inputFile=inputFile.in
outputFile=outputFile.out

columnCount= $(awk -F"\t" 'NR==1 {print NF}' $inputFile)

awk '{ for (i = 1; i <= $columnCount; i++)

    if (i<$columnCount) {print $0"\t?"}' $inputFile > $outputFile
}'

but I keep getting syntax errors.

$ awk -f missingValueAdder.awk 
awk: missingValueAdder.awk:3: cd ../path/to/file
awk: missingValueAdder.awk:3:    ^ syntax error
awk: missingValueAdder.awk:5: inputFile=inputFile.in
awk: missingValueAdder.awk:5:                    ^ syntax error
awk: missingValueAdder.awk:6: outputFile=outputFile.out
awk: missingValueAdder.awk:6                       ^ syntax error
awk: missingValueAdder.awk:8: columnCount= $(awk -F"\t" 'NR==1 {print NF}' $inputFile) 
awk: missingValueAdder.awk:8:                           ^ invalid char ''' in expression

So I tried this one-liner

 awk 'for (i=1;i<=NF;i++) BEGIN{FS=OFS="\t"} I<NF{print$0"\t?"}' inputFile.in > outputFile.out

but I got another syntax error starting at my for loop. Anyways, my output file should look like

DDR1 8.55578403700418 8.65526857898327 8.71701700266541 
MIR4640 8.55578403700418 8.65526857898327 8.71701700266541 
RFC2  5.47524925570941 5.88644077981836 5.77277342309348
HSPA6 4.12035662689116 4.01089068869244 3.82366440713502
PAX8    ?   ?   ? 
GUCA1A  ?   ?   ?

I want to print as many "?" as dictated by NF (In this case 3, but could be as many as 100). Any help would be most appreciated! Thanks

Read the books Shell Scripting Recipes by Chris Johnson, and Effective Awk Programming, 4th Edition, by Arnold Robbins. — Ed Morton
– Ed Morton, Commented May 16, 2016 at 17:03
@GreysonB you say you script is tab separated. Have the lines with PAX8 and GUCA1A also the required number of tabs, e.g. in the example three tabs after the gene name? — Lars Fischer
– Lars Fischer, Commented May 16, 2016 at 17:10
@LarsFischer good question. The lines such as PAX8 have no additional tabs after the first column. — Greyson B
– Greyson B, Commented May 16, 2016 at 17:19

glenn jackman · Accepted Answer · 2016-05-16 17:19:29Z

4

If you want to assume that the maximum number of fields in the file occurs on line 1, do this:

$ awk -v OFS="\t" 'NR==1 {cols=NF} {$1=$1; for (i=NF+1; i <= cols; i++) $i = "?"} 1' file
DDR1    8.55578403700418    8.65526857898327    8.71701700266541
MIR4640 8.55578403700418    8.65526857898327    8.71701700266541
RFC2    5.47524925570941    5.88644077981836    5.77277342309348
HSPA6   4.12035662689116    4.01089068869244    3.82366440713502
PAX8    ?   ?   ?
GUCA1A  ?   ?   ?

The strange $1=$1 bit forces awk to rewrite $0 using the new OFS for every line, even if no new fields are added by the for loop.

If the maximum number of fields does not necessarily occur on line 1, then you can process the file twice: once to find the max num; once to add the field placeholders:

awk -v OFS="\t" '
    NR == 1 {cols = NF}
    NR == FNR {if (NF>cols) cols=NF; next} 
    {$1=$1; for (i=NF+1; i <= cols; i++) $i = "?"} 
    1
' file file

answered May 16, 2016 at 17:19

glenn jackman

249k42 gold badges233 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

sjsam Over a year ago

nice touch pal :)

Lars Fischer · Accepted Answer · 2016-05-16 17:24:49Z

0

Here is my take:

script.awk

NR==1 { for(i=2;i<=NF;i++) tmp=tmp "\t?" }
{ if (NF==1) print $1, tmp
  else print }

use it like this: awk -f script.awk yourfile

The first line determines from the fieldcount in line 1 the template for the output in the lines that have only the name.
The second action prints either the line or the name together with the template

answered May 16, 2016 at 17:24

Lars Fischer

10.4k3 gold badges31 silver badges38 bronze badges

Comments

sjsam · Accepted Answer · 2016-05-16 17:35:10Z

Input

DDR1 8.55578403700418 8.65526857898327 8.71701700266541
MIR4640 8.55578403700418 8.65526857898327 8.71701700266541
RFC2  5.47524925570941 5.88644077981836 5.77277342309348
HSPA6 4.12035662689116 4.01089068869244 3.82366440713502
PAX8
GUCA1A

AWK Script

awk '{
       if($0!=$1){
         printf "%s\n",$0
        }
        else{
        printf "%s\t?\t?\t?\t\n",$1
        }
     }' yourfilename > temp && mv temp yourfilename

Output

DDR1 8.55578403700418 8.65526857898327 8.71701700266541 
MIR4640 8.55578403700418 8.65526857898327 8.71701700266541 
RFC2  5.47524925570941 5.88644077981836 5.77277342309348
HSPA6 4.12035662689116 4.01089068869244 3.82366440713502
PAX8    ?   ?   ?   
GUCA1A  ?   ?   ?

GNU-Sed one liner for the above

sed -i 's/^\([[:alnum:]]*\)$/\1\t?\t?\t?/' yourfilename

Collectives™ on Stack Overflow

Awk adding variable number of missing values

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related