0

I have a set of tab-separated files with gene identifiers in the first column, and each subsequent column represents an individual sample with values for that given gene in column one. Here is an truncated example of one of my files with only a few samples:

DDR1 8.55578403700418 8.65526857898327 8.71701700266541 
MIR4640 8.55578403700418 8.65526857898327 8.71701700266541 
RFC2  5.47524925570941 5.88644077981836 5.77277342309348
HSPA6 4.12035662689116 4.01089068869244 3.82366440713502
PAX8  
GUCA1A   

I got some ideas from Awk adding constant values, Bash Script Awk if statements, and AWK if length statement append, Since I have several thousand rows and possibly hundreds of columns depending on the input file, I tried writing my script like this:

cd ../path/to/file

inputFile=inputFile.in
outputFile=outputFile.out

columnCount= $(awk -F"\t" 'NR==1 {print NF}' $inputFile)

awk '{ for (i = 1; i <= $columnCount; i++)

    if (i<$columnCount) {print $0"\t?"}' $inputFile > $outputFile
}'

but I keep getting syntax errors.

$ awk -f missingValueAdder.awk 
awk: missingValueAdder.awk:3: cd ../path/to/file
awk: missingValueAdder.awk:3:    ^ syntax error
awk: missingValueAdder.awk:5: inputFile=inputFile.in
awk: missingValueAdder.awk:5:                    ^ syntax error
awk: missingValueAdder.awk:6: outputFile=outputFile.out
awk: missingValueAdder.awk:6                       ^ syntax error
awk: missingValueAdder.awk:8: columnCount= $(awk -F"\t" 'NR==1 {print NF}' $inputFile) 
awk: missingValueAdder.awk:8:                           ^ invalid char ''' in expression

So I tried this one-liner

 awk 'for (i=1;i<=NF;i++) BEGIN{FS=OFS="\t"} I<NF{print$0"\t?"}' inputFile.in > outputFile.out

but I got another syntax error starting at my for loop. Anyways, my output file should look like

DDR1 8.55578403700418 8.65526857898327 8.71701700266541 
MIR4640 8.55578403700418 8.65526857898327 8.71701700266541 
RFC2  5.47524925570941 5.88644077981836 5.77277342309348
HSPA6 4.12035662689116 4.01089068869244 3.82366440713502
PAX8    ?   ?   ? 
GUCA1A  ?   ?   ?

I want to print as many "?" as dictated by NF (In this case 3, but could be as many as 100). Any help would be most appreciated! Thanks

4
  • 1
    Your script is a shell script, not an awk script. Commented May 16, 2016 at 17:02
  • 2
    Read the books Shell Scripting Recipes by Chris Johnson, and Effective Awk Programming, 4th Edition, by Arnold Robbins. Commented May 16, 2016 at 17:03
  • @GreysonB you say you script is tab separated. Have the lines with PAX8 and GUCA1A also the required number of tabs, e.g. in the example three tabs after the gene name? Commented May 16, 2016 at 17:10
  • @LarsFischer good question. The lines such as PAX8 have no additional tabs after the first column. Commented May 16, 2016 at 17:19

3 Answers 3

4

If you want to assume that the maximum number of fields in the file occurs on line 1, do this:

$ awk -v OFS="\t" 'NR==1 {cols=NF} {$1=$1; for (i=NF+1; i <= cols; i++) $i = "?"} 1' file
DDR1    8.55578403700418    8.65526857898327    8.71701700266541
MIR4640 8.55578403700418    8.65526857898327    8.71701700266541
RFC2    5.47524925570941    5.88644077981836    5.77277342309348
HSPA6   4.12035662689116    4.01089068869244    3.82366440713502
PAX8    ?   ?   ?
GUCA1A  ?   ?   ?

The strange $1=$1 bit forces awk to rewrite $0 using the new OFS for every line, even if no new fields are added by the for loop.

If the maximum number of fields does not necessarily occur on line 1, then you can process the file twice: once to find the max num; once to add the field placeholders:

awk -v OFS="\t" '
    NR == 1 {cols = NF}
    NR == FNR {if (NF>cols) cols=NF; next} 
    {$1=$1; for (i=NF+1; i <= cols; i++) $i = "?"} 
    1
' file file
Sign up to request clarification or add additional context in comments.

1 Comment

nice touch pal :)
0

Here is my take:

script.awk

NR==1 { for(i=2;i<=NF;i++) tmp=tmp "\t?" }
{ if (NF==1) print $1, tmp
  else print }

use it like this: awk -f script.awk yourfile

  • The first line determines from the fieldcount in line 1 the template for the output in the lines that have only the name.
  • The second action prints either the line or the name together with the template

Comments

0

Input

DDR1 8.55578403700418 8.65526857898327 8.71701700266541
MIR4640 8.55578403700418 8.65526857898327 8.71701700266541
RFC2  5.47524925570941 5.88644077981836 5.77277342309348
HSPA6 4.12035662689116 4.01089068869244 3.82366440713502
PAX8
GUCA1A

AWK Script

awk '{
       if($0!=$1){
         printf "%s\n",$0
        }
        else{
        printf "%s\t?\t?\t?\t\n",$1
        }
     }' yourfilename > temp && mv temp yourfilename

Output

DDR1 8.55578403700418 8.65526857898327 8.71701700266541 
MIR4640 8.55578403700418 8.65526857898327 8.71701700266541 
RFC2  5.47524925570941 5.88644077981836 5.77277342309348
HSPA6 4.12035662689116 4.01089068869244 3.82366440713502
PAX8    ?   ?   ?   
GUCA1A  ?   ?   ?

GNU-Sed one liner for the above

sed -i 's/^\([[:alnum:]]*\)$/\1\t?\t?\t?/' yourfilename

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.