1

This seems to be a very trivial question but I don't have enough experience using grep and echo to answer it by myself. I have looked here and here with no success.

I have a file that starts like this (.gff file) with over 1,000,000 lines.

NW_007577731.1  RefSeq  region  1   3345205 .   +   .   ID=id0;Dbxref=taxon:144197;Name=Unknown;chromosome=Unknown;collection-date=16-Aug-2005;country=USA: Emerald Reef%2C Florida;gbkey=Src;genome=genomic;isolate=25-593;lat-lon=25.6748 N 80.0982 W;mol_type=genomic DNA;sex=male
NW_007577731.1  Gnomon  gene    7982    24854   .   -   .   ID=gene0;Dbxref=GeneID:103352799;Name=LOC103352799;gbkey=Gene;gene=LOC103352799;gene_biotype=protein_coding
NW_007577731.1  Gnomon  mRNA    7982    24854   .   -   .   ID=rna0;Parent=gene0;Dbxref=GeneID:103352799,Genbank:XM_008279367.1;Name=XM_008279367.1;gbkey=mRNA;gene=LOC103352799;model_evidence=Supporting evidence includes similarity to: 22 Proteins%2C and 73%25 coverage of the annotated genomic feature by RNAseq alignments;product=homer protein homolog 3-like;transcript_id=XM_008279367.1
NW_007577731.1  RefSeq  region  1   3345205 .   +   .   ID=id0;Dbxref=taxon:144197;Name=Unknown;chromosome=Unknown;collection-date=16-Aug-2005;country=USA: Emerald Reef%2C Florida;gbkey=Src;genome=genomic;isolate=25-593;lat-lon=25.6748 N 80.0982 W;mol_type=genomic DNA;sex=male
NW_007577731.1  Gnomon  gene    7982    24854   .   -   .   ID=gene0;Dbxref=GeneID:103352799;Name=LOC103352799;gbkey=Gene;gene=LOC103352799;gene_biotype=protein_coding
NW_007577731.1  Gnomon  mRNA    7982    24854   .   -   .   ID=rna0;Parent=gene0;Dbxref=GeneID:103352799,Genbank:XM_008279367.1;Name=XM_008279367.1;gbkey=mRNA;gene=LOC103352799;model_evidence=Supporting evidence includes similarity to: 22 Proteins%2C and 73%25 coverage of the annotated genomic feature by RNAseq alignments;product=homer protein homolog 3-like;transcript_id=XM_008279367.1

I want to grep on lines containing mRNA in the third column to get this tab-separated output (values in the fields gene=,product=,transcript_id=).

LOC103352799    homer protein homolog 3-like    XM_008279367.1
LOC103352799    homer protein homolog 3-like    XM_008279367.1

With an outrageous lack of elegancy, I can get the 3 columns separately using

grep "mRNA\t" myfile.gff|sed s/gene=/@/|cut -f2 -d"@" |cut -f1 -d";"
grep "mRNA\t" myfile.gff|sed s/product=/@/|cut -f2 -d"@" |cut -f1 -d";"
grep "mRNA\t" myfile.gff|sed s/transcript_id=/@/|cut -f2 -d"@" |cut -f1 -d";"

But how can append on the same line the outputs of these 3 commands? I have tried

echo -e "`grep "mRNA\t" myfile.gff|sed s/gene=/@/|cut -f2 -d"@" |cut -f1 -d";"`\t`grep "mRNA\t" myfile.gff|sed s/product=/@/|cut -f2 -d"@" |cut -f1 -d";"`\t`grep "mRNA\t" myfile.gff|sed s/transcript_id=/@/|cut -f2 -d"@" |cut -f1 -d";"`"

But here is the output:

LOC103352799
LOC103352799    homer protein homolog 3-like
homer protein homolog 3-like    XM_008279367.1
XM_008279367.1

Many thanks for your help!

2
  • Try "echo -n" (do not append a newline) Commented Aug 1, 2017 at 8:59
  • I get the same output ;) Commented Aug 1, 2017 at 9:02

2 Answers 2

2

Using awk:

$ awk 'BEGIN {
    FS=OFS="\t"                       # field separators to tab
    k="gene,product,transcript_id"    # keyword list
    split(k,a,",")                    # split keywords to a hash for matching
    for(i in a)                       # values to keys
        p[a[i]]
}
$3=="mRNA" {
    b=""                              # reset buffer b
    split($9,a,"[=;]")                # split the data to a hash
    for(i in a)                       # iterate and search
        if(a[i] in p)                 # ... for keywords, if match, 
            b=b (b==""?"":OFS) a[i+1] # ... value is the next, buffer
    print b                           # output buffer
}' file
LOC103352799    homer protein homolog 3-like    XM_008279367.1
LOC103352799    homer protein homolog 3-like    XM_008279367.1
Sign up to request clarification or add additional context in comments.

2 Comments

Awk one-liners... brrr!
@tlorin They say that the ones-who-control-the-comets control the comets with awk one-liners...
0

Talking about oneliners, here's one in sed:

sed -nE '/\tmRNA\t/ { s/.*gene=([^;]+).*product=([^;]+).*transcript_id=([^;]+)/\1\t\2\t\3/g;p }' file

The only assumption is a fixed order of gene, product, and transcript_id fields. This could be fixed with some alternations, but on the account of regex readability.

5 Comments

Here is what I get: sed: 1: "/\tmRNA\t/ { s/.*gene=( ...": bad flag in substitute command: '}'
Sorry, I wrote for GNU sed, fixed now. You can try again.
sed: 1: "/\tmRNA\t/ { s/.*gene=( ...": extra characters at the end of p command
Which sed are you using?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.