Multiple grep output on same line

Question

This seems to be a very trivial question but I don't have enough experience using grep and echo to answer it by myself. I have looked here and here with no success.

I have a file that starts like this (.gff file) with over 1,000,000 lines.

NW_007577731.1  RefSeq  region  1   3345205 .   +   .   ID=id0;Dbxref=taxon:144197;Name=Unknown;chromosome=Unknown;collection-date=16-Aug-2005;country=USA: Emerald Reef%2C Florida;gbkey=Src;genome=genomic;isolate=25-593;lat-lon=25.6748 N 80.0982 W;mol_type=genomic DNA;sex=male
NW_007577731.1  Gnomon  gene    7982    24854   .   -   .   ID=gene0;Dbxref=GeneID:103352799;Name=LOC103352799;gbkey=Gene;gene=LOC103352799;gene_biotype=protein_coding
NW_007577731.1  Gnomon  mRNA    7982    24854   .   -   .   ID=rna0;Parent=gene0;Dbxref=GeneID:103352799,Genbank:XM_008279367.1;Name=XM_008279367.1;gbkey=mRNA;gene=LOC103352799;model_evidence=Supporting evidence includes similarity to: 22 Proteins%2C and 73%25 coverage of the annotated genomic feature by RNAseq alignments;product=homer protein homolog 3-like;transcript_id=XM_008279367.1
NW_007577731.1  RefSeq  region  1   3345205 .   +   .   ID=id0;Dbxref=taxon:144197;Name=Unknown;chromosome=Unknown;collection-date=16-Aug-2005;country=USA: Emerald Reef%2C Florida;gbkey=Src;genome=genomic;isolate=25-593;lat-lon=25.6748 N 80.0982 W;mol_type=genomic DNA;sex=male
NW_007577731.1  Gnomon  gene    7982    24854   .   -   .   ID=gene0;Dbxref=GeneID:103352799;Name=LOC103352799;gbkey=Gene;gene=LOC103352799;gene_biotype=protein_coding
NW_007577731.1  Gnomon  mRNA    7982    24854   .   -   .   ID=rna0;Parent=gene0;Dbxref=GeneID:103352799,Genbank:XM_008279367.1;Name=XM_008279367.1;gbkey=mRNA;gene=LOC103352799;model_evidence=Supporting evidence includes similarity to: 22 Proteins%2C and 73%25 coverage of the annotated genomic feature by RNAseq alignments;product=homer protein homolog 3-like;transcript_id=XM_008279367.1

I want to grep on lines containing mRNA in the third column to get this tab-separated output (values in the fields gene=,product=,transcript_id=).

LOC103352799    homer protein homolog 3-like    XM_008279367.1
LOC103352799    homer protein homolog 3-like    XM_008279367.1

With an outrageous lack of elegancy, I can get the 3 columns separately using

grep "mRNA\t" myfile.gff|sed s/gene=/@/|cut -f2 -d"@" |cut -f1 -d";"
grep "mRNA\t" myfile.gff|sed s/product=/@/|cut -f2 -d"@" |cut -f1 -d";"
grep "mRNA\t" myfile.gff|sed s/transcript_id=/@/|cut -f2 -d"@" |cut -f1 -d";"

But how can append on the same line the outputs of these 3 commands? I have tried

echo -e "`grep "mRNA\t" myfile.gff|sed s/gene=/@/|cut -f2 -d"@" |cut -f1 -d";"`\t`grep "mRNA\t" myfile.gff|sed s/product=/@/|cut -f2 -d"@" |cut -f1 -d";"`\t`grep "mRNA\t" myfile.gff|sed s/transcript_id=/@/|cut -f2 -d"@" |cut -f1 -d";"`"

But here is the output:

LOC103352799
LOC103352799    homer protein homolog 3-like
homer protein homolog 3-like    XM_008279367.1
XM_008279367.1

Many thanks for your help!

Try "echo -n" (do not append a newline)

netizen
– netizen

2017-08-01 08:59:37 +00:00
Commented Aug 1, 2017 at 8:59 — netizen
– netizen, Commented Aug 1, 2017 at 8:59
I get the same output ;)

tlorin
– tlorin

2017-08-01 09:02:40 +00:00
Commented Aug 1, 2017 at 9:02 — tlorin
– tlorin, Commented Aug 1, 2017 at 9:02

James Brown · Accepted Answer · 2017-08-01 09:27:27Z

2

Using awk:

$ awk 'BEGIN {
    FS=OFS="\t"                       # field separators to tab
    k="gene,product,transcript_id"    # keyword list
    split(k,a,",")                    # split keywords to a hash for matching
    for(i in a)                       # values to keys
        p[a[i]]
}
$3=="mRNA" {
    b=""                              # reset buffer b
    split($9,a,"[=;]")                # split the data to a hash
    for(i in a)                       # iterate and search
        if(a[i] in p)                 # ... for keywords, if match, 
            b=b (b==""?"":OFS) a[i+1] # ... value is the next, buffer
    print b                           # output buffer
}' file
LOC103352799    homer protein homolog 3-like    XM_008279367.1
LOC103352799    homer protein homolog 3-like    XM_008279367.1

edited Aug 1, 2017 at 9:27

answered Aug 1, 2017 at 9:23

James Brown

37.7k8 gold badges52 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

tlorin Over a year ago

Awk one-liners... brrr!

James Brown Over a year ago

@tlorin They say that the ones-who-control-the-comets control the comets with awk one-liners...

randomir · Accepted Answer · 2017-08-01 14:57:38Z

0

Talking about oneliners, here's one in sed:

sed -nE '/\tmRNA\t/ { s/.*gene=([^;]+).*product=([^;]+).*transcript_id=([^;]+)/\1\t\2\t\3/g;p }' file

The only assumption is a fixed order of gene, product, and transcript_id fields. This could be fixed with some alternations, but on the account of regex readability.

edited Aug 1, 2017 at 14:57

answered Aug 1, 2017 at 14:44

randomir

18.8k1 gold badge46 silver badges60 bronze badges

5 Comments

tlorin Over a year ago

Here is what I get: sed: 1: "/\tmRNA\t/ { s/.*gene=( ...": bad flag in substitute command: '}'

randomir Over a year ago

Sorry, I wrote for GNU sed, fixed now. You can try again.

tlorin Over a year ago

sed: 1: "/\tmRNA\t/ { s/.*gene=( ...": extra characters at the end of p command

randomir Over a year ago

Which sed are you using?

tlorin Over a year ago

(I'm on a Mac: stackoverflow.com/questions/37639496/… )

Collectives™ on Stack Overflow

Multiple grep output on same line

2 Answers 2

2 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related