3

I have tried to work on a solution for the following: I have a .gff3 file for which I want to replace gene headers into a simplified name. Both the original gene headers and the new gene name are given in a separate file, with the original name in column 1 and the new name in column 2. How can I use sed (I think sed is most suitable here) to replace all occurences in the .gff3 file with the new shortened name in the second column?

Example lines .gff3 file:

tulip_contig_65_pilon_pilon .   contig  1   93354   .   .   .   ID=tulip_contig_65_pilon_pilon;Name=tulip_contig_65_pilon_pilon
tulip_contig_65_pilon_pilon maker   gene    19497   23038   .   +   .   ID=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4;Name=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4
tulip_contig_65_pilon_pilon maker   mRNA    19497   23038   .   +   .   ID=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4-mRNA-1;Parent=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4;Name=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4-mRNA-1;_AED=0.00;_eAED=0.00;_QI=418|1|1|1|0|0|3|2100|206

Example lines replacement file:

augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1   gene1
maker-tulip_contig_306_pilon_pilon-augustus-gene-0.12 gene2
maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4   gene3

expected outcome:

tulip_contig_65_pilon_pilon   .   contig  1   93354   .   .   .   ID=tulip_contig_65_pilon_pilon;Name=tulip_contig_65_pilon_pilon
tulip_contig_65_pilon_pilon   maker   gene    19497   23038   .   +   .   ID=gene3;Name=gene3
tulip_contig_65_pilon_pilon   maker   mRNA    19497   23038   .   +   .   ID=gene3-mRNA-1;Parent=gene3;Name=gene3-mRNA-1;_AED=0.00;_eAED=0.00;_QI=418|1|1|1|0|0|3|2100|206

I have tried to use:

while read -r pattern replacement; do sed -i "s/$pattern/$replacement/" file.gff3 ; done < rename.txt

But based on the answer below I am using AWK now instead. I use the script (the exact same indentation as given by Ed Morton but copying it here changes it slightly):

NR==FNR {
 map[$1] = $2
 next 
} 
{
 for (old in map) {
    gsub(old,map[old])
 }
 print 
}

To run I use:

awk -f tst.awk rename.txt original.gff3 > new.gff3 

However, this script works with partial regexp matching, while it should be fully matching. How can I change this awk script so it becomes full matching?

The gff file is 7369803 lines long. The rename.txt file is 18477 lines long.

Any advice is welcome here!

10
  • 1
    What did you try? Post your efforts even if they failed Commented Jun 17, 2020 at 12:27
  • Ah yes forgot that! I tried to use a loop but the main problem is that sed should replace the gene name twice if it is given twice per line but it doesn't. I am using: while read -r pattern replacement; do sed -i "s/$pattern/$replacement/" file.gff3 ; done < rename.txt Commented Jun 17, 2020 at 14:44
  • The other issue is that this is extremely slow. I have >18000 gene names to replace and it takes forever.. Commented Jun 17, 2020 at 15:04
  • edit your question to show the expected output for your posted sample input. Yes, that approach would take forever (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice) and you should be using awk for this, not sed, so you might want to tag your question with awk. Commented Jun 19, 2020 at 21:19
  • You're looping 18,477 times once per line of the 7,369,803 lines file - that's 136,171,850,031 calls to gsub() so yeah, that'll take a while! Is there any way to identify lines in the .gff3 file that you don't need to perform replacements on or reduce how many replacements might be necessary on a given line? If not then the awk script you have now is the fastest way to do what you want. Commented Jun 22, 2020 at 13:16

3 Answers 3

5
+50

This does a full string match from after = to the end of -gene=<number> on each line of .gff3 and should be orders of magnitude faster and more robust than what we had been doing previously since it only replaces the 1-3 strings actually found in each line of the original.gff3 file rather than trying to replace all 18,000+ strings that exist in the rename.txt file:

$ cat tst.awk
NR==FNR {
    map[$1] = $2
    next
}
{
    head = ""
    tail = $0
    while ( match(tail,/((ID|Parent|Name)=)([^;]+-gene-[0-9]+\.[0-9]+)(.*)/,a) ) {
        old = a[3]
        head = head substr(tail,1,RSTART-1) a[1] (old in map ? map[old] : old)
        tail = a[4]
    }
    print head tail
}

.

$ awk -f tst.awk rename.txt original.gff3
tulip_contig_65_pilon_pilon .   contig  1   93354   .   .   .   ID=tulip_contig_65_pilon_pilon;Name=tulip_contig_65_pilon_pilon
tulip_contig_65_pilon_pilon maker   gene    19497   23038   .   +   .   ID=gene3;Name=gene3
tulip_contig_65_pilon_pilon maker   mRNA    19497   23038   .   +   .   ID=gene3-mRNA-1;Parent=gene3;Name=gene3-mRNA-1;_AED=0.00;_eAED=0.00;_QI=418|1|1|1|0|0|3|2100|206

It uses GNU awk for the 3rd arg to match() - I assume you have GNU awk available (or can install it) since you were using GNU sed.

So, the match() is isolating a string (that then gets stored in old) from the current line of original.gff3 that might be in rename.txt (as stored in map[] in the first block) and then the old in map is testing if that string actually is in rename.txt or not and, if so, replacing old with the corresponding new value from map[]. That's all inside a while that loops as long as match() keeps finding new strings that are candidates to be replaced on the current line.

So instead of the original awk script below (and sed script in your question) that loops once for every one of the 18,000+ lines in rename.txt, the above only loops once for every string in the current line of original.gff3 that might need to be replaced which, per your posted sample input, is only up to 3 times.


Original answer based on just speeding up your shell loop calling sed:

Something like this is what you need:

$ cat tst.awk
NR==FNR {
    map[$1] = $2
    next
}
{
    for (old in map) {
        gsub(old,map[old])
    }
    print
}

.

$ awk -f tst.awk repl.txt foo.gff3
tulip_contig_65_pilon_pilon .   contig  1   93354   .   .   .   ID=tulip_contig_65_pilon_pilon;Name=tulip_contig_65_pilon_pilon
tulip_contig_65_pilon_pilon maker   gene    19497   23038   .   +   .   ID=gene3;Name=gene3
tulip_contig_65_pilon_pilon maker   mRNA    19497   23038   .   +   .   ID=gene3-mRNA-1;Parent=gene3;Name=gene3-mRNA-1;_AED=0.00;_eAED=0.00;_QI=418|1|1|1|0|0|3|2100|206

There are some decisions about string vs regexp matching and full vs partial matching that also applied to your shell+sed loop so think about your full requirements and provide sample input/output to test with and then we can massage this to suit if this doesn't do exactly what you want. Right now it's doing partial regexp matching just like the sed command in your question would do.

Sign up to request clarification or add additional context in comments.

Comments

1

With perl, quite similar in logic as the awk answer

perl -ne 'if(!$#ARGV){ @n = split; $h{$n[0]}=$n[1] } else{ 
          s/(ID|Name|Parent)=\K[^;]+-gene-\d+\.\d+/exists $h{$&} ? $h{$&} : $&/ge;
          print }' rename.txt original.gff3
  • if(!$#ARGV) to process only the first file
  • @n = split to get key and value for renaming map
    • $h{$n[0]}=$n[1] the map
  • (ID|Name|Parent)=\K to mark the starting location, won't be part of matching portion
  • [^;]+-gene-\d+\.\d+ please check if all your genes follow this format, it cannot have ; character and always ends with - followed by gene followed by digits followed by . followed by digits, otherwise this will not work
  • exists $h{$&} check if the matched portion exists as a key in the map and if so, use the value from the map else don't change the gene name


I did some speed test, but it may be differ substantially for your real use case. Edit - just realized that t2 created below is useless as it will only have 3 different mappings.

$ perl -0777 -ne 'print $_ x 1000000' original.gff3 > t1
$ perl -0777 -ne 'print $_ x 10000' rename.txt > t2

$ time perl -ne 'if(!$#ARGV){ @n = split; $h{$n[0]}=$n[1] } else{ 
            s/(ID|Name|Parent)=\K[^;]+-gene-\d+\.\d+/exists $h{$&} ? $h{$&} : $&/ge;
            print }' t2 t1 > m1 
real    0m17.473s

$ time awk -f tst.awk t2 t1 > m2    
real    2m53.598s

$ # assuming your input is ASCII
$ time LC_ALL=C awk -f tst.awk t2 t1 > m3
real    1m53.976s

$ du -h t1 t2 m1
591M    t1
1.9M    t2
371M    m1

$ diff -sq m1 m2
Files m1 and m2 are identical
$ diff -sq m1 m3
Files m1 and m3 are identical

Comments

1

Shell script attempts will tend to be O(n×m), which for 7369803×18477 is huge.

I'm not about to write the code, but for something this large, I'd write an ad hoc program in a language such as C or python:

  • Add a line number to the beginning of each line of the data file.
  • Sort the data file on the keyword field.
  • Sort the rename file on the keyword field.
  • Walk through the two sorted files in parallel (the "ad_hoc" program):
    • if the map key sorts before the data key, read the next map record.
    • else:
      • if the keys match, make the substitution.
      • write out the data record
      • read the next data record
  • Sort the result by line number.
  • Strip off the line numbers.

The computational complexity will be entirely from the sort operations, O(n×log(n)), which will be incredibly faster than anything with nested loops. The ad_hoc program itself is linear, O(n+m), simply reading the two files and processing each line once, without any loops.

#!/bin/bash
#Assumes "%" does not occur in either file

ad_hoc \
<( sed <map  -e 's/\([^ ]*\)  *\([^ ]*\)$/\1%\2/' \
 | sort -t '%' -k 1,1 ) \
<( sed  <input  '/./=' \
 | sed '/./N; s/\n/ /' \
 | sed -e 's/ID=\([^;]*\);/ID=%\1%;/' \
 | sort -t '%' -k 2,2 ) \
| sort -t ' ' -k 1,1n \
| sed -e 's/^[0-9]* *//'

All you need is the "ad_hoc" program.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.