I have tried to work on a solution for the following: I have a .gff3 file for which I want to replace gene headers into a simplified name. Both the original gene headers and the new gene name are given in a separate file, with the original name in column 1 and the new name in column 2. How can I use sed (I think sed is most suitable here) to replace all occurences in the .gff3 file with the new shortened name in the second column?
Example lines .gff3 file:
tulip_contig_65_pilon_pilon . contig 1 93354 . . . ID=tulip_contig_65_pilon_pilon;Name=tulip_contig_65_pilon_pilon
tulip_contig_65_pilon_pilon maker gene 19497 23038 . + . ID=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4;Name=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4
tulip_contig_65_pilon_pilon maker mRNA 19497 23038 . + . ID=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4-mRNA-1;Parent=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4;Name=maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4-mRNA-1;_AED=0.00;_eAED=0.00;_QI=418|1|1|1|0|0|3|2100|206
Example lines replacement file:
augustus_masked-tulip_contig_306_pilon_pilon-processed-gene-0.1 gene1 maker-tulip_contig_306_pilon_pilon-augustus-gene-0.12 gene2 maker-tulip_contig_65_pilon_pilon-augustus-gene-0.4 gene3
expected outcome:
tulip_contig_65_pilon_pilon . contig 1 93354 . . . ID=tulip_contig_65_pilon_pilon;Name=tulip_contig_65_pilon_pilon tulip_contig_65_pilon_pilon maker gene 19497 23038 . + . ID=gene3;Name=gene3 tulip_contig_65_pilon_pilon maker mRNA 19497 23038 . + . ID=gene3-mRNA-1;Parent=gene3;Name=gene3-mRNA-1;_AED=0.00;_eAED=0.00;_QI=418|1|1|1|0|0|3|2100|206
I have tried to use:
while read -r pattern replacement; do sed -i "s/$pattern/$replacement/" file.gff3 ; done < rename.txt
But based on the answer below I am using AWK now instead. I use the script (the exact same indentation as given by Ed Morton but copying it here changes it slightly):
NR==FNR { map[$1] = $2 next } { for (old in map) { gsub(old,map[old]) } print }
To run I use:
awk -f tst.awk rename.txt original.gff3 > new.gff3
However, this script works with partial regexp matching, while it should be fully matching. How can I change this awk script so it becomes full matching?
The gff file is 7369803 lines long. The rename.txt file is 18477 lines long.
Any advice is welcome here!
while read -r pattern replacement; do sed -i "s/$pattern/$replacement/" file.gff3 ; done < rename.txtgsub()so yeah, that'll take a while! Is there any way to identify lines in the .gff3 file that you don't need to perform replacements on or reduce how many replacements might be necessary on a given line? If not then the awk script you have now is the fastest way to do what you want.