0

I have two files (strings_to_match and files_with_rows_to_delete_based_in_strings_to_match) and I want to using a bash script or a way to delete rows from one file reading the patterns from a second file ( a kind of data cleaning).

strings_to_match (each name in a line):

Escherichia coli
Campylobacter jejuni
Rhizobium

files_with_rows_to_delete_based_in_strings_to_match :

#Organism Name,Organism Groups,Strain,BioSample,Assembly,Level,Size(Mb),GC%,Replicons,WGS,CDS,GenBank FTP,RefSeq FTP,RefSeq category
Mesorhizobium sp. M6A.T.Cr.TU.016.01.1.1,Bacteria;Proteobacteria;Alphaproteobacteria,M6A.T.Cr.TU.016.01.1.1,SAMN09232784,GCA_003952585.1, Chromosome,6.78458,62.2,chromosome:NZ_CP034452.1/CP034452.1,,6239,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/952/585/GCA_003952585.1_ASM395258v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/952/585/GCF_003952585.1_ASM395258v1,
Rhizobium sp. S41,Bacteria;Proteobacteria;Alphaproteobacteria,S41,SAMN05323143,GCA_001691455.1,Complete,5.52437,59.3,chromosome 1:NZ_CP016320.1/CP016320.1; chromosome 2:NZ_CP016433.1/CP016433.1,,5159,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/691/455/GCA_001691455.1_ASM169145v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/455/GCF_001691455.1_ASM169145v1,
Bordetella holmesii,Bacteria;Proteobacteria;Betaproteobacteria,F592,SAMN12525325,GCA_009627835.1,Complete,3.69654,62.7,chromosome:NZ_CP043169.1/CP043169.1,,3204,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/627/835/GCA_009627835.1_ASM962783v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/627/835/GCF_009627835.1_ASM962783v1,
Pseudomonas chlororaphis subsp. piscium,Bacteria;Proteobacteria;Gammaproteobacteria,ChPhzS135,SAMN08359204,GCA_003850485.1,Complete,6.94002,62.8,chromosome:NZ_CP027738.1/CP027738.1,,6052,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/850/485/GCA_003850485.1_ASM385048v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/850/485/GCF_003850485.1_ASM385048v1,
Piscirickettsia salmonis,Bacteria;Proteobacteria;Gammaproteobacteria,PM22180B,SAMN04376232,GCA_001932895.1,Complete,3.50973,39.6192,chromosome 1:NZ_CP013801.1/CP013801.1; plasmid p1PS13:NZ_CP013802.1/CP013802.1; plasmid p2PS13:NZ_CP013803.1/CP013803.1; plasmid p3PS13:NZ_CP013804.1/CP013804.1; plasmid p4PS13:NZ_CP013805.1/CP013805.1,,3345,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/932/895/GCA_001932895.1_ASM193289v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/932/895/GCF_001932895.1_ASM193289v1,
Morganella morganii subsp. morganii,Bacteria;Proteobacteria;Gammaproteobacteria,81703,SAMN16623062,GCA_018802525.1,Complete,4.01864,51.1,chromosome:NZ_CP064830.1/CP064830.1,,3704,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/802/525/GCA_018802525.1_ASM1880252v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/802/525/GCF_018802525.1_ASM1880252v1,
Staphylococcus pseudintermedius,Bacteria;Terrabacteria group;Firmicutes,FDAARGOS_1073,SAMN16357242,GCA_016403325.1, Chromosome,2.83031,37.5,chromosome:NZ_CP066292.1/CP066292.1,,2571,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/403/325/GCA_016403325.1_ASM1640332v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/403/325/GCF_016403325.1_ASM1640332v1,
Klebsiella aerogenes,Bacteria;Proteobacteria;Gammaproteobacteria,G3_AM,SAMN18346029,GCA_017742775.1, Chromosome,5.27581,55.1,chromosome:NZ_CP072327.1/CP072327.1,,4374,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/742/775/GCA_017742775.1_ASM1774277v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/017/742/775/GCF_017742775.1_ASM1774277v1,
Enterobacter hormaechei,Bacteria;Proteobacteria;Gammaproteobacteria,Eho-E2,SAMN13747519,GCA_015910205.1,Complete,5.03481,54.7474,chromosome:NZ_CP047715.1/CP047715.1; plasmid pEclE2-1:NZ_CP047716.1/CP047716.1; plasmid pEclE2-2:NZ_CP047717.1/CP047717.1; plasmid pEclE2-3:NZ_CP047718.1/CP047718.1; plasmid pEclE2-4:NZ_CP047719.1/CP047719.1; plasmid pEclE2-5:NZ_CP047720.1/CP047720.1,,4628,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/910/205/GCA_015910205.1_ASM1591020v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/910/205/GCF_015910205.1_ASM1591020v1,
Citrobacter freundii,Bacteria;Proteobacteria;Gammaproteobacteria,AR_0116,SAMN04014957,GCA_003571565.1,Complete,5.77217,51.7267,chromosome:NZ_CP032184.1/CP032184.1; plasmid unnamed1:NZ_CP032179.1/CP032179.1; plasmid unnamed2:NZ_CP032180.1/CP032180.1; plasmid unnamed3:NZ_CP032181.1/CP032181.1; plasmid unnamed4:NZ_CP032182.1/CP032182.1; plasmid unnamed5:NZ_CP032183.1/CP032183.1,,5419,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/571/565/GCA_003571565.1_ASM357156v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/571/565/GCF_003571565.1_ASM357156v1,
Candidatus Sulcia muelleri CARI,Bacteria;FCB group;Bacteroidetes/Chlorobi group,CARI,SAMN02604226,GCA_000147035.1,Complete,0.276511,21.1,chromosome:CP002163.1,,246,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/147/035/GCA_000147035.1_ASM14703v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/147/035/GCF_000147035.1_ASM14703v1,
Corynebacterium pseudotuberculosis,Bacteria;Terrabacteria group;Actinobacteria,CS_10,SAMN02899788,GCA_000730405.1,Complete,2.33814,52.2,chromosome:NZ_CP008923.1/CP008923.1,,1992,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/730/405/GCA_000730405.1_ASM73040v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/730/405/GCF_000730405.1_ASM73040v1,
Klebsiella variicola,Bacteria;Proteobacteria;Gammaproteobacteria,M186-1-2,SAMN16560586,GCA_015288045.1,Complete,5.48434,57.5,chromosome:NZ_CP063915.1/CP063915.1,,5082,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/288/045/GCA_015288045.1_ASM1528804v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/288/045/GCF_015288045.1_ASM1528804v1,
Lacticaseibacillus paracasei subsp. tolerans,Bacteria;Terrabacteria group;Firmicutes,ZY-1,SAMN16861159,GCA_015693945.1,Complete,3.25423,46.389,chromosome:NZ_CP065154.1/CP065154.1; plasmid pLPZ1:NZ_CP065155.1/CP065155.1; plasmid pLPZ2:NZ_CP065156.1/CP065156.1; plasmid pLPZ3:NZ_CP065157.1/CP065157.1; plasmid pLPZ4:NZ_CP065158.1/CP065158.1,,2980,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/693/945/GCA_015693945.1_ASM1569394v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/693/945/GCF_015693945.1_ASM1569394v1,
Escherichia fergusonii,Bacteria;Proteobacteria;Gammaproteobacteria,FDAARGOS 1438,SAMN16357580,GCA_019047545.1,Complete,4.54316,49.9,chromosome:NZ_CP077242.1/CP077242.1,,4191,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/019/047/545/GCA_019047545.1_ASM1904754v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/047/545/GCF_019047545.1_ASM1904754v1,
Escherichia coli,Bacteria;Proteobacteria;Gammaproteobacteria,S-P-N-063.01,SAMN26095817,GCA_022488345.1, Chromosome,4.64118,0,chromosome:CP092699.1,,4208,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/488/345/GCA_022488345.1_ASM2248834v1,,
synthetic Escherichia coli C321.deltaA,Bacteria;Proteobacteria;Gammaproteobacteria,C321.deltaA substr. rEc.y.dC.46,SAMN03283144,GCA_000826905.1, Chromosome,4.65016,50.8,chromosome:CP010455.1,,0,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/826/905/GCA_000826905.1_ASM82690v1,,
synthetic Escherichia coli C321.deltaA,Bacteria;Proteobacteria;Gammaproteobacteria,C321.deltaA substr. rEc.b.dC.12,SAMN03283190,GCA_000826925.1, Chromosome,4.65015,50.8,chromosome:CP010456.1,,0,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/826/925/GCA_000826925.1_ASM82692v1,,
Escherichia coli,Bacteria;Proteobacteria;Gammaproteobacteria,S-P-N-065.01,SAMN26095831,GCA_022488325.1, Chromosome,4.62726,0,chromosome:CP092707.1,,3804,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/488/325/GCA_022488325.1_ASM2248832v1

I trying using this script but I was unable to got the job done. Maybe I need to make a temporary file and then reassign to the original name?.

#! /usr/bin/env bash

while read line;
do
   echo $line
   gawk -i inplace '!/$line/' $2
done < $1

Both file are big and I put here just a toy example.

My need is to delete from the file all the lines that contain the full match, ex. delete

Escherichia coli,Bacteria;Proteobacteria;Gammaproteobacteria,S-P-N-065.01,SAMN26095831,GCA_022488325.1, Chromosome,4.62726,0,chromosome:CP092707.1,,3804,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/488/325/GCA_022488325.1_ASM2248832v1

Because the original file has 1500 (-+ ) lines that has Escherichia coli in it. Because I want to keep none of it because in my downstream analysis I dont need all 1500 genomes from Escherichia coli, I just need one.

If you guys could help me I really appreciate.

Thank you!

Paulo

5
  • 3
    Use grep -vFf patterns_to_delete files_with_rows_to_delete Commented Mar 28, 2022 at 11:42
  • 1
    I agree with @anubhava solution but it needs a little more work for making it match whole fields Commented Mar 28, 2022 at 12:39
  • @Fravadona I tryied in a copy o the whole file and from 13GB it was reduced at 826 kb. Latter I will check more careful. But It seems to work how it was grep -vFf patterns_to_delete files_with_rows_to_delete Commented Mar 28, 2022 at 13:01
  • 1
    Be aware that a line in strings_to_match of foo bar will delete antifoo barmatosis. Preventing that is complicated and will greatly slow down the process. Testing in a 13GB file is hard. It might be better to fix the original logic and re-generate the file. :( Commented Mar 28, 2022 at 14:34
  • 1
    @PauloSergioSchlogl it's extremely unlikely that you really want a partial string match across the whole line as that grep command in your comment would do. At a minimum you probably want to only match in specific field(s) (field 1 maybe?) and then you probably want a full instead of partial string match within that field. Commented Mar 28, 2022 at 15:01

1 Answer 1

1

if you want exact match, then

gawk 'BEGIN{FS=","}
    NR==FNR{a[$1]; next}
    !($1 in a)
' strings_to_match files_with_rows_to_delete_based_in_strings_to_match > output

you get in output

#Organism Name,Organism Groups,Strain,BioSample,Assembly,Level,Size(Mb),GC%,Replicons,WGS,CDS,GenBank FTP,RefSeq FTP,RefSeq category
Mesorhizobium sp. M6A.T.Cr.TU.016.01.1.1,Bacteria;Proteobacteria;Alphaproteobacteria,M6A.T.Cr.TU.016.01.1.1,SAMN09232784,GCA_003952585.1, Chromosome,6.78458,62.2,chromosome:NZ_CP034452.1/CP034452.1,,6239,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/952/585/GCA_003952585.1_ASM395258v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/952/585/GCF_003952585.1_ASM395258v1,
Rhizobium sp. S41,Bacteria;Proteobacteria;Alphaproteobacteria,S41,SAMN05323143,GCA_001691455.1,Complete,5.52437,59.3,chromosome 1:NZ_CP016320.1/CP016320.1; chromosome 2:NZ_CP016433.1/CP016433.1,,5159,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/691/455/GCA_001691455.1_ASM169145v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/455/GCF_001691455.1_ASM169145v1,
Bordetella holmesii,Bacteria;Proteobacteria;Betaproteobacteria,F592,SAMN12525325,GCA_009627835.1,Complete,3.69654,62.7,chromosome:NZ_CP043169.1/CP043169.1,,3204,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/627/835/GCA_009627835.1_ASM962783v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/627/835/GCF_009627835.1_ASM962783v1,
Pseudomonas chlororaphis subsp. piscium,Bacteria;Proteobacteria;Gammaproteobacteria,ChPhzS135,SAMN08359204,GCA_003850485.1,Complete,6.94002,62.8,chromosome:NZ_CP027738.1/CP027738.1,,6052,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/850/485/GCA_003850485.1_ASM385048v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/850/485/GCF_003850485.1_ASM385048v1,
Piscirickettsia salmonis,Bacteria;Proteobacteria;Gammaproteobacteria,PM22180B,SAMN04376232,GCA_001932895.1,Complete,3.50973,39.6192,chromosome 1:NZ_CP013801.1/CP013801.1; plasmid p1PS13:NZ_CP013802.1/CP013802.1; plasmid p2PS13:NZ_CP013803.1/CP013803.1; plasmid p3PS13:NZ_CP013804.1/CP013804.1; plasmid p4PS13:NZ_CP013805.1/CP013805.1,,3345,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/932/895/GCA_001932895.1_ASM193289v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/932/895/GCF_001932895.1_ASM193289v1,
Morganella morganii subsp. morganii,Bacteria;Proteobacteria;Gammaproteobacteria,81703,SAMN16623062,GCA_018802525.1,Complete,4.01864,51.1,chromosome:NZ_CP064830.1/CP064830.1,,3704,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/802/525/GCA_018802525.1_ASM1880252v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/802/525/GCF_018802525.1_ASM1880252v1,
Staphylococcus pseudintermedius,Bacteria;Terrabacteria group;Firmicutes,FDAARGOS_1073,SAMN16357242,GCA_016403325.1, Chromosome,2.83031,37.5,chromosome:NZ_CP066292.1/CP066292.1,,2571,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/403/325/GCA_016403325.1_ASM1640332v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/403/325/GCF_016403325.1_ASM1640332v1,
Klebsiella aerogenes,Bacteria;Proteobacteria;Gammaproteobacteria,G3_AM,SAMN18346029,GCA_017742775.1, Chromosome,5.27581,55.1,chromosome:NZ_CP072327.1/CP072327.1,,4374,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/742/775/GCA_017742775.1_ASM1774277v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/017/742/775/GCF_017742775.1_ASM1774277v1,
Enterobacter hormaechei,Bacteria;Proteobacteria;Gammaproteobacteria,Eho-E2,SAMN13747519,GCA_015910205.1,Complete,5.03481,54.7474,chromosome:NZ_CP047715.1/CP047715.1; plasmid pEclE2-1:NZ_CP047716.1/CP047716.1; plasmid pEclE2-2:NZ_CP047717.1/CP047717.1; plasmid pEclE2-3:NZ_CP047718.1/CP047718.1; plasmid pEclE2-4:NZ_CP047719.1/CP047719.1; plasmid pEclE2-5:NZ_CP047720.1/CP047720.1,,4628,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/910/205/GCA_015910205.1_ASM1591020v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/910/205/GCF_015910205.1_ASM1591020v1,
Citrobacter freundii,Bacteria;Proteobacteria;Gammaproteobacteria,AR_0116,SAMN04014957,GCA_003571565.1,Complete,5.77217,51.7267,chromosome:NZ_CP032184.1/CP032184.1; plasmid unnamed1:NZ_CP032179.1/CP032179.1; plasmid unnamed2:NZ_CP032180.1/CP032180.1; plasmid unnamed3:NZ_CP032181.1/CP032181.1; plasmid unnamed4:NZ_CP032182.1/CP032182.1; plasmid unnamed5:NZ_CP032183.1/CP032183.1,,5419,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/571/565/GCA_003571565.1_ASM357156v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/571/565/GCF_003571565.1_ASM357156v1,
Candidatus Sulcia muelleri CARI,Bacteria;FCB group;Bacteroidetes/Chlorobi group,CARI,SAMN02604226,GCA_000147035.1,Complete,0.276511,21.1,chromosome:CP002163.1,,246,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/147/035/GCA_000147035.1_ASM14703v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/147/035/GCF_000147035.1_ASM14703v1,
Corynebacterium pseudotuberculosis,Bacteria;Terrabacteria group;Actinobacteria,CS_10,SAMN02899788,GCA_000730405.1,Complete,2.33814,52.2,chromosome:NZ_CP008923.1/CP008923.1,,1992,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/730/405/GCA_000730405.1_ASM73040v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/730/405/GCF_000730405.1_ASM73040v1,
Klebsiella variicola,Bacteria;Proteobacteria;Gammaproteobacteria,M186-1-2,SAMN16560586,GCA_015288045.1,Complete,5.48434,57.5,chromosome:NZ_CP063915.1/CP063915.1,,5082,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/288/045/GCA_015288045.1_ASM1528804v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/288/045/GCF_015288045.1_ASM1528804v1,
Lacticaseibacillus paracasei subsp. tolerans,Bacteria;Terrabacteria group;Firmicutes,ZY-1,SAMN16861159,GCA_015693945.1,Complete,3.25423,46.389,chromosome:NZ_CP065154.1/CP065154.1; plasmid pLPZ1:NZ_CP065155.1/CP065155.1; plasmid pLPZ2:NZ_CP065156.1/CP065156.1; plasmid pLPZ3:NZ_CP065157.1/CP065157.1; plasmid pLPZ4:NZ_CP065158.1/CP065158.1,,2980,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/693/945/GCA_015693945.1_ASM1569394v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/693/945/GCF_015693945.1_ASM1569394v1,
Escherichia fergusonii,Bacteria;Proteobacteria;Gammaproteobacteria,FDAARGOS 1438,SAMN16357580,GCA_019047545.1,Complete,4.54316,49.9,chromosome:NZ_CP077242.1/CP077242.1,,4191,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/019/047/545/GCA_019047545.1_ASM1904754v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/047/545/GCF_019047545.1_ASM1904754v1,
synthetic Escherichia coli C321.deltaA,Bacteria;Proteobacteria;Gammaproteobacteria,C321.deltaA substr. rEc.y.dC.46,SAMN03283144,GCA_000826905.1, Chromosome,4.65016,50.8,chromosome:CP010455.1,,0,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/826/905/GCA_000826905.1_ASM82690v1,,
synthetic Escherichia coli C321.deltaA,Bacteria;Proteobacteria;Gammaproteobacteria,C321.deltaA substr. rEc.b.dC.12,SAMN03283190,GCA_000826925.1, Chromosome,4.65015,50.8,chromosome:CP010456.1,,0,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/826/925/GCA_000826925.1_ASM82692v1,,

if you want regular expressions match, then

gawk 'BEGIN{FS=","}
    NR==FNR{
        if(search) search=search"|"; 
        search=search $1; next
    }
    !($1 ~ search)
' strings_to_match files_with_rows_to_delete_based_in_strings_to_match > output

you get in output

#Organism Name,Organism Groups,Strain,BioSample,Assembly,Level,Size(Mb),GC%,Replicons,WGS,CDS,GenBank FTP,RefSeq FTP,RefSeq category
Mesorhizobium sp. M6A.T.Cr.TU.016.01.1.1,Bacteria;Proteobacteria;Alphaproteobacteria,M6A.T.Cr.TU.016.01.1.1,SAMN09232784,GCA_003952585.1, Chromosome,6.78458,62.2,chromosome:NZ_CP034452.1/CP034452.1,,6239,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/952/585/GCA_003952585.1_ASM395258v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/952/585/GCF_003952585.1_ASM395258v1,
Bordetella holmesii,Bacteria;Proteobacteria;Betaproteobacteria,F592,SAMN12525325,GCA_009627835.1,Complete,3.69654,62.7,chromosome:NZ_CP043169.1/CP043169.1,,3204,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/627/835/GCA_009627835.1_ASM962783v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/627/835/GCF_009627835.1_ASM962783v1,
Pseudomonas chlororaphis subsp. piscium,Bacteria;Proteobacteria;Gammaproteobacteria,ChPhzS135,SAMN08359204,GCA_003850485.1,Complete,6.94002,62.8,chromosome:NZ_CP027738.1/CP027738.1,,6052,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/850/485/GCA_003850485.1_ASM385048v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/850/485/GCF_003850485.1_ASM385048v1,
Piscirickettsia salmonis,Bacteria;Proteobacteria;Gammaproteobacteria,PM22180B,SAMN04376232,GCA_001932895.1,Complete,3.50973,39.6192,chromosome 1:NZ_CP013801.1/CP013801.1; plasmid p1PS13:NZ_CP013802.1/CP013802.1; plasmid p2PS13:NZ_CP013803.1/CP013803.1; plasmid p3PS13:NZ_CP013804.1/CP013804.1; plasmid p4PS13:NZ_CP013805.1/CP013805.1,,3345,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/932/895/GCA_001932895.1_ASM193289v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/932/895/GCF_001932895.1_ASM193289v1,
Morganella morganii subsp. morganii,Bacteria;Proteobacteria;Gammaproteobacteria,81703,SAMN16623062,GCA_018802525.1,Complete,4.01864,51.1,chromosome:NZ_CP064830.1/CP064830.1,,3704,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/802/525/GCA_018802525.1_ASM1880252v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/802/525/GCF_018802525.1_ASM1880252v1,
Staphylococcus pseudintermedius,Bacteria;Terrabacteria group;Firmicutes,FDAARGOS_1073,SAMN16357242,GCA_016403325.1, Chromosome,2.83031,37.5,chromosome:NZ_CP066292.1/CP066292.1,,2571,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/403/325/GCA_016403325.1_ASM1640332v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/403/325/GCF_016403325.1_ASM1640332v1,
Klebsiella aerogenes,Bacteria;Proteobacteria;Gammaproteobacteria,G3_AM,SAMN18346029,GCA_017742775.1, Chromosome,5.27581,55.1,chromosome:NZ_CP072327.1/CP072327.1,,4374,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/742/775/GCA_017742775.1_ASM1774277v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/017/742/775/GCF_017742775.1_ASM1774277v1,
Enterobacter hormaechei,Bacteria;Proteobacteria;Gammaproteobacteria,Eho-E2,SAMN13747519,GCA_015910205.1,Complete,5.03481,54.7474,chromosome:NZ_CP047715.1/CP047715.1; plasmid pEclE2-1:NZ_CP047716.1/CP047716.1; plasmid pEclE2-2:NZ_CP047717.1/CP047717.1; plasmid pEclE2-3:NZ_CP047718.1/CP047718.1; plasmid pEclE2-4:NZ_CP047719.1/CP047719.1; plasmid pEclE2-5:NZ_CP047720.1/CP047720.1,,4628,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/910/205/GCA_015910205.1_ASM1591020v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/910/205/GCF_015910205.1_ASM1591020v1,
Citrobacter freundii,Bacteria;Proteobacteria;Gammaproteobacteria,AR_0116,SAMN04014957,GCA_003571565.1,Complete,5.77217,51.7267,chromosome:NZ_CP032184.1/CP032184.1; plasmid unnamed1:NZ_CP032179.1/CP032179.1; plasmid unnamed2:NZ_CP032180.1/CP032180.1; plasmid unnamed3:NZ_CP032181.1/CP032181.1; plasmid unnamed4:NZ_CP032182.1/CP032182.1; plasmid unnamed5:NZ_CP032183.1/CP032183.1,,5419,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/571/565/GCA_003571565.1_ASM357156v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/571/565/GCF_003571565.1_ASM357156v1,
Candidatus Sulcia muelleri CARI,Bacteria;FCB group;Bacteroidetes/Chlorobi group,CARI,SAMN02604226,GCA_000147035.1,Complete,0.276511,21.1,chromosome:CP002163.1,,246,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/147/035/GCA_000147035.1_ASM14703v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/147/035/GCF_000147035.1_ASM14703v1,
Corynebacterium pseudotuberculosis,Bacteria;Terrabacteria group;Actinobacteria,CS_10,SAMN02899788,GCA_000730405.1,Complete,2.33814,52.2,chromosome:NZ_CP008923.1/CP008923.1,,1992,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/730/405/GCA_000730405.1_ASM73040v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/730/405/GCF_000730405.1_ASM73040v1,
Klebsiella variicola,Bacteria;Proteobacteria;Gammaproteobacteria,M186-1-2,SAMN16560586,GCA_015288045.1,Complete,5.48434,57.5,chromosome:NZ_CP063915.1/CP063915.1,,5082,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/288/045/GCA_015288045.1_ASM1528804v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/288/045/GCF_015288045.1_ASM1528804v1,
Lacticaseibacillus paracasei subsp. tolerans,Bacteria;Terrabacteria group;Firmicutes,ZY-1,SAMN16861159,GCA_015693945.1,Complete,3.25423,46.389,chromosome:NZ_CP065154.1/CP065154.1; plasmid pLPZ1:NZ_CP065155.1/CP065155.1; plasmid pLPZ2:NZ_CP065156.1/CP065156.1; plasmid pLPZ3:NZ_CP065157.1/CP065157.1; plasmid pLPZ4:NZ_CP065158.1/CP065158.1,,2980,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/693/945/GCA_015693945.1_ASM1569394v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/693/945/GCF_015693945.1_ASM1569394v1,
Escherichia fergusonii,Bacteria;Proteobacteria;Gammaproteobacteria,FDAARGOS 1438,SAMN16357580,GCA_019047545.1,Complete,4.54316,49.9,chromosome:NZ_CP077242.1/CP077242.1,,4191,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/019/047/545/GCA_019047545.1_ASM1904754v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/047/545/GCF_019047545.1_ASM1904754v1,
Sign up to request clarification or add additional context in comments.

2 Comments

No need for gawk, that script would behave the same way with any awk.
@Jose Ricardo Bustos M. in the first case seems that some lines with "synthetic Escherichia coli" still remains in the file, but the second script seems to work best. And using grep -Ff as a test it capture lines with synthetic Escherichia coli and Escherichia coli as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.