I have two files (strings_to_match and files_with_rows_to_delete_based_in_strings_to_match) and I want to using a bash script or a way to delete rows from one file reading the patterns from a second file ( a kind of data cleaning).
strings_to_match (each name in a line):
Escherichia coli
Campylobacter jejuni
Rhizobium
files_with_rows_to_delete_based_in_strings_to_match :
#Organism Name,Organism Groups,Strain,BioSample,Assembly,Level,Size(Mb),GC%,Replicons,WGS,CDS,GenBank FTP,RefSeq FTP,RefSeq category
Mesorhizobium sp. M6A.T.Cr.TU.016.01.1.1,Bacteria;Proteobacteria;Alphaproteobacteria,M6A.T.Cr.TU.016.01.1.1,SAMN09232784,GCA_003952585.1, Chromosome,6.78458,62.2,chromosome:NZ_CP034452.1/CP034452.1,,6239,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/952/585/GCA_003952585.1_ASM395258v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/952/585/GCF_003952585.1_ASM395258v1,
Rhizobium sp. S41,Bacteria;Proteobacteria;Alphaproteobacteria,S41,SAMN05323143,GCA_001691455.1,Complete,5.52437,59.3,chromosome 1:NZ_CP016320.1/CP016320.1; chromosome 2:NZ_CP016433.1/CP016433.1,,5159,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/691/455/GCA_001691455.1_ASM169145v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/691/455/GCF_001691455.1_ASM169145v1,
Bordetella holmesii,Bacteria;Proteobacteria;Betaproteobacteria,F592,SAMN12525325,GCA_009627835.1,Complete,3.69654,62.7,chromosome:NZ_CP043169.1/CP043169.1,,3204,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/627/835/GCA_009627835.1_ASM962783v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/627/835/GCF_009627835.1_ASM962783v1,
Pseudomonas chlororaphis subsp. piscium,Bacteria;Proteobacteria;Gammaproteobacteria,ChPhzS135,SAMN08359204,GCA_003850485.1,Complete,6.94002,62.8,chromosome:NZ_CP027738.1/CP027738.1,,6052,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/850/485/GCA_003850485.1_ASM385048v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/850/485/GCF_003850485.1_ASM385048v1,
Piscirickettsia salmonis,Bacteria;Proteobacteria;Gammaproteobacteria,PM22180B,SAMN04376232,GCA_001932895.1,Complete,3.50973,39.6192,chromosome 1:NZ_CP013801.1/CP013801.1; plasmid p1PS13:NZ_CP013802.1/CP013802.1; plasmid p2PS13:NZ_CP013803.1/CP013803.1; plasmid p3PS13:NZ_CP013804.1/CP013804.1; plasmid p4PS13:NZ_CP013805.1/CP013805.1,,3345,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/932/895/GCA_001932895.1_ASM193289v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/932/895/GCF_001932895.1_ASM193289v1,
Morganella morganii subsp. morganii,Bacteria;Proteobacteria;Gammaproteobacteria,81703,SAMN16623062,GCA_018802525.1,Complete,4.01864,51.1,chromosome:NZ_CP064830.1/CP064830.1,,3704,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/018/802/525/GCA_018802525.1_ASM1880252v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/802/525/GCF_018802525.1_ASM1880252v1,
Staphylococcus pseudintermedius,Bacteria;Terrabacteria group;Firmicutes,FDAARGOS_1073,SAMN16357242,GCA_016403325.1, Chromosome,2.83031,37.5,chromosome:NZ_CP066292.1/CP066292.1,,2571,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/403/325/GCA_016403325.1_ASM1640332v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/403/325/GCF_016403325.1_ASM1640332v1,
Klebsiella aerogenes,Bacteria;Proteobacteria;Gammaproteobacteria,G3_AM,SAMN18346029,GCA_017742775.1, Chromosome,5.27581,55.1,chromosome:NZ_CP072327.1/CP072327.1,,4374,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/742/775/GCA_017742775.1_ASM1774277v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/017/742/775/GCF_017742775.1_ASM1774277v1,
Enterobacter hormaechei,Bacteria;Proteobacteria;Gammaproteobacteria,Eho-E2,SAMN13747519,GCA_015910205.1,Complete,5.03481,54.7474,chromosome:NZ_CP047715.1/CP047715.1; plasmid pEclE2-1:NZ_CP047716.1/CP047716.1; plasmid pEclE2-2:NZ_CP047717.1/CP047717.1; plasmid pEclE2-3:NZ_CP047718.1/CP047718.1; plasmid pEclE2-4:NZ_CP047719.1/CP047719.1; plasmid pEclE2-5:NZ_CP047720.1/CP047720.1,,4628,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/910/205/GCA_015910205.1_ASM1591020v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/910/205/GCF_015910205.1_ASM1591020v1,
Citrobacter freundii,Bacteria;Proteobacteria;Gammaproteobacteria,AR_0116,SAMN04014957,GCA_003571565.1,Complete,5.77217,51.7267,chromosome:NZ_CP032184.1/CP032184.1; plasmid unnamed1:NZ_CP032179.1/CP032179.1; plasmid unnamed2:NZ_CP032180.1/CP032180.1; plasmid unnamed3:NZ_CP032181.1/CP032181.1; plasmid unnamed4:NZ_CP032182.1/CP032182.1; plasmid unnamed5:NZ_CP032183.1/CP032183.1,,5419,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/571/565/GCA_003571565.1_ASM357156v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/571/565/GCF_003571565.1_ASM357156v1,
Candidatus Sulcia muelleri CARI,Bacteria;FCB group;Bacteroidetes/Chlorobi group,CARI,SAMN02604226,GCA_000147035.1,Complete,0.276511,21.1,chromosome:CP002163.1,,246,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/147/035/GCA_000147035.1_ASM14703v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/147/035/GCF_000147035.1_ASM14703v1,
Corynebacterium pseudotuberculosis,Bacteria;Terrabacteria group;Actinobacteria,CS_10,SAMN02899788,GCA_000730405.1,Complete,2.33814,52.2,chromosome:NZ_CP008923.1/CP008923.1,,1992,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/730/405/GCA_000730405.1_ASM73040v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/730/405/GCF_000730405.1_ASM73040v1,
Klebsiella variicola,Bacteria;Proteobacteria;Gammaproteobacteria,M186-1-2,SAMN16560586,GCA_015288045.1,Complete,5.48434,57.5,chromosome:NZ_CP063915.1/CP063915.1,,5082,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/288/045/GCA_015288045.1_ASM1528804v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/288/045/GCF_015288045.1_ASM1528804v1,
Lacticaseibacillus paracasei subsp. tolerans,Bacteria;Terrabacteria group;Firmicutes,ZY-1,SAMN16861159,GCA_015693945.1,Complete,3.25423,46.389,chromosome:NZ_CP065154.1/CP065154.1; plasmid pLPZ1:NZ_CP065155.1/CP065155.1; plasmid pLPZ2:NZ_CP065156.1/CP065156.1; plasmid pLPZ3:NZ_CP065157.1/CP065157.1; plasmid pLPZ4:NZ_CP065158.1/CP065158.1,,2980,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/693/945/GCA_015693945.1_ASM1569394v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/693/945/GCF_015693945.1_ASM1569394v1,
Escherichia fergusonii,Bacteria;Proteobacteria;Gammaproteobacteria,FDAARGOS 1438,SAMN16357580,GCA_019047545.1,Complete,4.54316,49.9,chromosome:NZ_CP077242.1/CP077242.1,,4191,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/019/047/545/GCA_019047545.1_ASM1904754v1,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/019/047/545/GCF_019047545.1_ASM1904754v1,
Escherichia coli,Bacteria;Proteobacteria;Gammaproteobacteria,S-P-N-063.01,SAMN26095817,GCA_022488345.1, Chromosome,4.64118,0,chromosome:CP092699.1,,4208,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/488/345/GCA_022488345.1_ASM2248834v1,,
synthetic Escherichia coli C321.deltaA,Bacteria;Proteobacteria;Gammaproteobacteria,C321.deltaA substr. rEc.y.dC.46,SAMN03283144,GCA_000826905.1, Chromosome,4.65016,50.8,chromosome:CP010455.1,,0,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/826/905/GCA_000826905.1_ASM82690v1,,
synthetic Escherichia coli C321.deltaA,Bacteria;Proteobacteria;Gammaproteobacteria,C321.deltaA substr. rEc.b.dC.12,SAMN03283190,GCA_000826925.1, Chromosome,4.65015,50.8,chromosome:CP010456.1,,0,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/826/925/GCA_000826925.1_ASM82692v1,,
Escherichia coli,Bacteria;Proteobacteria;Gammaproteobacteria,S-P-N-065.01,SAMN26095831,GCA_022488325.1, Chromosome,4.62726,0,chromosome:CP092707.1,,3804,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/488/325/GCA_022488325.1_ASM2248832v1
I trying using this script but I was unable to got the job done. Maybe I need to make a temporary file and then reassign to the original name?.
#! /usr/bin/env bash
while read line;
do
echo $line
gawk -i inplace '!/$line/' $2
done < $1
Both file are big and I put here just a toy example.
My need is to delete from the file all the lines that contain the full match, ex. delete
Escherichia coli,Bacteria;Proteobacteria;Gammaproteobacteria,S-P-N-065.01,SAMN26095831,GCA_022488325.1, Chromosome,4.62726,0,chromosome:CP092707.1,,3804,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/022/488/325/GCA_022488325.1_ASM2248832v1
Because the original file has 1500 (-+ ) lines that has Escherichia coli in it. Because I want to keep none of it because in my downstream analysis I dont need all 1500 genomes from Escherichia coli, I just need one.
If you guys could help me I really appreciate.
Thank you!
Paulo
grep -vFf patterns_to_delete files_with_rows_to_deletestrings_to_matchoffoo barwill deleteantifoo barmatosis. Preventing that is complicated and will greatly slow down the process. Testing in a 13GB file is hard. It might be better to fix the original logic and re-generate the file. :(