I am trying to learn python and I wanted to write a text-parser. I try to parse a large fasta-file full of dna-strings (it is 136275 lines long and has the size of 9.8MB). My problem is that the program always stops working at an exact position (line 16076) and doesn't throw an error.
def file_parser(filepath):
data = []
file_content = open(filepath, 'r')
line = file_content.readline()
i=0
while line:
if line == 0:
break
elif line[0] == ">":
key, name = line.split('|')[-2:]
dna = ''
line = file_content.readline()
i = i+1
while not line.startswith('>'): #line[0] != ">": #
dna = dna + line
line = file_content.readline()
dna = dna.rstrip('\n')
name = name.rstrip('\n')
row = {
key,
name,
dna
}
data.append(row)
print(i)
else:
print("Your file is corrupted")
return data
So my question is (as a beginner to writing python) whats wrong with my code that it stops working?
I assume that it could be the line.startswith('>') because I switched it to that because I had some string index out of range errors before but to be honest I'm not really sure.
My test-file comes from this source: ftp://ftp.ncbi.nih.gov/genomes/Acanthisitta_chloris/protein/ (its the .fa.gz-file) I use the a slightly customized Ubuntu 18.10 and python3.
Thanks for your time.
readline()and awhileloop. Just loop over the file object to get lines:for line in file_content:, and you can get additional lines in the loop withnext(file_content).