problem with startswith() parsing big text-files in python

I am trying to learn python and I wanted to write a text-parser. I try to parse a large fasta-file full of dna-strings (it is 136275 lines long and has the size of 9.8MB). My problem is that the program always stops working at an exact position (line 16076) and doesn't throw an error.

def file_parser(filepath):
  data = []
  file_content = open(filepath, 'r')
  line = file_content.readline()
  i=0
  while line:
    if line == 0:
      break
    elif line[0] == ">":
      key, name = line.split('|')[-2:]
      dna = ''
      line = file_content.readline()
      i = i+1
      while not line.startswith('>'): #line[0] != ">": #
        dna = dna + line
        line = file_content.readline()
      dna = dna.rstrip('\n')
      name = name.rstrip('\n')
      row = {
        key, 
        name, 
        dna
      }
      data.append(row)
      print(i)
    else:
      print("Your file is corrupted")
  return data

So my question is (as a beginner to writing python) whats wrong with my code that it stops working? I assume that it could be the line.startswith('>') because I switched it to that because I had some string index out of range errors before but to be honest I'm not really sure.

My test-file comes from this source: ftp://ftp.ncbi.nih.gov/genomes/Acanthisitta_chloris/protein/ (its the .fa.gz-file) I use the a slightly customized Ubuntu 18.10 and python3.

Thanks for your time.

edited Nov 5, 2018 at 10:48

Chris_Rands

41.7k15 gold badges92 silver badges126 bronze badges

asked Nov 3, 2018 at 18:46

Luca R

691 silver badge6 bronze badges

Don't say a file is "large" without supporting that with a concrete number. I don't want to download a potentially very large file just to check.

Jongware
– Jongware

2018-11-03 18:49:28 +00:00
Commented Nov 3, 2018 at 18:49
@usr2564301 Ohh, yeah, forgot that, thank you.

Luca R
– Luca R

2018-11-03 18:52:36 +00:00
Commented Nov 3, 2018 at 18:52
1

Thanks! 9.8 Mb is not large at all (I process mutilples of that with eaze), so it should not be stressing Python, or your system in general.

Jongware
– Jongware

2018-11-03 18:56:01 +00:00
Commented Nov 3, 2018 at 18:56
1

Don't use readline() and a while loop. Just loop over the file object to get lines: for line in file_content:, and you can get additional lines in the loop with next(file_content).

Martijn Pieters
– Martijn Pieters

2018-11-03 19:00:28 +00:00
Commented Nov 3, 2018 at 19:00
Are your running this on Windows perhaps? What Python version are you using?

Martijn Pieters
– Martijn Pieters

2018-11-03 19:00:58 +00:00
Commented Nov 3, 2018 at 19:00

| Show 8 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

problem with startswith() parsing big text-files in python

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked