Trouble parsing FASTA files in Python

Question

dict = {}
tag = ""
with open('/storage/emulated/0/Download/sequence.fasta.txt','r') as sequence:
    seq = sequence.readlines()
    for line in seq:
        if line.startswith(">"):
            tag = line.replace("\n", "")
        else:
            seq = "".join(seq[1:])
            dict[tag] = seq.replace("\n", "")   
    print(dict)

Background for those who arn't familiar with FASTA files. This format contains one or multiple DNA, RNA, or protein sequences with a one-line descriptive tag of the sequence that starts with a ">" and then the sequence in the following lines(Ex. For DNA it would be a lot of repeating of A, T, G, and C). It also comes with many unnecessary line breaks. So far this code works when I only have one sequence per file but it seems to ignore the if condition if there are multiple. For example it should add each new tag: sequence pair into the dictionary everytime it notices a ">" but instead it only runs once and puts the first description as the key in the dictionary and joins the rest of the file regardless of ">" characters and uses that as the value. How can I get this loop to notice a new ">" after the first occurrence?

I am purposefully steering away from the biopython module.

Without running your code, it would appear the issue is here: seq = "".join(seq[1:]). You're modifying the object you're iterating over and that leads to issues. — readyready15728
– readyready15728, Commented Jun 23, 2020 at 12:35

readyready15728 · Accepted Answer · 2020-06-23 14:07:03Z

3

UPDATE: the code below now works for multiple-line sequences.

The following code works fine for me:

import re
from collections import defaultdict

sequences = defaultdict(str)

with open('fasta.txt') as f:
    lines = f.readlines()

current_tag = None
for line in lines:
    m = re.match('^>(.+)', line)

    if m:
        current_tag = m.group(1)
    else:
        sequences[current_tag] += line.strip()

for k, v in sequences.items():
    print(f"{k}: {v}")

It uses a number of features you may be unfamiliar with, such as regular expressions (which are probably very useful in bioinformatics) and f-string formatting. If anything confuses you, ask away. One thing I should add is that you don't want to define a variable as dict because that will clobber something Python has defined at startup. I chose sequences, which doesn't do this and is more informative.

For reference, this is the content of the example FASTA file fasta.txt I used in this instance:

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

edited Jun 23, 2020 at 14:07

answered Jun 23, 2020 at 12:31

readyready15728

5664 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ethan Hetrick Over a year ago

I am familiar with most of the code besides what is in the for statement. Also, it is not working properly for me. No errors but when the sequence occupies multiple lines it only seems to get the first one. Does it work with longer sequences for you?

readyready15728 Over a year ago

That is a limitation of the approach I have here. I am going to update this code such that multiple lines can be used.

Collectives™ on Stack Overflow

Trouble parsing FASTA files in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related