3
dict = {}
tag = ""
with open('/storage/emulated/0/Download/sequence.fasta.txt','r') as sequence:
    seq = sequence.readlines()
    for line in seq:
        if line.startswith(">"):
            tag = line.replace("\n", "")
        else:
            seq = "".join(seq[1:])
            dict[tag] = seq.replace("\n", "")   
    print(dict)

Background for those who arn't familiar with FASTA files. This format contains one or multiple DNA, RNA, or protein sequences with a one-line descriptive tag of the sequence that starts with a ">" and then the sequence in the following lines(Ex. For DNA it would be a lot of repeating of A, T, G, and C). It also comes with many unnecessary line breaks. So far this code works when I only have one sequence per file but it seems to ignore the if condition if there are multiple. For example it should add each new tag: sequence pair into the dictionary everytime it notices a ">" but instead it only runs once and puts the first description as the key in the dictionary and joins the rest of the file regardless of ">" characters and uses that as the value. How can I get this loop to notice a new ">" after the first occurrence?

I am purposefully steering away from the biopython module.

1
  • Without running your code, it would appear the issue is here: seq = "".join(seq[1:]). You're modifying the object you're iterating over and that leads to issues. Commented Jun 23, 2020 at 12:35

1 Answer 1

3

UPDATE: the code below now works for multiple-line sequences.

The following code works fine for me:

import re
from collections import defaultdict

sequences = defaultdict(str)

with open('fasta.txt') as f:
    lines = f.readlines()

current_tag = None
for line in lines:
    m = re.match('^>(.+)', line)

    if m:
        current_tag = m.group(1)
    else:
        sequences[current_tag] += line.strip()

for k, v in sequences.items():
    print(f"{k}: {v}")

It uses a number of features you may be unfamiliar with, such as regular expressions (which are probably very useful in bioinformatics) and f-string formatting. If anything confuses you, ask away. One thing I should add is that you don't want to define a variable as dict because that will clobber something Python has defined at startup. I chose sequences, which doesn't do this and is more informative.

For reference, this is the content of the example FASTA file fasta.txt I used in this instance:

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
Sign up to request clarification or add additional context in comments.

2 Comments

I am familiar with most of the code besides what is in the for statement. Also, it is not working properly for me. No errors but when the sequence occupies multiple lines it only seems to get the first one. Does it work with longer sequences for you?
That is a limitation of the approach I have here. I am going to update this code such that multiple lines can be used.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.