0

sp|P46531|NOTC1_HUMAN Neurogenic locus notch homolog protein 1 OS=Homo sapiens GN=NOTCH1 PE=1 SV=4 MPPLLAPLLCLALLP

I have a fasta file and I would like to search the file for the beginning of the amino acid sequence. It would be something like

aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
for filename in file_list:
    with open(filename,'r') as fh:
        while True:
        char = fh.read(1)
        if char.upper() in aminoacids:
            #look for the 4 characters directly after it

but if a character is found to be in the amino acid list and the four characters next to it are also in the list, then a string will be made starting with that character and going until there are no more characters. For example, I would like to iterate through the file looking for characters. If M is found, then I would like to look for the next four characters (PPLL). If those next four characters are amino acids, then I would like to create a string starting with M and continuing to the end of the file.

10
  • The 4 characters you are looking for, are they in the fasta file or you mean if 'A' then print 'A' 'C' 'D' 'E'? Commented Jul 8, 2015 at 15:36
  • @Andy Wong My bad, I'll fix that wording. I mean in the file. I am looking for characters in the file. Commented Jul 8, 2015 at 15:39
  • How large is the file? Is it plausible to read in the entire file into memory at the start? (The only reason you might not is if you expected the sequence of amino acids to be very near the end of a very large file) Commented Jul 8, 2015 at 15:44
  • @DavidRobinson The file is not too large at the moment. I am only dealing with relatively small files. I have converted it into a string later, but I wanted to search the file first. If I have to, I could convert it into a string first. Would that be better? Commented Jul 8, 2015 at 15:45
  • just fyi, but every letter is a valid amino acid one letter code, because there are some for "uh, it's either this one or that one but we're not sure they're chemically similar". B is asn/asp. Z is glu/gln. X is unknown. J is leu/ile. u is selenocysteine and O is pyrrolysine Commented Jul 8, 2015 at 15:45

1 Answer 1

2

You can read in the file as a single string, and then search for a regular expression:

regex = re.compile("[%s]{5}.*" % "".join(aminoacids))

with open(filename, 'r') as fh:
    s = fh.read()
    aa_sequence = regex.findall(s)
    if len(aa_sequence) > 0:
        # an amino acid sequence was found
        print aa_sequence[0]

This works because the regular expression that is constructed is:

[ACDEFGHIKLMNPQRSTVWY]{5}.*

which means "5 of these characters, followed by anything."

Note that if your amino acid string may span multiple lines, you'll need to remove the newlines first, with:

s = fh.read().replace('\n', '')
# or
s = "".join(s.readLines())
Sign up to request clarification or add additional context in comments.

6 Comments

aa_sequence = regex.findall(s)[0] for this I get a list index out of range error...? I had completely forgotten that python had regex like perl, so thanks for that.
@varda1316 Ah: that means the file had no amino acid sequence. See edit for a version that tests for this
I don't think that this works because A) my file was not empty to begin with - there was an amino acid sequence B) now it just says that there is only 1 character in the file when I print len(aa_sequence)...? I think perhaps I have to put this in a loop for it to work?
@varda1316 It is not saying there is only 1 character in the file, it is saying it found one match. Print aa_sequence[0] and you will see that it is the matching string.
Ohh that works. Now about the list index out of range error that I got before, I actually had a file with only lowercase letters which was the problem. So basically aa_sequence is a list and when I print the length, it tells me there is one element in the list?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.