Searching a string for a substring of characters in a list

Question

sp|P46531|NOTC1_HUMAN Neurogenic locus notch homolog protein 1 OS=Homo sapiens GN=NOTCH1 PE=1 SV=4 MPPLLAPLLCLALLP

I have a fasta file and I would like to search the file for the beginning of the amino acid sequence. It would be something like

aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
for filename in file_list:
    with open(filename,'r') as fh:
        while True:
        char = fh.read(1)
        if char.upper() in aminoacids:
            #look for the 4 characters directly after it

but if a character is found to be in the amino acid list and the four characters next to it are also in the list, then a string will be made starting with that character and going until there are no more characters. For example, I would like to iterate through the file looking for characters. If M is found, then I would like to look for the next four characters (PPLL). If those next four characters are amino acids, then I would like to create a string starting with M and continuing to the end of the file.

The 4 characters you are looking for, are they in the fasta file or you mean if 'A' then print 'A' 'C' 'D' 'E'? — Andy Wong
– Andy Wong, Commented Jul 8, 2015 at 15:36
@Andy Wong My bad, I'll fix that wording. I mean in the file. I am looking for characters in the file. — mkpappu
– mkpappu, Commented Jul 8, 2015 at 15:39
How large is the file? Is it plausible to read in the entire file into memory at the start? (The only reason you might not is if you expected the sequence of amino acids to be very near the end of a very large file) — David Robinson
– David Robinson, Commented Jul 8, 2015 at 15:44
@DavidRobinson The file is not too large at the moment. I am only dealing with relatively small files. I have converted it into a string later, but I wanted to search the file first. If I have to, I could convert it into a string first. Would that be better? — mkpappu
– mkpappu, Commented Jul 8, 2015 at 15:45
just fyi, but every letter is a valid amino acid one letter code, because there are some for "uh, it's either this one or that one but we're not sure they're chemically similar". B is asn/asp. Z is glu/gln. X is unknown. J is leu/ile. u is selenocysteine and O is pyrrolysine — NightShadeQueen
– NightShadeQueen, Commented Jul 8, 2015 at 15:45

David Robinson · Accepted Answer · 2015-07-08 16:31:59Z

2

You can read in the file as a single string, and then search for a regular expression:

regex = re.compile("[%s]{5}.*" % "".join(aminoacids))

with open(filename, 'r') as fh:
    s = fh.read()
    aa_sequence = regex.findall(s)
    if len(aa_sequence) > 0:
        # an amino acid sequence was found
        print aa_sequence[0]

This works because the regular expression that is constructed is:

[ACDEFGHIKLMNPQRSTVWY]{5}.*

which means "5 of these characters, followed by anything."

Note that if your amino acid string may span multiple lines, you'll need to remove the newlines first, with:

s = fh.read().replace('\n', '')
# or
s = "".join(s.readLines())

edited Jul 8, 2015 at 16:31

answered Jul 8, 2015 at 15:49

David Robinson

78.8k16 gold badges172 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

mkpappu Over a year ago

aa_sequence = regex.findall(s)[0] for this I get a list index out of range error...? I had completely forgotten that python had regex like perl, so thanks for that.

David Robinson Over a year ago

@varda1316 Ah: that means the file had no amino acid sequence. See edit for a version that tests for this

mkpappu Over a year ago

I don't think that this works because A) my file was not empty to begin with - there was an amino acid sequence B) now it just says that there is only 1 character in the file when I print len(aa_sequence)...? I think perhaps I have to put this in a loop for it to work?

David Robinson Over a year ago

@varda1316 It is not saying there is only 1 character in the file, it is saying it found one match. Print aa_sequence[0] and you will see that it is the matching string.

mkpappu Over a year ago

Ohh that works. Now about the list index out of range error that I got before, I actually had a file with only lowercase letters which was the problem. So basically aa_sequence is a list and when I print the length, it tells me there is one element in the list?

|

Collectives™ on Stack Overflow

Searching a string for a substring of characters in a list

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related