1

I have a text file full of amino acids (CA-Final.txt) as well as some other data. Here is a snippet of the text file

ATOM    109  CA ASER A  48      10.832  19.066  -2.324  0.50 61.96           C  
ATOM    121  CA AALA A  49      12.327  22.569  -2.163  0.50 60.22           C  
ATOM    131  CA AGLN A  50       8.976  24.342  -1.742  0.50 56.71           C  
ATOM    145  CA APRO A  51       7.689  25.565   1.689  0.50 51.89           C  
ATOM    158  CA  GLN A  52       5.174  23.336   3.467  1.00 43.45           C  
ATOM    167  CA  HIS A  53       2.339  24.135   5.889  1.00 38.39           C  
ATOM    177  CA  PHE A  54       0.900  22.203   8.827  1.00 33.79           C  
ATOM    188  CA  TYR A  55      -1.217  22.065  11.975  1.00 34.89           C  
ATOM    200  CA  ALA A  56       0.334  20.465  15.090  1.00 31.84           C  
ATOM    205  CA  VAL A  57       0.000  20.066  18.885  1.00 30.46           C  
ATOM    212  CA  VAL A  58       2.738  21.762  20.915  1.00 27.28           C 

Essentially, my problem is that a few of the amino acids have the letter A in front of them where they are not supposed to be. Amino acid abbreviations are supposed to be 3 letters long. I have attempted to use regular expressions to remove the A at every instance of A in front of an amino acid abbreviation. Here is my code so far

def Trimmer(txtFileName):
    i = open('CA-final.txt', 'w')
    j = open(txtFileName, 'r')
    for record in j:
        with open(txtFileName, 'r') as j:
            content= j.read()
            content_new = re.sub('^ATOM\s+\d+\s+CA\s+A[ADTSEPGCVMILYFHKRWQN]', r'^ATOM\s+\d+\s+CA\s+[ADTSEPGCVMILYFHKRWQN]', content, flags = re.M)

When I run the function, it returns an error

 File "C:\Users\UserName\AppData\Local\conda\conda\envs\biopython\lib\sre_parse.py", line 1024, in parse_template
    raise s.error('bad escape %s' % this, len(this)) 

error: bad escape \s

My idea is that this function will find every instance of an A in front of a string of 3 characters and replace it with just the 3 other characters. Why exactly am I getting this error?

3
  • 1
    Do not use a regex pattern in the replacement string. It is not supposed to work like this. Commented Nov 1, 2018 at 19:30
  • Try re.sub(r'^(ATOM\s+\d+\s+CA\s+)A', r'\1', content, flags = re.M) Commented Nov 1, 2018 at 19:32
  • Is the file tab delimited? Why not parse the file a bit instead of applying regex to each row? Also your "for record" and "with open" lines are redundant (they do the same thing). Commented Nov 2, 2018 at 15:01

2 Answers 2

1

As far as I know, the easiest way to achieve your goal right now is to parse it using biopython (Since it's a PDB file).

Let's analyze the following script:

#!/usr/bin/env python3
import Bio
print("Biopython v" + Bio.__version__)

from Bio.PDB import PDBParser
from Bio.PDB import PDBIO

# Parse and get basic information
parser=PDBParser()
protein_1p49 = parser.get_structure('STS', '1p49.pdb')
protein_1p49_resolution = protein_1p49.header["resolution"]
protein_1p49_keywords = protein_1p49.header["keywords"]

print("Sample name: " + str(protein_1p49))
print("Resolution: " + str(protein_1p49_resolution))
print("Keywords: " + str(protein_1p49_keywords))
print("Model: " + str(protein_1p49[0]))

#initialize IO 
io=PDBIO()

#custom select
class Select():
    def accept_model(self, model):
        return True
    def accept_chain(self, chain):
        return True
    def accept_residue(self, residue):
        # print("residue id:" + str(residue.get_id()))
        print("residue name:" + str(residue.get_resname()))
        if len(str(residue.get_resname()))>3:
            print("Alert! abbr longer that 3 letters" + residue.get_resname())
            exit(1)
        return True       
    def accept_atom(self, atom):
        # print("atom id:" + atom.get_id())
        # print("atom name:" + atom.get_name())
        if atom.get_name() == 'CA':  
            return True
        else:
            return False

#write to output file
io.set_structure(protein_1p49)
io.save("1p49_out.pdb", Select())

exit(0)

It parses a PDB structure and uses a build-in biopython class PDBIO to save a custom parts of protein structure. Notice that you can put custom logic within the Select sub-class.

In this example, I used accept_residue method to fetch me information about abnormally named residues in my protein structure. You can easily extend this and perform a simple string trimming inside this function.

Sign up to request clarification or add additional context in comments.

Comments

0

Your regex will fail, if the first of three letters is an 'A'. Try this instead:

(^ATOM\s+\d+\s+CA\s+)A(\w\w\w)

It creates 2 Groups with what's before and after the extra 'A'

Then replace with the 2 Groups:

\1\2

1 Comment

This generated a 16.7 MB text file

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.