Protein databank file chain, segment and residue number modifier

Question

My first non-trivial bash script after fully moving to Linux Mint last week is attached (modify_cccp_bundles.in). The script performs on protein databank (PDB) files that have been generated using an online CCCP (Coiled-coil Crick Parameterization) server. To use the generated PDB files for my own ends, I am using Rosetta (protein folding software) and phenix (protein modification software) that I have either set in my .bashrc file or have set a source file in my home to. The operations are:

add side chains and hydrogens to the CCCP templates using Rosetta
clear the segment IDs of the PDBs
change the residue sequences for all 4 chains so that they are continuous based on chain A (increment_resseq, there are 28 residues per chain)
add a segment ID of A to the files (first awk, segment ID is in column 75)
remove terminating lines from the file (there are 3 in-between the 4 chains that need removing)
change the chain IDs of all the chains to A (chain ID is in column 22)

Once finished, I have files that are continuously numbered, the same segment ID and chain ID with side chains and hydrogen added that look like this:

Loops are added by another process afterwards. This is my code:

#!/bin/bash

# --- bash script to modify CCCP server generated PDB bundle templates
#
# CCCP:    https://www.grigoryanlab.org/cccp/
# Phenix:  https://phenix-online.org/documentation/index.html
# Rosetta: https://www.rosettacommons.org/

# ------------------------------------------------------------------- FUNCTIONS

# --- extract first character from string and return capitalised
res_one_letter_cap () {
    abbrev=$(printf %.1s "$1")
    Abbrev="${abbrev^}"
    echo $Abbrev
}

# --- write Rosetta residue file for all Ala or Gly
write_resfiles () {
    res_abbrev=$(res_one_letter_cap $1)
    file_name="all_${1}_resfile.res"
    if [ -e $file_name ] # delete existing resfile so it is not being appended
    then
        rm $file_name
    fi
    echo -e $"PIKAA ${res_abbrev}\nstart" >> $file_name
}

# --- run rosetta fix backbone to add side chains and hydrogen
run_rosetta_fixbb () {
    for pdb in ../../../CCCP/poly-${1}/*; do
        fixbb.default.linuxgccrelease -s $pdb -resfile ../all_${1}_resfile.res \
        -nstruct 1 -ex1 -ex2 -database $ROSETTA3_DB
    done
}

# ------------------------------------------------------------------- CONSTANTS

residues='ala gly' # residue names
segID=A            # segment ID
chainID=A          # chain ID

# ------------------------------------------------------------------- SOURCE

#source ~/set_Phenix.sh # if path needs setting to phenix (set in ~/.bashrc)
source ~/set_Rosetta.sh # path to script containing rosetta_###/main/source/bin
                        # and the rosetta database path $ROSETTA3_DB

# ------------------------------------------------------------------- MAIN

# --- Rosetta fixed backbone: add side chains and hydrogen to PDB structures

mkdir ./rosetta && cd "$_"
for res in $residues; do
    write_resfiles $res
    mkdir ./poly_${res} && cd "$_"
    run_rosetta_fixbb $res
    rm scores.sc # scores.sc file not needed
    cd ../
    for pdb in ./poly_${res}/*; do # remove '_0001' Rosetta adds to filenames
        ext="${pdb##*.}"
        filename="${pdb%.*}"
        mv $pdb "${filename%?????}".$ext
    done
done

# --- phenix tools: change chain, segment ID's and renumber residues from
# --- constant offset defined in CONSTANTS

tmpfile="$(mktemp)"
phenixtmp=phenix.pdb

mkdir ../phenix/ && cd "$_"

for res in $residues; do
    mkdir -p ../templates/poly_${res}
    for file in ../rosetta/poly_${res}/*; do
        filename="${file##*/}"
        cp $file $filename
        phenix.pdbtools $filename clear_seg_id=true output.filename=$phenixtmp
        phenix.pdbtools $phenixtmp modify.selection="chain B" \
            increment_resseq=28 output.filename=$phenixtmp
        phenix.pdbtools $phenixtmp modify.selection="chain C" \
            increment_resseq=56 output.filename=$phenixtmp
        phenix.pdbtools $phenixtmp modify.selection="chain D" \
            increment_resseq=84 output.filename=$phenixtmp
        # to add 'A' as segment
        awk '{print substr($0,1,75) v substr($0,76)}' v=$segID $phenixtmp \
        > "$tmpfile" && mv "$tmpfile" $phenixtmp
        # remove TER lines
        awk '!/TERA/' $phenixtmp > "$tmpfile" && mv "$tmpfile" $phenixtmp
        # change chain to A
        awk '{print substr($0,1,21) v substr($0,23)}' v=$chainID $phenixtmp\
        > "$tmpfile" && mv "$tmpfile" $phenixtmp
        # copy to template dir
        mv $phenixtmp ../templates/poly_${res}/$filename
        rm $filename
    done
done

The folder structure looks like this afterwards (using just two PDBs, I have 64 but could in the future be more):

CCCP/
    poly-ala/
        ala_0001.pdb
        ala_0002.pdb
    poly-gly/
        gly_0001.pdb
        gly_0002.pdb
cccp_template_mods/
    modify_cccp_bundles.in
    rosetta/
        poly_ala/
            ala_0001.pdb
            ala_0001.pdb
        poly_gly/
            gly_0001.pdb
            gly_0001.pdb
        all_ala_resfile.res
        all_gly_resfile.res
    phenix/
    templates/
        poly_ala/
            ala_0001.pdb
            ala_0002.pdb
        poly_gly/
            gly_0001.pdb
            gly_0002.pdb

The ~/set_Rosetta.sh file contains:

export PATH=/usr/local/rosetta.source.release-340/main/source/bin:$PATH
export PYTHONPATH=/usr/local/rosetta.source.release-340/main/pyrosetta_scripts/apps/:$PYTHONPATH
export ROSETTA3=/usr/local/rosetta.source.release-340/main/
export ROSETTA3_DB=/usr/local/rosetta.source.release-340/main/database/

export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/rosetta.source.release-340/main/source/build/external/release/linux/5.15/64/x86/gcc/11/default:$LD_LIBRARY_PATH

The generated all_{ala/gly}_resfile.res contain (# is a for ala, g for gly):

PIKAA #
start

There is no extra ending blank line as this breaks Rosetta functions. Each numbered PDB is the same but with different residue names i.e. ala_0001.pdb is the alanine equivalent of the glycine backbone gly_0001.pdb.

The code could be improved by removing the non-templates directories but I am hesitant to include rm -rf in a bash script, I think it could all be done in one loop too. The phenix software will not accept a relative file in the input hence the need for phenixtmp. I considered using a case function for the phenix modifications but couldn't work out the best way to do so other than a loop through chains in a list and times the magic number 28 by an index whilst calling the phenix.pdbtools.

I am more used to TeX.SE etiquette so if I have missed something out of this question please let me know.

J_H · Accepted Answer · 2023-06-30 23:22:14Z

lint

You should run shellcheck on this, and follow its advice. Maybe your directory names lack SPACE characters ATM, but that might change in future, so it's worthwhile to habitually quote $file_name expressions.

In write_resfiles, it's enough to unconditionally rm -f "$file_name". No harm done if it doesn't exist, so no need to check beforehand.

portability

I found this a little hard to read:


    echo -e $"PIKAA ${res_abbrev}\nstart"

I think you were (roughly) writing echo $"\n", same as echo -e "\n". To use both left me scratching my head. Typical solution would be to prefer printf over echo, here.

You should document the versions of tools you tested against, including $BASH_VERSINFO and the MacOS or linux distro OS version.

conservative globbing

    for pdb in ../../../CCCP/poly-${1}/*; do

The trailing /* is perhaps a little more permissive than necessary. If those files share *.pdb or another common suffix, recommend you include that in the glob.

Not putting do on its own line is idosyncratic, but ok, whatever. Bash displays the parsed code a bit differently with: $ type run_rosetta_fixbb.

idempotency

mkdir ./rosetta

Sometimes things don't go smoothly and we need to edit / re-run several times. Consider using mkdir -p here, so it's not an error if the directory already exists.

use conservative settings

Consider putting "strict mode" near the top of this file:

set -u -e -o pipefail

That causes

unset / typo variable names to give error rather than empty string,
immediate bailout on error (any command's non-zero return status), and
same bailout if command within a pipeline fails

That would make it easier to reason about "what is $PWD ?" after several {mkdir, cd} combos.

directory changing

Consider writing a helper function: mkdir -p "$DIR"; cd "$DIR"

    cd ../

Certainly one can write cd foo; cmd1; cd ..; cmd2

You may find it more convenient to fork a subshell:

    (cd foo && cmd1)
    cmd2

$CWD changes in the subshell, but not in the parent, so cmd2 does not run down in the foo folder.

You have several stanzas of "do stuff in a subdir" and then next stanza will move on to another subdir to do more stuff. Consider breaking out each stanza into a named function, so your top-level logic is short and simple.

use conservative regex

        awk '!/TERA/' $phenixtmp

Maybe that's perfect the way it is. But if you can anchor with ^ caret, /^TERA/, or anchor with $ dollar, /TERA$/, then do so. It cuts down on surprises, and it makes your regex go faster.

Consider using grep -v for this simple task.

Consider using sed -i to filter in place, without needing a mv.

user feedback

Not sure how long this takes to run, nor whether 100% of all runs go successfully to the end.

You might want to set user expectations by using du to estimate the amount of input data to be processed. Logging "working on X" every minute or every few seconds might be helpful. Even if that's too chatty, printing out "Finished" at the end wouldn't hurt. Some scripts do one thing well, and compose nicely in pipelines. This script seems a bit bigger than that.

Consider doing this for a command cmd that takes a while to run:

(set -x && cmd some_input.pdb)

The ( ) subshell limits the effect of -x to just displaying the single command along with its arguments.

This code appears to achieve its design goals.

It relies on working versions of several tools to be installed. Adding the install procedures, with an example console session, would reduce the cost of onboarding a newly hired engineer.

It would be helpful to maintainers to freeze a set of example inputs and output, together with generated console output, perhaps in a .ZIP archive. Given that, I would be willing to delegate or accept maintenance tasks on this script.

Stack Exchange Network

Protein databank file chain, segment and residue number modifier

1 Answer 1

lint

portability

conservative globbing

idempotency

use conservative settings

directory changing

use conservative regex

user feedback

You must log in to answer this question.

Linked

Hot Network Questions

Protein databank file chain, segment and residue number modifier

1 Answer 1

lint

portability

conservative globbing

idempotency

use conservative settings

directory changing

use conservative regex

user feedback

You must log in to answer this question.

Linked

Related

Hot Network Questions