My first non-trivial bash script after fully moving to Linux Mint last week is attached (modify_cccp_bundles.in). The script performs on protein databank (PDB) files that have been generated using an online CCCP (Coiled-coil Crick Parameterization) server. To use the generated PDB files for my own ends, I am using Rosetta (protein folding software) and phenix (protein modification software) that I have either set in my .bashrc file or have set a source file in my home to. The operations are:
- add side chains and hydrogens to the CCCP templates using Rosetta
- clear the segment IDs of the PDBs
- change the residue sequences for all 4 chains so that they are continuous based on chain A (
increment_resseq, there are 28 residues per chain) - add a segment ID of A to the files (first
awk, segment ID is in column 75) - remove terminating lines from the file (there are 3 in-between the 4 chains that need removing)
- change the chain IDs of all the chains to A (chain ID is in column 22)
Once finished, I have files that are continuously numbered, the same segment ID and chain ID with side chains and hydrogen added that look like this:
Loops are added by another process afterwards. This is my code:
#!/bin/bash
# --- bash script to modify CCCP server generated PDB bundle templates
#
# CCCP: https://www.grigoryanlab.org/cccp/
# Phenix: https://phenix-online.org/documentation/index.html
# Rosetta: https://www.rosettacommons.org/
# ------------------------------------------------------------------- FUNCTIONS
# --- extract first character from string and return capitalised
res_one_letter_cap () {
abbrev=$(printf %.1s "$1")
Abbrev="${abbrev^}"
echo $Abbrev
}
# --- write Rosetta residue file for all Ala or Gly
write_resfiles () {
res_abbrev=$(res_one_letter_cap $1)
file_name="all_${1}_resfile.res"
if [ -e $file_name ] # delete existing resfile so it is not being appended
then
rm $file_name
fi
echo -e $"PIKAA ${res_abbrev}\nstart" >> $file_name
}
# --- run rosetta fix backbone to add side chains and hydrogen
run_rosetta_fixbb () {
for pdb in ../../../CCCP/poly-${1}/*; do
fixbb.default.linuxgccrelease -s $pdb -resfile ../all_${1}_resfile.res \
-nstruct 1 -ex1 -ex2 -database $ROSETTA3_DB
done
}
# ------------------------------------------------------------------- CONSTANTS
residues='ala gly' # residue names
segID=A # segment ID
chainID=A # chain ID
# ------------------------------------------------------------------- SOURCE
#source ~/set_Phenix.sh # if path needs setting to phenix (set in ~/.bashrc)
source ~/set_Rosetta.sh # path to script containing rosetta_###/main/source/bin
# and the rosetta database path $ROSETTA3_DB
# ------------------------------------------------------------------- MAIN
# --- Rosetta fixed backbone: add side chains and hydrogen to PDB structures
mkdir ./rosetta && cd "$_"
for res in $residues; do
write_resfiles $res
mkdir ./poly_${res} && cd "$_"
run_rosetta_fixbb $res
rm scores.sc # scores.sc file not needed
cd ../
for pdb in ./poly_${res}/*; do # remove '_0001' Rosetta adds to filenames
ext="${pdb##*.}"
filename="${pdb%.*}"
mv $pdb "${filename%?????}".$ext
done
done
# --- phenix tools: change chain, segment ID's and renumber residues from
# --- constant offset defined in CONSTANTS
tmpfile="$(mktemp)"
phenixtmp=phenix.pdb
mkdir ../phenix/ && cd "$_"
for res in $residues; do
mkdir -p ../templates/poly_${res}
for file in ../rosetta/poly_${res}/*; do
filename="${file##*/}"
cp $file $filename
phenix.pdbtools $filename clear_seg_id=true output.filename=$phenixtmp
phenix.pdbtools $phenixtmp modify.selection="chain B" \
increment_resseq=28 output.filename=$phenixtmp
phenix.pdbtools $phenixtmp modify.selection="chain C" \
increment_resseq=56 output.filename=$phenixtmp
phenix.pdbtools $phenixtmp modify.selection="chain D" \
increment_resseq=84 output.filename=$phenixtmp
# to add 'A' as segment
awk '{print substr($0,1,75) v substr($0,76)}' v=$segID $phenixtmp \
> "$tmpfile" && mv "$tmpfile" $phenixtmp
# remove TER lines
awk '!/TERA/' $phenixtmp > "$tmpfile" && mv "$tmpfile" $phenixtmp
# change chain to A
awk '{print substr($0,1,21) v substr($0,23)}' v=$chainID $phenixtmp\
> "$tmpfile" && mv "$tmpfile" $phenixtmp
# copy to template dir
mv $phenixtmp ../templates/poly_${res}/$filename
rm $filename
done
done
The folder structure looks like this afterwards (using just two PDBs, I have 64 but could in the future be more):
CCCP/
poly-ala/
ala_0001.pdb
ala_0002.pdb
poly-gly/
gly_0001.pdb
gly_0002.pdb
cccp_template_mods/
modify_cccp_bundles.in
rosetta/
poly_ala/
ala_0001.pdb
ala_0001.pdb
poly_gly/
gly_0001.pdb
gly_0001.pdb
all_ala_resfile.res
all_gly_resfile.res
phenix/
templates/
poly_ala/
ala_0001.pdb
ala_0002.pdb
poly_gly/
gly_0001.pdb
gly_0002.pdb
The ~/set_Rosetta.sh file contains:
export PATH=/usr/local/rosetta.source.release-340/main/source/bin:$PATH
export PYTHONPATH=/usr/local/rosetta.source.release-340/main/pyrosetta_scripts/apps/:$PYTHONPATH
export ROSETTA3=/usr/local/rosetta.source.release-340/main/
export ROSETTA3_DB=/usr/local/rosetta.source.release-340/main/database/
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/rosetta.source.release-340/main/source/build/external/release/linux/5.15/64/x86/gcc/11/default:$LD_LIBRARY_PATH
The generated all_{ala/gly}_resfile.res contain (# is a for ala, g for gly):
PIKAA #
start
There is no extra ending blank line as this breaks Rosetta functions. Each numbered PDB is the same but with different residue names i.e. ala_0001.pdb is the alanine equivalent of the glycine backbone gly_0001.pdb.
The code could be improved by removing the non-templates directories but I am hesitant to include rm -rf in a bash script, I think it could all be done in one loop too. The phenix software will not accept a relative file in the input hence the need for phenixtmp. I considered using a case function for the phenix modifications but couldn't work out the best way to do so other than a loop through chains in a list and times the magic number 28 by an index whilst calling the phenix.pdbtools.
I am more used to TeX.SE etiquette so if I have missed something out of this question please let me know.
