Quick Start: 1. Build Workflow
EnzyHTP is a python library which means it doesn’t have any direct executables but instead provides modular functions as building blocks of any customized workflow from users. In other word, user needs to writing a main script calling EnzyHTP functions to furnish their automated workflow.
1. Modify from a Template
As a quick-start, we will modify from a main script template
provided under the /template folder of EnzyHTP.
Explain functions used in the template
Talk about how you can be modify them to match your own need.
This will result in a main script. Running this main script will perform a HTP workflow addressing your research goal.
Note
Note
The code below are sections from template/template_main.py
1.1 Workflow Configurations
This part of the code configure the workflow with some settings, I will break it down into chunks and explain each of them. You can modify this part of the template to change the configuration of the workflow.
import pickle
from enzy_htp.preparation import protonate_stru, remove_hydrogens
from enzy_htp.mutation import assign_mutant, mutate_stru
from enzy_htp.geometry import equi_md_sampling
from enzy_htp.quantum import single_point
from enzy_htp.analysis import bond_dipole, ele_field_strength_at_along, ele_stab_energy_of_bond
from enzy_htp import interface
import enzy_htp.structure.structure_constraint as stru_cons
from enzy_htp.structure import PDBParser
from enzy_htp.chemical.level_of_theory import QMLevelOfTheory
from enzy_htp.core.clusters.accre import Accre
This import EnzyHTP functions/classes to this main script. We won’t change anything here in our quick-start.
# workflow config
# I/O path
wt_pdb_path = "KE_07_R7_2_S.pdb"
result_path = "ke_test_result.pickle"
ligand_chrg_spin_mapper = {"H5J" : (0,1)} # define the charge spin for ligands and modified AAs
This part configures input and output file path of the workflow.
Click to see code explanation
wt_pdb_paththe WT pdb path. This structure needs to be a structure that contains no missing parts (except for the hydrogens), no wrong parts, and no redundant parts (except water). (There will be less requirements for this input in the next version of EnzyHTP with the docking module)
result_paththe path of the result file in .pickle format (this file will be generated in this path after the workflow)
ligand_chrg_spin_mapperwe ask users to assign the charge and spin for ligands and modified amino acids in the protein. they are necessary for the model.
Note
Where to modify? (Examples)
change path of the PDB file
assign a different charge spin to model a radical cation
ligand_chrg_spin_mapper = {"H5J" : (1,2)}
# HPC job resources
md_hpc_job_config = {
"cluster" : Accre(),
"res_keywords" : {
"account" : "your_account",
"partition" : "your_partition"
}
}
qm_hpc_job_config = {
"cluster" : Accre(),
"res_keywords" : {
"account" : "your_account",
"partition" : "your_partition",
'walltime' : '1-00:00:00',
}
}
result_dict = {}
This part configures the resource on the HPC you want to use.
Click to see code explanation
There will be MD simulation and QM calculation in this workflow. We configure the resource and the HPC environment for them here.
md_hpc_job_configqm_hpc_job_configthe resource for MD and QM See here for more details about the config format
result_dicta place holder for collect the results of the workflow. no need to change this.
Note
Where to modify? (Examples)
Support your local cluster by changing
"cluster" : Name_of_your_cluster()(Check the Tutorial of supporting your local cluster.)For Accre user, use a real account by changing
'account':'your_real_account_name'Change the number of cores and memory
qm_hpc_job_config = { "cluster" : Accre(), "res_keywords" : { "account" : "your_account", "partition" : "your_partition", "walltime" : "1-00:00:00", "node_cores" : "24", "mem_per_core" : "3G", } }
1.2 Workflow Body
This following parts assemble EnzyHTP functions to a workflow, loops through mutants, and calculate their properties
# 1. create Structure()
wt_stru = PDBParser().get_structure(wt_pdb_path)
# 2. prepare
remove_hydrogens(wt_stru, polypeptide_only=True)
protonate_stru(wt_stru, protonate_ligand=False)
This 1st & 2nd part create the Structure() and prepares the enzyme.
We will not modify this part in this quick start.
Click to see code explanation
wt_stru- In EnzyHTP,
Structureserve as a center class that describes the structure of a protein. Most functions in EnzyHTP are operations of Structure (i.e.: change a Residue() to mutate or add Atom() to fixs missing hydrogens, etc.)AStructure()object is created by reading from a PDB file here usingPDBParser. See also other ways to obtain aStructure()here remove_hydrogensthis function remove hydrogens from the WT structure. See more details here
protonate_struthis function protonate the structure with the correct titration state.
# 3. create mutant library
mutant_pattern = "WT, r:2[resi 254 around 4 and not resi 101: all not self]*2"
mutants = assign_mutant(wt_stru, mutant_pattern)
for i, mut in enumerate(mutants):
mutant_result = []
mutant_dir = f"mutant_{i}"
# 4. mutate Structure()
mutant_stru = mutate_stru(wt_stru, mut, engine="pymol")
mutant_stru.assign_ncaa_chargespin(ligand_chrg_spin_mapper)
remove_hydrogens(mutant_stru, polypeptide_only=True)
protonate_stru(mutant_stru, protonate_ligand=False)
This 3rd & 4th part create a target mutant library.
Click to see code explanation
assign_mutantthis powerful function allows you to create mutant library in a flexible manner. You can apply mutation strageties such as site-saturation mutagenesis, mutating residues from large to small, random mutation, etc. by writing different mutant_pattern. (See syntax here)
After creating the library, we use a for loop in Python to study each mutant in the library in each loop.
For each loop, a mutant_result is initiated as a place holder for the result for this mutant. The mutant_dir
is determined as the sub-directory that contains MD files for this mutant.
For each mutant,
mutate_struthis function mutate the Struture() and create the corresponding mutant Structure() (See details here)
We then do another round of protonation to consider a perturbed protonation state of the protein mutant.
Note
Where to modify? (Examples)
Study a specific mutant by changing the
mutant_pattern(See syntax here)mutant_pattern = "R154W"
# 5. sampling with MD
param_method = interface.amber.build_md_parameterizer(
ncaa_param_lib_path=f"ncaa_lib",
force_fields=[
"leaprc.protein.ff14SB",
"leaprc.gaff2",
"leaprc.water.tip3p",
],
)
mut_constraints = [
stru_cons.create_distance_constraint("B.254.H2", "A.101.OE2", 2.4, mutant_stru),
stru_cons.create_angle_constraint("B.254.CAE", "B.254.H2", "A.101.OE2", 180.0, mutant_stru),
]
md_result = equi_md_sampling(
stru = mutant_stru,
param_method = param_method,
cluster_job_config = md_hpc_job_config,
job_check_period=10,
prod_constrain=mut_constraints,
prod_time= 0.1, #ns
work_dir=f"{mutant_dir}/MD/"
)
for replica_esm in md_result:
replica_result = []
This 5th part sample a geometrical ensemble for the enzyme. (still in the loop)
Click to see code explanation
param_method- this defines the force field and the MD engine you want to use.Here we used Amber with ff14SB, TIP3P, and GAFF2
mut_constraints- this defines the constraint added in the MD.here we add distance and angle constraint between the 2 atoms of the forming bond to create a TS-analog
equi_md_sampling- this function runs the MD simulation. It will submit a job to the HPC.
prod_timespecifies the length of the MD simulation. We use 0.1 ns here for demo purpose
The result of the MD simulation are replica trajectories from the production run. (by default 3 replica) We loop through it to calculate mutant properties for each replica.
Note
Where to modify? (Examples)
change
prod_time = 100.0to change the length of the MDchange the force field by changing the names in
force_fieldparam_method = interface.amber.build_md_parameterizer( ncaa_param_lib_path=f"ncaa_lib", force_fields=[ "leaprc.protein.ff19SB", "leaprc.gaff2", "leaprc.water.tip4p", ], )
apply customized ligand parameters by adding files under the ncaa_param_lib_path
See section 2.1 for example
apply a different constraint by changing the atom key and the constraint value
mut_constraints = [ stru_cons.create_distance_constraint("B.254.H2", "A.101.OE2", 1.8, mutant_stru), ] # format: chain_id.residue_idx.atom_name
run only 1 replica of the MD
by default 3 replicas are run. You can set it to only run 1 by set
parallel_runs = 1inequi_md_sampling
# 6. electronic structure
qm_results = single_point(
stru=replica_esm,
engine="gaussian",
method=QMLevelOfTheory( basis_set="3-21G", method="hf" ),
regions=["resi 101+254"],
cluster_job_config=qm_hpc_job_config,
job_check_period=60,
job_array_size=20,
work_dir=f"{mutant_dir}/QM_SPE/",
)
This 6th part calculate wavefunction for active site of the enzyme using QM cluster. (still in the loop)
Click to see code explanation
replica_resultthis is the place holder of each replica
single_pointthis function calculates wavefunction for each frame in a trajectory (as a
StructureEnsemble()) using QM. (See here for more details)
Note
Where to modify? (Examples)
change QM region by changing
regions = ['resi 123+456+789'](this follows PyMol selection syntax)change QM level of theory by changing
method=QMLevelOfTheory( basis_set="def2-svp", method="b3lyp-d3" ).You can also remove this whole section if you don’t want to do QM.
# 7. analysis
for ele_stru in qm_results:
this_frame_stru = ele_stru.geometry.topology
atom_1 = this_frame_stru.get("B.254.CAE")
atom_2 = this_frame_stru.get("B.254.H2")
# bond dipole
dipole = bond_dipole(
ele_stru, atom_1, atom_2,
work_dir=f"{mutant_dir}/bond_dipole/"
)
# EF
field_strength = ele_field_strength_at_along(
this_frame_stru, atom_1, atom_2, region_pattern="chain A and (not resi 101)"
)
# dGele
dg_ele = ele_stab_energy_of_bond(dipole[0], field_strength)
replica_result.append((dg_ele, dipole, field_strength))
mutant_result.append(replica_result)
result_dict[tuple(mut)] = mutant_result
This 7th part calculate properties for mutants. (still in the loop)
Click to see code explanation
With the wavefunctions generated by QM and trajectories generated by MM,
we calculate the reacting bond dipole moment (bond_dipole),
enzyme’s internal electric field strength (ele_field_strength_at_along),
and electrostatic stablization energy (ele_stab_energy_of_bond).
Most of the code are self-explaining in this part. I will explain for:
region_patternThis defines the region that EnzyHTP use to calculate the electric field strength.
Note
Where to modify? (Examples)
keep only functions that calculate the properties your want.
calcualte EF and dipole for a different bond by change
atom_1andatom_2.add functions to calculate other properties like
# MMPBSA # TODO update this ligand_mask = ":902" mmpbsa_result_dict = pdb_obj.get_mmpbsa_binding( ligand_mask, cluster=Accre(), res_setting = {'account':'yang_lab'})
# save the result
with open(result_path, "wb") as of:
pickle.dump(result_dict, of)
This last part save our results for each mutant to the output file.
2. Running the Workflow
Now we finished customizing the workflow. It is the time for launching it.
2.1 Configure the working dir
Here is what your working directory should look like before the launching:
.
├── template_main.py
├── template_hpc_submission.sh
├── your_target_wt_enzyme.pdb
└── ncaa_lib # (optional) add this when you customize ligand parameters
├── XYZ_AM1BCC-GAFF2.frcmod # XYZ is the ligand 3-letter code
└── XYZ_AM1BCC-GAFF2.prepin # the "AM1BCC-GAFF2" is used to identify the method (upper case)
(template/template_wk_dir give an example pdb file)
template_hpc_submission.shis the job submission script for our workflow main script (
template_main.py). This main script runs only requires 1 CPU and 6GB memory. It will submit computationally intensive jobs during the runtime to other computing nodes. (e.g.: MD and QM) The walltime for the main script should cover the maximum time span of your workflow.Do this if you are NOT in Vanderbilt…
You may also need to modify the
template_hpc_submission.shto match with your local cluster. Here are some instructions:In
template_hpc_submission.sh:Change
line 1-10(resource settings) to match your local cluster’s scheduler syntax. (checkout the submission script you normally use)Change
line 12-22(environment settings) to match your local environmental setting (e.g.: how you normally load Gaussian, AmberTool, and Multiwfn)
Do this if you are in Vanderbilt…
In
template_hpc_submission.sh:Change
xxxinline 3to a valid value. (e.g.: yang_lab)Change
EFdesMDinline 2to a customized name for your workflowChange the path of conda in
line 22to match your own paths
ncaa_libThis is the folder defined in
interface.amber.build_md_parameterizerabove. It is designed to reuse parameter files for the same ligand to be reused to save time during the workflow. It also allows user to supply custom parameter files for ligands.Click to see how to supply custom parameter files for ligands
You can do it by place files following this name scheme:
XYZ_AAA-BBB.frcmod
XYZ is the 3-letter name of the ligand/modified AA
AAA is the name of the atomic charge model in uppercase
BBB is the name of the force field in uppercase
allowed AAA-BBB combinations can be found in
enzy_htp/_interface/ncaa_library.py::PARM_METHOD_LIST
2.2 Submit!
Submit the main script under this working directory. Here is an example command for submission on ACCRE @Vanderbilt:
sbatch template_hpc_submission.sh
Now wait for results and enjoy the power of automation of EnzyHTP!
Author: QZ Shao <shaoqz@icloud.com>