BIOSC 1540: L17 (Docking and virtual screening)

Computational Biology

(BIOSC 1540)

Oct 31, 2024

Lecture 17:
Docking and virtual screening

Announcements

A06 is due tonight by 11:59 pm
- Reminder: There is a (soft) limit of 100 words for each question
A07 (final assignment) will be released tomorrow
No class on Tuesday (Nov 5) for election day
The next exam is on Nov 14
- We will have a review session on Nov 12
- Request DRS accommodations if needed
Project will be released Nov 21 and is due ???

After today, you should have a better understanding of

Need limitations of alchemical simulations

Alchemical simulations: Precise but expensive

Alchemical simulations compute free energy changes by gradually transforming one molecule into another

Importance in drug discovery: Highly precise, offering detailed insights into binding affinities essential for drug design

Why are alchemical simulations computationally expensive?

Atomistic forces: Computes forces for all atoms in proteins, ligands, cofactors, ions, solvents for millions of structures

Detailed sampling: Captures a wide range of conformations, which adds more dimensions to the calculation

Alchemical parameters: Simulations must be performed at various alchemical parameters

Okay, so how long is this really?

~10,000 CPU hours

~417 days on a 1 core
~42 days on 10 cores
~4 days on 100 cores
~10 hours on 1,000 cores

(For context, most supercomputers have ~24 cores per $30,000 node.)

Efficient methods are needed for virtual screening

Remember: Chemical space is unfathomably large and the role of computation is to virtually test as many compounds as possible

Value of data-driven approaches for drug discovery

After today, you should have a better understanding of

Accurate and efficient binding predictions are essential

Objective: Directly predict binding affinity from protein and ligand structures with high accuracy and minimal computational resources.

We can carefully simplify our methodology to improve speed with (hopefully) minimal impact to accuracy

What are some ideas?

Avoid sampling all microstates and determine one "optimal" protein-ligand structure

Using this bound structure, predict a "score" that is correlated to binding affinity

This is called docking

Docking simplifies the binding free energy prediction problem to enhance speed

Identifying relevant protein conformations

After today, you should have a better understanding of

Accurate, reproducible docking requires a relevant protein conformation

Significance of Protein Conformation in Docking

Protein-ligand interactions are highly dependent on the protein's 3D structure.
Using an inappropriate protein conformation can lead to inaccurate docking results.

Docking still considers the protein structure, but we only select one

Challenge: Proteins Are Dynamic Molecules

Conformational Flexibility: Proteins are not rigid structures; they exhibit movements ranging from side-chain rotations to large domain motions
Impact on Binding Sites: The shape and properties of the binding site can change, affecting ligand binding affinity and specificity.
Limited Experimental Structures: Crystallography and NMR provide snapshots of protein conformations but may not capture all relevant states.

Sources of Protein Conformational Data

Experimental Methods

Molecular Dynamics (MD) Simulations: Explore the conformational space over time.
Normal Mode Analysis (NMA): Identifies collective motions in proteins.
Ensemble Generation Methods: Generate multiple protein conformations for docking.

X-ray Crystallography: Provides high-resolution structures but may miss dynamic conformations.
NMR Spectroscopy: Captures ensembles of conformations but is limited to smaller proteins.

Computational Techniques

Experimental Structure Selection Criteria

Resolution and Quality
- Prefer structures with higher resolution (e.g., <2.5 Å).
- Assess reliability using R-factors and validation reports.
Ligand-Bound vs. Apo Structures
- Ligand-Bound (Holo) Structures
  - Provide direct insight into binding site conformation.
- Apo Structures
  - May reveal binding site flexibility in the absence of ligands.
Relevance to Target Ligand
- Choose structures co-crystallized with ligands similar to those of interest.

Molecular Dynamics Simulations for Conformational Sampling

Extract representative structures using clustering algorithms.

Identify conformations with open or closed binding sites.

Importance of Water Molecules

Role in Binding: Structured water molecules can mediate interactions between the protein and ligand.

Handling Water in Docking

Some docking programs allow explicit water molecules in the binding site.
Alternatively, consider their effect implicitly in scoring functions.

Inclusion Criteria: Retain water molecules that are conserved across multiple crystal structures.

Detecting binding pockets

After today, you should have a better understanding of

Accurate Binding Pocket Detection is Crucial for Docking

The binding pocket is the specific region where a ligand interacts with a protein

Accurate identification of binding pockets is essential for successful docking and virtual screening.

Understanding Protein Surface Topography

Terminology

Binding Pocket: A cavity that can accommodate a ligand.
Active Site: The functional region where biochemical reactions occur (often a binding pocket in enzymes).

Protein Surface Characteristics

Convex Regions: Typically inaccessible to ligands.
Concave Regions (Cavities): Potential binding sites.

Classification of Binding Pockets

Orthosteric Sites: The primary active site where endogenous ligands bind.
Allosteric Sites: Secondary sites that modulate protein function upon ligand binding.

Cryptic Sites: Binding pockets not apparent in the unbound protein structure but form upon ligand binding or conformational change.

Geometry-Based Pocket Detection Techniques

Alpha Shape Theory: Uses Delaunay triangulation and alpha complexes to define cavities.

Grid-Based Pocket Detection

Methodology
- Overlay a 3D grid on the protein structure.
- Classify grid points as inside, outside, or on the surface.
Pocket Identification
- Clusters of surface grid points forming concave regions indicate potential pockets.

Detecting Cryptic Binding Sites

Cryptic sites are hidden in the unbound structure and require conformational changes to become apparent.

Strategies

Use enhanced sampling MD methods like metadynamics.
Apply pocket detection to multiple conformations.

Case Study: Identification of allosteric sites in Hsp90

Ligand pose optimization

After today, you should have a better understanding of

Accurate Docking Depends on Optimized Ligand Poses

Precise ligand poses are crucial for reliable predictions of binding affinity and activity.
Incorrect poses can lead to false negatives or positives, misguiding drug development efforts.

Fundamentals of Ligand Pose Optimization

Definition of Ligand Pose
- The specific orientation and conformation of a ligand within the binding site of a target protein.
Optimization Goal
- Identify the energetically most favorable pose that closely represents the true binding mode.
Key Components
- Orientation: Position and alignment within the binding pocket.
- Conformation: Internal geometry, including bond angles, lengths, and torsions.

Docking needs to generate diverse conformations

Search strategies

Systematic
Stochastic
Empirical
Machine learning

Systematic searches numerically iterate over all possible conformations

Identify important degrees of freedom

Angles
Dihedrals

Scan along each angle with a step size of a N degrees

Remove structures with high strain

Systematic searches are only possible for very small molecules

How many different conformations would we have in this molecule if we scanned only dihedrals every 45 degrees?

Systematic searches are only possible for very small molecules

8 dihedrals

1

2

3

4

5

6

7

8

8 angles

8 × 8 × 8 × 8 × 8 × 8 × 8 × 8 = 16,777,216

That's a lot of structures, and many of them will clash!

We almost never do a systematic search in practice without some precautions to combinatorics

Stochastic algorithms provide better balance of sampling and cost

Monte Carlo

Steps:

Generate conformation
Compute energy change
If energy change less than a random sample: make move
Repeat

Allows us to sample efficiently

We also have conformer libraries

Use pre-generated libraries

Scoring functions as data-driven predictors

After today, you should have a better understanding of

Scoring function are parameterized models to estimate binding affinity after docking

Physics-based methods using force-field like methods

Recently, machine learning (e.g., graph neural networks) have been gaining traction

Interpretation of docking results

After today, you should have a better understanding of

Case study: MAP kinase ERK2

Overview of ERK2:

Extracellular signal-regulated kinase 2 (ERK2), encoded by the MAPK1 gene, is a crucial component of the mitogen-activated protein kinase (MAPK) signaling pathway.
ERK2 regulates vital cellular processes, including proliferation, differentiation, and survival.

Role in Disease:

Aberrant ERK2 activity is implicated in various cancers, often due to mutations in upstream components like RAS and RAF, leading to uncontrolled cell growth.
ERK2 is also involved in inflammatory diseases and neurodegenerative disorders.

AlphaFold 2 Prediction

MolModa

Evaluating and Scoring Binding Pockets

Volume and Depth: Larger, deeper pockets may accommodate larger ligands.
Hydrophobicity/Hydrophilicity: Influences ligand binding characteristics.
Amino Acid Composition: Presence of key residues for interactions.

Challenges and Limitations

Protein Flexibility
- Static structures may miss important conformations.
Water Molecules
- Role of water in the binding site can complicate detection.
Computational Resources
- High-resolution methods may be computationally intensive.
False Positives/Negatives
- Overprediction of non-druggable sites or missing true sites.

Accurate and efficient binding predictions are essential

Objective: Directly predict binding affinity from protein and ligand structures with high accuracy and minimal computational resources.

Why Empirical Data Matters

Experimental data provides a solid foundation for predictions.
Enhances reliability over purely theoretical approaches.

Thermal shift assays

High-Quality Datasets Enable Better Predictions

Utilizing Crystallographic Structures
- Real protein conformations guide accurate docking.
- Reveal detailed information about binding sites and interactions.
Incorporating Experimental Affinities
- Adjust predictions based on known binding strengths.
- Calibrate computational methods using empirical measurements.

Empirical Methods Improve Docking Accuracy

Utilizing Crystallographic Structures
- Real protein conformations guide accurate docking.
- Reveal detailed information about binding sites and interactions.
Incorporating Experimental Affinities
- Adjust predictions based on known binding strengths.
- Calibrate computational methods using empirical measurements.

Empirical Data Reduces Computational Effort

Efficiency Gains
- Focus on experimentally validated binding modes.
- Limit the search space to plausible interactions.
Resource Optimization
- Achieve accurate results without extensive computations.
- Save time and computational power for other tasks.

Empirical Data Increases Confidence in Results

Validation of Predictions
- Compare computational outcomes with experimental data.
- Confirm the accuracy of docking models.
Enhanced Decision-Making
- Reliable predictions support better drug discovery choices.
- Reduce the risk of pursuing ineffective compounds.

Introduction to Data-Driven Approaches

Data-driven approaches leverage large datasets and machine learning algorithms to make predictions, identify patterns, and generate models without relying solely on physical or chemical principles.

These methods harness the power of available biological data to accelerate research and discovery processes.

Advantages of Data-Driven Approaches

Speed and Efficiency: Capable of processing and analyzing large datasets rapidly, facilitating quicker decision-making.
Scalability: Easily scalable to handle growing amounts of data without a significant increase in computational resources.
Flexibility: Adaptable to various types of data and applications, from genomics to drug discovery.
Cost-Effectiveness: Reduces the need for expensive experiments by predicting outcomes computationally.