Computational Biology
(BIOSC 1540)
Mar 25, 2025
Lecture 11A
Protein structure prediction
Foundations
Assignments
Quizzes
Final exam
Proteins are molecular machines; their 3D shape determines how they interact with substrates, DNA, other proteins, etc.
As of 2021, >200 million protein sequences exist, but <200,000 structures are known
X-ray crystallography, NMR, and cryo-EM provide high-resolution data.
However, these methods are time-consuming, expensive, and often fail for specific proteins.
Enables structural understanding of sequences with no experimental structure. This accelerates many research fields and democratizes access to atomistic insights
Example: Our collaborators (Dr. Cahoon) crystallized Lm PrsA1 in 2016, but we need a structural model of PrsA2. Instead of potentially years, AlphaFold 3 gives us a decent prediction within minutes
PrsA1 (X-ray)
PrsA2 (AF3)
Proteins fold in milliseconds—implying nature doesn't sample all conformations.
Levinthal’s Paradox: A protein can’t sample all conformations in a biologically reasonable time, yet it folds quickly
Example: A protein with 100 amino acids, each capable of adopting about 3 torsion angles, results in ~ possible conformations3100
Scoring functions attempt to model this with statistical or physics-based potentials
Proteins adopt conformations that minimize thermodynamic free energy
A potential energy surface (PES) represents the energy of a system as a function of the positions of its atoms
Allows us to understand how the system's energy changes upon reactions or movements
Many proteins exist in ensembles of structures or are natively disordered.
Their function may depend on transient interactions or induced folding.
Proteins fold differently in different environments
Predictions need to capture interactions with solvent molecules, ions, and cofactors
Example: Predicting transmembrane protein structures, where the lipid bilayer plays a key role in folding, is particularly complex.
AlphaFold 3
pH-gated K+ channel
PTMs such as phosphorylation, glycosylation, and methylation can alter protein folding and function
Example: eIF4E is a eukaryotic translation initiation factor involved in directing ribosomes to the cap structure of mRNAs
Ser209 is phosphorylated by MNK1
AlphaFold 3 accurately predicts these changes when they are already known
Hidden Markov Model alignments
Requires a template—a known structure with detectable sequence similarity.
Often the first modeling strategy attempted due to simplicity and reliability.
Based on the principle that proteins with similar sequences tend to adopt similar structures.
You begin with a query sequence
(the protein you want to model)
> PrsA2
CGGGGDVVKTDSGDVTKDELYDAMKDKYGSEFVQQLTFEKILGDKYKVSDE
DVDKKFNEYKSQYGDQFSAVLTQSGLTEKSFKSQLKYNLLVQKATEANTDT
SDKTLKKYYETWQPDITVSHILVADENKAKEVEQKLKDGEKFADLAKEYST
DTATKDNGGQLAPFGPGKMDPAFEKAAYALKNKGDISAPVKTQYGYHIIQM
DKPATKTTFEKDKKAVKASYLESQLTTENMQKTLKKEYKDANVKVEDKDLK
DAFKDFDGSSSSDSDSSKThis is an MSA of Listeria monocytogenes PrsA2 to related proteins
Methods like Smith-Waterman use direct pairwise alignment based on similarity scores (e.g., BLOSUM62).
They do not consider evolutionary variation, insertions, or residue-level probabilities.
We need methods that detect evolutionarily distant but structurally conserved relationships.
A profile captures how conserved each position is across an MSA
Instead of a single residue, each position becomes a probability distribution over all 20 amino acids.
Made with skylign
A profile HMM models the amino acid probabilities at each position, plus insertion and deletion likelihoods.
The result is a probabilistic model that captures both conservation and structural variability.
Sequence likelihood can be computed by walking along the profile HMM
Full alignments are done using the Viterbi algorithm to find the best path through the HMM state space.
A maximum accuracy (MAC) alignment is also computed to optimize for correct residue–residue matches.
These alignments return E-values to estimate match confidence.
Only statistically significant hits (e.g., E < 1e-3) are retained for the next iteration.
The alignment is represented as red path through both HMMs
>50% identity: high-accuracy models (~1 Å RMSD) are achievable
Between 30–50%: moderate accuracy; errors appear in loops, side chains
<30% identity: The "twilight zone" where structural similarity is uncertain
Step 1: Converts HMM columns into a discretized profile alphabet of 219 letters to speed up comparison.
Step 2: Performs a fast local alignment of the query HMM to these compressed database representations.
These steps eliminate ~99.9% of irrelevant comparisons—reducing millions of alignments to thousands
Sequences from matched HMMs are added to the query MSA.
A new query HMM is built from this updated MSA.
Each iteration improves sensitivity by capturing more distant, diverse sequences.
This process continues for 1–4 rounds, depending on the desired depth and quality.
Template building
Conserved regions: Backbone atoms are copied from template directly.
Variable regions (loops, inserts): Built using fragment libraries or loop modeling algorithms.
Side chains are adjusted using rotamer libraries to fit target sequence.
After model construction, the structure often contains bad bond angles, clashes, or unrealistic torsions.
Refinement includes energy minimization using force fields or statistical potentials.
Some tools use molecular dynamics or Monte Carlo sampling.
Ramachandran plots visualize backbone torsion angles.
Statistical scores (e.g., DOPE, QMEAN, GA341) evaluate nativeness.
Residue-by-residue assessment helps identify weak regions (e.g., VERIFY3D, ERRAT).
Good models have:
Most residues in favored Ramachandran regions
Low-energy scores
No large clashes
If a model fails validation, revisit earlier steps:
Try a different template
Refine the alignment
Adjust loop modeling parameters
Multiple models are often built and ranked—choose the one with the best validation metrics.
MTLSILVAHDLQRVIGFENQLPWHLPNDLKHVKKLSTGHTLVMGRKTFESIGKPLPNRRNVVLTSDTSFNVEGVDVIHSIEDIYQLPGHVFIFGGQTLFEEMIDKVDDMYITVIEGKFRGDTFFPPYTFEDWEVASSVEGKLDEKNTIPHTFLHLIRKKDHFR (UniProt)
In cases where sequence similarity to known structures is low (< 30%), homology modeling becomes unreliable
Phyre2, RaptorX, MUSTER, and I-TASSER are commonly used for threading and takes much longer than homology modeling
Threading matches sequences to known structural folds based on structural rather than sequence similarity
A contact map is a 2D representation of which residues are in close proximity
Each point on the map corresponds to two residues that are close in 3D space
Contacts are determined by spatial proximity, typically within a certain distance threshold
Residues far apart in the sequence can still be close in the 3D structure, reflected in the contact map
Traditional methods like homology modeling and threading rely on templates and known structures
AlphaFold (DeepMind) and RosettaFold (Baker Lab) lead the charge in this area
ML predicts 3D structures only from sequence data
Developed by DeepMind, AlphaFold predicts protein structures with atomic accuracy by using deep learning models trained on large structural datasets
Breakthroughs
Mutations in one residue often result in compensatory mutations in its interacting partner
This is observed across species through analysis of homologous protein sequences
Correlated mutations indicate functionally significant residue pairs
Arg (positive)
Asp (negative)
Lys (positive)
Glu (negative)
Trp (hydrophobic)
Val (hydrophobic)
Evolution
Coevolution analysis helps predict which residues are close in the 3D structure
Residues showing correlated mutations are likely to be spatially close in the folded protein
This is particularly useful when no experimental structure is available
Coevolution is detected using large MSAs from homologous proteins
The more diverse the sequences in the MSA, the better the resolution of coevolving residues
Evolutionary information from MSAs guides predictions for residue-residue contacts
Residues with a high Score (i.e., coevolve) are near each other in the protein's structure (i.e., small distance)
Val14 and Gly120 coevolved
Models predict these residues are spatially close
Not all correlated mutations are due to direct physical interactions; some may be indirect
Noise in the data can come from random mutations or insufficient evolutionary diversity.
Large and diverse sequence data sets are needed for reliable coevolution predictions.
AlphaFold and RosettaFold utilize coevolutionary data from MSAs to predict residue interactions
These models incorporate evolutionary information along with structural features, leading to highly accurate predictions
Given the following data
MTLSILVAHDLQRVIGFENQLPWHLPNDLKHVKKLSTGHTLVMGRKTFESIGKPLPNRRNVVLTSDTSFNVEGVDVIHSIEDIYQLPGHVFIFGGQTLFEEMIDKVDDMYITVIEGKFRGDTFFPPYTFEDWEVASSVEGKLDEKNTIPHTFLHLIRKKInput sequence
Multiple Sequence Alignment
Predict
Atomistic structure
ML models
Using MSAs and contact maps, DeepMind trained a model to predict protein structures
AF2 iterations
Biggest change is the use of a diffusion model
Diffusion models essentially learn to unscramble atoms into a structure
Proteins, DNA, RNA, ligands, PTMs, protein-proteins, etc.
MTLSILVAHDLQRVIGFENQLPWHLPNDLKHVKKLSTGHTLVMGRKTFESIGKPLPNRRNVVLTSDTSFNVEGVDVIHSIEDIYQLPGHVFIFGGQTLFEEMIDKVDDMYITVIEGKFRGDTFFPPYTFEDWEVASSVEGKLDEKNTIPHTFLHLIRKKDHFR (UniProt)
MGKKEVILLFLAVIFVALNTLVVAVYFRETADEQVVYGKNNINQKLIQLKDGTYGFEPALPHVGTFKVLDSNRVPQIAQEIIRNKVKRYLQEAVRIEGTYPIVDGLVNAKYTVANPNNLHGYEGFLFKDNVPLTYPQEFILSNLDGKVRSLQNYDYDLDVLFGEKEEVKSEILRGLYYNTYTRAFSPYKLNovel protein (ChatGPT)
At least 40% of proteins have disordered regions
AlphaFold (and all other methods) struggle with disordered regions
Lecture 11B:
Protein structure prediction -
Methodology
Lecture 11A:
Protein structure prediction -
Foundations
Today
Thursday
Suppose I collect weather data in Pittsburgh for the past 30 days: Sunny, Cloudy, or Rain
I want to figure out how to predict tomorrow's weather based on today's
Today's weather
Tomorrow's weather
Transition probability
Example: If today is cloudy, there is a 57% chance it will be Sunny tomorrow
Each edge represents the probability of transitioning from one state to the next
Suppose my friend lives in a remote location where it is either Rainy or Sunny
I cannot look up the weather but I have last year's weathers reports
We know how weather patterns transitions, but we don't have this information from our friend
Obervables
Hidden states
Note: If we had previous observable data, we could fit/learn transition probabilities of hidden states
My friend can only tell me
HMMs are statistical models representing sequences using probabilities for matches, insertions, and deletions
Essentially more robust alignments
Hidden states represent the underlying biological events that are not directly observable
Observables are the actual amino acids (residues) in the protein sequence that we can observe
MGKKEVILLFLAVIFVALNTLVVAVYFRETADEQVVYGKNNINQKLIQLKDGTYGFEPALPHVGTFKVLDSNRVPQIAQEIIRNKVKRYLQEAVRIEGTYPIVDGLVNAKYTVANPNNLHGYEGFLFKDNVPLTYPQEFILSNLDGKVRSLQNYDYDLDVLFGEKEEVKSEILRGLYYNTYTRAFSPYKLNovel protein (ChatGPT)
Example: AlphaFold has made strides, but predicting de novo structures remains challenging, especially for proteins with no templates
Our predictions rely on similarity to known structures, but novel sequences or folds (for which no homologous structures exist) are difficult to predict accurately