Computational Biology
(BIOSC 1540)
Oct 10, 2024
Lecture 12:
Protein structure prediction
Protein structure dictates interactions, signaling, and biochemical roles
Experimental methods (X-ray, Cryo-EM) provide high-resolution structures but are resource-intensive and time-consuming
Advances in predictive accuracy are opening new frontiers in biology
Structure prediction complements genomics and transcriptomics to create a holistic understanding of biological function
Integrating predictive models with experimental data is the way forward
Proteins can adopt a large number of possible conformations
Levinthal’s Paradox: A protein can’t sample all conformations in a biologically reasonable time, yet it folds quickly
Example: A protein with 100 amino acids, each capable of adopting about 3 torsion angles, results in ~ possible conformations3100
Energy calculations are computationally intensive and depend on accurate force fields
Proteins fold to the lowest free-energy state, but this landscape is highly rugged
A potential energy surface (PES) is a represents the energy of a system as a function of the positions of its atoms
Understand how the system's energy changes upon reactions or movements
Proteins are not static; they adopt multiple conformations (flexibility) based on their environment or interactions with other molecules
Some proteins or regions do not adopt a fixed 3D structure but remain disordered or flexible under physiological conditions
Proteins fold differently in different environments
Predictions need to capture interactions with solvent molecules, ions, and cofactors
Example: Predicting transmembrane protein structures, where the lipid bilayer plays a key role in folding, is particularly complex.
AlphaFold 3
pH-gated K+ channel
PTMs such as phosphorylation, glycosylation, and methylation can alter protein folding and function
Example: eIF4E is a eukaryotic translation initiation factor involved in directing ribosomes to the cap structure of mRNAs
Ser209 is phosphorylated by MNK1
AlphaFold 3 accurately predicts these changes when they are already known
Example: AlphaFold has made strides, but predicting de novo structures remains challenging, especially for proteins with no templates
Our predictions rely on similarity to known structures, but novel sequences or folds (for which no homologous structures exist) are difficult to predict accurately
Homology modeling is the most accurate when sequence identity to other proteins is high (>30%)
Common tools for homology modeling include MODELLER, SWISS-MODEL, and Phyre2
The main principle is that proteins with similar sequences tend to fold into similar structures
HMMs are statistical models representing sequences using probabilities for matches, insertions, and deletions
Essentially more robust alignments
Suppose I collect weather data in Pittsburgh for the past 30 days: Sunny, Cloudy, or Rain
I want to figure out how to predict tomorrow's weather based on today's
Today's weather
Tomorrow's weather
Transition probability
Example: If today is cloudy, there is a 57% chance it will be Sunny tomorrow
Each edge represents the probability of transitioning from one state to the next
Suppose my friend lives in a remote location where it is either Rainy or Sunny
I cannot look up the weather but I have last year's weathers reports
We know how weather patterns transitions, but we don't have this information from our friend
Obervables
Hidden states
Note: If we had previous observable data, we could fit/learn transition probabilities of hidden states
My friend can only tell me
Hidden states represent the underlying biological events that are not directly observable
Observables are the actual amino acids (residues) in the protein sequence that we can observe
HMMER is a tool that uses HMMs to search databases for sequences that match a given profile HMM
It is used to find homologous sequences, identifying evolutionary relationships across protein families
MTLSILVAHDLQRVIGFENQLPWHLPNDLKHVKKLSTGHTLVMGRKTFESIGKPLPNRRNVVLTSDTSFNVEGVDVIHSIEDIYQLPGHVFIFGGQTLFEEMIDKVDDMYITVIEGKFRGDTFFPPYTFEDWEVASSVEGKLDEKNTIPHTFLHLIRKK
DHFR (UniProt)
MGKKEVILLFLAVIFVALNTLVVAVYFRETADEQVVYGKNNINQKLIQLKDGTYGFEPALPHVGTFKVLDSNRVPQIAQEIIRNKVKRYLQEAVRIEGTYPIVDGLVNAKYTVANPNNLHGYEGFLFKDNVPLTYPQEFILSNLDGKVRSLQNYDYDLDVLFGEKEEVKSEILRGLYYNTYTRAFSPYKL
Novel protein (ChatGPT)
In cases where sequence similarity to known structures is low (< 30%), homology modeling becomes unreliable
Phyre2, RaptorX, MUSTER, and I-TASSER are commonly used for threading and takes much longer than homology modeling
Threading matches sequences to known structural folds based on structural rather than sequence similarity
A contact map is a 2D representation of which residues are in close proximity
Each point on the map corresponds to two residues that are close in 3D space
Contacts are determined by spatial proximity, typically within a certain distance threshold
Residues far apart in the sequence can still be close in the 3D structure, reflected in the contact map
Traditional methods like homology modeling and threading rely on templates and known structures
AlphaFold (DeepMind) and RosettaFold (Baker Lab) lead the charge in this area
ML predicts 3D structures only from sequence data
Developed by DeepMind, AlphaFold predicts protein structures with atomic accuracy by using deep learning models trained on large structural datasets
Breakthroughs
Mutations in one residue often result in compensatory mutations in its interacting partner
This is observed across species through analysis of homologous protein sequences
Correlated mutations indicate functionally significant residue pairs
Arg (positive)
Asp (negative)
Lys (positive)
Glu (negative)
Trp (hydrophobic)
Val (hydrophobic)
Evolution
Coevolution analysis helps predict which residues are close in the 3D structure
Residues showing correlated mutations are likely to be spatially close in the folded protein
This is particularly useful when no experimental structure is available
Coevolution is detected using large MSAs from homologous proteins
The more diverse the sequences in the MSA, the better the resolution of coevolving residues
Evolutionary information from MSAs guides predictions for residue-residue contacts
Residues with a high Score (i.e., coevolve) are near each other in the protein's structure (i.e., small distance)
Val14 and Gly120 coevolved
Models predict these residues are spatially close
Not all correlated mutations are due to direct physical interactions; some may be indirect
Noise in the data can come from random mutations or insufficient evolutionary diversity.
Large and diverse sequence data sets are needed for reliable coevolution predictions.
AlphaFold and RosettaFold utilize coevolutionary data from MSAs to predict residue interactions
These models incorporate evolutionary information along with structural features, leading to highly accurate predictions
Given the following data
MTLSILVAHDLQRVIGFENQLPWHLPNDLKHVKKLSTGHTLVMGRKTFESIGKPLPNRRNVVLTSDTSFNVEGVDVIHSIEDIYQLPGHVFIFGGQTLFEEMIDKVDDMYITVIEGKFRGDTFFPPYTFEDWEVASSVEGKLDEKNTIPHTFLHLIRKK
Input sequence
Multiple Sequence Alignment
Predict
Atomistic structure
ML models
Using MSAs and contact maps, DeepMind trained a model to predict protein structures
Biggest change is the use of a diffusion model
Diffusion models essentially learn to unscramble atoms into a structure
Proteins, DNA, RNA, ligands, PTMs, protein-proteins, etc.
MTLSILVAHDLQRVIGFENQLPWHLPNDLKHVKKLSTGHTLVMGRKTFESIGKPLPNRRNVVLTSDTSFNVEGVDVIHSIEDIYQLPGHVFIFGGQTLFEEMIDKVDDMYITVIEGKFRGDTFFPPYTFEDWEVASSVEGKLDEKNTIPHTFLHLIRKK
DHFR (UniProt)
MGKKEVILLFLAVIFVALNTLVVAVYFRETADEQVVYGKNNINQKLIQLKDGTYGFEPALPHVGTFKVLDSNRVPQIAQEIIRNKVKRYLQEAVRIEGTYPIVDGLVNAKYTVANPNNLHGYEGFLFKDNVPLTYPQEFILSNLDGKVRSLQNYDYDLDVLFGEKEEVKSEILRGLYYNTYTRAFSPYKL
Novel protein (ChatGPT)
At least 40% of proteins have disordered regions
AlphaFold (and all other methods) struggle with disordered regions
Lecture 12:
Protein structure prediction
Today
Thursday
Lecture 13:
Molecular simulation princples