Richard Chen, Jimmy Kim, Jason Wong, Stewart Slocum
CPD seeks to predict an amino acid sequence for a protein that folds to a given structure.
It's an NP-hard problem.
Number of possible amino acids grow exponentially as chain size increases. Designing a protein of 100 residues has a solution space of \( 100^{20} \) amino acids before considering side chain flexibility.
Using a DNN to learn to approximate size of subtrees for a given branching variable and value. We will use these scores to pick the best variable-value to branch on, speeding up computation in Toulbar2 an exact solver for Weighted Constraint Satisfaction Problems which has been used in CPD.
Trained by taking static, dynamic, and dynamic optimization features at the current execution of the program.
Some examples (of many):
35 protein design problems of different orders of difficulty provide a set of consistent (hopefully learnable) instances to train to estimate subtree sizes on.
These problems generate search trees from a size of ~10 nodes to ~10,000. We probabilistically select nodes to include in the data set so that we choose an equal amount of nodes from each level of the tree and bias the data set towards large problems.