The goal of the project was to predict the angles that
create the main structure (backbone) of the protein,
given information about its aminoacid sequence.
Predicting proteins’ tertiary structure from their
primary structure is one of the most important unsolved
problems of biochemistry. Current methods are inaccurate
and expensive. Machine Learning offers new toolset that
promises cheaper and much more efficient solutions.
One of the recent breakthroughs is Mohammed AlQuraishi’s
End-to-end differentiable learning of protein structure
, which together with
, inspired this project.
The model was developed for a course but the results go beyond the regular scope of the class.
Python, Tensorflow and Jupyter Notebook
We use ProteinNet dataset as introduced by AlQuraishi and write the entire data processing pipeline in Tensorflow.
The heart of the model is a Convolutional Neural Network, similar to the one introduced in RaptorX.
The model is trained using ADAM optimizer on an MSE and MAE losses between predicted and true dihedral (torsional) angles.
The LSTM from the image below is thus replaced with a CNN and we don't implement the pink part
that would convert the angles into a 3-dimensional, Euclidean space.
Source: AlQuraishi, End-to-end differentiable learning of protein structure
The main challenge in going from a sequence of letters representing amino acids to a 3-dimensional protein structure is
1) an efficient loss calculation and 2) output of the network being angular.
- In the AlQuraishi's paper, protein’s tertiary structure is approximated by 3 torsional angles per amino acid,
which then are used to reproduce the 3-dimensional structure to compute loss in that space.
That process though is very computationally expensive, thus we’re focusing on a regression task that minimizes the loss between angles directly
as also done in the RaptorX paper.
- We need to angularize the output of the network to compare it with true angles.
As one approach, we predict 3 values directly squeezed into the range of [-pi, pi]
by a scaled tanh. Another approach is to predict 6 values split into 3 pairs of 2
where each pair represents a vector in a 2-dimensional space, that can then be converted into an angle using atan2 function.
The model developed in this project achieved results on par with results reported in RaptorX while
using a smaller feature space. The final report can be found here
I used Tensorflow to load the data from files saved in the tensor format,
prepared a que-based pipeline and a fully differentiable graph that first converts Euclidean coordinates
of the protein atoms into its corresponding dihedral angles and
then minimizes the loss between angles predicted by the core model and the true angles in the training dataset.
I experimented with many ways of angularizing the output of the model and developed a clean, modular code
available on GitHub.