Project description
The goal of the project was to predict the angles that
create the main structure (backbone) of the protein,
given information about its aminoacid sequence.
Overview
Predicting proteins’ tertiary structure from their
primary structure is one of the most important unsolved
problems of biochemistry. Current methods are inaccurate
and expensive. Machine Learning offers new toolset that
promises cheaper and much more efficient solutions.
One of the recent breakthroughs is Mohammed AlQuraishi’s
paper on
End-to-end differentiable learning of protein structure, which together with
RaptorX-Angle, inspired this project.
The model was developed for a course but the results go beyond the regular scope of the class.
Technologies
Python, Tensorflow and Jupyter Notebook
Technical Details
We use ProteinNet dataset as introduced by AlQuraishi and write the entire data processing pipeline
in Tensorflow.
The heart of the model is a Convolutional Neural Network, similar to the one introduced in RaptorX.
The model is trained using ADAM optimizer on an MSE and MAE losses between predicted and true
dihedral (torsional) angles.
The LSTM from the image below is thus replaced with a CNN and we don't implement the pink part
that would convert the angles into a 3-dimensional, Euclidean space.
Source: AlQuraishi, End-to-end differentiable learning of protein
structure
The main challenge in going from a sequence of letters representing amino acids to a 3-dimensional
protein structure is
1) an efficient loss calculation and 2) output of the network being angular.
- In the AlQuraishi's paper, protein’s tertiary structure is approximated by 3 torsional
angles per amino acid,
which then are used to reproduce the 3-dimensional structure to compute loss in that space.
That process though is very computationally expensive, thus we’re focusing on a regression
task that minimizes the loss between angles directly
as also done in the RaptorX paper.
- We need to angularize the output of the network to compare it with true angles.
As one approach, we predict 3 values directly squeezed into the range of [-pi, pi]
by a scaled tanh. Another approach is to predict 6 values split into 3 pairs of 2
where each pair represents a vector in a 2-dimensional space, that can then be converted
into an angle using atan2 function.
Results
The model developed in this project achieved results on par with results reported in RaptorX while
using a smaller feature space. The final report can be found
here.
My contribution
I used Tensorflow to load the data from files saved in the tensor format,
prepared a que-based pipeline and a fully differentiable graph that first converts Euclidean
coordinates
of the protein atoms into its corresponding dihedral angles and
then minimizes the loss between angles predicted by the core model and the true angles in the
training dataset.
I experimented with many ways of angularizing the output of the model and developed a clean, modular
code
available on GitHub.