Protein tertiary structure prediction

ITU University of Copenhagen, group work

Project description

The goal of the project was to predict the angles that create the main structure (backbone) of the protein, given information about its aminoacid sequence.

Overview

Predicting proteins’ tertiary structure from their primary structure is one of the most important unsolved problems of biochemistry. Current methods are inaccurate and expensive. Machine Learning offers new toolset that promises cheaper and much more efficient solutions. One of the recent breakthroughs is Mohammed AlQuraishi’s paper on End-to-end differentiable learning of protein structure, which together with RaptorX-Angle, inspired this project.

The model was developed for a course but the results go beyond the regular scope of the class.

Technologies

Python, Tensorflow and Jupyter Notebook

Technical Details

We use ProteinNet dataset as introduced by AlQuraishi and write the entire data processing pipeline in Tensorflow. The heart of the model is a Convolutional Neural Network, similar to the one introduced in RaptorX. The model is trained using ADAM optimizer on an MSE and MAE losses between predicted and true dihedral (torsional) angles. The LSTM from the image below is thus replaced with a CNN and we don't implement the pink part that would convert the angles into a 3-dimensional, Euclidean space.

Source: AlQuraishi, End-to-end differentiable learning of protein structure

The main challenge in going from a sequence of letters representing amino acids to a 3-dimensional protein structure is 1) an efficient loss calculation and 2) output of the network being angular.
  1. In the AlQuraishi's paper, protein’s tertiary structure is approximated by 3 torsional angles per amino acid, which then are used to reproduce the 3-dimensional structure to compute loss in that space. That process though is very computationally expensive, thus we’re focusing on a regression task that minimizes the loss between angles directly as also done in the RaptorX paper.
  2. We need to angularize the output of the network to compare it with true angles. As one approach, we predict 3 values directly squeezed into the range of [-pi, pi] by a scaled tanh. Another approach is to predict 6 values split into 3 pairs of 2 where each pair represents a vector in a 2-dimensional space, that can then be converted into an angle using atan2 function.

Results

The model developed in this project achieved results on par with results reported in RaptorX while using a smaller feature space. The final report can be found here.

My contribution

I used Tensorflow to load the data from files saved in the tensor format, prepared a que-based pipeline and a fully differentiable graph that first converts Euclidean coordinates of the protein atoms into its corresponding dihedral angles and then minimizes the loss between angles predicted by the core model and the true angles in the training dataset. I experimented with many ways of angularizing the output of the model and developed a clean, modular code available on GitHub.