MMLigner User Tutorial

Developed at Volkamerlab, Charité/FU Berlin

by Dennis Köser

References

Collier, J.H., Allison, L., Lesk A.M., Stuckey, P.J., Garcia de la Banda , M., Konagurthu, A.S. (2017) Statistical inference of protein structural alignments using information and compression. Bioinformatics, 33(7), 1005-1013

Wallace,C.S. and Boulton,D.M. (1968) An information measure for classification. Comput. J., 11, 185–194.

Wallace,C.S. (2005) Statistical and Inductive Inference Using MinimumMessage Length. Information Science and Statistics. SpringerVerlag, New York, NY.

Shannon,C.E. (1948) A mathematical theory of communication. Bell Syst.Tech. J., 27, 379–423.

Introduction

What are the chosen structures

Since this project was developed during the SARS-CoV-2 pandemic of 2020, we chose the main protease of SARS-Cov from 2006 (2GZ9) and the main protease of SARS-Cov 2 from 2020 (5R8T) as example structures for this tutorial. SARS-Cov and SARS-Cov 2 are strains of viruses that cause severe acute respiratory syndrome (SARS). The chosen proteases are required for the maturation of SARS-Cov and SARS-Cov 2 respectively, so they make a good target for structure-based drug design of anti-SARS drugs.

Why they have been chosen

In addition to the relevance for the pandemic of 2020, these structures work well as an example for this tutorial because of their relation to each other resulting in quite similar but not completly identical structures. They both have a length of 306 and, if you align them with Needelman-Wunsch, show a similarity 98.69 % and identity of 96.08 %.

Theory

About MMLigner

MMLigner (Collier et al., 2017) works by minimizing the ivalue of the alignment. The ivalue is based on the Minimum Message Length framework (Wallace and Boulton, 1968; Wallace, 2005), a Bayesian framework for statistical inductive inference. The ivalue represents the hypothetical minimum message length needed to transmit the computed alignment losslessly (Shannon, 1948). Using the ivalue measure, the algorithm creates crude-but-effective strucural alignments rapidly to act as seeds. These seeds are iteratively refined over an Expectation-Maximization approach using the I-value criterion. By substracting the ivalue from the null model, the statistical significance of the alignment can be computed. If the difference is positive, the alignment is significant.

About the RCSB

The RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Data Bank) provide a global PDB archive, and makes PDB data available at no charge to all data consumers without limiations on usage.

RMSD

The RMSD is the average distance between the atoms of superposed structures in Angstrom.

coverage

The coverage of the aligned structures

ivalue

The minimum message length of the compressed alignment in bits.

Preparation

How to get the structure from the CLI

To get the structures directly from the RCSB, the syntax looks like this:

!opencadd --method=mmligner 2GZ9 5R8T

Getting the structure in Python

First thing you need to do is to download the proteins and pass them to opencadd. We do that with the Structure objects and the .from_pdbid() class method.

[1]:
from opencadd.structure.core import Structure

structure1 = Structure.from_pdbid("2GZ9")
structure2 = Structure.from_pdbid("5R8T")

Using the Python API

Using MMligner in Python looks like this:

[2]:
%%time
from opencadd.structure.superposition.engines.mmligner import MMLignerAligner

mmligner = MMLignerAligner()
results = mmligner.calculate([structure1, structure2])
Current MMLigner wrappers produces accurate RMSD values but slightly shifted structures!
CPU times: user 1.03 s, sys: 0 ns, total: 1.03 s
Wall time: 3.37 s

You can access the RMSD value like this:

[3]:
results["scores"]["rmsd"]
[3]:
1.244

Or the used transformation matrix like this:

[4]:
results["metadata"]["rotation"]
[4]:
[[0.257, 0.015, 0.966], [-0.02, 1.0, -0.01], [-0.966, -0.017, 0.257]]

The results are returned as a dictionary containing the superposed structures, the RMSD, ivalue and coverage, the alignment, the transformation matrix, the translation vector as well as the quarternion matrix. The dictionary looks like this:

{
    "superposed": superposed_models,
    "scores": {"rmsd": rmsd, "score": ivalue, "coverage": coverage},
    "metadata": {"alignment": alignment, "rotation": rotation, "translation": translation, "quarternion": quarternion},
}

In addition you can compute the ivalue of a already computed alignment. This looks like:

[5]:
alignment = results["metadata"]["alignment"]
results_ivalue = mmligner.ivalue([structure1, structure2], alignment)

print("i-value for this alignment is", results_ivalue["scores"]["score"])
/home/jaime/.conda/envs/structuralalignment/lib/python3.8/site-packages/MDAnalysis/coordinates/PDB.py:914: UserWarning: Found no information for attr: 'tempfactors' Using default value of '0.0'
  warnings.warn("Found no information for attr: '{}'"
i-value for this alignment is 20323.5

Because ivalue only scores an already computed alignment, only the scores are returned. The dictionary looks like this:

{
    "scores": {"rmsd": rmsd, "score": ivalue, "coverage": coverage},
}

Analysis

Lets look at the computed alignment!

If you have trouble with NGLview, follow this troubleshooting guide.

[6]:
import nglview as nv

view = nv.show_mdanalysis(results["superposed"][0].atoms)
view.add_component(results["superposed"][1].atoms)

view

As expected the structures are a pretty good match.

Report

Using MMLigner we get an alignment with a RMSD of 1.244Å. The alignment covers 303 of the 306 residues. The results show how closly related the two viruses are and may provide a opportunity to transfer research approaches from the development of anti-SARS drugs targeting the main protease of SARS-Cov to drugs targeting the the main protease of SARS-Cov 2.