# MMLigner User Tutorial

Developed at Volkamerlab, Charité/FU Berlin

by Dennis Köser

## References

Collier, J.H., Allison, L., Lesk A.M., Stuckey, P.J., Garcia de la Banda , M., Konagurthu, A.S. (2017) Statistical inference of protein structural alignments using information and compression. Bioinformatics, 33(7), 1005-1013

Wallace,C.S. and Boulton,D.M. (1968) An information measure for classification. Comput. J., 11, 185–194.

Wallace,C.S. (2005) Statistical and Inductive Inference Using MinimumMessage Length. Information Science and Statistics. SpringerVerlag, New York, NY.

Shannon,C.E. (1948) A mathematical theory of communication. Bell Syst.Tech. J., 27, 379–423.

## Introduction

### What are the chosen structures

Since this project was developed during the SARS-CoV-2 pandemic of 2020, we chose the main protease of SARS-Cov from 2006 (2GZ9) and the main protease of SARS-Cov 2 from 2020 (5R8T) as example structures for this tutorial. SARS-Cov and SARS-Cov 2 are strains of viruses that cause severe acute respiratory syndrome (SARS). The chosen proteases are required for the maturation of SARS-Cov and SARS-Cov 2 respectively, so they make a good target for structure-based drug design of anti-SARS drugs.

### Why they have been chosen

In addition to the relevance for the pandemic of 2020, these structures work well as an example for this tutorial because of their relation to each other resulting in quite similar but not completly identical structures. They both have a length of 306 and, if you align them with Needelman-Wunsch, show a similarity 98.69 % and identity of 96.08 %.

## Theory

MMLigner (Collier et al., 2017) works by minimizing the ivalue of the alignment. The ivalue is based on the Minimum Message Length framework (Wallace and Boulton, 1968; Wallace, 2005), a Bayesian framework for statistical inductive inference. The ivalue represents the hypothetical minimum message length needed to transmit the computed alignment losslessly (Shannon, 1948). Using the ivalue measure, the algorithm creates crude-but-effective strucural alignments rapidly to act as seeds. These seeds are iteratively refined over an Expectation-Maximization approach using the I-value criterion. By substracting the ivalue from the null model, the statistical significance of the alignment can be computed. If the difference is positive, the alignment is significant.

The RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Data Bank) provide a global PDB archive, and makes PDB data available at no charge to all data consumers without limiations on usage.

### RMSD

The RMSD is the average distance between the atoms of superposed structures in Angstrom.

### coverage

The coverage of the aligned structures

### ivalue

The minimum message length of the compressed alignment in bits.

## Preparation

### How to get the structure from the CLI

To get the structures directly from the RCSB, the syntax looks like this:

### Getting the structure in Python

First thing you need to do is to download the proteins and pass them to opencadd. We do that with the Structure objects and the .from_pdbid() class method.

[1]:

from opencadd.structure.core import Structure

structure1 = Structure.from_pdbid("2GZ9")
structure2 = Structure.from_pdbid("5R8T")


### Using the Python API

Using MMligner in Python looks like this:

[2]:

%%time

mmligner = MMLignerAligner()
results = mmligner.calculate([structure1, structure2])

Current MMLigner wrappers produces accurate RMSD values but slightly shifted structures!
CPU times: user 1.03 s, sys: 0 ns, total: 1.03 s
Wall time: 3.37 s


You can access the RMSD value like this:

[3]:

results["scores"]["rmsd"]

[3]:

1.244


Or the used transformation matrix like this:

[4]:

results["metadata"]["rotation"]

[4]:

[[0.257, 0.015, 0.966], [-0.02, 1.0, -0.01], [-0.966, -0.017, 0.257]]


The results are returned as a dictionary containing the superposed structures, the RMSD, ivalue and coverage, the alignment, the transformation matrix, the translation vector as well as the quarternion matrix. The dictionary looks like this:

{
"superposed": superposed_models,
"scores": {"rmsd": rmsd, "score": ivalue, "coverage": coverage},
"metadata": {"alignment": alignment, "rotation": rotation, "translation": translation, "quarternion": quarternion},
}


In addition you can compute the ivalue of a already computed alignment. This looks like:

[5]:

alignment = results["metadata"]["alignment"]
results_ivalue = mmligner.ivalue([structure1, structure2], alignment)

print("i-value for this alignment is", results_ivalue["scores"]["score"])

/home/jaime/.conda/envs/structuralalignment/lib/python3.8/site-packages/MDAnalysis/coordinates/PDB.py:914: UserWarning: Found no information for attr: 'tempfactors' Using default value of '0.0'
warnings.warn("Found no information for attr: '{}'"

i-value for this alignment is 20323.5


Because ivalue only scores an already computed alignment, only the scores are returned. The dictionary looks like this:

{
"scores": {"rmsd": rmsd, "score": ivalue, "coverage": coverage},
}


## Analysis

Lets look at the computed alignment!

If you have trouble with NGLview, follow this troubleshooting guide.

[6]:

import nglview as nv

view = nv.show_mdanalysis(results["superposed"][0].atoms)