File: correlations.py

package info (click to toggle)
mdanalysis 2.10.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 116,696 kB
  • sloc: python: 92,135; ansic: 8,156; makefile: 215; sh: 138
file content (265 lines) | stat: -rw-r--r-- 11,510 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-
# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
#
# MDAnalysis --- https://www.mdanalysis.org
# Copyright (c) 2006-2017 The MDAnalysis Development Team and contributors
# (see the file AUTHORS for the full list of names)
#
# Released under the Lesser GNU Public Licence, v2.1 or any higher version
#
# Please cite your use of MDAnalysis in published work:
#
# R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler,
# D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein.
# MDAnalysis: A Python package for the rapid analysis of molecular dynamics
# simulations. In S. Benthall and S. Rostrup editors, Proceedings of the 15th
# Python in Science Conference, pages 102-109, Austin, TX, 2016. SciPy.
# doi: 10.25080/majora-629e541a-00e
#
# N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein.
# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations.
# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787
#

"""Correlations utilities --- :mod:`MDAnalysis.lib.correlations`
=================================================================================


:Authors: Paul Smith & Mateusz Bieniek
:Year: 2020
:Copyright: Lesser GNU Public License v2.1+

.. versionadded:: 1.0.0

This module is primarily for internal use by other analysis modules. It
provides functionality for calculating the time autocorrelation function
of a binary variable (i.e one that is either true or false at each
frame for a given atom/molecule/set of molecules). This module includes
functions for calculating both the time continuous autocorrelation and
the intermittent autocorrelation. The function :func:`autocorrelation`
calculates the continuous autocorrelation only. The data may be
pre-processed using the function :func:`intermittency` in order to
acount for intermittency before passing the results to
:func:`autocorrelation`.

This module is inspired by seemingly disparate analyses that rely on the same
underlying calculation, including the survival probability of water around
proteins :footcite:p:`ArayaSecchi2014`, hydrogen bond lifetimes
:footcite:p:`Gowers2015,ArayaSecchi2014`, and the rate of cholesterol
flip-flop in lipid bilayers :footcite:p:`Gu2019`.

.. seeAlso::

    Analysis tools that make use of modules:

        * :class:`MDAnalysis.analysis.waterdynamics.SurvivalProbability`
            Calculates the continuous or intermittent survival probability
            of an atom group in a region of interest.

        * :class:`MDAnalysis.analysis.hbonds.hbond_analysis`
            Calculates the continuous or intermittent hydrogen bond
            lifetime.

.. rubric:: References

.. footbibliography::

"""

import numpy as np
from copy import deepcopy


def autocorrelation(list_of_sets, tau_max, window_step=1):
    r"""Implementation of a discrete autocorrelation function.

    The autocorrelation of a property :math:`x` from a time :math:`t=t_0` to :math:`t=t_0 + \tau`
    is given by:

    .. math::
        C(\tau) = \langle \frac{ x(t_0)x(t_0 +\tau) }{ x(t_0)x(t_0) } \rangle

    where :math:`x` may represent any property of a particle, such as velocity or
    potential energy.

    This function is an implementation of a special case of the time
    autocorrelation function in which the property under consideration can
    be encoded with indicator variables, :math:`0` and :math:`1`, to represent the binary
    state of said property. This special case is often referred to as the
    survival probability (:math:`S(\tau)`). As an example, in calculating the survival
    probability of water molecules within 5 Å of a protein, each water
    molecule will either be within this cutoff range (:math:`1`) or not (:math:`0`). The
    total number of water molecules within the cutoff at time :math:`t_0` will be
    given by :math:`N(t_0)`. Other cases include the Hydrogen Bond Lifetime as
    well as the translocation rate of cholesterol across a bilayer.

    The survival probability of a property of a set of particles is
    given by:

    .. math::
        S(\tau) =  \langle \frac{ N(t_0, t_0 + \tau )} { N(t_0) }\rangle

    where :math:`N(t0)` is the number of particles at time :math:`t_0` for which the feature
    is observed, and :math:`N(t0, t_0 + \tau)` is the number of particles for which
    this feature is present at every frame from :math:`t_0` to :math:`N(t0, t_0 + \tau)`.
    The angular brackets represent an average over all time origins, :math:`t_0`.

    See :footcite:`ArayaSecchi2014` for a description survival probability.

    Parameters
    ----------
    list_of_sets : list
      List of sets. Each set corresponds to data from a single frame. Each element in a set
      may be, for example, an atom id or a tuple of atoms ids. In the case of calculating the
      survival probability of water around a protein, these atom ids in a given set will be
      those of the atoms which are within a cutoff distance of the protein at a given frame.
    tau_max : int
      The last tau (lag time, inclusive) for which to calculate the autocorrelation. e.g if tau_max = 20,
      the survival probability will be calculated over 20 frames.
    window_step : int, optional
      The step size for t0 to perform autocorrelation. Ideally, window_step will be larger than
       tau_max to ensure independence of each window for which the calculation is performed.
       Default is 1.

    Returns
    --------
    tau_timeseries : list of int
        the values of tau for which the autocorrelation was calculated
    timeseries : list of int
        the autocorelation values for each of the tau values
    timeseries_data : list of list of int
        the raw data from which the autocorrelation is computed, i.e :math:`S(\tau)` at each window.
        This allows the time dependant evolution of :math:`S(\tau)` to be investigated.

    .. versionadded:: 0.19.2
    """

    # check types
    if (type(list_of_sets) != list and len(list_of_sets) != 0) or type(
        list_of_sets[0]
    ) != set:
        raise TypeError(
            "list_of_sets must be a one-dimensional list of sets"
        )  # pragma: no cover

    # Check dimensions of parameters
    if len(list_of_sets) < tau_max:
        raise ValueError(
            "tau_max cannot be greater than the length of list_of_sets"
        )  # pragma: no cover

    tau_timeseries = list(range(1, tau_max + 1))
    timeseries_data = [[] for _ in range(tau_max)]

    # calculate autocorrelation
    for t in range(0, len(list_of_sets), window_step):
        Nt = len(list_of_sets[t])

        if Nt == 0:
            continue

        for tau in tau_timeseries:
            if t + tau >= len(list_of_sets):
                break

            # continuous: IDs that survive from t to t + tau and at every frame in between
            Ntau = len(set.intersection(*list_of_sets[t : t + tau + 1]))
            timeseries_data[tau - 1].append(Ntau / float(Nt))

    timeseries = [np.mean(x) for x in timeseries_data]

    # at time 0 the value has to be one
    tau_timeseries.insert(0, 0)
    timeseries.insert(0, 1)

    return tau_timeseries, timeseries, timeseries_data


def correct_intermittency(list_of_sets, intermittency):
    r"""Preprocess data to allow intermittent behaviour prior to calling :func:`autocorrelation`.

    Survival probabilty may be calculated either with a strict continuous requirement or
    a less strict intermittency. If calculating the survival probability water around a
    protein for example, in the former case the water must be within a cutoff distance
    of the protein at every frame from :math:`t_0` to :math:`t_0 + \tau` in order for it to be considered
    present at :math:`t_0 + \tau`. In the intermittent case, the water molecule is allowed to
    leave the region of interest for up to a specified consecutive number of frames whilst still
    being considered present at :math:`t_0 + \tau`.

    This function pre-processes data, such as the atom ids of water molecules within a cutoff
    distance of a protein at each frame, in order to allow for intermittent behaviour, with a
    single pass over the data.

    For example, if an atom is absent for a number of frames equal or smaller than the parameter
    :attr:`intermittency`, then this absence will be removed and thus the atom is considered to have
    not left.
    e.g 7,A,A,7 with `intermittency=2` will be replaced by 7,7,7,7, where A=absence.

    The returned data can be used as input to the function :func:`autocorrelation` in order
    to calculate the survival probability with a given intermittency.

    See :footcite:p:`Gowers2015` for a description of intermittency in the
    calculation of hydrogen bond lifetimes.

    # TODO - is intermittency consitent with list of sets of sets? (hydrogen bonds)

    Parameters
    ----------
    list_of_sets: list
        In the simple case of e.g survival probability, a list of sets of atom ids present at each frame, where a
        single set contains atom ids at a given frame, e.g [{0, 1}, {0}, {0}, {0, 1}]
    intermittency : int
        The maximum gap allowed. The default `intermittency=0` means that if the datapoint is missing at any frame, no
        changes are made to the data. With the value of `intermittency=2`, all datapoints missing for up to two
        consecutive frames will be instead be considered present.

    Returns
    -------
    list_of_sets: list
        returns a new list with the IDs with added IDs which disappeared for <= :attr:`intermittency`.
        e.g If [{0, 1}, {0}, {0}, {0, 1}] is a list of sets of atom ids present at each frame and `intermittency=2`,
        both atoms will be considered present throughout and thus the returned list of sets will be
        [{0, 1}, {0, 1}, {0, 1}, {0, 1}].

    """
    if intermittency == 0:
        return list_of_sets

    list_of_sets = deepcopy(list_of_sets)

    # an element (a superset) takes the form of:
    # - an atom pair when computing hydrogen bonds lifetime,
    # - atom ID in the case of water survival probability,
    for i, elements in enumerate(list_of_sets):
        # initially update each frame as seen 0 ago (now)
        seen_frames_ago = {i: 0 for i in elements}
        for j in range(1, intermittency + 2):
            for element in seen_frames_ago.keys():
                # no more frames
                if i + j >= len(list_of_sets):
                    continue

                # if the element is absent now
                if element not in list_of_sets[i + j]:
                    # increase its absence counter
                    seen_frames_ago[element] += 1
                    continue

                # the element is present
                if seen_frames_ago[element] == 0:
                    # the element was present in the last frame
                    continue

                # element was absent more times than allowed
                if seen_frames_ago[element] > intermittency:
                    continue

                # The element was absent but returned, i.e.
                # within <= intermittency_value.
                # Add it to the frames where it was absent.
                # Introduce the corrections.
                for k in range(seen_frames_ago[element], 0, -1):
                    list_of_sets[i + j - k].add(element)

                seen_frames_ago[element] = 0
    return list_of_sets