File: period_estimation.rst

package info (click to toggle)
python-cogent 2024.5.7a1%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 74,600 kB
  • sloc: python: 92,479; makefile: 117; sh: 16
file content (260 lines) | stat: -rw-r--r-- 7,589 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
***************************
Estimating periodic signals
***************************

.. sectionauthor:: Gavin Huttley, Julien Epps, Hua Ying

We consider two different scenarios:

- estimating the periods in a signal
- estimating the power for a given period
- measuring statistical significance for the latter case

Estimating the periods in a signal
==================================

For numerical (continuous) data
-------------------------------

We first make some sample data. A periodic signal and some noise.

..
    We set a seed for the random number generator so that we can get
    consistent generation of the same series. This makes the document
    robust for doctesting.

.. jupyter-execute::
    :hide-code:

    import numpy

    numpy.random.seed(11)

.. jupyter-execute::

    import numpy

    t = numpy.arange(0, 10, 0.1)
    n = numpy.random.randn(len(t))
    nse = numpy.convolve(n, numpy.exp(-t / 0.05)) * 0.1
    nse = nse[: len(t)]
    sig = numpy.sin(2 * numpy.pi * t) + nse

Discrete Fourier transform
^^^^^^^^^^^^^^^^^^^^^^^^^^

We now use the discrete Fourier transform to estimate periodicity in this signal. Given we set the period to equal 10, we expect the maximum power for that index.

.. jupyter-execute::

    from cogent3.maths.period import dft

    pwr, period = dft(sig)
    print(period)
    print(pwr)

The power (``pwr``) is returned as an array of complex numbers, so we convert into real numbers using ``abs``. We then zip the power and corresponding periods and sort to identify the period with maximum signal.

.. jupyter-execute::

    pwr = abs(pwr)
    max_pwr, max_period = sorted(zip(pwr, period))[-1]
    print(max_pwr, max_period)

Auto-correlation
^^^^^^^^^^^^^^^^

We now use auto-correlation.

.. jupyter-execute::

    from cogent3.maths.period import auto_corr

    pwr, period = auto_corr(sig)
    print(period)
    print(pwr)

We then zip the power and corresponding periods and sort to identify the period with maximum signal.

.. jupyter-execute::

    max_pwr, max_period = sorted(zip(pwr, period))[-1]
    print(max_pwr, max_period)

For symbolic data
-----------------

We create a sequence as just a string

.. jupyter-execute::

    s = (
        "ATCGTTGGGACCGGTTCAAGTTTTGGAACTCGCAAGGGGTGAATGGTCTTCGTCTAACGCTGG"
        "GGAACCCTGAATCGTTGTAACGCTGGGGTCTTTAACCGTTCTAATTTAACGCTGGGGGGTTCT"
        "AATTTTTAACCGCGGAATTGCGTC"
    )

We then specify the motifs whose occurrences will be converted into 1, with all other motifs converted into 0. As we might want to do this in batches for many sequences we use a factory function.

.. jupyter-execute::

    from cogent3.maths.stats.period import SeqToSymbols

    seq_to_symbols = SeqToSymbols(["AA", "TT", "AT"])
    symbols = seq_to_symbols(s)
    len(symbols) == len(s)
    symbols

We then estimate the integer discrete Fourier transform for the full data. To do this, we need to pass in the symbols from full conversion of the sequence. The returned values are the powers and periods.

.. jupyter-execute::

    from cogent3.maths.period import ipdft

    powers, periods = ipdft(symbols)
    powers

.. jupyter-execute::

    periods

We can also compute the auto-correlation statistic, and the hybrid (which combines IPDFT and auto-correlation).

.. jupyter-execute::

    from cogent3.maths.period import auto_corr, hybrid

    powers, periods = auto_corr(symbols)
    powers

.. jupyter-execute::

    periods

.. jupyter-execute::

    powers, periods = hybrid(symbols)
    powers

.. jupyter-execute::

    periods

Estimating power for specified period
=====================================

For numerical (continuous) data
-------------------------------

We just use ``sig`` created above. The Goertzel algorithm gives the same result as the ``dft``.

.. jupyter-execute::

    from cogent3.maths.period import goertzel

    pwr = goertzel(sig, 10)
    print(pwr)

For symbolic data
-----------------

.. take example above and show how to compute it using autocorrelation

We use the symbols from the above example. For the ``ipdft``, ``auto_corr`` and ``hybrid`` functions we just need to identify the array index containing the period of interest and slice the corresponding value from the returned powers. The reported periods start at ``llim``, which defaults to 2, but indexes start at 0, the index for a period-5 is simply 5-``llim``.

.. jupyter-execute::

    powers, periods = auto_corr(symbols)
    llim = 2
    period5 = 5 - llim
    periods[period5]

.. jupyter-execute::

    powers[period5]

For Fourier techniques, we can compute the power for a specific period more efficiently using Goertzel algorithm.

.. jupyter-execute::

    from cogent3.maths.period import goertzel

    period = 4
    power = goertzel(symbols, period)
    ipdft_powers, periods = ipdft(symbols)
    ipdft_power = abs(ipdft_powers[period - llim])
    round(power, 6) == round(ipdft_power, 6)
    power

It's also possible to specify a period to the stand-alone functions. As per the ``goertzel`` function, just the power is returned.

.. jupyter-execute::

    power = hybrid(symbols, period=period)
    power

Measuring statistical significance of periodic signals
======================================================

For numerical (continuous data)
-------------------------------

We use the signal provided above. Because significance testing is being done using a resampling approach, we define a calculator which precomputes some values to improve compute performance. For a continuous signal, we'll use the Goertzel algorithm.

.. jupyter-execute::

    from cogent3.maths.period import Goertzel

    goertzel_calc = Goertzel(len(sig), period=10)

Having defined this, we then just pass this calculator to the ``blockwise_bootstrap`` function. The other critical settings are the ``block_size`` which specifies the size of segments of contiguous sequence positions to use for sampling and ``num_reps`` which is the number of permuted replicate sequences to generate.

.. jupyter-execute::

    from cogent3.maths.stats.period import blockwise_bootstrap

    obs_stat, p = blockwise_bootstrap(
        sig, calc=goertzel_calc, block_size=10, num_reps=1000
    )
    print(obs_stat)
    print(p)

For symbolic data
-----------------

Permutation testing
^^^^^^^^^^^^^^^^^^^

The very notion of permutation testing for periods, applied to a genome, requires the compute performance be as quick as possible. This means providing as much information up front as possible. We have made the implementation flexible by not assuming how the user will convert sequences to symbols. It's also the case that numerous windows of exactly the same size are being assessed. Accordingly, we use a class to construct a fixed signal length evaluator. We do this for the hybrid metric first.

.. jupyter-execute::

    from cogent3.maths.period import Hybrid

    hybrid_calculator = Hybrid(len(s), period=4)

.. note:: We defined the period length of interest in defining this calculator because we're interested in dinucleotide motifs.

We then construct a seq-to-symbol convertor.

.. jupyter-execute::

    from cogent3.maths.stats.period import SeqToSymbols

    seq_to_symbols = SeqToSymbols(["AA", "TT", "AT"], length=len(s))

The rest is as per the analysis using ``Goertzel`` above.

.. jupyter-execute::

    from cogent3.maths.stats.period import blockwise_bootstrap

    stat, p = blockwise_bootstrap(
        s,
        calc=hybrid_calculator,
        block_size=10,
        num_reps=1000,
        seq_to_symbols=seq_to_symbols,
    )
    print(stat)
    p < 0.1