File: theseus.1

package info (click to toggle)
theseus 3.3.0-14
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 91,424 kB
  • sloc: ansic: 41,682; makefile: 267; sh: 121
file content (713 lines) | stat: -rw-r--r-- 22,195 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
.\" use 'man groff_man' to see the man page format macros
.\" ----------------------------------------------------------------------------
.TH THESEUS 1 "25 March 2015" "Brandeis University" "Likelihood (and Bayes) Rocks"
.\" ----------------------------------------------------------------------------
.SH NAME
.\" ----
.\"
.B theseus
\- Maximum likelihood, multiple simultaneous superpositions
with statistical analysis
.\" ----------------------------------------------------------------------------
.SH SYNOPSIS
.\" --------
.\"
.B theseus
[options]
.I pdbfile1 [pdbfile2 ...]
.P
and
.P
.B theseus_align
[options] \-f
.I pdbfile1 [pdbfile2 ...]

.\".P
.\"Default usage is equivalent to:
.\".P
.\".B theseus
.\"\-a0 \-e2 \-g1 \-i200 \-k-1 \-p1e-7 \-r theseus \-v \-P0
.\".I your.pdb
.\" ----------------------------------------------------------------------------
.SH DESCRIPTION
.\" -----------
.\"
.B Theseus
superposes a set of macromolecular structures simultaneously using the
method of maximum likelihood (ML), rather than the conventional least-squares
criterion.
.B Theseus
assumes that the structures are distributed according to a matrix Gaussian
distribution and that the eigenvalues of the atomic covariance matrix are
hierarchically distributed according to an inverse gamma distribution. 
This
ML superpositioning model produces much more accurate results by essentially
downweighting variable regions of the structures and by correcting for
correlations among atoms. 
.P
.B Theseus
operates in two main modes: (1) a mode for superimposing structures with identical
sequences and (2) a mode for structures with different sequences but similar
structures:
.IP
(1) A mode for superpositioning macromolecules with identical sequences and
numbers of residues, for instance, multiple models in an NMR family or
multiple structures from different crystal forms of the same protein. 
.IP
In this mode,
.B Theseus
will read every model in every file on the command line and superpose
them.
.IP
Example:
.IP
.B theseus
.I 1s40.pdb
.IP
In the above example, 
.I 1s40.pdb
is a pdb file of 10 NMR models.
.IP
(2) An ``alignment'' mode for superpositioning structures with different
sequences, for example, multiple structures of the cytochrome c protein from
different species or multiple mutated structures of hen egg white lysozyme.
.IP
This mode requires the user to supply a sequence alignment file of the
structures being superpositioned (see option
.B "\-A"
and ``FILE FORMATS'' below).
Additionally, it may be necessary to supply a mapfile that tells
.B theseus
which PDB structure files correspond to which sequences in the alignment (see
option
.B "\-M"
and ``FILE FORMATS'' below). 
The mapfile is unnecessary if the sequence names and corresponding pdb filenames are identical.
In this mode, if there are multiple structural models in a PDB file,
.B theseus
only reads the first model in each file on the command line. In other words,
.B theseus
treats the files on the command line as if there were only one structure
per file.
.IP
Example 1:
.IP
.B theseus
\-A cytc.aln \-M cytc.filemap d1cih__.pdb d1csu__.pdb d1kyow_.pdb
.IP
In the above example, 
d1cih__.pdb, d1csu__.pdb, and d1kyow_.pdb
are pdb files of cytochrome c domains from the SCOP database.
.IP
Example 2:
.IP
.B theseus_align
\-f d1cih__.pdb d1csu__.pdb d1kyow_.pdb
.IP
In this example, the
.B theseus_align
script is called to do the hard work for you.
It will calculate a sequence alignment and then superpose based on that
alignment.
The script 
.B theseus_align
takes the same options as the 
.B theseus
program.
Note, the first few lines of this script must be modified for your system, since
it calls an external multiple sequence alignment program to do the alignment.
See the
.B examples/
directory for more details, including example files.
.\" ----------------------------------------------------------------------------
.SH OPTIONS
.\" -------
.\"
.SS Algorithmic options, defaults in {brackets}:

.TP
.B "\-\-amber"
Do special processing for AMBER8 formatted PDB files
.IP
Most people will never need to use this long option, unless you are processing
MD traces from AMBER.
AMBER puts the atom names in the wrong column in the PDB file.  

.TP
.PD
.BI "\-a [" selection ]
Atoms to include in the superposition.
This option takes two types of arguments, either (1) a number specifying a
preselected set of atom types, or (2) an explict PDB-style, colon-delimited list
of the atoms to include.
.IP
For the preselected atom type subsets, the following integer options are 
available:
.IP
.PD 0
.RS 8
.IP \(bu 2
0, alpha carbons for proteins, C1\' atoms for nucleic acids
.IP \(bu 2
1, backbone
.IP \(bu 2
2, all
.IP \(bu 2
3, alpha and beta carbons
.IP \(bu 2
4, all heavy atoms (no hydrogens)
.RE
.PD
.IP
Note, only the
.B "\-a0"
option is available when superpositioning structures with
different sequences.
.IP
To custom select an explicit set of atom types, the atom types must be specified
exactly as given in the PDB file field, including spaces, and the atom-types
must encapsulated in quotation marks.
Multiple atom types must be delimited by a colon.
For example,
.IP
.B "\-a ` N  : CA : C  : O  '
.IP
would specify the atom types in the peptide backbone. 
.\"
.\".TP
.\".B "\-c"
.\"Use ML atomic covariance weighting (fit correlations, much slower)
.\".IP
.\"Unless you have many different structures with few residues, fitting the
.\"correlation matrix is likely unwarranted statistically due to a plethora of
.\"parameters and a paucity of data.

.TP
.BI "\-f"
Only read the first model of a multi-model PDB file

.TP
.B "\-h"
Help/usage

.TP
.BI "\-i [" nnn ]
Maximum iterations, {200}

.TP
.BI "\-p [" precision ]
Requested relative precision for convergence, {1e\-7}

.TP
.BI "\-r [" "root name" ]
Root name to be used in naming the output files, {theseus}

.TP
.BI "\-s [" n\-n:... ]
Residue selection (e.g. \-s15\-45:50\-55), {all}

.TP
.BI "\-S [" n\-n:... ]
Residues to exclude (e.g. \-S15-\45:50\-55) {none}
.IP
The previous two options have the same format. Residue (or alignment column)
ranges are indicated by beginning and end separated by a dash.
Multiple ranges, in any arbitrary order, are separated by a colon.
Chains may also be selected by giving the chain ID immediately preceding the
residue range.
For example,
.B \-sA1\-20:A40\-71
will only include residues 1 through 20
and 40 through 70 in chain A. Chains cannot be specified when superposing
structures with different sequences.

.TP
.B "\-v"
use ML variance weighting (no correlations) {default}

.PD
.SS Input/output options:

.TP
.BI "\-A [" "sequence alignment file" ]
Sequence alignment file to use as a guide (CLUSTAL or A2M format)
.IP
For use when superposing structures with different sequences.
See ``FILE FORMATS'' below.

.TP
.B "\-E"
Print expert options

.TP
.B "\-F"
Print FASTA files of the sequences in PDB files and quit
.IP
A useful option when superposing structures with different sequences.
The files output with this option can be aligned with a multiple sequence
alignment program such as CLUSTAL or MUSCLE, and the resulting output
alignment file used as
.B theseus
input with the
.B "\-A"
option.

.TP
.B "\-h"
Help/usage

.TP
.B "\-I"
Just calculate statistics for input file; don't superpose

.TP
.BI "\-M [" mapfile ]
File that maps PDB files to sequences in the alignment.
.IP
A simple two-column formatted file; see ``FILE FORMATS'' below. Used with mode 2.

.TP
.B "\-n"
Don't write transformed pdb file

.TP
.BI "\-o [" "reference structure" ]
Reference file to superpose on, all rotations are relative to the first
model in this file
.IP
For example, 'theseus \-o cytc1.pdb cytc1.pdb cytc2.pdb cytc3.pdb' will
superpose the structures and rotate the entire final superposition so that
the structure from cytc1.pdb is in the same orientation as the structure in the
original cytc1.pdb PDB file.

.\".TP
.\".B "\-O"
.\"Olve's segID file
.\".IP
.\"Useful output when superposing structures with different sequences (mode 2).
.\"In 'theseus_sup.pdb', the main output superposition PDB file, the segID field
.\"now holds the number of the sequence alignment column that it belongs to. 
.\"This number, divided by 100, is also echoed in the B-factor field.
.\"When using
.\".B O
.\"(or any other capable molecular visualization program), one can then color by
.\"B-factor ranges and immediately see in the superposition which regions of
.\"the structure are aligned in the sequence alignment file.
.\"An additional file is also output, called 'theseus_olve.pdb' which only contains
.\"the very atoms that were included in the ML superposition calculation.
.\"That is, it will only contain alpha carbons or phosphorous atoms, and it will
.\"only contain atoms from the columns selected with the
.\".B "\-s"
.\"or
.\".B
.\""\-S"
.\"options.
.\"Requested by Olve Peersen of Colorado State University.
.\"
.TP
.B "\-V"
Version

.PD
.SS Principal components analysis:

.TP
.B "\-C"
Use covariance matrix for PCA (correlation matrix is default)

.TP
.BI "\-P [" nnn ]
Number of principal components to calculate {0}

.IP
In both of the above, the corresponding principal component is written in the
B-factor field of the output PDB file. Usually only the first few PCs are of
any interest (maybe up to six).
.PD

.\" ----------------------------------------------------------------------------
 EXAMPLES
.\" --------
.\"
.B theseus
.I 2sdf.pdb

.P
.B theseus
\-l \-r new2sdf
.I 2sdf.pdb

.P
.B theseus
\-s15\-45 \-P3
.I 2sdf.pdb

.P
.B theseus
\-A
.I cytc.aln
\-M
.I cytc.mapfile
\-o
.I cytc1.pdb
\-s1\-40
.I cytc1.pdb cytc2.pdb cytc3.pdb cytc4.pdb
.\" ----------------------------------------------------------------------------
.SH ENVIRONMENT
.\" -----------
.\"
You can set the environment variable 'PDBDIR' to your PDB file directory and
.B theseus
will look there after the present working directory.
For example, in the C shell (tcsh or csh), you can put something akin to this
in your .cshrc file:

setenv PDBDIR '/usr/share/pdbs/'

.\" ----------------------------------------------------------------------------
.SH FILE FORMATS
.\" ------------
.\"

.P
.B Theseus
will read standard PDB formatted files (see <http://www.rcsb.org/pdb/>).
Every effort has been made for the program to accept nonstandard CNS and
X-PLOR file formats also.
.P
Two other files deserve mention, a sequence alignment file and a mapfile.

.SS Sequence alignment file

When superposing structures with different residue identities (where the
lengths of each the macromolecules in terms of residues are not necessarily
equal), a sequence alignment file must be included for
.B theseus
to use as a guide (specified by the
.B "\-A"
option).
.B Theseus
accepts both CLUSTAL and A2M (FASTA) formatted multiple sequence alignment
files.

.PD
.P
NOTE 1: The residue sequence in the alignment must match exactly the
residue sequence given in the coordinates of the PDB file. That is, there can
be no missing or extra residues that do not correspond to the sequence in the
PDB file. An easy way to ensure that your sequences exactly match the PDB
files is to generate the sequences using
.B theseus'
.B "\-F"
option, which writes out a FASTA formatted sequence file of the chain(s)
in the PDB files. The files output with this option can then be aligned with
a multiple sequence alignment program such as CLUSTAL or MUSCLE, and the
resulting output alignment file used as
.B theseus
input with the
.B "\-A"
option.

.PD
.P
NOTE 2: Every PDB file must have a corresponding sequence in the alignment.
However, not every sequence in the alignment needs to have a corresponding
PDB file. That is, there can be extra sequences in the alignment that are
not used for guiding the superposition.

.SS PDB \-> Sequence mapfile

If the names of the PDB files and the names of the corresponding sequences
in the alignemnt are identical, the mapfile may be omitted.  Otherwise,
.B Theseus
needs to know which sequences in the alignment file correspond to which
PDB structure files. This information is included in a mapfile with a very
simple format (specified with the
.B "\-M"
option). There are only two columns separated by whitespace: the first column
lists the names of the PDB structure files, while the second column lists the
corresponding sequence names exactly as given in the multiple sequence
alignment file. 
.P
An example of the mapfile:
.P
.PD 0
cytc1.pdb    seq1
.P
cytc2.pdb    seq2
.P
cytc3.pdb    seq3
.PD

.SH SCREEN OUTPUT

Theseus provides output describing both the progress of the superposing
and several statistics for the final result:

.TP
.B Classical LS pairwise <RMSD>:
The conventional RMSD for the superposition, the average RMSD for all
pairwise combinations of structures in the ensemble.

.TP
.B Least-squares <sigma>:
The standard deviation for the superposition, based on the conventional
assumption of no correlation and equal variances. Basically equal to the
RMSD from the average structure.

.TP
.B Maximum Likelihood <sigma>:
The ML analog of the standard deviation for the superposition. When assuming
that the correlations are zero (a diagonal covariance matrix), this is equal
to the square root of the harmonic average of the variances for each atom. In
contrast, the ``Least-squares <sigma>'' given above reports the square root of
the arithmetic average of the variances.  The harmonic average is always less
than the arithmetic average, and the harmonic average downweights large
values proportional to their magnitude. This makes sense statistically,
because when combining values one should weight them by the reciprocal of
their variance (which is in fact what the ML superposing method does).

.TP
.B Marginal Log Likelihood:
The final marginal log likelihood of the superposition, assuming the matrix
Gaussian distribution of the structures and the hierarchical inverse gamma
distribution of the eigenvalues of the covariance matrix.
The marginal log likelihood is the likelihood with the covariance matrix
integrated out.

.TP
.B AIC:
The Akaike Information Criterion for the final superposition. This is an
important statistic in likelihood analysis and model selection theory. It
allows an objective comparison of multiple theoretical models with different
numbers of parameters. In this case, the higher the number the better. There
is a tradeoff between fit to the data and the number of parameters being fit.
Increasing the number of parameters in a model will always give a better fit
to the data, but it also increases the uncertainty of the estimated values.
The AIC criterion finds the best combination by (1) maximizing the fit to the
data while (2) minimizing the uncertainty due to the number of parameters. In
the superposition case, one can compare the least squares superposition to
the maximum likelihood superposition. The method (or model) with the higher
AIC is preferred. A difference in the AIC of 2 or more is considered strong
statistical evidence for the better model. 

.TP
.B "BIC:"
The Bayesian Information Criterion. Similar to the AIC, but with a Bayesian
emphasis.

.TP
.B Omnibus chi\*{2\*}:
The overall reduced chi\*{2\*} statistic for the entire fit, including the
rotations, translations, covariances, and the inverse gamma parameters. This
is probably the most important statistic for the superposition. In some
cases, the inverse gamma fit may be poor, yet the overall fit is still very
good. Again, it should ideally be close to 1.0, which would indicate a
perfect fit. However, if you think it is too large, make sure to compare it
to the chi\*{2\*} for the least-squares fit; it's probably not that bad after all.
A large chi\*{2\*} often indicates a violation of the assumptions of the model.
The most common violation is when superposing two or more independent
domains that can rotate relative to each other. If this is the case, then
there will likely be not just one Gaussian distribution, but several mixed
Gaussians, one for each domain.  Then, it would be better to superpose
each domain independently.

.TP
.B Hierarchical var (alpha, gamma) chi\*{2\*}:
The reduced chi\*{2\*} for the inverse gamma fit of the covariance matrix
eigenvalues. As before, it should ideally be close to 1.0.  The two values in
the parentheses are the ML estimates of the scale and shape parameters,
respectively, for the inverse gamma distribtuion.

.TP
.B Rotational, translational, covar chi\*{2\*}:
The reduced chi\*{2\*} statistic for the fit of the structures to the model.
With a good fit it should be close to 1.0, which indicates a perfect fit of
the data to the statistical model.  In the case of least-squares, the assumed
model is a matrix Gaussian distribution of the structures with equal
variances and no correlations.  For the ML fits, the assumed model is unequal 
variances and no correlations, as calculated with the
.B "\-v"
option [default].  
This statistic is for the superposition only, and does
not include the fit of the covariance matrix eigenvalues to an inverse gamma
distribution.  See ``Omnibus chi\*{2\*}'' below.

.TP
.B Hierarchical minimum var:
The hierarchical fit of the inverse gamma distribution constrains the
variances of the atoms by making large ones smaller and small ones larger.
This statistic reports the minimum possible variance given the inferred
inverse gamma parameters.

.TP
.B skewness, skewness Z-value, kurtosis & kurtosis Z-value:
The skewness and kurtosis of the residuals. Both should be 0.0 if the
residuals fit a Gaussian distribution perfectly.  They are followed by the
P-value for the statistics. This is a very stringent test; residuals can be
very non-Gaussian and yet the estimated rotations, translations, and
covariance matrix may still be rather accurate. 

.TP
.B Data pts, Free params, D/P:
The total number of data points given all observed structures, the number of
parameters being fit in the model, and the data-to-parameter ratio.

.TP
.B Median structure:
The structure that is overall most similar to the average structure. This can
be considered to be the most ``typical'' structure in the ensemble.

.TP
.B Total rounds:
The number of iterations that the algorithm took to converge.

.TP
.B Fractional precision:
The actual precision that the algorithm converged to.

.SH OUTPUT FILES
Theseus writes out the following files:

.TP
.B "theseus_sup.pdb"
The final superposition, rotated to the principle axes of the mean structure.

.TP
.B "theseus_ave.pdb"
The estimate of the mean structure.

.TP
.B theseus_residuals.txt
The normalized residuals of the superposition. These can be analyzed for
deviations from normality (whether they fit a standard Gaussian
distribution). E.g., the chi\*{2\*}, skewness, and kurtosis statistics are based
on these values.

.TP
.B theseus_transf.txt
The final transformation rotation matrices and translation vectors.

.TP
.B theseus_variances.txt
The vector of estimated variances for each atom.

.PD
.P
When Principal Components are calculated (with the
.B "\-P"
option), the following
files are also produced:

.TP
.B theseus_pcvecs.txt
The principal component vectors.

.TP
.B theseus_pcstats.txt
Simple statistics for each principle component
(loadings, variance explained, etc.).

.TP
.B theseus_pcN_ave.pdb
The average structure with the Nth principal
component written in the temperature factor field.

.TP
.B theseus_pcN.pdb
The final superposition with the Nth principal
component written in the temperature factor field.
This file is omitted when superposing molecules
with different residue sequences (mode 2).

.TP
.B theseus_cor.mat, theseus_cov.mat
The atomic correlation matrix and covariance matrices, based on the final
superposition. The format is suitable for input to GNU's
.B octave.
These are the matrices used in the Principal Components Analysis.

.\" ----------------------------------------------------------------------------
.SH BUGS
.\" ----
.\"
Please send me (DLT) reports of all problems.

.\" ----------------------------------------------------------------------------
.SH RESTRICTIONS
.\" ------------
.\"
.B Theseus
is
.I not
a structural alignment program.
The structure-based alignment problem is completely different from the
structural superposition problem.
In order to do a structural superposition, there must be a 1-to-1 mapping that
associates the atoms in one structure with the atoms in the other structures.
In the simplest case, this means that structures must have equivalent numbers of
atoms, such as the models in an NMR PDB file.
For structures with different numbers of residues/atoms, superposing is
only possible when the sequences have been aligned previously.
Finding the best sequence alignment based on only structural information is
a difficult problem, and one for which there is currently no maximum likelihood
approach.
Extending
.B theseus
to address the structural alignment problem is an ongoing research project.

.\" ----------------------------------------------------------------------------
.SH AUTHOR
.\" ------
.\"
Douglas L. Theobald
.br
dtheobald@brandeis.edu

.\" ----------------------------------------------------------------------------
.SH CITATION
.\" -----
.\"
When using
.B theseus
in publications please cite:

.P
Douglas L. Theobaldand Phillip A. Steindel (2012) 
.br
``Optimal simultaneous superpositioning of multiple structures with missing data.''
.br
Bioinformatics 28(15):1972-1979

The following papers also report 
.B theseus 
developments:

.P
Douglas L. Theobald and Deborah S. Wuttke (2008)
.br
``Accurate structural correlations from maximum likelihood superpositions.''
.br
PLoS Computational Biology 4(2):e43

.P
Douglas L. Theobald and Deborah S. Wuttke (2006)
.br
``THESEUS: Maximum likelihood superpositioning and analysis of macromolecular
structures."
.br
Bioinformatics 22(17):2171-2172

.P
Douglas L. Theobald and Deborah S. Wuttke (2006)
.br
``Empirical Bayes models for regularizing maximum likelihood estimation in the 
matrix Gaussian Procrustes problem.''
.br
PNAS 103(49):18521-18527

.\" ---------------------------------------------------------------------------
.SH HISTORY
.\" -------
.\"
Long, tedious, and sordid.