File: README

package info (click to toggle)
theseus 3.3.0-14
links: PTS, VCS
area: main
in suites: bookworm, forky, sid, trixie
size: 91,424 kB
sloc: ansic: 41,682; makefile: 267; sh: 121
file content (554 lines) | stat: -rw-r--r-- 21,549 bytes
parent folder | download | duplicates (5)
THESEUS(1)		 Likelihood (and Bayes) Rocks		    THESEUS(1)



NAME
       theseus - Maximum likelihood, multiple simultaneous superpositions with
       statistical analysis

SYNOPSIS
       theseus [options] pdbfile1 [pdbfile2 ...]

       and

       theseus_align [options] -f pdbfile1 [pdbfile2 ...]


DESCRIPTION
       Theseus superposes a set of  macromolecular  structures	simultaneously
       using  the  method  of maximum likelihood (ML), rather than the conven-
       tional least-squares criterion.	Theseus assumes  that  the  structures
       are  distributed  according  to a matrix Gaussian distribution and that
       the eigenvalues of the atomic covariance matrix are hierarchically dis-
       tributed  according to an inverse gamma distribution.  This ML superpo-
       sitioning model produces much  more  accurate  results  by  essentially
       downweighting  variable regions of the structures and by correcting for
       correlations among atoms.

       Theseus operates in two main modes: (1) a mode for superimposing struc-
       tures  with identical sequences and (2) a mode for structures with dif-
       ferent sequences but similar structures:

	      (1) A mode for superpositioning  macromolecules  with  identical
	      sequences and numbers of residues, for instance, multiple models
	      in an NMR family or multiple structures from  different  crystal
	      forms of the same protein.

	      In this mode, Theseus will read every model in every file on the
	      command line and superpose them.

	      Example:

	      theseus 1s40.pdb

	      In the above example, 1s40.pdb is a pdb file of 10 NMR models.

	      (2) An ``alignment'' mode for superpositioning  structures  with
	      different  sequences,  for  example,  multiple structures of the
	      cytochrome c protein from different species or multiple  mutated
	      structures of hen egg white lysozyme.

	      This  mode requires the user to supply a sequence alignment file
	      of the structures  being	superpositioned  (see  option  -A  and
	      ``FILE  FORMATS''  below).  Additionally, it may be necessary to
	      supply a mapfile that tells theseus which  PDB  structure  files
	      correspond  to  which  sequences in the alignment (see option -M
	      and ``FILE FORMATS'' below).  The mapfile is unnecessary if  the
	      sequence	names  and  corresponding pdb filenames are identical.
	      In this mode, if there are multiple structural models in	a  PDB
	      file,  theseus  only  reads  the first model in each file on the
	      command line. In other words, theseus treats the	files  on  the
	      command line as if there were only one structure per file.

	      Example 1:

	      theseus  -A  cytc.aln  -M  cytc.filemap  d1cih__.pdb d1csu__.pdb
	      d1kyow_.pdb

	      In the above example, d1cih__.pdb, d1csu__.pdb, and  d1kyow_.pdb
	      are pdb files of cytochrome c domains from the SCOP database.

	      Example 2:

	      theseus_align -f d1cih__.pdb d1csu__.pdb d1kyow_.pdb

	      In  this	example,  the theseus_align script is called to do the
	      hard work for you.  It will calculate a sequence	alignment  and
	      then  superpose  based  on  that	alignment.   The  script  the-
	      seus_align takes the same options as the theseus program.  Note,
	      the  first  few  lines  of this script must be modified for your
	      system, since it calls an external multiple  sequence  alignment
	      program  to  do  the alignment.  See the examples/ directory for
	      more details, including example files.

OPTIONS
   Algorithmic options, defaults in {brackets}:
       --amber
	      Do special processing for AMBER8 formatted PDB files

	      Most people will never need to use this long option, unless  you
	      are  processing MD traces from AMBER.  AMBER puts the atom names
	      in the wrong column in the PDB file.


       -a [selection]
	      Atoms to include in the superposition.  This  option  takes  two
	      types of arguments, either (1) a number specifying a preselected
	      set of atom types, or (2) an explict PDB-style,  colon-delimited
	      list of the atoms to include.

	      For  the	preselected  atom  type subsets, the following integer
	      options are available:

	       o 0, alpha carbons for proteins, C1' atoms for nucleic acids
	       o 1, backbone
	       o 2, all
	       o 3, alpha and beta carbons
	       o 4, all heavy atoms (no hydrogens)

	      Note, only the -a0 option  is  available	when  superpositioning
	      structures with different sequences.

	      To  custom  select an explicit set of atom types, the atom types
	      must be specified exactly  as  given  in	the  PDB  file	field,
	      including spaces, and the atom-types must encapsulated in quota-
	      tion marks.  Multiple atom types must be delimited by  a	colon.
	      For example,

	      -a ` N  : CA : C	: O  '

	      would specify the atom types in the peptide backbone.


       -f     Only read the first model of a multi-model PDB file


       -h     Help/usage


       -i [nnn]
	      Maximum iterations, {200}


       -p [precision]
	      Requested relative precision for convergence, {1e-7}


       -r [root name]
	      Root name to be used in naming the output files, {theseus}


       -s [n-n:...]
	      Residue selection (e.g. -s15-45:50-55), {all}


       -S [n-n:...]
	      Residues to exclude (e.g. -S15-45:50-55) {none}

	      The  previous  two  options  have  the  same format. Residue (or
	      alignment column) ranges are indicated by beginning and end sep-
	      arated  by a dash.  Multiple ranges, in any arbitrary order, are
	      separated by a colon.  Chains may also be selected by giving the
	      chain  ID immediately preceding the residue range.  For example,
	      -sA1-20:A40-71 will only include residues 1 through  20  and  40
	      through 70 in chain A. Chains cannot be specified when superpos-
	      ing structures with different sequences.


       -v     use ML variance weighting (no correlations) {default}


   Input/output options:
       -A [sequence alignment file]
	      Sequence alignment file to use as a guide (CLUSTAL or  A2M  for-
	      mat)

	      For  use	when  superposing structures with different sequences.
	      See ``FILE FORMATS'' below.


       -E     Print expert options


       -F     Print FASTA files of the sequences in PDB files and quit

	      A useful	option	when  superposing  structures  with  different
	      sequences.   The	files  output  with this option can be aligned
	      with a multiple sequence alignment program such  as  CLUSTAL  or
	      MUSCLE,  and the resulting output alignment file used as theseus
	      input with the -A option.


       -h     Help/usage


       -I     Just calculate statistics for input file; don't superpose


       -M [mapfile]
	      File that maps PDB files to sequences in the alignment.

	      A simple two-column formatted file; see ``FILE FORMATS''	below.
	      Used with mode 2.


       -n     Don't write transformed pdb file


       -o [reference structure]
	      Reference  file  to  superpose on, all rotations are relative to
	      the first model in this file

	      For  example,  'theseus	-o   cytc1.pdb	 cytc1.pdb   cytc2.pdb
	      cytc3.pdb'  will	superpose the structures and rotate the entire
	      final superposition so that the structure from cytc1.pdb	is  in
	      the  same orientation as the structure in the original cytc1.pdb
	      PDB file.


       -V     Version


   Principal components analysis:
       -C     Use covariance matrix for PCA (correlation matrix is default)


       -P [nnn]
	      Number of principal components to calculate {0}


	      In both of the above, the corresponding principal  component  is
	      written  in  the	B-factor field of the output PDB file. Usually
	      only the first few PCs are of any interest (maybe up to six).

	       EXAMPLES theseus 2sdf.pdb


       theseus -l -r new2sdf 2sdf.pdb


       theseus -s15-45 -P3 2sdf.pdb


       theseus -A cytc.aln  -M	cytc.mapfile  -o  cytc1.pdb  -s1-40  cytc1.pdb
       cytc2.pdb cytc3.pdb cytc4.pdb

ENVIRONMENT
       You  can  set the environment variable 'PDBDIR' to your PDB file direc-
       tory and theseus will look there after the present  working  directory.
       For  example,  in the C shell (tcsh or csh), you can put something akin
       to this in your .cshrc file:

       setenv PDBDIR '/usr/share/pdbs/'


FILE FORMATS
       Theseus	 will	 read	 standard    PDB    formatted	 files	  (see
       <http://www.rcsb.org/pdb/>).   Every  effort has been made for the pro-
       gram to accept nonstandard CNS and X-PLOR file formats also.

       Two other files deserve mention, a sequence alignment file and  a  map-
       file.


   Sequence alignment file
       When  superposing  structures  with different residue identities (where
       the lengths of each the macromolecules in terms	of  residues  are  not
       necessarily equal), a sequence alignment file must be included for the-
       seus to use as a guide (specified by the -A option).   Theseus  accepts
       both  CLUSTAL  and  A2M	(FASTA)  formatted multiple sequence alignment
       files.


       NOTE 1: The residue sequence in the alignment must  match  exactly  the
       residue	sequence  given  in  the coordinates of the PDB file. That is,
       there can be no missing or extra residues that do not correspond to the
       sequence  in  the  PDB  file. An easy way to ensure that your sequences
       exactly match the PDB files is to generate the sequences using theseus'
       -F  option,  which  writes  out	a FASTA formatted sequence file of the
       chain(s) in the PDB files. The files output with this option  can  then
       be  aligned  with a multiple sequence alignment program such as CLUSTAL
       or MUSCLE, and the resulting output  alignment  file  used  as  theseus
       input with the -A option.


       NOTE 2: Every PDB file must have a corresponding sequence in the align-
       ment.  However, not every sequence in the alignment  needs  to  have  a
       corresponding  PDB  file.  That is, there can be extra sequences in the
       alignment that are not used for guiding the superposition.


   PDB -> Sequence mapfile
       If the names of the PDB	files  and  the  names	of  the  corresponding
       sequences  in  the alignemnt are identical, the mapfile may be omitted.
       Otherwise, Theseus needs to know which sequences in the alignment  file
       correspond  to  which PDB structure files. This information is included
       in a mapfile with a very simple format (specified with the -M  option).
       There  are  only  two columns separated by whitespace: the first column
       lists the names of the PDB structure files,  while  the	second	column
       lists the corresponding sequence names exactly as given in the multiple
       sequence alignment file.

       An example of the mapfile:

       cytc1.pdb    seq1
       cytc2.pdb    seq2
       cytc3.pdb    seq3


SCREEN OUTPUT
       Theseus provides output describing both the progress of the superposing
       and several statistics for the final result:


       Classical LS pairwise <RMSD>:
	      The  conventional  RMSD  for the superposition, the average RMSD
	      for all pairwise combinations of structures in the ensemble.


       Least-squares <sigma>:
	      The standard deviation for the superposition, based on the  con-
	      ventional  assumption  of  no  correlation  and equal variances.
	      Basically equal to the RMSD from the average structure.


       Maximum Likelihood <sigma>:
	      The ML analog of the standard deviation for  the	superposition.
	      When assuming that the correlations are zero (a diagonal covari-
	      ance matrix), this is equal to the square root of  the  harmonic
	      average  of  the	variances  for	each  atom.  In  contrast, the
	      ``Least-squares <sigma>'' given above reports the square root of
	      the  arithmetic  average of the variances.  The harmonic average
	      is always less than the arithmetic  average,  and  the  harmonic
	      average  downweights  large  values proportional to their magni-
	      tude. This makes sense  statistically,  because  when  combining
	      values  one  should weight them by the reciprocal of their vari-
	      ance (which is in fact what the ML superposing method does).


       Marginal Log Likelihood:
	      The final marginal log likelihood of the superposition, assuming
	      the matrix Gaussian distribution of the structures and the hier-
	      archical inverse gamma distribution of the  eigenvalues  of  the
	      covariance  matrix.   The marginal log likelihood is the likeli-
	      hood with the covariance matrix integrated out.


       AIC:   The Akaike Information Criterion for  the  final	superposition.
	      This  is an important statistic in likelihood analysis and model
	      selection theory. It allows an objective comparison of  multiple
	      theoretical models with different numbers of parameters. In this
	      case, the higher the number the  better.	There  is  a  tradeoff
	      between  fit to the data and the number of parameters being fit.
	      Increasing the number of parameters in a model will always  give
	      a  better fit to the data, but it also increases the uncertainty
	      of the estimated values.	The AIC criterion finds the best  com-
	      bination	by  (1) maximizing the fit to the data while (2) mini-
	      mizing the uncertainty due to the number of parameters.  In  the
	      superposition case, one can compare the least squares superposi-
	      tion to the maximum likelihood  superposition.  The  method  (or
	      model) with the higher AIC is preferred. A difference in the AIC
	      of 2 or more is considered strong statistical evidence  for  the
	      better model.


       BIC:   The Bayesian Information Criterion. Similar to the AIC, but with
	      a Bayesian emphasis.


       Omnibus chi2:
	      The overall reduced chi2 statistic for the entire fit, including
	      the  rotations, translations, covariances, and the inverse gamma
	      parameters. This is probably the most  important	statistic  for
	      the  superposition.  In some cases, the inverse gamma fit may be
	      poor, yet the overall fit is still very good. Again,  it	should
	      ideally  be  close  to  1.0, which would indicate a perfect fit.
	      However, if you think it is too large, make sure to  compare  it
	      to  the  chi2  for the least-squares fit; it's probably not that
	      bad after all.  A large chi2 often indicates a violation of  the
	      assumptions  of  the  model.   The most common violation is when
	      superposing two or more independent domains that can rotate rel-
	      ative to each other. If this is the case, then there will likely
	      be not just one Gaussian distribution, but several  mixed  Gaus-
	      sians,  one for each domain.  Then, it would be better to super-
	      pose each domain independently.


       Hierarchical var (alpha, gamma) chi2:
	      The reduced chi2 for the inverse gamma  fit  of  the  covariance
	      matrix  eigenvalues.  As	before,  it should ideally be close to
	      1.0.  The two values in the parentheses are the ML estimates  of
	      the  scale  and  shape parameters, respectively, for the inverse
	      gamma distribtuion.


       Rotational, translational, covar chi2:
	      The reduced chi2 statistic for the fit of the structures to  the
	      model.   With  a good fit it should be close to 1.0, which indi-
	      cates a perfect fit of the data to the  statistical  model.   In
	      the  case  of least-squares, the assumed model is a matrix Gaus-
	      sian distribution of the structures with equal variances and  no
	      correlations.   For  the	ML  fits, the assumed model is unequal
	      variances and no correlations, as calculated with the -v	option
	      [default].   This  statistic  is for the superposition only, and
	      does not include the fit of the covariance matrix eigenvalues to
	      an inverse gamma distribution.  See ``Omnibus chi2'' below.


       Hierarchical minimum var:
	      The  hierarchical  fit  of  the  inverse gamma distribution con-
	      strains the variances of the atoms by making large ones  smaller
	      and  small ones larger.  This statistic reports the minimum pos-
	      sible variance given the inferred inverse gamma parameters.


       skewness, skewness Z-value, kurtosis & kurtosis Z-value:
	      The skewness and kurtosis of the residuals. Both should  be  0.0
	      if  the  residuals  fit a Gaussian distribution perfectly.  They
	      are followed by the P-value for the statistics. This is  a  very
	      stringent  test;	residuals can be very non-Gaussian and yet the
	      estimated rotations, translations,  and  covariance  matrix  may
	      still be rather accurate.


       Data pts, Free params, D/P:
	      The  total  number of data points given all observed structures,
	      the number of parameters being fit in the model, and  the  data-
	      to-parameter ratio.


       Median structure:
	      The structure that is overall most similar to the average struc-
	      ture. This can be considered to be the most  ``typical''	struc-
	      ture in the ensemble.


       Total rounds:
	      The number of iterations that the algorithm took to converge.


       Fractional precision:
	      The actual precision that the algorithm converged to.


OUTPUT FILES
       Theseus writes out the following files:


       theseus_sup.pdb
	      The  final  superposition,  rotated to the principle axes of the
	      mean structure.


       theseus_ave.pdb
	      The estimate of the mean structure.


       theseus_residuals.txt
	      The normalized residuals of the superposition. These can be ana-
	      lyzed for deviations from normality (whether they fit a standard
	      Gaussian distribution). E.g., the chi2, skewness,  and  kurtosis
	      statistics are based on these values.


       theseus_transf.txt
	      The  final transformation rotation matrices and translation vec-
	      tors.


       theseus_variances.txt
	      The vector of estimated variances for each atom.


       When Principal Components are calculated (with the -P option), the fol-
       lowing files are also produced:


       theseus_pcvecs.txt
	      The principal component vectors.


       theseus_pcstats.txt
	      Simple  statistics for each principle component (loadings, vari-
	      ance explained, etc.).


       theseus_pcN_ave.pdb
	      The average structure with the Nth principal  component  written
	      in the temperature factor field.


       theseus_pcN.pdb
	      The final superposition with the Nth principal component written
	      in the temperature factor field.	 This  file  is  omitted  when
	      superposing molecules with different residue sequences (mode 2).


       theseus_cor.mat, theseus_cov.mat
	      The atomic correlation matrix and covariance matrices, based  on
	      the  final  superposition.  The  format is suitable for input to
	      GNU's octave.  These are the matrices used in the Principal Com-
	      ponents Analysis.


BUGS
       Please send me (DLT) reports of all problems.


RESTRICTIONS
       Theseus	is  not  a  structural alignment program.  The structure-based
       alignment problem is completely different from the structural  superpo-
       sition  problem.  In order to do a structural superposition, there must
       be a 1-to-1 mapping that associates the atoms in one structure with the
       atoms  in  the other structures.  In the simplest case, this means that
       structures must have equivalent numbers of atoms, such as the models in
       an   NMR   PDB	file.	 For  structures  with	different  numbers  of
       residues/atoms, superposing is only possible when  the  sequences  have
       been  aligned previously.  Finding the best sequence alignment based on
       only structural information is a difficult problem, and one  for  which
       there  is  currently no maximum likelihood approach.  Extending theseus
       to address the structural alignment  problem  is  an  ongoing  research
       project.


AUTHOR
       Douglas L. Theobald
       dtheobald@brandeis.edu


CITATION
       When using theseus in publications please cite:


       Douglas L. Theobaldand Phillip A. Steindel (2012)
       ``Optimal  simultaneous	superpositioning  of  multiple structures with
       missing data.''
       Bioinformatics 28(15):1972-1979

       The following papers also report theseus developments:


       Douglas L. Theobald and Deborah S. Wuttke (2008)
       ``Accurate structural correlations from maximum	likelihood  superposi-
       tions.''
       PLoS Computational Biology 4(2):e43


       Douglas L. Theobald and Deborah S. Wuttke (2006)
       ``THESEUS:  Maximum  likelihood superpositioning and analysis of macro-
       molecular structures."
       Bioinformatics 22(17):2171-2172


       Douglas L. Theobald and Deborah S. Wuttke (2006)
       ``Empirical Bayes models for regularizing maximum likelihood estimation
       in the matrix Gaussian Procrustes problem.''
       PNAS 103(49):18521-18527


HISTORY
       Long, tedious, and sordid.



Brandeis University		 25 March 2015			    THESEUS(1)