Provided by: theseus_3.3.0-14_amd64 bug

NAME

       theseus - Maximum likelihood, multiple simultaneous superpositions with statistical analysis

SYNOPSIS

       theseus [options] pdbfile1 [pdbfile2 ...]

       and

       theseus_align [options] -f pdbfile1 [pdbfile2 ...]

DESCRIPTION

       Theseus  superposes  a  set  of  macromolecular  structures  simultaneously  using  the method of maximum
       likelihood (ML), rather  than  the  conventional  least-squares  criterion.   Theseus  assumes  that  the
       structures  are  distributed  according to a matrix Gaussian distribution and that the eigenvalues of the
       atomic covariance matrix are hierarchically distributed according to an inverse gamma distribution.  This
       ML superpositioning model produces much more  accurate  results  by  essentially  downweighting  variable
       regions of the structures and by correcting for correlations among atoms.

       Theseus  operates in two main modes: (1) a mode for superimposing structures with identical sequences and
       (2) a mode for structures with different sequences but similar structures:

              (1) A mode for superpositioning macromolecules with identical sequences and numbers  of  residues,
              for instance, multiple models in an NMR family or multiple structures from different crystal forms
              of the same protein.

              In this mode, Theseus will read every model in every file on the command line and superpose them.

              Example:

              theseus 1s40.pdb

              In the above example, 1s40.pdb is a pdb file of 10 NMR models.

              (2)  An  ``alignment'' mode for superpositioning structures with different sequences, for example,
              multiple structures of the cytochrome  c  protein  from  different  species  or  multiple  mutated
              structures of hen egg white lysozyme.

              This  mode  requires  the  user  to  supply  a  sequence  alignment  file  of the structures being
              superpositioned (see option -A and ``FILE FORMATS'' below).  Additionally, it may be necessary  to
              supply a mapfile that tells theseus which PDB structure files correspond to which sequences in the
              alignment  (see option -M and ``FILE FORMATS'' below).  The mapfile is unnecessary if the sequence
              names and corresponding pdb filenames  are  identical.   In  this  mode,  if  there  are  multiple
              structural  models  in  a PDB file, theseus only reads the first model in each file on the command
              line. In other words, theseus treats the files on the command line  as  if  there  were  only  one
              structure per file.

              Example 1:

              theseus -A cytc.aln -M cytc.filemap d1cih__.pdb d1csu__.pdb d1kyow_.pdb

              In  the  above  example,  d1cih__.pdb,  d1csu__.pdb, and d1kyow_.pdb are pdb files of cytochrome c
              domains from the SCOP database.

              Example 2:

              theseus_align -f d1cih__.pdb d1csu__.pdb d1kyow_.pdb

              In this example, the theseus_align script is called  to  do  the  hard  work  for  you.   It  will
              calculate  a  sequence  alignment  and  then  superpose  based  on  that  alignment.   The  script
              theseus_align takes the same options as the theseus program.  Note, the first few  lines  of  this
              script  must  be  modified for your system, since it calls an external multiple sequence alignment
              program to do the alignment.  See the examples/ directory  for  more  details,  including  example
              files.

OPTIONS

   Algorithmic options, defaults in {brackets}:
       --amber
              Do special processing for AMBER8 formatted PDB files

              Most  people  will  never  need  to use this long option, unless you are processing MD traces from
              AMBER.  AMBER puts the atom names in the wrong column in the PDB file.

       -a [selection]
              Atoms to include in the superposition.  This option takes two types of  arguments,  either  (1)  a
              number  specifying  a  preselected set of atom types, or (2) an explict PDB-style, colon-delimited
              list of the atoms to include.

              For the preselected atom type subsets, the following integer options are available:

               • 0, alpha carbons for proteins, C1´ atoms for nucleic acids
               • 1, backbone
               • 2, all
               • 3, alpha and beta carbons
               • 4, all heavy atoms (no hydrogens)

              Note, only the -a0 option is available when superpositioning structures with different sequences.

              To custom select an explicit set of atom types, the atom types must be specified exactly as  given
              in  the PDB file field, including spaces, and the atom-types must encapsulated in quotation marks.
              Multiple atom types must be delimited by a colon.  For example,

              -a ` N  : CA : C  : O  '

              would specify the atom types in the peptide backbone.

       -f     Only read the first model of a multi-model PDB file

       -h     Help/usage

       -i [nnn]
              Maximum iterations, {200}

       -p [precision]
              Requested relative precision for convergence, {1e-7}

       -r [root name]
              Root name to be used in naming the output files, {theseus}

       -s [n-n:...]
              Residue selection (e.g. -s15-45:50-55), {all}

       -S [n-n:...]
              Residues to exclude (e.g. -S15-45:50-55) {none}

              The previous two options have the same format. Residue (or alignment column) ranges are  indicated
              by  beginning and end separated by a dash.  Multiple ranges, in any arbitrary order, are separated
              by a colon.  Chains may also be selected by giving the chain ID immediately preceding the  residue
              range.   For  example, -sA1-20:A40-71 will only include residues 1 through 20 and 40 through 70 in
              chain A. Chains cannot be specified when superposing structures with different sequences.

       -v     use ML variance weighting (no correlations) {default}

   Input/output options:
       -A [sequence alignment file]
              Sequence alignment file to use as a guide (CLUSTAL or A2M format)

              For use when superposing structures with different sequences.  See ``FILE FORMATS'' below.

       -E     Print expert options

       -F     Print FASTA files of the sequences in PDB files and quit

              A useful option when superposing structures with different sequences.  The files output with  this
              option  can  be  aligned with a multiple sequence alignment program such as CLUSTAL or MUSCLE, and
              the resulting output alignment file used as theseus input with the -A option.

       -h     Help/usage

       -I     Just calculate statistics for input file; don't superpose

       -M [mapfile]
              File that maps PDB files to sequences in the alignment.

              A simple two-column formatted file; see ``FILE FORMATS'' below. Used with mode 2.

       -n     Don't write transformed pdb file

       -o [reference structure]
              Reference file to superpose on, all rotations are relative to the first model in this file

              For example, 'theseus -o cytc1.pdb cytc1.pdb cytc2.pdb cytc3.pdb' will  superpose  the  structures
              and  rotate  the  entire  final  superposition so that the structure from cytc1.pdb is in the same
              orientation as the structure in the original cytc1.pdb PDB file.

       -V     Version

   Principal components analysis:
       -C     Use covariance matrix for PCA (correlation matrix is default)

       -P [nnn]
              Number of principal components to calculate {0}

              In both of the above, the corresponding principal component is written in the  B-factor  field  of
              the output PDB file. Usually only the first few PCs are of any interest (maybe up to six).

               EXAMPLES theseus 2sdf.pdb

       theseus -l -r new2sdf 2sdf.pdb

       theseus -s15-45 -P3 2sdf.pdb

       theseus -A cytc.aln -M cytc.mapfile -o cytc1.pdb -s1-40 cytc1.pdb cytc2.pdb cytc3.pdb cytc4.pdb

ENVIRONMENT

       You  can  set  the  environment  variable 'PDBDIR' to your PDB file directory and theseus will look there
       after the present working directory.  For example, in the C shell (tcsh or csh), you  can  put  something
       akin to this in your .cshrc file:

       setenv PDBDIR '/usr/share/pdbs/'

FILE FORMATS

       Theseus  will  read standard PDB formatted files (see <http://www.rcsb.org/pdb/>).  Every effort has been
       made for the program to accept nonstandard CNS and X-PLOR file formats also.

       Two other files deserve mention, a sequence alignment file and a mapfile.

   Sequence alignment file
       When  superposing  structures  with  different  residue  identities  (where  the  lengths  of  each   the
       macromolecules  in  terms  of  residues  are  not  necessarily  equal), a sequence alignment file must be
       included for theseus to use as a guide (specified by the -A option).  Theseus accepts  both  CLUSTAL  and
       A2M (FASTA) formatted multiple sequence alignment files.

       NOTE  1:  The  residue  sequence  in  the  alignment must match exactly the residue sequence given in the
       coordinates of the PDB file. That is, there can be no missing or extra residues that do not correspond to
       the sequence in the PDB file. An easy way to ensure that your sequences exactly match the PDB files is to
       generate the sequences using theseus' -F option, which writes out a FASTA formatted sequence file of  the
       chain(s) in the PDB files. The files output with this option can then be aligned with a multiple sequence
       alignment  program  such  as  CLUSTAL  or MUSCLE, and the resulting output alignment file used as theseus
       input with the -A option.

       NOTE 2: Every PDB file must have a corresponding sequence in the alignment.  However, not every  sequence
       in  the  alignment  needs  to have a corresponding PDB file. That is, there can be extra sequences in the
       alignment that are not used for guiding the superposition.

   PDB -> Sequence mapfile
       If the names of the PDB files and  the  names  of  the  corresponding  sequences  in  the  alignemnt  are
       identical, the mapfile may be omitted.  Otherwise, Theseus needs to know which sequences in the alignment
       file  correspond  to  which  PDB  structure  files. This information is included in a mapfile with a very
       simple format (specified with the -M option). There are only two columns  separated  by  whitespace:  the
       first  column lists the names of the PDB structure files, while the second column lists the corresponding
       sequence names exactly as given in the multiple sequence alignment file.

       An example of the mapfile:

       cytc1.pdb    seq1
       cytc2.pdb    seq2
       cytc3.pdb    seq3

SCREEN OUTPUT

       Theseus provides output describing both the progress of the superposing and several  statistics  for  the
       final result:

       Classical LS pairwise <RMSD>:
              The  conventional  RMSD  for  the superposition, the average RMSD for all pairwise combinations of
              structures in the ensemble.

       Least-squares <sigma>:
              The standard deviation  for  the  superposition,  based  on  the  conventional  assumption  of  no
              correlation and equal variances. Basically equal to the RMSD from the average structure.

       Maximum Likelihood <sigma>:
              The ML analog of the standard deviation for the superposition. When assuming that the correlations
              are  zero (a diagonal covariance matrix), this is equal to the square root of the harmonic average
              of the variances for each atom. In contrast, the ``Least-squares <sigma>'' given above reports the
              square root of the arithmetic average of the variances.  The harmonic average is always less  than
              the  arithmetic  average,  and the harmonic average downweights large values proportional to their
              magnitude. This makes sense statistically, because when combining values one should weight them by
              the reciprocal of their variance (which is in fact what the ML superposing method does).

       Marginal Log Likelihood:
              The final marginal log likelihood of the superposition, assuming the matrix Gaussian  distribution
              of  the  structures  and  the  hierarchical  inverse  gamma distribution of the eigenvalues of the
              covariance matrix.  The marginal log likelihood is  the  likelihood  with  the  covariance  matrix
              integrated out.

       AIC:   The  Akaike  Information  Criterion for the final superposition. This is an important statistic in
              likelihood analysis and model selection theory. It allows  an  objective  comparison  of  multiple
              theoretical  models  with different numbers of parameters. In this case, the higher the number the
              better. There is a tradeoff between fit to the data  and  the  number  of  parameters  being  fit.
              Increasing  the  number of parameters in a model will always give a better fit to the data, but it
              also increases the uncertainty of  the  estimated  values.   The  AIC  criterion  finds  the  best
              combination  by (1) maximizing the fit to the data while (2) minimizing the uncertainty due to the
              number of parameters. In the superposition case, one can compare the least  squares  superposition
              to the maximum likelihood superposition. The method (or model) with the higher AIC is preferred. A
              difference in the AIC of 2 or more is considered strong statistical evidence for the better model.

       BIC:   The Bayesian Information Criterion. Similar to the AIC, but with a Bayesian emphasis.

       Omnibus chi2:
              The  overall  reduced  chi2  statistic  for the entire fit, including the rotations, translations,
              covariances, and the inverse gamma parameters. This is probably the most important  statistic  for
              the  superposition. In some cases, the inverse gamma fit may be poor, yet the overall fit is still
              very good. Again, it should ideally be close to 1.0, which would indicate a perfect fit.  However,
              if  you think it is too large, make sure to compare it to the chi2 for the least-squares fit; it's
              probably not that bad after all.  A large chi2 often indicates a violation of the  assumptions  of
              the model.  The most common violation is when superposing two or more independent domains that can
              rotate  relative  to  each  other.  If  this  is  the case, then there will likely be not just one
              Gaussian distribution, but several mixed Gaussians, one for each domain.  Then, it would be better
              to superpose each domain independently.

       Hierarchical var (alpha, gamma) chi2:
              The reduced chi2 for the inverse gamma fit of the covariance matrix  eigenvalues.  As  before,  it
              should  ideally  be  close  to 1.0.  The two values in the parentheses are the ML estimates of the
              scale and shape parameters, respectively, for the inverse gamma distribtuion.

       Rotational, translational, covar chi2:
              The reduced chi2 statistic for the fit of the structures to the model.  With a good fit it  should
              be  close to 1.0, which indicates a perfect fit of the data to the statistical model.  In the case
              of least-squares, the assumed model is a matrix Gaussian distribution of the structures with equal
              variances and no correlations.  For the ML fits, the assumed model is  unequal  variances  and  no
              correlations, as calculated with the -v option [default].  This statistic is for the superposition
              only,  and  does  not  include  the  fit  of the covariance matrix eigenvalues to an inverse gamma
              distribution.  See ``Omnibus chi2'' below.

       Hierarchical minimum var:
              The hierarchical fit of the inverse gamma distribution constrains the variances of  the  atoms  by
              making  large  ones  smaller  and  small ones larger.  This statistic reports the minimum possible
              variance given the inferred inverse gamma parameters.

       skewness, skewness Z-value, kurtosis & kurtosis Z-value:
              The skewness and kurtosis of the residuals. Both should be 0.0 if the  residuals  fit  a  Gaussian
              distribution  perfectly.   They  are  followed  by  the P-value for the statistics. This is a very
              stringent test; residuals can be very non-Gaussian and yet the estimated rotations,  translations,
              and covariance matrix may still be rather accurate.

       Data pts, Free params, D/P:
              The  total number of data points given all observed structures, the number of parameters being fit
              in the model, and the data-to-parameter ratio.

       Median structure:
              The structure that is overall most similar to the average structure. This can be considered to  be
              the most ``typical'' structure in the ensemble.

       Total rounds:
              The number of iterations that the algorithm took to converge.

       Fractional precision:
              The actual precision that the algorithm converged to.

OUTPUT FILES

       Theseus writes out the following files:

       theseus_sup.pdb
              The final superposition, rotated to the principle axes of the mean structure.

       theseus_ave.pdb
              The estimate of the mean structure.

       theseus_residuals.txt
              The normalized residuals of the superposition. These can be analyzed for deviations from normality
              (whether  they  fit  a  standard  Gaussian  distribution).  E.g., the chi2, skewness, and kurtosis
              statistics are based on these values.

       theseus_transf.txt
              The final transformation rotation matrices and translation vectors.

       theseus_variances.txt
              The vector of estimated variances for each atom.

       When Principal Components are calculated (with the -P option), the following files are also produced:

       theseus_pcvecs.txt
              The principal component vectors.

       theseus_pcstats.txt
              Simple statistics for each principle component (loadings, variance explained, etc.).

       theseus_pcN_ave.pdb
              The average structure with the Nth principal component written in the temperature factor field.

       theseus_pcN.pdb
              The final superposition with the Nth principal component written in the temperature factor  field.
              This file is omitted when superposing molecules with different residue sequences (mode 2).

       theseus_cor.mat, theseus_cov.mat
              The  atomic  correlation  matrix  and  covariance  matrices, based on the final superposition. The
              format is suitable for input to GNU's octave.  These  are  the  matrices  used  in  the  Principal
              Components Analysis.

BUGS

       Please send me (DLT) reports of all problems.

RESTRICTIONS

       Theseus  is  not  a  structural  alignment  program.  The structure-based alignment problem is completely
       different from the structural superposition problem.  In order to do a  structural  superposition,  there
       must  be  a  1-to-1  mapping  that  associates  the  atoms  in  one structure with the atoms in the other
       structures.  In the simplest case, this means that structures must have equivalent numbers of atoms, such
       as the models in an NMR PDB file.  For structures with different numbers of  residues/atoms,  superposing
       is  only  possible  when the sequences have been aligned previously.  Finding the best sequence alignment
       based on only structural information is a difficult problem, and one for  which  there  is  currently  no
       maximum likelihood approach.  Extending theseus to address the structural alignment problem is an ongoing
       research project.

AUTHOR

       Douglas L. Theobald
       dtheobald@brandeis.edu

CITATION

       When using theseus in publications please cite:

       Douglas L. Theobaldand Phillip A. Steindel (2012)
       ``Optimal simultaneous superpositioning of multiple structures with missing data.''
       Bioinformatics 28(15):1972-1979

       The following papers also report theseus developments:

       Douglas L. Theobald and Deborah S. Wuttke (2008)
       ``Accurate structural correlations from maximum likelihood superpositions.''
       PLoS Computational Biology 4(2):e43

       Douglas L. Theobald and Deborah S. Wuttke (2006)
       ``THESEUS: Maximum likelihood superpositioning and analysis of macromolecular structures."
       Bioinformatics 22(17):2171-2172

       Douglas L. Theobald and Deborah S. Wuttke (2006)
       ``Empirical Bayes models for regularizing maximum likelihood estimation in the matrix Gaussian Procrustes
       problem.''
       PNAS 103(49):18521-18527

HISTORY

       Long, tedious, and sordid.

Brandeis University                               25 March 2015                                       THESEUS(1)