Provided by: phast_1.6+dfsg-4_amd64 bug

NAME

       phyloP  -  Compute conservation or acceleration p-values based on an alignment and The phylogenetic model
       must be in the .mod format produced by the phyloFit program.  The alignment file can be in any of several
       file formats (see --msa-format).  No alignment is required with the --null option.

DESCRIPTION

       Compute conservation or acceleration p-values based on an alignment and a  model  of  neutral  evolution.
       Will  also  compute p-values of conservation/acceleration in a subtree and in its complementary supertree
       given the whole tree (see --subtree).   P-values  can  be  produced  for  entire  input  alignments  (the
       default),  pre-specified  intervals  within  an  alignment  (see  --features),  or  individual sites (see
       --wig-scores and --base-by-base).

       The default behavior is to compute a null distribution for the total number  of  substitutions  from  the
       tree  model,  an  estimate of the number of substitutions that have actually occurred, and the p-value of
       this estimate wrt the null distribution.  These  computations  are  performed  as  described  by  Siepel,
       Pollard,  and  Haussler  (2006).   In  addition  to  the  SPH  method,  phyloP  can  compute  p-values or
       conservation/acceleration scores using a  likelihood  ratio  test  (--method  LRT),  a  score-based  test
       (--method  SCORE),  or  a  procedure  similar to that used by GERP (Cooper et al., 2005) (--method GERP).
       These alternative methods are currently supported only with --base-by-base, --wig-scores, or --features.

       The main advantage of the SPH method is  that  it  can  provide  a  complete  and  exact  description  of
       distributions  over  numbers  of substitutions.  However, simulation experiments suggest that the LRT and
       SCORE methods have somewhat better power than SPH for identifying selection, especially when the expected
       number of substitutions is small (e.g., with  short  branch  lengths  and/or  short  intervals/individual
       sites).   These  two methods are also faster.  They are generally similar to one another in power, but in
       many cases SCORE is considerably faster than LRT.  On the other hand, SCORE appears to have slightly less
       power than LRT at low false positive rates, i.e., for cases  of  extreme  selection.   Thus,  when  using
       --base-by-base,  --wig-scores,  or  --features, LRT is recommended for most purposes, but SCORE is a good
       alternative if speed is an issue.  When computing p-values with the SPH method, the default is to use the
       posterior expected number of substitutions as an estimate of the actual number.  This is  a  conservative
       estimate, because it is biased toward the mean of the null distribution by the prior.  These p-values can
       be made less conservative with --fit-model and more conservative with --confidence-interval (see below).

EXAMPLE

       1.  Using  the  SPH  method,  compute  and  report  p-values of conservation and acceleration for a given
       alignment with respect to a neutral model of evolution.  Estimated  numbers  of  substitutions  are  also
       reported.

              phyloP neutral.mod alignment.fa > report.txt

       The  file  neutral.mod  could  be produced by running phyloFit on data from ancestral repeats or fourfold
       degenerate sites with an appropriate tree topology and substitution model.

       2. Compute and report p-values of conservation and acceleration for  a  particular  subtree  of  interest
       (using SPH).

              phyloP --subtree human-mouse_lemur neutral.mod alignment.fa > report.txt

       Here human-mouse_lemur denote the most recent common ancestor of human and mouse_lemur, which is the node
       that  defines  the  primate  clade  in this phylogeny.  The tree_doctor program with the --name-ancestors
       option can be used to assign names to ancestral nodes of the tree.

       3. Describe the complete null distribution over the number of substitutions for a  10bp  alignment  given
       the specified neutral model (using SPH).

              phyloP --null 10 neutral.mod > null.txt

       A  two-column  table  is  produced  with  numbers  of  substitutions  and  their  probabilities, up to an
       appropriate upper limit.

       4. Describe the complete posterior distribution over the number of substitutions  in  a  given  alignment
       (using SPH).

              phyloP --posterior neutral.mod alignment.fa > posterior.txt

       5.  Compute  conservation  scores  (-log10 p-values) for each site in an alignment and output them in the
       fixed-step wig format (see http://genome.ucsc.edu/goldenPath/help/wiggle.html).  Use the likelihood ratio
       test (LRT) method.

              phyloP --wig-scores --method LRT neutral.mod alignment.fa > scores.wig

       The --mode option can be used instead to produce  acceleration  scores  (ACC),  scores  of  nonneutrality
       (NNEUT),  or scores that summarize conservation and acceleration (CONACC).  The --base-by-base option can
       be used to output additional statistics of interest (estimated scale factors,  log10  likelihood  ratios,
       etc.).  As discussed above, several arguments to --method are possible.

       6. Similarly, compute scores describing lineage-specific conservation in primates.

              phyloP --wig-scores --method LRT --subtree human-mouse_lemur neutral.mod alignment.fa > scores.wig

       7. Compute conservation p-values and associated statistics for each element in a BED file.  This time use
       a  score test and allow for acceleration as well as conservation, flagging elements under acceleration by
       making their p-values negative (CONACC mode).

              phyloP  --features  elements.bed  --method  SCORE  --mode  CONACC   neutral.mod   alignment.fa   >
              element-scores.txt

       This  option can also be used with --subtree.  The --gff-scores option can be used to output the original
       features in GFF format with scores equal to -log10 p.  Note that the input file can be in GFF instead  of
       BED format.

OPTIONS


       --msa-format, -i FASTA|PHYLIP|MPM|MAF|SS

              Alignment format (default is to guess format from file contents).

       --method, -m SPH|LRT|SCORE|GERP

              Method used to compute p-values or conservation/acceleration scores (Default SPH).  The likelihood
              ratio test (LRT) and score test (SCORE) compare an alternative model having a free scale parameter
              with  the  given  neutral  model, or, if --subtree is used, an alternative model having free scale
              parameters for the supertree and subtree with a null model having a single free  scale  parameter.
              P-values  are computed by comparing test statistics with asymptotic chi-square null distributions.
              The GERP-like method (GERP) estimates the number of "rejected substitutions" per base by comparing
              the (per-site) maximum likelihood expected number of substitutions with the expected number  under
              the  neutral  model.   Currently  LRT,  SCORE,  and  GERP  can  be  used only with --base-by-base,
              --wig-scores, or --features.

       --wig-scores, -w

              Compute separate p-values per site, and then  compute  site-specific  conservation  (acceleration)
              scores  as  -log10(p).   Output base-by-base scores in fixed-step wig format, using the coordinate
              system of the reference sequence (see --refidx).  In GERP mode, outputs rejected substitutions per
              site instead of -log10 p-values.

       --base-by-base, -b

              Like --wig-scores, but outputs multiple values per site, in a method-dependent way.   With  'SPH',
              output  includes  mean  and variance of posterior distribution, with LRT and SCORE it includes the
              estimated scale factor(s) and test statistics, and with GERP it includes the estimated numbers  of
              neutral,  observed, and rejected substitutions, along with the number of species available at each
              site.

       --refidx, -r <refseq_idx>

              (for use with --wig-scores or --base-by-base)  Use  coordinate  frame  of  specified  sequence  in
              output.   Default  value is 1, first sequence in alignment; 0 indicates coordinate frame of entire
              multiple alignment.

       --mode, -o CON|ACC|NNEUT|CONACC

              (For use with --wig-scores, --base-by-base, or --features) Whether to compute  one-sided  p-values
              so  that  small  p  (large  -log10  p)  indicates  unexpected  conservation  (CON; the default) or
              acceleration (ACC); or two-sided p-values such that small p indicates an unexpected departure from
              neutrality (NNEUT).  The fourth option (CONACC) uses  positive  values  (p-values  or  scores)  to
              indicate  conservation and negative values to indicate acceleration.  In GERP mode, CON and CONACC
              both report the number of rejected substitutions R (which may be negative), while ACC reports  -R,
              and NNEUT reports abs(R).

       --features, -f <file>

              Read  features  from  <file>  (GFF  or  BED  format)  and  output  a table of p-values and related
              statistics with one row per feature.  The features are assumed to use the coordinate frame of  the
              first  sequence in the alignment.  Not for use with --null or --posterior.  See also --gff-scores.

       --gff-scores, -g

              (For use with features) Instead of a table, output a GFF and assign each feature a score equal  to
              its -log10 p-value.

       --subtree, -s <node-name>

              (Not  available  in  GERP mode) Partition the tree into the subtree beneath the node whose name is
              given and the complementary supertree, and consider conservation/acceleration in the subtree given
              the supertree.  The branch above the specified node is included with the subtree.  Thus, given the
              tree "((human,chimp)primate,(mouse,rat)rodent)", the option "--subtree primate"  will  create  one
              partition  consisting  of  human,  chimp,  and  the  branch leading to them, and another partition
              consisting of the rest of the tree; "--subtree human" will create one partition consisting only of
              human and the branch leading to it and another partition consisting of the rest of the  tree.   In
              'SPH' mode, a reversible substitution model is assumed.

       --branch, -B <node-name(s)>

              (Not  available in GERP or SPH mode).  Like subtree, but partitions the tree into the set of named
              branches (each named by its  child  node),  and  all  the  remaining  branches.   Then  tests  for
              conservation/ acceleration in the set of named branches relative to the others.  The argument is a
              comma-delimited list of child nodes.

       --chrom, -N <name>

              (Optionally  use  with --wig-scores or --base-by-base) Chromosome name for wig output.  Default is
              root of multiple alignment filename.

       --log, -l <fname>

              Write log to  <fname>  describing  details  of  parameter  optimization.   Useful  for  debugging.
              (Warning: may produce large file.)

       --seed, -d <seed>

              Provide  a random number seed, should be an integer >=1.  Random numbers are used in some cases to
              generate starting values for optimization.  If not specified will use a seed based on the  current
              time.

       --no-prune,-P

              Do  not prune species from tree which are not in alignment.  Rather, treat these species as having
              missing data in the alignment.  Missing data does have an effect on the results when --method  SPH
              is used.

       --help, -h

              Produce this help message.

   Options for SPH mode only

       --null, -n <nsites> Compute just the null (prior) distribution of the number of substitutions, as defined
              by  the tree model and the given number of sites, and output as a table.  The 'alignment' argument
              will be ignored.  If used with --subtree, the joint distribution over the number of  substitutions
              in the specified supertree and subtree will be output instead.

       --posterior,  -p  Compute  just  the  posterior  distribution  of  the number of substitutions, given the
              alignment and the model, and output as a table.  If used with --subtree,  the  joint  distribution
              over the number of substitutions in the specified supertree and subtree will be output instead.

       --fit-model, -F

              Fit  model  to  data before computing posterior distribution, by estimating a scale factor for the
              whole tree or (if --subtree) separate scale factors  for  the  specified  subtree  and  supertree.
              Makes  p-values  less conservative.  This option has no effect with --null and currently cannot be
              used with --features.  It can be used with --wig-scores and --base-by-base.

       --epsilon, -e <val>

              (Default 1e-10 or 1e-6 if --wig-scores or --base-by-base) Threshold used in  truncating  tails  of
              distributions;  tail  probabilities  less than this value are discarded.  To get accurate p-values
              smaller than 1e-10, this option will need to be used, at some cost in speed.  Note that truncation
              affects only *right* tails, not left tails, so it  should  be  an  issue  only  with  p-values  of
              acceleration.

       --confidence-interval, -c <val>

              Allow  for  uncertainty in the estimate of the actual number of substitutions by using a (central)
              confidence interval about the mean of the specified size (0 < val < 1).  To be  conservative,  the
              maximum of this interval is used when computing a p-value of conservation, and the minimum is used
              when  computing a p-value of acceleration.  The variance of the posterior is computed exactly, but
              the confidence interval is based  on  the  assumption  that  the  combined  distribution  will  be
              approximately normal (true for large numbers of sites by central limit theorem).

       --quantiles, -q

              (For  use  with  --null  or  --posterior)  Report  quantiles  of  distribution  rather  than whole
              distribution.

SEE ALSO

       Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program, Green ED,  Batzoglou  S,  Sidow  A.
       Distribution and intensity of constraint in mammalian genomic sequence.  Genome Res. 2005 15(7):901-13.

       Siepel  A,  Pollard  KS,  and  Haussler  D.  New  methods  for  detecting  lineage-specific selection. In
       Proceedings of the 10th International Conference on Research in Computational Molecular  Biology  (RECOMB
       2006), pp. 190-205.

phyloP 1.4                                          May 2016                                           PHYLOP(1)