Ubuntu Manpage: phyloP - Compute conservation or acceleration p-values based on an alignment and The phylogenetic model

NAME

       phyloP  -  Compute conservation or acceleration p-values based on an alignment and The phylogenetic model
       must be in the .mod format produced by the phyloFit program.  The alignment file can be in any of several
       file formats (see --msa-format).  No alignment is required with the --null option.

DESCRIPTION

Compute conservation or acceleration p-values based on an alignment and a model of neutral evolution.
Will also compute p-values of conservation/acceleration in a subtree and in its complementary supertree
given the whole tree (see --subtree). P-values can be produced for entire input alignments (the
default), pre-specified intervals within an alignment (see --features), or individual sites (see
--wig-scores and --base-by-base).

The default behavior is to compute a null distribution for the total number of substitutions from the
tree model, an estimate of the number of substitutions that have actually occurred, and the p-value of
this estimate wrt the null distribution. These computations are performed as described by Siepel,
Pollard, and Haussler (2006). In addition to the SPH method, phyloP can compute p-values or
conservation/acceleration scores using a likelihood ratio test (--method LRT), a score-based test
(--method SCORE), or a procedure similar to that used by GERP (Cooper et al., 2005) (--method GERP).
These alternative methods are currently supported only with --base-by-base, --wig-scores, or --features.

The main advantage of the SPH method is that it can provide a complete and exact description of
distributions over numbers of substitutions. However, simulation experiments suggest that the LRT and
SCORE methods have somewhat better power than SPH for identifying selection, especially when the expected
number of substitutions is small (e.g., with short branch lengths and/or short intervals/individual
sites). These two methods are also faster. They are generally similar to one another in power, but in
many cases SCORE is considerably faster than LRT. On the other hand, SCORE appears to have slightly less
power than LRT at low false positive rates, i.e., for cases of extreme selection. Thus, when using
--base-by-base, --wig-scores, or --features, LRT is recommended for most purposes, but SCORE is a good
alternative if speed is an issue. When computing p-values with the SPH method, the default is to use the
posterior expected number of substitutions as an estimate of the actual number. This is a conservative
estimate, because it is biased toward the mean of the null distribution by the prior. These p-values can
be made less conservative with --fit-model and more conservative with --confidence-interval (see below).

EXAMPLE

1. Using the SPH method, compute and report p-values of conservation and acceleration for a given
alignment with respect to a neutral model of evolution. Estimated numbers of substitutions are also
reported.

phyloP neutral.mod alignment.fa > report.txt

The file neutral.mod could be produced by running phyloFit on data from ancestral repeats or fourfold
degenerate sites with an appropriate tree topology and substitution model.

2. Compute and report p-values of conservation and acceleration for a particular subtree of interest
(using SPH).

phyloP --subtree human-mouse_lemur neutral.mod alignment.fa > report.txt

Here human-mouse_lemur denote the most recent common ancestor of human and mouse_lemur, which is the node
that defines the primate clade in this phylogeny. The tree_doctor program with the --name-ancestors
option can be used to assign names to ancestral nodes of the tree.

3. Describe the complete null distribution over the number of substitutions for a 10bp alignment given
the specified neutral model (using SPH).

phyloP --null 10 neutral.mod > null.txt

A two-column table is produced with numbers of substitutions and their probabilities, up to an
appropriate upper limit.

4. Describe the complete posterior distribution over the number of substitutions in a given alignment
(using SPH).

phyloP --posterior neutral.mod alignment.fa > posterior.txt

5. Compute conservation scores (-log10 p-values) for each site in an alignment and output them in the
fixed-step wig format (see http://genome.ucsc.edu/goldenPath/help/wiggle.html). Use the likelihood ratio
test (LRT) method.

phyloP --wig-scores --method LRT neutral.mod alignment.fa > scores.wig

The --mode option can be used instead to produce acceleration scores (ACC), scores of nonneutrality
(NNEUT), or scores that summarize conservation and acceleration (CONACC). The --base-by-base option can
be used to output additional statistics of interest (estimated scale factors, log10 likelihood ratios,
etc.). As discussed above, several arguments to --method are possible.

6. Similarly, compute scores describing lineage-specific conservation in primates.

phyloP --wig-scores --method LRT --subtree human-mouse_lemur neutral.mod alignment.fa > scores.wig

7. Compute conservation p-values and associated statistics for each element in a BED file. This time use
a score test and allow for acceleration as well as conservation, flagging elements under acceleration by
making their p-values negative (CONACC mode).

phyloP --features elements.bed --method SCORE --mode CONACC neutral.mod alignment.fa >
element-scores.txt

This option can also be used with --subtree. The --gff-scores option can be used to output the original
features in GFF format with scores equal to -log10 p. Note that the input file can be in GFF instead of
BED format.

OPTIONS

--msa-format, -i FASTA|PHYLIP|MPM|MAF|SS

Alignment format (default is to guess format from file contents).

--method, -m SPH|LRT|SCORE|GERP

Method used to compute p-values or conservation/acceleration scores (Default SPH). The likelihood
ratio test (LRT) and score test (SCORE) compare an alternative model having a free scale parameter
with the given neutral model, or, if --subtree is used, an alternative model having free scale
parameters for the supertree and subtree with a null model having a single free scale parameter.
P-values are computed by comparing test statistics with asymptotic chi-square null distributions.
The GERP-like method (GERP) estimates the number of "rejected substitutions" per base by comparing
the (per-site) maximum likelihood expected number of substitutions with the expected number under
the neutral model. Currently LRT, SCORE, and GERP can be used only with --base-by-base,
--wig-scores, or --features.

--wig-scores, -w

Compute separate p-values per site, and then compute site-specific conservation (acceleration)
scores as -log10(p). Output base-by-base scores in fixed-step wig format, using the coordinate
system of the reference sequence (see --refidx). In GERP mode, outputs rejected substitutions per
site instead of -log10 p-values.

--base-by-base, -b

Like --wig-scores, but outputs multiple values per site, in a method-dependent way. With 'SPH',
output includes mean and variance of posterior distribution, with LRT and SCORE it includes the
estimated scale factor(s) and test statistics, and with GERP it includes the estimated numbers of
neutral, observed, and rejected substitutions, along with the number of species available at each
site.

--refidx, -r <refseq_idx>

(for use with --wig-scores or --base-by-base) Use coordinate frame of specified sequence in
output. Default value is 1, first sequence in alignment; 0 indicates coordinate frame of entire
multiple alignment.

--mode, -o CON|ACC|NNEUT|CONACC

(For use with --wig-scores, --base-by-base, or --features) Whether to compute one-sided p-values
so that small p (large -log10 p) indicates unexpected conservation (CON; the default) or
acceleration (ACC); or two-sided p-values such that small p indicates an unexpected departure from
neutrality (NNEUT). The fourth option (CONACC) uses positive values (p-values or scores) to
indicate conservation and negative values to indicate acceleration. In GERP mode, CON and CONACC
both report the number of rejected substitutions R (which may be negative), while ACC reports -R,
and NNEUT reports abs(R).

--features, -f <file>

Read features from <file> (GFF or BED format) and output a table of p-values and related
statistics with one row per feature. The features are assumed to use the coordinate frame of the
first sequence in the alignment. Not for use with --null or --posterior. See also --gff-scores.

--gff-scores, -g

(For use with features) Instead of a table, output a GFF and assign each feature a score equal to
its -log10 p-value.

--subtree, -s <node-name>

(Not available in GERP mode) Partition the tree into the subtree beneath the node whose name is
given and the complementary supertree, and consider conservation/acceleration in the subtree given
the supertree. The branch above the specified node is included with the subtree. Thus, given the
tree "((human,chimp)primate,(mouse,rat)rodent)", the option "--subtree primate" will create one
partition consisting of human, chimp, and the branch leading to them, and another partition
consisting of the rest of the tree; "--subtree human" will create one partition consisting only of
human and the branch leading to it and another partition consisting of the rest of the tree. In
'SPH' mode, a reversible substitution model is assumed.

--branch, -B <node-name(s)>

(Not available in GERP or SPH mode). Like subtree, but partitions the tree into the set of named
branches (each named by its child node), and all the remaining branches. Then tests for
conservation/ acceleration in the set of named branches relative to the others. The argument is a
comma-delimited list of child nodes.

--chrom, -N <name>

(Optionally use with --wig-scores or --base-by-base) Chromosome name for wig output. Default is
root of multiple alignment filename.

--log, -l <fname>

Write log to <fname> describing details of parameter optimization. Useful for debugging.
(Warning: may produce large file.)

--seed, -d <seed>

Provide a random number seed, should be an integer >=1. Random numbers are used in some cases to
generate starting values for optimization. If not specified will use a seed based on the current
time.

--no-prune,-P

Do not prune species from tree which are not in alignment. Rather, treat these species as having
missing data in the alignment. Missing data does have an effect on the results when --method SPH
is used.

--help, -h

Produce this help message.

Options for SPH mode only

--null, -n <nsites> Compute just the null (prior) distribution of the number of substitutions, as defined
by the tree model and the given number of sites, and output as a table. The 'alignment' argument
will be ignored. If used with --subtree, the joint distribution over the number of substitutions
in the specified supertree and subtree will be output instead.

--posterior, -p Compute just the posterior distribution of the number of substitutions, given the
alignment and the model, and output as a table. If used with --subtree, the joint distribution
over the number of substitutions in the specified supertree and subtree will be output instead.

--fit-model, -F

Fit model to data before computing posterior distribution, by estimating a scale factor for the
whole tree or (if --subtree) separate scale factors for the specified subtree and supertree.
Makes p-values less conservative. This option has no effect with --null and currently cannot be
used with --features. It can be used with --wig-scores and --base-by-base.

--epsilon, -e <val>

(Default 1e-10 or 1e-6 if --wig-scores or --base-by-base) Threshold used in truncating tails of
distributions; tail probabilities less than this value are discarded. To get accurate p-values
smaller than 1e-10, this option will need to be used, at some cost in speed. Note that truncation
affects only *right* tails, not left tails, so it should be an issue only with p-values of
acceleration.

--confidence-interval, -c <val>

Allow for uncertainty in the estimate of the actual number of substitutions by using a (central)
confidence interval about the mean of the specified size (0 < val < 1). To be conservative, the
maximum of this interval is used when computing a p-value of conservation, and the minimum is used
when computing a p-value of acceleration. The variance of the posterior is computed exactly, but
the confidence interval is based on the assumption that the combined distribution will be
approximately normal (true for large numbers of sites by central limit theorem).

--quantiles, -q

(For use with --null or --posterior) Report quantiles of distribution rather than whole
distribution.

NAME

DESCRIPTION

EXAMPLE

OPTIONS

SEE ALSO