Ubuntu Manpage: cmalign - align sequences to a covariance model

NAME

       cmalign - align sequences to a covariance model

SYNOPSIS

       cmalign
              [options] <cmfile> <seqfile>

DESCRIPTION

cmalign aligns the RNA sequences in <seqfile> to the covariance model (CM) in <cmfile>. The new
alignment is output to stdout in Stockholm format, but can be redirected to a file <f> with the -o <f>
option.

Either <cmfile> or <seqfile> (but not both) may be '-' (dash), which means reading this input from stdin
rather than a file.

The sequence file <seqfile> must be in FASTA or Genbank format.

cmalign uses an HMM banding technique to accelerate alignment by default as described below for the
--hbanded option. HMM banding can be turned off with the --nonbanded option.

By default, cmalign computes the alignment with maximum expected accuracy that is consistent with
constraints (bands) derived from an HMM, using a banded version of the Durbin/Holmes optimal accuracy
algorithm. This behavior can be changed with the --cyk or --sample options.

cmalign takes special care to correctly align truncated sequences, where some nucleotides from the
beginning (5') and/or end (3') of the actual full length biological sequence are not present in the input
sequence (see DL Kolbe and SR Eddy, Bioinformatics, 25:1236-1243, 2009). This behavior is on by default,
but can be turned off with --notrunc. In previous versions of cmalign the --sub option was required to
appropriately handle truncated sequences. The --sub option is still available in this version, but the
new default method for handling truncated sequences should be as good or superior to the sub method in
nearly all cases.

The --mapali <s> option allows inclusion of the fixed training alignment used to build the CM from file
<s> within the output alignment of cmalign.

It is possible to merge two or more alignments created by the same CM using the Easel miniapp esl-
alimerge (included in the easel/miniapps/ subdirectory of Infernal). Previous versions of cmalign
included options to merge alignments but they were deprecated upon development of esl-alimerge, which is
significantly more memory efficient.

By default, cmalign will output the alignment to stdout. The alignment can be redirected to an output
file <f> with the -o <f> option. With -o, information on each aligned sequence, including score and model
alignment boundaries will be printed to stdout (more on this below).

The output alignment will be in Stockholm format by default. This can be changed to Pfam, aligned FASTA
(AFA), A2M, Clustal, or Phylip format using the --outformat <s> option, where <s> is the name of the
desired format. As a special case, if the output alignment is large (more than 10,000 sequences or more
than 10,000,000 total nucleotides) than the output format will be Pfam format, with each sequence
appearing on a single line, for reasons of memory efficiency. For alignments larger than this, using
--ileaved will force interleaved Stockholm format, but the user should be aware that this may require a
lot of memory. --ileaved will only work for alignments up to 100,000 sequences or 100,000,000 total
nucleotides.

If the output alignment format is Stockholm or Pfam, the output alignment will be annotated with
posterior probabilities which estimate the confidence level of each aligned nucleotide. This annotation
appears as lines beginning with "#=GR <seq name> PP", one per sequence, each immediately below the
corresponding aligned sequence "<seq name>". Characters in PP lines have 12 possible values: "0-9", "*",
or ".". If ".", the position corresponds to a gap in the sequence. A value of "0" indicates a posterior
probability of between 0.0 and 0.05, "1" indicates between 0.05 and 0.15, "2" indicates between 0.15 and
0.25 and so on up to "9" which indicates between 0.85 and 0.95. A value of "*" indicates a posterior
probability of between 0.95 and 1.0. Higher posterior probabilities correspond to greater confidence that
the aligned nucleotide belongs where it appears in the alignment. With --nonbanded, the calculation of
the posterior probabilities considers all possible alignments of the target sequence to the CM. Without
--nonbanded (i.e. in default mode), the calculation considers only possible alignments within the HMM
bands. Further, the posterior probabilities are conditional on the truncation mode of the alignment. For
example, if the sequence alignment is truncated 5', a PP value of "9" indicates between 0.85 and 0.95 of
all 5' truncated alignments include the given nucleotide at the given position. The posterior annotation
can be turned off with the --noprob option. If --small is enabled, posterior annotation must also be
turned off using --noprob.

The tabular output that is printed to stdout if the -o option is used includes one line per sequence and
twelve fields per line: "idx": the index of the sequence in the input file, "seq name": the sequence
name; "length": the length of the sequence; "cm from" and "cm to": the model start and end positions of
the alignment; "trunc": "no" if the sequence is not truncated, "5'" if the beginning of the sequence
truncated 5', "3'" if the end of the sequence is truncated, and "5'&3'" if both the beginning and the end
are truncated; "bit sc": the bit score of the alignment, "avg pp" the average posterior probability of
all aligned nucleotides in the alignment; "band calc", "alignment" and "total": the time in seconds
required for calculating HMM bands, computing the alignment, and complete processing of the sequence,
respectively; "mem (Mb)": the size in Mb of all dynamic programming matrices required for aligning the
sequence. This tabular data can be saved to file <f> with the --sfile <f> option.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -o <f> Save  the  alignment  in  Stockholm  format to a file <f>.  The default is to write it to standard
              output.

       -g     Configure the model for global alignment of the query model to the target sequences.  By  default,
              the  model  is  configured  for local alignment. Local alignments can contain large insertions and
              deletions called "local ends" in the structure to be penalized  differently  than  normal  indels.
              These  are  annotated  as "~" columns in the RF line of the output alignment. The -g option can be
              used to disallow these local ends.  The -g option is required if the --sub option is also used.

OPTIONS FOR CONTROLLING THE ALIGNMENT ALGORITHM

--optacc
Align sequences using the Durbin/Holmes optimal accuracy algorithm. This is the default. The
optimal accuracy alignment will be constrained by HMM bands for acceleration unless the
--nonbanded option is enabled. The optimal accuracy algorithm determines the alignment that
maximizes the posterior probabilities of the aligned nucleotides within it. The posterior
probabilites are determined using (possibly HMM banded) variants of the Inside and Outside
algorithms.

--cyk Do not use the Durbin/Holmes optimal accuracy alignment to align the sequences, instead use the
CYK algorithm which determines the optimally scoring (maximum likelihood) alignment of the
sequence to the model, given the HMM bands (unless --nonbanded is also enabled).

--sample
Sample an alignment from the posterior distribution of alignments. The posterior distribution is
determined using an HMM banded (unless --nonbanded) variant of the Inside algorithm.

--seed <n>
Seed the random number generator with <n>, an integer >= 0. This option can only be used in
combination with --sample. If <n> is nonzero, stochastic sampling of alignments will be
reproducible; the same command will give the same results. If <n> is 0, the random number
generator is seeded arbitrarily, and stochastic samplings may vary from run to run of the same
command. The default seed is 181.

--notrunc
Turn off truncated alignment algorithms. All sequences in the input file will be assumed to be
full length, unless --sub is also used, in which case the program can still handle truncated
sequences but will use an alternative strategy for their alignment.

--sub Turn on the sub model construction and alignment procedure. For each sequence, an HMM is first
used to predict the model start and end consensus columns, and a new sub CM is constructed that
only models consensus columns from start to end. The sequence is then aligned to this sub CM. Sub
alignment is an older method than the default one for aligning sequences that are possibly
truncated. By default, cmalign uses special DP algorithms to handle truncated sequences which
should be more accurate than the sub method in most cases. --sub is still included as an option
mainly for testing against this default truncated sequence handling. This "sub CM" procedure is
not the same as the "sub CMs" described by Weinberg and Ruzzo.

OPTIONS FOR CONTROLLING SPEED AND MEMORY REQUIREMENTS

--hbanded
This option is turned on by default. Accelerate alignment by pruning away regions of the CM DP
matrix that are deemed negligible by an HMM. First, each sequence is scored with a CM plan 9 HMM
derived from the CM using the Forward and Backward HMM algorithms to calculate posterior
probabilities that each nucleotide aligns to each state of the HMM. These posterior probabilities
are used to derive constraints (bands) on the CM DP matrix. Finally, the target sequence is
aligned to the CM using the banded DP matrix, during which cells outside the bands are ignored.
Usually most of the full DP matrix lies outside the bands (often more than 95%), making this
technique faster because fewer DP calculations are required, and more memory efficient because
only cells within the bands need be allocated.

Importantly, HMM banding sacrifices the guarantee of determining the optimally accurarte or
optimal alignment, which will be missed if it lies outside the bands. The tau parameter is the
amount of probability mass considered negligible during HMM band calculation; lower values of tau
yield greater speedups but also a greater chance of missing the optimal alignment. The default tau
is 1E-7, determined empirically as a good tradeoff between sensitivity and speed, though this
value can be changed with the --tau <x> option. The level of acceleration increases with both the
length and primary sequence conservation level of the family. For example, with the default tau of
1E-7, tRNA models (low primary sequence conservation with length of about 75 nucleotides) show
about 10X acceleration, and SSU bacterial rRNA models (high primary sequence conservation with
length of about 1500 nucleotides) show about 700X. HMM banding can be turned off with the
--nonbanded option.

--tau <x>
Set the tail loss probability used during HMM band calculation to <x>. This is the amount of
probability mass within the HMM posterior probabilities that is considered negligible. The default
value is 1E-7. In general, higher values will result in greater acceleration, but increase the
chance of missing the optimal alignment due to the HMM bands.

--mxsize <x>
Set the maximum allowable total DP matrix size to <x> megabytes. By default this size is 1024 Mb.
This should be large enough for the vast majority of alignments, however if it is not cmalign will
attempt to iteratively tighten the HMM bands it uses to constrain the alignment by raising the tau
parameter and recalculating the bands until the total matrix size needed falls below <x> megabytes
or the maximum allowable tau value (0.05 by default, but changeable with --maxtau) is reached. At
each iteration of band tightening, tau is multiplied by a 2.0. The band tightening strategy can be
turned off with the --fixedtau option. If the maximum tau is reached and the required matrix size
still exceeds <x> or if HMM banding is not being used and the required matrix size exceeds <x>
then cmalign will exit prematurely and report an error message that the matrix exceeded its
maximum allowable size. In this case, the --mxsize can be used to raise the size limit or the
maximum tau can be raised with --maxtau. The limit will commonly be exceeded when the --nonbanded
option is used without the --small option, but can still occur when --nonbanded is not used. Note
that if cmalign is being run in <n> multiple threads on a multicore machine then each thread may
have an allocated matrix of up to size <x> Mb at any given time.

--fixedtau
Turn off the HMM band tightening strategy described in the explanation of the --mxsize option
above.

--maxtau <x>
Set the maximum allowed value for tau during band tightening, described in the explanation of
--mxsize above, to <x>. By default this value is 0.05.

--nonbanded
Turns off HMM banding. The returned alignment is guaranteed to be the globally optimally accurate
one (by default) or the globally optimally scoring one (if --cyk is enabled). The --small option
is recommended in combination with this option, because standard alignment without HMM banding
requires a lot of memory (see --small ).

--small
Use the divide and conquer CYK alignment algorithm described in SR Eddy, BMC Bioinformatics 3:18,
2002. The --nonbanded option must be used in combination with this options. Also, it is
recommended whenever --nonbanded is used that --small is also used because standard CM alignment
without HMM banding requires a lot of memory, especially for large RNAs. --small allows CM
alignment within practical memory limits, reducing the memory required for alignment LSU rRNA, the
largest known RNAs, from 150 Gb to less than 300 Mb. This option can only be used in combination
with --noprob, --nonbanded, --notrunc, and --cyk.

OPTIONAL OUTPUT FILES

       --sfile <f>
              Dump per-sequence alignment score and timig information to file <f>.  The format of this  file  is
              described  above  (it's  the same data in the same format as the tabular stdout output when the -o
              option is used).

       --tfile <f>
              Dump tabular sequence tracebacks for each individual sequence to a file <f>.  Primarily useful for
              debugging.

       --ifile <f>
              Dump per-sequence insert information to file  <f>.   The  format  of  the  file  is  described  by
              "#"-prefixed  comment  lines included at the top of the file <f>.  The insert information is valid
              even when the --matchonly option is used.

       --elfile <f>
              Dump per-sequence EL state (local end) insert information to file <f>.  The format of the file  is
              described  by  "#"-prefixed  comment  lines  included  at  the top of the file <f>.  The EL insert
              information is valid even when the --matchonly option is used.

OTHER OPTIONS

--mapali <f>
Reads the alignment from file <f> used to build the model aligns it as a single object to the CM;
e.g. the alignment in <f> is held fixed. This allows you to align sequences to a model with
cmalign and view them in the context of an existing trusted multiple alignment. <f> must be the
alignment file that the CM was built from. The program verifies that the checksum of the file
matches that of the file used to construct the CM. A similar option to this one was called
--withali in previous versions of cmalign.

--mapstr
Must be used in combination with --mapali <f>. Propagate structural information for any
pseudoknots that exist in <f> to the output alignment. A similar option to this one was called
--withstr in previous versions of cmalign.

--informat <s>
Assert that the input <seqfile> is in format <s>. Do not run Babelfish format autodection. This
increases the reliability of the program somewhat, because the Babelfish can make mistakes;
particularly recommended for unattended, high-throughput runs of Infernal. Acceptable formats
are: FASTA, GENBANK, and DDBJ. <s> is case-insensitive.

--outformat <s>
Specify the output alignment format as <s>. Acceptable formats are: Pfam, AFA, A2M, Clustal, and
Phylip. AFA is aligned fasta. Only Pfam and Stockholm alignment formats will include consensus
structure annotation and posterior probability annotation of aligned residues.

--dnaout
Output the alignments as DNA sequence alignments, instead of RNA ones.

--noprob
Do not annotate the output alignment with posterior probabilities.

--matchonly
Only include match columns in the output alignment, do not include any insertions relative to the
consensus model. This option may be useful when creating very large alignments that require a lot
of memory and disk space, most of which is necessary only to deal with insert columns that are
gaps in most sequences.

--miss In the output alignment, use missing data characters ('~') before the first residue and after the
final residue of each sequence to indicate the sequence was aligned with a truncated alignment
algorithm. The aligned sequences would be considered fragments if the alignment was used
subsequently as input to cmbuild with the --fraggiven option. This option has no effect if
--notrunc is also used.

--ileaved
Output the alignment in interleaved Stockholm format of a fixed width that may be more convenient
for examination. This was the default output alignment format of previous versions of cmalign.
Note that cmalign requires more memory when this option is used. For this reason, --ileaved will
only work for alignments of up to 100,000 sequences or a total of 100,000,000 aligned nucleotides.

--flanktoins <x1>
Change the transition probabilities from the ROOT_S state to the ROOT_IL and ROOT_IR states, and
from the ROOT_IL to the ROOT_IR state to <x1>. This option is meant to be helpful only when
aligning sequences that include extra sequence at the 5' and/or 3' ends. Without using this option
cmalign tends to mess up alignments at the end, especially for models with zero basepairs. This
option should not be necessary when aligning sequences identified by cmsearch or cmscan because
they should not include extra sequence at the ends. This option must be used in combination with
the --flankselfins <x2> option. Recommended values to use are 0.1 for <x1> and 0.8 for <x2> , but
the best performing pair of values may vary for different models. <x1> must be greater than 0.0
and less than 0.4, and the sum of <x1> and <x2> must be less than 0.95.

--flankselfins <x2>
Change the self-transition probabilities for the ROOT_IL and ROOT_IR states to <x2>. This option
must be used in combination with the --flanktoins <x1> option. See the explanation of that option
above for more information.

--regress <s>
Save an additional copy of the output alignment with no author information to file <s>.

--verbose
Output additional information in the tabular scores output (output to stdout if -o is used, or to
<f> if --sfile <f> is used). These are mainly useful for testing and debugging.

--cpu <n>
Set the number of parallel worker threads to <n>. On multicore machines, the default is 4. You
can also control this number by setting an environment variable, INFERNAL_NCPU. There is also a
master thread, so the actual number of threads that Infernal spawns is <n>+1. This option is not
available if Infernal was compiled with POSIX threads support turned off.

--mpi Run as an MPI parallel program. This option will only be available if Infernal has been configured
and built with the "--enable-mpi" flag (see the Installation section of the user guide for more
information).

COPYRIGHT

       Copyright (C) 2023 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on  copyright and licensing, see the file called COPYRIGHT in your Infernal
       source distribution, or see the Infernal web page (http://eddylab.org/infernal/).

AUTHOR

       http://eddylab.org

Infernal 1.1.5                                      Sep 2023                                          cmalign(1)