Provided by: samtools_1.19.2-1build2_amd64 bug

NAME

       samtools-consensus - produces a consensus FASTA/FASTQ/PILEUP

SYNOPSIS

       samtools  consensus  [-saAMq]  [-r region] [-f format] [-l line-len] [-d min-depth] [-C cutoff] [-c call-
       fract] [-H het-fract] in.bam

DESCRIPTION

       Generate consensus from a SAM, BAM or CRAM file based on the contents  of  the  alignment  records.   The
       consensus  is written either as FASTA, FASTQ, or a pileup oriented format.  This is selected using the -f
       FORMAT option.

       The default output for FASTA and FASTQ formats include one base per non-gap consensus.  Hence  insertions
       with  respect  to  the  aligned  reference will be included and deletions removed.  This behaviour can be
       controlled with the --show-ins and --show-del options.  This could be used to  compute  a  new  reference
       from sequences assemblies to realign against.

       The  pileup-style  format  strictly adheres to one row per consensus location, differing from the one row
       per reference based used in the related "samtools mpileup" command.  This means the base  quality  values
       for  inserted  columns  are  reported.   The  base  quality  value of gaps (either within an insertion or
       otherwise) are determined as the average of the surrounding non-gap bases.  The  columns  shown  are  the
       reference name, position, nth base at that position (zero if not an insertion), consensus call, consensus
       confidence, sequences and quality values.

       Two  consensus  calling  algorithms  are  offered.   The  default  computes a heterozygous consensus in a
       Bayesian manner, derived from the "Gap5" consensus algorithm.  Quality values are also  tweaked  to  take
       into account other nearby low quality values.  This can also be disabled, using the --no-adj-qual option.

       This  method  also  utilises  the  mapping  qualities,  unless  the  --no-use-MQ option is used.  Mapping
       qualities are also auto-scaled to take into account the local reference variation by processing the  MD:Z
       tag,  unless  --no-adj-MQ  is  used.   Mapping  qualities  can be capped between a minimum (--low-MQ) and
       maximum (--high-MQ), although the defaults are liberal and trust the data to be true.  Finally an overall
       scale on the resulting mapping quality can be supplied (--scale-MQ, defaulting to  1.0).   This  has  the
       effect  of favouring more calls with a higher false positive rate (values greater than 1.0) or being more
       cautious with higher false negative rates and lower false positive (values less than 1.0).

       The second method is a simple frequency counting algorithm, summing either +1 for each base type or +qual
       if the --use-qual option is specified.  This is enabled with the --mode simple option.

       The summed share of a specific base type is then compared against the total possible and if this is above
       the --call-fract fraction parameter then the most likely base type is called, or "N" otherwise (or absent
       if it is a gap).  The --ambig option permits generation of ambiguity codes instead of "N",  provided  the
       minimum  fraction  of  the  second  most  common  base  type  to the most common is above the --het-fract
       fraction.

OPTIONS

       General options that apply to both algorithms:

       -r REG, --region REG
                 Limit the query to region REG.  This requires an index.

       -f FMT, --format FMT
                 Produce format FMT, with "fastq", "fasta" and "pileup" as permitted options.

       -l N, --line-len N
                 Sets the maximum line length of line-wrapped fasta and fastq formats to N.

       -o FILE, --output FILE
                 Output consensus to FILE instead of stdout.

       -m STR, --mode STR
                 Select the consensus algorithm.  Valid modes are "simple" frequency counting and the "bayesian"
                 (Gap5) methods, with Bayesian being the default.  (Note case does not matter, so "Bayesian"  is
                 accepted  too.)   There are a variety of bayesian methods.  Straight "bayesian" is the best set
                 suitable for the other parameters selected.  The  choice  of  internal  parameters  may  change
                 depending  on  the "--P-indel" score.  This method distinguishes between substitution and indel
                 error rates.  The old Samtools consensus in version 1.16 did not distinguish types  of  errors,
                 but for compatibility the "bayesian_116" mode may be selected to replicate this.

       -a        Outputs  all  bases, from start to end of reference, even when the aligned data does not extend
                 to the ends.  This is most useful for construction of a full length reference sequence.

       -a -a, -aa
                 Output absolutely all positions, including references with no data aligned against them.

       --rf, --incl-flags STR|INT
                 Only include reads with at least one FLAG bit set.  Defaults to zero, which filters no reads.

       --ff, --excl-flags STR|INT
                 Exclude reads with any FLAG bit set.  Defaults to "UNMAP,SECONDARY,QCFAIL,DUP".

       --min-MQ INT
                 Filters out reads with a mapping quality below INT.  This defaults to zero.

       --min-BQ INT
                 Filters out bases with a base quality below INT.  This defaults to zero.

       --show-del yes/no
                 Whether to show deletions as "*" (yes) or to omit from the output (no).  Defaults to no.

       --show-ins yes/no
                 Whether to show insertions in the consensus.  Defaults to yes.

       --mark-ins
                 Insertions, when shown, are normally recorded in the consensus with plain 7-bit ASCII (ACGT, or
                 acgt if heterozygous).  However this makes  it  impossible  to  identify  the  mapping  between
                 consensus coordinates and the original reference coordinates.  If fasta output is selected then
                 the option adds an underscore before every inserted base, plus a corresponding character in the
                 quality  for  fastq  format.   When used in conjunction with -a --show-del yes, this permits an
                 easy derivation of the consensus to reference coordinate mapping.

       -A, --ambig
                 Enables IUPAC ambiguity codes in the consensus output.  Without this the output will be limited
                 to A, C, G, T, N and *.

       The following options apply only to the simple consensus mode:

       -q, --use-qual
                 For the simple consensus algorithm, this enables  use  of  base  quality  values.   Instead  of
                 summing  1  per base called, it sums the base quality instead.  These sums are also used in the
                 --call-fract and --het-fract parameters too.  Quality values are always  used  for  the  "Gap5"
                 consensus  method  and  this  option has no affect.  Note currently  quality values only affect
                 SNPs and not inserted sequences, which  still  get  scores  with  a  fixed  +1  per  base  type
                 occurrence.

       -d D, --min-depth D
                 The  minimum  depth  required  to  make  a call.  Defaults to 1.  Failing this depth check will
                 produce consensus "N", or absent if it is an insertion.  Note this  check  is  performed  after
                 filtering by flags and mapping/base quality.

       -H H, --het-fract H
                 For  consensus  columns  containing multiple base types, if the second most frequent type is at
                 least H fraction of the most common type then a heterozygous base type will be reported in  the
                 consensus.   Otherwise  the  most  common  base  is  used,  provided  it meets the --call-fract
                 parameter (otherwise "N").  The fractions computed may be modified by the use of quality values
                 if the -q option is enabled.  Note although IUPAC has ambiguity codes for A,C,G,T vs any  other
                 A,C,G,T  it does not have codes for A,C,G,T vs gap (such as in a heterozygous deletion).  Given
                 the lack of any official code, we use lower-case letter to symbolise a half-present base type.

       -c C, --call-fract C
                 Only used for the simple consensus algorithm.  Require at least C fraction  of  bases  agreeing
                 with  the  most  likely consensus call to emit that base type.  This defaults to 0.75.  Failing
                 this check will output "N".

       The following options apply only to Bayesian consensus mode enabled
       (default on).

       -C C, --cutoff C
                 Only used for the Gap5 consensus mode, which  produces  a  Phred  style  score  for  the  final
                 consensus quality.  If this is below C then the consensus is called as "N".

       --use-MQ, --no-use-MQ
                 Enable or disable the use of mapping qualities.  Defaults to on.

       --adj-MQ, --no-adj-MQ
                 If  mapping  qualities  are  used, this controls whether they are scaled by the local number of
                 mismatches to the reference.  The reference is unknown by this tool, so this data  is  obtained
                 from the MD:Z auxiliary tag (or ignored if not present).  Defaults to on.

       --NM-halo INT
                 Specifies  the  distance either side of the base call being considered for computing the number
                 of local mismatches.

       --low-MQ MIN, --high-MQ MAX
                 Specifies a minimum and maximum value of the  mapping  quality.   These  are  not  filters  and
                 instead simply put upper and lower caps on the values.  The defaults are 0 and 60.

       --scale-MQ FLOAT
                 This  is  a  general multiplicative  mapping quality scaling factor.  The effect is to globally
                 raise or lower the quality values used in the consensus  algorithm.   Defaults  to  1.0,  which
                 leaves the values unchanged.

       --P-het FLOAT
                 Controls  the likelihood of any position being a heterozygous site.  This is used in the priors
                 for the Bayesian calculations, and has little difference  on  deep  data.   Defaults  to  1e-3.
                 Smaller  numbers  makes the algorithm more likely to call a pure base type.  Note the algorithm
                 will always compute the probability of the base being homozygous vs heterozygous,  irrespective
                 of  whether  the  output  is reported as ambiguous (it will be "N" if deemed to be heterozygous
                 without --ambig mode enabled).

       --P-indel FLOAT
                 Controls the likelihood of small  indels.   This  is  used  in  the  priors  for  the  Bayesian
                 calculations, and has little difference on deep data.  Defaults to 2e-4.

       --het-scale FLOAT
                 This  is a multiplicative correction applied per base quality before adding to the heterozygous
                 hypotheses.  Reducing it means  fewer  heterozygous  calls  are  made.   This  oftens  leads  a
                 significant  reduction  in  false  positive  het  calls,  for  some increase in false negatives
                 (mislabelling real heterozygous sites as homozygous).  It is usually beneficial to reduce  this
                 on  instruments  where a significant proportion of bases may be aligned in the wrong column due
                 to insertions and deletions leading  to  alignment  errors  and  reference  bias.   It  can  be
                 considered as a het sensitivity tuning parameter.  Defaults to 1.0 (nop).

       -p, --homopoly-fix
                 Some  technologies  that call runs of the same base type together always put the lowest quality
                 calls at one end.  This can cause problems when reverse complementing and comparing  alignments
                 with  indels.   This  option  averages  the qualities at both ends to avoid orientation biases.
                 Recommended for old 454 or PacBio HiFi data sets.

       --homopoly-score FLOAT
                 The -p option  also  reduces  confidence  values  within  homopolymers  due  to  an  additional
                 likelihood  of  sequence  specific  errors.   The quality values are multiplied by FLOAT.  This
                 defaults to 0.5, but is  not  used  if  -p  was  not  specified.   Adjusting  this  score  also
                 automatically enables -p.

       -t, --qual-calibration FILE
                 Loads  a quality calibration table from FILE.  The format of this is a series of lines with the
                 following fields, each starting with the literal text "QUAL":

                     QUAL value substitution undercall overcall

                 Lines starting with a "#" are ignored.  Each line maps a recorded quality value  to  the  Phred
                 equivalent  score for substitution, undercall and overcall errors.  Quality values are expected
                 to be sorted in increasing numerical order, but may skip values.   This  allows  the  consensus
                 algorithm  to know the most likely cause of an error, and whether the instrument is more likely
                 to have indel errors (more common in some long read technologies) or substitution errors  (more
                 common in clocked short-read instruments).

                 Some  pre-defined  calibration  tables  are built in.  These are specified with a fake filename
                 starting with a colon.  See the -X option for more details.

                 Note due to the additional heuristics applied by the consensus algorithm,  these  recalibration
                 tables are not a true reflection of the instrument error rates and are a work in progress.

       -X, --config STR
                 Specifies  predefined  sets of configuration parameters.  Acceptable values for STR are defined
                 below, along with the list of parameters they are equivalent to.

                 hiseq     --qual-calibration :hiseq

                 hifi      --qual-calibration :hifi --homopoly-fix 0.3 --low-MQ  5  --scale-MQ  1.5  --het-scale
                           0.37

                 r10.4_sup --qual-calibration  :r10.4_sup  --homopoly-fix  0.3  --low-MQ 5 --scale-MQ 1.5 --het-
                           scale 0.37

                 r10.4_dup --qual-calibration :r10.4_dup --homopoly-fix 0.3 --low-MQ  5  --scale-MQ  1.5  --het-
                           scale 0.37

                 ultima    --qual-calibration  :ultima  --homopoly-fix  0.3 --low-MQ 10 --scale-MQ 2 --het-scale
                           0.37

EXAMPLES

       -      Create a modified FASTA reference that has a  1:1  coordinate  correspondence  with  the  original
              reference used in alignment.

                samtools consensus -a --show-ins no in.bam -o ref.fa

       -      Create a FASTQ file for the contigs with aligned data, including insertions.

                samtools consensus -f fastq in.bam -o cons.fq

AUTHOR

       Written by James Bonfield from the Sanger Institute.

SEE ALSO

       samtools(1), samtools-mpileup(1),

       Samtools website: <http://www.htslib.org/>

samtools-1.19.2                                  24 January 2024                           samtools-consensus(1)