Provided by: samtools_1.19.2-1build2_amd64 bug

NAME

       samtools-mpileup - produces "pileup" textual format from an alignment

SYNOPSIS

       samtools  mpileup  [-EB]  [-C  capQcoef]  [-r reg] [-f in.fa] [-l list] [-Q minBaseQ] [-q minMapQ] in.bam
       [in2.bam [...]]

DESCRIPTION

       Generate text pileup output for one or multiple BAM files.  Each input file produces a separate group  of
       pileup columns in the output.

       Note  that  there  are  two  orthogonal ways to specify locations in the input file; via -r region and -l
       file.  The former uses (and requires) an index to do random access while the latter streams  through  the
       file  contents  filtering  out  the  specified  regions,  requiring  no  index.   The  two may be used in
       conjunction.  For example a BED file containing locations of genes in chromosome 20  could  be  specified
       using  -r  20  -l chr20.bed, meaning that the index is used to find chromosome 20 and then it is filtered
       for the regions listed in the bed file.

       Unmapped reads are not considered and are always discarded.  By default secondary alignments, QC failures
       and duplicate reads will be omitted, along with low quality bases and some reads in high  depth  regions.
       See the --ff, -Q and -d options for changing this.

   Pileup Format
       Pileup  format  consists  of  TAB-separated  lines,  with each line representing the pileup of reads at a
       single genomic position.

       Several columns contain numeric quality values encoded as individual ASCII  characters.   Each  character
       can  range from “!” to “~” and is decoded by taking its ASCII value and subtracting 33; e.g., “A” encodes
       the numeric value 32.

       The first three columns give the position and reference:

       ○ Chromosome name.

       ○ 1-based position on the chromosome.

       ○ Reference base at this position (this will be “N” on all lines if -f/--fasta-ref has not been used).

       The remaining columns show the pileup data, and are repeated for each input BAM file specified:

       ○ Number of reads covering this position.

       ○ Read bases.  This encodes information on matches, mismatches,  indels,  strand,  mapping  quality,  and
         starts and ends of reads.

         For each read covering the position, this column contains:

         • If  this  is  the  first  position  covered  by the read, a “^” character followed by the alignment's
           mapping quality encoded as an ASCII character.

         • A single character indicating the read base and the strand to which the read has been mapped:
           Forward   Reverse                    Meaning
           ───────────────────────────────────────────────────────────────
            . dot    , comma   Base matches the reference base
            ACGTN     acgtn    Base is a mismatch to the reference base
              >         <      Reference skip (due to CIGAR “N”)
              *        */#     Deletion of the reference base (CIGAR “D”)

           Deleted bases are shown as “*” on both strands unless --reverse-del is used, in which case  they  are
           shown as “#” on the reverse strand.

         • If  there  is  an  insertion  after  this  read  base, text matching “\+[0-9]+[ACGTNacgtn*#]+”: a “+”
           character followed by an integer giving the length of the insertion and then the  inserted  sequence.
           Pads  are shown as “*” unless --reverse-del is used, in which case pads on the reverse strand will be
           shown as “#”.

         • If there is a deletion after this read base, text matching “-[0-9]+[ACGTNacgtn]+”:  a  “-”  character
           followed by the deleted reference bases represented similarly.  (Subsequent pileup lines will contain
           “*” for this read indicating the deleted bases.)

         • If this is the last position covered by the read, a “$” character.

       ○ Base qualities, encoded as ASCII characters.

       ○ Alignment  mapping qualities, encoded as ASCII characters.  (Column only present when -s/--output-MQ is
         used.)

       ○ Comma-separated 1-based positions within the alignments, in the orientation shown in  the  input  file.
         E.g.,  5  indicates  that it is the fifth base of the corresponding read that is mapped to this genomic
         position.  (Column only present when -O/--output-BP is used.)

       ○ Additional comma-separated read field columns, as selected via  --output-extra.   The  fields  selected
         appear  in  the  same  order  as  in SAM: QNAME, FLAG, RNAME, POS, MAPQ (displayed numerically), RNEXT,
         PNEXT.

       ○ Comma-separated 1-based positions within the alignments, in 5' to 3' orientation.   E.g.,  5  indicates
         that  it  is the fifth base of the corresponding read as produced by the sequencing instrument, that is
         mapped to this genomic position. (Column only present when --output-BP-5 is used.)

       ○ Additional read tag field columns, as selected via --output-extra.   These  columns  are  formatted  as
         determined  by  --output-sep  and  --output-empty  (comma-separated by default), and appear in the same
         order as the tags are given in --output-extra.

         Any output column that would be empty, such as a tag which is not  present  or  the  filtered  sequence
         depth  is  zero,  is  reported as "*".  This ensures a consistent number of columns across all reported
         positions.

OPTIONS

       -6, --illumina1.3+
                 Assume the quality is in the Illumina 1.3+ encoding.

       -A, --count-orphans
                 Do not skip anomalous read pairs in variant calling.  Anomalous read pairs are those marked  in
                 the FLAG field as paired in sequencing but without the properly-paired flag set.

       -b, --bam-list FILE
                 List of input BAM files, one file per line [null]

       -B, --no-BAQ
                 Disable base alignment quality (BAQ) computation.  See BAQ below.

       -C, --adjust-MQ INT
                 Coefficient  for downgrading mapping quality for reads containing excessive mismatches. Given a
                 read with a phred-scaled probability q of being generated from the  mapped  position,  the  new
                 mapping  quality  is  about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if
                 enabled, the recommended value for BWA is 50. [0]

       -d, --max-depth INT
                 At a position, read maximally INT reads per input file. Setting this limit reduces  the  amount
                 of  memory  and  time needed to process regions with very high coverage.  Passing zero for this
                 option sets it to the highest possible value, effectively removing the depth limit. [8000]

                 Note that up to release 1.8, samtools would enforce a minimum value for this option.   This  no
                 longer happens and the limit is set exactly as specified.

       -E, --redo-BAQ
                 Recalculate BAQ on the fly, ignore existing BQ tags.  See BAQ below.

       -f, --fasta-ref FILE
                 The  faidx-indexed reference file in the FASTA format. The file can be optionally compressed by
                 bgzip.  [null]

                 Supplying a reference file will enable base alignment quality calculation for all reads aligned
                 to a reference in the file.  See BAQ below.

       -G, --exclude-RG FILE
                 Exclude reads from read groups listed in FILE (one @RG-ID per line)

       -l, --positions FILE
                 BED or position list file containing a list of regions or sites where pileup or BCF  should  be
                 generated. Position list files contain two columns (chromosome and position) and start counting
                 from  1.   BED  files  contain  at least 3 columns (chromosome, start and end position) and are
                 0-based half-open.
                 While it is possible to mix both position-list and BED coordinates in the same  file,  this  is
                 strongly ill advised due to the differing coordinate systems. [null]

       -q, --min-MQ INT
                 Minimum mapping quality for an alignment to be used [0]

       -Q, --min-BQ INT
                 Minimum base quality for a base to be considered. [13]

                 Note  base-quality  0 is used as a filtering mechanism for overlap removal which marks bases as
                 having quality zero and lets the base quality filter remove them.  Hence using --min-BQ 0  will
                 make the overlapping bases reappear, albeit with quality zero.

       -r, --region STR
                 Only  generate  pileup in region. Requires the BAM files to be indexed.  If used in conjunction
                 with -l then considers the intersection of the two requests.  STR [all sites]

       -R, --ignore-RG
                 Ignore RG tags. Treat all reads in one BAM as one sample.

       --rf, --incl-flags STR|INT
                 Required flags: only include reads with any  of  the  mask  bits  set  [null].   Note  this  is
                 implemented  as  a filter-out rule, rejecting reads that have none of the mask bits set.  Hence
                 this does not override the --excl-flags option.

       --ff, --excl-flags STR|INT
                 Filter flags: skip reads with any of the mask bits set.  This defaults to SECONDARY,QCFAIL,DUP.
                 The option is not accumulative,  so  specifying  e.g.  --ff  QCFAIL  will  reenable  output  of
                 secondary and duplicate alignments.  Note this does not override the --incl-flags option.

       -x, --ignore-overlaps-removal, --disable-overlap-removal
                 Overlap detection and removal is enabled by default.  This option turns it off.

                 When  enabled,  where the ends of a read-pair overlap the overlapping region will have one base
                 selected and the duplicate base nullified by setting its phred score to zero.  It will then  be
                 discarded by the --min-BQ option unless this is zero.

                 The  quality  values  of  the  retained base within an overlap will be the summation of the two
                 bases if they agree, or 0.8 times the higher of the two bases if they disagree, with  the  base
                 nucleotide also being the higher confident call.

       -X        Include customized index file as a part of arguments. See EXAMPLES section for sample of usage.

       Output Options:

       -o, --output FILE
                 Write pileup output to FILE, rather than the default of standard output.

       -O, --output-BP
                 Output base positions on reads in orientation listed in the SAM file (left to right).

       --output-BP-5
                 Output base positions on reads in their original 5' to 3' orientation.

       -s, --output-MQ
                 Output mapping qualities encoded as ASCII characters.

       --output-QNAME
                 Output  an  extra  column  containing comma-separated read names.  Equivalent to --output-extra
                 QNAME.

       --output-extra STR
                 Output extra columns containing comma-separated values of read fields or read tags.  The  names
                 of the selected fields have to be provided as they are described in the SAM Specification (pag.
                 6) and will be output by the mpileup command in the same order as in the document (i.e.  QNAME,
                 FLAG,  RNAME,...)   The  names  are  case  sensitive.  Currently, only the following fields are
                 supported:

                 QNAME, FLAG, RNAME, POS, MAPQ, RNEXT, PNEXT

                 Anything that is not on this list is treated as a potential tag, although  only  two  character
                 tags  are  accepted.  In  the  mpileup output, tag columns are displayed in the order they were
                 provided by the user in the command line.  Field and tag names have to be provided in a  comma-
                 separated string to the mpileup command.  Tags with type B (byte array) type are not supported.
                 An absent or unsupported tag will be listed as "*".  E.g.

                 samtools mpileup --output-extra FLAG,QNAME,RG,NM in.bam

                 will  display  four  extra  columns  in  the  mpileup  output, the first being a list of comma-
                 separated read names, followed by a list of flag values, a list of RG tag values and a list  of
                 NM tag values. Field values are always displayed before tag values.

       --output-sep CHAR
                 Specify  a  different  separator character for tag value lists, when those values might contain
                 one or more commas (,), which is the default list separator.  This option only affects  columns
                 for  two-letter  tags  like  NM; standard fields like FLAG or QNAME will always be separated by
                 commas.

       --output-empty CHAR
                 Specify a different 'no value' character for tag list entries corresponding to reads that don't
                 have a tag requested with the --output-extra option. The default is *.

                 This option only applies to rows that have at least one read in the pileup, and only to columns
                 for two-letter tags.  Columns for empty rows will always be printed as *.

       -M, --output-mods
                 Adds base modification markup into the sequence column.  This uses the Mm and Ml auxiliary tags
                 (or their uppercase equivalents).  Any base in the sequence output may be followed by a  series
                 of strand code quality strings enclosed within square brackets where strand is "+" or "-", code
                 is a single character (such as "m" or "h") or a ChEBI numeric in parentheses, and quality is an
                 optional  numeric  quality  value.   For  example  a  "C"  base with possible 5mC and 5hmC base
                 modification may be reported as "C[+m179+h40]".

                 Quality values are from 0 to 255 inclusive, representing a linear scale of probability  0.0  to
                 1.0 in 1/256ths increments.  If quality values are absent (no Ml tag) these are omitted, giving
                 an example string of "C[+m+h]".

                 Note  the  base modifications may be identified on the reverse strand, either due to the native
                 ability for this detection by the sequencing instrument or by the sequence  subsequently  being
                 reverse  complemented.   This  can  lead  to modification codes, such as "m" meaning 5mC, being
                 shown for their complementary bases, such as "G[-m50]".

                 When --output-mods is selected base modifications can  appear  on  any  base  in  the  sequence
                 output,  including  during  insertions.  This may make parsing the string more complex, so also
                 see the --no-output-ins-mods and --no-output-ins options to simplify this process.

       --no-output-ins
                 Do not output the inserted bases in the sequence column.  Usually this is reported as  "+length
                 sequence",  but  with this option it becomes simply "+length".  For example an insertion of AGT
                 in a pileup column changes from "CCC+3AGTGCC" to "CCC+3GCC".

                 Specifying this option twice also removes the "+length" portion, changing the example above  to
                 "CCCGCC".

                 The  purpose  of  this  change  is  to  simplify parsing using basic regular expressions, which
                 traditionally cannot perform counting operations.  It is particularly beneficial when  used  in
                 conjunction  with  --output-mods  as  the  syntax  of the inserted sequence is adjusted to also
                 report possible base modifications, but see also --no-output-ins-mods as an alternative.

       --no-output-ins-mods
                 Outputs the inserted bases in the sequence, but excluding any base  modifications.   This  only
                 affects output when --output-mods is also used.

       --no-output-del
                 Do  not  output  deleted  reference bases in the sequence column.  Normally this is reported as
                 "+length sequence", but with this option it becomes simply "+length".  For example an  deletion
                 of  3 unknown bases (due to no reference being specified) would normally be seen in a column as
                 e.g. "CCC-3NNNGCC", but will be reported as "CCC-3GCC" with this option.

                 Specifying this option twice also removes the "-length" portion, changing the example above  to
                 "CCCGCC".

                 The  purpose  of  this  change  is  to  simplify parsing using basic regular expressions, which
                 traditionally cannot perform counting operations.  See also --no-output-ins.

       --no-output-ends
                 Removes the “^” (with mapping quality) and “$” markup from the sequence column.

       --reverse-del
                 Mark the deletions on the reverse strand with the character #, instead of the usual *.

       -a        Output all positions, including those with zero depth.

       -a -a, -aa
                 Output absolutely all positions, including unused reference sequences.  Note that when used  in
                 conjunction  with a BED file the -a option may sometimes operate as if -aa was specified if the
                 reference sequence has coverage outside of the region specified in the BED file.

       BAQ (Base Alignment Quality)

       BAQ is the Phred-scaled probability of a read base being misaligned.  It greatly helps  to  reduce  false
       SNPs  caused by misalignments.  BAQ is calculated using the probabilistic realignment method described in
       the paper “Improving SNP discovery by base alignment quality”, Heng Li, Bioinformatics, Volume 27,  Issue
       8 <https://doi.org/10.1093/bioinformatics/btr076>

       BAQ  is  turned  on  when  a  reference  file is supplied using the -f option.  To disable it, use the -B
       option.

       It is possible to store precalculated BAQ values in a SAM  BQ:Z  tag.   Samtools  mpileup  will  use  the
       precalculated  values  if it finds them.  The -E option can be used to make it ignore the contents of the
       BQ:Z tag and force it to recalculate the BAQ scores by making a new alignment.

EXAMPLES

       Using range: With implicit index files in1.bam.<ext> and in2.sam.gz.<ext>,

         samtools mpileup in1.bam in2.sam.gz -r chr10:100000-200000

       With explicit index files,

         samtools mpileup in1.bam in2.sam.gz idx/in1.csi idx/in2.csi -X -r chr10:100000-200000

       With fofn being a file of input file names, and implicit index files present with inputs,

         samtools mpileup -b fofn -r chr10:100000-200000

       Using flags: To get reads with flags READ2 or REVERSE and not having any of SECONDARY,QCFAIL,DUP,

         samtools mpileup --rf READ2,REVERSE in.sam

       or

         samtools mpileup --rf 144 in.sam

       To get reads with flag SECONDARY,

         samtools mpileup --rf SECONDARY --ff QCFAIL,DUP in.sam

       Using all possible alignmentes: To show all possible alignments, either of below two equivalent  commands
       may be used,

         samtools mpileup --count-orphans --no-BAQ --max-depth 0 --fasta-ref ref_file.fa \
         --min-BQ 0 --excl-flags 0 --disable-overlap-removal in.sam

         samtools mpileup -A -B -d 0 -f ref_file.fa -Q 0 --ff 0 -x in.sam

AUTHOR

       Written by Heng Li from the Sanger Institute.

SEE ALSO

       samtools(1), samtools-depth(1), samtools-sort(1), bcftools(1)

       Samtools website: <http://www.htslib.org/>

samtools-1.19.2                                  24 January 2024                             samtools-mpileup(1)