Ubuntu Manpage: bamsort - sort BAM files by coordinate or query name

Provided by: biobambam2_2.0.185+ds-2_amd64

NAME

       bamsort - sort BAM files by coordinate or query name

SYNOPSIS

       bamsort [options]

DESCRIPTION

       bamsort  reads  a BAM, SAM or CRAM file, sorts it by coordinate (lexicographical by reference sequence id
       and position on reference sequence),  query  name  (possibly  including  the  HI  aux  tag  for  ordering
       alignments featuring the same query name), hash value computed for the query name or an aux tag value and
       writes the sorted file in BAM, SAM or CRAM format.

       Lexicographical  order  denotes  that  pairs  (a,b)  and (c,d) will be ordered such that (a,b) < (c,d) if
       either a < c or a = c and b < d. For coordinates this means that the  alignments  are  first  grouped  by
       reference  sequence  id (i.e. all alignments for one chromosome appear in one block) and within the block
       for each reference sequence the alignments are ordered by the start position on this sequence.

       The order by query name decomposes the read names into parts containing numbers and  such  containing  no
       number.  A  read  name A15_30_C50 will for instance be split into the components A, 15, _, 30, _C and 50.
       The comparison of read names is performed lexicographically along this decomposition, where number fields
       are compared as numbers. As an example we have A15<B12 as A<B and A9<A12 as A=A and 9<12 (where 9 and  12
       are considered as numbers and not as the sequences of their digits).

       The  order  by  hash value computes a hash value (effectively random number) for each read name and order
       the alignments by this number in increasing order. Alignments assigned the same hash value are ordered by
       query name.

       The order by aux tag compares alignments by the value of a given aux field storing a string  value.  This
       string comparison follows the same order used for comparing query names stated above. Alignments with the
       same aux value are sorted by coordinate order.

       If  the  memory buffer given is not sufficiently large to process the input file, then the program writes
       intermediate results to a temporary file. This file can be large and depending on the compression of  the
       input file larger than the input itself.

       The following key=value pairs can be given:

       SO=<coordinate|queryname|hash|tag|tagonly|queryname_HI|queryname_lexicographic>:   set  the  sort  order.
       Valid values are

       coordinate:
              sort alignments by coordinate

       queryname
              sort alignments by query name

       hash   sort alignments by (Murmur3) hash of query name. This effectively puts them in a random order.

       tag    sort alignments by string aux field. The tag of the aux fields  need  to  be  provided  using  the
              sorttag key. Entries with identical tag are sorted by coordinate.

       tagonly
              sort  alignments  by  string  aux  field.  The tag of the aux fields need to be provided using the
              sorttag key. Entries with identical tag are left in the same order as they were in the input.

       queryname_HI
              sort alignments by query name. Alignments with identical query name are sorted  by  the  value  of
              their HI aux field.

       queryname_lexicographic
              sort  alignments  by  query  name  using  a  purely  lexicographic  comparison instead of the more
              sophisticated version described above.

       level=<-1|0|1|9|11>: set compression level of the output BAM file. Valid values are

       -1:    zlib/gzip default compression level

       0:     uncompressed

       1:     zlib/gzip level 1 (fast) compression

       9:     zlib/gzip level 9 (best) compression

       If libmaus has been compiled with support for igzip (see https://software.intel.com/en-us/articles/igzip-
       a-high-performance-deflate-compressor-with-optimizations-for-genomic-data) then an additional valid value
       is

       11:    igzip compression

       verbose=<1>: Valid values are

       1:     print progress report on standard error

       0:     do not print progress report

       blockmb=<1024>: set size of the internal memory sorting buffer in megabytes. The default buffer  size  is
       one gigabyte.

       tmpfile=<filename>: set the prefix for temporary file names

       disablevalidation=<0|1>: sets whether input validation is performed. Valid values are

       0:     validation is enabled (default)

       1:     validation is disabled

       md5=<0|1>: md5 checksum creation for output file. This option can only be given if outputformat=bam. Then
       valid values are

       0:     do not compute checksum. This is the default.

       1:     compute  checksum.  If the md5filename key is set, then the checksum is written to the given file.
              If md5filename is unset, then no checksum will be computed.

       md5filename file name for md5 checksum if md5=1.

       index=<0|1>: compute BAM index for output file. This option can only be given if  outputformat=bam.  Then
       valid values are

       0:     do not compute BAM index. This is the default.

       1:     compute  BAM  index.  If  the indexfilename key is set, then the BAM index is written to the given
              file. If indexfilename is unset, then no BAM index will be computed.

       indexfilename file name for output BAM index if index=1.

       inputformat=<bam>: input file format.  All versions of bamsort  come  with  support  for  the  BAM  input
       format. If the program in addition is linked to the io_lib package, then the following options are valid:

       bam:   BAM (see http://samtools.sourceforge.net/SAM1.pdf)

       sam:   SAM (see http://samtools.sourceforge.net/SAM1.pdf)

       cram:  CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)

       outputformat=<bam>:  output  file  format.   All versions of bamsort come with support for the BAM output
       format. If the program in addition is linked to the io_lib package, then the following options are valid:

       bam:   BAM (see http://samtools.sourceforge.net/SAM1.pdf)

       sam:   SAM (see http://samtools.sourceforge.net/SAM1.pdf)

       cram:  CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit). This format  is  not  advisable  for  data
              sorted by query name.

       I=<[stdin]>: input filename, standard input if unset.

       O=<[stdout]>: output filename, standard output if unset.

       inputthreads=<[1]>: input helper threads, only valid for inputformat=bam.

       sortthreads=<[1]>: number of threads used for sorting.

       outputthreads=<[1]>: output helper threads, only valid for outputformat=bam.

       reference=<[]>:  reference FastA file for inputformat=cram and outputformat=cram. An index file (.fai) is
       required.

       range=<>: input range to be processed. This option is only valid if the input is a coordinate sorted  and
       indexed BAM file

       fixmates=<0|1>:  fix  mate information as bamfixmateinformation would do. Input is assumed to be collated
       by query name (no changes will be applied to mates which are  not  adjacent  in  the  input  stream).  By
       default this option is disabled.

       calmdnm=<0|1>: calculate the MD and NM fields as a side effect. By default the fields are not calculated.
       Calculation  is  only  performed  if sorting is performed by coordinate. If calmdnm=1, then the parameter
       calmdnmreference in required.  The supported file formats can be found in the manual page for bammdnm.

       calmdnmreference=<[]>: name of reference sequence file if calmdnm=1.

       calmdnmrecompindetonly=<0|1>: compute MD/NM fields in the presence of indeterminate (N) bases only.  This
       option  is  only  relevant  if calmdnm=1. By default the fields are computed for all mapped alignments if
       calmdnm=1.

       calmdnmwarnchange=<0|1>: warn if MD/NM field which was computed is differing from a  previously  existing
       field. By default no warnings are produced.

       adddupmarksupport=<0|1>:  add  information  required for streaming duplicate marking in the aux fields MS
       and MC. Input is assumed to be collated by query name. This  option  is  ignored  unless  fixmates=1.  By
       default it is disabled.

       markduplicates=<[0]>:  mark  duplicate  read  pairs  and  reads. This option can only be used when a name
       collated file (all reads for a name are consecutive in the input) is sorted  into  coordinate  order.  In
       addition the input is required not to contain orphan reads (pair ends such that the other end of the pair
       is  not  contained  in  the  file). Setting markduplicates=1 implies adddupmarksupport=1. The temporarily
       added auxiliary fields are removed during output generation. The markduplicates  option  is  disabled  by
       default.

       rmdup=<[0]>:   remove   the   duplicates   marked   by   the  markduplicates  option.  As  this  requires
       markduplicates=1, the requirements stated for markduplicates also apply for rmdup.

       tag=<tag> name of auxiliary field storing tag information for duplicate  marking  in  string  form.  Read
       fragments or pairs with different tags will not be considered as duplicates, even they would be according
       to  their  mapping  coordinates.  For  pairs  the  tag field information of the first and second mate are
       concatenated to obtain the tag of the pair.

       nucltag=<tag> this option works like the tag option but is restricted to sequences of nucleotides  (A,C,G
       or  T)  as tags. The length of each tag sequence is not allowed to exceed 15 bases. All tags are required
       to have the same length.  Each non nucleotide symbol is mapped to A.  In  contrast  to  the  tag  option,
       nucltag uses less memory for processing and can be expected to be faster.

       M=<stderr>:  name of the metrics file for duplicate marking (metrics are written to standard error if not
       set)

       streaming=<0|1>: do not open input file(s) multiple times if set to 1. When given  multiple  input  files
       bamsort  concatenates  the  files  on  the  fly  and  computes  a  merged header before starting the data
       processing. Computing the header of the output file requires opening each input file. If each input  file
       can  only be opened once (as it may take the form of a pipe or socket connection), then bamsort will keep
       all the files open at the same time. Otherwise the files will be opened only as needed to keep the number
       of open file descriptors lower.

       sorttag=: tag of aux field used for comparison when SO=tag.

       hash=<crc32prod>: hash used for producing bamseqchksum type header fields in sorted output.

AUTHOR

       Written by German Tischler.

REPORTING BUGS

       Report bugs to <germant@miltenyibiotec.de>

COPYRIGHT

       Copyright © 2009-2016 German Tischler, © 2011-2014 Genome Research  Limited.   License  GPLv3+:  GNU  GPL
       version 3 <http://gnu.org/licenses/gpl.html>
       This  is  free software: you are free to change and redistribute it.  There is NO WARRANTY, to the extent
       permitted by law.

BIOBAMBAM                                        September 2017                                       BAMSORT(1)