Ubuntu Manpage: swarm — find clusters of nearly-identical nucleotide amplicons

NAME

       swarm — find clusters of nearly-identical nucleotide amplicons

SYNOPSIS

       swarm -h|v

       High-precision clustering:

       swarm [filename]

       swarm [-d 1] [-nrz] [-a int] [-i filename] [-j filename] [-l filename] [-o filename] [-s filename]
             [-t int] [-u filename] [-w filename] [filename]

       swarm [-d 1] -f [-nrz] [-a int] [-b int] [-c|y int] [-i filename] [-j filename] [-l filename]
             [-o filename] [-s filename] [-t int] [-u filename] [-w filename] [filename]

       Conservative clustering:

       swarm -d 2+ [-nrxz] [-a int] [-e int] [-g int] [-i filename] [-l filename] [-m int] [-o filename]
             [-p int] [-s filename] [-t int] [-u filename] [-w filename] [filename]

       Dereplication (merge strictly identical sequences):

       swarm -d 0 [-rz] [-a int] [-i filename] [-l filename] [-o filename] [-s filename] [-u filename]
             [-w filename] [filename]

DESCRIPTION

Environmental or clinical molecular studies generate large volumes of amplicons (e.g., 16S or 18S SSU-
rRNA sequences) that need to be grouped into clusters. Traditional clustering methods are based on
greedy, input-order dependent algorithms, with arbitrary selection of cluster centroids and cluster
limits (often 97%-similarity). To address that problem, we developed swarm, a fast and robust method that
recursively groups amplicons with d or less differences (i.e. substitutions, insertions or deletions).
swarm produces natural and stable clusters centered on local peaks of abundance, mostly free from input-
order dependency induced by centroid selection.

Exact clustering is impractical on large data sets when using a naïve all-vs-all approach (more precisely
a 2-combination without repetitions), as it implies unrealistic numbers of pairwise comparisons. swarm is
based on a maximum number of differences d between two amplicons, and focuses only on very close local
relationships. For d = 1, the default value, swarm uses an algorithm of linear complexity that generates
all possible single mutations and performs exact-string matching by comparing hash-values. For d = 2 or
greater, swarm uses an algorithm of quadratic complexity that performs pairwise string comparisons. An
efficient k-mer-based filtering and an astute use of comparisons results obtained during the clustering
process allows swarm to avoid most of the amplicon comparisons needed in a naïve approach. To speed up
the remaining amplicon comparisons, swarm implements an extremely fast Needleman-Wunsch algorithm making
use of the Streaming SIMD Extensions (SSE2) of x86-64 CPUs, NEON instructions of ARM64 CPUs, or
Altivec/VMX instructions of POWER8 CPUs. If SSE2 instructions are not available, swarm exits with an
error message.

swarm can read nucleotide amplicons in fasta format from a normal file or from the standard input (using
a pipe or a redirection). The amplicon header is defined as the string comprised between the '>' symbol
and the first space or the end of the line, whichever comes first. Each header must end with an abundance
annotation representing the amplicon copy number and defined as '_' followed by a positive integer. See
option -z for input data using usearch/vsearch's abundance annotation format (';size=integer[;]'). Once
stripped from the abundance annotation, the remaining part of the header is call the label. In summary,
using regular expression patterns:

>header[[:blank:]] and header = label_[1-9][0-9]*$

Abundance annotations play a crucial role in the clustering process, and swarm exits with an error
message if that information is not available. As swarm outputs lists of amplicon labels, amplicon labels
must be unique to avoid any ambiguity; swarm exits with an error message if labels are not unique. The
amplicon sequence is defined as a string of [ACGT] or [ACGU] symbols (case insensitive, 'U' is replaced
with 'T' internally), starting after the end of the header line and ending before the next header line or
the file end; swarm silently removes newline symbols ('\n' or '\r') and exits with an error message if
any other symbol is present. Accepted sequence lengths range from 1 nucleotide to 67 million nucleotides.
Please note that processing 67-Mb sequences requires at least 32 gigabytes of memory. Lastly, if
sequences are not all unique, i.e. were not properly dereplicated, swarm will exit with an error message.

Clusters are written to output files (specified with -i, -o, -s and -u) by decreasing abundance of their
seed sequences, and then by alphabetical order of seed sequence labels. An exception to that is the -w
(--seeds) output, which is sorted by decreasing cluster abundance (sum of abundances of all sequences in
the cluster), and then by alphabetical order of seed sequence labels. This is particularly useful for
post-clustering steps, such as de novo chimera detection, that require clusters to be sorted by
decreasing abundances.

General options
-h, --help
display this help and exit successfully.

-t, --threads positive integer
number of computation threads to use. Values between 1 and 512 are accepted, but we recommend to
use a number of threads lesser or equal to the number of available CPU cores. Default number of
threads is 1.

-v, --version
output version information and exit successfully.

-- delimit the option list. Later arguments, if any, are treated as operands even if they begin
with '-'. For example, 'swarm -- -file.fasta' reads from the file '-file.fasta'.

Clustering options
-d, --differences zero or positive integer
maximum number of differences allowed between two amplicons, meaning that two amplicons will be
grouped if they have integer (or less) differences. This is swarm's most important parameter.
The number of differences is calculated as the number of mismatches (substitutions, insertions
or deletions) between the two amplicons once the optimal pairwise global alignment has been
found (see 'pairwise alignment advanced options' to influence that step).

Any integer from 0 to 255 can be used, but high d values will decrease the taxonomical
resolution of swarm results. Commonly used d values are 1, 2 or 3, rarely higher. When using d =
0, swarm will output results corresponding to a strict dereplication of the dataset, i.e.
merging identical amplicons. Warning, whatever the d value, swarm requires fasta entries to
present abundance values. Default number of differences d is 1.

-n, --no-otu-breaking
when working with d = 1, deactivate the built-in cluster refinement (not recommended). Amplicon
abundance values are used to identify transitions among in-contact clusters and to separate
them, yielding higher-resolution clustering results. That option prevents that separation, and
in practice, allows the creation of a link between amplicons A and B, even if the abundance of B
is higher than the abundance of A.

Fastidious options
-b, --boundary positive integer
when using the option --fastidious (-f), define the minimum abundance of what should be
considered a large cluster. By default, a cluster with a total abundance of 3 or more is
considered large. Conversely, a cluster is small if it has a total abundance of 2 or less,
meaning that it is composed of either one amplicon of abundance 2, or two amplicons of abundance
1, or one amplicon of abundance 1. Any positive value greater than 1 can be specified. Using
higher boundary values can reduce the number of clusters (up to a point), and will reduce the
taxonomical resolution of swarm results. It will also slightly increase computation time.

-c, --ceiling positive integer
when using the option --fastidious (-f), define swarm's maximum memory footprint (in megabytes).
swarm will adjust the --bloom-bits (-y) value of the Bloom filter to fit within the specified
amount of memory. Values accepted range from 40 to 1,073,741,824 megabytes. See the --bloom-bits
(-y) option for an alternative way to control the memory footprint.

-f, --fastidious
when working with d = 1, perform a second clustering pass to reduce the number of small clusters
(recommended option). During the first clustering pass, an intermediate amplicon can be missing
for purely stochastic reasons, interrupting the aggregation process. The fastidious option will
create virtual amplicons, allowing to graft small clusters upon larger ones. By default, a
cluster is considered large if it has a total abundance of 3 or more (see the --boundary option
to modify that value).

To speed things up, swarm uses a Bloom filter to store intermediate results. Warning, the second
clustering pass can be 2 to 3 times slower than the first pass and requires much more memory to
store the virtual amplicons in Bloom filters. See the options --bloom-bits (-y) or --ceiling
(-c) to control the memory footprint of the Bloom filter.

The fastidious option modifies clustering results: the output files produced by the options
--log (-l), --output-file (-o), --mothur (-r), --uclust-file, and --seeds (-w) are updated to
reflect these modifications; the file --statistics-file (-s) is partially updated (columns 6 and
7 are not updated); the output file --internal-structure (-i) is partially updated (column 5 is
not updated for amplicons that belonged to the small cluster).

-y, --bloom-bits positive integer
when using the option --fastidious (-f), define the size (in bits) of each entry in the Bloom
filter. That option allows to balance the efficiency (i.e. speed) and the memory footprint of
the Bloom filter. Large values will make the Bloom filter more efficient but will require more
memory. Any value between 2 and 64 can be used. Default value is 16. See the --ceiling (-c)
option for an alternative way to control the memory footprint.

Input/output options
-a, --append-abundance positive integer
set abundance value to use when some or all amplicons in the input file lack abundance values
(_integer, or ;size=integer; when using -z). Warning, it is not recommended to use swarm on
datasets where abundance values are all identical. We provide that option as a courtesy to
advanced users, please use it carefully. swarm exits with an error message if abundance values
are missing and if this option is not used.

-i, --internal-structure filename
output all pairs of nearly-identical amplicons to filename using a five-column tab-delimited
format:

1. amplicon A label (header without abundance annotations).

2. amplicon B label (header without abundance annotations).

3. number of differences between amplicons A and B (positive integer).

4. cluster number (positive integer). Clusters are numbered in their order of
delineation, starting from 1. All pairs of amplicons belonging to the same cluster
will receive the same number.

5. cummulated number of steps from the cluster seed to amplicon B (positive integer).
When using the option --fastidious (-f), the actual number of steps between grafted
amplicons and the cluster seed cannot be re-computed efficiently and is always set to
2 for the amplicon pair linking the small cluster to the large cluster. Cummulated
number of steps in the small cluster (if any) are left unchanged.

-j, --network-file filename
(advanced users) when working with d = 1, dump raw amplicon network to filename using a two-
column tab-delimited table of headers with abundance annotations. Each line represents a
connection between two similar amplicons, from the most abundant to the lesser abundant. When
amplicons have the same abundance value, connections are bi-directional and are represented on
two lines: A to B, then B to A.

In order to delineate clusters and to compute the equivalent of a minimal spanning tree for each
cluster (see option --internal-structure), swarm first builds a network of similar amplicons.
This option is for advanced users who would like to explore this raw network.

-l, --log filename
output all messages to filename instead of standard error, with the exception of error messages
of course. That option is useful in situations where writing to standard error is problematic
(for example, with certain job schedulers).

-o, --output-file filename
output clustering results to filename. Results consist of a list of clusters, one cluster per
line. A cluster is a list of amplicon headers separated by spaces. That output format can be
modified by the option --mothur (-r). Default is to write to standard output.

-r, --mothur
output clustering results in a format compatible with Mothur. That option modifies swarm's
default output format.

-s, --statistics-file filename
output statistics to filename. The file is a tab-separated table with one cluster per row and
seven columns of information:

1. number of unique amplicons in the cluster,

2. total abundance of amplicons in the cluster,

3. label of the initial seed (header without abundance annotations),

4. abundance of the initial seed,

5. number of amplicons with an abundance of 1 in the cluster,

6. maximum number of iterations before the cluster reached its natural limit,

7. cummulated number of steps along the path joining the seed and the furthermost
amplicon in the cluster. Please note that the actual number of differences between
the seed and the furthermost amplicon is usually much smaller. When using the option
--fastidious (-f), grafted amplicons are not taken into account.

-u, --uclust-file filename
output clustering results in filename using a tab-separated uclust-like format with 10 columns
and 3 different type of entries (S, H or C). That option does not modify swarm's default output
format. Each fasta sequence in the input file can be either a cluster centroid (S) or a hit (H)
assigned to a cluster. Cluster records (C) summarize information for each cluster (number of
hits, centroid header). Column content varies with the type of entry (S, H or C):

1. Record type: S, H, or C.

2. Cluster number (zero-based).

3. Centroid length (S), query length (H), or number of hits (C).

4. Percentage of similarity with the centroid sequence (H), or set to '*' (S, C).

5. Match orientation + or - (H), or set to '*' (S, C).

6. Not used, always set to '*' (S, C) or to zero (H).

7. Not used, always set to '*' (S, C) or to zero (H).

8. set to '*' (S, C) or, for H, compact representation of the pairwise alignment using
the CIGAR format (Compact Idiosyncratic Gapped Alignment Report): M (match), D
(deletion) and I (insertion). The equal sign '=' indicates that the query is
identical to the centroid sequence.

9. Header of the query sequence (H), or of the centroid sequence (S, C).

10. Header of the centroid sequence (H), or set to '*' (S, C).

-w, --seeds filename
output cluster representative sequences to filename in fasta format. The abundance value of each
cluster representative is the sum of the abundances of all the amplicons in the cluster. Fasta
headers are formated as follows: '>label_integer', or '>label;size=integer;' if the -z option is
used, and sequences are uppercased. Sequences are sorted by decreasing abundance, and then by
alphabetical order of sequence labels.

-z, --usearch-abundance
accept amplicon abundance values in usearch/vsearch's style (>label;size=integer[;]). That
option influences the abundance annotation style used in swarm's standard output (-o), as well
as the output of options -r, -u and -w.

Pairwise alignment advanced options
when using d > 1, swarm recognizes advanced command-line options modifying the pairwise global alignment
scoring parameters:

-m, --match-reward positive integer
Default reward for a nucleotide match is 5.

-p, --mismatch-penalty positive integer
Default penalty for a nucleotide mismatch is 4.

-g, --gap-opening-penalty positive integer
Default gap opening penalty is 12.

-e, --gap-extension-penalty positive integer
Default gap extension penalty is 4.

-x, --disable-sse3
On the x86-64 CPU architecture, disable SSE3 and later instructions. This option is meant
for developers, not for regular users.

As swarm focuses on close relationships (e.g., d = 2 or 3), clustering results are resilient to pairwise
alignment model parameters modifications. When clustering using a higher d value, modifying model
parameters has a stronger impact.

EXAMPLES

       Clusterize the compressed data set myfile.fasta using the finest resolution  possible  (1  difference  by
       default,  built-in  breaking, fastidious option) using 4 computation threads. Clusters are written to the
       file myfile.swarms, and cluster representatives are written to myfile.representatives.fasta:
              zcat myfile.fasta.gz | \
                  swarm \
                      -t 4 \
                      -f \
                      -w myfile.representatives.fasta \
                      -o myfile.swarms

AUTHORS

       Concept by Frédéric Mahé, implementation by Torbjørn Rognes.

CITATION

       Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) Swarm: robust and fast clustering method  for
       amplicon-based studies.  PeerJ 2:e593 https://doi.org/10.7717/peerj.593.

       Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2015) Swarm v2: highly-scalable and high-resolution
       amplicon clustering.  PeerJ 3:e1420 https://doi.org/10.7717/peerj.1420.

       Mahé  F,  Czech  L,  Stamatakis  A, Quince C, de Vargas C, Dunthorn M, Rognes T. (2021) Swarm v3: towards
       tera-scale amplicon clustering.  Bioinformatics https://doi.org/10.1093/bioinformatics/btab493.

REPORTING BUGS

       Submit suggestions and bug-reports at https://github.com/torognes/swarm/issues, send a  pull  request  at
       https://github.com/torognes/swarm/pulls,  or  compose  a friendly or curmudgeonly e-mail to Frédéric Mahé
       and Torbjørn Rognes.

AVAILABILITY

       Source code and binaries available at https://github.com/torognes/swarm.

COPYRIGHT

       Copyright (C) 2012-2024 Frédéric Mahé & Torbjørn Rognes

       This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
       General Public License as published by the Free Software Foundation, either version 3 of the License,  or
       any later version.

       This  program  is  distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
       the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU  Affero  General
       Public License for more details.

       You  should  have  received  a copy of the GNU Affero General Public License along with this program.  If
       not, see https://www.gnu.org/licenses/.

VERSION HISTORY

New features and important modifications of swarm (short lived or minor bug releases are not mentioned):

v3.1.5 released March 31, 2024
Version 3.1.5 changes the minimal value for the ceiling option from 8 megabytes to 40
megabytes, and fixes four minor bugs. Warning, peak RSS memory increased by 5 to 10% when d
>= 2. Version 3.1.5 improves documentation (now covering option --network_file), adds more
compilation checks and eliminates 50 compilation warnings with GCC 13, GCC 14 and clang 19,
as well as 1,677 static analysis warnings.

v3.1.4 released September 20, 2023
Version 3.1.4 fixes a minor bug. It eliminates compilation warnings with GCC 13 and clang
18, as well as 1,040 static analysis warnings. The maximal number of threads swarm can run
is now 512, instead of 256. Compilation with runtime checks (`-DNDEBUG`) is now the
default. When d > 1, overall memory allocations remain unchanged, but peak RSS memory
increased by 6 to 10%, due to a change in the timing of memory deallocations. Peak RSS
memory is expected to regress to its prior levels as refactoring continues.

v3.1.3 released December 5, 2022
Version 3.1.3 fixes a regression introduced in version 3.1.1 (memory over-allocation when d
> 1). It also fixes a minor off-by-one error when allocating memory for a Bloom filter,
compilation warnings with GCC 12 and clang 13, as well as static analysis warnings.
Documentation was improved, as well as our test suite (swarm-tests).

v3.1.2 released November 10, 2022
Fix a bug with fastidious mode introduced in version 3.1.1, that could cause Swarm to
crash. Probably due to allocating too much memory.

v3.1.1 released September 29, 2022
Version 3.1.1 eliminates a risk of segmentation fault with extremely long sequence headers.
Documentation and error messages have been improved, and code cleaning continued.

v3.1.0 released March 1, 2021
Version 3.1.0 includes a fix for a bug in the 16-bit SIMD alignment code that was exposed
with a combination of d>1, long sequences, and very high gap penalties. The code has also
been been cleaned up, tested and improved substantially, and it is now fully C++11
compliant. Support for macOS on Apple Silicon (ARM64) has been added.

v3.0.0 released October 24, 2019
Version 3.0.0 introduces a faster algorithm for d = 1, and a reduced memory footprint.
Swarm has been ported to Windows x86-64, GNU/Linux ARM 64, and GNU/Linux POWER8. Internal
code has been modernized, hardened, and thoroughly tested. Strict dereplication of input
sequences is now mandatory. The --seeds option (-w) now outputs results sorted by
decreasing abundance, and then by alphabetical order of sequence labels.

v2.2.2 released December 12, 2017
Version 2.2.2 fixes a bug that would cause swarm to wait forever in very rare cases when
multiple threads were used.

v2.2.1 released October 27, 2017
Version 2.2.1 fixes a memory allocation bug for d = 1 and duplicated sequences.

v2.2.0 released October 17, 2017
Version 2.2.0 fixes several problems and improves usability. Corrected output to structure
and uclust files when using fastidious mode. Corrected abundance output in some cases.
Added check for duplicated sequences and fixed check for duplicated sequence IDs. Checks
for empty sequences. Sorts sequences by additional fields to improve stability. Improves
compatibility with compilers and operating systems. Outputs sequences in upper case.
Allows 64-bit abundances. Shows message when waiting for input from stdin. Improves error
messages and warnings. Improves checking of command line options. Fixes remaining errors
reported by test suite. Updates documentation.

v2.1.13 released March 8, 2017
Version 2.1.13 removes a bug with the progress bar when writing seeds.

v2.1.12 released January 16, 2017
Version 2.1.12 removes a debugging message.

v2.1.11 released January 16, 2017
Version 2.1.11 fixes two bugs related to the SIMD implementation of alignment that might
result in incorrect alignments and scores. The bug only applies when d > 1.

v2.1.10 released December 22, 2016
Version 2.1.10 fixes two bugs related to gap penalties of alignments. The first bug may
lead to wrong aligments and similarity percentages reported in UCLUST (.uc) files. The
second bug makes swarm use a slightly higher gap extension penalty than specified. The
default gap extension penalty used have actually been 4.5 instead of 4.

v2.1.9 released July 6, 2016
Version 2.1.9 fixes errors when compiling with GCC version 6.

v2.1.8 released March 11, 2016
Version 2.1.8 fixes a rare bug triggered when clustering extremely short undereplicated
sequences. Also, alignment parameters are not shown when d = 1.

v2.1.7 released February 24, 2016
Version 2.1.7 fixes a bug in the output of seeds with the -w option when d > 1 that was not
properly fixed in version 2.1.6. It also handles ascii character #13 (CR) in FASTA files
better. Swarm will now exit with status 0 if the -h or the -v option is specified. The help
text and some error messages have been improved.

v2.1.6 released December 14, 2015
Version 2.1.6 fixes problems with older compilers that do not have the x86intrin.h header
file. It also fixes a bug in the output of seeds with the -w option when d > 1.

v2.1.5 released September 8, 2015
Version 2.1.5 fixes minor bugs.

v2.1.4 released September 4, 2015
Version 2.1.4 fixes minor bugs in the swarm algorithm used for d = 1.

v2.1.3 released August 28, 2015
Version 2.1.3 adds checks of numeric option arguments.

v2.1.1 released March 31, 2015
Version 2.1.1 fixes a bug with the fastidious option that caused it to ignore some
connections between large and small clusters.

v2.1.0 released March 24, 2015
Version 2.1.0 marks the first official release of swarm v2.

v2.0.7 released March 18, 2015
Version 2.0.7 writes abundance information in usearch style when using options -w (--seeds)
in combination with -z (--usearch-abundance).

v2.0.6 released March 13, 2015
Version 2.0.6 fixes a minor bug.

v2.0.5 released March 13, 2015
Version 2.0.5 improves the implementation of the fastidious option and adds options to
control memory usage of the Bloom filter (-y and -c). In addition, an option (-w) allows
to output cluster representatives sequences with updated abundances (sum of all abundances
inside each cluster). This version also enables swarm to run with d = 0.

v2.0.4 released March 6, 2015
Version 2.0.4 includes a fully parallelised implementation of the fastidious option.

v2.0.3 released March 4, 2015
Version 2.0.3 includes a working implementation of the fastidious option, but only the
initial clustering is parallelized.

v2.0.2 released February 26, 2015
Version 2.0.2 fixes SSSE3 problems.

v2.0.1 released February 26, 2015
Version 2.0.1 is a development version that contains a partial implementation of the
fastidious option, but it is not usable yet.

v2.0.0 released December 3, 2014
Version 2.0.0 is faster and easier to use, providing new output options
(--internal-structure and --log), new control options (--boundary, --fastidious,
--no-otu-breaking), and built-in cluster refinement (no need to use the python script
anymore). When using default parameters, a novel and considerably faster algorithmic
approach is used, guaranteeing swarm's scalability.

v1.2.21 released February 26, 2015
Version 1.2.21 is supposed to fix some problems related to the use of the SSSE3 CPU
instructions which are not always available.

v1.2.20 released November 6, 2014
Version 1.2.20 presents a production-ready version of the alternative algorithm (option
-a), with optional built-in cluster breaking (option -n). That alternative algorithmic
approach (usable only with d = 1) is considerably faster than currently used clustering
algorithms, and can deal with datasets of 100 million unique amplicons or more in a few
hours. Of course, results are rigourously identical to the results previously produced with
swarm. That release also introduces new options to control swarm output (options -i and
-l).

v1.2.19 released October 3, 2014
Version 1.2.19 fixes a problem related to abundance information when the sequence label
includes multiple underscore characters.

v1.2.18 released September 29, 2014
Version 1.2.18 reenables the possibility of reading sequences from stdin if no file name is
specified on the command line. It also fixes a bug related to CPU features detection.

v1.2.17 released September 28, 2014
Version 1.2.17 fixes a memory allocation bug introduced in version 1.2.15.

v1.2.16 released September 27, 2014
Version 1.2.16 fixes a bug in the abundance sort introduced in version 1.2.15.

v1.2.15 released September 27, 2014
Version 1.2.15 sorts the input sequences in order of decreasing abundance unless they are
detected to be sorted already. When using the alternative algorithm for d = 1 it also sorts
all subseeds in order of decreasing abundance.

v1.2.14 released September 27, 2014
Version 1.2.14 fixes a bug in the output with the --swarm_breaker option (-b) when using
the alternative algorithm (-a).

v1.2.12 released August 18, 2014
Version 1.2.12 introduces an option --alternative-algorithm to use an extremely fast,
experimental clustering algorithm for the special case d = 1. Multithreading scalability of
the default algorithm has been noticeably improved.

v1.2.10 released August 8, 2014
Version 1.2.10 allows amplicon abundances to be specified using the usearch style in the
sequence header (e.g. '>id;size=1') when the -z option is chosen.

v1.2.8 released August 5, 2014
Version 1.2.8 fixes an error with the gap extension penalty. Previous versions used a gap
penalty twice as large as intended. That bug correction induces small changes in clustering
results.

v1.2.6 released May 23, 2014
Version 1.2.6 introduces an option --mothur to output clustering results in a format
compatible with the microbial ecology community analysis software suite Mothur (
https://www.mothur.org/).

v1.2.5 released April 11, 2014
Version 1.2.5 removes the need for a POPCNT hardware instruction to be present. swarm now
automatically checks whether POPCNT is available and uses a slightly slower software
implementation if not. Only basic SSE2 instructions are now required to run swarm.

v1.2.4 released January 30, 2014
Version 1.2.4 introduces an option --break-swarms to output all pairs of amplicons with d
differences to standard error. That option is used by the companion script
`swarm_breaker.py` to refine swarm results. The syntax of the inline assembly code is
changed for compatibility with more compilers.

v1.2 released May 16, 2013
Version 1.2 greatly improves speed by using alignment-free comparisons of amplicons based
on k-mer word content. For each amplicon, the presence-absence of all possible 5-mers is
computed and recorded in a 1024-bits vector. Vector comparisons are extremely fast and
drastically reduce the number of costly pairwise alignments performed by swarm. While
remaining exact, swarm 1.2 can be more than 100-times faster than swarm 1.1, when using a
single thread with a large set of sequences. The minor version 1.1.1, published just
before, adds compatibility with Apple computers, and corrects an issue in the pairwise
global alignment step that could lead to sub-optimal alignments.

v1.1 released February 26, 2013
Version 1.1 introduces two new important options: the possibility to output clustering
results using the uclust output format, and the possibility to output detailed statistics
on each cluster. swarm 1.1 is also faster: new filterings based on pairwise amplicon
sequence lengths and composition comparisons reduce the number of pairwise alignments
needed and speed up the clustering.

v1.0 released November 10, 2012
First public release.

version 3.1.5 March 31, 2024 swarm(1)