Skip to content
Commits on Source (6)
# swarm #
# swarm
A robust and fast clustering method for amplicon-based studies.
......@@ -16,8 +16,8 @@ To help users, we describe
starting from raw fastq files, clustering with **swarm** and producing
a filtered OTU table.
**swarm** 2.0 introduces several novelties and improvements over
swarm 1.0:
swarm 2.0 introduces several novelties and improvements over swarm
1.0:
* built-in breaking phase now performed automatically,
* possibility to output OTU representatives in fasta format (option
`-w`),
......@@ -26,30 +26,7 @@ a filtered OTU table.
* a new option called *fastidious* that refines *d* = 1 results and
reduces the number of small OTUs,
Table of Contents
=================
* [Common misconceptions](#common_misconceptions)
* [Quick start](#quick_start)
* [Install](#install)
* [Prepare amplicon fasta files](#prepare_amplicon)
* [Linearization](#linearization)
* [Dereplication](#dereplication)
* [Launch swarm](#launch)
* [Frequently asked questions](#FAQ)
* [Refine swarm OTUs](#refine_OTUs)
* [Count the number of amplicons per OTU](#OTU_sizes)
* [Get the seed sequence for each swarm](#extract_seeds)
* [Get fasta sequences for all amplicons in a swarm](#extract_all)
* [Troubleshooting](#troubleshooting)
* [Citation](#citation)
* [Contact](#contact)
* [Third-party pipelines](#pipelines)
* [Alternatives](#alternatives)
* [Version history](#history)
<a name="common_misconceptions"/>
## Common misconceptions ##
## Common misconceptions
**swarm** is a single-linkage clustering method, with some superficial
similarities with other clustering methods (e.g.,
......@@ -79,8 +56,7 @@ Table of Contents
that molecular markers have limitations too.
<a name="quick_start"/>
## Quick start ##
## Quick start
**swarm** most simple usage is:
......@@ -117,10 +93,9 @@ can increase significantly. See options `-c` and `-y` to control and
cap swarm's memory consumption.
<a name="install"/>
## Install ##
Get the latest binaries for GNU/Linux or MacOSX from
Get the latest binaries for GNU/Linux or macOS from
[the release page](https://github.com/torognes/swarm/releases "swarm
tagged releases"). Get the source code from
[GitHub](https://github.com/torognes/swarm "swarm public repository")
......@@ -149,7 +124,6 @@ Once installed, the man page can be accessed with the command `man
swarm`.
<a name="prepare_amplicon"/>
## Prepare amplicon fasta files ##
To facilitate the use of **swarm**, we provide examples of shell
......@@ -164,7 +138,6 @@ from adaptors and primers (with
converted to fasta.
<a name="linearization"/>
### Linearization ###
**swarm** accepts wrapped fasta files as well as linear fasta
......@@ -180,7 +153,6 @@ awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {printf "\n"}' amp
```
<a name="dereplication"/>
### Dereplication (mandatory) ###
In a sample, or collection of sample, a given sequence is likely to
......@@ -248,7 +220,7 @@ Alternatively, you may specify a default abundance value with
**swarm**'s `--append-abundance` (`-a`) option to be used when
abundance information is missing from a sequence.
<a name="launch"/>
### Launch swarm ###
Here is a typical way to use **swarm**:
......@@ -268,7 +240,6 @@ See the
for details on swarm's options and parameters.
<a name="FAQ"/>
## Frequently asked questions ##
To facilitate the use of **swarm**, we provide examples of options or
......@@ -277,7 +248,6 @@ that the amplicon fasta file was prepared as describe above
(linearization and dereplication).
<a name="refine_OTUs"/>
### Refine swarm OTUs ###
The chain-breaking, which used to be performed in a second step in
......@@ -292,7 +262,6 @@ is described in the figure below:
![](https://github.com/frederic-mahe/swarm/blob/master/figures/swarm_2.0_fastidious_reduced.png)
<a name="OTU_sizes"/>
### Count the number of amplicons per OTU ###
You might want to check the size distribution of OTUs (number of
......@@ -303,7 +272,6 @@ represents an OTU and provides different metrics. See the manual for a
complete description.
<a name="extract_seeds"/>
### Get the seed sequence for each OTU ###
It is frequent for subsequent analyses to keep only one representative
......@@ -312,7 +280,6 @@ burden. That operation is easily done with **swarm** by using the `-w
filename` option.
<a name="extract_all"/>
### Get fasta sequences for all amplicons in a OTU ###
For each OTU, get the fasta sequences for all amplicons. Warning, this
......@@ -338,7 +305,6 @@ rm "${AMPLICONS}"
```
<a name="troubleshooting"/>
## Troubleshooting ##
If **swarm** exits with an error message saying `This program
......@@ -348,7 +314,6 @@ on CPUs with the SSE2 instructions, i.e. most Intel and AMD CPUs
released since 2004.
<a name="citation"/>
## Citation ##
To cite **swarm**, please refer to:
......@@ -362,7 +327,6 @@ Swarm v2: highly-scalable and high-resolution amplicon clustering.
PeerJ 3:e1420 doi: [10.7717/peerj.1420](http://dx.doi.org/10.7717/peerj.1420)
<a name="contact"/>
## Contact ##
You are welcome to:
......@@ -372,7 +336,6 @@ You are welcome to:
* compose a friendly e-mail to: Frédéric Mahé <mahe@rhrk.uni-kl.de> and Torbjørn Rognes <torognes@ifi.uio.no>
<a name="pipelines"/>
## Third-party pipelines ##
**swarm** is available in third-party pipelines:
......@@ -388,35 +351,57 @@ You are welcome to:
performing microbiome analysis from raw DNA sequencing data.
<a name="alternatives"/>
## Alternatives ##
If you want to try alternative free and open-source clustering
methods, here are some links:
* [Vsearch](https://github.com/torognes/vsearch)
* [VSEARCH](https://github.com/torognes/vsearch)
* [Oligotyping](http://merenlab.org/projects/oligotyping/)
* [DNAclust](http://dnaclust.sourceforge.net/)
* [Sumaclust](http://metabarcoding.org/sumatra)
* [Crunchclust](https://code.google.com/p/crunchclust/)
<a name="history"/>
## Version history##
<a name="version2112"/>
### version 2.2.2 ###
**swarm** 2.2.2 fixes a bug causing Swarm to wait forever in very rare
cases when multiple threads were used.
### version 2.2.1 ###
**swarm** 2.2.1 fixes a memory allocation bug for d=1.
### version 2.2.0 ###
**swarm** 2.2.0 fixes several problems and improves usability.
Corrected output to structure and uclust files when using fastidious
mode. Corrected abundance output in some cases. Added check for
duplicated sequences and fixed check for duplicated sequence
IDs. Checks for empty sequences. Sorts sequences by additional fields
to improve stability. Improves compatibility with compilers and
operating systems. Outputs sequences in upper case. Allows 64-bit
abundances. Shows message when waiting for input from stdin. Improves
error messages and warnings. Improves checking of command line
options. Fixes remaining errors reported by test suite. Updates
documentation.
### version 2.1.13 ###
**swarm** 2.1.13 removes a bug in the progress bar when writing seeds.
### version 2.1.12 ###
**swarm** 2.1.12 removes a debugging message.
<a name="version2111"/>
### version 2.1.11 ###
**swarm** 2.1.11 fixes two bugs related to the SIMD implementation
of alignment that might result in incorrect alignments and scores.
The bug only applies when d>1.
<a name="version2110"/>
### version 2.1.10 ###
**swarm** 2.1.10 fixes two bugs related to gap penalties of
......@@ -426,47 +411,39 @@ use a slightly higher gap extension penalty than specified. The
default gap extension penalty used have actually been 4.5 instead of
4.
<a name="version219"/>
### version 2.1.9 ###
**swarm** 2.1.9 fixes a problem when compiling with GCC version 6.
<a name="version218"/>
### version 2.1.8 ###
**swarm** 2.1.8 fixes a rare bug triggered when clustering extremely
short undereplicated sequences. Also, alignment parameters are not
shown when d=1.
<a name="version217"/>
### version 2.1.7 ###
**swarm** 2.1.7 fixes more problems with seed output. Ignore CR
characters in FASTA files. Improved help and error messsages.
<a name="version216"/>
### version 2.1.6 ###
**swarm** 2.1.6 fixes problems with older compilers that do not have
the x86intrin.h header file. It also fixes a bug in the output of seeds
with the `-w` option when d>1.
<a name="version215"/>
### version 2.1.5 ###
**swarm** 2.1.5 fixes minor bugs.
<a name="version214"/>
### version 2.1.4 ###
**swarm** 2.1.4 fixes minor bugs in the swarm algorithm used for d=1.
<a name="version213"/>
### version 2.1.3 ###
**swarm** 2.1.3 adds checks of numeric option arguments.
<a name="version212"/>
### version 2.1.2 ###
**swarm** 2.1.2 adds the -a (--append-abundance) option to set a
......@@ -475,29 +452,24 @@ missing from the input file. If this option is not specified, missing
abundance information will result in a fatal error. The error message
in that case is improved.
<a name="version211"/>
### version 2.1.1 ###
**swarm** 2.1.1 fixes a bug with the fastidious option that caused it
to ignore some connections between heavy and light swarms.
<a name="version210"/>
### version 2.1.0 ###
**swarm** 2.1.0 marks the first official release of swarm 2.
<a name="version207"/>
### version 2.0.7 ###
**swarm** 2.0.7 writes abundance information in usearch style when using
options `-w` (`--seeds`) in combination with `-z` (`--usearch-abundance`).
<a name="version206"/>
### version 2.0.6 ###
**swarm** 2.0.6 fixes a minor bug.
<a name="version205"/>
### version 2.0.5 ###
**swarm** 2.0.5 improves the implementation of the fastidious option
......@@ -507,41 +479,34 @@ representatives sequences with updated abundances (sum of all
abundances inside each OTU). This version also enables dereplication
when `d = 0`.
<a name="version204"/>
### version 2.0.4 ###
**swarm** 2.0.4 includes a fully parallelized fastidious option.
<a name="version203"/>
### version 2.0.3 ###
**swarm** 2.0.3 includes a working fastidious option.
<a name="version202"/>
### version 2.0.2 ###
**swarm** 2.0.2 fixes SSSE3 problems.
<a name="version201"/>
### version 2.0.1 ###
**swarm** 2.0.1 is a development release that partially implements the
fastidious option.
<a name="version200"/>
### version 2.0.0 ###
**swarm** 2.0.0 simplifies the usage of swarm by using the fast
algorithm and the built-in OTU breaking by default. Some options are
changed and some new output options are introduced.
<a name="version1221"/>
### version 1.2.21 ###
**swarm** 1.2.21 is supposed to fix some problems related to the use of the
SSSE3 CPU instructions which are not always available.
<a name="version1220"/>
### version 1.2.20 ###
**swarm** 1.2.20 presents a production-ready version of the
......@@ -554,32 +519,27 @@ rigourously identical to the results previously produced with
swarm. That release also introduces new options to control swarm
output (options `-i` and `-l`).
<a name="version1219"/>
### version 1.2.19 ###
**swarm** 1.2.19 fixes a problem related to abundance information when
the sequence identifier includes multiple underscore characters.
<a name="version1218"/>
### version 1.2.18 ###
**swarm** 1.2.18 reenables the possibility of reading sequences from
`stdin` if no file name is specified on the command line. It also
fixes a bug related to CPU features detection.
<a name="version1217"/>
### version 1.2.17 ###
**swarm** 1.2.17 fixes a memory allocation bug introduced in
version 1.2.15.
<a name="version1216"/>
### version 1.2.16 ###
**swarm** 1.2.16 fixes a bug in the abundance sort introduced in
version 1.2.15.
<a name="version1215"/>
### version 1.2.15 ###
**swarm** 1.2.15 sorts the input sequences in order of decreasing
......@@ -587,29 +547,24 @@ output (options `-i` and `-l`).
the alternative algorithm for *d* = 1 it also sorts all subseeds in
order of decreasing abundance.
<a name="version1214"/>
### version 1.2.14 ###
**swarm** 1.2.14 fixes a bug in the output with the swarm breaker
option (`-b`) when using the alternative algorithm (`-a`).
<a name="version1213"/>
### version 1.2.13 ###
**swarm** 1.2.13 updates the citation.
<a name="version1212"/>
### version 1.2.12 ###
**swarm** 1.2.12 improves speed of new search strategy for *d* = 1.
<a name="version1211"/>
### version 1.2.11 ###
**swarm** 1.2.11 corrects the number of differences reported in the
break swarms output.
<a name="version1210"/>
### version 1.2.10 ###
**swarm** 1.2.10 allows amplicon abundances to be specified using the
......@@ -617,7 +572,6 @@ output (options `-i` and `-l`).
`-z` option is chosen. Also fixes the bad URL shown in the previous
version of swarm.
<a name="version129"/>
### version 1.2.9 ###
**swarm** 1.2.9 includes a parallelized variant of the new search
......@@ -630,7 +584,6 @@ output (options `-i` and `-l`).
differences between the seed and the amplicon is indicated in the
last column.
<a name="version128"/>
### version 1.2.8 ###
**swarm** 1.2.8 fixes an error with the gap extension
......@@ -641,13 +594,11 @@ output (options `-i` and `-l`).
sequences or more. The new strategy can be turned on with the `-a`
option.
<a name="version127"/>
### version 1.2.7 ###
**swarm** 1.2.7 incorporates a few small changes and improvements to
make it ready for integration into QIIME.
<a name="version126"/>
### version 1.2.6 ###
**swarm** 1.2.6 add an option (`-r` or `--mothur`) to format the
......@@ -656,7 +607,6 @@ output (options `-i` and `-l`).
input sequences it will now report the illegal character and the
line number.
<a name="version125"/>
### version 1.2.5 ###
**swarm** 1.2.5 can be run on CPUs without the POPCNT feature. It
......@@ -664,7 +614,6 @@ output (options `-i` and `-l`).
the appropriate code. The code that avoids POPCNT is just slightly
slower. Only basic SSE2 is now required.
<a name="version124"/>
### version 1.2.4 ###
**swarm** 1.2.4 changes the name of the new option from
......@@ -672,7 +621,6 @@ output (options `-i` and `-l`).
options, and also adds a companion script `swarm_breaker.py` to
refine swarm results (`scripts` folder).
<a name="version123"/>
### version 1.2.3 ###
**swarm** 1.2.3 adds an option (`-b` or `--break_swarms`) to output
......@@ -681,20 +629,18 @@ output (options `-i` and `-l`).
the inline assembly code is also changed for compatibility with more
compilers.
<a name="version122"/>
### version 1.2.2 ###
**swarm** 1.2.2 fixes an issue with incorrect values in the statistics
file (maximum generation and radius of swarms). This version is also
a bit faster.
<a name="version121"/>
### version 1.2.1 ###
**swarm** 1.2.1 removes the need for a SSE4.1 capable CPU and should
now be able to run on most servers, desktops and laptops.
<a name="version120"/>
### version 1.2.0 ###
**swarm** 1.2.0 introduces a pre-filtering of similar amplicons based
......@@ -705,7 +651,7 @@ output (options `-i` and `-l`).
a computational overhead, but becomes more and more efficient as the
size of the amplicon set increases.
<a name="version111"/>
### version 1.1.1 ###
**swarm** now works on Apple computers. This version also corrects an
......@@ -713,7 +659,7 @@ output (options `-i` and `-l`).
sub-optimal alignments. Slightly different alignments may result
relative to previous version, giving slightly different swarms.
<a name="version110"/>
### version 1.1.0 ###
**swarm** 1.1.0 introduces new optimizations and is 20% faster than
......
swarm-cluster (2.2.2+dfsg-1) unstable; urgency=medium
* New upstream version (no pdf versions of makefiles any more)
* Standards-Version: 4.1.3
* debhelper 11
-- Andreas Tille <tille@debian.org> Mon, 19 Feb 2018 11:13:33 +0100
swarm-cluster (2.1.12+dfsg-1) unstable; urgency=medium
* New upstream version
......
......@@ -4,10 +4,10 @@ Uploaders: Tim Booth <tbooth@ceh.ac.uk>,
Andreas Tille <tille@debian.org>
Section: science
Priority: optional
Build-Depends: debhelper (>= 10),
Build-Depends: debhelper (>= 11~),
man-db,
python-markdown
Standards-Version: 3.9.8
Standards-Version: 4.1.3
Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/swarm-cluster.git
Vcs-Git: https://anonscm.debian.org/git/debian-med/swarm-cluster.git
Homepage: https://github.com/torognes/swarm
......
......@@ -2,14 +2,15 @@ Description: allow to override $CXX
Author: Sascha Steinbiss <satta@debian.org>
--- a/src/Makefile
+++ b/src/Makefile
@@ -30,8 +30,8 @@
@@ -28,9 +28,9 @@ COMMON=-g
LIBS=-lpthread
LINKFLAGS=$(COMMON) $(LDFLAGS)
-CXX=g++
-CXXFLAGS=$(COMPILEOPT) $(COMMON)
+CXX?=g++
+CXXFLAGS+=$(COMPILEOPT) $(COMMON)
WARNINGS=-Wall -Wsign-compare -Wextra -Wpedantic -Wno-long-long
-CXXFLAGS=$(COMMON) $(WARNINGS) -O3 -msse2 -mtune=core2 -Icityhash
+CXXFLAGS+=$(COMMON) $(WARNINGS) -O3 -msse2 -mtune=core2 -Icityhash
PROG=swarm
......@@ -2,12 +2,12 @@ Description: fix compilation with GCC 6
Author: Sascha Steinbiss <satta@debian.org>
--- a/src/Makefile
+++ b/src/Makefile
@@ -25,7 +25,7 @@
#COMMON=-pg -g
COMMON=-g
@@ -30,7 +30,7 @@ LINKFLAGS=$(COMMON) $(LDFLAGS)
-COMPILEOPT=-Wall -Wsign-compare -O3 -msse2 -mtune=core2 -Icityhash
+COMPILEOPT=-Wall -Wsign-compare -O3 -msse2 -mtune=core2 -Icityhash -std=gnu++98
CXX?=g++
WARNINGS=-Wall -Wsign-compare -Wextra -Wpedantic -Wno-long-long
-CXXFLAGS+=$(COMMON) $(WARNINGS) -O3 -msse2 -mtune=core2 -Icityhash
+CXXFLAGS+=$(COMMON) $(WARNINGS) -O3 -msse2 -mtune=core2 -Icityhash -std=gnu++98
PROG=swarm
LIBS=-lpthread
LINKFLAGS=$(COMMON) $(LDFLAGS)
......@@ -4,27 +4,27 @@ Description: Propagate hardening options
--- a/src/Makefile
+++ b/src/Makefile
@@ -28,7 +28,7 @@
COMPILEOPT=-Wall -Wsign-compare -O3 -msse2 -mtune=core2 -Icityhash
@@ -26,7 +26,7 @@
COMMON=-g
LIBS=-lpthread
-LINKFLAGS=$(COMMON)
+LINKFLAGS=$(COMMON) $(LDFLAGS)
CXX=g++
CXXFLAGS=$(COMPILEOPT) $(COMMON)
@@ -44,7 +44,7 @@
.SUFFIXES:.o .cc
%.o : %.cc $(DEPS)
- $(CXX) $(CXXFLAGS) -c -o $@ $<
+ $(CXX) $(CPPFLAGS) $(CXXFLAGS) -c -o $@ $<
WARNINGS=-Wall -Wsign-compare -Wextra -Wpedantic -Wno-long-long
@@ -43,7 +43,7 @@ DEPS=Makefile swarm.h bitmap.h bloom.h c
all : $(PROG)
@@ -57,4 +57,4 @@
swarm : $(OBJS)
- $(CXX) $(LINKFLAGS) -o $@ $(OBJS) $(LIBS)
+ $(CXX) $(CPPFLAGS) $(LINKFLAGS) -o $@ $(OBJS) $(LIBS)
mkdir -p ../bin
cp -a swarm ../bin
@@ -51,4 +51,4 @@ clean :
rm -rf swarm *.o *~ ../bin/ gmon.out cityhash/*.o ../man/*~ ../*~
ssse3.o : ssse3.cc $(DEPS)
- $(CXX) -mssse3 $(CXXFLAGS) -c -o $@ $<
+ $(CXX) -mssse3 $(CPPFLAGS) $(CXXFLAGS) -c -o $@ $<
- $(CXX) $(CXXFLAGS) -mssse3 -c -o $@ $<
+ $(CXX) $(CPPFLAGS) $(CXXFLAGS) -mssse3 -c -o $@ $<
.\" ============================================================================
.TH swarm 1 "January 16, 2017" "version 2.1.12" "USER COMMANDS"
.TH swarm 1 "December 12, 2017" "version 2.2.2" "USER COMMANDS"
.\" ============================================================================
.SH NAME
swarm \(em find clusters of nearly-identical nucleotide amplicons
......@@ -42,22 +42,24 @@ extremely fast Needleman-Wunsch algorithm making use of the Streaming
SIMD Extensions (SSE2) of modern x86-64 CPUs. If SSE2 instructions are
not available, \fBswarm\fR exits with an error message.
.PP
\fBswarm\fR reads the named input \fIfilename\fR, a fasta file of
nucleotide amplicons. The amplicon identifier is defined as the string
comprised between the ">" symbol and the first space or the end of the
\fBswarm\fR can read nucleotide amplicons in fasta format from a
normal file or from the standard input (using a pipe or a
redirection). The amplicon identifier is defined as the string
comprised between the '>' symbol and the first space or the end of the
line, whichever comes first. As \fBswarm\fR outputs lists of amplicon
identifiers, amplicon identifiers must be unique to avoid ambiguity;
swarm exits with an error message if identifiers are not
unique. Amplicon identifiers must end with a "_" followed by a
positive integer representing the amplicon copy number (or abundance
annotation; usearch/vsearch users can use the option \-z to change
that behavior). Abundance annotations play a crucial role in the
clustering process, and swarm exits with an error message if that
swarm exits with an error message if identifiers are not unique.
Amplicon identifiers must end with a '_' followed by a positive
integer representing the amplicon copy number (or abundance
annotation; usearch/vsearch users can use the option \-z to use
the ';size=' annotation). Abundance annotations play a crucial role in
the clustering process, and swarm exits with an error message if that
information is not available. The amplicon sequence is defined as a
string of [acgt] or [acgu] symbols (case insensitive), starting after
the end of the identifier line and ending before the next identifier
line or the file end; \fBswarm\fR exits with an error message if any
other symbol is present.
string of [ACGT] or [ACGU] symbols (case insensitive, 'U' is replaced
with 'T' internally), starting after the end of the identifier line
and ending before the next identifier line or the file end;
\fBswarm\fR silently removes newline symbols ('\\n' or '\\r') and
exits with an error message if any other symbol is present.
.\" ----------------------------------------------------------------------------
.SS General options
.TP 9
......@@ -74,8 +76,8 @@ output version information and exit successfully.
.TP
.B \-\-
delimit the option list. Later arguments, if any, are treated as
operands even if they begin with "\-". For example, "swarm \-\-
\-file.fasta" reads from the file "\-file.fasta".
operands even if they begin with '\-'. For example, 'swarm \-\-
\-file.fasta' reads from the file '\-file.fasta'.
.LP
.\" ----------------------------------------------------------------------------
.SS Clustering options
......@@ -86,9 +88,9 @@ that two amplicons will be grouped if they have \fIinteger\fR (or
less) differences. This is \fBswarm\fR's most important parameter. The
number of differences is calculated as the number of mismatches
(substitutions, insertions or deletions) between the two amplicons
once the optimal pairwise global alignment has been found (see
"pairwise alignment advanced options" to influence that step). Any
\fIinteger\fR between 0 and 256 can be used, but high \fId\fR values
once the optimal pairwise global alignment has been found
(see 'pairwise alignment advanced options' to influence that step).
Any \fIinteger\fR from 0 to 255 can be used, but high \fId\fR values
will decrease the taxonomical resolution of \fBswarm\fR
results. Commonly used \fId\fR values are 1, 2 or 3, rarely
higher. When using \fId\fR = 0, \fBswarm\fR will output results
......@@ -122,7 +124,7 @@ of \fBswarm\fR results. Default mass of a large OTU is 3.
when using the option \-\-fastidious (\-f), define \fBswarm\fR's
maximum memory footprint (in megabytes). \fBswarm\fR will adjust the
\-\-bloom\-bits (\-y) value of the Bloom filter to fit within the
specified amount of memory.
specified amount of memory. The value must be at least 3.
.TP
.B \-f\fP,\fB\ \-\-fastidious
when working with \fId\fR = 1, perform a second clustering pass to
......@@ -142,8 +144,9 @@ clustering results: the output files produced by the options \-\-log
(\-l), \-\-output\-file (\-o), \-\-mothur (\-r), \-\-uclust\-file, and
\-\-seeds (\-w) are updated to reflect these modifications; the file
\-\-statistics\-file (\-s) is partially updated (columns 6 and 7 are
not updated); the output file \-\-internal\-structure (\-i) is not
updated.
not updated); the output file \-\-internal\-structure (\-i) is
partially updated (column 5 is not updated for amplicons that belonged to
the small OTU).
.TP
.BI \-y\fP,\fB\ \-\-bloom\-bits\~ "positive integer"
when using the option \-\-fastidious (\-f), define the size (in bits)
......@@ -159,9 +162,10 @@ alternative way to control the memory footprint.
.TP 9
.BI \-a\fP,\fB\ \-\-append\-abundance\~ "positive integer"
set abundance value to use when some or all amplicons in the input
file lack abundance values. Warning, it is not recommended to use
\fBswarm\fR on datasets where abundance values are all identical. We
provide that option as a courtesy to advanced users, please use it
file lack abundance values (_\fIinteger\fR, or ;size=\fIinteger\fR;
when using \-z). Warning, it is not recommended to use \fBswarm\fR on
datasets where abundance values are all identical. We provide that
option as a courtesy to advanced users, please use it
carefully. \fBswarm\fR exits with an error message if abundance values
are missing and if this option is not used.
.TP
......@@ -183,8 +187,12 @@ OTU number (\fIpositive integer\fR). OTUs are numbered in their order
of delineation, starting from 1. All pairs of amplicons belonging to
the same OTU will receive the same number.
.IP \n+[step].
number of steps from the OTU seed to amplicon B (\fIpositive
integer\fR).
cummulated number of steps from the OTU seed to amplicon B
(\fIpositive integer\fR). When using the option \-\-fastidious (\-f),
the actual number of steps between grafted amplicons and the OTU seed
cannot be re-computed efficiently and is always set to 2 for the
amplicon pair linking the small OTU to the big OTU. Cummulated number
of steps in the small OTU (if any) are left unchanged.
.RE
.RE
.TP
......@@ -222,17 +230,54 @@ initial seed abundance,
number of amplicons with an abundance of 1 in the OTU,
.IP \n+[step].
maximum number of iterations before the OTU reached its natural
limit),
limit,
.IP \n+[step].
theoretical maximum radius of the OTU (i.e., number of cummulated
differences between the seed and the furthermost amplicon in the
OTU). The actual maximum radius of the OTU is often much smaller.
cummulated number of steps along the path joining the seed and the
furthermost amplicon in the OTU. Please note that the actual number of
differences between the seed and the furthermost amplicon is usually
much smaller. When using the option \-\-fastidious (\-f), grafted
amplicons are not taken into account.
.RE
.RE
.TP
.BI \-u\fP,\fB\ \-\-uclust\-file \0filename
output clustering results in uclust-like file format to the specified
file. That option does not modify \fBswarm\fR's default output format.
output clustering results in \fIfilename\fR using a tab-separated
uclust-like format with 10 columns and 3 different type of entries (S,
H or C). That option does not modify \fBswarm\fR's default output
format. Each fasta sequence in the input file can be either a cluster
centroid (S) or a hit (H) assigned to a cluster. Cluster records (C)
summarize information (size, centroid label) for each cluster. Column
content varies with the type of entry (S, H or C):
.RS
.RS
.nr step 1 1
.IP \n[step]. 4
Record type: S, H, or C.
.IP \n+[step].
Cluster number (zero-based).
.IP \n+[step].
Centroid length (S), query length (H), or cluster size (C).
.IP \n+[step].
Percentage of similarity with the centroid sequence (H), or set to '*'
(S, C).
.IP \n+[step].
Match orientation + or - (H), or set to '*' (S, C).
.IP \n+[step].
Not used, always set to '*' (S, C) or to zero (H).
.IP \n+[step].
Not used, always set to '*' (S, C) or to zero (H).
.IP \n+[step].
set to '*' (S, C) or, for H, compact representation of the pairwise
alignment using the CIGAR format (Compact Idiosyncratic Gapped
Alignment Report): M (match), D (deletion) and I (insertion). The
equal sign '=' indicates that the query is identical to the centroid
sequence.
.IP \n+[step].
Label of the query sequence (H), or of the centroid sequence (S, C).
.IP \n+[step].
Label of the centroid sequence (H), or set to '*' (S, C).
.RE
.RE
.TP
.BI \-w\fP,\fB\ \-\-seeds \0filename
output OTU representatives to \fIfilename\fR in fasta format. The
......@@ -344,13 +389,37 @@ New features and important modifications of \fBswarm\fR (short lived
or minor bug releases are not mentioned):
.RS
.TP
.BR v2.2.2\~ "released December 12, 2017"
Version 2.2.2 fixes a bug that would cause Swarm to wait forever in
very rare cases when multiple threads were used.
.TP
.BR v2.2.1\~ "released October 27, 2017"
Version 2.2.1 fixes a memory allocation bug for \fId\fR = 1 and
duplicated sequences.
.TP
.BR v2.2.0\~ "released October 17, 2017"
Version 2.2.0 fixes several problems and improves usability. Corrected
output to structure and uclust files when using fastidious
mode. Corrected abundance output in some cases. Added check for
duplicated sequences and fixed check for duplicated sequence
IDs. Checks for empty sequences. Sorts sequences by additional fields
to improve stability. Improves compatibility with compilers and
operating systems. Outputs sequences in upper case. Allows 64-bit
abundances. Shows message when waiting for input from stdin. Improves
error messages and warnings. Improves checking of command line
options. Fixes remaining errors reported by test suite. Updates
documentation.
.TP
.BR v2.1.13\~ "released March 8, 2017"
Version 2.1.13 removes a bug with the progress bar when writing seeds.
.TP
.BR v2.1.12\~ "released January 16, 2017"
Version 2.1.12 removes a debugging message.
.TP
.BR v2.1.11\~ "released January 16, 2017"
Version 2.1.11 fixes two bugs related to the SIMD implementation
of alignment that might result in incorrect alignments and scores.
The bug only applies when d>1.
The bug only applies when \fId\fR > 1.
.TP
.BR v2.1.10\~ "released December 22, 2016"
Version 2.1.10 fixes two bugs related to gap penalties of alignments.
......@@ -370,7 +439,7 @@ shown when \fId\fR = 1.
.BR v2.1.7\~ "released February 24, 2016"
Version 2.1.7 fixes a bug in the output of seeds with the \-w option
when \fId\fR > 1 that was not properly fixed in version 2.1.6. It also
handles ascii character 13 (CR) in FASTA files better. Swarm will
handles ascii character #13 (CR) in FASTA files better. Swarm will
now exit with status 0 if the \-h or the \-v option is specified. The
help text and some error messages have been improved.
.TP
......@@ -484,7 +553,7 @@ has been noticeably improved.
.TP
.BR v1.2.10\~ "released August 8, 2014"
Version 1.2.10 allows amplicon abundances to be specified using the
usearch style in the sequence header (e.g. ">id;size=1") when the \-z
usearch style in the sequence header (e.g. '>id;size=1') when the \-z
option is chosen.
.TP
.BR v1.2.8\~ "released August 5, 2014"
......
......@@ -25,13 +25,12 @@
#COMMON=-pg -g
COMMON=-g
COMPILEOPT=-Wall -Wsign-compare -O3 -msse2 -mtune=core2 -Icityhash
LIBS=-lpthread
LINKFLAGS=$(COMMON)
CXX=g++
CXXFLAGS=$(COMPILEOPT) $(COMMON)
WARNINGS=-Wall -Wsign-compare -Wextra -Wpedantic -Wno-long-long
CXXFLAGS=$(COMMON) $(WARNINGS) -O3 -msse2 -mtune=core2 -Icityhash
PROG=swarm
......@@ -41,11 +40,6 @@ OBJS=swarm.o db.o search8.o search16.o nw.o matrix.o util.o scan.o \
DEPS=Makefile swarm.h bitmap.h bloom.h cityhash/config.h cityhash/city.h \
threads.h
.SUFFIXES:.o .cc
%.o : %.cc $(DEPS)
$(CXX) $(CXXFLAGS) -c -o $@ $<
all : $(PROG)
swarm : $(OBJS)
......@@ -57,4 +51,4 @@ clean :
rm -rf swarm *.o *~ ../bin/ gmon.out cityhash/*.o ../man/*~ ../*~
ssse3.o : ssse3.cc $(DEPS)
$(CXX) -mssse3 $(CXXFLAGS) -c -o $@ $<
$(CXX) $(CXXFLAGS) -mssse3 -c -o $@ $<
......@@ -55,14 +55,12 @@ void algo_run()
count_comparisons_8 = 0;
count_comparisons_16 = 0;
unsigned long searches = 0;
#ifdef VERBOSE
unsigned long searches = 0;
unsigned long estimates = 0;
#endif
unsigned long largestswarm = 0;
unsigned long swarmsize = 0;
unsigned long maxgenerations = 0;
......@@ -113,7 +111,7 @@ void algo_run()
unsigned long bits;
if ((unsigned long)resolution <= diff_saturation)
if ((unsigned long)opt_differences <= diff_saturation)
bits = 8;
else
bits = 16;
......@@ -131,6 +129,7 @@ void algo_run()
swarmid++;
unsigned long swarmsize = 0;
unsigned long amplicons_copies = 0;
unsigned long singletons = 0;
unsigned long hitcount = 0;
......@@ -184,7 +183,7 @@ void algo_run()
unsigned poolampliconid = qgramamps[i];
long diff = qgramdiffs[i];
amps[swarmed+i].diffestimate = diff;
if (diff <= resolution)
if (diff <= opt_differences)
{
targetindices[targetcount] = swarmed+i;
targetampliconids[targetcount] = poolampliconid;
......@@ -196,7 +195,9 @@ void algo_run()
{
search_do(seedampliconid, targetcount, targetampliconids,
scores, diffs, alignlengths, bits);
#ifdef VERBOSE
searches++;
#endif
if (bits == 8)
count_comparisons_8 += targetcount;
......@@ -218,7 +219,7 @@ void algo_run()
unsigned diff = diffs[t];
if (diff <= (unsigned long) resolution)
if (diff <= (unsigned long) opt_differences)
{
unsigned i = targetindices[t];
......@@ -294,7 +295,7 @@ void algo_run()
for(unsigned long i=swarmed; i<amplicons; i++)
{
unsigned long targetampliconid = amps[i].ampliconid;
if ((amps[i].diffestimate <= subseedradius + resolution) &&
if ((amps[i].diffestimate <= subseedradius + opt_differences) &&
((opt_no_otu_breaking) ||
(db_getabundance(targetampliconid)
<= subseedabundance)))
......@@ -313,7 +314,7 @@ void algo_run()
#endif
for(unsigned long i=0; i < listlen; i++)
if ((long)qgramdiffs[i] <= resolution)
if ((long)qgramdiffs[i] <= opt_differences)
{
targetindices[targetcount] = qgramindices[i];
targetampliconids[targetcount] = qgramamps[i];
......@@ -324,7 +325,9 @@ void algo_run()
{
search_do(subseedampliconid, targetcount, targetampliconids,
scores, diffs, alignlengths, bits);
#ifdef VERBOSE
searches++;
#endif
if (bits == 8)
count_comparisons_8 += targetcount;
......@@ -335,7 +338,7 @@ void algo_run()
{
unsigned diff = diffs[t];
if (diff <= (unsigned long) resolution)
if (diff <= (unsigned long) opt_differences)
{
unsigned i = targetindices[t];
......@@ -486,12 +489,12 @@ void algo_run()
char sep_amplicons;
char sep_swarms;
if (mothur)
if (opt_mothur)
{
/* mothur list file output */
sep_amplicons = ',';
sep_swarms = '\t';
fprintf(outfile, "swarm_%ld\t%lu\t", resolution, swarmid);
fprintf(outfile, "swarm_%ld\t%lu\t", opt_differences, swarmid);
}
else
{
......
......@@ -42,7 +42,7 @@ static struct ampinfo_s
int parent;
int generation;
int next; /* amp id of next amplicon in swarm */
int graft_cand; /* amp id of potential grafting parent (fastitdious) */
int graft_cand; /* amp id of potential grafting parent (fastidious) */
} * ampinfo = 0;
/* Information about each swarm (OTU) */
......@@ -375,7 +375,7 @@ void * worker(void * vp)
while (tip->work >= 0)
{
/* wait for work available */
if (tip->work == 0)
while (tip->work == 0)
pthread_cond_wait(&tip->workcond, &tip->workmutex);
if (tip->work > 0)
{
......@@ -397,10 +397,10 @@ void threads_init()
/* allocate memory for thread info, incl the variant sequences */
unsigned long longestamplicon = db_getlongestsequence();
ti = (struct thread_info_s *)
xmalloc(threads * sizeof(struct thread_info_s));
xmalloc(opt_threads * sizeof(struct thread_info_s));
/* init and create worker threads */
for(unsigned long t=0; t<threads; t++)
for(long t=0; t<opt_threads; t++)
{
struct thread_info_s * tip = ti + t;
tip->varseq = (unsigned char*) xmalloc(longestamplicon+1);
......@@ -417,7 +417,7 @@ void threads_init()
void threads_done()
{
/* finish and clean up worker threads */
for(unsigned long t=0; t<threads; t++)
for(long t=0; t<opt_threads; t++)
{
struct thread_info_s * tip = ti + t;
......@@ -442,37 +442,18 @@ void threads_done()
pthread_attr_destroy(&attr);
}
void swarm_breaker_info(int amp)
{
/* output info for swarm_breaker script */
if (opt_internal_structure)
{
long seed = ampinfo[amp].parent;
fprint_id_noabundance(internal_structure_file, seed);
fprintf(internal_structure_file, "\t");
fprint_id_noabundance(internal_structure_file, amp);
fprintf(internal_structure_file, "\t%d", 1);
fprintf(internal_structure_file, "\t%d\t%d",
ampinfo[seed].swarmid + 1,
ampinfo[amp].generation);
fprintf(internal_structure_file, "\n");
}
}
void add_amp_to_swarm(int amp)
{
/* add to swarm */
ampinfo[current_swarm_tail].next = amp;
current_swarm_tail = amp;
swarm_breaker_info(amp);
}
void process_seed(int seed, int subseed)
void process_seed(int subseed)
{
unsigned long seqlen = db_getsequencelen(subseed);
threads_used = threads;
threads_used = opt_threads;
if (threads_used > seqlen + 1)
threads_used = seqlen+1;
......@@ -510,6 +491,7 @@ void process_seed(int seed, int subseed)
{
if (global_hits_count + ti[t].hits_count > global_hits_alloc)
{
while (global_hits_count + ti[t].hits_count > global_hits_alloc)
global_hits_alloc <<= 1;
global_hits_data = (int*)xrealloc(global_hits_data,
global_hits_alloc * sizeof(int));
......@@ -648,7 +630,12 @@ int attach_candidates(int amplicons)
int parent = graft_array[i].parent;
int child = graft_array[i].child;
if (!swarminfo[ampinfo[child].swarmid].attached)
if (swarminfo[ampinfo[child].swarmid].attached)
{
/* this light swarm is already attached */
ampinfo[child].graft_cand = NO_SWARM;
}
else
{
/* attach child to parent */
attach(parent, child);
......@@ -949,6 +936,8 @@ BloomFilter * bloomp;
void mark_light_thread(long t)
{
(void) t;
char * buffer1 = (char*) xmalloc(db_getlongestsequence() + 2);
pthread_mutex_lock(&light_mutex);
while (light_progress < light_amplicon_count)
......@@ -976,6 +965,8 @@ static long amplicons;
void check_heavy_thread(long t)
{
(void) t;
char * buffer1 = (char*) xmalloc(db_getlongestsequence() + 2);
char * buffer2 = (char*) xmalloc(db_getlongestsequence() + 3);
pthread_mutex_lock(&heavy_mutex);
......@@ -1071,7 +1062,7 @@ void algo_d1_run()
global_hits_count = 0;
/* find the first generation matches */
process_seed(seed, seed);
process_seed(seed);
/* sort hits */
qsort(global_hits_data, global_hits_count,
......@@ -1089,7 +1080,7 @@ void algo_d1_run()
global_hits_count = 0;
while(subseed != NO_SWARM)
{
process_seed(seed, subseed);
process_seed(subseed);
update_stats(subseed);
subseed = ampinfo[subseed].next;
}
......@@ -1227,7 +1218,7 @@ void algo_d1_run()
}
fprintf(logfile,
"Bloom filter: bits=%ld, m=%ld, k=%ld, size=%.1lfMB\n",
"Bloom filter: bits=%ld, m=%ld, k=%ld, size=%.1fMB\n",
bits, m, k, 1.0 * m / (8*1024*1024));
bloomp = new BloomFilter(m, k);
......@@ -1247,7 +1238,7 @@ void algo_d1_run()
light_progress = 0;
light_amplicon_count = amplicons_in_small_otus;
light_amplicon = amplicons - 1;
ThreadRunner * tr = new ThreadRunner(threads, mark_light_thread);
ThreadRunner * tr = new ThreadRunner(opt_threads, mark_light_thread);
tr->run();
delete tr;
pthread_mutex_destroy(&light_mutex);
......@@ -1286,7 +1277,7 @@ void algo_d1_run()
heavy_progress = 0;
heavy_amplicon_count = amplicons_in_large_otus;
heavy_amplicon = 0;
ThreadRunner * heavy_tr = new ThreadRunner(threads,
ThreadRunner * heavy_tr = new ThreadRunner(opt_threads,
check_heavy_thread);
heavy_tr->run();
delete heavy_tr;
......@@ -1331,8 +1322,8 @@ void algo_d1_run()
progress_init("Writing swarms: ", swarmcount);
if (mothur)
fprintf(outfile, "swarm_%ld\t%ld", resolution, swarmcount_adjusted);
if (opt_mothur)
fprintf(outfile, "swarm_%ld\t%ld", opt_differences, swarmcount_adjusted);
for(int i = 0; i < swarmcount; i++)
{
......@@ -1343,7 +1334,7 @@ void algo_d1_run()
a >= 0;
a = ampinfo[a].next)
{
if (mothur)
if (opt_mothur)
{
if (a == seed)
fputc('\t', outfile);
......@@ -1357,13 +1348,13 @@ void algo_d1_run()
}
fprint_id(outfile, a);
}
if (!mothur)
if (!opt_mothur)
fputc('\n', outfile);
}
progress_update(i+1);
}
if (mothur)
if (opt_mothur)
fputc('\n', outfile);
progress_done();
......@@ -1373,7 +1364,7 @@ void algo_d1_run()
if (opt_seeds)
{
progress_init("Writing seeds: ", swarmcount_adjusted);
progress_init("Writing seeds: ", swarmcount);
for(int i=0; i < swarmcount; i++)
{
if (!swarminfo[i].attached)
......@@ -1390,10 +1381,80 @@ void algo_d1_run()
}
/* output internal structure */
if (opt_internal_structure)
{
unsigned int cluster_no = 0;
progress_init("Writing structure:", swarmcount);
for(unsigned int swarmid = 0; swarmid < swarmcount ; swarmid++)
{
if (!swarminfo[swarmid].attached)
{
int seed = swarminfo[swarmid].seed;
struct ampinfo_s * bp = ampinfo + seed;
for (int a = bp->next;
a >= 0;
a = ampinfo[a].next)
{
long graft_parent = ampinfo[a].graft_cand;
if (graft_parent != NO_SWARM)
{
fprint_id_noabundance(internal_structure_file, graft_parent);
fprintf(internal_structure_file, "\t");
fprint_id_noabundance(internal_structure_file, a);
fprintf(internal_structure_file,
"\t%d\t%d\t%d\n",
2,
cluster_no + 1,
ampinfo[graft_parent].generation + 1);
}
long parent = ampinfo[a].parent;
if (parent != NO_SWARM)
{
int diff = 1;
if (duplicates_found)
{
unsigned long parentseqlen = db_getsequencelen(parent);
unsigned long ampseqlen = db_getsequencelen(a);
if (parentseqlen == ampseqlen)
{
unsigned char * parentseq = (unsigned char *) db_getsequence(parent);
unsigned char * ampseq = (unsigned char *) db_getsequence(a);
if (memcmp(parentseq, ampseq, parentseqlen) == 0)
diff = 0;
}
}
fprint_id_noabundance(internal_structure_file, parent);
fprintf(internal_structure_file, "\t");
fprint_id_noabundance(internal_structure_file, a);
fprintf(internal_structure_file,
"\t%d\t%d\t%d\n",
diff,
cluster_no + 1,
ampinfo[a].generation);
}
}
cluster_no++;
}
progress_update(swarmid);
}
progress_done();
}
/* output swarm in uclust format */
if (uclustfile)
{
unsigned int cluster_no = 0;
progress_init("Writing UCLUST: ", swarmcount);
for(unsigned int swarmid = 0; swarmid < swarmcount ; swarmid++)
......@@ -1405,13 +1466,13 @@ void algo_d1_run()
struct ampinfo_s * bp = ampinfo + seed;
fprintf(uclustfile, "C\t%u\t%d\t*\t*\t*\t*\t*\t",
swarmid,
cluster_no,
swarminfo[swarmid].size);
fprint_id(uclustfile, seed);
fprintf(uclustfile, "\t*\n");
fprintf(uclustfile, "S\t%u\t%lu\t*\t*\t*\t*\t*\t",
swarmid,
cluster_no,
db_getsequencelen(seed));
fprint_id(uclustfile, seed);
fprintf(uclustfile, "\t*\n");
......@@ -1440,7 +1501,7 @@ void algo_d1_run()
fprintf(uclustfile,
"H\t%d\t%lu\t%.1f\t+\t0\t0\t%s\t",
ampinfo[seed].swarmid,
cluster_no,
db_getsequencelen(a),
percentid,
nwdiff > 0 ? nwalignment : "=");
......@@ -1453,6 +1514,8 @@ void algo_d1_run()
if (nwalignment)
free(nwalignment);
}
cluster_no++;
}
progress_update(swarmid);
}
......
......@@ -29,7 +29,7 @@ class Bitmap
public:
Bitmap(size_t _size)
explicit Bitmap(size_t _size)
{
size = _size;
data = (unsigned char *) xmalloc((size+7)/8);
......
......@@ -90,21 +90,31 @@ void showseq(char * seq)
void fprint_id(FILE * stream, unsigned long x)
{
fprintf(stream, "%.*s", seqindex[x].headeridlen, seqindex[x].header);
seqinfo_t * sp = seqindex + x;
char * h = sp->header;
int hdrlen = sp->headerlen;
if (opt_append_abundance && (sp->abundance_start == sp->abundance_end))
if (opt_usearch_abundance)
fprintf(stream, "%.*s;size=%lu;", hdrlen, h, sp->abundance);
else
fprintf(stream, "%.*s_%lu", hdrlen, h, sp->abundance);
else
fprintf(stream, "%.*s", hdrlen, h);
}
void fprint_id_noabundance(FILE * stream, unsigned long x)
{
seqinfo_t * sp = seqindex + x;
char * h = sp->header;
int hdrlen = sp->headeridlen;
int hdrlen = sp->headerlen;
if (sp->abundance_start < sp->abundance_end)
{
/* print start of header */
fprintf(stream, "%.*s", sp->abundance_start, h);
if (usearch_abundance)
if (opt_usearch_abundance)
{
/* print semicolon if the abundance is not at either end */
if ((sp->abundance_start > 0) && (sp->abundance_end < hdrlen))
......@@ -115,9 +125,7 @@ void fprint_id_noabundance(FILE * stream, unsigned long x)
}
}
else
{
fprintf(stream, "%s", h);
}
fprintf(stream, "%.*s", hdrlen, h);
}
void fprint_id_with_new_abundance(FILE * stream,
......@@ -126,14 +134,14 @@ void fprint_id_with_new_abundance(FILE * stream,
{
seqinfo_t * sp = seqindex + seqno;
if (usearch_abundance)
if (opt_usearch_abundance)
fprintf(stream,
"%.*s%ssize=%lu;%.*s",
sp->abundance_start,
sp->header,
sp->abundance_start > 0 ? ";" : "",
abundance,
sp->headeridlen - sp->abundance_end,
sp->headerlen - sp->abundance_end,
sp->header + sp->abundance_end);
else
fprintf(stream,
......@@ -153,14 +161,7 @@ int db_compare_abundance(const void * a, const void * b)
else if (x->abundance < y->abundance)
return +1;
else
{
if (x < y)
return -1;
else if (x > y)
return +1;
else
return 0;
}
return strcmp(x->header, y->header);
}
void db_read(const char * filename)
......@@ -189,14 +190,15 @@ void db_read(const char * filename)
/* get file size */
long filesize = 0;
if (filename)
{
if (fseek(fp, 0, SEEK_END))
fatal("Error: Unable to seek in database file (%s)", filename);
filesize = ftell(fp);
rewind(fp);
}
struct stat fs;
if (fstat(fileno(fp), & fs))
fatal("Unable to fstat on input file (%s)", filename);
bool is_regular = S_ISREG(fs.st_mode);
long filesize = is_regular ? fs.st_size : 0;
if (! is_regular)
fprintf(logfile, "Waiting for data... (Hit Ctrl-C and run swarm -h if you meant to read data from a file.)\n");
char line[LINEALLOC];
line[0] = 0;
......@@ -263,10 +265,11 @@ void db_read(const char * filename)
while (line[0] && (line[0] != '>'))
{
char m;
char c;
char * p = line;
while((c = *p++))
{
char m;
if ((m = map_nt[(int)c]) >= 0)
{
while (datalen >= dataalloc)
......@@ -280,12 +283,18 @@ void db_read(const char * filename)
}
else if ((c != 10) && (c != 13))
{
char msg[100];
if ((c >= 32) && (c <= 126))
snprintf(msg, 100, "Illegal character '%c' in sequence on line %u", c, lineno);
fprintf(stderr,
"Illegal character '%c' in sequence on line %u",
c,
lineno);
else
snprintf(msg, 100, "Illegal character (ascii no %d) in sequence on line %u", c, lineno);
fatal(msg);
fprintf(stderr,
"Illegal character (ascii no %d) in sequence on line %u",
c,
lineno);
exit(1);
}
}
line[0] = 0;
if (!fgets(line, LINEALLOC, fp))
......@@ -301,6 +310,12 @@ void db_read(const char * filename)
long length = datalen - seqbegin;
if (length == 0)
{
fprintf(stderr, "\nError: Empty sequence found on line %u.\n\n", lineno-1);
exit(1);
}
nucleotides += length;
if (length > longest)
......@@ -311,7 +326,7 @@ void db_read(const char * filename)
sequences++;
if (filename)
if (is_regular)
progress_update(ftell(fp));
}
progress_done();
......@@ -328,6 +343,19 @@ void db_read(const char * filename)
unsigned long duplicatedidentifiers = 0;
/* set up hash to check for unique sequences */
unsigned long seqhashsize = 2 * sequences;
seqinfo_t * * seqhashtable = 0;
if (opt_differences > 0)
{
seqhashtable =
(seqinfo_t **) xmalloc(seqhashsize * sizeof(seqinfo_t *));
memset(seqhashtable, 0, seqhashsize * sizeof(seqinfo_t *));
}
/* create indices */
seqindex = (seqinfo_t *) xmalloc(sequences * sizeof(seqinfo_t));
......@@ -336,7 +364,7 @@ void db_read(const char * filename)
regex_t db_regexp;
regmatch_t pmatch[4];
if (usearch_abundance)
if (opt_usearch_abundance)
{
if (regcomp(&db_regexp, "(^|;)size=([0-9]+)(;|$)", REG_EXTENDED))
fatal("Regular expression compilation failed");
......@@ -347,11 +375,12 @@ void db_read(const char * filename)
fatal("Regular expression compilation failed");
}
long lastabundance = LONG_MAX;
seqinfo_t * lastseq = 0;
int presorted = 1;
int missingabundance = 0;
unsigned int missingabundance_lineno = 0;
char * missingabundance_header = 0;
char * p = datap;
progress_init("Indexing database:", sequences);
......@@ -364,26 +393,40 @@ void db_read(const char * filename)
/* get header */
seqindex_p->header = p;
seqindex_p->headerlen = strlen(seqindex_p->header);
seqindex_p->headeridlen = seqindex_p->headerlen;
p += seqindex_p->headerlen + 1;
/* and sequence */
seqindex_p->seq = p;
seqindex_p->seqlen = strlen(p);
p += seqindex_p->seqlen + 1;
/* get amplicon abundance */
seqindex_p->abundance = 0;
if (!regexec(&db_regexp, seqindex_p->header, 4, pmatch, 0))
{
seqindex_p->abundance = atol(seqindex_p->header + pmatch[2].rm_so);
seqindex_p->abundance_start = pmatch[0].rm_so;
seqindex_p->abundance_end = pmatch[0].rm_eo;
if (seqindex_p->abundance == 0)
{
fprintf(stderr,
"\nError: Illegal abundance value on line %u:\n%s\n"
"Abundance values should be positive integers.\n\n",
lineno,
seqindex_p->header);
exit(1);
}
}
else
{
seqindex_p->abundance_start = 0;
seqindex_p->abundance_end = 0;
seqindex_p->abundance_start = seqindex_p->headerlen;
seqindex_p->abundance_end = seqindex_p->headerlen;
seqindex_p->abundance = 0;
}
if (seqindex_p->abundance < 1)
{
if (opt_append_abundance > 0)
if (opt_append_abundance)
{
seqindex_p->abundance = opt_append_abundance;
}
......@@ -391,42 +434,85 @@ void db_read(const char * filename)
{
missingabundance++;
if (missingabundance == 1)
{
missingabundance_lineno = lineno;
missingabundance_header = seqindex_p->header;
}
}
}
if (seqindex_p->abundance > lastabundance)
if (seqindex_p->abundance_start == 0)
fatal("Empty sequence identifier");
/* check if the sequences are presorted by abundance and header */
if (presorted && lastseq)
{
if (lastseq->abundance < seqindex_p->abundance)
presorted = 0;
else if (lastseq->abundance == seqindex_p->abundance)
{
if (strcmp(lastseq->header, seqindex_p->header) > 0)
presorted = 0;
}
}
lastabundance = seqindex_p->abundance;
lastseq = seqindex_p;
/* check hash, fatal error if found, otherwize insert new */
unsigned long hdrhash = HASH((unsigned char*)seqindex_p->header, seqindex_p->headeridlen);
/* check for duplicated identifiers using hash table */
unsigned long hdrhash = HASH((unsigned char*)seqindex_p->header, seqindex_p->abundance_start);
seqindex_p->hdrhash = hdrhash;
unsigned long hashindex = hdrhash % hdrhashsize;
unsigned long hdrhashindex = hdrhash % hdrhashsize;
seqinfo_t * found;
seqinfo_t * hdrfound = 0;
while ((found = hdrhashtable[hashindex]))
while ((hdrfound = hdrhashtable[hdrhashindex]))
{
if ((found->hdrhash == hdrhash) &&
(found->headeridlen == seqindex_p->headeridlen) &&
(strncmp(found->header, seqindex_p->header, found->headeridlen) == 0))
if ((hdrfound->hdrhash == hdrhash) &&
(hdrfound->abundance_start == seqindex_p->abundance_start) &&
(strncmp(hdrfound->header, seqindex_p->header, hdrfound->abundance_start) == 0))
break;
hashindex = (hashindex + 1) % hdrhashsize;
hdrhashindex = (hdrhashindex + 1) % hdrhashsize;
}
if (found)
if (hdrfound)
{
duplicatedidentifiers++;
fprintf(logfile, "WARNING: Duplicated sequence identifier: %s\n", seqindex_p->header);
fprintf(stderr, "\nError: Duplicated sequence identifier: %.*s\n\n",
seqindex_p->abundance_start,
seqindex_p->header);
exit(1);
}
hdrhashtable[hashindex] = seqindex_p;
hdrhashtable[hdrhashindex] = seqindex_p;
seqindex_p->seq = p;
seqindex_p->seqlen = strlen(p);
p += seqindex_p->seqlen + 1;
if (opt_differences > 0)
{
/* check for duplicated sequences using hash table */
unsigned long seqhash = HASH((unsigned char*)seqindex_p->seq,
seqindex_p->seqlen);
seqindex_p->seqhash = seqhash;
unsigned long seqhashindex = seqhash % seqhashsize;
seqinfo_t * seqfound = 0;
while ((seqfound = seqhashtable[seqhashindex]))
{
if ((seqfound->seqhash == seqhash) &&
(seqfound->seqlen == seqindex_p->seqlen) &&
(memcmp(seqfound->seq, seqindex_p->seq, seqfound->seqlen) == 0))
break;
seqhashindex = (seqhashindex + 1) % seqhashsize;
}
if (seqfound)
duplicates_found++;
else
seqhashtable[seqhashindex] = seqindex_p;
}
seqindex_p++;
progress_update(i);
......@@ -435,17 +521,26 @@ void db_read(const char * filename)
if (missingabundance)
{
char * msg;
if (asprintf(&msg,
"Abundance annotations not found for %d sequences, starting on line %u.\n"
fprintf(stderr,
"\nError: Abundance annotations not found for %d sequences, starting on line %u.\n"
">%s\n"
"Fasta headers must end with abundance annotations (_INT or ;size=INT).\n"
"The -z option must be used if the abundance annotation is in the latter format.\n"
"Abundance annotations can be produced by dereplicating the sequences.\n",
"Abundance annotations can be produced by dereplicating the sequences.\n"
"The header is defined as the string comprised between the \">\" symbol\n"
"and the first space or the end of the line, whichever comes first.\n\n",
missingabundance,
missingabundance_lineno) == -1)
fatal("Out of memory");
else
fatal(msg);
missingabundance_lineno,
missingabundance_header);
exit(1);
}
if (duplicates_found)
{
fprintf(logfile,
"WARNING: %lu duplicated sequences detected.\n"
"Please consider dereplicating your data for optimal results.\n",
duplicates_found);
}
if (!presorted)
......@@ -459,8 +554,11 @@ void db_read(const char * filename)
free(hdrhashtable);
if (duplicatedidentifiers)
exit(1);
if (seqhashtable)
{
free(seqhashtable);
seqhashtable = 0;
}
}
void db_qgrams_init()
......
......@@ -213,20 +213,20 @@ void dereplicate()
progress_init("Writing swarms: ", swarmcount);
if (mothur)
fprintf(outfile, "swarm_%ld\t%ld", resolution, swarmcount);
if (opt_mothur)
fprintf(outfile, "swarm_%ld\t%ld", opt_differences, swarmcount);
for(int i = 0; i < swarmcount; i++)
{
int seed = hashtable[i].seqno_first;
if (mothur)
if (opt_mothur)
fputc('\t', outfile);
fprint_id(outfile, seed);
int a = nextseqtab[seed];
while (a)
{
if (mothur)
if (opt_mothur)
fputc(',', outfile);
else
fputc(SEPCHAR, outfile);
......@@ -234,13 +234,13 @@ void dereplicate()
a = nextseqtab[a];
}
if (!mothur)
if (!opt_mothur)
fputc('\n', outfile);
progress_update(i+1);
}
if (mothur)
if (opt_mothur)
fputc('\n', outfile);
progress_done();
......@@ -345,7 +345,7 @@ void dereplicate()
fprint_id_noabundance(statsfile, sp->seqno_first);
fprintf(statsfile, "\t%lu\t%u\t%u\t%u\n",
db_getabundance(sp->seqno_first),
sp->singletons, 0, 0);
sp->singletons, 0U, 0U);
progress_update(i);
}
progress_done();
......
......@@ -125,7 +125,6 @@ void nw(char * dseq,
hearray must point to at least 2*qlen longs of allocated memory (8*qlen bytes) */
long n, e;
long unsigned *hep;
long qlen = qend - qseq;
long dlen = dend - dseq;
......@@ -142,6 +141,7 @@ void nw(char * dseq,
for(j=0; j<dlen; j++)
{
long unsigned *hep;
hep = hearray;
long f = 2 * gapopen + (j+2) * gapextend;
long h = (j == 0) ? 0 : (gapopen + j * gapextend);
......
......@@ -210,6 +210,7 @@ void * qgram_worker(void * vp)
while (tip->work >= 0)
{
/* wait for work available */
while (tip->work == 0)
pthread_cond_wait(&tip->workcond, &tip->workmutex);
if (tip->work > 0)
{
......@@ -228,11 +229,11 @@ void qgram_diff_init()
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
/* allocate memory for thread info */
ti = (struct thread_info_s *) xmalloc(threads *
ti = (struct thread_info_s *) xmalloc(opt_threads *
sizeof(struct thread_info_s));
/* init and create worker threads */
for(unsigned long t=0; t<threads; t++)
for(long t=0; t<opt_threads; t++)
{
struct thread_info_s * tip = ti + t;
tip->work = 0;
......@@ -246,7 +247,7 @@ void qgram_diff_init()
void qgram_diff_done()
{
/* finish and clean up worker threads */
for(unsigned long t=0; t<threads; t++)
for(long t=0; t<opt_threads; t++)
{
struct thread_info_s * tip = ti + t;
......@@ -273,7 +274,7 @@ void qgram_diff_fast(unsigned long seed,
unsigned long * amplist,
unsigned long * difflist)
{
long thr = threads;
long thr = opt_threads;
const unsigned long m = 150;
......
......@@ -237,6 +237,7 @@ void * search_worker(void * vp)
while (tip->work >= 0)
{
/* wait for work available */
while (tip->work == 0)
pthread_cond_wait(&tip->workcond, &tip->workmutex);
if (tip->work > 0)
{
......@@ -268,7 +269,7 @@ void search_do(unsigned long query_no,
master_alignlengths = alignlengths;
master_bits = bits;
unsigned long thr = threads;
unsigned long thr = opt_threads;
if (bits == 8)
{
......@@ -315,9 +316,9 @@ void search_begin()
{
longestdbsequence = db_getlongestsequence();
sd = (struct search_data *) xmalloc(sizeof(search_data) * threads);
sd = (struct search_data *) xmalloc(sizeof(search_data) * opt_threads);
for(unsigned long t=0; t<threads; t++)
for(long t=0; t<opt_threads; t++)
search_alloc(sd+t);
/* start threads */
......@@ -326,11 +327,11 @@ void search_begin()
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
/* allocate memory for thread info */
ti = (struct thread_info_s *) xmalloc(threads *
ti = (struct thread_info_s *) xmalloc(opt_threads *
sizeof(struct thread_info_s));
/* init and create worker threads */
for(unsigned long t=0; t<threads; t++)
for(long t=0; t<opt_threads; t++)
{
struct thread_info_s * tip = ti + t;
tip->work = 0;
......@@ -346,7 +347,7 @@ void search_end()
{
/* finish and clean up worker threads */
for(unsigned long t=0; t<threads; t++)
for(long t=0; t<opt_threads; t++)
{
struct thread_info_s * tip = ti + t;
......@@ -367,7 +368,7 @@ void search_end()
free(ti);
pthread_attr_destroy(&attr);
for(unsigned long t=0; t<threads; t++)
for(long t=0; t<opt_threads; t++)
search_free(sd+t);
free(sd);
}
......@@ -570,7 +570,7 @@ void search16(WORD * * q_start,
unsigned long next_id = 0;
unsigned long done;
T0 = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, 0xffff);
T0 = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, -1);
Q = _mm_set1_epi16(gap_open_penalty+gap_extend_penalty);
R = _mm_set1_epi16(gap_extend_penalty);
......
......@@ -854,8 +854,7 @@ void search8(BYTE * * q_start,
unsigned long next_id = 0;
unsigned long done;
T0 = _mm_set_epi8(0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff);
T0 = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1);
Q = _mm_set1_epi8(gap_open_penalty+gap_extend_penalty);
R = _mm_set1_epi8(gap_extend_penalty);
......