Skip to content
Commits on Source (5)
......@@ -24,7 +24,7 @@ Most of the nucleotide based commands and options in USEARCH version 7 are suppo
## Getting Help
If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion.
## Example
......@@ -37,9 +37,9 @@ In the example below, VSEARCH will identify sequences in the file database.fsa t
**Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands:
```
wget https://github.com/torognes/vsearch/archive/v2.9.0.tar.gz
tar xzf v2.9.0.tar.gz
cd vsearch-2.9.0
wget https://github.com/torognes/vsearch/archive/v2.10.2.tar.gz
tar xzf v2.10.2.tar.gz
cd vsearch-2.10.2
./autogen.sh
./configure
make
......@@ -70,36 +70,36 @@ Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (v
Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system:
```sh
wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-linux-x86_64.tar.gz
tar xzf vsearch-2.9.0-linux-x86_64.tar.gz
wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-linux-x86_64.tar.gz
tar xzf vsearch-2.10.2-linux-x86_64.tar.gz
```
Or these commands if you are using a Linux ppc64le system:
```sh
wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-linux-ppc64le.tar.gz
tar xzf vsearch-2.9.0-linux-ppc64le.tar.gz
wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-linux-ppc64le.tar.gz
tar xzf vsearch-2.10.2-linux-ppc64le.tar.gz
```
Or these commands if you are using a Mac:
```sh
wget https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-macos-x86_64.tar.gz
tar xzf vsearch-2.9.0-macos-x86_64.tar.gz
wget https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-macos-x86_64.tar.gz
tar xzf vsearch-2.10.2-macos-x86_64.tar.gz
```
Or if you are using Windows, download and extract (unzip) the contents of this file:
```
https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch-2.9.0-win-x86_64.zip
https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch-2.10.2-win-x86_64.zip
```
Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.9.0-linux-x86_64` or `vsearch-2.9.0-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.10.2-linux-x86_64` or `vsearch-2.10.2-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`.
Windows: You will now have the binary distribution in a folder called `vsearch-2.9.0-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
Windows: You will now have the binary distribution in a folder called `vsearch-2.10.2-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`.
**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.9.0/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
**Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.10.2/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases).
## Plugins, packages, and wrappers
......
......@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ([2.63])
AC_INIT([vsearch], [2.9.0], [torognes@ifi.uio.no])
AC_INIT([vsearch], [2.10.2], [torognes@ifi.uio.no])
AC_CANONICAL_TARGET
AM_INIT_AUTOMAKE([subdir-objects])
AC_LANG([C++])
......@@ -14,8 +14,8 @@ MACOSX_DEPLOYMENT_TARGET="10.7"
# Set default gcc and g++ options
CFLAGS='-g -march=native'
CXXFLAGS='-g -march=native'
CFLAGS='-g'
CXXFLAGS='-g'
# Checks for programs.
AC_PROG_CXX
......
vsearch (2.10.2-1) unstable; urgency=medium
* New upstream version
-- Andreas Tille <tille@debian.org> Tue, 18 Dec 2018 16:56:27 +0100
vsearch (2.9.0-1) unstable; urgency=medium
* Team upload.
......
......@@ -2,14 +2,13 @@ Description: do not override system compiler flags
Author: Sascha Steinbiss <satta@debian.org>
--- a/configure.ac
+++ b/configure.ac
@@ -14,8 +14,8 @@ MACOSX_DEPLOYMENT_TARGET="10.7"
@@ -14,9 +14,6 @@ MACOSX_DEPLOYMENT_TARGET="10.7"
# Set default gcc and g++ options
-CFLAGS='-g -march=native'
-CXXFLAGS='-g -march=native'
+#CFLAGS='-g -march=native'
+#CXXFLAGS='-g -march=native'
-CFLAGS='-g'
-CXXFLAGS='-g'
-
# Checks for programs.
AC_PROG_CXX
AC_PROG_RANLIB
.\" ============================================================================
.TH vsearch 1 "October 10, 2018" "version 2.9.0" "USER COMMANDS"
.TH vsearch 1 "December 10, 2018" "version 2.10.2" "USER COMMANDS"
.\" ============================================================================
.SH NAME
vsearch \(em chimera detection, clustering, dereplication and
......@@ -71,6 +71,9 @@ FASTA/FASTQ file processing:
\fBvsearch\fR \-\-fastx_revcomp \fIfastxfile\fR (\-\-fastaout |
\-\-fastqout) \fIoutputfile\fR [\fIoptions\fR]
.PP
\fBvsearch\fR \-\-sff_convert \fIsff-file\fR \-\-fastqout
\fIoutputfile\fR [\fIoptions\fR]
.PP
.RE
Masking:
.RS
......@@ -878,17 +881,18 @@ for details.
.TP
.BI \-\-rereplicate \0filename
Duplicate each sequence the number of times indicated by the abundance
of each sequence in the specified file. The sequence labels are
identical for the same sequence, unless \-\-relabel, \-\-relabel_sha1
or \-\-relabel_md5 is used to create unique labels. Output is written
to the file specified with the \-\-output option, in FASTA format. The
output file does not contain abundance information unless \-\-sizeout
is specified, in which case an abundance of 1 is used.
of each sequence in the specified file (option \-\-sizein is always
implied). The sequence labels are identical for the same sequence,
unless \-\-relabel, \-\-relabel_sha1 or \-\-relabel_md5 is used to
create unique labels. Output is written to the file specified with the
\-\-output option, in FASTA format. The output file does not contain
abundance information unless \-\-sizeout is specified, in which case
an abundance of 1 is used.
.TP
.B \-\-sizein
Take into account the abundance annotations present in the input fasta
file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence
headers).
headers). That option is active by default when rereplicating.
.TP
.B \-\-sizeout
Add abundance annotations to the output fasta file (add the pattern
......@@ -961,8 +965,9 @@ length of the sequences in a FASTQ file may be performed with the
\-\-fastq_stats, \-\-fastq_eestats, and \-\-fastq_eestats2
commands. Sequences may be shortened, filtered and converted by the
\-\-fastq_filter or \-\-fastx_filter commands. Paired-end reads can be
merged using the \-\-fastq_mergepairs command. Finally, the
\-\-fastx_revcomp command reverse-complements sequences.
merged using the \-\-fastq_mergepairs command. The \-\-fastx_revcomp
command reverse-complements sequences. Finally, the \-\-sff_convert
command can be used to convert SFF files to FASTQ.
.PP
.TP 9
.B \-\-eeout
......@@ -1013,9 +1018,9 @@ Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the
Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64).
.TP
.BI \-\-fastq_asciiout\~ "positive integer"
When using \-\-fastq_convert, define the ASCII character number used
as the basis for the FASTQ quality score when writing FASTQ output
files. The default is 33.
When using \-\-fastq_convert or \-\-sff_convert, define the ASCII
character number used as the basis for the FASTQ quality score when
writing FASTQ output files. The default is 33.
.TP
.BI \-\-fastq_chars \0filename
Summarize the composition of sequence and quality strings contained in
......@@ -1039,7 +1044,7 @@ Convert between the different variants of the FASTQ file format. The
quality encoding of the input file must be specified with the
\-\-fastq_ascii option (either 33 or 64, the default is 33), and the
output quality encoding must be specified with the \-\-fastq_asciiout
option (default 33). The mimimum and maximum output quality scores may
option (default 33). The minimum and maximum output quality scores may
be limited using the \-\-fastq_qminout and \-\-fastq_qmaxout
options. The output file is specified with the \-\-fastqout option.
.TP
......@@ -1199,10 +1204,10 @@ files. The default is 41, which is usual for recent Sanger/Illumina
1.8+ files.
.TP
.BI \-\-fastq_qmaxout\~ "positive integer"
When using \-\-fastq_convert, specify the maximum quality score used
when writing FASTQ files. The default is 41, which is usual for recent
Sanger/Illumina 1.8+ files. Older formats may use a maximum quality
score of 40.
When using \-\-fastq_convert or \-\-sff_convert, specify the maximum
quality score used when writing FASTQ files. The default is 41, which
is usual for recent Sanger/Illumina 1.8+ files. Older formats may use
a maximum quality score of 40.
.TP
.BI \-\-fastq_qmin\~ "positive integer"
Specify the minimum quality score accepted for FASTQ files. The
......@@ -1210,10 +1215,10 @@ default is 0, which is usual for recent Sanger/Illumina 1.8+
files. Older formats may use scores between -5 and 2.
.TP
.BI \-\-fastq_qminout\~ "positive integer"
When using \-\-fastq_convert, specify the minimum quality score used
when writing FASTQ files. The default is 0, which is usual for
Sanger/Illumina 1.8+ files. Older versions of the format may use
scores between -5 and 2.
When using \-\-fastq_convert or \-\-sff_convert, specify the minimum
quality score used when writing FASTQ files. The default is 0, which
is usual for Sanger/Illumina 1.8+ files. Older versions of the format
may use scores between -5 and 2.
.TP
.BI \-\-fastq_stats \0filename
Analyze a FASTQ file and report the number of reads it contains. The
......@@ -1422,6 +1427,21 @@ for details.
When using \-\-fastq_mergepairs or \-\-fastq_join, specify the FASTQ
file containing containing the reverse reads.
.TP
.BI \-\-sff_convert \0filename
Convert the given SFF file to FASTQ. The FASTQ output file is
specified with the \-\-fastqout option. The sequence may be clipped as
specied in the SFF file if the option \-\-sff_clip is specified,
otherwise no clipping occurs. Bases that would have been clipped are
converted to lower case, while the rest is in upper case. The output
quality encoding may be specified with the \-\-fastq_asciiout option
(default 33). The minimum and maximum output quality scores may be
limited using the \-\-fastq_qminout and \-\-fastq_qmaxout options.
.TP
.BI \-\-sff_clip
Specifies that the sequences converted by the \-\-sff_convert command
should be clipped in both ends as indicated in the SFF file. By
default no clipping is performed.
.TP
.B \-\-xsize
Strip abundance information from the headers when writing the output
file.
......@@ -3456,6 +3476,26 @@ the first space (unless the notrunclabels option is in effect).
.TP
.BR v2.9.0\~ "released October 10th, 2018"
Added the fastq_join command.
.TP
.BR v2.9.1\~ "released October 29th, 2018"
Changed compiler options that select the target cpu and tuning to
allow the software to run on any 64-bit x86 system, while tuning for
more modern variants. Avoid illegal instruction error on some
architectures. Update documentation of rereplicate command.
.TP
.BR v2.10.0\~ "released December 6th, 2018"
Added the sff_convert commmand to convert SFF files to FASTQ. Added
some additional option argument checks. Fixed segmentation fault bug
after some fatal errors when a log file was specified.
.TP
.BR v2.10.1\~ "released December 7th, 2018"
Improved sff_convert command. It will now read several variants of the
SFF format. It is also able to read from a pipe. Warnings are given if
there are minor problems. Errors messages have been improved. Minor
speed and memory usage improvements.
.TP
.BR v2.10.2\~ "released December 10th, 2018"
Fixed bug in sintax with reversed order of domain and kingdom.
.RE
.LP
.\" ============================================================================
......
......@@ -3,7 +3,7 @@ bin_PROGRAMS = $(top_builddir)/bin/vsearch
if TARGET_PPC
AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -mcpu=power8
else
AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g
AM_CXXFLAGS=-Wall -Wsign-compare -O3 -g -march=x86-64 -mtune=generic
endif
AM_CFLAGS=$(AM_CXXFLAGS)
......@@ -47,6 +47,7 @@ results.h \
search.h \
searchcore.h \
searchexact.h \
sffconvert.h \
showalign.h \
sha1.h \
shuffle.h \
......@@ -76,13 +77,13 @@ libcityhash_a_SOURCES = city.cc city.h
if TARGET_WIN
libcityhash_a_CXXFLAGS = -Wall -Wno-sign-compare -O3 -g -D_MSC_VER
libcityhash_a_CXXFLAGS = $(AM_CXXFLAGS) -Wno-sign-compare -D_MSC_VER
__top_builddir__bin_vsearch_LDFLAGS = -static
__top_builddir__bin_vsearch_LDADD = libregex.a libcityhash.a libcpu_ssse3.a libcpu_sse2.a
else
libcityhash_a_CXXFLAGS = -Wall -Wno-sign-compare -O3 -g
libcityhash_a_CXXFLAGS = $(AM_CXXFLAGS) -Wno-sign-compare
if TARGET_PPC
__top_builddir__bin_vsearch_LDADD = libcityhash.a libcpu.a
......@@ -126,6 +127,7 @@ results.cc \
search.cc \
searchcore.cc \
searchexact.cc \
sffconvert.cc \
sha1.c \
showalign.cc \
shuffle.cc \
......
/*
VSEARCH: a versatile open source tool for metagenomics
Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
All rights reserved.
Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
Department of Informatics, University of Oslo,
PO Box 1080 Blindern, NO-0316 Oslo, Norway
This software is dual-licensed and available under a choice
of one of two licenses, either under the terms of the GNU
General Public License version 3 or the BSD 2-Clause License.
GNU General Public License version 3
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
The BSD 2-Clause License
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
*/
#include "vsearch.h"
uint32_t sff_magic = 0x2e736666;
struct sff_header_s
{
uint32_t magic_number; /* .sff */
uint32_t version;
uint64_t index_offset;
uint32_t index_length;
uint32_t number_of_reads;
uint16_t header_length;
uint16_t key_length;
uint16_t flows_per_read;
uint8_t flowgram_format_code;
} sff_header;
struct sff_read_header_s
{
uint16_t read_header_length;
uint16_t name_length;
uint32_t number_of_bases;
uint16_t clip_qual_left;
uint16_t clip_qual_right;
uint16_t clip_adapter_left;
uint16_t clip_adapter_right;
} read_header;
uint64_t fskip(FILE * fp, uint64_t length)
{
/* read given amount of data from a stream and ignore it */
/* used instead of seeking in order to work with pipes */
#define BLOCKSIZE 4096
char buffer[BLOCKSIZE];
uint64_t skipped = 0;
uint64_t rest = length;
while (rest > 0)
{
uint64_t want = ((rest > BLOCKSIZE) ? BLOCKSIZE : rest);
uint64_t got = fread(buffer, 1, want, fp);
skipped += got;
rest -= got;
if (got < want)
break;
}
return skipped;
}
void sff_convert()
{
if (! opt_fastqout)
fatal("No output file for sff_convert specified with --fastqout.");
FILE * fp_fastqout = fopen_output(opt_fastqout);
if (!fp_fastqout)
fatal("Unable to open FASTQ output file for writing.");
FILE * fp_sff = fopen_input(opt_sff_convert);
if (!fp_sff)
fatal("Unable to open SFF input file for reading.");
/* read and check header */
uint64_t filepos = 0;
if (fread(&sff_header, 1, 31, fp_sff) < 31)
fatal("Unable to read from SFF file. File may be truncated.");
filepos += 31;
sff_header.magic_number = bswap_32(sff_header.magic_number);
sff_header.version = bswap_32(sff_header.version);
sff_header.index_offset = bswap_64(sff_header.index_offset);
sff_header.index_length = bswap_32(sff_header.index_length);
sff_header.number_of_reads = bswap_32(sff_header.number_of_reads);
sff_header.header_length = bswap_16(sff_header.header_length);
sff_header.key_length = bswap_16(sff_header.key_length);
sff_header.flows_per_read = bswap_16(sff_header.flows_per_read);
if (sff_header.magic_number != sff_magic)
fatal("Invalid SFF file. Incorrect magic number. Must be 0x2e736666 (.sff).");
if (sff_header.version != 1)
fatal("Invalid SFF file. Incorrect version. Must be 1.");
if (sff_header.flowgram_format_code != 1)
fatal("Invalid SFF file. Incorrect flowgram format code. Must be 1.");
if (sff_header.header_length != 8 * ((31 + sff_header.flows_per_read + sff_header.key_length + 7) / 8))
fatal("Invalid SFF file. Incorrect header length.");
if (sff_header.key_length != 4)
fatal("Invalid SFF file. Incorrect key length. Must be 4.");
if ((sff_header.index_length > 0) && (sff_header.index_length < 8))
fatal("Invalid SFF file. Incorrect index size. Must be at least 8.");
/* read and check flow chars, key and padding */
if (fskip(fp_sff, sff_header.flows_per_read) < sff_header.flows_per_read)
fatal("Invalid SFF file. Unable to read flow characters. File may be truncated.");
filepos += sff_header.flows_per_read;
char * key_sequence = (char *) xmalloc(sff_header.key_length + 1);
if (fread(key_sequence, 1, sff_header.key_length, fp_sff) < sff_header.key_length)
fatal("Invalid SFF file. Unable to read key sequence. File may be truncated.");
key_sequence[sff_header.key_length] = 0;
filepos += sff_header.key_length;
uint32_t padding_length = sff_header.header_length - sff_header.flows_per_read - sff_header.key_length - 31;
if (fskip(fp_sff, padding_length) < padding_length)
fatal("Invalid SFF file. Unable to read padding. File may be truncated.");
filepos += padding_length;
double totallength = 0.0;
uint32_t minimum = UINT_MAX;
uint32_t maximum = 0;
bool index_done = (sff_header.index_offset == 0) || (sff_header.index_length == 0);
bool index_odd = false;
char index_kind[9];
uint32_t index_padding = 0;
if ((sff_header.index_length & 7) > 0)
index_padding = 8 - (sff_header.index_length & 7);
if (! opt_quiet)
{
fprintf(stderr, "Number of reads: %d\n", sff_header.number_of_reads);
fprintf(stderr, "Flows per read: %d\n", sff_header.flows_per_read);
fprintf(stderr, "Key sequence: %s\n", key_sequence);
}
if (opt_log)
{
fprintf(fp_log, "Number of reads: %d\n", sff_header.number_of_reads);
fprintf(fp_log, "Flows per read: %d\n", sff_header.flows_per_read);
fprintf(fp_log, "Key sequence: %s\n", key_sequence);
}
progress_init("Converting SFF: ", sff_header.number_of_reads);
for(uint32_t read_no = 0; read_no < sff_header.number_of_reads; read_no++)
{
/* check if the index block is here */
if (! index_done)
{
if (filepos == sff_header.index_offset)
{
if (fread(index_kind, 1, 8, fp_sff) < 8)
fatal("Invalid SFF file. Unable to read index header. File may be truncated.");
filepos += 8;
index_kind[8] = 0;
uint64 index_size = sff_header.index_length - 8 + index_padding;
if (fskip(fp_sff, index_size) != index_size)
fatal("Invalid SFF file. Unable to read entire index. File may be truncated.");
filepos += index_size;
index_done = true;
index_odd = true;
}
}
/* read and check each read header */
if (fread(&read_header, 1, 16, fp_sff) < 16)
fatal("Invalid SFF file. Unable to read read header. File may be truncated.");
filepos += 16;
read_header.read_header_length = bswap_16(read_header.read_header_length);
read_header.name_length = bswap_16(read_header.name_length);
read_header.number_of_bases = bswap_32(read_header.number_of_bases);
read_header.clip_qual_left = bswap_16(read_header.clip_qual_left);
read_header.clip_qual_right = bswap_16(read_header.clip_qual_right);
read_header.clip_adapter_left = bswap_16(read_header.clip_adapter_left);
read_header.clip_adapter_right = bswap_16(read_header.clip_adapter_right);
if (read_header.read_header_length != 8 * ((16 + read_header.name_length + 7) / 8))
fatal("Invalid SFF file. Incorrect read header length.");
if (read_header.clip_qual_left > read_header.number_of_bases)
fatal("Invalid SFF file. Incorrect clip_qual_left value.");
if (read_header.clip_adapter_left > read_header.number_of_bases)
fatal("Invalid SFF file. Incorrect clip_adapter_left value.");
if (read_header.clip_qual_right > read_header.number_of_bases)
fatal("Invalid SFF file. Incorrect clip_qual_right value.");
if (read_header.clip_adapter_right > read_header.number_of_bases)
fatal("Invalid SFF file. Incorrect clip_adapter_right value.");
char * read_name = (char *) xmalloc(read_header.name_length + 1);
if (fread(read_name, 1, read_header.name_length, fp_sff) < read_header.name_length)
fatal("Invalid SFF file. Unable to read read name. File may be truncated.");
filepos += read_header.name_length;
read_name[read_header.name_length] = 0;
uint32_t read_header_padding_length = read_header.read_header_length - read_header.name_length - 16;
if (fskip(fp_sff, read_header_padding_length) < read_header_padding_length)
fatal("Invalid SFF file. Unable to read read header padding. File may be truncated.");
filepos += read_header_padding_length;
/* read and check the flowgram and sequence */
if (fskip(fp_sff, 2 * sff_header.flows_per_read) < sff_header.flows_per_read)
fatal("Invalid SFF file. Unable to read flowgram values. File may be truncated.");
filepos += 2 * sff_header.flows_per_read;
if (fskip(fp_sff, read_header.number_of_bases) < read_header.number_of_bases)
fatal("Invalid SFF file. Unable to read flow indices. File may be truncated.");
filepos += read_header.number_of_bases;
char * bases = (char *) xmalloc(read_header.number_of_bases + 1);
if (fread(bases, 1, read_header.number_of_bases, fp_sff) < read_header.number_of_bases)
fatal("Invalid SFF file. Unable to read read length. File may be truncated.");
bases[read_header.number_of_bases] = 0;
filepos += read_header.number_of_bases;
char * qual = (char *) xmalloc(read_header.number_of_bases + 1);
if (fread(qual, 1, read_header.number_of_bases, fp_sff) < read_header.number_of_bases)
fatal("Invalid SFF file. Unable to read quality scores. File may be truncated.");
filepos += read_header.number_of_bases;
/* convert quality scores to ascii characters */
for(uint32_t base_no = 0; base_no < read_header.number_of_bases; base_no++)
{
int q = qual[base_no];
if (q < opt_fastq_qminout)
q = opt_fastq_qminout;
if (q > opt_fastq_qmaxout)
q = opt_fastq_qmaxout;
qual[base_no] = opt_fastq_asciiout + q;
}
qual[read_header.number_of_bases] = 0;
uint32_t read_data_length = (2 * sff_header.flows_per_read + 3 * read_header.number_of_bases);
uint32_t read_data_padded_length = 8 * ((read_data_length + 7) / 8);
uint32_t read_data_padding_length = read_data_padded_length - read_data_length;
if (fskip(fp_sff, read_data_padding_length) < read_data_padding_length)
fatal("Invalid SFF file. Unable to read read data padding. File may be truncated.");
filepos += read_data_padding_length;
uint32_t clip_start = 0;
clip_start = MAX(1, MAX(read_header.clip_qual_left, read_header.clip_adapter_left)) - 1;
uint32_t clip_end = read_header.number_of_bases;
clip_end = MIN((read_header.clip_qual_right == 0 ? read_header.number_of_bases : read_header.clip_qual_right), (read_header.clip_adapter_right == 0 ? read_header.number_of_bases : read_header.clip_adapter_right));
/* make the clipped bases lowercase and the rest uppercase */
for (uint32_t i = 0; i < read_header.number_of_bases; i++)
{
if ((i < clip_start) || (i >= clip_end))
bases[i] = tolower(bases[i]);
else
bases[i] = toupper(bases[i]);
}
if (opt_sff_clip)
{
bases[clip_end] = 0;
qual[clip_end] = 0;
}
else
{
clip_start = 0;
clip_end = read_header.number_of_bases;
}
uint32_t length = clip_end - clip_start;
fastq_print_general(fp_fastqout,
bases + clip_start,
length,
read_name,
strlen(read_name),
qual + clip_start,
0, 0, 0, 0);
xfree(read_name);
xfree(bases);
xfree(qual);
totallength += length;
if (length < minimum)
minimum = length;
if (length > maximum)
maximum = length;
progress_update(read_no + 1);
}
progress_done();
/* check if the index block is here */
if (! index_done)
{
if (filepos == sff_header.index_offset)
{
if (fread(index_kind, 1, 8, fp_sff) < 8)
fatal("Invalid SFF file. Unable to read index header. File may be truncated.");
filepos += 8;
index_kind[8] = 0;
uint64 index_size = sff_header.index_length - 8;
if (fskip(fp_sff, index_size) != index_size)
fatal("Invalid SFF file. Unable to read entire index. File may be truncated.");
filepos += index_size;
index_done = true;
/* try to skip padding, if any */
if (index_padding > 0)
{
uint64_t got = fskip(fp_sff, index_padding);
if ((got < index_padding) && (got != 0))
fprintf(stderr, "WARNING: Additional data at end of SFF file ignored\n");
}
}
}
if (! index_done)
{
fprintf(stderr, "WARNING: SFF index missing\n");
if (opt_log)
fprintf(fp_log, "WARNING: SFF index missing\n");
}
if (index_odd)
{
fprintf(stderr, "WARNING: Index at unusual position in file\n");
if (opt_log)
fprintf(fp_log, "WARNING: Index at unusual position in file\n");
}
/* ignore the rest of file */
/* try reading just another byte */
if (fskip(fp_sff, 1) > 0)
{
fprintf(stderr, "WARNING: Additional data at end of SFF file ignored\n");
if (opt_log)
fprintf(fp_log, "WARNING: Additional data at end of SFF file ignored\n");
}
fclose(fp_sff);
fclose(fp_fastqout);
double average = totallength / sff_header.number_of_reads;
if (! opt_quiet)
{
if (sff_header.index_length > 0)
fprintf(stderr, "Index type: %s\n", index_kind);
fprintf(stderr, "\nSFF file read successfully.\n");
if (sff_header.number_of_reads > 0)
fprintf(stderr, "Sequence length: minimum %d, average %.1f, maximum %d\n",
minimum,
average,
maximum);
}
if (opt_log)
{
if (sff_header.index_length > 0)
fprintf(fp_log, "Index type: %s\n", index_kind);
fprintf(fp_log, "\nSFF file read successfully.\n");
if (sff_header.number_of_reads > 0)
fprintf(fp_log, "Sequence length: minimum %d, average %.1f, maximum %d\n",
minimum,
average,
maximum);
}
xfree(key_sequence);
}
/*
VSEARCH: a versatile open source tool for metagenomics
Copyright (C) 2014-2018, Torbjorn Rognes, Frederic Mahe and Tomas Flouri
All rights reserved.
Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
Department of Informatics, University of Oslo,
PO Box 1080 Blindern, NO-0316 Oslo, Norway
This software is dual-licensed and available under a choice
of one of two licenses, either under the terms of the GNU
General Public License version 3 or the BSD 2-Clause License.
GNU General Public License version 3
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
The BSD 2-Clause License
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
*/
void sff_convert();
......@@ -86,7 +86,7 @@ static pthread_attr_t attr;
static fastx_handle query_fastx_h;
const int tax_levels = 8;
const char * tax_letters = "kdpcofgs";
const char * tax_letters = "dkpcofgs";
const int subset_size = 32;
const int bootstrap_count = 100;
......@@ -150,8 +150,8 @@ bool sintax_parse_tax(const char * header,
void sintax_split(int seqno, int * level_start, int * level_len)
{
/* Parse taxonomy string into the following parts
k kingdom
d domain
k kingdom
p phylum
c class
o order
......
......@@ -114,7 +114,7 @@ void __attribute__((noreturn)) fatal(const char * msg)
fprintf(stderr, "\n\n");
fprintf(stderr, "Fatal error: %s\n", msg);
if (opt_log)
if (fp_log)
{
fprintf(fp_log, "\n\n");
fprintf(fp_log, "Fatal error: %s\n", msg);
......
......@@ -77,6 +77,7 @@ bool opt_relabel_keep;
bool opt_relabel_md5;
bool opt_relabel_sha1;
bool opt_samheader;
bool opt_sff_clip;
bool opt_sizeorder;
bool opt_xsize;
char * opt_allpairs_global;
......@@ -139,6 +140,7 @@ char * opt_rereplicate;
char * opt_reverse;
char * opt_samout;
char * opt_search_exact;
char * opt_sff_convert;
char * opt_shuffle;
char * opt_sintax;
char * opt_sortbylength;
......@@ -789,6 +791,8 @@ void args_init(int argc, char **argv)
opt_search_exact = 0;
opt_self = 0;
opt_selfid = 0;
opt_sff_convert = 0;
opt_sff_clip = 0;
opt_shuffle = 0;
opt_sintax = 0;
opt_sintax_cutoff = 0.0;
......@@ -1033,6 +1037,8 @@ void args_init(int argc, char **argv)
{"fastq_join", required_argument, 0, 0 },
{"join_padgap", required_argument, 0, 0 },
{"join_padgapq", required_argument, 0, 0 },
{"sff_convert", required_argument, 0, 0 },
{"sff_clip", no_argument, 0, 0 },
{ 0, 0, 0, 0 }
};
......@@ -1891,6 +1897,14 @@ void args_init(int argc, char **argv)
opt_join_padgapq = optarg;
break;
case 202:
opt_sff_convert = optarg;
break;
case 203:
opt_sff_clip = 1;
break;
default:
fatal("Internal error in option parsing");
}
......@@ -1980,6 +1994,8 @@ void args_init(int argc, char **argv)
commands++;
if (opt_fastq_join)
commands++;
if (opt_sff_convert)
commands++;
if (commands > 1)
......@@ -2061,9 +2077,21 @@ void args_init(int argc, char **argv)
if (opt_fastq_qmin > opt_fastq_qmax)
fatal("The argument to --fastq_qmin cannot be larger than to --fastq_qmax");
if (opt_fastq_ascii + opt_fastq_qmin < 33)
fatal("Sum of arguments to --fastq_ascii and --fastq_qmin must be no less than 33");
if (opt_fastq_ascii + opt_fastq_qmax > 126)
fatal("Sum of arguments to --fastq_ascii and --fastq_qmax must be no more than 126");
if (opt_fastq_qminout > opt_fastq_qmaxout)
fatal("The argument to --fastq_qminout cannot be larger than to --fastq_qmaxout");
if (opt_fastq_asciiout + opt_fastq_qminout < 33)
fatal("Sum of arguments to --fastq_asciiout and --fastq_qminout must be no less than 33");
if (opt_fastq_asciiout + opt_fastq_qmaxout > 126)
fatal("Sum of arguments to --fastq_asciiout and --fastq_qmaxout must be no more than 126");
if (opt_gzip_decompress && opt_bzip2_decompress)
fatal("Specify either --gzip_decompress or --bzip2_decompress, not both");
......@@ -2243,11 +2271,21 @@ void cmd_help()
" --relabel_keep keep the old label after the new when relabelling\n"
" --relabel_md5 relabel with md5 digest of normalized sequence\n"
" --relabel_sha1 relabel with sha1 digest of normalized sequence\n"
" --sizeorder sort accepted centroids by abundance (AGC)\n"
" --sizeorder sort accepted centroids by abundance, AGC\n"
" --sizeout write cluster abundances to centroid file\n"
" --uc FILENAME specify filename for UCLUST-like output\n"
" --xsize strip abundance information in output\n"
"\n"
"Convert SFF to FASTQ\n"
" --sff_convert FILENAME convert given SFF file to FASTQ format\n"
" Parameters\n"
" --sff_clip clip ends of sequences as indicated in file (no)\n"
" --fastq_asciiout INT FASTQ output quality score ASCII base char (33)\n"
" --fastq_qmaxout INT maximum base quality value for FASTQ output (41)\n"
" --fastq_qminout INT minimum base quality value for FASTQ output (0)\n"
" Output\n"
" --fastqout FILENAME output converted sequences to given FASTQ file\n"
"\n"
"Dereplication and rereplication\n"
" --derep_fulllength FILENAME dereplicate sequences in the given FASTA file\n"
" --derep_prefix FILENAME dereplicate sequences in file based on prefixes\n"
......@@ -2989,6 +3027,8 @@ int main(int argc, char** argv)
udb_stats();
else if (opt_sintax)
sintax();
else if (opt_sff_convert)
sff_convert();
else
cmd_none();
......
......@@ -125,6 +125,9 @@
#define PROG_OS "win"
#include <windows.h>
#include <psapi.h>
#define bswap_16(x) _byteswap_ushort(x)
#define bswap_32(x) _byteswap_ulong(x)
#define bswap_64(x) _byteswap_uint64(x)
#else
......@@ -132,6 +135,10 @@
#define PROG_OS "macos"
#include <sys/sysctl.h>
#include <libkern/OSByteOrder.h>
#define bswap_16(x) OSSwapInt16(x)
#define bswap_32(x) OSSwapInt32(x)
#define bswap_64(x) OSSwapInt64(x)
#else
......@@ -142,6 +149,7 @@
#endif
#include <sys/sysinfo.h>
#include <byteswap.h>
#endif
......@@ -212,6 +220,7 @@
#include "kmerhash.h"
#include "sintax.h"
#include "fastqjoin.h"
#include "sffconvert.h"
/* options */
......@@ -230,6 +239,7 @@ extern bool opt_relabel_keep;
extern bool opt_relabel_md5;
extern bool opt_relabel_sha1;
extern bool opt_samheader;
extern bool opt_sff_clip;
extern bool opt_sizeorder;
extern bool opt_xsize;
extern char * opt_allpairs_global;
......@@ -292,6 +302,7 @@ extern char * opt_rereplicate;
extern char * opt_reverse;
extern char * opt_samout;
extern char * opt_search_exact;
extern char * opt_sff_convert;
extern char * opt_shuffle;
extern char * opt_sintax;
extern char * opt_sortbylength;
......