Skip to content
Commits on Source (6)
......@@ -7,5 +7,6 @@ cmake_install.cmake
# misc
flexbar
.DS_Store
wget-log
include
local
......@@ -2,7 +2,7 @@ cmake_minimum_required( VERSION 2.8.2 )
project( FLEXBAR )
set( SEQAN_APP_VERSION "3.2.0" )
set( SEQAN_APP_VERSION "3.4.0" )
include_directories( ${FLEXBAR_SOURCE_DIR}/include )
# link_directories( ${FLEXBAR_SOURCE_DIR}/lib )
......
......@@ -4,48 +4,50 @@ The program Flexbar preprocesses high-throughput sequencing data efficiently. It
Refer to the [manual](https://github.com/seqan/flexbar/wiki) or contact [Johannes Roehr](https://github.com/jtroehr) for support with this application.
![Flexbar logo](https://github.com/seqan/flexbar/wiki/images/flexbar-logo.png)
### References
Johannes T. Roehr, Christoph Dieterich, Knut Reinert: Flexbar 3.0 – SIMD and multicore parallelization. Bioinformatics 2017.
Johannes T. Roehr, Christoph Dieterich, Knut Reinert:
Flexbar 3.0 – SIMD and multicore parallelization. Bioinformatics 2017.
See article on [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/28541403)
Matthias Dodt, Johannes T. Roehr, Rina Ahmed, Christoph Dieterich: Flexbar – flexible barcode and adapter processing for next-generation sequencing platforms. Biology 2012.
Matthias Dodt, Johannes T. Roehr, Rina Ahmed, Christoph Dieterich:
Flexbar – flexible barcode and adapter processing for next-generation sequencing platforms. Biology 2012.
See article on [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/24832523)
![Flexbar logo](https://github.com/seqan/flexbar/wiki/images/flexbar-logo.png)
### Download
Flexbar source code as well as binaries for Linux and Mac OS can be downloaded on the [release](https://github.com/seqan/flexbar/releases) page. Please follow instructions for building or setup of binaries below. Additionally, Flexbar is available via package manager on Debian systems. Versions before 2.4 can be found on the [old](https://sourceforge.net/projects/flexbar) page.
Flexbar source code as well as binaries for Linux and Mac OS can be downloaded on the [release](https://github.com/seqan/flexbar/releases) page. Please follow instructions for building or setup of binaries below. Additionally, Flexbar is available via package manager on Debian systems and in Conda. Versions before 2.4 can be found on the [old](https://sourceforge.net/projects/flexbar) page.
### Building from source
Make sure that `cmake` is available, as well as development and runtime files of the TBB library 4.0 or later (Intel Threading Building Blocks). For example on Debian systems, install the packages `libtbb-dev` and `libtbb2`. Furthermore, the SeqAn library and a compiler that supports C++14 is required:
* Get SeqAn library version 2.2.0 [here](https://github.com/seqan/seqan/releases/download/seqan-v2.2.0/seqan-library-2.2.0.tar.xz)
* Download Flexbar 3.2 source code [release](https://github.com/seqan/flexbar/releases)
* Get SeqAn library version 2.4.0 [here](https://github.com/seqan/seqan/releases/download/seqan-v2.4.0/seqan-library-2.4.0.tar.xz)
* Download Flexbar 3.4 source code [release](https://github.com/seqan/flexbar/releases)
Decompress both files:
tar xzf flexbar-3.2.0.tar.gz
tar xJf seqan-library-2.2.0.tar.xz
tar xzf flexbar-3.4.0.tar.gz
tar xJf seqan-library-2.4.0.tar.xz
Move SeqAn include folder to Flexbar:
mv seqan-library-2.2.0/include flexbar-3.2.0
mv seqan-library-2.4.0/include flexbar-3.4.0
Use these commands for building:
cd flexbar-3.2.0
cd flexbar-3.4.0
cmake .
make
Flexbar version 2.7 requires SeqAn 2.1.1 instead. Releases prior to 2.7 use the SeqAn 1.4.2 library.
Flexbar versions from 3.0 up to 3.2 require SeqAn 2.2.0 instead. Flexbar version 2.7 uses SeqAn 2.1.1 and releases prior to 2.7 use the SeqAn 1.4.2 library.
### Binaries
......@@ -69,17 +71,37 @@ Flexbar needs at least one file with sequencing reads in fasta or fastq format a
flexbar -r reads [-b barcodes] [-a adapters] [options]
Refer to the help screen `flexbar -h` or [manual](https://github.com/seqan/flexbar/wiki) for more information. Although default parameters of Flexbar are optimized to deliver good results in many scenarios, the adjustment of parameters might improve results, e.g. `--adapter-min-overlap`. To run tests, make sure `flexbar` is reachable via the path variable and run `flexbar_test.sh` within the test folder.
Refer to the help screen `flexbar -h` or [manual](https://github.com/seqan/flexbar/wiki) for more information. Although default parameters of Flexbar are optimized to deliver good results in many scenarios, the adjustment of parameters like `--adapter-min-overlap` might improve results. For tests, run `flexbar_test.sh` within the test folder if `flexbar` is reachable via the path variable.
#### Quality-based trimming
In this example, reads in fastq format are trimmed based on their quality scores in Illumina version 1.8 format. The TAIL method trims the right end of reads until a quality score equal or higher than the threshold is reached, default 20. Trimmed reads are written to `target.fastq` in same format as the input.
flexbar -r reads.fq -t target -q TAIL -qf i1.8
#### Demultiplexing with barcodes
Reads that are barcoded on the left end are demultiplexed by specifying a file with barcodes in fasta format. Reads that can be assigned are written to separate files using file names that are based on the names of barcodes in the fasta file.
flexbar -r reads.fq -b barcodes.fa -bt LTAIL
#### Adapter removal single-end
To remove adapter sequences from single-end reads, specify a file with adapters in fasta format. These are removed from the right side of reads per default, if they do not align before the read start. The left side of reads is kept if long enough. The overlap of an adapter and read must have at least length 3 with at most 10% errors in default settings.
flexbar -r reads.fq -a adapters.fa -ao 3 -ae 0.1
#### Adapter removal paired-end
#### Examples
For paired-end libraries, specify both files with paired reads and a fasta file with adapters for removal. Given adapters are trimmed in right mode per default. It is recommended to activate the pair overlap detection in case of standard paired reads. This increases the sensitivity by removing very short parts of adapters if an overlap is detected for a pair.
In this example, reads that are barcoded on left side are demultiplexed by specifying a file with barcodes in fasta format. After separation of reads, given adapters are removed from the right side if they do not align before read start. The left side of reads is kept if long enough. Remaining reads are written to the file `target.fastq` in same format as the input.
flexbar -r r1.fq -p r2.fq -a adapters.fa -ap ON
flexbar -r reads.fq -t target -b brc.fa -be LTAIL -a adp.fa
#### Adapter removal presets
The second example shows how to trim compressed reads based on their quality scores in illumina version 1.8 format. Afterwards, provided adapters are removed in right trim-end mode, only if the overlap of adapter and read has at least length 5 with at most 20% errors.
Several adapter presets for Illumina libraries are included in Flexbar. For example, select the `TruSeq` preset for standard TruSeq adapters and specify two read files for paired reads. If a preset is chosen, a separate file with adapters is not needed for removal. It is recommended to turn on the pair overlap detection for standard paired-end libraries.
flexbar -r reads.fq.gz -q TAIL -qf i1.8 -a adp.fa -ao 5 -at 0.2
flexbar -r r1.fq -p r2.fq -aa TruSeq -ap ON
For further examples visit the [manual](https://github.com/seqan/flexbar/wiki) page.
flexbar (1:3.4.0-1) unstable; urgency=medium
* New upstream version
* Standards-Version: 4.1.5
* Respekt DEB_BUILD_OPTIONS in override_dh_auto_test
-- Andreas Tille <tille@debian.org> Sun, 29 Jul 2018 09:46:44 +0200
flexbar (1:3.2.0-1) unstable; urgency=medium
* New upstream version
......
......@@ -10,7 +10,7 @@ Build-Depends: debhelper (>= 11~),
libseqan2-dev,
zlib1g-dev,
libbz2-dev
Standards-Version: 4.1.4
Standards-Version: 4.1.5
Vcs-Browser: https://salsa.debian.org/med-team/flexbar
Vcs-Git: https://salsa.debian.org/med-team/flexbar.git
Homepage: https://github.com/seqan/flexbar
......
......@@ -8,12 +8,14 @@ export DEB_BUILD_MAINT_OPTIONS = hardening=+all
dh $@
override_dh_auto_test:
ifeq (,$(filter nocheck,$(DEB_BUILD_OPTIONS)))
cd test ; \
export PATH="$(CURDIR)/obj-$(shell dpkg-architecture -qDEB_BUILD_GNU_TYPE):$$PATH" ; \
echo $$PATH ; \
export HOME=/tmp ; \
which flexbar ; \
./flexbar_test.sh
endif
override_dh_auto_clean:
dh_auto_clean
......
......@@ -2,11 +2,11 @@
Flexbar - flexible barcode and adapter removal
Version 3.2.0
Version 3.4.0
BSD 3-Clause License
uses SeqAn library release 2.2.0
uses SeqAn library release 2.4.0
and TBB library 4.0 or later
......@@ -29,8 +29,8 @@ int main(int argc, const char* argv[]){
using namespace std;
using namespace seqan;
const string version = "3.2";
const string date = "May 2018";
const string version = "3.4.0";
const string date = "June 2018";
ArgumentParser parser("flexbar");
......
......@@ -27,6 +27,7 @@
#include "Options.h"
#include "FlexbarIO.h"
#include "LoadFasta.h"
#include "LoadAdapters.h"
#include "SeqInput.h"
#include "PairedInput.h"
#include "PairedOutput.h"
......@@ -68,56 +69,69 @@ void loadBarcodes(Options &o, const bool secondSet){
template <typename TSeqStr, typename TString>
void loadAdapters(Options &o, const bool secondSet, const bool useAdapterFile){
using namespace std;
using namespace flexbar;
LoadFasta<TSeqStr, TString> lf(o, true);
if(useAdapterFile){
if(o.aPreset != APOFF){
LoadAdapters<TSeqStr, TString> la(o);
string adapFile = secondSet ? o.adapter2File : o.adapterFile;
la.loadSequences(secondSet);
lf.loadSequences(adapFile);
if(secondSet) o.adapters2 = la.getAdapters();
else o.adapters = la.getAdapters();
if(secondSet) la.printAdapters("Adapter2");
else la.printAdapters("Adapter");
}
else{
LoadFasta<TSeqStr, TString> lf(o, true);
if(secondSet){
o.adapters2 = lf.getBars();
if(useAdapterFile){
string adapFile = secondSet ? o.adapter2File : o.adapterFile;
lf.loadSequences(adapFile);
if(o.adapters2.size() == 0){
cerr << "\nERROR: No adapters found in file.\n" << endl;
exit(1);
if(secondSet){
o.adapters2 = lf.getBars();
if(o.adapters2.size() == 0){
cerr << "\nERROR: No adapters found in file.\n" << endl;
exit(1);
}
}
else{
o.adapters = lf.getBars();
if(o.adapters.size() == 0){
cerr << "\nERROR: No adapters found in file.\n" << endl;
exit(1);
}
}
}
else{
o.adapters = lf.getBars();
if(o.adapters.size() == 0){
cerr << "\nERROR: No adapters found in file.\n" << endl;
exit(1);
if(o.rcMode == RCOFF || o.rcMode == RCON){
TBar bar;
bar.id = "cmdline";
bar.seq = o.adapterSeq;
o.adapters.push_back(bar);
}
if(o.rcMode == RCON || o.rcMode == RCONLY){
TSeqStr adapterSeqRC = o.adapterSeq;
seqan::reverseComplement(adapterSeqRC);
TBar barRC;
barRC.id = "cmdline_rc";
barRC.seq = adapterSeqRC;
o.adapters.push_back(barRC);
}
lf.setBars(o.adapters);
}
if(secondSet) lf.printBars("Adapter2");
else lf.printBars("Adapter");
}
else{
if(o.rcMode == RCOFF || o.rcMode == RCON){
TBar bar;
bar.id = "cmdline";
bar.seq = o.adapterSeq;
o.adapters.push_back(bar);
}
if(o.rcMode == RCON || o.rcMode == RCONLY){
TSeqStr adapterSeqRC = o.adapterSeq;
seqan::reverseComplement(adapterSeqRC);
TBar barRC;
barRC.id = "cmdline_rc";
barRC.seq = adapterSeqRC;
o.adapters.push_back(barRC);
}
lf.setBars(o.adapters);
}
if(secondSet) lf.printBars("Adapter2");
else lf.printBars("Adapter");
}
......@@ -192,13 +206,13 @@ void printMessage(Options &o){
string s = "Flexbar completed ";
if(o.barDetect != BOFF) s += "barcode";
if(o.barDetect == WITHIN_READ_REMOVAL) s += " removal within reads";
if(o.barDetect == WITHIN_READ) s += " detection within reads";
if(o.barDetect == BARCODE_READ) s += " detection with separate reads";
if(o.barDetect != BOFF && o.adapRm != AOFF) s += " and ";
if(o.barDetect == BOFF && o.adapRm == AOFF) s += "basic processing";
if(o.adapRm != AOFF) s += "adapter removal";
if(o.barDetect != BOFF) s += "barcode";
if(o.barDetect == WITHIN_READ_REMOVAL) s += " removal within reads";
if(o.barDetect == WITHIN_READ) s += " detection within reads";
if(o.barDetect == BARCODE_READ) s += " detection with separate reads";
if(o.barDetect != BOFF && (o.adapRm != AOFF || o.poMode != POFF)) s += " and ";
if(o.barDetect == BOFF && o.adapRm == AOFF && o.poMode == POFF) s += "basic processing";
if(o.adapRm != AOFF || o.poMode != POFF) s += "adapter removal";
*o.out << s << ".\n" << endl;
......@@ -246,6 +260,8 @@ void startProcessing(Options &o){
if(o.writeLengthDist) outputFilter.writeLengthDist();
if(o.poMode != POFF) alignFilter.printPairOverlapStats();
if(o.adapRm != AOFF){
outputFilter.printAdapterRemovalStats();
alignFilter.printAdapterOverlapStats();
......@@ -296,7 +312,7 @@ void startProcessing(Options &o){
if(o.barDetect != BOFF && ! o.writeUnassigned)
*out << " skipped unassigned reads " << alignValue(len, alignFilter.getNrUnassignedReads()) << endl;
if(o.adapRm != AOFF)
if(o.adapRm != AOFF || o.poMode != POFF)
*out << " short prior to adapter removal " << alignValue(len, alignFilter.getNrPreShortReads()) << endl;
if(o.qTrim != QOFF && o.qtrimPostRm)
......
......@@ -9,6 +9,7 @@
#include <fstream>
#include <seqan/seq_io.h>
#include <seqan/stream.h>
#if SEQAN_HAS_ZLIB
#include <zlib.h>
......@@ -46,6 +47,176 @@ void closeFile(std::fstream &strm){
}
namespace seqan{
// Extension for input fasta file with dat ending
struct DatFastaAdaptor_;
using DatFastaAdaptor = Tag<DatFastaAdaptor_>;
// Specilaize sequence input file with custom tag
using DatFastaSeqFileIn = FormattedFile<Fastq, Input, DatFastaAdaptor>;
// Your custom format tag
struct DatFastaSeqFormat_;
using DatFastaSeqFormat = Tag<DatFastaSeqFormat_>;
// The extended TagList containing our custom format
using DatFastaSeqInFormats = TagList<DatFastaSeqFormat, SeqInFormats>;
// Overloaded file format metafunction
template <>
struct FileFormat<FormattedFile<Fastq, Input, DatFastaAdaptor> >{
using Type = TagSelector<DatFastaSeqInFormats>;
};
// Set magic header
template <typename T>
struct MagicHeader<DatFastaSeqFormat, T> : public MagicHeader<Fasta, T>{};
// Specify the valid ending for your fasta adaptor
template <typename T>
struct FileExtensions<DatFastaSeqFormat, T>{
static char const * VALUE[1];
};
template <typename T>
char const * FileExtensions<DatFastaSeqFormat, T>::VALUE[1] = { ".dat" };
// Overload inner readRecord function
template <typename TIdString, typename TSeqString, typename TSpec>
inline void
readRecord(TIdString & id, TSeqString & seq, FormattedFile<Fastq, Input, TSpec> & file, DatFastaSeqFormat){
readRecord(id, seq, file.iter, Fasta()); // Delegate to Fasta parser
}
// Extension for input fastq file with dat ending
struct DatFastqAdaptor_;
using DatFastqAdaptor = Tag<DatFastqAdaptor_>;
// Specilaize sequence input file with custom tag
using DatFastqSeqFileIn = FormattedFile<Fastq, Input, DatFastqAdaptor>;
// Your custom format tag
struct DatFastqSeqFormat_;
using DatFastqSeqFormat = Tag<DatFastqSeqFormat_>;
// The extended TagList containing our custom format
using DatFastqSeqInFormats = TagList<DatFastqSeqFormat, SeqInFormats>;
// Overloaded file format metafunction
template <>
struct FileFormat<FormattedFile<Fastq, Input, DatFastqAdaptor> >{
using Type = TagSelector<DatFastqSeqInFormats>;
};
// Set magic header
template <typename T>
struct MagicHeader<DatFastqSeqFormat, T> : public MagicHeader<Fastq, T>{};
// Specify the valid ending for your fastq adaptor
template <typename T>
struct FileExtensions<DatFastqSeqFormat, T>{
static char const * VALUE[1];
};
template <typename T>
char const * FileExtensions<DatFastqSeqFormat, T>::VALUE[1] = { ".dat" };
// Overload inner readRecord function
template <typename TIdString, typename TSeqString, typename TSpec>
inline void
readRecord(TIdString & id, TSeqString & seq, TIdString & qual, FormattedFile<Fastq, Input, TSpec> & file, DatFastqSeqFormat){
readRecord(id, seq, qual, file.iter, Fastq()); // Delegate to Fastq parser
}
template <typename TIdString, typename TSeqString, typename TSpec>
inline void
readRecord(TIdString & id, TSeqString & seq, FormattedFile<Fastq, Input, TSpec> & file, DatFastqSeqFormat){
readRecord(id, seq, file.iter, Fasta()); // Delegate to Fasta parser
}
// Extension for input fastq file with txt ending
struct FlexbarReadsAdaptor_;
using FlexbarReadsAdaptor = Tag<FlexbarReadsAdaptor_>;
// Specilaize sequence input file with custom tag
using FlexbarReadsSeqFileIn = FormattedFile<Fastq, Input, FlexbarReadsAdaptor>;
// Your custom format tag
struct FlexbarReadsSeqFormat_;
using FlexbarReadsSeqFormat = Tag<FlexbarReadsSeqFormat_>;
// The extended TagList containing our custom format
using FlexbarReadsSeqInFormats = TagList<FlexbarReadsSeqFormat, DatFastqSeqInFormats>;
// Overloaded file format metafunction
template <>
struct FileFormat<FormattedFile<Fastq, Input, FlexbarReadsAdaptor> >{
using Type = TagSelector<FlexbarReadsSeqInFormats>;
};
// Set magic header
template <typename T>
struct MagicHeader<FlexbarReadsSeqFormat, T> : public MagicHeader<Fastq, T>{};
// Specify the valid ending for your fastq adaptor
template <typename T>
struct FileExtensions<FlexbarReadsSeqFormat, T>{
static char const * VALUE[1];
};
template <typename T>
char const * FileExtensions<FlexbarReadsSeqFormat, T>::VALUE[1] = { ".txt" };
// Overload inner readRecord function
template <typename TIdString, typename TSeqString, typename TSpec>
inline void
readRecord(TIdString & id, TSeqString & seq, TIdString & qual, FormattedFile<Fastq, Input, TSpec> & file, FlexbarReadsSeqFormat){
readRecord(id, seq, qual, file.iter, Fastq()); // Delegate to Fastq parser
}
template <typename TIdString, typename TSeqString, typename TSpec>
inline void
readRecord(TIdString & id, TSeqString & seq, FormattedFile<Fastq, Input, TSpec> & file, FlexbarReadsSeqFormat){
readRecord(id, seq, file.iter, Fasta()); // Delegate to Fasta parser
}
// Extension for output reads file with dat ending
using FlexbarReadsSeqFileOut = FormattedFile<Fastq, Output, DatFastqAdaptor>;
using FlexbarReadsSeqOutFormats = TagList<DatFastqSeqFormat, SeqOutFormats>;
template <>
struct FileFormat<FormattedFile<Fastq, Output, DatFastqAdaptor> >{
using Type = TagSelector<FlexbarReadsSeqOutFormats>;
};
// Inner writeRecord function
template <typename TSpec, typename TIdString, typename TSeqString>
inline void
writeRecord(FormattedFile<Fastq, Output, TSpec> & file, TIdString & id, TSeqString & seq, TIdString & qual){
writeRecord(file.iter, id, seq, qual, Fastq()); // Delegate to Fastq parser
}
template <typename TSpec, typename TIdString, typename TSeqString>
inline void
writeRecord(FormattedFile<Fastq, Output, TSpec> & file, TIdString & id, TSeqString & seq){
writeRecord(file.iter, id, seq, Fasta()); // Delegate to Fasta parser
}
}
void checkFileCompression(const std::string path){
using namespace std;
......@@ -108,39 +279,38 @@ void checkInputType(const std::string path, flexbar::FileFormat &format, const b
if(c == '>') format = FASTA;
else if(c == '@') format = FASTQ;
else{
cerr << "\nERROR: Reads file type not conform.\n";
cerr << "Uncompressed fasta or fastq for stdin.\n" << endl;
cerr << "\nERROR: Format of reads from standard input not conform.\n";
cerr << "Use uncompressed fasta or fastq for stdin.\n" << endl;
exit(1);
}
}
else{
seqan::FlexbarReadsSeqFileIn seqFileIn;
seqan::SeqFileIn seqFileIn;
if(!open(seqFileIn, path.c_str())){
if(! open(seqFileIn, path.c_str())){
cerr << "\nERROR: Could not open file " << path << "\n" << endl;
exit(1);
}
if(! atEnd(seqFileIn)){
try{
FSeqStr rseq;
FString tag, qual;
try{
if(! atEnd(seqFileIn)){
FSeqStr seq;
FString id, qual;
readRecord(tag, rseq, qual, seqFileIn);
readRecord(id, seq, qual, seqFileIn);
if(qual == "") format = FASTA;
else format = FASTQ;
}
catch(seqan::Exception const &e){
cerr << "\nERROR: " << e.what() << "\nProgram execution aborted.\n" << endl;
else{
cerr << "\nReads file seems to be empty.\n\n" << endl;
close(seqFileIn);
exit(1);
}
}
else{
cerr << "\nReads file seems to be empty.\n\n" << endl;
catch(seqan::Exception const &e){
cerr << "\nERROR: " << e.what() << "\nProgram execution aborted.\n" << endl;
close(seqFileIn);
exit(1);
}
......
......@@ -11,13 +11,14 @@ class SeqRead {
TSeqStr seq;
TString id, qual, umi;
bool rmAdapter, rmAdapterRC;
bool rmAdapter, rmAdapterRC, pairOverlap;
SeqRead(TSeqStr& sequence, TString& seqID) :
seq(sequence),
id(seqID),
rmAdapter(false),
rmAdapterRC(false){
rmAdapterRC(false),
pairOverlap(false){
}
SeqRead(TSeqStr& sequence, TString& seqID, TString& quality) :
......@@ -25,7 +26,8 @@ class SeqRead {
id(seqID),
qual(quality),
rmAdapter(false),
rmAdapterRC(false){
rmAdapterRC(false),
pairOverlap(false){
}
};
......@@ -132,6 +134,30 @@ namespace flexbar{
};
struct Adapters {
FString id, info;
FSeqStr seq1, seq2, seqc;
};
enum AdapterPreset {
APOFF,
TRUSEQ,
SMALLRNA,
METHYL,
RIBO,
NEXTERA,
NEXTERAMP
};
enum PairOverlap {
POFF,
PON,
PSHORT,
PONLY
};
enum RevCompMode {
RCOFF,
RCON,
......
// LoadAdapters.h
#ifndef FLEXBAR_LOADADAPTERS_H
#define FLEXBAR_LOADADAPTERS_H
template <typename TSeqStr, typename TString>
class LoadAdapters {
private:
std::ostream *out;
tbb::concurrent_vector<flexbar::TBar> adapters;
flexbar::Adapters a;
const flexbar::AdapterPreset m_aPreset;
const flexbar::RevCompMode m_rcMode;
public:
LoadAdapters(const Options &o) :
out(o.out),
m_aPreset(o.aPreset),
m_rcMode(o.rcMode){
using namespace flexbar;
// Illumina sequencing adapters
// Oligonucleotide sequences © 2018 Illumina, Inc. All rights reserved.
// Obtained from https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html
if(m_aPreset == TRUSEQ){
a.id = "TruSeq";
a.seq1 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA";
a.seq2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
a.info = "TruSeq LT and TruSeq HT-based kits";
}
else if(m_aPreset == METHYL){
a.id = "TrueSeq-Methyl";
a.seq1 = "AGATCGGAAGAGCACACGTCTGAAC";
a.seq2 = "AGATCGGAAGAGCGTCGTGTAGGGA";
a.info = "ScriptSeq and TruSeq DNA Methylation";
}
else if(m_aPreset == SMALLRNA){
a.id = "TrueSeq-smallRNA";
a.seq1 = "TGGAATTCTCGGGTGCCAAGG";
a.info = "TruSeq Small RNA";
}
else if(m_aPreset == RIBO){
a.id = "TrueSeq-Ribo";
a.seq1 = "AGATCGGAAGAGCACACGTCT";
a.info = "TruSeq Ribo Profile";
}
else if(m_aPreset == NEXTERA){
a.id = "Nextera-TruSight";
a.seq1 = "CTGTCTCTTATACACATCT";
a.info = "AmpliSeq, Nextera, Nextera DNA Flex, Nextera DNA, Nextera XT, Nextera Enrichment, Nextera Rapid Capture Enrichment, TruSight Enrichment, TruSight Rapid Capture Enrichment, TruSight HLA";
}
else if(m_aPreset == NEXTERAMP){
a.id = "Nextera-Matepair";
a.seq1 = "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC";
a.seq2 = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
a.seqc = "CTGTCTCTTATACACATCT";
a.info = "Nextera Mate Pair";
}
// IonTorrent sequencing adapters
Adapters IonTorrent;
IonTorrent.id = "IonTorrent";
IonTorrent.seq1 = "ATCACCGACTGCCCATAGAGAGGCTGAGAC";
IonTorrent.seq1 = "CCATCTCATCCCTGCGTGTCTCCGACTCAG";
IonTorrent.seq2 = "CCTCTCTATGGGCAGTCGGTGAT";
IonTorrent.info = "IonTorrent";
};
virtual ~LoadAdapters(){};
void loadSequences(const bool secondSet){
using namespace std;
using namespace flexbar;
TString id = a.id;
TSeqStr seq;
if(! secondSet) seq = a.seq1;
else seq = a.seq2;
if(m_rcMode == RCOFF || m_rcMode == RCON){
TBar adapter;
adapter.id = id;
adapter.seq = seq;
adapters.push_back(adapter);
}
if(m_rcMode == RCON || m_rcMode == RCONLY){
TString idRC = id;
TSeqStr seqRC = seq;
append(idRC, "_rc");
seqan::reverseComplement(seqRC);
TBar adapterRC;
adapterRC.id = idRC;
adapterRC.seq = seqRC;
adapterRC.rcAdapter = true;
adapters.push_back(adapterRC);
}
if(m_aPreset == NEXTERAMP){
TString idc = id;
TSeqStr seqc = a.seqc;
append(idc, "_circ");
TBar adapter;
adapter.id = idc;
adapter.seq = seqc;
adapters.push_back(adapter);
append(idc, "_rc");
seqan::reverseComplement(seqc);
TBar adapterRC;
adapterRC.id = idc;
adapterRC.seq = seqc;
adapters.push_back(adapterRC);
}
};
tbb::concurrent_vector<flexbar::TBar> getAdapters(){
return adapters;
}
void printAdapters(std::string adapterName) const {
using namespace std;
const unsigned int maxSpaceLen = 23;
stringstream s; s << adapterName;
int len = s.str().length() + 1;
if(len + 2 > maxSpaceLen) len = maxSpaceLen - 2;
*out << adapterName << ":" << string(maxSpaceLen - len, ' ') << "Sequence:" << "\n";
for(unsigned int i=0; i < adapters.size(); ++i){
TString seqTag = adapters.at(i).id;
int whiteSpaceLen = maxSpaceLen - length(seqTag);
if(whiteSpaceLen < 2) whiteSpaceLen = 2;
string whiteSpace = string(whiteSpaceLen, ' ');
*out << seqTag << whiteSpace << adapters.at(i).seq << "\n";
}
*out << endl;
}
};
#endif
......@@ -33,19 +33,17 @@ public:
using namespace std;
using namespace flexbar;
seqan::SeqFileIn seqFileIn;
setFormat(seqFileIn, seqan::Fasta());
seqan::DatFastaSeqFileIn seqFileIn;
if(! open(seqFileIn, filePath.c_str())){
cerr << "\nERROR: Could not open file " << filePath << "\n" << endl;
exit(1);
}
TSeqStrs seqs;
TStrings ids;
try{
TSeqStrs seqs;
TStrings ids;
readRecords(ids, seqs, seqFileIn);
map<TString, short> idMap;
......
This diff is collapsed.
......@@ -4,6 +4,7 @@
#define FLEXBAR_PAIREDALIGN_H
#include "SeqAlign.h"
#include "SeqAlignPair.h"
#include "SeqAlignAlgo.h"
......@@ -17,17 +18,18 @@ private:
const std::string m_htrimLeft, m_htrimRight;
const int m_htrimMinLength, m_htrimMaxLength;
const float m_htrimErrorRate;
const unsigned int m_htrimMinLength, m_htrimMinLength2, m_htrimMaxLength;
const unsigned int m_arTimes;
const float m_htrimErrorRate;
const flexbar::FileFormat m_format;
const flexbar::LogAlign m_log;
const flexbar::RunType m_runType;
const flexbar::BarcodeDetect m_barType;
const flexbar::AdapterRemoval m_adapRem;
const flexbar::TrimEnd m_aTrimEnd, m_arcTrimEnd, m_bTrimEnd;
const flexbar::PairOverlap m_poMode;
tbb::atomic<unsigned long> m_unassigned;
tbb::concurrent_vector<flexbar::TBar> *m_adapters, *m_adapters2;
......@@ -36,6 +38,9 @@ private:
typedef SeqAlign<TSeqStr, TString, SeqAlignAlgo<TSeqStr> > TSeqAlign;
TSeqAlign *m_a1, *m_b1, *m_a2, *m_b2;
typedef SeqAlignPair<TSeqStr, TString, SeqAlignAlgo<TSeqStr> > TSeqAlignPair;
TSeqAlignPair *m_p;
std::ostream *out;
public:
......@@ -48,6 +53,7 @@ public:
m_runType(o.runType),
m_barType(o.barDetect),
m_adapRem(o.adapRm),
m_poMode(o.poMode),
m_aTrimEnd(o.a_end),
m_arcTrimEnd(o.arc_end),
m_bTrimEnd(o.b_end),
......@@ -58,6 +64,7 @@ public:
m_htrimLeft(o.htrimLeft),
m_htrimRight(o.htrimRight),
m_htrimMinLength(o.htrimMinLength),
m_htrimMinLength2(o.htrimMinLength2),
m_htrimMaxLength(o.htrimMaxLength),
m_htrimMaxFirstOnly(o.htrimMaxFirstOnly),
m_htrimErrorRate(o.h_errorRate),
......@@ -72,11 +79,13 @@ public:
m_barcodes2 = &o.barcodes2;
m_adapters2 = &o.adapters2;
m_b1 = new TSeqAlign(m_barcodes, o, o.b_min_overlap, o.b_errorRate, o.b_tail_len, o.b_match, o.b_mismatch, o.b_gapCost, true);
m_a1 = new TSeqAlign(m_adapters, o, o.a_min_overlap, o.a_errorRate, o.a_tail_len, o.match, o.mismatch, o.gapCost, false);
m_b1 = new TSeqAlign(m_barcodes, o, o.b_min_overlap, o.b_errorRate, o.b_tail_len, o.b_match, o.b_mismatch, o.b_gapCost, true);
m_b2 = new TSeqAlign(m_barcodes2, o, o.b_min_overlap, o.b_errorRate, o.b_tail_len, o.b_match, o.b_mismatch, o.b_gapCost, true);
m_a2 = new TSeqAlign(m_adapters2, o, o.a_min_overlap, o.a_errorRate, o.a_tail_len, o.match, o.mismatch, o.gapCost, false);
m_a1 = new TSeqAlign(m_adapters, o, o.a_min_overlap, o.a_errorRate, o.a_tail_len, o.a_match, o.a_mismatch, o.a_gapCost, false);
m_a2 = new TSeqAlign(m_adapters2, o, o.a_min_overlap, o.a_errorRate, o.a_tail_len, o.a_match, o.a_mismatch, o.a_gapCost, false);
m_p = new TSeqAlignPair(o, o.p_min_overlap, o.a_errorRate, o.a_match, o.a_mismatch, o.a_gapCost);
if(m_log == flexbar::TAB)
*out << "ReadTag\tQueryTag\tQueryStart\tQueryEnd\tOverlapLength\tMismatches\tIndels\tAllowedErrors" << std::endl;
......@@ -85,9 +94,10 @@ public:
virtual ~PairedAlign(){
delete m_b1;
delete m_a1;
delete m_b2;
delete m_a1;
delete m_a2;
delete m_p;
};
......@@ -95,21 +105,18 @@ public:
using namespace flexbar;
if(m_barType != BOFF){
switch(m_barType){
case BARCODE_READ: pRead->barID = m_b1->alignSeqRead(pRead->b, false, alBundle[0], cycle[0], idxAl[0], alMode, m_bTrimEnd); break;
case WITHIN_READ_REMOVAL2: pRead->barID2 = m_b2->alignSeqRead(pRead->r2, true, alBundle[2], cycle[2], idxAl[2], alMode, m_bTrimEnd);
case WITHIN_READ_REMOVAL: pRead->barID = m_b1->alignSeqRead(pRead->r1, true, alBundle[1], cycle[1], idxAl[1], alMode, m_bTrimEnd); break;
case WITHIN_READ2: pRead->barID2 = m_b2->alignSeqRead(pRead->r2, false, alBundle[2], cycle[2], idxAl[2], alMode, m_bTrimEnd);
case WITHIN_READ: pRead->barID = m_b1->alignSeqRead(pRead->r1, false, alBundle[1], cycle[1], idxAl[1], alMode, m_bTrimEnd); break;
case BOFF: break;
}
switch(m_barType){
case BARCODE_READ: pRead->barID = m_b1->alignSeqRead(pRead->b, false, alBundle[0], cycle[0], idxAl[0], alMode, m_bTrimEnd); break;
case WITHIN_READ_REMOVAL2: pRead->barID2 = m_b2->alignSeqRead(pRead->r2, true, alBundle[2], cycle[2], idxAl[2], alMode, m_bTrimEnd);
case WITHIN_READ_REMOVAL: pRead->barID = m_b1->alignSeqRead(pRead->r1, true, alBundle[1], cycle[1], idxAl[1], alMode, m_bTrimEnd); break;
case WITHIN_READ2: pRead->barID2 = m_b2->alignSeqRead(pRead->r2, false, alBundle[2], cycle[2], idxAl[2], alMode, m_bTrimEnd);
case WITHIN_READ: pRead->barID = m_b1->alignSeqRead(pRead->r1, false, alBundle[1], cycle[1], idxAl[1], alMode, m_bTrimEnd); break;
case BOFF: break;
}
if(pRead->barID == 0 || (m_twoBarcodes && pRead->barID2 == 0)){
if(pRead->barID == 0 || (m_twoBarcodes && pRead->barID2 == 0)){
if(cycle[0] != PRELOAD) m_unassigned++;
}
if(cycle[0] != PRELOAD) m_unassigned++;
}
}
......@@ -118,15 +125,12 @@ public:
using namespace flexbar;
if(m_adapRem != AOFF){
if(m_adapRem != ATWO)
m_a1->alignSeqRead(pRead->r1, true, alBundle[0], cycle[0], idxAl[0], alMode, trimEnd);
if(pRead->r2 != NULL && m_adapRem != AONE){
if(m_adapRem != NORMAL2) m_a1->alignSeqRead(pRead->r2, true, alBundle[1], cycle[1], idxAl[1], alMode, trimEnd);
else m_a2->alignSeqRead(pRead->r2, true, alBundle[1], cycle[1], idxAl[1], alMode, trimEnd);
}
if(m_adapRem != ATWO)
m_a1->alignSeqRead(pRead->r1, true, alBundle[0], cycle[0], idxAl[0], alMode, trimEnd);
if(pRead->r2 != NULL && m_adapRem != AONE){
if(m_adapRem != NORMAL2) m_a1->alignSeqRead(pRead->r2, true, alBundle[1], cycle[1], idxAl[1], alMode, trimEnd);
else m_a2->alignSeqRead(pRead->r2, true, alBundle[1], cycle[1], idxAl[1], alMode, trimEnd);
}
}
......@@ -166,7 +170,10 @@ public:
}
}
if(cutPos > 0 && cutPos >= m_htrimMinLength){
unsigned int htrimMinLength = m_htrimMinLength;
if(m_htrimMinLength2 > 0 && s > 0) htrimMinLength = m_htrimMinLength2;
if(cutPos > 0 && cutPos >= htrimMinLength){
erase(seqRead->seq, 0, cutPos);
if(m_format == FASTQ){
......@@ -214,7 +221,10 @@ public:
}
}
if(cutPos < seqLen && cutPos <= seqLen - m_htrimMinLength){
unsigned int htrimMinLength = m_htrimMinLength;
if(m_htrimMinLength2 > 0 && s > 0) htrimMinLength = m_htrimMinLength2;
if(cutPos < seqLen && cutPos <= seqLen - htrimMinLength){
erase(seqRead->seq, cutPos, length(seqRead->seq));
if(m_format == FASTQ){
......@@ -264,7 +274,6 @@ public:
idxAl.push_back(0);
cycle.push_back(PRELOAD);
}
for(unsigned int i = 0; i < prBundle->size(); ++i){
alignPairedReadToBarcodes(prBundle->at(i), alBundle, cycle, idxAl, alMode);
}
......@@ -273,7 +282,6 @@ public:
idxAl[i] = 0;
cycle[i] = COMPUTE;
}
for(unsigned int i = 0; i < prBundle->size(); ++i){
alignPairedReadToBarcodes(prBundle->at(i), alBundle, cycle, idxAl, alMode);
}
......@@ -281,12 +289,29 @@ public:
// adapter removal
if(m_poMode != POFF){
Alignments alignments;
unsigned int idxAl = 0;
ComputeCycle cycle = PRELOAD;
for(unsigned int i = 0; i < prBundle->size(); ++i){
m_p->alignSeqReadPair(prBundle->at(i)->r1, prBundle->at(i)->r2, alignments, cycle, idxAl);
}
idxAl = 0;
cycle = COMPUTE;
for(unsigned int i = 0; i < prBundle->size(); ++i){
m_p->alignSeqReadPair(prBundle->at(i)->r1, prBundle->at(i)->r2, alignments, cycle, idxAl);
}
}
if(m_adapRem != AOFF){
for(unsigned int c = 0; c < m_arTimes; ++c){
flexbar::TrimEnd trimEnd = m_aTrimEnd;
unsigned int rc = 1;
if(m_useRcTrimEnd){
......@@ -314,7 +339,6 @@ public:
idxAl.push_back(0);
cycle.push_back(PRELOAD);
}
for(unsigned int i = 0; i < prBundle->size(); ++i){
alignPairedReadToAdapters(prBundle->at(i), alBundle, cycle, idxAl, alMode, trimEnd);
}
......@@ -323,7 +347,6 @@ public:
idxAl[i] = 0;
cycle[i] = COMPUTE;
}
for(unsigned int i = 0; i < prBundle->size(); ++i){
alignPairedReadToAdapters(prBundle->at(i), alBundle, cycle, idxAl, alMode, trimEnd);
}
......@@ -333,12 +356,10 @@ public:
if(m_umiTags){
for(unsigned int i = 0; i < prBundle->size(); ++i){
append(prBundle->at(i)->r1->id, prBundle->at(i)->r1->umi);
if(prBundle->at(i)->r2 != NULL){
append(prBundle->at(i)->r1->id, prBundle->at(i)->r2->umi);
append(prBundle->at(i)->r2->id, prBundle->at(i)->r1->umi);
append(prBundle->at(i)->r2->id, prBundle->at(i)->r2->umi);
}
......@@ -347,7 +368,6 @@ public:
if(m_htrim){
if(m_htrimLeft != ""){
for(unsigned int i = 0; i < prBundle->size(); ++i){
trimLeftHPS(prBundle->at(i)->r1);
......@@ -356,7 +376,6 @@ public:
}
}
if(m_htrimRight != ""){
for(unsigned int i = 0; i < prBundle->size(); ++i){
trimRightHPS(prBundle->at(i)->r1);
......@@ -385,8 +404,20 @@ public:
using namespace flexbar;
if(m_adapRem != NORMAL2) return m_a1->getNrPreShortReads();
else return m_a1->getNrPreShortReads() + m_a2->getNrPreShortReads();
if (m_poMode != POFF) return m_p->getNrPreShortReads();
else if(m_adapRem != NORMAL2) return m_a1->getNrPreShortReads();
else return m_a1->getNrPreShortReads() + m_a2->getNrPreShortReads();
}
void printPairOverlapStats(){
using namespace flexbar;
if(m_p->getNrOverlappingReads() > 0)
*out << m_p->getOverlapStatsString() << "\n\n";
if(m_adapRem == AOFF) *out << std::endl;
}
......
......@@ -15,7 +15,7 @@ private:
const bool m_isPaired, m_useBarRead, m_useNumberTag, m_interleaved;
const unsigned int m_bundleSize;
tbb::atomic<unsigned long> m_uncalled, m_uncalledPairs, m_tagCounter;
tbb::atomic<unsigned long> m_uncalled, m_uncalledPairs, m_tagCounter, m_nBundles;
SeqInput<TSeqStr, TString> *m_f1, *m_f2, *m_b;
public:
......@@ -29,6 +29,7 @@ public:
m_isPaired(o.isPaired),
m_useBarRead(o.barDetect == flexbar::BARCODE_READ),
m_bundleSize(o.bundleSize),
m_nBundles(o.nBundles),
m_tagCounter(0),
m_uncalled(0),
m_uncalledPairs(0){
......@@ -43,6 +44,8 @@ public:
if(m_useBarRead)
m_b = new SeqInput<TSeqStr, TString>(o, o.barReadsFile, false, false);
if(m_nBundles > 0) ++m_nBundles;
}
virtual ~PairedInput(){
......@@ -62,6 +65,10 @@ public:
TStrings quals, quals2, qualsBR;
TBools uncalled, uncalled2, uncalledBR;
if(m_nBundles > 0){
if(m_nBundles-- == 1) return NULL;
}
unsigned int bundleSize = m_bundleSize;
if(m_interleaved) bundleSize = m_bundleSize * 2;
......
......@@ -102,10 +102,10 @@ public:
b2 << barcode2;
string s = m_target + "_barcode_" + b1.str();
TSeqOutput *of1 = new TSeqOutput(s, barcode1, false, o);
TSeqOutput *of1 = new TSeqOutput(s, barcode, false, o);
s = m_target + "_barcode_" + b2.str();
TSeqOutput *of2 = new TSeqOutput(s, barcode2, false, o);
TSeqOutput *of2 = new TSeqOutput(s, barcode, false, o);
TOutFiles& f = m_outMap[i + 1];
f.f1 = of1;
......@@ -125,10 +125,10 @@ public:
if(m_writeUnassigned){
string s = m_target + "_barcode_unassigned_1";
TSeqOutput *of1 = new TSeqOutput(s, "unassigned_1", false, o);
TSeqOutput *of1 = new TSeqOutput(s, "unassigned", false, o);
s = m_target + "_barcode_unassigned_2";
TSeqOutput *of2 = new TSeqOutput(s, "unassigned_2", false, o);
TSeqOutput *of2 = new TSeqOutput(s, "unassigned", false, o);
TOutFiles& f = m_outMap[0];
f.f1 = of1;
......@@ -154,10 +154,16 @@ public:
m_outMap = new TOutFiles[m_mapsize];
string s = m_target + "_1";
TSeqOutput *of1 = new TSeqOutput(s, "1", false, o);
if(o.outReadsFile != "") s = o.outReadsFile;
TSeqOutput *of1 = new TSeqOutput(s, "", false, o);
s = m_target + "_2";
TSeqOutput *of2 = new TSeqOutput(s, "2", false, o);
if(o.outReadsFile2 != "") s = o.outReadsFile2;
TSeqOutput *of2 = new TSeqOutput(s, "", false, o);
TOutFiles& f = m_outMap[0];
f.f1 = of1;
......@@ -182,6 +188,9 @@ public:
m_outMap = new TOutFiles[m_mapsize];
string s = m_target;
if(o.outReadsFile != "") s = o.outReadsFile;
TSeqOutput *of1 = new TSeqOutput(s, "", false, o);
TOutFiles& f = m_outMap[0];
......
// ==========================================================================
// QualTrimming.h
// ==========================================================================
// Copyright (c) 2006-2015, Knut Reinert, FU Berlin
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are met:
//
// * Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
// * Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
// * Neither the name of Knut Reinert or the FU Berlin nor the names of
// its contributors may be used to endorse or promote products derived
// from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
// ARE DISCLAIMED. IN NO EVENT SHALL KNUT REINERT OR THE FU BERLIN BE LIABLE
// FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
// DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
// SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
// CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
// OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
// DAMAGE.
//
// ==========================================================================
// QualTrimming.h
// Authors: Sebastian Roskosch
// Benjamin Menkuec
// Johannes Roehr
// ==========================================================================
#ifndef FLEXBAR_QUALTRIMMING_H
#define FLEXBAR_QUALTRIMMING_H
// Tags for choosing quality-based trimming method
struct Tail {};
struct BWA {};
......@@ -50,13 +18,9 @@ struct Window {
};
// ============================================================================
// Functions
// ============================================================================
template <typename TString>
inline unsigned getQuality(const TString& qual, unsigned i){
return static_cast<int>(qual[i]);
}
......@@ -67,37 +31,35 @@ unsigned qualTrimming(const TString& qual, unsigned const cutoff, Tail const &){
for (int i = length(qual) - 1; i >= 0; --i){
if(getQuality(qual, i) >= cutoff){
return i + 1;
}
if(getQuality(qual, i) >= cutoff) return i + 1;
}
return 0;
}
// Trim by shifting a window over the seq and cut where avg qual in window turns bad first.
// Trim by shifting a window over the seq and cut where avg qual in window turns bad
template <typename TString>
unsigned qualTrimming(const TString& qual, unsigned const _cutoff, Window const & spec){
unsigned window = spec.size;
unsigned avg = 0, i = 0;
// Work with absolute cutoff in window to avoid divisions.
// Work with absolute cutoff in window to avoid divisions
unsigned cutoff = _cutoff * window;
// Calculate average quality of initial window.
// Calculate average quality of initial window
for (i = 0; i < window; ++i){
avg += getQuality(qual, i);
}
// Shift window over read and keep mean quality, update in constant time.
// Shift window over read and keep mean quality, update in constant time
for (i = 0; i < length(qual) && avg >= cutoff; ++i){
// Take care only not to go over the end of the sequence. Shorten window near the end.
// Take care only not to go over the end of the sequence. Shorten window near the end
avg -= getQuality(qual, i);
avg += i + window < length(qual) ? getQuality(qual, i + window) : 0;
}
return i; // i holds start of first window that turned bad.
return i; // i holds start of first window that turned bad
}
......@@ -114,7 +76,6 @@ unsigned qualTrimming(const TString& qual, unsigned const cutoff, BWA const &){
if(sum < 0){
break;
}
if(sum > max){
max = sum;
max_arg = i;
......@@ -127,6 +88,8 @@ unsigned qualTrimming(const TString& qual, unsigned const cutoff, BWA const &){
template <typename TSeqStr, typename TString>
bool qualTrim(TSeqStr &seq, TString &qual, const flexbar::QualTrimType qtrim, const int cutoff, const int wSize){
using namespace seqan;
unsigned cutPos;
if(qtrim == flexbar::TAIL){
......@@ -139,8 +102,6 @@ bool qualTrim(TSeqStr &seq, TString &qual, const flexbar::QualTrimType qtrim, co
cutPos = qualTrimming(qual, cutoff, BWA());
}
using namespace seqan;
if(cutPos < length(qual)){
seq = prefix(seq, cutPos);
......@@ -148,9 +109,7 @@ bool qualTrim(TSeqStr &seq, TString &qual, const flexbar::QualTrimType qtrim, co
return true;
}
else{
return false;
}
else return false;
}
......@@ -171,25 +130,4 @@ bool qualTrim(SeqRead<TSeqStr, TString> *seqRead, const flexbar::QualTrimType qt
}
// inline unsigned getQuality(const seqan::String<seqan::Dna5Q>& seq, unsigned i)
// {
// return seqan::getQualityValue(seq[i]);
// }
// template<bool tag>
// struct TagTrimming
// {
// static const bool value = tag;
// };
// template <typename TSeq, typename TSpec>
// unsigned trimRead(TSeq& seq, unsigned const cutoff, TSpec const & spec) noexcept
// {
// unsigned ret, cut_pos;
// cut_pos = _trimRead(seqan::Dna5QString(seq), cutoff, spec);
// ret = length(seq) - cut_pos;
// erase(seq, cut_pos, length(seq));
// return ret;
// }
#endif
......@@ -11,8 +11,9 @@ private:
typedef AlignResults<TSeqStr> TAlignResults;
const flexbar::LogAlign m_log;
const flexbar::FileFormat m_format;
const flexbar::LogAlign m_log;
const flexbar::FileFormat m_format;
const flexbar::PairOverlap m_poMode;
const bool m_isBarcoding, m_writeTag, m_umiTags, m_strictRegion;
const int m_minLength, m_minOverlap, m_tailLength;
......@@ -24,7 +25,7 @@ private:
tbb::concurrent_vector<unsigned long> m_rmOverlaps;
std::ostream *m_out;
TAlgorithm algo;
TAlgorithm m_algo;
public:
......@@ -36,6 +37,7 @@ public:
m_isBarcoding(isBarcoding),
m_umiTags(o.umiTags),
m_minLength(o.min_readLen),
m_poMode(o.poMode),
m_log(o.logAlign),
m_format(o.format),
m_writeTag(o.useRemovalTag),
......@@ -44,7 +46,7 @@ public:
m_out(o.out),
m_nPreShortReads(0),
m_modified(0),
algo(TAlgorithm(o, match, mismatch, gapCost)){
m_algo(TAlgorithm(o, match, mismatch, gapCost, ! isBarcoding)){
m_queries = queries;
m_rmOverlaps = tbb::concurrent_vector<unsigned long>(flexbar::MAX_READLENGTH + 1, 0);
......@@ -119,7 +121,7 @@ public:
TAlignResults a;
// global sequence alignment
algo.alignGlobal(a, alignments, cycle, idxAl++, trimEnd);
m_algo.alignGlobal(a, alignments, cycle, idxAl++, trimEnd);
a.queryLength = length(m_queries->at(i).seq);
a.tailLength = (m_tailLength > 0) ? m_tailLength : a.queryLength;
......@@ -130,6 +132,9 @@ public:
float madeErrors = static_cast<float>(a.mismatches + a.gapsR + a.gapsA);
int minOverlap = (m_isBarcoding && m_minOverlap == 0) ? a.queryLength : m_minOverlap;
if(! m_isBarcoding && m_poMode == PON && seqRead.pairOverlap &&
(trimEnd == RIGHT || trimEnd == RTAIL)) minOverlap = 1;
bool validAl = true;
if(((trimEnd == RTAIL || trimEnd == RIGHT) && a.startPosA < a.startPosS && m_strictRegion) ||
......
......@@ -21,13 +21,14 @@ private:
// TScoreSimple m_score;
TScoreMatrix m_scoreMatrix;
const bool m_umiTags;
const bool m_umiTags, m_isAdapterRm;
const flexbar::LogAlign m_log;
public:
SeqAlignAlgo(const Options &o, const int match, const int mismatch, const int gapCost):
SeqAlignAlgo(const Options &o, const int match, const int mismatch, const int gapCost, const bool isAdapterRm):
m_umiTags(o.umiTags),
m_isAdapterRm(isAdapterRm),
m_log(o.logAlign){
using namespace seqan;
......@@ -38,7 +39,7 @@ public:
for(unsigned i = 0; i < ValueSize<TChar>::VALUE; ++i){
for(unsigned j = 0; j < ValueSize<TChar>::VALUE; ++j){
if(i == j || TChar(j) == 'N')
if(i == j || TChar(j) == 'N' || (TChar(i) == 'N' && isAdapterRm))
setScore(m_scoreMatrix, TChar(i), TChar(j), match);
else setScore(m_scoreMatrix, TChar(i), TChar(j), mismatch);
}
......@@ -125,10 +126,10 @@ public:
for(; it1 != end(row1); ++it1){
if(a.startPos <= alPos && alPos < a.endPos){
if(isGap(it1)) ++a.gapsR;
else if(isGap(it2)) ++a.gapsA;
else if(*it1 != *it2 && *it2 != 'N') ++a.mismatches;
else if(m_umiTags && *it2 == 'N') append(a.umiTag, (TChar) *it1);
if(isGap(it1)) ++a.gapsR;
else if(isGap(it2)) ++a.gapsA;
else if(*it1 != *it2 && *it2 != 'N' && (*it1 != 'N' || ! m_isAdapterRm)) ++a.mismatches;
else if(m_umiTags && *it2 == 'N') append(a.umiTag, (TChar) *it1);
}
++alPos;
++it2;
......
// SeqAlignPair.h
#ifndef FLEXBAR_SEQALIGNPAIR_H
#define FLEXBAR_SEQALIGNPAIR_H
template <typename TSeqStr, typename TString, class TAlgorithm>
class SeqAlignPair {
private:
typedef AlignResults<TSeqStr> TAlignResults;
const flexbar::LogAlign m_log;
const flexbar::FileFormat m_format;
const flexbar::PairOverlap m_poMode;
const bool m_writeTag;
const int m_minLength, m_minOverlap, m_aMinOverlap;
const float m_errorRate;
const unsigned int m_bundleSize;
tbb::atomic<unsigned long> m_nPreShortReads, m_overlaps, m_modified;
tbb::concurrent_vector<unsigned long> m_overlapLengths;
std::ostream *m_out;
TAlgorithm m_algo;
public:
SeqAlignPair(const Options &o, const int minOverlap, const float errorRate, const int match, const int mismatch, const int gapCost):
m_minOverlap(minOverlap),
m_aMinOverlap(o.a_min_overlap),
m_errorRate(errorRate),
m_minLength(o.min_readLen),
m_poMode(o.poMode),
m_log(o.logAlign),
m_format(o.format),
m_writeTag(o.useRemovalTag),
m_bundleSize(o.bundleSize),
m_out(o.out),
m_nPreShortReads(0),
m_overlaps(0),
m_modified(0),
m_algo(TAlgorithm(o, match, mismatch, gapCost, true)){
m_overlapLengths = tbb::concurrent_vector<unsigned long>(flexbar::MAX_READLENGTH + 1, 0);
};
void alignSeqReadPair(flexbar::TSeqRead* sr, flexbar::TSeqRead* sr2, flexbar::Alignments &alignments, flexbar::ComputeCycle &cycle, unsigned int &idxAl){
using namespace std;
using namespace flexbar;
TSeqRead &seqRead = *sr;
TSeqRead &seqRead2 = *sr2;
int readLength = length(seqRead.seq);
int readLength2 = length(seqRead2.seq);
if(cycle != PRELOAD){
if(readLength < m_minLength) ++m_nPreShortReads;
if(readLength2 < m_minLength) ++m_nPreShortReads;
}
if(readLength < 1 || readLength2 < 1) return;
if(cycle == PRELOAD){
if(idxAl == 0) reserve(alignments.aset, m_bundleSize);
TSeqStr rcSeq2 = seqRead2.seq;
seqan::reverseComplement(rcSeq2);
TAlign align;
appendValue(alignments.aset, align);
resize(rows(alignments.aset[idxAl]), 2);
assignSource(row(alignments.aset[idxAl], 0), seqRead.seq);
assignSource(row(alignments.aset[idxAl], 1), rcSeq2);
++idxAl;
return;
}
TAlignResults a;
m_algo.alignGlobal(a, alignments, cycle, idxAl++, ANY);
a.overlapLength = a.endPos - a.startPos;
a.allowedErrors = m_errorRate * a.overlapLength;
float madeErrors = static_cast<float>(a.mismatches + a.gapsR + a.gapsA);
stringstream s;
// check if alignment is valid, number of errors and overlap length
if((a.startPosA < a.startPosS || a.endPosA < a.endPosS) && madeErrors <= a.allowedErrors && a.overlapLength >= m_minOverlap){
if(a.startPosA < a.startPosS){
seqRead2.pairOverlap = true;
if(m_poMode == PONLY || (m_poMode == PSHORT && a.startPosS < m_aMinOverlap)){
unsigned int rCutPos = readLength2 - a.startPosS;
erase(seqRead2.seq, rCutPos, readLength2);
if(m_format == FASTQ)
erase(seqRead2.qual, rCutPos, readLength2);
++m_modified;
if(m_writeTag) append(seqRead2.id, "_Flexbar_removal_PO");
}
}
if(a.endPosA < a.endPosS){
seqRead.pairOverlap = true;
if(m_poMode == PONLY || (m_poMode == PSHORT && (a.endPosS - a.endPosA) < m_aMinOverlap)){
unsigned int rCutPos = readLength - (a.endPosS - a.endPosA);
erase(seqRead.seq, rCutPos, readLength);
if(m_format == FASTQ)
erase(seqRead.qual, rCutPos, readLength);
++m_modified;
if(m_writeTag) append(seqRead.id, "_Flexbar_removal_PO");
}
}
++m_overlaps;
// store overlap occurrences
if(a.overlapLength <= MAX_READLENGTH) m_overlapLengths.at(a.overlapLength)++;
else cerr << "\nCompile Flexbar with larger max read length for correct overlap stats.\n" << endl;
// alignment stats
if(m_log == ALL || m_log == MOD){
s << "Sequence removal:\n";
s << " read id " << seqRead.id << "\n"
<< " read pos " << a.startPosS << "-" << a.endPosS << "\n"
<< " read2 id " << seqRead2.id << "\n"
<< " read2 pos " << a.startPosA << "-" << a.endPosA << "\n"
<< " score " << a.score << "\n"
<< " overlap " << a.overlapLength << "\n"
<< " errors " << a.gapsR + a.gapsA + a.mismatches << "\n"
<< " error threshold " << a.allowedErrors << "\n"
<< " remaining read " << seqRead.seq << "\n";
if(m_format == FASTQ)
s << " remaining qual " << seqRead.qual << "\n";
s << " remaining read2 " << seqRead2.seq << "\n";
if(m_format == FASTQ)
s << " remaining qual2 " << seqRead2.qual << "\n";
s << "\n Alignment:\n" << endl << a.alString;
}
else if(m_log == TAB){
s << seqRead.id << "\t" << seqRead2.id << "\t"
<< a.startPosA << "\t" << a.endPosA << "\t" << a.overlapLength << "\t"
<< a.mismatches << "\t" << a.gapsR + a.gapsA << "\t" << a.allowedErrors << endl;
}
}
else if(m_log == ALL){
s << "Unvalid alignment:" << "\n"
<< "read id " << seqRead.id << "\n"
<< "read2 id " << seqRead2.id << "\n\n" << endl;
}
*m_out << s.str();
return;
}
std::string getOverlapStatsString(){
using namespace std;
using namespace flexbar;
unsigned long nValues = 0, halfValues = 0, cumValues = 0, lenSum = 0;
unsigned int max = 0, median = 0, mean = 0;
unsigned int min = numeric_limits<unsigned int>::max();
for(unsigned int i = 0; i <= MAX_READLENGTH; ++i){
unsigned long lenCount = m_overlapLengths.at(i);
if(lenCount > 0 && i < min) min = i;
if(lenCount > 0 && i > max) max = i;
nValues += lenCount;
lenSum += lenCount * i;
}
halfValues = nValues / 2;
for(unsigned int i = 0; i <= MAX_READLENGTH; ++i){
cumValues += m_overlapLengths.at(i);
if(cumValues >= halfValues){
median = i;
break;
}
}
if(m_overlaps > 0) mean = lenSum / m_overlaps;
stringstream s;
if(m_modified > 0){
s << "Number of trimmed reads based on pair overlap: ";
s << m_modified << "\n";
}
s << "Min, max, mean and median overlap of paired reads: ";
s << min << " / " << max << " / " << mean << " / " << median;
return s.str();
}
unsigned long getNrPreShortReads() const {
return m_nPreShortReads;
}
unsigned long getNrOverlappingReads() const {
return m_overlaps;
}
};
#endif