Skip to content
Commits on Source (6)
......@@ -6,8 +6,8 @@ before_install:
- sudo apt-get update -qq
install:
- sudo apt-get install -qq g++-4.7
- export CXX="g++-4.7"
- sudo apt-get install -qq g++-4.9
- export CXX="g++-4.9"
git:
submodules: false
......
STAR 2.6.1c 2018/11/16
STAR 2.7.0a 2019/01/23
======================
* This release introduces STARsolo for: mapping, demultiplexing and gene quantification for single cell RNA-seq.
* Multiple solo\* options control STARsolo algorithm. See the RELEASEnotes and the manualfor more information.
* This release is compiled with gcc-5.3.0, and requires at least gcc-4.9.4
STAR 2.6.1d 2018/11/16
======================
* Fixed the problem causing BAM sorting error with large number of threads and small ulimit -n (github.com/alexdobin/STAR/issues/512).
......
STAR 2.6
STAR 2.7
========
Spliced Transcripts Alignment to a Reference
© Alexander Dobin, 2009-2018
© Alexander Dobin, 2009-2019
https://www.ncbi.nlm.nih.gov/pubmed/23104886
AUTHOR/SUPPORT
......@@ -35,9 +35,9 @@ Download the latest [release from](https://github.com/alexdobin/STAR/releases) a
```bash
# Get latest STAR source from releases
wget https://github.com/alexdobin/STAR/archive/2.6.0a.tar.gz
tar -xzf 2.6.0a.tar.gz
cd STAR-2.6.0a
wget https://github.com/alexdobin/STAR/archive/2.7.0a.tar.gz
tar -xzf 2.7.0a.tar.gz
cd STAR-2.7.0a
# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git
......@@ -56,18 +56,19 @@ Compile under Mac OS X
----------------------
```bash
# Compile
cd source
make STARforMacStatic
# 1. Install brew (http://brew.sh/)
# 2. Install gcc with brew:
$ brew install gcc --without-multilib
# 3. Build STAR:
# note that the path to c++ executable has to be adjusted to its current version
$make STARforMacStatic CXX=/usr/local/Cellar/gcc/8.2.0/bin/g++-8
```
All platforms
-------------
All platforms - non-standard gcc
--------------------------------
If g++ compiler (true g++, not Clang sym-link) is not on the path, you will need to tell `make` where to find it:
```bash
# Build STAR
cd source
make STARforMacStatic CXX=/path/to/gcc
```
......
STAR 2.7.0a 2019/01/23
======================
STARsolo: mapping, demultiplexing and gene quantification for single cell RNA-seq
---------------------------------------------------------------------------------
STARsolo is a turnkey solution for analyzing droplet single cell RNA sequencing data (e.g. 10X Genomics Chromium System) built directly into STAR code.
STARsolo inputs the raw FASTQ reads files, and performs the following operations
(i) error correction and demultiplexing of cell barcodes using user-input whitelist
(ii) mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm
(ii) error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa)
(iv) quantification of per-cell gene expression by counting the number of reads per gene
STARsolo output is designed to be a drop-in replacement for 10X CellRanger gene quantification output.
It follows CellRanger logic for cell barcode whitelisting and UMI deduplication, and produces nearly identical gene counts in the same format.
The STAR solo algorithm is turned on with:
```
--soloType Droplet
```
Presently, the cell barcode whitelist has to be provided with:
```
--soloCBwhitelist /path/to/cell/barcode/whitelist
```
The 10X Chromium whitelist file can be found inside the CellRanger distribution,
e.g. [10X-whitelist](https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-).
Please make sure that the whitelist is compatible with the specific version of the 10X chemistry (V1,V2,V3 etc).
Importantly, in the --readFilesIn option, the 1st file has to be cDNA read, and the 2nd file has to be the barcode (cell+UMI) read, i.e.
```
--readFilesIn cDNAfragmentSequence.fastq.gz CellBarcodeUMIsequence.fastq.gz
```
Other parameters that control STARsolo output are listed below. Note that default parameters are compatible with 10X Chromium V2 protocol.
```
soloCBstart 1
int>0: cell barcode start base
soloCBlen 16
int>0: cell barcode length
soloUMIstart 17
int>0: UMI start base
soloUMIlen 10
int>0: UMI length
soloStrand Forward
string: strandedness of the solo libraries:
Unstranded ... no strand information
Forward ... read strand same as the original RNA molecule
Reverse ... read strand opposite to the original RNA molecule
soloFeatures Gene
string(s) genomic features for which the UMI counts per Cell Barcode are collected
Gene ... genes: reads match the gene transcript
SJ ... splice junctions: reported in SJ.out.tab
soloUMIdedup 1MM_All
string(s) type of UMI deduplication (collapsing) algorithm
1MM_All ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)
1MM_Directional ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
1MM_NotCollapsed ... UMIs with 1 mismatch distance to others are not collapsed (i.e. all counted)
soloOutFileNames Solo.out/ genes.tsv barcodes.tsv matrix.mtx matrixSJ.mtx
string(s) file names for STARsolo output
1st word ... file name prefix
2nd word ... barcode sequences
3rd word ... gene IDs and names
4th word ... cell/gene counts matrix
5th word ... cell/splice junction counts matrix
```
STAR 2.6.0a 2018/04/23
======================
......
rna-star (2.7.0a+dfsg-1) unstable; urgency=medium
* New upstream release.
This release introduces STARsolo for: mapping, demultiplexing and
gene quantification for single cell RNA-seq.
* Removed xxd-generated source/parametersDefault.xdd from source tree
* Added xxd to build-dependencies
* Standards-Version: 4.3.0 - no changes required
-- Steffen Moeller <moeller@debian.org> Tue, 29 Jan 2019 22:58:57 +0100
rna-star (2.6.1d+dfsg-1) unstable; urgency=medium
* New upstream release.
......
......@@ -8,8 +8,9 @@ Priority: optional
Build-Depends: debhelper (>= 11),
libhts-dev,
vim-common,
xxd,
zlib1g-dev
Standards-Version: 4.2.1
Standards-Version: 4.3.0
Vcs-Browser: https://salsa.debian.org/med-team/rna-star
Vcs-Git: https://salsa.debian.org/med-team/rna-star.git
Homepage: https://github.com/alexdobin/STAR/
......
......@@ -4,10 +4,11 @@ Upstream-Contact: Alexander Dobin <dobin@cshl.edu>
Source: https://github.com/alexdobin/STAR/releases
Files-Excluded: bin/*
source/htslib
source/parametersDefault.xxd
.gitignore
Files: *
Copyright: 2009-2018 Alexander Dobin <dobin@cshl.edu>
Copyright: 2009-2019 Alexander Dobin <dobin@cshl.edu>
License: GPL-3+
Files: source/bam_cat.c
......@@ -35,8 +36,9 @@ License: MIT
THE SOFTWARE.
Files: debian/*
Copyright: 2015 Steffen Moeller <moeller@debian.org>
2015 Andreas Tille <tille@debian.org>
Copyright: 2015-2019 Steffen Moeller <moeller@debian.org>
2016-2019 Sascha Steinbiss <sascha@debian.org>
2015-2019 Andreas Tille <tille@debian.org>
License: GPL-3+
Files: debian/tests/*
......
......@@ -2,33 +2,30 @@ Author: Steffen Moeller <moeller@debian.org>,
Last-Changed: Thu, 29 Jan 2015 14:18:44 +0100
Description: Use Debian packaged htslib
--- a/source/Makefile
+++ b/source/Makefile
@@ -12,17 +12,17 @@
Index: rna-star/source/Makefile
===================================================================
--- rna-star.orig/source/Makefile
+++ rna-star/source/Makefile
@@ -12,15 +12,15 @@ CXXFLAGSextra ?=
CXX ?= g++
# pre-defined flags
-LDFLAGS_shared := -pthread -Lhtslib -Bstatic -lhts -Bdynamic -lz
-LDFLAGS_static := -static -static-libgcc -pthread -Lhtslib -lhts -lz
-LDFLAGS_static := -static -static-libgcc -pthread -Lhtslib -lhts -lz
+LDFLAGS_shared := -pthread -Bstatic -lhts -Bdynamic $(LDFLAGS_add)
+LDFLAGS_static := -static -static-libgcc -pthread -lhts -lz
LDFLAGS_Mac :=-pthread -lz htslib/libhts.a
LDFLAGS_Mac_static :=-pthread -lz -static-libgcc htslib/libhts.a
-LDFLAGS_gdb := $(LDFLAGS_shared)
+LDFLAGS_gdb += $(LDFLAGS_shared)
LDFLAGS_gdb := $(LDFLAGS_shared)
COMPTIMEPLACE := -D'COMPILATION_TIME_PLACE="$(shell echo `date` $(HOSTNAME):`pwd`)"'
-CXXFLAGS_common := -pipe -std=c++11 -Wall -Wextra -fopenmp $(COMPTIMEPLACE)
-CXXFLAGS_main := -O3 $(CXXFLAGS_common)
-CXXFLAGS_gdb := -O0 -g $(CXXFLAGS_common)
+CXXFLAGS_common := -pipe -std=c++11 -Wall -Wextra -fopenmp $(COMPTIMEPLACE) $(CCFLAGS_common_add)
+CXXFLAGS_main += -O3 $(CXXFLAGS_common)
+CXXFLAGS_gdb += -O0 -g $(CXXFLAGS_common)
CXXFLAGS_main := -O3 $(CXXFLAGS_common)
CXXFLAGS_gdb := -O0 -g $(CXXFLAGS_common)
CFLAGS := -O3 -pipe -Wall -Wextra $(CFLAGS)
@@ -60,10 +60,10 @@
@@ -63,10 +63,10 @@ SOURCES := $(wildcard *.cpp) $(wildcard
%.o : %.cpp
......@@ -41,7 +38,7 @@ Description: Use Debian packaged htslib
all: STAR
@@ -74,12 +74,10 @@
@@ -77,12 +77,10 @@ clean:
.PHONY: CLEAN
CLEAN:
rm -f *.o STAR Depend.list
......@@ -54,7 +51,7 @@ Description: Use Debian packaged htslib
.PHONY: install
install:
@@ -90,7 +88,7 @@
@@ -93,7 +91,7 @@ ifneq ($(MAKECMDGOALS),cleanRelease)
ifneq ($(MAKECMDGOALS),CLEAN)
ifneq ($(MAKECMDGOALS),STARforMac)
ifneq ($(MAKECMDGOALS),STARforMacGDB)
......@@ -63,7 +60,7 @@ Description: Use Debian packaged htslib
echo $(SOURCES)
'rm' -f ./Depend.list
$(CXX) $(CXXFLAGS_common) -MM $^ >> Depend.list
@@ -101,11 +99,6 @@
@@ -104,11 +102,6 @@ endif
endif
endif
......@@ -75,8 +72,10 @@ Description: Use Debian packaged htslib
parametersDefault.xxd: parametersDefault
xxd -i parametersDefault > parametersDefault.xxd
--- a/source/bamRemoveDuplicates.cpp
+++ b/source/bamRemoveDuplicates.cpp
Index: rna-star/source/bamRemoveDuplicates.cpp
===================================================================
--- rna-star.orig/source/bamRemoveDuplicates.cpp
+++ rna-star/source/bamRemoveDuplicates.cpp
@@ -1,7 +1,7 @@
#include <unordered_map>
#include "bamRemoveDuplicates.h"
......@@ -86,9 +85,11 @@ Description: Use Debian packaged htslib
#include "IncludeDefine.h"
#include SAMTOOLS_BGZF_H
#include "ErrorWarning.h"
--- a/source/bam_cat.c
+++ b/source/bam_cat.c
@@ -52,8 +52,8 @@
Index: rna-star/source/bam_cat.c
===================================================================
--- rna-star.orig/source/bam_cat.c
+++ rna-star/source/bam_cat.c
@@ -52,8 +52,8 @@ THE SOFTWARE.
#include <stdlib.h>
#include <unistd.h>
......@@ -99,8 +100,10 @@ Description: Use Debian packaged htslib
#include <cstring>
#define BUF_SIZE 0x10000
--- a/source/signalFromBAM.h
+++ b/source/signalFromBAM.h
Index: rna-star/source/signalFromBAM.h
===================================================================
--- rna-star.orig/source/signalFromBAM.h
+++ rna-star/source/signalFromBAM.h
@@ -1,6 +1,6 @@
#ifndef CODE_signalFromBAM
#define CODE_signalFromBAM
......@@ -109,8 +112,10 @@ Description: Use Debian packaged htslib
#include <fstream>
#include <string>
#include "Stats.h"
--- a/source/IncludeDefine.h
+++ b/source/IncludeDefine.h
Index: rna-star/source/IncludeDefine.h
===================================================================
--- rna-star.orig/source/IncludeDefine.h
+++ rna-star/source/IncludeDefine.h
@@ -28,8 +28,8 @@
#define ERROR_OUT string ( __FILE__ ) +":"+ to_string ( (uint) __LINE__ ) +":"+ string ( __FUNCTION__ )
......@@ -122,8 +127,10 @@ Description: Use Debian packaged htslib
using namespace std;
--- a/source/BAMfunctions.cpp
+++ b/source/BAMfunctions.cpp
Index: rna-star/source/BAMfunctions.cpp
===================================================================
--- rna-star.orig/source/BAMfunctions.cpp
+++ rna-star/source/BAMfunctions.cpp
@@ -1,5 +1,5 @@
#include "BAMfunctions.h"
-#include "htslib/htslib/kstring.h"
......@@ -131,19 +138,23 @@ Description: Use Debian packaged htslib
string bam_cigarString (bam1_t *b) {//output CIGAR string
// kstring_t strK;
--- a/source/STAR.cpp
+++ b/source/STAR.cpp
@@ -28,7 +28,7 @@
#include "Variation.h"
Index: rna-star/source/STAR.cpp
===================================================================
--- rna-star.orig/source/STAR.cpp
+++ rna-star/source/STAR.cpp
@@ -30,7 +30,7 @@
#include "bam_cat.h"
-#include "htslib/htslib/sam.h"
+#include <htslib/sam.h>
#include "parametersDefault.xxd"
--- a/source/bam_cat.h
+++ b/source/bam_cat.h
void usage(int usageType) {
Index: rna-star/source/bam_cat.h
===================================================================
--- rna-star.orig/source/bam_cat.h
+++ rna-star/source/bam_cat.h
@@ -1,7 +1,7 @@
#ifndef CODE_bam_cat
#define CODE_bam_cat
......
Description: make build reproducible
Stops build hostname and directory from being recorded in the artifact.
Author: Sascha Steinbiss <sascha@steinbiss.name>
--- a/source/Makefile
+++ b/source/Makefile
@@ -18,7 +18,7 @@
Index: rna-star/source/Makefile
===================================================================
--- rna-star.orig/source/Makefile
+++ rna-star/source/Makefile
@@ -18,7 +18,7 @@ LDFLAGS_Mac :=-pthread -lz htslib/libhts
LDFLAGS_Mac_static :=-pthread -lz -static-libgcc htslib/libhts.a
LDFLAGS_gdb += $(LDFLAGS_shared)
LDFLAGS_gdb := $(LDFLAGS_shared)
-COMPTIMEPLACE := -D'COMPILATION_TIME_PLACE="$(shell echo `date` $(HOSTNAME):`pwd`)"'
+COMPTIMEPLACE := -D'COMPILATION_TIME_PLACE="<not set in Debian>"'
CXXFLAGS_common := -pipe -std=c++11 -Wall -Wextra -fopenmp $(COMPTIMEPLACE) $(CCFLAGS_common_add)
CXXFLAGS_main += -O3 $(CXXFLAGS_common)
CXXFLAGS_main := -O3 $(CXXFLAGS_common)
No preview for this file type
......@@ -34,7 +34,7 @@
\newcommand{\sechyperref}[1]{\hyperref[#1]{Section \ref{#1}. \nameref{#1}}}
\title{STAR manual 2.6.1a}
\title{STAR manual 2.7.0a}
\author{Alexander Dobin\\
dobin@cshl.edu}
\maketitle
......@@ -131,13 +131,11 @@ Examples of acceptable genome sequence files:
\begin{itemize}
\item \textbf{ENSEMBL:} files marked with .dna.primary.assembly, such as:
\url{ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz}
\item \textbf{NCBI:} "no alternative - analysis set": \url{ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz}
\item \textbf{GENCODE:} files marked with PRI (primary). Strongly recommended for mouse and human: \url{http://www.gencodegenes.org/}.
\end{itemize}
\subsubsection{Which annotations to use?}
The use of the most comprehensive annotations for a given species is strongly recommended. Very importantly, chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence files. For example, one can use ENSEMBL FASTA files with ENSEMBL GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses \code{chr1, chr2, ...} naming convention, and ENSEMBL uses \code{1, 2, ...} naming, the ENSEMBL and UCSC FASTA and GTF files cannot be mixed together, unless chromosomes are renamed to match between the FASTA anf GTF files.
For mouse and human, the Gencode annotations are recommended: \url{http://www.gencodegenes.org/}.
\subsubsection{Annotations in GFF format.}
In addition to the aforementioned options, for GFF3 formatted annotations you need to use \opt{sjdbGTFtagExonParentTranscript} \optv{Parent}. In general, for \opt{sjdbGTFfile} files STAR only processes lines which have \opt{sjdbGTFfeatureExon} (=\optv{exon} by default) in the 3rd field (column). The exons are assigned to the transcripts using parent-child relationship defined by the \opt{sjdbGTFtagExonParentTranscript} (=\optv{transcript\_id} by default) GTF/GFF attribute.
......@@ -492,6 +490,38 @@ This algorithm is activated with $>0$ value in \optv{chimMultimapNmax}, which de
The \optv{chimMultimapScoreRange} ($=1$ by default) parameter defines the score range for multi-mapping chimeras below the best chimeric score, similar to the \optv{outFilterMultimapScoreRange} parameter for normal alignments.
The \optv{chimNonchimScoreDropMin} ($=20$ by default) defines the threshold triggering chimeric detection: the drop in the best non-chimeric alignment score with respect to the read length has to be smaller than this value.
\section{STARsolo: mapping, demultiplexing and gene quantification for single cell RNA-seq}
STARsolo is a turnkey solution for analyzing droplet single cell RNA sequencing data (e.g. 10X Genomics Chromium System) built directly into STAR code.
STARsolo inputs the raw FASTQ reads files, and performs the following operations:
\begin{itemize}
\itemsep -0.5em
\item
error correction and demultiplexing of cell barcodes using user-input whitelist
\item
mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm
\item
error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa)
\item
quantification of per-cell gene expression by counting the number of reads per gene
\end{itemize}
STARsolo output is designed to be a drop-in replacement for 10X CellRanger gene quantification output.
It follows CellRanger logic for cell barcode whitelisting and UMI deduplication, and produces nearly identical gene counts in the same format. At the same time STARsolo is ~10 times faster than the CellRanger.
The STAR solo algorithm is turned on with: \opt{soloType} \optv{Droplet}.
Presently, the cell barcode whitelist has to be provided with:
\opt{soloCBwhitelist} \optvr{/path/to/cell/barcode/whitelist}
The 10X Chromium whitelist file can be found inside the CellRanger distribution, e.g. \url{https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-}. Please make sure that the whitelist is compatible with the specific version of the 10X chemistry (V1,V2,V3 etc).
Importantly, in the --readFilesIn option, the 1st FASTQ file has to be cDNA read, and the 2nd FASTQ file has to be the barcode (cell+UMI) read, i.e.
\opt{readFilesIn} \optvr{cDNAfragmentSequence.fastq.gz CellBarcodeUMIsequence.fastq.gz}.
Other solo* options can be found in the Section \ref{STARsolo_(single_cell_RNA-seq)_parameters}.
\section{Description of all options.}\label{Description_of_all_options}
For each STAR version, the most up-to-date information about all STAR parameters can be found in the \code{parametersDefault} file in the STAR source directory. The parameters in the \code{parametersDefault}, as well as in the descriptions below, are grouped by function:
\begin{itemize}
......
......@@ -153,6 +153,14 @@
\optName{readNameSeparator}
\optValue{/}
\optLine{string(s): character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed)}
\optName{readStrand}
\optValue{Unstranded}
\optLine{string: library strandedness type}
\begin{optOptTable}
\optOpt{Unstranded} \optOptLine{unstranded library}
\optOpt{Forward} \optOptLine{1st read strand same as RNA (i.e. 2nd cDNA synthesis strand)}
\optOpt{Reverse} \optOptLine{1st read opposite to RNA (i.e. 1st cDNA synthesis strand)}
\end{optOptTable}
\optName{clip3pNbases}
\optValue{0}
\optLine{int(s): number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.}
......@@ -192,6 +200,9 @@
\optName{limitSjdbInsertNsj}
\optValue{1000000}
\optLine{int{\textgreater}=0: maximum number of junction to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run}
\optName{limitNreadsSoft}
\optValue{-1}
\optLine{int: soft limit on the number of reads}
\end{optTable}
\optSection{Output: general}\label{Output:_general}
\begin{optTable}
......@@ -278,7 +289,8 @@
\optOpt{All} \optOptLine{NH HI AS nM NM MD jM jI MC ch}
\optOpt{vA} \optOptLine{variant allele}
\optOpt{vG} \optOptLine{genomic coordiante of the variant overlapped by the read}
\optOpt{vW} \optOptLine{0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag .}
\optOpt{vW} \optOptLine{0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag}
\optOpt{CR,CY,UR,UY} \optOptLine{sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing}
\end{optOptTable}
\optLine{Unsupported/undocumented:}
\begin{optOptTable}
......@@ -715,7 +727,7 @@
\optLine{int{\textgreater}=0: the score range for multi-mapping chimeras below the best chimeric score. Only works with --chimMultimapNmax {\textgreater} 1}
\optName{chimNonchimScoreDropMin}
\optValue{20}
\optLine{int{\textgreater}=0: to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be smaller than this value}
\optLine{int{\textgreater}=0: to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be smaller than this value}
\optName{chimOutJunctionFormat}
\optValue{0}
\optLine{int: formatting type for the Chimeric.out.junction file}
......@@ -736,7 +748,13 @@
\end{optOptTable}
\optName{quantTranscriptomeBAMcompression}
\optValue{1 1}
\optLine{int: -1 to 10 transcriptome BAM compression level, -1=default compression (6?), 0=no compression, 10=maximum compression}
\optLine{int: -2 to 10 transcriptome BAM compression level}
\begin{optOptTable}
\optOpt{-2} \optOptLine{no BAM output}
\optOpt{-1} \optOptLine{default compression (6?)}
\optOpt{0} \optOptLine{no compression}
\optOpt{10} \optOptLine{maximum compression}
\end{optOptTable}
\optName{quantTranscriptomeBan}
\optValue{IndelSoftclipSingleend}
\optLine{string: prohibit various alignment type}
......@@ -767,3 +785,60 @@
\optOpt{SAMtag} \optOptLine{add WASP tags to the alignments that pass WASP filtering}
\end{optOptTable}
\end{optTable}
\optSection{STARsolo (single cell RNA-seq) parameters}\label{STARsolo_(single_cell_RNA-seq)_parameters}
\begin{optTable}
\optName{soloType}
\optValue{None}
\optLine{string(s): type of single-cell RNA-seq}
\begin{optOptTable}
\optOpt{Droplet} \optOptLine{one cell barcode and one UMI barcode in read2, e.g. Drop-seq and 10X Chromium}
\end{optOptTable}
\optName{soloCBwhitelist}
\optValue{-}
\optLine{string: file with whitelist of cell barcodes}
\optName{soloCBstart}
\optValue{1}
\optLine{int{\textgreater}0: cell barcode start base}
\optName{soloCBlen}
\optValue{16}
\optLine{int{\textgreater}0: cell barcode length}
\optName{soloUMIstart}
\optValue{17}
\optLine{int{\textgreater}0: UMI start base}
\optName{soloUMIlen}
\optValue{10}
\optLine{int{\textgreater}0: UMI length}
\optName{soloStrand}
\optValue{Forward}
\optLine{string: strandedness of the solo libraries:}
\begin{optOptTable}
\optOpt{Unstranded} \optOptLine{no strand information}
\optOpt{Forward} \optOptLine{read strand same as the original RNA molecule}
\optOpt{Reverse} \optOptLine{read strand opposite to the original RNA molecule}
\end{optOptTable}
\optName{soloFeatures}
\optValue{Gene}
\optLine{string(s): genomic features for which the UMI counts per Cell Barcode are collected}
\begin{optOptTable}
\optOpt{Gene} \optOptLine{genes: reads match the gene transcript}
\optOpt{SJ} \optOptLine{splice junctions: reported in SJ.out.tab}
\end{optOptTable}
\optName{soloUMIdedup}
\optValue{1MM{\textunderscore}All}
\optLine{string(s): type of UMI deduplication (collapsing) algorithm}
\begin{optOptTable}
\optOpt{1MM{\textunderscore}All} \optOptLine{all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)}
\optOpt{1MM{\textunderscore}Directional} \optOptLine{follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).}
\optOpt{1MM{\textunderscore}NotCollapsed} \optOptLine{UMIs with 1 mismatch distance to others are not collapsed (i.e. all counted)}
\end{optOptTable}
\optName{soloOutFileNames}
\optValue{Solo.out/ genes.tsv barcodes.tsv matrix.mtx matrixSJ.mtx}
\optLine{string(s) file names for STARsolo output}
\begin{optOptTable}
\optOpt{1st word} \optOptLine{file name prefix}
\optOpt{2nd word} \optOptLine{barcode sequences}
\optOpt{3rd word} \optOptLine{gene IDs and names}
\optOpt{4th word} \optOptLine{cell/gene counts matrix}
\optOpt{5th word} \optOptLine{cell/splice junction counts matrix}
\end{optOptTable}
\end{optTable}
FROM debian:stretch-slim
MAINTAINER dobin@cshl.edu
ENV STAR_VERSION 2.6.1d
ENV PACKAGES gcc g++ make wget zlib1g-dev
RUN set -ex
RUN apt-get update && \
apt-get install -y --no-install-recommends ${PACKAGES} && \
apt-get clean && \
cd /home && \
wget --no-check-certificate https://github.com/alexdobin/STAR/archive/${STAR_VERSION}.tar.gz && \
tar xzf ${STAR_VERSION}.tar.gz && \
cd STAR-${STAR_VERSION}/source && \
make STARstatic && \
mkdir /home/bin && \
cp STAR /home/bin && \
cd /home && \
'rm' -rf STAR-${STAR_VERSION} && \
apt-get --purge autoremove -y ${PACKAGES}
ENV PATH /home/bin:${PATH}
......@@ -245,101 +245,75 @@ void Genome::genomeLoad(){//allocate and load Genome
shmSize=SA.lengthByte + nGenome+L+L+SHM_startG+8;
shmSize+= SAi.lengthByte;
if (P.annotScoreScale>0) shmSize+=nGenome;
if ((pGe.gLoad=="LoadAndKeep" ||
pGe.gLoad=="LoadAndRemove" ||
pGe.gLoad=="LoadAndExit" ||
pGe.gLoad=="Remove") && sharedMemory == NULL)
{
pGe.gLoad=="Remove") && sharedMemory == NULL) {
bool unloadLast = pGe.gLoad=="LoadAndRemove";
try
{
try {
sharedMemory = new SharedMemory(shmKey, unloadLast);
sharedMemory->SetErrorStream(P.inOut->logStdOut);
if (!sharedMemory->NeedsAllocation())
P.inOut->logMain <<"Found genome in shared memory\n"<<flush;
if (pGe.gLoad=="Remove") {//kill the genome and exit
if (pGe.gLoad=="Remove") {//kill the genome and exit
if (sharedMemory->NeedsAllocation()) {//did not find genome in shared memory, nothing to kill
ostringstream errOut;
errOut << "EXITING: Did not find the genome in memory, did not remove any genomes from shared memory\n";
exitWithError(errOut.str(),std::cerr, P.inOut->logMain, EXIT_CODE_GENOME_FILES, P);
} else {
sharedMemory->Clean();
P.inOut->logMain <<"DONE: removed the genome from shared memory\n"<<flush;
return;
};
}
ostringstream errOut;
errOut << "EXITING: Did not find the genome in memory, did not remove any genomes from shared memory\n";
exitWithError(errOut.str(),std::cerr, P.inOut->logMain, EXIT_CODE_GENOME_FILES, P);
} else {
sharedMemory->Clean();
P.inOut->logMain <<"DONE: removed the genome from shared memory\n"<<flush;
return;
};
};
if (sharedMemory->NeedsAllocation()){
P.inOut->logMain <<"Allocating shared memory for genome\n"<<flush;
sharedMemory->Allocate(shmSize);
}
}
catch (const SharedMemoryException & exc)
{
};
} catch (const SharedMemoryException & exc){
HandleSharedMemoryException(exc, shmSize);
}
};
shmStart = (char*) sharedMemory->GetMapped();
shmNG= (uint*) (shmStart+SHM_sizeG);
shmNSA= (uint*) (shmStart+SHM_sizeSA);
if (!sharedMemory->IsAllocator())
{
if (!sharedMemory->IsAllocator()) {
// genome is in shared memory or being loaded
// wait for the process that will populate it
// and record the sizes
uint iwait=0;
uint iwait=0;
while (*shmNG != nGenome) {
iwait++;
P.inOut->logMain <<"Another job is still loading the genome, sleeping for 1 min\n" <<flush;
sleep(60);
if (iwait==100) {
ostringstream errOut;
errOut << "EXITING because of FATAL ERROR: waited too long for the other job to finish loading the genome" << strerror(errno) << "\n" <<flush;
iwait++;
P.inOut->logMain <<"Another job is still loading the genome, sleeping for 1 min\n" <<flush;
sleep(60);
if (iwait==100) {
ostringstream errOut;
errOut << "EXITING because of FATAL ERROR: waited too long for the other job to finish loading the genome" << strerror(errno) << "\n" <<flush;
errOut << "SOLUTION: remove the shared memory chunk by running STAR with --genomeLoad Remove, and restart STAR" <<flush;
exitWithError(errOut.str(),std::cerr, P.inOut->logMain, EXIT_CODE_GENOME_LOADING_WAITED_TOO_LONG, P);
};
};
};
if (nSAbyte!=*shmNSA)
{
if (nSAbyte!=*shmNSA){
ostringstream errOut;
errOut << "EXITING because of FATAL ERROR: the SA file size did not match what we found in shared memory" << "\n" << flush;
errOut << "SOLUTION: remove the shared memory chunk by running STAR with --genomeLoad Remove, and restart STAR" << flush;
exitWithError(errOut.str(),std::cerr, P.inOut->logMain, EXIT_CODE_INCONSISTENT_DATA, P);
}
};
P.inOut->logMain << "Using shared memory for genome. key=0x" <<hex<<shmKey<<dec<< "; shmid="<< sharedMemory->GetId() <<endl<<flush;
}
};
G1=shmStart+SHM_startG;
SA.pointArray(G1+nGenome+L+L);
char* shmNext=SA.charArray+nSAbyte;
SAi.pointArray(shmNext);
shmNext += SAi.lengthByte;
// if (twoPass.pass1readsN==0) {//not 2-pass
// shmStartG=SHM_startSHM;
// shmStartSA=0;
// } else {//2-pass
// ostringstream errOut;
// errOut << "EXITING because of FATAL ERROR: 2-pass procedure cannot be used with genome already loaded im memory' "\n" ;
// errOut << "SOLUTION: check shared memory settigns as explained in STAR manual, OR run STAR with --genomeLoad NoSharedMemory to avoid using shared memory\n" <<flush;
// exitWithError(errOut.str(),std::cerr, P.inOut->logMain, EXIT_CODE_SHM, P);
// };
if (P.annotScoreScale>0) {//optional allocation
shmNext += nGenome;
}
}
else if (pGe.gLoad=="NoSharedMemory") // simply allocate memory, do not use shared memory
{
} else if (pGe.gLoad=="NoSharedMemory") {// simply allocate memory, do not use shared memory
genomeInsertL=0;
genomeInsertChrIndFirst=nChrReal;
if (pGe.gFastaFiles.at(0)!="-")
......
......@@ -245,11 +245,9 @@ void Genome::genomeGenerate() {
GstrandMask = ~(1LLU<<GstrandBit);
SA.defineBits(GstrandBit+1,nSA);
if (P.sjdbInsert.yes)
{//reserve space for junction insertion
if (P.sjdbInsert.yes) {//reserve space for junction insertion
SApass1.defineBits(GstrandBit+1,nSA+2*P.limitSjdbInsertNsj*sjdbLength);//TODO: this allocation is wasteful, get a better estimate of the number of junctions
} else
{//same as SA
} else {//same as SA
SApass1.defineBits(GstrandBit+1,nSA);
};
......
......@@ -91,7 +91,10 @@ typedef uint8_t uint8;
#define ATTR_ch 14
#define ATTR_MC 15
#define ATTR_rB 16
#define ATTR_CR 17
#define ATTR_CY 18
#define ATTR_UR 19
#define ATTR_UY 20
//BAM definitions
#define BAM_CIGAR_MaxSize 10000
......
......@@ -13,7 +13,7 @@ CXX ?= g++
# pre-defined flags
LDFLAGS_shared := -pthread -Lhtslib -Bstatic -lhts -Bdynamic -lz
LDFLAGS_static := -static -static-libgcc -pthread -Lhtslib -lhts -lz
LDFLAGS_static := -static -static-libgcc -pthread -Lhtslib -lhts -lz
LDFLAGS_Mac :=-pthread -lz htslib/libhts.a
LDFLAGS_Mac_static :=-pthread -lz -static-libgcc htslib/libhts.a
LDFLAGS_gdb := $(LDFLAGS_shared)
......@@ -29,32 +29,35 @@ CFLAGS := -O3 -pipe -Wall -Wextra $(CFLAGS)
##########################################################################################################
OBJECTS = SharedMemory.o PackedArray.o SuffixArrayFuns.o STAR.o Parameters.o InOutStreams.o SequenceFuns.o Genome.o Stats.o \
Transcript.o Transcript_alignScore.o Transcript_generateCigarP.o Chain.o \
Transcript_variationAdjust.o Variation.o ReadAlign_waspMap.o \
ReadAlign.o ReadAlign_storeAligns.o ReadAlign_stitchPieces.o ReadAlign_multMapSelect.o ReadAlign_mapOneRead.o readLoad.o \
OBJECTS = ParametersSolo.o SoloRead.o SoloRead_record.o \
SoloReadBarcode.o SoloReadBarcode_getCBandUMI.o \
SoloReadFeature.o SoloReadFeature_record.o SoloReadFeature_inputRecords.o \
Solo.o SoloFeature.o SoloFeature_collapseUMI.o SoloFeature_outputResults.o SoloFeature_processRecords.o\
ReadAlign_outputAlignments.o \
ReadAlign.o STAR.o \
SharedMemory.o PackedArray.o SuffixArrayFuns.o Parameters.o InOutStreams.o SequenceFuns.o Genome.o Stats.o \
Transcript.o Transcript_alignScore.o Transcript_generateCigarP.o Chain.o \
Transcript_variationAdjust.o Variation.o ReadAlign_waspMap.o \
ReadAlign_storeAligns.o ReadAlign_stitchPieces.o ReadAlign_multMapSelect.o ReadAlign_mapOneRead.o readLoad.o \
ReadAlignChunk.o ReadAlignChunk_processChunks.o ReadAlignChunk_mapChunk.o \
OutSJ.o outputSJ.o blocksOverlap.o ThreadControl.o sysRemoveDir.o \
ReadAlign_maxMappableLength2strands.o binarySearch2.o\
ReadAlign_outputAlignments.o \
ReadAlign_maxMappableLength2strands.o binarySearch2.o\
ReadAlign_outputTranscriptSAM.o ReadAlign_outputTranscriptSJ.o ReadAlign_outputTranscriptCIGARp.o ReadAlign_calcCIGAR.cpp \
ReadAlign_createExtendWindowsWithAlign.o ReadAlign_assignAlignToWindow.o ReadAlign_oneRead.o \
ReadAlign_createExtendWindowsWithAlign.o ReadAlign_assignAlignToWindow.o ReadAlign_oneRead.o \
ReadAlign_stitchWindowSeeds.o \
ReadAlign_peOverlapMergeMap.o ReadAlign_mappedFilter.o \
ReadAlign_chimericDetection.o ReadAlign_chimericDetectionOld.o ReadAlign_chimericDetectionOldOutput.o\
ChimericDetection.o ChimericDetection_chimericDetectionMult.o ReadAlign_chimericDetectionPEmerged.o \
ChimericSegment.cpp ChimericAlign.cpp ChimericAlign_chimericJunctionOutput.o ChimericAlign_chimericStitching.o \
stitchWindowAligns.o extendAlign.o stitchAlignToTranscript.o alignSmithWaterman.o \
Genome_genomeGenerate.o genomeParametersWrite.o genomeScanFastaFiles.o genomeSAindex.o \
Genome_insertSequences.o insertSeqSA.o funCompareUintAndSuffixes.o funCompareUintAndSuffixesMemcmp.o \
ReadAlign_peOverlapMergeMap.o ReadAlign_mappedFilter.o \
ReadAlign_chimericDetection.o ReadAlign_chimericDetectionOld.o ReadAlign_chimericDetectionOldOutput.o\
ChimericDetection.o ChimericDetection_chimericDetectionMult.o ReadAlign_chimericDetectionPEmerged.o \
ChimericSegment.cpp ChimericAlign.cpp ChimericAlign_chimericJunctionOutput.o ChimericAlign_chimericStitching.o \
stitchWindowAligns.o extendAlign.o stitchAlignToTranscript.o alignSmithWaterman.o \
Genome_genomeGenerate.o genomeParametersWrite.o genomeScanFastaFiles.o genomeSAindex.o \
Genome_insertSequences.o insertSeqSA.o funCompareUintAndSuffixes.o funCompareUintAndSuffixesMemcmp.o \
TimeFunctions.o ErrorWarning.o loadGTF.o streamFuns.o stringSubstituteAll.o \
Transcriptome.o Transcriptome_quantAlign.o ReadAlign_quantTranscriptome.o Quantifications.o Transcriptome_geneCountsAddAlign.o \
sjdbLoadFromFiles.o sjdbLoadFromStream.o sjdbPrepare.o sjdbBuildIndex.o sjdbInsertJunctions.o mapThreadsSpawn.o \
Parameters_openReadsFiles.cpp Parameters_closeReadsFiles.cpp Parameters_readSAMheader.o \
BAMoutput.o BAMfunctions.o ReadAlign_alignBAM.o BAMbinSortByCoordinate.o signalFromBAM.o bamRemoveDuplicates.o BAMbinSortUnmapped.o \
bam_cat.o \
serviceFuns.o \
GlobalVariables.cpp
Transcriptome.o Transcriptome_quantAlign.o ReadAlign_quantTranscriptome.o Quantifications.o Transcriptome_geneCountsAddAlign.o \
sjdbLoadFromFiles.o sjdbLoadFromStream.o sjdbPrepare.o sjdbBuildIndex.o sjdbInsertJunctions.o mapThreadsSpawn.o \
Parameters_openReadsFiles.cpp Parameters_closeReadsFiles.cpp Parameters_readSAMheader.o \
bam_cat.o serviceFuns.o GlobalVariables.cpp \
BAMoutput.o BAMfunctions.o ReadAlign_alignBAM.o BAMbinSortByCoordinate.o signalFromBAM.o bamRemoveDuplicates.o BAMbinSortUnmapped.o
SOURCES := $(wildcard *.cpp) $(wildcard *.c)
......
......@@ -56,6 +56,7 @@ Parameters::Parameters() {//initalize parameters info
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "readMatesLengthsIn", &readMatesLengthsIn));
parArray.push_back(new ParameterInfoScalar <uint> (-1, -1, "readMapNumber", &readMapNumber));
parArray.push_back(new ParameterInfoVector <string> (-1, -1, "readNameSeparator", &readNameSeparator));
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "readStrand", &pReads.strandString));
//input from BAM
......@@ -73,7 +74,7 @@ Parameters::Parameters() {//initalize parameters info
parArray.push_back(new ParameterInfoScalar <uint> (-1, -1, "limitOutSJoneRead", &limitOutSJoneRead));
parArray.push_back(new ParameterInfoScalar <uint> (-1, -1, "limitBAMsortRAM", &limitBAMsortRAM));
parArray.push_back(new ParameterInfoScalar <uint> (-1, -1, "limitSjdbInsertNsj", &limitSjdbInsertNsj));
parArray.push_back(new ParameterInfoScalar <uint> (-1, -1, "limitNreadsSoft", &limitNreadsSoft));
//output
parArray.push_back(new ParameterInfoScalar <string> (-1, 2, "outFileNamePrefix", &outFileNamePrefix));
......@@ -236,7 +237,6 @@ Parameters::Parameters() {//initalize parameters info
//WASP
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "waspOutputMode", &wasp.outputMode));
//quant
parArray.push_back(new ParameterInfoVector <string> (-1, -1, "quantMode", &quant.mode));
......@@ -248,10 +248,17 @@ Parameters::Parameters() {//initalize parameters info
twoPass.pass1readsN_par=parArray.size()-1;
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "twopassMode", &twoPass.mode));
// //SW parameters
// parArray.push_back(new ParameterInfoScalar <uint> (-1, -1, "swMode", &swMode));
// parArray.push_back(new ParameterInfoScalar <uint> (-1, -1, "swWinCoverageMinP", &swWinCoverageMinP));
//solo
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "soloType", &pSolo.typeStr));
parArray.push_back(new ParameterInfoScalar <uint32> (-1, -1, "soloCBstart", &pSolo.cbS));
parArray.push_back(new ParameterInfoScalar <uint32> (-1, -1, "soloUMIstart", &pSolo.umiS));
parArray.push_back(new ParameterInfoScalar <uint32> (-1, -1, "soloCBlen", &pSolo.cbL));
parArray.push_back(new ParameterInfoScalar <uint32> (-1, -1, "soloUMIlen", &pSolo.umiL));
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "soloCBwhitelist", &pSolo.soloCBwhitelist));
parArray.push_back(new ParameterInfoScalar <string> (-1, -1, "soloStrand", &pSolo.strandStr));
parArray.push_back(new ParameterInfoVector <string> (-1, -1, "soloOutFileNames", &pSolo.outFileNames));
parArray.push_back(new ParameterInfoVector <string> (-1, -1, "soloFeatures", &pSolo.featureIn));
parArray.push_back(new ParameterInfoVector <string> (-1, -1, "soloUMIdedup", &pSolo.umiDedup));
parameterInputName.push_back("Default");
parameterInputName.push_back("Command-Line-Initial");
......@@ -418,12 +425,6 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
///////////////////////////////////////// Old variables
outputBED[0]=0; outputBED[1]=0; //controls the format of starBED output. Will be replaced with HDF5 output
//annot scoring
annotScoreScale=0;
annotSignalFile="-";
//splitting
Qsplit=0;
maxNsplit=10;
......@@ -440,7 +441,6 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
runDirPerm=S_IRWXU;
} else if (runDirPermIn=="All_RWX")
{
// umask(0); //this should be defined by the user!
runDirPerm= S_IRWXU | S_IRWXG | S_IRWXO;
} else
{
......@@ -785,7 +785,8 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
exitWithError(errOut.str(), std::cerr, inOut->logMain, EXIT_CODE_PARAMETER, *this);
};
};
readNmatesIn=readNmates;
//two-pass
if (parArray.at(twoPass.pass1readsN_par)->inputLevel>0 && twoPass.mode=="None")
{
......@@ -950,7 +951,6 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
errOut <<"SOLUTION: re-run STAR with --waspOutputMode ... and --varVCFfile /path/to/file.vcf\n";
exitWithError(errOut.str(), std::cerr, inOut->logMain, EXIT_CODE_PARAMETER, *this);
};
if (wasp.yes && outSAMtype.at(0)!="BAM") {
ostringstream errOut;
......@@ -958,6 +958,52 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
errOut <<"SOLUTION: re-run STAR with --waspOutputMode ... and --outSAMtype BAM ... \n";
exitWithError(errOut.str(), std::cerr, inOut->logMain, EXIT_CODE_PARAMETER, *this);
};
//quantification parameters
quant.yes=false;
quant.geCount.yes=false;
quant.trSAM.yes=false;
quant.trSAM.bamYes=false;
if (quant.mode.at(0) != "-") {
quant.yes=true;
for (uint32 ii=0; ii<quant.mode.size(); ii++) {
if (quant.mode.at(ii)=="TranscriptomeSAM") {
quant.trSAM.yes=true;
if (quant.trSAM.bamCompression>-2)
quant.trSAM.bamYes=true;
if (quant.trSAM.bamYes) {
if (outStd=="BAM_Quant") {
outFileNamePrefix="-";
} else {
outQuantBAMfileName=outFileNamePrefix + "Aligned.toTranscriptome.out.bam";
};
inOut->outQuantBAMfile=bgzf_open(outQuantBAMfileName.c_str(),("w"+to_string((long long) quant.trSAM.bamCompression)).c_str());
};
if (quant.trSAM.ban=="IndelSoftclipSingleend") {
quant.trSAM.indel=false;
quant.trSAM.softClip=false;
quant.trSAM.singleEnd=false;
} else if (quant.trSAM.ban=="Singleend") {
quant.trSAM.indel=true;
quant.trSAM.softClip=true;
quant.trSAM.singleEnd=false;
};
} else if (quant.mode.at(ii)=="GeneCounts") {
quant.geCount.yes=true;
quant.geCount.outFile=outFileNamePrefix + "ReadsPerGene.out.tab";
} else {
ostringstream errOut;
errOut << "EXITING because of fatal INPUT error: unrecognized option in --quant.mode=" << quant.mode.at(ii) << "\n";
errOut << "SOLUTION: use one of the allowed values of --quant.mode : TranscriptomeSAM or - .\n";
exitWithError(errOut.str(),std::cerr, inOut->logMain, EXIT_CODE_PARAMETER, *this);
};
};
};
//solo
pSolo.initialize(this);
//outSAMattributes
outSAMattrPresent.NH=false;//TODO re-write as class with constructor?
......@@ -976,6 +1022,11 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
outSAMattrPresent.vW=false;
outSAMattrPresent.ch=false;
outSAMattrPresent.rB=false;
outSAMattrPresent.CR=false;
outSAMattrPresent.CY=false;
outSAMattrPresent.UR=false;
outSAMattrPresent.UY=false;
//for quant SAM output only NH and HI flags
outSAMattrPresentQuant=outSAMattrPresent;
......@@ -1044,6 +1095,22 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
outSAMattrOrder.push_back(ATTR_MC);
outSAMattrOrderQuant.push_back(ATTR_MC);
outSAMattrPresent.MC=true;
} else if (vAttr1.at(ii)== "CR") {
outSAMattrOrder.push_back(ATTR_CR);
outSAMattrOrderQuant.push_back(ATTR_CR);
outSAMattrPresent.CR=true;
} else if (vAttr1.at(ii)== "CY") {
outSAMattrOrder.push_back(ATTR_CY);
outSAMattrOrderQuant.push_back(ATTR_CY);
outSAMattrPresent.CY=true;
} else if (vAttr1.at(ii)== "UR") {
outSAMattrOrder.push_back(ATTR_UR);
outSAMattrOrderQuant.push_back(ATTR_UR);
outSAMattrPresent.UR=true;
} else if (vAttr1.at(ii)== "UY") {
outSAMattrOrder.push_back(ATTR_UY);
outSAMattrOrderQuant.push_back(ATTR_UY);
outSAMattrPresent.UY=true;
} else if (vAttr1.at(ii)== "XS") {
outSAMattrOrder.push_back(ATTR_XS);
outSAMattrPresent.XS=true;
......@@ -1097,6 +1164,13 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
inOut->logMain << "WARNING --waspOutputMode is set, therefore STAR will output vW attribute" <<endl;
};
if (pSolo.type==0 && (outSAMattrPresent.CR || outSAMattrPresent.CY || outSAMattrPresent.UR || outSAMattrPresent.UY) ) {
ostringstream errOut;
errOut <<"EXITING because of FATAL INPUT ERROR: --outSAMattributes contains CR/CY/UR/UY tags, but --soloType is not set\n";
errOut <<"SOLUTION: re-run STAR without these attribures, or with --soloType set\n";
exitWithError(errOut.str(), std::cerr, inOut->logMain, EXIT_CODE_PARAMETER, *this);
};
//chimeric
pCh.out.bam=false;
pCh.out.junctions=false;
......@@ -1228,44 +1302,6 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
exitWithError(errOut.str(),std::cerr, inOut->logMain, EXIT_CODE_PARAMETER, *this);
};
//quantification parameters
quant.yes=false;
quant.geCount.yes=false;
quant.trSAM.yes=false;
if (quant.mode.at(0) != "-") {
quant.yes=true;
for (uint32 ii=0; ii<quant.mode.size(); ii++) {
if (quant.mode.at(ii)=="TranscriptomeSAM") {
quant.trSAM.yes=true;
if (outStd=="BAM_Quant") {
outFileNamePrefix="-";
} else {
outQuantBAMfileName=outFileNamePrefix + "Aligned.toTranscriptome.out.bam";
};
inOut->outQuantBAMfile=bgzf_open(outQuantBAMfileName.c_str(),("w"+to_string((long long) quant.trSAM.bamCompression)).c_str());
if (quant.trSAM.ban=="IndelSoftclipSingleend")
{
quant.trSAM.indel=false;
quant.trSAM.softClip=false;
quant.trSAM.singleEnd=false;
} else if (quant.trSAM.ban=="Singleend")
{
quant.trSAM.indel=true;
quant.trSAM.softClip=true;
quant.trSAM.singleEnd=false;
};
} else if (quant.mode.at(ii)=="GeneCounts") {
quant.geCount.yes=true;
quant.geCount.outFile=outFileNamePrefix + "ReadsPerGene.out.tab";
} else {
ostringstream errOut;
errOut << "EXITING because of fatal INPUT error: unrecognized option in --quant.mode=" << quant.mode.at(ii) << "\n";
errOut << "SOLUTION: use one of the allowed values of --quant.mode : TranscriptomeSAM or - .\n";
exitWithError(errOut.str(),std::cerr, inOut->logMain, EXIT_CODE_PARAMETER, *this);
};
};
};
//sjdb insert on the fly
sjdbInsert.pass1=false;
......@@ -1397,6 +1433,26 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
peOverlap.yes=false;
};
//read parameters
if (pReads.strandString=="Unstranded") {
pReads.strand=0;
} else if (pReads.strandString=="Forward") {
pReads.strand=1;
} else if (pReads.strandString=="Reverse") {
pReads.strand=2;
} else {
ostringstream errOut;
errOut << "EXITING because of fatal PARAMETERS error: unrecognized option in of --readStrand="<<pReads.strandString<<"\n";
errOut << "SOLUTION: use allowed option: Unstranded or Forward or Reverse";
exitWithError(errOut.str(),std::cerr, inOut->logMain, EXIT_CODE_PARAMETER, *this);
};
//
outSAMreadIDnumber=false;
if (outSAMreadID=="Number") {
outSAMreadIDnumber=true;
};
////////////////////////////////////////////////
inOut->logMain << "Finished loading and checking parameters\n" <<flush;
};
......
......@@ -9,7 +9,10 @@
#include <unistd.h>
#include <signal.h>
#include "ParametersChimeric.h"
#include "ParametersSolo.h"
#include "ParametersGenome.h"
#include <vector>
#include <array>
class Parameters {
......@@ -62,13 +65,14 @@ class Parameters {
uint readMapNumber;
uint iReadAll;
uint readNmates;
uint readNmates, readNmatesIn;
string readMatesLengthsIn;
vector <string> readNameSeparator;
vector <char> readNameSeparatorChar;
string outSAMreadID;
bool outSAMreadIDnumber;
vector <uint> clip5pNbases, clip3pNbases, clip3pAfterAdapterNbases;
vector <double> clip3pAdapterMMp;
......@@ -88,6 +92,11 @@ class Parameters {
string alignSoftClipAtReferenceEnds;
vector <int32> alignSJstitchMismatchNmax;
struct {
string strandString;
int32 strand;
} pReads;
struct {
string in;
bool ext[2][2];
......@@ -134,7 +143,7 @@ class Parameters {
int outSAMtlen;
struct {bool NH,HI,AS,NM,MD,nM,jM,jI,RG,XS,rB,vG,vA,vW,ch,MC;} outSAMattrPresent, outSAMattrPresentQuant;
struct {bool NH,HI,AS,NM,MD,nM,jM,jI,RG,XS,rB,vG,vA,vW,ch,MC,CR,CY,UR,UY;} outSAMattrPresent, outSAMattrPresentQuant;
vector <int> outSAMattrOrder, outSAMattrOrderQuant;
int outBAMcompression;
......@@ -146,8 +155,7 @@ class Parameters {
// string bamRemoveDuplicatesType;
// uint bamRemoveDuplicatesMate2basesN;
struct
{
struct {
string mode;
bool yes;
bool markMulti;
......@@ -158,30 +166,26 @@ class Parameters {
uint64 *outBAMsortingBinStart; //genomic starts for bins for sorting BAM files
uint16 outSAMflagOR, outSAMflagAND;
struct
{
struct {
vector <string> mode;
bool yes;
bool within;//output unmapped reads within SAM/BAM files
bool keepPairs;//keep mates together
} outSAMunmapped;
struct
{
struct {
vector <string> mode;
bool yes;
bool KeepOnlyAddedReferences;
bool KeepAllAddedReferences;
} outSAMfilter;
struct
{
struct {
string mode;
bool random;
} outMultimapperOrder;
struct
{
struct {
bool yes;
uint NbasesMin;
double MMp;
......@@ -248,27 +252,13 @@ class Parameters {
uint limitOutSJoneRead, limitOutSJcollapsed;
uint limitBAMsortRAM;
uint limitSjdbInsertNsj;
uint limitNreadsSoft;
// penalties
intScore scoreGap, scoreGapNoncan, scoreGapGCAG, scoreGapATAC, scoreDelBase, scoreDelOpen, scoreInsBase, scoreInsOpen;
intScore scoreStitchSJshift;//Max negative score when
double scoreGenomicLengthLog2scale;
//old variables: CLEAN-up needed
char outputBED[MAX_OUTPUT_FLAG]; //output flags
//SW search
uint swMode, swWinCoverageMinP;
//SW penalties
uint swPeoutFilterMatchNmin, swPenMismatch, swPenGapOpen, swPenGapExtend;
uint swHsize;
int annotScoreScale;//overall multiplication factor for the annotation
string annotSignalFile;//binary file with annotation signal
uint sjNovelN, *sjNovelStart, *sjNovelEnd; //novel junctions collapased and filtered
//quantification parameters
//input
......@@ -280,6 +270,7 @@ class Parameters {
struct
{
bool yes;
bool bamYes;
bool indel;
bool softClip;
bool singleEnd;
......@@ -302,23 +293,26 @@ class Parameters {
string vcfFile;
} var;
struct
struct
{
bool yes;
bool SAMtag;
string outputMode;
} wasp;
//solo
ParametersSolo pSolo;
//chimeric
ParametersChimeric pCh;
//splitting
char Qsplit;
uint maxNsplit, minLsplit, minLmap;
//limits
//not really parameters, but global variables:
array<vector<uint64>,2> sjAll;
uint64 sjNovelN, *sjNovelStart, *sjNovelEnd; //novel junctions collapased and filtered
////////////////////// CLEAN-UP needed
InOutStreams *inOut; //main input output streams
......
#include "ParametersSolo.h"
#include "Parameters.h"
#include "ErrorWarning.h"
#include "streamFuns.h"
#include "SequenceFuns.h"
#include "serviceFuns.cpp"
#include <stdlib.h>
const vector<string> ParametersSolo::featureNames={"Gene","SJ"};
void ParametersSolo::initialize(Parameters *pPin)
{
pP=pPin;
if (typeStr=="None") {
type=0;
return;
} else if (typeStr=="Droplet") {
type=1;
bL=cbL+umiL;
pP->readNmates=1; //output mates TODO: check that readNmatesIn==2
} else {
ostringstream errOut;
errOut << "EXITING because of fatal PARAMETERS error: unrecognized option in --soloType="<<typeStr<<"\n";
errOut << "SOLUTION: use allowed option: None or Droplet";
exitWithError(errOut.str(),std::cerr, pP->inOut->logMain, EXIT_CODE_PARAMETER, *pP);
};
if (strandStr=="Unstranded") {
strand=-1;
} else if (strandStr=="Forward") {
strand=0;
} else if (strandStr=="Reverse") {
strand=1;
} else {
ostringstream errOut;
errOut << "EXITING because of fatal PARAMETERS error: unrecognized option in --soloStrand="<<strandStr<<"\n";
errOut << "SOLUTION: use allowed option: Unstranded OR Forward OR Reverse";
exitWithError(errOut.str(),std::cerr, pP->inOut->logMain, EXIT_CODE_PARAMETER, *pP);
};
for (auto &fin : featureIn) {
bool finGood=false;
for (uint32 ii=0; ii<featureNames.size(); ii++) {
if (fin==featureNames[ii]) {
finGood=true;
featureYes[ii]=true;
features.push_back(ii);
break;
};
};
if (!finGood) {
ostringstream errOut;
errOut << "EXITING because of fatal PARAMETERS error: unrecognized option in --soloFeatures="<<fin<<"\n";
errOut << "SOLUTION: use allowed option: ";
errOut <<featureNames[0]<< " OR ";
for (auto &fname : featureNames)
errOut << fname <<" ";
exitWithError(errOut.str(),std::cerr, pP->inOut->logMain, EXIT_CODE_PARAMETER, *pP);
};
};
nFeatures=features.size();
umiDedupYes.resize(3,false);
umiDedupColumns.resize(umiDedup.size());
for (uint32 ii=0; ii<umiDedup.size(); ii++) {
if (umiDedup[ii]=="1MM_NotCollapsed") {
umiDedupYes[0]=true;
umiDedupColumns[ii]=0;
} else if (umiDedup[ii]=="1MM_All") {
umiDedupYes[1]=true;
umiDedupColumns[ii]=1;
} else if (umiDedup[ii]=="1MM_Directional") {
umiDedupYes[2]=true;
umiDedupColumns[ii]=2;
} else {
ostringstream errOut;
errOut << "EXITING because of fatal PARAMETERS error: unrecognized option in --soloUMIdedup="<<umiDedup[ii]<<"\n";
exitWithError(errOut.str(),std::cerr, pP->inOut->logMain, EXIT_CODE_PARAMETER, *pP);
};
};
///////////// finished parameters input
//make output directory if needed
if ( outFileNames[0].find_last_of("/") < outFileNames[0].size() ) {//need to create dir
string dir1=pP->outFileNamePrefix+outFileNames[0].substr(0,outFileNames[0].find_last_of("/"));
if (mkdir(dir1.c_str(),pP->runDirPerm)!=0 && errno!=EEXIST) {
ostringstream errOut;
errOut << "EXITING because of fatal OUTPUT FILE error: could not create Solo output directory"<<dir1<<"\n";
errOut << "SOLUTION: check the path and permisssions";
exitWithError(errOut.str(),std::cerr, pP->inOut->logMain, EXIT_CODE_PARAMETER, *pP);
};
};
QSbase=33;//TODO make these user-definable
QSmax=33;
cbMinP=0.975;
umiMaskLow=(uint32) ( (((uint64)1)<<umiL) - 1);
umiMaskHigh=~umiMaskLow;
//load the CB whitlist and create unordered_map
ifstream & cbWlStream = ifstrOpen(soloCBwhitelist, ERROR_OUT, "SOLUTION: check the path and permissions of the CB whitelist file: " + soloCBwhitelist, *pP);
string seq1;
while (cbWlStream >> seq1) {
if (seq1.size() != cbL) {
ostringstream errOut;
errOut << "EXITING because of FATAL ERROR in input CB whitelist file: "<< soloCBwhitelist <<" the total length of barcode sequence is " << seq1.size() << " not equal to expected " <<bL <<"\n" ;
errOut << "SOLUTION: make sure that the barcode read is the second in --readFilesIn and check that is has the correct formatting\n";
exitWithError(errOut.str(),std::cerr, pP->inOut->logMain, EXIT_CODE_INPUT_FILES, *pP);
};
uint32 cb1;
//convert to 2-bit format
if (convertNuclStrToInt32(seq1,cb1)) {
//cbWL.insert(cb1);
cbWL.push_back(cb1);
} else {
pP->inOut->logMain << "WARNING: CB whitelist sequence contains non-ACGT and is ignored: " << seq1 <<endl;
};
};
//int comp1 = [] (const void *a, const void *b) {uint32 aa=*(uint32*) a; uint32 bb=*(uint32*) b; if (a
qsort(cbWL.data(),cbWL.size(),sizeof(uint32),funCompareNumbers<uint32>);
if (!pP->quant.trSAM.yes) {
pP->quant.yes = true;
pP->quant.trSAM.yes = true;
pP->quant.trSAM.bamYes = false;
pP->quant.trSAM.bamCompression = -2;
pP->quant.trSAM.indel = true;
pP->quant.trSAM.softClip = true;
pP->inOut->logMain << "Turning on Genomic->Transcriptomic coordinate conversion for STARsolo\n";
};
time_t rawTime;
time(&rawTime);
pP->inOut->logMain << timeMonthDayTime(rawTime) << "Finished reading CB whitelist sequences: " << cbWL.size() <<endl;
};