Skip to content
Commits on Source (6)
STAR 2.7.3a 2019/10/08
======================
Major new features in STARsolo
------------------------------
* **Output enhancements:**
* Summary.csv statistics output for raw and filtered cells useful for quick run quality assessment.
* --soloCellFilter option for basic filtering of the cells, similar to the methods used by CellRanger 2.2.x.
* [**Better compatibility with CellRanger 3.x.x:**](docs/STARsolo.md#matching-cellranger-3xx-results)
* --soloUMIfiltering MultiGeneUMI option introduced in CellRanger 3.x.x for filtering UMI collisions between different genes.
* --soloCBmatchWLtype 1MM_multi_pseudocounts option, introduced in CellRanger 3.x.x, which slightly changes the posterior probability calculation for CB with 1 mismatch.
* [**Velocyto spliced/unspliced/ambiguous quantification:**](docs/STARsolo.md#velocyto-splicedunsplicedambiguous-quantification)
* --soloFeatures Velocyto option to produce Spliced, Unspliced, and Ambiguous counts similar to the [velocyto.py](http://velocyto.org/) tool developed by [LaManno et al](https://doi.org/10.1038/s41586-018-0414-6). This option is under active development and the results may change in the future versions.
* [**Support for complex barcodes, e.g. inDrop:**](docs/STARsolo.md#barcode-geometry)
* Complex barcodes in STARsolo with --soloType CB_UMI_Complex, --soloCBmatchWLtype --soloAdapterSequence, --soloAdapterMismatchesNmax, --soloCBposition,--soloUMIposition
* [**BAM tags:**](#bam-tags)
* CB/UB for corrected CellBarcode/UMI
* GX/GN for gene ID/name
* STARsolo most up-to-date [documentation](docs/STARsolo.md).
STAR 2.7.2d 2019/10/04
======================
* Fixed the problem with no header in Chimeric.out.sam
STAR 2.7.2c 2019/10/02
======================
* Fixed the problem with no output to Chimeric.out.sam
STAR 2.7.2b 2019/08/29
======================
Bug fixes in chimeric detection, contributed by Meng Xiao He (@mengxiao)
* Fix memory leak in handling chimeric multimappers: #721
* Ensure chimeric alignment score requirements are consistently checked: #722,#723.
STAR 2.7.2a 2019/08/13
======================
* Chimeric read reporting now requires that the chimeric read alignment score higher than the alternative non-chimeric alignment to the reference genome. The Chimeric.out.junction file now includes the scores of the chimeric alignments and non-chimeric alternative alignments, in addition to the PEmerged bool attribute. (bhaas, Aug 2019)
* Fixed the problem with ALT=* in STAR-WASP.
* Implemented extras/scripts/soloBasicCellFilter.awk script to perform basic filtering of the STARsolo count matrices.
* Fixed a bug causing rare seg-faults with for --peOverlap* options and chimeric detection.
* Fixed a problem in STARsolo with unmapped reads counts in Solo.out/*.stats files.
* Fixed a bug in STARsolo with counting reads for splice junctions. Solo.out/matrixSJ.mtx output is slighlty changed.
* Fixed the problem with ALT=* in VCF files for STAR-WASP.
STAR 2.7.1a 2019/05/15
======================
......@@ -27,7 +58,6 @@ STAR 2.7.1a 2019/05/15
STAR 2.7.0f 2019/03/28
======================
* Fixed a problem in STARsolo with empty Unmapped.out.mate2 file. Issue #593.
* Fixed a problem with CR CY UR UQ SAM tags in solo output. Issue #593.
* Fixed problems with STARsolo and 2-pass.
......
......@@ -35,9 +35,9 @@ Download the latest [release from](https://github.com/alexdobin/STAR/releases) a
```bash
# Get latest STAR source from releases
wget https://github.com/alexdobin/STAR/archive/2.7.2b.tar.gz
tar -xzf 2.7.2b.tar.gz
cd STAR-2.7.2b
wget https://github.com/alexdobin/STAR/archive/2.7.3a.tar.gz
tar -xzf 2.7.3a.tar.gz
cd STAR-2.7.3a
# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git
......
STAR 2.7.3a 2019/10/08
======================
Major new features in STARsolo
------------------------------
* **Output enhancements:**
* Summary.csv statistics output for raw and filtered cells useful for quick run quality assessment.
* --soloCellFilter option for basic filtering of the cells, similar to the methods used by CellRanger 2.2.x.
* [**Better compatibility with CellRanger 3.x.x:**](docs/STARsolo.md#matching-cellranger-3xx-results)
* --soloUMIfiltering MultiGeneUMI option introduced in CellRanger 3.x.x for filtering UMI collisions between different genes.
* --soloCBmatchWLtype 1MM_multi_pseudocounts option, introduced in CellRanger 3.x.x, which slightly changes the posterior probability calculation for CB with 1 mismatch.
* [**Velocyto spliced/unspliced/ambiguous quantification:**](docs/STARsolo.md#velocyto-splicedunsplicedambiguous-quantification)
* --soloFeatures Velocyto option to produce Spliced, Unspliced, and Ambiguous counts similar to the [velocyto.py](http://velocyto.org/) tool developed by [LaManno et al](https://doi.org/10.1038/s41586-018-0414-6). This option is under active development and the results may change in the future versions.
* [**Support for complex barcodes, e.g. inDrop:**](docs/STARsolo.md#barcode-geometry)
* Complex barcodes in STARsolo with --soloType CB_UMI_Complex, --soloCBmatchWLtype --soloAdapterSequence, --soloAdapterMismatchesNmax, --soloCBposition,--soloUMIposition
* [**BAM tags:**](#bam-tags)
* CB/UB for corrected CellBarcode/UMI
* GX/GN for gene ID/name
* STARsolo most up-to-date [documentation](docs/STARsolo.md).
STAR 2.7.2a 2019/08/13
======================
......
rna-star (2.7.3a+dfsg-1) unstable; urgency=medium
* New upstream version.
* Bump Standards-Version.
* Use debhelper 12.
-- Sascha Steinbiss <satta@debian.org> Thu, 10 Oct 2019 00:26:31 +0200
rna-star (2.7.2b+dfsg-1) unstable; urgency=medium
* New upstream version.
......
......@@ -5,12 +5,12 @@ Uploaders: Steffen Moeller <moeller@debian.org>,
Sascha Steinbiss <satta@debian.org>
Section: science
Priority: optional
Build-Depends: debhelper (>= 11),
Build-Depends: debhelper (>= 12),
libhts-dev,
vim-common,
xxd,
zlib1g-dev
Standards-Version: 4.3.0
Standards-Version: 4.4.1
Vcs-Browser: https://salsa.debian.org/med-team/rna-star
Vcs-Git: https://salsa.debian.org/med-team/rna-star.git
Homepage: https://github.com/alexdobin/STAR/
......
......@@ -23,7 +23,7 @@ Description: Use Debian packaged htslib
CXXFLAGS_main := -O3 $(CXXFLAGS_common)
CXXFLAGS_gdb := -O0 -g $(CXXFLAGS_common)
@@ -64,10 +64,10 @@
@@ -66,10 +66,10 @@
%.o : %.cpp
......@@ -36,7 +36,7 @@ Description: Use Debian packaged htslib
all: STAR
@@ -78,19 +78,17 @@
@@ -80,19 +80,17 @@
.PHONY: CLEAN
CLEAN:
rm -f *.o STAR Depend.list
......@@ -57,7 +57,7 @@ Description: Use Debian packaged htslib
echo $(SOURCES)
'rm' -f ./Depend.list
$(CXX) $(CXXFLAGS_common) -MM $^ >> Depend.list
@@ -101,11 +99,6 @@
@@ -103,11 +101,6 @@
endif
endif
......@@ -127,9 +127,9 @@ Description: Use Debian packaged htslib
// kstring_t strK;
--- a/source/STAR.cpp
+++ b/source/STAR.cpp
@@ -30,7 +30,7 @@
#include "bam_cat.h"
@@ -27,7 +27,7 @@
#include "Variation.h"
#include "Solo.h"
-#include "htslib/htslib/sam.h"
+#include <htslib/sam.h>
......
No preview for this file type
STARsolo: mapping, demultiplexing and gene quantification for single cell RNA-seq
---------------------------------------------------------------------------------
STARsolo: mapping, demultiplexing and quantification for single cell RNA-seq
=================================================================================
First released in STAR 2.7.0a (Jan 23 2019)
Major updates in STAR 2.7.3a (Oct 8 2019)
-----------------------------------------
* **Output enhancements:**
* Summary.csv statistics output for raw and filtered cells useful for quick run quality assessment.
* --soloCellFilter option for basic filtering of the cells, similar to the methods used by CellRanger 2.2.x.
* [**Better compatibility with CellRanger 3.x.x:**](#matching-cellranger-3xx-results)
* --soloUMIfiltering MultiGeneUMI option introduced in CellRanger 3.x.x for filtering UMI collisions between different genes.
* --soloCBmatchWLtype 1MM_multi_pseudocounts option, introduced in CellRanger 3.x.x, which slightly changes the posterior probability calculation for CB with 1 mismatch.
* [**Velocyto spliced/unspliced/ambiguous quantification:**](#velocyto-splicedunsplicedambiguous-quantification)
* --soloFeatures Velocyto option to produce Spliced, Unspliced, and Ambiguous counts similar to the [velocyto.py](http://velocyto.org/) tool developed by [LaManno et al](https://doi.org/10.1038/s41586-018-0414-6). This option is under active development and the results may change in the future versions.
* [**Support for complex barcodes, e.g. inDrop:**](#barcode-geometry)
* Complex barcodes in STARsolo with --soloType CB_UMI_Complex, --soloCBmatchWLtype --soloAdapterSequence, --soloAdapterMismatchesNmax, --soloCBposition,--soloUMIposition
* [**BAM tags:**](#bam-tags)
* CB/UB for corrected CellBarcode/UMI
* GX/GN for gene ID/name
STARsolo
-------------
STARsolo is a turnkey solution for analyzing droplet single cell RNA sequencing data (e.g. 10X Genomics Chromium System) built directly into STAR code.
STARsolo inputs the raw FASTQ reads files, and performs the following operations
* error correction and demultiplexing of cell barcodes using user-input whitelist
* mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm
* error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa)
* quantification of per-cell gene expression by counting the number of reads per gene
* quantification of other transcriptomic features: splice junctions; pre-mRNA; spliced/unspliced reads similar to Velocyto
STARsolo output is designed to be a drop-in replacement for 10X CellRanger gene quantification output.
It follows CellRanger logic for cell barcode whitelisting and UMI deduplication, and produces nearly identical gene counts in the same format.
At the same time STARsolo is ~10 times faster than the CellRanger.
The STAR solo algorithm is turned on with:
Running STARsolo for 10X Chromium scRNA-seq data
-------------------------------------
* STARsolo is run the same way as normal STAR run, with addition of several STARsolo parameters:
```
/path/to/STAR --genomeDir /path/to/genome/dir/ --readFilesIn ... [...other parameters...] --soloType ... --soloCBwhitelist ...
```
The genome index is the same as for normal STAR runs. </br>
The parameters required to run STARsolo on 10X Chromium data are described below:
* The STAR solo algorithm is turned on with:
```
--soloType Droplet
```
or, since 2.7.3a, with more descriptive:
```
--soloType CB_UMI_Simple
```
* The CellBarcode whitelist has to be provided with:
Presently, the cell barcode whitelist has to be provided with:
```
--soloCBwhitelist /path/to/cell/barcode/whitelist
```
The 10X Chromium whitelist file can be found inside the CellRanger distribution,
e.g. [10X-whitelist](https://kb.10xgenomics.com/hc/en-us/articles/115004506263-What-is-a-barcode-whitelist-).
Please make sure that the whitelist is compatible with the specific version of the 10X chemistry (V1,V2,V3 etc).
Please make sure that the whitelist is compatible with the specific version of the 10X chemistry: V2 or V3. For instance, in CellRanger 3.1.0, the *V2 whitelist* is:
```
cellranger-cs/3.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt
```
and *V3 whitelist* (gunzip it for STAR):
```
cellranger-cs/3.1.0/lib/python/cellranger/barcodes/3M-february-2018.txt.gz
```
* The default barcode lengths (CB=16b, UMI=10b) work for 10X Chromium V2. For V3, specify:
```
--soloUMIlen 12
```
Importantly, in the --readFilesIn option, the 1st file has to be cDNA read, and the 2nd file has to be the barcode (cell+UMI) read, i.e.
* Importantly, in the --readFilesIn option, the 1st file has to be cDNA read, and the 2nd file has to be the barcode (cell+UMI) read, i.e.
```
--readFilesIn cDNAfragmentSequence.fastq.gz CellBarcodeUMIsequence.fastq.gz
```
For instance, standard 10X runs have cDNA as Read2 and barcode as Read1:
```
--readFilesIn Read2.fastq.gz Read1.fastq.gz
```
For multiple lanes, use commas separated lists for Read2 and Read1:
```
--readFilesIn Read2_Lane1.fastq.gz,Read2_Lane2.fastq.gz,Read2_Lane3.fastq.gz Read1_Lane1.fastq.gz,Read1_Lane2.fastq.gz,Read1_Lane3.fastq.gz
```
----------------------------------------
How to make STARsolo _raw_ gene counts (almost) identical to CellRanger's
----------------------------------------------------------------------
* CellRanger uses its own "filtered" version of annotations (GTF file) which is a subset of ENSEMBL annotations, with several gene biotypes removed (mostly small non-coding RNA). Annotations affect the counts, and to match CellRanger counts CellRanger annotations have to be used.
* 10X provides several versions of the CellRanger annotations:</br>
[https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest ](https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest)</br>
For the best match, the annotations in CellRanger run and STARsolo run should be exactly the same.
* The FASTA and GTF files
```
refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf
refdata-cellranger-GRCh38-3.0.0/genes/genome.fa
```
have to be used in STAR genome index generation step before mapping:
```
STAR --runMode genomeGenerate --runThreadN ... --genomeDir ./ --genomeFastaFiles /path/to/genome.fa --sjdbGTFfile /path/to/genes.gtf
```
* If you want to use your own GTF (e.g. newer version of ENSEMBL or GENCODE), you can generate the "filtered" GTF file using 10X's mkref tool:</br>
[https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references)
* To make the agreement between STARsolo and CellRanger even more perfect, you can add
```
--genomeSAsparseD 3
```
to the genome generation options, which is used by CellRanger to generate STAR genomes. It will generate sparse suffixs array whic has an additional benefit of fitting into 16GB of RAM. However, it also results in 30-50% reduction of speed.
* The considerations above are for *raw* counts, i.e. when cell filtering is not performed. To get *filtered* results, refer to [Basic cell filtering](#basic-cell-filtering) section.
#### Matching CellRanger 3.x.x results
* By default, cell barcode and UMI collapsing parameters are designed to give the best agreement with CellRanger 2.x.x. CellRanger 3.x.x introduced some minor changes to this algorithm. To get the best agreement between STARsolo and CellRanger 3.x.x, add these parameters:
```
--soloUMIfiltering MultiGeneUMI --soloCBmatchWLtype 1MM_multi_pseudocounts
```
-----------------
Barcode geometry
-------------------
* Simple barcode lengths and start positions on barcode reads are described with
```
--soloCBstart, --soloCBlen, --soloUMIstart, --soloUMIlen
```
which works with
```
--soloType CB_UMI_Simple (a.k.a Droplet)
```
* More complex barcodes are activated with ```--soloType CB_UMI_complex``` and are described with the following parameters
```
soloCBposition -
strings(s) position of Cell Barcode(s) on the barcode read.
Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
Format for each barcode: startAnchor_startDistance_endAnchor_endDistance
start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
start(end)Distance is the distance from the CB start(end) to the Anchor base
String for different barcodes are separated by space.
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 0_0_2_-1 3_1_3_8
soloUMIposition -
string position of the UMI on the barcode read, same as soloCBposition
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 3_9_3_14
soloAdapterSequence -
string: adapter sequence to anchor barcodes.
soloAdapterMismatchesNmax 1
int>0: maximum number of mismatches allowed in adapter sequence
```
--------------------------------------
Basic cell filtering
--------------------
* Since 2.7.3a, in addition to raw, unfiltered output of gene/cell counts, STARsolo performs simple (knee-like) filtering of the cells, similar to the methods used by CellRanger 2.2.x. This is turned on by default and is controlled by:
```
soloCellFilter CellRanger2.2 3000 0.99 10
string(s): cell filtering type and parameters
CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
TopCells ... only report top cells by UMI count, followed by the excat number of cells
None ... do not output filtered cells
```
* This filtering is used to produce summary statistics for filtered cells in the Summary.csv file, which is similar to CellRanger's summary and is useful for Quality Control.
* Recent versions of CellRanger switched to more advanced filtering done with the EmptyDrop tool developed by [Lun et al](https://doi.org/10.1186/s13059-019-1662-y). To obtain filtered counts similar to recent CellRanger versions, we need to run this tools on **raw** STARsolo output
Important: the genome index has to be re-generated with the latest 2.7.0x release.
Other parameters that control STARsolo output are listed below. Note that default parameters are compatible with 10X Chromium V2 protocol.
------------------
Quantification of different transcriptomic features
-----------------------
* In addition to the gene counts (deafult), STARsolo can calculate counts for other transcriptomic features:
* pre-mRNA counts, useful for single-nucleus RNA-seq. This counts all read that overlap gene loci, i.e. included both exonic and intronic reads:
```
--soloFeatures GeneFull
```
* Counts for annotated and novel splice junctions:
```
--soloFeatures SJ
```
* #### Velocyto spliced/unspliced/ambiguous quantification
This option will calculate Spliced, Unspliced, and Ambiguous counts per cell per gene similar to the [velocyto.py](http://velocyto.org/) tool developed by [LaManno et al](https://doi.org/10.1038/s41586-018-0414-6). This option is under active development and the results may change in the future versions.
```
--soloFeatures Gene Velocyto
```
Note that Velocyto quantification requires Gene features
* All the features can be conveniently quantified in one run:
```
--soloFeatures Gene GeneFull SJ Velocyto
```
--------------------------------------
BAM tags
-----------------
* To output BAM tags into SAM/BAM file, add them to the list of standard tags in
```
--outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM
```
Any combinations of tags can be used.
* CR/UR: **raw (uncorrected)** CellBarcode/UMI
* CY/UY: quality score for CellBarcode/UMI
* GX/GN: for gene ID/names
* sS/sQ: for sequence/quality combined CellBarcode and UMI; sM for barcode match status.
* CB/UB: **corrected** CellBarcode/UMI. Note, that these tags require sorted BAM output, i.e. we need to add:
```
--outSAMtype BAM SortedByCoordinate
```
-------------------------------------------------------------
------------------------------------------------------
--------------------------------------------------
For completenes, all parameters that control STARsolo output are listed again below with defaults and short descriptions:
---------------------------------------
```
soloType None
string(s): type of single-cell RNA-seq
CB_UMI_Simple ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium
CB_UMI_Complex ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
soloCBwhitelist -
string(s): file(s) with whitelist(s) of cell barcodes. Only one file allowed with
soloCBstart 1
int>0: cell barcode start base
......@@ -50,6 +232,41 @@ soloUMIstart 17
soloUMIlen 10
int>0: UMI length
soloBarcodeReadLength 1
int: length of the barcode read
1 ... equal to sum of soloCBlen+soloUMIlen
0 ... not defined, do not check
soloCBposition -
strings(s) position of Cell Barcode(s) on the barcode read.
Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
Format for each barcode: startAnchor_startDistance_endAnchor_endDistance
start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
start(end)Distance is the distance from the CB start(end) to the Anchor base
String for different barcodes are separated by space.
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 0_0_2_-1 3_1_3_8
soloUMIposition -
string position of the UMI on the barcode read, same as soloCBposition
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 3_9_3_14
soloAdapterSequence -
string: adapter sequence to anchor barcodes.
soloAdapterMismatchesNmax 1
int>0: maximum number of mismatches allowed in adapter sequence.
soloCBmatchWLtype 1MM_multi
string: matching the Cell Barcodes to the WhiteList
Exact ... only exact matches allowed
1MM ... only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.
1MM_multi ... multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches.
Allowed CBs have to have at least one read with exact match. Similar to CellRanger 2.2.0
1MM_multi_pseudocounts ... same as 1MM_Multi, but pseudocounts of 1 are added to all whitelist barcodes.
Similar to CellRanger 3.x.x
soloStrand Forward
string: strandedness of the solo libraries:
Unstranded ... no strand information
......@@ -57,21 +274,31 @@ soloStrand Forward
Reverse ... read strand opposite to the original RNA molecule
soloFeatures Gene
string(s) genomic features for which the UMI counts per Cell Barcode are collected
string(s): genomic features for which the UMI counts per Cell Barcode are collected
Gene ... genes: reads match the gene transcript
SJ ... splice junctions: reported in SJ.out.tab
GeneFull ... full genes: count all reads overlapping genes' exons and introns
Transcript3p ... quantification of transcript for 3' protocols
soloUMIdedup 1MM_All
string(s) type of UMI deduplication (collapsing) algorithm
string(s): type of UMI deduplication (collapsing) algorithm
1MM_All ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)
1MM_Directional ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
1MM_NotCollapsed ... UMIs with 1 mismatch distance to others are not collapsed (i.e. all counted)
soloOutFileNames Solo.out/ genes.tsv barcodes.tsv matrix.mtx matrixSJ.mtx
string(s) file names for STARsolo output
1st word ... file name prefix
2nd word ... barcode sequences
3rd word ... gene IDs and names
4th word ... cell/gene counts matrix
5th word ... cell/splice junction counts matrix
Exact ... only exactly matching UMIs are collapsed
soloUMIfiltering -
string(s) type of UMI filtering
- ... basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0)
MultiGeneUMI ... remove lower-count UMIs that map to more than one gene (introduced in CellRanger 3.x.x)
soloOutFileNames Solo.out/ features.tsv barcodes.tsv matrix.mtx
string(s) file names for STARsolo output:
file_name_prefix gene_names barcode_sequences cell_feature_count_matrix
soloCellFilter CellRanger2.2 3000 0.99 10
string(s): cell filtering type and parameters
CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
TopCells ... only report top cells by UMI count, followed by the excat number of cells
None ... do not output filtered cells
```
......@@ -34,7 +34,7 @@
\newcommand{\sechyperref}[1]{\hyperref[#1]{Section \ref{#1}. \nameref{#1}}}
\title{STAR manual 2.7.2b}
\title{STAR manual 2.7.3a}
\author{Alexander Dobin\\
dobin@cshl.edu}
\maketitle
......
......@@ -159,6 +159,9 @@
\optName{readNameSeparator}
\optValue{/}
\optLine{string(s): character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed)}
\optName{readQualityScoreBase}
\optValue{33}
\optLine{int{\textgreater}=0: number to be subtracted from the ASCII code to get Phred quality score}
\optName{clip3pNbases}
\optValue{0}
\optLine{int(s): number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.}
......@@ -288,7 +291,16 @@
\optOpt{vA} \optOptLine{variant allele}
\optOpt{vG} \optOptLine{genomic coordiante of the variant overlapped by the read}
\optOpt{vW} \optOptLine{0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag}
\end{optOptTable}
\optLine{STARsolo:}
\begin{optOptTable}
\optOpt{CR CY UR UY} \optOptLine{sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing}
\optOpt{CB UB} \optOptLine{error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires --outSAMtype BAM SortedByCoordinate.}
\optOpt{sM} \optOptLine{assessment of CB and UMI}
\end{optOptTable}
\optLine{sS ... sequence of the entire barcode (CB,UMI,adapter...)}
\begin{optOptTable}
\optOpt{sQ} \optOptLine{quality of the entire barcode}
\end{optOptTable}
\optLine{Unsupported/undocumented:}
\begin{optOptTable}
......@@ -789,11 +801,12 @@
\optValue{None}
\optLine{string(s): type of single-cell RNA-seq}
\begin{optOptTable}
\optOpt{Droplet} \optOptLine{one cell barcode and one UMI barcode in read2, e.g. Drop-seq and 10X Chromium}
\optOpt{CB{\textunderscore}UMI{\textunderscore}Simple} \optOptLine{(a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium}
\optOpt{CB{\textunderscore}UMI{\textunderscore}Complex} \optOptLine{one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.}
\end{optOptTable}
\optName{soloCBwhitelist}
\optValue{-}
\optLine{string: file with whitelist of cell barcodes}
\optLine{string(s): file(s) with whitelist(s) of cell barcodes. Only one file allowed with }
\optName{soloCBstart}
\optValue{1}
\optLine{int{\textgreater}0: cell barcode start base}
......@@ -813,6 +826,40 @@
\optOpt{1} \optOptLine{equal to sum of soloCBlen+soloUMIlen}
\optOpt{0} \optOptLine{not defined, do not check}
\end{optOptTable}
\optName{soloCBposition}
\optValue{-}
\optLine{strings(s) position of Cell Barcode(s) on the barcode read.}
\optLine{Presently only works with --soloType CB{\textunderscore}UMI{\textunderscore}Complex, and barcodes are assumed to be on Read2.}
\optLine{Format for each barcode: startAnchor{\textunderscore}startDistance{\textunderscore}endAnchor{\textunderscore}endDistance}
\optLine{start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end}
\optLine{start(end)Distance is the distance from the CB start(end) to the Anchor base}
\optLine{String for different barcodes are separated by space.}
\optLine{Example: inDrop (Zilionis et al, Nat. Protocols, 2017):}
\optLine{--soloCBposition 0{\textunderscore}0{\textunderscore}2{\textunderscore}-1 3{\textunderscore}1{\textunderscore}3{\textunderscore}8}
\optName{soloUMIposition}
\optValue{-}
\optLine{string position of the UMI on the barcode read, same as soloCBposition}
\optLine{Example: inDrop (Zilionis et al, Nat. Protocols, 2017):}
\optLine{--soloCBposition 3{\textunderscore}9{\textunderscore}3{\textunderscore}14}
\optName{soloAdapterSequence}
\optValue{-}
\optLine{string: adapter sequence to anchor barcodes.}
\optName{soloAdapterMismatchesNmax}
\optValue{1}
\optLine{int{\textgreater}0: maximum number of mismatches allowed in adapter sequence.}
\optName{soloCBmatchWLtype}
\optValue{1MM{\textunderscore}multi}
\optLine{string: matching the Cell Barcodes to the WhiteList}
\begin{optOptTable}
\optOpt{Exact} \optOptLine{only exact matches allowed}
\optOpt{1MM} \optOptLine{only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.}
\optOpt{1MM{\textunderscore}multi} \optOptLine{multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches.}
\end{optOptTable}
\optLine{Allowed CBs have to have at least one read with exact match. Similar to CellRanger 2.2.0}
\begin{optOptTable}
\optOpt{1MM{\textunderscore}multi{\textunderscore}pseudocounts} \optOptLine{same as 1MM{\textunderscore}Multi, but pseudocounts of 1 are added to all whitelist barcodes.}
\end{optOptTable}
\optLine{Similar to CellRanger 3.x.x}
\optName{soloStrand}
\optValue{Forward}
\optLine{string: strandedness of the solo libraries:}
......@@ -828,6 +875,7 @@
\optOpt{Gene} \optOptLine{genes: reads match the gene transcript}
\optOpt{SJ} \optOptLine{splice junctions: reported in SJ.out.tab}
\optOpt{GeneFull} \optOptLine{full genes: count all reads overlapping genes' exons and introns}
\optOpt{Transcript3p} \optOptLine{quantification of transcript for 3' protocols}
\end{optOptTable}
\optName{soloUMIdedup}
\optValue{1MM{\textunderscore}All}
......@@ -835,17 +883,25 @@
\begin{optOptTable}
\optOpt{1MM{\textunderscore}All} \optOptLine{all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)}
\optOpt{1MM{\textunderscore}Directional} \optOptLine{follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).}
\optOpt{1MM{\textunderscore}NotCollapsed} \optOptLine{UMIs with 1 mismatch distance to others are not collapsed (i.e. all counted)}
\optOpt{Exact} \optOptLine{only exactly matching UMIs are collapsed}
\end{optOptTable}
\optName{soloUMIfiltering}
\optValue{-}
\optLine{string(s) type of UMI filtering}
\begin{optOptTable}
\optOpt{-} \optOptLine{basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0)}
\optOpt{MultiGeneUMI} \optOptLine{remove lower-count UMIs that map to more than one gene (introduced in CellRanger 3.x.x)}
\end{optOptTable}
\optName{soloOutFileNames}
\optValue{Solo.out/ genes.tsv barcodes.tsv matrix.mtx matrixSJ.mtx matrixGeneFull.mtx}
\optLine{string(s) file names for STARsolo output}
\begin{optOptTable}
\optOpt{1st word} \optOptLine{file name prefix}
\optOpt{2nd word} \optOptLine{gene IDs and names}
\optOpt{3rd word} \optOptLine{barcode sequences}
\optOpt{4th word} \optOptLine{cell/Gene counts matrix}
\optOpt{5th word} \optOptLine{cell/SJ counts matrix}
\optOpt{6th word} \optOptLine{cell/GeneFull counts matrix}
\optValue{Solo.out/ features.tsv barcodes.tsv matrix.mtx}
\optLine{string(s) file names for STARsolo output:}
\optLine{file{\textunderscore}name{\textunderscore}prefix gene{\textunderscore}names barcode{\textunderscore}sequences cell{\textunderscore}feature{\textunderscore}count{\textunderscore}matrix}
\optName{soloCellFilter}
\optValue{CellRanger2.2 3000 0.99 10}
\optLine{string(s): cell filtering type and parameters}
\begin{optOptTable}
\optOpt{CellRanger2.2} \optOptLine{simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count}
\optOpt{TopCells} \optOptLine{only report top cells by UMI count, followed by the excat number of cells}
\optOpt{None} \optOptLine{do not output filtered cells}
\end{optOptTable}
\end{optTable}
......@@ -2,7 +2,7 @@ FROM debian:stretch-slim
MAINTAINER dobin@cshl.edu
ARG STAR_VERSION=2.7.2b
ARG STAR_VERSION=2.7.3a
ENV PACKAGES gcc g++ make wget zlib1g-dev unzip
......
#ifndef H_AlignVsTranscript
#define H_AlignVsTranscript
namespace AlignVsTranscript
{
enum {Intron=0, ExonIntron=1, ExonIntronSpan=2, Concordant=3, N=4};
};
#endif
\ No newline at end of file
......@@ -2,8 +2,9 @@
#include "ErrorWarning.h"
#include "serviceFuns.cpp"
#include "BAMfunctions.h"
#include "SequenceFuns.h"
void BAMbinSortByCoordinate(uint32 iBin, uint binN, uint binS, uint nThreads, string dirBAMsort, Parameters &P, Genome &mapGen) {
void BAMbinSortByCoordinate(uint32 iBin, uint binN, uint binS, uint nThreads, string dirBAMsort, Parameters &P, Genome &mapGen, Solo &solo) {
if (binS==0) return; //nothing to do for empty bins
//allocate arrays
......@@ -61,9 +62,15 @@ void BAMbinSortByCoordinate(uint32 iBin, uint binN, uint binS, uint nThreads, st
outBAMwriteHeader(bgzfBin,P.samHeaderSortedCoord,mapGen.chrNameAll,mapGen.chrLengthAll);
//send ordered aligns to bgzf one-by-one
char bam1[BAM_ATTR_MaxSize];//temp array
for (uint ia=0;ia<binN;ia++) {
char* ib=bamIn+startPos[ia*3+2];
bgzf_write(bgzfBin,ib, *((uint32*) ib)+sizeof(uint32) );
char* bam0=bamIn+startPos[ia*3+2];
uint32 size0=*((uint32*) bam0)+sizeof(uint32);
if (solo.pSolo.samAttrYes)
solo.soloFeat[solo.pSolo.samAttrFeature]->addBAMtags(bam0,size0,bam1);
bgzf_write(bgzfBin, bam0, size0);
};
bgzf_flush(bgzfBin);
......
......@@ -3,9 +3,10 @@
#include "IncludeDefine.h"
#include "Parameters.h"
#include "Genome.h"
#include "Solo.h"
#include SAMTOOLS_BGZF_H
void BAMbinSortByCoordinate(uint32 iBin, uint binN, uint binS, uint nThreads, string dirBAMsort, Parameters &P, Genome &mapGen);
void BAMbinSortByCoordinate(uint32 iBin, uint binN, uint binS, uint nThreads, string dirBAMsort, Parameters &P, Genome &mapGen, Solo &solo);
#endif
\ No newline at end of file
......@@ -2,7 +2,7 @@
#include "ErrorWarning.h"
#include "BAMfunctions.h"
void BAMbinSortUnmapped(uint32 iBin, uint nThreads, string dirBAMsort, Parameters &P, Genome &mapGen) {
void BAMbinSortUnmapped(uint32 iBin, uint nThreads, string dirBAMsort, Parameters &P, Genome &mapGen, Solo &solo) {
BGZF *bgzfBin;
bgzfBin=bgzf_open((dirBAMsort+"/b"+to_string((uint) iBin)).c_str(),("w"+to_string((long long) P.outBAMcompression)).c_str());
......@@ -37,24 +37,35 @@ void BAMbinSortUnmapped(uint32 iBin, uint nThreads, string dirBAMsort, Parameter
bamInStream[it].read(bamIn[it],sizeof(int32));//read BAM record size
if (bamInStream[it].good()) {
bamSize[it]=((*(uint32*)bamIn[it])+sizeof(int32));//true record size +=4 (4 bytes for uint-iRead)
bamInStream[it].read(bamIn[it]+sizeof(int32),bamSize.at(it)-sizeof(int32)+sizeof(uint));//read the rest of the record, including last uint = iRead
startPos[*(uint*)(bamIn[it]+bamSize.at(it))]=it;//startPos[iRead]=it : record the order of the files to output
bamInStream[it].read(bamIn[it]+sizeof(int32),bamSize.at(it)-sizeof(int32)+sizeof(uint64));//read the rest of the record, including last uint = iRead
uint64 iRead=*(uint*)(bamIn[it]+bamSize.at(it));
iRead = iRead >> 32; //iRead is recorded in top 32bits
startPos[iRead]=it;//startPos[iRead]=it : record the order of the files to output
} else {//nothing to do here, file is empty, do not record it
};
};
//send ordered aligns to bgzf one-by-one
char bam1[BAM_ATTR_MaxSize];//temp array
while (startPos.size()>0) {
uint it=startPos.begin()->second;
uint startNext=startPos.size()>1 ? (++startPos.begin())->first : (uint) -1;
while (true) {
bgzf_write(bgzfBin, bamIn[it], bamSize.at(it));
//add extra tags to the BAM record
char* bam0=bamIn[it];
uint32 size0=bamSize.at(it);
if (solo.pSolo.samAttrYes)
solo.soloFeat[solo.pSolo.samAttrFeature]->addBAMtags(bam0,size0,bam1);
bgzf_write(bgzfBin, bam0, size0);
bamInStream[it].read(bamIn[it],sizeof(int32));//read record size
if (bamInStream[it].good()) {
bamSize[it]=((*(uint32*)bamIn[it])+sizeof(int32));
bamInStream[it].read(bamIn[it]+sizeof(int32),bamSize.at(it)-sizeof(int32)+sizeof(uint));//read the rest of the record, including la$
uint iRead=*(uint*)(bamIn[it]+bamSize.at(it));
bamInStream[it].read(bamIn[it]+sizeof(int32),bamSize.at(it)-sizeof(int32)+sizeof(uint));//read the rest of the record, including
uint64 iRead=*(uint*)(bamIn[it]+bamSize.at(it));
iRead = iRead >> 32; //iRead is recorded in top 32bits
if (iRead>startNext) {//this read from this chunk is > than a read from another chunk
startPos[iRead]=it;
break;
......
......@@ -3,9 +3,10 @@
#include "IncludeDefine.h"
#include "Parameters.h"
#include "Genome.h"
#include "Solo.h"
#include SAMTOOLS_BGZF_H
void BAMbinSortUnmapped(uint32 iBin, uint nThreads, string dirBAMsort, Parameters &P, Genome &mapGen);
void BAMbinSortUnmapped(uint32 iBin, uint nThreads, string dirBAMsort, Parameters &P, Genome &mapGen, Solo &solo);
#endif
......@@ -89,22 +89,100 @@ void outBAMwriteHeader (BGZF* fp, const string &samh, const vector <string> &chr
bgzf_flush(fp);
};
template <class TintType>
TintType bamAttributeInt(const char *bamAux, const char *attrName) {//not tested!!!
const char *attrStart=strstr(bamAux,attrName);
if (attrStart==NULL) return (TintType) -1;
switch (attrStart[2]) {
case ('c'):
return (TintType) *(int8_t*)(attrStart+3);
case ('s'):
return (TintType) *(int16_t*)(attrStart+3);
case ('i'):
return (TintType) *(int32_t*)(attrStart+3);
case ('C'):
return (TintType) *(uint8_t*)(attrStart+3);
case ('S'):
return (TintType) *(uint16_t*)(attrStart+3);
case ('I'):
return (TintType) *(uint32_t*)(attrStart+3);
// calculate bin given an alignment covering [beg,end) (zero-based, half-close-half-open)
int reg2bin(int beg, int end)
{
--end;
if (beg>>14 == end>>14) return ((1<<15)-1)/7 + (beg>>14);
if (beg>>17 == end>>17) return ((1<<12)-1)/7 + (beg>>17);
if (beg>>20 == end>>20) return ((1<<9)-1)/7 + (beg>>20);
if (beg>>23 == end>>23) return ((1<<6)-1)/7 + (beg>>23);
if (beg>>26 == end>>26) return ((1<<3)-1)/7 + (beg>>26);
return 0;
};
int bamAttrArrayWrite(int32 attr, const char* tagName, char* attrArray ) {
attrArray[0]=tagName[0];attrArray[1]=tagName[1];
attrArray[2]='i';
*( (int32*) (attrArray+3))=attr;
return 3+sizeof(int32);
};
int bamAttrArrayWrite(float attr, const char* tagName, char* attrArray ) {
attrArray[0]=tagName[0];attrArray[1]=tagName[1];
attrArray[2]='f';
*( (float*) (attrArray+3))=attr;
return 3+sizeof(int32);
};
int bamAttrArrayWrite(char attr, const char* tagName, char* attrArray ) {
attrArray[0]=tagName[0];attrArray[1]=tagName[1];
attrArray[2]='A';
attrArray[3]=attr;
return 3+sizeof(char);
};
int bamAttrArrayWrite(string &attr, const char* tagName, char* attrArray ) {
attrArray[0]=tagName[0];attrArray[1]=tagName[1];
attrArray[2]='Z';
memcpy(attrArray+3,attr.c_str(),attr.size()+1);//copy string data including \0
return 3+attr.size()+1;
};
int bamAttrArrayWrite(const vector<char> &attr, const char* tagName, char* attrArray ) {
attrArray[0]=tagName[0];attrArray[1]=tagName[1];
attrArray[2]='B';
attrArray[3]='c';
*( (int32*) (attrArray+4))=attr.size();
memcpy(attrArray+4+sizeof(int32),attr.data(),attr.size());//copy array data
return 4+sizeof(int32)+attr.size();
};
int bamAttrArrayWrite(const vector<int32> &attr, const char* tagName, char* attrArray ) {
attrArray[0]=tagName[0];attrArray[1]=tagName[1];
attrArray[2]='B';
attrArray[3]='i';
*( (int32*) (attrArray+4))=attr.size();
memcpy(attrArray+4+sizeof(int32),attr.data(),sizeof(int32)*attr.size());//copy array data
return 4+sizeof(int32)+sizeof(int32)*attr.size();
};
int bamAttrArrayWriteSAMtags(string &attrStr, char *attrArray) {//write bam record into attrArray for string attribute attString
size_t pos1=0, pos2=0;
int nattr=0;
do {//cycle over multiple tags separated by tab
pos2 = attrStr.find('\t',pos1);
string attr1 = attrStr.substr(pos1, pos2-pos1);
pos1=pos2+1;
if (attr1.empty())
continue; //extra tab at the beginning, or consecutive tabs
switch (attr1.at(3)) {
case 'i':
{
int32 a1=stol(attr1.substr(5));
nattr += bamAttrArrayWrite(a1,attr1.c_str(),attrArray+nattr);
break;
};
case 'A':
{
char a1=attr1.at(5);
nattr += bamAttrArrayWrite(a1,attr1.c_str(),attrArray+nattr);
break;
};
break;
case 'Z':
{
string a1=attr1.substr(5);
nattr += bamAttrArrayWrite(a1,attr1.c_str(),attrArray+nattr);
break;
};
case 'f':
{
float a1=stof(attr1.substr(5));
nattr += bamAttrArrayWrite(a1,attr1.c_str(),attrArray+nattr);
break;
};
};
} while (pos2!= string::npos);
return nattr;
};
......@@ -4,7 +4,76 @@
#include "IncludeDefine.h"
#include SAMTOOLS_BGZF_H
#include SAMTOOLS_SAM_H
#include "ErrorWarning.h"
void outBAMwriteHeader (BGZF* fp, const string &samh, const vector <string> &chrn, const vector <uint> &chrl);
int bam_read1_fromArray(char *bamChar, bam1_t *b);
string bam_cigarString (bam1_t *b);
int reg2bin(int beg, int end);
int bamAttrArrayWrite(int32 attr, const char* tagName, char* attrArray );
int bamAttrArrayWrite(float attr, const char* tagName, char* attrArray );
int bamAttrArrayWrite(char attr, const char* tagName, char* attrArray );
int bamAttrArrayWrite(string &attr, const char* tagName, char* attrArray );
int bamAttrArrayWrite(const vector<char> &attr, const char* tagName, char* attrArray );
int bamAttrArrayWrite(const vector<int32> &attr, const char* tagName, char* attrArray );
int bamAttrArrayWriteSAMtags(string &attrStr, char *attrArray);
template <class TintType>
TintType bamAttributeInt(const char *bamAux, const char *attrName) {//not tested!!!
const char *attrStart=strstr(bamAux,attrName);
if (attrStart==NULL) return (TintType) -1;
switch (attrStart[2]) {
case ('c'):
return (TintType) *(int8_t*)(attrStart+3);
case ('s'):
return (TintType) *(int16_t*)(attrStart+3);
case ('i'):
return (TintType) *(int32_t*)(attrStart+3);
case ('C'):
return (TintType) *(uint8_t*)(attrStart+3);
case ('S'):
return (TintType) *(uint16_t*)(attrStart+3);
case ('I'):
return (TintType) *(uint32_t*)(attrStart+3);
};
};
template <typename intType>
int bamAttrArrayWriteInt(intType xIn, const char* tagName, char* attrArray, Parameters &P) {//adapted from samtools
attrArray[0]=tagName[0];attrArray[1]=tagName[1];
#define ATTR_RECORD_INT(_intChar,_intType,_intValue) attrArray[2] = _intChar; *(_intType*)(attrArray+3) = (_intType) _intValue; return 3+sizeof(_intType)
int64 x = (int64) xIn;
if (x < 0) {
if (x >= -127) {
ATTR_RECORD_INT('c',int8_t,x);
} else if (x >= -32767) {
ATTR_RECORD_INT('s',int16_t,x);
} else {
ATTR_RECORD_INT('i',int32_t,x);
if (!(x>=-2147483647)) {
ostringstream errOut;
errOut <<"EXITING because of FATAL BUG: integer out of range for BAM conversion: "<< x <<"\n";
errOut <<"SOLUTION: contact Alex Dobin at dobin@cshl.edu\n";
exitWithError(errOut.str(), std::cerr, P.inOut->logMain, EXIT_CODE_BUG, P);
};
};
} else {
if (x <= 255) {
ATTR_RECORD_INT('C',uint8_t,x);
} else if (x <= 65535) {
ATTR_RECORD_INT('S',uint16_t,x);
} else {
ATTR_RECORD_INT('I',uint32_t,x);
if (!(x<=4294967295)) {
ostringstream errOut;
errOut <<"EXITING because of FATAL BUG: integer out of range for BAM conversion: "<< x <<"\n";
errOut <<"SOLUTION: contact Alex Dobin at dobin@cshl.edu\n";
exitWithError(errOut.str(), std::cerr, P.inOut->logMain, EXIT_CODE_BUG, P);
};
};
};
};
#endif
\ No newline at end of file
......@@ -93,12 +93,6 @@ void BAMoutput::coordOneAlign (char *bamIn, uint bamSize, uint iRead) {
};
};
// if ( alignG == (uint32) -1 ) {//unmapped alignment, last bin
// iBin=nBins-1;
// } else {
// iBin=(alignG + chrStart)/binGlen;
// };
//write buffer is filled
if (binBytes[iBin]+bamSize+sizeof(uint) > ( (iBin>0 || nBins>1) ? binSize : binSize1) ) {//write out this buffer
if ( nBins>1 || iBin==(P.outBAMcoordNbins-1) ) {//normal writing, bins have already been determined
......