**This version requires re-generation of the genome indexes**
* Implemented --soloFeatures GeneFull which counts reads overlapping full genes, i.e. includes reads that overlap introns. This can be combined with other features, e.g. --soloFeatures Gene SJ GeneFull .
* Implemented --soloCBwhitelist None option for solo* demultiplexing without CB whitelist. In this case error correction for CBs is not performed.
* Implemented Cell Barcodes longer than 16 bases (but shorter than 31 bases). Many thanks to Gert Hulselmans for implementing this feature (#588).
* Implemented collapsing of duplicate cell barcodes in the whitelist.
* Implemented --sjdbGTFtagExonParentGeneName and --sjdbGTFtagExonParentGeneType options to load gene name and biotype attributes from the GTF file.
* Fixed problems created by missing gene/transcript ID, name and biotype attributes in GTF files (issues #613, #628).
* Added warning for incorrectly scaled --genomeSAindexNbases parameter (issue #614).
* Added numbers of unmapped reads to the Log.final.out file (pull #622).
* Fixed a problem which may cause seg-faults for reads with many blocks (issue #342).
STAR 2.7.0f 2019/03/28
======================
* Fixed a problem in STARsolo with empty Unmapped.out.mate2 file. Issue #593.
* Fixed a problem with CR CY UR UQ SAM tags in solo output. Issue #593.
* Fixed problems with STARsolo and 2-pass.
STAR 2.7.0e 2019/02/25
======================
* Fixed problems with --quantMode GeneCounts and --parametersFiles options
STAR 2.7.0d 2019/02/19
======================
* Implemented --soloBarcodeReadLength option for barcode read length not equal to the UMI+CB length
* Enforced genome version rules for 2.7.0
STAR 2.7.0c 2019/02/08
======================
* This release is compiled with gcc-4.8.5, and requires at least gcc-4.8.5
* Fixed another problem in STARsolo genes.tsv output.
* Replaced tabs with spaces in STARsolo matrix.mtx output
* #559, #562 Fixed compilation problems.
* #550 (again, previous merge failed): Added correct header for the STARsolo matrix.mtx file, needed for python scipy mmread compatibility.
STAR 2.7.0b 2019/02/05
======================
* #550: Added correct header for the STARsolo matrix.mtx file, needed for python scipy mmread compatibility.
* #556: Fixed a problem with STARsolo genes.tsv file, which may also cause troubles with GTF files processing.
* Important: 2.7.0x releases require re-generation of the genome index.
STAR 2.7.0a 2019/01/23
======================
* This release introduces STARsolo for: mapping, demultiplexing and gene quantification for single cell RNA-seq.
@@ -74,7 +74,6 @@ make STARforMacStatic CXX=/path/to/gcc
```
If employing STAR only on a single machine or a homogeneously setup cluster, you may aim at helping the compiler to optimize in way that is tailored to your platform. The flags LDFLAGSextra and CXXFLAGSextra are appended to the default optimizations specified in source/Makefile.
```
# platform-specific optimization for gcc/g++
make CXXFLAGSextra=-march=native
...
...
@@ -82,6 +81,14 @@ make CXXFLAGSextra=-march=native
make LDFLAGSextra=-flto CXXFLAGSextra="-flto -march=native"
```
FreeBSD ports
=============
STAR can be installed on FreeBSD via the FreeBSD ports system.
STARsolo is a turnkey solution for analyzing droplet single cell RNA sequencing data (e.g. 10X Genomics Chromium System) built directly into STAR code.
STARsolo inputs the raw FASTQ reads files, and performs the following operations
(i) error correction and demultiplexing of cell barcodes using user-input whitelist
(ii) mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm
(ii) error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa)
(iv) quantification of per-cell gene expression by counting the number of reads per gene
* error correction and demultiplexing of cell barcodes using user-input whitelist
* mapping the reads to the reference genome using the standard STAR spliced read alignment algorithm
* error correction and collapsing (deduplication) of Unique Molecular Identifiers (UMIa)
* quantification of per-cell gene expression by counting the number of reads per gene
STARsolo output is designed to be a drop-in replacement for 10X CellRanger gene quantification output.
It follows CellRanger logic for cell barcode whitelisting and UMI deduplication, and produces nearly identical gene counts in the same format.
At the same time STARsolo is ~10 times faster than the CellRanger.
The STAR solo algorithm is turned on with:
```
...
...
@@ -32,6 +33,8 @@ Importantly, in the --readFilesIn option, the 1st file has to be cDNA read, and
@@ -253,7 +253,7 @@ STAR produces multiple output files. All files have standard name, however, you
\subsection{SAM.}
\ofilen{Aligned.out.sam} - alignments in standard SAM format.
\subsubsection{Multimappers.}
The number of loci \code{Nmap} a read maps to is given by \code{NH:i:Nmap} field. Value of 1 corresponds to unique mappers, while values \textgreater1 corresponds to multi-mappers. \code{HI} attrbiutes enumerates multiple alignments of a read starting with 1 (this can be changed with the \opt{outSAMattrIHstart} - setting it to 0 may be required for compatibility with downstream software such as Cufflinks or StringTie).
The number of loci \code{Nmap} a read maps to is given by \code{NH:i:Nmap} field. Value of 1 corresponds to unique mappers, while values \textgreater1 corresponds to multi-mappers. \code{HI} attrbiutes enumerates multiple alignments of a read starting with 1 (this can be changed with the \opt{outSAMattrIHstart} - setting it to 0 may be required for compatibility with downstream software such as Cufflinks).
The mapping quality MAPQ (column 5) is 255 for uniquely mapping reads, and int(-10*log10(1-1/Nmap)) for multi-mapping reads. This scheme is same as the one used by TopHat and is compatible with Cufflinks. The default MAPQ=255 for the unique mappers maybe changed with \opt{outSAMmapqUnique} parameter (integer 0 to 255) to ensure compatibility with downstream tools such as GATK.
...
...
@@ -269,17 +269,35 @@ The \opt{outSAMmultNmax} parameter limits the number of output alignments (SAM l
The SAM attributes can be specified by the user using \opt{outSAMattributes}\optvr{A1 A2 A3 ...} option which accept a list of 2-character SAM attributes. The implemented attributes are: \optv{NH HI NM MD AS nM jM jI XS}. By default, STAR outputs \optv{NH HI AS nM} attributes.
\begin{itemize}
\item[]
\optv{NH HI NM MD} have standard meaning as defined in the SAM format specifications.
\optv{NH HI NM MD}: have standard meaning as defined in the SAM format specifications.
\item[]
\optv{AS} id the local alignment score (paired for paired-end reads).
\optv{AS}: id the local alignment score (paired for paired-end reads).
\item[]
\optv{nM} is the number of mismatches per (paired) alignment, not to be confused with \optv{NM}, which is the number of mismatches in each mate.
\optv{nM}: is the number of mismatches per (paired) alignment, not to be confused with \optv{NM}, which is the number of mismatches in each mate.
\item[]
\optv{jM:B:c,M1,M2,...} intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
\optv{jM:B:c,M1,M2,...}: intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
\item[]
\optv{jI:B:I,Start1,End1,Start2,End2,...} Start and End of introns for all junctions (1-based).
\optv{jI:B:I,Start1,End1,Start2,End2,...}: Start and End of introns for all junctions (1-based).
\item[]
\optv{jM jI} attributes require samtools 0.1.18 or later, and were reported to be incompatible with some downstream tools such as Cufflinks.
\optv{jM jI} : attributes require samtools 0.1.18 or later, and were reported to be incompatible with some downstream tools such as Cufflinks.
\item[]
\optv{vA} : variant allele
\item[]
\optv{vG} : genomic coordiante of the variant overlapped by the read
\item[]
\optv{vW} : 0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag
\item[]
\optv{CR CY UR UY} : sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing, not error corrected
\item[]
\optv{uT} : for unmapped reads, reason for not mapping:
\begin{itemize}[noitemsep,topsep=-3pt]
\item[] 0 : no acceptable seed/windows, "Unmapped other" in the Log.final.out
\item[] 1 : best alignment shorter than min allowed mapped length, "Unmapped: too short" in the Log.final.out
\item[] 2 : best alignment has more mismatches than max allowed number of mismatches, "Unmapped: too many mismatches" in the Log.final.out
\item[] 3 : read maps to more loci than the max number of multimappng loci, "Multimapping: mapped to too many loci" in the Log.final.out
\item[] 4 : unmapped mate of a mapped paired-end read
\end{itemize}
\end{itemize}
\subsubsection{Compatibility with Cufflinks/Cuffdiff.}
...
...
@@ -522,6 +540,44 @@ Importantly, in the --readFilesIn option, the 1st FASTQ file has to be cDNA read
Other solo* options can be found in the Section \ref{STARsolo_(single_cell_RNA-seq)_parameters}.
\subsection{Feature statistics summaries.}
Feature statistics summaries are recorded in the \optvr{Solo.out/} directory in files \optvr{<Feature>.stats} where features are those used in the \opt{soloFeatures} option, e.g. \optvr{Gene.stats}. The following metrics are recorded:
\begin{itemize}[leftmargin=1.5in]
\itemsep -0.3em
\item[\optv{nNinBarcode:}] number of reads with more than 2 Ns in cell barcode (CB)
\item[\optv{nUMIhomopolymer:}] number of reads with homopolymer in CB
\item[\optv{nTooMany:}] not used at the moment
\item[\optv{nNoMatch:}] number of reads with CBs that do not match whitelist even with one mismatch
\end{itemize}
All of the above reads are discarded from Solo output. Remaining reads are checked for overlap with features (e.g. genes):
\begin{itemize}[leftmargin=2in]
\itemsep -0.3em
\item[\optv{nUnmapped:}] number of reads unmapped to the genome
\item[\optv{nNoFeature:}] number of reads that map to the genome but do not belong to a feature
\item[\optv{nAmbigFeature:}] number of reads that belong to more than one feature
\item[\optv{nAmbigFeatureMultimap:}] number of reads that belong to more than one feature and are also multimapping to the genome (this is a subset of the nAmbigFeature)
\item[\optv{nTooMany:}] number of reads with ambiguous CB (i.e. CB matches whitelist with one mismatch but with posterior probability <0.95)
\item[\optv{nNoExactMatch:}] number of reads with CB that matches a whitelist barcode with 1 mismatch, but this whitelist barcode does not get any other reads with exact matches of CB
\end{itemize}
All of the reads above are output in feature (e.g. gene) / cell count matrices.
\begin{itemize}[leftmargin=1.5in]
\itemsep -0.3em
\item[\optv{nExactMatch:}] number of reads with CB that match the whitelist exactly
\item[\optv{nMatch:}] total number of reads that match CB with 0 or 1 mismatches (this is superset of nExactMatch)
\item[\optv{nCellBarcodes:}] number of distinct CBs detected
\item[\optv{nUMIs:}] number of distinct UMIs detected
\end{itemize}
These metrics can be grouped into more broad categories:
\begin{itemize}
\itemsep -0.3em
\item[]\optv{nNinBarcode+nUMIhomopolymer+nNoMatch+nTooMany+nNoExactMatch} = number of reads with CBs that do not match whitelist.
\item[]\optv{nUnmapped+nAmbigFeature} = number of reads without defined feature (gene)
\item[]\optv{nMatch} = number of reads that are output as solo counts
\end{itemize}
The three categoties above summed together should be equal to the total number of reads.
\section{Description of all options.}\label{Description_of_all_options}
For each STAR version, the most up-to-date information about all STAR parameters can be found in the \code{parametersDefault} file in the STAR source directory. The parameters in the \code{parametersDefault}, as well as in the descriptions below, are grouped by function: