New upstream version 1.5.2+dfsg

parent e3249b5c
......@@ -14,55 +14,56 @@ For FreeBSD OS, use command:
gmake -f Makefile.FreeBSD
For OpenSolaris/Oracle Solaris OS, use command:
gmake -f Makefile.SunOS
If the build is successful, a new directory called 'bin' will be created under the home directory of the package (ie. one level up from 'src' directory). The 'bin directory contains all the generated executables. To enable easy access to these executables, you may copy the executables to a system directory such as '/usr/bin' or add the path to the executables to your search path (add path to your environment variable `PATH').
Content
--------------
annotation Directory including NCBI RefSeq gene annotations for genomes 'hg19', 'mm10' and 'mm9'.
Each row is an exon. Entrez gene identifiers and chromosomal coordinates are provided for each exon.
bin Directory including executables after compilation (or directly available from a binary release).
doc Directory including the users manual.
LICENSE The license agreement for using this package.
README.txt This file.
src Directory including source code (binary releases do not have this directory).
test Directory including test data and scripts.
annotation Directory including NCBI RefSeq gene annotations for genomes 'hg19', 'hg38', 'mm10' and 'mm9'.
Each row is an exon. Entrez gene identifiers and chromosomal coordinates are provided for each exon.
bin Directory including executables after compilation (or directly available from a binary release).
doc Directory including the users manual.
LICENSE The license agreement for using this package.
README.txt This file.
src Directory including source code (binary releases do not have this directory).
test Directory including test data and scripts.
A Quick Start
--------------
An index should be built before carrying out read alignments:
Build index for a reference genome:
subread-buildindex -o my_index chr1.fa chr2.fa ...
(You may provide a single FASTA file including all chromosomal sequences).
subread-buildindex -o my_index chr1.fa chr2.fa ...
(You may provide a single FASTA file including all chromosomal sequences).
Align a single-end RNA-seq dataset to the reference genome:
With built index, you can now align reads to the reference genome. Align single-end reads:
subread-align -i my_index -r reads.txt -t 0 -o subread_results.bam
subread-align -i my_index -r reads.txt -o subread_results.sam
Align a paired-end genomic DNA-seq dataset to the reference genome:
Align paired-end reads:
subread-align -i my_index -r reads1.txt -R reads2.txt -t 1 -o subread_results_PE.bam
subread-align -i my_index -r reads1.txt -R reads2.txt -o subread_results_PE.sam
Detect exon-exon junctions from a paired-end RNA-seq dataset (read mapping results are also produced):
Detect exon-exon junctions from RNA-seq data (read mapping results are also generated):
subjunc -i my_index -r reads1.txt -R reads2.txt -o subjunc_results.bam
subjunc -i my_index -r reads1.txt -R reads2.txt -o subjunc_results.sam
Assign mapped RNA-seq reads to mm10 genes using inbuilt annotation:
Assign mapped reads to genomic features (eg. genes):
featureCounts -a ../annotation/mm10_RefSeq_exon.txt -F 'SAF' -o counts.txt subread_results.bam
featureCounts -a annotation.gtf -o counts.txt subread_results.sam
Assign mapped RNA-seq reads to hg38 genes using a public GTF annotation:
featureCounts -a hg38_annotation.gtf -o counts.txt subread_results.bam
Tutorials
-------------------
A short tutorial for Subread - http://bioinf.wehi.edu.au/subread
A short tutorial for Subjunc - http://bioinf.wehi.edu.au/subjunc
A short tutorial for featureCounts - http://bioinf.wehi.edu.au/featureCounts
A short tutorial for exactSNP - http://bioinf.wehi.edu.au/exactSNP
Users Guide
--------------
Users Guide can be found in the 'doc' subdirectory. It provides comprehensive descriptions to the programs included in this package.
Users Guide can be found in the 'doc' subdirectory of this software package or via URL (http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf).
Citation
--------------
......@@ -72,8 +73,9 @@ Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable rea
If you use the featureCounts program, please cite:
Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 2013. doi: 10.1093/bioinformatics/btt656
Liao Y, Smyth GK and Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30, 2014
Get help
Mailing lists
--------------
You may subscribe to the SeqAnswer forum (http://www.seqanswers.com) or the Bioconductor mailing list (http://bioconductor.org/) to get help. Alternatively, you may directly contact Wei Shi (shi at wehi dot edu dot au) or Yang Liao (liao at wehi dot edu dot au) for help.
Please post your questions/suggestions to Bioconductor support site(https://support.bioconductor.org/) or Subread google group (https://groups.google.com/forum/#!forum/subread).
......@@ -35,9 +35,9 @@
\begin{center}
{\Huge\bf Subread/Rsubread Users Guide}\\
\vspace{1 cm}
{\centering\large Subread v1.5.1/Rsubread v1.23.4\\}
{\centering\large Subread v1.5.2/Rsubread v1.24.2\\}
\vspace{1 cm}
\centering 25 August 2016\\
\centering 15 March 2017\\
\vspace{5 cm}
\Large Wei Shi and Yang Liao\\
\vspace{1 cm}
......@@ -47,7 +47,7 @@ The Walter and Eliza Hall Institute of Medical Research\\
The University of Melbourne\\
Melbourne, Australia\\}
\vspace{7 cm}
\centering Copyright \small{\copyright} 2011 - 2016\\
\centering Copyright \small{\copyright} 2011 - 2017\\
\end{center}
\end{titlepage}
......@@ -94,17 +94,17 @@ These software programs support a variety of sequencing platforms including Illu
\section{Citation}
If you use {\Subread} or {\Subjunc} aligners, please cite:\\
If you use {\Subread} or {\Subjunc} aligners, please cite:
\begin{quote}
Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013
Liao Y, Smyth GK and Shi W (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. \emph{Nucleic Acids Research}, 41(10):e108.
\\
{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/23558742}} }
\end{quote}
If you use \featureCounts, please cite:\\
{\noindent If you use \featureCounts, please cite:}
\begin{quote}
Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 2013 Nov 30. [Epub ahead of print]
Liao Y, Smyth GK and Shi W (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. \emph{Bioinformatics}, 30(7):923-30.
\\
{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/24227677}}}
\end{quote}
......@@ -166,8 +166,8 @@ Alternatively, you may download the {\Rsubread} source package directly from {\c
\section{How to get help}
Bioconductor mailing list (\url{http://bioconductor.org/}) and SeqAnswer forum (\url{http://www.seqanswers.com}) are the best places to get help and to report bugs.
Alternatively, you may contact Wei Shi (shi at wehi dot edu dot au) directly.
Bioconductor support site (\url{https://support.bioconductor.org/}) or Google Subread group (\url{https://groups.google.com/forum/#!forum/subread}) are the best place to post questions or make suggestions.
\chapter{The seed-and-vote mapping paradigm}
......@@ -215,19 +215,24 @@ Since no mismatches are allowed in the mapping of the subreads, the indels can b
\section{Detection of exon-exon junctions}
\label{sec:junction}
The seed-and-vote paradigm is also very useful in detecting exon-exon junctions, because the short subreads extracted across the entire read can be used to detect short exons in a sensitive and accurate way.
The figure below shows the schematic of detecting exon-exon junctions and mapping RNA-seq reads by \code{Subjunc}, which uses this paradigm.
Figure below shows the schematic of exon-exon junction under seed-and-vote paradigm.
The first scan detects all possible exon-exon junctions using the mapping locations of the subreads extracted from each read.
Matched donor (`GT') and receptor (`AG') sites are required for calling junctions.
Exons as short as 16bp can be detected in this step.
The second scan verifies the putative exon-exon junctions discovered from the first scan by performing re-alignments for the junction reads.
The output from \code{Subjunc} includes the list of verified junctions and also the mapping results for all the reads.
Orientation of splicing sites is indicated by `XA' tag in section of optional fields in mapping output.
The second scan verifies the putative exon-exon junctions discovered from the first scan by read re-alignment.
By default, \code{Subjunc} only reports canonical exon-exon junctions it has discovered (ie. presence of donor (`GT') and receptor (`AG') sites is required).
However, users may turn on `--allJunctions' option to instruct \code{Subjunc} to report all junctions including both canonical and non-canonical ones.
This approach is implemented in the \code{Subjunc} program.
The output of \code{Subjunc} includes a list of discovered junctions, in addition to the mapping results.
By default, \code{Subjunc} only reports canonical exon-exon junctions that contain canonical donor and receptor sites (`GT' and `AG' respectively).
It was reported that such exon-exon junctions account for $>$98\% of all junctions.
Orientation of donor and receptor sites is indicated by `XA' tag in the SAM/BAM output.
\code{Subjunc} will report both canonical and non-canonical junctions when `--allJunctions' option is turned on.
Accuracy of junction detection generally improves when external gene annotation data is provided.
The annotation data should include chromosomal coordinates of known exons of each gene.
\code{Subjunc} infers exon-exon junctions from the provided annotation data by connecting each pair of neighboring exons from the same gene.
This should cover majority of known exon-exon junctions and the other junctions are expected to be discovered by the program.
Note that although \code{Subread} aligner does not report exon-exon junctions, providing this annotation is useful for it to map junction reads more accurately.
See `-a' parameter in Table 2 for more details.
\begin{center}
......@@ -362,12 +367,19 @@ output_file="rsubread.bam",minFragLength=50,maxFragLength=600)
\label{sec:index}
The \code{subread-buildindex} (\code{buildindex} function in \Rsubread) program builds an index for reference genome by creating a hash table in which keys are 16bp mers (subreads) extracted from the genome and values are their chromosomal locations.
By default, subreads are extracted from the genome at a 2bp interval.
The reference sequences should be in FASTA format (the header line for each chromosomal sequence starts with ``$>$'').\\
By default, subreads are extracted from the reference genome at a 2bp interval and a gapped index is built.
During mapping, three sets of subreads will be extracted from each read.
The first set starts from the first base of the read, the second set starts from the second base and the third set starts from the third base.
The reference sequences should be in FASTA format.
The \code{subread-buildindex} function divides each reference sequence name (which can be found in the header lines) into multiple substrings by using separators including `\code{|}', ` '(space) and `\code{<tab>}', and it uses the first substring as the name for the reference sequence during its index building.
The first substrings must be distinct for different reference sequences (otherwise the index cannot be built).
Note that the starting `\code{>}' character in the header line is not included in the first substrings.
Table 1 describes the arguments used by the \code{subread-buildindex} program.
\newpage
%\newpage
\begin{table}[h]
\raggedright{Table 1: Arguments used by the \code{subread-buildindex} program (\code{buildindex} function in \Rsubread) in alphabetical order.
......@@ -378,17 +390,17 @@ Arguments & Description \\
\hline
chr1.fa, chr2.fa, ... \newline (\code{reference}) & Give names of chromosome files. Note that in {\Rsubread}, only a single FASTA file including all reference sequences should be provided.\\
\hline
-B \newline (\code{indexSplit=FALSE}) & Create one block of index. The built index will not be split into multiple pieces. This makes the largest amount of memory be requested when running alignments, but it enables the maximum mapping speed to be achieved. This option overrides -M when it is provided as well.\\
-B \newline (\code{indexSplit=FALSE}) & Create one block of index. The built index will not be split into multiple pieces. The more blocks an index has, the slower the mapping speed. This option will override `-M' option when it is also provided.\\
\hline
-c \newline (\code{colorspace}) & Build a color-space index.\\
\hline
-f $<int>$ \newline (\code{TH\_subread}) & Specify the threshold for removing uninformative subreads (highly repetitive 16bp mers). Subreads will be excluded from the index if they occur more than threshold number of times in the reference genome. Default value is 100.\\
\hline
-F \newline (\code{gappedIndex=FALSE}) & Build a full index for the reference genome. 16bp mers (subreads) will be extracted from every position of the reference genome. Under default setting (`-F' is not specified), subreads are extracted in every three bases from the genome.\\
-F \newline (\code{gappedIndex=FALSE}) & Build a full index for the reference genome. 16bp mers (subreads) will be extracted from every position of a reference genome. Size of the full index built for mouse genome is 14GB.\\
\hline
-M $<int>$ \newline (\code{memory}) & Specify the Size of requested memory(RAM) in megabytes, 8000MB by default. With the default value, the index built for a mammalian genome (eg. human or mouse genome) will be saved into one block, enabling the fastest mapping speed to be achieved. The amount of memory used is $\sim$ 7600MB for mouse or human genome (other species have a much smaller memory footprint), when performing read mapping. Using less memory will increase read mapping time.\\
-M $<int>$ \newline (\code{memory}) & Specify the size of computer memory(RAM) in megabytes that will be used for alignment of reads, 8000MB by default. If the size of an index built for a reference genome is greater than the `-M' value, this index will be split into multiple blocks and then saved onto the disk. These blocks will be loaded into computer memory sequentially when performing read alignment. A gapped index generated for mouse genome has a size of 5300MB. Note that when generating a gapped index this function itself uses no more than 12GB of memory.\\
\hline
-o $<basename>$ \newline (\code{basename}) & Specify the base name of the index to be created.\\
-o $<string>$ \newline (\code{basename}) & Specify the base name of the index to be created.\\
\hline
-v & Output version of the program. \\
\hline
......@@ -417,6 +429,10 @@ Arguments used by \code{subjunc} only are marked with $^{**}$.
\hline
Arguments & Description \\
\hline
-a $<string>$\newline (\code{useAnnotation, annot.inbuilt, annot.ext}) & Name of a gene annotation file that includes chromosomal coordinates of exons from each gene. GTF/GFF format by default. See -F option for supported formats. Users may use the inbuilt annotations included in this package (SAF format) for human and mouse data. Exon-exon junctions are inferred by connecting each pair of neighboring exons from the same gene.\\
\hline
-A $<string>$\newline (\code{chrAliases}) & Name of a comma-delimited text file that includes aliases of chromosome names. This file should contain two columns. First column contains names of chromosomes included in the SAF or GTF annotation and second column contains corresponding names of chromosomes in the reference genome. No column headers should be provided. Also note that chromosome names are case sensitive. This file can be used to match chromosome names between the annotation and the reference genome.\\
\hline
-b \newline (\code{color2base=TRUE}) & Output base-space reads instead of color-space reads in mapping output for color space data (eg. LifTech SOLiD data). Note that the mapping itself will still be performed at color-space.\\
\hline
-B $<int>$ \newline (\code{nBestLocations}) & Specify the maximal number of equally-best mapping locations allowed to be reported for each read. 1 by default. `NH' tag is used to indicate how many alignments are reported for the read and `HI' tag is used for numbering the alignments reported for the same read, in the output. Note that \code{-u} option takes precedence over \code{-B}.\\
......@@ -425,7 +441,9 @@ Arguments & Description \\
\hline
-D $<int>$ \newline (\code{maxFragLength}) & Specify the maximum fragment/template length, 600 by default.\\
\hline
-i $<index> \newline (\code{index}) $ & Specify the base name of the index.\\
-F $<string>$ \newline (\code{isGTF}) & Specify format of the provided annotation file. Acceptable formats include `GTF' (or compatible GFF format) and `SAF'. Default format in SourceForge {\Subread} is `GTF'. Default format in {\Rsubread} is `SAF'. \\
\hline
-i $<string> \newline (\code{index}) $ & Specify the base name of the index.\\
\hline
-I $<int>$ \newline (\code{indels}) & Specify the number of INDEL bases allowed in the mapping. 5 by default. Indels of up to 200bp long can be detected.\\
\hline
......@@ -435,15 +453,15 @@ Arguments & Description \\
\hline
-n $<int>$ \newline (\code{nsubreads}) & Specify the number of subreads extracted from each read, 10 by default.\\
\hline
-o $<output>$ \newline (\code{output\_file}) & Give the name of output file. The default output format is BAM. All reads are included in mapping output, including both mapped and unmapped reads, and they are in the same order as in the input file.\\
-o $<string>$ \newline (\code{output\_file}) & Give the name of output file. The default output format is BAM. All reads are included in mapping output, including both mapped and unmapped reads, and they are in the same order as in the input file.\\
\hline
-p $<int>$ \newline (\code{TH2}) & Specify the minimum number of consensus subreads both reads from the same pair must have. This argument is only applicable for paired-end read data. The value of this argument should not be greater than that of `-m' option, so as to rescue those read pairs in which one read has a high mapping quality but the other does not. 1 by default.\\
\hline
-P $<3:6>$ \newline (\code{phredOffset}) & Specify the format of Phred scores used in the input data, '3' for phred+33 and '6' for phred+64. '3' by default. For \code{align} function in \Rsubread, the possible values are `33' (for phred+33) and `64' (for phred+64). `33' by default.\\
\hline
-r $<input>$ \newline (\code{readfile1}) & Give the name of input file(s) (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}). For paired-end read data, this gives the first read file and the other read file should be provided via the -R option. Supported input formats include FASTQ/FASTA (uncompressed or gzip compressed)(default), SAM and BAM.\\
-r $<string>$ \newline (\code{readfile1}) & Give the name of an input file (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}). For paired-end read data, this gives the first read file and the other read file should be provided via the -R option. Supported input formats include FASTQ/FASTA (uncompressed or gzip compressed)(default), SAM and BAM.\\
\hline
-R $<input>$ \newline (\code{readfile2}) & Provide name of the second read file from paired-end data. The program will switch to paired-end read mapping mode if this file is provided. (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}).\\
-R $<string>$ \newline (\code{readfile2}) & Provide name of the second read file from paired-end data. The program will switch to paired-end read mapping mode if this file is provided. (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}).\\
\hline
-S $<ff:fr:rf>$ \newline (\code{PE\_orientation}) & Specify the orientation of the two reads from the same pair. It has three possible values including `fr', `ff' and `'rf. Letter `f' denotes the forward strand and letter `r' the reverse strand. `fr' by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand).\\
\hline
......@@ -453,19 +471,23 @@ $^*$ -t $<int>$ \newline (\code{type}) & Specify the type of input sequencing da
\hline
-u \newline (\code{unique=TRUE}) & Output uniquely mapped reads only. Reads that were found to have more than one best mapping location will not be reported.\\
\hline
$^{**}$$--$allJunctions \newline (\code{reportAllJunctions=TRUE}) & This option should be used with \code{subjunc} for detecting canonical exon-exon junctions (with `GT/AG' donor/receptor sites), non-canonical exon-exon junctions and structural variants (SVs) in RNA-seq data. detected junctions will be saved to a file with suffix name ``.junction.bed". Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordinates of detected SV breakpoints and also number of supporting reads. In the read mapping output, each breakpoint-containing read will contain the following extra fields for the description of its secondary alignment: CC(Chr), CP(Position),CG(CIGAR) and CT(strand). The primary alignment (described in the main field) and secondary alignment give respectively the mapping results for the two segments from the same read that were seperated by the breakpoint. Note that each breakpoint-containing read occupies only one row in mapping output. The mapping output includes mapping results for all the reads.\\
$^{**}$$--$allJunctions \newline (\code{reportAllJunctions} \newline \code{=TRUE}) & This option should be used with \code{subjunc} for detecting canonical exon-exon junctions (with `GT/AG' donor/receptor sites), non-canonical exon-exon junctions and structural variants (SVs) in RNA-seq data. detected junctions will be saved to a file with suffix name ``.junction.bed". Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordinates of detected SV breakpoints and also number of supporting reads. In the read mapping output, each breakpoint-containing read will contain the following extra fields for the description of its secondary alignment: CC(Chr), CP(Position),CG(CIGAR) and CT(strand). The primary alignment (described in the main field) and secondary alignment give respectively the mapping results for the two segments from the same read that were seperated by the breakpoint. Note that each breakpoint-containing read occupies only one row in mapping output. The mapping output includes mapping results for all the reads.\\
\hline
$--$BAMinput \newline (\code{input\_format="BAM"}) & Specify that the input read data are in BAM format.\\
\hline
$--$complexIndels & Detect multiple short indels that occur concurrently in a small genomic region (these indels could be as close as 1bp apart).\\
\hline
$--$DPGapExt $<int>$ \newline (\code{DP\_GapExtPenalty}) & Specify the penalty for extending the gap when performing the Smith-Waterman dynamic programming. 0 by defaut.\\
$--$DPGapExt $<int>$ \newline (\code{DP\_GapExtPenalty}) & Specify the penalty for extending the gap when performing the Smith-Waterman dynamic programming. 0 by default.\\
\hline
$--$DPGapOpen $<int>$ \newline (\code{DP\_GapOpenPenalty}) & Specify the penalty for opening a gap when applying the Smith-Waterman dynamic programming to detecting indels. -2 by defaut.\\
$--$DPGapOpen $<int>$ \newline (\code{DP\_GapOpenPenalty}) & Specify the penalty for opening a gap when applying the Smith-Waterman dynamic programming to detecting indels. -2 by default.\\
\hline
$--$DPMismatch $<int>$ \newline (\code{DP\_MismatchPenalty}) & Specify the penalty for mismatches when performing the Smith-Waterman dynamic programming. 0 by defaut.\\
$--$DPMismatch $<int>$ \newline (\code{DP\_MismatchPenalty}) & Specify the penalty for mismatches when performing the Smith-Waterman dynamic programming. 0 by default.\\
\hline
$--$DPMatch $<int>$ \newline (\code{DP\_MatchScore}) & Specify the score for the matched base when performing the Smith-Waterman dynamic programming. 2 by defaut.\\
$--$DPMatch $<int>$ \newline (\code{DP\_MatchScore}) & Specify the score for the matched base when performing the Smith-Waterman dynamic programming. 2 by default.\\
\hline
$--$gtfFeature $<string>$ \newline (\code{GTF.featureType}) & Specify the type of features that will be extracted from a GTF annotation. `exon' by default. Feature types can be found in the 3rd column of a GTF annotation.\\
\hline
$--$gtfAttr $<string>$ \newline (\code{GTF.attrType}) & Specify the type of attributes in a GTF annotation that will be used to group features. `gene\_id' by default. Attributes can be found in the 9th column of a GTF annotation.\\
\hline
$--$rg $<string>$ \newline (\code{readGroup}) & Add a $<tag:value>$ to the read group (RG) header in the mapping output. \\
\hline
......@@ -489,7 +511,8 @@ $--$trim3 $<int>$ \newline (\code{nTrim3}) & Trim off $<int>$ number of bases fr
\section{Mapping quality scores}
{\Subread} and {\Subjunc} aligners assign a mapping quality score (MQS) to each mapped read to indicate the confidence of the mapping:\\
{\Subread} and {\Subjunc} aligners determine the final mapping location of each read by taking into account vote number, number of mis-matched bases, number of matched bases and mapping distance between two reads from the same pair (for paired-end reads only) .
They then assign a mapping quality score (MQS) to each mapped read to indicate the confidence of mapping using the following formula:\\
\[ MQS = \left\{
\begin{array}{l l}
......@@ -499,10 +522,8 @@ $--$trim3 $<int>$ \newline (\code{nTrim3}) & Trim off $<int>$ number of bases fr
0 & \quad \text{if $>1$ equally best locations were found}\\
\end{array} \right.\] \\
\noindent where $N_c$ is the number of candidate locations that are considered for final full alignment (re-alignment step).
Such locations must have a vote number within top three vote numbers counted from all locations considered in the subread-mapping step (first scan).
Up to three candiate locations are considered for each read in the realignment step.
$N_{mm}$ is the number of mismatches found in the alignment at each candidate location.
\noindent where $N_c$ is the number of candidate locations considered at the re-alignment step (note that no more than three candidate locations are considered at this step).
$N_{mm}$ is the number of mismatches present in the final reported alignment for the read.
......@@ -626,6 +647,7 @@ Reads may be re-aligned if required.
Output of {\Subjunc} aligner includes a list of discovered exon-exon junction locations and also the complete alignment results for the reads.
Table 2 describes the arguments used by the {\Subjunc} program.\\
\section{Mapping output}
Read mapping results for each library will be saved to a BAM/SAM file.
......@@ -731,6 +753,7 @@ The genomic features can be specified in either GTF/GFF or SAF format. The SAF f
The genomic features can be specified in either GTF/GFF or SAF format.
A definition of the GTF format can be found at UCSC website (\url{http://genome.ucsc.edu/FAQ/FAQformat.html#format4}).
The SAF format includes five required columns for each feature: feature identifier, chromosome name, start position, end position and strand.
Both start and end positions are inclusive.
These five columns provide the minimal sufficient information for read quantification purposes.
Extra annotation data are allowed to be added from the sixth column.
......@@ -760,41 +783,46 @@ In this case, {\featureCounts} can be instructed to count fragments rather than
{\featureCounts} automatically sorts reads by name if paired reads are not in consecutive positions in the SAM or BAM file, with minimal cost.
Users do not need to sort their paired reads before providing them to {\featureCounts}.
\subsection{Features and meta-features}
\subsection{Assign reads to features or meta-features}
{\featureCounts} is a general-purpose read summarization function, which assigns mapped reads (RNA-seq reads or genomic DNA-seq reads) to genomic features or meta-features.
Each feature is an interval (range of positions) on one of the reference sequences. We define a meta-feature to be a set of features representing a biological construct of interest. For example, features often correspond to exons and meta-features to genes. Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. {\featureCounts} can summarize reads at either the feature or meta-feature levels.
A feature is an interval (range of positions) on one of the reference sequences.
A meta-feature is a set of features that represents a biological construct of interest.
For example, features often correspond to exons and meta-features to genes. Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. {\featureCounts} can summarize reads at either the feature or meta-feature levels.
We recommend to use unique gene identifiers, such as NCBI Entrez gene identifiers, to cluster features into meta-features. Gene names are not recommended to use for this purpose because different genes may have the same names. Unique gene identifiers were often included in many publicly available GTF annotations which can be readily used for summarization. The Bioconductor {\Rsubread} package also includes NCBI RefSeq annotations for human and mice. Entrez gene identifiers are used in these annotations.
\subsection{Overlap of reads with features}
{\featureCounts} preforms precise read assignment by comparing mapping location of every base in the read or fragment with the genomic region spanned by each feature.
{\featureCounts} preforms precise read assignment by comparing mapping location of every base in the read with the genomic region spanned by each feature.
It takes account of any gaps (insertions, deletions, exon-exon junctions or structural variants) that are found in the read.
It calls a hit if any overlap is found between read and feature.
Users may use `--minOverlap (\code{minOverlap} in \R)' and `--fracOverlap (\code{fracOverlap} in \R)' options to specify the minimum number of overlapping bases and minimum fraction of overlapping bases requried for assigning a read to a feature, respectively.
The `--fracOverlap' option might be particularly useful for counting reads with variable lengths.
% A hit is called for a meta-feature if the read or fragment overlaps any component feature of the meta-feature.
\subsection{Multi-mapping reads}
When counting reads at meta-feature level, a hit is called for a meta-feature if the read overlaps any component feature of the meta-feature.
Note that if a read hits a meta-feature, it is always counted once no matter how many features in the meta-feature this read overalps with.
For instance, an exon-spanning read overlapping with more than one exon within the same gene only contributes 1 count to the gene.
A multi-mapping read is a read that can be equally best mapped to more than one location in the reference genome.
Due to the mapping amguity, it is recommended that multi-mapping reads should be excluded from read counting (default behavior of {\featureCounts} program) to produce as accurate counts as possible.
However we do provide users with different options to deal with the counting of such reads.
Users can choose to discard multi-mapping reads, or fully count every alignment reported for a multi-mapping read (ie. each alignment carries 1 count) or count each alignment fractionally (ie. each alignment carries $1/n$ count where $n$ is the total number of alignments reported for the read).
Relevent parameters for counting multi-mapping reads include `-M' (\code{countMultiMappingReads} in \R) and `--fraction' (\code{fraction} in \R).
\subsection{Count multi-mapping reads and multi-overlapping reads}
A multi-mapping read is a read that can be equally best mapped to more than one location in the reference genome.
Due to the mapping ambiguity, it is recommended that multi-mapping reads should be excluded from read counting (default behavior of {\featureCounts} program) to produce as accurate counts as possible.
However we do provide users with other counting options for such reads.
Users can specify the `-M' option (set \code{countMultiMappingReads} to \code{TRUE} in \R) to fully count every alignment reported for a multi-mapping read (each alignment carries 1 count), or specify both `-M' and `--fraction' options (set both \code{countMultiMappingReads} and \code{fraction} to \code{TRUE} in \R) to count each alignment fractionally (each alignment carries $1/x$ count where $x$ is the total number of alignments reported for the read).
Note that for multi-mapping reads the counting is performed at the level of individual alignments (not at read level).
\subsection{Multi-overlap reads}
A multi-overlapping read is a read that overlaps more than one meta-feature when counting reads at meta-feature level or overlaps more than one feature when counting reads at feature level.
The decision of whether or not to counting these reads is often determined by the experiment type. We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments, because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined.
On the other hand, we recommend that multi-overlapping reads or fragments are counted for ChIP-seq experiments because for example epigenetic modifications inferred from these reads may regulate the biological functions of all their overlapping genes.
A multi-overlap read or fragment is one that overlaps more than one feature, or more than one meta-feature when summarizing at the meta-feature level. {\featureCounts} provides users with the option to exclude multi-overlap reads, or fully count them for each overlapping feature (ie. each overlapping feature receives a count of 1 from the read) or assign a fractional count to each overlapping feature.
Relevent parameters for counting multi-overlap reads include `-O' (\code{allowMultiOverlap} in \R) and `--fraction' (\code{fraction} in \R).
By default, {\featureCounts} does not count multi-overlapping reads.
Users can specify the `-O' option (set \code{allowMultiOverlap} to \code{TRUE} in \R) to fully count them for each overlapping meta-feature/feature (each overlapping meta-feature/feature receives a count of 1 from a read), or specify both `-O' and `--fraction' options (set both \code{allowMultiOverlap} and \code{fraction} to \code{TRUE} in \R) to assign a fractional count to each overlapping meta-feature/feature (each overlapping meta-feature/feature receives a count of $1/y$ from a read where $y$ is the total number of meta-features/features overlapping with the read).
The decision whether or not to counting these reads is often determined by the experiment type. We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments, because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined. On the other hand, we recommend that multi-overlap reads or fragments are counted for most ChIP-seq experiments because epigenetic modifications inferred from these reads may regulate the biological functions of all their overlapping genes.
If a read is both multi-mapping and multi-overlapping, then each overlapping meta-feature/feature will receive a fractional count of $1/(x*y)$ when `-O', `-M', and `--fraction' are all specified.
Note that each alignment reported for a multi-mapping read is assessed separately for overlapping with multiple meta-features/features.
Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.
\subsection{In-built annotations}
......@@ -849,9 +877,9 @@ Arguments included in parenthesis are the equivalent parameters used by {\featur
\hline
Arguments & Description \\
\hline
input\_files \newline (\code{files}) & Give the names of input read files that include the read mapping results. The program automatically detects the file format (SAM or BAM). Multiple files can be provided at the same time.\\
input\_files \newline (\code{files}) & Give the names of input read files that include the read mapping results. The program automatically detects the file format (SAM or BAM). Multiple files can be provided at the same time. Files are allowed to be provided via $<stdin>$. \\
\hline
-a $<input> \newline (\code{annot.ext, annot.inbuilt}) $ & Give the name of an annotation file. \\
-a $<string>$ \newline (\code{annot.ext, annot.inbuilt}) & Give the name of an annotation file. \\
\hline
-A \newline (\code{chrAliases}) & Provide a chromosome name alias file to match chr names in annotation with those in the reads. This should be a two-column comma-delimited text file. Its first column should include chr names in the annotation and its second column should include chr names in the reads. Chr names are case sensitive. No column header should be included in the file.\\
\hline
......@@ -865,20 +893,20 @@ input\_files \newline (\code{files}) & Give the names of input read files that i
\hline
-f \newline (\code{useMetaFeatures}) & If specified, read summarization will be performed at feature level (eg. exon level). Otherwise, it is performed at meta-feature level (eg. gene level).\\
\hline
-F \newline (\code{isGTFAnnotationFile}) & Specify the format of the annotation file. Acceptable formats include `GTF' and `SAF' (see Section~\ref{sec:annotation} for details). The {\C} version of {\featureCounts} program uses a GTF format annotation by default, but the R version uses a SAF format annotation by default. The R version also includes in-built annotations.\\
-F \newline (\code{isGTFAnnotationFile}) & Specify the format of the annotation file. Acceptable formats include `GTF' and `SAF' (see Section~\ref{sec:annotation} for details). By default, {\C} version of {\featureCounts} program accepts a GTF format annotation and R version accepts a SAF format annotation. In-built annotations in SAF format are provided.\\
\hline
-g $<input>$ \newline (\code{GTF.attrType}) & Specify the attribute type used to group features (eg. exons) into meta-features (eg. genes) when GTF annotation is provided. `gene\_id' by default. This attribute type is usually the gene identifier. This argument is useful for the meta-feature level summarization.\\
-g $<string>$ \newline (\code{GTF.attrType}) & Specify the attribute type used to group features (eg. exons) into meta-features (eg. genes) when GTF annotation is provided. `gene\_id' by default. This attribute type is usually the gene identifier. This argument is useful for the meta-feature level summarization.\\
\hline
-G $<input>$ \newline (\code{genome}) & Provide the name of a FASTA-format file that contains the reference sequences used in
-G $<string>$ \newline (\code{genome}) & Provide the name of a FASTA-format file that contains the reference sequences used in
read mapping that produced the provided SAM/BAM files. This optional argument can be used with '-J' option to improve read counting for junctions.\\
\hline
-J \newline (\code{juncCounts}) & Count the number of reads supporting each exon-exon junction. Junctions are identified from those exon-spanning reads (containing `N' in CIGAR string) in input data. For each junction, the reported data include number of supporting reads, genes that the junction belongs to, chromosomal coordinates of splice sites etc.\\
-J \newline (\code{juncCounts}) & Count the number of reads supporting each exon-exon junction. Junctions are identified from those exon-spanning reads (containing `N' in CIGAR string) in input data. The output result includes names of primary and secondary genes that overlap at least one of the two splice sites of a junction. Only one primary gene is reported, but there might be more than one secondary gene reported. Secondary genes do not overlap more splice sites than the primary gene. When the primary and secondary genes overlap same number of splice sites, the gene with the smallest leftmost base position is selected as the primary gene. Also included in the output result are the position information for the left splice site (`Site1') and the right splice site (`Site2') of a junction. These include chromosome name, coordinate and strand of the splice site. In the last columns of the output, number of supporting reads is provided for each junction for each library.\\
\hline
-M \newline (\code{countMultiMappingReads}) & If specified, multi-mapping reads/fragments will be counted. A multi-mapping read will be counted up to N times if it has N reported mapping locations. The program uses the `NH' tag to find multi-mapping reads.\\
-M \newline (\code{countMultiMappingReads}) & If specified, multi-mapping reads/fragments will be counted. The program uses the `NH' tag to find multi-mapping reads. Alignments reported for a multi-mapping read will be counted separately. Each alignment will have \code{1} count or a fractional count if \code{--fraction} is specified. See section ``Count multi-mapping reads and multi-overlapping reads'' for more details.\\
\hline
-o $<output>$ & Give the name of the output file. The output file contains the number of reads assigned to each meta-feature (or each feature if \code{-f} is specified). Note that the {\featureCounts} function in {\Rsubread} does not use this parameter. It returns a \code{list} object including read summarization results and other data. \\
-o $<string>$ & Give the name of the output file. The output file contains the number of reads assigned to each meta-feature (or each feature if \code{-f} is specified). Note that the {\featureCounts} function in {\Rsubread} does not use this parameter. It returns a \code{list} object including read summarization results and other data. \\
\hline
-O \newline (\code{allowMultiOverlap}) & If specified, reads (or fragments if \code{-p} is specified) will be allowed to be assigned to more than one matched meta-feature (or feature if \code{-f} is specified). Reads/fragments overlapping with more than one meta-feature/feature will be counted more than once. Note that when performing meta-feature level summarization, a read (or fragment) will still be counted once if it overlaps with multiple features belonging to the same meta-feature but does not overlap with other meta-features. \\
-O \newline (\code{allowMultiOverlap}) & If specified, reads (or fragments if \code{-p} is specified) will be allowed to be assigned to more than one matched meta-feature (or feature if \code{-f} is specified). Reads/fragments overlapping with more than one meta-feature/feature will be counted more than once. Note that when performing meta-feature level summarization, a read (or fragment) will still be counted once if it overlaps with multiple features within the same meta-feature (as long as it does not overlap with other meta-features). Also note that this parameter is applied to each individual alignment when there are more than one alignment reported for a read (ie. multi-mapping read). See section ``Count multi-mapping reads and multi-overlapping reads'' for more details.\\
\hline
-p \newline (\code{isPairedEnd}) & If specified, fragments (or templates) will be counted instead of reads. This option is only applicable for paired-end reads.\\
\hline
......@@ -888,9 +916,9 @@ read mapping that produced the provided SAM/BAM files. This optional argument ca
\hline
-R \newline (\code{reportReads}) & Output detailed read assignment results for each read (or fragment if paired end). They are saved to a tab-delimited file that contains four columns including read name, status(assigned or the reason if not assigned), name of target feature/meta-feature and total number of hits if the read/fragment is counted multiple times. Names of output files are the same as input file names except a suffix string `.featureCounts' is added.\\
\hline
-s $<int>$ \newline (\code{isStrandSpecific}) & Indicate if strand-specific read counting should be performed. Acceptable values: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). 0 by default. For paired-end reads, strand of the first read is taken as the strand of the whole fragment and FLAG field of the current read is used to tell if it is the first read in the fragment.\\
-s $<int>$ \newline (\code{isStrandSpecific}) & Indicate if strand-specific read counting should be performed. Acceptable values: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). 0 by default. For paired-end reads, strand of the first read is taken as the strand of the whole fragment. FLAG field is used to tell if a read is first or second read in a pair.\\
\hline
-t $<input>$ \newline (\code{GTF.featureType}) & Specify the feature type. Only rows which have the matched feature type in the provided GTF annotation file will be included for read counting. `exon' by default.\\
-t $<string>$ \newline (\code{GTF.featureType}) & Specify the feature type. Only rows which have the matched feature type in the provided GTF annotation file will be included for read counting. `exon' by default.\\
\hline
-T $<int>$ \newline (\code{nthreads}) & Number of the threads. The value should be between 1 and 32. 1 by default.\\
\hline
......@@ -902,9 +930,9 @@ $--$countNonSplit \newline AlignmentsOnly \newline (\code{nonSplitOnly}) & If s
\hline
$--$donotsort \newline (\code{autosort}) & If specified, paired end reads will not be re-ordered even if reads from the same pair were found not to be next to each other in the input.\\
\hline
$--$fraction \newline (\code{fraction}) & Assign fractional counts to features. This option must be used together with '-M' or '-O' or both. When '-M' is specified, each reported alignment from a multi-mapping read (identified via `NH' tag) will carry a fractional count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read. When '-O' is specified, each overlapping feature will receive a fractional count of 1/y, where y is the total number of features overlapping with the read. When both '-M' and '-O' are specified, each alignment will carry a fraction count of 1/(x*y).\\
$--$fraction \newline (\code{fraction}) & Assign fractional counts to features. This option must be used together with `-M' or `-O' or both. When `-M' is specified, each reported alignment from a multi-mapping read (identified via `NH' tag) will carry a count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read. When `-O' is specified, each overlapping feature will receive a count of 1/y, where y is the total number of features overlapping with the read. When both `-M' and `-O' are specified, each alignment will carry a count of 1/(x*y).\\
\hline
$--$fracOverlap $<value>$ \newline (\code{fracOverlap}) & Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be within range [0,1]. 0 by default. If paired end, number of overlapping bases is counted from both reads. Soft-clipped bases are counted when calculating total read length (but ignored when counting overlapping bases). Both this option and `--minOverlap' option need to be satisfied for read assignment. \\
$--$fracOverlap $<float>$ \newline (\code{fracOverlap}) & Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be a float number in the range [0,1]. 0 by default. If paired end, number of overlapping bases is counted from both reads. Soft-clipped bases are counted when calculating total read length (but ignored when counting overlapping bases). Both this option and `--minOverlap' option need to be satisfied for read assignment. \\
\hline
$--$ignoreDup \newline (\code{ignoreDup}) & If specified, reads that were marked as duplicates will be ignored. Bit Ox400 in FLAG field of SAM/BAM file is used for identifying duplicate reads. In paired end data, the entire read pair will be ignored if at least one end is found to be a duplicate read.\\
\hline
......
......@@ -44,6 +44,7 @@
#include "subread.h"
#include "input-files.h"
#include "gene-algorithms.h"
#include "HelperFunctions.h"
......@@ -979,3 +980,171 @@ double fast_fisher_test_one_side(unsigned int a, unsigned int b, unsigned int c,
}
int load_features_annotation(char * file_name, int file_type, char * gene_id_column, char * feature_name_column,
void * context, int do_add_feature(char * gene_name, char * chro_name, unsigned int start, unsigned int end, int is_negative_strand, void * context) ){
char * file_line = malloc(MAX_LINE_LENGTH+1);
int lineno = 0, is_GFF_warned = 0, loaded_features = 0;
FILE * fp = fopen(file_name, "r");
if(NULL == fp){
SUBREADprintf("Error: unable to open the annotation file : %s\n", file_name);
return -1;
}
while(1){
int is_gene_id_found = 0, is_negative_strand = -1;
char * token_temp = NULL, * feature_name, * chro_name = NULL;
char feature_name_tmp[FEATURE_NAME_LENGTH];
feature_name = feature_name_tmp;
unsigned int start = 0, end = 0;
char * getres = fgets(file_line, MAX_LINE_LENGTH, fp);
if(getres == NULL) break;
lineno++;
if(is_comment_line(file_line, file_type, lineno-1))continue;
if(file_type == FILE_TYPE_RSUBREAD)
{
feature_name = strtok_r(file_line,"\t",&token_temp);
int feature_name_len = strlen(feature_name);
if(feature_name_len > FEATURE_NAME_LENGTH) feature_name[FEATURE_NAME_LENGTH -1 ] = 0;
chro_name = strtok_r(NULL,"\t", &token_temp);
int chro_name_len = strlen(chro_name);
if(chro_name_len > MAX_CHROMOSOME_NAME_LEN) chro_name[MAX_CHROMOSOME_NAME_LEN -1 ] = 0;
char * start_ptr = strtok_r(NULL,"\t", &token_temp);
char * end_ptr = strtok_r(NULL,"\t", &token_temp);
if(start_ptr == NULL || end_ptr == NULL){
SUBREADprintf("\nWarning: the format on the %d-th line is wrong.\n", lineno);
}
long long int tv1 = atoll(start_ptr);
long long int tv2 = atoll(end_ptr);
if( isdigit(start_ptr[0]) && isdigit(end_ptr[0]) ){
if(strlen(start_ptr) > 10 || strlen(end_ptr) > 10 || tv1 > 0x7fffffff || tv2> 0x7fffffff){
SUBREADprintf("\nError: Line %d contains a coordinate greater than 2^31!\n", lineno);
return -2;
}
}else{
SUBREADprintf("\nError: Line %d contains a format error. The expected annotation format is SAF.\n", lineno);
return -2;
}
start = atoi(start_ptr);// start
end = atoi(end_ptr);//end
char * strand_str = strtok_r(NULL,"\t", &token_temp);
if(strand_str == NULL)
is_negative_strand = 0;
else
is_negative_strand = ('-' ==strand_str[0]);
is_gene_id_found = 1;
} else if(file_type == FILE_TYPE_GTF) {
chro_name = strtok_r(file_line,"\t",&token_temp);
strtok_r(NULL,"\t", &token_temp);// source
char * feature_type = strtok_r(NULL,"\t", &token_temp);// feature_type
if(strcmp(feature_type, feature_name_column)==0){
char * start_ptr = strtok_r(NULL,"\t", &token_temp);
char * end_ptr = strtok_r(NULL,"\t", &token_temp);
if(start_ptr == NULL || end_ptr == NULL){
SUBREADprintf("\nWarning: the format on the %d-th line is wrong.\n", lineno);
}
long long int tv1 = atoll(start_ptr);
long long int tv2 = atoll(end_ptr);
if( isdigit(start_ptr[0]) && isdigit(end_ptr[0]) ){
if(strlen(start_ptr) > 10 || strlen(end_ptr) > 10 || tv1 > 0x7fffffff || tv2> 0x7fffffff){
SUBREADprintf("\nError: Line %d contains a coordinate greater than 2^31!\n", lineno);
return -2;
}
}else{
SUBREADprintf("\nError: Line %d contains a format error. The expected annotation format is GTF/GFF.\n", lineno);
return -2;
}
start = atoi(start_ptr);// start
end = atoi(end_ptr);//end
if(start < 1 || end<1 || start > 0x7fffffff || end > 0x7fffffff || start > end)
SUBREADprintf("\nWarning: the feature on the %d-th line has zero coordinate or zero lengths\n\n", lineno);
strtok_r(NULL,"\t", &token_temp);// score
is_negative_strand = ('-' == (strtok_r(NULL,"\t", &token_temp)[0]));//strand
strtok_r(NULL,"\t",&token_temp); // "frame"
char * extra_attrs = strtok_r(NULL,"\t",&token_temp); // name_1 "val1"; name_2 "val2"; ...
if(extra_attrs && (strlen(extra_attrs)>2)){
int attr_val_len = GTF_extra_column_value(extra_attrs , gene_id_column , feature_name_tmp, FEATURE_NAME_LENGTH);
if(attr_val_len>0) is_gene_id_found=1;
}
if(!is_gene_id_found){
if(!is_GFF_warned)
{
int ext_att_len = strlen(extra_attrs);
if(extra_attrs[ext_att_len-1] == '\n') extra_attrs[ext_att_len-1] =0;
SUBREADprintf("\nWarning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.\nThe specified gene identifier attribute is '%s' \nThe attributes included in your GTF annotation are '%s' \n\n", gene_id_column, extra_attrs);
}
is_GFF_warned++;
}
}
}
if(is_gene_id_found){
do_add_feature(feature_name, chro_name, start, end, is_negative_strand, context);
loaded_features++;
}
}
fclose(fp);
free(file_line);
return loaded_features;
}
HashTable * load_alias_table(char * fname) {
FILE * fp = f_subr_open(fname, "r");
if(!fp)
{
print_in_box(80,0,0,"WARNING unable to open alias file '%s'", fname);
return NULL;
}
char * fl = malloc(2000);
HashTable * ret = HashTableCreate(1013);
HashTableSetDeallocationFunctions(ret, free, free);
HashTableSetKeyComparisonFunction(ret, fc_strcmp_chro);
HashTableSetHashFunction(ret, fc_chro_hash);
while (1)
{
char *ret_fl = fgets(fl, 1999, fp);
if(!ret_fl) break;
if(fl[0]=='#') continue;
char * sam_chr = NULL;
char * anno_chr = strtok_r(fl, ",", &sam_chr);
if((!sam_chr)||(!anno_chr)) continue;
sam_chr[strlen(sam_chr)-1]=0;
char * anno_chr_buf = malloc(strlen(anno_chr)+1);
strcpy(anno_chr_buf, anno_chr);
char * sam_chr_buf = malloc(strlen(sam_chr)+1);
strcpy(sam_chr_buf, sam_chr);
HashTablePut(ret, sam_chr_buf, anno_chr_buf);
}
fclose(fp);
free(fl);
return ret;
}
......@@ -71,4 +71,8 @@ unsigned int find_left_end_cigar(unsigned int right_pos, char * cigar);
int mac_or_rand_str(char * char_14);
double fast_fisher_test_one_side(unsigned int a, unsigned int b, unsigned int c, unsigned int d, long double * frac_buffer, int buffer_size);
int load_features_annotation(char * file_name, int file_type, char * gene_id_column, char * feature_name_column,
void * context, int do_add_feature(char * gene_name, char * chro_name, unsigned int start, unsigned int end, int is_negative_strand, void * context) );
HashTable * load_alias_table(char * fname) ;
#endif
......@@ -2,10 +2,11 @@
include makefile.version
OPT_LEVEL = 9
OPT_LEVEL = 3
CCFLAGS = -mtune=core2 ${MACOS} -O${OPT_LEVEL} -Wall -DMAKE_FOR_EXON -D MAKE_STANDALONE -D SUBREAD_VERSION=\"${SUBREAD_VERSION}\" -D_FILE_OFFSET_BITS=64
LDFLAGS = ${STATIC_MAKE} -lpthread -lz -lm ${MACOS} -O${OPT_LEVEL} -DMAKE_FOR_EXON -D MAKE_STANDALONE # -DREPORT_ALL_THE_BEST
CC = gcc ${CCFLAGS} -ggdb -fomit-frame-pointer -ffast-math -funroll-loops -mmmx -msse -msse2 -msse3 -fmessage-length=0
CC_EXEC = gcc
CC = ${CC_EXEC} ${CCFLAGS} -fmessage-length=0 -ggdb # -fomit-frame-pointer -ffast-math -funroll-loops -mmmx -msse -msse2 -msse3 -fmessage-length=0
ALL_LIBS= core core-junction core-indel sambam-file sublog gene-algorithms hashtable input-files sorted-hashtable gene-value-index exon-algorithms HelperFunctions interval_merge long-hashtable core-bigtable seek-zlib
......
......@@ -79,6 +79,7 @@ struct SNP_Calling_Parameters{
char pile_file_name[300];
int delete_piles;
int disk_is_full;
char background_input_file[300];
char subread_index[300];
......@@ -294,7 +295,11 @@ int read_tmp_block(struct SNP_Calling_Parameters * parameters, FILE * tmp_fp, ch
fread(&read_rec, sizeof(read_rec), 1, tmp_fp);
fread(&read_len, sizeof(short), 1, tmp_fp);
fread(read, sizeof(char), read_len, tmp_fp);
fread(qual, sizeof(char), read_len, tmp_fp);
int rlen = fread(qual, sizeof(char), read_len, tmp_fp);
if(rlen < read_len){
SUBREADputs("ERROR: the temporary file is broken.");
return -1;
}
first_base_pos = read_rec.pos - block_no * BASE_BLOCK_LENGTH;
parameters->is_paired_end_data = read_rec.flags & 1;
......@@ -581,7 +586,7 @@ void fishers_test_on_block(struct SNP_Calling_Parameters * parameters, float * s
int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference_len, char * referenced_genome, char * chro_name , char * temp_prefix, struct SNP_Calling_Parameters * parameters)
{
int block_no = (offset -1) / BASE_BLOCK_LENGTH, i;
int block_no = (offset -1) / BASE_BLOCK_LENGTH, i, disk_is_full = 0;
char temp_file_name[300];
FILE *tmp_fp;
unsigned int * snp_voting_piles, *snp_BGC_piles = NULL; // offset * 4 + "A/C/G/T"[0,1,2,3]
......@@ -632,7 +637,7 @@ int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference
pcutoff_list[i]=-1.;
}
read_tmp_block(parameters, tmp_fp,&SNP_bitmap_recorder,snp_voting_piles,block_no, reference_len, referenced_genome);
int read_is_error = read_tmp_block(parameters, tmp_fp,&SNP_bitmap_recorder,snp_voting_piles,block_no, reference_len, referenced_genome);
fclose(tmp_fp);
if (parameters -> delete_piles)
......@@ -891,7 +896,12 @@ int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference
snprintf(sprint_line,999, "%s\t%u\t.\t%c\t%s\t%.4f\t.\tDP=%d;MM=%s;BGTOTAL=%d;BGMM=%d%s\n", chro_name, BASE_BLOCK_LENGTH*block_no +1 + i, true_value,base_list, Qvalue, all_reads, supporting_list , snp_filter_background_matched[i]+snp_filter_background_unmatched[i], snp_filter_background_unmatched[i], BGC_Qvalue_str);
if(parameters->output_fp_lock)