Skip to content
Commits on Source (5)
subread (1.6.5+dfsg-1) unstable; urgency=medium
* Team upload.
* New upstream version
* Standards-Version: 4.4.0
-- Steffen Moeller <moeller@debian.org> Fri, 09 Aug 2019 16:21:31 +0200
subread (1.6.4+dfsg-1) unstable; urgency=medium
* Update d/watch, add uversionmangle to remove -source suffix
......
......@@ -9,7 +9,7 @@ Build-Depends: bc,
help2man,
python,
zlib1g-dev
Standards-Version: 4.3.0
Standards-Version: 4.4.0
Vcs-Browser: https://salsa.debian.org/med-team/subread
Vcs-Git: https://salsa.debian.org/med-team/subread.git
Homepage: http://sourceforge.net/projects/subread/
......
......@@ -7,9 +7,11 @@ Description: Inject architecture specific flags:
- Append CFLAGS and LDFLAGS instead of hardcoding the values, this way the
hardening flags are set correctly
Author: Alex Mestiashvili <alex@biotec.tu-dresden.de>
Index: subread/src/Makefile.Linux
===================================================================
--- subread.orig/src/Makefile.Linux
+++ subread/src/Makefile.Linux
@@ -7,9 +7,9 @@
@@ -7,9 +7,9 @@ OPT_LEVEL = 3
include makefile.version
-include ~/.R/DBPZ_debug_makefile
......@@ -21,7 +23,9 @@ Author: Alex Mestiashvili <alex@biotec.tu-dresden.de>
+CC = ${CC_EXEC} ${CCFLAGS} ${CFLAGS} ${CPPFLAGS} -fmessage-length=0 -ggdb
ALL_LIBS= core core-junction core-indel sambam-file sublog gene-algorithms hashtable input-files sorted-hashtable gene-value-index exon-algorithms HelperFunctions interval_merge long-hashtable core-bigtable seek-zlib
ALL_LIBS= core core-junction core-indel sambam-file sublog gene-algorithms hashtable input-files sorted-hashtable gene-value-index exon-algorithms HelperFunctions interval_merge long-hashtable core-bigtable seek-zlib input-blc
Index: subread/src/longread-one/Makefile
===================================================================
--- subread.orig/src/longread-one/Makefile
+++ subread/src/longread-one/Makefile
@@ -1,10 +1,10 @@
......
......@@ -34,11 +34,11 @@
\begin{titlepage}
\begin{center}
{\Huge\bf Subread/Rsubread Users Guide}\\
{\Huge\bf Rsubread/Subread Users Guide}\\
\vspace{1 cm}
{\centering\large Subread v1.6.4/Rsubread v1.32.4\\}
{\centering\large Rsubread v1.34.6/Subread v1.6.5\\}
\vspace{1 cm}
\centering 11 March 2019\\
\centering 18 July 2019\\
\vspace{5 cm}
\Large Wei Shi and Yang Liao\\
\vspace{1 cm}
......@@ -390,6 +390,8 @@ A full index is larger than a gapped index.
However the full index enables faster mapping speed to be achieved.
When a one-block full index is used for mapping, the maximum mapping speed is achieved.
Size of one-block full index built for the human reference genome (GRCh38) is 17.8 GB.
The \code{subread-buildindex} function needs 15 GB of memory to build this index.
Size of a gapped index built for GRCh38 is less than 9 GB and \code{subread-buildindex} needs 5.7 GB of memory to build it.
Options are available to generate index of any size.
In \Rsubread, a one-block full index is built by default.
......@@ -544,6 +546,23 @@ $^{1,2,3}$ -v & Output version of the program. \\
\newpage
\section{Memory use and speed}
\label{sec:memoryspeed}
\code{subread-buildindex} (\code{buildindex} function in \Rsubread) needs 15GB of memory to build a full index for human/mouse genome.
With this index, \code{subread-align} (\code{align} in \Rsubread) require 17.8GB of memory for read mapping.
This enables fastest mapping speed, but it is recommended that the full index should be on a unix server due to relatively large memory use.
Mapping rate is $\sim$14 million reads per minute (10 CPU threads) when full index is used.
A gapped index is recommended for use on a personal computer, which typically has 16GB of memory or less.
\code{subread-buildindex} (\code{buildindex} function in \Rsubread) only needs 5.7GB of memory to build a gapped index for human/mouse genome.
\code{subread-align} (\code{align} in \Rsubread) needs 8.2GB of memory for mapping with the gapped index.
It takes \code{subread-buildindex} (\code{buildindex} function in \Rsubread) about 40 minutes to build a full index for human/mouse genome, and building a gapped index takes about 15 minutes.
Memory use for index building and read mapping can be further reduced by building a split index using the \code{-B} and \code{-M} options in \code{subread-buildindex} (\code{indexSplit} and \code{memory} options in \code{buildindex} function in \Rsubread).
\section{Mapping quality scores}
{\Subread} and {\Subjunc} aligners determine the final mapping location of each read by taking into account vote number, number of mis-matched bases, number of matched bases and mapping distance between two reads from the same pair (for paired-end reads only) .
......@@ -561,21 +580,6 @@ They then assign a mapping quality score (MQS) to each mapped read to indicate t
$N_{mm}$ is the number of mismatches present in the final reported alignment for the read.
% \[ MQS = \left\{
% \begin{array}{l l}
% (\sum_{i \in b_m} ( 1 - p_i) - \sum_{i \in b_{mm}} (1 - p_i)) \times 60 / L & \quad \text{if uniquely mapped}\\
% & \quad \text{\scriptsize{[MQS is reset to 0 if less than 0]}}\\
% & \\
% 0 & \quad \text{if $>1$ equally best locations found}\\
% \end{array} \right.\]
% where $L$ is the read length, $p_i$ is the base-calling $p$-value for the $i$th base in the read, $b_m$ is the set of locations of matched bases, and $b_{mm}$ is the set of locations of mismatched bases.
% Base-calling p values can be readily computed from the base quality scores.
% Read bases of high sequencing quality have low base-calling p values.
% Read bases that were found to be insertions are treated as matched bases in the MQS calculation.
% The MQS is a read-length normalized value and it is in the range [0, 60).
\section{Mapping output}
Read mapping results for each library will be saved to a BAM or SAM format file.
......@@ -671,6 +675,11 @@ subjunc(index="my_index",readfile1="rnaseq-reads.txt.gz",output_file="subjunc_re
\end{Rcode}
\section{Index building}
Please refer to Section~\ref{sec:index}.
Same index is used for the mapping of RNA and DNA sequencing reads.
\section{Local read alignment}
The \code{Subread} and \code{Subjunc} can both be used to map RNA-seq reads to the reference genome.
......@@ -691,6 +700,13 @@ Reads may be re-aligned if required.
Output of {\Subjunc} aligner includes a list of discovered exon-exon junction locations and also the complete alignment results for the reads.
Table 2 describes the arguments used by the {\Subjunc} program.\\
\section{Memory use and speed}
Memory use and running time of \code{subread-buildindex} and \code{subread-align} (\code{buildindex} and \code{align} in \Rsubread) are the same as their memory use and running time in the analysis of DNA sequencing data (see Section~\ref{sec:memoryspeed}).
Compared to \code{subread-align} (\code{align} in \Rsubread), \code{subjunc} uses the same amount of memory when a full index is used and it uses slightly more memory (8.8GB of memory for human/mouse data) when a gapped index is used.
\code{subjunc} is also slightly slower than \code{subread-align}.
\section{Mapping output}
......@@ -766,7 +782,7 @@ Here we describe the {\featureCounts} program, an efficient and accurate read qu
\item It can count reads at feature (eg. exon) or meta-feature (eg. gene) level.
\item Highly flexible in counting multi-mapping and multi-overlapping reads. Such reads can be excluded, fully counted or fractionally counted.
\item It gives users full control on the summarization of paired-end reads, including allowing them to check if both ends are mapped and/or if the fragment length falls within the specified range.
\item Reduce ambuiguity in assigning read pairs by searching features that overlap with both reads from the pair.
\item Reduce ambiguity in assigning read pairs by searching features that overlap with both reads from the pair.
\item It allows users to specify whether chimeric fragments should be counted.
\item Automatically detect input format (SAM or BAM).
\item Automatically sort paired-end reads. Users can provide either location-sorted or name-sorted bams files to featureCounts. Read sorting is implemented on the fly and it only incurs minimal time cost.
......@@ -863,7 +879,6 @@ Specialized transcript-level quantification tools are recommended for counting r
Such tools use model-based approaches to deconvolve reads overlapping with multiple transcripts.
\subsection{Count multi-mapping reads and multi-overlapping reads}
A multi-mapping read is a read that maps to more than one location in the reference genome.
......@@ -889,12 +904,14 @@ Note that all the alignments reported for a multi-mapping read are expected to h
\label{sec:read_filtering}
{\featureCounts} implements a variety of read filters to facilitate flexible read counting, which should satisfy the requirement of most downstream analyses.
The order of these filters being applied is as follows (from highest to lowest):
The order of these filters being applied is as following (from first to last):
unmapped
$>$ read type
$>$ singleton
$>$ mapping quality
$>$ chimeric fragment
$>$ fragment length
$>$ duplication
$>$ duplicate
$>$ multi-mapping
$>$ secondary alignment
$>$ split reads (or nonsplit reads)
......@@ -903,6 +920,9 @@ $>$ overlapping length
$>$ assignment ambiguity.
Number of reads that were excluded from counting by each filter is reported in the program output, in addition to the reported read counts (see Section~\ref{sec:program_output}).
The `read type' filter removes those reads that have an unexpected read type and also cannot be counted with confidence.
For example, if there are single end reads included in a paired end read dataset (such data can be produced from a read trimming program for instance) and reads are required to be counted in a strand-specific manner, then all the single end reads will be excluded from counting because their strandness cannot be determined.
However if such reads are to be counted in an unstranded manner then all the single end reads will be considered for counting.
\subsection{Read manipulation}
......@@ -932,6 +952,8 @@ Filters supported by {\featureCounts} can be found in the list below:
\begin{itemize}
\item Unassigned\_Unmapped: unmapped reads cannot be assigned.
\item Unassigned\_Read\_Type: reads that have an unexpected read type (eg. being a single end read included in a paired end dataset) and also cannot be counted with confidence (eg. due to stranded counting). Such reads are typically generated from a read trimming program.
\item Unassigned\_Singleton: read pairs that have only one end mapped.
\item Unassigned\_MappingQuality: alignments with a mapping quality score lower than the threshold.
\item Unassigned\_Chimera: two ends in a paired end alignment are located on different chromosomes or have unexpected orientation.
\item Unassigned\_FragementLength: fragment length inferred from paired end alignment does not meet the length criteria.
......@@ -945,8 +967,7 @@ Filters supported by {\featureCounts} can be found in the list below:
\end{itemize}
In the counting summary these filters are listed in the same order as they were applied in counting process (see Section~\ref{sec:read_filtering}).
All categories are exclusive to each other, ie no alignments are assigned to more than one category.
If an alignment can be filtered out by more than one filter, it is always assigned to the first filter it encounters.
An unassigned alignment might fall into more than one category as listed above, however it will only be allocated to one category which is the category corresponding to the first filter that filtered this alignment out.
\subsection{Program usage}
......@@ -1253,6 +1274,7 @@ It takes only about half a minute to re-order a location-sorted BAM file includi
\section{flattenGTF}
Flatten features (eg. exons) provided in a GTF annotation and output the modified annotation to a SAF format annotation.
If overlapping features are found in the GTF annotation, this function can combine them to form a single large feature encompassing all the original features, or chop them into non-overlapping bins.
\section{promoterRegions}
......
......@@ -22,7 +22,7 @@
#include <assert.h>
#include <stdarg.h>
#include <math.h>
//#include <zutil.h>
#include <unistd.h>
#ifdef MACOS
......@@ -35,6 +35,10 @@
#include <net/if_dl.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <stdlib.h>
#include <string.h>
#include <mach/mach.h>
#include <mach/vm_statistics.h>
#else
......@@ -43,7 +47,8 @@
#include <netinet/in.h>
#include <net/if.h>
#endif
#include <unistd.h>
#include <sys/types.h>
#include <sys/sysinfo.h>
#endif
......@@ -53,6 +58,39 @@
#include "gene-algorithms.h"
#include "HelperFunctions.h"
size_t get_sys_mem_info(char * keyword){
FILE * mfp = fopen("/proc/meminfo","r");
if(mfp==NULL) return -1;
char linebuf[1000];
size_t ret = -1;
while(1){
char * rret = fgets(linebuf, 999, mfp);
if(memcmp( keyword, linebuf, strlen(keyword) ) == 0 && strstr(linebuf," kB")) {
ret=0;
int ii ,state=0;
for(ii=strlen(keyword);; ii++){
//SUBREADprintf("CH[%d] = %d '%c' at state %d\n", ii, linebuf[ii], linebuf[ii], state);
if(state == 0 && linebuf[ii]==' ') state = 1;
if(state == 1 && linebuf[ii]!=' ') state = 2;
if(state == 2 && linebuf[ii]==' ') state = 9999;
if(state == 2 && !isdigit(linebuf[ii])){
SUBREADprintf("WRONG MEMORY INFO '%s'\n", linebuf);
ret = -1;
break;
}
if(state == 2) ret = ret*10 + ( linebuf[ii] - '0' );
if(state >= 9999) {
ret *=1024;
break;
}
}
}
if(!rret) break;
}
fclose(mfp);
return ret;
}
char * get_short_fname(char * lname){
char * ret = lname;
......@@ -2648,3 +2686,31 @@ void main(){
}
#endif
int get_free_total_mem(size_t * total, size_t * free_mem){
#ifdef FREEBSD
return -1;
#endif
#ifdef MACOS
mach_msg_type_number_t count = HOST_VM_INFO_COUNT;
vm_statistics_data_t vmstat;
int page_size = getpagesize();
if(KERN_SUCCESS != host_statistics(mach_host_self(), HOST_VM_INFO, (host_info_t)&vmstat, &count))
return -1;
//printf("PSIZE=%d\nACT=%u; INACT=%u; FREE=%u\n", page_size, vmstat.active_count, vmstat.inactive_count, vmstat.free_count);
size_t btlen = sizeof(*total);
if(sysctl( (int[]) { CTL_HW, HW_MEMSIZE }, 2, total, &btlen, NULL, 0)) return -1;
*free_mem = (vmstat.free_count + vmstat.inactive_count) * 1llu * page_size;
return 0;
#else
struct sysinfo sinf;
sysinfo(&sinf);
size_t cached_mem = get_sys_mem_info("Cached:");
if(cached_mem<0)cached_mem=0;
*free_mem = cached_mem + sinf.bufferram+sinf.freeram;
*total = sinf.totalram;
return 0;
#endif
}
......@@ -235,5 +235,6 @@ void TNbignum_dec(struct bn* n); /* Decrement: subtr
void TNbignum_pow(struct bn* a, struct bn* b, struct bn* c); /* Calculate a^b -- e.g. 2^10 => 1024 */
void TNbignum_isqrt(struct bn* a, struct bn* b); /* Integer square root -- e.g. isqrt(5) => 2*/
void TNbignum_assign(struct bn* dst, struct bn* src); /* Copy src into dst -- dst := src */
size_t get_sys_mem_info(char * keyword);
int get_free_total_mem(size_t * total, size_t * free_mem);
#endif
......@@ -12,7 +12,7 @@ LDFLAGS = ${STATIC_MAKE} -pthread -lz ${MACOS} -O${OPT_LEVEL} -DMAKE_FOR_EXON -D
CC = ${CC_EXEC} ${CCFLAGS} -fmessage-length=0 -ggdb
ALL_LIBS= core core-junction core-indel sambam-file sublog gene-algorithms hashtable input-files sorted-hashtable gene-value-index exon-algorithms HelperFunctions interval_merge long-hashtable core-bigtable seek-zlib
ALL_LIBS= core core-junction core-indel sambam-file sublog gene-algorithms hashtable input-files sorted-hashtable gene-value-index exon-algorithms HelperFunctions interval_merge long-hashtable core-bigtable seek-zlib input-blc
ALL_OBJECTS=$(addsuffix .o, ${ALL_LIBS})
ALL_H=$(addsuffix .h, ${ALL_LIBS})
ALL_C=$(addsuffix .c, ${ALL_LIBS})
......
......@@ -63,6 +63,7 @@ struct SNP_Calling_Parameters{
int is_BAM_file_input;
int is_paired_end_data;
int is_coverage_calculation;
int use_soft_clipping_bases;
int fisher_exact_testlen;
......@@ -77,13 +78,13 @@ struct SNP_Calling_Parameters{
pthread_spinlock_t * output_fp_lock;
char pile_file_name[300];
char pile_file_name[MAX_FILE_NAME_LENGTH];
int delete_piles;
int disk_is_full;
char background_input_file[300];
char subread_index[300];
char known_SNP_vcf[300];
char background_input_file[MAX_FILE_NAME_LENGTH];
char subread_index[MAX_FILE_NAME_LENGTH];
char known_SNP_vcf[MAX_FILE_NAME_LENGTH];
unsigned int known_SNPs_number;
gene_offset_t * subread_index_offsets;
gene_value_index_t * subread_index_array;
......@@ -281,9 +282,11 @@ int read_tmp_block(struct SNP_Calling_Parameters * parameters, FILE * tmp_fp, ch
}
if(!(*SNP_bitmap_recorder))
{
(*SNP_bitmap_recorder)=malloc((reference_len/8)+2);
memset((*SNP_bitmap_recorder), 0 , (reference_len/8)+2);
(*SNP_bitmap_recorder)=malloc((reference_len/8)+200);
memset((*SNP_bitmap_recorder), 0 , (reference_len/8)+200);
}
if(SNP_rec.pos > block_no * BASE_BLOCK_LENGTH && SNP_rec.pos <= block_no * BASE_BLOCK_LENGTH + reference_len )
mask_snp_bitmap((*SNP_bitmap_recorder), SNP_rec.pos - block_no * BASE_BLOCK_LENGTH - 1);
parameters -> known_SNPs_number ++;
//printf("SNPat: %u\n", SNP_rec.pos);
......@@ -478,22 +481,19 @@ void fishers_test_on_POI(struct SNP_Calling_Parameters * parameters, float * snp
}
}
#define KEEP_WINDOW_SAME_SIZE 0
void fishers_test_on_block(struct SNP_Calling_Parameters * parameters, float * snp_fisher_raw, unsigned int * snp_voting_piles, char * referenced_genome, unsigned int reference_len, double multiplex_base, char * SNP_bitmap_recorder, unsigned short * snp_filter_background_matched, unsigned short * snp_filter_background_unmatched, int all_result_needed, char * chro_name, unsigned int block_start) {
int i, j, is_fresh_jumppd, remove_old;
int a=0, b=0, c=0, d=0, go_ahead = 0, left_tail = 0;
int i, j; // all_SNPs_at_pois=0;
int POIbase_MM=0, ALLbase_MM=0, POIbase_MAT=0, ALLbase_MAT=0;
long long int reference_len_long = reference_len;
/** | POI | All_Window (inc. POI)
* ----------+------+-------
* #mismatch | a | b
* #matched | c | d
**/
remove_old = 1;
for(i= - parameters -> fisher_exact_testlen;i<reference_len_long; i++)
{
a=0; c=0; is_fresh_jumppd = 1;
POIbase_MM=0; POIbase_MAT=0;
if(i>=0) {
char true_value = referenced_genome[i];
......@@ -502,88 +502,81 @@ void fishers_test_on_block(struct SNP_Calling_Parameters * parameters, float * s
for(j=0;j<4;j++)
{
if(j == true_value_int)
c = snp_voting_piles[ i * 4 + j ] ;
POIbase_MAT = snp_voting_piles[ i * 4 + j ] ;
else
a += snp_voting_piles[ i * 4 + j ] ;
POIbase_MM += snp_voting_piles[ i * 4 + j ] ;
}
// if the POI reached in this step is not fresh (known SNPs with < 80% support)
if(KEEP_WINDOW_SAME_SIZE && SNP_bitmap_recorder && is_snp_bitmap(SNP_bitmap_recorder,i))
{
is_fresh_jumppd = 0;
go_ahead --;
}
}
// if the current POI is fresh, then add a new FRESH base into the right window
if(is_fresh_jumppd){
while(i+parameters -> fisher_exact_testlen+go_ahead < reference_len_long)
{
int mm=0, pm=0, is_added = 0;
char true_value = referenced_genome[i + parameters -> fisher_exact_testlen + go_ahead];
if(i+parameters -> fisher_exact_testlen < reference_len_long) {
int mm=0, pm=0;
char true_value = referenced_genome[i + parameters -> fisher_exact_testlen ];
int true_value_int = (true_value=='A'?0:(true_value=='C'?1:(true_value=='G'?2:3)));
for(j=0;j<4;j++)
{
if(j == true_value_int)
pm = snp_voting_piles[ (i+parameters -> fisher_exact_testlen + go_ahead) *4 + j ] ;
pm = snp_voting_piles[ (i+parameters -> fisher_exact_testlen ) *4 + j ] ;
else
mm += snp_voting_piles[ (i+parameters -> fisher_exact_testlen + go_ahead) *4 + j ] ;
mm += snp_voting_piles[ (i+parameters -> fisher_exact_testlen ) *4 + j ] ;
}
unsigned int chro_pos = block_start + i + parameters -> fisher_exact_testlen;
if(0 && chro_pos < 16084550+12 && chro_pos > 16084550-12){
SUBREADprintf("ADD-BASE-AT-RIGHT : %s : %u : KNOWN-SNP : %d a b c d : %d %d %d %d A 4 > C : %d\n" , chro_name, chro_pos, is_snp_bitmap(SNP_bitmap_recorder,i + parameters -> fisher_exact_testlen), a,b,c,d, (a *4 >= c));
}
// if this is not POI : add this matched/unmatched into "flanking+PoI" counts.
if((!SNP_bitmap_recorder)||(!is_snp_bitmap(SNP_bitmap_recorder, go_ahead + i + parameters -> fisher_exact_testlen)))
{
d += pm;
b += mm;
is_added = 1;
SUBREADprintf("ADD-BASE-AT-RIGHT : %s : %u : KNOWN-SNP : %d a b c d : %d %d %d %d A 4 > C : %d\n" , chro_name, chro_pos, is_snp_bitmap(SNP_bitmap_recorder,i + parameters -> fisher_exact_testlen), POIbase_MM , ALLbase_MM, POIbase_MAT, ALLbase_MAT, (POIbase_MM *4 >= POIbase_MAT));
}
if(is_added || (!KEEP_WINDOW_SAME_SIZE)) break;
else go_ahead++;
if((!SNP_bitmap_recorder)||(!is_snp_bitmap(SNP_bitmap_recorder, i + parameters -> fisher_exact_testlen))) {
ALLbase_MAT += pm;
ALLbase_MM += mm;
}
}
// test the middle base
// a : mismatched at POI ; c : matched at POI
if(i>=0 && a > 0){
if(i>=0 && POIbase_MM > 0){
int Known_SNP_here = (SNP_bitmap_recorder!=NULL) && is_snp_bitmap(SNP_bitmap_recorder,i);
//if(Known_SNP_here)
// SUBREADprintf("SNP known at %s:%d\n" , chro_name, block_start + i);
//all_SNPs_at_pois += (Known_SNP_here)?1:0;
float observed_coverage = (b+d) *1./(1. + 2 * parameters -> fisher_exact_testlen) ;
float observed_coverage = (ALLbase_MM + ALLbase_MAT) *1./(1. + 2 * parameters -> fisher_exact_testlen) ;
double p_cutoff = pow(10, -(observed_coverage/multiplex_base));
p_cutoff = min(parameters -> cutoff_upper_bound, p_cutoff);
p_cutoff = max(1E-320, p_cutoff);
p_cutoff = max(1E-323, p_cutoff);
int flanking_unmatched = b-a;
int flanking_matched = d-c;
int flanking_unmatched = ALLbase_MM;
int flanking_matched = ALLbase_MAT;
if(SNP_bitmap_recorder == NULL || ! Known_SNP_here){
flanking_unmatched -= POIbase_MM;
flanking_matched -= POIbase_MAT;
}
unsigned int chro_pos = block_start + i;
if(0 && chro_pos < 16084550+12 && chro_pos > 16084550-12){
SUBREADprintf("KNOWN-SNP-AT-PoI at %s : %u : KNOWN-SNP : %d a b c d : %d %d %d %d A 4 > C : %d\n" , chro_name, chro_pos, is_snp_bitmap(SNP_bitmap_recorder,i), a,b,c,d, (a *4 >= c));
}
if(0 && SNP_bitmap_recorder && is_snp_bitmap(SNP_bitmap_recorder,i) && (a *4 >= c)) // no reason to include the current POI into the flanking window, even if it is a SNP.
{
flanking_unmatched = b;
flanking_matched = d;
if(0 && strcmp("8", chro_name)==0 && chro_pos < 144762490 + 12 && chro_pos > 144762490-12){
SUBREADprintf("KNOWN-SNP-AT-PoI at %s : %u : KNOWN-SNP : %d a b c d : %d %d %d %d A 4 > C : %d\n" , chro_name, chro_pos, is_snp_bitmap(SNP_bitmap_recorder,i), POIbase_MM, ALLbase_MM , POIbase_MAT, ALLbase_MAT, (POIbase_MM *4 >= POIbase_MAT));
}
float p_middle = fisher_exact_test(a, flanking_unmatched, c, flanking_matched);
float p_middle = fisher_exact_test(POIbase_MM, flanking_unmatched, POIbase_MAT, flanking_matched);
if(0 && chro_pos < 16084550+12 && chro_pos > 16084550-12){
SUBREADprintf("TEST: %s : %u a,b,c,d=%d %d %d %d; FU=%d FM=%d; Goahead=%d; Tailleft=%d; p=%G; p-cut=%G\n KNOWN-SNP : %d A 4 > C : %d\n", chro_name, chro_pos ,a,b,c,d, flanking_unmatched, flanking_matched, go_ahead, left_tail, p_middle, p_cutoff, is_snp_bitmap(SNP_bitmap_recorder,i), (a *4 >= c));
SUBREADprintf("TEST: %s : %u a,b,c,d=%d %d %d %d; FU=%d FM=%d; Goahead=%d; Tailleft=%d; p=%G; p-cut=%G\n KNOWN-SNP : %d A 4 > C : %d\n", chro_name, chro_pos ,POIbase_MM, ALLbase_MM,POIbase_MAT, ALLbase_MAT , flanking_unmatched, flanking_matched, 0, 0, p_middle, p_cutoff, is_snp_bitmap(SNP_bitmap_recorder,i), (POIbase_MM *4 >= POIbase_MAT));
}
if(all_result_needed || ( p_middle < p_cutoff && flanking_matched*20>(flanking_matched+ flanking_unmatched )*16))
//#warning " ===================== If a known SNP is at POI, the Fisher's p-value is only compared with the 'upper bound' of the p-value. ====================="
//#warning " ===================== '1 &&' is the switch ========================"
if(all_result_needed || ( 1 && Known_SNP_here && p_middle < parameters -> cutoff_upper_bound ) || ( p_middle < p_cutoff && flanking_matched*20>(flanking_matched+ flanking_unmatched )*16))
snp_fisher_raw [i] = p_middle;
else snp_fisher_raw [i] = -999;
if(flanking_unmatched <0)
SUBREADprintf("ERROR_AB: A=%d, B=%d, C=%d, D=%d, flanking_unmatched=%d\n", POIbase_MM, ALLbase_MM , POIbase_MAT, ALLbase_MAT,flanking_unmatched);
assert(flanking_unmatched>=0);
assert(flanking_matched>=0);
//if(strcmp(chro_name, "chr20")==0 && block_no == 3 && i == 57566366-1-3*15000000)
if(snp_filter_background_unmatched){
......@@ -593,47 +586,30 @@ void fishers_test_on_block(struct SNP_Calling_Parameters * parameters, float * s
fisher_test_size ++;
}else if(i>=0 && all_result_needed) snp_fisher_raw[i]=1.1;
if(remove_old)
{
while(i >= parameters -> fisher_exact_testlen + left_tail)
{
int mm=0, pm=0, is_removed = 0;
char true_value = referenced_genome[i - parameters -> fisher_exact_testlen-left_tail];
if(i >= parameters -> fisher_exact_testlen) {
int mm=0, pm=0;
char true_value = referenced_genome[i - parameters -> fisher_exact_testlen];
int true_value_int = (true_value=='A'?0:(true_value=='C'?1:(true_value=='G'?2:3)));
for(j=0;j<4;j++)
{
if(j == true_value_int)
pm = snp_voting_piles[ (i-parameters -> fisher_exact_testlen-left_tail) *4 + j ] ;
pm = snp_voting_piles[ (i-parameters -> fisher_exact_testlen) *4 + j ] ;
else
mm += snp_voting_piles[ (i-parameters -> fisher_exact_testlen-left_tail) *4 + j ] ;
mm += snp_voting_piles[ (i-parameters -> fisher_exact_testlen) *4 + j ] ;
}
if((!SNP_bitmap_recorder)||!is_snp_bitmap(SNP_bitmap_recorder, i - parameters -> fisher_exact_testlen-left_tail))
{
d-=pm;
b-=mm;
is_removed =1;
if((!SNP_bitmap_recorder)||!is_snp_bitmap(SNP_bitmap_recorder, i - parameters -> fisher_exact_testlen)) {
ALLbase_MAT-=pm;
ALLbase_MM-=mm;
}
if(is_removed || !KEEP_WINDOW_SAME_SIZE) break;
else if(KEEP_WINDOW_SAME_SIZE) left_tail--;
}
}
// if a fresh base is at POI, the tail is removed in next step.
if(is_fresh_jumppd || !KEEP_WINDOW_SAME_SIZE)
remove_old = 1;
else{
left_tail ++;
remove_old = 0;
}
}
// SUBREADprintf("BLOCK_FINISH; SNP found = %d\n", all_SNPs_at_pois);
}
int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference_len, char * referenced_genome, char * chro_name , char * temp_prefix, struct SNP_Calling_Parameters * parameters)
{
int block_no = (offset -1) / BASE_BLOCK_LENGTH, i, disk_is_full = 0;
char temp_file_name[300];
char temp_file_name[MAX_FILE_NAME_LENGTH];
FILE *tmp_fp;
unsigned int * snp_voting_piles, *snp_BGC_piles = NULL; // offset * 4 + "A/C/G/T"[0,1,2,3]
char * SNP_bitmap_recorder = NULL;
......@@ -1337,7 +1313,7 @@ void EXSNP_SIGINT_hook(int param)
int xk1, last_slash = -1;
if(_EXSNP_SNP_delete_temp_prefix != NULL)
{
char del2[300], del_suffix[200], del_name[400];
char del2[MAX_FILE_NAME_LENGTH], del_suffix[MAX_FILE_NAME_LENGTH], del_name[MAX_FILE_NAME_LENGTH];
#ifdef MAKE_STANDALONE
if(param)
SUBREADprintf("\n\nReceived a terminal signal. The temporary files were removed.\n");
......@@ -1399,7 +1375,7 @@ void EXSNP_SIGINT_hook(int param)
int SNP_calling(char * in_SAM_file, char * out_BED_file, char * in_FASTA_file, char * temp_location, unsigned int known_read_count, int threads, struct SNP_Calling_Parameters* parameters)
{
char temp_file_prefix[300];
char temp_file_prefix[MAX_FILE_NAME_LENGTH];
chromosome_t * known_chromosomes;
unsigned int real_read_count=0;
......@@ -1431,7 +1407,7 @@ int SNP_calling(char * in_SAM_file, char * out_BED_file, char * in_FASTA_file, c
if(parameters->subread_index[0])
{
char table_fn[350];
char table_fn[MAX_FILE_NAME_LENGTH+80];
parameters -> subread_index_offsets = (gene_offset_t*)malloc(sizeof(gene_offset_t));
load_offsets (parameters -> subread_index_offsets, parameters->subread_index);
......@@ -1448,7 +1424,7 @@ int SNP_calling(char * in_SAM_file, char * out_BED_file, char * in_FASTA_file, c
while(1)
{
int fpos0 = fpos;
char one_fn [300];
char one_fn [MAX_FILE_NAME_LENGTH];
while(in_SAM_file[fpos]!=',' && in_SAM_file[fpos]!=0)
fpos++;
strncpy(one_fn, in_SAM_file+fpos0, fpos-fpos0);
......@@ -1481,21 +1457,21 @@ int SNP_calling(char * in_SAM_file, char * out_BED_file, char * in_FASTA_file, c
while(1)
{
int fpos0 = fpos;
char one_fn [300];
char one_fn [MAX_FILE_NAME_LENGTH];
while(in_SAM_file[fpos]!=',' && in_SAM_file[fpos]!=0)
fpos++;
strncpy(one_fn, in_SAM_file+fpos0, fpos-fpos0);
one_fn[fpos-fpos0]=0;
if(break_SAM_file(one_fn, parameters -> is_BAM_file_input, temp_file_prefix, &real_read_count, &parameters->all_blocks, known_chromosomes, 1, parameters -> bases_ignored_head_tail, parameters->subread_index_array, parameters->subread_index_offsets, &parameters -> all_mapped_bases, parameters-> cigar_event_table, parameters->known_SNP_vcf, NULL, 1,0)) return -1;
if(break_SAM_file(one_fn, parameters -> is_BAM_file_input, temp_file_prefix, &real_read_count, &parameters->all_blocks, known_chromosomes, 1, parameters -> bases_ignored_head_tail, parameters->subread_index_array, parameters->subread_index_offsets, &parameters -> all_mapped_bases, parameters-> cigar_event_table, parameters->known_SNP_vcf, NULL, 0,0, parameters -> use_soft_clipping_bases)) return -1;
if(!in_SAM_file[fpos]) break;
fpos++;
}
if(parameters -> background_input_file[0])
{
char temp_file_prefix2[350];
char temp_file_prefix2[MAX_FILE_NAME_LENGTH+80];
sprintf(temp_file_prefix2, "%sBGC-", temp_file_prefix);
if(break_SAM_file(parameters -> background_input_file, parameters -> is_BAM_file_input, temp_file_prefix2, NULL, NULL, known_chromosomes, 1, parameters -> bases_ignored_head_tail, parameters->subread_index_array, parameters->subread_index_offsets, NULL, NULL, NULL, NULL, 1,0)) return -1;
if(break_SAM_file(parameters -> background_input_file, parameters -> is_BAM_file_input, temp_file_prefix2, NULL, NULL, known_chromosomes, 1, parameters -> bases_ignored_head_tail, parameters->subread_index_array, parameters->subread_index_offsets, NULL, NULL, NULL, NULL, 0,0, parameters -> use_soft_clipping_bases)) return -1;
}
......@@ -1503,7 +1479,7 @@ int SNP_calling(char * in_SAM_file, char * out_BED_file, char * in_FASTA_file, c
parameters -> real_read_count = real_read_count;
char qfname[330];
char qfname[MAX_FILE_NAME_LENGTH+40];
sprintf(qfname, "%s.qStatic", temp_file_prefix);
parameters -> final_phred_score = 0;
if (parameters -> delete_piles)
......@@ -1619,9 +1595,9 @@ int main_snp_calling_test(int argc,char ** argv)
int c;
char in_SAM_file[5000];
char out_BED_file[300];
char temp_path[300];
char in_FASTA_file[300];
char out_BED_file[MAX_FILE_NAME_LENGTH];
char temp_path[MAX_FILE_NAME_LENGTH];
char in_FASTA_file[MAX_FILE_NAME_LENGTH];
int threads, optindex=0;
int t=0, k;
struct SNP_Calling_Parameters parameters;
......@@ -1685,10 +1661,13 @@ int main_snp_calling_test(int argc,char ** argv)
print_usage_snp(argv[0]);
return 0;
}
while ((c = getopt_long (argc, argv, "7:N:C:a:i:g:o:bQ:p:f:n:r:x:w:s:t:T:v4",snp_long_options, &optindex))!=-1)
while ((c = getopt_long (argc, argv, "S7:N:C:a:i:g:o:bQ:p:f:n:r:x:w:s:t:T:v4",snp_long_options, &optindex))!=-1)
{
switch (c)
{
case 'S':
parameters.use_soft_clipping_bases = 1;
break;
case 'N':
strcpy(parameters.background_input_file, optarg);
break;
......@@ -1740,15 +1719,15 @@ int main_snp_calling_test(int argc,char ** argv)
break;
case 'g':
strncpy(in_FASTA_file, optarg,299);
strncpy(in_FASTA_file, optarg,MAX_FILE_NAME_LENGTH-1);
break;
case 'i':
strncpy(in_SAM_file, optarg,299);
strncpy(in_SAM_file, optarg,MAX_FILE_NAME_LENGTH-1);
break;
case 'o':
strncpy(out_BED_file, optarg,299);
strncpy(out_BED_file, optarg,MAX_FILE_NAME_LENGTH-1);
break;
case 'T':
......@@ -1765,7 +1744,7 @@ int main_snp_calling_test(int argc,char ** argv)
break;
case 'C':
strncpy(temp_path, optarg,299);
strncpy(temp_path, optarg,MAX_FILE_NAME_LENGTH-1);
break;
case 'r':
......
......@@ -1929,12 +1929,20 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
voting_position = current_result -> selected_position;
//SUBREADprintf("NR=%d\n", indel_recorder[0]);
if(!indel_recorder[0])
return 0;
gene_value_index_t * current_value_index = thread_context?thread_context->current_value_index:global_context->current_value_index;
if(0){ // test "shifting indels" -- not needed when "NEW" go_Q used
int last_offset = 0;
for(i=0; indel_recorder[i] && (i<MAX_INDEL_SECTIONS); i+=3){
last_offset = indel_recorder[i+2];
}
if(0==last_offset) return 0;
}
for(i=0; indel_recorder[i] && (i<MAX_INDEL_SECTIONS); i+=3)
{
int indels = indel_recorder[i+2] - last_indel;
......@@ -2035,8 +2043,8 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
{
int old_event_id = -1;
if(0)if(indel_left_boundary >= 44438630 + 1210 - 200 && indel_left_boundary <= 44438630 + 1210 + 200 && current_indel_len == 6){
SUBREADprintf("ADD AN INDEL: %s : %u ; len = %u\n", read_name, indel_left_boundary, current_indel_len);
if(0)if(indel_left_boundary >= 46481340 + 1210 - 200 && indel_left_boundary <= 46481340 + 1210 + 200 && abs(current_indel_len )==4){
SUBREADprintf("ADD AN INDEL: %s : %u ; len = %d\n", read_name, indel_left_boundary, current_indel_len);
}
chromosome_event_t * new_event = local_add_indel_event(global_context, thread_context, event_table, read_text + cursor_on_read + min(0,current_indel_len), indel_left_boundary - 1, current_indel_len, 1, ambiguous_count, 0, &old_event_id);
mark_gapped_read(current_result);
......@@ -4329,7 +4337,7 @@ int finalise_long_insertions_by_hashtable(global_context_t * global_context)
int chro_i;
assert(global_context -> index_block_number == 1);
unsigned int chro_start_pos = 0;
char tmp_fname[350];
char tmp_fname[MAX_FILE_NAME_LENGTH+40];
sprintf(tmp_fname,"%s.reassembly.fa", global_context->config.output_prefix);
global_context->long_insertion_FASTA_fp = f_subr_open(tmp_fname ,"wb");
......@@ -4395,7 +4403,7 @@ void COREMAIN_SIGINT_hook(int param)
int xk1, last_slash = -1;
if(_COREMAIN_delete_temp_prefix != NULL)
{
char del2[300], del_suffix[200], del_name[400];
char del2[MAX_FILE_NAME_LENGTH], del_suffix[MAX_FILE_NAME_LENGTH], del_name[MAX_FILE_NAME_LENGTH];
SUBREADprintf("\n\nReceived a terminal signal. The temporary files were removed.\n");
for(xk1=0; _COREMAIN_delete_temp_prefix[xk1]; xk1++)
{
......
......@@ -294,23 +294,20 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
explain_context -> tmp_min_support_as_complex = min((tested_event -> is_donor_found_or_annotation & 64)?0x7fffffff:tested_event -> supporting_reads,explain_context -> tmp_min_support_as_complex);
explain_context -> tmp_min_unsupport = min(tested_event -> anti_supporting_reads,explain_context -> tmp_min_unsupport);
explain_context -> tmp_is_pure_donor_found_explain = explain_context -> tmp_is_pure_donor_found_explain && tested_event -> is_donor_found_or_annotation;
explain_context -> tmp_indel_penalty += ( tested_event -> event_type == CHRO_EVENT_TYPE_INDEL );
if(tested_event -> event_type == CHRO_EVENT_TYPE_FUSION && tested_event -> is_strand_jumped)
explain_context -> current_is_strand_jumped = !explain_context -> current_is_strand_jumped;
explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].is_strand_jumped = explain_context -> current_is_strand_jumped;
explain_context -> tmp_search_sections ++;
if(0 && FIXLENstrcmp("R000404427", explain_context -> read_name) == 0)
SUBREADprintf("FRONT_ADD_EVENT : %s , %u ~ %u , INDELLEN=%d, TEST_READ_POS=%u, RPED=%u, ABSSTART=%u\n", explain_context -> read_name, tested_event -> event_small_side, tested_event -> event_large_side, tested_event -> indel_length, tested_read_pos, explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].read_pos_end, new_read_head_abs_offset);
explain_context -> total_tries ++;
search_events_to_front(global_context, thread_context, explain_context, read_text + tested_event -> indel_at_junction + tested_read_pos - min(0, tested_event->indel_length), qual_text + tested_read_pos - min(0, tested_event->indel_length), new_read_head_abs_offset, new_remainder_len, sofar_matched + matched_bases_to_site - jump_penalty, tested_event -> connected_next_event_distance, 0);
explain_context -> tmp_search_sections --;
explain_context -> current_is_strand_jumped = current_is_jumped;
explain_context -> tmp_indel_penalty -= ( tested_event -> event_type == CHRO_EVENT_TYPE_INDEL );
explain_context -> tmp_min_support_as_complex = current_sup_as_complex;
explain_context -> tmp_support_as_simple = current_sup_as_simple;
//explain_context -> tmp_min_unsupport = current_unsup_as_simple;
......@@ -335,8 +332,11 @@ void new_explain_try_replace(global_context_t* global_context, thread_context_t
{
int is_better_result = 0, is_same_best = 0;
if(0 && FIXLENstrcmp("simulated.11420793", explain_context->read_name)==0){
SUBREADprintf("TRY_REPLACE : %s has best=%d, b_evn=%d, tscore=%d, t_evn=%d\n", explain_context->read_name, explain_context -> best_matching_bases , explain_context -> best_is_complex , explain_context-> tmp_total_matched_bases, explain_context -> tmp_search_sections );
}
if(explain_context -> best_matching_bases < explain_context-> tmp_total_matched_bases)
if(explain_context -> best_matching_bases - explain_context -> best_indel_penalty < explain_context-> tmp_total_matched_bases - explain_context -> tmp_indel_penalty)
{
is_better_result = 1;
explain_context -> best_is_complex = explain_context -> tmp_search_sections ;
......@@ -347,13 +347,14 @@ void new_explain_try_replace(global_context_t* global_context, thread_context_t
explain_context -> best_is_pure_donor_found_explain = explain_context -> tmp_is_pure_donor_found_explain;
explain_context -> second_best_matching_bases = max(explain_context -> second_best_matching_bases, explain_context -> best_matching_bases);
explain_context -> best_matching_bases = explain_context-> tmp_total_matched_bases ;
explain_context -> best_indel_penalty = explain_context -> tmp_indel_penalty;
}
else if(explain_context -> best_matching_bases == explain_context-> tmp_total_matched_bases)
else if(explain_context -> best_matching_bases - explain_context -> best_indel_penalty == explain_context-> tmp_total_matched_bases - explain_context -> tmp_indel_penalty)
{
// only gapped explainations are complex counted.
explain_context -> best_is_complex += explain_context -> tmp_search_sections;
explain_context -> second_best_matching_bases = explain_context -> best_matching_bases;
explain_context -> best_indel_penalty = explain_context -> tmp_indel_penalty;
if(0 && FIXLENstrcmp("R010442852", explain_context -> read_name) == 0){
SUBREADprintf("complexity: curr=%d, new=%d ; sections=%d\n", explain_context->best_min_support_as_complex, explain_context -> tmp_min_support_as_complex, explain_context -> tmp_search_sections );
......@@ -744,32 +745,21 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
explain_context -> tmp_min_support_as_complex = min((tested_event -> is_donor_found_or_annotation & 64)?0x7fffffff:tested_event -> supporting_reads,explain_context -> tmp_min_support_as_complex);
explain_context -> tmp_min_unsupport = min(tested_event -> anti_supporting_reads,explain_context -> tmp_min_unsupport);
explain_context -> tmp_is_pure_donor_found_explain = explain_context -> tmp_is_pure_donor_found_explain && tested_event -> is_donor_found_or_annotation;
explain_context -> tmp_indel_penalty += ( tested_event -> event_type == CHRO_EVENT_TYPE_INDEL );
if(tested_event -> event_type == CHRO_EVENT_TYPE_FUSION && tested_event -> is_strand_jumped)
explain_context -> current_is_strand_jumped = !explain_context -> current_is_strand_jumped;
explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].is_strand_jumped = explain_context -> current_is_strand_jumped;
//if(explain_context->pair_number == 999999)
// SUBREADprintf(" === %d ; js=%d ===>>>\n", explain_context -> tmp_search_sections, is_junction_scanned);
explain_context -> tmp_search_sections ++;
//printf("SUGGEST_PREV at %u = %d (! %d)\n", tested_event -> event_small_side, tested_event -> connected_previous_event_distance, tested_event -> connected_next_event_distance);
//#warning ">>>>>>>>>>>>>>>> REMOVE IT <<<<<<<<<<<<<<<<<<<<<<"
//printf("OCT27-STEPSB-JB-%s: %u IN -> %u; NEW_TAIL=%d; ENV_CONN=%d; LEV=%d\n", explain_context -> read_name, potential_event_pos, new_read_tail_abs_offset, new_read_tail_pos, tested_event -> connected_previous_event_distance, explain_context -> tmp_search_sections);
explain_context -> total_tries ++;
search_events_to_back(global_context, thread_context, explain_context, read_text , qual_text, new_read_tail_abs_offset , new_read_tail_pos, sofar_matched + matched_bases_to_site - jump_penalty, tested_event -> connected_previous_event_distance, 0);
//#warning ">>>>>>>>>>>>>>>> REMOVE IT <<<<<<<<<<<<<<<<<<<<<<"
//printf("OCT27-STEPSB-JB-%s: %u OUT <- %u; LEN=%d\n", explain_context -> read_name, potential_event_pos, new_read_tail_abs_offset, explain_context -> tmp_search_sections);
explain_context -> tmp_search_sections --;
//if(explain_context->pair_number == 999999)
// SUBREADprintf(" === %d ===<<<\n", explain_context -> tmp_search_sections);
search_events_to_back(global_context, thread_context, explain_context, read_text , qual_text, new_read_tail_abs_offset , new_read_tail_pos, sofar_matched + matched_bases_to_site - jump_penalty, tested_event -> connected_previous_event_distance, 0);
explain_context -> tmp_search_sections --;
explain_context -> tmp_indel_penalty -= ( tested_event -> event_type == CHRO_EVENT_TYPE_INDEL );
explain_context -> current_is_strand_jumped = current_is_jumped;
explain_context -> tmp_min_support_as_complex = current_sup_as_complex;
explain_context -> tmp_support_as_simple = current_sup_as_simple;
//explain_context -> tmp_min_unsupport = current_unsup_as_simple;
explain_context -> tmp_is_pure_donor_found_explain = current_pure_donor_found;
}
//if(global_context ->config.limited_tree_scan) break;
......@@ -2665,6 +2655,8 @@ unsigned int explain_read(global_context_t * global_context, thread_context_t *
explain_context.best_read_id = best_read_id;
explain_context.total_tries = 0;
if(0 && FIXLENstrcmp("simulated.24700032", explain_context.read_name)==0)SUBREADprintf("BBFINAL %s SEL_POS=%u COV=%d - %d\n", explain_context.read_name, current_result -> selected_position, current_result -> confident_coverage_start, current_result -> confident_coverage_end);
unsigned int back_search_tail_position,front_search_start_position;
unsigned short back_search_read_tail, front_search_read_start;
......@@ -2680,8 +2672,10 @@ unsigned int explain_read(global_context_t * global_context, thread_context_t *
explain_context.all_back_alignments = 0;
explain_context.tmp_search_sections = 0;
explain_context.best_indel_penalty =0;
explain_context.best_matching_bases = -9999;
explain_context.second_best_matching_bases = -9999;
explain_context.tmp_indel_penalty = 0;
explain_context.tmp_total_matched_bases = 0;
explain_context.is_currently_tie = 0;
explain_context.best_is_complex = 0;
......@@ -2705,6 +2699,7 @@ unsigned int explain_read(global_context_t * global_context, thread_context_t *
}
search_events_to_back(global_context, thread_context, &explain_context, read_text , qual_text, back_search_tail_position , back_search_read_tail, 0, 0, 1);
int back_penalty = explain_context.best_indel_penalty;
//int is_backsearch_tie = explain_context.is_currently_tie;
int back_search_matches_diff = -9999;
......@@ -2743,9 +2738,12 @@ unsigned int explain_read(global_context_t * global_context, thread_context_t *
*/
explain_context.all_front_alignments = 0;
explain_context.tmp_search_sections = 0;
explain_context.best_indel_penalty = 0;
explain_context.best_matching_bases = -9999;
explain_context.second_best_matching_bases = -9999;
explain_context.tmp_total_matched_bases = 0;
explain_context.tmp_indel_penalty = 0;
explain_context.is_currently_tie = 0;
explain_context.best_is_complex = 0;
explain_context.best_support_as_simple = 0;
......@@ -2774,6 +2772,7 @@ unsigned int explain_read(global_context_t * global_context, thread_context_t *
if(0 && FIXLENstrcmp("R_chr901_932716_91M1D9M",explain_context.read_name ) == 0)
SUBREADprintf("F_SEARCH has found %d result sets\n", explain_context.all_front_alignments);
explain_context.best_indel_penalty += back_penalty;
//int is_frontsearch_tie = explain_context.is_currently_tie;
//SUBREADprintf("DFI:%d - %d;\n", explain_context.best_matching_bases , explain_context.second_best_matching_bases);
......@@ -3209,6 +3208,7 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
for(xk1=0; xk1<explain_context -> result_back_junction_numbers[back_i]; xk1++)
{
int section_length = explain_context -> result_back_junctions[back_i][xk1].read_pos_end - explain_context -> result_back_junctions[back_i][xk1].read_pos_start;
if(0 && FIXLENstrcmp("simulated.11420793", explain_context->read_name)==0)SUBREADprintf("FINAL_EXPLAIN %s BACK_%d SEC_%d OLD_START=%d SEC_LENG=%d\n", explain_context->read_name, back_i, xk1, explain_context -> result_back_junctions[back_i][xk1].abs_offset_for_start, section_length);
unsigned int new_start_pos;
if(explain_context -> result_back_junctions[back_i][xk1].is_strand_jumped)
......@@ -3337,9 +3337,12 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
unsigned int final_position;
if( explain_context -> result_back_junction_numbers[back_i] + explain_context -> result_front_junction_numbers[front_i] <= 2) final_position = result -> selected_position;
// #warning "'0 &&' is because there could be indels in the high-confident region but this indel is finally disused."
if( 0 && explain_context -> result_back_junction_numbers[back_i] + explain_context -> result_front_junction_numbers[front_i] <= 2) final_position = result -> selected_position;
else final_position = explain_context -> result_back_junctions[back_i][0].abs_offset_for_start;
if(0 && FIXLENstrcmp("simulated.11420793", explain_context->read_name)==0)SUBREADprintf("FFFINAL %s : POS=%u, ABS=%u\n", explain_context->read_name, final_position, explain_context -> result_back_junctions[back_i][0].abs_offset_for_start);
int is_exonic_read_fraction_OK = 1;
if( global_context -> config.minimum_exonic_subread_fraction > 0.0000001 && (!is_junction_read) && result -> used_subreads_in_vote>0)
......@@ -3372,10 +3375,10 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
// ACDB PVDB TTTS
//#warning " ========== COMMENT THIS LINE !! ========="
if(0 && FIXLENstrcmp("R000404427", explain_context -> read_name) ==0){
if(0 && FIXLENstrcmp("simulated.11420793", explain_context -> read_name) ==0){
char outpos1[100];
absoffset_to_posstr(global_context, final_position, outpos1);
SUBREADprintf("FINALQUAL %s : FINAL_POS=%s ( %u )\tCIGAR=%s\tMM=%d / MAPLEN=%d > %d?\tVOTE=%d > %0.2f x %d ? MASK=%d\tQUAL=%d\tBRNO=%d\nKNOWN_JUNCS=%d\n\n", explain_context -> read_name, outpos1 , final_position , tmp_cigar, mismatch_bases, non_clipped_length, applied_mismatch, result -> selected_votes, global_context -> config.minimum_exonic_subread_fraction,result-> used_subreads_in_vote, result->result_flags, final_qual, explain_context -> best_read_id, known_junction_supp);
SUBREADprintf("FINALQUAL %s : FINAL_POS=%s ( %u )\tCIGAR=%s\tMM=%d / MAPLEN=%d > %d?\tVOTE=%d > %0.2f x %d ? MASK=%d\tQUAL=%d\tBRNO=%d\nKNOWN_JUNCS=%d PENALTY=%d\n\n", explain_context -> read_name, outpos1 , final_position , tmp_cigar, mismatch_bases, non_clipped_length, applied_mismatch, result -> selected_votes, global_context -> config.minimum_exonic_subread_fraction,result-> used_subreads_in_vote, result->result_flags, final_qual, explain_context -> best_read_id, known_junction_supp, explain_context -> best_indel_penalty);
}
......@@ -3388,6 +3391,7 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
realign_res -> mapping_result = result;
realign_res -> chromosomal_length = chromosomal_length;
realign_res -> known_junction_supp = known_junction_supp;
realign_res -> final_penalty = explain_context -> best_indel_penalty;
if(mismatch_bases > applied_mismatch ) realign_res -> realign_flags |= CORE_TOO_MANY_MISMATCHES;
else realign_res -> realign_flags &= ~CORE_TOO_MANY_MISMATCHES;
......
......@@ -79,7 +79,9 @@ typedef struct{
int best_matching_bases;
int best_second_match_diff;
int second_best_matching_bases;
int best_indel_penalty;
int tmp_total_matched_bases;
int tmp_indel_penalty;
int is_currently_tie;
int best_is_complex;
int best_support_as_simple;
......
This diff is collapsed.
......@@ -62,7 +62,7 @@
#define CORE_MAX_CIGAR_STR_LEN 110
#define CORE_ADDITIONAL_INFO_LENGTH 400
#define READPAIRS_FOR_CALC_EXPT_TLEN 1000
typedef struct{
subread_read_number_t fragments;
......@@ -384,6 +384,7 @@ typedef struct{
short chromosomal_length;
short MAPQ_adjustment;
int known_junction_supp;
int final_penalty;
} realignment_result_t;
#define BUCKETED_TABLE_INIT_ITEMS 3
......@@ -541,6 +542,8 @@ typedef struct{
double timecost_before_realign;
double timecost_for_realign;
unsigned long long expected_TLEN_sum;
unsigned int expected_TLEN_read_numbers;
unsigned long long all_processed_reads;
unsigned long long all_correct_PE_reads;
unsigned int all_junctions;
......
......@@ -20,8 +20,8 @@ int max_M = 10;
int paired_end = 0;
int ignore_bad_pair = 0;
int base_values = 0;
char input_file_name[300];
char output_file_name[300];
char input_file_name[MAX_FILE_NAME_LENGTH];
char output_file_name[MAX_FILE_NAME_LENGTH];
HashTable * cov_bin_table;
......
......@@ -48,12 +48,16 @@
#include "seek-zlib.h"
#include "HelperFunctions.h"
#define MERGING_MODE_CHOPPING 100 // keep all the edges : [100,199] + [150,249] = [100,149], [150,199], [200,249]
#define MERGING_MODE_MERGING 0 // just merge into one: [100,199] + [150,249] = [100, 249]
typedef struct{
char GTF_gene_id_column[MAX_READ_NAME_LEN];
char GTF_wanted_feature_type[MAX_READ_NAME_LEN];
char GTF_file_name[MAX_FILE_NAME_LENGTH];
char output_file_name[MAX_FILE_NAME_LENGTH];
FILE * output_FP;
int merging_mode;
HashTable * gene_to_chro_strand_table;
HashTable * gene_chro_strand_to_exons_table;
......@@ -89,6 +93,9 @@ void flatAnno_print_usage(){
SUBREADputs(" This attribute type is used to group features into meta-");
SUBREADputs(" features.");
SUBREADputs("");
SUBREADputs(" -C Merging overlapping exons into multiple non-overlapping exons but");
SUBREADputs(" all the edges are kept.");
SUBREADputs("");
}
int flatAnno_finalise(flatAnno_context_t * context){
......@@ -145,6 +152,77 @@ int flatAnno_do_anno_merge_one_array_compare(void * vL, void * vR){
return 0;
}
void flatAnno_do_anno_chop_one_array(void * key, void * hashed_obj, HashTable * tab){
ArrayList * this_list = hashed_obj; // each element is a pair of integers defining start and end of an exon.
ArrayList * edge_before_me_List = ArrayListCreate( this_list->numOfElements * 2 );
int i;
for(i=0; i<this_list -> numOfElements; i++){
int * curr_2i = this_list -> elementList[ i ];
int findi, endi;
for(endi = 0; endi < 2; endi++){
assert(curr_2i[endi]>0);
int search_tag = curr_2i[endi] + endi, found=0;
for(findi = 0; findi < edge_before_me_List->numOfElements; findi++){
if(ArrayListGet(edge_before_me_List,findi)-NULL == search_tag){
found = 1;
break;
}
}
if(0==found) ArrayListPush(edge_before_me_List, NULL+search_tag);
}
}
ArrayListSort(edge_before_me_List, NULL);
char * continue_after_an_edge = malloc(edge_before_me_List -> numOfElements -1);
memset(continue_after_an_edge,0, edge_before_me_List -> numOfElements -1);
int missed_gaps = edge_before_me_List->numOfElements -1;
for(i=0; i<edge_before_me_List->numOfElements -1; i++){
int an_edge_before_me = edge_before_me_List-> elementList[i]-NULL;
int j;
for(j=0; j<this_list->numOfElements; j++){
int * curr_2i = this_list -> elementList[ j ];
if( curr_2i[0] <= an_edge_before_me && curr_2i[1] >= an_edge_before_me ) {
continue_after_an_edge[i]=1;
missed_gaps --;
break;
}
}
}
long old_capacity = this_list -> capacityOfElements, old_items = this_list -> numOfElements;
if(edge_before_me_List->numOfElements -1 - missed_gaps > old_capacity){
this_list -> elementList = realloc( this_list -> elementList, sizeof(void*) * edge_before_me_List->numOfElements -1 -missed_gaps );
this_list -> capacityOfElements = edge_before_me_List->numOfElements -1 -missed_gaps ;
}
int written_i = 0;
for(i=0; i<edge_before_me_List->numOfElements -1; i++){
int * curr_2i = NULL;
//if(edge_before_me_List->elementList[i] - NULL == 108395906)SUBREADprintf("CV_I %d = %ld ~ %ld ; CTN=%d\n", i, edge_before_me_List->elementList[i] - NULL, edge_before_me_List -> elementList[i+1] -NULL-1, continue_after_an_edge[i]);
if(0==continue_after_an_edge[i]){
// SUBREADprintf("NC_I %d ~ [ 0 , %ld ]\n", i, edge_before_me_List->numOfElements -2);
assert(i>0 && i<edge_before_me_List->numOfElements -2);
assert(continue_after_an_edge[i+1] && continue_after_an_edge[i-1] ); // otherwise it is a singleton edge.
continue;
}
if(written_i < old_items){
curr_2i = this_list -> elementList[ written_i ];
}else{
curr_2i = malloc(sizeof(int)*2);
this_list -> elementList[ written_i ] = curr_2i;
}
curr_2i[0] = edge_before_me_List -> elementList[i] -NULL;
curr_2i[1] = edge_before_me_List -> elementList[i+1] -NULL-1;
written_i++;
}
assert(written_i == edge_before_me_List->numOfElements -1 - missed_gaps );
for(i=written_i; i<old_items; i++) free( this_list -> elementList[i]);
this_list -> numOfElements = written_i;
ArrayListDestroy(edge_before_me_List);
free(continue_after_an_edge);
}
void flatAnno_do_anno_merge_one_array(void * key, void * hashed_obj, HashTable * tab){
ArrayList * this_list = hashed_obj;
ArrayListSort(this_list, flatAnno_do_anno_merge_one_array_compare);
......@@ -171,7 +249,7 @@ void flatAnno_do_anno_merge_one_array(void * key, void * hashed_obj, HashTable *
int flatAnno_do_anno_merge_and_write(flatAnno_context_t * context){
context -> gene_chro_strand_to_exons_table -> appendix1 = context;
HashTableIteration(context -> gene_chro_strand_to_exons_table, flatAnno_do_anno_merge_one_array);
HashTableIteration(context -> gene_chro_strand_to_exons_table, context -> merging_mode == MERGING_MODE_CHOPPING? flatAnno_do_anno_chop_one_array :flatAnno_do_anno_merge_one_array);
ArrayList * all_chro_st_list = HashTableKeyArray(context -> gene_chro_strand_to_exons_table);
ArrayListSort(all_chro_st_list, (int(*)(void * , void*))strcmp);
......@@ -222,7 +300,7 @@ int flatAnno_start(flatAnno_context_t * context){
SUBREADprintf("Error: no output file is specified.\n");
return -1;
}
SUBREADprintf("Flatting GTF file: %s\n", context -> GTF_file_name);
SUBREADprintf("Flattening GTF file: %s\n", context -> GTF_file_name);
SUBREADprintf("Output SAF file: %s\n", context -> output_file_name);
context -> output_FP = fopen(context -> output_file_name, "w");
if(NULL == context -> output_FP ){
......@@ -253,8 +331,11 @@ int R_flattenAnnotations(int argc, char ** argv)
optind=0;
opterr=1;
optopt=63;
while ((c = getopt_long (argc, argv, "t:g:a:o:v?", long_options, &option_index)) != -1)
while ((c = getopt_long (argc, argv, "Ct:g:a:o:v?", long_options, &option_index)) != -1)
switch(c) {
case 'C':
context.merging_mode = MERGING_MODE_CHOPPING;
break;
case 'o':
strcpy(context.output_file_name, optarg);
break;
......
......@@ -155,7 +155,7 @@ void full_scan_read(char * index_name, char * read_str)
int main (int argc , char ** argv)
{
char index_name [1200];
char index_name [MAX_FILE_NAME_LENGTH];
char read_str [1208];
char c;
int i;
......@@ -166,7 +166,7 @@ int main (int argc , char ** argv)
switch(c)
{
case 'i':
strncpy(index_name, optarg, 1199);
strncpy(index_name, optarg, MAX_FILE_NAME_LENGTH-1);
break;
case 'm':
MIN_REPORTING_RATIO = atof(optarg);
......
......@@ -92,6 +92,7 @@ int print_usage_gen_reads(char * pgname) {
static struct option long_options[] =
{
{"quiet", no_argument, 0, 'Q'},
{"truthInReadNames", no_argument, 0, 'T'},
{"simpleTranscriptId", no_argument, 0, 'C'},
{"summarizeFasta", no_argument, 0, 'M'},
......@@ -129,6 +130,7 @@ typedef struct {
int insertion_length_max;
int insertion_length_min;
float insertion_length_sigma;
int quiet;
int read_length;
int no_actual_reads;
......@@ -500,6 +502,13 @@ int grc_summary_fasta(genRand_context_t * grc){
char clinebuf[TRANSCRIPT_FASTA_LINE_WIDTH];
int rlength = autozip_gets(&auto_FP, clinebuf, TRANSCRIPT_FASTA_LINE_WIDTH -1);
if(rlength < 1)break;
if(rlength > 1 && clinebuf[rlength-2] =='\r' ){
rlength--;
clinebuf[rlength-1]='\n';
clinebuf[rlength]=0;
}else if(clinebuf[rlength-1]=='\r'){
clinebuf[rlength-1]='\n';
}
if(rlength >= TRANSCRIPT_FASTA_LINE_WIDTH -1 || clinebuf[rlength]!='\0' || clinebuf[rlength-1]!='\n'){
SUBREADprintf("Error: The line width of the fasta file excessed %d bytes!\n", TRANSCRIPT_FASTA_LINE_WIDTH);
ret = 1;
......@@ -595,7 +604,7 @@ int grc_summary_fasta(genRand_context_t * grc){
}
HashTableDestroy(reprs_tab);
if(total_rep_seq)SUBREADprintf("Warning: %d duplicate sequences were found in the input. Please check the summary table.\n",total_rep_seq);
if(total_rep_seq && grc->quiet==0)SUBREADprintf("Warning: %d duplicate sequences were found in the input. Please check the summary table.\n",total_rep_seq);
ArrayListDestroy(seq_name_list);
HashTableDestroy(seq_length_tab);
......@@ -882,8 +891,11 @@ int gen_rnaseq_reads_main(int argc, char ** argv)
long long seed = -1;
while ((c = getopt_long (argc, argv, "O:TCxS:V:N:X:F:L:q:r:t:e:o:pM?", long_options, &option_index)) != -1) {
while ((c = getopt_long (argc, argv, "QO:TCxS:V:N:X:F:L:q:r:t:e:o:pM?", long_options, &option_index)) != -1) {
switch(c){
case 'Q':
grc.quiet=1;
break;
case 'x':
grc.no_actual_reads = 1;
break;
......
......@@ -402,7 +402,7 @@ int remove_repeated_reads(gehash_t * table, gehash_t * huge_table, int index_thr
else if(val_len[j]> index_threshold)
{
gehash_remove(table, vals[j]);
gehash_insert(huge_table, vals[j], 1);
gehash_insert(huge_table, vals[j], 1, NULL);
all_removed += val_len[j];
}
}
......@@ -1131,7 +1131,7 @@ void explain_indel_in_middle(gene_allvote_t* allvote, int qid , int pos, char *
int evaluate_piece(char * piece_str, int chron, int offset, int is_counterpart, int start_pos, int end_pos)
{
char fname[300];
char fname[MAX_FILE_NAME_LENGTH];
int inner_pos = 0, i;
FILE * fp;
char next_char=0;
......@@ -1474,7 +1474,7 @@ void destroy_offsets(gene_offset_t* offsets)
int load_offsets(gene_offset_t* offsets , const char index_prefix [])
{
char fn[300];
char fn[MAX_FILE_NAME_LENGTH];
FILE * fp;
int n=0;
int padding = 0;
......@@ -1509,7 +1509,7 @@ int load_offsets(gene_offset_t* offsets , const char index_prefix [])
{
int i=0, step = 0, j=0;
read_line(299,fp, fn, 0);
read_line(MAX_FILE_NAME_LENGTH-1,fp, fn, 0);
if (strlen(fn)<2)continue;
while (fn[i])
{
......
......@@ -26,11 +26,11 @@
#include "input-files.h"
int gvindex_init(gene_value_index_t * index, unsigned int start_point)
int gvindex_init(gene_value_index_t * index, unsigned int start_point, unsigned int total_bases_estimated)
{
index->start_point = start_point;
index->memory_block_size = 100000000;
index->values = malloc(index->memory_block_size);
index->memory_block_size = total_bases_estimated/4+10;
index->values = calloc(index->memory_block_size,1);
if(!index->values)
{
SUBREADputs(MESSAGE_OUT_OF_MEMORY);
......@@ -141,6 +141,7 @@ void gvindex_set(gene_value_index_t * index, gehash_data_t offset, gehash_key_t
unsigned int offset_byte_margin = offset_byte + (padding / 8 +1);
if(index -> memory_block_size <= offset_byte_margin + 2)
{
assert(index -> memory_block_size> offset_byte_margin+2);
index -> memory_block_size *= 1.5;
index->values = realloc(index->values, index -> memory_block_size);
}
......