Commit 6b5aa132 authored by Steffen Möller's avatar Steffen Möller

New upstream version 0.7.16

parent 9b4cfaa2
Release 0.7.16 (30 July 2017)
-----------------------------
This release added a couple of minor features and incorporated multiple pull
requests, including:
* Added option -5, which is useful to some Hi-C pipelines.
* Fixed an error with samtools sorting (#129). Updated download link for
GRCh38 (#123). Fixed README MarkDown formatting (#70). Addressed multiple
issues via a collected pull request #139 by @jmarshall. Avoid malformatted
SAM header when -R is used with TAB (#84). Output mate CIGAR (#138).
(0.7.16: 30 July 2017, r1180)
Release 0.7.15 (31 May 2016)
----------------------------
Fixed a long existing bug which potentially leads underestimated insert size
Fixed a long existing bug which potentially leads to underestimated insert size
upper bound. This bug should have little effect in practice.
(0.7.15: 31 May 2016, r1140)
......
[![Build Status](https://travis-ci.org/lh3/bwa.svg?branch=dev)](https://travis-ci.org/lh3/bwa)
[![Build Status](https://drone.io/github.com/lh3/bwa/status.png)](https://drone.io/github.com/lh3/bwa/latest)
##Getting started
## Getting started
git clone https://github.com/lh3/bwa.git
cd bwa; make
......@@ -8,7 +7,7 @@
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
##Introduction
## Introduction
BWA is a software package for mapping DNA sequences against a large reference
genome, such as the human genome. It consists of three algorithms:
......@@ -24,7 +23,7 @@ reference genome (the **index** command). Alignment algorithms are invoked with
different sub-commands: **aln/samse/sampe** for BWA-backtrack,
**bwasw** for BWA-SW and **mem** for the BWA-MEM algorithm.
##Availability
## Availability
BWA is released under [GPLv3][1]. The latest source code is [freely
available at github][2]. Released packages can [be downloaded][3] at
......@@ -37,7 +36,7 @@ In addition to BWA, this self-consistent package also comes with bwa-associated
and 3rd-party tools for proper BAM-to-FASTQ conversion, mapping to ALT contigs,
adapter triming, duplicate marking, HLA typing and associated data files.
##Seeking helps
## Seeking help
The detailed usage is described in the man page available together with the
source code. You can use `man ./bwa.1` to view the man page in a terminal. The
......@@ -46,7 +45,7 @@ have questions about BWA, you may [sign up the mailing list][6] and then send
the questions to [bio-bwa-help@sourceforge.net][7]. You may also ask questions
in forums such as [BioStar][8] and [SEQanswers][9].
##Citing BWA
## Citing BWA
* Li H. and Durbin R. (2009) Fast and accurate short read alignment with
Burrows-Wheeler transform. *Bioinformatics*, **25**, 1754-1760. [PMID:
......@@ -63,7 +62,7 @@ in forums such as [BioStar][8] and [SEQanswers][9].
Please note that the last reference is a preprint hosted at [arXiv.org][13]. I
do not have plan to submit it to a peer-reviewed journal in the near future.
##Frequently asked questions (FAQs)
## Frequently asked questions (FAQs)
1. [What types of data does BWA work with?](#type)
2. [Why does a read appear multiple times in the output SAM?](#multihit)
......@@ -73,7 +72,7 @@ do not have plan to submit it to a peer-reviewed journal in the near future.
6. [Does BWA work with ALT contigs in the GRCh38 release?](#altctg)
7. [Can I just run BWA-MEM against GRCh38+ALT without post-processing?](#postalt)
####<a name="type"></a>1. What types of data does BWA work with?
#### <a name="type"></a>1. What types of data does BWA work with?
BWA works with a variety types of DNA sequence data, though the optimal
algorithm and setting may vary. The following list gives the recommended
......@@ -108,7 +107,7 @@ errors given longer query sequences as the chance of missing all seeds is small.
As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore
reads with a sequencing error rate over 20%.
####<a name="multihit"></a>2. Why does a read appear multiple times in the output SAM?
#### <a name="multihit"></a>2. Why does a read appear multiple times in the output SAM?
BWA-SW and BWA-MEM perform local alignments. If there is a translocation, a gene
fusion or a long deletion, a read bridging the break point may have two hits,
......@@ -116,18 +115,18 @@ occupying two lines in the SAM output. With the default setting of BWA-MEM, one
and only one line is primary and is soft clipped; other lines are tagged with
0x800 SAM flag (supplementary alignment) and are hard clipped.
####<a name="4gb"></a>3. Does BWA work on reference sequences longer than 4GB in total?
#### <a name="4gb"></a>3. Does BWA work on reference sequences longer than 4GB in total?
Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over
4GB. However, individual chromosome should not be longer than 2GB.
####<a name="pe0"></a>4. Why can one read in a pair has high mapping quality but the other has zero?
#### <a name="pe0"></a>4. Why can one read in a pair have a high mapping quality but the other has zero?
This is correct. Mapping quality is assigned for individual read, not for a read
pair. It is possible that one read can be mapped unambiguously, but its mate
falls in a tandem repeat and thus its accurate position cannot be determined.
####<a name="endref"></a>5. How can a BWA-backtrack alignment stands out of the end of a chromosome?
#### <a name="endref"></a>5. How can a BWA-backtrack alignment stand out of the end of a chromosome?
Internally BWA concatenates all reference sequences into one long sequence. A
read may be mapped to the junction of two adjacent reference sequences. In this
......@@ -135,7 +134,7 @@ case, BWA-backtrack will flag the read as unmapped (0x4), but you will see
position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment
as well. BWA-MEM does not have this problem.
####<a name="altctg"></a>6. Does BWA work with ALT contigs in the GRCh38 release?
#### <a name="altctg"></a>6. Does BWA work with ALT contigs in the GRCh38 release?
Yes, since 0.7.11, BWA-MEM officially supports mapping to GRCh38+ALT.
BWA-backtrack and BWA-SW don't properly support ALT mapping as of now. Please
......@@ -143,7 +142,7 @@ see [README-alt.md][18] for details. Briefly, it is recommended to use
[bwakit][17], the binary release of BWA, for generating the reference genome
and for mapping.
####<a name="postalt"></a>7. Can I just run BWA-MEM against GRCh38+ALT without post-processing?
#### <a name="postalt"></a>7. Can I just run BWA-MEM against GRCh38+ALT without post-processing?
If you are not interested in hits to ALT contigs, it is okay to run BWA-MEM
without post-processing. The alignments produced this way are very close to
......
......@@ -299,9 +299,9 @@ int64_t bns_fasta2bntseq(gzFile fp_fa, const char *prefix, int for_only)
// read sequences
while (kseq_read(seq) >= 0) pac = add1(seq, bns, pac, &m_pac, &m_seqs, &m_holes, &q);
if (!for_only) { // add the reverse complemented sequence
m_pac = (bns->l_pac * 2 + 3) / 4 * 4;
pac = realloc(pac, m_pac/4);
memset(pac + (bns->l_pac+3)/4, 0, (m_pac - (bns->l_pac+3)/4*4) / 4);
int64_t ll_pac = (bns->l_pac * 2 + 3) / 4 * 4;
if (ll_pac > m_pac) pac = realloc(pac, ll_pac/4);
memset(pac + (bns->l_pac+3)/4, 0, (ll_pac - (bns->l_pac+3)/4*4) / 4);
for (l = bns->l_pac - 1; l >= 0; --l, ++bns->l_pac)
_set_pac(pac, bns->l_pac, 3-_get_pac(pac, l));
}
......
.TH bwa 1 "31 May 2016" "bwa-0.7.15-r1140" "Bioinformatics tools"
.TH bwa 1 "30 July 2017" "bwa-0.7.16-r1180" "Bioinformatics tools"
.SH NAME
.PP
bwa - Burrows-Wheeler Alignment Tool
......@@ -107,6 +107,8 @@ appropriate algorithm will be chosen automatically.
.IR clipPen ]
.RB [ -U
.IR unpairPen ]
.RB [ -x
.IR readType ]
.RB [ -R
.IR RGline ]
.RB [ -H
......@@ -256,7 +258,23 @@ Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as
and scores a paired as scoreRead1+scoreRead2-insertPenalty. It compares these
two scores to determine whether we should force pairing. A larger value leads to
more aggressive read pair. [17]
.TP
.BI -x \ STR
Read type. Changes multiple parameters unless overriden [null]
.RS
.TP 10
.BR pacbio :
.B -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0
(PacBio reads to ref)
.TP
.BR ont2d :
.B -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0
(Oxford Nanopore 2D-reads to ref)
.TP
.BR intractg :
.B -B9 -O16 -L5
(intra-species contigs to ref)
.RE
.TP
.B INPUT/OUTPUT OPTIONS:
.TP
......@@ -277,6 +295,29 @@ If ARG starts with @, it is interpreted as a string and gets inserted into the
output SAM header; otherwise, ARG is interpreted as a file with all lines
starting with @ in the file inserted into the SAM header. [null]
.TP
.BI -o \ FILE
Write the output SAM file to
.IR FILE .
For compatibility with other BWA commands, this option may also be given as
.B -f
.IR FILE .
[standard ouptut]
.TP
.B -5
For split alignment, mark the segment with the smallest coordinate as the
primary. This option may help some Hi-C pipelines. By default, BWA-MEM marks
highest scoring segment as primary.
.TP
.B -K \ INT
Process
.I INT
input bases in each batch regardless of the number of threads in use
.RI [10000000* nThreads ].
By default, the batch size is proportional to the number of threads in use.
Because the inferred insert size distribution slightly depends on the batch
size, using different number of threads may produce different output.
Specifying this option helps reproducibility.
.TP
.BI -T \ INT
Don't output alignment with score lower than
.IR INT .
......@@ -302,7 +343,7 @@ Output all found alignments for single-end or unpaired paired-end reads. These
alignments will be flagged as secondary alignments.
.TP
.B -C
Append append FASTA/Q comment to SAM output. This option can be used to
Append FASTA/Q comment to SAM output. This option can be used to
transfer read meta information (e.g. barcode) to the SAM output. Note that the
FASTA/Q comment (the string after a space in the header line) must conform the SAM
spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.
......@@ -316,7 +357,7 @@ supplementary alignments.
Mark shorter split hits as secondary (for Picard compatibility).
.TP
.BI -v \ INT
Control the verbose level of the output. This option has not been fully
Control the verbosity level of the output. This option has not been fully
supported throughout BWA. Ideally, a value 0 for disabling all the output to
stderr; 1 for outputting errors only; 2 for warnings and errors; 3 for
all normal messages; 4 or higher for debugging. When this option takes value
......
......@@ -30,13 +30,23 @@ static inline void trim_readno(kstring_t *s)
s->l -= 2, s->s[s->l] = 0;
}
static inline char *dupkstring(const kstring_t *str, int dupempty)
{
char *s = (str->l > 0 || dupempty)? malloc(str->l + 1) : NULL;
if (!s) return NULL;
memcpy(s, str->s, str->l);
s[str->l] = '\0';
return s;
}
static inline void kseq2bseq1(const kseq_t *ks, bseq1_t *s)
{ // TODO: it would be better to allocate one chunk of memory, but probably it does not matter in practice
s->name = strdup(ks->name.s);
s->comment = ks->comment.l? strdup(ks->comment.s) : 0;
s->seq = strdup(ks->seq.s);
s->qual = ks->qual.l? strdup(ks->qual.s) : 0;
s->l_seq = strlen(s->seq);
s->name = dupkstring(&ks->name, 1);
s->comment = dupkstring(&ks->comment, 0);
s->seq = dupkstring(&ks->seq, 1);
s->qual = dupkstring(&ks->qual, 0);
s->l_seq = ks->seq.l;
}
bseq1_t *bseq_read(int chunk_size, int *n_, void *ks1_, void *ks2_)
......@@ -144,9 +154,9 @@ uint32_t *bwa_gen_cigar2(const int8_t mat[25], int o_del, int e_del, int o_ins,
max_del = (int)((double)(((l_query+1)>>1) * mat[0] - o_del) / e_del + 1.);
max_gap = max_ins > max_del? max_ins : max_del;
max_gap = max_gap > 1? max_gap : 1;
w = (max_gap + abs(rlen - l_query) + 1) >> 1;
w = (max_gap + abs((int)rlen - l_query) + 1) >> 1;
w = w < w_? w : w_;
min_w = abs(rlen - l_query) + 3;
min_w = abs((int)rlen - l_query) + 3;
w = w > min_w? w : min_w;
// NW alignment
if (bwa_verbose >= 4) {
......@@ -414,10 +424,14 @@ char *bwa_set_rg(const char *s)
if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] the read group line is not started with @RG\n", __func__);
goto err_set_rg;
}
if (strstr(s, "\t") != NULL) {
if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] the read group line contained literal <tab> characters -- replace with escaped tabs: \\t\n", __func__);
goto err_set_rg;
}
rg_line = strdup(s);
bwa_escape(rg_line);
if ((p = strstr(rg_line, "\tID:")) == 0) {
if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] no ID at the read group line\n", __func__);
if (bwa_verbose >= 1) fprintf(stderr, "[E::%s] no ID within the read group line\n", __func__);
goto err_set_rg;
}
p += 4;
......
......@@ -5,7 +5,7 @@ use warnings;
use Getopt::Std;
my %opts = (t=>1);
getopts("PSadskHo:R:x:t:", \%opts);
getopts("MPSadskHo:R:x:t:", \%opts);
die('
Usage: run-bwamem [options] <idxbase> <file1> [file2]
......@@ -24,6 +24,7 @@ Options: -o STR prefix for output files [inferred from
-S for BAM input, don\'t shuffle
-s sort the output alignment (via samtools; requring more RAM)
-k keep temporary files generated by typeHLA
-M mark shorter split hits as secondary
Examples:
......@@ -143,7 +144,7 @@ if ($is_bam) {
$cmd = "cat $ARGV[1] \\\n";
}
my $bwa_opts = "-p " . ($opts{t} > 1? "-t$opts{t} " : "") . (defined($opts{x})? "-x $opts{x} " : "") . (defined($opts{R})? "-R'$opts{R}' " : "");
my $bwa_opts = "-p " . ($opts{t} > 1? "-t$opts{t} " : "") . (defined($opts{x})? "-x $opts{x} " : "") . (defined($opts{R})? "-R'$opts{R}' " : "") . (defined($opts{M})? "-M " : "");
$bwa_opts .= join(" ", @RG_lines) . " -C " if @RG_lines > 0;
$cmd .= " | $root/trimadap 2> $prefix.log.trim \\\n" if defined($opts{a});
......@@ -163,7 +164,7 @@ if (-f "$ARGV[0].alt" && !defined($opts{P})) {
}
my $t_sort = $opts{t} < 4? $opts{t} : 4;
$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";
$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - -o $prefix.aln.bam;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";
if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
......
......@@ -2,7 +2,7 @@
root=`dirname $0`
url38="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz"
url38="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz"
url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
if [ $# -eq 0 ]; then
......
......@@ -809,6 +809,19 @@ static inline int get_rlen(int n_cigar, const uint32_t *cigar)
return l;
}
static inline void add_cigar(const mem_opt_t *opt, mem_aln_t *p, kstring_t *str, int which)
{
int i;
if (p->n_cigar) { // aligned
for (i = 0; i < p->n_cigar; ++i) {
int c = p->cigar[i]&0xf;
if (!(opt->flag&MEM_F_SOFTCLIP) && !p->is_alt && (c == 3 || c == 4))
c = which? 4 : 3; // use hard clipping for supplementary alignments
kputw(p->cigar[i]>>4, str); kputc("MIDSH"[c], str);
}
} else kputc('*', str); // having a coordinate but unaligned (e.g. when copy_mate is true)
}
void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq1_t *s, int n, const mem_aln_t *list, int which, const mem_aln_t *m_)
{
int i, l_name;
......@@ -835,14 +848,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq
kputs(bns->anns[p->rid].name, str); kputc('\t', str); // RNAME
kputl(p->pos + 1, str); kputc('\t', str); // POS
kputw(p->mapq, str); kputc('\t', str); // MAPQ
if (p->n_cigar) { // aligned
for (i = 0; i < p->n_cigar; ++i) {
int c = p->cigar[i]&0xf;
if (!(opt->flag&MEM_F_SOFTCLIP) && !p->is_alt && (c == 3 || c == 4))
c = which? 4 : 3; // use hard clipping for supplementary alignments
kputw(p->cigar[i]>>4, str); kputc("MIDSH"[c], str);
}
} else kputc('*', str); // having a coordinate but unaligned (e.g. when copy_mate is true)
add_cigar(opt, p, str, which);
} else kputsn("*\t0\t0\t*", 7, str); // without coordinte
kputc('\t', str);
......@@ -899,6 +905,7 @@ void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq
kputsn("\tNM:i:", 6, str); kputw(p->NM, str);
kputsn("\tMD:Z:", 6, str); kputs((char*)(p->cigar + p->n_cigar), str);
}
if (m && m->n_cigar) { kputsn("\tMC:Z:", 6, str); add_cigar(opt, m, str, which); }
if (p->score >= 0) { kputsn("\tAS:i:", 6, str); kputw(p->score, str); }
if (p->sub >= 0) { kputsn("\tXS:i:", 6, str); kputw(p->sub, str); }
if (bwa_rg_id[0]) { kputsn("\tRG:Z:", 6, str); kputs(bwa_rg_id, str); }
......@@ -968,6 +975,30 @@ int mem_approx_mapq_se(const mem_opt_t *opt, const mem_alnreg_t *a)
return mapq;
}
void mem_reorder_primary5(int T, mem_alnreg_v *a)
{
int k, n_pri = 0, left_st = INT_MAX, left_k = -1;
mem_alnreg_t t;
for (k = 0; k < a->n; ++k)
if (a->a[k].secondary < 0 && !a->a[k].is_alt && a->a[k].score >= T) ++n_pri;
if (n_pri <= 1) return; // only one alignment
for (k = 0; k < a->n; ++k) {
mem_alnreg_t *p = &a->a[k];
if (p->secondary >= 0 || p->is_alt || p->score < T) continue;
if (p->qb < left_st) left_st = p->qb, left_k = k;
}
assert(a->a[0].secondary < 0);
if (left_k == 0) return; // no need to reorder
t = a->a[0], a->a[0] = a->a[left_k], a->a[left_k] = t;
for (k = 1; k < a->n; ++k) { // update secondary and secondary_all
mem_alnreg_t *p = &a->a[k];
if (p->secondary == 0) p->secondary = left_k;
else if (p->secondary == left_k) p->secondary = 0;
if (p->secondary_all == 0) p->secondary_all = left_k;
else if (p->secondary_all == left_k) p->secondary_all = 0;
}
}
// TODO (future plan): group hits into a uint64_t[] array. This will be cleaner and more flexible
void mem_reg2sam(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, bseq1_t *s, mem_alnreg_v *a, int extra_flag, const mem_aln_t *m)
{
......@@ -1160,6 +1191,7 @@ static void worker2(void *data, int i, int tid)
if (!(w->opt->flag&MEM_F_PE)) {
if (bwa_verbose >= 4) printf("=====> Finalizing read '%s' <=====\n", w->seqs[i].name);
mem_mark_primary_se(w->opt, w->regs[i].n, w->regs[i].a, w->n_processed + i);
if (w->opt->flag & MEM_F_PRIMARY5) mem_reorder_primary5(w->opt->T, &w->regs[i]);
mem_reg2sam(w->opt, w->bns, w->pac, &w->seqs[i], &w->regs[i], 0, 0);
free(w->regs[i].a);
} else {
......
......@@ -19,6 +19,7 @@ typedef struct __smem_i smem_i;
#define MEM_F_REF_HDR 0x100
#define MEM_F_SOFTCLIP 0x200
#define MEM_F_SMARTPE 0x400
#define MEM_F_PRIMARY5 0x800
typedef struct {
int a, b; // match score and mismatch penalty
......
......@@ -243,6 +243,7 @@ int mem_pair(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, cons
}
void mem_aln2sam(const mem_opt_t *opt, const bntseq_t *bns, kstring_t *str, bseq1_t *s, int n, const mem_aln_t *list, int which, const mem_aln_t *m);
void mem_reorder_primary5(int T, mem_alnreg_v *a);
#define raw_mapq(diff, a) ((int)(6.02 * (diff) / (a) + .499))
......@@ -275,6 +276,10 @@ int mem_sam_pe(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, co
}
n_pri[0] = mem_mark_primary_se(opt, a[0].n, a[0].a, id<<1|0);
n_pri[1] = mem_mark_primary_se(opt, a[1].n, a[1].a, id<<1|1);
if (opt->flag & MEM_F_PRIMARY5) {
mem_reorder_primary5(opt->T, &a[0]);
mem_reorder_primary5(opt->T, &a[1]);
}
if (opt->flag&MEM_F_NOPAIRING) goto no_pairing;
// pairing single-end hits
if (n_pri[0] && n_pri[1] && (o = mem_pair(opt, bns, pac, pes, s, a, id, &subo, &n_sub, z, n_pri)) > 0) {
......
......@@ -624,7 +624,8 @@ ubyte_t *bwa_paired_sw(const bntseq_t *bns, const ubyte_t *_pacseq, int n_seqs,
void bwa_sai2sam_pe_core(const char *prefix, char *const fn_sa[2], char *const fn_fa[2], pe_opt_t *popt, const char *rg_line)
{
extern bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa);
int i, j, n_seqs, tot_seqs = 0;
int i, j, n_seqs;
long long tot_seqs = 0;
bwa_seq_t *seqs[2];
bwa_seqio_t *ks[2];
clock_t t;
......@@ -711,7 +712,7 @@ void bwa_sai2sam_pe_core(const char *prefix, char *const fn_sa[2], char *const f
for (j = 0; j < 2; ++j)
bwa_free_read_seq(n_seqs, seqs[j]);
fprintf(stderr, "[bwa_sai2sam_pe_core] %d sequences have been processed.\n", tot_seqs);
fprintf(stderr, "[bwa_sai2sam_pe_core] %lld sequences have been processed.\n", tot_seqs);
last_ii = ii;
}
......
......@@ -173,13 +173,15 @@ bwa_cigar_t *bwa_refine_gapped_core(bwtint_t l_pac, const ubyte_t *pacseq, int l
ubyte_t *rseq;
int64_t k, rb, re, rlen;
int8_t mat[25];
int w;
bwa_fill_scmat(1, 3, mat);
rb = *_rb; re = rb + len + ref_shift;
assert(re <= l_pac);
rseq = bns_get_seq(l_pac, pacseq, rb, re, &rlen);
assert(re - rb == rlen);
ksw_global(len, seq, rlen, rseq, 5, mat, 5, 1, SW_BW > abs(rlen - len) * 1.5? SW_BW : abs(rlen - len) * 1.5, n_cigar, &cigar32);
w = abs((int)rlen - len) * 1.5;
ksw_global(len, seq, rlen, rseq, 5, mat, 5, 1, SW_BW > w? SW_BW : w, n_cigar, &cigar32);
assert(*n_cigar > 0);
if ((cigar32[*n_cigar - 1]&0xf) == 1) cigar32[*n_cigar - 1] = (cigar32[*n_cigar - 1]>>4<<4) | 3; // change endding ins to soft clipping
if ((cigar32[0]&0xf) == 1) cigar32[0] = (cigar32[0]>>4<<4) | 3; // change beginning ins to soft clipping
......@@ -505,7 +507,8 @@ void bwase_initialize()
void bwa_sai2sam_se_core(const char *prefix, const char *fn_sa, const char *fn_fa, int n_occ, const char *rg_line)
{
extern bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa);
int i, n_seqs, tot_seqs = 0, m_aln;
int i, n_seqs, m_aln;
long long tot_seqs = 0;
bwt_aln1_t *aln = 0;
bwa_seq_t *seqs;
bwa_seqio_t *ks;
......@@ -563,7 +566,7 @@ void bwa_sai2sam_se_core(const char *prefix, const char *fn_sa, const char *fn_f
fprintf(stderr, "%.2f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC);
bwa_free_read_seq(n_seqs, seqs);
fprintf(stderr, "[bwa_aln_core] %d sequences have been processed.\n", tot_seqs);
fprintf(stderr, "[bwa_aln_core] %lld sequences have been processed.\n", tot_seqs);
}
// destroy
......
......@@ -338,6 +338,7 @@ BWT *BWTCreate(const bgint_t textLength, unsigned int *decodeTable)
bwt->decodeTable = (unsigned*)calloc(DNA_OCC_CNT_TABLE_SIZE_IN_WORD, sizeof(unsigned int));
GenerateDNAOccCountTable(bwt->decodeTable);
} else {
// FIXME Prevent BWTFree() from freeing decodeTable in this case
bwt->decodeTable = decodeTable;
}
......@@ -1538,25 +1539,21 @@ BWTInc *BWTIncConstructFromPacked(const char *inputFileName, bgint_t initialMaxB
(long)bwtInc->numberOfIterationDone, (long)processedTextLength);
}
}
return bwtInc;
}
void BWTFree(BWT *bwt)
{
if (bwt == 0) return;
free(bwt->cumulativeFreq);
free(bwt->bwtCode);
free(bwt->occValue);
free(bwt->occValueMajor);
free(bwt->decodeTable);
free(bwt);
fclose(packedFile);
return bwtInc;
}
void BWTIncFree(BWTInc *bwtInc)
{
if (bwtInc == 0) return;
free(bwtInc->bwt->cumulativeFreq);
free(bwtInc->bwt->occValueMajor);
free(bwtInc->bwt->decodeTable);
free(bwtInc->bwt);
free(bwtInc->workingMemory);
free(bwtInc->cumulativeCountInCurrentBuild);
free(bwtInc->packedShift);
free(bwtInc);
}
......@@ -1602,7 +1599,7 @@ void bwt_bwtgen2(const char *fn_pac, const char *fn_bwt, int block_size)
{
BWTInc *bwtInc;
bwtInc = BWTIncConstructFromPacked(fn_pac, block_size, block_size);
printf("[bwt_gen] Finished constructing BWT in %u iterations.\n", bwtInc->numberOfIterationDone);
fprintf(stderr, "[bwt_gen] Finished constructing BWT in %u iterations.\n", bwtInc->numberOfIterationDone);
BWTSaveBwtCodeAndOcc(bwtInc->bwt, fn_bwt, 0);
BWTIncFree(bwtInc);
}
......
......@@ -48,7 +48,7 @@ bwtl_t *bwtl_seq2bwtl(int len, const uint8_t *seq)
}
{ // generate cnt_table
for (i = 0; i != 256; ++i) {
u_int32_t j, x = 0;
uint32_t j, x = 0;
for (j = 0; j != 4; ++j)
x |= (((i&3) == j) + ((i>>2&3) == j) + ((i>>4&3) == j) + (i>>6 == j)) << (j<<3);
b->cnt_table[i] = x;
......
......@@ -158,7 +158,8 @@ bwa_seqio_t *bwa_open_reads(int mode, const char *fn_fa)
void bwa_aln_core(const char *prefix, const char *fn_fa, const gap_opt_t *opt)
{
int i, n_seqs, tot_seqs = 0;
int i, n_seqs;
long long tot_seqs = 0;
bwa_seq_t *seqs;
bwa_seqio_t *ks;
clock_t t;
......@@ -218,7 +219,7 @@ void bwa_aln_core(const char *prefix, const char *fn_fa, const gap_opt_t *opt)
fprintf(stderr, "%.2f sec\n", (float)(clock() - t) / CLOCKS_PER_SEC);
bwa_free_read_seq(n_seqs, seqs);
fprintf(stderr, "[bwa_aln_core] %d sequences have been processed.\n", tot_seqs);
fprintf(stderr, "[bwa_aln_core] %lld sequences have been processed.\n", tot_seqs);
}
// destroy
......
......@@ -58,7 +58,7 @@ static inline void gap_push(gap_stack_t *stack, int i, bwtint_t k, bwtint_t l, i
q->stack = (gap_entry_t*)realloc(q->stack, sizeof(gap_entry_t) * q->m_entries);
}
p = q->stack + q->n_entries;
p->info = (u_int32_t)score<<21 | i; p->k = k; p->l = l;
p->info = (uint32_t)score<<21 | i; p->k = k; p->l = l;
p->n_mm = n_mm; p->n_gapo = n_gapo; p->n_gape = n_gape;
p->n_ins = n_ins; p->n_del = n_del;
p->state = state;
......
......@@ -5,9 +5,9 @@
#include "bwtaln.h"
typedef struct { // recursion stack
u_int32_t info; // score<<21 | i
u_int32_t n_mm:8, n_gapo:8, n_gape:8, state:2, n_seed_mm:6;
u_int32_t n_ins:16, n_del:16;
uint32_t info; // score<<21 | i
uint32_t n_mm:8, n_gapo:8, n_gape:8, state:2, n_seed_mm:6;
uint32_t n_ins:16, n_del:16;
int last_diff_pos;
bwtint_t k, l; // (k,l) is the SA region of [i,n-1]
} gap_entry_t;
......
......@@ -119,7 +119,7 @@ bwt_t *bwt_pac2bwt(const char *fn_pac, int use_is)
}
rope_destroy(r);
}
bwt->bwt = (u_int32_t*)calloc(bwt->bwt_size, 4);
bwt->bwt = (uint32_t*)calloc(bwt->bwt_size, 4);
for (i = 0; i < bwt->seq_len; ++i)
bwt->bwt[i>>4] |= buf[i] << ((15 - (i&15)) << 1);
free(buf);
......@@ -175,7 +175,7 @@ void bwt_bwtupdate_core(bwt_t *bwt)
int bwa_bwtupdate(int argc, char *argv[]) // the "bwtupdate" command
{
bwt_t *bwt;
if (argc < 2) {
if (argc != 2) {
fprintf(stderr, "Usage: bwa bwtupdate <the.bwt>\n");
return 1;
}
......
......@@ -70,6 +70,12 @@ bsw2pestat_t bsw2_stat(int n, bwtsw2_t **buf, kstring_t *msg, int max_ins)
for (i = x = 0, r.avg = 0; i < k; ++i)
if (isize[i] >= r.low && isize[i] <= r.high)
r.avg += isize[i], ++x;
if (x == 0) {
ksprintf(msg, "[%s] fail to infer the insert size distribution: no pairs within boundaries.\n", __func__);
free(isize);
r.failed = 1;
return r;
}
r.avg /= x;
for (i = 0, r.std = 0; i < k; ++i)
if (isize[i] >= r.low && isize[i] <= r.high)
......@@ -236,7 +242,7 @@ void bsw2_pair(const bsw2opt_t *opt, int64_t l_pac, const uint8_t *pac, int n, b
double diff;
G[0] = hits[i]->hits[0].G + a[1].G;
G[1] = hits[i+1]->hits[0].G + a[0].G;
diff = fabs(G[0] - G[1]) / (opt->a + opt->b) / ((hits[i]->hits[0].len + a[1].len + hits[i+1]->hits[0].len + a[0].len) / 2.);
diff = fabs((double)(G[0] - G[1])) / (opt->a + opt->b) / ((hits[i]->hits[0].len + a[1].len + hits[i+1]->hits[0].len + a[0].len) / 2.);
if (diff > 0.05) a[G[0] > G[1]? 0 : 1].G = 0;
}
if (a[0].G == 0 || a[1].G == 0) { // one proper pair only
......
......@@ -130,7 +130,7 @@ int main_mem(int argc, char *argv[])
aux.opt = opt = mem_opt_init();
memset(&opt0, 0, sizeof(mem_opt_t));
while ((c = getopt(argc, argv, "1paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:W:x:G:h:y:K:X:H:")) >= 0) {
while ((c = getopt(argc, argv, "51paMCSPVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:o:f:W:x:G:h:y:K:X:H:")) >= 0) {
if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1;
else if (c == '1') no_mt_io = 1;
else if (c == 'x') mode = optarg;
......@@ -147,6 +147,7 @@ int main_mem(int argc, char *argv[])
else if (c == 'S') opt->flag |= MEM_F_NO_RESCUE;
else if (c == 'Y') opt->flag |= MEM_F_SOFTCLIP;
else if (c == 'V') opt->flag |= MEM_F_REF_HDR;
else if (c == '5') opt->flag |= MEM_F_PRIMARY5;
else if (c == 'c') opt->max_occ = atoi(optarg), opt0.max_occ = 1;
else if (c == 'd') opt->zdrop = atoi(optarg), opt0.zdrop = 1;
else if (c == 'v') bwa_verbose = atoi(optarg);
......@@ -157,6 +158,7 @@ int main_mem(int argc, char *argv[])
else if (c == 's') opt->split_width = atoi(optarg), opt0.split_width = 1;
else if (c == 'G') opt->max_chain_gap = atoi(optarg), opt0.max_chain_gap = 1;
else if (c == 'N') opt->max_chain_extend = atoi(optarg), opt0.max_chain_extend = 1;
else if (c == 'o' || c == 'f') xreopen(optarg, "wb", stdout);
else if (c == 'W') opt->min_chain_weight = atoi(optarg), opt0.min_chain_weight = 1;
else if (c == 'y') opt->max_mem_intv = atol(optarg), opt0.max_mem_intv = 1;
else if (c == 'C') aux.copy_comment = 1;
......@@ -256,7 +258,7 @@ int main_mem(int argc, char *argv[])
fprintf(stderr, " -E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [%d,%d]\n", opt->e_del, opt->e_ins);
fprintf(stderr, " -L INT[,INT] penalty for 5'- and 3'-end clipping [%d,%d]\n", opt->pen_clip5, opt->pen_clip3);
fprintf(stderr, " -U INT penalty for an unpaired read pair [%d]\n\n", opt->pen_unpaired);
fprintf(stderr, " -x STR read type. Setting -x changes multiple parameters unless overriden [null]\n");
fprintf(stderr, " -x STR read type. Setting -x changes multiple parameters unless overridden [null]\n");
fprintf(stderr, " pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)\n");
fprintf(stderr, " ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)\n");
fprintf(stderr, " intractg: -B9 -O16 -L5 (intra-species contigs to ref)\n");
......@@ -264,9 +266,12 @@ int main_mem(int argc, char *argv[])
fprintf(stderr, " -p smart pairing (ignoring in2.fq)\n");
fprintf(stderr, " -R STR read group header line such as '@RG\\tID:foo\\tSM:bar' [null]\n");
fprintf(stderr, " -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]\n");
fprintf(stderr, " -o FILE sam file to output results to [stdout]\n");
fprintf(stderr, " -j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)\n");
fprintf(stderr, " -5 for split alignment, take the alignment with the smallest coordiate as primary\n");
fprintf(stderr, " -K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []\n");
fprintf(stderr, "\n");
fprintf(stderr, " -v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose);
fprintf(stderr, " -v INT verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose);
fprintf(stderr, " -T INT minimum score to output [%d]\n", opt->T);
fprintf(stderr, " -h INT[,INT] if there are <INT hits with score >80%% of the max score, output all in XA [%d,%d]\n", opt->max_XA_hits, opt->max_XA_hits_alt);
fprintf(stderr, " -a output all alignments for SE or unpaired PE\n");
......
#include <pthread.h>
#include <stdint.h>
#include <stdlib.h>
#include <limits.h>
......
......@@ -4,7 +4,7 @@
#include "utils.h"
#ifndef PACKAGE_VERSION
#define PACKAGE_VERSION "0.7.15-r1140"
#define PACKAGE_VERSION "0.7.16a-r1181"
#endif
int bwa_fa2pac(int argc, char *argv[]);
......
......@@ -13,7 +13,7 @@ void *wrap_calloc(size_t nmemb, size_t size,
void *p = calloc(nmemb, size);
if (NULL == p) {
fprintf(stderr,
"[%s] Failed to allocate %zd bytes at %s line %u: %s\n",
"[%s] Failed to allocate %zu bytes at %s line %u: %s\n",
func, nmemb * size, file, line, strerror(errno));
exit(EXIT_FAILURE);
}
......@@ -25,7 +25,7 @@ void *wrap_malloc(size_t size,
void *p = malloc(size);
if (NULL == p) {
fprintf(stderr,
"[%s] Failed to allocate %zd bytes at %s line %u: %s\n",
"[%s] Failed to allocate %zu bytes at %s line %u: %s\n",
func, size, file, line, strerror(errno));
exit(EXIT_FAILURE);
}
...