Skip to content
Commits on Source (6)
2019-07-16 Martin C. Frith <Martin C. Frith>
* doc/last-split.txt, doc/last-tutorial.txt, scripts/last-train,
src/split/last-split-main.cc, test/last-split-test.out, test/last-
split-test.sh:
Change last-split -m default to 1
[2a722dbe737a] [tip]
2019-06-27 Martin C. Frith <Martin C. Frith>
* src/split/last-split.cc:
Fix rare last-split spliced alignment crash
[4f77b3fc0720]
2019-06-21 Martin C. Frith <Martin C. Frith>
* scripts/last-dotplot:
dotplot: change the undocumented input format
[0e184d90d5da]
2019-04-22 Martin C. Frith <Martin C. Frith>
* doc/last-dotplot.txt, scripts/last-dotplot:
Add last-dotplot --join option
[27c26371226b]
2019-04-01 Martin C. Frith <Martin C. Frith>
* src/LambdaCalculator.cc, src/LambdaCalculator.hh,
src/TantanMasker.cc, src/lastal.cc, src/makefile,
src/mcf_substitution_matrix_stats.cc,
src/mcf_substitution_matrix_stats.hh:
Make lastal startup even faster
[9b416e560dc3]
* src/LambdaCalculator.cc, src/LambdaCalculator.hh, src/makefile:
Make lastal startup faster
[d429589d040e]
* src/Alignment.cc, src/Centroid.cc, src/Centroid.hh:
Simplify some probability calculations a bit
[9eeaeea88f2c]
2019-03-29 Martin C. Frith <Martin C. Frith>
* src/Alignment.cc, src/Centroid.cc, src/Centroid.hh:
Refactor
[f3f69ccff967]
2019-03-28 Martin C. Frith <Martin C. Frith>
* src/Alignment.cc, src/Alignment.hh, src/Centroid.cc,
src/Centroid.hh, src/GeneralizedAffineGapCosts.cc,
src/GeneralizedAffineGapCosts.hh, src/LastalArguments.cc,
src/LastalArguments.hh, src/lastal.cc, src/makefile,
src/mcf_gap_costs.cc, src/mcf_gap_costs.hh:
Add piecewise linear gap costs (unfinished)
[3701b1e2019b]
* doc/last-train.txt, doc/lastal.txt, scripts/last-train,
src/LastalArguments.cc, src/LastalArguments.hh, src/ScoreMatrix.cc,
src/ScoreMatrix.hh, src/lastal.cc, test/dfam3-LTR22B1.fa,
test/dfam3-LTR22C.fa, test/last-test.out, test/last-test.sh:
Add lastal/last-train -X option for ambiguous N/X
[2b0e54aee679]
2019-02-27 Martin C. Frith <Martin C. Frith>
* src/lastal.cc:
Refactor
[1ed9db7f46ca]
* src/lastal.cc:
Refactor
[ae880ec00591]
2019-02-26 Martin C. Frith <Martin C. Frith>
* src/lastal.cc:
Refactor
[6189dce68b46]
* doc/lastal.txt, src/LastalArguments.cc, src/lastal.cc, src/makefile,
src/split/cbrc_split_aligner.cc:
Change the effect of setting lastal option -t
[2073177a7920]
* src/lastal.cc, test/last-train-test.out:
Fix fastq-aware last-train (lastal -j7)
[7dcde7432ed7]
2019-02-25 Martin C. Frith <Martin C. Frith>
* src/OneQualityScoreMatrix.cc, src/OneQualityScoreMatrix.hh,
src/TwoQualityScoreMatrix.cc, src/TwoQualityScoreMatrix.hh,
src/lastal.cc, src/qualityScoreUtil.hh:
Refactor
[1bbe50060a6f]
* src/lastal.cc, src/mcf_substitution_matrix_stats.cc,
src/mcf_substitution_matrix_stats.hh:
Make lastal code more robust for non-DNA
[a75e610c943c]
* src/lastal.cc, src/makefile, src/mcf_substitution_matrix_stats.cc,
src/mcf_substitution_matrix_stats.hh:
Refactor (prepare new score-matrix analysis)
[685f6de69cc8]
* src/LambdaCalculator.cc, src/LambdaCalculator.hh,
src/TantanMasker.cc, src/lastal.cc:
Refactor
[92c7b8f19291]
* scripts/last-train, src/LastEvaluer.hh, src/LastalArguments.cc,
src/LastalArguments.hh, src/cbrc_linalg.cc, src/cbrc_linalg.hh,
src/lastal.cc, src/makefile, src/split/cbrc_linalg.cc,
src/split/cbrc_linalg.hh:
Refactor
[a5c1e7e987ff]
2018-12-26 Martin C. Frith <Martin C. Frith>
* src/tantan.cc:
Make tantan repeat-finding faster
[44199bff9cef] [tip]
[44199bff9cef]
* doc/Makefile, doc/last-parallel.txt, doc/last-split.txt, doc/last-
tutorial.txt:
......
last-align (983-1) unstable; urgency=medium
* Team upload.
* New upstream version
* debhelper-compat 12
* Standards-Version: 4.4.0
* Updated with no manual changes required by Andreas'
routine-update script.
-- Steffen Moeller <moeller@debian.org> Sat, 27 Jul 2019 12:33:38 +0200
last-align (963-2) unstable; urgency=medium
* Example files are not compressed any more - adapt autopkgtest
......
......@@ -4,11 +4,11 @@ Uploaders: Charles Plessy <plessy@debian.org>,
Andreas Tille <tille@debian.org>
Section: science
Priority: optional
Build-Depends: debhelper (>= 12~),
Build-Depends: debhelper-compat (= 12),
help2man,
python-pil,
zlib1g-dev
Standards-Version: 4.3.0
Standards-Version: 4.4.0
Vcs-Browser: https://salsa.debian.org/med-team/last-align
Vcs-Git: https://salsa.debian.org/med-team/last-align.git
Homepage: http://last.cbrc.jp/
......
......@@ -422,6 +422,13 @@ for primary and secondary alignments.</td></tr>
<kbd><span class="option">--pad=<var>FRAC</var></span></kbd></td>
<td>Length of pad to leave when cutting unaligned gaps.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-j <var>N</var></span>, <span class="option">--join=<var>N</var></span></kbd></td>
<td>Draw horizontal or vertical lines joining adjacent alignments.
0 means don't join, 1 means draw vertical lines joining
alignments that are adjacent in the 1st (horizontal) genome, 2
means draw horizontal lines joining alignments that are adjacent
in the 2nd (vertical) genome.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--border-pixels=<var>INT</var></span></kbd></td>
<td>Number of pixels between sequences.</td></tr>
<tr><td class="option-group">
......
......@@ -84,6 +84,12 @@ Options
Maximum unaligned gap in the 2nd genome.
--pad=FRAC
Length of pad to leave when cutting unaligned gaps.
-j N, --join=N
Draw horizontal or vertical lines joining adjacent alignments.
0 means don't join, 1 means draw vertical lines joining
alignments that are adjacent in the 1st (horizontal) genome, 2
means draw horizontal lines joining alignments that are adjacent
in the 2nd (vertical) genome.
--border-pixels=INT
Number of pixels between sequences.
--border-color=COLOR
......
......@@ -318,8 +318,8 @@ table.field-list { border: thin solid green }
<div class="document" id="last-split">
<h1 class="title">last-split</h1>
<p>This program estimates &quot;split alignments&quot; (typically for DNA) or
&quot;spliced alignments&quot; (typically for RNA).</p>
<p>This program finds &quot;split alignments&quot; (typically for DNA) or &quot;spliced
alignments&quot; (typically for RNA).</p>
<p>It reads candidate alignments of query sequences to a genome, and
looks for a unique best alignment for each part of each query. It
allows different parts of one query to match different parts of the
......@@ -334,7 +334,8 @@ format), and the genome is in &quot;genome.fasta&quot; (in fasta format). We
can do the alignment like this:</p>
<pre class="literal-block">
lastdb -uNEAR -R01 db genome.fasta
lastal -Q1 db q.fastq | last-split &gt; out.maf
last-train -Q1 db q.fastq &gt; train.out
lastal -p train.out db q.fastq | last-split &gt; out.maf
</pre>
</div>
<div class="section" id="spliced-alignment-of-rna-reads-to-a-genome">
......@@ -345,7 +346,8 @@ which causes it to do spliced instead of split alignment, and also
tells it where the splice signals are (GT, AG, etc):</p>
<pre class="literal-block">
lastdb -uNEAR -R01 db genome.fasta
lastal -Q1 -D10 db q.fastq | last-split -g db &gt; out.maf
last-train -Q1 db q.fastq &gt; train.out
lastal -p train.out -D10 db q.fastq | last-split -g db &gt; out.maf
</pre>
<p>This will favour splices starting at GT (and to a lesser extent GC and
AT), and ending at AG (and to a lesser extent AC). However, it allows
......@@ -362,17 +364,7 @@ last-split options.</p>
</div>
<div class="section" id="alignment-of-two-whole-genomes">
<h3>Alignment of two whole genomes</h3>
<p>We can align the cat and rat genomes like this:</p>
<pre class="literal-block">
lastdb -uMAM8 -cR11 catdb cat.fasta
lastal -m100 -E0.05 catdb rat.fasta | last-split -m1 &gt; out.maf
</pre>
<p>This will align each rat base-pair to at most one cat base-pair, but
not necessarily vice-versa. We can get 1-to-1 alignments by swapping
the sequences and running last-split again:</p>
<pre class="literal-block">
maf-swap out.maf | last-split -m1 &gt; out2.maf
</pre>
<p>See <a class="reference external" href="https://github.com/mcfrith/last-genome-alignments">here</a>.</p>
</div>
</div>
<div class="section" id="faq">
......@@ -393,15 +385,6 @@ matches.</p>
</tbody>
</table>
</div>
<div class="section" id="going-faster-by-parallelization">
<h2>Going faster by parallelization</h2>
<p>For example, split alignment of DNA reads to a genome:</p>
<pre class="literal-block">
parallel-fastq &quot;lastal -Q1 db | last-split&quot; &lt; q.fastq &gt; out.maf
</pre>
<p>This requires GNU parallel to be installed
(<a class="reference external" href="http://www.gnu.org/software/parallel/">http://www.gnu.org/software/parallel/</a>).</p>
</div>
<div class="section" id="output">
<h2>Output</h2>
<p>The output is in MAF(-like) format:</p>
......@@ -514,7 +497,7 @@ lowest possible. In general:</p>
Error probability &lt;= 10 ^ -((ASCII value - 33) / 10)
</pre>
<p>The &quot;mismap&quot; is simply the lowest probability from the &quot;p&quot; line. (If
you run last-split twice, as in the genome alignment example, the
you run last-split twice, as in the genome alignment recipes, the
mismap is the lowest combined error probability from both &quot;p&quot; lines.)</p>
</div>
<div class="section" id="split-versus-spliced-alignment">
......@@ -545,8 +528,7 @@ of the query.)</p>
<p>Spliced alignment can be slow. It can be sped up, at a small cost in
accuracy, by not favouring cis-splices:</p>
<pre class="literal-block">
lastdb -uNEAR -R01 db genome.fasta
lastal -Q1 -D10 db q.fastq | last-split -c0 -t0.004 -g db &gt; out.maf
lastal -p train.out -D10 db q.fastq | last-split -c0 -t0.004 -g db &gt; out.maf
</pre>
<p>The -c0 turns off cis-splicing, and the -t0.004 specifies a higher
probability of trans-splicing.</p>
......@@ -557,8 +539,7 @@ probability of trans-splicing.</p>
middle of the query, we can do &quot;spliced&quot; alignment without considering
splice signals or favouring cis-splices:</p>
<pre class="literal-block">
lastdb -uNEAR -R01 db genome.fasta
lastal -Q1 db q.fastq | last-split -c0 &gt; out.maf
lastal -p train.out db q.fastq | last-split -c0 &gt; out.maf
</pre>
</div>
</div>
......@@ -613,9 +594,7 @@ The default value fits human RNA.</td></tr>
ln(intron length). The default value fits human RNA.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-m</span>, <span class="option">--mismap=<var>PROB</var></span></kbd></td>
<td>Don't write alignments with mismap probability &gt; PROB.
Low-confidence alignments will be discarded unless you
increase this value!</td></tr>
<td>Don't write alignments with mismap probability &gt; PROB.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-s</span>, <span class="option">--score=<var>INT</var></span></kbd></td>
<td><p class="first">Don't write alignments with score &lt; INT.</p>
......
last-split
==========
This program estimates "split alignments" (typically for DNA) or
"spliced alignments" (typically for RNA).
This program finds "split alignments" (typically for DNA) or "spliced
alignments" (typically for RNA).
It reads candidate alignments of query sequences to a genome, and
looks for a unique best alignment for each part of each query. It
......@@ -21,7 +21,8 @@ format), and the genome is in "genome.fasta" (in fasta format). We
can do the alignment like this::
lastdb -uNEAR -R01 db genome.fasta
lastal -Q1 db q.fastq | last-split > out.maf
last-train -Q1 db q.fastq > train.out
lastal -p train.out db q.fastq | last-split > out.maf
Spliced alignment of RNA reads to a genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -32,7 +33,8 @@ which causes it to do spliced instead of split alignment, and also
tells it where the splice signals are (GT, AG, etc)::
lastdb -uNEAR -R01 db genome.fasta
lastal -Q1 -D10 db q.fastq | last-split -g db > out.maf
last-train -Q1 db q.fastq > train.out
lastal -p train.out -D10 db q.fastq | last-split -g db > out.maf
This will favour splices starting at GT (and to a lesser extent GC and
AT), and ending at AG (and to a lesser extent AC). However, it allows
......@@ -52,16 +54,7 @@ last-split options.
Alignment of two whole genomes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We can align the cat and rat genomes like this::
lastdb -uMAM8 -cR11 catdb cat.fasta
lastal -m100 -E0.05 catdb rat.fasta | last-split -m1 > out.maf
This will align each rat base-pair to at most one cat base-pair, but
not necessarily vice-versa. We can get 1-to-1 alignments by swapping
the sequences and running last-split again::
maf-swap out.maf | last-split -m1 > out2.maf
See `here <https://github.com/mcfrith/last-genome-alignments>`_.
FAQ
---
......@@ -73,16 +66,6 @@ FAQ
can prevent lastal and last-split from wasting time on such
matches.
Going faster by parallelization
-------------------------------
For example, split alignment of DNA reads to a genome::
parallel-fastq "lastal -Q1 db | last-split" < q.fastq > out.maf
This requires GNU parallel to be installed
(http://www.gnu.org/software/parallel/).
Output
------
......@@ -125,7 +108,7 @@ lowest possible. In general::
Error probability <= 10 ^ -((ASCII value - 33) / 10)
The "mismap" is simply the lowest probability from the "p" line. (If
you run last-split twice, as in the genome alignment example, the
you run last-split twice, as in the genome alignment recipes, the
mismap is the lowest combined error probability from both "p" lines.)
Split versus spliced alignment
......@@ -159,8 +142,7 @@ Faster spliced alignment
Spliced alignment can be slow. It can be sped up, at a small cost in
accuracy, by not favouring cis-splices::
lastdb -uNEAR -R01 db genome.fasta
lastal -Q1 -D10 db q.fastq | last-split -c0 -t0.004 -g db > out.maf
lastal -p train.out -D10 db q.fastq | last-split -c0 -t0.004 -g db > out.maf
The -c0 turns off cis-splicing, and the -t0.004 specifies a higher
probability of trans-splicing.
......@@ -172,8 +154,7 @@ If we do not wish to allow arbitrarily large unaligned parts in the
middle of the query, we can do "spliced" alignment without considering
splice signals or favouring cis-splices::
lastdb -uNEAR -R01 db genome.fasta
lastal -Q1 db q.fastq | last-split -c0 > out.maf
lastal -p train.out db q.fastq | last-split -c0 > out.maf
Options
-------
......@@ -221,8 +202,6 @@ Options
-m, --mismap=PROB
Don't write alignments with mismap probability > PROB.
Low-confidence alignments will be discarded unless you
increase this value!
-s, --score=INT
Don't write alignments with score < INT.
......
......@@ -478,6 +478,17 @@ position in each query.</td></tr>
<kbd><span class="option">-P <var>COUNT</var></span></kbd></td>
<td>Number of parallel threads.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-X <var>NUMBER</var></span></kbd></td>
<td><p class="first">How to score a match/mismatch involving N (for DNA) or X
(otherwise). By default, the lowest match/mismatch score
is used. 0 means the default; 1 means treat reference
Ns/Xs as fully-ambiguous letters; 2 means treat query
Ns/Xs as ambiguous; 3 means treat reference and query
Ns/Xs as ambiguous.</p>
<p class="last">If specified, this parameter is written in last-train's
output, so it will override lastal's default.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
<td><p class="first">How to read the query sequences. By default, they must
be in <tt class="docutils literal">fasta</tt> format. <tt class="docutils literal"><span class="pre">-Q0</span></tt> means <tt class="docutils literal">fasta</tt> or
......
......@@ -102,6 +102,16 @@ Alignment options
-k STEP Look for initial matches starting only at every STEP-th
position in each query.
-P COUNT Number of parallel threads.
-X NUMBER How to score a match/mismatch involving N (for DNA) or X
(otherwise). By default, the lowest match/mismatch score
is used. 0 means the default; 1 means treat reference
Ns/Xs as fully-ambiguous letters; 2 means treat query
Ns/Xs as ambiguous; 3 means treat reference and query
Ns/Xs as ambiguous.
If specified, this parameter is written in last-train's
output, so it will override lastal's default.
-Q NUMBER How to read the query sequences. By default, they must
be in ``fasta`` format. ``-Q0`` means ``fasta`` or
``fastq-ignore``. ``-Q1`` means ``fastq-sanger``.
......
......@@ -432,19 +432,17 @@ options, please see <a class="reference external" href="last-split.html">last-sp
<h2>Example 6: Find very short DNA alignments</h2>
<p>By default, LAST is quite strict, and only reports significant
alignments that will rarely occur by chance. In the preceding
example, the minimum alignment length is about 28 bases (less for
example, the minimum alignment length is about 26 bases (less for
smaller genomes). To find shorter alignments, we must down-tune the
strictness:</p>
<pre class="literal-block">
lastdb -uNEAR -R01 humandb human/chr*.fa
lastal -D100 humandb queries.fa | last-split -m1 &gt; myalns.maf
lastal -D100 humandb queries.fa | last-split &gt; myalns.maf
</pre>
<ul>
<li><p class="first">-D100 makes lastal report alignments that could occur by chance once
per hundred query letters. (The default is once per million.)</p>
</li>
<li><p class="first">-m1 tells last-split to keep low-confidence alignments.</p>
</li>
</ul>
<p>In this example, the minimum alignment length is about 20 bases (less
for smaller genomes).</p>
......
......@@ -123,18 +123,16 @@ Example 6: Find very short DNA alignments
By default, LAST is quite strict, and only reports significant
alignments that will rarely occur by chance. In the preceding
example, the minimum alignment length is about 28 bases (less for
example, the minimum alignment length is about 26 bases (less for
smaller genomes). To find shorter alignments, we must down-tune the
strictness::
lastdb -uNEAR -R01 humandb human/chr*.fa
lastal -D100 humandb queries.fa | last-split -m1 > myalns.maf
lastal -D100 humandb queries.fa | last-split > myalns.maf
* -D100 makes lastal report alignments that could occur by chance once
per hundred query letters. (The default is once per million.)
* -m1 tells last-split to keep low-confidence alignments.
In this example, the minimum alignment length is about 20 bases (less
for smaller genomes).
......
......@@ -476,6 +476,14 @@ anything.</p>
but command line options override them.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-X <var>NUMBER</var></span></kbd></td>
<td>How to score a match/mismatch involving N (for DNA) or X
(otherwise), if not specified by a score matrix. By default,
the lowest match/mismatch score is used. 0 means the default; 1
means treat reference Ns/Xs as fully-ambiguous letters; 2 means
treat query Ns/Xs as ambiguous; 3 means treat reference and
query Ns/Xs as ambiguous.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-a <var>COST</var></span></kbd></td>
<td>Gap existence cost.</td></tr>
<tr><td class="option-group">
......@@ -769,9 +777,9 @@ option has no effect unless DNA-versus-protein alignment is
selected with option -F.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-t <var>TEMPERATURE</var></span></kbd></td>
<td>Parameter for converting between scores and likelihood ratios.
<td>Parameter for converting between scores and probability ratios.
This affects the column ambiguity estimates. A score is
converted to a likelihood ratio by this formula: exp(score /
converted to a probability ratio by this formula: exp(score /
TEMPERATURE). The default value is 1/lambda, where lambda is
the scale factor of the scoring matrix, which is calculated by
the method of Yu and Altschul (YK Yu et al. 2003, PNAS
......
......@@ -144,6 +144,14 @@ Score options
Other options can be specified on lines starting with "#last",
but command line options override them.
-X NUMBER
How to score a match/mismatch involving N (for DNA) or X
(otherwise), if not specified by a score matrix. By default,
the lowest match/mismatch score is used. 0 means the default; 1
means treat reference Ns/Xs as fully-ambiguous letters; 2 means
treat query Ns/Xs as ambiguous; 3 means treat reference and
query Ns/Xs as ambiguous.
-a COST
Gap existence cost.
......@@ -400,9 +408,9 @@ Miscellaneous options
selected with option -F.
-t TEMPERATURE
Parameter for converting between scores and likelihood ratios.
Parameter for converting between scores and probability ratios.
This affects the column ambiguity estimates. A score is
converted to a likelihood ratio by this formula: exp(score /
converted to a probability ratio by this formula: exp(score /
TEMPERATURE). The default value is 1/lambda, where lambda is
the scale factor of the scoring matrix, which is calculated by
the method of Yu and Altschul (YK Yu et al. 2003, PNAS
......
......@@ -97,12 +97,37 @@ def mafBlocks(beg1, beg2, seq1, seq2):
size += 1
if size: yield beg1, beg2, size
def alignmentFromSegment(qrySeqName, qrySeqLen, segment):
refSeqLen = sys.maxsize # XXX
refSeqName, refSeqBeg, qrySeqBeg, size = segment
block = refSeqBeg, qrySeqBeg, size
return refSeqName, refSeqLen, qrySeqName, qrySeqLen, [block]
def alignmentInput(lines):
'''Get alignments and sequence lengths, from MAF or tabular format.'''
mafCount = 0
qrySeqName = ""
segments = []
for line in lines:
w = line.split()
if line[0].isdigit(): # tabular format
if line[0] == "#":
pass
elif len(w) == 1:
for i in segments:
yield alignmentFromSegment(qrySeqName, qrySeqLen, i)
qrySeqName = w[0]
qrySeqLen = 0
segments = []
elif len(w) == 2 and qrySeqName and w[1].isdigit():
qrySeqLen += int(w[1])
elif len(w) == 4 and qrySeqName and w[1].isdigit() and w[3].isdigit():
refSeqName, refSeqBeg, refSeqEnd = w[0], int(w[1]), int(w[3])
size = abs(refSeqEnd - refSeqBeg)
if refSeqBeg > refSeqEnd:
refSeqBeg = -refSeqBeg
segments.append((refSeqName, refSeqBeg, qrySeqLen, size))
qrySeqLen += size
elif line[0].isdigit(): # tabular format
chr1, beg1, seqlen1 = w[1], int(w[2]), int(w[5])
if w[4] == "-": beg1 -= seqlen1
chr2, beg2, seqlen2 = w[6], int(w[7]), int(w[10])
......@@ -120,6 +145,8 @@ def alignmentInput(lines):
blocks = mafBlocks(beg1, beg2, seq1, seq2)
yield chr1, seqlen1, chr2, seqlen2, blocks
mafCount = 0
for i in segments:
yield alignmentFromSegment(qrySeqName, qrySeqLen, i)
def seqRequestFromText(text):
if ":" in text:
......@@ -504,6 +531,49 @@ def alignmentPixels(width, height, alignments, bp_per_pix,
ori1 + beg1, ori2 - beg2 - 1, size)
return hits
def orientedBlocks(alignments, seqIndex):
otherIndex = 1 - seqIndex
for a in alignments:
seq1, seq2, blocks = a
for b in blocks:
beg1, beg2, size = b
if b[seqIndex] < 0:
b = -(beg1 + size), -(beg2 + size), size
yield a[seqIndex], b[seqIndex], a[otherIndex], b[otherIndex], size
def drawJoins(im, alignments, bpPerPix, seqIndex, rangeDict1, rangeDict2):
blocks = orientedBlocks(alignments, seqIndex)
oldSeq1 = ""
for seq1, beg1, seq2, beg2, size in sorted(blocks):
isReverse1, ori1 = strandAndOrigin(rangeDict1[seq1], beg1, size)
isReverse2, ori2 = strandAndOrigin(rangeDict2[seq2], beg2, size)
end1 = beg1 + size - 1
end2 = beg2 + size - 1
if isReverse1:
beg1 = -(beg1 + 1)
end1 = -(end1 + 1)
if isReverse2:
beg2 = -(beg2 + 1)
end2 = -(end2 + 1)
newPix1 = (ori1 + beg1) // bpPerPix
newPix2 = (ori2 + beg2) // bpPerPix
if seq1 == oldSeq1:
lowerPix2 = min(oldPix2, newPix2)
upperPix2 = max(oldPix2, newPix2)
midPix1 = (oldPix1 + newPix1) // 2
if isReverse1:
midPix1 = (oldPix1 + newPix1 + 1) // 2
oldPix1, newPix1 = newPix1, oldPix1
if upperPix2 - lowerPix2 > 1 and oldPix1 <= newPix1 <= oldPix1 + 1:
if seqIndex == 0:
box = midPix1, lowerPix2, midPix1 + 1, upperPix2 + 1
else:
box = lowerPix2, midPix1, upperPix2 + 1, midPix1 + 1
im.paste("lightgray", box)
oldPix1 = (ori1 + end1) // bpPerPix
oldPix2 = (ori2 + end2) // bpPerPix
oldSeq1 = seq1
def expandedSeqDict(seqDict):
'''Allow lookup by short sequence names, e.g. chr7 as well as hg19.chr7.'''
newDict = seqDict.copy()
......@@ -826,7 +896,8 @@ def lastDotplot(opts, args):
logging.info("height: " + str(height))
logging.info("processing alignments...")
hits = alignmentPixels(width, height, alignments + alignmentsB, bpPerPix,
allAlignments = alignments + alignmentsB
hits = alignmentPixels(width, height, allAlignments, bpPerPix,
rangeDict1, rangeDict2)
logging.info("reading annotations...")
......@@ -854,6 +925,16 @@ def lastDotplot(opts, args):
drawAnnotations(im, boxes)
joinA, joinB = twoValuesFromOption(opts.join, ":")
if joinA in "13":
drawJoins(im, alignments, bpPerPix, 0, rangeDict1, rangeDict2)
if joinB in "13":
drawJoins(im, alignmentsB, bpPerPix, 0, rangeDict1, rangeDict2)
if joinA in "23":
drawJoins(im, alignments, bpPerPix, 1, rangeDict2, rangeDict1)
if joinB in "23":
drawJoins(im, alignmentsB, bpPerPix, 1, rangeDict2, rangeDict1)
for i in range(height):
for j in range(width):
store_value = hits[i * width + j]
......@@ -933,6 +1014,9 @@ if __name__ == "__main__":
op.add_option("--pad", metavar="FRAC", type="float", default=0.04, help=
"pad length when cutting unaligned gaps: "
"fraction of aligned length (default=%default)")
op.add_option("-j", "--join", default="0", metavar="N", help=
"join: 0=nothing, 1=alignments adjacent in genome1, "
"2=alignments adjacent in genome2 (default=%default)")
op.add_option("--border-pixels", metavar="INT", type="int", default=1,
help="number of pixels between sequences (default=%default)")
op.add_option("--border-color", metavar="COLOR", default="black",
......
......@@ -26,7 +26,7 @@ def randomSample(things, sampleSize):
return reservoir
def writeWords(outFile, words):
outFile.write(" ".join(words) + "\n")
print(*words, file=outFile)
def seqInput(fileNames):
if not fileNames:
......@@ -306,15 +306,12 @@ def gapCostsFromProbs(scale, probs):
if insExtendCost == 0: insExtendCost = 1
return delExistCost, insExistCost, delExtendCost, insExtendCost
def writeLine(out, *things):
out.write(" ".join(map(str, things)) + "\n")
def writeGapCosts(gapCosts, out):
delExistCost, insExistCost, delExtendCost, insExtendCost = gapCosts
writeLine(out, "#last -a", delExistCost)
writeLine(out, "#last -A", insExistCost)
writeLine(out, "#last -b", delExtendCost)
writeLine(out, "#last -B", insExtendCost)
print("#last -a", delExistCost, file=out)
print("#last -A", insExistCost, file=out)
print("#last -b", delExtendCost, file=out)
print("#last -B", insExtendCost, file=out)
def printGapCosts(gapCosts):
delExistCost, insExistCost, delExtendCost, insExtendCost = gapCosts
......@@ -349,6 +346,7 @@ def fixedLastalArgs(opts, lastalProgName):
if opts.m: x.append("-m" + opts.m)
if opts.k: x.append("-k" + opts.k)
if opts.P: x.append("-P" + opts.P)
if opts.X: x.append("-X" + opts.X)
if opts.Q: x.append("-Q" + opts.Q)
if opts.verbose: x.append("-" + "v" * opts.verbose)
return x
......@@ -366,7 +364,7 @@ def doTraining(opts, args):
if opts.A: x.append("-A" + opts.A)
if opts.B: x.append("-B" + opts.B)
x += args
y = ["last-split", "-n"]
y = ["last-split", "-n", "-m0.01"] # xxx ???
z = ["last-postmask"]
p = subprocess.Popen(x, stdout=subprocess.PIPE, universal_newlines=True)
q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE,
......@@ -418,11 +416,12 @@ def doTraining(opts, args):
q = subprocess.Popen(z, stdin=q.stdout, stdout=subprocess.PIPE,
universal_newlines=True)
if opts.Q: writeLine(sys.stdout, "#last -Q", opts.Q)
if opts.X: print("#last -X", opts.X)
if opts.Q: print("#last -Q", opts.Q)
gapCosts = gapCostsFromProbs(externalScale, gapProbs)
writeGapCosts(gapCosts, sys.stdout)
if opts.s: writeLine(sys.stdout, "#last -s", opts.s)
if opts.S: writeLine(sys.stdout, "#last -S", opts.S)
if opts.s: print("#last -s", opts.s)
if opts.S: print("#last -S", opts.S)
matScores = matScoresFromProbs(externalScale, matProbs)
externalMatrix = matrixWithLetters(matScores)
print("# score matrix "
......@@ -499,6 +498,9 @@ if __name__ == "__main__":
"every STEP-th position in each query (default: 1)")
og.add_option("-P", metavar="THREADS",
help="number of parallel threads")
og.add_option("-X", metavar="NUMBER", help="N/X is ambiguous in: "
"0=neither sequence, 1=reference, 2=query, 3=both "
"(default=0)")
og.add_option("-Q", metavar="NUMBER", help=
"input format: 0=fasta or fastq-ignore, 1=fastq-sanger "
"(default=fasta)")
......
......@@ -4,7 +4,6 @@
#include "Alphabet.hh"
#include "Centroid.hh"
#include "GeneticCode.hh"
#include "GeneralizedAffineGapCosts.hh"
#include "GreedyXdropAligner.hh"
#include "TwoQualityScoreMatrix.hh"
#include <cassert>
......@@ -35,7 +34,7 @@ static void addExpectedCounts( double* expectedCounts,
const int numEmissionCounts = alph.size * alph.size;
double* transitionCounts = &expectedCounts[ numEmissionCounts ];
transitionCounts[0] += ec.MM + ec.DM + ec.IM + ec.PM; // match count
transitionCounts[0] += ec.toMatch;
transitionCounts[1] += ec.DD + ec.MD + ec.PD; // deleted letter count
transitionCounts[2] += ec.II + ec.MI + ec.DI + ec.PI; // ins. letter count
transitionCounts[3] += ec.MD + ec.PD; // deletion open/close count
......@@ -49,8 +48,6 @@ static void addExpectedCounts( double* expectedCounts,
// MI = IM - DI - PI + IQ
// PM = MP - PD - PI - PQ
// DM + IM + PM = MD + MI + MP - DQ - IQ - PQ
// SM, SD, SP, SI seem to be always zero.
// MQ, SQ ?
}
static void countSeedMatches( double* expectedCounts,
......@@ -73,7 +70,7 @@ void Alignment::makeXdrop( Centroid& centroid,
GreedyXdropAligner& greedyAligner, bool isGreedy,
const uchar* seq1, const uchar* seq2, int globality,
const ScoreMatrixRow* scoreMatrix, int smMax,
const GeneralizedAffineGapCosts& gap, int maxDrop,
const mcf::GapCosts& gap, int maxDrop,
int frameshiftCost, size_t frameSize,
const ScoreMatrixRow* pssm2,
const TwoQualityScoreMatrix& sm2qual,
......@@ -171,7 +168,7 @@ void Alignment::makeXdrop( Centroid& centroid,
// cost of the gap between x and y
static int gapCost(const SegmentPair &x, const SegmentPair &y,
const GeneralizedAffineGapCosts &gapCosts,
const mcf::GapCosts &gapCosts,
int frameshiftCost, size_t frameSize) {
size_t gapSize1 = y.beg1() - x.end1();
size_t gapSize2, frameshift2;
......@@ -183,7 +180,7 @@ static int gapCost(const SegmentPair &x, const SegmentPair &y,
bool Alignment::isOptimal( const uchar* seq1, const uchar* seq2, int globality,
const ScoreMatrixRow* scoreMatrix, int maxDrop,
const GeneralizedAffineGapCosts& gapCosts,
const mcf::GapCosts& gapCosts,
int frameshiftCost, size_t frameSize,
const ScoreMatrixRow* pssm2,
const TwoQualityScoreMatrix& sm2qual,
......@@ -226,7 +223,7 @@ bool Alignment::isOptimal( const uchar* seq1, const uchar* seq2, int globality,
bool Alignment::hasGoodSegment(const uchar *seq1, const uchar *seq2,
int minScore, const ScoreMatrixRow *scoreMatrix,
const GeneralizedAffineGapCosts &gapCosts,
const mcf::GapCosts &gapCosts,
int frameshiftCost, size_t frameSize,
const ScoreMatrixRow *pssm2,
const TwoQualityScoreMatrix &sm2qual,
......@@ -292,13 +289,15 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
size_t start1, size_t start2,
bool isForward, int globality,
const ScoreMatrixRow* sm, int smMax, int maxDrop,
const GeneralizedAffineGapCosts& gap,
const mcf::GapCosts& gap,
int frameshiftCost, size_t frameSize,
const ScoreMatrixRow* pssm2,
const TwoQualityScoreMatrix& sm2qual,
const uchar* qual1, const uchar* qual2,
const Alphabet& alph, AlignmentExtras& extras,
double gamma, int outputType ){
const mcf::GapCosts::Piece &del = gap.delPieces[0];
const mcf::GapCosts::Piece &ins = gap.insPieces[0];
GappedXdropAligner& aligner = centroid.aligner();
if( frameSize ){
......@@ -307,7 +306,6 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
assert( !globality );
assert( !pssm2 );
assert( !sm2qual );
assert( gap.isSymmetric() );
size_t dnaStart = aaToDna( start2, frameSize );
size_t f = dnaStart + 1;
......@@ -317,13 +315,13 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
score += aligner.align3( seq1 + start1, seq2 + start2,
seq2 + frame1, seq2 + frame2, isForward,
sm, gap.delExist, gap.delExtend, gap.pairExtend,
sm, del.openCost, del.growCost, gap.pairCost,
frameshiftCost, maxDrop, smMax );
size_t end1, end2, size;
// This should be OK even if end2 < size * 3:
while( aligner.getNextChunk3( end1, end2, size,
gap.delExist, gap.delExtend, gap.pairExtend,
del.openCost, del.growCost, gap.pairCost,
frameshiftCost ) )
chunks.push_back( SegmentPair( end1 - size, end2 - size * 3, size ) );
......@@ -336,19 +334,19 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
: sm2qual ? aligner.align2qual( seq1 + start1, qual1 + start1,
seq2 + start2, qual2 + start2,
isForward, globality, sm2qual,
gap.delExist, gap.delExtend,
gap.insExist, gap.insExtend,
gap.pairExtend, maxDrop, smMax )
del.openCost, del.growCost,
ins.openCost, ins.growCost,
gap.pairCost, maxDrop, smMax )
: pssm2 ? aligner.alignPssm( seq1 + start1, pssm2 + start2,
isForward, globality,
gap.delExist, gap.delExtend,
gap.insExist, gap.insExtend,
gap.pairExtend, maxDrop, smMax )
del.openCost, del.growCost,
ins.openCost, ins.growCost,
gap.pairCost, maxDrop, smMax )
: aligner.align( seq1 + start1, seq2 + start2,
isForward, globality, sm,
gap.delExist, gap.delExtend,
gap.insExist, gap.insExtend,
gap.pairExtend, maxDrop, smMax );
del.openCost, del.growCost,
ins.openCost, ins.growCost,
gap.pairCost, maxDrop, smMax );
if( extensionScore == -INF ){
score = -INF; // avoid score overflow
......@@ -364,9 +362,8 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
chunks.push_back( SegmentPair( end1 - size, end2 - size, size ) );
}else{
while( aligner.getNextChunk( end1, end2, size,
gap.delExist, gap.delExtend,
gap.insExist, gap.insExtend,
gap.pairExtend ) )
del.openCost, del.growCost,
ins.openCost, ins.growCost, gap.pairCost ) )
chunks.push_back( SegmentPair( end1 - size, end2 - size, size ) );
}
}
......
......@@ -6,6 +6,7 @@
#define ALIGNMENT_HH
#include "ScoreMatrixRow.hh"
#include "SegmentPair.hh"
#include "mcf_gap_costs.hh"
#include <stddef.h> // size_t
#include <string>
#include <vector>
......@@ -15,7 +16,6 @@ namespace cbrc{
typedef unsigned char uchar;
class GeneralizedAffineGapCosts;
class GreedyXdropAligner;
class LastEvaluer;
class MultiSequence;
......@@ -81,7 +81,7 @@ struct Alignment{
GreedyXdropAligner& greedyAligner, bool isGreedy,
const uchar* seq1, const uchar* seq2, int globality,
const ScoreMatrixRow* scoreMatrix, int smMax,
const GeneralizedAffineGapCosts& gap, int maxDrop,
const mcf::GapCosts& gap, int maxDrop,
int frameshiftCost, size_t frameSize,
const ScoreMatrixRow* pssm2,
const TwoQualityScoreMatrix& sm2qual,
......@@ -95,7 +95,7 @@ struct Alignment{
// If "globality" is non-zero, skip the prefix and suffix checks.
bool isOptimal( const uchar* seq1, const uchar* seq2, int globality,
const ScoreMatrixRow* scoreMatrix, int maxDrop,
const GeneralizedAffineGapCosts& gapCosts,
const mcf::GapCosts& gapCosts,
int frameshiftCost, size_t frameSize,
const ScoreMatrixRow* pssm2,
const TwoQualityScoreMatrix& sm2qual,
......@@ -104,7 +104,7 @@ struct Alignment{
// Does the Alignment have any segment with score >= minScore?
bool hasGoodSegment(const uchar *seq1, const uchar *seq2,
int minScore, const ScoreMatrixRow *scoreMatrix,
const GeneralizedAffineGapCosts &gapCosts,
const mcf::GapCosts &gapCosts,
int frameshiftCost, size_t frameSize,
const ScoreMatrixRow *pssm2,
const TwoQualityScoreMatrix &sm2qual,
......@@ -134,7 +134,7 @@ struct Alignment{
size_t start1, size_t start2,
bool isForward, int globality,
const ScoreMatrixRow* sm, int smMax, int maxDrop,
const GeneralizedAffineGapCosts& gap,
const mcf::GapCosts& gap,
int frameshiftCost, size_t frameSize,
const ScoreMatrixRow* pssm2,
const TwoQualityScoreMatrix& sm2qual,
......
......@@ -18,50 +18,26 @@ namespace{
}
}
static bool isAffineGapCosts(const mcf::GapCosts &g) {
return g.pairCost >= g.delPieces[0].growCost + g.insPieces[0].growCost +
std::max(g.delPieces[0].openCost, g.insPieces[0].openCost);
}
namespace cbrc{
ExpectedCount::ExpectedCount ()
{
double d0 = 0;
MM = d0; MD = d0; MP = d0; MI = d0; MQ = d0;
DD = d0; DM = d0; DI = d0;
PP = d0; PM = d0; PD = d0; PI = d0;
II = d0; IM = d0;
SM = d0; SD = d0; SP = d0; SI = d0; SQ = d0;
toMatch = d0;
MD = d0; MP = d0; MI = d0;
DD = d0; DI = d0;
PP = d0; PD = d0; PI = d0;
II = d0;
for (int n=0; n<scoreMatrixRowSize; n++)
for (int m=0; m<scoreMatrixRowSize; m++) emit[n][m] = d0;
}
std::ostream& ExpectedCount::write (std::ostream& os) const
{
for (int n=0; n<scoreMatrixRowSize; ++n) {
for (int m=0; m<scoreMatrixRowSize; ++m) {
double prob = emit[n][m];
if (prob > 0)
os << "emit[" << n << "][" << m << "]=" << emit[n][m] << std::endl;
}
}
os << "M->M=" << MM << std::endl;
os << "M->D=" << MD << std::endl;
os << "M->P=" << MP << std::endl;
os << "M->I=" << MI << std::endl;
os << "D->D=" << DD << std::endl;
os << "D->M=" << DM << std::endl;
os << "D->I=" << DI << std::endl;
os << "P->P=" << PP << std::endl;
os << "P->M=" << PM << std::endl;
os << "P->D=" << PD << std::endl;
os << "P->I=" << PI << std::endl;
os << "I->I=" << II << std::endl;
os << "I->M=" << IM << std::endl;
return os;
}
void Centroid::setScoreMatrix( const ScoreMatrixRow* sm, double T ) {
this -> T = T;
this -> isPssm = false;
......@@ -126,14 +102,12 @@ namespace cbrc{
void Centroid::initForwardMatrix(){
size_t n = xa.scoreEndIndex( numAntidiagonals );
if ( fM.size() < n ) {
fM.resize( n );
fD.resize( n );
fI.resize( n );
fP.resize( n );
}
fM[0] = 1;
}
......@@ -147,17 +121,17 @@ namespace cbrc{
void Centroid::forward(const uchar* seq1, const uchar* seq2,
const ExpMatrixRow* pssm, bool isExtendFwd,
const GeneralizedAffineGapCosts& gap,
int globality) {
initForwardMatrix();
const mcf::GapCosts& gap, int globality) {
const int seqIncrement = isExtendFwd ? 1 : -1;
const bool isAffine = gap.isAffine();
const double eE = EXP ( - gap.delExtend / T );
const double eF = EXP ( - gap.delExist / T );
const double eEI = EXP ( - gap.insExtend / T );
const double eFI = EXP ( - gap.insExist / T );
const double eP = EXP ( - gap.pairExtend / T );
assert( gap.insExist == gap.delExist || eP <= 0.0 );
const bool isAffine = isAffineGapCosts(gap);
initForwardMatrix();
const double eE = EXP(-gap.delPieces[0].growCost / T);
const double eF = EXP(-gap.delPieces[0].openCost / T);
const double eEI = EXP(-gap.insPieces[0].growCost / T);
const double eFI = EXP(-gap.insPieces[0].openCost / T);
const double eP = EXP(-gap.pairCost / T);
double Z = 0.0; // partion function of forward values
for( size_t k = 0; k < numAntidiagonals; ++k ){ // loop over antidiagonals
......@@ -196,18 +170,20 @@ namespace cbrc{
if (isAffine) {
while (1) {
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const unsigned letter1 = *s1;
const unsigned letter2 = *s2;
const double matchProb = match_score[letter1][letter2];
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xSum = xM + xD + xI;
*fD0 = xM * eF + xD;
*fI0 = (xM + xD) * eFI + xI;
const double total = xM + xD + xI;
*fM0 = total * matchProb;
*fM0 = xSum * matchProb;
sum_f += xM;
if (globality && matchProb <= 0) Z += total; // xxx
if (globality && matchProb <= 0) Z += xSum; // xxx
if (fM0 == fM0last) break;
fM0++; fD0++; fI0++;
......@@ -217,20 +193,22 @@ namespace cbrc{
}
} else {
while (1) {
const unsigned letter1 = *s1;
const unsigned letter2 = *s2;
const double matchProb = match_score[letter1][letter2];
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xP = *fP2 * seP;
const unsigned letter1 = *s1;
const unsigned letter2 = *s2;
const double matchProb = match_score[letter1][letter2];
const double xSum = (xM + xD) + (xI + xP);
*fP0 = xM * eF + xP;
*fD0 = xM * eF + xP + xD;
*fI0 = (xM + xD) * eFI + (xI + xP);
const double total = (xM + xD) + (xI + xP);
*fM0 = total * matchProb;
*fM0 = xSum * matchProb;
sum_f += xM;
if (globality && matchProb <= 0) Z += total; // xxx
if (globality && matchProb <= 0) Z += xSum; // xxx
if (fM0 == fM0last) break;
fM0++; fD0++; fI0++; fP0++;
......@@ -244,18 +222,20 @@ namespace cbrc{
if (isAffine) {
while (1) {
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const unsigned letter1 = *s1;
const double *matchProbs = *p2;
const double matchProb = matchProbs[letter1];
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xSum = xM + xD + xI;
*fD0 = xM * eF + xD;
*fI0 = (xM + xD) * eFI + xI;
const double total = xM + xD + xI;
*fM0 = total * matchProb;
*fM0 = xSum * matchProb;
sum_f += xM;
if (globality && matchProb <= 0) Z += total; // xxx
if (globality && matchProb <= 0) Z += xSum; // xxx
if (fM0 == fM0last) break;
fM0++; fD0++; fI0++;
......@@ -265,20 +245,22 @@ namespace cbrc{
}
} else {
while (1) {
const unsigned letter1 = *s1;
const double *matchProbs = *p2;
const double matchProb = matchProbs[letter1];
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xP = *fP2 * seP;
const unsigned letter1 = *s1;
const double *matchProbs = *p2;
const double matchProb = matchProbs[letter1];
const double xSum = (xM + xD) + (xI + xP);
*fP0 = xM * eF + xP;
*fD0 = xM * eF + xP + xD;
*fI0 = (xM + xD) * eFI + (xI + xP);
const double total = (xM + xD) + (xI + xP);
*fM0 = total * matchProb;
*fM0 = xSum * matchProb;
sum_f += xM;
if (globality && matchProb <= 0) Z += total; // xxx
if (globality && matchProb <= 0) Z += xSum; // xxx
if (fM0 == fM0last) break;
fM0++; fD0++; fI0++; fP0++;
......@@ -303,17 +285,17 @@ namespace cbrc{
// compute posterior probabilities while executing backward algorithm
void Centroid::backward(const uchar* seq1, const uchar* seq2,
const ExpMatrixRow* pssm, bool isExtendFwd,
const GeneralizedAffineGapCosts& gap,
int globality) {
initBackwardMatrix();
const mcf::GapCosts& gap, int globality) {
const int seqIncrement = isExtendFwd ? 1 : -1;
const bool isAffine = gap.isAffine();
const double eE = EXP ( - gap.delExtend / T );
const double eF = EXP ( - gap.delExist / T );
const double eEI = EXP ( - gap.insExtend / T );
const double eFI = EXP ( - gap.insExist / T );
const double eP = EXP ( - gap.pairExtend / T );
assert( gap.insExist == gap.delExist || eP <= 0.0 );
const bool isAffine = isAffineGapCosts(gap);
initBackwardMatrix();
const double eE = EXP(-gap.delPieces[0].growCost / T);
const double eF = EXP(-gap.delPieces[0].openCost / T);
const double eEI = EXP(-gap.insPieces[0].growCost / T);
const double eFI = EXP(-gap.insPieces[0].openCost / T);
const double eP = EXP(-gap.pairCost / T);
double scaledUnit = 1.0;
for( size_t k = numAntidiagonals; k-- > 0; ){
......@@ -357,10 +339,14 @@ namespace cbrc{
if (isAffine) {
while (1) {
const double matchProb = match_score[*s1][*s2];
const unsigned letter1 = *s1;
const unsigned letter2 = *s2;
const double matchProb = match_score[letter1][letter2];
const double yM = (*bM0) * matchProb;
const double yD = *bD0;
const double yI = *bI0;
double zM = yM + yD * eF + yI * eFI;
double zD = yM + yD + yI * eFI;
double zI = yM + yI;
......@@ -390,11 +376,15 @@ namespace cbrc{
}
} else {
while (1) {
const double matchProb = match_score[*s1][*s2];
const unsigned letter1 = *s1;
const unsigned letter2 = *s2;
const double matchProb = match_score[letter1][letter2];
const double yM = (*bM0) * matchProb;
const double yD = *bD0;
const double yI = *bI0;
const double yP = *bP0;
double zM = yM + yD * eF + yI * eFI + yP * eF;
double zD = yM + yD + yI * eFI;
double zI = yM + yI;
......@@ -430,10 +420,13 @@ namespace cbrc{
if (isAffine) {
while (1) {
const double matchProb = (*p2)[*s1];
const unsigned letter1 = *s1;
const double matchProb = (*p2)[letter1];
const double yM = (*bM0) * matchProb;
const double yD = *bD0;
const double yI = *bI0;
double zM = yM + yD * eF + yI * eFI;
double zD = yM + yD + yI * eFI;
double zI = yM + yI;
......@@ -461,11 +454,14 @@ namespace cbrc{
}
} else {
while (1) {
const double matchProb = (*p2)[*s1];
const unsigned letter1 = *s1;
const double matchProb = (*p2)[letter1];
const double yM = (*bM0) * matchProb;
const double yD = *bD0;
const double yI = *bI0;
const double yP = *bP0;
double zM = yM + yD * eF + yI * eFI + yP * eF;
double zD = yM + yD + yI * eFI;
double zI = yM + yI;
......@@ -729,7 +725,7 @@ namespace cbrc{
void Centroid::computeExpectedCounts ( const uchar* seq1, const uchar* seq2,
size_t start1, size_t start2,
bool isExtendFwd,
const GeneralizedAffineGapCosts& gap,
const mcf::GapCosts& gap,
unsigned alphabetSize,
ExpectedCount& c ) const{
seq1 += start1;
......@@ -744,13 +740,13 @@ namespace cbrc{
const int seqIncrement = isExtendFwd ? 1 : -1;
int alphabetSizeIncrement = alphabetSize;
if (!isExtendFwd) alphabetSizeIncrement *= -1;
const bool isAffine = gap.isAffine();
const double eE = EXP ( - gap.delExtend / T );
const double eF = EXP ( - gap.delExist / T );
const double eEI = EXP ( - gap.insExtend / T );
const double eFI = EXP ( - gap.insExist / T );
const double eP = EXP ( - gap.pairExtend / T );
assert( gap.insExist == gap.delExist || eP <= 0.0 );
const bool isAffine = isAffineGapCosts(gap);
const double eE = EXP(-gap.delPieces[0].growCost / T);
const double eF = EXP(-gap.delPieces[0].openCost / T);
const double eEI = EXP(-gap.insPieces[0].growCost / T);
const double eFI = EXP(-gap.insPieces[0].openCost / T);
const double eP = EXP(-gap.pairCost / T);
for( size_t k = 0; k < numAntidiagonals; ++k ){ // loop over antidiagonals
const size_t seq1beg = seq1start( k );
......@@ -785,23 +781,28 @@ namespace cbrc{
if (isAffine) {
while (1) {
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const unsigned letter1 = *s1;
const unsigned letter2 = *s2;
const double yM = *bM0 * match_score[letter1][letter2];
const double matchProb = match_score[letter1][letter2];
const double yM = *bM0 * matchProb;
const double yD = *bD0;
const double yI = *bI0;
c.emit[letter1][letter2] += (xM + xD + xI) * yM;
c.MM += xM * yM;
c.DM += xD * yM;
c.IM += xI * yM;
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xSum = xM + xD + xI;
const double alignProb = xSum * yM;
c.emit[letter1][letter2] += alignProb;
c.toMatch += alignProb;
c.MD += xM * yD * eF;
c.DD += xD * yD;
c.MI += xM * yI * eFI;
c.DI += xD * yI * eFI;
c.II += xI * yI;
if (bM0 == bM0last) break;
fM2++; fD1++; fI1++;
bM0++; bD0++; bI0++;
......@@ -810,21 +811,24 @@ namespace cbrc{
}
} else {
while (1) {
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xP = *fP2 * seP;
const unsigned letter1 = *s1;
const unsigned letter2 = *s2;
const double yM = *bM0 * match_score[letter1][letter2];
const double matchProb = match_score[letter1][letter2];
const double yM = *bM0 * matchProb;
const double yD = *bD0;
const double yI = *bI0;
const double yP = *bP0;
c.emit[letter1][letter2] += (xM + xD + xI + xP) * yM;
c.MM += xM * yM;
c.DM += xD * yM;
c.IM += xI * yM;
c.PM += xP * yM;
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xP = *fP2 * seP;
const double xSum = xM + xD + xI + xP;
const double alignProb = xSum * yM;
c.emit[letter1][letter2] += alignProb;
c.toMatch += alignProb;
c.MD += xM * yD * eF;
c.DD += xD * yD;
c.PD += xP * yD;
......@@ -834,6 +838,7 @@ namespace cbrc{
c.PI += xP * yI;
c.MP += xM * yP * eF;
c.PP += xP * yP;
if (bM0 == bM0last) break;
fM2++; fD1++; fI1++; fP2++;
bM0++; bD0++; bI0++; bP0++;
......@@ -841,32 +846,35 @@ namespace cbrc{
s2 -= seqIncrement;
}
}
}
else {
} else {
const ExpMatrixRow* p2 = isExtendFwd ? pssm + seq2pos : pssm - seq2pos;
const size_t a2 = seq2pos * alphabetSize;
const double* lp2 = isExtendFwd ? letterProbs + a2 : letterProbs - a2;
if (isAffine) {
while (1) { // inner most loop
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const unsigned letter1 = *s1;
const double yM = *bM0 * (*p2)[letter1];
const double matchProb = (*p2)[letter1];
const double yM = *bM0 * matchProb;
const double yD = *bD0;
const double yI = *bI0;
const double alignProb = (xM + xD + xI) * yM;
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xSum = xM + xD + xI;
const double alignProb = xSum * yM;
countUncertainLetters(c.emit[letter1], alignProb,
alphabetSize, match_score[letter1], lp2);
c.MM += xM * yM;
c.DM += xD * yM;
c.IM += xI * yM;
c.toMatch += alignProb;
c.MD += xM * yD * eF;
c.DD += xD * yD;
c.MI += xM * yI * eFI;
c.DI += xD * yI * eFI;
c.II += xI * yI;
if (bM0 == bM0last) break;
fM2++; fD1++; fI1++;
bM0++; bD0++; bI0++;
......@@ -876,22 +884,24 @@ namespace cbrc{
}
} else {
while (1) { // inner most loop
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xP = *fP2 * seP;
const unsigned letter1 = *s1;
const double yM = *bM0 * (*p2)[letter1];
const double matchProb = (*p2)[letter1];
const double yM = *bM0 * matchProb;
const double yD = *bD0;
const double yI = *bI0;
const double yP = *bP0;
const double alignProb = (xM + xD + xI + xP) * yM;
const double xM = *fM2 * scale12;
const double xD = *fD1 * seE;
const double xI = *fI1 * seEI;
const double xP = *fP2 * seP;
const double xSum = xM + xD + xI + xP;
const double alignProb = xSum * yM;
countUncertainLetters(c.emit[letter1], alignProb,
alphabetSize, match_score[letter1], lp2);
c.MM += xM * yM;
c.DM += xD * yM;
c.IM += xI * yM;
c.PM += xP * yM;
c.toMatch += alignProb;
c.MD += xM * yD * eF;
c.DD += xD * yD;
c.PD += xP * yD;
......@@ -901,6 +911,7 @@ namespace cbrc{
c.PI += xP * yI;
c.MP += xM * yP * eF;
c.PP += xP * yP;
if (bM0 == bM0last) break;
fM2++; fD1++; fI1++; fP2++;
bM0++; bD0++; bI0++; bP0++;
......
......@@ -4,7 +4,7 @@
#ifndef CENTROID_HH
#define CENTROID_HH
#include "GappedXdropAligner.hh"
#include "GeneralizedAffineGapCosts.hh"
#include "mcf_gap_costs.hh"
#include "SegmentPair.hh"
#include "OneQualityScoreMatrix.hh"
#include <stddef.h> // size_t
......@@ -16,14 +16,13 @@ namespace cbrc{
struct ExpectedCount{
public:
double emit[scoreMatrixRowSize][scoreMatrixRowSize];
double MM, MD, MP, MI, MQ;
double DD, DM, DI;
double PP, PM, PD, PI;
double II, IM;
double SM, SD, SP, SI, SQ;
double toMatch;
double MD, MP, MI;
double DD, DI;
double PP, PD, PI;
double II;
public:
ExpectedCount ();
std::ostream& write (std::ostream& os) const;
};
/**
......@@ -58,8 +57,7 @@ namespace cbrc{
void doForwardBackwardAlgorithm(const uchar* seq1, const uchar* seq2,
size_t start1, size_t start2,
bool isExtendFwd,
const GeneralizedAffineGapCosts& gap,
bool isExtendFwd, const mcf::GapCosts& gap,
int globality) {
seq1 += start1;
seq2 += start2;
......@@ -100,8 +98,7 @@ namespace cbrc{
// Added by MH (2008/10/10) : compute expected counts for transitions and emissions
void computeExpectedCounts(const uchar* seq1, const uchar* seq2,
size_t start1, size_t start2, bool isExtendFwd,
const GeneralizedAffineGapCosts& gap,
unsigned alphabetSize,
const mcf::GapCosts& gap, unsigned alphabetSize,
ExpectedCount& count) const;
private:
......@@ -144,11 +141,11 @@ namespace cbrc{
void forward(const uchar* seq1, const uchar* seq2,
const ExpMatrixRow* pssm, bool isExtendFwd,
const GeneralizedAffineGapCosts& gap, int globality);
const mcf::GapCosts& gap, int globality);
void backward(const uchar* seq1, const uchar* seq2,
const ExpMatrixRow* pssm, bool isExtendFwd,
const GeneralizedAffineGapCosts& gap, int globality);
const mcf::GapCosts& gap, int globality);
void initForwardMatrix();
void initBackwardMatrix();
......