Skip to content
Commits on Source (6)
2018-04-20 Martin C. Frith <Martin C. Frith>
* doc/Makefile, doc/lastal.txt, doc/maf-cut.txt, scripts/maf-cut:
Add maf-cut utility
[482845444bcd] [tip]
2018-04-18 Martin C. Frith <Martin C. Frith>
* src/ScoreMatrix.cc, src/ScoreMatrix.hh, src/lastal.cc, src/makefile,
test/last-test.out:
Use appropriate scores for ambiguous DNA letters
[75785542d5a0]
2018-04-17 Martin C. Frith <Martin C. Frith>
* src/ScoreMatrix.cc, src/ScoreMatrix.hh, test/last-test.out:
Prettify the score matrix in lastal's header
[68bd00d2faff]
* src/ScoreMatrix.cc, src/lastal.cc:
Refactor
[d1aa91889332]
* src/ScoreMatrix.cc, src/ScoreMatrix.hh, src/TantanMasker.cc,
src/lastal.cc:
Refactor
[e22cac29168c]
2018-04-16 Martin C. Frith <Martin C. Frith>
* src/QualityPssmMaker.cc, src/ScoreMatrix.cc, src/ScoreMatrix.hh,
src/qualityScoreUtil.hh:
Refactor
[fb851b8892a5]
2018-03-05 Martin C. Frith <Martin C. Frith>
* scripts/last-dotplot, test/last-dotplot-test.out, test/last-dotplot-
test.sh:
Fix dotplot --strands bug with cut sequences
[bc08832db1a1]
* scripts/last-dotplot, test/last-dotplot-test.out, test/last-dotplot-
test.sh:
Fix dotplot crash for sorting by sequence length
[e1861f956f60]
2018-02-26 Martin C. Frith <Martin C. Frith>
* doc/lastdb.txt, src/Alphabet.cc, test/last-test.out, test/last-
test.sh:
Convert Us in RNA to Ts
[e7d7cb7db0a7]
2018-02-13 Martin C. Frith <Martin C. Frith>
* doc/lastdb.txt, src/SubsetSuffixArray.hh,
src/SubsetSuffixArraySort.cc, src/lastdb.cc:
Multi-thread lastdb's main sort
[77d8757028c2]
* src/alp/sls_alignment_evaluer.cpp, test/last-test.out:
Fix some E-value & last-train fails
[2c57015b1e89]
2018-01-05 Martin C. Frith <Martin C. Frith>
* src/Alignment.cc, src/Alignment.hh, src/Centroid.cc,
src/Centroid.hh:
Refactor
[8430dbf5abe7] [tip]
[8430dbf5abe7]
* src/Centroid.cc:
Try to calculate lastal probs a tiny bit faster
......
last-align (932-1) unstable; urgency=medium
* New upstream version
* Point Vcs fields to salsa.debian.org
* Standards-Version: 4.1.4
-- Andreas Tille <tille@debian.org> Thu, 26 Apr 2018 22:27:19 +0200
last-align (921-1) unstable; urgency=medium
* New upstream version
......
......@@ -8,9 +8,9 @@ Build-Depends: debhelper (>= 11~),
help2man,
python-pil,
zlib1g-dev
Standards-Version: 4.1.3
Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/last-align.git
Vcs-Git: https://anonscm.debian.org/git/debian-med/last-align.git
Standards-Version: 4.1.4
Vcs-Browser: https://salsa.debian.org/med-team/last-align
Vcs-Git: https://salsa.debian.org/med-team/last-align.git
Homepage: http://last.cbrc.jp/
Package: last-align
......
......@@ -5,7 +5,7 @@ last-map-probs.html last-matrices.html last-pair-probs.html \
last-papers.html last-parallel.html last-postmask.html \
last-repeats.html last-seeds.html last-split.html last-train.html \
last-tuning.html last-tutorial.html last.html lastal.html lastdb.html \
maf-convert.html
maf-convert.html maf-cut.html
all: ${DOCS}
......
......@@ -462,14 +462,18 @@ value of -e, so if you specify -e then -E has no effect.</td></tr>
<td><p class="first">Specify a match/mismatch score matrix. Options -r and -q will
be ignored. The built-in matrices are described in
<a class="reference external" href="last-matrices.html">last-matrices.html</a>.</p>
<p class="last">Any other NAME is assumed to be a file name. For an example of
the format, see the matrix files in the data directory. Any
letters that aren't in the matrix will get the lowest score in
the matrix when aligned to anything. Asymmetric scores are
allowed: query letters correspond to columns and reference
letters correspond to rows. Other options can be specified on
lines starting with &quot;#last&quot;, but command line options override
them.</p>
<p>Any other NAME is assumed to be a file name. For an example of
the format, see the matrix files in the data directory.
Asymmetric scores are allowed: query letters correspond to
columns and reference letters correspond to rows.</p>
<p>Any letters that aren't in the matrix get default match/mismatch
scores. For doubly- and triply-ambiguous bases (such as &quot;W&quot;
meaning A or T), these default scores are derived from the ACGT
scores, and are shown in the header of lastal's output. Any
other letters get the lowest score in the matrix when aligned to
anything.</p>
<p class="last">Other options can be specified on lines starting with &quot;#last&quot;,
but command line options override them.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-a <var>COST</var></span></kbd></td>
......
......@@ -130,13 +130,19 @@ Score options
`<last-matrices.html>`_.
Any other NAME is assumed to be a file name. For an example of
the format, see the matrix files in the data directory. Any
letters that aren't in the matrix will get the lowest score in
the matrix when aligned to anything. Asymmetric scores are
allowed: query letters correspond to columns and reference
letters correspond to rows. Other options can be specified on
lines starting with "#last", but command line options override
them.
the format, see the matrix files in the data directory.
Asymmetric scores are allowed: query letters correspond to
columns and reference letters correspond to rows.
Any letters that aren't in the matrix get default match/mismatch
scores. For doubly- and triply-ambiguous bases (such as "W"
meaning A or T), these default scores are derived from the ACGT
scores, and are shown in the header of lastal's output. Any
other letters get the lowest score in the matrix when aligned to
anything.
Other options can be specified on lines starting with "#last",
but command line options override them.
-a COST
Gap existence cost.
......
......@@ -461,8 +461,7 @@ lastdb and then used by lastal. These formats are described in
<kbd><span class="option">-P <var>THREADS</var></span></kbd></td>
<td>Divide the work between this number of threads running in
parallel. 0 means use as many threads as your computer claims
it can handle simultaneously. Currently, multi-threading is
used for tantan masking only.</td></tr>
it can handle simultaneously.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-m <var>PATTERN</var></span></kbd></td>
<td><p class="first">Specify a spaced seed pattern, for example &quot;-m 110101&quot;. In this
......@@ -490,7 +489,8 @@ alphabet is equivalent to &quot;-a ACGT&quot;. The protein alphabet (-p)
is equivalent to &quot;-a ACDEFGHIKLMNPQRSTVWY&quot;. Non-alphabet
letters are allowed in sequences, but by default they are
excluded from initial matches and get the mismatch score when
aligned to anything. If -a is specified, -p is ignored.</td></tr>
aligned to anything. As a special case, for the DNA alphabet,
Us are converted to Ts. If -a is specified, -p is ignored.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-i <var>MATCHES</var></span></kbd></td>
<td>This option makes lastdb faster, at the expense of limiting your
......
......@@ -129,8 +129,7 @@ Advanced Options
-P THREADS
Divide the work between this number of threads running in
parallel. 0 means use as many threads as your computer claims
it can handle simultaneously. Currently, multi-threading is
used for tantan masking only.
it can handle simultaneously.
-m PATTERN
Specify a spaced seed pattern, for example "-m 110101". In this
......@@ -161,7 +160,8 @@ Advanced Options
is equivalent to "-a ACDEFGHIKLMNPQRSTVWY". Non-alphabet
letters are allowed in sequences, but by default they are
excluded from initial matches and get the mismatch score when
aligned to anything. If -a is specified, -p is ignored.
aligned to anything. As a special case, for the DNA alphabet,
Us are converted to Ts. If -a is specified, -p is ignored.
-i MATCHES
This option makes lastdb faster, at the expense of limiting your
......
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.6: http://docutils.sourceforge.net/" />
<title>maf-cut</title>
<style type="text/css">
/*
:Author: David Goodger (goodger@python.org)
:Id: $Id: html4css1.css 5951 2009-05-18 18:03:10Z milde $
:Copyright: This stylesheet has been placed in the public domain.
Default cascading style sheet for the HTML output of Docutils.
See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/
/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
border: 0 }
table.borderless td, table.borderless th {
/* Override padding for "table.docutils td" with "! important".
The right padding separates the table cells. */
padding: 0 0.5em 0 0 ! important }
.first {
/* Override more specific margin styles with "! important". */
margin-top: 0 ! important }
.last, .with-subtitle {
margin-bottom: 0 ! important }
.hidden {
display: none }
a.toc-backref {
text-decoration: none ;
color: black }
blockquote.epigraph {
margin: 2em 5em ; }
dl.docutils dd {
margin-bottom: 0.5em }
/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
font-weight: bold }
*/
div.abstract {
margin: 2em 5em }
div.abstract p.topic-title {
font-weight: bold ;
text-align: center }
div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
margin: 2em ;
border: medium outset ;
padding: 1em }
div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
font-weight: bold ;
font-family: sans-serif }
div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title {
color: red ;
font-weight: bold ;
font-family: sans-serif }
/* Uncomment (and remove this text!) to get reduced vertical space in
compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
margin-bottom: 0.5em }
div.compound .compound-last, div.compound .compound-middle {
margin-top: 0.5em }
*/
div.dedication {
margin: 2em 5em ;
text-align: center ;
font-style: italic }
div.dedication p.topic-title {
font-weight: bold ;
font-style: normal }
div.figure {
margin-left: 2em ;
margin-right: 2em }
div.footer, div.header {
clear: both;
font-size: smaller }
div.line-block {
display: block ;
margin-top: 1em ;
margin-bottom: 1em }
div.line-block div.line-block {
margin-top: 0 ;
margin-bottom: 0 ;
margin-left: 1.5em }
div.sidebar {
margin: 0 0 0.5em 1em ;
border: medium outset ;
padding: 1em ;
background-color: #ffffee ;
width: 40% ;
float: right ;
clear: right }
div.sidebar p.rubric {
font-family: sans-serif ;
font-size: medium }
div.system-messages {
margin: 5em }
div.system-messages h1 {
color: red }
div.system-message {
border: medium outset ;
padding: 1em }
div.system-message p.system-message-title {
color: red ;
font-weight: bold }
div.topic {
margin: 2em }
h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
margin-top: 0.4em }
h1.title {
text-align: center }
h2.subtitle {
text-align: center }
hr.docutils {
width: 75% }
img.align-left, .figure.align-left{
clear: left ;
float: left ;
margin-right: 1em }
img.align-right, .figure.align-right {
clear: right ;
float: right ;
margin-left: 1em }
.align-left {
text-align: left }
.align-center {
clear: both ;
text-align: center }
.align-right {
text-align: right }
/* reset inner alignment in figures */
div.align-right {
text-align: left }
/* div.align-center * { */
/* text-align: left } */
ol.simple, ul.simple {
margin-bottom: 1em }
ol.arabic {
list-style: decimal }
ol.loweralpha {
list-style: lower-alpha }
ol.upperalpha {
list-style: upper-alpha }
ol.lowerroman {
list-style: lower-roman }
ol.upperroman {
list-style: upper-roman }
p.attribution {
text-align: right ;
margin-left: 50% }
p.caption {
font-style: italic }
p.credits {
font-style: italic ;
font-size: smaller }
p.label {
white-space: nowrap }
p.rubric {
font-weight: bold ;
font-size: larger ;
color: maroon ;
text-align: center }
p.sidebar-title {
font-family: sans-serif ;
font-weight: bold ;
font-size: larger }
p.sidebar-subtitle {
font-family: sans-serif ;
font-weight: bold }
p.topic-title {
font-weight: bold }
pre.address {
margin-bottom: 0 ;
margin-top: 0 ;
font: inherit }
pre.literal-block, pre.doctest-block {
margin-left: 2em ;
margin-right: 2em }
span.classifier {
font-family: sans-serif ;
font-style: oblique }
span.classifier-delimiter {
font-family: sans-serif ;
font-weight: bold }
span.interpreted {
font-family: sans-serif }
span.option {
white-space: nowrap }
span.pre {
white-space: pre }
span.problematic {
color: red }
span.section-subtitle {
/* font-size relative to parent (h1..h6 element) */
font-size: 80% }
table.citation {
border-left: solid 1px gray;
margin-left: 1px }
table.docinfo {
margin: 2em 4em }
table.docutils {
margin-top: 0.5em ;
margin-bottom: 0.5em }
table.footnote {
border-left: solid 1px black;
margin-left: 1px }
table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
padding-left: 0.5em ;
padding-right: 0.5em ;
vertical-align: top }
table.docutils th.field-name, table.docinfo th.docinfo-name {
font-weight: bold ;
text-align: left ;
white-space: nowrap ;
padding-left: 0 }
h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
font-size: 100% }
ul.auto-toc {
list-style-type: none }
</style>
<style type="text/css">
/* Style sheet for LAST HTML documents */
h1 { color: navy }
h2 { color: teal }
div.document { margin-left: auto; margin-right: auto; max-width: 45em }
strong { color: red }
.option-list td { padding-bottom: 1em }
table.field-list { border: thin solid green }
</style>
</head>
<body>
<div class="document" id="maf-cut">
<h1 class="title">maf-cut</h1>
<p><tt class="docutils literal"><span class="pre">maf-cut</span></tt> cuts out parts of <a class="reference external" href="http://genome.ucsc.edu/FAQ/FAQformat.html#format5">MAF</a> format alignments. For example:</p>
<pre class="literal-block">
maf-cut chr7:1234000-1235000 alignments.maf
</pre>
<p>This gets alignment parts within the range 1234000-1235000 of the
sequence named &quot;chr7&quot;.</p>
<p>The ranges are zero-based. For example, <tt class="docutils literal"><span class="pre">chr7:0-1000</span></tt> means the
first 1000 bases of chr7.</p>
<p>You can omit the sequence name:</p>
<pre class="literal-block">
maf-cut 1234000-1235000 alignments.maf
</pre>
<p>Then, the range applies to the topmost sequence in each alignment.</p>
</div>
</body>
</html>
maf-cut
=======
``maf-cut`` cuts out parts of MAF_ format alignments. For example::
maf-cut chr7:1234000-1235000 alignments.maf
This gets alignment parts within the range 1234000-1235000 of the
sequence named "chr7".
The ranges are zero-based. For example, ``chr7:0-1000`` means the
first 1000 bases of chr7.
You can omit the sequence name::
maf-cut 1234000-1235000 alignments.maf
Then, the range applies to the topmost sequence in each alignment.
.. _MAF: http://genome.ucsc.edu/FAQ/FAQformat.html#format5
......@@ -285,7 +285,7 @@ def nameKey(oneSeqRanges):
return natural_sort_key(oneSeqRanges[0][0])
def sizeKey(oneSeqRanges):
return sum(b - e for n, b, e in oneSeqRanges), nameKey(oneSeqRanges)
return sum(b - e for n, b, e, s in oneSeqRanges), nameKey(oneSeqRanges)
def alignmentKey(seqNamesToLists, oneSeqRanges):
seqName = oneSeqRanges[0][0]
......@@ -479,7 +479,7 @@ def strandAndOrigin(ranges, beg, size):
if isReverseStrand:
beg = -(beg + size)
for rangeBeg, rangeEnd, isReverseRange, origin in ranges:
if rangeEnd > beg:
if rangeEnd > beg: # assumes the ranges are sorted
return (isReverseStrand != isReverseRange), origin
def alignmentPixels(width, height, alignments, bp_per_pix,
......@@ -698,7 +698,7 @@ def rangesWithOrigins(sortedRanges, rangePixBegs, rangePixLens, bpPerPix):
def rangesPerSeq(sortedRanges, rangePixBegs, rangePixLens, bpPerPix):
a = rangesWithOrigins(sortedRanges, rangePixBegs, rangePixLens, bpPerPix)
for k, v in itertools.groupby(a, itemgetter(0)):
yield k, [i[1] for i in v]
yield k, sorted(i[1] for i in v)
def getFont(opts):
if opts.fontfile:
......
#! /usr/bin/env python
from __future__ import print_function
import gzip
import itertools
import optparse
import signal
import sys
def myOpen(fileName): # faster than fileinput
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
return gzip.open(fileName)
return open(fileName)
def alnBegFromSeqBeg(gappedSequence, seqBeg):
for i, x in enumerate(gappedSequence):
if x != "-":
if seqBeg == 0:
return i
seqBeg -= 1
def alnEndFromSeqEnd(gappedSequence, seqEnd):
for i, x in enumerate(gappedSequence):
if x != "-":
seqEnd -= 1
if seqEnd == 0:
return i + 1
def alignmentRange(cutBeg, cutEnd, sLineFields):
beg = int(sLineFields[2])
if beg >= cutEnd:
return 0, 0
sequenceWithGaps = sLineFields[6]
span = len(sequenceWithGaps) - sequenceWithGaps.count("-")
end = beg + span
if end <= cutBeg:
return 0, 0
seqBeg = max(cutBeg - beg, 0)
alnBeg = alnBegFromSeqBeg(sequenceWithGaps, seqBeg)
seqEnd = min(cutEnd - beg, span)
alnEnd = alnEndFromSeqEnd(sequenceWithGaps, seqEnd)
return alnBeg, alnEnd
def findTheSpecifiedSequence(seqName, mafLines):
for line in mafLines:
if line[0] == "s":
fields = line.split()
if seqName is None or fields[1] == seqName:
return fields
return None
def cutMafRecords(mafLines, alnBeg, alnEnd):
for line in mafLines:
fields = line.split()
if line[0] == "s":
oldSeq = fields[6]
newSeq = oldSeq[alnBeg:alnEnd]
beg = int(fields[2]) + alnBeg - oldSeq[:alnBeg].count("-")
span = len(newSeq) - newSeq.count("-")
yield fields[:2] + [str(beg), str(span)] + fields[4:6] + [newSeq]
elif line[0] == "q":
yield fields[:2] + [fields[2][alnBeg:alnEnd]]
elif line[0] == "p":
yield fields[:1] + [fields[1][alnBeg:alnEnd]]
else:
yield fields
def mafFieldWidths(mafRecords):
sRecords = (i for i in mafRecords if i[0] == "s")
sColumns = zip(*sRecords)
for i in sColumns:
yield max(map(len, i))
def printMafLine(fieldWidths, fields):
formatParams = itertools.chain.from_iterable(zip(fieldWidths, fields))
print("%*s %-*s %*s %*s %*s %*s %*s" % tuple(formatParams))
def cutOneMaf(cutRange, mafLines):
seqName, cutBeg, cutEnd = cutRange
sLineFields = findTheSpecifiedSequence(seqName, mafLines)
if not sLineFields:
return
alnBeg, alnEnd = alignmentRange(cutBeg, cutEnd, sLineFields)
if alnBeg >= alnEnd:
return
mafRecords = list(cutMafRecords(mafLines, alnBeg, alnEnd))
fieldWidths = list(mafFieldWidths(mafRecords))
for fields in mafRecords:
if fields[0] == "s":
printMafLine(fieldWidths, fields)
elif fields[0] == "q":
printMafLine(fieldWidths, fields[:2] + [""] * 4 + fields[2:])
elif fields[0] == "p":
printMafLine(fieldWidths, fields[:1] + [""] * 5 + fields[1:])
else:
print(" ".join(fields))
print()
def mafCutOneFile(cutRange, lines):
mafLines = []
for line in lines:
if line.isspace():
cutOneMaf(cutRange, mafLines)
mafLines = []
elif line[0] != "#":
mafLines.append(line)
cutOneMaf(cutRange, mafLines)
def wantedRange(cutSpecification):
seqName = None
if ":" in cutSpecification:
seqName, cutSpecification = cutSpecification.rsplit(":", 1)
beg, end = cutSpecification.rsplit("-", 1)
return seqName, int(beg), int(end)
def mafCut(opts, args):
cutRange = wantedRange(args[0])
if len(args) > 1:
for i in args[1:]:
mafCutOneFile(cutRange, myOpen(i))
else:
mafCutOneFile(cutRange, sys.stdin)
if __name__ == "__main__":
signal.signal(signal.SIGPIPE, signal.SIG_DFL) # avoid silly error message
usage = "%prog chrN:start-end alignments.maf"
description = "Get parts of MAF-format alignments."
op = optparse.OptionParser(usage=usage, description=description)
opts, args = op.parse_args()
if not args:
op.error("please give me a cut specification and MAF alignments")
mafCut(opts, args)
......@@ -96,6 +96,11 @@ void Alphabet::initCaseConversions( unsigned codeEnd ) {
numbersToLowercase[i] = i;
}
if (letters == dna) { // kludge for RNA:
encode['U'] = encode['T'];
encode['u'] = encode['t'];
}
for( unsigned i = 0; i < capacity; ++i ){
lettersToUppercase[i] = numbersToUppercase[ encode[i] ];
}
......
......@@ -29,8 +29,8 @@ void QualityPssmMaker::init(const ScoreMatrixRow *scoreMatrix,
this->isMatchMismatchMatrix = isMatchMismatchMatrix;
this->toUnmasked = toUnmasked;
double matchExp = std::exp(lambda * matchScore);
double mismatchExp = std::exp(lambda * mismatchScore);
double matchExp = probFromScore(lambda, matchScore);
double mismatchExp = probFromScore(lambda, mismatchScore);
for (int q = 0; q < qualityCapacity; ++q) {
double e = errorProbFromQual(q, qualityOffset, false);
......@@ -40,12 +40,12 @@ void QualityPssmMaker::init(const ScoreMatrixRow *scoreMatrix,
double expScore = p * matchExp + e * mismatchExp;
assert(p > 0);
assert(expScore > 0);
qualityToMatchScore[q] = nearestInt(std::log(expScore) / lambda);
qualityToMatchScore[q] = scoreFromProb(lambda, expScore);
}
for (int i = 0; i < scoreMatrixRowSize; ++i)
for (int j = 0; j < scoreMatrixRowSize; ++j)
expMatrix[i][j] = std::exp(lambda * scoreMatrix[i][j]); // can be 0
expMatrix[i][j] = probFromScore(lambda, scoreMatrix[i][j]); // can be 0
}
// This function is a wee bit slow, even if isMatchMismatchMatrix and
......@@ -88,7 +88,7 @@ void QualityPssmMaker::make(const uchar *sequenceBeg,
double expScore = std::inner_product(letterProbs, letterProbsEnd,
expMatrix[unmasked1], 0.0);
assert(expScore > 0);
score = nearestInt(std::log(expScore) / lambda);
score = scoreFromProb(lambda, expScore);
}
}
......
......@@ -2,6 +2,7 @@
#include "ScoreMatrix.hh"
#include "ScoreMatrixData.hh"
#include "qualityScoreUtil.hh"
#include "zio.hh"
#include <sstream>
#include <iomanip>
......@@ -9,17 +10,24 @@
#include <stdexcept>
#include <cassert>
#include <cctype> // toupper, tolower
#include <cstddef> // size_t
#include <stddef.h> // size_t
//#include <iostream> // for debugging
#define ERR(x) throw std::runtime_error(x)
#define COUNTOF(a) (sizeof (a) / sizeof *(a))
static void makeUppercase(std::string& s) {
for (size_t i = 0; i < s.size(); ++i) {
unsigned char c = s[i];
s[i] = std::toupper(c);
}
}
namespace cbrc{
const char *ScoreMatrix::canonicalName( const std::string& name ){
for( std::size_t i = 0; i < COUNTOF(scoreMatrixNicknames); ++i )
for( size_t i = 0; i < COUNTOF(scoreMatrixNicknames); ++i )
if( name == scoreMatrixNicknames[i].nickname )
return scoreMatrixNicknames[i].realname;
return name.c_str();
......@@ -28,24 +36,24 @@ const char *ScoreMatrix::canonicalName( const std::string& name ){
std::string ScoreMatrix::stringFromName( const std::string& name ){
std::string n = canonicalName( name );
for( std::size_t i = 0; i < COUNTOF(scoreMatrices); ++i )
for( size_t i = 0; i < COUNTOF(scoreMatrices); ++i )
if( n == scoreMatrices[i].name )
return scoreMatrices[i].text;
return slurp( n.c_str() );
}
void ScoreMatrix::matchMismatch( int match, int mismatch,
const std::string& letters ){
rows = letters;
cols = letters;
std::size_t size = letters.size();
void ScoreMatrix::setMatchMismatch(int matchScore, int mismatchCost,
const std::string& symbols) {
rowSymbols.assign(symbols.begin(), symbols.end());
colSymbols.assign(symbols.begin(), symbols.end());
size_t size = symbols.size();
cells.resize(size);
for( std::size_t i = 0; i < size; ++i ){
cells[i].assign( size, -mismatch );
cells[i][i] = match;
for (size_t i = 0; i < size; ++i) {
cells[i].assign(size, -mismatchCost);
cells[i][i] = matchScore;
}
}
......@@ -55,20 +63,31 @@ void ScoreMatrix::fromString( const std::string& matString ){
if( !iss ) ERR( "can't read the score matrix" );
}
void ScoreMatrix::init( const uchar encode[] ){
assert( !rows.empty() && !cols.empty() );
static unsigned s2i(const uchar symbolToIndex[], uchar c) {
return symbolToIndex[c];
}
static void upperAndLowerIndex(unsigned tooBig, const uchar symbolToIndex[],
char symbol, unsigned& upper, unsigned& lower) {
uchar s = symbol;
upper = symbolToIndex[s];
lower = symbolToIndex[std::tolower(s)];
if (upper >= tooBig || lower >= tooBig) {
ERR(std::string("bad letter in score matrix: ") + symbol);
}
}
for( std::string::iterator i = rows.begin(); i < rows.end(); ++i )
*i = std::toupper( *i );
void ScoreMatrix::init(const uchar symbolToIndex[]) {
assert( !rowSymbols.empty() && !colSymbols.empty() );
for( std::string::iterator i = cols.begin(); i < cols.end(); ++i )
*i = std::toupper( *i );
makeUppercase(rowSymbols);
makeUppercase(colSymbols);
minScore = cells[0][0];
maxScore = cells[0][0];
for( std::size_t i = 0; i < rows.size(); ++i ){
for( std::size_t j = 0; j < cols.size(); ++j ){
for( size_t i = 0; i < rowSymbols.size(); ++i ){
for( size_t j = 0; j < colSymbols.size(); ++j ){
minScore = std::min( minScore, cells[i][j] );
maxScore = std::max( maxScore, cells[i][j] );
}
......@@ -82,30 +101,25 @@ void ScoreMatrix::init( const uchar encode[] ){
}
}
for( std::size_t i = 0; i < rows.size(); ++i ){
for( std::size_t j = 0; j < cols.size(); ++j ){
uchar x = encode[ uchar(rows[i]) ];
uchar y = encode[ uchar(cols[j]) ];
uchar a = encode[ std::tolower( rows[i] ) ];
uchar b = encode[ std::tolower( cols[j] ) ];
if( a >= MAT )
ERR( std::string("bad letter in score matrix: ") + rows[i] );
if( b >= MAT )
ERR( std::string("bad letter in score matrix: ") + cols[j] );
caseSensitive[x][b] = std::min( cells[i][j], 0 );
caseSensitive[a][y] = std::min( cells[i][j], 0 );
caseSensitive[a][b] = std::min( cells[i][j], 0 );
caseSensitive[x][y] = cells[i][j]; // careful: maybe a==x or b==y
caseInsensitive[x][y] = cells[i][j];
caseInsensitive[x][b] = cells[i][j];
caseInsensitive[a][y] = cells[i][j];
caseInsensitive[a][b] = cells[i][j];
for( size_t i = 0; i < rowSymbols.size(); ++i ){
for( size_t j = 0; j < colSymbols.size(); ++j ){
unsigned iu, il, ju, jl;
upperAndLowerIndex(MAT, symbolToIndex, rowSymbols[i], iu, il);
upperAndLowerIndex(MAT, symbolToIndex, colSymbols[j], ju, jl);
caseSensitive[iu][jl] = std::min( cells[i][j], 0 );
caseSensitive[il][ju] = std::min( cells[i][j], 0 );
caseSensitive[il][jl] = std::min( cells[i][j], 0 );
caseSensitive[iu][ju] = cells[i][j]; // careful: maybe il==iu or jl==ju
caseInsensitive[iu][ju] = cells[i][j];
caseInsensitive[iu][jl] = cells[i][j];
caseInsensitive[il][ju] = cells[i][j];
caseInsensitive[il][jl] = cells[i][j];
}
}
// set a hugely negative score for the delimiter symbol:
uchar delimiter = ' ';
uchar z = encode[delimiter];
uchar z = symbolToIndex[delimiter];
assert( z < MAT );
for( unsigned i = 0; i < MAT; ++i ){
caseSensitive[z][i] = -INF;
......@@ -116,24 +130,26 @@ void ScoreMatrix::init( const uchar encode[] ){
}
void ScoreMatrix::writeCommented( std::ostream& stream ) const{
int colWidth = colSymbols.size() < 20 ? 3 : 2;
stream << "# " << ' ';
for( std::size_t i = 0; i < cols.size(); ++i ){
stream << ' ' << std::setw(OUTPAD) << cols[i];
for( size_t i = 0; i < colSymbols.size(); ++i ){
stream << ' ' << std::setw(colWidth) << colSymbols[i];
}
stream << '\n';
for( std::size_t i = 0; i < rows.size(); ++i ){
stream << "# " << rows[i];
for( std::size_t j = 0; j < cols.size(); ++j ){
stream << ' ' << std::setw(OUTPAD) << cells[i][j];
for( size_t i = 0; i < rowSymbols.size(); ++i ){
stream << "# " << rowSymbols[i];
for( size_t j = 0; j < colSymbols.size(); ++j ){
stream << ' ' << std::setw(colWidth) << cells[i][j];
}
stream << '\n';
}
}
std::istream& operator>>( std::istream& stream, ScoreMatrix& m ){
std::string tmpRows;
std::string tmpCols;
std::string tmpRowSymbols;
std::string tmpColSymbols;
std::vector< std::vector<int> > tmpCells;
std::string line;
int state = 0;
......@@ -145,46 +161,131 @@ std::istream& operator>>( std::istream& stream, ScoreMatrix& m ){
if( state == 0 ){
if( c == '#' ) continue; // skip comment lines at the top
do{
tmpCols.push_back(c);
tmpColSymbols.push_back(c);
}while( iss >> c );
state = 1;
}
else{
tmpRows.push_back(c);
tmpRowSymbols.push_back(c);
tmpCells.resize( tmpCells.size() + 1 );
int score;
while( iss >> score ){
tmpCells.back().push_back(score);
}
if( tmpCells.back().size() != tmpCols.size() ) ERR( "bad score matrix" );
if (tmpCells.back().size() != tmpColSymbols.size()) {
ERR("bad score matrix");
}
}
}
if( stream.eof() && !stream.bad() && !tmpRows.empty() ){
if( stream.eof() && !stream.bad() && !tmpRowSymbols.empty() ){
stream.clear();
m.rows.swap(tmpRows);
m.cols.swap(tmpCols);
m.rowSymbols.swap(tmpRowSymbols);
m.colSymbols.swap(tmpColSymbols);
m.cells.swap(tmpCells);
}
return stream;
}
std::ostream& operator<<( std::ostream& stream, const ScoreMatrix& m ){
for( std::size_t i = 0; i < m.cols.size(); ++i ){
stream << ' ' << std::setw(m.OUTPAD) << m.cols[i];
const char *ambiguities[] = {
"M" "AC",
"S" "CG",
"K" "GT",
"W" "TA",
"R" "AG",
"Y" "CT",
"B" "CGT",
"D" "AGT",
"H" "ACT",
"V" "ACG"
};
const size_t numOfAmbiguousSymbols = COUNTOF(ambiguities);
static bool isIn(const std::string& s, char x) {
return find(s.begin(), s.end(), x) != s.end();
}
stream << '\n';
for( std::size_t i = 0; i < m.rows.size(); ++i ){
stream << m.rows[i];
for( std::size_t j = 0; j < m.cols.size(); ++j ){
stream << ' ' << std::setw(m.OUTPAD) << m.cells[i][j];
static const char *ambiguityList(char symbol, char scratch[]) {
for (size_t i = 0; i < numOfAmbiguousSymbols; ++i) {
if (ambiguities[i][0] == symbol) return ambiguities[i] + 1;
}
stream << '\n';
scratch[0] = symbol;
return scratch;
}
return stream;
static double symbolProbSum(const uchar symbolToIndex[],
const char *symbols, const double probs[]) {
if (symbols[1] == 0) return 1;
double p = 0;
for (size_t i = 0; symbols[i]; ++i) {
unsigned x = s2i(symbolToIndex, symbols[i]);
p += probs[x];
}
return p;
}
static int jointScore(const uchar symbolToIndex[], int **fastMatrix,
double scale,
const double rowSymbolProbs[],
const double colSymbolProbs[],
const char *rSymbols, const char *cSymbols) {
bool isOneRowSymbol = (rSymbols[1] == 0);
bool isOneColSymbol = (cSymbols[1] == 0);
double p = 0;
for (size_t i = 0; rSymbols[i]; ++i) {
for (size_t j = 0; cSymbols[j]; ++j) {
unsigned x = s2i(symbolToIndex, rSymbols[i]);
unsigned y = s2i(symbolToIndex, cSymbols[j]);
double r = isOneRowSymbol ? 1 : rowSymbolProbs[x];
double c = isOneColSymbol ? 1 : colSymbolProbs[y];
p += r * c * probFromScore(scale, fastMatrix[x][y]);
}
}
double rowProbSum = symbolProbSum(symbolToIndex, rSymbols, rowSymbolProbs);
double colProbSum = symbolProbSum(symbolToIndex, cSymbols, colSymbolProbs);
return scoreFromProb(scale, p / (rowProbSum * colProbSum));
}
void ScoreMatrix::addAmbiguousScores(const uchar symbolToIndex[],
double scale,
const double rowSymbolProbs[],
const double colSymbolProbs[]) {
int *fastMatrix[MAT];
std::copy(caseInsensitive, caseInsensitive + MAT, fastMatrix);
char scratch[2] = {0};
for (size_t k = 0; k < numOfAmbiguousSymbols; ++k) {
char ambiguousSymbol = ambiguities[k][0];
if (isIn(colSymbols, ambiguousSymbol)) continue;
colSymbols.push_back(ambiguousSymbol);
for (size_t i = 0; i < rowSymbols.size(); ++i) {
const char *rSymbols = ambiguityList(rowSymbols[i], scratch);
const char *cSymbols = ambiguities[k] + 1;
int s = jointScore(symbolToIndex, fastMatrix, scale,
rowSymbolProbs, colSymbolProbs, rSymbols, cSymbols);
cells[i].push_back(s);
}
}
for (size_t k = 0; k < numOfAmbiguousSymbols; ++k) {
char ambiguousSymbol = ambiguities[k][0];
if (isIn(rowSymbols, ambiguousSymbol)) continue;
rowSymbols.push_back(ambiguousSymbol);
cells.resize(cells.size() + 1);
for (size_t j = 0; j < colSymbols.size(); ++j) {
const char *rSymbols = ambiguities[k] + 1;
const char *cSymbols = ambiguityList(colSymbols[j], scratch);
int s = jointScore(symbolToIndex, fastMatrix, scale,
rowSymbolProbs, colSymbolProbs, rSymbols, cSymbols);
cells.back().push_back(s);
}
}
}
} // end namespace cbrc
}
......@@ -15,23 +15,33 @@
namespace cbrc{
struct ScoreMatrix{
typedef unsigned char uchar;
struct ScoreMatrix{
enum { INF = INT_MAX / 2 }; // big, but try to avoid overflow
enum { MAT = 64 }; // matrix size = MAT x MAT
enum { OUTPAD = 2 }; // cell-padding for output
static const char *canonicalName( const std::string& name );
static std::string stringFromName( const std::string& name );
void matchMismatch( int match, int mismatch, const std::string& letters );
void setMatchMismatch(int matchScore, // usually > 0
int mismatchCost, // usually > 0
const std::string& symbols); // case is preserved
void fromString( const std::string& s );
void init( const uchar encode[] ); // unspecified letters get minScore
void init(const uchar symbolToIndex[]); // unspecified letters get minScore
// add scores for e.g. "W" meaning A or T
void addAmbiguousScores(const uchar symbolToIndex[],
double scale, // "lambda" for getting probabilities
const double rowSymbolProbs[],
const double colSymbolProbs[]);
void writeCommented( std::ostream& stream ) const; // write preceded by "#"
std::string rows; // row headings (letters)
std::string cols; // column headings (letters)
std::string rowSymbols; // row headings (letters)
std::string colSymbols; // column headings (letters)
std::vector< std::vector<int> > cells; // scores
int caseSensitive[MAT][MAT];
int caseInsensitive[MAT][MAT];
......@@ -42,5 +52,6 @@ struct ScoreMatrix{
std::istream& operator>>( std::istream& stream, ScoreMatrix& mat );
std::ostream& operator<<( std::ostream& stream, const ScoreMatrix& mat );
} // end namespace cbrc
#endif // SCOREMATRIX_HH
}
#endif
......@@ -32,6 +32,8 @@ class SubsetSuffixArray{
public:
typedef LAST_INT_TYPE indexT;
struct Range {indexT* beg; indexT* end; indexT depth;};
CyclicSubsetSeed& getSeed() { return seed; }
const CyclicSubsetSeed& getSeed() const { return seed; }
......@@ -44,8 +46,8 @@ public:
size_t step, size_t minimizerWindow );
// Sort the suffix array (but don't make the buckets).
void sortIndex( const uchar* text,
size_t maxUnsortedInterval, int childTableType );
void sortIndex( const uchar* text, size_t maxUnsortedInterval,
int childTableType, size_t numOfThreads );
// Make the buckets. If bucketDepth+1 == 0, then a default
// bucketDepth is used. The default is: the maximum possible
......@@ -121,17 +123,26 @@ private:
void sort2( const uchar* text, indexT* beg, const uchar* subsetMap );
void radixSort1( const uchar* text, const uchar* subsetMap,
void radixSort1( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth );
void radixSort2( const uchar* text, const uchar* subsetMap,
void radixSort2( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth );
void radixSort3( const uchar* text, const uchar* subsetMap,
void radixSort3( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth );
void radixSort4( const uchar* text, const uchar* subsetMap,
void radixSort4( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth );
void radixSortN( const uchar* text, const uchar* subsetMap,
void radixSortN( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth,
unsigned subsetCount );
unsigned subsetCount, indexT* bucketSize );
void sortRanges( std::vector<Range>* stacks, indexT* bucketSizes,
const uchar* text, size_t maxUnsortedInterval,
size_t numOfThreads );
// Same as the 1st equalRange, but uses more info and may be faster:
void equalRange( indexT& beg, indexT& end, const uchar* textBase,
......
......@@ -7,17 +7,24 @@
#include <algorithm> // iter_swap, min
//#include <iostream> // for debugging
#ifdef HAS_CXX_THREADS
#include <thread>
#endif
using namespace cbrc;
namespace{
typedef SubsetSuffixArray::indexT indexT;
struct Stack{ indexT* beg; indexT* end; indexT depth; };
Stack stack[1048576]; // big enough???
Stack* sp = stack;
typedef SubsetSuffixArray::Range Range;
}
#define PUSH(b, e, d) if( e-b > 1 ) sp->beg = b, sp->end = e, (sp++)->depth = d
#define POP(b, e, d) b = (--sp)->beg, e = sp->end, d = sp->depth
static void pushRange(std::vector<Range> &v,
indexT *beg, indexT *end, indexT depth) {
if (end - beg > 1) {
Range r = {beg, end, depth};
v.push_back(r);
}
}
static void insertionSort( const uchar* text, const CyclicSubsetSeed& seed,
indexT* beg, indexT* end, const uchar* subsetMap ){
......@@ -60,7 +67,8 @@ void SubsetSuffixArray::sort2( const uchar* text, indexT* beg,
// Specialized sort for 1 symbol + 1 delimiter.
// E.g. wildcard positions in spaced seeds.
void SubsetSuffixArray::radixSort1( const uchar* text, const uchar* subsetMap,
void SubsetSuffixArray::radixSort1( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth ){
indexT* end0 = beg; // end of '0's
indexT* begN = end; // beginning of delimiters
......@@ -78,7 +86,7 @@ void SubsetSuffixArray::radixSort1( const uchar* text, const uchar* subsetMap,
}
}
PUSH( beg, end0, depth ); // the '0's
pushRange( rangeStack, beg, end0, depth ); // the '0's
if( isChildDirectionForward( beg ) ){
if( end0 == end ) return;
......@@ -91,7 +99,8 @@ void SubsetSuffixArray::radixSort1( const uchar* text, const uchar* subsetMap,
// Specialized sort for 2 symbols + 1 delimiter.
// E.g. transition-constrained positions in subset seeds.
void SubsetSuffixArray::radixSort2( const uchar* text, const uchar* subsetMap,
void SubsetSuffixArray::radixSort2( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth ){
indexT* end0 = beg; // end of '0's
indexT* end1 = beg; // end of '1's
......@@ -114,8 +123,8 @@ void SubsetSuffixArray::radixSort2( const uchar* text, const uchar* subsetMap,
}
}
PUSH( beg, end0, depth ); // the '0's
PUSH( end0, end1, depth ); // the '1's
pushRange( rangeStack, beg, end0, depth ); // the '0's
pushRange( rangeStack, end0, end1, depth ); // the '1's
if( isChildDirectionForward( beg ) ){
if( end0 == end ) return;
......@@ -132,7 +141,8 @@ void SubsetSuffixArray::radixSort2( const uchar* text, const uchar* subsetMap,
// Specialized sort for 3 symbols + 1 delimiter.
// E.g. subset seeds for bisulfite-converted DNA.
void SubsetSuffixArray::radixSort3( const uchar* text, const uchar* subsetMap,
void SubsetSuffixArray::radixSort3( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth ){
indexT* end0 = beg; // end of '0's
indexT* end1 = beg; // end of '1's
......@@ -161,9 +171,9 @@ void SubsetSuffixArray::radixSort3( const uchar* text, const uchar* subsetMap,
}
}
PUSH( beg, end0, depth ); // the '0's
PUSH( end0, end1, depth ); // the '1's
PUSH( beg2, begN, depth ); // the '2's
pushRange( rangeStack, beg, end0, depth ); // the '0's
pushRange( rangeStack, end0, end1, depth ); // the '1's
pushRange( rangeStack, beg2, begN, depth ); // the '2's
if( isChildDirectionForward( beg ) ){
if( end0 == end ) return;
......@@ -183,7 +193,8 @@ void SubsetSuffixArray::radixSort3( const uchar* text, const uchar* subsetMap,
}
// Specialized sort for 4 symbols + 1 delimiter. E.g. DNA.
void SubsetSuffixArray::radixSort4( const uchar* text, const uchar* subsetMap,
void SubsetSuffixArray::radixSort4( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth ){
indexT* end0 = beg; // end of '0's
indexT* end1 = beg; // end of '1's
......@@ -218,10 +229,10 @@ void SubsetSuffixArray::radixSort4( const uchar* text, const uchar* subsetMap,
}
}
PUSH( beg, end0, depth ); // the '0's
PUSH( end0, end1, depth ); // the '1's
PUSH( end1, end2, depth ); // the '2's
PUSH( beg3, begN, depth ); // the '3's
pushRange( rangeStack, beg, end0, depth ); // the '0's
pushRange( rangeStack, end0, end1, depth ); // the '1's
pushRange( rangeStack, end1, end2, depth ); // the '2's
pushRange( rangeStack, beg3, begN, depth ); // the '3's
if( isChildDirectionForward( beg ) ){
if( end0 == end ) return;
......@@ -244,11 +255,13 @@ void SubsetSuffixArray::radixSort4( const uchar* text, const uchar* subsetMap,
}
}
void SubsetSuffixArray::radixSortN( const uchar* text, const uchar* subsetMap,
const unsigned numOfBuckets = 256;
void SubsetSuffixArray::radixSortN( std::vector<Range>& rangeStack,
const uchar* text, const uchar* subsetMap,
indexT* beg, indexT* end, indexT depth,
unsigned subsetCount ){
static indexT bucketSize[256]; // initialized to zero at startup
/* */ indexT* bucketEnd[256]; // "static" makes little difference to speed
unsigned subsetCount, indexT* bucketSize ){
indexT* bucketEnd[numOfBuckets];
// get bucket sizes (i.e. letter counts):
// The intermediate oracle array makes it faster (see "Engineering
......@@ -268,7 +281,7 @@ void SubsetSuffixArray::radixSortN( const uchar* text, const uchar* subsetMap,
indexT* pos = beg;
for( unsigned i = 0; i < subsetCount; ++i ){
indexT* nextPos = pos + bucketSize[i];
PUSH( pos, nextPos, depth );
pushRange( rangeStack, pos, nextPos, depth );
pos = nextPos;
bucketEnd[i] = pos;
}
......@@ -306,31 +319,91 @@ void SubsetSuffixArray::radixSortN( const uchar* text, const uchar* subsetMap,
}
}
void SubsetSuffixArray::sortIndex( const uchar* text,
size_t maxUnsortedInterval,
int childTableType ){
const indexT minLength = 1;
static size_t rangeSize(const Range &r) {
return r.end - r.beg;
}
if( childTableType == 1 ) chibiTable.v.assign( suffixArray.v.size(), -1 );
if( childTableType == 2 ) kiddyTable.v.assign( suffixArray.v.size(), -1 );
if( childTableType == 3 ) childTable.v.assign( suffixArray.v.size(), 0 );
static size_t nextRangeSize(const std::vector<Range> &ranges) {
return rangeSize(ranges.back());
}
PUSH( &suffixArray.v.front(), &suffixArray.v.back() + 1, 0 );
setChildReverse( &suffixArray.v.back() + 1, &suffixArray.v.front() );
static size_t rangeSizeSum(const std::vector<Range> &ranges) {
size_t s = 0;
for (size_t i = 0; i < ranges.size(); ++i) {
s += rangeSize(ranges[i]);
}
return s;
}
while( sp > stack ){
indexT* beg;
indexT* end;
indexT depth;
POP( beg, end, depth );
static size_t numOfThreadsForOneRange(size_t numOfThreads,
size_t sizeOfThisRange,
size_t sizeOfAllRanges) {
// We want:
// min(t | sizeOfThisRange / t < sizeOfOtherRanges / (numOfThreads - (t+1)))
// Or equivalently:
// max(t | sizeOfThisRange / (t-1) >= sizeOfOtherRanges / (numOfThreads - t))
double x = numOfThreads - 1; // use double to avoid overflow
return (x * sizeOfThisRange + sizeOfAllRanges) / sizeOfAllRanges;
}
void SubsetSuffixArray::sortRanges(std::vector<Range> *stacks,
indexT *bucketSizes,
const uchar *text,
size_t maxUnsortedInterval,
size_t numOfThreads) {
std::vector<Range> &myStack = stacks[0];
while( !myStack.empty() ){
#ifdef HAS_CXX_THREADS
size_t numOfChunks = std::min(numOfThreads, myStack.size());
if (numOfChunks > 1) {
size_t totalSize = rangeSizeSum(myStack);
size_t numOfNewThreads = numOfChunks - 1;
std::vector<std::thread> threads(numOfNewThreads);
for (size_t i = 0; i < numOfNewThreads; ++i) {
size_t thisSize = nextRangeSize(myStack);
size_t t = numOfThreadsForOneRange(numOfThreads, thisSize, totalSize);
size_t maxThreads = numOfThreads - (numOfNewThreads - i);
size_t thisThreads = std::min(t, maxThreads);
numOfThreads -= thisThreads;
do {
totalSize -= nextRangeSize(myStack);
stacks[numOfThreads].push_back(myStack.back());
myStack.pop_back();
thisSize += nextRangeSize(myStack);
} while (myStack.size() > numOfThreads &&
thisSize <= totalSize / numOfThreads);
// We want:
// max(r | sizeSum(r) <= (totalSize - sizeSum(r-1)) / newNumOfThreads)
threads[i] = std::thread(&SubsetSuffixArray::sortRanges, this,
stacks + numOfThreads,
bucketSizes + numOfThreads * numOfBuckets,
text, maxUnsortedInterval, thisThreads);
}
sortRanges(stacks, bucketSizes, text, maxUnsortedInterval, numOfThreads);
for (size_t i = 0; i < numOfNewThreads; ++i) {
threads[i].join();
}
return;
}
#endif
indexT* beg = myStack.back().beg;
indexT* end = myStack.back().end;
indexT depth = myStack.back().depth;
myStack.pop_back();
size_t interval = end - beg;
const indexT minLength = 1;
if( interval <= maxUnsortedInterval && depth >= minLength ) continue;
const uchar* textBase = text + depth;
const uchar* subsetMap = seed.subsetMap(depth);
if( childTableType == 0 ){
if( childTable.v.empty() && kiddyTable.v.empty() && chibiTable.v.empty() ){
if( interval < 10 ){ // ???
insertionSort( textBase, seed, beg, end, subsetMap );
continue;
......@@ -347,11 +420,30 @@ void SubsetSuffixArray::sortIndex( const uchar* text,
++depth;
switch( subsetCount ){
case 1: radixSort1( textBase, subsetMap, beg, end, depth ); break;
case 2: radixSort2( textBase, subsetMap, beg, end, depth ); break;
case 3: radixSort3( textBase, subsetMap, beg, end, depth ); break;
case 4: radixSort4( textBase, subsetMap, beg, end, depth ); break;
default: radixSortN( textBase, subsetMap, beg, end, depth, subsetCount );
case 1: radixSort1(myStack, textBase, subsetMap, beg, end, depth); break;
case 2: radixSort2(myStack, textBase, subsetMap, beg, end, depth); break;
case 3: radixSort3(myStack, textBase, subsetMap, beg, end, depth); break;
case 4: radixSort4(myStack, textBase, subsetMap, beg, end, depth); break;
default: radixSortN(myStack, textBase, subsetMap, beg, end, depth,
subsetCount, bucketSizes);
}
}
}
void SubsetSuffixArray::sortIndex( const uchar* text,
size_t maxUnsortedInterval,
int childTableType,
size_t numOfThreads ){
if( childTableType == 1 ) chibiTable.v.assign( suffixArray.v.size(), -1 );
if( childTableType == 2 ) kiddyTable.v.assign( suffixArray.v.size(), -1 );
if( childTableType == 3 ) childTable.v.assign( suffixArray.v.size(), 0 );
std::vector< std::vector<Range> > stacks(numOfThreads);
std::vector<indexT> bucketSizes(numOfThreads * numOfBuckets);
pushRange( stacks[0], &suffixArray.v.front(), &suffixArray.v.back() + 1, 0 );
setChildReverse( &suffixArray.v.back() + 1, &suffixArray.v.front() );
sortRanges(&stacks[0], &bucketSizes[0], text, maxUnsortedInterval,
numOfThreads);
}
......@@ -37,7 +37,7 @@ T -5 -5 -5 2\n\
s.fromString(ScoreMatrix::stringFromName("BL62"));
} else {
alphabetSize = alphabet.size();
s.matchMismatch(1, 1, alphabet);
s.setMatchMismatch(1, 1, alphabet);
}
s.init(letterToIndex);
......
......@@ -463,8 +463,12 @@ double temperature_)
double GaplessPreliminaryTime=CurrentTimeGaplessPreliminary-CurrentTime1;
//the choice for the importance sampling
long int gapOpen=alp_data::Tmin(gapOpen1_,gapOpen2_);
//long int gapOpen=alp_data::Tmin(gapOpen1_,gapOpen2_);
//long int gapEpen=alp_data::Tmin(gapEpen1_,gapEpen2_);
long int gapEpen = alp_data::Tmin(gapEpen1_, gapEpen2_);
long int gapOpen = alp_data::Tmin(gapOpen1_ + gapEpen1_, gapOpen2_ + gapEpen2_) - gapEpen;
if(max_time_<=0)
{
......