Skip to content
Commits on Source (7)
2018-12-19 Martin C. Frith <Martin C. Frith>
* src/tantan.cc:
Make it faster
[3523060bcfb9] [tip]
* src/tantan_repeat_finder.cc, src/tantan_repeat_finder.hh:
Make -f4 a bit faster
[1fccdf6e21be]
* README.txt, src/mcf_tantan_options.cc, src/mcf_tantan_options.hh,
src/tantan_app.cc, src/tantan_repeat_finder.cc,
src/tantan_repeat_finder.hh, test/hard.fa, test/tantan_test.out,
test/tantan_test.sh:
Add option to find straightforward tandem repeats
[1bae60144712]
* README.txt, src/mcf_tantan_options.cc, src/tantan.cc,
src/tantan_app.cc:
Tweak the help message
[fc1ca32a72aa]
2018-12-10 Martin C. Frith <Martin C. Frith>
* README.txt, src/mcf_tantan_options.cc, src/mcf_tantan_options.hh,
src/tantan_app.cc, test/tantan_test.out, test/tantan_test.sh:
Add match score, mismatch cost options
[2383404c795a] [tip]
[2383404c795a]
* src/tantan.cc:
Refactor
......
......@@ -451,7 +451,7 @@ repeats, so it's easy to lift the masking after determining homology.</p>
<td>preserve uppercase/lowercase in non-masked regions</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-m</span></kbd></td>
<td>file for letter pair scores (scoring matrix)</td></tr>
<td>file for letter-pair score matrix</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-r</span></kbd></td>
<td>probability of a repeat starting per position</td></tr>
......@@ -480,9 +480,12 @@ repeats, so it's easy to lift the masking after determining homology.</p>
<kbd><span class="option">-s</span></kbd></td>
<td>minimum repeat probability for masking</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-n</span></kbd></td>
<td>minimum copy number, affects -f4 only</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-f</span></kbd></td>
<td>output type: 0=masked sequence, 1=repeat probabilities,
2=repeat counts, 3=BED</td></tr>
2=repeat counts, 3=BED, 4=tandem repeats</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-h</span>, <span class="option">--help</span></kbd></td>
<td>show help message, then exit</td></tr>
......@@ -512,6 +515,29 @@ TGCTAGCAA ttaggcttaggtcagtgc TTAGGCTTAGGTCAGTGC AACGTA
</pre>
<p>(My thanks to Junko Tsuji and Paul Horton for finding these issues.)</p>
</div>
<div class="section" id="finding-straightforward-tandem-repeats">
<h1>Finding straightforward tandem repeats</h1>
<p>Option -f4 runs tantan in a different mode, where it finds
straightforward tandem repeats only. (Technically, it uses a Viterbi
algorithm instead of a Forward-Backward algorithm.) This is <em>not</em>
recommended for avoiding false homologs! But it might be useful for
studying tandem repeats. The output looks like this:</p>
<pre class="literal-block">
mySeq 14765 14780 6 2.5 GTCATG GTCATG,GTCATG,GTC
mySeq 632362 632377 2 6 GC GC,GC,GC,GCt,GCT,GCT
mySeq 1278353 1278369 3 6.5 TCA TCA,TCA,TCA,TC-,TC,TC,T
mySeq 3616084 3616100 3 5.33333 TGG TGA,TGA,TGG,TGG,TGG,T
</pre>
<p>The first 3 columns show the start and end coordinates of the
repetitive region, in <a class="reference external" href="https://genome.ucsc.edu/FAQ/FAQformat.html#format1">BED</a> format. Column
4 shows the length of the repeating unit (which might vary due to
insertions and deletions, so this column shows the most common
length). Column 5 shows the number of repeat units. Column 6 shows
the repeating unit (which again might vary, so this is just a
representative). Column 7 shows the whole repeat: lowercase letters
are insertions relative to the previous repeat unit, and dashes are
deletions relative to the previous repeat unit.</p>
</div>
<div class="section" id="miscellaneous">
<h1>Miscellaneous</h1>
<p>tantan is distributed under the GNU General Public License, either
......
......@@ -136,7 +136,7 @@ Options
-p interpret the sequences as proteins
-x letter to use for masking, instead of lowercase
-c preserve uppercase/lowercase in non-masked regions
-m file for letter pair scores (scoring matrix)
-m file for letter-pair score matrix
-r probability of a repeat starting per position
-e probability of a repeat ending per position
-w maximum tandem repeat period to consider
......@@ -146,8 +146,9 @@ Options
-a gap existence cost
-b gap extension cost
-s minimum repeat probability for masking
-n minimum copy number, affects -f4 only
-f output type: 0=masked sequence, 1=repeat probabilities,
2=repeat counts, 3=BED
2=repeat counts, 3=BED, 4=tandem repeats
-h, --help show help message, then exit
--version show version information, then exit
......@@ -172,6 +173,31 @@ align it on the other strand::
(My thanks to Junko Tsuji and Paul Horton for finding these issues.)
Finding straightforward tandem repeats
--------------------------------------
Option -f4 runs tantan in a different mode, where it finds
straightforward tandem repeats only. (Technically, it uses a Viterbi
algorithm instead of a Forward-Backward algorithm.) This is *not*
recommended for avoiding false homologs! But it might be useful for
studying tandem repeats. The output looks like this::
mySeq 14765 14780 6 2.5 GTCATG GTCATG,GTCATG,GTC
mySeq 632362 632377 2 6 GC GC,GC,GC,GCt,GCT,GCT
mySeq 1278353 1278369 3 6.5 TCA TCA,TCA,TCA,TC-,TC,TC,T
mySeq 3616084 3616100 3 5.33333 TGG TGA,TGA,TGG,TGG,TGG,T
The first 3 columns show the start and end coordinates of the
repetitive region, in `BED
<https://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_ format. Column
4 shows the length of the repeating unit (which might vary due to
insertions and deletions, so this column shows the most common
length). Column 5 shows the number of repeat units. Column 6 shows
the repeating unit (which again might vary, so this is just a
representative). Column 7 shows the whole repeat: lowercase letters
are insertions relative to the previous repeat unit, and dashes are
deletions relative to the previous repeat unit.
Miscellaneous
-------------
......
tantan (22-1) unstable; urgency=medium
* New upstream release.
* Update my uploaders email address in d/control.
* Use correct path for source lintian overrides.
* Update lintian override tag to debian-watch-does-not-check-gpg-signature.
* Fix whitespace in d/changelog.
* Use secure Format URL in d/copyright.
-- Sascha Steinbiss <satta@debian.org> Sat, 22 Dec 2018 20:38:13 +0000
tantan (18-1) unstable; urgency=medium
* Team upload.
......
Source: tantan
Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Uploaders: Sascha Steinbiss <sascha@steinbiss.name>
Uploaders: Sascha Steinbiss <satta@debian.org>
Section: science
Priority: optional
Build-Depends: debhelper (>= 11~)
......
Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: tantan
Source: http://www.cbrc.jp/tantan/
......
# Upstream does not provide signed tarballs.
tantan source: debian-watch-may-check-gpg-signature
tantan source: debian-watch-does-not-check-gpg-signature
......@@ -36,7 +36,7 @@ static int myGetopt(int argc, char **argv, const char *optstring,
std::istream &operator>>(std::istream &s, TantanOptions::OutputType &x) {
int i = 0;
s >> i;
if (i < 0 || i > 3)
if (i < 0 || i > 4)
s.setstate(std::ios::failbit);
if (s)
x = static_cast<TantanOptions::OutputType>(i);
......@@ -57,6 +57,7 @@ TantanOptions::TantanOptions() :
gapExistenceCost(0),
gapExtensionCost(-1), // means: no gaps
minMaskProb(0.5),
minCopyNumber(2.0),
outputType(maskOut),
indexOfFirstNonOptionArgument(-1) {}
......@@ -69,7 +70,7 @@ Options (default settings):\n\
-p interpret the sequences as proteins\n\
-x letter to use for masking, instead of lowercase\n\
-c preserve uppercase/lowercase in non-masked regions\n\
-m file for letter pair scores (+1/-1, but -p selects BLOSUM62)\n\
-m file for letter-pair score matrix\n\
-r probability of a repeat starting per position ("
+ stringify(repeatProb) + ")\n\
-e probability of a repeat ending per position ("
......@@ -77,15 +78,17 @@ Options (default settings):\n\
-w maximum tandem repeat period to consider (100, but -p selects 50)\n\
-d probability decay per period ("
+ stringify(repeatOffsetProbDecay) + ")\n\
-i match score (1, but -p selects BLOSUM62)\n\
-j mismatch cost (1, but -p selects BLOSUM62)\n\
-i match score (BLOSUM62 if -p, else 2 if -f4, else 1)\n\
-j mismatch cost (BLOSUM62 if -p, else 7 if -f4, else 1)\n\
-a gap existence cost ("
+ stringify(gapExistenceCost) + ")\n\
-b gap extension cost (infinite: no gaps)\n\
-b gap extension cost (7 if -f4, else infinity)\n\
-s minimum repeat probability for masking ("
+ stringify(minMaskProb) + ")\n\
-n minimum copy number, affects -f4 only ("
+ stringify(minCopyNumber) + ")\n\
-f output type: 0=masked sequence, 1=repeat probabilities,\n\
2=repeat counts, 3=BED ("
2=repeat counts, 3=BED, 4=tandem repeats ("
+ stringify(outputType) + ")\n\
-h, --help show help message, then exit\n\
--version show version information, then exit\n\
......@@ -93,12 +96,13 @@ Options (default settings):\n\
Report bugs to: tantan@cbrc.jp\n\
Home page: http://www.cbrc.jp/tantan/\n\
";
// -k for transition cost?
std::string version = "tantan "
#include "version.hh"
"\n";
const char *optstring = "px:cm:r:e:w:d:i:j:a:b:s:f:h";
const char *optstring = "px:cm:r:e:w:d:i:j:a:b:s:n:f:h";
int i;
while ((i = myGetopt(argc, argv, optstring, help, version)) != -1) {
......@@ -158,6 +162,9 @@ Home page: http://www.cbrc.jp/tantan/\n\
unstringify(minMaskProb, optarg);
// don't bother checking for stupid values?
break;
case 'n':
unstringify(minCopyNumber, optarg);
break;
case 'f':
unstringify(outputType, optarg);
break;
......@@ -168,11 +175,18 @@ Home page: http://www.cbrc.jp/tantan/\n\
}
}
if (gapExtensionCost < 0 && outputType == repOut) gapExtensionCost = 7;
if (gapExtensionCost > 0 && gapExistenceCost + gapExtensionCost <= 0)
throw Error("gap existence + extension cost is too low");
if (maxCycleLength < 0) maxCycleLength = (isProtein ? 50 : 100);
if (!isProtein || matchScore || mismatchCost) {
if (matchScore == 0) matchScore = (outputType == repOut ? 2 : 1);
if (mismatchCost == 0) mismatchCost = (outputType == repOut ? 7 : 1);
}
indexOfFirstNonOptionArgument = optind;
}
......
......@@ -23,7 +23,8 @@ struct TantanOptions {
int gapExistenceCost;
int gapExtensionCost;
double minMaskProb;
enum OutputType { maskOut, probOut, countOut, bedOut } outputType;
double minCopyNumber;
enum OutputType { maskOut, probOut, countOut, bedOut, repOut } outputType;
int indexOfFirstNonOptionArgument;
};
......
......@@ -20,9 +20,9 @@ void multiplyAll(std::vector<double> &v, double factor) {
}
double firstRepeatOffsetProb(double probMult, int maxRepeatOffset) {
if (probMult < 1 || probMult > 1)
if (probMult < 1 || probMult > 1) {
return (1 - probMult) / (1 - std::pow(probMult, maxRepeatOffset));
else
}
return 1.0 / maxRepeatOffset;
}
......@@ -64,6 +64,7 @@ struct Tantan {
double b2fLast; // background state to last foreground state
double backgroundProb;
std::vector<double> b2fProbs; // background state to each foreground state
std::vector<double> foregroundProbs;
std::vector<double> insertionProbs;
......@@ -110,9 +111,16 @@ struct Tantan {
b2fFirst = repeatProb * firstRepeatOffsetProb(b2fDecay, maxRepeatOffset);
b2fLast = repeatProb * firstRepeatOffsetProb(b2fGrowth, maxRepeatOffset);
b2fProbs.resize(maxRepeatOffset);
foregroundProbs.resize(maxRepeatOffset);
insertionProbs.resize(maxRepeatOffset - 1);
double p = b2fFirst;
for (int i = 0; i < maxRepeatOffset; ++i) {
b2fProbs[i] = p;
p *= b2fDecay;
}
scaleFactors.resize((seqEnd - seqBeg) / scaleStepSize);
}
......@@ -211,20 +219,17 @@ struct Tantan {
void calcForwardTransitionProbs() {
if (endGapProb > 0) return calcForwardTransitionProbsWithGaps();
double fromBackground = backgroundProb * b2fLast;
double b = backgroundProb;
double fromForeground = 0;
double *foregroundPtr = END(foregroundProbs);
double *foregroundBeg = BEG(foregroundProbs);
while (foregroundPtr > foregroundBeg) {
--foregroundPtr;
double f = *foregroundPtr;
for (int i = 0; i < maxRepeatOffset; ++i) {
double f = foregroundBeg[i];
fromForeground += f;
*foregroundPtr = fromBackground + f * f2f0;
fromBackground *= b2fGrowth;
foregroundBeg[i] = b * b2fProbs[i] + f * f2f0;
}
backgroundProb = backgroundProb * b2b + fromForeground * f2b;
backgroundProb = b * b2b + fromForeground * f2b;
}
void calcBackwardTransitionProbs() {
......@@ -232,18 +237,15 @@ struct Tantan {
double toBackground = f2b * backgroundProb;
double toForeground = 0;
double *foregroundPtr = BEG(foregroundProbs);
double *foregroundEnd = END(foregroundProbs);
double *foregroundBeg = BEG(foregroundProbs);
while (foregroundPtr < foregroundEnd) {
toForeground *= b2fGrowth;
double f = *foregroundPtr;
toForeground += f;
*foregroundPtr = toBackground + f2f0 * f;
++foregroundPtr;
for (int i = 0; i < maxRepeatOffset; ++i) {
double f = foregroundBeg[i];
toForeground += b2fProbs[i] * f;
foregroundBeg[i] = toBackground + f2f0 * f;
}
backgroundProb = b2b * backgroundProb + b2fLast * toForeground;
backgroundProb = b2b * backgroundProb + toForeground;
}
void addEndCounts(double forwardProb,
......
......@@ -9,17 +9,18 @@
#include "mcf_tantan_options.hh"
#include "mcf_util.hh"
#include "tantan.hh"
#include "tantan_repeat_finder.hh"
#include "LambdaCalculator.hh"
#include <algorithm> // copy, fill_n
#include <cassert>
#include <cmath>
#include <cstring> // strchr
#include <cstdlib> // EXIT_SUCCESS, EXIT_FAILURE
#include <exception> // exception
#include <fstream>
#include <iostream>
#include <new> // bad_alloc
#include <string.h>
#define BEG(v) ((v).empty() ? 0 : &(v).front())
#define END(v) ((v).empty() ? 0 : &(v).back() + 1)
......@@ -33,7 +34,7 @@ bool isDubiousDna(const uchar *beg, const uchar *end) {
int badLetters = 0;
if (end - beg > 100) end = beg + 100; // just check the first 100 letters
while (beg < end) {
if (!std::strchr("AaCcGgTtNnUu", *beg)) {
if (!strchr("AaCcGgTtNnUu", *beg)) {
++badLetters;
if (badLetters > 10) return true;
}
......@@ -45,6 +46,7 @@ bool isDubiousDna(const uchar *beg, const uchar *end) {
namespace {
TantanOptions options;
Alphabet alphabet;
tantan::RepeatFinder repeatFinder;
enum { scoreMatrixSize = 64 };
int fastMatrix[scoreMatrixSize][scoreMatrixSize];
......@@ -76,16 +78,14 @@ void initScoresAndProbabilities() {
if (options.scoreMatrixFileName) {
unfilify(scoreMatrix, options.scoreMatrixFileName);
} else if (options.isProtein) {
if (options.matchScore || options.mismatchCost) {
scoreMatrix.initMatchMismatch(std::max(options.matchScore, 1),
std::max(options.mismatchCost, 1),
if (options.matchScore && options.mismatchCost) {
scoreMatrix.initMatchMismatch(options.matchScore, options.mismatchCost,
Alphabet::protein);
} else {
unstringify(scoreMatrix, ScoreMatrix::blosum62);
}
} else {
scoreMatrix.initMatchMismatch(std::max(options.matchScore, 1),
std::max(options.mismatchCost, 1),
scoreMatrix.initMatchMismatch(options.matchScore, options.mismatchCost,
"ACGTU"); // allow for RNA
}
......@@ -101,7 +101,9 @@ void initScoresAndProbabilities() {
for (int i = 0; i < scoreMatrixSize; ++i) {
for (int j = 0; j < scoreMatrixSize; ++j) {
probMatrix[i][j] = std::exp(matrixLambda * fastMatrix[i][j]);
double x = matrixLambda * fastMatrix[i][j];
if (options.outputType != options.repOut) x = std::exp(x);
probMatrix[i][j] = x;
}
}
......@@ -116,6 +118,10 @@ void initScoresAndProbabilities() {
// XXX check if firstGapProb is too high
}
repeatFinder.init(options.maxCycleLength, probMatrixPointers,
options.repeatProb, options.repeatEndProb,
options.repeatOffsetProbDecay, firstGapProb, otherGapProb);
//std::cerr << "lambda: " << matrixLambda << "\n";
//std::cerr << "firstGapProb: " << firstGapProb << "\n";
//std::cerr << "otherGapProb: " << otherGapProb << "\n";
......@@ -157,6 +163,138 @@ void writeBed(const float *probBeg, const float *probEnd,
if (maskBeg) writeBedLine(seqName, probBeg, maskBeg, probEnd, output);
}
void storeSequence(const uchar *beg, const uchar *end, std::string &out) {
out.clear();
for (const uchar *i = beg; i < end; ++i) {
out.push_back(std::toupper(alphabet.numbersToLetters[*i]));
}
}
struct RepeatUnit {
const uchar *beg;
int len;
};
bool less(const RepeatUnit &x, const RepeatUnit &y) {
if (x.len != y.len) return x.len < y.len;
int c = memcmp(x.beg, y.beg, x.len);
if (c != 0) return c < 0;
return x.beg > y.beg;
}
int mainLen(const std::vector<RepeatUnit> &repUnits) {
int bestLen;
size_t bestCount = 0;
size_t count = 0;
for (size_t i = 0; i < repUnits.size(); ++i) {
if (i && repUnits[i].len > repUnits[i - 1].len) {
count = 0;
}
++count;
if (count > bestCount) {
bestCount = count;
bestLen = repUnits[i].len;
}
}
return bestLen;
}
const uchar *mainBeg(const std::vector<RepeatUnit> &repUnits, int len) {
const uchar *bestBeg;
size_t bestCount = 0;
size_t count = 0;
for (size_t i = 0; i < repUnits.size(); ++i) {
if (repUnits[i].len != len) continue;
if (count && memcmp(repUnits[i - 1].beg, repUnits[i].beg, len) != 0) {
count = 0;
}
++count;
if (count < bestCount) continue;
if (count > bestCount || repUnits[i].beg < bestBeg) {
bestCount = count;
bestBeg = repUnits[i].beg;
}
}
return bestBeg;
}
void writeRepeat(const FastaSequence &f,
const uchar *repBeg, const uchar *repEnd,
const std::string &repText, std::vector<RepeatUnit> &repUnits,
const uchar *commaPos, int finalOffset) {
double repeatCount = count(repText.begin(), repText.end(), ',');
double copyNumber = repeatCount + (repEnd - commaPos) * 1.0 / finalOffset;
if (copyNumber < options.minCopyNumber) return;
sort(repUnits.begin(), repUnits.end(), less);
int bestLen = mainLen(repUnits);
const uchar *bestBeg = mainBeg(repUnits, bestLen);
const uchar *beg = BEG(f.sequence);
std::cout << firstWord(f.title) << '\t'
<< (repBeg - beg) << '\t' << (repEnd - beg) << '\t'
<< bestLen << '\t' << copyNumber << '\t';
for (int i = 0; i < bestLen; ++i) {
char c = std::toupper(alphabet.numbersToLetters[bestBeg[i]]);
std::cout << c;
}
std::cout << '\t';
std::cout << repText << '\n';
}
void findRepeatsInOneSequence(const FastaSequence &f) {
const uchar *beg = BEG(f.sequence);
const uchar *end = END(f.sequence);
repeatFinder.calcBestPathScore(beg, end);
std::vector<RepeatUnit> repUnits;
std::string repText;
const uchar *repBeg;
const uchar *commaPos;
int state = 0;
for (const uchar *seqPtr = beg; seqPtr < end; ++seqPtr) {
int newState = repeatFinder.nextState();
if (newState == 0) {
if (state > 0) {
writeRepeat(f, repBeg, seqPtr, repText, repUnits, commaPos, state);
}
} else if (newState <= options.maxCycleLength) {
if (state == 0) {
repUnits.clear();
repBeg = seqPtr - newState;
storeSequence(repBeg, seqPtr, repText);
commaPos = repBeg;
} else if (state <= options.maxCycleLength) {
for (int i = state; i > newState; --i) {
if (seqPtr - commaPos >= i) {
repText.push_back(',');
commaPos = seqPtr;
}
repText.push_back('-');
}
}
RepeatUnit unit = {seqPtr - newState, newState};
repUnits.push_back(unit);
if (seqPtr - commaPos >= newState) {
repText.push_back(',');
commaPos = seqPtr;
}
repText.push_back(std::toupper(alphabet.numbersToLetters[*seqPtr]));
} else {
repText.push_back(std::tolower(alphabet.numbersToLetters[*seqPtr]));
}
state = newState;
}
if (state > 0) {
writeRepeat(f, repBeg, end, repText, repUnits, commaPos, state);
}
}
void processOneSequence(FastaSequence &f, std::ostream &output) {
uchar *beg = BEG(f.sequence);
uchar *end = END(f.sequence);
......@@ -181,6 +319,8 @@ void processOneSequence(FastaSequence &f, std::ostream &output) {
BEG(transitionCounts));
double sequenceLength = static_cast<double>(f.sequence.size());
transitionTotal += sequenceLength + 1;
} else if (options.outputType == options.repOut) {
findRepeatsInOneSequence(f);
} else {
std::vector<float> probabilities(end - beg);
float *probBeg = BEG(probabilities);
......
// Copyright 2018 Martin C. Frith
#include "tantan_repeat_finder.hh"
#include <algorithm>
#include <assert.h>
#include <limits.h>
#include <math.h>
//#include <iostream> // for debugging
namespace tantan {
static double max3(double x, double y, double z) {
return std::max(std::max(x, y), z);
}
static double myLog(double x) {
return x > 0 ? log(x) : -HUGE_VAL;
}
static double firstRepeatOffsetProb(double probMult, int maxRepeatOffset) {
if (probMult < 1 || probMult > 1) {
return (1 - probMult) / (1 - pow(probMult, 1.0 * maxRepeatOffset));
}
return 1.0 / maxRepeatOffset;
}
static int numOfDpScoresPerLetter(int maxRepeatOffset, double endGapScore) {
if (endGapScore > -HUGE_VAL) {
assert(maxRepeatOffset <= INT_MAX / 2);
return maxRepeatOffset * 2;
}
assert(maxRepeatOffset < INT_MAX);
return maxRepeatOffset + 1;
}
static unsigned minStoredPositions(const uchar *beg, const uchar *end) {
// We will do a dynamic programming algorithm along the sequence.
// To save memory, we will keep only some DP values, and recalculate
// the others later. We keep the values in an array x of size s.
// We will store DP initial values in x[0]. We will put
// the DP values after the first s-1 sequence positions in x[1],
// the DP values after the next s-2 positions in x[2],
// the DP values after the next s-3 positions in x[3], etc.
// This function returns the minimum possible value of s.
unsigned t = 0;
while (t < end - beg) {
beg += t;
++t;
}
return t + 1;
}
void RepeatFinder::init(int maxRepeatOffset,
const const_double_ptr *substitutionMatrix,
double repeatProb,
double repeatEndProb,
double repeatOffsetProbDecay,
double firstGapProb,
double otherGapProb) {
assert(maxRepeatOffset > 0);
this->maxRepeatOffset = maxRepeatOffset;
this->substitutionMatrix = substitutionMatrix;
b2b = myLog(1 - repeatProb);
f2b = myLog(repeatEndProb);
g2g = myLog(otherGapProb);
oneGapScore = myLog(firstGapProb * (1 - otherGapProb));
endGapScore = myLog(firstGapProb * (maxRepeatOffset > 1));
f2f0 = myLog(1 - repeatEndProb);
f2f1 = myLog(1 - repeatEndProb - firstGapProb);
f2f2 = myLog(1 - repeatEndProb - firstGapProb * 2);
double x = 1 / repeatOffsetProbDecay;
b2fGrowth = myLog(x);
b2fLast = myLog(repeatProb * firstRepeatOffsetProb(x, maxRepeatOffset));
dpScoresPerLetter = numOfDpScoresPerLetter(maxRepeatOffset, endGapScore);
}
void RepeatFinder::initializeBackwardScores() {
scoresPtr[0] = b2b;
std::fill_n(scoresPtr + 1, maxRepeatOffset, f2b);
if (endGapScore > -HUGE_VAL) {
std::fill_n(scoresPtr + 1 + maxRepeatOffset, maxRepeatOffset-1, -HUGE_VAL);
}
}
void RepeatFinder::calcBackwardTransitionScoresWithGaps() {
double toBackground = f2b + scoresPtr[0];
double *foregroundPtr = scoresPtr + 1;
double f = *foregroundPtr;
double toForeground = f;
double *insertionPtr = scoresPtr + 1 + maxRepeatOffset;
double i = *insertionPtr;
*foregroundPtr = max3(toBackground, f2f1 + f, i);
double d = endGapScore + f;
++foregroundPtr;
toForeground += b2fGrowth;
while (foregroundPtr < scoresPtr + maxRepeatOffset) {
f = *foregroundPtr;
toForeground = std::max(toForeground, f);
i = *(insertionPtr + 1);
*foregroundPtr = max3(toBackground, f2f2 + f, std::max(i, d));
double oneGapScore_f = oneGapScore + f;
*insertionPtr = std::max(oneGapScore_f, g2g + i);
d = std::max(oneGapScore_f, g2g + d);
++foregroundPtr;
++insertionPtr;
toForeground += b2fGrowth;
}
f = *foregroundPtr;
toForeground = std::max(toForeground, f);
*foregroundPtr = max3(toBackground, f2f1 + f, d);
*insertionPtr = endGapScore + f;
scoresPtr[0] = std::max(b2b + scoresPtr[0], b2fLast + toForeground);
}
void RepeatFinder::calcBackwardTransitionScores() {
if (endGapScore > -HUGE_VAL) return calcBackwardTransitionScoresWithGaps();
double toBackground = f2b + scoresPtr[0];
double toForeground = -HUGE_VAL;
double *foregroundPtr = scoresPtr + 1;
double *foregroundEnd = foregroundPtr + maxRepeatOffset;
while (foregroundPtr < foregroundEnd) {
toForeground += b2fGrowth;
double f = *foregroundPtr;
toForeground = std::max(toForeground, f);
*foregroundPtr = std::max(toBackground, f2f0 + f);
++foregroundPtr;
}
scoresPtr[0] = std::max(b2b + scoresPtr[0], b2fLast + toForeground);
}
void RepeatFinder::calcEmissionScores() {
const double *matrixRow = substitutionMatrix[*seqPtr];
double *oldScores = scoresPtr - dpScoresPerLetter;
int maxOffset = maxOffsetInTheSequence();
int i = 1;
scoresPtr[0] = oldScores[0];
for (; i <= maxOffset; ++i) {
scoresPtr[i] = oldScores[i] + matrixRow[seqPtr[-i]];
}
for (; i <= maxRepeatOffset; ++i) {
scoresPtr[i] = -HUGE_VAL;
}
std::copy(oldScores + i, scoresPtr, scoresPtr + i);
}
void RepeatFinder::makeCheckpoint() {
checkpoint += dpScoresPerLetter;
std::copy(scoresPtr - dpScoresPerLetter, scoresPtr, checkpoint);
scoresPtr = checkpoint + dpScoresPerLetter;
assert(scoresPtr < scoresEnd);
}
void RepeatFinder::redoCheckpoint() {
seqPtr += (scoresEnd - scoresPtr) / dpScoresPerLetter;
while (scoresPtr < scoresEnd) {
--seqPtr;
calcScoresForOneSequencePosition();
scoresPtr += dpScoresPerLetter;
}
scoresPtr -= dpScoresPerLetter;
checkpoint -= dpScoresPerLetter;
}
double RepeatFinder::calcBestPathScore(const uchar *seqBeg,
const uchar *seqEnd) {
this->seqBeg = seqBeg;
this->seqEnd = seqEnd;
unsigned long numOfStoredPositions = minStoredPositions(seqBeg, seqEnd);
unsigned long numOfScores = numOfStoredPositions * dpScoresPerLetter;
assert(numOfStoredPositions > 0);
assert(numOfScores > 0);
dpScores.resize(numOfScores);
scoresPtr = &dpScores[0];
scoresEnd = scoresPtr + numOfScores;
checkpoint = scoresPtr;
seqPtr = seqEnd;
initializeBackwardScores();
while (seqPtr > seqBeg) {
--seqPtr;
scoresPtr += dpScoresPerLetter;
if (scoresPtr == scoresEnd) makeCheckpoint();
calcScoresForOneSequencePosition();
}
state = 0;
return scoresPtr[0];
}
int RepeatFinder::offsetWithMaxScore() const {
const double *matrixRow = substitutionMatrix[*seqPtr];
int maxOffset = maxOffsetInTheSequence();
int bestOffset = 0;
double toForeground = -HUGE_VAL;
for (int i = 1; i <= maxOffset; ++i) {
toForeground += b2fGrowth;
double f = scoreWithEmission(matrixRow, i);
if (f > toForeground) {
toForeground = f;
bestOffset = i;
}
}
return bestOffset;
}
int RepeatFinder::deletionWithMaxScore() const {
const double *matrixRow = substitutionMatrix[*seqPtr];
int bestOffset = 1;
double f = scoreWithEmission(matrixRow, 1);
double d = endGapScore + f;
for (int i = 2; i < state; ++i) {
d += g2g;
f = scoreWithEmission(matrixRow, i);
if (oneGapScore + f > d) {
d = oneGapScore + f;
bestOffset = i;
}
}
return bestOffset;
}
int RepeatFinder::nextState() {
double maxScore = scoresPtr[state];
if (scoresPtr == checkpoint) redoCheckpoint();
scoresPtr -= dpScoresPerLetter;
if (state == 0) {
if (b2b + scoresPtr[0] < maxScore) state = offsetWithMaxScore();
} else if (state <= maxRepeatOffset) {
if (f2b + scoresPtr[0] >= maxScore) {
state = 0;
} else if (endGapScore > -HUGE_VAL) {
double f = scoreWithEmission(substitutionMatrix[*seqPtr], state);
if (state == 1) {
if (f2f1 + f < maxScore) state += maxRepeatOffset;
} else if (state == maxRepeatOffset) {
if (f2f1 + f < maxScore) state = deletionWithMaxScore();
} else if (f2f2 + f < maxScore) {
if (scoresPtr[state + maxRepeatOffset] >= maxScore) {
state += maxRepeatOffset;
} else {
state = deletionWithMaxScore();
}
}
}
} else {
++state;
if (state == dpScoresPerLetter || g2g + scoresPtr[state] < maxScore) {
state -= maxRepeatOffset;
}
}
++seqPtr;
return state;
}
}
// Copyright 2018 Martin C. Frith
// This class finds tandem repeats in a sequence, using a Viterbi
// algorithm.
// The input parameters are as in tantan.hh, with one difference: the
// substitutionMatrix should be a *log* likelihood ratio matrix:
// substitutionMatrix[x][y] = lambda * scoringMatrix[x][y].
// Usage: first call init. Then call calcBestPathScore, which runs
// the Viterbi algorithm backwards from the end to the start of the
// sequence. Finally, call nextState once per sequence letter, to get
// the "state" of each letter from start to end.
// state = 0: non-repeat.
// 0 < state <= maxRepeatOffset: tandem repeat with period = state.
// maxRepeatOffset < state < 2*maxRepeatOffset: insertion in repeat.
#ifndef TANTAN_REPEAT_FINDER_HH
#define TANTAN_REPEAT_FINDER_HH
#include <vector>
namespace tantan {
typedef unsigned char uchar;
typedef const double *const_double_ptr;
class RepeatFinder {
public:
void init(int maxRepeatOffset,
const const_double_ptr *substitutionMatrix,
double repeatProb,
double repeatEndProb,
double repeatOffsetProbDecay,
double firstGapProb,
double otherGapProb);
double calcBestPathScore(const uchar *seqBeg, const uchar *seqEnd);
int nextState();
private:
const const_double_ptr *substitutionMatrix;
double b2b;
double f2b;
double g2g;
double oneGapScore;
double endGapScore;
double f2f0;
double f2f1;
double f2f2;
double b2fGrowth;
double b2fLast;
int maxRepeatOffset;
int dpScoresPerLetter;
std::vector<double> dpScores;
double *scoresPtr;
double *scoresEnd;
double *checkpoint;
const uchar *seqBeg;
const uchar *seqEnd;
const uchar *seqPtr;
int state;
void initializeBackwardScores();
void calcBackwardTransitionScoresWithGaps();
void calcBackwardTransitionScores();
void calcEmissionScores();
void makeCheckpoint();
void redoCheckpoint();
int offsetWithMaxScore() const;
int deletionWithMaxScore() const;
bool isNearSeqBeg() const {
return seqPtr - seqBeg < maxRepeatOffset;
}
int maxOffsetInTheSequence() const {
return isNearSeqBeg() ? (seqPtr - seqBeg) : maxRepeatOffset;
}
double scoreWithEmission(const double *matrixRow, int offset) const {
return scoresPtr[offset] + matrixRow[seqPtr[-offset]];
}
void calcScoresForOneSequencePosition() {
calcEmissionScores();
calcBackwardTransitionScores();
}
};
}
#endif
>hard
catcatcatcatcat
atcatatcatatcatatcatatcat
......@@ -1489,3 +1489,27 @@ GTGGTCCACATATACAATGGAATATTACTCAGCTATCAGAAAGAA
IIIIIIIIIIIIIIIIDIII2BDHIIHCF=/-6A1921*/1-*-7
205
SRR019778.4 6 27 7 3 TTGTGTG TGTGTAT,TGTGTGT,TGTGTGT
SRR019778.4 28 45 2 12 GT GT,GT,GT,GT,GT,-T,T,T,T,T,T,T
SRR019778.42 3 44 4 11 TGTG TGTG,TGTG,TGTG,TGTG,TGTG,TGTG,TGTT,TGTT,T-TT,TTT,TTT
SRR019778.45 9 37 8 3.33333 GTGTGTGG GTGTGTGG,GTGTGTGT,GTtGTGTGT,GTT
SRR019778.50 0 21 7 3 GTGTGTT GTGTGTT,GTGTGTT,GAGTGTT
SRR019778.50 22 33 2 5.5 TG TG,TG,TG,TG,TG,T
SRR019778.64 30 44 3 4.66667 GAG GAG,GAG,GAG,GAG,GA
SRR019778.65 2 13 2 5.5 TG TG,TG,TG,TG,TG,T
SRR019778.65 4 37 16 2.0625 TGTGTGTGTTTTCTGT TGTGTGTGTTTTCTGT,TGTGTGTGTTTTTTGT,T
SRR019778.78 9 31 10 2.2 TGCCTTACTA TGCCTTACTA,TGCCTTACTA,TG
SRR019778.95 2 18 4 4 TGAT TGAT,TGAT,TGAT,TGAT
SRR019778.95 22 45 4 5.75 ATAG ATAG,ATAG,ATAG,ATAG,ATAG,ATA
hard 0 14 3 4.66667 CAT CAT,CAT,CAT,CAT,CA
hard 10 40 5 6 ATCAT ATCAT,ATCAT,ATCAT,ATCAT,ATCAT,ATCAT
chrM 207 220 4 3.25 TTAA TTAA,TTAA,TTAA,T
chrM 515 526 2 5.5 CA CA,CA,CA,CA,CA,C
chrM 5305 5326 12 1.75 CCACCATCATAG CCACCATCATAG,CCACCATCA
chrM 8271 8290 9 2.11111 ACCCCCTCT ACCCCCTCT,ACCCCCTCT,A
chrM 13647 13691 30 1.48276 CCCCACCCTTACTAACATTAACGAAAATAA CCCCACCCTTACTAACATTAACGAAAATAA,CCCCACCCT-ACTAA
chrM 15829 15844 6 2.5 AATCCT AATCCT,AATCCT,AAT
chrM 16183 16195 1 12 C C,C,C,C,C,C,C,C,C,C,C,C
......@@ -29,5 +29,11 @@ countLowercaseLetters () {
tantan panda.fastq
echo
tantan -i2 -j3 hg19_chrM.fa | countLowercaseLetters
echo
tantan -f4 panda.fastq
echo
tantan -f4 -b12 hard.fa
echo
tantan -f4 -n1 hg19_chrM.fa
} 2>&1 |
diff -u tantan_test.out -