Skip to content
Commits on Source (4)
2018-05-14 Martin C. Frith <Martin C. Frith>
* scripts/last-dotplot, scripts/last-postmask, test/102.maf, test
/last-postmask-test.out, test/last-postmask-test.sh, test/last-
split-test.out, test/last-split-test.sh:
postmask: fix bug for unusual alignment headers
[65969d4d464d] [tip]
* scripts/last-dotplot, scripts/last-map-probs, scripts/last-postmask,
scripts/last-train, scripts/maf-convert, scripts/maf-swap:
Make last-train find LAST programs more robustly
[8a4fadbe6080]
2018-05-07 Martin C. Frith <Martin C. Frith>
* scripts/last-postmask, scripts/last-train, scripts/maf-convert,
scripts/maf-join, scripts/maf-swap:
Modernize the Python code
[b1c09fdd12fe]
* doc/last-tuning.txt, doc/lastal.txt, doc/lastdb.txt,
src/LastalArguments.cc, src/LastalArguments.hh, test/last-test.out,
test/last-test.sh:
Change lastal -x & -z options
[b09d82479c91]
* doc/last-train.txt, scripts/last-train:
Make last-train use last-postmask
[2755c8f0dc79]
* scripts/last-postmask, test/last-postmask-test.out:
postmask: bugfix for strand-asymmetric scores
[9caca45429d8]
2018-04-20 Martin C. Frith <Martin C. Frith>
* doc/Makefile, doc/lastal.txt, doc/maf-cut.txt, scripts/maf-cut:
Add maf-cut utility
[482845444bcd] [tip]
[482845444bcd]
2018-04-18 Martin C. Frith <Martin C. Frith>
......
last-align (938-1) unstable; urgency=medium
* New upstream version
-- Andreas Tille <tille@debian.org> Wed, 23 May 2018 07:10:11 +0200
last-align (932-1) unstable; urgency=medium
* New upstream version
......
......@@ -5,4 +5,4 @@ IndexIgnore last-map-probs.txt last-matrices.txt last-pair-probs.txt
IndexIgnore last-papers.txt last-parallel.txt last-postmask.txt
IndexIgnore last-repeats.txt last-seeds.txt last-split.txt last-train.txt
IndexIgnore last-tuning.txt last-tutorial.txt last.txt lastal.txt lastdb.txt
IndexIgnore maf-convert.txt
IndexIgnore maf-convert.txt maf-cut.txt
......@@ -334,7 +334,7 @@ final score parameters, in a format that can be read by <a class="reference exte
option</a>.</p>
<p>You can also pipe sequences into last-train, for example:</p>
<pre class="literal-block">
zcat queries.fasta.gz | last-train mydb
bzcat queries.fasta.bz2 | last-train mydb
</pre>
<div class="section" id="options">
<h2>Options</h2>
......@@ -377,6 +377,11 @@ e.g. score(A→G) = score(G→A).</td></tr>
optimize the parameters for low-similarity alignments,
similarly to the BLOSUM matrices.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--postmask=<var>NUMBER</var></span></kbd></td>
<td>By default, last-train ignores alignments of mostly-lowercase
sequence (by using <a class="reference external" href="last-postmask.html">last-postmask</a>).
To turn this off, do <tt class="docutils literal"><span class="pre">--postmask=0</span></tt>.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--sample-number=<var>N</var></span></kbd></td>
<td>Use N randomly-chosen chunks of the query sequences. The
queries are chopped into fixed-length chunks (as if they were
......
......@@ -20,7 +20,7 @@ option <lastal.html#score-options>`_.
You can also pipe sequences into last-train, for example::
zcat queries.fasta.gz | last-train mydb
bzcat queries.fasta.bz2 | last-train mydb
Options
-------
......@@ -46,6 +46,10 @@ Training options
Ignore alignments with > PID% identity. This aims to
optimize the parameters for low-similarity alignments,
similarly to the BLOSUM matrices.
--postmask=NUMBER
By default, last-train ignores alignments of mostly-lowercase
sequence (by using `last-postmask <last-postmask.html>`_).
To turn this off, do ``--postmask=0``.
--sample-number=N
Use N randomly-chosen chunks of the query sequences. The
queries are chopped into fixed-length chunks (as if they were
......
......@@ -370,7 +370,7 @@ alphabetically earliest.</p>
<div class="section" id="lastdb8-lastal8">
<h2>lastdb8 &amp; lastal8</h2>
<p>If your reference has more than about 4 billion letters, 8-byte LAST
<em>may</em> be beneficial. Ordinary (4-byte) LAST cannot directly handle so
may be beneficial. Ordinary (4-byte) LAST cannot directly handle so
much data, so it splits it into volumes, which is inefficient. 8-byte
LAST can handle such data without voluming, but it uses more memory.</p>
<p>8-byte LAST combines well with lastdb option -w or -W, which reduce
......@@ -401,15 +401,27 @@ high-identity matches.</p>
<p>This option (gapless alignment culling) can make lastal <strong>faster</strong> but
<strong>less sensitive</strong>. It can also <strong>reduce redundant output</strong>. For
example, -C2 makes it discard alignments (before gapped extension)
whose query coordinates lie in those of 2 or more stronger alignments.</p>
whose query coordinates lie in those of 2 or more stronger alignments.
This works well for aligning long, repeat-rich, indel-poor sequences
(e.g. mammal chromosomes) without repeat-masking.</p>
</div>
<div class="section" id="lastal-x">
<h3>lastal -x</h3>
<div class="section" id="lastal-z">
<h3>lastal -z</h3>
<p>This option can make lastal <strong>faster</strong> but <strong>less sensitive</strong>. It
sets the maximum score drop in alignments, in the gapped extension
phase. Lower values make it faster, by quitting unpromising
extensions sooner. The default aims at best accuracy. For example,
use -x50% to specify 50% of the default value.</p>
extensions sooner. The default aims at best accuracy.</p>
<p>You can set this option in several ways: perhaps the most intuitive is
via maximum gap length. For example, -z10g sets the maximum score
drop such that the longest possible gap length is 10.</p>
</div>
<div class="section" id="lastal-x">
<h3>lastal -x</h3>
<p>This option (preliminary gapped extension) can make lastal <strong>faster</strong>
but <strong>less sensitive</strong>. For example, -x2g makes it extend gapped
alignments with a maximum gap length of 2, discard those with score
below the gapped score threshold, then redo the survivors with the
final max score drop (z).</p>
</div>
<div class="section" id="id2">
<h3>lastal -M</h3>
......
......@@ -65,7 +65,7 @@ lastdb8 & lastal8
~~~~~~~~~~~~~~~~~
If your reference has more than about 4 billion letters, 8-byte LAST
*may* be beneficial. Ordinary (4-byte) LAST cannot directly handle so
may be beneficial. Ordinary (4-byte) LAST cannot directly handle so
much data, so it splits it into volumes, which is inefficient. 8-byte
LAST can handle such data without voluming, but it uses more memory.
......@@ -102,15 +102,29 @@ This option (gapless alignment culling) can make lastal **faster** but
**less sensitive**. It can also **reduce redundant output**. For
example, -C2 makes it discard alignments (before gapped extension)
whose query coordinates lie in those of 2 or more stronger alignments.
This works well for aligning long, repeat-rich, indel-poor sequences
(e.g. mammal chromosomes) without repeat-masking.
lastal -x
lastal -z
---------
This option can make lastal **faster** but **less sensitive**. It
sets the maximum score drop in alignments, in the gapped extension
phase. Lower values make it faster, by quitting unpromising
extensions sooner. The default aims at best accuracy. For example,
use -x50% to specify 50% of the default value.
extensions sooner. The default aims at best accuracy.
You can set this option in several ways: perhaps the most intuitive is
via maximum gap length. For example, -z10g sets the maximum score
drop such that the longest possible gap length is 10.
lastal -x
---------
This option (preliminary gapped extension) can make lastal **faster**
but **less sensitive**. For example, -x2g makes it extend gapped
alignments with a maximum gap length of 2, discard those with score
below the gapped score threshold, then redo the survivors with the
final max score drop (z).
lastal -M
---------
......
......@@ -520,25 +520,37 @@ looks like this:</p>
<p class="last">The &quot;-1&quot; indicates the reverse frameshift.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-x <var>DROP</var></span></kbd></td>
<kbd><span class="option">-z <var>DROP</var></span></kbd></td>
<td><p class="first">Maximum score drop for gapped alignments. Gapped alignments are
forbidden from having any internal region with score &lt; -DROP.
This serves two purposes: accuracy (avoid spurious internal
regions in alignments) and speed (the smaller the faster).</p>
<p class="last">You can specify either a score (e.g. -x20), or a percentage; for
example, -x50% specifies 50% of the default value (rounded down
to the nearest integer).</p>
The default value is e-1, which arguably produces the best
alignments. Lower values improve speed, by quitting unpromising
extensions sooner. You can specify this parameter in 3 ways:</p>
<ul class="last">
<li><p class="first">A score (e.g. -z20).</p>
</li>
<li><p class="first">A percentage. For example, -z50% specifies 50% of the default
value (rounded down to the nearest integer).</p>
</li>
<li><p class="first">A maximum gap length. For example, -z8g sets the maximum
score drop to: min[a+8b, A+8B]. However, this never increases
the value above the default.</p>
</li>
</ul>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-x <var>DROP</var></span></kbd></td>
<td>This option makes lastal extend gapped alignments twice. First,
it extends gapped alignments with a maximum score drop of x, and
discards those with score &lt; e. The surviving alignments are
redone with a (presumably higher) maximum score drop of z. This
aims to improve speed with minimal effect on the final
alignments. You can specify -x in the same ways as -z (with the
default value of x being z).</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-y <var>DROP</var></span></kbd></td>
<td>Maximum score drop for gapless alignments.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-z <var>DROP</var></span></kbd></td>
<td>Maximum score drop for final gapped alignments. Setting z
different from x causes lastal to extend gapless alignments
twice: first with a maximum score drop of x, and then with a
(presumably higher) maximum score drop of z.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-d <var>SCORE</var></span></kbd></td>
<td>Minimum score for gapless alignments.</td></tr>
<tr><td class="option-group">
......
......@@ -187,25 +187,34 @@ Score options
The "-1" indicates the reverse frameshift.
-x DROP
-z DROP
Maximum score drop for gapped alignments. Gapped alignments are
forbidden from having any internal region with score < -DROP.
This serves two purposes: accuracy (avoid spurious internal
regions in alignments) and speed (the smaller the faster).
The default value is e-1, which arguably produces the best
alignments. Lower values improve speed, by quitting unpromising
extensions sooner. You can specify this parameter in 3 ways:
* A score (e.g. -z20).
* A percentage. For example, -z50% specifies 50% of the default
value (rounded down to the nearest integer).
You can specify either a score (e.g. -x20), or a percentage; for
example, -x50% specifies 50% of the default value (rounded down
to the nearest integer).
* A maximum gap length. For example, -z8g sets the maximum
score drop to: min[a+8b, A+8B]. However, this never increases
the value above the default.
-x DROP
This option makes lastal extend gapped alignments twice. First,
it extends gapped alignments with a maximum score drop of x, and
discards those with score < e. The surviving alignments are
redone with a (presumably higher) maximum score drop of z. This
aims to improve speed with minimal effect on the final
alignments. You can specify -x in the same ways as -z (with the
default value of x being z).
-y DROP
Maximum score drop for gapless alignments.
-z DROP
Maximum score drop for final gapped alignments. Setting z
different from x causes lastal to extend gapless alignments
twice: first with a maximum score drop of x, and then with a
(presumably higher) maximum score drop of z.
-d SCORE
Minimum score for gapless alignments.
......
......@@ -569,8 +569,7 @@ bytes.</p>
<p>lastdb can become catastrophically slow for highly redundant
sequences, e.g. two almost-identical genomes. It usually processes
several GB per hour, but if it becomes much slower, try option &quot;-i
10&quot;, which is likely to solve the problem. (If even that is too slow,
try &quot;-i 100&quot; or so.)</p>
10&quot;, which is likely to solve the problem.</p>
</div>
</div>
</body>
......
......@@ -234,5 +234,4 @@ Limitations
lastdb can become catastrophically slow for highly redundant
sequences, e.g. two almost-identical genomes. It usually processes
several GB per hour, but if it becomes much slower, try option "-i
10", which is likely to solve the problem. (If even that is too slow,
try "-i 100" or so.)
10", which is likely to solve the problem.
......@@ -18,8 +18,15 @@ import subprocess
import itertools, optparse, os, re, sys
# Try to make PIL/PILLOW work:
try: from PIL import Image, ImageDraw, ImageFont, ImageColor
except ImportError: import Image, ImageDraw, ImageFont, ImageColor
try:
from PIL import Image, ImageDraw, ImageFont, ImageColor
except ImportError:
import Image, ImageDraw, ImageFont, ImageColor
try:
from future_builtins import zip
except ImportError:
pass
def myOpen(fileName): # faster than fileinput
if fileName is None:
......@@ -75,7 +82,7 @@ def tabBlocks(beg1, beg2, blocks):
def mafBlocks(beg1, beg2, seq1, seq2):
'''Get the gapless blocks of an alignment, from MAF format.'''
size = 0
for x, y in itertools.izip(seq1, seq2):
for x, y in zip(seq1, seq2):
if x == "-":
if size:
yield beg1, beg2, size
......@@ -149,8 +156,8 @@ def updateSeqs(coverDict, seqRanges, seqName, ranges, coveredRange):
def readAlignments(fileName, opts):
'''Get alignments and sequence limits, from MAF or tabular format.'''
seqRequests1 = map(seqRequestFromText, opts.seq1)
seqRequests2 = map(seqRequestFromText, opts.seq2)
seqRequests1 = [seqRequestFromText(i) for i in opts.seq1]
seqRequests2 = [seqRequestFromText(i) for i in opts.seq2]
alignments = []
seqRanges1 = []
......
......@@ -120,6 +120,6 @@ if __name__ == "__main__":
try: lastMapProbs(opts, args)
except KeyboardInterrupt: pass # avoid silly error message
except Exception, e:
except Exception as e:
prog = os.path.basename(sys.argv[0])
sys.exit(prog + ": error: " + str(e))
......@@ -12,8 +12,15 @@
# Limitations: doesn't (yet) handle sequence quality data,
# frameshifts, or generalized affine gaps.
from __future__ import print_function
import gzip
import itertools, optparse, os, signal, sys
import optparse, os, signal, sys
try:
from future_builtins import zip
except ImportError:
pass
def myOpen(fileName):
if fileName == "-":
......@@ -22,9 +29,17 @@ def myOpen(fileName):
return gzip.open(fileName)
return open(fileName)
def getScoreMatrix(rowHeads, colHeads, matrix, deleteCost, insertCost):
def complement(base):
x = "ACGTRYKMBDHV"
y = "TGCAYRMKVHDB"
i = x.find(base)
return y[i] if i >= 0 else base
def fastScoreMatrix(rowHeads, colHeads, matrix, deleteCost, insertCost):
matrixLen = 128
defaultScore = min(map(min, matrix))
scoreMatrix = [[defaultScore for i in range(128)] for j in range(128)]
fastMatrix = [[defaultScore for i in range(matrixLen)]
for j in range(matrixLen)]
for i, x in enumerate(rowHeads):
for j, y in enumerate(colHeads):
xu = ord(x.upper())
......@@ -33,70 +48,90 @@ def getScoreMatrix(rowHeads, colHeads, matrix, deleteCost, insertCost):
yl = ord(y.lower())
score = matrix[i][j]
maskScore = min(score, 0)
scoreMatrix[xu][yu] = score
scoreMatrix[xu][yl] = maskScore
scoreMatrix[xl][yu] = maskScore
scoreMatrix[xl][yl] = maskScore
for i in range(128):
scoreMatrix[i][ord("-")] = -deleteCost
scoreMatrix[ord("-")][i] = -insertCost
return scoreMatrix
def isGoodAlignment(seqs, scoreMatrix, delOpenCost, insOpenCost, minScore):
fastMatrix[xu][yu] = score
fastMatrix[xu][yl] = maskScore
fastMatrix[xl][yu] = maskScore
fastMatrix[xl][yl] = maskScore
for i in range(matrixLen):
fastMatrix[i][ord("-")] = -deleteCost
fastMatrix[ord("-")][i] = -insertCost
return fastMatrix
def matrixPerStrand(rowHeads, colHeads, matrix, deleteCost, insertCost):
rowComps = [complement(i) for i in rowHeads]
colComps = [complement(i) for i in colHeads]
fwd = fastScoreMatrix(rowHeads, colHeads, matrix, deleteCost, insertCost)
rev = fastScoreMatrix(rowComps, colComps, matrix, deleteCost, insertCost)
return fwd, rev
def isGoodAlignment(columns, scoreMatrix, delOpenCost, insOpenCost, minScore):
"""Does the alignment have a segment with score >= minScore?"""
r, q = seqs
score = 0
xOld = " "
yOld = " "
for x, y in itertools.izip(r, q):
xOld = yOld = " "
for x, y in columns:
score += scoreMatrix[ord(x)][ord(y)]
if score >= minScore: return True
if x == "-" and xOld != "-": score -= insOpenCost
if y == "-" and yOld != "-": score -= delOpenCost
if score < 0: score = 0
if score >= minScore:
return True
if x == "-" and xOld != "-":
score -= insOpenCost
if y == "-" and yOld != "-":
score -= delOpenCost
if score < 0:
score = 0
xOld = x
yOld = y
return False
def printIfGood(maf, seqs, scoreMatrix, delOpenCost, insOpenCost, minScore):
if isGoodAlignment(seqs, scoreMatrix, delOpenCost, insOpenCost, minScore):
cols = zip(*seqs)
if isGoodAlignment(cols, scoreMatrix, delOpenCost, insOpenCost, minScore):
for line in maf:
print line,
print
print(line, end="")
print()
def doOneFile(lines):
aDel = bDel = aIns = bIns = minScore = matrices = None
strandParam = 0
scoreMatrix = []
rowHeads = []
colHeads = []
maf = []
seqs = []
for line in lines:
if line[0] == "#":
print line,
w = line.split()
for i in w:
print(line, end="")
fields = line.split()
for i in fields:
if i.startswith("a="): aDel = int(i[2:])
if i.startswith("b="): bDel = int(i[2:])
if i.startswith("A="): aIns = int(i[2:])
if i.startswith("B="): bIns = int(i[2:])
if i.startswith("e="): minScore = int(i[2:])
if len(w) > 1 and max(map(len, w)) == 1:
colHeads = w[1:]
rowHeads = []
matrix = []
elif len(w) > 2 and len(w[1]) == 1:
rowHeads.append(w[1])
matrix.append(map(int, w[2:]))
if i.startswith("S="): strandParam = int(i[2:])
if not colHeads and len(fields) > 1 and max(map(len, fields)) == 1:
colHeads = fields[1:]
elif len(fields) > 2 and len(fields[1]) == 1:
rowHeads.append(fields[1])
scoreMatrix.append([int(i) for i in fields[2:]])
elif line.isspace():
if seqs: printIfGood(maf, seqs, scoreMatrix, aDel, aIns, minScore)
if seqs: printIfGood(maf, seqs, matrix, aDel, aIns, minScore)
maf = []
seqs = []
else:
if not scoreMatrix:
scoreMatrix = getScoreMatrix(rowHeads, colHeads, matrix,
bDel, bIns)
maf.append(line)
if line[0] == "s": seqs.append(line.split()[6])
if seqs: printIfGood(maf, seqs, scoreMatrix, aDel, aIns, minScore)
if line[0] == "s":
if not matrices:
if None in (aDel, bDel, aIns, bIns, minScore):
raise Exception("can't read alignment header")
matrices = matrixPerStrand(rowHeads, colHeads,
scoreMatrix, bDel, bIns)
fields = line.split()
if len(seqs) == strandParam:
strand = fields[4]
matrix = matrices[strand == "-"]
seqs.append(fields[6])
if seqs: printIfGood(maf, seqs, matrix, aDel, aIns, minScore)
def lastPostmask(args):
if not args:
......@@ -116,6 +151,6 @@ if __name__ == "__main__":
try: lastPostmask(args)
except KeyboardInterrupt: pass # avoid silly error message
except Exception, e:
except Exception as e:
prog = os.path.basename(sys.argv[0])
sys.exit(prog + ": error: " + str(e))
#! /usr/bin/env python
# Copyright 2015 Martin C. Frith
from __future__ import print_function
import gzip
import math, optparse, os, random, signal, subprocess, sys, tempfile
......@@ -157,7 +159,7 @@ def countsFromLastOutput(lines, opts):
if line[0] == "s":
strand = line.split()[4] # slow?
if line[0] == "c":
c = map(float, line.split()[1:])
c = [float(i) for i in line.split()[1:]]
if not matrix:
matrixSize = int(math.sqrt(len(c) - 10))
matrix = [[1.0] * matrixSize for i in range(matrixSize)]
......@@ -236,14 +238,16 @@ def matProbsFromCounts(counts, opts):
total = sum(map(sum, counts))
probs = [[j / total for j in i] for i in counts]
print "# substitution percent identity: %g" % (100 * identities / total)
print
print "# count matrix (query letters = columns, reference letters = rows):"
print("# substitution percent identity: %g" % (100 * identities / total))
print()
print("# count matrix "
"(query letters = columns, reference letters = rows):")
writeCountMatrix(sys.stdout, counts, "# ")
print
print "# probability matrix (query letters = columns, reference letters = rows):"
print()
print("# probability matrix "
"(query letters = columns, reference letters = rows):")
writeProbMatrix(sys.stdout, probs, "# ")
print
print()
return probs
......@@ -264,19 +268,19 @@ def gapProbsFromCounts(counts, opts):
delExtendProb = (deletes - delOpens) / deletes
insExtendProb = (inserts - insOpens) / inserts
print "# aligned letter pairs:", matches
print "# deletes:", deletes
print "# inserts:", inserts
print "# delOpens:", delOpens
print "# insOpens:", insOpens
print "# alignments:", alignments
print "# mean delete size: %g" % (deletes / delOpens)
print "# mean insert size: %g" % (inserts / insOpens)
print "# delOpenProb: %g" % delOpenProb
print "# insOpenProb: %g" % insOpenProb
print "# delExtendProb: %g" % delExtendProb
print "# insExtendProb: %g" % insExtendProb
print
print("# aligned letter pairs:", matches)
print("# deletes:", deletes)
print("# inserts:", inserts)
print("# delOpens:", delOpens)
print("# insOpens:", insOpens)
print("# alignments:", alignments)
print("# mean delete size: %g" % (deletes / delOpens))
print("# mean insert size: %g" % (inserts / insOpens))
print("# delOpenProb: %g" % delOpenProb)
print("# insOpenProb: %g" % insOpenProb)
print("# delExtendProb: %g" % delExtendProb)
print("# insExtendProb: %g" % insExtendProb)
print()
delCloseProb = 1 - delExtendProb
insCloseProb = 1 - insExtendProb
......@@ -301,8 +305,8 @@ def scoreFromLetterProbs(scale, pairProb, prob1, prob2):
return scoreFromProb(scale, probRatio)
def matScoresFromProbs(scale, probs):
rowProbs = map(sum, probs)
colProbs = map(sum, zip(*probs))
rowProbs = [sum(i) for i in probs]
colProbs = [sum(i) for i in zip(*probs)]
return [[scoreFromLetterProbs(scale, j, x, y) for j, y in zip(i, colProbs)]
for i, x in zip(probs, rowProbs)]
......@@ -328,17 +332,17 @@ def writeGapCosts(gapCosts, out):
def printGapCosts(gapCosts):
delExistCost, insExistCost, delExtendCost, insExtendCost = gapCosts
print "# delExistCost:", delExistCost
print "# insExistCost:", insExistCost
print "# delExtendCost:", delExtendCost
print "# insExtendCost:", insExtendCost
print
print("# delExistCost:", delExistCost)
print("# insExistCost:", insExistCost)
print("# delExtendCost:", delExtendCost)
print("# insExtendCost:", insExtendCost)
print()
def tryToMakeChildProgramsFindable():
myDir = os.path.dirname(__file__)
srcDir = os.path.join(myDir, os.pardir, "src")
# put srcDir first, to avoid getting older versions of LAST:
os.environ["PATH"] = srcDir + os.pathsep + os.environ["PATH"]
d = os.path.dirname(__file__)
e = os.path.join(d, os.pardir, "src")
# put them first, to avoid getting older versions of LAST:
os.environ["PATH"] = d + os.pathsep + e + os.pathsep + os.environ["PATH"]
def readLastalProgName(lastdbIndexName):
bitsPerInt = "32"
......@@ -377,8 +381,11 @@ def doTraining(opts, args):
if opts.B: x.append("-B" + opts.B)
x += args
y = ["last-split", "-n"]
z = ["last-postmask"]
p = subprocess.Popen(x, stdout=subprocess.PIPE)
q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE)
if opts.postmask:
q = subprocess.Popen(z, stdin=q.stdout, stdout=subprocess.PIPE)
lastalVersion = versionFromHeader(q.stdout)
externalScale = scaleFromHeader(q.stdout)
internalScale = externalScale * scaleIncrease
......@@ -387,15 +394,15 @@ def doTraining(opts, args):
internalMatrix = scaledMatrix(externalMatrix, scaleIncrease)
oldParameters = []
print "# lastal version:", lastalVersion
print "# maximum percent identity:", opts.pid
print "# scale of score parameters:", externalScale
print "# scale used while training:", internalScale
print
print("# lastal version:", lastalVersion)
print("# maximum percent identity:", opts.pid)
print("# scale of score parameters:", externalScale)
print("# scale used while training:", internalScale)
print()
while True:
print "#", " ".join(x)
print
print("#", " ".join(x))
print()
sys.stdout.flush()
matCounts, gapCounts = countsFromLastOutput(q.stdout, opts)
gapProbs = gapProbsFromCounts(gapCounts, opts)
......@@ -407,9 +414,10 @@ def doTraining(opts, args):
else:
matProbs = matProbsFromCounts(matCounts, opts)
matScores = matScoresFromProbs(internalScale, matProbs)
print "# score matrix (query letters = columns, reference letters = rows):"
print("# score matrix "
"(query letters = columns, reference letters = rows):")
writeScoreMatrix(sys.stdout, matScores, "# ")
print
print()
parameters = gapCosts, matScores
if parameters in oldParameters: break
oldParameters.append(parameters)
......@@ -423,6 +431,8 @@ def doTraining(opts, args):
p.stdin.close()
# in python2.6, the next line must come after p.stdin.close()
q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE)
if opts.postmask:
q = subprocess.Popen(z, stdin=q.stdout, stdout=subprocess.PIPE)
gapCosts = gapCostsFromProbs(externalScale, gapProbs)
writeGapCosts(gapCosts, sys.stdout)
......@@ -431,7 +441,8 @@ def doTraining(opts, args):
if not opts.Q:
matScores = matScoresFromProbs(externalScale, matProbs)
externalMatrix = matrixWithLetters(matScores)
print "# score matrix (query letters = columns, reference letters = rows):"
print("# score matrix "
"(query letters = columns, reference letters = rows):")
writeMatrixWithLetters(sys.stdout, externalMatrix, "")
def lastTrain(opts, args):
......@@ -464,6 +475,8 @@ if __name__ == "__main__":
help="force insertion/deletion symmetry")
og.add_option("--pid", type="float", default=100, help=
"skip alignments with > PID% identity (default: %default)")
og.add_option("--postmask", type="int", metavar="NUMBER", default=1, help=
"skip mostly-lowercase alignments (default=%default)")
og.add_option("--sample-number", type="int", default=500, metavar="N",
help="number of random sequence samples (default: %default)")
og.add_option("--sample-length", type="int", default=2000, metavar="L",
......@@ -518,6 +531,6 @@ if __name__ == "__main__":
try: lastTrain(opts, args)
except KeyboardInterrupt: pass # avoid silly error message
except Exception, e:
except Exception as e:
prog = os.path.basename(sys.argv[0])
sys.exit(prog + ": error: " + str(e))
......@@ -6,6 +6,8 @@
# By "MAF" we mean "multiple alignment format" described in the UCSC
# Genome FAQ, not e.g. "MIRA assembly format".
from __future__ import print_function
from itertools import *
import gzip
import math, optparse, os, signal, sys
......@@ -126,7 +128,7 @@ def mafInput(opts, lines):
elif line[0] == "#":
updateEvalueParameters(opts, line)
if opts.isKeepComments:
print line,
print(line, end="")
if sLines: yield aLine, sLines, qLines, pLines
def isJoinable(opts, oldMaf, newMaf):
......@@ -178,10 +180,10 @@ def writeAxt(maf):
if score:
outWords.append(score)
print " ".join(outWords)
print(" ".join(outWords))
for i in sLines:
print i[6]
print # print a blank line at the end
print(i[6])
print() # print a blank line at the end
def mafConvertToAxt(opts, lines):
for maf in mafInput(opts, lines):
......@@ -212,7 +214,7 @@ def writeTab(maf):
gapWord = ",".join(matchAndInsertSizes(alignmentColumns, letterSizes))
outWords.append(gapWord)
print "\t".join(outWords + endWords)
print("\t".join(outWords + endWords))
def mafConvertToTab(opts, lines):
for maf in mafInput(opts, lines):
......@@ -236,7 +238,7 @@ def writeChain(maf):
outWords.append(str(next(axtCounter) + 1))
print " ".join(outWords)
print(" ".join(outWords))
letterSizes = [i[3] for i in sLines]
rows = [i[6] for i in sLines]
......@@ -244,11 +246,11 @@ def writeChain(maf):
size = "0"
for i in matchAndInsertSizes(alignmentColumns, letterSizes):
if ":" in i:
print size + "\t" + i.replace(":", "\t")
print(size + "\t" + i.replace(":", "\t"))
size = "0"
else:
size = i
print size
print(size)
def mafConvertToChain(opts, lines):
for maf in mafInput(opts, lines):
......@@ -358,7 +360,7 @@ def writePsl(opts, mafs):
seqNameB, seqLenB, begB, endB, seqNameA, seqLenA, begA, endA,
blockCount, blockSizes, blockStartsB, blockStartsA)
print "\t".join(map(str, outWords))
print("\t".join(map(str, outWords)))
def mafConvertToPsl(opts, lines):
if opts.join:
......@@ -409,12 +411,12 @@ def copyDictFile(lines):
break
def writeSamHeader(opts, fileNames):
print "@HD\tVN:1.3\tSO:unsorted"
print("@HD\tVN:1.3\tSO:unsorted")
if opts.dictionary:
sequenceLengths = dict(readSequenceLengths(fileNames))
for k in sorted(sequenceLengths, key=karyotypicSortKey):
print "@SQ\tSN:%s\tLN:%s" % (k, sequenceLengths[k])
print("@SQ\tSN:%s\tLN:%s" % (k, sequenceLengths[k]))
if opts.dictfile:
f = myOpen(opts.dictfile)
......@@ -422,7 +424,7 @@ def writeSamHeader(opts, fileNames):
f.close()
if opts.readgroup:
print "@RG\t" + "\t".join(opts.readgroup.split())
print("@RG\t" + "\t".join(opts.readgroup.split()))
mapqMissing = "255"
mapqMaximum = "254"
......@@ -537,7 +539,7 @@ def writeSam(readGroup, maf):
if score: out.append(score)
if evalue: out.append(evalue)
if readGroup: out.append(readGroup)
print "\t".join(out)
print("\t".join(out))
def mafConvertToSam(opts, lines):
readGroup = ""
......@@ -598,13 +600,13 @@ def writeBlast(opts, maf, oldQueryName):
seqNameB, seqLenB, strandB, letterSizeB, begB, endB, rowB = fieldsB
if seqNameB != oldQueryName:
print "Query= " + seqNameB
print " (%s letters)" % seqLenB
print
print("Query= " + seqNameB)
print(" (%s letters)" % seqLenB)
print()
print ">" + seqNameA
print " Length = %s" % seqLenA
print
print(">" + seqNameA)
print(" Length = %s" % seqLenA)
print()
score, evalue = scoreAndEvalue(aLine)
......@@ -617,7 +619,7 @@ def writeBlast(opts, maf, oldQueryName):
if evalue:
scoreLine += ", Expect = %s" % evalue
print scoreLine
print(scoreLine)
alignmentColumns = zip(rowA, rowB)
......@@ -629,19 +631,19 @@ def writeBlast(opts, maf, oldQueryName):
if gaps:
gapPercent = 100 * gaps // alnSize # round down, like BLAST
identLine += ", Gaps = %s/%s (%s%%)" % (gaps, alnSize, gapPercent)
print identLine
print(identLine)
print " Strand = %s / %s" % (strandText(strandB), strandText(strandA))
print
print(" Strand = %s / %s" % (strandText(strandB), strandText(strandA)))
print()
for chunk in blastChunker(sLines, opts.linesize, alignmentColumns):
cols, rows, begs, ends = chunk
begWidth = maxlen(begs)
matchSymbols = ''.join(map(pairwiseMatchSymbol, cols))
print "Query: %-*s %s %s" % (begWidth, begs[1], rows[1], ends[1])
print " %-*s %s" % (begWidth, " ", matchSymbols)
print "Sbjct: %-*s %s %s" % (begWidth, begs[0], rows[0], ends[0])
print
print("Query: %-*s %s %s" % (begWidth, begs[1], rows[1], ends[1]))
print(" %-*s %s" % (begWidth, " ", matchSymbols))
print("Sbjct: %-*s %s %s" % (begWidth, begs[0], rows[0], ends[0]))
print()
def mafConvertToBlast(opts, lines):
oldQueryName = ""
......@@ -682,7 +684,7 @@ def writeBlastTab(opts, maf):
bitScore = opts.bitScoreA * float(score) - opts.bitScoreB
out.append("%.3g" % bitScore)
print "\t".join(out)
print("\t".join(out))
def mafConvertToBlastTab(opts, lines):
for maf in mafInput(opts, lines):
......@@ -691,7 +693,7 @@ def mafConvertToBlastTab(opts, lines):
##### Routines for converting to HTML format: #####
def writeHtmlHeader():
print '''
print('''
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"><head>
......@@ -718,7 +720,7 @@ pre {font-family: "Courier New", monospace, serif; font-size: 0.8125em}
<pre class="key"><span class="e"> </span> prob &gt; 0.5 </pre>
<pre class="key"><span class="f"> </span> prob &le; 0.5 </pre>
</div>
'''
''')
def probabilityClass(probabilityColumn):
p = ord(min(probabilityColumn)) - 33
......@@ -756,7 +758,7 @@ def writeHtml(opts, maf):
scoreLine += " score=" + score
if evalue:
scoreLine += ", expect=" + evalue
print "<h3>%s:</h3>" % scoreLine
print("<h3>%s:</h3>" % scoreLine)
if pLines:
probRows = [i.split()[1] for i in pLines]
......@@ -770,7 +772,7 @@ def writeHtml(opts, maf):
rows = [i[6] for i in sLines]
alignmentColumns = zip(*rows)
print '<pre>'
print('<pre>')
for chunk in blastChunker(sLines, opts.linesize, alignmentColumns):
cols, rows, begs, ends = chunk
begWidth = maxlen(begs)
......@@ -782,10 +784,10 @@ def writeHtml(opts, maf):
spans = [htmlSpan(r, i) for i in classRuns]
spans = ''.join(spans)
formatParams = nameWidth, n, begWidth, b, spans, endWidth, e
print '%-*s %*s %s %*s' % formatParams
print ' ' * nameWidth, ' ' * begWidth, matchSymbols
print
print '</pre>'
print('%-*s %*s %s %*s' % formatParams)
print(' ' * nameWidth, ' ' * begWidth, matchSymbols)
print()
print('</pre>')
def mafConvertToHtml(opts, lines):
for maf in mafInput(opts, lines):
......@@ -841,7 +843,7 @@ def mafConvert(opts, args):
if not opts.noheader:
if isFormat(formatName, "html"):
print "</body></html>"
print("</body></html>")
if __name__ == "__main__":
signal.signal(signal.SIGPIPE, signal.SIG_DFL) # avoid silly error message
......@@ -882,6 +884,6 @@ if __name__ == "__main__":
op.error("need file (not pipe) with option -d")
try: mafConvert(opts, args)
except Exception, e:
except Exception as e:
prog = os.path.basename(sys.argv[0])
sys.exit(prog + ": error: " + str(e))
......@@ -10,6 +10,8 @@
# WARNING: Alignment columns with a gap in the top genome are joined
# arbitrarily!!!
from __future__ import print_function
import sys, os, fileinput, optparse, signal
signal.signal(signal.SIGPIPE, signal.SIG_DFL) # stop spurious error message
......@@ -70,16 +72,16 @@ class MafBlock:
seq = [''.join(i) for i in self.seq]
columns = self.chr, beg, size, self.strand, self.chrSize, seq
widths = map(maxLen, columns)
print 'a'
print('a')
for row in zip(*columns):
widthsAndFields = zip(widths, row)
field0 = "%-*s" % widthsAndFields[0] # left-justify
fields = ["%*s" % i for i in widthsAndFields[1:]] # right-justify
print 's', field0, ' '.join(fields)
print('s', field0, ' '.join(fields))
pad = ' '.join(' ' * i for i in widths[:-1])
for i in self.prob:
print 'p', pad, i
print # blank line afterwards
print('p', pad, i)
print() # blank line afterwards
def topSeqBeg(maf): return maf.beg[0]
def topSeqEnd(maf): return maf.end[0]
......
......@@ -9,11 +9,13 @@
# Seems to work with Python 2.x, x>=4
from __future__ import print_function
import fileinput, itertools, optparse, os, signal, string, sys
def filterComments(lines):
for i in lines:
if i.startswith("#"): print i,
if i.startswith("#"): print(i, end="")
else: yield i
def mafInput(lines):
......@@ -119,8 +121,8 @@ def mafSwap(opts, args):
mafLines[1:stop] = mafLines[start:stop] + mafLines[1:start]
if not isCanonicalStrand(mafLines[1]):
mafLines = flippedMaf(mafLines)
for i in mafLines: print i,
print # blank line after each alignment
for i in mafLines: print(i, end="")
print() # blank line after each alignment
if __name__ == "__main__":
signal.signal(signal.SIGPIPE, signal.SIG_DFL) # avoid silly error message
......@@ -135,6 +137,6 @@ if __name__ == "__main__":
try: mafSwap(opts, args)
except KeyboardInterrupt: pass # avoid silly error message
except Exception, e:
except Exception as e:
prog = os.path.basename(sys.argv[0])
sys.exit(prog + ": error: " + str(e))
......@@ -41,15 +41,14 @@ static char parseOutputFormat( const char* text ){
return 0;
}
static int parseScoreOrPercent(const std::string &s) {
static int parseScoreAndSuffix(const std::string &s, char &suffix) {
char c = 0;
int x;
std::istringstream iss(s);
if(!(iss >> x) || x < 0) ERR("bad value: " + s);
char c;
if (iss >> c) {
if (c != '%' || iss >> c) ERR("bad value: " + s);
x = -x; // negative value indicates percentage (kludge)
std::istringstream i(s);
if (!(i >> x) || x < 0 || (i >> c && i >> c)) {
ERR("bad value: " + s);
}
suffix = c;
return x;
}
......@@ -78,9 +77,11 @@ LastalArguments::LastalArguments() :
gapPairCost(-1), // this means: OFF
frameshiftCost(-1), // this means: ordinary, non-translated alignment
matrixFile(""),
maxDropGapped(-100), // depends on minScoreGapped & maxDropGapless
maxDropGapped(100), // depends on maxDropFinal
maxDropGappedSuffix('%'),
maxDropGapless(-1), // depends on the score matrix
maxDropFinal(-1), // depends on maxDropGapped
maxDropFinal(100), // depends on minScoreGapped
maxDropFinalSuffix('%'),
inputFormat(sequenceFormat::fasta),
minHitDepth(1),
maxHitDepth(-1),
......@@ -128,9 +129,9 @@ Score options (default settings):\n\
-B: insertion extension cost (b)\n\
-c: unaligned residue pair cost (off)\n\
-F: frameshift cost (off)\n\
-x: maximum score drop for gapped alignments (e-1)\n\
-x: maximum score drop for preliminary gapped alignments (z)\n\
-y: maximum score drop for gapless alignments (min[t*10, x])\n\
-z: maximum score drop for final gapped alignments (x)\n\
-z: maximum score drop for final gapped alignments (e-1)\n\
-d: minimum score for gapless alignments (min[e, t*ln(1000*refSize/n)])\n\
-e: minimum score for gapped alignments\n\
\n\
......@@ -234,15 +235,14 @@ LAST home page: http://last.cbrc.jp/\n\
if( frameshiftCost < 0 ) badopt( c, optarg );
break;
case 'x':
maxDropGapped = parseScoreOrPercent(optarg);
maxDropGapped = parseScoreAndSuffix(optarg, maxDropGappedSuffix);
break;
case 'y':
unstringify( maxDropGapless, optarg );
if( maxDropGapless < 0 ) badopt( c, optarg );
break;
case 'z':
unstringify( maxDropFinal, optarg );
if( maxDropFinal < 0 ) badopt( c, optarg );
maxDropFinal = parseScoreAndSuffix(optarg, maxDropFinalSuffix);
break;
case 'd':
unstringify( minScoreGapless, optarg );
......@@ -523,6 +523,10 @@ void LastalArguments::setDefaultsFromAlphabet( bool isDna, bool isProtein,
}
}
static int percent(int val, int percentage) {
return (percentage == 100) ? val : val * percentage / 100;
}
void LastalArguments::setDefaultsFromMatrix( double lambda, int minScore ){
if( outputType < 2 && minScoreGapped < 0 ) minScoreGapped = minScoreGapless;
if( minScoreGapped < 0 ){
......@@ -536,11 +540,19 @@ To proceed without E-values, set a score threshold with option -e.");
if( temperature < 0 ) temperature = 1 / lambda;
if( maxDropGapped < 0 ){
int percent = -maxDropGapped;
maxDropGapped = std::max( minScoreGapped - 1, 0 );
maxDropGapped = std::max( maxDropGapped, maxDropGapless );
if (percent != 100) maxDropGapped = maxDropGapped * percent / 100;
int defaultMaxScoreDrop = std::max(minScoreGapped - 1, 0);
defaultMaxScoreDrop = std::max(defaultMaxScoreDrop, maxDropGapless);
if (maxDropFinalSuffix == '%') {
maxDropFinal = percent(defaultMaxScoreDrop, maxDropFinal);
} else if (maxDropFinalSuffix == 'g') {
maxDropFinal = std::min(minGapCost(maxDropFinal), defaultMaxScoreDrop);
}
if (maxDropGappedSuffix == '%') {
maxDropGapped = percent(maxDropFinal, maxDropGapped);
} else if (maxDropGappedSuffix == 'g') {
maxDropGapped = std::min(minGapCost(maxDropGapped), maxDropFinal);
}
if( maxDropGapless < 0 ){ // should it depend on temperature or lambda?
......@@ -548,8 +560,6 @@ To proceed without E-values, set a score threshold with option -e.");
else maxDropGapless = int( 10.0 * temperature + 0.5 );
maxDropGapless = std::min( maxDropGapless, maxDropGapped );
}
if( maxDropFinal < 0 ) maxDropFinal = maxDropGapped;
}
int LastalArguments::calcMinScoreGapless( double numLettersInReference,
......
......@@ -56,6 +56,11 @@ struct LastalArguments{
// how many strands are we scanning (1 or 2)?
int numOfStrands() const{ return (strand == 2) ? 2 : 1; }
int minGapCost(int gapLength) const {
return std::min(gapExistCost + gapLength * gapExtendCost,
insExistCost + gapLength * insExtendCost);
}
// options:
int outputFormat;
int outputType;
......@@ -80,8 +85,10 @@ struct LastalArguments{
int frameshiftCost;
std::string matrixFile;
int maxDropGapped;
char maxDropGappedSuffix;
int maxDropGapless;
int maxDropFinal;
char maxDropFinalSuffix;
sequenceFormat::Enum inputFormat;
size_t minHitDepth;
size_t maxHitDepth;
......