Skip to content
Commits on Source (5)
2018-12-10 Martin C. Frith <Martin C. Frith>
* doc/last-dotplot.txt, scripts/last-dotplot:
Add last-dotplot --maxseqs option
[ed0fb9b1eb40] [tip]
* src/split/last-split.cc:
Make last-split -n keep E-values
[2063a5a7d7ac]
2018-10-30 Martin C. Frith <Martin C. Frith>
* doc/last-train.txt, doc/lastal.txt, doc/lastdb.txt, scripts/last-
train, src/LastalArguments.cc, src/LastdbArguments.cc,
src/MultiSequence.hh, src/MultiSequenceQual.cc,
src/SequenceFormat.hh, src/last.hh, test/last-test.out, test/last-
test.sh:
Add fastq-ignore option
[c551d04a3928]
2018-10-29 Martin C. Frith <Martin C. Frith>
* src/LastalArguments.cc, src/SequenceFormat.hh,
src/SubsetSuffixArray.cc, src/lastal.cc, src/lastdb.cc:
Refactor
[96b45da7bf0f]
* scripts/last-dotplot, scripts/last-map-probs, scripts/last-postmask,
scripts/last-train, scripts/maf-convert, scripts/maf-cut, scripts
/maf-swap:
Fix gz reading for Python3
[c54672c8892e]
2018-09-10 Martin C. Frith <Martin C. Frith>
* scripts/last-map-probs:
Python3-ify last-map-probs
[83191e47259e] [tip]
[83191e47259e]
2018-09-07 Martin C. Frith <Martin C. Frith>
......
last-align (956-2) UNRELEASED; urgency=medium
last-align (961-1) unstable; urgency=medium
[ Jelmer Vernooij ]
* Use secure copyright file specification URI.
* Trim trailing whitespace.
-- Jelmer Vernooij <jelmer@debian.org> Wed, 24 Oct 2018 22:46:10 +0000
[ Andreas Tille ]
* New upstream version
* Remove trailing whitespace in debian/rules
-- Andreas Tille <tille@debian.org> Mon, 17 Dec 2018 11:13:03 +0100
last-align (956-1) unstable; urgency=medium
......
......@@ -335,62 +335,6 @@ last-dotplot alns alns.gif
set of sequences. This document calls a &quot;set of sequences&quot; a
&quot;genome&quot;, though it need not actually be a genome.</p>
</div>
<div class="section" id="choosing-sequences">
<h2>Choosing sequences</h2>
<p>If there are too many sequences, the dotplot will be very cluttered,
or the script might give up with an error message. You can exclude
sequences with names like &quot;chrUn_random522&quot; like this:</p>
<pre class="literal-block">
last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
</pre>
<p>Option &quot;-1&quot; selects sequences from the 1st (horizontal) genome, and
&quot;-2&quot; selects sequences from the 2nd (vertical) genome. 'chr[!U]*' is
a <em>pattern</em> that specifies names starting with &quot;chr&quot;, followed by any
character except U, followed by anything.</p>
<table border="1" class="docutils">
<colgroup>
<col width="26%" />
<col width="74%" />
</colgroup>
<tbody valign="top">
<tr><td>Pattern</td>
<td>Meaning</td>
</tr>
<tr><td><tt class="docutils literal">*</tt></td>
<td>zero or more of any character</td>
</tr>
<tr><td><tt class="docutils literal">?</tt></td>
<td>any single character</td>
</tr>
<tr><td><tt class="docutils literal">[abc]</tt></td>
<td>any character in abc</td>
</tr>
<tr><td><tt class="docutils literal">[!abc]</tt></td>
<td>any character not in abc</td>
</tr>
</tbody>
</table>
<p>If a sequence name has a dot (e.g. &quot;hg19.chr7&quot;), the pattern is
compared to both the whole name and the part after the dot.</p>
<p>You can specify more than one pattern, e.g. this gets sequences with
names starting in &quot;chr&quot; followed by one or two characters:</p>
<pre class="literal-block">
last-dotplot -1 'chr?' -1 'chr??' alns alns.png
</pre>
<p>You can also specify a sequence range; for example this gets the first
1000 bases of chr9:</p>
<pre class="literal-block">
last-dotplot -1 chr9:0-1000 alns alns.png
</pre>
</div>
<div class="section" id="fonts">
<h2>Fonts</h2>
<p>last-dotplot tries to find a nice font on your computer, but may fail
and use an ugly font. You can specify a font like this:</p>
<pre class="literal-block">
last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
</pre>
</div>
<div class="section" id="options">
<h2>Options</h2>
<blockquote>
......@@ -405,18 +349,23 @@ last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns
<kbd><span class="option">-v</span>, <span class="option">--verbose</span></kbd></td>
<td>Show progress messages &amp; data about the plot.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-x <var>INT</var></span>, <span class="option">--width=<var>INT</var></span></kbd></td>
<td>Maximum width in pixels.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-y <var>INT</var></span>, <span class="option">--height=<var>INT</var></span></kbd></td>
<td>Maximum height in pixels.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-m <var>M</var></span>, <span class="option">--maxseqs=<var>M</var></span></kbd></td>
<td>Maximum number of horizontal or vertical sequences. If there
are &gt;M sequences, the smallest ones (after cutting) will be
discarded.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-1 <var>PATTERN</var></span>, <span class="option">--seq1=<var>PATTERN</var></span></kbd></td>
<td>Which sequences to show from the 1st (horizontal) genome.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-2 <var>PATTERN</var></span>, <span class="option">--seq2=<var>PATTERN</var></span></kbd></td>
<td>Which sequences to show from the 2nd (vertical) genome.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-x <var>WIDTH</var></span>, <span class="option">--width=<var>WIDTH</var></span></kbd></td>
<td>Maximum width in pixels.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-y <var>HEIGHT</var></span>, <span class="option">--height=<var>HEIGHT</var></span></kbd></td>
<td>Maximum height in pixels.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-c <var>COLOR</var></span>, <span class="option">--forwardcolor=<var>COLOR</var></span></kbd></td>
<td>Color for forward alignments.</td></tr>
<tr><td class="option-group">
......@@ -602,6 +551,65 @@ of unknown sequence.</p>
pixel.</p>
</div>
</div>
<div class="section" id="choosing-sequences">
<h2>Choosing sequences</h2>
<p>For example, you can exclude sequences with names like
&quot;chrUn_random522&quot; like this:</p>
<pre class="literal-block">
last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
</pre>
<p>Option &quot;-1&quot; selects sequences from the 1st (horizontal) genome, and
&quot;-2&quot; selects sequences from the 2nd (vertical) genome. 'chr[!U]*' is
a <em>pattern</em> that specifies names starting with &quot;chr&quot;, followed by any
character except U, followed by anything.</p>
<table border="1" class="docutils">
<colgroup>
<col width="26%" />
<col width="74%" />
</colgroup>
<tbody valign="top">
<tr><td>Pattern</td>
<td>Meaning</td>
</tr>
<tr><td><tt class="docutils literal">*</tt></td>
<td>zero or more of any character</td>
</tr>
<tr><td><tt class="docutils literal">?</tt></td>
<td>any single character</td>
</tr>
<tr><td><tt class="docutils literal">[abc]</tt></td>
<td>any character in abc</td>
</tr>
<tr><td><tt class="docutils literal">[!abc]</tt></td>
<td>any character not in abc</td>
</tr>
</tbody>
</table>
<p>If a sequence name has a dot (e.g. &quot;hg19.chr7&quot;), the pattern is
compared to both the whole name and the part after the dot.</p>
<p>You can specify more than one pattern, e.g. this gets sequences with
names starting in &quot;chr&quot; followed by one or two characters:</p>
<pre class="literal-block">
last-dotplot -1 'chr?' -1 'chr??' alns alns.png
</pre>
<p>You can also specify a sequence range; for example this gets the first
1000 bases of chr9:</p>
<pre class="literal-block">
last-dotplot -1 chr9:0-1000 alns alns.png
</pre>
</div>
<div class="section" id="text-font">
<h2>Text font</h2>
<p>You can improve the font quality by increasing its size, e.g. to 20
points:</p>
<blockquote>
last-dotplot -s20 my-alignments my-plot.png</blockquote>
<p>last-dotplot tries to find a nice font on your computer, but may fail
and use an ugly font. You can specify a font like this:</p>
<pre class="literal-block">
last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
</pre>
</div>
<div class="section" id="colors">
<h2>Colors</h2>
<p>Colors can be specified in <a class="reference external" href="http://effbot.org/imagingbook/imagecolor.htm">various ways described here</a>.</p>
......
......@@ -19,50 +19,6 @@ last-dotplot shows alignments of one set of sequences against another
set of sequences. This document calls a "set of sequences" a
"genome", though it need not actually be a genome.
Choosing sequences
------------------
If there are too many sequences, the dotplot will be very cluttered,
or the script might give up with an error message. You can exclude
sequences with names like "chrUn_random522" like this::
last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
Option "-1" selects sequences from the 1st (horizontal) genome, and
"-2" selects sequences from the 2nd (vertical) genome. 'chr[!U]*' is
a *pattern* that specifies names starting with "chr", followed by any
character except U, followed by anything.
========== =============================
Pattern Meaning
---------- -----------------------------
``*`` zero or more of any character
``?`` any single character
``[abc]`` any character in abc
``[!abc]`` any character not in abc
========== =============================
If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
compared to both the whole name and the part after the dot.
You can specify more than one pattern, e.g. this gets sequences with
names starting in "chr" followed by one or two characters::
last-dotplot -1 'chr?' -1 'chr??' alns alns.png
You can also specify a sequence range; for example this gets the first
1000 bases of chr9::
last-dotplot -1 chr9:0-1000 alns alns.png
Fonts
-----
last-dotplot tries to find a nice font on your computer, but may fail
and use an ugly font. You can specify a font like this::
last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
Options
-------
......@@ -70,14 +26,18 @@ Options
Show a help message, with default option values, and exit.
-v, --verbose
Show progress messages & data about the plot.
-x INT, --width=INT
Maximum width in pixels.
-y INT, --height=INT
Maximum height in pixels.
-m M, --maxseqs=M
Maximum number of horizontal or vertical sequences. If there
are >M sequences, the smallest ones (after cutting) will be
discarded.
-1 PATTERN, --seq1=PATTERN
Which sequences to show from the 1st (horizontal) genome.
-2 PATTERN, --seq2=PATTERN
Which sequences to show from the 2nd (vertical) genome.
-x WIDTH, --width=WIDTH
Maximum width in pixels.
-y HEIGHT, --height=HEIGHT
Maximum height in pixels.
-c COLOR, --forwardcolor=COLOR
Color for forward alignments.
-r COLOR, --reversecolor=COLOR
......@@ -207,6 +167,54 @@ of unknown sequence.
An unsequenced gap will be shown only if it covers at least one whole
pixel.
Choosing sequences
------------------
For example, you can exclude sequences with names like
"chrUn_random522" like this::
last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
Option "-1" selects sequences from the 1st (horizontal) genome, and
"-2" selects sequences from the 2nd (vertical) genome. 'chr[!U]*' is
a *pattern* that specifies names starting with "chr", followed by any
character except U, followed by anything.
========== =============================
Pattern Meaning
---------- -----------------------------
``*`` zero or more of any character
``?`` any single character
``[abc]`` any character in abc
``[!abc]`` any character not in abc
========== =============================
If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
compared to both the whole name and the part after the dot.
You can specify more than one pattern, e.g. this gets sequences with
names starting in "chr" followed by one or two characters::
last-dotplot -1 'chr?' -1 'chr??' alns alns.png
You can also specify a sequence range; for example this gets the first
1000 bases of chr9::
last-dotplot -1 chr9:0-1000 alns alns.png
Text font
---------
You can improve the font quality by increasing its size, e.g. to 20
points:
last-dotplot -s20 my-alignments my-plot.png
last-dotplot tries to find a nice font on your computer, but may fail
and use an ugly font. You can specify a font like this::
last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
Colors
------
......
......@@ -333,7 +333,7 @@ last-train mydb queries.fasta
<p>last-train prints a summary of each alignment step, followed by the
final score parameters, in a format that can be read by <a class="reference external" href="lastal.html#score-options">lastal's -p
option</a>.</p>
<p>You can also pipe sequences into last-train, for example:</p>
<p>last-train can read .gz files, or from pipes:</p>
<pre class="literal-block">
bzcat queries.fasta.bz2 | last-train mydb
</pre>
......@@ -479,15 +479,21 @@ position in each query.</td></tr>
<td>Number of parallel threads.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
<td><p class="first">Query sequence format: 0=fasta, 1=fastq-sanger.</p>
<p>last-train assumes that fastq quality codes indicate
substitution error probabilities, <em>not</em> insertion or
deletion error probabilities. If this assumption is
dubious (e.g. for data with many insertion or deletion
errors), it may be better to use fasta.</p>
<p>For fastq, last-train finds the rates of substitutions
not explained by the quality data (ideally, real
substitutions as opposed to errors).</p>
<td><p class="first">How to read the query sequences. By default, they must
be in <tt class="docutils literal">fasta</tt> format. <tt class="docutils literal"><span class="pre">-Q0</span></tt> means <tt class="docutils literal">fasta</tt> or
<tt class="docutils literal"><span class="pre">fastq-ignore</span></tt>. <tt class="docutils literal"><span class="pre">-Q1</span></tt> means <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>.</p>
<p>The <tt class="docutils literal">fastq</tt> formats are described here:
<a class="reference external" href="lastal.html">lastal.html</a>. <tt class="docutils literal"><span class="pre">fastq-ignore</span></tt> means that the
quality data is ignored, so the results will be the same
as for <tt class="docutils literal">fasta</tt>.</p>
<p>For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train assumes the quality
codes indicate substitution error probabilities, <em>not</em>
insertion or deletion error probabilities. If this
assumption is dubious (e.g. for data with many insertion
or deletion errors), it may be better to use
<tt class="docutils literal"><span class="pre">fastq-ignore</span></tt>. For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train finds
the rates of substitutions not explained by the quality
data (ideally, real substitutions as opposed to errors).</p>
<p class="last">If specified, this parameter is written in last-train's
output, so it will override lastal's default.</p>
</td></tr>
......
......@@ -19,7 +19,7 @@ last-train prints a summary of each alignment step, followed by the
final score parameters, in a format that can be read by `lastal's -p
option <lastal.html#score-options>`_.
You can also pipe sequences into last-train, for example::
last-train can read .gz files, or from pipes::
bzcat queries.fasta.bz2 | last-train mydb
......@@ -102,17 +102,23 @@ Alignment options
-k STEP Look for initial matches starting only at every STEP-th
position in each query.
-P COUNT Number of parallel threads.
-Q NUMBER Query sequence format: 0=fasta, 1=fastq-sanger.
last-train assumes that fastq quality codes indicate
substitution error probabilities, *not* insertion or
deletion error probabilities. If this assumption is
dubious (e.g. for data with many insertion or deletion
errors), it may be better to use fasta.
For fastq, last-train finds the rates of substitutions
not explained by the quality data (ideally, real
substitutions as opposed to errors).
-Q NUMBER How to read the query sequences. By default, they must
be in ``fasta`` format. ``-Q0`` means ``fasta`` or
``fastq-ignore``. ``-Q1`` means ``fastq-sanger``.
The ``fastq`` formats are described here:
`<lastal.html>`_. ``fastq-ignore`` means that the
quality data is ignored, so the results will be the same
as for ``fasta``.
For ``fastq-sanger``, last-train assumes the quality
codes indicate substitution error probabilities, *not*
insertion or deletion error probabilities. If this
assumption is dubious (e.g. for data with many insertion
or deletion errors), it may be better to use
``fastq-ignore``. For ``fastq-sanger``, last-train finds
the rates of substitutions not explained by the quality
data (ideally, real substitutions as opposed to errors).
If specified, this parameter is written in last-train's
output, so it will override lastal's default.
......
......@@ -849,13 +849,18 @@ generalized affine gap costs. They are:</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
<td><p class="first">This option allows lastal to use sequence quality scores, or
PSSMs, for the queries. 0 means read queries in fasta format
(without quality scores); 1 means fastq-sanger format; 2 means
fastq-solexa format; 3 means fastq-illumina format; 4 means prb
format; 5 means read PSSMs. (<em>Warning</em>: Illumina data is not
necessarily in fastq-illumina format; it is often in
fastq-sanger format.)</p>
<td><p class="first">Specify how to read the query sequences:</p>
<pre class="literal-block">
Default fasta
0 fasta or fastq-ignore
1 fastq-sanger
2 fastq-solexa
3 fastq-illumina
4 prb
5 PSSM
</pre>
<p><em>Warning</em>: Illumina data is not necessarily in fastq-illumina
format; it is often in fastq-sanger format.</p>
<p>The fastq formats look like this:</p>
<pre class="literal-block">
&#64;mySequenceName
......@@ -863,12 +868,18 @@ TTTTTTTTGCCTCGGGCCTGAGTTCTTAGCCGCG
+
55555555*&amp;5-/55*5//5(55,5#&amp;$)$)*+$
</pre>
<p>The &quot;+&quot; may optionally be followed by a name (ignored), and the
sequence and quality codes are allowed to wrap onto more than
one line. For fastq-sanger, the quality scores are obtained by
subtracting 33 from the ASCII values of the characters below the
&quot;+&quot;. For fastq-solexa and fastq-illumina, they are obtained by
subtracting 64.</p>
<p>The &quot;+&quot; may be followed by text (ignored). The symbols below
the &quot;+&quot; are quality codes, one per sequence letter. The
sequence and quality codes may wrap onto more than one line.</p>
<p>fastq-ignore means that the quality codes are ignored. For the
other fastq variants, lastal assumes the quality codes indicate
substitution error probabilities, <em>not</em> insertion or deletion
error probabilities. If this assumption is dubious (e.g. for
data with many insertion or deletion errors), it may be better
to use fastq-ignore.</p>
<p>For fastq-sanger, quality scores are obtained by subtracting 33
from the ASCII values of the quality codes. For fastq-solexa
and fastq-illumina, they are obtained by subtracting 64.</p>
<p>prb format stores four quality scores (A, C, G, T) per position,
with one sequence per line, like this:</p>
<pre class="literal-block">
......
......@@ -474,13 +474,18 @@ Miscellaneous options
* The count of adjacent pairs of insertions and unaligned letter pairs.
-Q NUMBER
This option allows lastal to use sequence quality scores, or
PSSMs, for the queries. 0 means read queries in fasta format
(without quality scores); 1 means fastq-sanger format; 2 means
fastq-solexa format; 3 means fastq-illumina format; 4 means prb
format; 5 means read PSSMs. (*Warning*: Illumina data is not
necessarily in fastq-illumina format; it is often in
fastq-sanger format.)
Specify how to read the query sequences::
Default fasta
0 fasta or fastq-ignore
1 fastq-sanger
2 fastq-solexa
3 fastq-illumina
4 prb
5 PSSM
*Warning*: Illumina data is not necessarily in fastq-illumina
format; it is often in fastq-sanger format.
The fastq formats look like this::
......@@ -489,12 +494,20 @@ Miscellaneous options
+
55555555*&5-/55*5//5(55,5#&$)$)*+$
The "+" may optionally be followed by a name (ignored), and the
sequence and quality codes are allowed to wrap onto more than
one line. For fastq-sanger, the quality scores are obtained by
subtracting 33 from the ASCII values of the characters below the
"+". For fastq-solexa and fastq-illumina, they are obtained by
subtracting 64.
The "+" may be followed by text (ignored). The symbols below
the "+" are quality codes, one per sequence letter. The
sequence and quality codes may wrap onto more than one line.
fastq-ignore means that the quality codes are ignored. For the
other fastq variants, lastal assumes the quality codes indicate
substitution error probabilities, *not* insertion or deletion
error probabilities. If this assumption is dubious (e.g. for
data with many insertion or deletion errors), it may be better
to use fastq-ignore.
For fastq-sanger, quality scores are obtained by subtracting 33
from the ASCII values of the quality codes. For fastq-solexa
and fastq-illumina, they are obtained by subtracting 64.
prb format stores four quality scores (A, C, G, T) per position,
with one sequence per line, like this::
......
......@@ -457,11 +457,10 @@ about 4 billion.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-Q <var>NUMBER</var></span></kbd></td>
<td>Specify the input format. 0 means fasta, 1 means fastq-sanger,
2 means fastq-solexa, and 3 means fastq-illumina. The fastq
formats provide sequence quality data, which will be stored by
lastdb and then used by lastal. These formats are described in
<a class="reference external" href="lastal.html">lastal.html</a>.</td></tr>
<td>Specify how to read the sequences. The default is fasta, 0
means fasta or fastq-ignore, 1 means fastq-sanger, 2 means
fastq-solexa, and 3 means fastq-illumina. The fastq formats are
described in <a class="reference external" href="lastal.html">lastal.html</a>.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-P <var>THREADS</var></span></kbd></td>
<td>Divide the work between this number of threads running in
......
......@@ -125,11 +125,10 @@ Advanced Options
about 4 billion.
-Q NUMBER
Specify the input format. 0 means fasta, 1 means fastq-sanger,
2 means fastq-solexa, and 3 means fastq-illumina. The fastq
formats provide sequence quality data, which will be stored by
lastdb and then used by lastal. These formats are described in
`<lastal.html>`_.
Specify how to read the sequences. The default is fasta, 0
means fasta or fastq-ignore, 1 means fastq-sanger, 2 means
fastq-solexa, and 3 means fastq-illumina. The fastq formats are
described in `<lastal.html>`_.
-P THREADS
Divide the work between this number of threads running in
......
......@@ -13,6 +13,7 @@ import collections
import functools
import gzip
from fnmatch import fnmatchcase
import logging
from operator import itemgetter
import subprocess
import itertools, optparse, os, re, sys
......@@ -34,14 +35,9 @@ def myOpen(fileName): # faster than fileinput
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
return gzip.open(fileName)
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def warn(message):
if opts.verbose:
prog = os.path.basename(sys.argv[0])
sys.stderr.write(prog + ": " + message + "\n")
def groupByFirstItem(things):
for k, v in itertools.groupby(things, itemgetter(0)):
yield k, [i[1:] for i in v]
......@@ -414,8 +410,8 @@ def dataFromRanges(sortedRanges, font, fontSize, imageMode, labelOpt, textRot):
out = [seqName, str(rangeBeg), str(rangeEnd)]
if strandNum > 0:
out.append(".+-"[strandNum])
warn("\t".join(out))
warn("")
logging.info("\t".join(out))
logging.info("")
rangeSizes = [e - b for n, b, e, s in sortedRanges]
labs = list(rangeLabels(sortedRanges, labelOpt, font, fontSize,
imageMode, textRot))
......@@ -430,7 +426,7 @@ def div_ceil(x, y):
def get_bp_per_pix(rangeSizes, pixTweenRanges, maxPixels):
'''Get the minimum bp-per-pixel that fits in the size limit.'''
warn("choosing bp per pixel...")
logging.info("choosing bp per pixel...")
numOfRanges = len(rangeSizes)
maxPixelsInRanges = maxPixels - pixTweenRanges * (numOfRanges - 1)
if maxPixelsInRanges < numOfRanges:
......@@ -718,18 +714,37 @@ def getFont(opts):
out, err = p.communicate()
fileNames.append(out)
except OSError as e:
warn("fc-match error: " + str(e))
logging.info("fc-match error: " + str(e))
fileNames.append("/Library/Fonts/Arial.ttf") # for Mac
for i in fileNames:
try:
font = ImageFont.truetype(i, opts.fontsize)
warn("font: " + i)
logging.info("font: " + i)
return font
except IOError as e:
warn("font load error: " + str(e))
logging.info("font load error: " + str(e))
return ImageFont.load_default()
def sequenceSizesAndNames(seqRanges):
for seqName, ranges in itertools.groupby(seqRanges, itemgetter(0)):
size = sum(e - b for n, b, e in ranges)
yield size, seqName
def biggestSequences(seqRanges, maxNumOfSequences):
s = sorted(sequenceSizesAndNames(seqRanges), reverse=True)
if len(s) > maxNumOfSequences:
logging.warning("too many sequences - discarding the smallest ones")
s = s[:maxNumOfSequences]
return set(i[1] for i in s)
def remainingSequenceRanges(seqRanges, alignments, seqIndex):
remainingSequences = set(i[seqIndex] for i in alignments)
return [i for i in seqRanges if i[0] in remainingSequences]
def lastDotplot(opts, args):
logLevel = logging.INFO if opts.verbose else logging.WARNING
logging.basicConfig(format="%(filename)s: %(message)s", level=logLevel)
font = getFont(opts)
image_mode = 'RGB'
forward_color = ImageColor.getcolor(opts.forwardcolor, image_mode)
......@@ -740,11 +755,11 @@ def lastDotplot(opts, args):
maxGap1, maxGapB1 = twoValuesFromOption(opts.max_gap1, ":")
maxGap2, maxGapB2 = twoValuesFromOption(opts.max_gap2, ":")
warn("reading alignments...")
logging.info("reading alignments...")
alnData = readAlignments(args[0], opts)
alignments, seqRanges1, coverDict1, seqRanges2, coverDict2 = alnData
if not alignments: raise Exception("there are no alignments")
warn("cutting...")
logging.info("cutting...")
coverDict1 = dict(mergedRangesPerSeq(coverDict1))
coverDict2 = dict(mergedRangesPerSeq(coverDict2))
minAlignedBases = min(coveredLength(coverDict1), coveredLength(coverDict2))
......@@ -754,10 +769,20 @@ def lastDotplot(opts, args):
cutRanges2 = list(trimmed(seqRanges2, coverDict2, minAlignedBases,
maxGap2, pad, pad))
warn("reading secondary alignments...")
biggestSeqs1 = biggestSequences(cutRanges1, opts.maxseqs)
cutRanges1 = [i for i in cutRanges1 if i[0] in biggestSeqs1]
alignments = [i for i in alignments if i[0] in biggestSeqs1]
cutRanges2 = remainingSequenceRanges(cutRanges2, alignments, 1)
biggestSeqs2 = biggestSequences(cutRanges2, opts.maxseqs)
cutRanges2 = [i for i in cutRanges2 if i[0] in biggestSeqs2]
alignments = [i for i in alignments if i[1] in biggestSeqs2]
cutRanges1 = remainingSequenceRanges(cutRanges1, alignments, 0)
logging.info("reading secondary alignments...")
alnDataB = readSecondaryAlignments(opts, cutRanges1, cutRanges2)
alignmentsB, seqRangesB1, coverDictB1, seqRangesB2, coverDictB2 = alnDataB
warn("cutting...")
logging.info("cutting...")
coverDictB1 = dict(mergedRangesPerSeq(coverDictB1))
coverDictB2 = dict(mergedRangesPerSeq(coverDictB2))
cutRangesB1 = trimmed(seqRangesB1, coverDictB1, minAlignedBases,
......@@ -765,7 +790,7 @@ def lastDotplot(opts, args):
cutRangesB2 = trimmed(seqRangesB2, coverDictB2, minAlignedBases,
maxGapB2, 0, 0)
warn("sorting...")
logging.info("sorting...")
sortOut = allSortedRanges(opts, alignments, alignmentsB,
cutRanges1, cutRangesB1, cutRanges2, cutRangesB2)
sortedRanges1, sortedRanges2 = sortOut
......@@ -785,7 +810,7 @@ def lastDotplot(opts, args):
bpPerPix1 = get_bp_per_pix(rangeSizes1, opts.border_pixels, maxPixels1)
bpPerPix2 = get_bp_per_pix(rangeSizes2, opts.border_pixels, maxPixels2)
bpPerPix = max(bpPerPix1, bpPerPix2)
warn("bp per pixel = " + str(bpPerPix))
logging.info("bp per pixel = " + str(bpPerPix))
p1 = pixelData(rangeSizes1, bpPerPix, opts.border_pixels, lMargin)
rangePixBegs1, rangePixLens1, width = p1
......@@ -797,14 +822,14 @@ def lastDotplot(opts, args):
rangeDict2 = dict(rangesPerSeq(sortedRanges2, rangePixBegs2,
rangePixLens2, bpPerPix))
warn("width: " + str(width))
warn("height: " + str(height))
logging.info("width: " + str(width))
logging.info("height: " + str(height))
warn("processing alignments...")
logging.info("processing alignments...")
hits = alignmentPixels(width, height, alignments + alignmentsB, bpPerPix,
rangeDict1, rangeDict2)
warn("reading annotations...")
logging.info("reading annotations...")
rangeDict1 = expandedSeqDict(rangeDict1)
beds1 = itertools.chain(readBed(opts.bed1, rangeDict1),
......@@ -822,7 +847,7 @@ def lastDotplot(opts, args):
boxes = sorted(itertools.chain(b1, b2))
warn("drawing...")
logging.info("drawing...")
image_size = width, height
im = Image.new(image_mode, image_size, opts.background_color)
......@@ -868,17 +893,20 @@ if __name__ == "__main__":
op = optparse.OptionParser(usage=usage, description=description)
op.add_option("-v", "--verbose", action="count",
help="show progress messages & data about the plot")
# Replace "width" & "height" with a single "length" option?
op.add_option("-x", "--width", metavar="INT", type="int", default=1000,
help="maximum width in pixels (default: %default)")
op.add_option("-y", "--height", metavar="INT", type="int", default=1000,
help="maximum height in pixels (default: %default)")
op.add_option("-m", "--maxseqs", type="int", default=100, metavar="M",
help="maximum number of horizontal or vertical sequences "
"(default=%default)")
op.add_option("-1", "--seq1", metavar="PATTERN", action="append",
default=[],
help="which sequences to show from the 1st genome")
op.add_option("-2", "--seq2", metavar="PATTERN", action="append",
default=[],
help="which sequences to show from the 2nd genome")
# Replace "width" & "height" with a single "length" option?
op.add_option("-x", "--width", type="int", default=1000,
help="maximum width in pixels (default: %default)")
op.add_option("-y", "--height", type="int", default=1000,
help="maximum height in pixels (default: %default)")
op.add_option("-c", "--forwardcolor", metavar="COLOR", default="red",
help="color for forward alignments (default: %default)")
op.add_option("-r", "--reversecolor", metavar="COLOR", default="blue",
......
......@@ -16,7 +16,7 @@ def myOpen(fileName):
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
return gzip.open(fileName)
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def logsum(numbers):
......
......@@ -26,7 +26,7 @@ def myOpen(fileName):
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
return gzip.open(fileName)
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def complement(base):
......
......@@ -10,7 +10,7 @@ def myOpen(fileName): # faster than fileinput
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
return gzip.open(fileName)
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def randomSample(things, sampleSize):
......@@ -499,8 +499,9 @@ if __name__ == "__main__":
"every STEP-th position in each query (default: 1)")
og.add_option("-P", metavar="THREADS",
help="number of parallel threads")
og.add_option("-Q", metavar="NUMBER",
help="input format: 0=fasta, 1=fastq-sanger")
og.add_option("-Q", metavar="NUMBER", help=
"input format: 0=fasta or fastq-ignore, 1=fastq-sanger "
"(default=fasta)")
op.add_option_group(og)
(opts, args) = op.parse_args()
if len(args) < 1:
......
......@@ -21,7 +21,7 @@ def myOpen(fileName):
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
return gzip.open(fileName)
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def maxlen(s):
......
......@@ -12,7 +12,7 @@ def myOpen(fileName): # faster than fileinput
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
return gzip.open(fileName)
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def alnBegFromSeqBeg(gappedSequence, seqBeg):
......
......@@ -25,7 +25,7 @@ def myOpen(fileName):
if fileName == "-":
return sys.stdin
if fileName.endswith(".gz"):
return gzip.open(fileName)
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def indexOfNthSequence(mafLines, n):
......
......@@ -172,9 +172,8 @@ Miscellaneous options (default settings):\n\
4=column ambiguity estimates, 5=gamma-centroid, 6=LAMA,\n\
7=expected counts ("
+ stringify(outputType) + ")\n\
-Q: input format: 0=fasta, 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina,\n\
4=prb, 5=PSSM ("
+ stringify(inputFormat) + ")\n\
-Q: input format: 0=fasta or fastq-ignore, 1=fastq-sanger, 2=fastq-solexa,\n\
3=fastq-illumina, 4=prb, 5=PSSM (fasta)\n\
\n\
Report bugs to: last-align (ATmark) googlegroups (dot) com\n\
LAST home page: http://last.cbrc.jp/\n\
......@@ -445,7 +444,7 @@ void LastalArguments::setDefaultsFromAlphabet( bool isDna, bool isProtein,
if( gapExistCost == INT_MIN ) gapExistCost = 11;
if( gapExtendCost < 0 ) gapExtendCost = 2;
}
else if( !isQuality( inputFormat ) ){
else if( !isUseQuality( inputFormat ) ){
if( matchScore < 0 ) matchScore = 1;
if( mismatchCost < 0 ) mismatchCost = 1;
if( gapExistCost == INT_MIN ) gapExistCost = 7;
......@@ -641,7 +640,7 @@ void LastalArguments::writeCommented( std::ostream& stream ) const{
if( outputType > 4 && outputType < 7 )
stream << " g=" << gamma;
stream << " j=" << outputType;
stream << " Q=" << inputFormat;
stream << " Q=" << (inputFormat % sequenceFormat::fasta);
stream << '\n';
stream << "# " << lastdbName << '\n';
......
......@@ -70,8 +70,8 @@ Advanced Options (default settings):\n\
-S: strand: 0=reverse, 1=forward, 2=both ("
+ stringify(strand) + ")\n\
-s: volume size (unlimited)\n\
-Q: input format: 0=fasta, 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina ("
+ stringify(inputFormat) + ")\n\
-Q: input format: 0=fasta or fastq-ignore,\n\
1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina (fasta)\n\
-P: number of parallel threads ("
+ stringify(numOfThreads) + ")\n\
-m: seed pattern\n\
......