Skip to content
Commits on Source (8)
......@@ -36,6 +36,7 @@ $stoppingCommits{"bbbdcd063560e5f86006ee6b8b96d2d7b80bb750"} = 1; # 21 NOV 20
$stoppingCommits{"64459fe33f97f6d23fe036ba1395743d0cdd03e4"} = 1; # 17 APR 2017
$stoppingCommits{"9e9bd674b705f89817b07ff30067210c2d180f42"} = 1; # 14 AUG 2017
$stoppingCommits{"0fff8a511fd7d74081d94ff9e0f6c0351650ae2e"} = 1; # 27 FEB 2018 - v1.7
$stoppingCommits{"fcc3fe19eb635abd735486d215fbf65c56bcf4ee"} = 1; # 22 OCT 2018 - v1.8
open(F, "< logs") or die "Failed to open 'logs': $!\n";
......
This diff is collapsed.
......@@ -225,6 +225,8 @@ my %derived;
$authcopy{$1} .= "$2\n";
} elsif (m/^D\s+(\S+)\s+(\S+)$/) {
$authcopy{$1} .= $authcopy{$2}; # Include all authors of old file in new file.
#$derived{$1} .= $derived{$2};
$derived{$1} .= "$2\n";
} else {
......
#!/bin/sh
# Before building a release:
#
# Update copyrights
# Increase version in documentation/source/conf.py
# Increase version in src/canu_version_update.pl
version=$1
if [ x$version = x ] ; then
......
canu (1.8+dfsg-1) unstable; urgency=medium
* Team upload.
* New upstream version
* Standards-Version: 4.2.1
* Remove unused paragraphs in d/copyright
* Fix perl interpreter path
-- Andreas Tille <tille@debian.org> Thu, 01 Nov 2018 08:56:31 +0100
canu (1.7.1+dfsg-1) unstable; urgency=medium
* Team upload.
......
......@@ -9,7 +9,7 @@ Build-Depends: debhelper (>= 11~),
# For File::Path
libfilesys-df-perl,
mhap (>= 2.1.3)
Standards-Version: 4.1.5
Standards-Version: 4.2.1
Vcs-Browser: https://salsa.debian.org/med-team/canu
Vcs-Git: https://salsa.debian.org/med-team/canu.git
Homepage: http://canu.readthedocs.org/en/latest/
......
......@@ -51,31 +51,6 @@ License: BSD-3-Clause-PacBio
OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
Files: src/AS_UTL/md5.C
Copyright: 1991-1992 RSA Data Security, Inc.
License: RSA
License to copy and use this software is granted provided that it
is identified as the "RSA Data Security, Inc. MD5 Message-Digest
Algorithm" in all material mentioning or referencing this software
or this function.
.
License is also granted to make and use derivative works provided
that such works are identified as "derived from the RSA Data
Security, Inc. MD5 Message-Digest Algorithm" in all material
mentioning or referencing the derived work.
.
RSA Data Security, Inc. makes no representations concerning either
the merchantability of this software or the suitability of this
software for any particular purpose. It is provided "as is"
without express or implied warranty of any kind.
.
These notices must be retained in any copies of any part of this
documentation and/or software.
Files: src/AS_UTL/mt19937ar.*
Copyright: 1997 - 2002 Makoto Matsumoto and Takuji Nishimura
License: BSD-3-Clause-BNBI
Files: debian/*
Copyright: 2016-2017 Afif Elghraoui <afif@debian.org>
License: GPL-2.0+
......
......@@ -3,11 +3,11 @@ Description: don't expect bundled MHAP
Author: Afif Elghraoui <afif@debian.org>
Forwarded: not-needed
Last-Update: 2018-03-10
--- canu.orig/src/Makefile
+++ canu/src/Makefile
@@ -615,7 +615,6 @@
--- a/src/Makefile
+++ b/src/Makefile
@@ -665,7 +665,6 @@ all: UPDATE_VERSION MAKE_DIRS \
$(addprefix ${TARGET_DIR}/,${ALL_TGTS}) \
${TARGET_DIR}/bin/canu \
${TARGET_DIR}/bin/trioCanu \
${TARGET_DIR}/bin/canu.defaults \
- ${TARGET_DIR}/share/java/classes/mhap-2.1.3.jar \
${TARGET_DIR}/lib/site_perl/canu/Consensus.pm \
......
Author: Andreas Tille <tille@debian.org>
last-Update: Sat, 02 Sep 2017 15:30:21 +0200
Bug-Debian: https://bugs.debian.org/871390
Description: Fix gcc-7 error (violation of format-security)
--- canu.orig/src/merTrim/merTrim.C
+++ canu/src/merTrim/merTrim.C
@@ -1790,7 +1790,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (ORI)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
logPos = 0;
for (uint32 i=0; i<seqLen; i++) {
@@ -1800,7 +1800,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (SEQ)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
if (corrSeq && verifySeq) {
uint32 i=0;
@@ -1821,7 +1821,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (VAL)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
logPos = 0;
for (uint32 i=0; i<seqLen; i++) {
@@ -1831,7 +1831,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (VAL)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
}
logPos = 0;
@@ -1842,7 +1842,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (QLT)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
logPos = 0;
for (uint32 i=0; i<seqLen; i++) {
@@ -1852,7 +1852,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (COVERAGE)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
logPos = 0;
for (uint32 i=0; i<seqLen; i++) {
@@ -1862,7 +1862,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (CORRECTIONS)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
logPos = 0;
for (uint32 i=0; i<seqLen; i++) {
@@ -1872,7 +1872,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (DISCONNECTION)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
logPos = 0;
for (uint32 i=0; i<seqLen; i++) {
@@ -1882,7 +1882,7 @@
if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
}
strcpy(logLine + logPos, " (ADAPTER)\n");
- fprintf(stderr, logLine);
+ fprintf(stderr, "%s", logLine);
delete [] logLine;
}
use-debian-mhap-at-runtime.patch
gcc-7_format-security.patch
external-mhap.patch
......@@ -2,21 +2,21 @@ Description: Use mhap jar from /usr/share/java
Author: Afif Elghraoui <afif@debian.org>
Forwarded: not-needed
Last-Update: 2016-03-20
--- canu.orig/src/pipelines/canu/OverlapMhap.pm
+++ canu/src/pipelines/canu/OverlapMhap.pm
@@ -364,7 +364,7 @@
--- a/src/pipelines/canu/OverlapMhap.pm
+++ b/src/pipelines/canu/OverlapMhap.pm
@@ -368,7 +368,7 @@ sub mhapConfigure ($$$) {
print F "cd ./blocks\n";
print F "\n";
print F "$javaPath -d64 -server -Xmx", $javaMemory, "m \\\n";
print F "$javaPath $javaOpt -XX:ParallelGCThreads=", getGlobal("${tag}mhapThreads"), " -server -Xms", $javaMemory, "m -Xmx", $javaMemory, "m \\\n";
- print F " -jar $cygA \$bin/../share/java/classes/mhap-" . getGlobal("${tag}MhapVersion") . ".jar $cygB \\\n";
+ print F " -jar $cygA /usr/share/java/mhap.jar $cygB \\\n";
print F " --repeat-weight 0.9 --repeat-idf-scale 10 -k $merSize \\\n";
print F " --supress-noise 2 \\\n" if (defined(getGlobal("${tag}MhapFilterUnique")) && getGlobal("${tag}MhapFilterUnique") == 1);
print F " --no-tf \\\n" if (defined(getGlobal("${tag}MhapNoTf")) && getGlobal("${tag}MhapNoTf") == 1);
@@ -464,7 +464,7 @@
@@ -468,7 +468,7 @@ sub mhapConfigure ($$$) {
print F "\n";
print F "if [ ! -e ./results/\$qry.mhap ] ; then\n";
print F " $javaPath -d64 -server -Xmx", $javaMemory, "m \\\n";
print F " $javaPath $javaOpt -XX:ParallelGCThreads=", getGlobal("${tag}mhapThreads"), " -server -Xms", $javaMemory, "m -Xmx", $javaMemory, "m \\\n";
- print F " -jar $cygA \$bin/../share/java/classes/mhap-" . getGlobal("${tag}MhapVersion") . ".jar $cygB \\\n";
+ print F " -jar $cygA /usr/share/java/mhap.jar $cygB \\\n";
print F " --repeat-weight 0.9 --repeat-idf-scale 10 -k $merSize \\\n";
......
......@@ -18,3 +18,9 @@ override_dh_auto_build:
find $$builddir \
-name OverlapMhap.pm \
-exec sed -i 's#\(\s*my \$$javaPath = \).*#\1 "/usr/lib/jvm/java-8-openjdk-$(DEB_HOST_ARCH)/bin/java";#' {} +
override_dh_install:
dh_install
for pl in `grep -Rl '#![[:space:]]*/usr/bin/env[[:space:]]\+perl' debian/*/usr/*` ; do \
sed -i '1s?^#![[:space:]]*/usr/bin/env[[:space:]]\+perl?#!/usr/bin/perl?' $${pl} ; \
done
This diff is collapsed.
This diff is collapsed.
......@@ -35,7 +35,7 @@ bogart
When loading overlaps, an inflated maximum (to allow reruns with different error rates):
-eM 0.05 no more than 0.05 fraction (5.0%) error in any overlap loaded into bogart
the maximum used will ALWAYS be at leeast the maximum of the four error rates
the maximum used will ALWAYS be at least the maximum of the four error rates
For all, the lower limit on overlap length
-el 500 no shorter than 40 bases
......
......@@ -31,7 +31,7 @@ canu
If you want to change the defaults, use the various utg*ErrorRate options.
A full list of options can be printed with '-options'. All options
can be supplied in an optional sepc file.
can be supplied in an optional spec file.
Reads can be either FASTA or FASTQ format, uncompressed, or compressed
with gz, bz2 or xz. Reads are specified by the technology they were
......
......@@ -31,7 +31,7 @@ overlapInCore
-w filter out overlaps with too many errors in a window
-z skip the hopeless check
--maxerate <n> only output overlaps with fraction <n> or less error (e.g., 0.06 == 6%)
--maxrate <n> only output overlaps with fraction <n> or less error (e.g., 0.06 == 6%)
--minlength <n> only output overlaps of <n> or more bases
--hashbits n Use n bits for the hash mask.
......
......@@ -13,7 +13,7 @@ splitReads
-t bgn-end limit processing to only reads from bgn to end (inclusive)
-Ci clearFile path to input clear ranges (NOT SUPPORTED)
-Co clearFile path to ouput clear ranges
-Co clearFile path to output clear ranges
-e erate ignore overlaps with more than 'erate' percent error
......
......@@ -55,9 +55,9 @@ copyright = u'2015, Adam Phillippy, Sergey Koren, Brian Walenz'
# built documents.
#
# The short X.Y version.
version = '1.7'
version = '1.8'
# The full version, including alpha/beta/rc tags.
release = '1.7'
release = '1.8'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -31,11 +31,11 @@ What resources does Canu require for a bacterial genome assembly? A mammalian as
How do I run Canu on my SLURM / SGE / PBS / LSF / Torque system?
-------------------------------------
Canu will detect and configure itself to use on most grids. You can supply your own grid
options, such as a partition on SLURM or an account code on SGE, with ``gridOptions="<your
options list>"`` which will passed to every job submitted by Canu. Similar options exist for
every stage of Canu, which could be used to, for example, restrict overlapping to a specific
partition or queue.
Canu will detect and configure itself to use on most grids. Canu will NOT request explicit time limits or
queues/partitions. You can supply your own grid options, such as a partition on SLURM, an account code
on SGE, and/or time limits with ``gridOptions="<your options list>"`` which will passed to every job
submitted by Canu. Similar options exist for every stage of Canu, which could be used to, for example,
restrict overlapping to a specific partition or queue.
To disable grid support and run only on the local machine, specify ``useGrid=false``
......@@ -61,6 +61,38 @@ My run stopped with the error ``'Failed to submit batch jobs'``
compute nodes.
My run of Canu was killed by the sysadmin; the power going out; my cat stepping on the power button; et cetera. Is it safe to restart? How do I restart?
-------------------------------------
Yes, perfectly safe! It's actually how Canu runs normally: each time Canu starts, it examines
the state of the assembly to decide what it should do next. For example, if six overlap tasks
have no results, it'll run just those six tasks.
This also means that if you want to redo some step, just remove those results from the assembly
directory. Some care needs to be taken to make sure results computed after those are also
removed.
Short answer: just rerun the _exact_ same command as before. It'll do the right thing.
My genome size and assembly size are different, help!
-------------------------------------
The difference could be due to a heterozygous genome where the assembly separated some loci. It could also be because the previous estimate is incorrect. We typically use two analyses to see what happened. First, a `BUSCO <https://busco.ezlab.org>`_ analysis will indicate duplicated genes. For example this assembly::
INFO C:98.5%[S:97.9%,D:0.6%],F:1.0%,M:0.5%,n:2799
INFO 2756 Complete BUSCOs (C)
INFO 2740 Complete and single-copy BUSCOs (S)
INFO 16 Complete and duplicated BUSCOs (D)
does not have much duplication but this assembly::
INFO C:97.6%[S:15.8%,D:81.8%],F:0.9%,M:1.5%,n:2799
INFO 2732 Complete BUSCOs (C)
INFO 443 Complete and single-copy BUSCOs (S)
INFO 2289 Complete and duplicated BUSCOs (D)
does. We have had some success (in limited testing) using `purge_haplotigs <https://bitbucket.org/mroachawri/purge_haplotigs>`_ to remove duplication. Purge haplotigs will also generate a coverage plot which will usually have two peaks when assemblies have separated some loci.
What parameters should I use for my reads?
-------------------------------------
Canu is designed to be universal on a large range of PacBio (C2, P4-C2, P5-C3, P6-C4) and Oxford
......@@ -91,12 +123,12 @@ What parameters should I use for my reads?
Slightly decrease the maximum allowed difference in overlaps from the default of 14.4% to 12.0%
with ``correctedErrorRate=0.120``
**Early PacBio Sequel**
Based on exactly one publically released *A. thaliana* `dataset
**PacBio Sequel**
Based on an *A. thaliana* `dataset
<http://www.pacb.com/blog/sequel-system-data-release-arabidopsis-dataset-genome-assembly/>`_,
slightly decrease the maximum allowed difference from the default of 4.5% to 4.0% with
``correctedErrorRate=0.040 corMhapSensitivity=normal``. For recent Sequel data, the defaults
seem to be appropriate.
and a few more recent mammalian genomes, slightly increase the maximum allowed difference from the default of 4.5% to 8.5% with
``correctedErrorRate=0.085 corMhapSensitivity=normal``.
Only add the second parameter (``corMhapSensivity=normal``) if you have >50x coverage.
**Nanopore R9 large genomes**
Due to some systematic errors, the identity estimate used by Canu for correction can be an
......@@ -106,6 +138,25 @@ What parameters should I use for my reads?
coverage.
Can I assemble RNA sequence data?
-------------------------------------
Canu will likely mis-assemble, or completely fail to assemble, RNA data. It will do a
reasonable job at generating corrected reads though. Reads are corrected using (local) best
alignments to other reads, and alignments between different isoforms are usually obviously not
'best'. Just like with DNA sequences, similar isoforms can get 'mixed' together. We've heard
of reasonable success from users, but do not have any parameter suggestions to make.
Note that Canu will silently translate 'U' bases to 'T' bases on input, but **NOT** translate
the output bases back to 'U'.
My assembly is running out of space, is too slow?
-------------------------------------
We don't have a good way to estimate of disk space used for the assembly. It varies with genome size, repeat content, and sequencing depth. A human genome sequenced with PacBio or Nanopore at 40-50x typically requires 1-2TB of space at the peak. Plants, unfortunately, seem to want a lot of space. 10TB is a reasonable guess. We've seen it as bad as 20TB on some very repetitive genomes.
The most common cause of high disk usage is a very repetitive or large genome. There are some parameters you can tweak to both reduce disk space and speed up the run. Try adding the options ``corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=60g mhapBlockSize=500 ovlMerThreshold=500``. This will suppress repeats more than the default settings and speed up both correction and assembly.
It is also possible to clean up some intermediate outputs before the assembly is complete to save space. If you already have a ```*.ovlStore.BUILDING/1-bucketize.successs`` file in your current step (e.g. ``correct```), you can clean up the files under ``1-overlapper/blocks``. You can also remove the ovlStore for the previous step if you have its output (e.g. if you have ``asm.trimmedReads.fasta.gz``, you can remove ``trimming/asm.ovlStore``).
My assembly continuity is not good, how can I improve it?
-------------------------------------
The most important determinant for assembly quality is sequence length, followed by the repeat
......@@ -160,13 +211,13 @@ What parameters can I tweak?
- ``corMinCoverage``, loosely, controls the quality of the corrected reads. It is the coverage
in evidence reads that is needed before a (portion of a) corrected read is reported.
Corrected reads are generated as a consensus of other reads; this is just the minimum ocverage
Corrected reads are generated as a consensus of other reads; this is just the minimum coverage
needed for the consensus sequence to be reported. The default is based on input read
coverage: 0x coverage for less than 30X input coverage, and 4x coverage for more than that.
For assembly:
- ``utgOvlErrorRate`` is essientially a speed optimization. Overlaps above this error rate are
- ``utgOvlErrorRate`` is essentially a speed optimization. Overlaps above this error rate are
not computed. Setting it too high generally just wastes compute time, while setting it too
low will degrade assemblies by missing true overlaps between lower quality reads.
......@@ -194,19 +245,20 @@ What parameters can I tweak?
more conservative at picking the error rate to use for the assembly to try to maintain
haplotype separation. If it works, you'll end up with an assembly >= 2x your haploid
genome size. Post-processing using gene information or other synteny information is
required to remove redunancy from this assembly.
required to remove redundancy from this assembly.
2) **Smash haplotypes together** and then do phasing using another approach (like HapCUT2 or
whatshap or others). In that case you want to do the opposite, increase the error rates
used for finding overlaps:
``corOutCoverage=200 ovlErrorRate=0.15 obtErrorRate=0.15``
``corOutCoverage=200 correctedErrorRate=0.15``
Error rates for trimming (``obtErrorRate``) and assembling (``batErrorRate``) can usually
be left as is. When trimming, reads will be trimmed using other reads in the same
When trimming, reads will be trimmed using other reads in the same
chromosome (and probably some reads from other chromosomes). When assembling, overlaps
well outside the observed error rate distribution are discarded.
We typically prefer option 1 which will lead to a larger than expected genome size. We have had some success (in limited testing) using `purge_haplotigs <https://bitbucket.org/mroachawri/purge_haplotigs>`_ to remove this duplication.
For metagenomes:
The basic idea is to use all data for assembly rather than just the longest as default. The
......