Skip to content
Commits on Source (5)
......@@ -11,15 +11,15 @@ Canu is a hierarchical assembly pipeline which runs in four steps:
## Install:
The easiest way to get started is to download a [release](http://github.com/marbl/canu/releases).
* The easiest way to get started is to download a [release](http://github.com/marbl/canu/releases). Installing with a 'package manager' is not encouraged.
Alternatively, you can also build the latest unreleased from github:
* Alternatively, you can use the latest unreleased version from the source code. This version has not undergone the same testing as a release and so may have unknown bugs or issues generating sub-optimal assemblies. We recommend the release version for most users.
git clone https://github.com/marbl/canu.git
cd canu/src
make -j <number of threads>
The unreleased tip has not undergone the same testing as a release and so may have unknown bugs or issues generating sub-optimal assemblies. We recommend the release version for most users.
* An *unsupported* Docker image made by Frank Förster is at https://hub.docker.com/r/greatfireball/canu/.
## Learn:
......
......@@ -37,6 +37,7 @@ $stoppingCommits{"64459fe33f97f6d23fe036ba1395743d0cdd03e4"} = 1; # 17 APR 20
$stoppingCommits{"9e9bd674b705f89817b07ff30067210c2d180f42"} = 1; # 14 AUG 2017
$stoppingCommits{"0fff8a511fd7d74081d94ff9e0f6c0351650ae2e"} = 1; # 27 FEB 2018 - v1.7
$stoppingCommits{"fcc3fe19eb635abd735486d215fbf65c56bcf4ee"} = 1; # 22 OCT 2018 - v1.8
$stoppingCommits{"438412092d4470c63323f20780252b48ec132d44"} = 1; # 04 NOV 2019 - v1.9
open(F, "< logs") or die "Failed to open 'logs': $!\n";
......@@ -68,6 +69,10 @@ while (!eof(F)) {
$author = "Brian P. Walenz";
} elsif (m/koren/i) {
$author = "Sergey Koren";
} elsif (m/nurk/i) {
$author = "Sergey Nurk";
} elsif (m/rhie/i) {
$author = "Arang Rhie";
} else {
print STDERR "Skipping commit from '$_'\n";
$author = undef;
......
This diff is collapsed.
......@@ -226,7 +226,7 @@ my %derived;
} elsif (m/^D\s+(\S+)\s+(\S+)$/) {
$authcopy{$1} .= $authcopy{$2}; # Include all authors of old file in new file.
#$derived{$1} .= $derived{$2};
$derived{$1} .= $derived{$2};
$derived{$1} .= "$2\n";
} else {
......
#!/bin/sh
# Generate a script to compile Canu using the Holy Build Box.
echo > build-linux.sh \#\!/bin/bash
echo >> build-linux.sh yum install -y git
echo >> build-linux.sh cd /build/src
echo >> build-linux.sh gmake -j 12 \> ../Linux-amd64.out 2\>\&1
echo >> build-linux.sh cd ..
#echo >> build-linux.sh rm -rf Linux-amd64/obj
#echo >> build-linux.sh tar -cf canu-linux.Linux-amd64.tar canu-linux/README* canu-linux/Linux-amd64
chmod 755 build-linux.sh
echo ""
echo "-- Build Linux and make tarballs."
echo ""
echo "% docker run ..."
docker run \
-v `pwd`:/build \
-t \
-i \
--rm phusion/holy-build-box-64:latest /hbb_exe/activate-exec bash /build/build-linux.sh \
> build-linux.sh.out 2>&1
rm -f build-linux.sh
echo ""
echo "-- Build success?"
echo ""
tail -n 1 build-linux.sh.out | head -n 1
tail -n 2 Linux-amd64.out | head -n 1
# Fetch the Upload Agent and install in our bin/.
if [ ! -e src/pipelines/dx-canu/resources/bin/ua ] ; then
echo ""
echo "-- Fetch UploadAgent."
echo ""
curl -L -R -O https://dnanexus-sdk.s3.amazonaws.com/dnanexus-upload-agent-1.5.31-osx.zip
curl -L -R -O https://dnanexus-sdk.s3.amazonaws.com/dnanexus-upload-agent-1.5.31-linux.tar.gz
tar zxf dnanexus-upload-agent-1.5.31-linux.tar.gz
cp -p dnanexus-upload-agent-1.5.31-linux/ua src/pipelines/dx-canu/resources/bin/
cp -p dnanexus-upload-agent-1.5.31-linux/ua src/pipelines/dx-trio/resources/bin/
#rm -rf dnanexus-upload-agent-1.5.31-linux.tar.gz
rm -rf dnanexus-upload-agent-1.5.31-linux
fi
# Remove the old app.
echo ""
echo "-- Purge previous dx-canu and dx-trio builds."
echo ""
echo "% rm -rf dx-canu/ dx-trio/"
rm -rf dx-canu/ dx-trio/
mkdir -p dx-canu/ dx-trio/
mkdir -p dx-canu/resources/bin/ dx-trio/resources/bin/
mkdir -p dx-canu/resources/usr/bin/ dx-trio/resources/usr/bin/
mkdir -p dx-canu/resources/usr/lib/ dx-trio/resources/usr/lib/
mkdir -p dx-canu/resources/usr/share/ dx-trio/resources/usr/share/
# Package all that up into dx-canu.
echo ""
echo "-- Package new bits into dx-canu and dx-trio builds."
echo ""
echo "% rsync ..."
rsync -a src/pipelines/dx-canu/ dx-canu/
rsync -a Linux-amd64/bin/ dx-canu/resources/usr/bin/
rsync -a Linux-amd64/lib/ dx-canu/resources/usr/lib/
rsync -a Linux-amd64/share/ dx-canu/resources/usr/share/
rsync -a src/pipelines/dx-trio/ dx-trio/
rsync -a Linux-amd64/bin/ dx-trio/resources/usr/bin/
rsync -a Linux-amd64/lib/ dx-trio/resources/usr/lib/
rsync -a Linux-amd64/share/ dx-trio/resources/usr/share/
#rm -fr Linux-amd64/obj Linux-amd64.tar Linux-amd64.tar.gz
#tar -cf Linux-amd64.tar Linux-amd64
#gzip -1v Linux-amd64.tar
#dx rm Linux-amd64.tar.gz
#dx upload Linux-amd64.tar.gz
echo ""
echo "-- Build the DNAnexus apps."
echo ""
echo "% dx build -f dx-canu"
dx build -f dx-canu
echo "% dx build -f dx-trio"
dx build -f dx-trio
exit 0
......@@ -36,6 +36,15 @@ cd canu-$version
echo Build MacOS.
cd src
gmake -j 12 > ../Darwin-amd64.out 2>&1
echo Make static binaries MacOS
cd ../Darwin-amd64
if [ ! -e ../statifyOSX.py ]; then
curl -L -R -o ../statifyOSX.py https://raw.githubusercontent.com/marbl/canu/master/statifyOSX.py
fi
python ../statifyOSX.py bin lib true true >> ../Darwin-amd64.out 2>&1
python ../statifyOSX.py lib lib true true >> ../Darwin-amd64.out 2>&1
cd ../..
rm -f canu-$version/linux.sh
......
canu (1.9+dfsg-1) unstable; urgency=medium
* Team upload.
* New upstream version
* debhelper-compat 12
-- Steffen Moeller <moeller@debian.org> Sun, 10 Nov 2019 11:12:40 +0100
canu (1.8+dfsg-2) unstable; urgency=medium
* Team upload
......
......@@ -2,13 +2,13 @@ Source: canu
Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Section: science
Priority: optional
Build-Depends: debhelper (>= 11~),
Build-Depends: debhelper-compat (= 12),
libboost-dev,
libmeryl-dev,
# For File::Path
libfilesys-df-perl,
mhap (>= 2.1.3)
Standards-Version: 4.2.1
Standards-Version: 4.4.1
Vcs-Browser: https://salsa.debian.org/med-team/canu
Vcs-Git: https://salsa.debian.org/med-team/canu.git
Homepage: http://canu.readthedocs.org/en/latest/
......
Index: canu/src/canu_version_update.pl
===================================================================
--- canu.orig/src/canu_version_update.pl
+++ canu/src/canu_version_update.pl
@@ -52,7 +52,8 @@ my $dirtyc = undef;
# If in a git repo, we can get the actual values.
-if (-d "../.git") {
+#if (-d "../.git") {
+if (0) {
$label = "snapshot";
# Count the number of changes since the last release.
......@@ -3,13 +3,15 @@ Description: don't expect bundled MHAP
Author: Afif Elghraoui <afif@debian.org>
Forwarded: not-needed
Last-Update: 2018-03-10
--- a/src/Makefile
+++ b/src/Makefile
@@ -665,7 +665,6 @@ all: UPDATE_VERSION MAKE_DIRS \
$(addprefix ${TARGET_DIR}/,${ALL_TGTS}) \
Index: canu/src/Makefile
===================================================================
--- canu.orig/src/Makefile
+++ canu/src/Makefile
@@ -670,7 +670,6 @@ all: UPDATE_VERSION MAKE_DIRS \
${TARGET_DIR}/bin/canu \
${TARGET_DIR}/bin/canu-time \
${TARGET_DIR}/bin/canu.defaults \
- ${TARGET_DIR}/share/java/classes/mhap-2.1.3.jar \
${TARGET_DIR}/lib/site_perl/canu/Consensus.pm \
${TARGET_DIR}/lib/site_perl/canu/CorrectReads.pm \
${TARGET_DIR}/lib/site_perl/canu/HaplotypeReads.pm \
${TARGET_DIR}/share/sequence/ultra-long-nanopore \
${TARGET_DIR}/share/sequence/pacbio \
${TARGET_DIR}/share/sequence/pacbio-hifi\
use-debian-mhap-at-runtime.patch
external-mhap.patch
tell_version_properly.patch
canu_version.patch
......@@ -3,8 +3,10 @@ Bug-Debian: https://bugs.debian.org/915269
Author: Andreas Tille <tille@debian.org>
Lest-Update: Mon, 03 Dec 2018 09:53:52 +0100
--- a/src/canu_version_update.pl
+++ b/src/canu_version_update.pl
Index: canu/src/canu_version_update.pl
===================================================================
--- canu.orig/src/canu_version_update.pl
+++ canu/src/canu_version_update.pl
@@ -34,7 +34,7 @@ use Cwd qw(getcwd);
my $cwd = getcwd();
......@@ -12,5 +14,5 @@ Lest-Update: Mon, 03 Dec 2018 09:53:52 +0100
-my $label = "snapshot"; # Automagically set to 'release' for releases.
+my $label = "release"; # Automagically set to 'release' for releases.
my $major = "1"; # Bump before release.
my $minor = "8"; # Bump before release.
my $minor = "9"; # Bump before release.
......@@ -2,9 +2,11 @@ Description: Use mhap jar from /usr/share/java
Author: Afif Elghraoui <afif@debian.org>
Forwarded: not-needed
Last-Update: 2016-03-20
--- a/src/pipelines/canu/OverlapMhap.pm
+++ b/src/pipelines/canu/OverlapMhap.pm
@@ -368,7 +368,7 @@ sub mhapConfigure ($$$) {
Index: canu/src/pipelines/canu/OverlapMhap.pm
===================================================================
--- canu.orig/src/pipelines/canu/OverlapMhap.pm
+++ canu/src/pipelines/canu/OverlapMhap.pm
@@ -365,7 +365,7 @@ sub mhapConfigure ($$$) {
print F "cd ./blocks\n";
print F "\n";
print F "$javaPath $javaOpt -XX:ParallelGCThreads=", getGlobal("${tag}mhapThreads"), " -server -Xms", $javaMemory, "m -Xmx", $javaMemory, "m \\\n";
......@@ -13,9 +15,9 @@ Last-Update: 2016-03-20
print F " --repeat-weight 0.9 --repeat-idf-scale 10 -k $merSize \\\n";
print F " --supress-noise 2 \\\n" if (defined(getGlobal("${tag}MhapFilterUnique")) && getGlobal("${tag}MhapFilterUnique") == 1);
print F " --no-tf \\\n" if (defined(getGlobal("${tag}MhapNoTf")) && getGlobal("${tag}MhapNoTf") == 1);
@@ -468,7 +468,7 @@ sub mhapConfigure ($$$) {
@@ -473,7 +473,7 @@ sub mhapConfigure ($$$) {
print F "\n";
print F "if [ ! -e ./results/\$qry.mhap ] ; then\n";
print F "if [ ! -e \$outPath/\$qry.mhap ] ; then\n";
print F " $javaPath $javaOpt -XX:ParallelGCThreads=", getGlobal("${tag}mhapThreads"), " -server -Xms", $javaMemory, "m -Xmx", $javaMemory, "m \\\n";
- print F " -jar $cygA \$bin/../share/java/classes/mhap-" . getGlobal("${tag}MhapVersion") . ".jar $cygB \\\n";
+ print F " -jar $cygA /usr/share/java/mhap.jar $cygB \\\n";
......
......@@ -24,3 +24,7 @@ override_dh_install:
for pl in `grep -Rl '#![[:space:]]*/usr/bin/env[[:space:]]\+perl' debian/*/usr/*` ; do \
sed -i '1s?^#![[:space:]]*/usr/bin/env[[:space:]]\+perl?#!/usr/bin/perl?' $${pl} ; \
done
override_dh_auto_clean:
rm -rf Linux-*
rm -f src/canu_version.H
......@@ -55,9 +55,9 @@ copyright = u'2015, Adam Phillippy, Sergey Koren, Brian Walenz'
# built documents.
#
# The short X.Y version.
version = '1.8'
version = '1.9'
# The full version, including alpha/beta/rc tags.
release = '1.8'
release = '1.9'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -45,6 +45,11 @@ How do I run Canu on my SLURM / SGE / PBS / LSF / Torque system?
`Issue #756 <https://github.com/marbl/canu/issues/756>`_ for Slurm and SGE examples.
My run stopped with the error ``'Mhap precompute jobs failed'``
-------------------------------------
Several package managers make a mess of the installation causing this error (conda and ubuntu in particular). Package managers don't add much benefit to a tool like Canu which is distributed as pre-compiled binaries compatible with most systems so our recommended installation method is downloading a binary release. Try running the assembly from scratch using our release distribution and if you continue to encounter errors, submit an issue.
My run stopped with the error ``'Failed to submit batch jobs'``
-------------------------------------
......@@ -120,23 +125,21 @@ What parameters should I use for my reads?
contiguous assemblies on large genomes.
**Nanopore R9 2D** and **PacBio P6**
Slightly decrease the maximum allowed difference in overlaps from the default of 14.4% to 12.0%
with ``correctedErrorRate=0.120``
Slightly decrease the maximum allowed difference in overlaps from the default of 12% to 10.5%
with ``correctedErrorRate=0.105``
**PacBio Sequel**
**PacBio Sequel V2**
Based on an *A. thaliana* `dataset
<http://www.pacb.com/blog/sequel-system-data-release-arabidopsis-dataset-genome-assembly/>`_,
and a few more recent mammalian genomes, slightly increase the maximum allowed difference from the default of 4.5% to 8.5% with
``correctedErrorRate=0.085 corMhapSensitivity=normal``.
Only add the second parameter (``corMhapSensivity=normal``) if you have >50x coverage.
**Nanopore R9 large genomes**
Due to some systematic errors, the identity estimate used by Canu for correction can be an
over-estimate of true error, inflating runtime. For recent large genomes (>1gbp) with more
than 30x coverage, we've used ``'corMhapOptions=--threshold 0.8 --num-hashes
512 --ordered-sketch-size 1000 --ordered-kmer-size 14'``. This is not needed for below 30x
coverage.
**PacBio Sequel V3**
The defaults for PacBio should work on this data.
**Nanopore flip-flop R9.4**
Based on a human dataset, the flip-flop basecaller reduces both the raw read error rate and the residual error rate remaining after Canu read correction. For this reason you can reduce the error tolerated by Canu. If you have over 30x coverage add the options: ``'corMhapOptions=--threshold 0.8 --ordered-sketch-size 1000 --ordered-kmer-size 14' correctedErrorRate=0.105``. This is primarily a speed optimization so you can use defaults, especially if your genome's accuracy is not improved by the flip-flop caller.
Can I assemble RNA sequence data?
-------------------------------------
......@@ -153,7 +156,7 @@ My assembly is running out of space, is too slow?
-------------------------------------
We don't have a good way to estimate of disk space used for the assembly. It varies with genome size, repeat content, and sequencing depth. A human genome sequenced with PacBio or Nanopore at 40-50x typically requires 1-2TB of space at the peak. Plants, unfortunately, seem to want a lot of space. 10TB is a reasonable guess. We've seen it as bad as 20TB on some very repetitive genomes.
The most common cause of high disk usage is a very repetitive or large genome. There are some parameters you can tweak to both reduce disk space and speed up the run. Try adding the options ``corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=60g mhapBlockSize=500 ovlMerThreshold=500``. This will suppress repeats more than the default settings and speed up both correction and assembly.
The most common cause of high disk usage is a very repetitive or large genome. There are some parameters you can tweak to both reduce disk space and speed up the run. Try adding the options ``corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=60g mhapBlockSize=500 ovlMerDistinct=0.975``. This will suppress repeats more than the default settings and speed up both correction and assembly.
It is also possible to clean up some intermediate outputs before the assembly is complete to save space. If you already have a ```*.ovlStore.BUILDING/1-bucketize.successs`` file in your current step (e.g. ``correct```), you can clean up the files under ``1-overlapper/blocks``. You can also remove the ovlStore for the previous step if you have its output (e.g. if you have ``asm.trimmedReads.fasta.gz``, you can remove ``trimming/asm.ovlStore``).
......@@ -322,8 +325,17 @@ Why do I get less corrected read data than I asked for?
What is the minimum coverage required to run Canu?
-------------------------------------
For eukaryotic genomes, coverage more than 20X is enough to outperform current hybrid
methods. Below that, you will likely not assemble the full genome.
methods. Below that, you will likely not assemble the full genome. The following
two papers have several examples.
* `Koren et al. (2013) Reducing assembly complexity of microbial genomes with single-molecule sequencing <https://www.ncbi.nlm.nih.gov/pubmed/24034426>`_
* `Koren and Walenz et al. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation <https://www.ncbi.nlm.nih.gov/pubmed/28298431>`_
Can I use Illumina data too?
-------------------------------------
No. We've seen that using short reads for correction will homogenize repeats and
mix up haplotypes. Even though the short reads are very high quality, their length
isn't sufficient for the true alignment to be identified, and so reads from other repeat
instances are used for correction, resulting in incorrect corrections.
My circular element is duplicated/has overlap?
-------------------------------------
......
......@@ -105,6 +105,13 @@ genomeSize <float=unset> *required*
parameter) and how sensitive the mhap overlapper should be (via the :ref:`mhapSensitivity <mhapSensitivity>`
parameter). It also impacts some logging, in particular, reports of NG50 sizes.
.. _fast:
fast <toggle>
This option uses MHAP overlapping for all steps, not just correction, making assembly significantly faster. It can be used on any genome size but may produce less continuous assemblies on genomes larger than 1 Gbp. It is recommended for nanopore genomes smaller than 1 Gbp or metagenomes.
The fast option will also optionally use `wtdbg <https://github.com/ruanjue/wtdbg2>`_ for unitigging if wtdbg is manually copied to the Canu binary folder. However, this is only tested with very small genomes and is **NOT** recommended.
.. _canuIteration:
canuIteration <internal parameter, do not use>
......@@ -189,22 +196,49 @@ stopOnLowCoverage <integer=10>
.. _stopAfter:
stopAfter <string=undefined>
If set, Canu will stop processing after a specific stage in the pipeline finishes.
Valid values for ``stopAfter`` are:
- ``gatekeeper`` - stops after the reads are loaded into the assembler read database.
- ``meryl`` - stops after frequent kmers are tabulated.
- ``overlapConfigure`` - stops after overlap jobs are configured.
- ``overlap`` - stops after overlaps are generated, before they are loaded into the overlap database.
- ``overlapStoreConfigure`` - stops after the jobs for creating the overlap store are configured.
- ``overlapStore`` - stops after overlaps are loaded into the overlap database.
- ``readCorrection`` - stops after corrected reads are generated.
- ``readTrimming`` - stops after trimmed reads are generated.
- ``unitig`` - stops after unitigs and contigs are created.
- ``consensusConfigure`` - stops after consensus jobs are configured.
- ``consensus`` - stops after consensus sequences are loaded into the databases.
If set, Canu will stop processing after a specific stage in the pipeline finishes. Valid values are:
+-----------------------+-------------------------------------------------------------------+
| **stopAfter=** | **Canu will stop after ....** |
+-----------------------+-------------------------------------------------------------------+
| sequenceStore | reads are loaded into the assembler read database. |
+-----------------------+-------------------------------------------------------------------+
| meryl-configure | kmer counting jobs are configured. |
+-----------------------+-------------------------------------------------------------------+
| meryl-count | kmers are counted, but not processed into one database. |
+-----------------------+-------------------------------------------------------------------+
| meryl-merge | kmers are merged into one database. |
+-----------------------+-------------------------------------------------------------------+
| meryl-process | frequent kmers are generated. |
+-----------------------+-------------------------------------------------------------------+
| meryl-subtract | haplotype specific kmers are generated. |
+-----------------------+-------------------------------------------------------------------+
| meryl | all kmer work is complete. |
+-----------------------+-------------------------------------------------------------------+
| haplotype-configure | haplotype read separation jobs are configured. |
+-----------------------+-------------------------------------------------------------------+
| haplotype | haplotype-specific reads are generated. |
+-----------------------+-------------------------------------------------------------------+
| overlapConfigure | overlap jobs are configured. |
+-----------------------+-------------------------------------------------------------------+
| overlap | overlaps are generated, before they are loaded into the database. |
+-----------------------+-------------------------------------------------------------------+
| overlapStoreConfigure | the jobs for creating the overlap database are configured. |
+-----------------------+-------------------------------------------------------------------+
| overlapStore | overlaps are loaded into the overlap database. |
+-----------------------+-------------------------------------------------------------------+
| correction | corrected reads are generated. |
+-----------------------+-------------------------------------------------------------------+
| trimming | trimmed reads are generated. |
+-----------------------+-------------------------------------------------------------------+
| unitig | unitigs and contigs are created. |
+-----------------------+-------------------------------------------------------------------+
| consensusConfigure | consensus jobs are configured. |
+-----------------------+-------------------------------------------------------------------+
| consensus | consensus sequences are loaded into the databases. |
+-----------------------+-------------------------------------------------------------------+
*readCorrection* and *readTrimming* are deprecated synonyms for *correction* and *trimming*, respectively.
General Options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -216,6 +250,9 @@ shell <string="/bin/sh">
java <string="java">
A path to a Java application launcher of at least version 1.8.
minimap <string="minimap2">
A path to the minimap2 versatile pairwise aligner.
gnuplot <string="gnuplot">
A path to the gnuplot graphing utility. Plotting is disabled if this is unset
(`gnuplot=` or `gnuplot=undef`), or if gnuplot fails to execute, or if gnuplot
......@@ -263,17 +300,45 @@ Cleanup Options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
saveOverlaps <boolean=false>
If set to 'false', the raw overlapper outputs are removed as soon as they are loaded into an
overlap store. Also, the correction and trimming overlap stores are removed when they are no
longer needed.. This is recommended in nearly every case.
If 'true', retain all overlap stores. If 'false', delete the correction
and trimming overlap stores when they are no longer useful. Overlaps used
for contig construction are never deleted.
purgeOverlaps <string=normal>
Controls when to remove intermediate overlap results.
'never' removes no intermediate overlap results. This is only useful if
you have a desire to exhaust your disk space.
'false' is the same as 'never'.
'normal' removes intermediate overlap results after they are loaded into an
overlap store.
'true' is the same as 'normal'.
'aggressive' removes intermediate overlap results as soon as possible. In
the event of a corrupt or lost file, this can result in a fair amount of
suffering to recompute the data. In particular, overlapper output is removed
as soon as it is loaded into buckets, and buckets are removed once they are
rewritten as sorted overlaps.
If set to 'stores', the raw overlapper outputs are removed, but all of the overlap stores are
retained. The overlap stores capture all the critical information in the raw outputs and the raw
outputs are redundant and unwieldy. Retaining the overlap stores can allow one to 'back up' and
redo a step, but this is generally not useful unless one is familiar with the algorithms.
'dangerous' removes intermediate results as soon as possible, in some
cases, before they are even fully processed. In addition to corrupt files,
jobs killed by out of memory, power outages, stray cosmic rays, et cetera,
will result in a fair amount of suffering to recompute the lost data. This
mode can help when creating ginormous overlap stores, by removing the
bucketized data immediately after it is loaded into the sorting jobs, thus
making space for the output of the sorting jobs.
If set to 'true', all overlapper outputs and all stores are retained. This is useful for
debugging potential problems with the overlap store.
Use 'normal' for non-large assemblies, and when disk space is plentiful.
Use 'aggressive' on large assemblies when disk space is tight. Never use
'dangerous', unless you know how to recover from an error and you fully
trust your compute environment.
For Mhap and Minimap2, the raw ovelraps (in Mhap and PAF format) are
deleted immediately after being converted to Canu ovb format, except when
purgeOverlaps=never.
saveReadCorrections <boolean=false>.
If set, do not remove raw corrected read output from correction/2-correction. Normally, this
......@@ -356,7 +421,36 @@ Overlapper Configuration, ovl Algorithm
overlaps with :ref:`mhapReAlign <mhapReAlign>`.
{prefix}OvlFrequentMers <string=undefined>
Do not seed overlaps with these kmers (fasta format).
Do not seed overlaps with these kmers, or, for mhap, do not seed with these kmers unless necessary (down-weight them).
For corFrequentMers (mhap), the file must contain a single line header followed by number-of-kmers data lines::
0 number-of-kmers
forward-kmer word-frequency kmer-count total-number-of-kmers
reverse-kmer word-frequency kmer-count total-number-of-kmers
Where `kmer-count` is the number of times this kmer sequence occurs in the reads, 'total-number-of-kmers'
is the number of kmers in the reads (including duplicates; rougly the number of bases in the reads),
and 'word-frequency' is 'kmer-count' / 'total-number-of-kmers'.
For example::
0 4
AAAATAATAGACTTATCGAGTC 0.0000382200 52 1360545
GACTCGATAAGTCTATTATTTT 0.0000382200 52 1360545
AAATAATAGACTTATCGAGTCA 0.0000382200 52 1360545
TGACTCGATAAGTCTATTATTT 0.0000382200 52 1360545
This file must be gzip compressed.
For obtFrequentMers and ovlFrequentMers, the file must contain a list of the canonical kmers and
their count on a single line. The count value is ignored, but needs to be present. This file
should not be compressed.
For example::
AAAATAATAGACTTATCGAGTC 52
AAATAATAGACTTATCGAGTCA 52
{prefix}OvlHashBits <integer=unset>
Width of the kmer hash. Width 22=1gb, 23=2gb, 24=4gb, 25=8gb. Plus 10b per ovlHashBlockLength.
......@@ -484,6 +578,12 @@ trimReadsCoverage <integer=1>
Minimum depth of evidence to retain bases.
Trio binning Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. _hapUnknownFraction:
hapUnknownFraction <float=0.05>
Fraction of unclassified bases to ignore for haplotype assemblies. If there are more than this fraction of unclassified bases, they are included in both haplotype assemblies.
.. _grid-engine:
......@@ -539,32 +639,13 @@ commands, they all start with ``gridEngine``. For each grid, these parameters a
various ``src/pipeline/Grid_*.pm`` modules. The parameters are used in
``src/pipeline/canu/Execution.pm``.
For SGE grids, two options are sometimes necessary to tell canu about pecularities of your grid:
``gridEngineThreadsOption`` describes how to request multiple cores, and ``gridEngineMemoryOption``
describes how to request memory. Usually, canu can figure out how to do this, but sometimes it
reports an error such as::
-- WARNING: Couldn't determine the SGE parallel environment to run multi-threaded codes.
-- Valid choices are (pick one and supply it to canu):
-- gridEngineThreadsOption="-pe make THREADS"
-- gridEngineThreadsOption="-pe make-dedicated THREADS"
-- gridEngineThreadsOption="-pe mpich-rr THREADS"
-- gridEngineThreadsOption="-pe openmpi-fill THREADS"
-- gridEngineThreadsOption="-pe smp THREADS"
-- gridEngineThreadsOption="-pe thread THREADS"
or::
-- WARNING: Couldn't determine the SGE resource to request memory.
-- Valid choices are (pick one and supply it to canu):
-- gridEngineMemoryOption="-l h_vmem=MEMORY"
-- gridEngineMemoryOption="-l mem_free=MEMORY"
If you get such a message, just add the appropriate line to your canu command line. Both options
will replace the uppercase text (THREADS or MEMORY) with the value canu wants when the job is
submitted. For ``gridEngineMemoryOption``, any number of ``-l`` options can be supplied; we could
use ``gridEngineMemoryOption="-l h_vmem=MEMORY -l mem_free=MEMORY"`` to request both ``h_vmem`` and
``mem_free`` memory.
In Canu 1.8 and earlier, ``gridEngineMemoryOption`` and ``gridEngineThreadsOption`` are used to tell
Canu how to request resources from the grid. Starting with ``snapshot v1.8 +90 changes`` (roughly
January 11th), those options were merged into ``gridEngineResourceOption``. These options specify
the grid options needed to request memory and threads for each job. For example, the default
``gridEngineResourceOption`` for PBS/Torque is "-l nodes=1:ppn=THREADS:mem=MEMORY", and for Slurm it
is "--cpus-per-task=THREADS --mem-per-cpu=MEMORY". Canu will replace "THREADS" and "MEMORY" with
the specific values needed for each job.
.. _grid-options:
......
......@@ -81,7 +81,7 @@ For Nanopore::
Output and intermediate files will be in directories 'ecoli-pacbio' and 'ecoli-nanopore',
respectively. Intermediate files are written in directories 'correction', 'trimming' and
'unitigging' for the respective stages. Output files are named using the '-p' prefix, such as
'ecoli.contigs.fasta', 'ecoli.contigs.gfa', etc. See section :ref:`outputs` for more details on
'ecoli.contigs.fasta', 'ecoli.unitigs.gfa', etc. See section :ref:`outputs` for more details on
outputs (intermediate files aren't documented).
......@@ -168,14 +168,25 @@ Canu has support for using parental short-read sequencing to classify and bin th
curl -L -o O157.parental.fasta https://gembox.cbcb.umd.edu/triobinning/example/o157.12.fasta
curl -L -o F1.fasta https://gembox.cbcb.umd.edu/triobinning/example/pacbio.fasta
trioCanu \
canu \
-p asm -d ecoliTrio \
genomeSize=5m \
-haplotypeK12 K12.parental.fasta \
-haplotypeO157 O157.parental.fasta \
-pacbio-raw F1.fasta
The run will produce two assemblies, ecoliTrio/haplotypeK12/asm.contigs.fasta and ecoliTrio/haplotypeO157/asm.contigs.fasta. As comparison, you can try co-assembling the datasets instead::
The run will first bin the reads into the haplotypes (``ecoliTrio/haplotype/haplotype-*.fasta.gz``) and provide a summary of the classification in ``ecoliTrio/haplotype/haplotype.log``::
-- Processing reads in batches of 100 reads each.
--
-- 119848 reads 378658103 bases written to haplotype file ./haplotype-K12.fasta.gz.
-- 308353 reads 1042955878 bases written to haplotype file ./haplotype-O157.fasta.gz.
-- 4114 reads 6520294 bases written to haplotype file ./haplotype-unknown.fasta.gz.
Next, the haplotypes are assembled in ``ecoliTrio/asm-haplotypeK12/asm-haplotypeK12.contigs.fasta`` and ``ecoliTrio/asm-haplotypeO157/asm-haplotypeO157.contigs.fasta``. By default, if the unassigned bases are > 5% of the total, they are included in both haplotypes. This can be controlled with the :ref:`hapUnknownFraction <hapUnknownFraction>` option.
As comparison, you can try co-assembling the datasets instead::
canu \
-p asm -d ecoliHap \
......@@ -183,11 +194,7 @@ The run will produce two assemblies, ecoliTrio/haplotypeK12/asm.contigs.fasta an
corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" \
-pacbio-raw F1.fasta
and compare the contiguity/accuracy. The current version of trioCanu is not yet optimized for memory use so requires adjusted parameters for large genomes. Adding the options::
gridOptionsExecutive="--mem=250g" gridOptionsMeryl='--partition=largemem --mem=1000g'
should be sufficient for a mammalian genome.
and compare the continuity/accuracy.
Consensus Accuracy
-------------------
......
......@@ -367,7 +367,7 @@ and 8 'distinct' kmers.
pick a threshold so as to seed overlaps using this fraction of all kmers in the input. In the example above,
fraction 0.667 of the k-mers (8/12) will be at or below threshold 2.
<tag>FrequentMers
don't compute frequent kmers, use those listed in this fasta file
don't compute frequent kmers, use those listed in this file
Mhap Overlapper Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -515,15 +515,17 @@ The header line for each sequence provides some metadata on the sequence.::
If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.
suggestCircular
If yes, sequence is likely circular. Not implemented.
If yes, sequence is likely circular. The GFA file includes the CIGAR sequence for the overlap.
GRAPHS
<prefix>.contigs.gfa
Unused or ambiguous edges between contig sequences. Bubble edges cannot be represented in this format.
Canu versions prior to v1.9 created a GFA of the contig graph. However, as noted at the time, the
GFA format cannot represent partial overlaps between contigs (for more details see the discussion of
general edges on the `GFA2 <https://github.com/GFA-spec/GFA-spec/blob/master/GFA2.md>`_ page).
Because Canu contigs are not compatible with the GFA format, <prefix>.contigs.gfa has been removed.
<prefix>.unitigs.gfa
Contigs split at bubble intersections.
Since the GFA format cannot represent partial overlaps, the contigs are split at all such overlap junctions into unitigs. The unitigs capture non-branching subsequences within the contigs and will break at any ambiguity (e.g. a haplotype switch).
<prefix>.unitigs.bed
The position of each unitig in a contig.
......