Steffen Möller · Steffen Möller · Steffen Möller · Steffen Möller · Steffen Möller · d233eef4
--- a/README.md
+++ b/README.md
@@ -11,15 +11,15 @@ Canu is a hierarchical assembly pipeline which runs in four steps:

 ## Install:

-The easiest way to get started is to download a [release](http://github.com/marbl/canu/releases). 
+* The easiest way to get started is to download a [release](http://github.com/marbl/canu/releases).  Installing with a 'package manager' is not encouraged.

-Alternatively, you can also build the latest unreleased from github:
+* Alternatively, you can use the latest unreleased version from the source code.  This version has not undergone the same testing as a release and so may have unknown bugs or issues generating sub-optimal assemblies. We recommend the release version for most users.

        git clone https://github.com/marbl/canu.git
        cd canu/src
        make -j <number of threads>

-The unreleased tip has not undergone the same testing as a release and so may have unknown bugs or issues generating sub-optimal assemblies. We recommend the release version for most users.
+ * An *unsupported* Docker image made by Frank Förster is at https://hub.docker.com/r/greatfireball/canu/.

 ## Learn:


--- a/addCopyrights-BuildData.pl
+++ b/addCopyrights-BuildData.pl
@@ -37,6 +37,7 @@ $stoppingCommits{"64459fe33f97f6d23fe036ba1395743d0cdd03e4"} = 1;   #  17 APR 20
 $stoppingCommits{"9e9bd674b705f89817b07ff30067210c2d180f42"} = 1;   #  14 AUG 2017
 $stoppingCommits{"0fff8a511fd7d74081d94ff9e0f6c0351650ae2e"} = 1;   #  27 FEB 2018 - v1.7
 $stoppingCommits{"fcc3fe19eb635abd735486d215fbf65c56bcf4ee"} = 1;   #  22 OCT 2018 - v1.8
+$stoppingCommits{"438412092d4470c63323f20780252b48ec132d44"} = 1;   #  04 NOV 2019 - v1.9

 open(F, "< logs") or die "Failed to open 'logs': $!\n";

@@ -68,6 +69,10 @@ while (!eof(F)) {
        $author = "Brian P. Walenz";
    } elsif (m/koren/i) {
        $author = "Sergey Koren";
+    } elsif (m/nurk/i) {
+        $author = "Sergey Nurk";
+    } elsif (m/rhie/i) {
+        $author = "Arang Rhie";
    } else {
        print STDERR "Skipping commit from '$_'\n";
        $author = undef;

--- a/addCopyrights.dat
+++ b/addCopyrights.dat
--- a/addCopyrights.pl
+++ b/addCopyrights.pl
@@ -226,7 +226,7 @@ my %derived;

        } elsif (m/^D\s+(\S+)\s+(\S+)$/) {
            $authcopy{$1} .= $authcopy{$2};   #  Include all authors of old file in new file.
-            #$derived{$1}  .= $derived{$2};
+            $derived{$1}  .= $derived{$2};
            $derived{$1}  .= "$2\n";

        } else {

--- a/buildRelease-DNAnexus.sh
+++ b/buildRelease-DNAnexus.sh
+#!/bin/sh
+
+#  Generate a script to compile Canu using the Holy Build Box.
+
+echo  > build-linux.sh  \#\!/bin/bash
+echo >> build-linux.sh  yum install -y git
+echo >> build-linux.sh  cd /build/src
+echo >> build-linux.sh  gmake -j 12 \> ../Linux-amd64.out 2\>\&1
+echo >> build-linux.sh  cd ..
+#echo >> build-linux.sh  rm -rf Linux-amd64/obj
+#echo >> build-linux.sh  tar -cf canu-linux.Linux-amd64.tar  canu-linux/README* canu-linux/Linux-amd64
+
+chmod 755 build-linux.sh
+
+echo ""
+echo "-- Build Linux and make tarballs."
+echo ""
+
+echo "% docker run ..."
+docker run \
+  -v `pwd`:/build \
+  -t \
+  -i \
+  --rm phusion/holy-build-box-64:latest /hbb_exe/activate-exec bash /build/build-linux.sh \
+> build-linux.sh.out 2>&1
+
+rm -f build-linux.sh
+
+echo ""
+echo "-- Build success?"
+echo ""
+
+tail -n 1 build-linux.sh.out | head -n 1
+tail -n 2 Linux-amd64.out | head -n 1
+
+#  Fetch the Upload Agent and install in our bin/.
+
+if [ ! -e src/pipelines/dx-canu/resources/bin/ua ] ; then
+  echo ""
+  echo "-- Fetch UploadAgent."
+  echo ""
+
+  curl -L -R -O https://dnanexus-sdk.s3.amazonaws.com/dnanexus-upload-agent-1.5.31-osx.zip
+  curl -L -R -O https://dnanexus-sdk.s3.amazonaws.com/dnanexus-upload-agent-1.5.31-linux.tar.gz
+
+  tar zxf dnanexus-upload-agent-1.5.31-linux.tar.gz
+
+  cp -p dnanexus-upload-agent-1.5.31-linux/ua src/pipelines/dx-canu/resources/bin/
+  cp -p dnanexus-upload-agent-1.5.31-linux/ua src/pipelines/dx-trio/resources/bin/
+
+  #rm -rf dnanexus-upload-agent-1.5.31-linux.tar.gz
+  rm -rf dnanexus-upload-agent-1.5.31-linux
+fi
+
+#   Remove the old app.
+
+echo ""
+echo "-- Purge previous dx-canu and dx-trio builds."
+echo ""
+
+echo "% rm -rf dx-canu/ dx-trio/"
+rm -rf   dx-canu/ dx-trio/
+mkdir -p dx-canu/ dx-trio/
+mkdir -p dx-canu/resources/bin/       dx-trio/resources/bin/
+mkdir -p dx-canu/resources/usr/bin/   dx-trio/resources/usr/bin/
+mkdir -p dx-canu/resources/usr/lib/   dx-trio/resources/usr/lib/
+mkdir -p dx-canu/resources/usr/share/ dx-trio/resources/usr/share/
+
+#  Package all that up into dx-canu.
+
+echo ""
+echo "-- Package new bits into dx-canu and dx-trio builds."
+echo ""
+
+echo "% rsync ..."
+rsync -a src/pipelines/dx-canu/  dx-canu/
+rsync -a Linux-amd64/bin/        dx-canu/resources/usr/bin/
+rsync -a Linux-amd64/lib/        dx-canu/resources/usr/lib/
+rsync -a Linux-amd64/share/      dx-canu/resources/usr/share/
+
+rsync -a src/pipelines/dx-trio/  dx-trio/
+rsync -a Linux-amd64/bin/        dx-trio/resources/usr/bin/
+rsync -a Linux-amd64/lib/        dx-trio/resources/usr/lib/
+rsync -a Linux-amd64/share/      dx-trio/resources/usr/share/
+
+#rm -fr Linux-amd64/obj Linux-amd64.tar Linux-amd64.tar.gz
+#tar -cf Linux-amd64.tar Linux-amd64
+#gzip -1v Linux-amd64.tar
+#dx rm Linux-amd64.tar.gz
+#dx upload Linux-amd64.tar.gz
+
+echo ""
+echo "-- Build the DNAnexus apps."
+echo ""
+
+echo "% dx build -f dx-canu"
+dx build -f dx-canu
+
+echo "% dx build -f dx-trio"
+dx build -f dx-trio
+
+exit 0
--- a/buildRelease.sh
+++ b/buildRelease.sh
@@ -36,6 +36,15 @@ cd canu-$version
 echo Build MacOS.
 cd src
 gmake -j 12 > ../Darwin-amd64.out 2>&1
+
+echo Make static binaries MacOS
+cd ../Darwin-amd64
+if [ ! -e ../statifyOSX.py ]; then
+   curl -L -R -o ../statifyOSX.py https://raw.githubusercontent.com/marbl/canu/master/statifyOSX.py
+fi
+
+python ../statifyOSX.py bin lib true true >> ../Darwin-amd64.out 2>&1
+python ../statifyOSX.py lib lib true true >> ../Darwin-amd64.out 2>&1
 cd ../..

 rm -f canu-$version/linux.sh

--- a/debian/changelog
+++ b/debian/changelog
+canu (1.9+dfsg-1) unstable; urgency=medium
+
+  * Team upload.
+  * New upstream version
+  * debhelper-compat 12
+
+ -- Steffen Moeller <moeller@debian.org>  Sun, 10 Nov 2019 11:12:40 +0100
+
 canu (1.8+dfsg-2) unstable; urgency=medium

  * Team upload

--- a/debian/compat
+++ b/debian/compat
-11
--- a/debian/control
+++ b/debian/control
@@ -2,13 +2,13 @@ Source: canu
 Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
 Section: science
 Priority: optional
-Build-Depends: debhelper (>= 11~),
+Build-Depends: debhelper-compat (= 12),
               libboost-dev,
               libmeryl-dev,
 # For File::Path
               libfilesys-df-perl,
               mhap (>= 2.1.3)
-Standards-Version: 4.2.1
+Standards-Version: 4.4.1
 Vcs-Browser: https://salsa.debian.org/med-team/canu
 Vcs-Git: https://salsa.debian.org/med-team/canu.git
 Homepage: http://canu.readthedocs.org/en/latest/

--- a/debian/patches/canu_version.patch
+++ b/debian/patches/canu_version.patch
+Index: canu/src/canu_version_update.pl
+===================================================================
+--- canu.orig/src/canu_version_update.pl
+++ canu/src/canu_version_update.pl
+@@ -52,7 +52,8 @@ my $dirtyc   = undef;
+ 
+ #  If in a git repo, we can get the actual values.
+ 
+-if (-d "../.git") {
+#if (-d "../.git") {
+if (0) {
+     $label = "snapshot";
+ 
+     #  Count the number of changes since the last release.
--- a/debian/patches/external-mhap.patch
+++ b/debian/patches/external-mhap.patch
@@ -3,13 +3,15 @@ Description: don't expect bundled MHAP
 Author: Afif Elghraoui <afif@debian.org>
 Forwarded: not-needed
 Last-Update: 2018-03-10
--- a/src/Makefile
-+++ b/src/Makefile
-@@ -665,7 +665,6 @@ all: UPDATE_VERSION MAKE_DIRS \
-      $(addprefix ${TARGET_DIR}/,${ALL_TGTS}) \
+Index: canu/src/Makefile
+===================================================================
+--- canu.orig/src/Makefile
+++ canu/src/Makefile
+@@ -670,7 +670,6 @@ all: UPDATE_VERSION MAKE_DIRS \
      ${TARGET_DIR}/bin/canu \
+      ${TARGET_DIR}/bin/canu-time \
      ${TARGET_DIR}/bin/canu.defaults \
 -     ${TARGET_DIR}/share/java/classes/mhap-2.1.3.jar \
-      ${TARGET_DIR}/lib/site_perl/canu/Consensus.pm \
-      ${TARGET_DIR}/lib/site_perl/canu/CorrectReads.pm \
-      ${TARGET_DIR}/lib/site_perl/canu/HaplotypeReads.pm \
+      ${TARGET_DIR}/share/sequence/ultra-long-nanopore \
+      ${TARGET_DIR}/share/sequence/pacbio \
+      ${TARGET_DIR}/share/sequence/pacbio-hifi\
--- a/debian/patches/series
+++ b/debian/patches/series
 use-debian-mhap-at-runtime.patch
 external-mhap.patch
 tell_version_properly.patch
+canu_version.patch
--- a/debian/patches/tell_version_properly.patch
+++ b/debian/patches/tell_version_properly.patch
@@ -3,8 +3,10 @@ Bug-Debian: https://bugs.debian.org/915269
 Author: Andreas Tille <tille@debian.org>
 Lest-Update: Mon, 03 Dec 2018 09:53:52 +0100

--- a/src/canu_version_update.pl
-+++ b/src/canu_version_update.pl
+Index: canu/src/canu_version_update.pl
+===================================================================
+--- canu.orig/src/canu_version_update.pl
+++ canu/src/canu_version_update.pl
 @@ -34,7 +34,7 @@ use Cwd qw(getcwd);
 
 my $cwd = getcwd();
@@ -12,5 +14,5 @@ Lest-Update: Mon, 03 Dec 2018 09:53:52 +0100
 -my $label    = "snapshot";     #  Automagically set to 'release' for releases.
 +my $label    = "release";     #  Automagically set to 'release' for releases.
 my $major    = "1";            #  Bump before release.
- my $minor    = "8";            #  Bump before release.
+ my $minor    = "9";            #  Bump before release.
 
--- a/debian/patches/use-debian-mhap-at-runtime.patch
+++ b/debian/patches/use-debian-mhap-at-runtime.patch
@@ -2,9 +2,11 @@ Description: Use mhap jar from /usr/share/java
 Author: Afif Elghraoui <afif@debian.org>
 Forwarded: not-needed
 Last-Update: 2016-03-20
--- a/src/pipelines/canu/OverlapMhap.pm
-+++ b/src/pipelines/canu/OverlapMhap.pm
-@@ -368,7 +368,7 @@ sub mhapConfigure ($$$) {
+Index: canu/src/pipelines/canu/OverlapMhap.pm
+===================================================================
+--- canu.orig/src/pipelines/canu/OverlapMhap.pm
+++ canu/src/pipelines/canu/OverlapMhap.pm
+@@ -365,7 +365,7 @@ sub mhapConfigure ($$$) {
     print F "cd ./blocks\n";
     print F "\n";
     print F "$javaPath $javaOpt -XX:ParallelGCThreads=",  getGlobal("${tag}mhapThreads"), " -server -Xms", $javaMemory, "m -Xmx", $javaMemory, "m \\\n";
@@ -13,9 +15,9 @@ Last-Update: 2016-03-20
     print F "  --repeat-weight 0.9 --repeat-idf-scale 10 -k $merSize \\\n";
     print F "  --supress-noise 2 \\\n"  if (defined(getGlobal("${tag}MhapFilterUnique")) && getGlobal("${tag}MhapFilterUnique") == 1);
     print F "  --no-tf \\\n"            if (defined(getGlobal("${tag}MhapNoTf")) && getGlobal("${tag}MhapNoTf") == 1);
-@@ -468,7 +468,7 @@ sub mhapConfigure ($$$) {
+@@ -473,7 +473,7 @@ sub mhapConfigure ($$$) {
     print F "\n";
-     print F "if [ ! -e ./results/\$qry.mhap ] ; then\n";
+     print F "if [ ! -e \$outPath/\$qry.mhap ] ; then\n";
     print F "  $javaPath $javaOpt -XX:ParallelGCThreads=",  getGlobal("${tag}mhapThreads"), " -server -Xms", $javaMemory, "m -Xmx", $javaMemory, "m \\\n";
 -    print F "    -jar $cygA \$bin/../share/java/classes/mhap-" . getGlobal("${tag}MhapVersion") . ".jar $cygB \\\n";
 +    print F "    -jar $cygA /usr/share/java/mhap.jar $cygB \\\n";

--- a/debian/rules
+++ b/debian/rules
@@ -24,3 +24,7 @@ override_dh_install:
 	for pl in `grep -Rl '#![[:space:]]*/usr/bin/env[[:space:]]\+perl' debian/*/usr/*` ; do \
 	    sed -i '1s?^#![[:space:]]*/usr/bin/env[[:space:]]\+perl?#!/usr/bin/perl?' $${pl} ; \
 	done
+
+override_dh_auto_clean:
+	rm -rf Linux-*
+	rm -f src/canu_version.H
--- a/documentation/source/conf.py
+++ b/documentation/source/conf.py
@@ -55,9 +55,9 @@ copyright = u'2015, Adam Phillippy, Sergey Koren, Brian Walenz'
 # built documents.
 #
 # The short X.Y version.
-version = '1.8'
+version = '1.9'
 # The full version, including alpha/beta/rc tags.
-release = '1.8'
+release = '1.9'

 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

--- a/documentation/source/faq.rst
+++ b/documentation/source/faq.rst
@@ -45,6 +45,11 @@ How do I run Canu on my SLURM / SGE / PBS / LSF / Torque system?
    `Issue #756 <https://github.com/marbl/canu/issues/756>`_ for Slurm and SGE examples.


+My run stopped with the error ``'Mhap precompute jobs failed'``
+-------------------------------------
+
+    Several package managers make a mess of the installation causing this error (conda and ubuntu in particular). Package managers don't add much benefit to a tool like Canu which is distributed as pre-compiled binaries compatible with most systems so our recommended installation method is downloading a binary release. Try running the assembly from scratch using our release distribution and if you continue to encounter errors, submit an issue.
+
 My run stopped with the error ``'Failed to submit batch jobs'``
 -------------------------------------

@@ -120,23 +125,21 @@ What parameters should I use for my reads?
      contiguous assemblies on large genomes.

    **Nanopore R9 2D** and **PacBio P6**
-       Slightly decrease the maximum allowed difference in overlaps from the default of 14.4% to 12.0%
-       with ``correctedErrorRate=0.120``
+       Slightly decrease the maximum allowed difference in overlaps from the default of 12% to 10.5%
+       with ``correctedErrorRate=0.105``

-    **PacBio Sequel**
+    **PacBio Sequel V2**
       Based on an *A. thaliana* `dataset
       <http://www.pacb.com/blog/sequel-system-data-release-arabidopsis-dataset-genome-assembly/>`_,
       and a few more recent mammalian genomes, slightly increase the maximum allowed difference from the default of 4.5% to 8.5% with
       ``correctedErrorRate=0.085 corMhapSensitivity=normal``.
      Only add the second parameter (``corMhapSensivity=normal``) if you have >50x coverage.

-   **Nanopore R9 large genomes**
-       Due to some systematic errors, the identity estimate used by Canu for correction can be an
-       over-estimate of true error, inflating runtime. For recent large genomes (>1gbp) with more
-       than 30x coverage, we've used ``'corMhapOptions=--threshold 0.8 --num-hashes
-       512 --ordered-sketch-size 1000 --ordered-kmer-size 14'``. This is not needed for below 30x
-       coverage.
+    **PacBio Sequel V3**
+       The defaults for PacBio should work on this data.

+    **Nanopore flip-flop R9.4**
+       Based on a human dataset, the flip-flop basecaller reduces both the raw read error rate and the residual error rate remaining after Canu read correction. For this reason you can reduce the error tolerated by Canu. If you have over 30x coverage add the options: ``'corMhapOptions=--threshold 0.8 --ordered-sketch-size 1000 --ordered-kmer-size 14' correctedErrorRate=0.105``. This is primarily a speed optimization so you can use defaults, especially if your genome's accuracy is not improved by the flip-flop caller.

 Can I assemble RNA sequence data?
 -------------------------------------
@@ -153,7 +156,7 @@ My assembly is running out of space, is too slow?
 -------------------------------------
    We don't have a good way to estimate of disk space used for the assembly. It varies with genome size, repeat content, and sequencing depth. A human genome sequenced with PacBio or Nanopore at 40-50x typically requires 1-2TB of space at the peak. Plants, unfortunately, seem to want a lot of space. 10TB is a reasonable guess. We've seen it as bad as 20TB on some very repetitive genomes.
    
-    The most common cause of high disk usage is a very repetitive or large genome. There are some parameters you can tweak to both reduce disk space and speed up the run. Try adding the options ``corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=60g mhapBlockSize=500 ovlMerThreshold=500``. This will suppress repeats more than the default settings and speed up both correction and assembly.
+    The most common cause of high disk usage is a very repetitive or large genome. There are some parameters you can tweak to both reduce disk space and speed up the run. Try adding the options ``corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=60g mhapBlockSize=500 ovlMerDistinct=0.975``. This will suppress repeats more than the default settings and speed up both correction and assembly.
    
    It is also possible to clean up some intermediate outputs before the assembly is complete to save space. If you already have a ```*.ovlStore.BUILDING/1-bucketize.successs`` file in your current step (e.g. ``correct```), you can clean up the files under ``1-overlapper/blocks``. You can also remove the ovlStore for the previous step if you have its output (e.g. if you have ``asm.trimmedReads.fasta.gz``, you can remove ``trimming/asm.ovlStore``). 

@@ -322,8 +325,17 @@ Why do I get less corrected read data than I asked for?
 What is the minimum coverage required to run Canu?
 -------------------------------------
    For eukaryotic genomes, coverage more than 20X is enough to outperform current hybrid
-    methods. Below that, you will likely not assemble the full genome.
+    methods.  Below that, you will likely not assemble the full genome.  The following
+    two papers have several examples.
+     * `Koren et al. (2013) Reducing assembly complexity of microbial genomes with single-molecule sequencing <https://www.ncbi.nlm.nih.gov/pubmed/24034426>`_
+     * `Koren and Walenz et al. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation <https://www.ncbi.nlm.nih.gov/pubmed/28298431>`_

+Can I use Illumina data too?
+-------------------------------------
+    No.  We've seen that using short reads for correction will homogenize repeats and
+    mix up haplotypes.  Even though the short reads are very high quality, their length
+    isn't sufficient for the true alignment to be identified, and so reads from other repeat
+    instances are used for correction, resulting in incorrect corrections.

 My circular element is duplicated/has overlap?
 -------------------------------------

--- a/documentation/source/parameter-reference.rst
+++ b/documentation/source/parameter-reference.rst
@@ -105,6 +105,13 @@ genomeSize <float=unset> *required*
  parameter) and how sensitive the mhap overlapper should be (via the :ref:`mhapSensitivity <mhapSensitivity>`
  parameter). It also impacts some logging, in particular, reports of NG50 sizes.

+.. _fast:
+
+fast <toggle>
+   This option uses MHAP overlapping for all steps, not just correction, making assembly significantly faster. It can be used on any genome size but may produce less continuous assemblies on genomes larger than 1 Gbp. It is recommended for nanopore genomes smaller than 1 Gbp or metagenomes.
+   
+   The fast option will also optionally use `wtdbg <https://github.com/ruanjue/wtdbg2>`_ for unitigging if wtdbg is manually copied to the Canu binary folder. However, this is only tested with very small genomes and is **NOT** recommended.
+
 .. _canuIteration:

 canuIteration <internal parameter, do not use>
@@ -189,22 +196,49 @@ stopOnLowCoverage <integer=10>
 .. _stopAfter:

 stopAfter <string=undefined>
-  If set, Canu will stop processing after a specific stage in the pipeline finishes.
-
-  Valid values for ``stopAfter`` are:
-
-   - ``gatekeeper`` - stops after the reads are loaded into the assembler read database.
-   - ``meryl`` - stops after frequent kmers are tabulated.
-   - ``overlapConfigure`` - stops after overlap jobs are configured.
-   - ``overlap`` - stops after overlaps are generated, before they are loaded into the overlap database.
-   - ``overlapStoreConfigure`` - stops after the jobs for creating the overlap store are configured.
-   - ``overlapStore`` - stops after overlaps are loaded into the overlap database.
-   - ``readCorrection`` - stops after corrected reads are generated.
-   - ``readTrimming`` - stops after trimmed reads are generated.
-   - ``unitig`` - stops after unitigs and contigs are created.
-   - ``consensusConfigure`` - stops after consensus jobs are configured.
-   - ``consensus`` - stops after consensus sequences are loaded into the databases.
-
+  If set, Canu will stop processing after a specific stage in the pipeline finishes.  Valid values are:
+
+  +-----------------------+-------------------------------------------------------------------+
+  | **stopAfter=**        | **Canu will stop after ....**                                     |
+  +-----------------------+-------------------------------------------------------------------+
+  | sequenceStore         | reads are loaded into the assembler read database.                |
+  +-----------------------+-------------------------------------------------------------------+
+  | meryl-configure       | kmer counting jobs are configured.                                |
+  +-----------------------+-------------------------------------------------------------------+
+  | meryl-count           | kmers are counted, but not processed into one database.           |
+  +-----------------------+-------------------------------------------------------------------+
+  | meryl-merge           | kmers are merged into one database.                               |
+  +-----------------------+-------------------------------------------------------------------+
+  | meryl-process         | frequent kmers are generated.                                     |
+  +-----------------------+-------------------------------------------------------------------+
+  | meryl-subtract        | haplotype specific kmers are generated.                           |
+  +-----------------------+-------------------------------------------------------------------+
+  | meryl                 | all kmer work is complete.                                        |
+  +-----------------------+-------------------------------------------------------------------+
+  | haplotype-configure   | haplotype read separation jobs are configured.                    |
+  +-----------------------+-------------------------------------------------------------------+
+  | haplotype             | haplotype-specific reads are generated.                           |
+  +-----------------------+-------------------------------------------------------------------+
+  | overlapConfigure      | overlap jobs are configured.                                      |
+  +-----------------------+-------------------------------------------------------------------+
+  | overlap               | overlaps are generated, before they are loaded into the database. |
+  +-----------------------+-------------------------------------------------------------------+
+  | overlapStoreConfigure | the jobs for creating the overlap database are configured.        |
+  +-----------------------+-------------------------------------------------------------------+
+  | overlapStore          | overlaps are loaded into the overlap database.                    |
+  +-----------------------+-------------------------------------------------------------------+
+  | correction            | corrected reads are generated.                                    |
+  +-----------------------+-------------------------------------------------------------------+
+  | trimming              | trimmed reads are generated.                                      |
+  +-----------------------+-------------------------------------------------------------------+
+  | unitig                | unitigs and contigs are created.                                  |
+  +-----------------------+-------------------------------------------------------------------+
+  | consensusConfigure    | consensus jobs are configured.                                    |
+  +-----------------------+-------------------------------------------------------------------+
+  | consensus             | consensus sequences are loaded into the databases.                |
+  +-----------------------+-------------------------------------------------------------------+
+
+  *readCorrection* and *readTrimming* are deprecated synonyms for *correction* and *trimming*, respectively.

 General Options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -216,6 +250,9 @@ shell <string="/bin/sh">
 java <string="java">
  A path to a Java application launcher of at least version 1.8.

+minimap <string="minimap2">
+  A path to the minimap2 versatile pairwise aligner.
+
 gnuplot <string="gnuplot">
  A path to the gnuplot graphing utility.  Plotting is disabled if this is unset
  (`gnuplot=` or `gnuplot=undef`), or if gnuplot fails to execute, or if gnuplot
@@ -263,17 +300,45 @@ Cleanup Options
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 saveOverlaps <boolean=false>
-  If set to 'false', the raw overlapper outputs are removed as soon as they are loaded into an
-  overlap store.  Also, the correction and trimming overlap stores are removed when they are no
-  longer needed..  This is recommended in nearly every case.
+  If 'true', retain all overlap stores.  If 'false', delete the correction
+  and trimming overlap stores when they are no longer useful.  Overlaps used
+  for contig construction are never deleted.
+
+purgeOverlaps <string=normal>
+  Controls when to remove intermediate overlap results.
+
+  'never' removes no intermediate overlap results.  This is only useful if
+  you have a desire to exhaust your disk space.
+
+  'false' is the same as 'never'.
+
+  'normal' removes intermediate overlap results after they are loaded into an
+  overlap store.
+
+  'true' is the same as 'normal'.
+
+  'aggressive' removes intermediate overlap results as soon as possible.  In
+  the event of a corrupt or lost file, this can result in a fair amount of
+  suffering to recompute the data.  In particular, overlapper output is removed
+  as soon as it is loaded into buckets, and buckets are removed once they are
+  rewritten as sorted overlaps.

-  If set to 'stores', the raw overlapper outputs are removed, but all of the overlap stores are
-  retained.  The overlap stores capture all the critical information in the raw outputs and the raw
-  outputs are redundant and unwieldy.  Retaining the overlap stores can allow one to 'back up' and
-  redo a step, but this is generally not useful unless one is familiar with the algorithms.
+  'dangerous' removes intermediate results as soon as possible, in some
+  cases, before they are even fully processed.  In addition to corrupt files,
+  jobs killed by out of memory, power outages, stray cosmic rays, et cetera,
+  will result in a fair amount of suffering to recompute the lost data.  This
+  mode can help when creating ginormous overlap stores, by removing the
+  bucketized data immediately after it is loaded into the sorting jobs, thus
+  making space for the output of the sorting jobs.

-  If set to 'true', all overlapper outputs and all stores are retained.  This is useful for
-  debugging potential problems with the overlap store.
+  Use 'normal' for non-large assemblies, and when disk space is plentiful.
+  Use 'aggressive' on large assemblies when disk space is tight.  Never use
+  'dangerous', unless you know how to recover from an error and you fully
+  trust your compute environment.
+
+  For Mhap and Minimap2, the raw ovelraps (in Mhap and PAF format) are
+  deleted immediately after being converted to Canu ovb format, except when
+  purgeOverlaps=never.

 saveReadCorrections <boolean=false>.
  If set, do not remove raw corrected read output from correction/2-correction. Normally, this
@@ -356,7 +421,36 @@ Overlapper Configuration, ovl Algorithm
  overlaps with :ref:`mhapReAlign <mhapReAlign>`.

 {prefix}OvlFrequentMers <string=undefined>
-  Do not seed overlaps with these kmers (fasta format).
+  Do not seed overlaps with these kmers, or, for mhap, do not seed with these kmers unless necessary (down-weight them).
+
+  For corFrequentMers (mhap), the file must contain a single line header followed by number-of-kmers data lines::
+
+    0 number-of-kmers
+    forward-kmer word-frequency kmer-count total-number-of-kmers
+    reverse-kmer word-frequency kmer-count total-number-of-kmers
+
+  Where `kmer-count` is the number of times this kmer sequence occurs in the reads, 'total-number-of-kmers'
+  is the number of kmers in the reads (including duplicates; rougly the number of bases in the reads),
+  and 'word-frequency' is 'kmer-count' / 'total-number-of-kmers'.
+
+  For example::
+
+    0 4
+    AAAATAATAGACTTATCGAGTC  0.0000382200    52      1360545
+    GACTCGATAAGTCTATTATTTT  0.0000382200    52      1360545
+    AAATAATAGACTTATCGAGTCA  0.0000382200    52      1360545
+    TGACTCGATAAGTCTATTATTT  0.0000382200    52      1360545
+
+  This file must be gzip compressed.
+
+  For obtFrequentMers and ovlFrequentMers, the file must contain a list of the canonical kmers and
+  their count on a single line.  The count value is ignored, but needs to be present.  This file
+  should not be compressed.
+
+  For example::
+
+    AAAATAATAGACTTATCGAGTC  52
+    AAATAATAGACTTATCGAGTCA  52

 {prefix}OvlHashBits <integer=unset>
  Width of the kmer hash.  Width 22=1gb, 23=2gb, 24=4gb, 25=8gb.  Plus 10b per ovlHashBlockLength.
@@ -484,6 +578,12 @@ trimReadsCoverage <integer=1>
  Minimum depth of evidence to retain bases.


+Trio binning Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _hapUnknownFraction:
+hapUnknownFraction <float=0.05>
+   Fraction of unclassified bases to ignore for haplotype assemblies. If there are more than this fraction of unclassified bases, they are included in both haplotype assemblies.

 .. _grid-engine:

@@ -539,32 +639,13 @@ commands, they all start with ``gridEngine``.  For each grid, these parameters a
 various ``src/pipeline/Grid_*.pm`` modules.  The parameters are used in
 ``src/pipeline/canu/Execution.pm``.

-For SGE grids, two options are sometimes necessary to tell canu about pecularities of your grid:
-``gridEngineThreadsOption`` describes how to request multiple cores, and ``gridEngineMemoryOption``
-describes how to request memory.  Usually, canu can figure out how to do this, but sometimes it
-reports an error such as::
-
- -- WARNING:  Couldn't determine the SGE parallel environment to run multi-threaded codes.
- --           Valid choices are (pick one and supply it to canu):
- --             gridEngineThreadsOption="-pe make THREADS"
- --             gridEngineThreadsOption="-pe make-dedicated THREADS"
- --             gridEngineThreadsOption="-pe mpich-rr THREADS"
- --             gridEngineThreadsOption="-pe openmpi-fill THREADS"
- --             gridEngineThreadsOption="-pe smp THREADS"
- --             gridEngineThreadsOption="-pe thread THREADS"
-
-or::
-
- -- WARNING:  Couldn't determine the SGE resource to request memory.
- --           Valid choices are (pick one and supply it to canu):
- --             gridEngineMemoryOption="-l h_vmem=MEMORY"
- --             gridEngineMemoryOption="-l mem_free=MEMORY"
-
-If you get such a message, just add the appropriate line to your canu command line.  Both options
-will replace the uppercase text (THREADS or MEMORY) with the value canu wants when the job is
-submitted.  For ``gridEngineMemoryOption``, any number of ``-l`` options can be supplied; we could
-use ``gridEngineMemoryOption="-l h_vmem=MEMORY -l mem_free=MEMORY"`` to request both ``h_vmem`` and
-``mem_free`` memory.
+In Canu 1.8 and earlier, ``gridEngineMemoryOption`` and ``gridEngineThreadsOption`` are used to tell
+Canu how to request resources from the grid.  Starting with ``snapshot v1.8 +90 changes`` (roughly
+January 11th), those options were merged into ``gridEngineResourceOption``.  These options specify
+the grid options needed to request memory and threads for each job.  For example, the default
+``gridEngineResourceOption`` for PBS/Torque is "-l nodes=1:ppn=THREADS:mem=MEMORY", and for Slurm it
+is "--cpus-per-task=THREADS --mem-per-cpu=MEMORY".  Canu will replace "THREADS" and "MEMORY" with
+the specific values needed for each job.

 .. _grid-options:


--- a/documentation/source/quick-start.rst
+++ b/documentation/source/quick-start.rst
@@ -81,7 +81,7 @@ For Nanopore::
 Output and intermediate files will be in directories 'ecoli-pacbio' and 'ecoli-nanopore',
 respectively.  Intermediate files are written in directories 'correction', 'trimming' and
 'unitigging' for the respective stages.  Output files are named using the '-p' prefix, such as
-'ecoli.contigs.fasta', 'ecoli.contigs.gfa', etc.  See section :ref:`outputs` for more details on
+'ecoli.contigs.fasta', 'ecoli.unitigs.gfa', etc.  See section :ref:`outputs` for more details on
 outputs (intermediate files aren't documented).


@@ -168,14 +168,25 @@ Canu has support for using parental short-read sequencing to classify and bin th
 curl -L -o O157.parental.fasta https://gembox.cbcb.umd.edu/triobinning/example/o157.12.fasta
 curl -L -o F1.fasta https://gembox.cbcb.umd.edu/triobinning/example/pacbio.fasta

- trioCanu \
+ canu \
  -p asm -d ecoliTrio \
  genomeSize=5m \
  -haplotypeK12 K12.parental.fasta \
  -haplotypeO157 O157.parental.fasta \
  -pacbio-raw F1.fasta

-The run will produce two assemblies, ecoliTrio/haplotypeK12/asm.contigs.fasta and ecoliTrio/haplotypeO157/asm.contigs.fasta. As comparison, you can try co-assembling the datasets instead::
+The run will first bin the reads into the haplotypes (``ecoliTrio/haplotype/haplotype-*.fasta.gz``) and provide a summary of the classification in ``ecoliTrio/haplotype/haplotype.log``::
+
+  -- Processing reads in batches of 100 reads each.
+  --
+  --   119848 reads    378658103 bases written to haplotype file ./haplotype-K12.fasta.gz.
+  --   308353 reads   1042955878 bases written to haplotype file ./haplotype-O157.fasta.gz.
+  --     4114 reads      6520294 bases written to haplotype file ./haplotype-unknown.fasta.gz.
+
+
+Next, the haplotypes are assembled in ``ecoliTrio/asm-haplotypeK12/asm-haplotypeK12.contigs.fasta`` and ``ecoliTrio/asm-haplotypeO157/asm-haplotypeO157.contigs.fasta``. By default, if the unassigned bases are > 5% of the total, they are included in both haplotypes. This can be controlled with the :ref:`hapUnknownFraction <hapUnknownFraction>` option. 
+
+As comparison, you can try co-assembling the datasets instead::

 canu \
  -p asm -d ecoliHap \
@@ -183,11 +194,7 @@ The run will produce two assemblies, ecoliTrio/haplotypeK12/asm.contigs.fasta an
  corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" \
 -pacbio-raw F1.fasta

-and compare the contiguity/accuracy. The current version of trioCanu is not yet optimized for memory use so requires adjusted parameters for large genomes. Adding the options::
-
-  gridOptionsExecutive="--mem=250g" gridOptionsMeryl='--partition=largemem --mem=1000g'
-
-should be sufficient for a mammalian genome.
+and compare the continuity/accuracy. 

 Consensus Accuracy
 -------------------

--- a/documentation/source/tutorial.rst
+++ b/documentation/source/tutorial.rst
@@ -367,7 +367,7 @@ and 8 'distinct' kmers.
  pick a threshold so as to seed overlaps using this fraction of all kmers in the input.  In the example above,
  fraction 0.667 of the k-mers (8/12) will be at or below threshold 2.
 <tag>FrequentMers
-  don't compute frequent kmers, use those listed in this fasta file
+  don't compute frequent kmers, use those listed in this file

 Mhap Overlapper Parameters
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -515,15 +515,17 @@ The header line for each sequence provides some metadata on the sequence.::
      If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.

   suggestCircular
-      If yes, sequence is likely circular.  Not implemented.
+      If yes, sequence is likely circular.  The GFA file includes the CIGAR sequence for the overlap.

 GRAPHS

-<prefix>.contigs.gfa
-  Unused or ambiguous edges between contig sequences.  Bubble edges cannot be represented in this format.
+Canu versions prior to v1.9 created a GFA of the contig graph.  However, as noted at the time, the
+GFA format cannot represent partial overlaps between contigs (for more details see the discussion of
+general edges on the `GFA2 <https://github.com/GFA-spec/GFA-spec/blob/master/GFA2.md>`_ page).
+Because Canu contigs are not compatible with the GFA format, <prefix>.contigs.gfa has been removed.

 <prefix>.unitigs.gfa
-  Contigs split at bubble intersections.
+  Since the GFA format cannot represent partial overlaps, the contigs are split at all such overlap junctions into unitigs. The unitigs capture non-branching subsequences within the contigs and will break at any ambiguity (e.g. a haplotype switch).

 <prefix>.unitigs.bed
  The position of each unitig in a contig.