Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille
--- a/addCopyrights-BuildData.pl
+++ b/addCopyrights-BuildData.pl
@@ -36,6 +36,7 @@ $stoppingCommits{"bbbdcd063560e5f86006ee6b8b96d2d7b80bb750"} = 1;   #  21 NOV 20
 $stoppingCommits{"64459fe33f97f6d23fe036ba1395743d0cdd03e4"} = 1;   #  17 APR 2017
 $stoppingCommits{"9e9bd674b705f89817b07ff30067210c2d180f42"} = 1;   #  14 AUG 2017
 $stoppingCommits{"0fff8a511fd7d74081d94ff9e0f6c0351650ae2e"} = 1;   #  27 FEB 2018 - v1.7
+$stoppingCommits{"fcc3fe19eb635abd735486d215fbf65c56bcf4ee"} = 1;   #  22 OCT 2018 - v1.8

 open(F, "< logs") or die "Failed to open 'logs': $!\n";


--- a/addCopyrights.dat
+++ b/addCopyrights.dat
--- a/addCopyrights.pl
+++ b/addCopyrights.pl
@@ -225,6 +225,8 @@ my %derived;
            $authcopy{$1} .= "$2\n";

        } elsif (m/^D\s+(\S+)\s+(\S+)$/) {
+            $authcopy{$1} .= $authcopy{$2};   #  Include all authors of old file in new file.
+            #$derived{$1}  .= $derived{$2};
            $derived{$1}  .= "$2\n";

        } else {

--- a/buildRelease.sh
+++ b/buildRelease.sh
 #!/bin/sh

+#  Before building a release:
+#
+#    Update copyrights
+#    Increase version in documentation/source/conf.py
+#    Increase version in src/canu_version_update.pl
+
 version=$1

 if [ x$version = x ] ; then

--- a/debian/changelog
+++ b/debian/changelog
+canu (1.8+dfsg-1) unstable; urgency=medium
+
+  * Team upload.
+  * New upstream version
+  * Standards-Version: 4.2.1
+  * Remove unused paragraphs in d/copyright
+  * Fix perl interpreter path
+
+ -- Andreas Tille <tille@debian.org>  Thu, 01 Nov 2018 08:56:31 +0100
+
 canu (1.7.1+dfsg-1) unstable; urgency=medium

  * Team upload.

--- a/debian/control
+++ b/debian/control
@@ -9,7 +9,7 @@ Build-Depends: debhelper (>= 11~),
 # For File::Path
               libfilesys-df-perl,
               mhap (>= 2.1.3)
-Standards-Version: 4.1.5
+Standards-Version: 4.2.1
 Vcs-Browser: https://salsa.debian.org/med-team/canu
 Vcs-Git: https://salsa.debian.org/med-team/canu.git
 Homepage: http://canu.readthedocs.org/en/latest/

--- a/debian/copyright
+++ b/debian/copyright
@@ -51,31 +51,6 @@ License: BSD-3-Clause-PacBio
 OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGE.

-Files: src/AS_UTL/md5.C
-Copyright: 1991-1992 RSA Data Security, Inc.
-License: RSA
- License to copy and use this software is granted provided that it
- is identified as the "RSA Data Security, Inc. MD5 Message-Digest
- Algorithm" in all material mentioning or referencing this software
- or this function.
- .
- License is also granted to make and use derivative works provided
- that such works are identified as "derived from the RSA Data
- Security, Inc. MD5 Message-Digest Algorithm" in all material
- mentioning or referencing the derived work.
- .
- RSA Data Security, Inc. makes no representations concerning either
- the merchantability of this software or the suitability of this
- software for any particular purpose. It is provided "as is"
- without express or implied warranty of any kind.
- .
- These notices must be retained in any copies of any part of this
- documentation and/or software.
-
-Files: src/AS_UTL/mt19937ar.*
-Copyright: 1997 - 2002 Makoto Matsumoto and Takuji Nishimura
-License: BSD-3-Clause-BNBI
-
 Files: debian/*
 Copyright: 2016-2017 Afif Elghraoui <afif@debian.org>
 License: GPL-2.0+

--- a/debian/patches/external-mhap.patch
+++ b/debian/patches/external-mhap.patch
@@ -3,11 +3,11 @@ Description: don't expect bundled MHAP
 Author: Afif Elghraoui <afif@debian.org>
 Forwarded: not-needed
 Last-Update: 2018-03-10
--- canu.orig/src/Makefile
-+++ canu/src/Makefile
-@@ -615,7 +615,6 @@
+--- a/src/Makefile
+++ b/src/Makefile
+@@ -665,7 +665,6 @@ all: UPDATE_VERSION MAKE_DIRS \
+      $(addprefix ${TARGET_DIR}/,${ALL_TGTS}) \
      ${TARGET_DIR}/bin/canu \
-      ${TARGET_DIR}/bin/trioCanu \
      ${TARGET_DIR}/bin/canu.defaults \
 -     ${TARGET_DIR}/share/java/classes/mhap-2.1.3.jar \
      ${TARGET_DIR}/lib/site_perl/canu/Consensus.pm \

--- a/debian/patches/gcc-7_format-security.patch
+++ b/debian/patches/gcc-7_format-security.patch
-Author: Andreas Tille <tille@debian.org>
-last-Update: Sat, 02 Sep 2017 15:30:21 +0200
-Bug-Debian: https://bugs.debian.org/871390
-Description: Fix gcc-7 error (violation of format-security)
-
--- canu.orig/src/merTrim/merTrim.C
-+++ canu/src/merTrim/merTrim.C
-@@ -1790,7 +1790,7 @@
-     if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-   }
-   strcpy(logLine + logPos, " (ORI)\n");
-  fprintf(stderr, logLine);
-+  fprintf(stderr, "%s", logLine);
- 
-   logPos = 0;
-   for (uint32 i=0; i<seqLen; i++) {
-@@ -1800,7 +1800,7 @@
-     if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-   }
-   strcpy(logLine + logPos, " (SEQ)\n");
-  fprintf(stderr, logLine);
-+  fprintf(stderr, "%s", logLine);
- 
-   if (corrSeq && verifySeq) {
-     uint32 i=0;
-@@ -1821,7 +1821,7 @@
-       if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-     }
-     strcpy(logLine + logPos, " (VAL)\n");
-    fprintf(stderr, logLine);
-+    fprintf(stderr, "%s", logLine);
- 
-     logPos = 0;
-     for (uint32 i=0; i<seqLen; i++) {
-@@ -1831,7 +1831,7 @@
-       if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-     }
-     strcpy(logLine + logPos, " (VAL)\n");
-    fprintf(stderr, logLine);
-+    fprintf(stderr, "%s", logLine);
-   }
- 
-   logPos = 0;
-@@ -1842,7 +1842,7 @@
-     if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-   }
-   strcpy(logLine + logPos, " (QLT)\n");
-  fprintf(stderr, logLine);
-+  fprintf(stderr, "%s", logLine);
- 
-   logPos = 0;
-   for (uint32 i=0; i<seqLen; i++) {
-@@ -1852,7 +1852,7 @@
-     if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-   }
-   strcpy(logLine + logPos, " (COVERAGE)\n");
-  fprintf(stderr, logLine);
-+  fprintf(stderr, "%s", logLine);
- 
-   logPos = 0;
-   for (uint32 i=0; i<seqLen; i++) {
-@@ -1862,7 +1862,7 @@
-     if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-   }
-   strcpy(logLine + logPos, " (CORRECTIONS)\n");
-  fprintf(stderr, logLine);
-+  fprintf(stderr, "%s", logLine);
- 
-   logPos = 0;
-   for (uint32 i=0; i<seqLen; i++) {
-@@ -1872,7 +1872,7 @@
-     if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-   }
-   strcpy(logLine + logPos, " (DISCONNECTION)\n");
-  fprintf(stderr, logLine);
-+  fprintf(stderr, "%s", logLine);
- 
-   logPos = 0;
-   for (uint32 i=0; i<seqLen; i++) {
-@@ -1882,7 +1882,7 @@
-     if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; }
-   }
-   strcpy(logLine + logPos, " (ADAPTER)\n");
-  fprintf(stderr, logLine);
-+  fprintf(stderr, "%s", logLine);
- 
-   delete [] logLine;
- }
--- a/debian/patches/series
+++ b/debian/patches/series
 use-debian-mhap-at-runtime.patch
-gcc-7_format-security.patch
 external-mhap.patch
--- a/debian/patches/use-debian-mhap-at-runtime.patch
+++ b/debian/patches/use-debian-mhap-at-runtime.patch
@@ -2,21 +2,21 @@ Description: Use mhap jar from /usr/share/java
 Author: Afif Elghraoui <afif@debian.org>
 Forwarded: not-needed
 Last-Update: 2016-03-20
--- canu.orig/src/pipelines/canu/OverlapMhap.pm
-+++ canu/src/pipelines/canu/OverlapMhap.pm
-@@ -364,7 +364,7 @@
+--- a/src/pipelines/canu/OverlapMhap.pm
+++ b/src/pipelines/canu/OverlapMhap.pm
+@@ -368,7 +368,7 @@ sub mhapConfigure ($$$) {
     print F "cd ./blocks\n";
     print F "\n";
-     print F "$javaPath -d64 -server -Xmx", $javaMemory, "m \\\n";
+     print F "$javaPath $javaOpt -XX:ParallelGCThreads=",  getGlobal("${tag}mhapThreads"), " -server -Xms", $javaMemory, "m -Xmx", $javaMemory, "m \\\n";
 -    print F "  -jar $cygA \$bin/../share/java/classes/mhap-" . getGlobal("${tag}MhapVersion") . ".jar $cygB \\\n";
 +    print F "  -jar $cygA /usr/share/java/mhap.jar $cygB \\\n";
     print F "  --repeat-weight 0.9 --repeat-idf-scale 10 -k $merSize \\\n";
     print F "  --supress-noise 2 \\\n"  if (defined(getGlobal("${tag}MhapFilterUnique")) && getGlobal("${tag}MhapFilterUnique") == 1);
     print F "  --no-tf \\\n"            if (defined(getGlobal("${tag}MhapNoTf")) && getGlobal("${tag}MhapNoTf") == 1);
-@@ -464,7 +464,7 @@
+@@ -468,7 +468,7 @@ sub mhapConfigure ($$$) {
     print F "\n";
     print F "if [ ! -e ./results/\$qry.mhap ] ; then\n";
-     print F "  $javaPath -d64 -server -Xmx", $javaMemory, "m \\\n";
+     print F "  $javaPath $javaOpt -XX:ParallelGCThreads=",  getGlobal("${tag}mhapThreads"), " -server -Xms", $javaMemory, "m -Xmx", $javaMemory, "m \\\n";
 -    print F "    -jar $cygA \$bin/../share/java/classes/mhap-" . getGlobal("${tag}MhapVersion") . ".jar $cygB \\\n";
 +    print F "    -jar $cygA /usr/share/java/mhap.jar $cygB \\\n";
     print F "    --repeat-weight 0.9 --repeat-idf-scale 10 -k $merSize \\\n";

--- a/debian/rules
+++ b/debian/rules
@@ -18,3 +18,9 @@ override_dh_auto_build:
 	find $$builddir \
 	-name OverlapMhap.pm \
 	-exec sed -i 's#\(\s*my \$$javaPath = \).*#\1 "/usr/lib/jvm/java-8-openjdk-$(DEB_HOST_ARCH)/bin/java";#' {} +
+
+override_dh_install:
+	dh_install
+	for pl in `grep -Rl '#![[:space:]]*/usr/bin/env[[:space:]]\+perl' debian/*/usr/*` ; do \
+	    sed -i '1s?^#![[:space:]]*/usr/bin/env[[:space:]]\+perl?#!/usr/bin/perl?' $${pl} ; \
+	done
--- a/documentation/source/canu-pipeline-17.svg
+++ b/documentation/source/canu-pipeline-17.svg
--- a/documentation/source/canu-pipeline-20.svg
+++ b/documentation/source/canu-pipeline-20.svg
--- a/documentation/source/commands/bogart.rst
+++ b/documentation/source/commands/bogart.rst
@@ -35,7 +35,7 @@ bogart

    When loading overlaps, an inflated maximum (to allow reruns with different error rates):
      -eM 0.05   no more than 0.05 fraction (5.0%) error in any overlap loaded into bogart
-                 the maximum used will ALWAYS be at leeast the maximum of the four error rates
+                 the maximum used will ALWAYS be at least the maximum of the four error rates

    For all, the lower limit on overlap length
      -el 500     no shorter than 40 bases

--- a/documentation/source/commands/canu.rst
+++ b/documentation/source/commands/canu.rst
@@ -31,7 +31,7 @@ canu
    If you want to change the defaults, use the various utg*ErrorRate options.

    A full list of options can be printed with '-options'.  All options
-    can be supplied in an optional sepc file.
+    can be supplied in an optional spec file.

    Reads can be either FASTA or FASTQ format, uncompressed, or compressed
    with gz, bz2 or xz.  Reads are specified by the technology they were

--- a/documentation/source/commands/overlapInCore.rst
+++ b/documentation/source/commands/overlapInCore.rst
@@ -31,7 +31,7 @@ overlapInCore
  -w          filter out overlaps with too many errors in a window
  -z          skip the hopeless check

-  --maxerate <n>     only output overlaps with fraction <n> or less error (e.g., 0.06 == 6%)
+  --maxrate <n>      only output overlaps with fraction <n> or less error (e.g., 0.06 == 6%)
  --minlength <n>    only output overlaps of <n> or more bases

  --hashbits n       Use n bits for the hash mask.

--- a/documentation/source/commands/splitReads.rst
+++ b/documentation/source/commands/splitReads.rst
@@ -13,7 +13,7 @@ splitReads
    -t bgn-end     limit processing to only reads from bgn to end (inclusive)

    -Ci clearFile  path to input clear ranges (NOT SUPPORTED)
-    -Co clearFile  path to ouput clear ranges
+    -Co clearFile  path to output clear ranges

    -e erate       ignore overlaps with more than 'erate' percent error


--- a/documentation/source/conf.py
+++ b/documentation/source/conf.py
@@ -55,9 +55,9 @@ copyright = u'2015, Adam Phillippy, Sergey Koren, Brian Walenz'
 # built documents.
 #
 # The short X.Y version.
-version = '1.7'
+version = '1.8'
 # The full version, including alpha/beta/rc tags.
-release = '1.7'
+release = '1.8'

 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

--- a/documentation/source/faq.rst
+++ b/documentation/source/faq.rst
@@ -31,11 +31,11 @@ What resources does Canu require for a bacterial genome assembly? A mammalian as

 How do I run Canu on my SLURM / SGE / PBS / LSF / Torque system?
 -------------------------------------
-    Canu will detect and configure itself to use on most grids. You can supply your own grid
-    options, such as a partition on SLURM or an account code on SGE, with ``gridOptions="<your
-    options list>"`` which will passed to every job submitted by Canu.  Similar options exist for
-    every stage of Canu, which could be used to, for example, restrict overlapping to a specific
-    partition or queue.
+    Canu will detect and configure itself to use on most grids. Canu will NOT request explicit time limits or
+    queues/partitions. You can supply your own grid options, such as a partition on SLURM, an account code 
+    on SGE, and/or time limits with ``gridOptions="<your options list>"`` which will passed to every job 
+    submitted by Canu.  Similar options exist for every stage of Canu, which could be used to, for example, 
+    restrict overlapping to a specific partition or queue.

    To disable grid support and run only on the local machine, specify ``useGrid=false``

@@ -61,6 +61,38 @@ My run stopped with the error ``'Failed to submit batch jobs'``
    compute nodes.


+My run of Canu was killed by the sysadmin; the power going out; my cat stepping on the power button; et cetera.  Is it safe to restart?  How do I restart?
+-------------------------------------
+
+    Yes, perfectly safe!  It's actually how Canu runs normally: each time Canu starts, it examines
+    the state of the assembly to decide what it should do next.  For example, if six overlap tasks
+    have no results, it'll run just those six tasks.
+
+    This also means that if you want to redo some step, just remove those results from the assembly
+    directory.  Some care needs to be taken to make sure results computed after those are also
+    removed.
+
+    Short answer: just rerun the _exact_ same command as before.  It'll do the right thing.
+
+
+My genome size and assembly size are different, help!
+-------------------------------------
+    The difference could be due to a heterozygous genome where the assembly separated some loci. It could also be because the previous estimate is incorrect. We typically use two analyses to see what happened. First, a `BUSCO <https://busco.ezlab.org>`_ analysis will indicate duplicated genes. For example this assembly::
+
+      INFO	C:98.5%[S:97.9%,D:0.6%],F:1.0%,M:0.5%,n:2799
+      INFO	2756 Complete BUSCOs (C)
+      INFO	2740 Complete and single-copy BUSCOs (S)
+      INFO	16 Complete and duplicated BUSCOs (D)
+    
+    does not have much duplication but this assembly::
+    
+      INFO	C:97.6%[S:15.8%,D:81.8%],F:0.9%,M:1.5%,n:2799
+      INFO	2732 Complete BUSCOs (C)
+      INFO	443 Complete and single-copy BUSCOs (S)
+      INFO	2289 Complete and duplicated BUSCOs (D)
+    
+    does. We have had some success (in limited testing) using `purge_haplotigs <https://bitbucket.org/mroachawri/purge_haplotigs>`_ to remove duplication. Purge haplotigs will also generate a coverage plot which will usually have two peaks when assemblies have separated some loci. 
+
 What parameters should I use for my reads?
 -------------------------------------
    Canu is designed to be universal on a large range of PacBio (C2, P4-C2, P5-C3, P6-C4) and Oxford
@@ -91,12 +123,12 @@ What parameters should I use for my reads?
       Slightly decrease the maximum allowed difference in overlaps from the default of 14.4% to 12.0%
       with ``correctedErrorRate=0.120``

-    **Early PacBio Sequel**
-       Based on exactly one publically released *A. thaliana* `dataset
+    **PacBio Sequel**
+       Based on an *A. thaliana* `dataset
       <http://www.pacb.com/blog/sequel-system-data-release-arabidopsis-dataset-genome-assembly/>`_,
-       slightly decrease the maximum allowed difference from the default of 4.5% to 4.0% with
-       ``correctedErrorRate=0.040 corMhapSensitivity=normal``.  For recent Sequel data, the defaults
-       seem to be appropriate.
+       and a few more recent mammalian genomes, slightly increase the maximum allowed difference from the default of 4.5% to 8.5% with
+       ``correctedErrorRate=0.085 corMhapSensitivity=normal``.
+      Only add the second parameter (``corMhapSensivity=normal``) if you have >50x coverage.

   **Nanopore R9 large genomes**
       Due to some systematic errors, the identity estimate used by Canu for correction can be an
@@ -106,6 +138,25 @@ What parameters should I use for my reads?
       coverage.


+Can I assemble RNA sequence data?
+-------------------------------------
+    Canu will likely mis-assemble, or completely fail to assemble, RNA data.  It will do a
+    reasonable job at generating corrected reads though.  Reads are corrected using (local) best
+    alignments to other reads, and alignments between different isoforms are usually obviously not
+    'best'.  Just like with DNA sequences, similar isoforms can get 'mixed' together.  We've heard
+    of reasonable success from users, but do not have any parameter suggestions to make.
+
+    Note that Canu will silently translate 'U' bases to 'T' bases on input, but **NOT** translate
+    the output bases back to 'U'.
+
+My assembly is running out of space, is too slow?
+-------------------------------------
+    We don't have a good way to estimate of disk space used for the assembly. It varies with genome size, repeat content, and sequencing depth. A human genome sequenced with PacBio or Nanopore at 40-50x typically requires 1-2TB of space at the peak. Plants, unfortunately, seem to want a lot of space. 10TB is a reasonable guess. We've seen it as bad as 20TB on some very repetitive genomes.
+    
+    The most common cause of high disk usage is a very repetitive or large genome. There are some parameters you can tweak to both reduce disk space and speed up the run. Try adding the options ``corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=60g mhapBlockSize=500 ovlMerThreshold=500``. This will suppress repeats more than the default settings and speed up both correction and assembly.
+    
+    It is also possible to clean up some intermediate outputs before the assembly is complete to save space. If you already have a ```*.ovlStore.BUILDING/1-bucketize.successs`` file in your current step (e.g. ``correct```), you can clean up the files under ``1-overlapper/blocks``. You can also remove the ovlStore for the previous step if you have its output (e.g. if you have ``asm.trimmedReads.fasta.gz``, you can remove ``trimming/asm.ovlStore``). 
+
 My assembly continuity is not good, how can I improve it?
 -------------------------------------
    The most important determinant for assembly quality is sequence length, followed by the repeat
@@ -160,13 +211,13 @@ What parameters can I tweak?

    - ``corMinCoverage``, loosely, controls the quality of the corrected reads.  It is the coverage
      in evidence reads that is needed before a (portion of a) corrected read is reported.
-      Corrected reads are generated as a consensus of other reads; this is just the minimum ocverage
+      Corrected reads are generated as a consensus of other reads; this is just the minimum coverage
      needed for the consensus sequence to be reported.  The default is based on input read
      coverage: 0x coverage for less than 30X input coverage, and 4x coverage for more than that.

    For assembly:

-    - ``utgOvlErrorRate`` is essientially a speed optimization.  Overlaps above this error rate are
+    - ``utgOvlErrorRate`` is essentially a speed optimization.  Overlaps above this error rate are
      not computed.  Setting it too high generally just wastes compute time, while setting it too
      low will degrade assemblies by missing true overlaps between lower quality reads.

@@ -194,19 +245,20 @@ What parameters can I tweak?
           more conservative at picking the error rate to use for the assembly to try to maintain
           haplotype separation. If it works, you'll end up with an assembly >= 2x your haploid
           genome size. Post-processing using gene information or other synteny information is
-           required to remove redunancy from this assembly.
+           required to remove redundancy from this assembly.

        2) **Smash haplotypes together** and then do phasing using another approach (like HapCUT2 or
           whatshap or others). In that case you want to do the opposite, increase the error rates
           used for finding overlaps:

-           ``corOutCoverage=200 ovlErrorRate=0.15 obtErrorRate=0.15``
+           ``corOutCoverage=200 correctedErrorRate=0.15``

-           Error rates for trimming (``obtErrorRate``) and assembling (``batErrorRate``) can usually
-           be left as is.  When trimming, reads will be trimmed using other reads in the same
+           When trimming, reads will be trimmed using other reads in the same
           chromosome (and probably some reads from other chromosomes).  When assembling, overlaps
           well outside the observed error rate distribution are discarded.
           
+         We typically prefer option 1 which will lead to a larger than expected genome size. We have had some success (in limited testing) using `purge_haplotigs <https://bitbucket.org/mroachawri/purge_haplotigs>`_ to remove this duplication.
+
    For metagenomes:

        The basic idea is to use all data for assembly rather than just the longest as default. The