@@ -19,6 +19,8 @@ Alternatively, you can also build the latest unreleased from github:
cd canu/src
make -j <number of threads>
The unreleased tip has not undergone the same testing as a release and so may have unknown bugs or issues generating sub-optimal assemblies. We recommend the release version for most users.
## Learn:
The [quick start](http://canu.readthedocs.io/en/latest/quick-start.html) will get you assembling quickly, while the [tutorial](http://canu.readthedocs.io/en/latest/tutorial.html) explains things in more detail.
@@ -13,7 +13,7 @@ What resources does Canu require for a bacterial genome assembly? A mammalian as
-------------------------------------
Canu will detect available resources and configure itself to run efficiently using those
resources. It will request resources, for example, the number of compute threads to use, Based
on the ``genomeSize`` being assembled. It will fail to even start if it feels there are
on the genome size being assembled. It will fail to even start if it feels there are
insufficient resources available.
A typical bacterial genome can be assembled with 8GB memory in a few CPU hours - around an hour
...
...
@@ -39,44 +39,71 @@ How do I run Canu on my SLURM / SGE / PBS / LSF / Torque system?
To disable grid support and run only on the local machine, specify ``useGrid=false``
It is possible to limit the number of grid jobs running at the same time, but this isn't
directly supported by Canu. The various :ref:`gridOptions <grid-options>` parameters
can pass grid-specific parameters to the submit commands used; see
`Issue #756 <https://github.com/marbl/canu/issues/756>`_ for Slurm and SGE examples.
My run stopped with the error ``'Failed to submit batch jobs'``
-------------------------------------
The grid you run on must allow compute nodes to submit jobs. This means that if you are on a compute host, ``qsub/bsub/sbatch/etc`` must be available and working. You can test this by starting an interactive compute session and running the submit command manually (e.g. ``qsub`` on SGE, ``bsub`` on LSF, ``sbatch`` on SLURM).
The grid you run on must allow compute nodes to submit jobs. This means that if you are on a
compute host, ``qsub/bsub/sbatch/etc`` must be available and working. You can test this by
starting an interactive compute session and running the submit command manually (e.g. ``qsub``
on SGE, ``bsub`` on LSF, ``sbatch`` on SLURM).
If this is not the case, Canu **WILL NOT** work on your grid. You must then set ``useGrid=false`` and run on a single machine. Alternatively, you can run Canu with ``useGrid=remote`` which will stop at every submit command, list what should be submitted. You then submit these jobs manually, wait for them to complete, and run the Canu command again. This is a manual process but currently the only workaround for grids without submit support on the compute nodes.
If this is not the case, Canu **WILL NOT** work on your grid. You must then set
``useGrid=false`` and run on a single machine. Alternatively, you can run Canu with
``useGrid=remote`` which will stop at every submit command, list what should be submitted. You
then submit these jobs manually, wait for them to complete, and run the Canu command again. This
is a manual process but currently the only workaround for grids without submit support on the
compute nodes.
What parameters should I use for my reads?
-------------------------------------
Canu is designed to be universal on a large range of PacBio (C2, P4-C2, P5-C3, P6-C4) and Oxford Nanopore
(R6 through R9) data. Assembly quality and/or efficiency can be enhanced for specific datatypes:
Canu is designed to be universal on a large range of PacBio (C2, P4-C2, P5-C3, P6-C4) and Oxford
Nanopore (R6 through R9) data. Assembly quality and/or efficiency can be enhanced for specific
datatypes:
**Nanopore R7 1D** and **Low Identity Reads**
With R7 1D sequencing data, and generally for any raw reads lower than 80% identity, five to
ten rounds of error correction are helpful. To run just the correction phase, use options
``-correct corOutCoverage=500 corMinCoverage=0 corMhapSensitivity=high``. Use the output of
the previous run (in ``asm.correctedReads.fasta.gz``) as input to the next round.
slightly decrease the maximum allowed difference from the default of 4.5% to 4.0% with
``correctedErrorRate=0.040 corMhapSensitivity=normal``. For recent Sequel data, the defaults
are appropriate.
seem to be appropriate.
**Nanopore R9 large genomes**
Due to some systematic errors, the identity estimate used by Canu for correction can be an over-estimate of true error, inflating runtime. For recent large genomes (>1gbp) we've used ``'corMhapOptions=--threshold 0.8 --num-hashes 512 --ordered-sketch-size 1000 --ordered-kmer-size 14'``. This can be used with 30x or more of coverage, below that the defaults are OK.
Due to some systematic errors, the identity estimate used by Canu for correction can be an
over-estimate of true error, inflating runtime. For recent large genomes (>1gbp) with more
than 30x coverage, we've used ``'corMhapOptions=--threshold 0.8 --num-hashes
512 --ordered-sketch-size 1000 --ordered-kmer-size 14'``. This is not needed for below 30x
coverage.
My assembly continuity is not good, how can I improve it?
...
...
@@ -161,7 +188,7 @@ What parameters can I tweak?
divergence, you'd end up collapsing the variations. We've used the following parameters
My genome is AT (or GC) rich, do I need to adjust parameters? What about highly repetitive genomes?
-------------------------------------
...
...
@@ -250,12 +311,9 @@ How can I send data to you?
FTP to ftp://ftp.cbcb.umd.edu/incoming/sergek. This is a write-only location that only the Canu
developers can see.
Here is a quick walk-through using a command-line ftp client (should be available on most Linux and OSX installations). Say we want to transfer a file named ``reads.fastq``. First, run ``ftp ftp.cbcb.umd.edu``, specify ``anonymous`` as the user name and hit return for password (blank). Then:
.. code-block::
cd incoming/sergek
put reads.fastq
quit
Here is a quick walk-through using a command-line ftp client (should be available on most Linux
and OSX installations). Say we want to transfer a file named ``reads.fastq``. First, run ``ftp
ftp.cbcb.umd.edu``, specify ``anonymous`` as the user name and hit return for password
(blank). Then ``cd incoming/sergek``, ``put reads.fastq``, and ``quit``.
That's it, you won't be able to see the file but we can download it.
The allowed difference in an overlap between two uncorrected reads, expressed as fraction error.
Sets :ref:`corOvlErrorRate` and :ref:`corErrorRate`. The `rawErrorRate` typically does not need
to be modified. It might need to be increased if very early reads are being assembled. The
default is 0.300 For PacBio reads, and 0.500 for Nanopore reads.
Sets :ref:`corOvlErrorRate <corOvlErrorRate>` and :ref:`corErrorRate <corErrorRate>`. The
:ref:`rawErrorRate <rawErrorRate>` typically does not need to be modified. It might need to be
increased if very early reads are being assembled. The default is 0.300 For PacBio reads, and
0.500 for Nanopore reads.
.. _correctedErrorRate:
correctedErrorRate <float=unset>
The allowed difference in an overlap between two corrected reads, expressed as fraction error. Sets :ref:`obtOvlErrorRate`, :ref:`utgOvlErrorRate`, :ref:`obtErrorRate`, :ref:`utgErrorRate`, and :ref:`cnsErrorRate`.
The `correctedErrorRate` can be adjusted to account for the quality of read correction, for the amount of divergence in the sample being
assembled, and for the amount of sequence being assembled. The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads.
The allowed difference in an overlap between two corrected reads, expressed as fraction error.
:ref:`obtErrorRate <obtErrorRate>`, :ref:`utgErrorRate <utgErrorRate>`, and :ref:`cnsErrorRate
<cnsErrorRate>`.
The :ref:`correctedErrorRate <correctedErrorRate>` can be adjusted to account for the quality of
read correction, for the amount of divergence in the sample being assembled, and for the amount of
sequence being assembled. The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads.
For low coverage datasets (less than 30X), we recommend increasing `correctedErrorRate` slightly, by 1% or so.
For low coverage datasets (less than 30X), we recommend increasing :ref:`correctedErrorRate
<correctedErrorRate>` slightly, by 1% or so.
For high-coverage datasets (more than 60X), we recommend decreasing `correctedErrorRate` slighly, by 1% or so.
For high-coverage datasets (more than 60X), we recommend decreasing :ref:`correctedErrorRate
<correctedErrorRate>` slighly, by 1% or so.
Raising the `correctedErrorRate` will increase run time. Likewise, decreasing `correctedErrorRate` will decrease run time, at the risk of missing overlaps and fracturing the assembly.
Raising the :ref:`correctedErrorRate <correctedErrorRate>` will increase run time. Likewise,
decreasing :ref:`correctedErrorRate <correctedErrorRate>` will decrease run time, at the risk of
missing overlaps and fracturing the assembly.
.. _minReadLength:
...
...
@@ -60,7 +69,7 @@ minReadLength <integer=1000>
Must be no smaller than minOverlapLength.
If set high enough, the gatekeeper module will halt as too many of the input reads have been
discarded. Set `stopOnReadQuality` to false to avoid this.
discarded. Set :ref:`stopOnReadQuality <stopOnReadQuality>` to false to avoid this.
The type of image to generate in gnuplot. By default, canu will use png, svg or gif, in that order.
gnuplotTested <boolean=false>
If set, skip the tests to determine if gnuplot will run, and to decide the image type to generate. This is used when gnuplot fails to run, or isn't even installed, and allows canu to continue execution without generating graphs.
If set, skip the tests to determine if gnuplot will run, and to decide the image type to generate.
This is used when gnuplot fails to run, or isn't even installed, and allows canu to continue
execution without generating graphs.
preExec <string=undef>
A single command that will be run before Canu starts in a grid-enabled configuration.
Can be used to set up the environment, e.g., with 'module'.
File Staging
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
@@ -171,8 +178,8 @@ File Staging
The correction stage of Canu requires random access to all the reads. Performance is greatly
improved if the gkpStore database of reads is copied locally to each node that computes corrected
read consensus sequences. This 'staging' is enabled by supplying a path name to fast local storage
with the `stageDirectory` option, and, optionally, requesting access to that resource from the grid
with the `gridEngineStageOption` option.
with the :ref:`stageDirectory` option, and, optionally, requesting access to that resource from the grid
with the :ref:`gridEngineStageOption` option.
stageDirectory <string=undefined>
A path to a directory local to each compute node. The directory should use an environment
...
...
@@ -198,11 +205,12 @@ Cleanup Options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
saveOverlaps <boolean=false>
If set, do not remove raw overlap output from either mhap or overlapInCore. Normally, this output is removed once
the overlaps are loaded into an overlap store.
If set, do not remove raw overlap output from either mhap or overlapInCore. Normally, this output
is removed once the overlaps are loaded into an overlap store.
saveReadCorrections <boolean=false.
If set, do not remove raw corrected read output from correction/2-correction. Normally, this output is removed once the corrected reads are generated.
If set, do not remove raw corrected read output from correction/2-correction. Normally, this
output is removed once the corrected reads are generated.
saveIntermediates <boolean=false>
If set, do not remove intermediate outputs. Normally, intermediate files are removed
- To change the k-mer size for just the ovl overlapper used during correction, 'corMerSize=16' would be used.
- To change the mhap k-mer size for all instances, 'mhapMerSize=18' would be used.
- To change the mhap k-mer size just during correction, 'corMhapMerSize=15' would be used.
- To use minimap for overlap computation just during correction, 'corOverlapper=minimap' would be used.
- To use minimap for overlap computation just during correction, 'corOverlapper=minimap' would be used. The minimap2 executable must be symlinked from the Canu binary folder ('Linux-amd64/bin' or 'Darwin-amd64/bin' depending on your system).
Ovl Overlapper Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
...
@@ -416,7 +416,7 @@ READS
SEQUENCE
<prefix>.contigs.fasta
Everything which could be assembled and is part of the primary assembly, including both unique
Everything which could be assembled and is the primary assembly, including both unique