Skip to content
Commits on Source (2)
......@@ -19,6 +19,8 @@ Alternatively, you can also build the latest unreleased from github:
cd canu/src
make -j <number of threads>
The unreleased tip has not undergone the same testing as a release and so may have unknown bugs or issues generating sub-optimal assemblies. We recommend the release version for most users.
## Learn:
The [quick start](http://canu.readthedocs.io/en/latest/quick-start.html) will get you assembling quickly, while the [tutorial](http://canu.readthedocs.io/en/latest/tutorial.html) explains things in more detail.
......
......@@ -35,6 +35,7 @@ $stoppingCommits{"1ef335952342ef06ad1651a888f09c312f54dab8"} = 1; # 18 MAY 20
$stoppingCommits{"bbbdcd063560e5f86006ee6b8b96d2d7b80bb750"} = 1; # 21 NOV 2016
$stoppingCommits{"64459fe33f97f6d23fe036ba1395743d0cdd03e4"} = 1; # 17 APR 2017
$stoppingCommits{"9e9bd674b705f89817b07ff30067210c2d180f42"} = 1; # 14 AUG 2017
$stoppingCommits{"0fff8a511fd7d74081d94ff9e0f6c0351650ae2e"} = 1; # 27 FEB 2018 - v1.7
open(F, "< logs") or die "Failed to open 'logs': $!\n";
......
This diff is collapsed.
......@@ -286,6 +286,7 @@ foreach my $file (@filesToProcess) {
next if ($file =~ m/libfalcon/);
next if ($file =~ m/libNDFalcon/);
next if ($file =~ m/libbacktrace/);
next if ($file =~ m/libsnappy/);
next if ($file =~ m/qsort_mt.c$/);
......
......@@ -8,3 +8,11 @@ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#definition-li
http://rest-sphinx-memo.readthedocs.io/en/latest/ReST.html
- A very useful page with examples.
`italics`
*italics*
**bold**
``red-code``
......@@ -55,9 +55,9 @@ copyright = u'2015, Adam Phillippy, Sergey Koren, Brian Walenz'
# built documents.
#
# The short X.Y version.
version = '1.6'
version = '1.7'
# The full version, including alpha/beta/rc tags.
release = '1.6'
release = '1.7'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -13,7 +13,7 @@ What resources does Canu require for a bacterial genome assembly? A mammalian as
-------------------------------------
Canu will detect available resources and configure itself to run efficiently using those
resources. It will request resources, for example, the number of compute threads to use, Based
on the ``genomeSize`` being assembled. It will fail to even start if it feels there are
on the genome size being assembled. It will fail to even start if it feels there are
insufficient resources available.
A typical bacterial genome can be assembled with 8GB memory in a few CPU hours - around an hour
......@@ -39,44 +39,71 @@ How do I run Canu on my SLURM / SGE / PBS / LSF / Torque system?
To disable grid support and run only on the local machine, specify ``useGrid=false``
It is possible to limit the number of grid jobs running at the same time, but this isn't
directly supported by Canu. The various :ref:`gridOptions <grid-options>` parameters
can pass grid-specific parameters to the submit commands used; see
`Issue #756 <https://github.com/marbl/canu/issues/756>`_ for Slurm and SGE examples.
My run stopped with the error ``'Failed to submit batch jobs'``
-------------------------------------
The grid you run on must allow compute nodes to submit jobs. This means that if you are on a compute host, ``qsub/bsub/sbatch/etc`` must be available and working. You can test this by starting an interactive compute session and running the submit command manually (e.g. ``qsub`` on SGE, ``bsub`` on LSF, ``sbatch`` on SLURM).
The grid you run on must allow compute nodes to submit jobs. This means that if you are on a
compute host, ``qsub/bsub/sbatch/etc`` must be available and working. You can test this by
starting an interactive compute session and running the submit command manually (e.g. ``qsub``
on SGE, ``bsub`` on LSF, ``sbatch`` on SLURM).
If this is not the case, Canu **WILL NOT** work on your grid. You must then set ``useGrid=false`` and run on a single machine. Alternatively, you can run Canu with ``useGrid=remote`` which will stop at every submit command, list what should be submitted. You then submit these jobs manually, wait for them to complete, and run the Canu command again. This is a manual process but currently the only workaround for grids without submit support on the compute nodes.
If this is not the case, Canu **WILL NOT** work on your grid. You must then set
``useGrid=false`` and run on a single machine. Alternatively, you can run Canu with
``useGrid=remote`` which will stop at every submit command, list what should be submitted. You
then submit these jobs manually, wait for them to complete, and run the Canu command again. This
is a manual process but currently the only workaround for grids without submit support on the
compute nodes.
What parameters should I use for my reads?
-------------------------------------
Canu is designed to be universal on a large range of PacBio (C2, P4-C2, P5-C3, P6-C4) and Oxford Nanopore
(R6 through R9) data. Assembly quality and/or efficiency can be enhanced for specific datatypes:
Canu is designed to be universal on a large range of PacBio (C2, P4-C2, P5-C3, P6-C4) and Oxford
Nanopore (R6 through R9) data. Assembly quality and/or efficiency can be enhanced for specific
datatypes:
**Nanopore R7 1D** and **Low Identity Reads**
With R7 1D sequencing data, and generally for any raw reads lower than 80% identity, five to
ten rounds of error correction are helpful. To run just the correction phase, use options
``-correct corOutCoverage=500 corMinCoverage=0 corMhapSensitivity=high``. Use the output of
the previous run (in ``asm.correctedReads.fasta.gz``) as input to the next round.
ten rounds of error correction are helpful::
canu -p r1 -d r1 -correct corOutCoverage=500 corMinCoverage=0 corMhapSensitivity=high -nanopore-raw your_reads.fasta
canu -p r2 -d r2 -correct corOutCoverage=500 corMinCoverage=0 corMhapSensitivity=high -nanopore-raw r1/r1.correctedReads.fasta.gz
canu -p r3 -d r3 -correct corOutCoverage=500 corMinCoverage=0 corMhapSensitivity=high -nanopore-raw r2/r2.correctedReads.fasta.gz
canu -p r4 -d r4 -correct corOutCoverage=500 corMinCoverage=0 corMhapSensitivity=high -nanopore-raw r3/r3.correctedReads.fasta.gz
canu -p r5 -d r5 -correct corOutCoverage=500 corMinCoverage=0 corMhapSensitivity=high -nanopore-raw r4/r4.correctedReads.fasta.gz
Then assemble the output of the last round, allowing up to 30% difference in overlaps::
Once corrected, assemble with ``-nanopore-corrected <your data> correctedErrorRate=0.3 utgGraphDeviation=50``
canu -p asm -d asm correctedErrorRate=0.3 utgGraphDeviation=50 -nanopore-corrected r5/r5.correctedReads.fasta.gz
**Nanopore R7 2D** and **Nanopore R9 1D**
Increase the maximum allowed difference in overlaps from the default of 4.5% to 7.5% with
``correctedErrorRate=0.075``
The defaults were designed with these datasets in mind so they should work. Having very high
coverage or very long Nanopore reads can slow down the assembly significantly. You can try the
``overlapper=mhap utgReAlign=true`` option which is much faster but may produce less
contiguous assemblies on large genomes.
**Nanopore R9 2D** and **PacBio P6**
Slightly decrease the maximum allowed difference in overlaps from the default of 4.5% to 4.0%
with ``correctedErrorRate=0.040``
Slightly decrease the maximum allowed difference in overlaps from the default of 14.4% to 12.0%
with ``correctedErrorRate=0.120``
**Early PacBio Sequel**
Based on exactly one publically released *A. thaliana* `dataset
<http://www.pacb.com/blog/sequel-system-data-release-arabidopsis-dataset-genome-assembly/>`_,
slightly decrease the maximum allowed difference from the default of 4.5% to 4.0% with
``correctedErrorRate=0.040 corMhapSensitivity=normal``. For recent Sequel data, the defaults
are appropriate.
seem to be appropriate.
**Nanopore R9 large genomes**
Due to some systematic errors, the identity estimate used by Canu for correction can be an over-estimate of true error, inflating runtime. For recent large genomes (>1gbp) we've used ``'corMhapOptions=--threshold 0.8 --num-hashes 512 --ordered-sketch-size 1000 --ordered-kmer-size 14'``. This can be used with 30x or more of coverage, below that the defaults are OK.
Due to some systematic errors, the identity estimate used by Canu for correction can be an
over-estimate of true error, inflating runtime. For recent large genomes (>1gbp) with more
than 30x coverage, we've used ``'corMhapOptions=--threshold 0.8 --num-hashes
512 --ordered-sketch-size 1000 --ordered-kmer-size 14'``. This is not needed for below 30x
coverage.
My assembly continuity is not good, how can I improve it?
......@@ -161,7 +188,7 @@ What parameters can I tweak?
divergence, you'd end up collapsing the variations. We've used the following parameters
for polyploid populations (PacBio data):
``corOutCoverage=200 correctedErrorRate=0.040 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"``
``corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"``
This will output more corrected reads (than the default 40x). The latter option will be
more conservative at picking the error rate to use for the assembly to try to maintain
......@@ -180,17 +207,28 @@ What parameters can I tweak?
chromosome (and probably some reads from other chromosomes). When assembling, overlaps
well outside the observed error rate distribution are discarded.
For metagenomes:
The basic idea is to use all data for assembly rather than just the longest as default. The
parameters we've used recently are:
``corOutCoverage=10000 corMhapSensitivity=high corMinCoverage=0 redMemory=32 oeaMemory=32 batMemory=200``
For low coverage:
- For less than 30X coverage, increase the alllowed difference in overlaps from 4.5% to 7.5%
(or more) with ``correctedErrorRate=0.075``, to adjust for inferior read correction. Canu
will automatically reduce ``corMinCoverage`` to zero to correct as many reads as possible.
- For less than 30X coverage, increase the alllowed difference in overlaps by a few percent
(from 4.5% to 8.5% (or more) with ``correctedErrorRate=0.105`` for PacBio and from 14.4% to
16% (or more) with ``correctedErrorRate=0.16`` for Nanopore), to adjust for inferior read
correction. Canu will automatically reduce ``corMinCoverage`` to zero to correct as many
reads as possible.
For high coverage:
- For more than 60X coverage, decrease the allowed difference in overlaps from 4.5% to 4.0%
with ``correctedErrorRate=0.040``, so that only the better corrected reads are used. This is
primarily an optimization for speed and generally does not change assembly continuity.
- For more than 60X coverage, decrease the allowed difference in overlaps (from 4.5% to 4.0%
with ``correctedErrorRate=0.040`` for PacBio, from 14.4% to 12% with
``correctedErrorRate=0.12`` for Nanopore), so that only the better corrected reads are used.
This is primarily an optimization for speed and generally does not change assembly
continuity.
My asm.contigs.fasta is empty, why?
......@@ -200,24 +238,26 @@ My asm.contigs.fasta is empty, why?
output, unitigs are the primary output split at alternate paths,
and unassembled are the leftover pieces.
The :ref:`contigFilter` parameter sets several parameters that control how small or low coverage
initial contigs are handled. By default, initial contigs with more than 50% of the length at
less than 5X coverage will be classified as 'unassembled' and removed from the assembly, that
is, ``contigFilter="2 0 1.0 0.5 5"``. The filtering can be disabled by changing the last number
from '5' to '0' (meaning, filter if 50% is less than 0X coverage).
The :ref:`contigFilter <contigFilter>` parameter sets several parameters that control how small
or low coverage initial contigs are handled. By default, initial contigs with more than 50% of
the length at less than 3X coverage will be classified as 'unassembled' and removed from the
assembly, that is, ``contigFilter="2 0 1.0 0.5 3"``. The filtering can be disabled by changing
the last number from '3' to '0' (meaning, filter if 50% of the contig is less than 0X coverage).
Why is my assembly is missing my favorite short plasmid?
-------------------------------------
Only the longest 40X of data (based on the specified genome size) is used for
correction. Datasets with uneven coverage or small plasmids can fail to generate enough
corrected reads to give enough coverage for assembly, resulting in gaps in the genome or even no
reads for small plasmids. Set ``corOutCoverage=1000`` (or any value greater than your total input
coverage) to correct all input data.
In Canu v1.6 and earlier only the longest 40X of data (based on the specified genome size) is
used for correction. Datasets with uneven coverage or small plasmids can fail to generate
enough corrected reads to give enough coverage for assembly, resulting in gaps in the genome or
even no reads for small plasmids. Set ``corOutCoverage=1000`` (or any value greater than your
total input coverage) to correct all input data.
An alternate approach is to correct all reads (``-correct corOutCoverage=1000``) then assemble
40X of reads picked at random from the ``<prefix>.correctedReads.fasta.gz`` output.
More recent Canu versions dynamically select poorly represented sequences to avoid missing short
plasmids so this should no longer happen.
Why do I get less corrected read data than I asked for?
-------------------------------------
......@@ -229,8 +269,29 @@ Why do I get less corrected read data than I asked for?
What is the minimum coverage required to run Canu?
-------------------------------------
For eukaryotic genomes, coverage more than 20X is enough to outperform current hybrid methods.
For eukaryotic genomes, coverage more than 20X is enough to outperform current hybrid
methods. Below that, you will likely not assemble the full genome.
My circular element is duplicated/has overlap?
-------------------------------------
This is expected for any circular elements. They can overlap by up to a read length due to how
Canu constructs contigs. Canu provides an alignment string in the GFA output which can be
converted to an alignment to identify the trimming points.
An alternative is to run MUMmer to get self-alignments on the contig and use those trim
points. For example, assuming the circular element is in ``tig00000099.fa``. Run::
nucmer -maxmatch -nosimplify tig00000099.fa tig00000099.fa
show-coords -lrcTH out.delta
to find the end overlaps in the tig. The output would be something like::
1 1895 48502 50400 1895 1899 99.37 50400 50400 3.76 3.77 tig00000001 tig00000001
48502 50400 1 1895 1899 1895 99.37 50400 50400 3.77 3.76 tig00000001 tig00000001
means trim to 1 to 48502. There is also an alternate `writeup
<https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Circularizing-and-trimming>`_.
My genome is AT (or GC) rich, do I need to adjust parameters? What about highly repetitive genomes?
-------------------------------------
......@@ -250,12 +311,9 @@ How can I send data to you?
FTP to ftp://ftp.cbcb.umd.edu/incoming/sergek. This is a write-only location that only the Canu
developers can see.
Here is a quick walk-through using a command-line ftp client (should be available on most Linux and OSX installations). Say we want to transfer a file named ``reads.fastq``. First, run ``ftp ftp.cbcb.umd.edu``, specify ``anonymous`` as the user name and hit return for password (blank). Then:
.. code-block::
cd incoming/sergek
put reads.fastq
quit
Here is a quick walk-through using a command-line ftp client (should be available on most Linux
and OSX installations). Say we want to transfer a file named ``reads.fastq``. First, run ``ftp
ftp.cbcb.umd.edu``, specify ``anonymous`` as the user name and hit return for password
(blank). Then ``cd incoming/sergek``, ``put reads.fastq``, and ``quit``.
That's it, you won't be able to see the file but we can download it.
......@@ -34,22 +34,31 @@ errorRate <float=unset> (OBSOLETE)
rawErrorRate <float=unset>
The allowed difference in an overlap between two uncorrected reads, expressed as fraction error.
Sets :ref:`corOvlErrorRate` and :ref:`corErrorRate`. The `rawErrorRate` typically does not need
to be modified. It might need to be increased if very early reads are being assembled. The
default is 0.300 For PacBio reads, and 0.500 for Nanopore reads.
Sets :ref:`corOvlErrorRate <corOvlErrorRate>` and :ref:`corErrorRate <corErrorRate>`. The
:ref:`rawErrorRate <rawErrorRate>` typically does not need to be modified. It might need to be
increased if very early reads are being assembled. The default is 0.300 For PacBio reads, and
0.500 for Nanopore reads.
.. _correctedErrorRate:
correctedErrorRate <float=unset>
The allowed difference in an overlap between two corrected reads, expressed as fraction error. Sets :ref:`obtOvlErrorRate`, :ref:`utgOvlErrorRate`, :ref:`obtErrorRate`, :ref:`utgErrorRate`, and :ref:`cnsErrorRate`.
The `correctedErrorRate` can be adjusted to account for the quality of read correction, for the amount of divergence in the sample being
assembled, and for the amount of sequence being assembled. The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads.
The allowed difference in an overlap between two corrected reads, expressed as fraction error.
Sets :ref:`obtOvlErrorRate <obtOvlErrorRate>`, :ref:`utgOvlErrorRate <utgOvlErrorRate>`,
:ref:`obtErrorRate <obtErrorRate>`, :ref:`utgErrorRate <utgErrorRate>`, and :ref:`cnsErrorRate
<cnsErrorRate>`.
The :ref:`correctedErrorRate <correctedErrorRate>` can be adjusted to account for the quality of
read correction, for the amount of divergence in the sample being assembled, and for the amount of
sequence being assembled. The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads.
For low coverage datasets (less than 30X), we recommend increasing `correctedErrorRate` slightly, by 1% or so.
For low coverage datasets (less than 30X), we recommend increasing :ref:`correctedErrorRate
<correctedErrorRate>` slightly, by 1% or so.
For high-coverage datasets (more than 60X), we recommend decreasing `correctedErrorRate` slighly, by 1% or so.
For high-coverage datasets (more than 60X), we recommend decreasing :ref:`correctedErrorRate
<correctedErrorRate>` slighly, by 1% or so.
Raising the `correctedErrorRate` will increase run time. Likewise, decreasing `correctedErrorRate` will decrease run time, at the risk of missing overlaps and fracturing the assembly.
Raising the :ref:`correctedErrorRate <correctedErrorRate>` will increase run time. Likewise,
decreasing :ref:`correctedErrorRate <correctedErrorRate>` will decrease run time, at the risk of
missing overlaps and fracturing the assembly.
.. _minReadLength:
......@@ -60,7 +69,7 @@ minReadLength <integer=1000>
Must be no smaller than minOverlapLength.
If set high enough, the gatekeeper module will halt as too many of the input reads have been
discarded. Set `stopOnReadQuality` to false to avoid this.
discarded. Set :ref:`stopOnReadQuality <stopOnReadQuality>` to false to avoid this.
.. _minOverlapLength:
......@@ -76,18 +85,21 @@ minOverlapLength <integer=500>
genomeSize <float=unset> *required*
An estimate of the size of the genome. Common suffices are allowed, for example, 3.7m or 2.8g.
The genome size estimate is used to decide how many reads to correct (via the corOutCoverage_
parameter) and how sensitive the mhap overlapper should be (via the mhapSensitivity_
The genome size estimate is used to decide how many reads to correct (via the :ref:`corOutCoverage <corOutCoverage>`
parameter) and how sensitive the mhap overlapper should be (via the :ref:`mhapSensitivity <mhapSensitivity>`
parameter). It also impacts some logging, in particular, reports of NG50 sizes.
.. _canuIteration:
canuIteration <internal parameter, do not use>
Which parallel iteration is being attempted.
canuIterationMax <integer=2>
How many parallel iterations to try. Ideally, the parallel jobs, run under grid control, would all finish successfully on the first try.
Sometimes, jobs fail due to other jobs exhausting resources (memory), or by the node itself failing. In this case, canu will launch the jobs
again. This parameter controls how many times it tries.
How many parallel iterations to try. Ideally, the parallel jobs, run under grid control, would
all finish successfully on the first try.
Sometimes, jobs fail due to other jobs exhausting resources (memory), or by the node itself
failing. In this case, canu will launch the jobs again. This parameter controls how many times
it tries.
.. _onSuccess:
......@@ -138,16 +150,6 @@ stopAfter <string=undefined>
General Options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pathMap <string=undefined>
A text file containing lines that map a hostname to a path to the assembler binaries.
This can be used to provide fine-grained binary directories, for example, when incompatible versions
of the operating system are used, or when canu is installed in a non-standard way.
The hostname must be the same as returned by 'uname -n'. For example::
grid01 /grid/software/canu/Linux-amd64/bin/
devel01 /devel/canu/Linux-amd64/bin/
shell <string="/bin/sh">
A path to a Bourne shell, to be used for executing scripts. By default, '/bin/sh', which is typically
the same as 'bash'. C shells (csh, tcsh) are not supported.
......@@ -162,8 +164,13 @@ gnuplotImageFormat <string="png">
The type of image to generate in gnuplot. By default, canu will use png, svg or gif, in that order.
gnuplotTested <boolean=false>
If set, skip the tests to determine if gnuplot will run, and to decide the image type to generate. This is used when gnuplot fails to run, or isn't even installed, and allows canu to continue execution without generating graphs.
If set, skip the tests to determine if gnuplot will run, and to decide the image type to generate.
This is used when gnuplot fails to run, or isn't even installed, and allows canu to continue
execution without generating graphs.
preExec <string=undef>
A single command that will be run before Canu starts in a grid-enabled configuration.
Can be used to set up the environment, e.g., with 'module'.
File Staging
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -171,8 +178,8 @@ File Staging
The correction stage of Canu requires random access to all the reads. Performance is greatly
improved if the gkpStore database of reads is copied locally to each node that computes corrected
read consensus sequences. This 'staging' is enabled by supplying a path name to fast local storage
with the `stageDirectory` option, and, optionally, requesting access to that resource from the grid
with the `gridEngineStageOption` option.
with the :ref:`stageDirectory` option, and, optionally, requesting access to that resource from the grid
with the :ref:`gridEngineStageOption` option.
stageDirectory <string=undefined>
A path to a directory local to each compute node. The directory should use an environment
......@@ -198,11 +205,12 @@ Cleanup Options
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
saveOverlaps <boolean=false>
If set, do not remove raw overlap output from either mhap or overlapInCore. Normally, this output is removed once
the overlaps are loaded into an overlap store.
If set, do not remove raw overlap output from either mhap or overlapInCore. Normally, this output
is removed once the overlaps are loaded into an overlap store.
saveReadCorrections <boolean=false.
If set, do not remove raw corrected read output from correction/2-correction. Normally, this output is removed once the corrected reads are generated.
If set, do not remove raw corrected read output from correction/2-correction. Normally, this
output is removed once the corrected reads are generated.
saveIntermediates <boolean=false>
If set, do not remove intermediate outputs. Normally, intermediate files are removed
......@@ -213,6 +221,10 @@ saveIntermediates <boolean=false>
saveMerCounts <boolean=false>
If set, do not remove meryl binary databases.
saveReads <boolean=false>
If set, save the corrected reads (in asm.correctedReads.fasta.gz) and trimmed reads (in asm.trimmedReads.fasta.gz).
Both read sets are saved in the asm.gkpStore, and can be retrieved later.
Overlapper Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -223,14 +235,15 @@ Two overlap algorithms are in use. One, mhap, is typically applied to raw uncor
returns alignment-free overlaps with imprecise extents. The other, the original overlapper
algorithm 'ovl', returns alignments but is much more expensive.
There are three sets of parameters, one for the 'mhap' algorithm, one for the 'ovl' algorithm, and one for the 'minimap' algorithm.
Parameters used for a specific type of overlap are set by a prefix on the option: 'cor' for read
correction, 'obt' for read trimming ('overlap based trimming') or 'utg' for unitig construction.
For example, 'corOverlapper=ovl' would set the overlapper used for read correction to the 'ovl'
algorithm.
There are three sets of parameters, one for the 'mhap' algorithm, one for the 'ovl' algorithm, and
one for the 'minimap' algorithm. Parameters used for a specific type of overlap are set by a prefix
on the option: 'cor' for read correction, 'obt' for read trimming ('overlap based trimming') or
'utg' for unitig construction. For example, 'corOverlapper=ovl' would set the overlapper used for
read correction to the 'ovl' algorithm.
{prefix}Overlapper <string=see-below>
Specify which overlap algorith, 'mhap' or 'ovl' or 'minimap'. The default is to use 'mhap' for 'cor' and 'ovl' for both 'obt' and 'utg'.
Specify which overlap algorith, 'mhap' or 'ovl' or 'minimap'. The default is to use 'mhap' for
'cor' and 'ovl' for both 'obt' and 'utg'.
Overlapper Configuration, ovl Algorithm
---------------------------------------
......@@ -242,22 +255,24 @@ Overlapper Configuration, ovl Algorithm
{prefix}OvlErrorRate <float=unset>
Overlaps above this error rate are not computed.
* `corOvlErrorRate` applies to overlaps generated for correcting reads;
* `obtOvlErrorRate` applied to overlaps generated for trimming reads;
* `utgOvlErrorRate` applies to overlaps generated for assembling reads.
These limits apply to the 'ovl' overlap algorithm and when alignments are computed for mhap overlaps with :ref:`mhapReAlign <mhapReAlign>`.
* :ref:`corOvlErrorRate <corOvlErrorRate>` applies to overlaps generated for correcting reads;
* :ref:`obtOvlErrorRate <obtOvlErrorRate>` applied to overlaps generated for trimming reads;
* :ref:`utgOvlErrorRate <utgOvlErrorRate>` applies to overlaps generated for assembling reads.
These limits apply to the 'ovl' overlap algorithm and when alignments are computed for mhap
overlaps with :ref:`mhapReAlign <mhapReAlign>`.
{prefix}OvlFrequentMers <string=undefined>
Do not seed overlaps with these kmers (fasta format).
{prefix}OvlHashBits <integer=unset>
Width of the kmer hash. Width 22=1gb, 23=2gb, 24=4gb, 25=8gb. Plus 10b per corOvlHashBlockLength.
Width of the kmer hash. Width 22=1gb, 23=2gb, 24=4gb, 25=8gb. Plus 10b per ovlHashBlockLength.
{prefix}OvlHashBlockLength <integer=unset>
Amount of sequence (bp to load into the overlap hash table.
{prefix}OvlHashLoad <integer=unset>
Maximum hash table load. If set too high, table lookups are inefficent; if too low, search overhead dominates run time.
Maximum hash table load. If set too high, table lookups are inefficent; if too low, search
overhead dominates run time.
{prefix}OvlMerDistinct <integer=unset>
K-mer frequency threshold; the least frequent fraction of distinct mers can seed overlaps.
......@@ -316,13 +331,24 @@ algorithms usually need to know all overlaps for a single read. The overlap sto
overlap, sorts them by the first ID, and stores them for quick retrieval of all overlaps for a
single read.
The overlap store construction process involves reading outputs of the overlap algorithm (in *.ovb files),
duplicating each overlap, and writing these to intermediate files. Each intermediate file is loaded into
memory, sorted, and written back to disk.
ovsMemory <float>
How much memory, in gigabytes, to use for constructing overlap stores. Must be at least 256m or 0.25g.
ovsMethod <string="sequential">
Two construction algorithms are supported. One uses a single data stream, and is faster for small
and moderate size assemblies. The other uses parallel data streams and can be faster (depending
on your network disk bandwitdh) for moderate and large assemblies.
Two construction algorithms are supported. The 'sequential' method uses a single data stream, and
is faster for small and moderate size assemblies. The 'parallel' method uses multiple compute
nodes and can be faster (depending on your network disk bandwitdh) for moderate and large
assemblies.
The parallel method is selected for genomes larger than 1 Gbp, but only when Canu runs in grid
mode. A special 'forceparallel' method will force usage of the parallel method regardless of the
availability of a grid. Be advised that the parallel method is less efficient than the sequential
method, and can easily thrash consumer-level NAS devices resulting in exceptionally poor
performance.
Meryl
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -567,6 +593,11 @@ For example, 'mhapMemory` would set the memory limit for computing overlaps with
'cormhapMemory' would set the memory limit only when mhap is used for generating overlaps used for
correction.
.. _maxMemory:
.. _minMemory:
.. _minThreads:
.. _maxThreads:
The 'minMemory', 'maxMemory', 'minThreads' and 'maxThreads' options will apply to all jobs, and
can be used to artifically limit canu to a portion of the current machine. In the overlapper
example above, setting maxThreads=4 would result in two concurrent jobs instead of four.
......@@ -684,12 +715,13 @@ Output Filtering
.. _contigFilter:
contigFilter <minReads, integer=2> <minLength, integer=0> <singleReadSpan, float=1.0> <lowCovSpan, float=0.5> <lowCovDepth, integer=5>
Remove spurious assemblies from consideration. Any contig that meeds any of the following
conditions is flagged as 'unassembled' and removed from further consideration:
- fewer than minReads reads
- shorter than minLength bases
- a single read covers more than singleReadSpan fraction of the contig
- more than lowCovSpan fraction of the contig is at coverage below lowCovDepth
This filtering is done immediately after initial contigs are formed, before repeat detection.
Initial contigs that span a repeat can be split into multiple conitgs; none of these
new contigs will be 'unassembled', even if they are a single read.
A contig that meeds any of the following conditions is flagged as 'unassembled' and removed from
further consideration:
- fewer than minReads reads (default 2)
- shorter than minLength bases (default 0)
- a single read covers more than singleReadSpan fraction of the contig (default 1.0)
- more than lowCovSpan fraction of the contig is at coverage below lowCovDepth (defaults 0.5, 5)
This filtering is done immediately after initial contigs are formed, before potentially
incorrectly spanned repeats are detected. Initial contigs that incorrectly span a repeat can be
split into multiple conitgs; none of these new contigs will be flagged as 'unassembled', even if
they are a single read.
......@@ -22,7 +22,7 @@ terminations.
Canu will auto-detect computational resources and scale itself to fit, using all of the resources
available and are reasonable for the size of your assembly. Memory and processors can be explicitly
limited with with parameters :ref:`maxMemory` and :ref:`maxThreads`. See section :ref:`execution`
limited with with parameters :ref:`maxMemory <maxMemory>` and :ref:`maxThreads <maxThreads>`. See section :ref:`execution`
for more details.
Canu will automaticall take full advantage of any LSF/PBS/PBSPro/Torque/Slrum/SGE grid available,
......@@ -153,7 +153,7 @@ quality corrected reads::
canu \
-p asm -d yeast \
genomeSize=12.1m \
correctedErrorRate=0.075 \
correctedErrorRate=0.105 \
-pacbio-raw yeast.20x.fastq.gz
Consensus Accuracy
......
......@@ -305,7 +305,7 @@ For example:
- To change the k-mer size for just the ovl overlapper used during correction, 'corMerSize=16' would be used.
- To change the mhap k-mer size for all instances, 'mhapMerSize=18' would be used.
- To change the mhap k-mer size just during correction, 'corMhapMerSize=15' would be used.
- To use minimap for overlap computation just during correction, 'corOverlapper=minimap' would be used.
- To use minimap for overlap computation just during correction, 'corOverlapper=minimap' would be used. The minimap2 executable must be symlinked from the Canu binary folder ('Linux-amd64/bin' or 'Darwin-amd64/bin' depending on your system).
Ovl Overlapper Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -416,7 +416,7 @@ READS
SEQUENCE
<prefix>.contigs.fasta
Everything which could be assembled and is part of the primary assembly, including both unique
Everything which could be assembled and is the primary assembly, including both unique
and repetitive elements.
<prefix>.unitigs.fasta
......
......@@ -62,6 +62,19 @@ allocateArray(TT*& array, LL arrayMax, uint32 op=resizeArray_clearNew) {
template<typename TT>
TT *
duplicateString(TT const *fr) {
uint32 ln = strlen(fr);
TT *to = new TT [ln+1];
memcpy(to, fr, sizeof(TT) * (ln+1));
return(to);
}
template<typename TT, typename LL>
void
duplicateArray(TT*& to, LL &toLen, LL &toMax, TT *fr, LL frLen, LL UNUSED(frMax), bool forceAlloc=false) {
......
......@@ -80,15 +80,39 @@ AS_UTL_writeFastQ(FILE *f,
char *q, int ql,
char *h, ...) {
va_list ap;
int qi = 0;
int oi = 0;
assert(sl == ql);
va_start(ap, h);
vfprintf(f, h, ap);
va_end(ap);
AS_UTL_safeWrite(f, s, "AS_UTL_writeFastQ", sizeof(char), sl);
fprintf(f, "\n");
fprintf(f, "+\n");
AS_UTL_safeWrite(f, q, "AS_UTL_writeFastQ", sizeof(char), ql);
fprintf(f, "\n");
}
void
AS_UTL_writeFastQ(FILE *f,
char *s, int sl,
uint8 *q, int ql,
char *h, ...) {
va_list ap;
char *o = new char [ql + 1];
int qi = 0;
int oi = 0;
assert(sl == ql);
// Reencode the QV to the Sanger spec. This is a copy, so we don't need to undo it.
while (qi < ql)
o[oi++] = q[qi++] + '!';
while (qi < ql) // Reencode the QV
o[oi++] = q[qi++] + '!'; // to the Sanger spec.
o[oi] = 0;
va_start(ap, h);
......
......@@ -47,7 +47,13 @@ AS_UTL_writeFastA(FILE *f,
void
AS_UTL_writeFastQ(FILE *f,
char *s, int sl,
char *q, int ql,
char *q, int ql, // As Sanger QV, already encoded
char *h, ...);
void
AS_UTL_writeFastQ(FILE *f,
char *s, int sl,
uint8 *q, int ql, // As Sanger QV, from integer values
char *h, ...);
#endif
......@@ -322,7 +322,24 @@ AS_UTL_unlink(const char *filename) {
errno = 0;
unlink(filename);
if (errno)
fprintf(stderr, "AS_UTL_unlink()-- Failed to remove file '%s': %s\n", filename, strerror(errno)), exit(1);
fprintf(stderr, "AS_UTL_unlink()-- Failed to remove file '%s': %s\n",
filename, strerror(errno)), exit(1);
}
// Rename a file, or do nothing if the file doesn't exist.
void
AS_UTL_rename(const char *oldname, const char *newname) {
if (AS_UTL_fileExists(oldname, FALSE, FALSE) == false)
return;
errno = 0;
rename(oldname, newname);
if (errno)
fprintf(stderr, "AS_UTL_renane()-- Failed to rename file '%s' to '%s': %s\n",
oldname, newname, strerror(errno)), exit(1);
}
......@@ -336,10 +353,9 @@ AS_UTL_fileExists(const char *path,
int directory,
int readwrite) {
struct stat s;
int r;
errno = 0;
r = stat(path, &s);
stat(path, &s);
if (errno)
return(0);
......@@ -379,11 +395,10 @@ AS_UTL_fileExists(const char *path,
off_t
AS_UTL_sizeOfFile(const char *path) {
struct stat s;
int r;
off_t size = 0;
errno = 0;
r = stat(path, &s);
stat(path, &s);
if (errno) {
fprintf(stderr, "Failed to stat() file '%s': %s\n", path, strerror(errno));
exit(1);
......@@ -495,10 +510,7 @@ AS_UTL_fseek(FILE *stream, off_t offset, int whence) {
void
AS_UTL_loadFileList(char *fileName, vector<char *> &fileList) {
errno = 0;
FILE *F = fopen(fileName, "r");
if (errno)
fprintf(stderr, "Can't open '%s': %s\n", fileName, strerror(errno)), exit(1);
FILE *F = AS_UTL_openInputFile(fileName);
char *line = new char [FILENAME_MAX];
......@@ -513,11 +525,91 @@ AS_UTL_loadFileList(char *fileName, vector<char *> &fileList) {
delete [] line;
fclose(F);
AS_UTL_closeFile(F);
}
FILE *
AS_UTL_openInputFile(char const *prefix,
char separator,
char const *suffix,
bool doOpen) {
char filename[FILENAME_MAX];
if (prefix == NULL)
return(NULL);
if (doOpen == false)
return(NULL);
if (suffix)
snprintf(filename, FILENAME_MAX, "%s%c%s", prefix, separator, suffix);
else
strncpy(filename, prefix, FILENAME_MAX-1);
errno = 0;
FILE *F = fopen(filename, "r");
if (errno)
fprintf(stderr, "Failed to open '%s' for reading: %s\n", filename, strerror(errno)), exit(1);
return(F);
}
FILE *
AS_UTL_openOutputFile(char const *prefix,
char separator,
char const *suffix,
bool doOpen) {
char filename[FILENAME_MAX];
if (prefix == NULL)
return(NULL);
if (doOpen == false)
return(NULL);
if (suffix)
snprintf(filename, FILENAME_MAX, "%s%c%s", prefix, separator, suffix);
else
strncpy(filename, prefix, FILENAME_MAX-1);
errno = 0;
FILE *F = fopen(filename, "w");
if (errno)
fprintf(stderr, "Failed to open '%s' for writing: %s\n", filename, strerror(errno)), exit(1);
return(F);
}
void
AS_UTL_closeFile(FILE *F, const char *filename, bool critical) {
if ((F == NULL) || (F == stdout) || (F == stderr))
return;
errno = 0;
fclose(F);
if ((critical == false) || (errno == 0))
return;
if (filename)
fprintf(stderr, "Failed to close file '%s': %s\n", filename, strerror(errno));
else
fprintf(stderr, "Failed to close file: %s\n", strerror(errno));
exit(1);
}
cftType
compressedFileType(char const *filename) {
......@@ -547,56 +639,58 @@ compressedFileReader::compressedFileReader(const char *filename) {
int32 len = 0;
_file = NULL;
_filename = duplicateString(filename);
_pipe = false;
_stdi = false;
cftType ft = compressedFileType(filename);
cftType ft = compressedFileType(_filename);
if ((ft != cftSTDIN) && (AS_UTL_fileExists(filename, FALSE, FALSE) == FALSE))
fprintf(stderr, "ERROR: Failed to open input file '%s': %s\n", filename, strerror(errno)), exit(1);
if ((ft != cftSTDIN) && (AS_UTL_fileExists(_filename, FALSE, FALSE) == FALSE))
fprintf(stderr, "ERROR: Failed to open input file '%s': %s\n", _filename, strerror(errno)), exit(1);
errno = 0;
switch (ft) {
case cftGZ:
snprintf(cmd, FILENAME_MAX, "gzip -dc '%s'", filename);
snprintf(cmd, FILENAME_MAX, "gzip -dc '%s'", _filename);
_file = popen(cmd, "r");
_pipe = true;
break;
case cftBZ2:
snprintf(cmd, FILENAME_MAX, "bzip2 -dc '%s'", filename);
snprintf(cmd, FILENAME_MAX, "bzip2 -dc '%s'", _filename);
_file = popen(cmd, "r");
_pipe = true;
break;
case cftXZ:
snprintf(cmd, FILENAME_MAX, "xz -dc '%s'", filename);
snprintf(cmd, FILENAME_MAX, "xz -dc '%s'", _filename);
_file = popen(cmd, "r");
_pipe = true;
if (_file == NULL) // popen() returns NULL on error. It does not reliably set errno.
fprintf(stderr, "ERROR: Failed to open input file '%s': popen() returned NULL\n", filename), exit(1);
fprintf(stderr, "ERROR: Failed to open input file '%s': popen() returned NULL\n", _filename), exit(1);
errno = 0;
break;
case cftSTDIN:
_file = stdin;
_stdi = 1;
_stdi = true;
break;
default:
_file = fopen(filename, "r");
_file = fopen(_filename, "r");
_pipe = false;
break;
}
if (errno)
fprintf(stderr, "ERROR: Failed to open input file '%s': %s\n", filename, strerror(errno)), exit(1);
fprintf(stderr, "ERROR: Failed to open input file '%s': %s\n", _filename, strerror(errno)), exit(1);
}
compressedFileReader::~compressedFileReader() {
if (_file == NULL)
......@@ -608,7 +702,9 @@ compressedFileReader::~compressedFileReader() {
if (_pipe)
pclose(_file);
else
fclose(_file);
AS_UTL_closeFile(_file);
delete [] _filename;
}
......@@ -618,28 +714,29 @@ compressedFileWriter::compressedFileWriter(const char *filename, int32 level) {
int32 len = 0;
_file = NULL;
_filename = duplicateString(filename);
_pipe = false;
_stdi = false;
cftType ft = compressedFileType(filename);
cftType ft = compressedFileType(_filename);
errno = 0;
switch (ft) {
case cftGZ:
snprintf(cmd, FILENAME_MAX, "gzip -%dc > '%s'", level, filename);
snprintf(cmd, FILENAME_MAX, "gzip -%dc > '%s'", level, _filename);
_file = popen(cmd, "w");
_pipe = true;
break;
case cftBZ2:
snprintf(cmd, FILENAME_MAX, "bzip2 -%dc > '%s'", level, filename);
snprintf(cmd, FILENAME_MAX, "bzip2 -%dc > '%s'", level, _filename);
_file = popen(cmd, "w");
_pipe = true;
break;
case cftXZ:
snprintf(cmd, FILENAME_MAX, "xz -%dc > '%s'", level, filename);
snprintf(cmd, FILENAME_MAX, "xz -%dc > '%s'", level, _filename);
_file = popen(cmd, "w");
_pipe = true;
break;
......@@ -650,13 +747,13 @@ compressedFileWriter::compressedFileWriter(const char *filename, int32 level) {
break;
default:
_file = fopen(filename, "w");
_file = fopen(_filename, "w");
_pipe = false;
break;
}
if (errno)
fprintf(stderr, "ERROR: Failed to open output file '%s': %s\n", filename, strerror(errno)), exit(1);
fprintf(stderr, "ERROR: Failed to open output file '%s': %s\n", _filename, strerror(errno)), exit(1);
}
......@@ -668,8 +765,15 @@ compressedFileWriter::~compressedFileWriter() {
if (_stdi)
return;
errno = 0;
if (_pipe)
pclose(_file);
else
fclose(_file);
AS_UTL_closeFile(_file);
if (errno)
fprintf(stderr, "ERROR: Failed to cleanly close output file '%s': %s\n", _filename, strerror(errno)), exit(1);
delete [] _filename;
}
......@@ -67,6 +67,8 @@ void AS_UTL_symlink(const char *pathToFile, const char *pathToLink);
void AS_UTL_unlink(const char *filename);
void AS_UTL_rename(const char *oldname, const char *newname);
int AS_UTL_fileExists(const char *path, int directory=false, int readwrite=false);
off_t AS_UTL_sizeOfFile(const char *path);
......@@ -75,7 +77,29 @@ void AS_UTL_fseek(FILE *stream, off_t offset, int whence);
// Read a file-of-files into a vector
void AS_UTL_loadFileList(char *fileName, vector<char *> &fileList);
void AS_UTL_loadFileList(char *fileName, vector<char *> &FILE);
FILE *AS_UTL_openInputFile (char const *prefix, char separator='.', char const *suffix=NULL, bool doOpen=true);
FILE *AS_UTL_openOutputFile(char const *prefix, char separator='.', char const *suffix=NULL, bool doOpen=true);
void AS_UTL_closeFile(FILE *F, const char *filename=NULL, bool critical=true);
template<typename OBJ>
void AS_UTL_loadFile(char *filename, OBJ *objects, uint64 numberToLoad) {
FILE *file = AS_UTL_openInputFile(filename);
uint64 length = AS_UTL_sizeOfFile(filename);
if (length / sizeof(OBJ) < numberToLoad)
fprintf(stderr, "AS_UTL_loadFile()-- File '%s' contains " F_U64 " objects, but asked to load " F_U64 ".\n",
filename, length / sizeof(OBJ), numberToLoad), exit(1);
AS_UTL_safeRead(file, objects, "", sizeof(OBJ), numberToLoad);
AS_UTL_closeFile(file);
}
......@@ -99,10 +123,13 @@ public:
FILE *operator*(void) { return(_file); };
FILE *file(void) { return(_file); };
bool isCompressed(void) { return(_pipe); };
bool isCompressed(void) { return(_pipe == true); };
bool isNormal(void) { return((_pipe == false) &&
(_stdi == false)); };
private:
FILE *_file;
char *_filename;
bool _pipe;
bool _stdi;
};
......@@ -117,10 +144,11 @@ public:
FILE *operator*(void) { return(_file); };
FILE *file(void) { return(_file); };
bool isCompressed(void) { return(_pipe); };
bool isCompressed(void) { return(_pipe == true); };
private:
FILE *_file;
char *_filename;
bool _pipe;
bool _stdi;
};
......
......@@ -111,11 +111,12 @@ reverseComplementCopy(char *seq, int len) {
template<typename qvType>
void
reverseComplement(char *seq, char *qlt, int len) {
reverseComplement(char *seq, qvType *qlt, int len) {
char c=0;
char *s=seq, *S=seq+len-1;
char *q=qlt, *Q=qlt+len-1;
qvType *q=qlt, *Q=qlt+len-1;
if (qlt == NULL) {
reverseComplementSequence(seq, len);
......@@ -142,3 +143,5 @@ reverseComplement(char *seq, char *qlt, int len) {
*s = inv[*s];
}
template void reverseComplement<char> (char *seq, char *qlt, int len); // Give the linker
template void reverseComplement<uint8>(char *seq, uint8 *qlt, int len); // something to link
......@@ -34,6 +34,8 @@
void reverseComplementSequence(char *seq, int len);
char *reverseComplementCopy(char *seq, int len);
void reverseComplement(char *seq, char *qlt, int len);
template<typename qvType>
void reverseComplement(char *seq, qvType *qlt, int len);
#endif
......@@ -36,15 +36,10 @@
#if (!defined(__CYGWIN__) && !defined(_WIN31))
#include <execinfo.h> // backtrace
#endif
#include <execinfo.h> // backtrace
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <cxxabi.h>
......
......@@ -35,13 +35,6 @@
* full conditions and disclaimers for each license.
*/
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <errno.h>
#include <string.h>
#include <strings.h>
#include "bitPackedArray.H"
#include "bitPacking.H"
......
......@@ -47,11 +47,6 @@
#include "bitPackedFile.H"
#include "AS_UTL_fileIO.H"
//#include <stdio.h>
//#include <stdlib.h>
//#include <unistd.h>
//#include <errno.h>
//#include <string.h>
#include <fcntl.h>
......@@ -64,12 +59,6 @@ bitPackedFile::bitPackedFile(char const *name, uint64 offset, bool forceTruncate
_name = new char [strlen(name) + 1];
strcpy(_name, name);
#ifdef WITH_BZIP2
_bzFILE = 0L;
_bzerr = 0;
_bzfile = 0L;
#endif
_bfrmax = 1048576 / 8;
_bfr = new uint64 [_bfrmax];
_pos = uint64ZERO;
......@@ -81,7 +70,6 @@ bitPackedFile::bitPackedFile(char const *name, uint64 offset, bool forceTruncate
_bfrDirty = false;
_forceFirstLoad = false;
_isReadOnly = false;
_isBzip2 = false;
stat_seekInside = uint64ZERO;
stat_seekOutside = uint64ZERO;
......@@ -174,40 +162,6 @@ bitPackedFile::bitPackedFile(char const *name, uint64 offset, bool forceTruncate
return;
}
if ((c[0] == 'B') && (c[1] == 'Z') && (c[2] == 'h')) {
#ifdef WITH_BZIP2
// Looks like a bzip2 file!
errno = 0;
_bzFILE = fopen(_name, "r");
if (errno) {
fprintf(stderr, "bitPackedFile::bitPackedFile()-- failed to open bzip2 file '%s'\n", _name);
exit(1);
}
_bzerr = 0;
_bzfile = BZ2_bzReadOpen(&_bzerr, _bzFILE, 0, 0, 0L, 0);
if ((_bzfile == 0L) || (_bzerr != BZ_OK)) {
fprintf(stderr, "bitPackedFile::bitPackedFile()-- failed to init bzip2 file '%s'\n", _name);
exit(1);
}
BZ2_bzRead(&_bzerr, _bzfile, c, sizeof(char) * 16);
BZ2_bzRead(&_bzerr, _bzfile, &ac, sizeof(uint64));
BZ2_bzRead(&_bzerr, _bzfile, &bc, sizeof(uint64));
// XXX should check bzerr!
_isReadOnly = true;
_isBzip2 = true;
#else
fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' looks like a bzip2 file, but bzip2 support not available!\n", _name);
exit(1);
#endif
}
// Check the magic number, decide on an endianess to use.
//
if (strncmp(t, c, 16) == 0) {
......@@ -238,14 +192,6 @@ bitPackedFile::~bitPackedFile() {
delete [] _bfr;
delete [] _name;
close(_file);
#ifdef WITH_BZIP2
if (_bzFILE)
fclose(_bzFILE);
if (_bzfile)
BZ2_bzReadClose(&_bzerr, _bzfile);
#endif
}
......@@ -306,86 +252,6 @@ bitPackedFile::flushDirty(void) {
void
bitPackedFile::seekBzip2(uint64 UNUSED(bitpos)) {
#ifdef WITH_BZIP2
// All we can do here is check that bitpos is
// a) in our current buffer
// b) would be in the next buffer once we read it
uint64 newpos = bitpos >> 6;
if (_pos + _bfrmax < newpos) {
// nope, not in the buffer -- we could probably handle this by just reading and
// discarding from the file until we get to the correct bitpos.
fprintf(stderr, "bitPackedFile::seekBzip2()-- '%s' seek was not contiguous!\n", _name);
exit(1);
}
// Copy the remaining bits of the current buffer to the start. Or
// not, if this is the first load.
uint64 lastpos = _bit >> 6; // The word we are currently in
uint64 lastlen = (_bfrmax - lastpos); // The number of words left in the buffer
if (_forceFirstLoad == true) {
lastpos = 0;
lastlen = 0;
} else {
memcpy(_bfr, _bfr + lastpos, sizeof(uint64) * lastlen);
}
// Update _bit and _pos -- lastlen is now the first invalid word
//
_bit = bitpos & 0x3f; // 64 * lastlen;
_pos = bitpos >> 6;
// Fill the buffer
size_t wordsread = 0;
if (_bzfile) {
_bzerr = 0;
wordsread = BZ2_bzRead(&_bzerr, _bzfile, _bfr + lastlen, sizeof(uint64) * (_bfrmax - lastlen));
if (_bzerr == BZ_STREAM_END) {
//fprintf(stderr, "bitPackedFile::seekBzip2() file ended.\n");
BZ2_bzReadClose(&_bzerr, _bzfile);
fclose(_bzFILE);
_bzfile = 0L;
_bzFILE = 0L;
} else if (_bzerr != BZ_OK) {
fprintf(stderr, "bitPackedFile::seekBzip2() '%s' read failed.\n", _name);
exit(1);
}
}
//fprintf(stderr, "Filled buffer with %d words!\n", wordsread);
// Adjust to make wordsread be the index of the last word we actually read.
//
wordsread += lastlen;
// Flip all the words we just read, if needed
//
if (endianess_flipped)
for (uint32 i=lastlen; i<wordsread; i++)
_bfr[i] = uint64Swap(_bfr[i]);
// Clear any words that we didn't read (supposedly, because we hit
// EOF).
//
while (wordsread < _bfrmax)
_bfr[wordsread++] = uint64ZERO;
#else
fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s'\n", _name);
fprintf(stderr, "bitPackedFile::bitPackedFile()-- bzip2 support not present, but still tried to read it??\n");
exit(1);
#endif
}
void
bitPackedFile::seekNormal(uint64 bitpos) {
......@@ -485,9 +351,6 @@ bitPackedFile::seek(uint64 bitpos) {
flushDirty();
if (_isBzip2)
seekBzip2(bitpos);
else
seekNormal(bitpos);
_forceFirstLoad = false;
......