Skip to content
Commits on Source (6)
# Changelog
## Unreleased
## [2.0.8] - 2019-04-25 (beta)
### Added
- FTP downloading option for taxonomy/libraries (--use-ftp for kraken2-build)
- Option to skip downloading taxonomy maps
### Changed
- Added lookup table to speed up parsing in MinimizerScanner class
- Default parameters for minimizer lengths and spaces changed (spaces=7 for
nucleotide search, length=12 for translated search)
### Fixed
- Linked space expansion value for proteins to constant used by MinimizerScanner
- Reporting of taxids in classified-out sequence files
- Confidence scoring bug associated with failure to leave some sequences
unclassified
- Reverse complement shifting bug, code made backwards-compatible with
existing databases (newly created DBs will have fix)
- NCBI taxonomy download error due to removal of EST/GSS files
## [2.0.7] - 2018-08-11 (beta)
### Added
......
kraken2 (2.0.8~beta-1) unstable; urgency=medium
* New upstream version
* debhelper-compat 12
* Standards-Version: 4.4.0
-- Andreas Tille <tille@debian.org> Thu, 01 Aug 2019 13:36:17 +0200
kraken2 (2.0.7~beta-1) unstable; urgency=low
* Initial release (Closes: #924567)
......
......@@ -3,8 +3,8 @@ Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.
Uploaders: Andreas Tille <tille@debian.org>
Section: science
Priority: optional
Build-Depends: debhelper (>= 12~)
Standards-Version: 4.3.0
Build-Depends: debhelper-compat (= 12)
Standards-Version: 4.4.0
Vcs-Browser: https://salsa.debian.org/med-team/kraken2
Vcs-Git: https://salsa.debian.org/med-team/kraken2.git
Homepage: https://www.ccb.jhu.edu/software/kraken2/
......
......@@ -13,7 +13,7 @@
<div class="pretoc">
<p class="title">Kraken taxonomic sequence classification system</p>
<p class="version">Version 2.0.7-beta</p>
<p class="version">Version 2.0.8-beta</p>
<p>Operating Manual</p>
</div>
......@@ -64,7 +64,8 @@
<p>Unlike Kraken 1, Kraken 2 does not use an external <span class="math inline"><em>k</em></span>-mer counter. However, by default, Kraken 2 will attempt to use the <code>dustmasker</code> or <code>segmasker</code> programs provided as part of NCBI's BLAST suite to mask low-complexity regions (see <a href="#masking-of-low-complexity-sequences">Masking of Low-complexity Sequences</a>).</p>
<p><strong>MacOS NOTE:</strong> MacOS and other non-Linux operating systems are <em>not</em> explicitly supported by the developers, and MacOS users should refer to the Kraken-users group for support in installing the appropriate utilities to allow for full operation of Kraken 2. We will attempt to use MacOS-compliant code when possible, but development and testing time is at a premium and we cannot guarantee that Kraken 2 will install and work to its full potential on a default installation of MacOS.</p>
<p>In particular, we note that the default MacOS X installation of GCC does not have support for OpenMP. Without OpenMP, Kraken 2 is limited to single-threaded operation, resulting in slower build and classification runtimes.</p></li>
<li><p><strong>Network connectivity</strong>: Kraken 2's standard database build and download commands expect unfettered FTP and rsync access to the NCBI FTP server. If you're working behind a proxy, you may need to set certain environment variables (such as <code>ftp_proxy</code> or <code>RSYNC_PROXY</code>) in order to get these commands to work properly.</p></li>
<li><p><strong>Network connectivity</strong>: Kraken 2's standard database build and download commands expect unfettered FTP and rsync access to the NCBI FTP server. If you're working behind a proxy, you may need to set certain environment variables (such as <code>ftp_proxy</code> or <code>RSYNC_PROXY</code>) in order to get these commands to work properly.</p>
<p>Kraken 2's scripts default to using rsync for most downloads; however, you may find that your network situation prevents use of rsync. In such cases, you can try the <code>--use-ftp</code> option to <code>kraken2-build</code> to force the downloads to occur via FTP.</p></li>
<li><p><strong>MiniKraken</strong>: At present, users with low-memory computing environments can replicate the &quot;MiniKraken&quot; functionality of Kraken 1 in two ways: first, by increasing the value of <span class="math inline"><em>k</em></span> with respect to <span class="math inline"></span> (using the <code>--kmer-len</code> and <code>--minimizer-len</code> options to <code>kraken2-build</code>); and secondly, through downsampling of minimizers (from both the database and query sequences) using a hash function. This second option is performed if the <code>--max-db-size</code> option to <code>kraken2-build</code> is used; however, the two options are not mutually exclusive. In a difference from Kraken 1, Kraken 2 does not require building a full database and then shrinking it to obtain a reduced database.</p></li>
</ul>
<h1 id="installation">Installation</h1>
......@@ -165,7 +166,8 @@
<ol style="list-style-type: decimal">
<li><p>Install a taxonomy. Usually, you will just use the NCBI taxonomy, which you can easily download using:</p>
<pre><code>kraken2-build --download-taxonomy --db $DBNAME</code></pre>
<p>This will download the accession number to taxon map, as well as the taxonomic name and tree information from NCBI. These files can be found in <code>$DBNAME/taxonomy/</code> . If you need to modify the taxonomy, edits can be made to the <code>names.dmp</code> and <code>nodes.dmp</code> files in this directory; you may also need to modify the <code>*.accession2taxid</code> files appropriately.</p></li>
<p>This will download the accession number to taxon maps, as well as the taxonomic name and tree information from NCBI. These files can be found in <code>$DBNAME/taxonomy/</code> . If you need to modify the taxonomy, edits can be made to the <code>names.dmp</code> and <code>nodes.dmp</code> files in this directory; you may also need to modify the <code>*.accession2taxid</code> files appropriately.</p>
<p>Some of the standard sets of genomic libraries have taxonomic information associated with them, and don't need the accession number to taxon maps to build the database successfully. These libraries include all those available through the <code>--download-library</code> option (see next point), except for the <code>plasmid</code> and non-redundant databases. If you are not using custom sequences (see the <code>--add-to-library</code> option) and are not using one of the <code>plasmid</code> or non-redundant database libraries, you may want to skip downloading of the accession number to taxon maps. This can be done by passing <code>--skip-maps</code> to the <code>kraken2-build --download-taxonomy</code> command.</p></li>
<li><p>Install one or more reference libraries. Several sets of standard genomes/proteins are made easily available through <code>kraken2-build</code>:</p>
<ul>
<li><code>archaea</code>: RefSeq complete archaeal genomes/proteins</li>
......
......@@ -111,6 +111,11 @@ System Requirements
certain environment variables (such as `ftp_proxy` or `RSYNC_PROXY`)
in order to get these commands to work properly.
Kraken 2's scripts default to using rsync for most downloads; however, you
may find that your network situation prevents use of rsync. In such cases,
you can try the `--use-ftp` option to `kraken2-build` to force the
downloads to occur via FTP.
* **MiniKraken**: At present, users with low-memory computing environments
can replicate the "MiniKraken" functionality of Kraken 1 in two ways:
first, by increasing
......@@ -410,13 +415,23 @@ To build a custom database:
kraken2-build --download-taxonomy --db $DBNAME
This will download the accession number to taxon map, as well as the
This will download the accession number to taxon maps, as well as the
taxonomic name and tree information from NCBI. These files can
be found in `$DBNAME/taxonomy/` . If you need to modify the taxonomy,
edits can be made to the `names.dmp` and `nodes.dmp` files in this
directory; you may also need to modify the `*.accession2taxid` files
appropriately.
Some of the standard sets of genomic libraries have taxonomic information
associated with them, and don't need the accession number to taxon maps
to build the database successfully. These libraries include all those
available through the `--download-library` option (see next point), except
for the `plasmid` and non-redundant databases. If you are not using
custom sequences (see the `--add-to-library` option) and are not using
one of the `plasmid` or non-redundant database libraries, you may want to
skip downloading of the accession number to taxon maps. This can be done
by passing `--skip-maps` to the `kraken2-build --download-taxonomy` command.
2. Install one or more reference libraries. Several sets of standard
genomes/proteins are made easily available through `kraken2-build`:
......
<div class="pretoc">
<p class="title">Kraken taxonomic sequence classification system</p>
<p class="version">Version 2.0.7-beta</p>
<p class="version">Version 2.0.8-beta</p>
<p>Operating Manual</p>
</div>
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
set -e
VERSION="2.0.7-beta"
VERSION="2.0.8-beta"
if [ -z "$1" ] || [ -n "$2" ]
then
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/usr/bin/env perl
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/usr/bin/env perl
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/usr/bin/env perl
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/usr/bin/env perl
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......@@ -25,6 +25,16 @@ if [ -n "$KRAKEN2_PROTEIN_DB" ]; then
library_file="library.faa"
fi
function download_file() {
file="$1"
if [ -n "$KRAKEN2_USE_FTP" ]
then
wget -q ${FTP_SERVER}${file}
else
rsync --no-motd ${RSYNC_SERVER}${file} .
fi
}
case $library_name in
"archaea" | "bacteria" | "viral" | "fungi" | "plant" | "human" | "protozoa")
mkdir -p $LIBRARY_DIR/$library_name
......@@ -34,7 +44,7 @@ case $library_name in
if [ "$library_name" = "human" ]; then
remote_dir_name="vertebrate_mammalian/Homo_sapiens"
fi
if ! wget -q $FTP_SERVER/genomes/refseq/$remote_dir_name/assembly_summary.txt; then
if ! download_file "/genomes/refseq/$remote_dir_name/assembly_summary.txt"; then
1>&2 echo "Error downloading assembly summary file for $library_name, exiting."
exit 1
fi
......@@ -50,6 +60,7 @@ case $library_name in
mkdir -p $LIBRARY_DIR/plasmid
cd $LIBRARY_DIR/plasmid
rm -f library.f* plasmid.*
## This is staying FTP only D/L for now
1>&2 echo -n "Downloading plasmid files from FTP..."
wget -q --no-remove-listing --spider $FTP_SERVER/genomes/refseq/plasmid/
if [ -n "$KRAKEN2_PROTEIN_DB" ]; then
......@@ -75,8 +86,8 @@ case $library_name in
mkdir -p $LIBRARY_DIR/$library_name
cd $LIBRARY_DIR/$library_name
rm -f $library_name.gz
1>&2 echo -n "Downloading $library_name database from FTP..."
wget -q $FTP_SERVER/blast/db/FASTA/$library_name.gz
1>&2 echo -n "Downloading $library_name database from server... "
download_file "/blast/db/FASTA/$library_name.gz"
1>&2 echo "done."
1>&2 echo -n "Uncompressing $library_name database..."
gunzip $library_name.gz
......@@ -95,8 +106,8 @@ case $library_name in
fi
mkdir -p $LIBRARY_DIR/$library_name
cd $LIBRARY_DIR/$library_name
1>&2 echo -n "Downloading $library_name data from FTP..."
wget -q $FTP_SERVER/pub/UniVec/$library_name
1>&2 echo -n "Downloading $library_name data from server... "
download_file "/pub/UniVec/$library_name"
1>&2 echo "done."
# 28384: "other sequences"
special_taxid=28384
......
#!/bin/bash
# Copyright 2013-2018, Derrick Wood <dwood@cs.jhu.edu>
# Copyright 2013-2019, Derrick Wood <dwood@cs.jhu.edu>
#
# This file is part of the Kraken 2 taxonomic sequence classification system.
......@@ -15,24 +15,32 @@ NCBI_SERVER="ftp.ncbi.nlm.nih.gov"
RSYNC_SERVER="rsync://$NCBI_SERVER"
FTP_SERVER="ftp://$NCBI_SERVER"
RSYNC="rsync --no-motd"
mkdir -p "$TAXONOMY_DIR"
cd "$TAXONOMY_DIR"
if [ ! -e "accmap.dlflag" ]
function download_file() {
file="$1"
if [ -n "$KRAKEN2_USE_FTP" ]
then
wget -q ${FTP_SERVER}${file}
else
rsync --no-motd ${RSYNC_SERVER}${file} .
fi
}
if [ ! -e "accmap.dlflag" ] && [ -z "$KRAKEN2_SKIP_MAPS" ]
then
if [ -z "$KRAKEN2_PROTEIN_DB" ]
then
for subsection in est gb gss wgs
for subsection in gb wgs
do
1>&2 echo -n "Downloading nucleotide ${subsection} accession to taxon map..."
$RSYNC $RSYNC_SERVER/pub/taxonomy/accession2taxid/nucl_${subsection}.accession2taxid.gz .
download_file "/pub/taxonomy/accession2taxid/nucl_${subsection}.accession2taxid.gz"
1>&2 echo " done."
done
else
1>&2 echo -n "Downloading protein accession to taxon map..."
$RSYNC $RSYNC_SERVER/pub/taxonomy/accession2taxid/prot.accession2taxid.gz .
download_file "/pub/taxonomy/accession2taxid/prot.accession2taxid.gz"
1>&2 echo " done."
fi
touch accmap.dlflag
......@@ -42,7 +50,7 @@ fi
if [ ! -e "taxdump.dlflag" ]
then
1>&2 echo -n "Downloading taxonomy tree data..."
$RSYNC $RSYNC_SERVER/pub/taxonomy/taxdump.tar.gz .
download_file "/pub/taxonomy/taxdump.tar.gz"
touch taxdump.dlflag
1>&2 echo " done."
fi
......