Skip to content
Commits on Source (11)
......@@ -97,23 +97,11 @@ Full documentation, including a basic description of the algorithm, is hosted
at https://axe-demultiplexer.readthedocs.org/en/latest/ .
Implementation Progress:
------------------------
- [x] Single ended read de-multiplexing
- [x] Interleaved/Paired input and output with single-ended de-multiplexing
- [x] Combinatorial de-multiplexing
- [ ] CLI integration tests
- [ ] Comprehensive `libaxe` tests
- [ ] Comprehensive CLI tests
See also TODO.md
Publication
-----------
A publication is coming soon, if the reviewer gods decide to smile upon us.
We have a preprint describing AXE up at http://www.biorxiv.org/content/early/2017/07/07/160606
Versioning
----------
......
TODO list:
==========
- Re-read documentation; fix errors
- Exhaustive CLI tests
- Refactor trie lookups to hide all datrie operations from main body of code
- Summary info
- Valgrind integration tests - maybe a CLI flag to enable
- Investigate shipping the gsl functions bundled in source.
axe-demultiplexer (0.3.3+dfsg-1) unstable; urgency=medium
* Team upload.
* New upstream version
* Point Vcs fields to salsa.debian.org
* Standards-Version: 4.1.4
* debhelper 11
* Build-Depends: s/python-sphinx/python3-sphinx/
* Do not parse d/changelog
* Enable running tests using Python3 instead of Python2
-- Andreas Tille <tille@debian.org> Thu, 26 Apr 2018 15:24:43 +0200
axe-demultiplexer (0.3.2+dfsg1-1) unstable; urgency=medium
[ Kevin Murray ]
......
......@@ -3,15 +3,15 @@ Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.
Uploaders: Kevin Murray <kdmfoss@gmail.com>
Section: science
Priority: optional
Standards-Version: 3.9.8
Build-Depends: debhelper (>= 10),
Build-Depends: debhelper (>= 11~),
cmake,
zlib1g-dev,
libqes-dev,
libgsl-dev,
python-sphinx,
Vcs-Browser: https://anonscm.debian.org/git/debian-med/axe-demultiplexer.git
Vcs-Git: https://anonscm.debian.org/git/debian-med/axe-demultiplexer.git
python3-sphinx
Standards-Version: 4.1.4
Vcs-Browser: https://salsa.debian.org/med-team/axe-demultiplexer
Vcs-Git: https://salsa.debian.org/med-team/axe-demultiplexer.git
Homepage: https://github.com/kdmurray91/axe
Package: axe-demultiplexer
......
......@@ -6,8 +6,6 @@ Subject: Use standard include path of gsl headers
src/axe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/axe.c b/src/axe.c
index 6390418..9326947 100644
--- a/src/axe.c
+++ b/src/axe.c
@@ -23,7 +23,7 @@
......@@ -16,6 +14,6 @@ index 6390418..9326947 100644
#include "axe.h"
-#include "gsl_combination.h"
+#include <gsl/gsl_combination.h>
/* Holds the current timestamp, so we don't have to free the returned string
* from now(). */
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
Author: Andreas Tille <tille@debian.org>
Last-Update: Thu, 26 Apr 2018 15:14:56 +0200
Description: Enable running tests using Python3 instead of Python2
When switching Build-Depends from python-sphinx to python3-sphinx
the Python 2 interpreter was not installed any more in the build
time environment. This was used to trigger the build time test
suite. By changing the script tp Python3 it is possible to get
rid of this dependency.
--- a/tests/axe_cli_tests.py
+++ b/tests/axe_cli_tests.py
@@ -1,5 +1,5 @@
-#!/usr/bin/env python
-from __future__ import print_function
+#!/usr/bin/env python3
+
import hashlib
import logging
import os
--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@@ -19,7 +19,7 @@ ADD_TEST(NAME "UnitTests" COMMAND test_a
SET(COVERAGE_CMD test_axe)
SET(COVERAGE_OUT "${CMAKE_BINARY_DIR}/coverage_html")
-ADD_TEST(NAME "IntegrationTests" COMMAND python
+ADD_TEST(NAME "IntegrationTests" COMMAND python3
${CMAKE_BINARY_DIR}/bin/axe_cli_tests.py
${CMAKE_BINARY_DIR})
......@@ -2,3 +2,4 @@ cmake-version
0002-Remove-libqes-from-CML.txt.patch
0003-Use-standard-include-path-of-gsl-headers.patch
0004-Dynamically-link-to-libqes.patch
0005-do_not_depend_python2.patch
......@@ -2,12 +2,12 @@
export DEB_BUILD_MAINT_OPTIONS = hardening=+all
DEBPKGNAME := $(shell dpkg-parsechangelog | awk '/^Source:/ {print $$2}')
include /usr/share/dpkg/default.mk
%:
dh $@ --buildsystem=cmake
override_dh_installdocs:
dh_installdocs
mkdir -p debian/$(DEBPKGNAME)/usr/share/doc/$(DEBPKGNAME)/tests
cp -a tests/data debian/$(DEBPKGNAME)/usr/share/doc/$(DEBPKGNAME)/tests
mkdir -p debian/$(DEB_SOURCE)/usr/share/doc/$(DEB_SOURCE)/tests
cp -a tests/data debian/$(DEB_SOURCE)/usr/share/doc/$(DEB_SOURCE)/tests
......@@ -17,7 +17,7 @@ IF(SPHINXBUILD)
COMMAND ${CMAKE_COMMAND} -E remove_directory "${CMAKE_BINARY_DIR}/doc"
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}
)
ADD_CUSTOM_TARGET(doc ALL DEPENDS doc_man doc_onehtml)
ADD_CUSTOM_TARGET(doc ALL DEPENDS doc_man doc_html doc_onehtml)
INSTALL(FILES ${CMAKE_BINARY_DIR}/doc/man/axe.1 DESTINATION "share/man/man1" RENAME "axe-demux.1")
SET_DIRECTORY_PROPERTIES(PROPERTY ADDITIONAL_MAKE_CLEAN_FILES doc/)
ELSE()
......
......@@ -4,17 +4,17 @@ Axe's matching algorithm
Axe uses an algorithm based on longest-prefix-in-trie matching to match a
variable length from the start of each read against a set of 'mutated'
barcodes.
indexes.
Hamming distance matching
-------------------------
While for most applications in high-throughput sequencing hamming distances are
a frowned-upon metric, it is typical for HTS read barcodes to be designed to
a frowned-upon metric, it is typical for HTS read indexes to be designed to
tolerate a certain level of hamming mismatches. Given these sequences are short
and typically occur at the 5' end of reads, insertions and deletions rarely
need be considered, and the increased rate of assignment of reads with many
errors is offset by the risk of falsely assigning barcodes to an incorrect
errors is offset by the risk of falsely assigning indexes to an incorrect
sample. In any case, reads with more than 1-2 sequencing errors in their first
several bases are likely to be poor quality, and will simply be filtered out
during downstream quality control.
......@@ -22,47 +22,47 @@ during downstream quality control.
Hamming mismatch tries
----------------------
Typically, reads are matched to a set of barcodes by calculating the hamming
distance between the barcode, and the first :math:`l` bases of a read for a
barcode of length :math:`l`. The "correct" barcode is then selected by
recording either the barcode with the lowest hamming distance to the read
(competitive matching) or by simply accepting the first barcode with a hamming
Typically, reads are matched to a set of indexes by calculating the hamming
distance between the index, and the first :math:`l` bases of a read for a
index of length :math:`l`. The "correct" index is then selected by
recording either the index with the lowest hamming distance to the read
(competitive matching) or by simply accepting the first index with a hamming
distance below a certain threshold. These approaches are both very
computationally expensive, and can have lower accuracy than the algorithm I
propose. Additionally, implementations of these methods rarely handle barcodes
of differing length and combinatorial barcoding well, if at all.
propose. Additionally, implementations of these methods rarely handle indexes
of differing length and combinatorial indexing well, if at all.
Central to Axe's algorithm is the concept of hamming-mismatch tries. A trie is
a N-ary tree for an N letter alphabet. In the case of high-throughput
sequencing reads, we have the alphabet ``AGCT``, corresponding to the four
nucleotides of DNA, plus ``N``, used to represent ambiguous base calls. Instead
of matching each barcode to each read, we pre-calculate all allowable sequences
of matching each index to each read, we pre-calculate all allowable sequences
at each mismatch level, and store these in level-wise tries. For example, to
match to a hamming distance of 2, we create three tries: One containing all
barcodes, verbatim, and two tries where every sequence within a hamming
distance of 1 and 2 of each barcode respectively. Hereafter, these tries are
indexes, verbatim, and two tries where every sequence within a hamming
distance of 1 and 2 of each index respectively. Hereafter, these tries are
referred to as the 0, 1 and 2-mm tries, for a hamming distance (mismatch) of
0, 1 and 2. Then, we find the longest prefix in each sequence read in the 0mm
trie. If this prefix is not a valid leaf in the 0mm trie, we find the longest
prefix in the 1mm trie, and so on for all tries in ascending order. If no
prefix of the read is a complete sequence in any trie, the read is assigned to
an "non-barcoded" output file.
an "non-indexed" output file.
This algorithm ensures optimal barcode matching in many ways, but is also
extremely fast. In situations with barcodes of differing length, we ensure that
the *longest* acceptable barcode at a given hamming distance is chosen;
assuming that sequence is random after the barcode, the probability of false
This algorithm ensures optimal index matching in many ways, but is also
extremely fast. In situations with indexes of differing length, we ensure that
the *longest* acceptable index at a given hamming distance is chosen;
assuming that sequence is random after the index, the probability of false
assignments using this method is low. We also ensure that short perfect matches
are preferred to longer inexact matches, as we firstly only consider barcodes
with no error, then 1 error, and so on. This ensures that reads with barcodes
are preferred to longer inexact matches, as we firstly only consider indexes
with no error, then 1 error, and so on. This ensures that reads with indexes
that are followed by random sequence that happens to inexactly match a longer
barcode in the set are not falsely assigned to this longer barcode.
index in the set are not falsely assigned to this longer index.
The speed of this algorithm is largely due to the constant time matching
algorithm with respect to the number of barcodes to match. The time taken to
match each read is proportional instead to the length of the barcodes, as for a
barcode of length :math:`l`, at most :math:`l + 1` trie level descents are
algorithm with respect to the number of indexes to match. The time taken to
match each read is proportional instead to the length of the indexes, as for a
index of length :math:`l`, at most :math:`l + 1` trie level descents are
required to find an entry in the trie. As this length is more-or-less constant
and small, the overall complexity of axe's algorithm is :math:`O(n)` for
:math:`n` reads, as opposed to :math:`O(nm)` for :math:`n` reads and :math:`m`
barcodes as is typical for traditional matching algorithms
indexes as is typical for traditional matching algorithms
......@@ -31,7 +31,7 @@ import gitversion
# ones.
extensions = [
'sphinx.ext.todo',
'sphinx.ext.pngmath',
'sphinx.ext.imgmath',
'sphinx.ext.ifconfig',
]
......@@ -188,10 +188,10 @@ htmlhelp_basename = 'axedoc'
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
'papersize': 'a4paper',
#'papersize': 'a4paper',
# The font size ('10pt', '11pt' or '12pt').
'pointsize': '11pt',
#'pointsize': '11pt',
# Additional stuff for the LaTeX preamble.
#'preamble': '',
......
......@@ -8,8 +8,8 @@
# versioneer-0.13 (https://github.com/warner/python-versioneer)
# these strings will be replaced by git during git-archive
git_refnames = " (HEAD -> master, tag: 0.3.2)"
git_full = "0faf8e3e05ba45343357e0cfa6c9657cedee4cc9"
git_refnames = " (tag: 0.3.3)"
git_full = "4b985d2ac95b046e96f0a83cb98f1f5e31fa49bd"
# these strings are filled in when 'setup.py versioneer' creates _version.py
tag_prefix = ""
......
......@@ -7,16 +7,17 @@ Welcome to axe's documentation!
===============================
Axe is a read de-multiplexer, useful in situations where sequence reads contain
the barcodes that uniquely distinguish samples. Axe uses a rapid and accurate
the indexes that uniquely distinguish samples. Axe uses a rapid and accurate
algorithm based on hamming mismatch tries to competitively match the prefix of
a sequencing read against a set of barcodes. Axe supports combinatorial
barcoding schemes.
a sequencing read against a set of indexes. Axe supports combinatorial
indexing schemes.
Contents:
.. toctree::
:maxdepth: 2
:maxdepth: 1
tutorial
usage
algorithm
......
************
Axe Tutorial
************
In this tutorial, we'll use Axe to demultiplex some paired-end,
combinatorially-index Genotyping-by-Sequencing reads. The data for this
tutorial is available from figshare:
https://figshare.com/articles/axe-tutorial_tar/6143720 .
Axe should be run as the initial step of any analysis: don't use sequence QC
tools like AdapterRemoval or Trimmomatic before using axe, as indexes may be
trimmed away, or pairing information removed.
Step 0: Download the trial data
-------------------------------
This will download the trial data, and extract it on the fly:
.. code-block:: bash
curl -LS https://ndownloader.figshare.com/files/11094782 | tar xv
Step 1: prepare a key file
--------------------------
The key file associates index sequences with sample names. A key file can be
prepared in a spreadsheet editor, like LibreOffice Calc, or Excel. The format
is quite strict, and is described in detail in the online usage documentation.
Let's now inspect the keyfile I have provided for the tutorial.
.. code-block:: bash
head axe-keyfile.tsv
Step 2: Demultiplex with Axe
----------------------------
In this step, we will demultiplex our interleaved input file to per-sample
interleaved output files. To see a full range of Axe's options, please run
``axe-demux -h``, or inspect the online usage documentation.
First, let's inspect the input.
.. code-block:: bash
zcat axe-tutorial.fastq.gz | head -n 8
Then, we need to ensure that axe has somewhere to put the demultiplexed reads.
Axe outputs one file (or more, depending on pairing) per sample. Axe does so by
appending the sample name to some prefix (as given by the ``-I``, ``-F``,
and/or ``-R`` options). If this prefix is a directory, then sample fastq files
will be created in that sub-directory, but the directory must exist. Let's make
an output directory:
.. code-block:: bash
mkdir -p output
Now, let's demultiplex the reads!
.. code-block:: bash
axe-demux -i axe-tutorial.fastq.gz -I output/ \
-c -b axe-keyfile.tsv -t demux-stats.tsv -z 1
The command above demultiplexes reads from ``axe-tutorial.fastq.gz`` into
separate files under ``output``, based on the combinatorial (``-c``)
sample-to-index-sequence mapping described in ``axe-keyfile.tsv``, and saves a
file of statistics as ``demux-stats.tsv``. Note that we have enabled
compression of output files using the ``-z`` option, in case you don't have
much disk space available. This will make Axe slightly slower.
......@@ -9,15 +9,16 @@ Axe Usage
did not change.
Axe has several usage modes. The primary distinction is between the two
alternate barcoding schemes, single and combinatorial barcoding. Single barcode
matching is used when only the first read contains barcode sequences.
Combinatorial barcoding is used when both reads in a read pair contain
independent (typically different) barcode sequences.
alternate indexing schemes, single and combinatorial indexing. Single index
matching is used when only the first read contains index sequences.
Combinatorial indexing is used when both reads in a read pair contain
independent (typically different) index sequences.
For concise reference, the command-line usage of ``axe-demux`` is reproduced
below:
.. literalinclude:: usage.txt
:language: text
Inputs and Outputs
------------------
......@@ -37,8 +38,8 @@ default) and 9, where 0 indicates plain text output (``gzopen`` mode "wT"), and
fastest and 9 is most compact.
The output flags should be prefixes that are used to generate the output file
name based on the barcode's (or barcode pair's) ID. The names are generated as:
``prefix`` + ``_`` + ``barcode ID`` + ``_`` + ``read number`` + ``.extension``.
name based on the index's (or index pair's) ID. The names are generated as:
``prefix`` + ``_`` + ``index ID`` + ``_`` + ``read number`` + ``.extension``.
The output file for reads that could not be demultiplexed is ``prefix`` + ``_``
+ ``unknown`` + ``_`` + ``read number`` + ``.extension``. The read number is
omitted unless the paired read file scheme is used, and is "il" for interleaved
......@@ -51,65 +52,65 @@ The corresponding CLI flags are:
- ``-r`` and ``-R``: Paired R2 file input and output.
- ``-i`` and ``-I``: Interleaved paired input and output.
The barcode file
The index file
----------------
The barcode file is a tab-separated file with an optional header. It is
The index file is a tab-separated file with an optional header. It is
mandatory, and is always supplied using the ``-b`` command line flag. The exact
format is dependent on barcoding mode, and is described further in the sections
format is dependent on indexing mode, and is described further in the sections
below. If a header is present, the header line must start with either
`Barcode` or ``barcode``, or it will be interpreted as a barcode line, leading
`Barcode` or ``index``, or it will be interpreted as a index line, leading
to a parsing error. Any line starting with ';' or '#' is ignored, allowing
comments to be added in line with barcodes. Please ensure that the software
used to produce the barcode uses ASCII encoding, and does not insert a
comments to be added in line with indexes. Please ensure that the software
used to produce the index uses ASCII encoding, and does not insert a
Byte-order Mark (BoM) as many text editors can silently use Unicode-based
encoding schemes. I recommend the use of
`LibreOffice Calc <www.libreoffice.org>`_ (part of a free and open source
office suite) to generate barcode tables; Microsoft Excel can also be used.
office suite) to generate index tables; Microsoft Excel can also be used.
Mismatch level selection
------------------------
Independent of barcode mode, the ``-m`` flag is used to select the maximum
allowable hamming distance between a read's prefix and a barcode to be
considered as a match. As "mutated" barcodes must be unique, a hamming distance
of one is the default as typically barcodes are designed to differ by a hamming
Independent of index mode, the ``-m`` flag is used to select the maximum
allowable hamming distance between a read's prefix and a index to be
considered as a match. As "mutated" indexes must be unique, a hamming distance
of one is the default as typically indexes are designed to differ by a hamming
distance of at least two. Optionally, (using the ``-p`` flag), axe will allow
selective mismatch levels, where, if clashes are observed, the barcode will
only be matched exactly. This allows one to process datasets with barcodes that
selective mismatch levels, where, if clashes are observed, the index will
only be matched exactly. This allows one to process datasets with indexes that
don't have a sufficiently high distance between them.
Single barcode mode
Single index mode
-------------------
Single barcode mode is the default mode of operation. Barcodes are matched
against read one (hereafter the forward read), and the barcode is trimmed from
Single index mode is the default mode of operation. Barcodes are matched
against read one (hereafter the forward read), and the index is trimmed from
only the forward read, unless the ``-2`` command line flag is given, in which
case a prefix the same length as the matched barcode is also trimmed from the
case a prefix the same length as the matched index is also trimmed from the
second or reverse read. Note that sequence of this second read is not checked
before trimming.
In single barcode mode, the barcode file has two columns: ``Barcode`` and
In single index mode, the index file has two columns: ``Barcode`` and
``ID``.
Combinatorial barcode mode
Combinatorial index mode
--------------------------
Combinatorial barcode mode is activated by giving the ``-c`` flag on the
command line. Forward read barcodes are matched against the forward read, and
reverse read barcodes are matched against the reverse read. The optimal
barcodes are selected independently, and the barcode pair is selected from
these two barcodes. The respective barcodes are trimmed from both reads; the
``-2`` command line flag has no effect in combinatorial barcode mode.
Combinatorial index mode is activated by giving the ``-c`` flag on the
command line. Forward read indexes are matched against the forward read, and
reverse read indexes are matched against the reverse read. The optimal
indexes are selected independently, and the index pair is selected from
these two indexes. The respective indexes are trimmed from both reads; the
``-2`` command line flag has no effect in combinatorial index mode.
In combinatorial barcode mode, the barcode file has three columns:
``Barcode1``, ``Barcode2`` and ``ID``. Individual barcodes can occur many times
within the forward and reverse barcodes, but barcode pairs must be unique
In combinatorial index mode, the index file has three columns:
``Barcode1``, ``Barcode2`` and ``ID``. Individual indexes can occur many times
within the forward and reverse indexes, but index pairs must be unique
combinations.
The Demultipexing Statistics File
---------------------------------
The Demultiplexing Statistics File
----------------------------------
The ``-t`` option allows the output of per-sample read counts to a
tab-separated file. The file will have a header describing its format, and
includes a line for unbarcoded reads.
includes a line for reads which could not be demultiplexed.
......@@ -3,7 +3,7 @@
*
* Filename: axe.c
* Description: Demultiplex reads by 5' barcodes
* Copyright: 2014-2015 Kevin Murray <spam@kdmurray.id.au>
* Copyright: 2014-2016 Kevin Murray <kdmfoss@gmail.com>
* License: GNU GPL v3+
*
* This program is free software: you can redistribute it and/or modify it
......@@ -24,6 +24,9 @@
#include "axe.h"
#include "gsl_combination.h"
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
/* Holds the current timestamp, so we don't have to free the returned string
* from now(). */
......@@ -169,16 +172,15 @@ _axe_format_outfile_path (const char *prefix, const char *id, int read,
char buf[4096];
int res = 0;
char *our_prefix = NULL;
char lastchr = '\0';
size_t prefix_len = 0;
struct stat statbuff;
if (prefix == NULL || id == NULL) {
return NULL;
}
prefix_len = strlen(prefix);
lastchr = prefix[prefix_len - 1];
if (lastchr == '/' || lastchr == '\\') {
if (stat(prefix, &statbuff) == 0 && S_ISDIR(statbuff.st_mode)) {
/* Our prefix is a directory, don't add '_' */
our_prefix = strdup(prefix);
} else {
......
......@@ -3,7 +3,7 @@
*
* Filename: axe.h
* Description: Demultiplex reads by 5' barcodes
* Copyright: 2014-2015 Kevin Murray <spam@kdmurray.id.au>
* Copyright: 2014-2016 Kevin Murray <kdmfoss@gmail.com>
* License: GNU GPL v3+
*
* This program is free software: you can redistribute it and/or modify it
......
......@@ -3,7 +3,7 @@
*
* Filename: axe_config.h.in
* Description: Pull in build system definitions
* Copyright: 2014-2015 Kevin Murray <spam@kdmurray.id.au>
* Copyright: 2014-2016 Kevin Murray <kdmfoss@gmail.com>
* License: GNU GPL v3+
*
* This program is free software: you can redistribute it and/or modify it
......
......@@ -3,7 +3,7 @@
*
* Filename: axe_main.c
* Description: Main loop for axe
* Copyright: 2014-2015 Kevin Murray <spam@kdmurray.id.au>
* Copyright: 2014-2016 Kevin Murray <kdmfoss@gmail.com>
* License: GNU GPL v3+
*
* This program is free software: you can redistribute it and/or modify it
......