Skip to content
Commits on Source (3)
......@@ -2,6 +2,21 @@
Changes
=======
v2.4 (2019-07-09)
-----------------
* :issue:`292`: Implement support for demultiplexing paired-end reads that use
:ref:`combinatorial indexing (“combinatorial demultiplexing”)
<combinatorial-demultiplexing>`.
* :pr:`384`: Speed up reading compressed files by requiring an xopen version
that uses an external pigz process even for *reading* compressed input files
(not only for writing).
* :issue:`381`: Fix ``--report=minimal`` not working.
* :issue:`380`: Add a ``--fasta`` option for forcing that FASTA is written
to standard output even when input is FASTQ. Previously, forcing
FASTA was only possible by providing an output file name.
v2.3 (2019-04-25)
-----------------
......
Metadata-Version: 2.1
Name: cutadapt
Version: 2.3
Version: 2.4
Summary: trim adapters from high-throughput sequencing reads
Home-page: https://cutadapt.readthedocs.io/
Author: Marcel Martin
......
python-cutadapt (2.4-1) unstable; urgency=medium
* Team upload
* New upstream version.
* Standards-Version: 4.4.0
* Cleaning auto-generated files
-- Steffen Moeller <moeller@debian.org> Thu, 01 Aug 2019 20:34:10 +0200
python-cutadapt (2.3-1) UNRELEASED; urgency=medium
* Team upload.
......
......@@ -16,7 +16,7 @@ Build-Depends: debhelper (>= 11~),
python3-xopen (>= 0.5.0),
python3-dnaio (>= 0.3),
cython3
Standards-Version: 4.3.0
Standards-Version: 4.4.0
Vcs-Browser: https://salsa.debian.org/med-team/python-cutadapt
Vcs-Git: https://salsa.debian.org/med-team/python-cutadapt.git
Homepage: http://pypi.python.org/pypi/cutadapt
......
......@@ -28,3 +28,7 @@ override_dh_install:
override_dh_auto_test:
dh_auto_test -- -s custom --test-args="cd {build_dir}; \
py.test-3 --pyargs cutadapt tests"
override_dh_auto_clean:
dh_auto_clean
rm -rf src/cutadapt.egg-info/ .pytest_cache
......@@ -66,7 +66,7 @@ else:
release = version
issues_uri = 'https://github.com/marcelm/cutadapt/issues/{issue}'
issues_pr_uri = 'https://github.com/marcelm/cutadapt/pull/{pr}'
suppress_warnings = ['image.nonlocal_uri']
# The language for content autogenerated by Sphinx. Refer to documentation
......
......@@ -33,38 +33,13 @@ versions installed)::
venv/bin/tox
Development installation (without virtualenv)
---------------------------------------------
Alternatively, if you do not want to use virtualenv, running the following may
work from within the cloned repository::
python3 setup.py build_ext -i
pytest
This requires Cython and pytest to be installed. Avoid this method and use a
virtualenv instead if you can.
Code style
----------
Cutadapt tries to follow PEP8, with some exceptions:
* Indentation is made with tabs, not with spaces
* The maximum line length for code 100 characters, not 80, but try to wrap
comments at 80 characters for readability.
Yes, there are inconsistencies in the current code base since it’s a few years old already.
Making a release
----------------
Since version 1.17, Travis CI is used to automatically deploy a new Cutadapt release
(both as an sdist and as wheels) whenever a new tag is pushed to the Git repository.
Cutadapt uses `versioneer <https://github.com/warner/python-versioneer>`_ to automatically manage
Cutadapt uses `setuptools_scm <https://github.com/pypa/setuptools_scm>`_ to automatically manage
version numbers. This means that the version is not stored in the source code but derived from
the most recent Git tag. The following procedure can be used to bump the version and make a new
release.
......@@ -90,82 +65,16 @@ release.
#. Wait for Travis to finish and to deploy to PyPI.
#. Update the `bioconda recipe <https://github.com/bioconda/bioconda-recipes/blob/master/recipes/cutadapt/meta.yaml>`_.
It is probly easiest to edit the recipe via the web interface and send in a
pull request. Ensure that the list of dependencies (the ``requirements:``
section in the recipe) is in sync with the ``setup.py`` file.
Since this is just a version bump, the pull request does not need a
review by other bioconda developers. As soon as the tests pass and if you
have the proper permissions, it can be merged directly.
Releases to bioconda still need to be made manually.
Making a release manually
-------------------------
.. note:
This section is outdated, see the previous section!
If this is the first time you attempt to upload a distribution to PyPI, create a
configuration file named ``.pypirc`` in your home directory with the following
contents::
[distutils]
index-servers =
pypi
[pypi]
username=my-user-name
password=my-password
See also `this blog post about getting started with
PyPI <http://peterdowns.com/posts/first-time-with-pypi.html>`_. In particular,
note that a ``%`` in your password needs to be doubled and that the password
must *not* be put between quotation marks even if it contains spaces.
Cutadapt uses `versioneer <https://github.com/warner/python-versioneer>`_ to automatically manage
version numbers. This means that the version is not stored in the source code but derived from
the most recent Git tag. The following procedure can be used to bump the version and make a new
release.
#. Update ``CHANGES.rst`` (version number and list of changes)
#. Ensure you have no uncommitted changes in the working copy.
#. Run a ``git pull``.
#. Run ``tox``, ensuring all tests pass.
#. Tag the current commit with the version number (there must be a ``v`` prefix)::
git tag v0.1
#. Create a distribution (``.tar.gz`` file). Double-check that the auto-generated version number in
the tarball is as you expect it by looking at the name of the generated file in ``dist/``::
python3 setup.py sdist
#. If necessary, pip install ``twine`` and then upload the generated tar file to PyPI::
twine upload dist/cutadapt-0.1.tar.gz # adjust version number
#. Push the tag::
git push --tags
#. The `bioconda recipe <https://github.com/bioconda/bioconda-recipes/blob/master/recipes/cutadapt/meta.yaml>`_
also needs to be updated, but the bioconda bot will likely do this automatically
if you just wait a little while.
#. Update the `bioconda recipe <https://github.com/bioconda/bioconda-recipes/blob/master/recipes/cutadapt/meta.yaml>`_.
It is probly easiest to edit the recipe via the web interface and send in a
pull request. Ensure that the list of dependencies (the ``requirements:``
Ensure that the list of dependencies (the ``requirements:``
section in the recipe) is in sync with the ``setup.py`` file.
Since this is just a version bump, the pull request does not need a
review by other bioconda developers. As soon as the tests pass and if you
have the proper permissions, it can be merged directly.
If something went wrong *after* you uploaded a tarball, fix the problem and follow the
above instructions again, but with an incremented revision in the version number.
That is, go from version x.y to x.y.1. Do not change a version that has already
If something went wrong *after* a version has already been tagged and published to
PyPI, fix the problem and tag a new version. Do not change a version that has already
been uploaded.
.. include:: ../CONTRIBUTING.rst
......@@ -30,20 +30,30 @@ explained further down.
Input and output file formats
-----------------------------
Input files for Cutadapt need to be in one the these formats:
Input and output files need to be in FASTA or FASTQ format. Reading and writing
compressed file formats ``.gz``, ``.bz2`` or ``.xz`` is also supported. Cutadapt
uses ``pigz`` internally if possible to speed up writing and reading of
gzipped files.
* FASTA with extensions ``.fasta``, ``.fa`` or ``.fna``
* FASTQ with extensions ``.fastq`` or ``.fq``
* Any of the above, but compressed as ``.gz``, ``.bz2`` or ``.xz``
The input file format is recognized from the file name extension. If the
extension was not recognized or when Cutadapt reads from standard input,
the contents are inspected instead.
Input and output file formats are recognized from the file name extension. You
can override the input format with the ``--format`` option.
The output file format is also recognized from the file name extension. If the
extensions was not recognized or when Cutadapt writes to standard input, the
same format as the input is used for the output.
You can use the automatic format detection to convert from FASTQ to FASTA
(without doing any adapter trimming)::
You can use this to convert from FASTQ to FASTA (without doing any adapter
trimming)::
cutadapt -o output.fasta.gz input.fastq.gz
When you want to do the same (read FASTQ, write FASTA), but want to write to
standard output, you need to use ``--fasta`` instead because there is no
output file name::
cutadapt --fasta input.fastq.gz > out.fasta
.. _compressed-files:
......@@ -52,8 +62,8 @@ Compressed files
Cutadapt supports compressed input and output files. Whether an input file
needs to be decompressed or an output file needs to be compressed is detected
automatically by inspecting the file name: If it ends in ``.gz``, then gzip
compression is assumed. This is why the example given above works::
automatically by inspecting the file name: For example, if it ends in ``.gz``,
then gzip compression is assumed ::
cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz
......@@ -62,6 +72,10 @@ All of Cutadapt's options that expect a file name support this.
The supported compression formats are gzip (``.gz``), bzip2 (``.bz2``)
and xz (``.xz``).
The default compression level for gzip output is 6. Use option ``-Z`` to
change this to level 1. The files need more space, but it is faster and
therefore a good choice for short-lived intermediate files.
Standard input and output
-------------------------
......@@ -153,6 +167,33 @@ Some of these limitations will be lifted in the future, as time allows.
.. versionadded:: 1.18
``--cores=0`` for autodetection
Speed-up tricks
---------------
There are several tricks for limiting wall-clock time while using cutadapt.
``-Z`` (alternatively ``--compression-level=1``) can be used to limit the
amount of CPU time which is spent on the compression of output files.
Alternatively, choosing filenames not ending with ``.gz``, ``.bz2`` or ``.xz``
will make sure no cpu time is spent on compression at all. On systems
with slow I/O, it can actually be faster to set a higher compression-level
than 1.
Increasing the number of cores with ``-j`` will increase the number of reads per
minute at near-linear rate.
It is also possible to use pipes in order to bypass the filesystem and pipe
cutadapt's output into an aligner such as BWA. The ``mkfifo`` command allows
you to create named pipes in bash.
.. code-block::bash
mkfifo R1.fastq R2.fastq
cutadapt -a ${ADAPTER_R1} -A ${ADAPTER_R2} -o R1.fastq -p R2.fastq ${READ1} ${READ2} > cutadapt.report & \
bwa mem -o output.sam ${INDEX} R1.fastq R2.fastq
This command will run cutadapt and BWA simultaneously, using cutadapts output as
BWA's input, and capturing cutadapts report in ``cutadapt.report``.
Read processing stages
======================
......@@ -431,7 +472,7 @@ Input read Processed read
Linked adapters (combined 5' and 3' adapter)
--------------------------------------------
If your sequence of interest ist “framed” by a 5' and a 3' adapter, and you want
If your sequence of interest is “framed” by a 5' and a 3' adapter, and you want
to remove both adapters, then you may want to use a *linked adapter*. A linked
adapter combines a 5' and a 3' adapter. By default, the adapters are not anchored,
but in many cases, you should anchor the 5’ adapter by prefixing it with ``^``.
......@@ -1258,6 +1299,10 @@ This scheme, also called “non-redundant indexing”, uses 96 unique i5 indices
indices, which are only used in pairs, that is, the first i5 index is always used with the first i7
index and so on.
.. note::
If the adapters do not come in pairs, but all combinations are possible, see
:ref:`the section about combinatorial demultiplexing <combinatorial-demultiplexing>`.
An example::
cutadapt --pair-adapters -a AAAAA -a GGGG -A CCCCC -a TTTT -o out.1.fastq -p out.2.fastq in.1.fastq in.2.fastq
......@@ -1273,7 +1318,7 @@ The ``--pair-adapters`` option can be used also :ref:`when demultiplexing <demul
There is one limitation of the algorithm at the moment: The program looks for the best-matching R1 adapter
first and then checks whether the corresponding R2 adapter can be found. If not, the read pair
remains unchanged. However, it is in theory possible that a different R1 adapter that does not
fit as well has a partner that *can* be found. Some read pairs may therefore remain untrimmed.
fit as well would have a partner that *can* be found. Some read pairs may therefore remain untrimmed.
.. versionadded:: 2.1
......@@ -1495,6 +1540,45 @@ More advice on demultiplexing:
* If you want to demultiplex, but keep the barcode in the reads, use the option ``--action=none``.
.. _combinatorial-demultiplexing:
Demultiplexing paired-end reads with combinatorial dual indexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Illumina’s combinatorial dual indexing strategy uses a set of indexed adapters on R1 and another one
on R2. Unlike
`unique dual indexes (UDI) <https://support.illumina.com/bulletins/2018/08/understanding-unique-dual-indexes--udi--and-associated-library-p.html>`_,
all combinations of indexes are possible.
For demultiplexing this type of data ("combinatorial demultiplexing"), it is necessary to write each
read pair to an output file depending on the adapters found on R1 *and* R2.
Doing this with Cutadapt is similar to doing normal demultiplexing as described above, but you need
to use ``{name1}}`` and ``{name2}`` in both output file name templates. For example::
cutadapt \
-e 0.15 --no-indels \
-g file:barcodes_fwd.fasta \
-G file:barcodes_rev.fasta \
-o trimmed-{name1}-{name2}.1.fastq.gz -p trimmed-{name1}-{name2}.2.fastq.gz \
input.1.fastq.gz input.2.fastq.gz
The ``{name1}`` will be replaced with the name of the best-matching R1 adapter and ``{name2}}`` will
be replaced with the name of the best-matching R2 adapter.
If there was no match of an R1 adapter, ``{name1}`` is set to "unknown", and if there is no match of
an R2 adapter, ``{name2}`` is set to "unknown". To discard read pairs for which one or both adapters
could not be found, use ``--discard-untrimmed``.
The ``--untrimmed-output`` and ``--untrimmed-paired-output`` options cannot be used.
Read the :ref:`demultiplexing <demultiplexing>` section for how to choose the error rate etc.
Also, the tips below about how to speed up demultiplexing apply even with combinatorial
demultiplexing.
.. versionadded:: 2.4
.. _speed-up-demultiplexing:
Speeding up demultiplexing
......
......@@ -102,7 +102,7 @@ setup(
package_dir={'': 'src'},
packages=find_packages('src'),
entry_points={'console_scripts': ['cutadapt = cutadapt.__main__:main']},
install_requires=['dnaio>=0.3', 'xopen>=0.5.0'],
install_requires=['dnaio>=0.3', 'xopen>=0.7.3'],
extras_require={
'dev': ['Cython', 'pytest', 'pytest-timeout', 'sphinx', 'sphinx_issues'],
},
......
......@@ -107,6 +107,10 @@ class CommandLineError(Exception):
pass
# Custom log level, see setup_logging
REPORT = 25
class NiceFormatter(logging.Formatter):
"""
Do not prefix "INFO:" to info-level log messages (but do it for all other
......@@ -115,24 +119,31 @@ class NiceFormatter(logging.Formatter):
Based on http://stackoverflow.com/a/9218261/715090 .
"""
def format(self, record):
if record.levelno != logging.INFO:
if record.levelno not in (logging.INFO, REPORT):
record.msg = '{}: {}'.format(record.levelname, record.msg)
return super().format(record)
def setup_logging(stdout=False, quiet=False, debug=False):
def setup_logging(stdout=False, minimal=False, quiet=False, debug=False):
"""
Attach handler to the global logger object
"""
# For --report=minimal, we need this custom log level because we want to
# print nothing except the minimal report and therefore cannot use the
# INFO level (and the ERROR level would give us an 'ERROR:' prefix).
logging.addLevelName(REPORT, 'REPORT')
# Due to backwards compatibility, logging output is sent to standard output
# instead of standard error if the -o option is used.
stream_handler = logging.StreamHandler(sys.stdout if stdout else sys.stderr)
stream_handler.setFormatter(NiceFormatter())
# debug overrides quiet
# debug overrides quiet overrides minimal
if debug:
level = logging.DEBUG
elif quiet:
level = logging.ERROR
elif minimal:
level = REPORT
else:
level = logging.INFO
stream_handler.setLevel(level)
......@@ -159,6 +170,9 @@ def get_argument_parser():
# Buffer size for the reader process when running in parallel
group.add_argument("--buffer-size", type=int, default=4000000,
help=SUPPRESS)
# Compression level for gzipped output files. Not exposed since we have -Z
group.add_argument("--compression-level", default=6,
help=SUPPRESS)
# Deprecated: The input format is always auto-detected
group.add_argument("-f", "--format", help=SUPPRESS)
......@@ -283,9 +297,13 @@ def get_argument_parser():
help="Which type of report to print: 'full' or 'minimal'. Default: full")
group.add_argument("-o", "--output", metavar="FILE",
help="Write trimmed reads to FILE. FASTQ or FASTA format is chosen "
"depending on input. The summary report is sent to standard output. "
"Use '{name}' in FILE to demultiplex reads into multiple "
"files. Default: write to standard output")
"depending on input. Summary report is sent to standard output. "
"Use '{name}' for demultiplexing (see docs). "
"Default: write to standard output")
group.add_argument("--fasta", default=False, action='store_true',
help="Output FASTA to standard output even on FASTQ input.")
group.add_argument("-Z", action="store_const", const=1, dest="compression_level",
help="Use compression level 1 for gzipped output files (faster, but uses more space)")
group.add_argument("--info-file", metavar="FILE",
help="Write information about each read and its adapter matches into FILE. "
"See the documentation for the file format.")
......@@ -402,20 +420,21 @@ def open_output_files(args, default_outfile, interleaved):
Return an OutputFiles instance. If demultiplex is True, the untrimmed, untrimmed2, out and out2
attributes are not opened files, but paths (out and out2 with the '{name}' template).
"""
compression_level = args.compression_level
rest_file = info_file = wildcard = None
if args.rest_file is not None:
rest_file = xopen(args.rest_file, 'w')
rest_file = xopen(args.rest_file, 'w', compresslevel=compression_level)
if args.info_file is not None:
info_file = xopen(args.info_file, 'w')
info_file = xopen(args.info_file, 'w', compresslevel=compression_level)
if args.wildcard_file is not None:
wildcard = xopen(args.wildcard_file, 'w')
wildcard = xopen(args.wildcard_file, 'w', compresslevel=compression_level)
def open2(path1, path2):
file1 = file2 = None
if path1 is not None:
file1 = xopen(path1, 'wb')
file1 = xopen(path1, 'wb', compresslevel=compression_level)
if path2 is not None:
file2 = xopen(path2, 'wb')
file2 = xopen(path2, 'wb', compresslevel=compression_level)
return file1, file2
too_short = too_short2 = None
......@@ -427,18 +446,30 @@ def open_output_files(args, default_outfile, interleaved):
too_long, too_long2 = open2(args.too_long_output, args.too_long_paired_output)
if int(args.discard_trimmed) + int(args.discard_untrimmed) + int(
args.untrimmed_output is not None) > 1:
args.untrimmed_output is not None) > 1:
raise CommandLineError("Only one of the --discard-trimmed, --discard-untrimmed "
"and --untrimmed-output options can be used at the same time.")
demultiplex = args.output is not None and '{name}' in args.output
if args.paired_output is not None and (demultiplex != ('{name}' in args.paired_output)):
raise CommandLineError('When demultiplexing paired-end data, "{name}" must appear in '
'both output file names (-o and -p)')
demultiplex_combinatorial = (
args.output is not None
and args.paired_output is not None
and '{name1}' in args.output
and '{name2}' in args.output
and '{name1}' in args.paired_output
and '{name2}' in args.paired_output
)
if (demultiplex or demultiplex_combinatorial) and args.discard_trimmed:
raise CommandLineError("Do not use --discard-trimmed when demultiplexing.")
if demultiplex:
if args.discard_trimmed:
raise CommandLineError("Do not use --discard-trimmed when demultiplexing.")
if demultiplex_combinatorial:
raise CommandLineError("You cannot combine {name} with {name1} and {name2}")
out = args.output
untrimmed = args.output.replace('{name}', 'unknown')
......@@ -457,14 +488,24 @@ def open_output_files(args, default_outfile, interleaved):
else:
untrimmed2 = out2 = None
assert out is not None and '{name}' in out and (out2 is None or '{name}' in out2)
elif demultiplex_combinatorial:
out = args.output
out2 = args.paired_output
if args.untrimmed_output or args.untrimmed_paired_output:
raise CommandLineError("Combinatorial demultiplexing (with {name1} and {name2})"
" cannot be combined with --untrimmed-output or --untrimmed-paired-output")
if args.discard_untrimmed:
untrimmed = untrimmed2 = None
else:
untrimmed = untrimmed2 = 'unknown'
else:
untrimmed, untrimmed2 = open2(args.untrimmed_output, args.untrimmed_paired_output)
out, out2 = open2(args.output, args.paired_output)
if out is None:
out = default_outfile
if demultiplex:
assert out is not None and '{name}' in out and (out2 is None or '{name}' in out2)
return OutputFiles(
rest=rest_file,
info=info_file,
......@@ -477,8 +518,9 @@ def open_output_files(args, default_outfile, interleaved):
untrimmed2=untrimmed2,
out=out,
out2=out2,
demultiplex=demultiplex,
demultiplex=demultiplex or demultiplex_combinatorial,
interleaved=interleaved,
force_fasta=args.fasta,
)
......@@ -739,7 +781,7 @@ def main(cmdlineargs=None, default_outfile=sys.stdout.buffer):
# this function is being called externally such as from unit tests)
if not logging.root.handlers:
setup_logging(stdout=log_to_stdout,
quiet=args.quiet or args.report == 'minimal', debug=args.debug)
quiet=args.quiet, minimal=args.report == 'minimal', debug=args.debug)
if args.profile:
import cProfile
profiler = cProfile.Profile()
......@@ -754,6 +796,7 @@ def main(cmdlineargs=None, default_outfile=sys.stdout.buffer):
"--colorspace, -c, -d, --double-encode, -t, --trim-primer, "
"--strip-f3, --maq, --bwa, --no-zero-cap. "
"Use Cutadapt 1.18 or earlier to work with colorspace data.")
paired = determine_paired_mode(args)
assert paired in (False, True)
......@@ -786,7 +829,7 @@ def main(cmdlineargs=None, default_outfile=sys.stdout.buffer):
'--too-long-paired-output, --format\n'
'Also, demultiplexing is not supported.\n'
'Omit --cores/-j to continue.')
sys.exit(1)
return # avoid IDE warnings below
else:
runner_class = SerialPipelineRunner
runner_kwargs = dict()
......@@ -796,6 +839,7 @@ def main(cmdlineargs=None, default_outfile=sys.stdout.buffer):
runner = runner_class(pipeline, infiles, outfiles, **runner_kwargs)
except (dnaio.UnknownFileFormat, IOError) as e:
parser.error(e)
return # avoid IDE warnings below
logger.info("Processing reads on %d core%s in %s mode ...",
cores, 's' if cores > 1 else '',
......@@ -818,7 +862,7 @@ def main(cmdlineargs=None, default_outfile=sys.stdout.buffer):
report = minimal_report
else:
report = full_report
logger.info('%s', report(stats, elapsed, args.gc_content / 100))
logger.log(REPORT, '%s', report(stats, elapsed, args.gc_content / 100))
if args.profile:
import pstats
profiler.disable()
......
/* Generated by Cython 0.29.7 */
/* Generated by Cython 0.29.12 */
 
/* BEGIN: Cython Metadata
{
......@@ -20,8 +20,8 @@ END: Cython Metadata */
#elif PY_VERSION_HEX < 0x02060000 || (0x03000000 <= PY_VERSION_HEX && PY_VERSION_HEX < 0x03030000)
#error Cython requires Python 2.6+ or Python 3.3+.
#else
#define CYTHON_ABI "0_29_7"
#define CYTHON_HEX_VERSION 0x001D07F0
#define CYTHON_ABI "0_29_12"
#define CYTHON_HEX_VERSION 0x001D0CF0
#define CYTHON_FUTURE_DIVISION 1
#include <stddef.h>
#ifndef offsetof
......@@ -323,8 +323,13 @@ END: Cython Metadata */
#define __Pyx_DefaultClassType PyClass_Type
#else
#define __Pyx_BUILTIN_MODULE_NAME "builtins"
#if PY_VERSION_HEX >= 0x030800A4 && PY_VERSION_HEX < 0x030800B2
#define __Pyx_PyCode_New(a, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)\
PyCode_New(a, 0, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)
#else
#define __Pyx_PyCode_New(a, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)\
PyCode_New(a, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)
#endif
#define __Pyx_DefaultClassType PyType_Type
#endif
#ifndef Py_TPFLAGS_CHECKTYPES
......@@ -1001,7 +1006,7 @@ static CYTHON_INLINE int __Pyx_IterFinish(void);
#define __Pyx_PyFunction_FastCall(func, args, nargs)\
__Pyx_PyFunction_FastCallDict((func), (args), (nargs), NULL)
#if 1 || PY_VERSION_HEX < 0x030600B1
static PyObject *__Pyx_PyFunction_FastCallDict(PyObject *func, PyObject **args, int nargs, PyObject *kwargs);
static PyObject *__Pyx_PyFunction_FastCallDict(PyObject *func, PyObject **args, Py_ssize_t nargs, PyObject *kwargs);
#else
#define __Pyx_PyFunction_FastCallDict(func, args, nargs, kwargs) _PyFunction_FastCallDict(func, args, nargs, kwargs)
#endif
......@@ -7366,6 +7371,9 @@ static PyTypeObject __pyx_type_8cutadapt_6_align_Aligner = {
#if PY_VERSION_HEX >= 0x030400a1
0, /*tp_finalize*/
#endif
#if PY_VERSION_HEX >= 0x030800b1
0, /*tp_vectorcall*/
#endif
};
 
static PyObject *__pyx_tp_new_8cutadapt_6_align_PrefixComparer(PyTypeObject *t, CYTHON_UNUSED PyObject *a, CYTHON_UNUSED PyObject *k) {
......@@ -7465,6 +7473,9 @@ static PyTypeObject __pyx_type_8cutadapt_6_align_PrefixComparer = {
#if PY_VERSION_HEX >= 0x030400a1
0, /*tp_finalize*/
#endif
#if PY_VERSION_HEX >= 0x030800b1
0, /*tp_vectorcall*/
#endif
};
 
static PyObject *__pyx_tp_new_8cutadapt_6_align_SuffixComparer(PyTypeObject *t, PyObject *a, PyObject *k) {
......@@ -7540,6 +7551,9 @@ static PyTypeObject __pyx_type_8cutadapt_6_align_SuffixComparer = {
#if PY_VERSION_HEX >= 0x030400a1
0, /*tp_finalize*/
#endif
#if PY_VERSION_HEX >= 0x030800b1
0, /*tp_vectorcall*/
#endif
};
 
static struct __pyx_obj_8cutadapt_6_align___pyx_scope_struct____str__ *__pyx_freelist_8cutadapt_6_align___pyx_scope_struct____str__[8];
......@@ -7651,6 +7665,9 @@ static PyTypeObject __pyx_type_8cutadapt_6_align___pyx_scope_struct____str__ = {
#if PY_VERSION_HEX >= 0x030400a1
0, /*tp_finalize*/
#endif
#if PY_VERSION_HEX >= 0x030800b1
0, /*tp_vectorcall*/
#endif
};
 
static struct __pyx_obj_8cutadapt_6_align___pyx_scope_struct_1_genexpr *__pyx_freelist_8cutadapt_6_align___pyx_scope_struct_1_genexpr[8];
......@@ -7750,6 +7767,9 @@ static PyTypeObject __pyx_type_8cutadapt_6_align___pyx_scope_struct_1_genexpr =
#if PY_VERSION_HEX >= 0x030400a1
0, /*tp_finalize*/
#endif
#if PY_VERSION_HEX >= 0x030800b1
0, /*tp_vectorcall*/
#endif
};
 
static struct __pyx_obj_8cutadapt_6_align___pyx_scope_struct_2_genexpr *__pyx_freelist_8cutadapt_6_align___pyx_scope_struct_2_genexpr[8];
......@@ -7849,6 +7869,9 @@ static PyTypeObject __pyx_type_8cutadapt_6_align___pyx_scope_struct_2_genexpr =
#if PY_VERSION_HEX >= 0x030400a1
0, /*tp_finalize*/
#endif
#if PY_VERSION_HEX >= 0x030800b1
0, /*tp_vectorcall*/
#endif
};
 
static PyMethodDef __pyx_methods[] = {
......@@ -8164,7 +8187,9 @@ static int __Pyx_modinit_type_init_code(void) {
__Pyx_RefNannySetupContext("__Pyx_modinit_type_init_code", 0);
/*--- Type init code ---*/
if (PyType_Ready(&__pyx_type_8cutadapt_6_align_Aligner) < 0) __PYX_ERR(0, 113, __pyx_L1_error)
#if PY_VERSION_HEX < 0x030800B1
__pyx_type_8cutadapt_6_align_Aligner.tp_print = 0;
#endif
if ((CYTHON_USE_TYPE_SLOTS && CYTHON_USE_PYTYPE_LOOKUP) && likely(!__pyx_type_8cutadapt_6_align_Aligner.tp_dictoffset && __pyx_type_8cutadapt_6_align_Aligner.tp_getattro == PyObject_GenericGetAttr)) {
__pyx_type_8cutadapt_6_align_Aligner.tp_getattro = __Pyx_PyObject_GenericGetAttr;
}
......@@ -8172,7 +8197,9 @@ static int __Pyx_modinit_type_init_code(void) {
if (__Pyx_setup_reduce((PyObject*)&__pyx_type_8cutadapt_6_align_Aligner) < 0) __PYX_ERR(0, 113, __pyx_L1_error)
__pyx_ptype_8cutadapt_6_align_Aligner = &__pyx_type_8cutadapt_6_align_Aligner;
if (PyType_Ready(&__pyx_type_8cutadapt_6_align_PrefixComparer) < 0) __PYX_ERR(0, 534, __pyx_L1_error)
#if PY_VERSION_HEX < 0x030800B1
__pyx_type_8cutadapt_6_align_PrefixComparer.tp_print = 0;
#endif
if ((CYTHON_USE_TYPE_SLOTS && CYTHON_USE_PYTYPE_LOOKUP) && likely(!__pyx_type_8cutadapt_6_align_PrefixComparer.tp_dictoffset && __pyx_type_8cutadapt_6_align_PrefixComparer.tp_getattro == PyObject_GenericGetAttr)) {
__pyx_type_8cutadapt_6_align_PrefixComparer.tp_getattro = __Pyx_PyObject_GenericGetAttr;
}
......@@ -8181,7 +8208,9 @@ static int __Pyx_modinit_type_init_code(void) {
__pyx_ptype_8cutadapt_6_align_PrefixComparer = &__pyx_type_8cutadapt_6_align_PrefixComparer;
__pyx_type_8cutadapt_6_align_SuffixComparer.tp_base = __pyx_ptype_8cutadapt_6_align_PrefixComparer;
if (PyType_Ready(&__pyx_type_8cutadapt_6_align_SuffixComparer) < 0) __PYX_ERR(0, 634, __pyx_L1_error)
#if PY_VERSION_HEX < 0x030800B1
__pyx_type_8cutadapt_6_align_SuffixComparer.tp_print = 0;
#endif
if ((CYTHON_USE_TYPE_SLOTS && CYTHON_USE_PYTYPE_LOOKUP) && likely(!__pyx_type_8cutadapt_6_align_SuffixComparer.tp_dictoffset && __pyx_type_8cutadapt_6_align_SuffixComparer.tp_getattro == PyObject_GenericGetAttr)) {
__pyx_type_8cutadapt_6_align_SuffixComparer.tp_getattro = __Pyx_PyObject_GenericGetAttr;
}
......@@ -8189,19 +8218,25 @@ static int __Pyx_modinit_type_init_code(void) {
if (__Pyx_setup_reduce((PyObject*)&__pyx_type_8cutadapt_6_align_SuffixComparer) < 0) __PYX_ERR(0, 634, __pyx_L1_error)
__pyx_ptype_8cutadapt_6_align_SuffixComparer = &__pyx_type_8cutadapt_6_align_SuffixComparer;
if (PyType_Ready(&__pyx_type_8cutadapt_6_align___pyx_scope_struct____str__) < 0) __PYX_ERR(0, 102, __pyx_L1_error)
#if PY_VERSION_HEX < 0x030800B1
__pyx_type_8cutadapt_6_align___pyx_scope_struct____str__.tp_print = 0;
#endif
if ((CYTHON_USE_TYPE_SLOTS && CYTHON_USE_PYTYPE_LOOKUP) && likely(!__pyx_type_8cutadapt_6_align___pyx_scope_struct____str__.tp_dictoffset && __pyx_type_8cutadapt_6_align___pyx_scope_struct____str__.tp_getattro == PyObject_GenericGetAttr)) {
__pyx_type_8cutadapt_6_align___pyx_scope_struct____str__.tp_getattro = __Pyx_PyObject_GenericGetAttrNoDict;
}
__pyx_ptype_8cutadapt_6_align___pyx_scope_struct____str__ = &__pyx_type_8cutadapt_6_align___pyx_scope_struct____str__;
if (PyType_Ready(&__pyx_type_8cutadapt_6_align___pyx_scope_struct_1_genexpr) < 0) __PYX_ERR(0, 106, __pyx_L1_error)
#if PY_VERSION_HEX < 0x030800B1
__pyx_type_8cutadapt_6_align___pyx_scope_struct_1_genexpr.tp_print = 0;
#endif
if ((CYTHON_USE_TYPE_SLOTS && CYTHON_USE_PYTYPE_LOOKUP) && likely(!__pyx_type_8cutadapt_6_align___pyx_scope_struct_1_genexpr.tp_dictoffset && __pyx_type_8cutadapt_6_align___pyx_scope_struct_1_genexpr.tp_getattro == PyObject_GenericGetAttr)) {
__pyx_type_8cutadapt_6_align___pyx_scope_struct_1_genexpr.tp_getattro = __Pyx_PyObject_GenericGetAttrNoDict;
}
__pyx_ptype_8cutadapt_6_align___pyx_scope_struct_1_genexpr = &__pyx_type_8cutadapt_6_align___pyx_scope_struct_1_genexpr;
if (PyType_Ready(&__pyx_type_8cutadapt_6_align___pyx_scope_struct_2_genexpr) < 0) __PYX_ERR(0, 108, __pyx_L1_error)
#if PY_VERSION_HEX < 0x030800B1
__pyx_type_8cutadapt_6_align___pyx_scope_struct_2_genexpr.tp_print = 0;
#endif
if ((CYTHON_USE_TYPE_SLOTS && CYTHON_USE_PYTYPE_LOOKUP) && likely(!__pyx_type_8cutadapt_6_align___pyx_scope_struct_2_genexpr.tp_dictoffset && __pyx_type_8cutadapt_6_align___pyx_scope_struct_2_genexpr.tp_getattro == PyObject_GenericGetAttr)) {
__pyx_type_8cutadapt_6_align___pyx_scope_struct_2_genexpr.tp_getattro = __Pyx_PyObject_GenericGetAttrNoDict;
}
......@@ -8419,9 +8454,9 @@ if (!__Pyx_RefNanny) {
}
#endif
/*--- Builtin init code ---*/
if (__Pyx_InitCachedBuiltins() < 0) __PYX_ERR(0, 1, __pyx_L1_error)
if (__Pyx_InitCachedBuiltins() < 0) goto __pyx_L1_error;
/*--- Constants init code ---*/
if (__Pyx_InitCachedConstants() < 0) __PYX_ERR(0, 1, __pyx_L1_error)
if (__Pyx_InitCachedConstants() < 0) goto __pyx_L1_error;
/*--- Global type/function init code ---*/
(void)__Pyx_modinit_global_init_code();
(void)__Pyx_modinit_variable_export_code();
......@@ -8663,7 +8698,7 @@ static PyObject* __Pyx_PyFunction_FastCallNoKw(PyCodeObject *co, PyObject **args
return result;
}
#if 1 || PY_VERSION_HEX < 0x030600B1
static PyObject *__Pyx_PyFunction_FastCallDict(PyObject *func, PyObject **args, int nargs, PyObject *kwargs) {
static PyObject *__Pyx_PyFunction_FastCallDict(PyObject *func, PyObject **args, Py_ssize_t nargs, PyObject *kwargs) {
PyCodeObject *co = (PyCodeObject *)PyFunction_GET_CODE(func);
PyObject *globals = PyFunction_GET_GLOBALS(func);
PyObject *argdefs = PyFunction_GET_DEFAULTS(func);
......@@ -8734,12 +8769,12 @@ static PyObject *__Pyx_PyFunction_FastCallDict(PyObject *func, PyObject **args,
}
#if PY_MAJOR_VERSION >= 3
result = PyEval_EvalCodeEx((PyObject*)co, globals, (PyObject *)NULL,
args, nargs,
args, (int)nargs,
k, (int)nk,
d, (int)nd, kwdefs, closure);
#else
result = PyEval_EvalCodeEx(co, globals, (PyObject *)NULL,
args, nargs,
args, (int)nargs,
k, (int)nk,
d, (int)nd, closure);
#endif
......@@ -10986,6 +11021,9 @@ static PyTypeObject __pyx_CyFunctionType_type = {
#if PY_VERSION_HEX >= 0x030400a1
0,
#endif
#if PY_VERSION_HEX >= 0x030800b1
0,
#endif
};
static int __pyx_CyFunction_init(void) {
__pyx_CyFunctionType = __Pyx_FetchCommonType(&__pyx_CyFunctionType_type);
......@@ -13107,6 +13145,9 @@ static PyTypeObject __pyx_GeneratorType_type = {
#elif PY_VERSION_HEX >= 0x030400a1
0,
#endif
#if PY_VERSION_HEX >= 0x030800b1
0,
#endif
};
static int __pyx_Generator_init(void) {
__pyx_GeneratorType_type.tp_getattro = __Pyx_PyObject_GenericGetAttrNoDict;
......
# coding: utf-8
# file generated by setuptools_scm
# don't change, don't track in version control
version = '2.3'
version = '2.4'
......@@ -295,7 +295,72 @@ class PairedDemultiplexer:
def close(self):
self._demultiplexer1.close()
self._demultiplexer1.close()
self._demultiplexer2.close()
class CombinatorialDemultiplexer:
"""
Demultiplex reads depending on which adapter matches, taking into account both matches
on R1 and R2.
"""
def __init__(self, path_template, path_paired_template, untrimmed_name, qualities):
"""
path_template must contain the string '{name1}' and '{name2}', which will be replaced
with the name of the adapters found on R1 and R2, respectively to form the final output
path. For reads without an adapter match, the name1 and/or name2 are set to the string
specified by untrimmed_name. Alternatively, untrimmed_name can be set to None; in that
case, read pairs for which at least one read does not have an adapter match are
discarded.
"""
assert '{name1}' in path_template and '{name2}' in path_template
assert '{name1}' in path_paired_template and '{name2}' in path_paired_template
self.template = path_template
self.paired_template = path_paired_template
self.untrimmed_name = untrimmed_name
self.writers = dict()
self.written = 0
self.written_bp = [0, 0]
self.qualities = qualities
@staticmethod
def _make_path(template, name1, name2):
return template.replace('{name1}', name1).replace('{name2}', name2)
def __call__(self, read1, read2, matches1, matches2):
"""
Write the read to the proper output file according to the most recent matches both on
R1 and R2
"""
# import ipdb; ipdb.set_trace()
assert read2 is not None
name1 = matches1[-1].adapter.name if matches1 else None
name2 = matches2[-1].adapter.name if matches2 else None
key = (name1, name2)
if key not in self.writers:
if name1 is None:
name1 = self.untrimmed_name
if name2 is None:
name2 = self.untrimmed_name
if name1 is None or name2 is None:
return DISCARD
path1 = self._make_path(self.template, name1, name2)
path2 = self._make_path(self.paired_template, name1, name2)
self.writers[key] = (
dnaio.open(path1, mode='w', qualities=self.qualities),
dnaio.open(path2, mode='w', qualities=self.qualities),
)
writer1, writer2 = self.writers[key]
self.written += 1
self.written_bp[0] += len(read1)
self.written_bp[1] += len(read2)
writer1.write(read1)
writer2.write(read2)
return DISCARD
def close(self):
for w1, w2 in self.writers.values():
w1.close()
w2.close()
class RestFileWriter:
......
......@@ -18,7 +18,7 @@ from .report import Statistics
from .filters import (Redirector, PairedRedirector, NoFilter, PairedNoFilter, InfoFileWriter,
RestFileWriter, WildcardFileWriter, TooShortReadFilter, TooLongReadFilter, NContentFilter,
CasavaFilter, DiscardTrimmedFilter, DiscardUntrimmedFilter, Demultiplexer,
PairedDemultiplexer)
PairedDemultiplexer, CombinatorialDemultiplexer)
logger = logging.getLogger()
......@@ -33,8 +33,8 @@ class InputFiles:
class OutputFiles:
"""
The attributes are open file-like objects except when demultiplex is True. In that case,
untrimmed, untrimmed2 are file names, and out and out2 are file name templates
containing '{name}'.
untrimmed, untrimmed2, out and out2 are file names or templates
as required by the used demultiplexer ('{name}' etc.).
If interleaved is True, then out is written interleaved.
Files may also be None.
"""
......@@ -54,6 +54,7 @@ class OutputFiles:
wildcard=None,
demultiplex=False,
interleaved=False,
force_fasta=None,
):
self.out = out
self.out2 = out2
......@@ -68,6 +69,7 @@ class OutputFiles:
self.wildcard = wildcard
self.demultiplex = demultiplex
self.interleaved = interleaved
self.force_fasta = force_fasta
def __iter__(self):
yield self.out
......@@ -109,13 +111,15 @@ class Pipeline(ABC):
self._reader = dnaio.open(infiles.file1, file2=infiles.file2,
interleaved=infiles.interleaved, mode='r')
def _open_writer(self, file, file2, **kwargs):
def _open_writer(self, file, file2, force_fasta=None, **kwargs):
# TODO backwards-incompatible change (?) would be to use outfiles.interleaved
# for all outputs
if force_fasta:
kwargs['fileformat'] = 'fasta'
return dnaio.open(file, file2=file2, mode='w', qualities=self.uses_qualities,
**kwargs)
def set_output(self, outfiles):
def set_output(self, outfiles: OutputFiles):
self._filters = []
self._outfiles = outfiles
filter_wrapper = self._filter_wrapper()
......@@ -188,7 +192,7 @@ class Pipeline(ABC):
@abstractmethod
# TODO progress shouldn’t be a parameter
def process_reads(self, progress: Progress=None):
def process_reads(self, progress: Progress = None):
pass
@abstractmethod
......@@ -247,7 +251,7 @@ class SingleEndPipeline(Pipeline):
return Redirector
def _final_filter(self, outfiles):
writer = self._open_writer(outfiles.out, outfiles.out2)
writer = self._open_writer(outfiles.out, outfiles.out2, force_fasta=outfiles.force_fasta)
return NoFilter(writer)
def _create_demultiplexer(self, outfiles):
......@@ -344,11 +348,17 @@ class PairedEndPipeline(Pipeline):
return self._filter_wrapper()
def _final_filter(self, outfiles):
writer = self._open_writer(outfiles.out, outfiles.out2, interleaved=outfiles.interleaved)
writer = self._open_writer(
outfiles.out, outfiles.out2, interleaved=outfiles.interleaved,
force_fasta=outfiles.force_fasta)
return PairedNoFilter(writer)
def _create_demultiplexer(self, outfiles):
return PairedDemultiplexer(outfiles.out, outfiles.out2,
if '{name1}' in outfiles.out and '{name2}' in outfiles.out:
return CombinatorialDemultiplexer(outfiles.out, outfiles.out2,
outfiles.untrimmed, qualities=self.uses_qualities)
else:
return PairedDemultiplexer(outfiles.out, outfiles.out2,
outfiles.untrimmed, outfiles.untrimmed2, qualities=self.uses_qualities)
@property
......@@ -478,7 +488,7 @@ class WorkerProcess(Process):
output2 = None
infiles = InputFiles(input, input2, interleaved=self._interleaved_input)
outfiles = OutputFiles(out=output, out2=output2, interleaved=self._orig_outfiles.interleaved)
outfiles = OutputFiles(out=output, out2=output2, interleaved=self._orig_outfiles.interleaved, force_fasta=self._orig_outfiles.force_fasta)
self._pipeline.set_input(infiles)
self._pipeline.set_output(outfiles)
(n, bp1, bp2) = self._pipeline.process_reads()
......
/* Generated by Cython 0.29.7 */
/* Generated by Cython 0.29.12 */
/* BEGIN: Cython Metadata
{
......@@ -19,8 +19,8 @@ END: Cython Metadata */
#elif PY_VERSION_HEX < 0x02060000 || (0x03000000 <= PY_VERSION_HEX && PY_VERSION_HEX < 0x03030000)
#error Cython requires Python 2.6+ or Python 3.3+.
#else
#define CYTHON_ABI "0_29_7"
#define CYTHON_HEX_VERSION 0x001D07F0
#define CYTHON_ABI "0_29_12"
#define CYTHON_HEX_VERSION 0x001D0CF0
#define CYTHON_FUTURE_DIVISION 1
#include <stddef.h>
#ifndef offsetof
......@@ -322,8 +322,13 @@ END: Cython Metadata */
#define __Pyx_DefaultClassType PyClass_Type
#else
#define __Pyx_BUILTIN_MODULE_NAME "builtins"
#if PY_VERSION_HEX >= 0x030800A4 && PY_VERSION_HEX < 0x030800B2
#define __Pyx_PyCode_New(a, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)\
PyCode_New(a, 0, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)
#else
#define __Pyx_PyCode_New(a, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)\
PyCode_New(a, k, l, s, f, code, c, n, v, fv, cell, fn, name, fline, lnos)
#endif
#define __Pyx_DefaultClassType PyType_Type
#endif
#ifndef Py_TPFLAGS_CHECKTYPES
......@@ -1922,9 +1927,9 @@ if (!__Pyx_RefNanny) {
}
#endif
/*--- Builtin init code ---*/
if (__Pyx_InitCachedBuiltins() < 0) __PYX_ERR(0, 1, __pyx_L1_error)
if (__Pyx_InitCachedBuiltins() < 0) goto __pyx_L1_error;
/*--- Constants init code ---*/
if (__Pyx_InitCachedConstants() < 0) __PYX_ERR(0, 1, __pyx_L1_error)
if (__Pyx_InitCachedConstants() < 0) goto __pyx_L1_error;
/*--- Global type/function init code ---*/
(void)__Pyx_modinit_global_init_code();
(void)__Pyx_modinit_variable_export_code();
......
......@@ -8,8 +8,8 @@ import textwrap
from .adapters import Where, EndStatistics
from .modifiers import QualityTrimmer, NextseqQualityTrimmer, AdapterCutter, PairedAdapterCutter
from .filters import (NoFilter, PairedNoFilter, TooShortReadFilter, TooLongReadFilter,
PairedDemultiplexer, Demultiplexer, NContentFilter, InfoFileWriter, WildcardFileWriter,
RestFileWriter)
PairedDemultiplexer, CombinatorialDemultiplexer, Demultiplexer, NContentFilter, InfoFileWriter,
WildcardFileWriter, RestFileWriter)
def safe_divide(numerator, denominator):
......@@ -91,7 +91,8 @@ class Statistics:
for w in writers:
if isinstance(w, (InfoFileWriter, RestFileWriter, WildcardFileWriter)):
pass
elif isinstance(w, (NoFilter, PairedNoFilter, PairedDemultiplexer, Demultiplexer)):
elif isinstance(w, (NoFilter, PairedNoFilter, PairedDemultiplexer,
CombinatorialDemultiplexer, Demultiplexer)):
self.written += w.written
self.written_bp[0] += w.written_bp[0]
self.written_bp[1] += w.written_bp[1]
......