Commit 02fef595 authored by Andreas Tille's avatar Andreas Tille

New upstream version 1.18

parent de9edc7e
......@@ -2,6 +2,71 @@
v1.18 (2018-09-07)
* Close :issue:`327`: Maximum and minimum lengths can now be specified
separately for R1 and R2 with ``-m LENGTH1:LENGTH2``. One of the
lengths can be omitted, in which case only the length of the other
read is checked (as in ``-m 17:`` or ``-m :17``).
* Close :issue:`322`: Use ``-j 0`` to auto-detect how many cores to run on.
This should even work correctly on cluster systems when Cutadapt runs as
a batch job to which fewer cores than exist on the machine have been
assigned. Note that the number of threads used by ``pigz`` cannot be
controlled at the moment, see :issue:`290`.
* Close :issue:`225`: Allow setting the maximum error rate and minimum overlap
length per adapter. A new :ref:`syntax for adapter-specific
parameters <trimming-parameters>` was added for this. Example:
``-a "ADAPTER;min_overlap=5"``.
* Close :issue:`152`: Using the new syntax for adapter-specific parameters,
it is now possible to allow partial matches of a 3' adapter at the 5' end
(and partial matches of a 5' adapter at the 3' end) by specifying the
``anywhere`` parameter (as in ``-a "ADAPTER;anywhere"``).
* Allow ``--pair-filter=first`` in addition to ``both`` and ``any``. If
used, a read pair is discarded if the filtering criterion applies to R1;
and R2 is ignored.
* Close :issue:`112`: Implement a ``--report=minimal`` option for printing
a succinct two-line report in tab-separated value (tsv) format. Thanks
to :user:`jvolkening` for coming up with an initial patch!
Bug fixes
* Fix :issue:`128`: The Reads written figure in the report incorrectly
included both trimmed and untrimmed reads if ``--untrimmed-output`` was used.
* The options ``--no-trim`` and ``--mask-adapter`` should now be written as
``--action=mask`` and ``--action=none``. The old options still work.
* This is the last release to support :ref:`colorspace data <colorspace>`.
* This is the last release to support Python 2.
v1.17 (2018-08-20)
* Close :issue:`53`: Implement adapters :ref:`that disallow internal matches <non-internal>`.
This is a bit like anchoring, but less strict: The adapter sequence
can appear at different lengths, but must always be at one of the ends.
Use ``-a ADAPTERX`` (with a literal ``X``) to disallow internal matches
for a 3' adapter. Use ``-g XADAPTER`` to disallow for a 5' adapter.
* :user:`klugem` contributed PR :issue:`299`: The ``--length`` option (and its
alias ``-l``) can now be used with negative lengths, which will remove bases
from the beginning of the read instead of from the end.
* Close :issue:`107`: Add a ``--discard-casava`` option to remove reads
that did not pass CASAVA filtering (this is possibly relevant only for
older datasets).
* Fix :issue:`318`: Cutadapt should now be installable with Python 3.7.
* Running Cutadapt under Python 3.3 is no longer supported (Python 2.7 or
3.4+ are needed)
* Planned change: One of the next Cutadapt versions will drop support for
Python 2 entirely, requiring Python 3.
v1.16 (2018-02-21)
Metadata-Version: 1.1
Name: cutadapt
Version: 1.16
Version: 1.18
Summary: trim adapters from high-throughput sequencing reads
Author: Marcel Martin
License: MIT
Description-Content-Type: UNKNOWN
Description: .. image::
Algorithm details
.. _adapter-alignment-algorithm:
Adapter alignment algorithm
Since the publication of the `EMBnet journal application note about
cutadapt <>`_, the alignment algorithm
used for finding adapters has changed significantly. An overview of this new
algorithm is given in this section. An even more detailed description is
available in Chapter 2 of my PhD thesis `Algorithms and tools for the analysis
of high-throughput DNA sequencing data <>`_.
The algorithm is based on *semiglobal alignment*, also called *free-shift*,
*ends-free* or *overlap* alignment. In a regular (global) alignment, the
two sequences are compared from end to end and all differences occuring over
that length are counted. In semiglobal alignment, the sequences are allowed to
freely shift relative to each other and differences are only penalized in the
overlapping region between them::
The prefix ``ELE`` and the suffix ``ASTIC`` do not have a counterpart in the
respective other row, but this is not counted as an error. The overlap ``FANT``
has a length of four characters.
Traditionally, *alignment scores* are used to find an optimal overlap aligment:
This means that the scoring function assigns a positive value to matches,
while mismatches, insertions and deletions get negative values. The optimal
alignment is then the one that has the maximal total score. Usage of scores
has the disadvantage that they are not at all intuitive: What does a total score
of *x* mean? Is that good or bad? How should a threshold be chosen in order to
avoid finding alignments with too many errors?
For cutadapt, the adapter alignment algorithm uses *unit costs* instead.
This means that mismatches, insertions and deletions are counted as one error, which
is easier to understand and allows to specify a single parameter for the
algorithm (the maximum error rate) in order to describe how many errors are
There is a problem with this: When using costs instead of scores, we would like
to minimize the total costs in order to find an optimal alignment. But then the
best alignment would always be the one in which the two sequences do not overlap
at all! This would be correct, but meaningless for the purpose of finding an
adapter sequence.
The optimization criteria are therefore a bit different. The basic idea is to
consider the alignment optimal that maximizes the overlap between the two
sequences, as long as the allowed error rate is not exceeded.
Conceptually, the procedure is as follows:
1. Consider all possible overlaps between the two sequences and compute an
alignment for each, minimizing the total number of errors in each one.
2. Keep only those alignments that do not exceed the specified maximum error
3. Then, keep only those alignments that have a maximal number of matches
(that is, there is no alignment with more matches).
4. If there are multiple alignments with the same number of matches, then keep
only those that have the smallest error rate.
5. If there are still multiple candidates left, choose the alignment that starts
at the leftmost position within the read.
In Step 1, the different adapter types are taken into account: Only those
overlaps that are actually allowed by the adapter type are actually considered.
.. _quality-trimming-algorithm:
Quality trimming algorithm
The trimming algorithm implemented in cutadapt is the same as the one used by
BWA, but applied to both
ends of the read in turn (if requested). That is: Subtract the given cutoff
from all qualities; compute partial sums from all indices to the end of the
sequence; cut the sequence at the index at which the sum is minimal. If both
ends are to be trimmed, repeat this for the other end.
The basic idea is to remove all bases starting from the end of the read whose
quality is smaller than the given threshold. This is refined a bit by allowing
some good-quality bases among the bad-quality ones. In the following example,
we assume that the 3' end is to be quality-trimmed.
Assume you use a threshold of 10 and have these quality values:
42, 40, 26, 27, 8, 7, 11, 4, 2, 3
Subtracting the threshold gives:
32, 30, 16, 17, -2, -3, 1, -6, -8, -7
Then sum up the numbers, starting from the end (partial sums). Stop early if
the sum is greater than zero:
(70), (38), 8, -8, -25, -23, -20, -21, -15, -7
The numbers in parentheses are not computed (because 8 is greater than zero),
but shown here for completeness. The position of the minimum (-25) is used as
the trimming position. Therefore, the read is trimmed to the first four bases,
which have quality values 42, 40, 26, 27.
......@@ -4,8 +4,7 @@ Developing
The `Cutadapt source code is on GitHub <>`_.
Cutadapt is written in Python with some extension modules that are written
in Cython. Cutadapt uses a single code base that is compatible with both
Python 2 and 3. Python 2.7 is the minimum supported Python version. With
relatively little effort, compatibility with Python 2.6 could be restored.
Python 2 and 3. Python 2.7 is the minimum supported Python version.
Development installation
......@@ -63,6 +62,54 @@ Yes, there are inconsistencies in the current code base since it’s a few years
Making a release
Since version 1.17, Travis CI is used to automatically deploy a new Cutadapt release
(both as an sdist and as wheels) whenever a new tag is pushed to the Git repository.
Cutadapt uses `versioneer <>`_ to automatically manage
version numbers. This means that the version is not stored in the source code but derived from
the most recent Git tag. The following procedure can be used to bump the version and make a new
#. Update ``CHANGES.rst`` (version number and list of changes)
#. Ensure you have no uncommitted changes in the working copy.
#. Run a ``git pull``.
#. Run ``tox``, ensuring all tests pass.
#. Tag the current commit with the version number (there must be a ``v`` prefix)::
git tag v0.1
To release a development version, use a ``dev`` version number such as ``v1.17.dev1``.
Users will not automatically get these unless they use ``pip install --pre``.
#. Push the tag::
git push --tags
#. Wait for Travis to finish and to deploy to PyPI.
#. Update the `bioconda recipe <>`_.
It is probly easiest to edit the recipe via the web interface and send in a
pull request. Ensure that the list of dependencies (the ``requirements:``
section in the recipe) is in sync with the ```` file.
Since this is just a version bump, the pull request does not need a
review by other bioconda developers. As soon as the tests pass and if you
have the proper permissions, it can be merged directly.
Releases to bioconda still need to be made manually.
Making a release manually
.. note:
This section is outdated, see the previous section!
If this is the first time you attempt to upload a distribution to PyPI, create a
configuration file named ``.pypirc`` in your home directory with the following
This diff is collapsed.
......@@ -22,7 +22,6 @@ improvements.
- allow to remove not the adapter itself, but the sequence before or after it
- instead of trimming, convert adapter to lowercase
- warn when given adapter sequence contains non-IUPAC characters
- try multithreading again, this time use os.pipe() or 0mq
- extensible file type detection
- the --times setting should be an attribute of Adapter
......@@ -34,7 +33,6 @@ Backwards-incompatible changes
- Possibly drop wildcard-file support, extend info-file instead
- Drop "legacy mode"
- For non-anchored 5' adapters, find rightmost match
- Move ``scripts/`` to ````
Specifying adapters
......@@ -10,6 +10,7 @@ Table of contents
......@@ -45,7 +45,7 @@ Dependencies
Cutadapt installation requires this software to be installed:
* Python 2.7 or at least Python 3.3
* Python 2.7 or at least Python 3.4
* Possibly a C compiler. For Linux, cutadapt packages are provided as
so-called “wheels” (``.whl`` files) which come pre-compiled.
......@@ -5,18 +5,6 @@ Recipes (FAQ)
This section gives answers to frequently asked questions. It shows you how to
get cutadapt to do what you want it to do!
.. _avoid-internal-adapter-matches:
Avoid internal adapter matches
To force matches to be at the end of the read and thus avoiding internal
adapter matches, append a few ``X`` characters to the adapter sequence, like
this: ``-a TACGGCATXXX``. The ``X`` is counted as a mismatch and will force the
match to be at the end. Just make sure that there are more ``X`` characters than
the length of the adapter times the error rate. This is not the same as an
anchored 3' adapter since partial matches are still allowed.
Remove more than one adapter
......@@ -9,5 +9,4 @@ parentdir_prefix = cutadapt-
tag_build =
tag_date = 0
tag_svn_revision = 0
......@@ -12,8 +12,9 @@ import versioneer
if sys.version_info < (2, 7):
sys.stdout.write("At least Python 2.7 is required.\n")
vi = sys.version_info
if (vi[0] == 2 and vi[1] < 7) or (vi[0] == 3 and vi[1] < 4):
sys.stdout.write('Minimum supported Python versions are 2.7 and 3.4.\n')
......@@ -110,8 +111,11 @@ setup(
package_dir={'': 'src'},
entry_points={'console_scripts': ['cutadapt = cutadapt.__main__:main']},
extras_require = {
'dev': ['Cython', 'pytest', 'pytest-timeout', 'nose', 'sphinx', 'sphinx_issues'],
"Development Status :: 5 - Production/Stable",
"Environment :: Console",
Metadata-Version: 1.1
Name: cutadapt
Version: 1.16
Version: 1.18
Summary: trim adapters from high-throughput sequencing reads
Author: Marcel Martin
License: MIT
Description-Content-Type: UNKNOWN
Description: .. image::
......@@ -7,6 +7,7 @@ setup.cfg
......@@ -35,6 +36,7 @@ src/cutadapt/
......@@ -54,11 +56,13 @@ tests/
......@@ -100,6 +104,8 @@ tests/cut/overlapa.fa
......@@ -125,6 +131,7 @@ tests/cut/polya.fasta
......@@ -159,6 +166,7 @@ tests/cut/wildcard.fa
......@@ -168,6 +176,7 @@ tests/data/anchored-back.fasta
......@@ -216,7 +225,9 @@ tests/data/tooshort.noprimer.fa
\ No newline at end of file
\ No newline at end of file
# coding: utf-8
from __future__ import print_function, division, absolute_import
import sys
from ._version import get_versions
__version__ = get_versions()['version']
del get_versions
def check_importability(): # pragma: no cover
import cutadapt._align
except ImportError as e:
if 'undefined symbol' in str(e):
ERROR: A required extension module could not be imported because it is
incompatible with your system. A quick fix is to recompile the extension
modules with the following command:
{0} build_ext -i
See the documentation for alternative ways of installing the program.
The original error message follows.
This diff is collapsed.
This diff is collapsed.
......@@ -86,7 +86,7 @@ class DPMatrix:
Representation of the dynamic-programming matrix.
This used only when debugging is enabled in the Aligner class since the
This is used only when debugging is enabled in the Aligner class since the
matrix is normally not stored in full.
Entries in the matrix may be None, in which case that value was not
......@@ -178,19 +178,20 @@ cdef class Aligner:
If any of the flags is set, all non-IUPAC characters in the sequences
compare as 'not equal'.
cdef int m
cdef _Entry* column # one column of the DP matrix
cdef double max_error_rate
cdef int flags
cdef int _insertion_cost
cdef int _deletion_cost
cdef int _min_overlap
cdef bint wildcard_ref
cdef bint wildcard_query
cdef bint debug
cdef object _dpmatrix
cdef bytes _reference # TODO rename to translated_reference or so
cdef str str_reference
int m
_Entry* column # one column of the DP matrix
double max_error_rate
int flags
int _insertion_cost
int _deletion_cost
int _min_overlap
bint wildcard_ref
bint wildcard_query
bint debug
object _dpmatrix
bytes _reference # TODO rename to translated_reference or so
str str_reference
......@@ -226,7 +227,7 @@ cdef class Aligner:
def __set__(self, value):
if value < 1:
raise ValueError('Insertion/deletion cost must be at leat 1')
raise ValueError('Insertion/deletion cost must be at least 1')
self._insertion_cost = value
self._deletion_cost = value
......@@ -276,17 +277,18 @@ cdef class Aligner:
The alignment itself is not returned.
cdef char* s1 = self._reference
cdef bytes query_bytes = query.encode('ascii')
cdef char* s2 = query_bytes
cdef int m = self.m
cdef int n = len(query)
cdef _Entry* column = self.column
cdef double max_error_rate = self.max_error_rate
cdef bint start_in_ref = self.flags & START_WITHIN_SEQ1
cdef bint start_in_query = self.flags & START_WITHIN_SEQ2
cdef bint stop_in_ref = self.flags & STOP_WITHIN_SEQ1
cdef bint stop_in_query = self.flags & STOP_WITHIN_SEQ2
char* s1 = self._reference
bytes query_bytes = query.encode('ascii')
char* s2 = query_bytes
int m = self.m
int n = len(query)
_Entry* column = self.column
double max_error_rate = self.max_error_rate
bint start_in_ref = self.flags & START_WITHIN_SEQ1
bint start_in_query = self.flags & START_WITHIN_SEQ2
bint stop_in_ref = self.flags & STOP_WITHIN_SEQ1
bint stop_in_query = self.flags & STOP_WITHIN_SEQ2
if self.wildcard_query:
query_bytes = query_bytes.translate(IUPAC_TABLE)
......@@ -366,13 +368,14 @@ cdef class Aligner:
if start_in_ref:
last = m
cdef int cost_diag
cdef int cost_deletion
cdef int cost_insertion
cdef int origin, cost, matches
cdef int length
cdef bint characters_equal
cdef _Entry tmp_entry
int cost_diag
int cost_deletion
int cost_insertion
int origin, cost, matches
int length
bint characters_equal
_Entry tmp_entry
with nogil:
# iterate over columns
......@@ -502,15 +505,16 @@ def compare_prefixes(str ref, str query, bint wildcard_ref=False, bint wildcard_
This function returns a tuple compatible with what Aligner.locate outputs.
cdef int m = len(ref)
cdef int n = len(query)
cdef bytes query_bytes = query.encode('ascii')
cdef bytes ref_bytes = ref.encode('ascii')
cdef char* r_ptr
cdef char* q_ptr
cdef int length = min(m, n)
cdef int i, matches = 0
cdef bint compare_ascii = False
int m = len(ref)
int n = len(query)
bytes query_bytes = query.encode('ascii')
bytes ref_bytes = ref.encode('ascii')
char* r_ptr
char* q_ptr
int length = min(m, n)
int i, matches = 0
bint compare_ascii = False
if wildcard_ref:
ref_bytes = ref_bytes.translate(IUPAC_TABLE)
This diff is collapsed.
......@@ -17,11 +17,12 @@ def quality_trim_index(str qualities, int cutoff_front, int cutoff_back, int bas
- Compute partial sums from all indices to the end of the sequence.
- Trim sequence at the index at which the sum is minimal.
cdef int s
cdef int max_qual
cdef int stop = len(qualities)
cdef int start = 0
cdef int i
int s
int max_qual
int stop = len(qualities)
int start = 0
int i
# find trim position for 5' end
s = 0
This diff is collapsed.
......@@ -16,56 +16,6 @@ ctypedef fused bytes_or_bytearray:
def head(bytes_or_bytearray buf, Py_ssize_t lines):
Skip forward by a number of lines in the given buffer and return
how many bytes this corresponds to.
Py_ssize_t pos = 0
Py_ssize_t linebreaks_seen = 0
Py_ssize_t length = len(buf)
unsigned char* data = buf
while linebreaks_seen < lines and pos < length:
if data[pos] == '\n':
linebreaks_seen += 1
pos += 1
return pos
def fastq_head(bytes_or_bytearray buf, Py_ssize_t end=-1):
Return an integer length such that buf[:length] contains the highest
possible number of complete four-line records.
If end is -1, the full buffer is searched. Otherwise only buf[:end].
Py_ssize_t pos = 0
Py_ssize_t linebreaks = 0
Py_ssize_t length = len(buf)
unsigned char* data = buf
Py_ssize_t record_start = 0
if end != -1:
length = min(length, end)
while True:
while pos < length and data[pos] != '\n':
pos += 1
if pos == length:
pos += 1
linebreaks += 1
if linebreaks == 4:
linebreaks = 0
record_start = pos
# Reached the end of the data block
return record_start
def two_fastq_heads(bytes_or_bytearray buf1, bytes_or_bytearray buf2, Py_ssize_t end1, Py_ssize_t end2):
Skip forward in the two buffers by multiples of four lines.
......@@ -108,25 +58,19 @@ cdef class Sequence(object):
A record in a FASTQ file. Also used for FASTA (then the qualities attribute
is None). qualities is a string and it contains the qualities encoded as
If an adapter has been matched to the sequence, the 'match' attribute is
set to the corresponding Match instance.
public str name
public str sequence
public str qualities
public bint second_header
public object match
def __init__(self, str name, str sequence, str qualities=None, bint second_header=False,
def __init__(self, str name, str sequence, str qualities=None, bint second_header=False):
"""Set qualities to None if there are no quality values""" = name
self.sequence = sequence
self.qualities = qualities
self.second_header = second_header
self.match = match
if qualities is not None and len(qualities) != len(sequence):
rname = _shorten(name)
raise FormatError("In read named {0!r}: length of quality sequence ({1}) and length "
......@@ -139,8 +83,7 @@ cdef class Sequence(object):,
self.qualities[key] if self.qualities is not None else None,
def __repr__(self):
qstr = ''
......@@ -164,8 +107,7 @@ cdef class Sequence(object):
raise NotImplementedError()
def __reduce__(self):
return (Sequence, (, self.sequence, self.qualities, self.second_header,
return (Sequence, (, self.sequence, self.qualities, self.second_header))
class FastqReader(SequenceReader):
......@@ -11,8 +11,8 @@ version_json = '''
"dirty": false,
"error": null,
"full-revisionid": "77ade52bc2a7fe2d278fdb4256c5b46936011c2c",
"version": "1.16"
"full-revisionid": "069226f9bded83d8e72ef274d2d08686c0a2382a",
"version": "1.18"
This diff is collapsed.
......@@ -52,7 +52,7 @@ def encode(s):
Given a sequence of nucleotides, convert them to