Skip to content
Commits on Source (3)
......@@ -2,6 +2,106 @@
Changes
=======
v2.3 (2019-04-25)
-----------------
* :issue:`378`: The ``--pair-adapters`` option, added in version 2.1, was
not actually usable for demultiplexing.
v2.2 (2019-04-20)
---------------------
* :issue:`376`: Fix a crash when using anchored 5' adapters together with
``--no-indels`` and trying to trim an empty read.
* :issue:`369`: Fix a crash when attempting to trim an empty read using a ``-g``
adapter with wildcards.
v2.1 (2019-03-15)
-----------------
* :issue:`366`: Fix problems when combining ``--cores`` with
reading from standard input or writing to standard output.
* :issue:`347`: Support :ref:`“paired adapters” <paired-adapters>`. One use case is
demultiplexing Illumina *Unique Dual Indices* (UDI).
v2.0 (2019-03-06)
-----------------
This is a major new release with lots of bug fixes and new features, but
also some backwards-incompatible changes. These should hopefully
not affect too many users, but please make sure to review them and
possibly update your scripts!
Backwards-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* :issue:`329`: Linked adapters specified with ``-a ADAPTER1...ADAPTER2``
are no longer anchored by default. To get results consist with the old
behavior, use ``-a ^ADAPTER1...ADAPTER2`` instead.
* Support for colorspace data was removed. Thus, the following command-line
options can no longer be used: ``-c``, ``-d``, ``-t``, ``--strip-f3``,
``--maq``, ``--bwa``, ``--no-zero-cap``.
* “Legacy mode” has been removed. This mode was enabled under certain
conditions and would change the behavior such that the read-modifying options
such as ``-q`` would only apply to the forward/R1 reads. This was necessary
for compatibility with old Cutadapt versions, but became increasingly
confusing.
* :issue:`360`: Computation of the error rate of an adapter match no longer
counts the ``N`` wildcard bases. Previously, an adapter like ``N{18}CC``
(18 ``N`` wildcards followed by ``CC``) would effectively match
anywhere because the default error rate of 0.1 (10%) would allow for
two errors. The error rate of a match is now computed as
the number of non-``N`` bases in the matching part of the adapter
divided by the number of errors.
* This release of Cutadapt requires at least Python 3.4 to run. Python 2.7
is no longer supported.
Features
~~~~~~~~
* A progress indicator is printed while Cutadapt is working. If you redirect
standard error to a file, the indicator is disabled.
* Reading of FASTQ files has gotten faster due to a new parser. The FASTA
and FASTQ reading/writing functions are now available as part of the
`dnaio library <https://github.com/marcelm/dnaio/>`_. This is a separate
Python package that can be installed independently from Cutadapt.
There is one regression at the moment: FASTQ files that use a second
header (after the "+") will have that header removed in the output.
* Some other performance optimizations were made. Speedups of up to 15%
are possible.
* Demultiplexing has become a lot faster :ref:`under certain conditions <speed-up-demultiplexing>`.
* :issue:`335`: For linked adapters, it is now possible to
:ref:`specify which of the two adapters should be required <linked-override>`,
overriding the default.
* :issue:`166`: By specifying ``--action=lowercase``, it is now possible
to not trim adapters, but to instead convert the section of the read
that would have been trimmed to lowercase.
Bug fixes
~~~~~~~~~
* Removal of legacy mode fixes also :issue:`345`: ``--length`` would not enable
legacy mode.
* The switch to ``dnaio`` also fixed :issue:`275`: Input files with
non-standard names now no longer lead to a crash. Instead the format
is now recognized from the file content.
* Fix :issue:`354`: Sequences given using ``file:`` can now be unnamed.
* Fix :issue:`257` and :issue:`242`: When only R1 or only R2 adapters are given, the
``--pair-filter`` setting is now forced to ``both`` for the
``--discard-untrimmed`` (and ``--untrimmed-(paired-)output``) filters.
Otherwise, with the default ``--pair-filter=any``, all pairs would be
considered untrimmed because one of the reads in the pair is always
untrimmed.
Other
~~~~~
* :issue:`359`: The ``-f``/``--format`` option is now ignored and a warning
will be printed if it is used. The input file format is always
auto-detected.
v1.18 (2018-09-07)
------------------
......
Copyright (c) 2010-2018 Marcel Martin <marcel.martin@scilifelab.se>
Copyright (c) 2010-2019 Marcel Martin <marcel.martin@scilifelab.se>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
......
include CHANGES.rst
include CITATION
include LICENSE
include doc/*.rst
include doc/conf.py
include doc/Makefile
include versioneer.py
include src/cutadapt/*.c
include src/cutadapt/*.pyx
include tests/utils.py
include tests/test_*.py
graft tests/data
graft tests/cut
Metadata-Version: 1.1
Metadata-Version: 2.1
Name: cutadapt
Version: 1.18
Version: 2.3
Summary: trim adapters from high-throughput sequencing reads
Home-page: https://cutadapt.readthedocs.io/
Author: Marcel Martin
Author-email: marcel.martin@scilifelab.se
License: MIT
Description-Content-Type: UNKNOWN
Description: .. image:: https://travis-ci.org/marcelm/cutadapt.svg?branch=master
:target: https://travis-ci.org/marcelm/cutadapt
......@@ -14,7 +13,7 @@ Description: .. image:: https://travis-ci.org/marcelm/cutadapt.svg?branch=master
:target: https://pypi.python.org/pypi/cutadapt
========
cutadapt
Cutadapt
========
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other
......@@ -35,7 +34,7 @@ Description: .. image:: https://travis-ci.org/marcelm/cutadapt.svg?branch=master
Cutadapt comes with an extensive suite of automated tests and is available under
the terms of the MIT license.
If you use cutadapt, please cite
If you use Cutadapt, please cite
`DOI:10.14806/ej.17.1.200 <http://dx.doi.org/10.14806/ej.17.1.200>`_ .
......@@ -56,6 +55,7 @@ Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.4
Provides-Extra: dev
......@@ -5,7 +5,7 @@
:target: https://pypi.python.org/pypi/cutadapt
========
cutadapt
Cutadapt
========
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other
......@@ -26,7 +26,7 @@ also just demultiplex your input data, without removing adapter sequences at all
Cutadapt comes with an extensive suite of automated tests and is available under
the terms of the MIT license.
If you use cutadapt, please cite
If you use Cutadapt, please cite
`DOI:10.14806/ej.17.1.200 <http://dx.doi.org/10.14806/ej.17.1.200>`_ .
......
python-cutadapt (2.3-1) UNRELEASED; urgency=medium
* Team upload.
* New upstream version
-- Liubov Chuprikova <chuprikovalv@gmail.com> Fri, 28 Jun 2019 19:07:03 +0200
python-cutadapt (1.18-1) unstable; urgency=medium
* New upstream version
......
......@@ -9,7 +9,7 @@ Adapter alignment algorithm
===========================
Since the publication of the `EMBnet journal application note about
cutadapt <http://dx.doi.org/10.14806/ej.17.1.200>`_, the alignment algorithm
Cutadapt <http://dx.doi.org/10.14806/ej.17.1.200>`_, the alignment algorithm
used for finding adapters has changed significantly. An overview of this new
algorithm is given in this section. An even more detailed description is
available in Chapter 2 of my PhD thesis `Algorithms and tools for the analysis
......@@ -37,7 +37,7 @@ has the disadvantage that they are not at all intuitive: What does a total score
of *x* mean? Is that good or bad? How should a threshold be chosen in order to
avoid finding alignments with too many errors?
For cutadapt, the adapter alignment algorithm uses *unit costs* instead.
For Cutadapt, the adapter alignment algorithm uses *unit costs* instead.
This means that mismatches, insertions and deletions are counted as one error, which
is easier to understand and allows to specify a single parameter for the
algorithm (the maximum error rate) in order to describe how many errors are
......@@ -75,7 +75,7 @@ overlaps that are actually allowed by the adapter type are actually considered.
Quality trimming algorithm
--------------------------
The trimming algorithm implemented in cutadapt is the same as the one used by
The trimming algorithm implemented in Cutadapt is the same as the one used by
BWA, but applied to both
ends of the read in turn (if requested). That is: Subtract the given cutoff
from all qualities; compute partial sums from all indices to the end of the
......
.. _colorspace:
Colorspace reads
================
Cutadapt was designed to work with colorspace reads from the ABi SOLiD
sequencer. Colorspace trimming is activated by the ``--colorspace``
option (or use ``-c`` for short). The input reads can be given either:
- in a FASTA file (typically extensions ``.csfasta`` or ``.csfa``)
- in a FASTQ file
- in a ``.csfasta`` and a ``.qual`` file (this is the native SOLiD
format). That is, cutadapt expects *two* file names in this case.
In all cases, the colors must be represented by the characters 0, 1, 2,
3. Here is an example input file in ``.fastq`` format that is accepted::
@1_13_85_F3
T110020300.0113010210002110102330021
+
7&9<&77)& <7))%4'657-1+9;9,.<8);.;8
@1_13_573_F3
T312311200.3021301101113203302010003
+
6)3%)&&&& .1&(6:<'67..*,:75)'77&&&5
Further example input files can be found in the cutadapt distribution at
``tests/data/solid.*``. The ``.csfasta``/``.qual`` file format is
automatically assumed if two input files are given to cutadapt, and when no
paired-end trimming options are used.
Cutadapt always converts input data given as a pair of FASTA/QUAL files to FASTQ.
In colorspace mode, the adapter sequences given to the ``-a``, ``-b``
and ``-g`` options can be given both as colors or as nucleotides. If
given as nucleotides, they will automatically be converted to
colorspace. For example, to trim an adapter from ``solid.csfasta`` and
``solid.qual``, use this command-line::
cutadapt -c -a CGCCTTGGCCGTACAGCAG solid.csfasta solid.qual > output.fastq
In case you know the colorspace adapter sequence, you can also write
``330201030313112312`` instead of ``CGCCTTGGCCGTACAGCAG``, and the result
is the same.
Ambiguity in colorspace
-----------------------
The ambiguity of colorspace encoding leads to some effects to be aware
of when trimming 3' adapters from colorspace reads. For example, when
trimming the adapter ``AACTC``, cutadapt searches for its
colorspace-encoded version ``0122``. But also ``TTGAG``, ``CCAGA`` and
``GGTCT`` have an encoding of ``0122``. This means that effectively four
different adapter sequences are searched and trimmed at the same time.
There is no way around this, unless the decoded sequence were available,
but that is usually only the case after read mapping.
The effect should usually be quite small. The number of false positives
is multiplied by four, but with a sufficiently large overlap (3 or 4 is
already enough), this is still only around 0.2 bases lost per read on
average. If inspecting k-mer frequencies or using small overlaps, you
need to be aware of the effect, however.
Double-encoding, BWA and MAQ
----------------------------
The read mappers MAQ and BWA (and possibly others) need their colorspace
input reads to be in a so-called "double encoding". This simply means
that they cannot deal with the characters 0, 1, 2, 3 in the reads, but
require that the letters A, C, G, T be used for colors. For example, the
colorspace sequence ``0011321`` would be ``AACCTGC`` in double-encoded
form. This is not the same as conversion to basespace! The read is still
in colorspace, only letters are used instead of digits. If that sounds
confusing, that is because it is.
Note that MAQ is unmaintained and should not be used in new projects.
BWA’s colorspace support was dropped in versions more recent than 0.5.9,
but that version works well.
When you want to trim reads that will be mapped with BWA or MAQ, you can
use the ``--bwa`` option, which enables colorspace mode (``-c``),
double-encoding (``-d``), primer trimming (``-t``), all of which are
required for BWA, in addition to some other useful options.
The ``--maq`` option is an alias for ``--bwa``.
Colorspace examples
-------------------
To cut an adapter from SOLiD data given in ``solid.csfasta`` and
``solid.qual``, to produce MAQ- and BWA-compatible output, allow the
default of 10% errors and write the resulting FASTQ file to
output.fastq::
cutadapt --bwa -a CGCCTTGGCCGTACAGCAG solid.csfasta solid.qual > output.fastq
Instead of redirecting standard output with ``>``, the ``-o`` option can
be used. This also shows that you can give the adapter in colorspace and
how to use a different error rate::
cutadapt --bwa -e 0.15 -a 330201030313112312 -o output.fastq solid.csfasta solid.qual
This does the same as above, but produces BFAST-compatible output,
strips the \_F3 suffix from read names and adds the prefix "abc:" to
them::
cutadapt -c -e 0.15 -a 330201030313112312 -x abc: --strip-f3 solid.csfasta solid.qual > output.fastq
Bowtie
------
Quality values of colorspace reads are sometimes negative. Bowtie gets
confused and prints this message::
Encountered a space parsing the quality string for read xyz
BWA also has a problem with such data. Cutadapt therefore converts
negative quality values to zero in colorspace data. Use the option
``--no-zero-cap`` to turn this off.
.. _sra-fastq:
.. _colorspace:
Sequence Read Archive
---------------------
The Sequence Read Archive provides files in a special "SRA" file format. When
the ``fastq-dump`` program from the sra-toolkit package is used to convert
these ``.sra`` files to FASTQ format, colorspace reads will get an extra
quality value in the beginning of each read. You may get an error like this::
cutadapt: error: In read named 'xyz': length of colorspace quality
sequence (36) and length of read (35) do not match (primer is: 'T')
Colorspace
==========
To make cutadapt ignore the extra quality base, add ``--format=sra-fastq`` to
your command-line, as in this example::
Support for processing data in so-called “colorspace”, as produced by
the ABI SOLiD sequencer, was removed from Cutadapt versions newer than
1.18.
cutadapt -c --format=sra-fastq -a CGCCTTGGCCG sra.fastq > trimmed.fastq
To process colorspace data, please use Cutadapt 1.18 or earlier.
That version also knows how to process ``.csfasta``/``.qual`` file
pairs.
When you use ``--format=sra-fastq``, the spurious quality value will be removed
from all reads in the file.
`See also the colorspace section in the documentation for
Cutadapt 1.18 <https://cutadapt.readthedocs.io/en/v1.18/colorspace.html>`_.
......@@ -47,22 +47,20 @@ master_doc = 'index'
# General information about the project.
project = u'cutadapt'
copyright = u'2010-2018, Marcel Martin'
copyright = u'2010-2019, Marcel Martin'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
from cutadapt import __version__
# The short X.Y version.
version = __version__
from pkg_resources import get_distribution
release = get_distribution('cutadapt').version
# Read The Docs modifies the conf.py script and we therefore get
# version numbers like 0.7+0.g27d0d31.dirty from versioneer.
if version.endswith('.dirty') and os.environ.get('READTHEDOCS') == 'True':
version, _, rest = version.partition('+')
if not rest.startswith('0.'):
version = version + '+' + rest[:-6]
# version numbers like 0.12+0.g27d0d31
if os.environ.get('READTHEDOCS') == 'True':
version = '.'.join(release.split('.')[:2])
else:
version = release
# The full version, including alpha/beta/rc tags.
release = version
......
......@@ -2,9 +2,8 @@ Developing
==========
The `Cutadapt source code is on GitHub <https://github.com/marcelm/cutadapt/>`_.
Cutadapt is written in Python with some extension modules that are written
in Cython. Cutadapt uses a single code base that is compatible with both
Python 2 and 3. Python 2.7 is the minimum supported Python version.
Cutadapt is written in Python 3 with some extension modules that are written
in Cython. Support for Python 2 has been dropped.
Development installation
......@@ -15,8 +14,8 @@ using a virtualenv. This sequence of commands should work::
git clone https://github.com/marcelm/cutadapt.git # or clone your own fork
cd cutadapt
virtualenv -p python3 venv # or omit the "-p python3" for Python 2
venv/bin/pip3 install Cython pytest nose tox # pip3 becomes just pip for Python 2
python3 -m venv venv
venv/bin/pip3 install Cython pytest nose tox
venv/bin/pip3 install -e .
Then you can run Cutadapt like this (or activate the virtualenv and omit the
......@@ -40,7 +39,7 @@ Development installation (without virtualenv)
Alternatively, if you do not want to use virtualenv, running the following may
work from within the cloned repository::
python3 setup.py build_ext -i # omit the "3" for Python 2
python3 setup.py build_ext -i
pytest
This requires Cython and pytest to be installed. Avoid this method and use a
......
This diff is collapsed.
......@@ -6,24 +6,17 @@ things that could be improved in the source code, and of possible algorithmic
improvements.
- show average error rate
- In colorspace and probably also for Illumina data, gapped alignment
is not necessary
- ``--progress``
- run pylint, pychecker
- length histogram
- check whether input is FASTQ although -f fasta is given
- search for adapters in the order in which they are given on the
command line
- more tests for the alignment algorithm
- ``--detect`` prints out best guess which of the given adapters is the correct one
- alignment algorithm: make a 'banded' version
- it seems the str.find optimization isn't very helpful. In any case, it should be
moved into the Aligner class.
- allow to remove not the adapter itself, but the sequence before or after it
- instead of trimming, convert adapter to lowercase
- warn when given adapter sequence contains non-IUPAC characters
- extensible file type detection
- the --times setting should be an attribute of Adapter
Backwards-incompatible changes
......@@ -31,7 +24,6 @@ Backwards-incompatible changes
- Drop ``--rest-file`` support
- Possibly drop wildcard-file support, extend info-file instead
- Drop "legacy mode"
- For non-anchored 5' adapters, find rightmost match
......@@ -87,16 +79,6 @@ Model somehow all the flags that exist for semiglobal alignment. For start of th
Not degraded and no bases before allowed = anchored.
Degraded and bases before allowed = regular 5'
By default, the 5' end should be anchored, the 3' end not.
* ``-a ADAPTER...`` → not degraded, no bases before allowed
* ``-a N*ADAPTER...`` → not degraded, bases before allowed
* ``-a ADAPTER^...`` → degraded, no bases before allowed
* ``-a N*ADAPTER^...`` → degraded, bases before allowed
* ``-a ...ADAPTER`` → degraded, bases after allowed
* ``-a ...ADAPTER$`` → not degraded, no bases after allowed
Paired-end trimming
-------------------
......@@ -108,4 +90,4 @@ Available/used letters for command-line options
* Remaining characters: All uppercase letters except A, B, G, M, N, O, U
* Lowercase letters: i, j, k, s, w
* Planned/reserved: Q (paired-end quality trimming), j (multithreading)
* Planned/reserved: Q (paired-end quality trimming)
......@@ -9,14 +9,14 @@ successfully under macOS and Windows.
Quick installation
------------------
The easiest way to install cutadapt is to use ``pip`` on the command line::
The easiest way to install Cutadapt is to use ``pip3`` on the command line::
pip install --user --upgrade cutadapt
pip3 install --user --upgrade cutadapt
This will download the software from `PyPI (the Python packaging
index) <https://pypi.python.org/pypi/cutadapt/>`_, and
install the cutadapt binary into ``$HOME/.local/bin``. If an old version of
cutadapt exists on your system, the ``--upgrade`` parameter is required in order
Cutadapt exists on your system, the ``--upgrade`` parameter is required in order
to install a newer version. You can then run the program like this::
~/.local/bin/cutadapt --help
......@@ -28,14 +28,14 @@ If you want to avoid typing the full path, add the directory
Installation with conda
-----------------------
Alternatively, cutadapt is available as a conda package from the
Alternatively, Cutadapt is available as a conda package from the
`bioconda channel <https://bioconda.github.io/>`_. If you do not have conda,
`install miniconda <http://conda.pydata.org/miniconda.html>`_ first.
Then install cutadapt like this::
Then install Cutadapt like this::
conda install -c bioconda cutadapt
If neither `pip` nor `conda` installation works, keep reading.
If neither ``pip`` nor ``conda`` installation works, keep reading.
.. _dependencies:
......@@ -45,37 +45,44 @@ Dependencies
Cutadapt installation requires this software to be installed:
* Python 2.7 or at least Python 3.4
* Possibly a C compiler. For Linux, cutadapt packages are provided as
* Python 3.4 or newer
* Possibly a C compiler. For Linux, Cutadapt packages are provided as
so-called “wheels” (``.whl`` files) which come pre-compiled.
Under Ubuntu, you may need to install the packages ``build-essential`` and
``python-dev`` (or ``python3-dev``) to get a C compiler.
On Windows, you need `Microsoft Visual C++ Compiler for
Python 2.7 <https://www.microsoft.com/en-us/download/details.aspx?id=44266>`_.
``python3-dev`` to get a C compiler.
If you get an error message::
error: command 'gcc' failed with exit status 1
Then check the entire error message. If it says something about a missing
``Python.h`` file, then the problem are missing Python development
packages (``python-dev``/``python3-dev`` in Ubuntu).
``Python.h`` file, then the problem is that you are missing Python development
packages (``python3-dev`` in Ubuntu).
System-wide installation (root required)
----------------------------------------
If you have root access, then you can install cutadapt system-wide by running::
If you have root access, then you can install Cutadapt system-wide by running::
sudo pip install cutadapt
sudo python3 -m pip install cutadapt
This installs cutadapt into `/usr/local/bin`.
This installs cutadapt into ``/usr/local/bin``.
If you want to upgrade from an older version, use this command instead::
sudo pip install --upgrade cutadapt
sudo python3 -m pip install --upgrade cutadapt
If the above does not work for you, then you can try to install Cutadapt
into a virtual environment. This may lead to fewer conflicts with
system-installed packages::
sudo python3 -m venv /usr/local/cutadapt
sudo /usr/local/cutadapt/bin/pip install cutadapt
cd /usr/local/bin/
sudo ln -s ../cutadapt/bin/cutadapt
Uninstalling
......@@ -83,7 +90,7 @@ Uninstalling
Type ::
pip uninstall cutadapt
pip3 uninstall cutadapt
and confirm with ``y`` to remove the package. Under some circumstances, multiple
versions may be installed at the same time. Repeat the above command until you
......@@ -93,7 +100,7 @@ get an error message in order to make sure that all versions are removed.
Shared installation (on a cluster)
----------------------------------
If you have a larger installation and want to provide cutadapt as a module
If you have a larger installation and want to provide Cutadapt as a module
that can be loaded and unloaded (with the Lmod system, for example), we
recommend that you create a virtual environment and 'pip install' cutadapt into
it. These instructions work on our SLURM cluster that uses the Lmod system
......@@ -105,7 +112,7 @@ it. These instructions work on our SLURM cluster that uses the Lmod system
The ``install-option`` part is important. It ensures that a second, separate
``bin/`` directory is created (``/software/cutadapt-1.9.1/bin/``) that *only*
contains the ``cutadapt`` script and nothing else. To make cutadapt available to
contains the ``cutadapt`` script and nothing else. To make Cutadapt available to
the users, that directory (``$BASE/bin``) needs to be added to the ``$PATH``.
Make sure you *do not* add the ``bin/`` directory within the ``venv`` directory
......@@ -128,11 +135,11 @@ Activation merely adds the ``bin/`` directory to the ``$PATH``, so the
Installing the development version
----------------------------------
We recommend that you install cutadapt into a so-called virtual environment if
We recommend that you install Cutadapt into a so-called virtual environment if
you decide to use the development version. The virtual environment is a single
directory that contains everything needed to run the software. Nothing else on
your system is changed, so you can simply uninstall this particular version of
cutadapt by removing the directory with the virtual environment.
Cutadapt by removing the directory with the virtual environment.
The following instructions work on Linux using Python 3. Make sure you have
installed the :ref:`dependencies <dependencies>` (``python3-dev`` and
......@@ -143,13 +150,13 @@ environment and what you want to call it. Let us assume you chose the path
``~/cutadapt-venv``. Then use these commands for the installation::
python3 -m venv ~/cutadapt-venv
~/cutadapt-venv/bin/pip install Cython
~/cutadapt-venv/bin/pip install https://github.com/marcelm/cutadapt/archive/master.zip
~/cutadapt-venv/bin/python3 -m pip install --upgrade pip
~/cutadapt-venv/bin/pip install git+https://github.com/marcelm/cutadapt.git#egg=cutadapt
To run cutadapt and see the version number, type ::
To run Cutadapt and see the version number, type ::
~/cutadapt-venv/bin/cutadapt --version
The reported version number will be something like ``1.14+65.g5610275``. This
means that you are now running a cutadapt version that contains 65 additional
changes (*commits*) since version 1.14.
The reported version number will be something like ``2.2.dev5+gf564208``. This
means that you are now running the version of Cutadapt that will become 2.2, and that is contains
5 changes (*commits*) since the previous release (2.1 in this case).
=============
Recipes (FAQ)
=============
===============
Recipes and FAQ
===============
This section gives answers to frequently asked questions. It shows you how to
get cutadapt to do what you want it to do!
get Cutadapt to do what you want it to do!
Remove more than one adapter
......@@ -20,13 +20,13 @@ version ``-n 2``). For example::
cutadapt -g ^TTAAGGCC -g ^AAGCTTA -a TACGGACT -n 2 -o output.fastq input.fastq
This instructs cutadapt to run two rounds of adapter finding and removal. That
This instructs Cutadapt to run two rounds of adapter finding and removal. That
means that, after the first round and only when an adapter was actually found,
another round is performed. In both rounds, all given adapters are searched and
removed. The problem is that it could happen that one adapter is found twice (so
the 3' adapter, for example, could be removed twice).
The second option is to not use the ``-n`` option, but to run cutadapt twice,
The second option is to not use the ``-n`` option, but to run Cutadapt twice,
first removing one adapter and then the other. It is easiest if you use a pipe
as in this example::
......@@ -38,7 +38,7 @@ Trim poly-A tails
If you want to trim a poly-A tail from the 3' end of your reads, use the 3'
adapter type (``-a``) with an adapter sequence of many repeated ``A``
nucleotides. Starting with version 1.8 of cutadapt, you can use the
nucleotides. Starting with version 1.8 of Cutadapt, you can use the
following notation to specify a sequence that consists of 100 ``A``::
cutadapt -a "A{100}" -o output.fastq input.fastq
......@@ -54,7 +54,7 @@ will be trimmed to::
If for some reason you would like to use a shorter sequence of ``A``, you can
do so: The matching algorithm always picks the leftmost match that it can find,
so cutadapt will do the right thing even when the tail has more ``A`` than you
so Cutadapt will do the right thing even when the tail has more ``A`` than you
used in the adapter sequence. However, sequencing errors may result in shorter
matches than desired. For example, using ``-a "A{10}"``, the read above (where
the ``AAAT`` is followed by eleven ``A``) would be trimmed to::
......@@ -114,7 +114,7 @@ the linked adapter option that needs to be used is therefore ::
where ``FWDPRIMER`` needs to be replaced with the sequence of your
forward primer and ``RCREVPRIMER`` with the reverse complement of
the reverse primer. The three dots ``...`` need to be entered
as they are -- they tell cutadapt that this is a linked adapter
as they are -- they tell Cutadapt that this is a linked adapter
with a 5' and a 3' part.
Sequencing of R2 starts before the 3' sequencing primer and
......@@ -127,7 +127,7 @@ swapped and reverse-complemented::
The uppercase ``-A`` specifies that this option is
meant to work on R2. Similar to above, ``REVPRIMER`` is
the sequence of the reverse primer and ``RCFWDPRIMER`` is the
reverse-complement of the forward primer. Note that cutadapt
reverse-complement of the forward primer. Note that Cutadapt
does not reverse-complement any sequences of its own; you
will have to do that yourself.
......@@ -158,7 +158,7 @@ you know must be there::
Piping paired-end data
----------------------
Sometimes it is necessary to run cutadapt twice on your data. For example, when
Sometimes it is necessary to run Cutadapt twice on your data. For example, when
you want to change the order in which read modification or filtering options are
applied. To simplify this, you can use Unix pipes (``|``), but this is more
difficult with paired-end data since then input and output consists of two files
......@@ -171,13 +171,113 @@ principle::
cutadapt [options] --interleaved in.1.fastq.gz in.2.fastq.gz | \
cutadapt [options] --interleaved -o out.1.fastq.gz -p out.2.fastq.gz -
Note the ``-`` character in the second invocation to cutadapt.
Note the ``-`` character in the second invocation to Cutadapt.
Support for concatenated compressed files
-----------------------------------------
Cutadapt supports concatenated gzip and bzip2 input files.
Paired-end read name check
--------------------------
When reading paired-end files, Cutadapt checks whether the read names match.
Only the part of the read name before the first space is considered. If the
read name ends with ``1`` or ``2``, then that is also ignored. For example,
two FASTQ headers that would be considered to denote properly paired reads are::
@my_read/1 a comment
and::
@my_read/2 another comment
This is an example for *improperly paired* read names::
@my_read/1;1
and::
@my_read/2;1
Since the ``1`` and ``2`` are ignored only if the occur at the end of the read
name, and since the ``;1`` is considered to be part of the read name, these
reads will not be considered to be propely paired.
Rescuing single reads from paired-end reads that were filtered
--------------------------------------------------------------
When trimming and filtering paired-end reads, Cutadapt always discards entire read pairs. If you
want to keep one of the reads, you need to write the filtered read pairs to an output file and
postprocess it.
For example, assume you are using ``-m 30`` to discard too short reads. Cutadapt discards all
read pairs in which just one of the reads is too short (but see the ``--pair-filter`` option).
To recover those (individual) reads that are long enough, you can first use the
``--too-short-(paired)-output`` options to write the filtered pairs to a file, and then postprocess
those files to keep only the long enough reads.
cutadapt -m 30 -q 20 -o out.1.fastq.gz -p out.2.fastq.gz --too-short-output=tooshort.1.fastq.gz --too-short-paired-output=tooshort.2.fastq.gz in.1.fastq.gz in.2.fastq.gz
cutadapt -m 30 -o rescued.a.fastq.gz tooshort.1.fastq.gz
cutadapt -m 30 -o rescued.b.fastq.gz tooshort.2.fastq.gz
The two output files ``rescued.a.fastq.gz`` and ``rescued.b.fastq.gz`` contain those individual
reads that are long enough. Note that the file names do not end in ``.1.fastq.gz`` and
``.2.fastq.gz`` to make it very clear that these files no longer contain synchronized paired-end
reads.
.. _bisulfite:
Bisulfite sequencing (RRBS)
---------------------------
When trimming reads that come from a library prepared with the RRBS (reduced
representation bisulfite sequencing) protocol, the last two 3' bases must be
removed in addition to the adapter itself. This can be achieved by using not
the adapter sequence itself, but by adding two wildcard characters to its
beginning. If the adapter sequence is ``ADAPTER``, the command for trimming
should be::
cutadapt -a NNADAPTER -o output.fastq input.fastq
Details can be found in `Babraham bioinformatics' "Brief guide to
RRBS" <http://www.bioinformatics.babraham.ac.uk/projects/bismark/RRBS_Guide.pdf>`_.
A summary follows.
During RRBS library preparation, DNA is digested with the restriction enzyme
MspI, generating a two-base overhang on the 5' end (``CG``). MspI recognizes
the sequence ``CCGG`` and cuts
between ``C`` and ``CGG``. A double-stranded DNA fragment is cut in this way::
5'-NNNC|CGGNNN-3'
3'-NNNGGC|CNNN-5'
The fragment between two MspI restriction sites looks like this::
5'-CGGNNN...NNNC-3'
3'-CNNN...NNNGGC-5'
Before sequencing (or PCR) adapters can be ligated, the missing base positions
must be filled in with GTP and CTP::
5'-ADAPTER-CGGNNN...NNNCcg-ADAPTER-3'
3'-ADAPTER-gcCNNN...NNNGGC-ADAPTER-5'
The filled-in bases, marked in lowercase above, do not contain any original
methylation information, and must therefore not be used for methylation calling.
By prefixing the adapter sequence with ``NN``, the bases will be automatically
stripped during adapter trimming.
Other things (unfinished)
-------------------------
* How to detect adapters
* Use cutadapt for quality-trimming only
* Use Cutadapt for quality-trimming only
* Use it for minimum/maximum length filtering
* Use it for conversion to FASTQ
[build-system]
requires = ["setuptools", "wheel", "setuptools_scm", "cython"]
[versioneer]
vcs = git
style = pep440
versionfile_source = src/cutadapt/_version.py
versionfile_build = cutadapt/_version.py
tag_prefix = v
parentdir_prefix = cutadapt-
[egg_info]
tag_build =
tag_date = 0
"""
Build cutadapt.
Build Cutadapt.
"""
import sys
import os.path
......@@ -8,13 +8,11 @@ from setuptools import setup, Extension, find_packages
from distutils.version import LooseVersion
from distutils.command.sdist import sdist as _sdist
from distutils.command.build_ext import build_ext as _build_ext
import versioneer
MIN_CYTHON_VERSION = '0.24'
MIN_CYTHON_VERSION = '0.28'
vi = sys.version_info
if (vi[0] == 2 and vi[1] < 7) or (vi[0] == 3 and vi[1] < 4):
sys.stdout.write('Minimum supported Python versions are 2.7 and 3.4.\n')
if sys.version_info[:2] < (3, 4):
sys.stdout.write('You need at least Python 3.4\n')
sys.exit(1)
......@@ -56,16 +54,11 @@ def check_cython_version():
extensions = [
Extension('cutadapt._align', sources=['src/cutadapt/_align.pyx']),
Extension('cutadapt._qualtrim', sources=['src/cutadapt/_qualtrim.pyx']),
Extension('cutadapt._seqio', sources=['src/cutadapt/_seqio.pyx']),
Extension('cutadapt.qualtrim', sources=['src/cutadapt/qualtrim.pyx']),
]
cmdclass = versioneer.get_cmdclass()
versioneer_build_ext = cmdclass.get('build_ext', _build_ext)
versioneer_sdist = cmdclass.get('sdist', _sdist)
class build_ext(versioneer_build_ext):
class BuildExt(_build_ext):
def run(self):
# If we encounter a PKG-INFO file, then this is likely a .tar.gz/.zip
# file retrieved from PyPI that already includes the pre-cythonized
......@@ -78,20 +71,16 @@ class build_ext(versioneer_build_ext):
check_cython_version()
from Cython.Build import cythonize
self.extensions = cythonize(self.extensions)
versioneer_build_ext.run(self)
super().run()
class sdist(versioneer_sdist):
class SDist(_sdist):
def run(self):
# Make sure the compiled Cython files in the distribution are up-to-date
from Cython.Build import cythonize
check_cython_version()
cythonize(extensions)
versioneer_sdist.run(self)
cmdclass['build_ext'] = build_ext
cmdclass['sdist'] = sdist
super().run()
encoding_arg = {'encoding': 'utf-8'} if sys.version > '3' else dict()
......@@ -100,22 +89,24 @@ with open('README.rst', **encoding_arg) as f:
setup(
name='cutadapt',
version=versioneer.get_version(),
setup_requires=['setuptools_scm'], # Support pip versions that don't know about pyproject.toml
use_scm_version={'write_to': 'src/cutadapt/_version.py'},
author='Marcel Martin',
author_email='marcel.martin@scilifelab.se',
url='https://cutadapt.readthedocs.io/',
description='trim adapters from high-throughput sequencing reads',
long_description=long_description,
license='MIT',
cmdclass=cmdclass,
cmdclass={'build_ext': BuildExt, 'sdist': SDist},
ext_modules=extensions,
package_dir={'': 'src'},
packages=find_packages('src'),
entry_points={'console_scripts': ['cutadapt = cutadapt.__main__:main']},
install_requires=['xopen>=0.3.2'],
install_requires=['dnaio>=0.3', 'xopen>=0.5.0'],
extras_require={
'dev': ['Cython', 'pytest', 'pytest-timeout', 'nose', 'sphinx', 'sphinx_issues'],
'dev': ['Cython', 'pytest', 'pytest-timeout', 'sphinx', 'sphinx_issues'],
},
python_requires='>=3.4',
classifiers=[
"Development Status :: 5 - Production/Stable",
"Environment :: Console",
......@@ -123,8 +114,7 @@ setup(
"License :: OSI Approved :: MIT License",
"Natural Language :: English",
"Programming Language :: Cython",
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3",
"Topic :: Scientific/Engineering :: Bio-Informatics"
]
],
)
Metadata-Version: 1.1
Name: cutadapt
Version: 1.18
Summary: trim adapters from high-throughput sequencing reads
Home-page: https://cutadapt.readthedocs.io/
Author: Marcel Martin
Author-email: marcel.martin@scilifelab.se
License: MIT
Description-Content-Type: UNKNOWN
Description: .. image:: https://travis-ci.org/marcelm/cutadapt.svg?branch=master
:target: https://travis-ci.org/marcelm/cutadapt
.. image:: https://img.shields.io/pypi/v/cutadapt.svg?branch=master
:target: https://pypi.python.org/pypi/cutadapt
========
cutadapt
========
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other
types of unwanted sequence from your high-throughput sequencing reads.
Cleaning your data in this way is often required: Reads from small-RNA
sequencing contain the 3’ sequencing adapter because the read is longer than
the molecule that is sequenced. Amplicon reads start with a primer sequence.
Poly-A tails are useful for pulling out RNA from your sample, but often you
don’t want them to be in your reads.
Cutadapt helps with these trimming tasks by finding the adapter or primer
sequences in an error-tolerant way. It can also modify and filter reads in
various ways. Adapter sequences can contain IUPAC wildcard characters. Also,
paired-end reads and even colorspace data is supported. If you want, you can
also just demultiplex your input data, without removing adapter sequences at all.
Cutadapt comes with an extensive suite of automated tests and is available under
the terms of the MIT license.
If you use cutadapt, please cite
`DOI:10.14806/ej.17.1.200 <http://dx.doi.org/10.14806/ej.17.1.200>`_ .
Links
-----
* `Documentation <https://cutadapt.readthedocs.io/>`_
* `Source code <https://github.com/marcelm/cutadapt/>`_
* `Report an issue <https://github.com/marcelm/cutadapt/issues>`_
* `Project page on PyPI (Python package index) <https://pypi.python.org/pypi/cutadapt/>`_
* `Follow @marcelm_ on Twitter <https://twitter.com/marcelm_>`_
* `Wrapper for the Galaxy platform <https://bitbucket.org/lance_parsons/cutadapt_galaxy_wrapper>`_
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
CHANGES.rst
CITATION
LICENSE
MANIFEST.in
README.rst
setup.cfg
setup.py
versioneer.py
doc/Makefile
doc/algorithms.rst
doc/changes.rst
doc/colorspace.rst
doc/conf.py
doc/develop.rst
doc/guide.rst
doc/ideas.rst
doc/index.rst
doc/installation.rst
doc/recipes.rst
src/cutadapt/__init__.py
src/cutadapt/__main__.py
src/cutadapt/_align.c
src/cutadapt/_align.pyx
src/cutadapt/_qualtrim.c
src/cutadapt/_qualtrim.pyx
src/cutadapt/_seqio.c
src/cutadapt/_seqio.pyx
src/cutadapt/_version.py
src/cutadapt/adapters.py
src/cutadapt/align.py
src/cutadapt/colorspace.py
src/cutadapt/compat.py
src/cutadapt/filters.py
src/cutadapt/modifiers.py
src/cutadapt/pipeline.py
src/cutadapt/qualtrim.py
src/cutadapt/report.py
src/cutadapt/seqio.py
src/cutadapt/utils.py
src/cutadapt.egg-info/PKG-INFO
src/cutadapt.egg-info/SOURCES.txt
src/cutadapt.egg-info/dependency_links.txt
src/cutadapt.egg-info/entry_points.txt
src/cutadapt.egg-info/requires.txt
src/cutadapt.egg-info/top_level.txt
tests/test_adapters.py
tests/test_align.py
tests/test_colorspace.py
tests/test_commandline.py
tests/test_filters.py
tests/test_modifiers.py
tests/test_paired.py
tests/test_qualtrim.py
tests/test_seqio.py
tests/test_trim.py
tests/utils.py
tests/cut/454.fa
tests/cut/SRR2040271_1.fastq
tests/cut/adapterx.fasta
tests/cut/anchored-back.fasta
tests/cut/anchored.fasta
tests/cut/anchored_no_indels.fasta
tests/cut/anchored_no_indels_wildcard.fasta
tests/cut/anywhere_repeat.fastq
tests/cut/casava.fastq
tests/cut/demultiplexed.first.1.fastq
tests/cut/demultiplexed.first.2.fastq
tests/cut/demultiplexed.second.1.fastq
tests/cut/demultiplexed.second.2.fastq
tests/cut/demultiplexed.unknown.1.fastq
tests/cut/demultiplexed.unknown.2.fastq
tests/cut/discard-untrimmed.fastq
tests/cut/discard.fastq
tests/cut/dos.fastq
tests/cut/empty.fastq
tests/cut/example.fa
tests/cut/examplefront.fa
tests/cut/illumina.fastq
tests/cut/illumina.info.txt
tests/cut/illumina5.fastq
tests/cut/illumina5.info.txt
tests/cut/illumina64.fastq
tests/cut/interleaved.fastq
tests/cut/issue46.fasta
tests/cut/linked-anchored.fasta
tests/cut/linked-discard-g.fasta
tests/cut/linked-discard.fasta
tests/cut/linked-not-anchored.fasta
tests/cut/linked.fasta
tests/cut/lowercase.fastq
tests/cut/lowqual.fastq
tests/cut/maxlen.fa
tests/cut/maxn0.2.fasta
tests/cut/maxn0.4.fasta
tests/cut/maxn0.fasta
tests/cut/maxn1.fasta
tests/cut/maxn2.fasta
tests/cut/minlen.fa
tests/cut/minlen.noprimer.fa
tests/cut/nextseq.fastq
tests/cut/no-trim.fastq
tests/cut/no_indels.fasta
tests/cut/overlapa.fa
tests/cut/overlapb.fa
tests/cut/paired-filterboth.1.fastq
tests/cut/paired-filterboth.2.fastq
tests/cut/paired-filterfirst.1.fastq
tests/cut/paired-filterfirst.2.fastq
tests/cut/paired-m27.1.fastq
tests/cut/paired-m27.2.fastq
tests/cut/paired-onlyA.1.fastq
tests/cut/paired-onlyA.2.fastq
tests/cut/paired-separate.1.fastq
tests/cut/paired-separate.2.fastq
tests/cut/paired-too-short.1.fastq
tests/cut/paired-too-short.2.fastq
tests/cut/paired-trimmed.1.fastq
tests/cut/paired-trimmed.2.fastq
tests/cut/paired-untrimmed.1.fastq
tests/cut/paired-untrimmed.2.fastq
tests/cut/paired.1.fastq
tests/cut/paired.2.fastq
tests/cut/paired.m14.1.fastq
tests/cut/paired.m14.2.fastq
tests/cut/pairedq.1.fastq
tests/cut/pairedq.2.fastq
tests/cut/pairedu.1.fastq
tests/cut/pairedu.2.fastq
tests/cut/plus.fastq
tests/cut/polya.fasta
tests/cut/rest.fa
tests/cut/restfront.fa
tests/cut/s_1_sequence.txt
tests/cut/shortened-negative.fastq
tests/cut/shortened.fastq
tests/cut/small-no-trim.fasta
tests/cut/small.fasta
tests/cut/small.fastq
tests/cut/small.trimmed.fastq
tests/cut/small.untrimmed.fastq
tests/cut/solid-no-zerocap.fastq
tests/cut/solid.fasta
tests/cut/solid.fastq
tests/cut/solid5p-anchored.fasta
tests/cut/solid5p-anchored.fastq
tests/cut/solid5p-anchored.notrim.fasta
tests/cut/solid5p-anchored.notrim.fastq
tests/cut/solid5p.fasta
tests/cut/solid5p.fastq
tests/cut/solidbfast.fastq
tests/cut/solidmaq.fastq
tests/cut/solidqual.fastq
tests/cut/sra.fastq
tests/cut/stripped.fasta
tests/cut/suffix.fastq
tests/cut/trimN3.fasta
tests/cut/trimN5.fasta
tests/cut/twoadapters.fasta
tests/cut/twoadapters.first.fasta
tests/cut/twoadapters.second.fasta
tests/cut/twoadapters.unknown.fasta
tests/cut/unconditional-back.fastq
tests/cut/unconditional-both.fastq
tests/cut/unconditional-front.fastq
tests/cut/wildcard.fa
tests/cut/wildcardN.fa
tests/cut/wildcard_adapter.fa
tests/cut/wildcard_adapter_anywhere.fa
tests/cut/xadapter.fasta
tests/data/454.fa
tests/data/E3M.fasta
tests/data/E3M.qual
tests/data/SRR2040271_1.fastq
tests/data/adapter.fasta
tests/data/anchored-back.fasta
tests/data/anchored.fasta
tests/data/anchored_no_indels.fasta
tests/data/anywhere_repeat.fastq
tests/data/casava.fastq
tests/data/dos.fastq
tests/data/empty.fastq
tests/data/example.fa
tests/data/illumina.fastq.gz
tests/data/illumina5.fastq
tests/data/illumina64.fastq
tests/data/interleaved.fastq
tests/data/issue46.fasta
tests/data/lengths.fa
tests/data/linked.fasta
tests/data/lowqual.fastq
tests/data/maxn.fasta
tests/data/multiblock.fastq.bz2
tests/data/multiblock.fastq.gz
tests/data/nextseq.fastq
tests/data/no_indels.fasta
tests/data/overlapa.fa
tests/data/overlapb.fa
tests/data/paired.1.fastq
tests/data/paired.2.fastq
tests/data/plus.fastq
tests/data/polya.fasta
tests/data/prefix-adapter.fasta
tests/data/rest.fa
tests/data/rest.txt
tests/data/restfront.txt
tests/data/s_1_sequence.txt.gz
tests/data/simple.fasta
tests/data/simple.fastq
tests/data/small.fastq
tests/data/small.fastq.bz2
tests/data/small.fastq.gz
tests/data/small.fastq.xz
tests/data/small.myownextension
tests/data/solid.csfasta
tests/data/solid.fasta
tests/data/solid.fastq
tests/data/solid.qual
tests/data/solid5p.fasta
tests/data/solid5p.fastq
tests/data/sra.fastq
tests/data/suffix-adapter.fasta
tests/data/toolong.fa
tests/data/tooshort.fa
tests/data/tooshort.noprimer.fa
tests/data/trimN3.fasta
tests/data/trimN5.fasta
tests/data/twoadapters.fasta
tests/data/underscore_fastq.gz
tests/data/wildcard.fa
tests/data/wildcardN.fa
tests/data/wildcard_adapter.fa
tests/data/withplus.fastq
tests/data/xadapterx.fasta
\ No newline at end of file