Skip to content
Commits on Source (4)
# Changes
v1.1.22 (dev)
-------------
* Documentation fixes (#79)
* Fix for index error when running `detect` command (#80)
v1.1.21 (2018.11.24)
--------------------
* Added long options for paired adapter parameters.
......
......@@ -19,8 +19,7 @@ test:
$(TEST)
docs:
make -C docs api
make -C docs html
make -C doc html
readme:
pandoc --from=markdown --to=rst --output=README.rst README.md
......@@ -58,7 +57,7 @@ release:
$(TEST)
python setup.py sdist bdist_wheel
# release
python setup.py sdist upload
python setup.py sdist upload -r pypi
git push origin --tags
$(github_release)
$(docker)
......
[![Travis CI](https://travis-ci.org/jdidion/atropos.svg?branch=master)](https://travis-ci.org/jdidion/atropos])
[![Travis CI](https://travis-ci.org/jdidion/atropos.svg?branch=1.1)](https://travis-ci.org/jdidion/atropos)
[![PyPi](https://img.shields.io/pypi/v/atropos.svg)](https://pypi.python.org/pypi/atropos)
[![DOI](https://zenodo.org/badge/61393086.svg)](https://zenodo.org/badge/latestdoi/61393086)
......@@ -127,12 +127,15 @@ The citation for the original Cutadapt paper is:
* Provide option for RNA-seq data that will trim polyA sequence.
* Add formal config file support (#53)
* Automate crash reporting using [sentry](https://github.com/getsentry/raven-python).
* Look at [NGMerge] for improving read merging: https://github.com/harvardinformatics/NGmerge
* Look at replacing pysam with [pybam](https://github.com/JohnLonginotto/pybam)
### 1.4
* Currently, InsertAligner requires a single 3' adapter for each end. Adapter trimming will later be generalized so that A) the InsertAligner can handle multiple matched pairs of adapters and/or B) multiple different aligners can be used for different adapters.
* Integrate with [AdapterBase](https://github.com/NCBI-Hackathons/OnlineAdapterDatabase) for improved matching of detected contaminants to known adapters, automated trimming of datasets with known adapters, and (opt-in) submission of adapter information for novel datasets.
* Migrate to seqio (https://github.com/jdidion/seqio) for reading/writing sequence files.
* Also look at using HTSeq instead: https://github.com/simon-anders/htseq
* General-purpose read filtering based on read ID: https://github.com/marcelm/cutadapt/issues/107.
* Currently, SAM/BAM input files must be name sorted; add an option to 1) pre-sort reads inline using samtools or sambamba, or 2) cache each read in memory until its mate is found.
......@@ -169,7 +172,9 @@ The citation for the original Cutadapt paper is:
### 2.0
* Simplification of command line options, perhaps by further splitting functionality up into different sub-commands, but also by more intelligent choices for default option values based on context.
* Consider adding additional report formats
* Add fuzzy matching: https://github.com/Daniel-Liu-c0deb0t/Java-Fuzzy-Search
* https://github.com/marcelm/cutadapt/issues/112
* Performance enhancements using
* http://numba.pydata.org/
......
......@@ -23,8 +23,8 @@ def get_keywords():
# setup.py/versioneer.py will grep for the variable names, so they must
# each be defined on a line of their own. _version.py will just call
# get_keywords().
git_refnames = " (tag: 1.1.21)"
git_full = "1b7273b082551e80652a45df2736e5ffff8ce7f9"
git_refnames = " (tag: 1.1.22)"
git_full = "2b15c778f0ccf1d0fb753e4334fa6dc0048a9ee6"
keywords = {"refnames": git_refnames, "full": git_full}
return keywords
......
......@@ -212,6 +212,12 @@ MatchInfo = namedtuple("MatchInfo", (
# * http://www.combio.pl/alfree/tools/
# * http://www.combio.pl/alfree
# * http://bioinformatics.org.au/tools/decaf+py/
# 12. https://github.com/sdu-hpcl/BGSA
# 13. Train a NN to approximate the pairwise alignment distance
# https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty887/5140215?redirectedFrom=fulltext
# 14. Nucl2vec: https://github.com/prakharg24/Nucl2vec (local alignment only - could
# it be adapted to semi-global?)
# * Can re-implement in SpaCy? https://spacy.io/usage/vectors-similarity
class InsertAligner(object):
"""Implementation of an insert matching algorithm.
......
......@@ -8,7 +8,7 @@ from atropos.align import Aligner, SEMIGLOBAL
from atropos.commands.base import (
BaseCommandRunner, Pipeline, SingleEndPipelineMixin, PairedEndPipelineMixin)
from atropos.util import (
reverse_complement, sequence_complexity, enumerate_range, run_interruptible)
reverse_complement, sequence_complexity, run_interruptible)
# TODO: Test whether using rc=True in parse_known_contaminants is as fast
# and as accurate as testing both the forward and reverse complement
......@@ -88,6 +88,8 @@ class CommandRunner(BaseCommandRunner):
logging.getLogger().debug(
"Detecting contaminants using the kmer-based algorithm")
detector_class = KhmerDetector
else:
raise ValueError("Invalid value for 'detector': {}".format(detector))
summary_args = dict(
kmer_size=kmer_size, n_reads=n_reads,
......@@ -113,6 +115,7 @@ class CommandRunner(BaseCommandRunner):
self.summary.update(mode='serial', threads=1)
return run_interruptible(detector, self, raise_on_error=True)
class Match(object):
"""A contaminant match.
......@@ -224,6 +227,7 @@ class Match(object):
known_seqs=self.known_seqs)
return summary
class ContaminantMatcher(object):
"""Matches a known contaminant against other sequences.
......@@ -279,7 +283,8 @@ class ContaminantMatcher(object):
match_frac1 = n_matches / self.n_kmers
if len(kmers) > 0:
match_frac2 = n_matches / len(kmers)
return (match_frac1, match_frac2, compare_seq)
return match_frac1, match_frac2, compare_seq
def create_contaminant_matchers(contaminants, kmer_size):
"""Create :class:`ContaminantMatcher`s from sequences.
......@@ -296,6 +301,7 @@ def create_contaminant_matchers(contaminants, kmer_size):
for seq, names in contaminants.iter_sequences()
]
class Detector(SingleEndPipelineMixin, Pipeline):
"""Base class for contaminant detectors.
......@@ -304,7 +310,7 @@ class Detector(SingleEndPipelineMixin, Pipeline):
n_reads: Number of reads to sample.
overrep_cutoff: Degree of overrepresentation required for a kmer to be
considered as a contaminant.
known_contaminant: :class:`ContaminantMatcher`s to match against.
known_contaminants: :class:`ContaminantMatcher`s to match against.
past_end_bases: On Illumina, long runs of A (and sometimes other bases)
can signify that the sequencer has read past the end of a fragment
that is shorter than the read length + adapter length. Those
......@@ -392,39 +398,39 @@ class Detector(SingleEndPipelineMixin, Pipeline):
for match in matches:
match.estimate_abundance(self._read_sequences)
def _filter(match):
if match.count < self.min_report_freq:
def _filter(_match):
if _match.count < self.min_report_freq:
logging.getLogger().debug(
"Filtering {} because frequency {} < {}".format(
match.seq, match.count, self.min_report_freq))
_match.seq, _match.count, self.min_report_freq))
return False
if min_len and len(match) < min_len:
if min_len and len(_match) < min_len:
logging.getLogger().debug(
"Filtering {} because it's too short (< {} bp)".format(
match.seq, min_len))
_match.seq, min_len))
print('too short')
return False
if min_complexity and match.seq_complexity < min_complexity:
if min_complexity and _match.seq_complexity < min_complexity:
logging.getLogger().debug(
"Filtering {} because its complexity {} < {}".format(
match.seq, match.seq_complexity, min_complexity))
_match.seq, _match.seq_complexity, min_complexity))
return False
if self.include == 'known' and not match.is_known:
if self.include == 'known' and not _match.is_known:
logging.getLogger().debug(
"Filtering {} because it's not known".format(
match.seq))
_match.seq))
return False
elif self.include == 'unknown' and match.is_known:
elif self.include == 'unknown' and _match.is_known:
logging.getLogger().debug(
"Filtering {} because it's known".format(
match.seq))
_match.seq))
return False
if (
min_match_frac and match.is_known and
match.match_frac < min_match_frac):
min_match_frac and _match.is_known and
_match.match_frac < min_match_frac):
logging.getLogger().debug(
"Filtering {} because its match_frac {} < {}".format(
match.seq, match.match_frac, min_match_frac))
_match.seq, _match.match_frac, min_match_frac))
return False
return True
......@@ -451,6 +457,7 @@ class Detector(SingleEndPipelineMixin, Pipeline):
for match in self.matches(**kwargs)
],)
class PairedDetector(PairedEndPipelineMixin, Pipeline):
"""Detector for paired-end reads.
"""
......@@ -484,6 +491,7 @@ class PairedDetector(PairedEndPipelineMixin, Pipeline):
for match in self.read2_detector.matches(**kwargs)
])
class KnownContaminantDetector(Detector):
"""Test known contaminants against reads. This has linear complexity and is
more specific than the khmer matcher, but less specific than the heuristic
......@@ -540,6 +548,7 @@ class KnownContaminantDetector(Detector):
)
]
class HeuristicDetector(Detector):
"""Use a heuristic iterative algorithm to arrive at likely contaminants.
This is the most accurate algorithm overall, but it has quadratic complexity
......@@ -557,11 +566,11 @@ class HeuristicDetector(Detector):
return 0.1 * self.n_reads
def _get_contaminants(self):
def _min_count(kmer_size):
def _min_count(_kmer_size):
return math.ceil(self.n_reads * max(
self.min_frequency,
(self._read_length - kmer_size + 1) * self.overrep_cutoff /
float(4**kmer_size)))
(self._read_length - _kmer_size + 1) * self.overrep_cutoff /
float(4 ** _kmer_size)))
kmer_size = self.kmer_size
kmers = defaultdict(lambda: [0, set()])
......@@ -613,8 +622,7 @@ class HeuristicDetector(Detector):
# Now merge overlapping sequences by length and frequency to eliminate
# redundancy in the set of candidate kmers.
results.sort(key=lambda i: len(i[0]) * math.log(i[1]), reverse=True)
cur = results[0]
results.sort(key=lambda r: len(r[0]) * math.log(r[1]), reverse=True)
merged = []
unmerged = []
while len(results) > 1:
......@@ -643,7 +651,7 @@ class HeuristicDetector(Detector):
# appear in full very infrequently
# Re-sort by frequency
results.sort(key=lambda i: i[1], reverse=True)
results.sort(key=lambda r: r[1], reverse=True)
# Keep anything that's within 50% of the top hit
# TODO: make this user-configurable?
min_count = int(results[0][1] * 0.5)
......@@ -660,28 +668,33 @@ class HeuristicDetector(Detector):
known = {}
unknown = []
def find_best_match(seq, best_matches, best_match_frac):
def find_best_match(_seq, _best_matches, _best_match_frac):
"""Find best contaminant matches to `seq`.
"""
seqrc = reverse_complement(seq)
for contam in contaminants:
match_frac1, match_frac2, compare_seq = contam.match(
seq, seqrc)
if match_frac1 < best_match_frac[0]:
seqrc = reverse_complement(_seq)
for _contam in contaminants:
match_frac1, match_frac2, compare_seq = _contam.match(
_seq, seqrc)
if match_frac1 < _best_match_frac[0]:
continue
if (
contam.seq in compare_seq or
_contam.seq in compare_seq or
align(
compare_seq, contam.seq,
compare_seq, _contam.seq,
self.min_contaminant_match_frac)):
if (match_frac1 > best_match_frac[0] or (
match_frac1 == best_match_frac[0] and
match_frac2 > best_match_frac[1])):
best_matches = {}
best_match_frac = (match_frac1, match_frac2)
best_matches[contam] = (
if (
match_frac1 > _best_match_frac[0] or (
match_frac1 == _best_match_frac[0] and
match_frac2 > _best_match_frac[1]
)
):
_best_matches = {}
_best_match_frac = (match_frac1, match_frac2)
_best_matches[_contam] = (
match, (match_frac1, match_frac2))
return (best_matches, best_match_frac)
return _best_matches, _best_match_frac
for match in matches:
best_matches, best_match_frac = find_best_match(
......@@ -692,9 +705,9 @@ class HeuristicDetector(Detector):
match.longest_match[0], best_matches, best_match_frac)
if best_matches:
for contam, match in best_matches.items():
if contam not in known or match[1] > known[contam][1]:
known[contam] = match
for contam, _match in best_matches.items():
if contam not in known or _match[1] > known[contam][1]:
known[contam] = _match
else:
unknown.append(match)
......@@ -720,7 +733,7 @@ class HeuristicDetector(Detector):
match.set_contaminant(contam, *match_frac)
else:
names = set(contam.names)
seqs = set((contam.seq,))
seqs = {(contam.seq,)}
for other_contam in equiv:
names.update(other_contam[0].names)
seqs.add(other_contam[0].seq)
......@@ -731,6 +744,7 @@ class HeuristicDetector(Detector):
return matches
class KhmerDetector(Detector):
"""Identify contaminants based on kmer frequency using a fast kmer counting
approach (as implemented in the khmer library). This approach is fast but
......@@ -770,12 +784,12 @@ class KhmerDetector(Detector):
matches = []
seen = set()
def match(kmer):
def match(_kmer):
"""Returns the frequency of `kmer` in `candidates`.
"""
freq = candidates.get(kmer, 0)
freq = candidates.get(_kmer, 0)
if freq > 0:
seen.add(kmer)
seen.add(_kmer)
return freq
for seq, names in self.known_contaminants.iter_sequences():
......@@ -817,11 +831,13 @@ class KhmerDetector(Detector):
return matches
def align(seq1, seq2, min_overlap_frac=0.9):
"""Align two sequences.
Args:
seq1, seq2: The sequences to align.
seq1: The second sequence to align.
seq2: The first sequence to align.
min_overlap_frac: Minimum fraction of overlapping bases required for a
match.
......
......@@ -46,7 +46,7 @@ master_doc = 'index'
# General information about the project.
project = u'atropos'
copyright = u'Original Cutadapt code: 2010-2014, Marcel Martin; Atropos modifications and improvements: 2015-2016, John P Didion'
copyright = u'Original Cutadapt code: 2010-2014, Marcel Martin; Atropos modifications and improvements: 2015-2018, John P Didion'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
......@@ -208,8 +208,7 @@ latex_elements = {
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
('index', 'atropos.tex', u'atropos Documentation',
u'Marcel Martin', 'manual'),
('index', 'atropos.tex', u'Atropos Documentation', u'John Didion', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
......@@ -238,8 +237,7 @@ latex_documents = [
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'atropos', u'atropos Documentation',
[u'Marcel Martin'], 1)
('index', 'atropos', u'Atropos Documentation', [u'John Didion'], 1)
]
# If true, show URL addresses after external links.
......@@ -252,9 +250,8 @@ man_pages = [
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'atropos', u'atropos Documentation',
u'Marcel Martin', 'atropos', 'One line description of project.',
'Miscellaneous'),
('index', 'atropos', u'atropos Documentation', u'John Didion', 'atropos',
'General purpose NGS read manipulation tool.', 'Read manipulation'),
]
# Documents to append as an appendix to all manuals.
......
......@@ -657,10 +657,10 @@ been removed. For example, if the following sequence is in reads.fastq::
ACGTACGTACGTADAP
The following command will first trim the ``ADAP`` part of the adapter
off the end. Then, since only 4 bases were trimmed, the ``-i 5`` option
off the end. Then, since only 4 bases were trimmed, the ``-i -5`` option
will cause a 5th base to be removed.
atropos -A ADAPTER -i 5 -o trimmed.fastq -se reads.fastq
atropos -A ADAPTER -i -5 -o trimmed.fastq -se reads.fastq
.. _quality-trimming:
......
.. include:: ./readme.rst
=================
Table of contents
=================
......
......@@ -146,7 +146,7 @@ setup(
url="https://atropos.readthedocs.org/",
description="trim adapters from high-throughput sequencing reads",
long_description=long_description,
log_description_content_type="text/markdown",
long_description_content_type="text/markdown",
license="Original Cutadapt code is under MIT license; improvements and additions are in the Public Domain",
ext_modules=extensions,
packages=find_packages(),
......