Steffen Möller · Steffen Möller · Steffen Möller · Steffen Möller · dbfa561b · dbfa561b
--- a/CHANGES.md
+++ b/CHANGES.md
 # Changes

+v1.1.22 (dev)
+-------------
+* Documentation fixes (#79)
+* Fix for index error when running `detect` command (#80)
+
 v1.1.21 (2018.11.24)
 --------------------
 * Added long options for paired adapter parameters.

--- a/Makefile
+++ b/Makefile
@@ -19,8 +19,7 @@ test:
 	$(TEST)

 docs:
-	make -C docs api
-	make -C docs html
+	make -C doc html

 readme:
 	pandoc --from=markdown --to=rst --output=README.rst README.md
@@ -58,7 +57,7 @@ release:
 	$(TEST)
 	python setup.py sdist bdist_wheel
 	# release
-	python setup.py sdist upload
+	python setup.py sdist upload -r pypi
 	git push origin --tags
 	$(github_release)
 	$(docker)

--- a/README.md
+++ b/README.md
-[![Travis CI](https://travis-ci.org/jdidion/atropos.svg?branch=master)](https://travis-ci.org/jdidion/atropos])
+[![Travis CI](https://travis-ci.org/jdidion/atropos.svg?branch=1.1)](https://travis-ci.org/jdidion/atropos)
 [![PyPi](https://img.shields.io/pypi/v/atropos.svg)](https://pypi.python.org/pypi/atropos)
 [![DOI](https://zenodo.org/badge/61393086.svg)](https://zenodo.org/badge/latestdoi/61393086)

@@ -127,12 +127,15 @@ The citation for the original Cutadapt paper is:
 * Provide option for RNA-seq data that will trim polyA sequence.
 * Add formal config file support (#53)
 * Automate crash reporting using [sentry](https://github.com/getsentry/raven-python).
+* Look at [NGMerge] for improving read merging: https://github.com/harvardinformatics/NGmerge
+* Look at replacing pysam with [pybam](https://github.com/JohnLonginotto/pybam)

 ### 1.4

 * Currently, InsertAligner requires a single 3' adapter for each end. Adapter trimming will later be generalized so that A) the InsertAligner can handle multiple matched pairs of adapters and/or B) multiple different aligners can be used for different adapters.
 * Integrate with [AdapterBase](https://github.com/NCBI-Hackathons/OnlineAdapterDatabase) for improved matching of detected contaminants to known adapters, automated trimming of datasets with known adapters, and (opt-in) submission of adapter information for novel datasets.
 * Migrate to seqio (https://github.com/jdidion/seqio) for reading/writing sequence files.
+    * Also look at using HTSeq instead: https://github.com/simon-anders/htseq
 * General-purpose read filtering based on read ID: https://github.com/marcelm/cutadapt/issues/107.
 * Currently, SAM/BAM input files must be name sorted; add an option to 1) pre-sort reads inline using samtools or sambamba, or 2) cache each read in memory until its mate is found.

@@ -169,7 +172,9 @@ The citation for the original Cutadapt paper is:
 ### 2.0

 * Simplification of command line options, perhaps by further splitting functionality up into different sub-commands, but also by more intelligent choices for default option values based on context.
+
 * Consider adding additional report formats
+* Add fuzzy matching: https://github.com/Daniel-Liu-c0deb0t/Java-Fuzzy-Search
 * https://github.com/marcelm/cutadapt/issues/112
 * Performance enhancements using
    * http://numba.pydata.org/

--- a/atropos/_version.py
+++ b/atropos/_version.py
@@ -23,8 +23,8 @@ def get_keywords():
    # setup.py/versioneer.py will grep for the variable names, so they must
    # each be defined on a line of their own. _version.py will just call
    # get_keywords().
-    git_refnames = " (tag: 1.1.21)"
-    git_full = "1b7273b082551e80652a45df2736e5ffff8ce7f9"
+    git_refnames = " (tag: 1.1.22)"
+    git_full = "2b15c778f0ccf1d0fb753e4334fa6dc0048a9ee6"
    keywords = {"refnames": git_refnames, "full": git_full}
    return keywords


--- a/atropos/align/__init__.py
+++ b/atropos/align/__init__.py
@@ -212,6 +212,12 @@ MatchInfo = namedtuple("MatchInfo", (
 # * http://www.combio.pl/alfree/tools/
 # * http://www.combio.pl/alfree
 # * http://bioinformatics.org.au/tools/decaf+py/
+# 12. https://github.com/sdu-hpcl/BGSA
+# 13. Train a NN to approximate the pairwise alignment distance
+# https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty887/5140215?redirectedFrom=fulltext
+# 14. Nucl2vec: https://github.com/prakharg24/Nucl2vec (local alignment only - could
+# it be adapted to semi-global?)
+# * Can re-implement in SpaCy? https://spacy.io/usage/vectors-similarity

 class InsertAligner(object):
    """Implementation of an insert matching algorithm.

--- a/atropos/commands/detect/__init__.py
+++ b/atropos/commands/detect/__init__.py
@@ -8,7 +8,7 @@ from atropos.align import Aligner, SEMIGLOBAL
 from atropos.commands.base import (
    BaseCommandRunner, Pipeline, SingleEndPipelineMixin, PairedEndPipelineMixin)
 from atropos.util import (
-    reverse_complement, sequence_complexity, enumerate_range, run_interruptible)
+    reverse_complement, sequence_complexity, run_interruptible)

 # TODO: Test whether using rc=True in parse_known_contaminants is as fast
 # and as accurate as testing both the forward and reverse complement
@@ -88,6 +88,8 @@ class CommandRunner(BaseCommandRunner):
            logging.getLogger().debug(
                "Detecting contaminants using the kmer-based algorithm")
            detector_class = KhmerDetector
+        else:
+            raise ValueError("Invalid value for 'detector': {}".format(detector))

        summary_args = dict(
            kmer_size=kmer_size, n_reads=n_reads,
@@ -113,6 +115,7 @@ class CommandRunner(BaseCommandRunner):
        self.summary.update(mode='serial', threads=1)
        return run_interruptible(detector, self, raise_on_error=True)

+
 class Match(object):
    """A contaminant match.

@@ -224,6 +227,7 @@ class Match(object):
                known_seqs=self.known_seqs)
        return summary

+
 class ContaminantMatcher(object):
    """Matches a known contaminant against other sequences.

@@ -279,7 +283,8 @@ class ContaminantMatcher(object):
            match_frac1 = n_matches / self.n_kmers
        if len(kmers) > 0:
            match_frac2 = n_matches / len(kmers)
-        return (match_frac1, match_frac2, compare_seq)
+        return match_frac1, match_frac2, compare_seq
+

 def create_contaminant_matchers(contaminants, kmer_size):
    """Create :class:`ContaminantMatcher`s from sequences.
@@ -296,6 +301,7 @@ def create_contaminant_matchers(contaminants, kmer_size):
        for seq, names in contaminants.iter_sequences()
    ]

+
 class Detector(SingleEndPipelineMixin, Pipeline):
    """Base class for contaminant detectors.

@@ -304,7 +310,7 @@ class Detector(SingleEndPipelineMixin, Pipeline):
        n_reads: Number of reads to sample.
        overrep_cutoff: Degree of overrepresentation required for a kmer to be
            considered as a contaminant.
-        known_contaminant: :class:`ContaminantMatcher`s to match against.
+        known_contaminants: :class:`ContaminantMatcher`s to match against.
        past_end_bases: On Illumina, long runs of A (and sometimes other bases)
            can signify that the sequencer has read past the end of a fragment
            that is shorter than the read length + adapter length. Those
@@ -392,39 +398,39 @@ class Detector(SingleEndPipelineMixin, Pipeline):
        for match in matches:
            match.estimate_abundance(self._read_sequences)

-        def _filter(match):
-            if match.count < self.min_report_freq:
+        def _filter(_match):
+            if _match.count < self.min_report_freq:
                logging.getLogger().debug(
                    "Filtering {} because frequency {} < {}".format(
-                        match.seq, match.count, self.min_report_freq))
+                        _match.seq, _match.count, self.min_report_freq))
                return False
-            if min_len and len(match) < min_len:
+            if min_len and len(_match) < min_len:
                logging.getLogger().debug(
                    "Filtering {} because it's too short (< {} bp)".format(
-                        match.seq, min_len))
+                        _match.seq, min_len))
                print('too short')
                return False
-            if min_complexity and match.seq_complexity < min_complexity:
+            if min_complexity and _match.seq_complexity < min_complexity:
                logging.getLogger().debug(
                    "Filtering {} because its complexity {} < {}".format(
-                        match.seq, match.seq_complexity, min_complexity))
+                        _match.seq, _match.seq_complexity, min_complexity))
                return False
-            if self.include == 'known' and not match.is_known:
+            if self.include == 'known' and not _match.is_known:
                logging.getLogger().debug(
                    "Filtering {} because it's not known".format(
-                        match.seq))
+                        _match.seq))
                return False
-            elif self.include == 'unknown' and match.is_known:
+            elif self.include == 'unknown' and _match.is_known:
                logging.getLogger().debug(
                    "Filtering {} because it's known".format(
-                        match.seq))
+                        _match.seq))
                return False
            if (
-                    min_match_frac and match.is_known and
-                    match.match_frac < min_match_frac):
+                    min_match_frac and _match.is_known and
+                    _match.match_frac < min_match_frac):
                logging.getLogger().debug(
                    "Filtering {} because its match_frac {} < {}".format(
-                        match.seq, match.match_frac, min_match_frac))
+                        _match.seq, _match.match_frac, min_match_frac))
                return False
            return True

@@ -451,6 +457,7 @@ class Detector(SingleEndPipelineMixin, Pipeline):
            for match in self.matches(**kwargs)
        ],)

+
 class PairedDetector(PairedEndPipelineMixin, Pipeline):
    """Detector for paired-end reads.
    """
@@ -484,6 +491,7 @@ class PairedDetector(PairedEndPipelineMixin, Pipeline):
            for match in self.read2_detector.matches(**kwargs)
        ])

+
 class KnownContaminantDetector(Detector):
    """Test known contaminants against reads. This has linear complexity and is
    more specific than the khmer matcher, but less specific than the heuristic
@@ -540,6 +548,7 @@ class KnownContaminantDetector(Detector):
            )
        ]

+
 class HeuristicDetector(Detector):
    """Use a heuristic iterative algorithm to arrive at likely contaminants.
    This is the most accurate algorithm overall, but it has quadratic complexity
@@ -557,11 +566,11 @@ class HeuristicDetector(Detector):
        return 0.1 * self.n_reads

    def _get_contaminants(self):
-        def _min_count(kmer_size):
+        def _min_count(_kmer_size):
            return math.ceil(self.n_reads * max(
                self.min_frequency,
-                (self._read_length - kmer_size + 1) * self.overrep_cutoff /
-                float(4**kmer_size)))
+                (self._read_length - _kmer_size + 1) * self.overrep_cutoff /
+                float(4 ** _kmer_size)))

        kmer_size = self.kmer_size
        kmers = defaultdict(lambda: [0, set()])
@@ -613,8 +622,7 @@ class HeuristicDetector(Detector):

        # Now merge overlapping sequences by length and frequency to eliminate
        # redundancy in the set of candidate kmers.
-        results.sort(key=lambda i: len(i[0]) * math.log(i[1]), reverse=True)
-        cur = results[0]
+        results.sort(key=lambda r: len(r[0]) * math.log(r[1]), reverse=True)
        merged = []
        unmerged = []
        while len(results) > 1:
@@ -643,7 +651,7 @@ class HeuristicDetector(Detector):
        # appear in full very infrequently

        # Re-sort by frequency
-        results.sort(key=lambda i: i[1], reverse=True)
+        results.sort(key=lambda r: r[1], reverse=True)
        # Keep anything that's within 50% of the top hit
        # TODO: make this user-configurable?
        min_count = int(results[0][1] * 0.5)
@@ -660,28 +668,33 @@ class HeuristicDetector(Detector):
            known = {}
            unknown = []

-            def find_best_match(seq, best_matches, best_match_frac):
+            def find_best_match(_seq, _best_matches, _best_match_frac):
                """Find best contaminant matches to `seq`.
                """
-                seqrc = reverse_complement(seq)
-                for contam in contaminants:
-                    match_frac1, match_frac2, compare_seq = contam.match(
-                        seq, seqrc)
-                    if match_frac1 < best_match_frac[0]:
+                seqrc = reverse_complement(_seq)
+
+                for _contam in contaminants:
+                    match_frac1, match_frac2, compare_seq = _contam.match(
+                        _seq, seqrc)
+                    if match_frac1 < _best_match_frac[0]:
                        continue
                    if (
-                            contam.seq in compare_seq or
+                            _contam.seq in compare_seq or
                            align(
-                                compare_seq, contam.seq,
+                                compare_seq, _contam.seq,
                                self.min_contaminant_match_frac)):
-                        if (match_frac1 > best_match_frac[0] or (
-                                match_frac1 == best_match_frac[0] and
-                                match_frac2 > best_match_frac[1])):
-                            best_matches = {}
-                            best_match_frac = (match_frac1, match_frac2)
-                        best_matches[contam] = (
+                        if (
+                            match_frac1 > _best_match_frac[0] or (
+                                match_frac1 == _best_match_frac[0] and
+                                match_frac2 > _best_match_frac[1]
+                            )
+                        ):
+                            _best_matches = {}
+                            _best_match_frac = (match_frac1, match_frac2)
+                        _best_matches[_contam] = (
                            match, (match_frac1, match_frac2))
-                return (best_matches, best_match_frac)
+
+                return _best_matches, _best_match_frac

            for match in matches:
                best_matches, best_match_frac = find_best_match(
@@ -692,9 +705,9 @@ class HeuristicDetector(Detector):
                        match.longest_match[0], best_matches, best_match_frac)

                if best_matches:
-                    for contam, match in best_matches.items():
-                        if contam not in known or match[1] > known[contam][1]:
-                            known[contam] = match
+                    for contam, _match in best_matches.items():
+                        if contam not in known or _match[1] > known[contam][1]:
+                            known[contam] = _match
                else:
                    unknown.append(match)

@@ -720,7 +733,7 @@ class HeuristicDetector(Detector):
                        match.set_contaminant(contam, *match_frac)
                    else:
                        names = set(contam.names)
-                        seqs = set((contam.seq,))
+                        seqs = {(contam.seq,)}
                        for other_contam in equiv:
                            names.update(other_contam[0].names)
                            seqs.add(other_contam[0].seq)
@@ -731,6 +744,7 @@ class HeuristicDetector(Detector):

        return matches

+
 class KhmerDetector(Detector):
    """Identify contaminants based on kmer frequency using a fast kmer counting
    approach (as implemented in the khmer library). This approach is fast but
@@ -770,12 +784,12 @@ class KhmerDetector(Detector):
            matches = []
            seen = set()

-            def match(kmer):
+            def match(_kmer):
                """Returns the frequency of `kmer` in `candidates`.
                """
-                freq = candidates.get(kmer, 0)
+                freq = candidates.get(_kmer, 0)
                if freq > 0:
-                    seen.add(kmer)
+                    seen.add(_kmer)
                return freq

            for seq, names in self.known_contaminants.iter_sequences():
@@ -817,11 +831,13 @@ class KhmerDetector(Detector):

        return matches

+
 def align(seq1, seq2, min_overlap_frac=0.9):
    """Align two sequences.

    Args:
-        seq1, seq2: The sequences to align.
+        seq1: The second sequence to align.
+        seq2: The first sequence to align.
        min_overlap_frac: Minimum fraction of overlapping bases required for a
            match.


--- a/doc/conf.py
+++ b/doc/conf.py
@@ -46,7 +46,7 @@ master_doc = 'index'

 # General information about the project.
 project = u'atropos'
-copyright = u'Original Cutadapt code: 2010-2014, Marcel Martin; Atropos modifications and improvements: 2015-2016, John P Didion'
+copyright = u'Original Cutadapt code: 2010-2014, Marcel Martin; Atropos modifications and improvements: 2015-2018, John P Didion'

 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
@@ -208,8 +208,7 @@ latex_elements = {
 # (source start file, target name, title,
 #  author, documentclass [howto, manual, or own class]).
 latex_documents = [
-  ('index', 'atropos.tex', u'atropos Documentation',
-   u'Marcel Martin', 'manual'),
+  ('index', 'atropos.tex', u'Atropos Documentation', u'John Didion', 'manual'),
 ]

 # The name of an image file (relative to this directory) to place at the top of
@@ -238,8 +237,7 @@ latex_documents = [
 # One entry per manual page. List of tuples
 # (source start file, name, description, authors, manual section).
 man_pages = [
-    ('index', 'atropos', u'atropos Documentation',
-     [u'Marcel Martin'], 1)
+    ('index', 'atropos', u'Atropos Documentation', [u'John Didion'], 1)
 ]

 # If true, show URL addresses after external links.
@@ -252,9 +250,8 @@ man_pages = [
 # (source start file, target name, title, author,
 #  dir menu entry, description, category)
 texinfo_documents = [
-  ('index', 'atropos', u'atropos Documentation',
-   u'Marcel Martin', 'atropos', 'One line description of project.',
-   'Miscellaneous'),
+  ('index', 'atropos', u'atropos Documentation', u'John Didion', 'atropos',
+   'General purpose NGS read manipulation tool.', 'Read manipulation'),
 ]

 # Documents to append as an appendix to all manuals.

--- a/doc/guide.rst
+++ b/doc/guide.rst
@@ -657,10 +657,10 @@ been removed. For example, if the following sequence is in reads.fastq::
    ACGTACGTACGTADAP

 The following command will first trim the ``ADAP`` part of the adapter
-off the end. Then, since only 4 bases were trimmed, the ``-i 5`` option
+off the end. Then, since only 4 bases were trimmed, the ``-i -5`` option
 will cause a 5th base to be removed.

-    atropos -A ADAPTER -i 5 -o trimmed.fastq -se reads.fastq
+    atropos -A ADAPTER -i -5 -o trimmed.fastq -se reads.fastq

 .. _quality-trimming:


--- a/doc/index.rst
+++ b/doc/index.rst
-.. include:: ./readme.rst
-
 =================
 Table of contents
 =================

--- a/setup.py
+++ b/setup.py
@@ -146,7 +146,7 @@ setup(
    url="https://atropos.readthedocs.org/",
    description="trim adapters from high-throughput sequencing reads",
    long_description=long_description,
-    log_description_content_type="text/markdown",
+    long_description_content_type="text/markdown",
    license="Original Cutadapt code is under MIT license; improvements and additions are in the Public Domain",
    ext_modules=extensions,
    packages=find_packages(),