Steffen Möller · Steffen Möller · Steffen Möller · Steffen Möller · Steffen Möller · 931bde13
--- a/PKG-INFO
+++ b/PKG-INFO
 Metadata-Version: 1.1
 Name: gffutils
-Version: 0.8.7.1
+Version: 0.9
 Summary: Work with GFF and GTF files in a flexible database framework
-Home-page: none
+Home-page: https://github.com/daler/gffutils
 Author: Ryan Dale
 Author-email: dalerr@niddk.nih.gov
 License: UNKNOWN
 Description: UNKNOWN
 Platform: UNKNOWN
 Classifier: Intended Audience :: Science/Research
-Classifier: License :: OSI Approved :: GNU General Public License (GPL)
+Classifier: License :: OSI Approved :: MIT License
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
 Classifier: Programming Language :: Python
 Classifier: Programming Language :: Python :: 2
@@ -17,4 +17,7 @@ Classifier: Programming Language :: Python :: 2.6
 Classifier: Programming Language :: Python :: 2.7
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.3
+Classifier: Programming Language :: Python :: 3.4
+Classifier: Programming Language :: Python :: 3.5
+Classifier: Programming Language :: Python :: 3.6
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
--- a/README.rst
+++ b/README.rst
@@ -9,10 +9,11 @@
    :target: https://pypi.python.org/pypi/gffutils


-See docs at http://daler.github.io/gffutils.

 ``gffutils`` is a Python package for working with and manipulating the GFF and
 GTF format files typically used for genomic annotations.  Files are loaded into
 a sqlite3 database, allowing much more complex manipulation of hierarchical
 features (e.g., genes, transcripts, and exons) than is possible with plain-text
 methods alone.
+
+See documentation at **http://daler.github.io/gffutils**.
--- a/debian/changelog
+++ b/debian/changelog
-python-gffutils (0.8.7.1-1) unstable; urgency=medium
+python-gffutils (0.9-1) UNRELEASED; urgency=medium
+
+  * Removal reported via bug #894298.
+
+  * Team non-upload.
+
+  * New upstream version.
+
+ -- Steffen Moeller <moeller@debian.org>  Sun, 15 Apr 2018 14:03:00 +0200
+
+python-gffutils (0.8.7.1-1) REMOVED; urgency=medium

  * Initial release. (Closes: #851488)


--- a/debian/control
+++ b/debian/control
@@ -17,7 +17,7 @@ Build-Depends: debhelper (>= 10),
               python3-nose,
               python3-biopython,
               python3-pybedtools
-Standards-Version: 3.9.8
+Standards-Version: 4.1.3
 Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/python-gffutils.git
 Vcs-Git: https://anonscm.debian.org/git/debian-med/python-gffutils.git
 Homepage: https://daler.github.io/gffutils

--- a/doc/source/_templates/class.rst
+++ b/doc/source/_templates/class.rst
+{{ fullname }}
+{{ underline }}
+
+.. currentmodule:: {{ module }}
+
+.. autoclass:: {{ objname }}
+
+   {% block methods %}
+   .. automethod:: __init__
+
+   {% if methods %}
+   .. rubric:: Methods
+
+   .. autosummary::
+   {% for item in methods %}
+      ~{{ name }}.{{ item }}
+   {%- endfor %}
+   {% endif %}
+   {% endblock %}
+
+   {% block attributes %}
+   {% if attributes %}
+   .. rubric:: Attributes
+
+   .. autosummary::
+   {% for item in attributes %}
+      ~{{ name }}.{{ item }}
+   {%- endfor %}
+   {% endif %}
+   {% endblock %}
--- a/gffutils.egg-info/PKG-INFO
+++ b/gffutils.egg-info/PKG-INFO
 Metadata-Version: 1.1
 Name: gffutils
-Version: 0.8.7.1
+Version: 0.9
 Summary: Work with GFF and GTF files in a flexible database framework
-Home-page: none
+Home-page: https://github.com/daler/gffutils
 Author: Ryan Dale
 Author-email: dalerr@niddk.nih.gov
 License: UNKNOWN
 Description: UNKNOWN
 Platform: UNKNOWN
 Classifier: Intended Audience :: Science/Research
-Classifier: License :: OSI Approved :: GNU General Public License (GPL)
+Classifier: License :: OSI Approved :: MIT License
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
 Classifier: Programming Language :: Python
 Classifier: Programming Language :: Python :: 2
@@ -17,4 +17,7 @@ Classifier: Programming Language :: Python :: 2.6
 Classifier: Programming Language :: Python :: 2.7
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.3
+Classifier: Programming Language :: Python :: 3.4
+Classifier: Programming Language :: Python :: 3.5
+Classifier: Programming Language :: Python :: 3.6
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
--- a/gffutils.egg-info/SOURCES.txt
+++ b/gffutils.egg-info/SOURCES.txt
@@ -3,6 +3,7 @@ MANIFEST.in
 README.rst
 requirements.txt
 setup.py
+doc/source/_templates/class.rst
 gffutils/__init__.py
 gffutils/attributes.py
 gffutils/bins.py
@@ -34,23 +35,34 @@ gffutils/test/expected.py
 gffutils/test/feature_test.py
 gffutils/test/helpers_test.py
 gffutils/test/parser_test.py
+gffutils/test/performance_test.py
 gffutils/test/test.py
 gffutils/test/test_biopython_integration.py
 gffutils/test/data/F3-unique-3.v2.gff
 gffutils/test/data/FBgn0031208.gff
 gffutils/test/data/FBgn0031208.gtf
+gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.5000_gene_ids.txt
+gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.5000_transcript_ids.txt
+gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.chromsizes.txt
 gffutils/test/data/c_elegans_WS199_ann_gff.txt
 gffutils/test/data/c_elegans_WS199_dna_shortened.fa
 gffutils/test/data/c_elegans_WS199_shortened_gff.txt
+gffutils/test/data/dm6-chr2L.fa
 gffutils/test/data/dmel-all-no-analysis-r5.49_50k_lines.gff
+gffutils/test/data/download-large-annotation-files.sh
 gffutils/test/data/ensembl_gtf.txt
 gffutils/test/data/gencode-v19.gtf
+gffutils/test/data/gencode.vM8.5000_gene_ids.txt
+gffutils/test/data/gencode.vM8.5000_transcript_ids.txt
+gffutils/test/data/gencode.vM8.chromsizes.txt
 gffutils/test/data/gff_example1.gff3
 gffutils/test/data/gff_example1.gff3.gz
 gffutils/test/data/glimmer_nokeyval.gff3
 gffutils/test/data/hybrid1.gff3
 gffutils/test/data/intro_docs_example.gff
 gffutils/test/data/jgi_gff2.txt
+gffutils/test/data/keep-order-test.gtf
+gffutils/test/data/keyval_sep_in_attrs.gff
 gffutils/test/data/mouse_extra_comma.gff3
 gffutils/test/data/ncbi_gff3.txt
 gffutils/test/data/nonascii

--- a/gffutils/constants.py
+++ b/gffutils/constants.py
@@ -139,6 +139,7 @@ dialect = {
 }

 always_return_list = True
+ignore_url_escape_characters = False

 # these keyword args are used by iterators.
 _iterator_kwargs = (

--- a/gffutils/convert.py
+++ b/gffutils/convert.py
@@ -4,24 +4,19 @@ Conversion functions that operate on :class:`FeatureDB` classes.

 import six

-
 def to_bed12(f, db, child_type='exon', name_field='ID'):
    """
    Given a top-level feature (e.g., transcript), construct a BED12 entry
-
    Parameters
    ----------
    f : Feature object or string
        This is the top-level feature represented by one BED12 line.  For
        a canonical GFF or GTF, this will generally be a transcript.
-
    db : a FeatureDB object
        This is need to get the children for the feature
-
    child_type : str
        Featuretypes that will be represented by the BED12 "blocks".  Typically
        "exon".
-
    name_field : str
        Attribute to be used in the "name" field of the BED12 entry.  Usually
        "ID" for GFF; "transcript_id" for GTF.

--- a/gffutils/create.py
+++ b/gffutils/create.py
@@ -57,6 +57,7 @@ class _DBCreator(object):
                 force_merge_fields=None,
                 text_factory=sqlite3.OptimizedUnicode,
                 pragmas=constants.default_pragmas, _keep_tempfiles=False,
+                 directives=None,
                 **kwargs):
        """
        Base class for _GFFDBCreator and _GTFDBCreator; see create_db()
@@ -80,6 +81,9 @@ class _DBCreator(object):
        self.pragmas = pragmas
        self.merge_strategy = merge_strategy
        self.default_encoding = default_encoding
+        if directives is None:
+            directives = []
+        self.directives = directives

        if not infer_gene_extent:
            warnings.warn("'infer_gene_extent' will be deprecated. For now, "
@@ -121,6 +125,7 @@ class _DBCreator(object):
            dialect=dialect
        )

+
    def set_verbose(self, verbose=None):
        if verbose == 'debug':
            logger.setLevel(logging.DEBUG)
@@ -439,9 +444,10 @@ class _DBCreator(object):
        In general, if you'll be adding stuff to the meta table, do it here.
        """
        c = self.conn.cursor()
+        directives = self.directives + self.iterator.directives
        c.executemany('''
                      INSERT INTO directives VALUES (?)
-                      ''', ((i,) for i in self.iterator.directives))
+                      ''', ((i,) for i in directives))
        c.execute(
            '''
            INSERT INTO meta (version, dialect)
@@ -472,6 +478,16 @@ class _DBCreator(object):
        logger.info("Creating features(featuretype) index")
        c.execute('DROP INDEX IF EXISTS featuretype')
        c.execute('CREATE INDEX featuretype ON features (featuretype)')
+        logger.info("Creating features (seqid, start, end) index")
+        c.execute('DROP INDEX IF EXISTS seqidstartend')
+        c.execute('CREATE INDEX seqidstartend ON features (seqid, start, end)')
+        logger.info("Creating features (seqid, start, end, strand) index")
+        c.execute('DROP INDEX IF EXISTS seqidstartendstrand')
+        c.execute('CREATE INDEX seqidstartendstrand ON features (seqid, start, end, strand)')
+
+        # speeds computation 1000x in some cases
+        logger.info("Running ANALYSE features")
+        c.execute('ANALYZE features')

        self.conn.commit()

@@ -1104,7 +1120,7 @@ def create_db(data, dbfn, id_spec=None, force=False, verbose=False,
        Using `merge_strategy="warning"`, a warning will be printed to the
        logger, and the duplicate feature will be skipped.

-        Using `merge_strategy="replace" will replace the entire existing
+        Using `merge_strategy="replace"` will replace the entire existing
        feature with the new feature.

    transform : callable
@@ -1216,7 +1232,6 @@ def create_db(data, dbfn, id_spec=None, force=False, verbose=False,
    -------
    New :class:`FeatureDB` object.
    """
-
    _locals = locals()

    # Check if any older kwargs made it in
@@ -1235,16 +1250,16 @@ def create_db(data, dbfn, id_spec=None, force=False, verbose=False,
    if dialect is None:
        dialect = iterator.dialect

-    if isinstance(iterator, iterators._FeatureIterator):
-        # However, a side-effect of this is that  if `data` was a generator,
-        # then we've just consumed `checklines` items (see
-        # iterators.BaseIterator.__init__, which calls iterators.peek).
-        #
-        # But it also chains those consumed items back onto the beginning, and
-        # the result is available as as iterator._iter.
-        #
-        # That's what we should be using now for `data:
-        kwargs['data'] = iterator._iter
+    # However, a side-effect of this is that  if `data` was a generator, then
+    # we've just consumed `checklines` items (see
+    # iterators.BaseIterator.__init__, which calls iterators.peek).
+    #
+    # But it also chains those consumed items back onto the beginning, and the
+    # result is available as as iterator._iter.
+    #
+    # That's what we should be using now for `data:
+    kwargs['data'] = iterator._iter
+    kwargs['directives'] = iterator.directives

    # Since we've already checked lines, we don't want to do it again
    kwargs['checklines'] = 0

--- a/gffutils/feature.py
+++ b/gffutils/feature.py
@@ -74,6 +74,14 @@ class Feature(object):
            dictionary and the dialect -- except if the original attributes
            string was provided, in which case that will be used directly.

+            Notes on encoding/decoding: the only time unquoting
+            (e.g., "%2C" becomes ",") happens is if `attributes` is a string
+            and if `settings.ignore_url_escape_characters = False`. If dict or
+            JSON, the contents are used as-is.
+
+            Similarly, the only time characters are quoted ("," becomes "%2C")
+            is when the feature is printed (`__str__` method).
+
        extra : string or list
            Additional fields after the canonical 9 fields for GFF/GTF.

@@ -114,11 +122,11 @@ class Feature(object):
        """
        # start/end can be provided as int-like, ".", or None, but will be
        # converted to int or None
-        if start == ".":
+        if start == "." or start == "":
            start = None
        elif start is not None:
            start = int(start)
-        if end == ".":
+        if end == "." or end == "":
            end = None
        elif end is not None:
            end = int(end)
@@ -224,6 +232,7 @@ class Feature(object):
            return unicode(self).encode('utf-8')

    def __unicode__(self):
+
        # All fields but attributes (and extra).
        items = [getattr(self, k) for k in constants._gffkeys[:-1]]

@@ -264,7 +273,7 @@ class Feature(object):
        return self.stop - self.start + 1

    # aliases for official GFF field names; this way x.chrom == x.seqid; and
-    # x.start == x.end.
+    # x.stop == x.end.
    @property
    def chrom(self):
        return self.seqid
@@ -334,11 +343,14 @@ class Feature(object):
        string
        """
        if isinstance(fasta, six.string_types):
-            fasta = Fasta(fasta, as_raw=True)
+            fasta = Fasta(fasta, as_raw=False)

        # recall GTF/GFF is 1-based closed;  pyfaidx uses Python slice notation
        # and is therefore 0-based half-open.
-        return fasta[self.chrom][self.start-1:self.stop]
+        seq = fasta[self.chrom][self.start-1:self.stop]
+        if use_strand and self.strand == '-':
+            seq = seq.reverse.complement
+        return seq.seq


 def feature_from_line(line, dialect=None, strict=True, keep_order=False):

--- a/gffutils/interface.py
+++ b/gffutils/interface.py
@@ -2,6 +2,7 @@ import os
 import six
 import sqlite3
 import shutil
+import warnings
 from gffutils import bins
 from gffutils import helpers
 from gffutils import constants
@@ -150,6 +151,15 @@ class FeatureDB(object):

        self.set_pragmas(pragmas)

+        if not self._analyzed():
+            warnings.warn(
+                "It appears that this database has not had the ANALYZE "
+                "sqlite3 command run on it. Doing so can dramatically "
+                "speed up queries, and is done by default for databases "
+                "created with gffutils >0.8.7.1 (this database was "
+                "created with version %s) Consider calling the analyze() "
+                "method of this object." % self.version)
+
    def set_pragmas(self, pragmas):
        """
        Set pragmas for the current database connection.
@@ -178,6 +188,14 @@ class FeatureDB(object):
        kwargs.setdefault('sort_attribute_values', self.sort_attribute_values)
        return Feature(**kwargs)

+    def _analyzed(self):
+        res = self.execute(
+            """
+            SELECT name FROM sqlite_master WHERE type='table'
+            AND name='sqlite_stat1';
+            """)
+        return len(list(res)) == 1
+
    def schema(self):
        """
        Returns the database schema as a string.
@@ -442,6 +460,14 @@ class FeatureDB(object):
        c = self.conn.cursor()
        return c.execute(query)

+    def analyze(self):
+        """
+        Runs the sqlite ANALYZE command to potentially speed up queries
+        dramatically.
+        """
+        self.execute('ANALYZE features')
+        self.conn.commit()
+
    def region(self, region=None, seqid=None, start=None, end=None,
               strand=None, featuretype=None, completely_within=False):
        """

--- a/gffutils/parser.py
+++ b/gffutils/parser.py
@@ -2,6 +2,8 @@

 import re
 import copy
+import collections
+from six.moves import urllib
 from gffutils import constants
 from gffutils.exceptions import AttributeStringError

@@ -16,6 +18,60 @@ logger.addHandler(ch)

 gff3_kw_pat = re.compile('\w+=')

+# Encoding/decoding notes
+# -----------------------
+# From
+# https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#description-of-the-format:
+#
+#       GFF3 files are nine-column, tab-delimited, plain text files.
+#       Literal use of tab, newline, carriage return, the percent (%) sign,
+#       and control characters must be encoded using RFC 3986
+#       Percent-Encoding; no other characters may be encoded. Backslash and
+#       other ad-hoc escaping conventions that have been added to the GFF
+#       format are not allowed. The file contents may include any character
+#       in the set supported by the operating environment, although for
+#       portability with other systems, use of Latin-1 or Unicode are
+#       recommended.
+#
+#           tab (%09)
+#           newline (%0A)
+#           carriage return (%0D)
+#           % percent (%25)
+#           control characters (%00 through %1F, %7F)
+#
+#       In addition, the following characters have reserved meanings in
+#       column 9 and must be escaped when used in other contexts:
+#
+#           ; semicolon (%3B)
+#           = equals (%3D)
+#           & ampersand (%26)
+#           , comma (%2C)
+#
+#
+# See also issue #98.
+#
+# Note that spaces are NOT encoded. Some GFF files have spaces encoded; in
+# these cases round-trip invariance will not hold since the %20 will be decoded
+# but not re-encoded.
+_to_quote = '\n\t\r%;=&,'
+_to_quote += ''.join([chr(i) for i in range(32)])
+_to_quote += chr(127)
+
+
+# Caching idea from urllib.parse.Quoter, which uses a defaultdict for
+# efficiency. Here we're sort of doing the reverse of the "reserved" idea used
+# there.
+class Quoter(collections.defaultdict):
+    def __missing__(self, b):
+        if b in _to_quote:
+            res = '%{:02X}'.format(ord(b))
+        else:
+            res = b
+        self[b] = res
+        return res
+
+quoter = Quoter()
+

 def _reconstruct(keyvals, dialect, keep_order=False,
                 sort_attribute_values=False):
@@ -46,17 +102,27 @@ def _reconstruct(keyvals, dialect, keep_order=False,
        return ""
    parts = []

+    # Re-encode when reconstructing attributes
+    if constants.ignore_url_escape_characters or dialect['fmt'] != 'gff3':
+        attributes = keyvals
+    else:
+        attributes = {}
+        for k, v in keyvals.items():
+            attributes[k] = []
+            for i in v:
+                attributes[k].append(''.join([quoter[j] for j in i]))
+
    # May need to split multiple values into multiple key/val pairs
    if dialect['repeated keys']:
        items = []
-        for key, val in keyvals.items():
+        for key, val in attributes.items():
            if len(val) > 1:
                for v in val:
                    items.append((key, [v]))
            else:
                items.append((key, val))
    else:
-        items = list(keyvals.items())
+        items = list(attributes.items())

    def sort_key(x):
        # sort keys by their order in the dialect; anything not in there will
@@ -87,7 +153,10 @@ def _reconstruct(keyvals, dialect, keep_order=False,
                # Typically "=" for GFF3 or " " otherwise
                part = dialect['keyval separator'].join([key, val_str])
        else:
-            part = key
+            if dialect['fmt'] == 'gtf':
+                part = dialect['keyval separator'].join([key, '""'])
+            else:
+                part = key
        parts.append(part)

    # Typically ";" or "; "
@@ -116,6 +185,19 @@ def _split_keyvals(keyval_str, dialect=None):

    Otherwise, use the provided dialect (and return it at the end).
    """
+
+    def _unquote_quals(quals, dialect):
+        """
+        Handles the unquoting (decoding) of percent-encoded characters.
+
+        See notes on encoding/decoding above.
+        """
+        if not constants.ignore_url_escape_characters and dialect['fmt'] == 'gff3':
+            for key, vals in quals.items():
+                unquoted = [urllib.parse.unquote(v) for v in vals]
+                quals[key] = unquoted
+        return quals
+
    infer_dialect = False
    if dialect is None:
        # Make a copy of default dialect so it can be modified as needed
@@ -160,11 +242,14 @@ def _split_keyvals(keyval_str, dialect=None):
                key, val = item

            # Only key provided?
-            else:
-                assert len(item) == 1, item
+            elif len(item) == 1:
                key = item[0]
                val = ''

+            else:
+                key = item[0]
+                val = dialect['keyval separator'].join(item[1:])
+
            try:
                quals[key]
            except KeyError:
@@ -181,6 +266,7 @@ def _split_keyvals(keyval_str, dialect=None):
                vals = val.split(',')
                quals[key].extend(vals)

+        quals = _unquote_quals(quals, dialect)
        return quals, dialect

    # If we got here, then we need to infer the dialect....
@@ -229,10 +315,16 @@ def _split_keyvals(keyval_str, dialect=None):
            key, val = item

        # Only key provided?
+        elif len(item) == 1:
+                key = item[0]
+                val = ''
+
+        # Pathological cases where values of a key have within them the key-val
+        # separator, e.g.,
+        #  Alias=SGN-M1347;ID=T0028;Note=marker name(s): T0028 SGN-M1347 |identity=99.58|escore=2e-126
        else:
-            assert len(item) == 1, item
            key = item[0]
-            val = ''
+            val = dialect['keyval separator'].join(item[1:])

        # Is the key already in there?
        if key in quals:
@@ -258,29 +350,11 @@ def _split_keyvals(keyval_str, dialect=None):
        # keep track of the order of keys
        dialect['order'].append(key)

-    #for key, vals in quals.items():
-    #
-        # TODO: urllib.unquote breaks round trip invariance for "hybrid1.gff3"
-        # test file.  This is because the "Note" field has %xx escape chars,
-        # but "Dbxref" has ":" which, if everything were consistent, should
-        # have also been escaped.
-        #
-        # (By the way, GFF3 spec says only literal use of \t, \n, \r, %, and
-        # control characters should be encoded)
-        #
-        # Solution 1: don't unquote
-        # Solution 2: store, along with each attribute, whether or not it
-        #             should be quoted later upon reconstruction
-        # Solution 3: don't care about invariance
-
-        # unquoted = [urllib.unquote(v) for v in vals]
-
-        #quals[key] = vals
-
    if (
        (dialect['keyval separator'] == ' ') and
        (dialect['quoted GFF2 values'])
    ):
        dialect['fmt'] = 'gtf'

+    quals = _unquote_quals(quals, dialect)
    return quals, dialect
--- a/gffutils/test/attr_test_cases.py
+++ b/gffutils/test/attr_test_cases.py
@@ -89,10 +89,12 @@ attrs = [
             'AFFX-U95:1332_f_at',
             'Swissprot:SOMA_HUMAN',
         ],
-         'Note': ['growth%20hormone%201'],
+         'Note': ['growth hormone 1'],
         'Alias': ['GH1']},

-        None,
+        'ID=A00469;Dbxref=AFFX-U133:205840_x_at,Locuslink:2688,Genbank-mRNA:'
+        'A00469,Swissprot:P01241,PFAM:PF00103,AFFX-U95:1332_f_at,Swissprot:'
+        'SOMA_HUMAN;Note=growth hormone 1;Alias=GH1',
    ),

    # jgi_gff2.txt
@@ -157,8 +159,8 @@ attrs = [
         'Parent': ['NC_008596.1:speB'],
         'locus_tag': ['MSMEG_1072'],
         'EC_number': ['3.5.3.11'],
-         'note': ['identified%20by%20match%20to%20protein%20family%20HMM%20P'
-                  'F00491%3B%20match%20to%20protein%20family%20HMM%20TIGR01'
+         'note': ['identified by match to protein family HMM P'
+                  'F00491; match to protein family HMM TIGR01'
                  '230'],
         'transl_table': ['11'],
         'product': ['agmatinase'],
@@ -167,7 +169,12 @@ attrs = [
         'exon_number': ['1'],
         },

-        None,
+        'ID=NC_008596.1:speB:unknown_transcript_1;Parent=NC_008596.1:speB;'
+        'locus_tag=MSMEG_1072;EC_number=3.5.3.11;note=identified by mat'
+        'ch to protein family HMM PF00491%3B match to prote'
+        'in family HMM TIGR01230;transl_table=11;product=agmatinase;p'
+        'rotein_id=YP_885468.1;db_xref=GI:118469242;db_xref=GeneID:4535378;'
+        'exon_number=1',
    ),

    # wormbase_gff2_alt.txt

--- a/gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.5000_gene_ids.txt
+++ b/gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.5000_gene_ids.txt
--- a/gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.5000_transcript_ids.txt
+++ b/gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.5000_transcript_ids.txt
--- a/gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.chromsizes.txt
+++ b/gffutils/test/data/Saccharomyces_cerevisiae.R64-1-1.83.chromsizes.txt
+I 230218
+II 813184
+III 316620
+IV 1531933
+IX 439888
+Mito 85779
+V 576874
+VI 270161
+VII 1090940
+VIII 562643
+X 745751
+XI 666816
+XII 1078177
+XIII 924431
+XIV 784333
+XV 1091291
+XVI 948066
--- a/gffutils/test/data/dm6-chr2L.fa
+++ b/gffutils/test/data/dm6-chr2L.fa
+>chr2L
+Cgacaatgcacgacagaggaagcagaacagatatttagattgcctctcat
+tttctctcccatattatagggagaaatatgatcgcgtatgcgagagtagt
+gccaacatattgtgctctttgattttttggcaacccaaaatggtggcgga
+tgaaCGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA
+TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG
+CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
+CAAACCCAAAAAgacaatacacgacagagagagagagcagcggagatatt
+tagattgcctattaaatatgatcgcgtatgcgagagtagtgccaacatat
+tgtgctctCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA
+ACCCAAatcgacaatgcacgacagaggaagcagaacagatatttagattg
+cctctcattttctctcccatattatagggagaaatatgatcgcgtatgcg
+agagtagtgccaacatattgtgctctttgattttttggcaacccaaaatg
+gtggcggatgaaCGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA
+TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA
+GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA
+TATTACCGCAAACCCAAAAAgacaatacacgacagagagagagagcagcg
+gagatatttagattgcctattaaatatgatcgcgtatgcgagagtagtgc
+caacatattgtgctctCTATATAATGACTGCCTCTCATTCTGTCTTATTT
+TACCGCAAACCCAAatcgacaatgcacgacagaggaagcagaacagatat
+ttagattgcctctcattttctctcccatattatagggagaaatatgatcg
+cgtatgcgagagtagtgccaacatattgtgctctttgattttttggcaac
+ccaaaatggtggcggatgaaCGAGATGATAATATATTCAAGTTGCCGCTA
+ATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGC
+GTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCT
+CTGTCTTATATTACCGCAAACCCAAAAAgacaatacacgacagagagaga
+gagcagcggagatatttagattgcctattaaatatgatcgcgtatgcgag
+agtagtgccaacatattgtgctctCTATATAATGACTGCCTCTCATTCTG
+TCTTATTTTACCGCAAACCCAAatcgacaatgcacgacagaggaagcaga
+acagatatttagattgcctctcattttctctcccatattatagggagaaa
+tatgatcgcgtatgcgagagtagtgccaacatattgtgctctttgatttt
+ttggcaacccaaaatggtggcggatgaaCGAGATGATAATATATTCAAGT
+TGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATAT
+ATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCT
+CTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAgacaatacacgaca
+gagagagagagcagcggagatatttagattgcctattaaatatgatcgcg
+tatgcgagagtagtgccaacatattgtgctctCTATATAATGACTGCCTC
+TCATTCTGTCTTATTTTACCGCAAACCCAAatcgacaatgcacgacagag
+gaagcagaacagatatttagattgcctctcattttctctcccatattata
+gggagaaatatgatcgcgtatgcgagagtagtgccaacatattgtgctct
+ttgattttttggcaacccaaaatggtggcggatgaaCGAGATGATAATAT
+ATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAG
+CACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAAT
+GAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAgacaat
+acacgacagagagagagagcagcggagatatttagattgcctattaaata
+tgatcgcgtatgcgagagtagtgccaacatattgtgctctCTATATAATG
+ACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAatcgacaatgca
+cgacagaggaagcagaacagatatttagattgcctctcattttctctccc
+atattatagggagaaatatgatcgcgtatgcgagagtagtgccaacatat
+tgtgctctttgattttttggcaacccaaaatggtggcggatgaaCGAGAT
--- a/gffutils/test/data/download-large-annotation-files.sh
+++ b/gffutils/test/data/download-large-annotation-files.sh
+# Download large annotation files neede for testing
+# to gffutils/test/data/ directory.
+
+cd $(dirname $0)
+wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M8/gencode.vM8.annotation.gff3.gz
+gzip -d gencode.vM8.annotation.gff3.gz
+wget ftp://ftp.ensembl.org/pub/release-83/gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.83.gff3.gz
+gzip -d Saccharomyces_cerevisiae.R64-1-1.83.gff3.gz
--- a/gffutils/test/data/gencode.vM8.5000_gene_ids.txt
+++ b/gffutils/test/data/gencode.vM8.5000_gene_ids.txt