Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille · Andreas Tille
--- a/PKG-INFO
+++ b/PKG-INFO
 Metadata-Version: 1.1
 Name: PyVCF
-Version: 0.6.7
+Version: 0.6.8
 Summary: Variant Call Format (VCF) parser for Python
 Home-page: https://github.com/jamescasbon/PyVCF
 Author: James Casbon and @jdoughertyii
 Author-email: casbon@gmail.com
 License: UNKNOWN
-Description: A VCFv4.0 and 4.1 parser for Python.
-        
-        Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/
-        
-        The intent of this module is to mimic the ``csv`` module in the Python stdlib,
-        as opposed to more flexible serialization formats like JSON or YAML.  ``vcf``
-        will attempt to parse the content of each record based on the data types
-        specified in the meta-information lines --  specifically the ##INFO and
-        ##FORMAT lines.  If these lines are missing or incomplete, it will check
-        against the reserved types mentioned in the spec.  Failing that, it will just
-        return strings.
-        
-        There main interface is the class: ``Reader``.  It takes a file-like
-        object and acts as a reader::
-        
-            >>> import vcf
-            >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-            >>> for record in vcf_reader:
-            ...     print record
-            Record(CHROM=20, POS=14370, REF=G, ALT=[A])
-            Record(CHROM=20, POS=17330, REF=T, ALT=[A])
-            Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-            Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-            Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])
-        
-        
-        This produces a great deal of information, but it is conveniently accessed.
-        The attributes of a Record are the 8 fixed fields from the VCF spec::
-        
-            * ``Record.CHROM``
-            * ``Record.POS``
-            * ``Record.ID``
-            * ``Record.REF``
-            * ``Record.ALT``
-            * ``Record.QUAL``
-            * ``Record.FILTER``
-            * ``Record.INFO``
-        
-        plus attributes to handle genotype information:
-        
-            * ``Record.FORMAT``
-            * ``Record.samples``
-            * ``Record.genotype``
-        
-        ``samples`` and ``genotype``, not being the title of any column, are left lowercase.  The format
-        of the fixed fields is from the spec.  Comma-separated lists in the VCF are
-        converted to lists.  In particular, one-entry VCF lists are converted to
-        one-entry Python lists (see, e.g., ``Record.ALT``).  Semicolon-delimited lists
-        of key=value pairs are converted to Python dictionaries, with flags being given
-        a ``True`` value. Integers and floats are handled exactly as you'd expect::
-        
-            >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-            >>> record = vcf_reader.next()
-            >>> print record.POS
-            14370
-            >>> print record.ALT
-            [A]
-            >>> print record.INFO['AF']
-            [0.5]
-        
-        There are a number of convienience methods and properties for each ``Record`` allowing you to
-        examine properties of interest::
-        
-            >>> print record.num_called, record.call_rate, record.num_unknown
-            3 1.0 0
-            >>> print record.num_hom_ref, record.num_het, record.num_hom_alt
-            1 1 1
-            >>> print record.nucl_diversity, record.aaf, record.heterozygosity
-            0.6 [0.5] 0.5
-            >>> print record.get_hets()
-            [Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
-            >>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
-            True False True False
-            >>> print record.var_type, record.var_subtype
-            snp ts
-            >>> print record.is_monomorphic
-            False
-        
-        ``record.FORMAT`` will be a string specifying the format of the genotype
-        fields.  In case the FORMAT column does not exist, ``record.FORMAT`` is
-        ``None``.  Finally, ``record.samples`` is a list of dictionaries containing the
-        parsed sample column and ``record.genotype`` is a way of looking up genotypes
-        by sample name::
-        
-            >>> record = vcf_reader.next()
-            >>> for sample in record.samples:
-            ...     print sample['GT']
-            0|0
-            0|1
-            0/0
-            >>> print record.genotype('NA00001')['GT']
-            0|0
-        
-        The genotypes are represented by ``Call`` objects, which have three attributes: the
-        corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
-        call data in ``data``::
-        
-             >>> call = record.genotype('NA00001')
-             >>> print call.site
-             Record(CHROM=20, POS=17330, REF=T, ALT=[A])
-             >>> print call.sample
-             NA00001
-             >>> print call.data
-             CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])
-        
-        Please note that as of release 0.4.0, attributes known to have single values (such as
-        ``DP`` and ``GQ`` above) are returned as values.  Other attributes are returned
-        as lists (such as ``HQ`` above).
-        
-        There are also a number of methods::
-        
-            >>> print call.called, call.gt_type, call.gt_bases, call.phased
-            True 0 T|T True
-        
-        Metadata regarding the VCF file itself can be investigated through the
-        following attributes:
-        
-            * ``Reader.metadata``
-            * ``Reader.infos``
-            * ``Reader.filters``
-            * ``Reader.formats``
-            * ``Reader.samples``
-        
-        For example::
-        
-            >>> vcf_reader.metadata['fileDate']
-            '20090805'
-            >>> vcf_reader.samples
-            ['NA00001', 'NA00002', 'NA00003']
-            >>> vcf_reader.filters
-            OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
-            >>> vcf_reader.infos['AA'].desc
-            'Ancestral Allele'
-        
-        ALT records are actually classes, so that you can interrogate them::
-        
-            >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
-            >>> _ = reader.next(); row = reader.next()
-            >>> print row
-            Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
-            >>> bnd = row.ALT[0]
-            >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
-            True False True T
-        
-        Random access is supported for files with tabix indexes.  Simply call fetch for the
-        region you are interested in::
-        
-            >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-            >>> for record in vcf_reader.fetch('20', 1110696, 1230237):  # doctest: +SKIP
-            ...     print record
-            Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-            Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-        
-        Or extract a single row::
-        
-            >>> print vcf_reader.fetch('20', 1110696)  # doctest: +SKIP
-            Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-        
-        
-        The ``Writer`` class provides a way of writing a VCF file.  Currently, you must specify a
-        template ``Reader`` which provides the metadata::
-        
-            >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-            >>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
-            >>> for record in vcf_reader:
-            ...     vcf_writer.write_record(record)
-        
-        
-        An extensible script is available to filter vcf files in vcf_filter.py.  VCF filters
-        declared by other packages will be available for use in this script.  Please
-        see :doc:`FILTERS` for full description.
-        
-        
+Description: UNKNOWN
 Keywords: bioinformatics
 Platform: UNKNOWN
 Classifier: Development Status :: 4 - Beta
 Classifier: Intended Audience :: Developers
 Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: BSD License
+Classifier: License :: OSI Approved :: MIT License
 Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Cython
 Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 2
+Classifier: Programming Language :: Python :: 2.6
+Classifier: Programming Language :: Python :: 2.7
 Classifier: Programming Language :: Python :: 3
-Classifier: Topic :: Scientific/Engineering
+Classifier: Programming Language :: Python :: 3.2
+Classifier: Programming Language :: Python :: 3.3
+Classifier: Programming Language :: Python :: 3.4
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
--- a/README.rst
+++ b/README.rst
@@ -50,7 +50,7 @@ of key=value pairs are converted to Python dictionaries, with flags being given
 a ``True`` value. Integers and floats are handled exactly as you'd expect::

    >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-    >>> record = vcf_reader.next()
+    >>> record = next(vcf_reader)
    >>> print record.POS
    14370
    >>> print record.ALT
@@ -58,7 +58,7 @@ a ``True`` value. Integers and floats are handled exactly as you'd expect::
    >>> print record.INFO['AF']
    [0.5]

-There are a number of convienience methods and properties for each ``Record`` allowing you to
+There are a number of convenience methods and properties for each ``Record`` allowing you to
 examine properties of interest::

    >>> print record.num_called, record.call_rate, record.num_unknown
@@ -82,7 +82,7 @@ fields.  In case the FORMAT column does not exist, ``record.FORMAT`` is
 parsed sample column and ``record.genotype`` is a way of looking up genotypes
 by sample name::

-    >>> record = vcf_reader.next()
+    >>> record = next(vcf_reader)
    >>> for sample in record.samples:
    ...     print sample['GT']
    0|0
@@ -135,27 +135,38 @@ For example::
 ALT records are actually classes, so that you can interrogate them::

    >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
-    >>> _ = reader.next(); row = reader.next()
+    >>> _ = next(reader); row = next(reader)
    >>> print row
    Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
    >>> bnd = row.ALT[0]
    >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
    True False True T

-Random access is supported for files with tabix indexes.  Simply call fetch for the
-region you are interested in::
+The Reader supports retrieval of records within designated regions for files
+with tabix indexes via the fetch method. This requires the pysam module as a
+dependency. Pass in a chromosome, and, optionally, start and end coordinates,
+for the regions of interest::

    >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-    >>> for record in vcf_reader.fetch('20', 1110696, 1230237):  # doctest: +SKIP
+    >>> # fetch all records on chromosome 20 from base 1110696 through 1230237
+    >>> for record in vcf_reader.fetch('20', 1110695, 1230237):  # doctest: +SKIP
    ...     print record
    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
    Record(CHROM=20, POS=1230237, REF=T, ALT=[None])

-Or extract a single row::
+Note that the start and end coordinates are in the zero-based, half-open
+coordinate system, similar to ``_Record.start`` and ``_Record.end``. The very
+first base of a chromosome is index 0, and the the region includes bases up
+to, but not including the base at the end coordinate. For example::

-    >>> print vcf_reader.fetch('20', 1110696)  # doctest: +SKIP
-    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
+    >>> # fetch all records on chromosome 4 from base 11 through 20
+    >>> vcf_reader.fetch('4', 10, 20)   # doctest: +SKIP

+would include all records overlapping a 10 base pair region from the 11th base
+of through the 20th base (which is at index 19) of chromosome 4. It would not
+include the 21st base (at index 20). (See
+http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms for more
+information on the zero-based, half-open coordinate system.)

 The ``Writer`` class provides a way of writing a VCF file.  Currently, you must specify a
 template ``Reader`` which provides the metadata::
@@ -165,8 +176,6 @@ template ``Reader`` which provides the metadata::
    >>> for record in vcf_reader:
    ...     vcf_writer.write_record(record)

-
 An extensible script is available to filter vcf files in vcf_filter.py.  VCF filters
 declared by other packages will be available for use in this script.  Please
 see :doc:`FILTERS` for full description.
-
--- a/debian/changelog
+++ b/debian/changelog
+python-pyvcf (0.6.8-1~bpo8+1) jessie-backports; urgency=medium
+
+  * Rebuild for jessie-backports.
+  * export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
+
+ -- Andreas Tille <tille@debian.org>  Fri, 09 Jun 2017 23:06:01 +0200
+
+python-pyvcf (0.6.8-1) unstable; urgency=medium
+
+  * New upstream version
+  * Fix watch file
+  * cme fix dpkg-control
+  * debhelper 10
+  * hardening=+all
+  * Remove unused lintian override
+
+ -- Andreas Tille <tille@debian.org>  Fri, 02 Dec 2016 20:33:48 +0100
+
+python-pyvcf (0.6.7-2) unstable; urgency=medium
+
+  [ Scott Kitterman ]
+  * Change python3-dev build-dep to python3-all-dev so package is built for
+    all supported python3 versions
+    Closes: #809487
+
+ -- Andreas Tille <tille@debian.org>  Thu, 31 Dec 2015 12:22:30 +0100
+
 python-pyvcf (0.6.7-1~bpo8+1) jessie-backports; urgency=medium

  * Rebuild for jessie-backports.

--- a/debian/compat
+++ b/debian/compat
-9
+10
--- a/debian/control
+++ b/debian/control
@@ -3,17 +3,17 @@ Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.
 Uploaders: Andreas Tille <tille@debian.org>
 Section: python
 Priority: optional
-Build-Depends: debhelper (>= 9),
+Build-Depends: debhelper (>= 10),
               dh-python,
               cython,
-               python-dev (>= 2.7),
+               python-dev,
               python-setuptools,
               cython3,
-               python3-dev,
+               python3-all-dev,
               python3-setuptools
-Standards-Version: 3.9.6
+Standards-Version: 3.9.8
 Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/python-pyvcf.git
-Vcs-Git: git://anonscm.debian.org/debian-med/python-pyvcf.git
+Vcs-Git: https://anonscm.debian.org/git/debian-med/python-pyvcf.git
 Homepage: https://pypi.python.org/pypi/PyVCF

 Package: python-pyvcf

--- a/debian/get-orig-source
+++ b/debian/get-orig-source
@@ -2,7 +2,6 @@
 # Test-Data needs to be fetched separately from Git

 set -e
-set -x

 PKG=`dpkg-parsechangelog | awk '/^Source/ { print $2 }'`

@@ -10,7 +9,7 @@ if ! echo $@ | grep -q upstream-version ; then
    VERSION=`dpkg-parsechangelog | awk '/^Version:/ { print $2 }' | sed 's/\([0-9\.]\+\)-[0-9]\+$/\1/'`
    uscan --verbose --force-download
 else
-    VERSION=`echo $@ | sed 's?^.*--upstream-version \([0-9.]\+\) .*?\1?'`
+    VERSION=`echo $@ | sed 's?^.*--upstream-version \([0-9.]\+\).*?\1?'`
    if echo "$VERSION" | grep -q "upstream-version" ; then
 	echo "Unable to parse version number"
        exit

--- a/debian/patches/series
+++ b/debian/patches/series
-use_setuptools.patch
--- a/debian/patches/use_setuptools.patch
+++ b/debian/patches/use_setuptools.patch
-Author: Andreas Tille <tille@debian.org>
-Last-Updated: Tue, 13 Oct 2015 17:02:08 +0200
-Description: Use setuptools instead of distribute
- see https://lists.alioth.debian.org/pipermail/debian-med-packaging/2015-October/035410.html
-
--- a/setup.py
-+++ b/setup.py
-@@ -53,7 +53,7 @@ setup(
-     description='Variant Call Format (VCF) parser for Python',
-     long_description=DOC,
-     test_suite='vcf.test.test_vcf.suite',
-    install_requires=['distribute'],
-+    install_requires=['setuptools'],
-     requires=requires,
-     entry_points = {
-         'vcf.filters': [
--- a/debian/python-pyvcf-examples.lintian-overrides
+++ b/debian/python-pyvcf-examples.lintian-overrides
-# This is an unchanged file from upstream tarball
-python-pyvcf-examples: package-contains-timestamped-gzip usr/share/doc/python-pyvcf-examples/test/1kg.vcf.gz
-
--- a/debian/rules
+++ b/debian/rules
@@ -4,6 +4,8 @@
 # Uncomment this to turn on verbose mode.
 #export DH_VERBOSE=1

+export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
+
 export PYBUILD_NAME=pyvcf
 examplesdir=debian/python-$(PYBUILD_NAME)-examples/usr/share/doc/python-$(PYBUILD_NAME)-examples
 helperdir=debian/$(PYBUILD_NAME)/usr
@@ -21,7 +23,9 @@ override_dh_install:
 	mkdir -p $(helperdir)
 	mv debian/python3-$(PYBUILD_NAME)/usr/bin $(helperdir)
 	# strip .py extension
-	mv $(helperdir)/bin/vcf_filter.py $(helperdir)/bin/vcf_filter
+	for py in $(helperdir)/bin/*.py ; do \
+	    mv $${py} $(helperdir)/bin/`basename $${py} .py` ; \
+	done
 	rm -rf debian/python-$(PYBUILD_NAME)/usr/bin

 get-orig-source:

--- a/debian/watch
+++ b/debian/watch
 # Test-Data needs to be fetched separately from Git so wee need to call get-orig-source
-
-version=3
-https://pypi.python.org/pypi/PyVCF .*/PyVCF-([0-9.]+).tar.[xgb]z2? \
+version=4
+http://pypi.debian.net/PyVCF/PyVCF-(.+)\.(?:tar(?:\.gz|\.bz2)?|tgz) \
  debian debian/get-orig-source
--- a/scripts/vcf_sample_filter.py
+++ b/scripts/vcf_sample_filter.py
+#!/usr/bin/env python
+
+# Author: Lenna X. Peterson
+# github.com/lennax
+# arklenna at gmail dot com
+
+import argparse
+import logging
+
+from vcf import SampleFilter
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("file", help="VCF file to filter")
+    parser.add_argument("-o", metavar="outfile",
+                       help="File to write out filtered samples")
+    parser.add_argument("-f", metavar="filters",
+                       help="Comma-separated list of sample indices or names \
+                        to filter")
+    parser.add_argument("-i", "--invert", action="store_true",
+                       help="Keep rather than discard the filtered samples")
+    parser.add_argument("-q", "--quiet", action="store_true",
+                       help="Less output")
+
+    args = parser.parse_args()
+
+    if args.quiet:
+        log_level = logging.WARNING
+    else:
+        log_level = logging.INFO
+    logging.basicConfig(format='%(message)s', level=log_level)
+
+    sf = SampleFilter(infile=args.file, outfile=args.o,
+                      filters=args.f, invert=args.invert)
+    if args.f is None:
+        print "Samples:"
+        for idx, val in enumerate(sf.samples):
+            print "{0}: {1}".format(idx, val)
--- a/setup.cfg
+++ b/setup.cfg
 [egg_info]
-tag_build = 
 tag_date = 0
+tag_build = 
 tag_svn_revision = 0

--- a/setup.py
+++ b/setup.py
 from setuptools import setup
-from distutils.core import setup
 from distutils.extension import Extension
+import sys

 try:
    from Cython.Distutils import build_ext
@@ -8,23 +8,14 @@ try:
 except:
    CYTHON = False

-requires = []
+IS_PYTHON26 = sys.version_info[:2] == (2, 6)

-# python 2.6 does not have argparse
-try:
-    import argparse
-except ImportError:
-    requires.append('argparse')
+DEPENDENCIES = ['setuptools']
+
+if IS_PYTHON26:
+    DEPENDENCIES.extend(['argparse', 'counter', 'ordereddict',
+                         'unittest2'])

-import collections
-try:
-    collections.Counter
-except AttributeError:
-    requires.append('counter')
-try:
-    collections.OrderedDict
-except AttributeError:
-    requires.append('ordereddict')

 # get the version without an import
 VERSION = "Undefined"
@@ -47,14 +38,14 @@ if CYTHON:
 setup(
    name='PyVCF',
    packages=['vcf', 'vcf.test'],
-    scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py'],
+    scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py',
+             'scripts/vcf_sample_filter.py'],
    author='James Casbon and @jdoughertyii',
    author_email='casbon@gmail.com',
    description='Variant Call Format (VCF) parser for Python',
    long_description=DOC,
    test_suite='vcf.test.test_vcf.suite',
-    install_requires=['distribute'],
-    requires=requires,
+    install_requires=DEPENDENCIES,
    entry_points = {
        'vcf.filters': [
            'site_quality = vcf.filters:SiteQuality',
@@ -71,10 +62,19 @@ setup(
        'Development Status :: 4 - Beta',
        'Intended Audience :: Developers',
        'Intended Audience :: Science/Research',
+        'License :: OSI Approved :: BSD License',
+        'License :: OSI Approved :: MIT License',
        'Operating System :: OS Independent',
+        'Programming Language :: Cython',
        'Programming Language :: Python',
+        'Programming Language :: Python :: 2',
+        'Programming Language :: Python :: 2.6',
+        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3',
-        'Topic :: Scientific/Engineering',
+        'Programming Language :: Python :: 3.2',
+        'Programming Language :: Python :: 3.3',
+        'Programming Language :: Python :: 3.4',
+        'Topic :: Scientific/Engineering :: Bio-Informatics',
      ],
    keywords='bioinformatics',
    use_2to3=True,

--- a/vcf/__init__.py
+++ b/vcf/__init__.py
 #!/usr/bin/env python
-'''A VCFv4.0 and 4.1 parser for Python.
+"""
+A VCFv4.0 and 4.1 parser for Python.

 Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/
+"""

-The intent of this module is to mimic the ``csv`` module in the Python stdlib,
-as opposed to more flexible serialization formats like JSON or YAML.  ``vcf``
-will attempt to parse the content of each record based on the data types
-specified in the meta-information lines --  specifically the ##INFO and
-##FORMAT lines.  If these lines are missing or incomplete, it will check
-against the reserved types mentioned in the spec.  Failing that, it will just
-return strings.

-There main interface is the class: ``Reader``.  It takes a file-like
-object and acts as a reader::
-
-    >>> import vcf
-    >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-    >>> for record in vcf_reader:
-    ...     print record
-    Record(CHROM=20, POS=14370, REF=G, ALT=[A])
-    Record(CHROM=20, POS=17330, REF=T, ALT=[A])
-    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-    Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-    Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])
-
-
-This produces a great deal of information, but it is conveniently accessed.
-The attributes of a Record are the 8 fixed fields from the VCF spec::
-
-    * ``Record.CHROM``
-    * ``Record.POS``
-    * ``Record.ID``
-    * ``Record.REF``
-    * ``Record.ALT``
-    * ``Record.QUAL``
-    * ``Record.FILTER``
-    * ``Record.INFO``
-
-plus attributes to handle genotype information:
-
-    * ``Record.FORMAT``
-    * ``Record.samples``
-    * ``Record.genotype``
-
-``samples`` and ``genotype``, not being the title of any column, are left lowercase.  The format
-of the fixed fields is from the spec.  Comma-separated lists in the VCF are
-converted to lists.  In particular, one-entry VCF lists are converted to
-one-entry Python lists (see, e.g., ``Record.ALT``).  Semicolon-delimited lists
-of key=value pairs are converted to Python dictionaries, with flags being given
-a ``True`` value. Integers and floats are handled exactly as you'd expect::
-
-    >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-    >>> record = vcf_reader.next()
-    >>> print record.POS
-    14370
-    >>> print record.ALT
-    [A]
-    >>> print record.INFO['AF']
-    [0.5]
-
-There are a number of convienience methods and properties for each ``Record`` allowing you to
-examine properties of interest::
-
-    >>> print record.num_called, record.call_rate, record.num_unknown
-    3 1.0 0
-    >>> print record.num_hom_ref, record.num_het, record.num_hom_alt
-    1 1 1
-    >>> print record.nucl_diversity, record.aaf, record.heterozygosity
-    0.6 [0.5] 0.5
-    >>> print record.get_hets()
-    [Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
-    >>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
-    True False True False
-    >>> print record.var_type, record.var_subtype
-    snp ts
-    >>> print record.is_monomorphic
-    False
-
-``record.FORMAT`` will be a string specifying the format of the genotype
-fields.  In case the FORMAT column does not exist, ``record.FORMAT`` is
-``None``.  Finally, ``record.samples`` is a list of dictionaries containing the
-parsed sample column and ``record.genotype`` is a way of looking up genotypes
-by sample name::
-
-    >>> record = vcf_reader.next()
-    >>> for sample in record.samples:
-    ...     print sample['GT']
-    0|0
-    0|1
-    0/0
-    >>> print record.genotype('NA00001')['GT']
-    0|0
-
-The genotypes are represented by ``Call`` objects, which have three attributes: the
-corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
-call data in ``data``::
-
-     >>> call = record.genotype('NA00001')
-     >>> print call.site
-     Record(CHROM=20, POS=17330, REF=T, ALT=[A])
-     >>> print call.sample
-     NA00001
-     >>> print call.data
-     CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])
-
-Please note that as of release 0.4.0, attributes known to have single values (such as
-``DP`` and ``GQ`` above) are returned as values.  Other attributes are returned
-as lists (such as ``HQ`` above).
-
-There are also a number of methods::
-
-    >>> print call.called, call.gt_type, call.gt_bases, call.phased
-    True 0 T|T True
-
-Metadata regarding the VCF file itself can be investigated through the
-following attributes:
-
-    * ``Reader.metadata``
-    * ``Reader.infos``
-    * ``Reader.filters``
-    * ``Reader.formats``
-    * ``Reader.samples``
-
-For example::
-
-    >>> vcf_reader.metadata['fileDate']
-    '20090805'
-    >>> vcf_reader.samples
-    ['NA00001', 'NA00002', 'NA00003']
-    >>> vcf_reader.filters
-    OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
-    >>> vcf_reader.infos['AA'].desc
-    'Ancestral Allele'
-
-ALT records are actually classes, so that you can interrogate them::
-
-    >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
-    >>> _ = reader.next(); row = reader.next()
-    >>> print row
-    Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
-    >>> bnd = row.ALT[0]
-    >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
-    True False True T
-
-Random access is supported for files with tabix indexes.  Simply call fetch for the
-region you are interested in::
-
-    >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-    >>> for record in vcf_reader.fetch('20', 1110696, 1230237):  # doctest: +SKIP
-    ...     print record
-    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-    Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-
-Or extract a single row::
-
-    >>> print vcf_reader.fetch('20', 1110696)  # doctest: +SKIP
-    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-
-
-The ``Writer`` class provides a way of writing a VCF file.  Currently, you must specify a
-template ``Reader`` which provides the metadata::
-
-    >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-    >>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
-    >>> for record in vcf_reader:
-    ...     vcf_writer.write_record(record)
-
-
-An extensible script is available to filter vcf files in vcf_filter.py.  VCF filters
-declared by other packages will be available for use in this script.  Please
-see :doc:`FILTERS` for full description.
-
-'''
 from vcf.parser import Reader, Writer
 from vcf.parser import VCFReader, VCFWriter
 from vcf.filters import Base as Filter
 from vcf.parser import RESERVED_INFO, RESERVED_FORMAT
+from vcf.sample_filter import SampleFilter

-VERSION = '0.6.7'
+VERSION = '0.6.8'
--- a/vcf/cparse.pyx
+++ b/vcf/cparse.pyx
@@ -36,7 +36,10 @@ def parse_samples(
            vals = sampvals[j]

            # short circuit the most common
-            if vals == '.' or vals == './.':
+            if samp_fmt._fields[j] == 'GT':
+                sampdat[j] = vals
+                continue
+            elif not vals or vals == '.':
                sampdat[j] = None
                continue


--- a/vcf/model.py
+++ b/vcf/model.py
 from abc import ABCMeta, abstractmethod
 import collections
 import sys
+import re

 try:
    from collections import Counter
 except ImportError:
    from counter import Counter

+allele_delimiter = re.compile(r'''[|/]''') # to split a genotype into alleles

 class _Call(object):
    """ A genotype call, a cell entry in a VCF file"""

-    __slots__ = ['site', 'sample', 'data', 'gt_nums', 'called']
+    __slots__ = ['site', 'sample', 'data', 'gt_nums', 'gt_alleles', 'called', 'ploidity']

    def __init__(self, site, sample, data):
        #: The ``_Record`` for this ``_Call``
        self.site = site
        #: The sample name
        self.sample = sample
-        #: Dictionary of data from the VCF file
+        #: Namedtuple of data from the VCF file
        self.data = data
-        try:
-            self.gt_nums = self.data.GT
-            #: True if the GT is not ./.
-            self.called = self.gt_nums is not None
-        except AttributeError:
-            self.gt_nums = None
+
+        if hasattr(self.data, 'GT'):
+            self.gt_alleles = [(al if al != '.' else None) for al in allele_delimiter.split(self.data.GT)]
+            self.ploidity = len(self.gt_alleles)
+            self.called = all([al != None for al in self.gt_alleles])
+            self.gt_nums = self.data.GT if self.called else None
+        else:
            #62 a call without a genotype is not defined as called or not
+            self.gt_alleles = None
+            self.ploidity = None
            self.called = None
+            self.gt_nums = None

    def __repr__(self):
        return "Call(sample=%s, %s)" % (self.sample, str(self.data))
@@ -50,12 +56,6 @@ class _Call(object):
    def gt_phase_char(self):
        return "/" if not self.phased else "|"

-    @property
-    def gt_alleles(self):
-        '''The numbers of the alleles called at a given sample'''
-        # grab the numeric alleles of the gt string; tokenize by phasing
-        return self.gt_nums.split(self.gt_phase_char())
-
    @property
    def gt_bases(self):
        '''The actual genotype alleles.
@@ -125,10 +125,45 @@ class _Record(object):
        INFO and FORMAT are available as properties.

        The list of genotype calls is in the ``samples`` property.
+
+        Regarding the coordinates associated with each instance:
+
+        - ``POS``, per VCF specification, is the one-based index
+          (the first base of the contig has an index of 1) of the first
+          base of the ``REF`` sequence.
+        - The ``start`` and ``end`` denote the coordinates of the entire
+          ``REF`` sequence in the zero-based, half-open coordinate
+          system (see
+          http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms),
+          where the first base of the contig has an index of 0, and the
+          interval runs up to, but does not include, the base at the
+          ``end`` index. This indexing scheme is analagous to Python
+          slice notation.
+        - The ``affected_start`` and ``affected_end`` coordinates are
+          also in the zero-based, half-open coordinate system. These
+          coordinates indicate the precise region of the reference
+          genome actually affected by the events denoted in ``ALT``
+          (i.e., the minimum ``affected_start`` and maximum
+          ``affected_end``).
+
+          - For SNPs and structural variants, the affected region
+            includes all bases of ``REF``, including the first base
+            (i.e., ``affected_start = start = POS - 1``).
+          - For deletions, the region includes all bases of ``REF``
+            except the first base, which flanks upstream the actual
+            deletion event, per VCF specification.
+          - For insertions, the ``affected_start`` and ``affected_end``
+            coordinates represent a 0 bp-length region between the two
+            flanking bases (i.e., ``affected_start`` =
+            ``affected_end``). This is analagous to Python slice
+            notation (see http://stackoverflow.com/a/2947881/38140).
+            Neither the upstream nor downstream flanking bases are
+            included in the region.
    """
    def __init__(self, CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT,
            sample_indexes, samples=None):
        self.CHROM = CHROM
+        #: the one-based coordinate of the first nucleotide in ``REF``
        self.POS = POS
        self.ID = ID
        self.REF = REF
@@ -137,9 +172,9 @@ class _Record(object):
        self.FILTER = FILTER
        self.INFO = INFO
        self.FORMAT = FORMAT
-        #: 0-based start coordinate
+        #: zero-based, half-open start coordinate of ``REF``
        self.start = self.POS - 1
-        #: 1-based end coordinate
+        #: zero-based, half-open end coordinate of ``REF``
        self.end = self.start + len(self.REF)
        #: list of alleles. [0] = REF, [1:] = ALTS
        self.alleles = [self.REF]
@@ -148,6 +183,61 @@ class _Record(object):
        self.samples = samples or []
        self._sample_indexes = sample_indexes

+        # Setting affected_start and affected_end here for Sphinx
+        # autodoc purposes...
+        #: zero-based, half-open start coordinate of affected region of reference genome
+        self.affected_start = None
+        #: zero-based, half-open end coordinate of affected region of reference genome (not included in the region)
+        self.affected_end = None
+        self._set_start_and_end()
+
+
+    def _set_start_and_end(self):
+        self.affected_start = self.affected_end = self.POS
+        for alt in self.ALT:
+            if alt is None:
+                start, end = self._compute_coordinates_for_none_alt()
+            elif alt.type == 'SNV':
+                start, end = self._compute_coordinates_for_snp()
+            elif alt.type == 'MNV':
+                start, end = self._compute_coordinates_for_indel()
+            else:
+                start, end = self._compute_coordinates_for_sv()
+            self.affected_start = min(self.affected_start, start)
+            self.affected_end = max(self.affected_end, end)
+
+
+    def _compute_coordinates_for_none_alt(self):
+        start = self.POS - 1
+        end = start + len(self.REF)
+        return (start, end)
+
+
+    def _compute_coordinates_for_snp(self):
+        if len(self.REF) > 1:
+            start = self.POS
+            end = start + (len(self.REF) - 1)
+        else:
+            start = self.POS - 1
+            end = self.POS
+        return (start, end)
+
+
+    def _compute_coordinates_for_indel(self):
+        if len(self.REF) > 1:
+            start = self.POS
+            end = start + (len(self.REF) - 1)
+        else:
+            start = end = self.POS
+        return (start, end)
+
+
+    def _compute_coordinates_for_sv(self):
+        start = self.POS - 1
+        end = start + len(self.REF)
+        return (start, end)
+
+
    # For Python 2
    def __cmp__(self, other):
        return cmp((self.CHROM, self.POS), (getattr(other, "CHROM", None), getattr(other, "POS", None)))
@@ -221,12 +311,13 @@ class _Record(object):
        """ A list of allele frequencies of alternate alleles.
           NOTE: Denominator calc'ed from _called_ genotypes.
        """
-        num_chroms = 2.0 * self.num_called
+        num_chroms = 0.0
        allele_counts = Counter()
        for s in self.samples:
            if s.gt_type is not None:
-                allele_counts.update([s.gt_alleles[0]])
-                allele_counts.update([s.gt_alleles[1]])
+                for a in s.gt_alleles:
+                    allele_counts.update([a])
+                    num_chroms += 1
        return [allele_counts[str(i)]/num_chroms for i in range(1, len(self.ALT)+1)]

    @property
@@ -239,7 +330,7 @@ class _Record(object):

        Derived from:
        \"Population Genetics: A Concise Guide, 2nd ed., p.45\"
-          John Gillespie.
+        John Gillespie.
        """
        # skip if more than one alternate allele. assumes bi-allelic
        if len(self.ALT) > 1:
@@ -285,7 +376,7 @@ class _Record(object):
        for alt in self.ALT:
            if alt is None or alt.type != "SNV":
                return False
-            if alt not in ['A', 'C', 'G', 'T']:
+            if alt not in ['A', 'C', 'G', 'T', 'N', '*']:
                return False
        return True

@@ -376,13 +467,14 @@ class _Record(object):
    def var_subtype(self):
        """
        Return the subtype of variant.
+
        - For SNPs and INDELs, yeild one of: [ts, tv, ins, del]
-        - For SVs yield either "complex" or the SV type defined
-          in the ALT fields (removing the brackets).
-          E.g.:
-               <DEL>       -> DEL
-               <INS:ME:L1> -> INS:ME:L1
-               <DUP>       -> DUP
+        - For SVs yield either "complex" or the SV type defined in the ALT
+          fields (removing the brackets). E.g.::
+
+              <DEL>       -> DEL
+              <INS:ME:L1> -> INS:ME:L1
+              <DUP>       -> DUP

        The logic is meant to follow the rules outlined in the following
        paragraph at:

--- a/vcf/parser.py
+++ b/vcf/parser.py
+import codecs
 import collections
-import re
 import csv
 import gzip
-import sys
 import itertools
-import codecs
+import os
+import re
+import sys

 try:
    from collections import OrderedDict
@@ -62,12 +63,13 @@ SINGULAR_METADATA = ['fileformat', 'fileDate', 'reference']
 # Conversion between value in file and Python value
 field_counts = {
    '.': None,  # Unknown number of values
-    'A': -1,  # Equal to the number of alleles in a given record
+    'A': -1,  # Equal to the number of alternate alleles in a given record
    'G': -2,  # Equal to the number of genotypes in a given record
+    'R': -3,  # Equal to the number of alleles including reference in a given record
 }


-_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc'])
+_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc', 'source', 'version'])
 _Filter = collections.namedtuple('Filter', ['id', 'desc'])
 _Alt = collections.namedtuple('Alt', ['id', 'desc'])
 _Format = collections.namedtuple('Format', ['id', 'num', 'type', 'desc'])
@@ -80,36 +82,39 @@ class _vcf_metadata_parser(object):
    def __init__(self):
        super(_vcf_metadata_parser, self).__init__()
        self.info_pattern = re.compile(r'''\#\#INFO=<
-            ID=(?P<id>[^,]+),
-            Number=(?P<number>-?\d+|\.|[AG]),
-            Type=(?P<type>Integer|Float|Flag|Character|String),
+            ID=(?P<id>[^,]+),\s*
+            Number=(?P<number>-?\d+|\.|[AGR]),\s*
+            Type=(?P<type>Integer|Float|Flag|Character|String),\s*
            Description="(?P<desc>[^"]*)"
+            (?:,\s*Source="(?P<source>[^"]*)")?
+            (?:,\s*Version="?(?P<version>[^"]*)"?)?
            >''', re.VERBOSE)
        self.filter_pattern = re.compile(r'''\#\#FILTER=<
-            ID=(?P<id>[^,]+),
+            ID=(?P<id>[^,]+),\s*
            Description="(?P<desc>[^"]*)"
            >''', re.VERBOSE)
        self.alt_pattern = re.compile(r'''\#\#ALT=<
-            ID=(?P<id>[^,]+),
+            ID=(?P<id>[^,]+),\s*
            Description="(?P<desc>[^"]*)"
            >''', re.VERBOSE)
        self.format_pattern = re.compile(r'''\#\#FORMAT=<
-            ID=(?P<id>.+),
-            Number=(?P<number>-?\d+|\.|[AG]),
-            Type=(?P<type>.+),
+            ID=(?P<id>.+),\s*
+            Number=(?P<number>-?\d+|\.|[AGR]),\s*
+            Type=(?P<type>.+),\s*
            Description="(?P<desc>.*)"
            >''', re.VERBOSE)
        self.contig_pattern = re.compile(r'''\#\#contig=<
-            ID=(?P<id>[^,]+),
-            .*
-            length=(?P<length>-?\d+)
+            ID=(?P<id>[^>,]+)
+            (,.*length=(?P<length>-?\d+))?
            .*
            >''', re.VERBOSE)
        self.meta_pattern = re.compile(r'''##(?P<key>.+?)=(?P<val>.+)''')

    def vcf_field_count(self, num_str):
        """Cast vcf header numbers to integer or None"""
-        if num_str not in field_counts:
+        if num_str is None:
+            return None
+        elif num_str not in field_counts:
            # Fixed, specified number
            return int(num_str)
        else:
@@ -125,7 +130,8 @@ class _vcf_metadata_parser(object):
        num = self.vcf_field_count(match.group('number'))

        info = _Info(match.group('id'), num,
-                     match.group('type'), match.group('desc'))
+                     match.group('type'), match.group('desc'),
+                     match.group('source'), match.group('version'))

        return (match.group('id'), info)

@@ -164,31 +170,28 @@ class _vcf_metadata_parser(object):
                       match.group('type'), match.group('desc'))

        return (match.group('id'), form)
-    
+
    def read_contig(self, contig_string):
        '''Read a meta-contigrmation INFO line.'''
        match = self.contig_pattern.match(contig_string)
        if not match:
            raise SyntaxError(
                "One of the contig lines is malformed: %s" % contig_string)
-
        length = self.vcf_field_count(match.group('length'))
-
        contig = _Contig(match.group('id'), length)
-
        return (match.group('id'), contig)

-
    def read_meta_hash(self, meta_string):
-        items = re.split("[<>]", meta_string)
-        # Removing initial hash marks and final equal sign
-        key = items[0][2:-1]
+        # assert re.match("##.+=<", meta_string)
+        items = meta_string.split('=', 1)
+        # Removing initial hash marks
+        key = items[0].lstrip('#')
        # N.B., items can have quoted values, so cannot just split on comma
        val = OrderedDict()
        state = 0
        k = ''
        v = ''
-        for c in items[1]:
+        for c in items[1].strip('[<>]'):

            if state == 0:  # reading item key
                if c == '=':
@@ -219,16 +222,20 @@ class _vcf_metadata_parser(object):
    def read_meta(self, meta_string):
        if re.match("##.+=<", meta_string):
            return self.read_meta_hash(meta_string)
-        else:
-            match = self.meta_pattern.match(meta_string)
-            return match.group('key'), match.group('val')
+        match = self.meta_pattern.match(meta_string)
+        if not match:
+            # Spec only allows key=value, but we try to be liberal and
+            # interpret anything else as key=none (and all values are parsed
+            # as strings).
+            return meta_string.lstrip('#'), 'none'
+        return match.group('key'), match.group('val')


 class Reader(object):
    """ Reader for a VCF v 4.0 file, an iterator returning ``_Record objects`` """

-    def __init__(self, fsock=None, filename=None, compressed=False, prepend_chr=False,
-                 strict_whitespace=False):
+    def __init__(self, fsock=None, filename=None, compressed=None, prepend_chr=False,
+                 strict_whitespace=False, encoding='ascii'):
        """ Create a new Reader for a VCF file.

            You must specify either fsock (stream) or filename.  Gzipped streams
@@ -250,21 +257,26 @@ class Reader(object):
            self._reader = fsock
            if filename is None and hasattr(fsock, 'name'):
                filename = fsock.name
-                compressed = compressed or filename.endswith('.gz')
+                if compressed is None:
+                    compressed = filename.endswith('.gz')
        elif filename:
-            compressed = compressed or filename.endswith('.gz')
+            if compressed is None:
+                compressed = filename.endswith('.gz')
            self._reader = open(filename, 'rb' if compressed else 'rt')
        self.filename = filename
        if compressed:
            self._reader = gzip.GzipFile(fileobj=self._reader)
            if sys.version > '3':
-                self._reader = codecs.getreader('ascii')(self._reader)
+                self._reader = codecs.getreader(encoding)(self._reader)

        if strict_whitespace:
            self._separator = '\t'
        else:
            self._separator = '\t| +'

+        self._row_pattern = re.compile(self._separator)
+        self._alt_pattern = re.compile('[\[\]]')
+
        self.reader = (line.strip() for line in self._reader if line.strip())

        #: metadata fields from header (string or hash, depending)
@@ -287,6 +299,7 @@ class Reader(object):
        self._prepend_chr = prepend_chr
        self._parse_metainfo()
        self._format_cache = {}
+        self.encoding = encoding

    def __iter__(self):
        return self
@@ -301,7 +314,7 @@ class Reader(object):

        parser = _vcf_metadata_parser()

-        line = self.reader.next()
+        line = next(self.reader)
        while line.startswith('##'):
            self._header_lines.append(line)

@@ -320,7 +333,7 @@ class Reader(object):
            elif line.startswith('##FORMAT'):
                key, val = parser.read_format(line)
                self.formats[key] = val
-            
+
            elif line.startswith('##contig'):
                key, val = parser.read_contig(line)
                self.contigs[key] = val
@@ -334,9 +347,9 @@ class Reader(object):
                        self.metadata[key] = []
                    self.metadata[key].append(val)

-            line = self.reader.next()
+            line = next(self.reader)

-        fields = re.split(self._separator, line[1:])
+        fields = self._row_pattern.split(line[1:])
        self._column_headers = fields[:9]
        self.samples = fields[9:]
        self._sample_indexes = dict([(x,i) for (i,x) in enumerate(self.samples)])
@@ -358,7 +371,7 @@ class Reader(object):
        retdict = {}

        for entry in entries:
-            entry = entry.split('=')
+            entry = entry.split('=', 1)
            ID = entry[0]
            try:
                entry_type = self.infos[ID].type
@@ -389,6 +402,7 @@ class Reader(object):
                    vals = entry[1].split(',') # commas are reserved characters indicating multiple values
                    val = self._map(str, vals)
                except IndexError:
+                    entry_type = 'Flag'
                    val = True

            try:
@@ -430,7 +444,6 @@ class Reader(object):
        # check whether we already know how to parse this format
        if samp_fmt not in self._format_cache:
            self._format_cache[samp_fmt] = self._parse_sample_format(samp_fmt)
-
        samp_fmt = self._format_cache[samp_fmt]

        if cparse:
@@ -450,7 +463,10 @@ class Reader(object):
            for i, vals in enumerate(sample.split(':')):

                # short circuit the most common
-                if vals == '.' or vals == './.':
+                if samp_fmt._fields[i] == 'GT':
+                    sampdat[i] = vals
+                    continue
+                elif not vals or vals == ".":
                    sampdat[i] = None
                    continue

@@ -494,9 +510,9 @@ class Reader(object):
        return samp_data

    def _parse_alt(self, str):
-        if re.search('[\[\]]', str) is not None:
+        if self._alt_pattern.search(str) is not None:
            # Paired breakend
-            items = re.split('[\[\]]', str)
+            items = self._alt_pattern.split(str)
            remoteCoords = items[1].split(':')
            chr = remoteCoords[0]
            if chr[0] == '<':
@@ -523,8 +539,8 @@ class Reader(object):

    def next(self):
        '''Return the next record in the file.'''
-        line = self.reader.next()
-        row = re.split(self._separator, line.rstrip())
+        line = next(self.reader)
+        row = self._row_pattern.split(line.rstrip())
        chrom = row[0]
        if self._prepend_chr:
            chrom = 'chr' + chrom
@@ -559,6 +575,9 @@ class Reader(object):
            fmt = row[8]
        except IndexError:
            fmt = None
+        else:
+            if fmt == '.':
+                fmt = None

        record = _Record(chrom, pos, ID, ref, alt, qual, filt,
                info, fmt, self._sample_indexes)
@@ -569,45 +588,60 @@ class Reader(object):

        return record

-    def fetch(self, chrom, start, end=None):
-        """ fetch records from a Tabix indexed VCF, requires pysam
-            if start and end are specified, return iterator over positions
-            if end not specified, return individual ``_Call`` at start or None
+    def fetch(self, chrom, start=None, end=None):
+        """ Fetches records from a tabix-indexed VCF file and returns an
+            iterable of ``_Record`` instances
+
+            chrom must be specified.
+
+            The start and end coordinates are in the zero-based,
+            half-open coordinate system, similar to ``_Record.start`` and
+            ``_Record.end``. The very first base of a chromosome is
+            index 0, and the the region includes bases up to, but not
+            including the base at the end coordinate. For example
+            ``fetch('4', 10, 20)`` would include all variants
+            overlapping a 10 base pair region from the 11th base of
+            through the 20th base (which is at index 19) of chromosome
+            4. It would not include the 21st base (at index 20). See
+            http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
+            for more information on the zero-based, half-open coordinate
+            system.
+
+            If end is omitted, all variants from start until the end of
+            the chromosome chrom will be included.
+
+            If start and end are omitted, all variants on chrom will be
+            returned.
+
+            requires pysam
+
        """
        if not pysam:
            raise Exception('pysam not available, try "pip install pysam"?')
-
        if not self.filename:
            raise Exception('Please provide a filename (or a "normal" fsock)')

        if not self._tabix:
-            self._tabix = pysam.Tabixfile(self.filename)
+            self._tabix = pysam.Tabixfile(self.filename,
+                                          encoding=self.encoding)

        if self._prepend_chr and chrom[:3] == 'chr':
            chrom = chrom[3:]

-        # not sure why tabix needs position -1
-        start = start - 1
-
-        if end is None:
-            self.reader = self._tabix.fetch(chrom, start, start + 1)
-            try:
-                return self.next()
-            except StopIteration:
-                return None
-
        self.reader = self._tabix.fetch(chrom, start, end)
        return self


 class Writer(object):
-    """ VCF Writer """
+    """VCF Writer. On Windows Python 2, open stream with 'wb'."""

    # Reverse keys and values in header field count dictionary
    counts = dict((v,k) for k,v in field_counts.iteritems())

    def __init__(self, stream, template, lineterminator="\n"):
-        self.writer = csv.writer(stream, delimiter="\t", lineterminator=lineterminator)
+        self.writer = csv.writer(stream, delimiter="\t",
+                                 lineterminator=lineterminator,
+                                 quotechar='', quoting=csv.QUOTE_NONE)
        self.template = template
        self.stream = stream

@@ -639,7 +673,10 @@ class Writer(object):
        for line in template.alts.itervalues():
            stream.write(two.format(key="ALT", *line))
        for line in template.contigs.itervalues():
-            stream.write('##contig=<ID={0},length={1}>\n'.format(*line))
+            if line.length:
+                stream.write('##contig=<ID={0},length={1}>\n'.format(*line))
+            else:
+                stream.write('##contig=<ID={0}>\n'.format(*line))

        self._write_header()

@@ -699,28 +736,14 @@ class Writer(object):
                        sorted(info, key=order_key))

    def _format_sample(self, fmt, sample):
-        try:
-            # Try to get the GT value first.
-            gt = getattr(sample.data, 'GT')
-            # PyVCF stores './.' GT values as None, so we need to revert it back
-            # to './.' when writing.
-            if gt is None:
-                gt = './.'
-        except AttributeError:
-            # Failing that, try to check whether 'GT' is specified in the FORMAT
-            # field. If yes, use the recommended empty value ('./.')
-            if 'GT' in fmt:
-                gt = './.'
-            # Otherwise use an empty string as the value
-            else:
-                gt = ''
-        # If gt is an empty string (i.e. not stored), write all other data
+        if hasattr(sample.data, 'GT'):
+            gt = sample.data.GT
+        else:
+            gt = './.' if 'GT' in fmt else ''
+
        if not gt:
            return ':'.join([self._stringify(x) for x in sample.data])
-        # Otherwise use the GT values from above and combine it with the rest of
-        # the data.
-        # Note that this follows the VCF spec, where GT is always the first
-        # item whenever it is present.
+        # Following the VCF spec, GT is always the first item whenever it is present.
        else:
            return ':'.join([gt] + [self._stringify(x) for x in sample.data[1:]])


--- a/vcf/sample_filter.py
+++ b/vcf/sample_filter.py
+# Author: Lenna X. Peterson
+# github.com/lennax
+# arklenna at gmail dot com
+
+import logging
+import sys
+import warnings
+
+
+from parser import Reader, Writer
+
+
+class SampleFilter(object):
+    """
+    Modifies the vcf Reader to filter each row by sample as it is parsed.
+
+    """
+
+    def __init__(self, infile, outfile=None, filters=None, invert=False):
+        # Methods to add to Reader
+        def get_filter(self):
+            return self._samp_filter
+
+        def set_filter(self, filt):
+            self._samp_filter = filt
+            if filt:
+                self.samples = [val for idx, val in enumerate(self.samples)
+                               if idx not in set(filt)]
+
+        def filter_samples(fn):
+            """Decorator function to filter sample parameter"""
+            def filt(self, samples, *args):
+                samples = [val for idx, val in enumerate(samples)
+                           if idx not in set(self.sample_filter)]
+                return fn(self, samples, *args)
+            return filt
+
+        # Add property to Reader for filter list
+        Reader.sample_filter = property(get_filter, set_filter)
+        Reader._samp_filter = []
+        # Modify Reader._parse_samples to filter samples
+        self._orig_parse_samples = Reader._parse_samples
+        Reader._parse_samples = filter_samples(Reader._parse_samples)
+        self.parser = Reader(filename=infile)
+        # Store initial samples and indices
+        self.samples = self.parser.samples
+        self.smp_idx = dict([(v, k) for k, v in enumerate(self.samples)])
+        # Properties for filter/writer
+        self.outfile = outfile
+        self.invert = invert
+        self.filters = filters
+        if filters is not None:
+            self.set_filters()
+            self.write()
+
+    def __del__(self):
+        try:
+            self._undo_monkey_patch()
+        except AttributeError:
+            pass
+
+    def set_filters(self, filters=None, invert=False):
+        """Convert filters from string to list of indices, set on Reader"""
+        if filters is not None:
+            self.filters = filters
+        if invert:
+            self.invert = invert
+        filt_l = self.filters.split(",")
+        filt_s = set(filt_l)
+        if len(filt_s) < len(filt_l):
+            warnings.warn("Non-unique filters, ignoring", RuntimeWarning)
+
+        def filt2idx(item):
+            """Convert filter to valid sample index"""
+            try:
+                item = int(item)
+            except ValueError:
+                # not an idx, check if it's a value
+                return self.smp_idx.get(item)
+            else:
+                # is int, check if it's an idx
+                if item < len(self.samples):
+                    return item
+        filters = set(filter(lambda x: x is not None, map(filt2idx, filt_s)))
+        if len(filters) < len(filt_s):
+            # TODO print the filters that were ignored
+            warnings.warn("Invalid filters, ignoring", RuntimeWarning)
+
+        if self.invert:
+            filters = set(xrange(len(self.samples))).difference(filters)
+
+        # `sample_filter` setter updates `samples`
+        self.parser.sample_filter = filters
+        if len(self.parser.samples) == 0:
+            warnings.warn("Number of samples to keep is zero", RuntimeWarning)
+        logging.info("Keeping these samples: {0}\n".format(self.parser.samples))
+        return self.parser.samples
+
+    def write(self, outfile=None):
+        if outfile is not None:
+            self.outfile = outfile
+        if self.outfile is None:
+            _out = sys.stdout
+        elif hasattr(self.outfile, 'write'):
+            _out = self.outfile
+        else:
+            _out = open(self.outfile, "wb")
+        logging.info("Writing to '{0}'\n".format(self.outfile))
+        writer = Writer(_out, self.parser)
+        for row in self.parser:
+            writer.write_record(row)
+
+    def _undo_monkey_patch(self):
+        Reader._parse_samples = self._orig_parse_samples
+        delattr(Reader, 'sample_filter')
--- a/vcf/test/FT.vcf
+++ b/vcf/test/FT.vcf
+##fileformat=VCFv4.2
+##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
+##FILTER=<ID=DP125,Description="DP<125">
+##FILTER=<ID=DP130,Description="DP<130">
+##FILTER=<ID=LowQual,Description="Low quality">
+##FILTER=<ID=Q4800,Description="QUAL<4800.0">
+##FILTER=<ID=Q5000,Description="QUAL<5000.0">
+##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
+##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
+##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
+##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
+##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
+##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
+##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
+##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
+##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.6-0-g89b7209,Date="Wed Jul 27 09:50:44 CEST 2016",Epoch=1469605844963,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=../ref.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=10.vcf) mask=(RodBinding name= source=UNBOUND) out=/home/redmar/tmp/example/snps/10_filt.vcf filterExpression=[QUAL<5000.0, QUAL<4800.0] filterName=[Q5000, Q4800] genotypeFilterExpression=[DP<130, DP<125] genotypeFilterName=[DP130, DP125] clusterSize=3 clusterWindowSize=0 maskExtension=0 maskName=Mask filterNotInMask=false missingValuesInExpressionsShouldEvaluateAsFailing=false invalidatePreviousFilters=false invertFilterExpression=false invertGenotypeFilterExpression=false setFilteredGtToNocall=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
+##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
+##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
+##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
+##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
+##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
+##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
+##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
+##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
+##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
+##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
+##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
+##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
+##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
+##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
+##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
+##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
+##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
+##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
+##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
+##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
+##contig=<ID=ref,length=4888768>
+##reference=file://../ref.fasta
+#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	1	2	3	4	5
+ref	63393	.	C	A	29454.60	.	AC=5;AF=1.00;AN=5;DP=719;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.67;SOR=0.965	GT:AD:DP:GQ:PL	1:0,166:166:99:6740,0	1:0,142:142:99:5824,0	1:0,134:134:99:5616,0	1:0,122:122:99:4930,0	1:0,155:155:99:6371,0
+ref	65903	.	AATTGCGCTG	A	7340.57	PASS	AC=1;AF=0.200;AN=5;DP=524;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=34.04;SOR=1.091	GT:AD:DP:FT:GQ:PL	1:0,164:164:PASS:99:7383,0	0:95,0:95:DP125;DP130:99:0,1800	0:88,0:88:DP125;DP130:99:0,1800	0:87,0:87:DP125;DP130:99:0,1800	0:89,0:89:DP125;DP130:99:0,1800
+ref	70837	.	C	A	4711.61	Q4800;Q5000	AC=1;AF=0.200;AN=5;DP=512;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=27.64;SOR=0.726	GT:AD:DP:FT:GQ:PL	0:121,0:121:DP125;DP130:99:0,1800	0:95,0:95:DP125;DP130:99:0,1800	1:0,120:120:DP125;DP130:99:4745,0	0:87,0:87:DP125;DP130:99:0,1800	0:89,0:89:DP125;DP130:99:0,1800
+ref	71448	.	C	T	31134.60	PASS	AC=5;AF=1.00;AN=5;BaseQRankSum=2.22;ClippingRankSum=0.00;DP=768;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.43;ReadPosRankSum=2.03;SOR=0.295	GT:AD:DP:FT:GQ:PL	1:0,147:147:PASS:99:5996,0	1:1,183:184:PASS:99:7501,0	1:0,113:113:DP125;DP130:99:4559,0	1:0,161:161:PASS:99:6436,0	1:0,163:163:PASS:99:6669,0
+ref	104257	.	C	T	5521.61	PASS	AC=1;AF=0.200;AN=5;DP=506;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=29.45;SOR=0.854	GT:AD:DP:FT:GQ:PL	0:101,0:101:DP125;DP130:99:0,1800	0:109,0:109:DP125;DP130:99:0,1800	1:0,132:132:PASS:99:5555,0	0:67,0:67:DP125;DP130:99:0,1800	0:97,0:97:DP125;DP130:99:0,1800
+ref	140658	.	C	A	32467.60	PASS	AC=5;AF=1.00;AN=5;BaseQRankSum=2.24;ClippingRankSum=0.00;DP=801;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.65;ReadPosRankSum=1.27;SOR=0.346	GT:AD:DP:GQ:PL	1:0,170:170:99:6854,0	1:0,198:198:99:8098,0	1:0,136:136:99:5554,0	1:0,141:141:99:5661,0	1:1,155:156:99:6327,0
+ref	147463	.	C	A	4885.61	Q5000	AC=1;AF=0.200;AN=5;BaseQRankSum=-7.720e-01;ClippingRankSum=0.00;DP=503;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;MQRankSum=0.00;QD=35.03;ReadPosRankSum=-6.950e-01;SOR=0.278	GT:AD:DP:FT:GQ:PL	0:97,0:97:DP125;DP130:99:0,1800	0:104,0:104:DP125;DP130:99:0,1800	0:84,0:84:DP125;DP130:99:0,1800	1:1,128:129:DP130:99:4919,0	0:89,0:89:DP125;DP130:99:0,1800
+ref	154578	.	A	G	32015.60	PASS	AC=5;AF=1.00;AN=5;DP=776;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=25.82;SOR=0.902	GT:AD:DP:GQ:PL	1:0,152:152:99:6300,0	1:0,183:183:99:7608,0	1:0,137:137:99:5713,0	1:0,148:148:99:6040,0	1:0,156:156:99:6381,0
+ref	203200	.	C	T	30880.60	PASS	AC=5;AF=1.00;AN=5;DP=752;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.65;SOR=0.878	GT:AD:DP:FT:GQ:PL	1:0,161:161:PASS:99:6708,0	1:0,185:185:PASS:99:7602,0	1:0,136:136:PASS:99:5602,0	1:0,126:126:DP130:99:5080,0	1:0,144:144:PASS:99:5915,0
+ref	231665	.	C	T	30074.60	PASS	AC=5;AF=1.00;AN=5;DP=735;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=33.23;SOR=0.938	GT:AD:DP:FT:GQ:PL	1:0,191:191:PASS:99:7867,0	1:0,159:159:PASS:99:6431,0	1:0,130:130:PASS:99:5299,0	1:0,129:129:DP130:99:5290,0	1:0,126:126:DP130:99:5214,0