Skip to content
Commits on Source (16)
Metadata-Version: 1.1
Name: PyVCF
Version: 0.6.7
Version: 0.6.8
Summary: Variant Call Format (VCF) parser for Python
Home-page: https://github.com/jamescasbon/PyVCF
Author: James Casbon and @jdoughertyii
Author-email: casbon@gmail.com
License: UNKNOWN
Description: A VCFv4.0 and 4.1 parser for Python.
Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/
The intent of this module is to mimic the ``csv`` module in the Python stdlib,
as opposed to more flexible serialization formats like JSON or YAML. ``vcf``
will attempt to parse the content of each record based on the data types
specified in the meta-information lines -- specifically the ##INFO and
##FORMAT lines. If these lines are missing or incomplete, it will check
against the reserved types mentioned in the spec. Failing that, it will just
return strings.
There main interface is the class: ``Reader``. It takes a file-like
object and acts as a reader::
>>> import vcf
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> for record in vcf_reader:
... print record
Record(CHROM=20, POS=14370, REF=G, ALT=[A])
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])
This produces a great deal of information, but it is conveniently accessed.
The attributes of a Record are the 8 fixed fields from the VCF spec::
* ``Record.CHROM``
* ``Record.POS``
* ``Record.ID``
* ``Record.REF``
* ``Record.ALT``
* ``Record.QUAL``
* ``Record.FILTER``
* ``Record.INFO``
plus attributes to handle genotype information:
* ``Record.FORMAT``
* ``Record.samples``
* ``Record.genotype``
``samples`` and ``genotype``, not being the title of any column, are left lowercase. The format
of the fixed fields is from the spec. Comma-separated lists in the VCF are
converted to lists. In particular, one-entry VCF lists are converted to
one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists
of key=value pairs are converted to Python dictionaries, with flags being given
a ``True`` value. Integers and floats are handled exactly as you'd expect::
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> record = vcf_reader.next()
>>> print record.POS
14370
>>> print record.ALT
[A]
>>> print record.INFO['AF']
[0.5]
There are a number of convienience methods and properties for each ``Record`` allowing you to
examine properties of interest::
>>> print record.num_called, record.call_rate, record.num_unknown
3 1.0 0
>>> print record.num_hom_ref, record.num_het, record.num_hom_alt
1 1 1
>>> print record.nucl_diversity, record.aaf, record.heterozygosity
0.6 [0.5] 0.5
>>> print record.get_hets()
[Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
>>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
True False True False
>>> print record.var_type, record.var_subtype
snp ts
>>> print record.is_monomorphic
False
``record.FORMAT`` will be a string specifying the format of the genotype
fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
``None``. Finally, ``record.samples`` is a list of dictionaries containing the
parsed sample column and ``record.genotype`` is a way of looking up genotypes
by sample name::
>>> record = vcf_reader.next()
>>> for sample in record.samples:
... print sample['GT']
0|0
0|1
0/0
>>> print record.genotype('NA00001')['GT']
0|0
The genotypes are represented by ``Call`` objects, which have three attributes: the
corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
call data in ``data``::
>>> call = record.genotype('NA00001')
>>> print call.site
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
>>> print call.sample
NA00001
>>> print call.data
CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])
Please note that as of release 0.4.0, attributes known to have single values (such as
``DP`` and ``GQ`` above) are returned as values. Other attributes are returned
as lists (such as ``HQ`` above).
There are also a number of methods::
>>> print call.called, call.gt_type, call.gt_bases, call.phased
True 0 T|T True
Metadata regarding the VCF file itself can be investigated through the
following attributes:
* ``Reader.metadata``
* ``Reader.infos``
* ``Reader.filters``
* ``Reader.formats``
* ``Reader.samples``
For example::
>>> vcf_reader.metadata['fileDate']
'20090805'
>>> vcf_reader.samples
['NA00001', 'NA00002', 'NA00003']
>>> vcf_reader.filters
OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
>>> vcf_reader.infos['AA'].desc
'Ancestral Allele'
ALT records are actually classes, so that you can interrogate them::
>>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
>>> _ = reader.next(); row = reader.next()
>>> print row
Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
>>> bnd = row.ALT[0]
>>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
True False True T
Random access is supported for files with tabix indexes. Simply call fetch for the
region you are interested in::
>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP
... print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Or extract a single row::
>>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a
template ``Reader`` which provides the metadata::
>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
>>> for record in vcf_reader:
... vcf_writer.write_record(record)
An extensible script is available to filter vcf files in vcf_filter.py. VCF filters
declared by other packages will be available for use in this script. Please
see :doc:`FILTERS` for full description.
Description: UNKNOWN
Keywords: bioinformatics
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
......@@ -50,7 +50,7 @@ of key=value pairs are converted to Python dictionaries, with flags being given
a ``True`` value. Integers and floats are handled exactly as you'd expect::
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> record = vcf_reader.next()
>>> record = next(vcf_reader)
>>> print record.POS
14370
>>> print record.ALT
......@@ -58,7 +58,7 @@ a ``True`` value. Integers and floats are handled exactly as you'd expect::
>>> print record.INFO['AF']
[0.5]
There are a number of convienience methods and properties for each ``Record`` allowing you to
There are a number of convenience methods and properties for each ``Record`` allowing you to
examine properties of interest::
>>> print record.num_called, record.call_rate, record.num_unknown
......@@ -82,7 +82,7 @@ fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
parsed sample column and ``record.genotype`` is a way of looking up genotypes
by sample name::
>>> record = vcf_reader.next()
>>> record = next(vcf_reader)
>>> for sample in record.samples:
... print sample['GT']
0|0
......@@ -135,27 +135,38 @@ For example::
ALT records are actually classes, so that you can interrogate them::
>>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
>>> _ = reader.next(); row = reader.next()
>>> _ = next(reader); row = next(reader)
>>> print row
Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
>>> bnd = row.ALT[0]
>>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
True False True T
Random access is supported for files with tabix indexes. Simply call fetch for the
region you are interested in::
The Reader supports retrieval of records within designated regions for files
with tabix indexes via the fetch method. This requires the pysam module as a
dependency. Pass in a chromosome, and, optionally, start and end coordinates,
for the regions of interest::
>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP
>>> # fetch all records on chromosome 20 from base 1110696 through 1230237
>>> for record in vcf_reader.fetch('20', 1110695, 1230237): # doctest: +SKIP
... print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Or extract a single row::
Note that the start and end coordinates are in the zero-based, half-open
coordinate system, similar to ``_Record.start`` and ``_Record.end``. The very
first base of a chromosome is index 0, and the the region includes bases up
to, but not including the base at the end coordinate. For example::
>>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
>>> # fetch all records on chromosome 4 from base 11 through 20
>>> vcf_reader.fetch('4', 10, 20) # doctest: +SKIP
would include all records overlapping a 10 base pair region from the 11th base
of through the 20th base (which is at index 19) of chromosome 4. It would not
include the 21st base (at index 20). (See
http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms for more
information on the zero-based, half-open coordinate system.)
The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a
template ``Reader`` which provides the metadata::
......@@ -165,8 +176,6 @@ template ``Reader`` which provides the metadata::
>>> for record in vcf_reader:
... vcf_writer.write_record(record)
An extensible script is available to filter vcf files in vcf_filter.py. VCF filters
declared by other packages will be available for use in this script. Please
see :doc:`FILTERS` for full description.
python-pyvcf (0.6.8-1~bpo8+1) jessie-backports; urgency=medium
* Rebuild for jessie-backports.
* export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
-- Andreas Tille <tille@debian.org> Fri, 09 Jun 2017 23:06:01 +0200
python-pyvcf (0.6.8-1) unstable; urgency=medium
* New upstream version
* Fix watch file
* cme fix dpkg-control
* debhelper 10
* hardening=+all
* Remove unused lintian override
-- Andreas Tille <tille@debian.org> Fri, 02 Dec 2016 20:33:48 +0100
python-pyvcf (0.6.7-2) unstable; urgency=medium
[ Scott Kitterman ]
* Change python3-dev build-dep to python3-all-dev so package is built for
all supported python3 versions
Closes: #809487
-- Andreas Tille <tille@debian.org> Thu, 31 Dec 2015 12:22:30 +0100
python-pyvcf (0.6.7-1~bpo8+1) jessie-backports; urgency=medium
* Rebuild for jessie-backports.
......
......@@ -3,17 +3,17 @@ Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.
Uploaders: Andreas Tille <tille@debian.org>
Section: python
Priority: optional
Build-Depends: debhelper (>= 9),
Build-Depends: debhelper (>= 10),
dh-python,
cython,
python-dev (>= 2.7),
python-dev,
python-setuptools,
cython3,
python3-dev,
python3-all-dev,
python3-setuptools
Standards-Version: 3.9.6
Standards-Version: 3.9.8
Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/python-pyvcf.git
Vcs-Git: git://anonscm.debian.org/debian-med/python-pyvcf.git
Vcs-Git: https://anonscm.debian.org/git/debian-med/python-pyvcf.git
Homepage: https://pypi.python.org/pypi/PyVCF
Package: python-pyvcf
......
......@@ -2,7 +2,6 @@
# Test-Data needs to be fetched separately from Git
set -e
set -x
PKG=`dpkg-parsechangelog | awk '/^Source/ { print $2 }'`
......@@ -10,7 +9,7 @@ if ! echo $@ | grep -q upstream-version ; then
VERSION=`dpkg-parsechangelog | awk '/^Version:/ { print $2 }' | sed 's/\([0-9\.]\+\)-[0-9]\+$/\1/'`
uscan --verbose --force-download
else
VERSION=`echo $@ | sed 's?^.*--upstream-version \([0-9.]\+\) .*?\1?'`
VERSION=`echo $@ | sed 's?^.*--upstream-version \([0-9.]\+\).*?\1?'`
if echo "$VERSION" | grep -q "upstream-version" ; then
echo "Unable to parse version number"
exit
......
Author: Andreas Tille <tille@debian.org>
Last-Updated: Tue, 13 Oct 2015 17:02:08 +0200
Description: Use setuptools instead of distribute
see https://lists.alioth.debian.org/pipermail/debian-med-packaging/2015-October/035410.html
--- a/setup.py
+++ b/setup.py
@@ -53,7 +53,7 @@ setup(
description='Variant Call Format (VCF) parser for Python',
long_description=DOC,
test_suite='vcf.test.test_vcf.suite',
- install_requires=['distribute'],
+ install_requires=['setuptools'],
requires=requires,
entry_points = {
'vcf.filters': [
# This is an unchanged file from upstream tarball
python-pyvcf-examples: package-contains-timestamped-gzip usr/share/doc/python-pyvcf-examples/test/1kg.vcf.gz
......@@ -4,6 +4,8 @@
# Uncomment this to turn on verbose mode.
#export DH_VERBOSE=1
export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
export PYBUILD_NAME=pyvcf
examplesdir=debian/python-$(PYBUILD_NAME)-examples/usr/share/doc/python-$(PYBUILD_NAME)-examples
helperdir=debian/$(PYBUILD_NAME)/usr
......@@ -21,7 +23,9 @@ override_dh_install:
mkdir -p $(helperdir)
mv debian/python3-$(PYBUILD_NAME)/usr/bin $(helperdir)
# strip .py extension
mv $(helperdir)/bin/vcf_filter.py $(helperdir)/bin/vcf_filter
for py in $(helperdir)/bin/*.py ; do \
mv $${py} $(helperdir)/bin/`basename $${py} .py` ; \
done
rm -rf debian/python-$(PYBUILD_NAME)/usr/bin
get-orig-source:
......
# Test-Data needs to be fetched separately from Git so wee need to call get-orig-source
version=3
https://pypi.python.org/pypi/PyVCF .*/PyVCF-([0-9.]+).tar.[xgb]z2? \
version=4
http://pypi.debian.net/PyVCF/PyVCF-(.+)\.(?:tar(?:\.gz|\.bz2)?|tgz) \
debian debian/get-orig-source
#!/usr/bin/env python
# Author: Lenna X. Peterson
# github.com/lennax
# arklenna at gmail dot com
import argparse
import logging
from vcf import SampleFilter
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("file", help="VCF file to filter")
parser.add_argument("-o", metavar="outfile",
help="File to write out filtered samples")
parser.add_argument("-f", metavar="filters",
help="Comma-separated list of sample indices or names \
to filter")
parser.add_argument("-i", "--invert", action="store_true",
help="Keep rather than discard the filtered samples")
parser.add_argument("-q", "--quiet", action="store_true",
help="Less output")
args = parser.parse_args()
if args.quiet:
log_level = logging.WARNING
else:
log_level = logging.INFO
logging.basicConfig(format='%(message)s', level=log_level)
sf = SampleFilter(infile=args.file, outfile=args.o,
filters=args.f, invert=args.invert)
if args.f is None:
print "Samples:"
for idx, val in enumerate(sf.samples):
print "{0}: {1}".format(idx, val)
[egg_info]
tag_build =
tag_date = 0
tag_build =
tag_svn_revision = 0
from setuptools import setup
from distutils.core import setup
from distutils.extension import Extension
import sys
try:
from Cython.Distutils import build_ext
......@@ -8,23 +8,14 @@ try:
except:
CYTHON = False
requires = []
IS_PYTHON26 = sys.version_info[:2] == (2, 6)
# python 2.6 does not have argparse
try:
import argparse
except ImportError:
requires.append('argparse')
DEPENDENCIES = ['setuptools']
if IS_PYTHON26:
DEPENDENCIES.extend(['argparse', 'counter', 'ordereddict',
'unittest2'])
import collections
try:
collections.Counter
except AttributeError:
requires.append('counter')
try:
collections.OrderedDict
except AttributeError:
requires.append('ordereddict')
# get the version without an import
VERSION = "Undefined"
......@@ -47,14 +38,14 @@ if CYTHON:
setup(
name='PyVCF',
packages=['vcf', 'vcf.test'],
scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py'],
scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py',
'scripts/vcf_sample_filter.py'],
author='James Casbon and @jdoughertyii',
author_email='casbon@gmail.com',
description='Variant Call Format (VCF) parser for Python',
long_description=DOC,
test_suite='vcf.test.test_vcf.suite',
install_requires=['distribute'],
requires=requires,
install_requires=DEPENDENCIES,
entry_points = {
'vcf.filters': [
'site_quality = vcf.filters:SiteQuality',
......@@ -71,10 +62,19 @@ setup(
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'Intended Audience :: Science/Research',
'License :: OSI Approved :: BSD License',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Cython',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Topic :: Scientific/Engineering',
'Programming Language :: Python :: 3.2',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Topic :: Scientific/Engineering :: Bio-Informatics',
],
keywords='bioinformatics',
use_2to3=True,
......
#!/usr/bin/env python
'''A VCFv4.0 and 4.1 parser for Python.
"""
A VCFv4.0 and 4.1 parser for Python.
Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/
"""
The intent of this module is to mimic the ``csv`` module in the Python stdlib,
as opposed to more flexible serialization formats like JSON or YAML. ``vcf``
will attempt to parse the content of each record based on the data types
specified in the meta-information lines -- specifically the ##INFO and
##FORMAT lines. If these lines are missing or incomplete, it will check
against the reserved types mentioned in the spec. Failing that, it will just
return strings.
There main interface is the class: ``Reader``. It takes a file-like
object and acts as a reader::
>>> import vcf
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> for record in vcf_reader:
... print record
Record(CHROM=20, POS=14370, REF=G, ALT=[A])
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])
This produces a great deal of information, but it is conveniently accessed.
The attributes of a Record are the 8 fixed fields from the VCF spec::
* ``Record.CHROM``
* ``Record.POS``
* ``Record.ID``
* ``Record.REF``
* ``Record.ALT``
* ``Record.QUAL``
* ``Record.FILTER``
* ``Record.INFO``
plus attributes to handle genotype information:
* ``Record.FORMAT``
* ``Record.samples``
* ``Record.genotype``
``samples`` and ``genotype``, not being the title of any column, are left lowercase. The format
of the fixed fields is from the spec. Comma-separated lists in the VCF are
converted to lists. In particular, one-entry VCF lists are converted to
one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists
of key=value pairs are converted to Python dictionaries, with flags being given
a ``True`` value. Integers and floats are handled exactly as you'd expect::
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> record = vcf_reader.next()
>>> print record.POS
14370
>>> print record.ALT
[A]
>>> print record.INFO['AF']
[0.5]
There are a number of convienience methods and properties for each ``Record`` allowing you to
examine properties of interest::
>>> print record.num_called, record.call_rate, record.num_unknown
3 1.0 0
>>> print record.num_hom_ref, record.num_het, record.num_hom_alt
1 1 1
>>> print record.nucl_diversity, record.aaf, record.heterozygosity
0.6 [0.5] 0.5
>>> print record.get_hets()
[Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
>>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
True False True False
>>> print record.var_type, record.var_subtype
snp ts
>>> print record.is_monomorphic
False
``record.FORMAT`` will be a string specifying the format of the genotype
fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
``None``. Finally, ``record.samples`` is a list of dictionaries containing the
parsed sample column and ``record.genotype`` is a way of looking up genotypes
by sample name::
>>> record = vcf_reader.next()
>>> for sample in record.samples:
... print sample['GT']
0|0
0|1
0/0
>>> print record.genotype('NA00001')['GT']
0|0
The genotypes are represented by ``Call`` objects, which have three attributes: the
corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
call data in ``data``::
>>> call = record.genotype('NA00001')
>>> print call.site
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
>>> print call.sample
NA00001
>>> print call.data
CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])
Please note that as of release 0.4.0, attributes known to have single values (such as
``DP`` and ``GQ`` above) are returned as values. Other attributes are returned
as lists (such as ``HQ`` above).
There are also a number of methods::
>>> print call.called, call.gt_type, call.gt_bases, call.phased
True 0 T|T True
Metadata regarding the VCF file itself can be investigated through the
following attributes:
* ``Reader.metadata``
* ``Reader.infos``
* ``Reader.filters``
* ``Reader.formats``
* ``Reader.samples``
For example::
>>> vcf_reader.metadata['fileDate']
'20090805'
>>> vcf_reader.samples
['NA00001', 'NA00002', 'NA00003']
>>> vcf_reader.filters
OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
>>> vcf_reader.infos['AA'].desc
'Ancestral Allele'
ALT records are actually classes, so that you can interrogate them::
>>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
>>> _ = reader.next(); row = reader.next()
>>> print row
Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
>>> bnd = row.ALT[0]
>>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
True False True T
Random access is supported for files with tabix indexes. Simply call fetch for the
region you are interested in::
>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP
... print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Or extract a single row::
>>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a
template ``Reader`` which provides the metadata::
>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
>>> for record in vcf_reader:
... vcf_writer.write_record(record)
An extensible script is available to filter vcf files in vcf_filter.py. VCF filters
declared by other packages will be available for use in this script. Please
see :doc:`FILTERS` for full description.
'''
from vcf.parser import Reader, Writer
from vcf.parser import VCFReader, VCFWriter
from vcf.filters import Base as Filter
from vcf.parser import RESERVED_INFO, RESERVED_FORMAT
from vcf.sample_filter import SampleFilter
VERSION = '0.6.7'
VERSION = '0.6.8'
......@@ -36,7 +36,10 @@ def parse_samples(
vals = sampvals[j]
# short circuit the most common
if vals == '.' or vals == './.':
if samp_fmt._fields[j] == 'GT':
sampdat[j] = vals
continue
elif not vals or vals == '.':
sampdat[j] = None
continue
......
from abc import ABCMeta, abstractmethod
import collections
import sys
import re
try:
from collections import Counter
except ImportError:
from counter import Counter
allele_delimiter = re.compile(r'''[|/]''') # to split a genotype into alleles
class _Call(object):
""" A genotype call, a cell entry in a VCF file"""
__slots__ = ['site', 'sample', 'data', 'gt_nums', 'called']
__slots__ = ['site', 'sample', 'data', 'gt_nums', 'gt_alleles', 'called', 'ploidity']
def __init__(self, site, sample, data):
#: The ``_Record`` for this ``_Call``
self.site = site
#: The sample name
self.sample = sample
#: Dictionary of data from the VCF file
#: Namedtuple of data from the VCF file
self.data = data
try:
self.gt_nums = self.data.GT
#: True if the GT is not ./.
self.called = self.gt_nums is not None
except AttributeError:
self.gt_nums = None
if hasattr(self.data, 'GT'):
self.gt_alleles = [(al if al != '.' else None) for al in allele_delimiter.split(self.data.GT)]
self.ploidity = len(self.gt_alleles)
self.called = all([al != None for al in self.gt_alleles])
self.gt_nums = self.data.GT if self.called else None
else:
#62 a call without a genotype is not defined as called or not
self.gt_alleles = None
self.ploidity = None
self.called = None
self.gt_nums = None
def __repr__(self):
return "Call(sample=%s, %s)" % (self.sample, str(self.data))
......@@ -50,12 +56,6 @@ class _Call(object):
def gt_phase_char(self):
return "/" if not self.phased else "|"
@property
def gt_alleles(self):
'''The numbers of the alleles called at a given sample'''
# grab the numeric alleles of the gt string; tokenize by phasing
return self.gt_nums.split(self.gt_phase_char())
@property
def gt_bases(self):
'''The actual genotype alleles.
......@@ -125,10 +125,45 @@ class _Record(object):
INFO and FORMAT are available as properties.
The list of genotype calls is in the ``samples`` property.
Regarding the coordinates associated with each instance:
- ``POS``, per VCF specification, is the one-based index
(the first base of the contig has an index of 1) of the first
base of the ``REF`` sequence.
- The ``start`` and ``end`` denote the coordinates of the entire
``REF`` sequence in the zero-based, half-open coordinate
system (see
http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms),
where the first base of the contig has an index of 0, and the
interval runs up to, but does not include, the base at the
``end`` index. This indexing scheme is analagous to Python
slice notation.
- The ``affected_start`` and ``affected_end`` coordinates are
also in the zero-based, half-open coordinate system. These
coordinates indicate the precise region of the reference
genome actually affected by the events denoted in ``ALT``
(i.e., the minimum ``affected_start`` and maximum
``affected_end``).
- For SNPs and structural variants, the affected region
includes all bases of ``REF``, including the first base
(i.e., ``affected_start = start = POS - 1``).
- For deletions, the region includes all bases of ``REF``
except the first base, which flanks upstream the actual
deletion event, per VCF specification.
- For insertions, the ``affected_start`` and ``affected_end``
coordinates represent a 0 bp-length region between the two
flanking bases (i.e., ``affected_start`` =
``affected_end``). This is analagous to Python slice
notation (see http://stackoverflow.com/a/2947881/38140).
Neither the upstream nor downstream flanking bases are
included in the region.
"""
def __init__(self, CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT,
sample_indexes, samples=None):
self.CHROM = CHROM
#: the one-based coordinate of the first nucleotide in ``REF``
self.POS = POS
self.ID = ID
self.REF = REF
......@@ -137,9 +172,9 @@ class _Record(object):
self.FILTER = FILTER
self.INFO = INFO
self.FORMAT = FORMAT
#: 0-based start coordinate
#: zero-based, half-open start coordinate of ``REF``
self.start = self.POS - 1
#: 1-based end coordinate
#: zero-based, half-open end coordinate of ``REF``
self.end = self.start + len(self.REF)
#: list of alleles. [0] = REF, [1:] = ALTS
self.alleles = [self.REF]
......@@ -148,6 +183,61 @@ class _Record(object):
self.samples = samples or []
self._sample_indexes = sample_indexes
# Setting affected_start and affected_end here for Sphinx
# autodoc purposes...
#: zero-based, half-open start coordinate of affected region of reference genome
self.affected_start = None
#: zero-based, half-open end coordinate of affected region of reference genome (not included in the region)
self.affected_end = None
self._set_start_and_end()
def _set_start_and_end(self):
self.affected_start = self.affected_end = self.POS
for alt in self.ALT:
if alt is None:
start, end = self._compute_coordinates_for_none_alt()
elif alt.type == 'SNV':
start, end = self._compute_coordinates_for_snp()
elif alt.type == 'MNV':
start, end = self._compute_coordinates_for_indel()
else:
start, end = self._compute_coordinates_for_sv()
self.affected_start = min(self.affected_start, start)
self.affected_end = max(self.affected_end, end)
def _compute_coordinates_for_none_alt(self):
start = self.POS - 1
end = start + len(self.REF)
return (start, end)
def _compute_coordinates_for_snp(self):
if len(self.REF) > 1:
start = self.POS
end = start + (len(self.REF) - 1)
else:
start = self.POS - 1
end = self.POS
return (start, end)
def _compute_coordinates_for_indel(self):
if len(self.REF) > 1:
start = self.POS
end = start + (len(self.REF) - 1)
else:
start = end = self.POS
return (start, end)
def _compute_coordinates_for_sv(self):
start = self.POS - 1
end = start + len(self.REF)
return (start, end)
# For Python 2
def __cmp__(self, other):
return cmp((self.CHROM, self.POS), (getattr(other, "CHROM", None), getattr(other, "POS", None)))
......@@ -221,12 +311,13 @@ class _Record(object):
""" A list of allele frequencies of alternate alleles.
NOTE: Denominator calc'ed from _called_ genotypes.
"""
num_chroms = 2.0 * self.num_called
num_chroms = 0.0
allele_counts = Counter()
for s in self.samples:
if s.gt_type is not None:
allele_counts.update([s.gt_alleles[0]])
allele_counts.update([s.gt_alleles[1]])
for a in s.gt_alleles:
allele_counts.update([a])
num_chroms += 1
return [allele_counts[str(i)]/num_chroms for i in range(1, len(self.ALT)+1)]
@property
......@@ -239,7 +330,7 @@ class _Record(object):
Derived from:
\"Population Genetics: A Concise Guide, 2nd ed., p.45\"
John Gillespie.
John Gillespie.
"""
# skip if more than one alternate allele. assumes bi-allelic
if len(self.ALT) > 1:
......@@ -285,7 +376,7 @@ class _Record(object):
for alt in self.ALT:
if alt is None or alt.type != "SNV":
return False
if alt not in ['A', 'C', 'G', 'T']:
if alt not in ['A', 'C', 'G', 'T', 'N', '*']:
return False
return True
......@@ -376,13 +467,14 @@ class _Record(object):
def var_subtype(self):
"""
Return the subtype of variant.
- For SNPs and INDELs, yeild one of: [ts, tv, ins, del]
- For SVs yield either "complex" or the SV type defined
in the ALT fields (removing the brackets).
E.g.:
<DEL> -> DEL
<INS:ME:L1> -> INS:ME:L1
<DUP> -> DUP
- For SVs yield either "complex" or the SV type defined in the ALT
fields (removing the brackets). E.g.::
<DEL> -> DEL
<INS:ME:L1> -> INS:ME:L1
<DUP> -> DUP
The logic is meant to follow the rules outlined in the following
paragraph at:
......
import codecs
import collections
import re
import csv
import gzip
import sys
import itertools
import codecs
import os
import re
import sys
try:
from collections import OrderedDict
......@@ -62,12 +63,13 @@ SINGULAR_METADATA = ['fileformat', 'fileDate', 'reference']
# Conversion between value in file and Python value
field_counts = {
'.': None, # Unknown number of values
'A': -1, # Equal to the number of alleles in a given record
'A': -1, # Equal to the number of alternate alleles in a given record
'G': -2, # Equal to the number of genotypes in a given record
'R': -3, # Equal to the number of alleles including reference in a given record
}
_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc'])
_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc', 'source', 'version'])
_Filter = collections.namedtuple('Filter', ['id', 'desc'])
_Alt = collections.namedtuple('Alt', ['id', 'desc'])
_Format = collections.namedtuple('Format', ['id', 'num', 'type', 'desc'])
......@@ -80,36 +82,39 @@ class _vcf_metadata_parser(object):
def __init__(self):
super(_vcf_metadata_parser, self).__init__()
self.info_pattern = re.compile(r'''\#\#INFO=<
ID=(?P<id>[^,]+),
Number=(?P<number>-?\d+|\.|[AG]),
Type=(?P<type>Integer|Float|Flag|Character|String),
ID=(?P<id>[^,]+),\s*
Number=(?P<number>-?\d+|\.|[AGR]),\s*
Type=(?P<type>Integer|Float|Flag|Character|String),\s*
Description="(?P<desc>[^"]*)"
(?:,\s*Source="(?P<source>[^"]*)")?
(?:,\s*Version="?(?P<version>[^"]*)"?)?
>''', re.VERBOSE)
self.filter_pattern = re.compile(r'''\#\#FILTER=<
ID=(?P<id>[^,]+),
ID=(?P<id>[^,]+),\s*
Description="(?P<desc>[^"]*)"
>''', re.VERBOSE)
self.alt_pattern = re.compile(r'''\#\#ALT=<
ID=(?P<id>[^,]+),
ID=(?P<id>[^,]+),\s*
Description="(?P<desc>[^"]*)"
>''', re.VERBOSE)
self.format_pattern = re.compile(r'''\#\#FORMAT=<
ID=(?P<id>.+),
Number=(?P<number>-?\d+|\.|[AG]),
Type=(?P<type>.+),
ID=(?P<id>.+),\s*
Number=(?P<number>-?\d+|\.|[AGR]),\s*
Type=(?P<type>.+),\s*
Description="(?P<desc>.*)"
>''', re.VERBOSE)
self.contig_pattern = re.compile(r'''\#\#contig=<
ID=(?P<id>[^,]+),
.*
length=(?P<length>-?\d+)
ID=(?P<id>[^>,]+)
(,.*length=(?P<length>-?\d+))?
.*
>''', re.VERBOSE)
self.meta_pattern = re.compile(r'''##(?P<key>.+?)=(?P<val>.+)''')
def vcf_field_count(self, num_str):
"""Cast vcf header numbers to integer or None"""
if num_str not in field_counts:
if num_str is None:
return None
elif num_str not in field_counts:
# Fixed, specified number
return int(num_str)
else:
......@@ -125,7 +130,8 @@ class _vcf_metadata_parser(object):
num = self.vcf_field_count(match.group('number'))
info = _Info(match.group('id'), num,
match.group('type'), match.group('desc'))
match.group('type'), match.group('desc'),
match.group('source'), match.group('version'))
return (match.group('id'), info)
......@@ -164,31 +170,28 @@ class _vcf_metadata_parser(object):
match.group('type'), match.group('desc'))
return (match.group('id'), form)
def read_contig(self, contig_string):
'''Read a meta-contigrmation INFO line.'''
match = self.contig_pattern.match(contig_string)
if not match:
raise SyntaxError(
"One of the contig lines is malformed: %s" % contig_string)
length = self.vcf_field_count(match.group('length'))
contig = _Contig(match.group('id'), length)
return (match.group('id'), contig)
def read_meta_hash(self, meta_string):
items = re.split("[<>]", meta_string)
# Removing initial hash marks and final equal sign
key = items[0][2:-1]
# assert re.match("##.+=<", meta_string)
items = meta_string.split('=', 1)
# Removing initial hash marks
key = items[0].lstrip('#')
# N.B., items can have quoted values, so cannot just split on comma
val = OrderedDict()
state = 0
k = ''
v = ''
for c in items[1]:
for c in items[1].strip('[<>]'):
if state == 0: # reading item key
if c == '=':
......@@ -219,16 +222,20 @@ class _vcf_metadata_parser(object):
def read_meta(self, meta_string):
if re.match("##.+=<", meta_string):
return self.read_meta_hash(meta_string)
else:
match = self.meta_pattern.match(meta_string)
return match.group('key'), match.group('val')
match = self.meta_pattern.match(meta_string)
if not match:
# Spec only allows key=value, but we try to be liberal and
# interpret anything else as key=none (and all values are parsed
# as strings).
return meta_string.lstrip('#'), 'none'
return match.group('key'), match.group('val')
class Reader(object):
""" Reader for a VCF v 4.0 file, an iterator returning ``_Record objects`` """
def __init__(self, fsock=None, filename=None, compressed=False, prepend_chr=False,
strict_whitespace=False):
def __init__(self, fsock=None, filename=None, compressed=None, prepend_chr=False,
strict_whitespace=False, encoding='ascii'):
""" Create a new Reader for a VCF file.
You must specify either fsock (stream) or filename. Gzipped streams
......@@ -250,21 +257,26 @@ class Reader(object):
self._reader = fsock
if filename is None and hasattr(fsock, 'name'):
filename = fsock.name
compressed = compressed or filename.endswith('.gz')
if compressed is None:
compressed = filename.endswith('.gz')
elif filename:
compressed = compressed or filename.endswith('.gz')
if compressed is None:
compressed = filename.endswith('.gz')
self._reader = open(filename, 'rb' if compressed else 'rt')
self.filename = filename
if compressed:
self._reader = gzip.GzipFile(fileobj=self._reader)
if sys.version > '3':
self._reader = codecs.getreader('ascii')(self._reader)
self._reader = codecs.getreader(encoding)(self._reader)
if strict_whitespace:
self._separator = '\t'
else:
self._separator = '\t| +'
self._row_pattern = re.compile(self._separator)
self._alt_pattern = re.compile('[\[\]]')
self.reader = (line.strip() for line in self._reader if line.strip())
#: metadata fields from header (string or hash, depending)
......@@ -287,6 +299,7 @@ class Reader(object):
self._prepend_chr = prepend_chr
self._parse_metainfo()
self._format_cache = {}
self.encoding = encoding
def __iter__(self):
return self
......@@ -301,7 +314,7 @@ class Reader(object):
parser = _vcf_metadata_parser()
line = self.reader.next()
line = next(self.reader)
while line.startswith('##'):
self._header_lines.append(line)
......@@ -320,7 +333,7 @@ class Reader(object):
elif line.startswith('##FORMAT'):
key, val = parser.read_format(line)
self.formats[key] = val
elif line.startswith('##contig'):
key, val = parser.read_contig(line)
self.contigs[key] = val
......@@ -334,9 +347,9 @@ class Reader(object):
self.metadata[key] = []
self.metadata[key].append(val)
line = self.reader.next()
line = next(self.reader)
fields = re.split(self._separator, line[1:])
fields = self._row_pattern.split(line[1:])
self._column_headers = fields[:9]
self.samples = fields[9:]
self._sample_indexes = dict([(x,i) for (i,x) in enumerate(self.samples)])
......@@ -358,7 +371,7 @@ class Reader(object):
retdict = {}
for entry in entries:
entry = entry.split('=')
entry = entry.split('=', 1)
ID = entry[0]
try:
entry_type = self.infos[ID].type
......@@ -389,6 +402,7 @@ class Reader(object):
vals = entry[1].split(',') # commas are reserved characters indicating multiple values
val = self._map(str, vals)
except IndexError:
entry_type = 'Flag'
val = True
try:
......@@ -430,7 +444,6 @@ class Reader(object):
# check whether we already know how to parse this format
if samp_fmt not in self._format_cache:
self._format_cache[samp_fmt] = self._parse_sample_format(samp_fmt)
samp_fmt = self._format_cache[samp_fmt]
if cparse:
......@@ -450,7 +463,10 @@ class Reader(object):
for i, vals in enumerate(sample.split(':')):
# short circuit the most common
if vals == '.' or vals == './.':
if samp_fmt._fields[i] == 'GT':
sampdat[i] = vals
continue
elif not vals or vals == ".":
sampdat[i] = None
continue
......@@ -494,9 +510,9 @@ class Reader(object):
return samp_data
def _parse_alt(self, str):
if re.search('[\[\]]', str) is not None:
if self._alt_pattern.search(str) is not None:
# Paired breakend
items = re.split('[\[\]]', str)
items = self._alt_pattern.split(str)
remoteCoords = items[1].split(':')
chr = remoteCoords[0]
if chr[0] == '<':
......@@ -523,8 +539,8 @@ class Reader(object):
def next(self):
'''Return the next record in the file.'''
line = self.reader.next()
row = re.split(self._separator, line.rstrip())
line = next(self.reader)
row = self._row_pattern.split(line.rstrip())
chrom = row[0]
if self._prepend_chr:
chrom = 'chr' + chrom
......@@ -559,6 +575,9 @@ class Reader(object):
fmt = row[8]
except IndexError:
fmt = None
else:
if fmt == '.':
fmt = None
record = _Record(chrom, pos, ID, ref, alt, qual, filt,
info, fmt, self._sample_indexes)
......@@ -569,45 +588,60 @@ class Reader(object):
return record
def fetch(self, chrom, start, end=None):
""" fetch records from a Tabix indexed VCF, requires pysam
if start and end are specified, return iterator over positions
if end not specified, return individual ``_Call`` at start or None
def fetch(self, chrom, start=None, end=None):
""" Fetches records from a tabix-indexed VCF file and returns an
iterable of ``_Record`` instances
chrom must be specified.
The start and end coordinates are in the zero-based,
half-open coordinate system, similar to ``_Record.start`` and
``_Record.end``. The very first base of a chromosome is
index 0, and the the region includes bases up to, but not
including the base at the end coordinate. For example
``fetch('4', 10, 20)`` would include all variants
overlapping a 10 base pair region from the 11th base of
through the 20th base (which is at index 19) of chromosome
4. It would not include the 21st base (at index 20). See
http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
for more information on the zero-based, half-open coordinate
system.
If end is omitted, all variants from start until the end of
the chromosome chrom will be included.
If start and end are omitted, all variants on chrom will be
returned.
requires pysam
"""
if not pysam:
raise Exception('pysam not available, try "pip install pysam"?')
if not self.filename:
raise Exception('Please provide a filename (or a "normal" fsock)')
if not self._tabix:
self._tabix = pysam.Tabixfile(self.filename)
self._tabix = pysam.Tabixfile(self.filename,
encoding=self.encoding)
if self._prepend_chr and chrom[:3] == 'chr':
chrom = chrom[3:]
# not sure why tabix needs position -1
start = start - 1
if end is None:
self.reader = self._tabix.fetch(chrom, start, start + 1)
try:
return self.next()
except StopIteration:
return None
self.reader = self._tabix.fetch(chrom, start, end)
return self
class Writer(object):
""" VCF Writer """
"""VCF Writer. On Windows Python 2, open stream with 'wb'."""
# Reverse keys and values in header field count dictionary
counts = dict((v,k) for k,v in field_counts.iteritems())
def __init__(self, stream, template, lineterminator="\n"):
self.writer = csv.writer(stream, delimiter="\t", lineterminator=lineterminator)
self.writer = csv.writer(stream, delimiter="\t",
lineterminator=lineterminator,
quotechar='', quoting=csv.QUOTE_NONE)
self.template = template
self.stream = stream
......@@ -639,7 +673,10 @@ class Writer(object):
for line in template.alts.itervalues():
stream.write(two.format(key="ALT", *line))
for line in template.contigs.itervalues():
stream.write('##contig=<ID={0},length={1}>\n'.format(*line))
if line.length:
stream.write('##contig=<ID={0},length={1}>\n'.format(*line))
else:
stream.write('##contig=<ID={0}>\n'.format(*line))
self._write_header()
......@@ -699,28 +736,14 @@ class Writer(object):
sorted(info, key=order_key))
def _format_sample(self, fmt, sample):
try:
# Try to get the GT value first.
gt = getattr(sample.data, 'GT')
# PyVCF stores './.' GT values as None, so we need to revert it back
# to './.' when writing.
if gt is None:
gt = './.'
except AttributeError:
# Failing that, try to check whether 'GT' is specified in the FORMAT
# field. If yes, use the recommended empty value ('./.')
if 'GT' in fmt:
gt = './.'
# Otherwise use an empty string as the value
else:
gt = ''
# If gt is an empty string (i.e. not stored), write all other data
if hasattr(sample.data, 'GT'):
gt = sample.data.GT
else:
gt = './.' if 'GT' in fmt else ''
if not gt:
return ':'.join([self._stringify(x) for x in sample.data])
# Otherwise use the GT values from above and combine it with the rest of
# the data.
# Note that this follows the VCF spec, where GT is always the first
# item whenever it is present.
# Following the VCF spec, GT is always the first item whenever it is present.
else:
return ':'.join([gt] + [self._stringify(x) for x in sample.data[1:]])
......
# Author: Lenna X. Peterson
# github.com/lennax
# arklenna at gmail dot com
import logging
import sys
import warnings
from parser import Reader, Writer
class SampleFilter(object):
"""
Modifies the vcf Reader to filter each row by sample as it is parsed.
"""
def __init__(self, infile, outfile=None, filters=None, invert=False):
# Methods to add to Reader
def get_filter(self):
return self._samp_filter
def set_filter(self, filt):
self._samp_filter = filt
if filt:
self.samples = [val for idx, val in enumerate(self.samples)
if idx not in set(filt)]
def filter_samples(fn):
"""Decorator function to filter sample parameter"""
def filt(self, samples, *args):
samples = [val for idx, val in enumerate(samples)
if idx not in set(self.sample_filter)]
return fn(self, samples, *args)
return filt
# Add property to Reader for filter list
Reader.sample_filter = property(get_filter, set_filter)
Reader._samp_filter = []
# Modify Reader._parse_samples to filter samples
self._orig_parse_samples = Reader._parse_samples
Reader._parse_samples = filter_samples(Reader._parse_samples)
self.parser = Reader(filename=infile)
# Store initial samples and indices
self.samples = self.parser.samples
self.smp_idx = dict([(v, k) for k, v in enumerate(self.samples)])
# Properties for filter/writer
self.outfile = outfile
self.invert = invert
self.filters = filters
if filters is not None:
self.set_filters()
self.write()
def __del__(self):
try:
self._undo_monkey_patch()
except AttributeError:
pass
def set_filters(self, filters=None, invert=False):
"""Convert filters from string to list of indices, set on Reader"""
if filters is not None:
self.filters = filters
if invert:
self.invert = invert
filt_l = self.filters.split(",")
filt_s = set(filt_l)
if len(filt_s) < len(filt_l):
warnings.warn("Non-unique filters, ignoring", RuntimeWarning)
def filt2idx(item):
"""Convert filter to valid sample index"""
try:
item = int(item)
except ValueError:
# not an idx, check if it's a value
return self.smp_idx.get(item)
else:
# is int, check if it's an idx
if item < len(self.samples):
return item
filters = set(filter(lambda x: x is not None, map(filt2idx, filt_s)))
if len(filters) < len(filt_s):
# TODO print the filters that were ignored
warnings.warn("Invalid filters, ignoring", RuntimeWarning)
if self.invert:
filters = set(xrange(len(self.samples))).difference(filters)
# `sample_filter` setter updates `samples`
self.parser.sample_filter = filters
if len(self.parser.samples) == 0:
warnings.warn("Number of samples to keep is zero", RuntimeWarning)
logging.info("Keeping these samples: {0}\n".format(self.parser.samples))
return self.parser.samples
def write(self, outfile=None):
if outfile is not None:
self.outfile = outfile
if self.outfile is None:
_out = sys.stdout
elif hasattr(self.outfile, 'write'):
_out = self.outfile
else:
_out = open(self.outfile, "wb")
logging.info("Writing to '{0}'\n".format(self.outfile))
writer = Writer(_out, self.parser)
for row in self.parser:
writer.write_record(row)
def _undo_monkey_patch(self):
Reader._parse_samples = self._orig_parse_samples
delattr(Reader, 'sample_filter')
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=DP125,Description="DP<125">
##FILTER=<ID=DP130,Description="DP<130">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=Q4800,Description="QUAL<4800.0">
##FILTER=<ID=Q5000,Description="QUAL<5000.0">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.6-0-g89b7209,Date="Wed Jul 27 09:50:44 CEST 2016",Epoch=1469605844963,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=../ref.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=10.vcf) mask=(RodBinding name= source=UNBOUND) out=/home/redmar/tmp/example/snps/10_filt.vcf filterExpression=[QUAL<5000.0, QUAL<4800.0] filterName=[Q5000, Q4800] genotypeFilterExpression=[DP<130, DP<125] genotypeFilterName=[DP130, DP125] clusterSize=3 clusterWindowSize=0 maskExtension=0 maskName=Mask filterNotInMask=false missingValuesInExpressionsShouldEvaluateAsFailing=false invalidatePreviousFilters=false invertFilterExpression=false invertGenotypeFilterExpression=false setFilteredGtToNocall=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=ref,length=4888768>
##reference=file://../ref.fasta
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 2 3 4 5
ref 63393 . C A 29454.60 . AC=5;AF=1.00;AN=5;DP=719;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.67;SOR=0.965 GT:AD:DP:GQ:PL 1:0,166:166:99:6740,0 1:0,142:142:99:5824,0 1:0,134:134:99:5616,0 1:0,122:122:99:4930,0 1:0,155:155:99:6371,0
ref 65903 . AATTGCGCTG A 7340.57 PASS AC=1;AF=0.200;AN=5;DP=524;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=34.04;SOR=1.091 GT:AD:DP:FT:GQ:PL 1:0,164:164:PASS:99:7383,0 0:95,0:95:DP125;DP130:99:0,1800 0:88,0:88:DP125;DP130:99:0,1800 0:87,0:87:DP125;DP130:99:0,1800 0:89,0:89:DP125;DP130:99:0,1800
ref 70837 . C A 4711.61 Q4800;Q5000 AC=1;AF=0.200;AN=5;DP=512;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=27.64;SOR=0.726 GT:AD:DP:FT:GQ:PL 0:121,0:121:DP125;DP130:99:0,1800 0:95,0:95:DP125;DP130:99:0,1800 1:0,120:120:DP125;DP130:99:4745,0 0:87,0:87:DP125;DP130:99:0,1800 0:89,0:89:DP125;DP130:99:0,1800
ref 71448 . C T 31134.60 PASS AC=5;AF=1.00;AN=5;BaseQRankSum=2.22;ClippingRankSum=0.00;DP=768;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.43;ReadPosRankSum=2.03;SOR=0.295 GT:AD:DP:FT:GQ:PL 1:0,147:147:PASS:99:5996,0 1:1,183:184:PASS:99:7501,0 1:0,113:113:DP125;DP130:99:4559,0 1:0,161:161:PASS:99:6436,0 1:0,163:163:PASS:99:6669,0
ref 104257 . C T 5521.61 PASS AC=1;AF=0.200;AN=5;DP=506;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=29.45;SOR=0.854 GT:AD:DP:FT:GQ:PL 0:101,0:101:DP125;DP130:99:0,1800 0:109,0:109:DP125;DP130:99:0,1800 1:0,132:132:PASS:99:5555,0 0:67,0:67:DP125;DP130:99:0,1800 0:97,0:97:DP125;DP130:99:0,1800
ref 140658 . C A 32467.60 PASS AC=5;AF=1.00;AN=5;BaseQRankSum=2.24;ClippingRankSum=0.00;DP=801;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.65;ReadPosRankSum=1.27;SOR=0.346 GT:AD:DP:GQ:PL 1:0,170:170:99:6854,0 1:0,198:198:99:8098,0 1:0,136:136:99:5554,0 1:0,141:141:99:5661,0 1:1,155:156:99:6327,0
ref 147463 . C A 4885.61 Q5000 AC=1;AF=0.200;AN=5;BaseQRankSum=-7.720e-01;ClippingRankSum=0.00;DP=503;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;MQRankSum=0.00;QD=35.03;ReadPosRankSum=-6.950e-01;SOR=0.278 GT:AD:DP:FT:GQ:PL 0:97,0:97:DP125;DP130:99:0,1800 0:104,0:104:DP125;DP130:99:0,1800 0:84,0:84:DP125;DP130:99:0,1800 1:1,128:129:DP130:99:4919,0 0:89,0:89:DP125;DP130:99:0,1800
ref 154578 . A G 32015.60 PASS AC=5;AF=1.00;AN=5;DP=776;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=25.82;SOR=0.902 GT:AD:DP:GQ:PL 1:0,152:152:99:6300,0 1:0,183:183:99:7608,0 1:0,137:137:99:5713,0 1:0,148:148:99:6040,0 1:0,156:156:99:6381,0
ref 203200 . C T 30880.60 PASS AC=5;AF=1.00;AN=5;DP=752;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.65;SOR=0.878 GT:AD:DP:FT:GQ:PL 1:0,161:161:PASS:99:6708,0 1:0,185:185:PASS:99:7602,0 1:0,136:136:PASS:99:5602,0 1:0,126:126:DP130:99:5080,0 1:0,144:144:PASS:99:5915,0
ref 231665 . C T 30074.60 PASS AC=5;AF=1.00;AN=5;DP=735;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=33.23;SOR=0.938 GT:AD:DP:FT:GQ:PL 1:0,191:191:PASS:99:7867,0 1:0,159:159:PASS:99:6431,0 1:0,130:130:PASS:99:5299,0 1:0,129:129:DP130:99:5290,0 1:0,126:126:DP130:99:5214,0