Skip to content
Commits on Source (8)
comment: false
coverage:
status:
patch: false
......@@ -2,6 +2,7 @@ language: python
sudo: false
python:
- 'nightly'
- '3.6'
- '3.5'
- '3.4'
- '3.3'
......@@ -13,14 +14,19 @@ install:
- pip wheel -f wheelhouse coverage biopython cython pysam pyvcf || true
- pip install -f wheelhouse biopython cython pysam pyfasta coverage pyvcf || true
- python setup.py install
- if [ ! -f samtools-1.2 ]; then wget -q -O - https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 | tar -xjv; fi
- if [ ! -f samtools-1.2 ]; then curl -sL https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 | tar -xjv; fi
- cd samtools-1.2
- make
- export PATH=$PATH:$PWD
- cd ..
- if [ ! -f htslib-1.4 ]; then curl -sL https://github.com/samtools/htslib/releases/download/1.4/htslib-1.4.tar.bz2 | tar -xjv; fi
- cd htslib-1.4
- make
- export PATH=$PATH:$PWD
- cd ..
before_script:
- /usr/local/bin/pip install --user biopython
- /usr/bin/python tests/data/download_gene_fasta.py
- if [ $(python --version) -gt 2.6 ]; then env pip install biopython; fi
- env python tests/data/download_gene_fasta.py
script: nosetests --with-coverage --cover-package=pyfaidx
deploy:
provider: pypi
......@@ -29,18 +35,20 @@ deploy:
secure: MbSaeuitkVTZqxa0PJ3RcR1aMf+B/sMbcx2sWOo9xfLlRFDFpYWJZ0EfXWEhrVu2YWXpBsasgunTDWSi0jNcZMH92MzOC+UTVYr45LO5sy6hm4iSiAgm/DPgYWdjP0SFKr7eL/HWPS+gHvgkXL1upleX21O358bxaezoasuKFvs=
on:
all_branches: true
python: 2.6
python: 2.7
tags: true
repo: mdshw5/pyfaidx
matrix:
allow_failures:
- python: 'nightly'
- python: 'pypy3'
- python: 'pypy'
fast_finish: true
cache:
directories:
- tests/data
- samtools-1.2
- htslib-1.4
- wheelhouse
after_success:
- bash <(curl -s https://codecov.io/bash)
......
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at mdshw5@gmail.com. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [http://contributor-covenant.org/version/1/4][version]
[homepage]: http://contributor-covenant.org
[version]: http://contributor-covenant.org/version/1/4/
|Travis| |PyPI| |Landscape| |Coverage| |Depsy| |Appveyor| |Flattr|
|Travis| |PyPI| |Landscape| |Coverage| |Depsy| |Appveyor|
Description
-----------
......@@ -38,6 +38,8 @@ or download a `release <https://github.com/mdshw5/pyfaidx/releases>`_ and:
python setup.py install
If using ``pip install --user`` make sure to add ``/home/$(whoami)/.local/bin`` to your ``$PATH`` if you want to run the ``faidx`` script.
Usage
-----
......@@ -72,7 +74,7 @@ Acts like a dictionary.
>>> genes['NM_001282543.1'][200:230].end
230
>>> genes['NM_001282543.1'][200:230].longname
>>> genes['NM_001282543.1'][200:230].fancy_name
'NM_001282543.1:201-230'
>>> len(genes['NM_001282543.1'])
......@@ -110,26 +112,6 @@ Slices just like a string:
- Slicing start and end coordinates are 0-based, just like Python sequences.
Sequence can be buffered in memory using a read-ahead buffer
for fast sequential access:
.. code:: python
>>> from timeit import timeit
>>> fetch = "genes['NM_001282543.1'][200:230]"
>>> read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)"
>>> no_read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')"
>>> string_slicing = "genes = {}; genes['NM_001282543.1'] = 'N'*10000"
>>> timeit(fetch, no_read_ahead, number=10000)
0.2204863309962093
>>> timeit(fetch, read_ahead, number=10000)
0.1121859749982832
>>> timeit(fetch, string_slicing, number=10000)
0.0033553699977346696
Read-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions.
Complements and reverse complements just like DNA
.. code:: python
......@@ -146,6 +128,31 @@ Complements and reverse complements just like DNA
>NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG
``Fasta`` objects can also be accessed using method calls:
.. code:: python
>>> genes.get_seq('NM_001282543.1', 201, 210)
>NM_001282543.1:201-210
CTCGTTCCGC
>>> genes.get_seq('NM_001282543.1', 201, 210, rc=True)
>NM_001282543.1 (complement):210-201
GCGGAACGAG
Spliced sequences can be retrieved from a list of [start, end] coordinates:
**TODO** update this section
.. code:: python
# new in v0.5.1
segments = [[1, 10], [50, 70]]
>>> genes.get_spliced_seq('NM_001282543.1', segments)
>gi|543583786|ref|NM_001282543.1|:1-70
CCCCGCCCCTGGTTTCGAGTCGCTGGCCTGC
.. _keyfn:
Custom key functions provide cleaner access:
.. code:: python
......@@ -158,6 +165,24 @@ Custom key functions provide cleaner access:
>NR_104212:1-10
CCCCGCCCCT
You can specify a character to split names on, which will generate additional entries:
.. code:: python
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', split_char='.', duplicate_action="first") # default duplicate_action="stop"
>>> genes.keys()
dict_keys(['.1', 'NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
If your `key_function` or `split_char` generates duplicate entries, you can choose what action to take:
.. code:: python
# new in v0.4.9
>>> genes = Fasta('tests/data/genes.fasta', split_char="|", duplicate_action="longest")
>>> genes.keys()
dict_keys(['gi', '563317589', 'dbj', 'AB821309.1', '', '557361099', 'gb', 'KF435150.1', '557361097', 'KF435149.1', '543583796', 'ref', 'NR_104216.1', '543583795', 'NR_104215.1', '543583794', 'NR_104212.1', '543583788', 'NM_001282545.1', '543583786', 'NM_001282543.1', '543583785', 'NM_000465.3', '543583740', 'NM_001282549.1', '543583738', 'NM_001282548.1', '530384540', 'XM_005249645.1', '530384538', 'XM_005249644.1', '530384536', 'XM_005249643.1', '530384534', 'XM_005249642.1', '530373237','XM_005265508.1', '530373235', 'XM_005265507.1', '530364726', 'XR_241081.1', '530364725', 'XR_241080.1', '530364724', 'XR_241079.1'])
Filter functions (returning True) limit the index:
.. code:: python
......@@ -226,6 +251,36 @@ Sequence names are truncated on any whitespace. This is a limitation of the inde
gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
...
# new in v0.4.9
>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', read_long_names=True)
>>> for record in genes:
... print(record.name)
...
gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
Sequence can be buffered in memory using a read-ahead buffer
for fast sequential access:
.. code:: python
>>> from timeit import timeit
>>> fetch = "genes['NM_001282543.1'][200:230]"
>>> read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)"
>>> no_read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')"
>>> string_slicing = "genes = {}; genes['NM_001282543.1'] = 'N'*10000"
>>> timeit(fetch, no_read_ahead, number=10000)
0.2204863309962093
>>> timeit(fetch, read_ahead, number=10000)
0.1121859749982832
>>> timeit(fetch, string_slicing, number=10000)
0.0033553699977346696
Read-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions.
.. role:: red
If you want to modify the contents of your FASTA file in-place, you can use the `mutable` argument.
......@@ -304,12 +359,18 @@ cli script: faidx
default base for missing positions and masking. default: N
-d DELIMITER, --delimiter DELIMITER
delimiter for splitting names to multiple values (duplicate names will be discarded). default: None
-e HEADER_FUNCTION, --header-function HEADER_FUNCTION
python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: lambda x: x.split()[0]
-u {stop,first,last,longest,shortest}, --duplicates-action {stop,first,last,longest,shortest}
entry to take when duplicate sequence names are encountered. default: stop
-g REGEX, --regex REGEX
selected sequences are those matching regular expression. default: .*
-v, --invert-match selected sequences are those not matching 'regions' argument. default: False
-m, --mask-with-default-seq
mask the FASTA file using --default-seq default: False
-M, --mask-by-case mask the FASTA file by changing to lowercase. default: False
-e HEADER_FUNCTION, --header-function HEADER_FUNCTION
python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: None
--no-rebuild do not rebuild the .fai index even if it is out of date. default: False
--version print pyfaidx version number
......@@ -424,8 +485,6 @@ Examples:
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
.......
$ faidx --size-range 5500,6000 -i chromsizes tests/data/genes.fasta
NM_000465.3 5523
......@@ -435,6 +494,14 @@ Examples:
$ faidx -M --bed regions.bed tests/data/genes.fasta
### Modifies tests/data/genes.fasta by masking regions using lowercase characters ###
$ faidx -e "lambda x: x.split('.')[0]" tests/data/genes.fasta -i bed
AB821309 1 3510
KF435150 1 481
KF435149 1 642
NR_104216 1 4573
NR_104215 1 5317
.......
Similar syntax as ``samtools faidx``
......@@ -462,6 +529,19 @@ A lower-level Faidx class is also available:
where "filename.fa" is the original FASTA file.
- Start and end coordinates are 1-based.
Support for compressed FASTA
----------------------------
``pyfaidx`` can create and read ``.fai`` indices for FASTA files that have
been compressed using the `bgzip <http://www.htslib.org/doc/tabix.html>`_
tool from `samtools <http://www.htslib.org/>`_. ``bgzip`` writes compressed
data in a ``BGZF`` format. ``BGZF`` is ``gzip`` compatible, consisting of
multiple concatenated ``gzip`` blocks, each with an additional ``gzip``
header making it possible to build an index for rapid random access. I.e.,
files compressed with ``bgzip`` are valid ``gzip`` and so can be read by
``gunzip``. See `this description
<http://pydoc.net/Python/biopython/1.66/Bio.bgzf/>`_ for more details on
``bgzip``.
Changelog
---------
......@@ -478,7 +558,7 @@ I try to fix as many bugs as possible, but most of this work is supported by a s
Contributing
------------
Create a new Pull Request with one feauture. If you add a new feature, please
Create a new Pull Request with one feature. If you add a new feature, please
create also the relevant test.
To get test running on your machine:
......@@ -519,8 +599,3 @@ Comprehensive Cancer Center in the Department of Oncology.
.. |Appveyor| image:: https://ci.appveyor.com/api/projects/status/80ihlw30a003596w?svg=true
:target: https://ci.appveyor.com/project/mdshw5/pyfaidx
.. |Flattr| image:: http://button.flattr.com/flattr-badge-large.png
:target: https://flattr.com/submit/auto?fid=po00kq&url=https%3A%2F%2Fgithub.com%2Fmdshw5%2Fpyfaidx
Replace
Testsuite: autopkgtest-pkg-python
by a real test suite running build time tests
python-pyfaidx (0.5.3.1-1) unstable; urgency=medium
* New upstream version
* Point Vcs fields to salsa.debian.org
* Standards-Version: 4.1.4
* debhelper 11
* Testsuite: autopkgtest-pkg-python
-- Andreas Tille <tille@debian.org> Mon, 16 Apr 2018 13:12:29 +0200
python-pyfaidx (0.4.8.1-1) unstable; urgency=medium
* New upstream version
......
......@@ -2,8 +2,9 @@ Source: python-pyfaidx
Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Uploaders: Andreas Tille <tille@debian.org>
Section: python
Testsuite: autopkgtest-pkg-python
Priority: optional
Build-Depends: debhelper (>= 10),
Build-Depends: debhelper (>= 11~),
dh-python,
python-all,
python-coverage,
......@@ -19,9 +20,9 @@ Build-Depends: debhelper (>= 10),
python3-numpy,
python3-six,
python3-mock
Standards-Version: 3.9.8
Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/python-pyfaidx.git
Vcs-Git: https://anonscm.debian.org/git/debian-med/python-pyfaidx.git
Standards-Version: 4.1.4
Vcs-Browser: https://salsa.debian.org/med-team/python-pyfaidx
Vcs-Git: https://salsa.debian.org/med-team/python-pyfaidx.git
Homepage: https://github.com/mdshw5/pyfaidx
Package: python-pyfaidx
......
This diff is collapsed.
......@@ -4,23 +4,23 @@ import sys
import os.path
import re
from pyfaidx import Fasta, wrap_sequence, FetchError, ucsc_split, bed_split
from collections import defaultdict
keepcharacters = (' ', '.', '_')
def write_sequence(args):
_, ext = os.path.splitext(args.fasta)
if ext:
ext = ext[1:] # remove the dot from extension
filt_function = re.compile(args.regex).search
fasta = Fasta(args.fasta, default_seq=args.default_seq, strict_bounds=not args.lazy, split_char=args.delimiter, filt_function=filt_function, rebuild=not args.no_rebuild)
fasta = Fasta(args.fasta, default_seq=args.default_seq, key_function=eval(args.header_function), strict_bounds=not args.lazy, split_char=args.delimiter, filt_function=filt_function, read_long_names=args.long_names, rebuild=not args.no_rebuild)
regions_to_fetch, split_function = split_regions(args)
if not regions_to_fetch:
regions_to_fetch = fasta.keys()
if args.invert_match:
sequences_to_exclude = set([split_function(region)[0] for region in regions_to_fetch])
fasta = Fasta(args.fasta, default_seq=args.default_seq, strict_bounds=not args.lazy, split_char=args.delimiter, rebuild=not args.no_rebuild)
fasta = Fasta(args.fasta, default_seq=args.default_seq, key_function=eval(args.header_function), strict_bounds=not args.lazy, split_char=args.delimiter, rebuild=not args.no_rebuild)
regions_to_fetch = (key for key in fasta.keys() if key not in sequences_to_exclude)
split_function = ucsc_split
......@@ -45,14 +45,14 @@ def write_sequence(args):
try:
if args.transform:
if not header and args.transform == 'nucleotide':
outfile.write("name\tstart\tend\tA\tT\tC\tG\tN\n")
outfile.write("name\tstart\tend\tA\tT\tC\tG\tN\tothers\n")
header = True
outfile.write(transform_sequence(args, fasta, name, start, end))
else:
for line in fetch_sequence(args, fasta, name, start, end):
outfile.write(line)
except FetchError as e:
raise FetchError(e.msg.rstrip() + "Try setting --lazy.\n")
raise FetchError(str(e) + " Try setting --lazy.\n")
if args.split_files:
outfile.close()
fasta.__exit__()
......@@ -61,6 +61,11 @@ def write_sequence(args):
def fetch_sequence(args, fasta, name, start=None, end=None):
try:
line_len = fasta.faidx.index[name].lenc
if args.auto_strand and start > end and start is not None and end is not None:
# flip (0, 1] coordinates
sequence = fasta[name][end - 1:start + 1]
sequence = sequence.reverse.complement
else:
sequence = fasta[name][start:end]
except KeyError:
sys.stderr.write("warning: {name} not found in file\n".format(**locals()))
......@@ -69,13 +74,13 @@ def fetch_sequence(args, fasta, name, start=None, end=None):
sequence = sequence.complement
if args.reverse:
sequence = sequence.reverse
if args.no_output:
return
if args.no_names:
pass
elif args.full_names:
yield ''.join(['>', fasta[name].long_name, '\n'])
else:
if start or end:
yield ''.join(['>', sequence.longname, '\n'])
if (start or end) and not args.no_coords:
yield ''.join(['>', sequence.fancy_name, '\n'])
else:
yield ''.join(['>', sequence.name, '\n'])
for line in wrap_sequence(line_len, sequence.seq):
......@@ -114,14 +119,23 @@ def split_regions(args):
def transform_sequence(args, fasta, name, start=None, end=None):
line_len = fasta.faidx.index[name].lenc
s = fasta[name][start:end]
if args.no_output:
return
if args.transform == 'bed':
return '{name}\t{start}\t{end}\n'.format(name=s.name, start=s.start, end=s.end)
return '{name}\t{start}\t{end}\n'.format(name=s.name, start=s.start - 1 , end=s.end)
elif args.transform == 'chromsizes':
return '{name}\t{length}\n'.format(name=s.name, length=len(s))
elif args.transform == 'nucleotide':
nucs = Counter(dict([('A', 0), ('T', 0), ('C', 0), ('G', 0), ('N', 0)]))
nucs.update(str(s).upper())
return '{name}\t{start}\t{end}\t{A}\t{T}\t{C}\t{G}\t{N}\n'.format(name=s.name, start=s.start, end=s.end, **nucs)
ss = str(s).upper()
nucs = defaultdict(int)
nucs.update([(c, str(ss).count(c)) for c in set(str(ss))])
A = nucs.pop('A', 0)
T = nucs.pop('T', 0)
C = nucs.pop('C', 0)
G = nucs.pop('G', 0)
N = nucs.pop('N', 0)
others = '|'.join([':'.join((k, v)) for k, v in nucs.items()])
return '{sname}\t{sstart}\t{send}\t{A}\t{T}\t{C}\t{G}\t{N}\t{others}\n'.format(sname=s.name, sstart=s.start, send=s.end, **locals())
elif args.transform == 'transposed':
return '{name}\t{start}\t{end}\t{seq}\n'.format(name=s.name, start=s.start, end=s.end, seq=str(s))
......@@ -133,25 +147,33 @@ def main(ext_args=None):
epilog="Please cite: Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. (2015) Efficient \"pythonic\" access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196 https://dx.doi.org/10.7287/peerj.preprints.970v1")
parser.add_argument('fasta', type=str, help='FASTA file')
parser.add_argument('regions', type=str, nargs='*', help="space separated regions of sequence to fetch e.g. chr1:1-1000")
parser.add_argument('-b', '--bed', type=argparse.FileType('r'), help="bed file of regions")
parser.add_argument('-o', '--out', type=argparse.FileType('w'), help="output file name (default: stdout)")
parser.add_argument('-i', '--transform', type=str, choices=('bed', 'chromsizes', 'nucleotide', 'transposed'), help="transform the requested regions into another format. default: %(default)s")
parser.add_argument('-c', '--complement', action="store_true", default=False, help="complement the sequence. default: %(default)s")
parser.add_argument('-r', '--reverse', action="store_true", default=False, help="reverse the sequence. default: %(default)s")
parser.add_argument('-a', '--size-range', type=parse_size_range, default=None, help='selected sequences are in the size range [low, high]. example: 1,1000 default: %(default)s')
names = parser.add_mutually_exclusive_group()
_input = parser.add_argument_group('input options')
output = parser.add_argument_group('output options')
header = parser.add_argument_group('header options')
_input.add_argument('-b', '--bed', type=argparse.FileType('r'), help="bed file of regions")
output.add_argument('-o', '--out', type=argparse.FileType('w'), help="output file name (default: stdout)")
output.add_argument('-i', '--transform', type=str, choices=('bed', 'chromsizes', 'nucleotide', 'transposed'), help="transform the requested regions into another format. default: %(default)s")
output.add_argument('-c', '--complement', action="store_true", default=False, help="complement the sequence. default: %(default)s")
output.add_argument('-r', '--reverse', action="store_true", default=False, help="reverse the sequence. default: %(default)s")
output.add_argument('-y', '--auto-strand', action="store_true", default=False, help="reverse complement the sequence when start > end coordinate. default: %(default)s")
output.add_argument('-a', '--size-range', type=parse_size_range, default=None, help='selected sequences are in the size range [low, high]. example: 1,1000 default: %(default)s')
names = header.add_mutually_exclusive_group()
names.add_argument('-n', '--no-names', action="store_true", default=False, help="omit sequence names from output. default: %(default)s")
names.add_argument('-f', '--full-names', action="store_true", default=False, help="output full names including description. default: %(default)s")
parser.add_argument('-x', '--split-files', action="store_true", default=False, help="write each region to a separate file (names are derived from regions)")
parser.add_argument('-l', '--lazy', action="store_true", default=False, help="fill in --default-seq for missing ranges. default: %(default)s")
parser.add_argument('-s', '--default-seq', type=check_seq_length, default='N', help='default base for missing positions and masking. default: %(default)s')
parser.add_argument('-d', '--delimiter', type=str, default=None, help='delimiter for splitting names to multiple values (duplicate names will be discarded). default: %(default)s')
matcher = parser.add_mutually_exclusive_group()
names.add_argument('-f', '--long-names', action="store_true", default=False, help="output full (long) names from the input fasta headers. default: headers are truncated after the first whitespace")
header.add_argument('-t', '--no-coords', action="store_true", default=False, help="omit coordinates (e.g. chr:start-end) from output headers. default: %(default)s")
output.add_argument('-x', '--split-files', action="store_true", default=False, help="write each region to a separate file (names are derived from regions)")
output.add_argument('-l', '--lazy', action="store_true", default=False, help="fill in --default-seq for missing ranges. default: %(default)s")
output.add_argument('-s', '--default-seq', type=check_seq_length, default='N', help='default base for missing positions and masking. default: %(default)s')
header.add_argument('-d', '--delimiter', type=str, default=None, help='delimiter for splitting names to multiple values (duplicate names will be discarded). default: %(default)s')
header.add_argument('-e', '--header-function', type=str, default='lambda x: x.split()[0]', help='python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: %(default)s')
header.add_argument('-u', '--duplicates-action', type=str, default="stop", choices=("stop", "first", "last", "longest", "shortest"), help='entry to take when duplicate sequence names are encountered. default: %(default)s')
matcher = header.add_mutually_exclusive_group()
matcher.add_argument('-g', '--regex', type=str, default='.*', help='selected sequences are those matching regular expression. default: %(default)s')
matcher.add_argument('-v', '--invert-match', action="store_true", default=False, help="selected sequences are those not matching 'regions' argument. default: %(default)s")
masking = parser.add_mutually_exclusive_group()
masking = output.add_mutually_exclusive_group()
masking.add_argument('-m', '--mask-with-default-seq', action="store_true", default=False, help="mask the FASTA file using --default-seq default: %(default)s")
masking.add_argument('-M', '--mask-by-case', action="store_true", default=False, help="mask the FASTA file by changing to lowercase. default: %(default)s")
output.add_argument('--no-output', action="store_true", default=False, help="do not output any sequence. default: %(default)s")
parser.add_argument('--no-rebuild', action="store_true", default=False, help="do not rebuild the .fai index even if it is out of date. default: %(default)s")
parser.add_argument('--version', action="version", version=__version__, help="print pyfaidx version number")
# print help usage if no arguments are supplied
......@@ -163,12 +185,19 @@ def main(ext_args=None):
else:
args = parser.parse_args()
if args.auto_strand:
if args.complement:
sys.stderr.write("--auto-strand and --complement are both set. Are you sure this is what you want?\n")
if args.reverse:
sys.stderr.write("--auto-strand and --reverse are both set. Are you sure this is what you want?\n")
if args.mask_with_default_seq or args.mask_by_case:
mask_sequence(args)
else:
write_sequence(args)
def check_seq_length(value):
if len(value) != 1:
raise argparse.ArgumentTypeError("--default-seq value must be a single character!")
......
......@@ -48,6 +48,10 @@ def make_long_fasta(filename, nrecs=250, seqlen=SEQLEN):
f.write(line)
return headers
def bgzip_compress_fasta(filename):
from subprocess import call
call(' '.join(['bgzip', '-c', filename, '>', filename + '.gz']), shell=True)
def read_dict(f, headers):
for k in islice(headers, 0, None, 10):
for start, end in intervals:
......@@ -89,7 +93,10 @@ def read_samtools(f, headers):
def main():
fa_file = NamedTemporaryFile()
index = fa_file.name + '.fai'
bgzf_index = fa_file.name + '.gz.fai'
headers = make_long_fasta(fa_file.name)
bgzip_compress_fasta(fa_file.name)
def pyfaidx_fasta(n):
print('timings for pyfaidx.Fasta')
ti = []
......@@ -113,6 +120,29 @@ def main():
print(mean(tf)/nreads/10*1000*1000)
tracemalloc.stop()
def pyfaidx_bgzf_faidx(n):
print('timings for pyfaidx.Faidx with bgzf compression')
ti = []
tf = []
for _ in range(n):
t = time.time()
f = pyfaidx.Faidx(fa_file.name + '.gz')
ti.append(time.time() - t)
t = time.time()
read_faidx(f, headers)
tf.append(time.time() - t)
os.remove(index)
# profile memory usage and report timings
tracemalloc.start()
f = pyfaidx.Faidx(fa_file.name + '.gz')
read_faidx(f, headers)
os.remove(index)
print(tracemalloc.get_traced_memory())
print(mean(ti))
print(mean(tf)/nreads/10*1000*1000)
tracemalloc.stop()
def pyfaidx_faidx(n):
print('timings for pyfaidx.Faidx')
ti = []
......@@ -276,6 +306,7 @@ def main():
n = 3
pyfaidx_fasta(n)
pyfaidx_faidx(n)
pyfaidx_bgzf_faidx(n)
pyfasta_fasta(n)
pyfasta_fseek(n)
seqio_read(n)
......
This diff is collapsed.
......@@ -65,6 +65,11 @@ def fake_chr22(filename):
chr22_len -= 70
fake_file.write('N' * mod_70 + '\n')
def bgzip_compress_fasta(filename):
from subprocess import call
call(' '.join(['bgzip', '-c', filename, '>', filename + '.gz']), shell=True)
def fetch_chr22_vcf(filename):
from subprocess import call
call(['curl', '-s', 'ftp://ftp-trace.ncbi.nih.gov//1000genomes/ftp/pilot_data/release/2010_07/exon/snps/CEU.exon.2010_03.genotypes.vcf.gz',
......@@ -83,3 +88,5 @@ if __name__ == "__main__":
fetch_chr22_vcf("chr22.vcf.gz")
if not os.path.isfile("chr22.fasta"):
fetch_chr22("chr22.fasta")
bgzip_compress_fasta("genes.fasta")
bgzip_compress_fasta("chr22.fasta")
chr17 6010 31420 uc010vpx.1 0 - 6012 31270 0 6 158,127,110,75,80,523, 0,5195,5861,7910,16317,24887,
>chr17:6010-31420(-)
CGGAGCGGGCAGCGGCCAAGTCAGGGCCGTCCGGGGGCGCGGCCGGCGATGCCCGCAGCCCCCgccgcgccccgccgggcctgctgagccgcccccgggccggggtcgcgccgggccgggccgcgcccggggcggggcggCGCTGCCTGCATGACCCTCCGGCGGCGCGGGGAGAAGGCGACCATCAGCATCCAGGAGCATATGGCCATCGACGTGTGCCCCGGCCCCATCCGTCCCATCAAGCAGATCTCCGACTACTTCCCCCGCTTcccgcggggcctgcccccggacgccgggccccgagccgctgcacccccggacgcccccgcgcgcccggctgtggccggtgccggccgccgcagcccctccgacggcgcccgcgAGGACGACGAGGATGTGGACCAGCTCTTCGGAGCCTAcggctccagcccgggccccagcccgggtcccagccccGCGCGGCCGCCAGCCAAGCCGCCGGAGGACGAGCCGGACGCCGACGGCTACGAGTCGGACGACTGCACTGCCCTGGGCACGCTGGACTTCAGCCTGCTGTATGACCAGGAGAACAACGCCCTCCACTGCACCATCACCAAGGCCAAGGGCCTGAAGCCAATGGACCACAATGGGCTGGCAGACCCCTACGTCAAGCTGCACCTGCTGCCAGGAGCCAGTAAGGCAAATAAGCTCAGAACAAAAACTCTCCGTAACACTCTGAACCCCACATGGAACGAGACCCTCACTTACTACGGGATCACAGATGAAGACATGATCCGCAAGACCCTGCGGATCTCTGTGTGTGACGAGGACAAATTCCGGCACAATGAGTTCATCGGGGAGACACGTGTGCCCCTGAAGAAGCTGAAACCCAACCACACCAAGACCTTCAGCATCTGCCTGGAGAAGCAGCTGCCGGTGGACAAGACTGAAGACAAGTCCCTGGAGGAGCGGGGCCGCATCCTCATCTCCCTCAAGTACAGCTCACAGAAGCAAGGCCTGCTGGTAGGCATCGTGCGGTGCGCCCACCTGGCCGCCATGGACGCCAACGGCTACTCGGACCCCTACGTGAAAAC
......@@ -75,6 +75,30 @@ class TestFastaRecord(TestCase):
sys.stdout.writelines(tuple(Differ().compare(deflines, long_names)))
assert deflines == long_names
def test_unpadded_length(self):
filename = "data/padded.fasta"
with open(filename, 'w') as padded:
padded.write(">test_padded\n")
for n in range(10):
padded.write("N" * 80)
padded.write("\n")
padded.write("N" * 30)
padded.write("A" * 20)
padded.write("N" * 30)
padded.write("\n")
for n in range(10):
padded.write("N" * 80)
padded.write("\n")
fasta = Fasta(filename)
expect = 20
result = fasta["test_padded"].unpadded_len
print (expect, result)
assert expect == result
os.remove('data/padded.fasta')
os.remove('data/padded.fasta.fai')
class TestMutableFastaRecord(TestCase):
def setUp(self):
with open('data/genes_mutable.fasta', 'wb') as mutable:
......
......@@ -25,3 +25,10 @@ class TestFastaRecordIter(TestCase):
fasta = Fasta('data/genes.fasta')
for record in fasta:
assert len(next(iter(record))) == fasta.faidx.index[record.name].lenc
def test_reverse_iter(self):
expect = list(chain(*([line[::-1] for line in record][::-1] for record in Fasta('data/genes.fasta', as_raw=True))))
result = list(chain(*([line for line in reversed(record)] for record in Fasta('data/genes.fasta', as_raw=True))))
for a, b in zip(expect, result):
print(a, b)
assert expect == result
......@@ -9,6 +9,9 @@ os.chdir(path)
class TestFastaVariant(TestCase):
def setUp(self):
raise SkipTest
def tearDown(self):
try:
os.remove('data/chr22.fasta.fai')
......
import os
from pyfaidx import Fasta, Faidx, UnsupportedCompressionFormat, FetchError
from itertools import chain
try:
from unittest import TestCase, expectedFailure
except ImportError:
from unittest import TestCase
from nose.plugins.skip import SkipTest as expectedFailure # python2.6
from nose.tools import raises
from nose.plugins.skip import SkipTest
path = os.path.dirname(__file__)
os.chdir(path)
class TestIndexing(TestCase):
def setUp(self):
try:
from Bio import SeqIO
except ImportError:
raise SkipTest
def tearDown(self):
try:
os.remove('data/genes.fasta.gz.fai')
except EnvironmentError:
pass # some tests may delete this file
@expectedFailure
def test_build_issue_126(self):
""" Samtools BGZF index should be identical to pyfaidx BGZF index """
expect_index = ("gi|563317589|dbj|AB821309.1| 3510 114 70 71\n"
"gi|557361099|gb|KF435150.1| 481 3789 70 71\n"
"gi|557361097|gb|KF435149.1| 642 4368 70 71\n"
"gi|543583796|ref|NR_104216.1| 4573 5141 70 71\n"
"gi|543583795|ref|NR_104215.1| 5317 9901 70 71\n"
"gi|543583794|ref|NR_104212.1| 5374 15415 70 71\n"
"gi|543583788|ref|NM_001282545.1| 4170 20980 70 71\n"
"gi|543583786|ref|NM_001282543.1| 5466 25324 70 71\n"
"gi|543583785|ref|NM_000465.3| 5523 30980 70 71\n"
"gi|543583740|ref|NM_001282549.1| 3984 36696 70 71\n"
"gi|543583738|ref|NM_001282548.1| 4113 40851 70 71\n"
"gi|530384540|ref|XM_005249645.1| 2752 45151 70 71\n"
"gi|530384538|ref|XM_005249644.1| 3004 48071 70 71\n"
"gi|530384536|ref|XM_005249643.1| 3109 51246 70 71\n"
"gi|530384534|ref|XM_005249642.1| 3097 54528 70 71\n"
"gi|530373237|ref|XM_005265508.1| 2794 57830 70 71\n"
"gi|530373235|ref|XM_005265507.1| 2848 60824 70 71\n"
"gi|530364726|ref|XR_241081.1| 1009 63849 70 71\n"
"gi|530364725|ref|XR_241080.1| 4884 65009 70 71\n"
"gi|530364724|ref|XR_241079.1| 2819 70099 70 71\n")
index_file = Faidx('data/genes.fasta.gz').indexname
result_index = open(index_file).read()
assert result_index == expect_index
class TestFastaBGZF(TestCase):
def setUp(self):
try:
from Bio import SeqIO
except ImportError:
raise SkipTest
def tearDown(self):
try:
os.remove('data/genes.fasta.gz.fai')
except EnvironmentError:
pass # some tests may delete this file
def test_integer_slice(self):
fasta = Fasta('data/genes.fasta.gz')
expect = fasta['gi|563317589|dbj|AB821309.1|'][:100].seq
result = fasta[0][:100].seq
assert expect == result
def test_integer_index(self):
fasta = Fasta('data/genes.fasta.gz')
expect = fasta['gi|563317589|dbj|AB821309.1|'][100].seq
result = fasta[0][100].seq
assert expect == result
def test_fetch_whole_fasta(self):
expect = [line.rstrip('\n') for line in open('data/genes.fasta') if line[0] != '>']
result = list(chain(*([line for line in record] for record in Fasta('data/genes.fasta.gz', as_raw=True))))
assert expect == result
def test_line_len(self):
fasta = Fasta('data/genes.fasta.gz')
for record in fasta:
assert len(next(iter(record))) == fasta.faidx.index[record.name].lenc
@raises(UnsupportedCompressionFormat)
def test_mutable_bgzf(self):
fasta = Fasta('data/genes.fasta.gz', mutable=True)
@raises(NotImplementedError)
def test_long_names(self):
""" Test that deflines extracted using FastaRecord.long_name are
identical to deflines in the actual file.
"""
deflines = []
with open('data/genes.fasta') as fasta_file:
for line in fasta_file:
if line[0] == '>':
deflines.append(line[1:-1])
fasta = Fasta('data/genes.fasta.gz')
long_names = []
for record in fasta:
long_names.append(record.long_name)
assert deflines == long_names
def test_fetch_whole_entry(self):
faidx = Faidx('data/genes.fasta.gz')
expect = ('ATGACATCATTTTCCACCTCTGCTCAGTGTTCAACATCTGA'
'CAGTGCTTGCAGGATCTCTCCTGGACAAATCAATCAGGTACGACCA'
'AAACTGCCGCTTTTGAAGATTTTGCATGCAGCAGGTGCGCAAGG'
'TGAAATGTTCACTGTTAAAGAGGTCATGCACTATTTAGGTCAGTACAT'
'AATGGTGAAGCAACTTTATGATCAGCAGGAGCAGCATATGGTATATTG'
'TGGTGGAGATCTTTTGGGAGAACTACTGGGACGTCAGAGCTTCTCCGTG'
'AAAGACCCAAGCCCTCTCTATGATATGCTAAGAAAGAATCTTGTCACTTT'
'AGCCACTGCTACTACAGCAAAGTGCAGAGGAAAGTTCCACTTCCAGAAAAA'
'GAACTACAGAAGACGATATCCCCACACTGCCTACCTCAGAGCATAAATGCA'
'TACATTCTAGAGAAGGTGATTGAAGTGGGAAAAAATGATGACCTGGAGGACTC')
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
1, 481)
assert str(result) == expect
def test_fetch_middle(self):
faidx = Faidx('data/genes.fasta.gz')
expect = 'TTGAAGATTTTGCATGCAGCAGGTGCGCAAGGTGAAATGTTCACTGTTAAA'
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
100, 150)
assert str(result) == expect
def test_fetch_end(self):
faidx = Faidx('data/genes.fasta.gz')
expect = 'TC'
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
480, 481)
assert str(result) == expect
@raises(FetchError)
def test_fetch_border(self):
""" Fetch past the end of a gene entry """
faidx = Faidx('data/genes.fasta.gz')
expect = 'TC'
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
480, 500)
print(result)
assert str(result) == expect
def test_rev(self):
faidx = Faidx('data/genes.fasta.gz')
expect = 'GA'
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
480, 481)
assert str(-result) == expect, result
@raises(FetchError)
def test_fetch_past_bounds(self):
""" Fetch past the end of a gene entry """
faidx = Faidx('data/genes.fasta.gz', strict_bounds=True)
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
480, 5000)
@raises(FetchError)
def test_fetch_negative(self):
""" Fetch starting with a negative coordinate """
faidx = Faidx('data/genes.fasta.gz', strict_bounds=True)
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
-10, 10)
@raises(FetchError)
def test_fetch_reversed_coordinates(self):
""" Fetch starting with a negative coordinate """
faidx = Faidx('data/genes.fasta.gz', strict_bounds=True)
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
50, 10)
@raises(FetchError)
def test_fetch_keyerror(self):
""" Fetch a key that does not exist """
faidx = Faidx('data/genes.fasta.gz', strict_bounds=True)
result = faidx.fetch('gi|joe|gb|KF435150.1|',
1, 10)
def test_blank_string(self):
""" seq[0:0] should return a blank string mdshw5/pyfaidx#53 """
fasta = Fasta('data/genes.fasta.gz', as_raw=True)
assert fasta['gi|557361099|gb|KF435150.1|'][0:0] == ''
def test_slice_from_beginning(self):
fasta = Fasta('data/genes.fasta.gz', as_raw=True)
assert fasta['gi|557361099|gb|KF435150.1|'][:4] == 'ATGA'
def test_slice_from_end(self):
fasta = Fasta('data/genes.fasta.gz', as_raw=True)
assert fasta['gi|557361099|gb|KF435150.1|'][-4:] == 'ACTC'
def test_issue_74_start(self):
f0 = Fasta('data/genes.fasta.gz', one_based_attributes=False)
f1 = Fasta('data/genes.fasta.gz', one_based_attributes=True)
assert f0['gi|557361099|gb|KF435150.1|'][0:90].start == f1['gi|557361099|gb|KF435150.1|'][0:90].start - 1
def test_issue_74_consistency(self):
f0 = Fasta('data/genes.fasta.gz', one_based_attributes=False)
f1 = Fasta('data/genes.fasta.gz', one_based_attributes=True)
assert str(f0['gi|557361099|gb|KF435150.1|'][0:90]) == str(f1['gi|557361099|gb|KF435150.1|'][0:90])
def test_issue_74_end_faidx(self):
f0 = Faidx('data/genes.fasta.gz', one_based_attributes=False)
f1 = Faidx('data/genes.fasta.gz', one_based_attributes=True)
end0 = f0.fetch('gi|557361099|gb|KF435150.1|', 1, 90).end
end1 = f1.fetch('gi|557361099|gb|KF435150.1|', 1, 90).end
assert end0 == end1
def test_issue_74_end_fasta(self):
f0 = Fasta('data/genes.fasta.gz', one_based_attributes=False)
f1 = Fasta('data/genes.fasta.gz', one_based_attributes=True)
end0 = f0['gi|557361099|gb|KF435150.1|'][1:90].end
end1 = f1['gi|557361099|gb|KF435150.1|'][1:90].end
print((end0, end1))
assert end0 == end1
def test_issue_79_fix(self):
f = Fasta('data/genes.fasta.gz')
s = f['gi|557361099|gb|KF435150.1|'][100:105]
print((s.start, s.end))
assert (101, 105) == (s.start, s.end)
def test_issue_79_fix_negate(self):
f = Fasta('data/genes.fasta.gz')
s = f['gi|557361099|gb|KF435150.1|'][100:105]
s = -s
print((s.start, s.end))
assert (105, 101) == (s.start, s.end)
def test_issue_79_fix_one_based_false(self):
f = Fasta('data/genes.fasta.gz', one_based_attributes=False)
s = f['gi|557361099|gb|KF435150.1|'][100:105]
print((s.start, s.end))
assert (100, 105) == (s.start, s.end)
def test_issue_79_fix_one_based_false_negate(self):
f = Fasta('data/genes.fasta.gz', one_based_attributes=False)
s = f['gi|557361099|gb|KF435150.1|'][100:105]
print(s.__dict__)
s = -s
print(s.__dict__)
assert (105, 100) == (s.start, s.end)
@raises(FetchError)
def test_fetch_border_padded(self):
""" Fetch past the end of a gene entry """
faidx = Faidx('data/genes.fasta.gz', default_seq='N')
expect = 'TCNNNNNNNNNNNNNNNNNNN'
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
480, 500)
print(result)
assert str(result) == expect
import os
import filecmp
from pyfaidx import FastaIndexingError, BedError, FetchError
from pyfaidx.cli import main
from nose.tools import raises
from unittest import TestCase
from tempfile import NamedTemporaryFile
path = os.path.dirname(__file__)
os.chdir(path)
......@@ -36,3 +38,15 @@ class TestCLI(TestCase):
def test_key_warning(self):
main(['data/genes.fasta', 'foo'])
def test_auto_strand(self):
""" Test that --auto-strand produces the same output as --reverse --complement"""
with NamedTemporaryFile() as auto_strand:
with NamedTemporaryFile() as noto_strand:
main(['--auto-strand', '-o', auto_strand.name, 'data/genes.fasta', 'gi|557361099|gb|KF435150.1|:100-1'])
main(['--reverse', '--complement', '-o', noto_strand.name, 'data/genes.fasta', 'gi|557361099|gb|KF435150.1|:1-100'])
print(auto_strand.read())
print()
print(noto_strand.read())
self.assertTrue(filecmp.cmp(auto_strand.name, noto_strand.name))