Skip to content
Commits on Source (8)
......@@ -4,3 +4,4 @@ __pycache__/
*~
.tox
venv/
src/xopen/_version.py
......@@ -12,11 +12,13 @@ python:
- "3.5"
- "3.6"
- "3.7"
- "3.8-dev"
install:
- pip install .
script:
- sudo apt-get update && sudo apt-get install -y pigz
- python setup.py --version # Detect encoding problems
- python -m pytest
......@@ -30,10 +32,13 @@ jobs:
services:
- docker
python: "3.6"
install: python3 -m pip install twine 'requests-toolbelt!=0.9.0'
install: python3 -m pip install twine
if: tag IS present
script:
- |
python3 setup.py sdist
ls -l dist/
python3 -m twine upload dist/*
allowed_failures:
- python: "3.8-dev"
\ No newline at end of file
Metadata-Version: 2.1
Name: xopen
Version: 0.5.0
Version: 0.7.3
Summary: Open compressed files transparently
Home-page: https://github.com/marcelm/xopen/
Author: Marcel Martin
......@@ -22,9 +22,11 @@ Description: .. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master
recognized by their file extensions `.gz`, `.bz2` or `.xz`.
The focus is on being as efficient as possible on all supported Python versions.
For example, simply using ``gzip.open`` is very slow in older Pythons, and
it is a lot faster to use a ``gzip`` subprocess. For writing to gzip files,
``xopen`` uses ``pigz`` when available.
For example, ``xopen`` uses ``pigz``, which is a parallel version of ``gzip``,
to open ``.gz`` files, which is faster than using the built-in ``gzip.open``
function. ``pigz`` can use multiple threads when compressing, but is also faster
when reading ``.gz`` files, so it is used both for reading and writing if it is
available.
This module has originally been developed as part of the `cutadapt
tool <https://cutadapt.readthedocs.io/>`_ that is used in bioinformatics to
......@@ -52,21 +54,26 @@ Description: .. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master
content = f.read()
f.close()
Open a file for writing::
Open a file in binary mode for writing::
from xopen import xopen
with xopen('file.txt.gz', mode='w') as f:
f.write('Hello')
with xopen('file.txt.gz', mode='wb') as f:
f.write(b'Hello')
Credits
-------
The name ``xopen`` was taken from the C function of the same name in the
`utils.h file which is part of BWA <https://github.com/lh3/bwa/blob/83662032a2192d5712996f36069ab02db82acf67/utils.h>`_.
`utils.h file which is part of
BWA <https://github.com/lh3/bwa/blob/83662032a2192d5712996f36069ab02db82acf67/utils.h>`_.
Kyle Beauchamp <https://github.com/kyleabeauchamp/> has contributed support for appending to files.
Kyle Beauchamp <https://github.com/kyleabeauchamp/> has contributed support for
appending to files.
Ruben Vorderman <https://github.com/rhpvorderman/> contributed improvements to
make reading gzipped files faster.
Some ideas were taken from the `canopener project <https://github.com/selassid/canopener>`_.
If you also want to open S3 files, you may want to use that module instead.
......@@ -75,6 +82,12 @@ Description: .. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master
Changes
-------
v0.6.0
~~~~~~
* For reading from gzipped files, xopen will now use a ``pigz`` subprocess.
This is faster than using ``gzip.open``.
* Python 2 supported will be dropped in one of the next releases.
v0.5.0
~~~~~~
* By default, pigz is now only allowed to use at most four threads. This hopefully reduces
......
......@@ -14,9 +14,11 @@ Supported compression formats are gzip, bzip2 and xz. They are automatically
recognized by their file extensions `.gz`, `.bz2` or `.xz`.
The focus is on being as efficient as possible on all supported Python versions.
For example, simply using ``gzip.open`` is very slow in older Pythons, and
it is a lot faster to use a ``gzip`` subprocess. For writing to gzip files,
``xopen`` uses ``pigz`` when available.
For example, ``xopen`` uses ``pigz``, which is a parallel version of ``gzip``,
to open ``.gz`` files, which is faster than using the built-in ``gzip.open``
function. ``pigz`` can use multiple threads when compressing, but is also faster
when reading ``.gz`` files, so it is used both for reading and writing if it is
available.
This module has originally been developed as part of the `cutadapt
tool <https://cutadapt.readthedocs.io/>`_ that is used in bioinformatics to
......@@ -44,21 +46,26 @@ Or without context manager::
content = f.read()
f.close()
Open a file for writing::
Open a file in binary mode for writing::
from xopen import xopen
with xopen('file.txt.gz', mode='w') as f:
f.write('Hello')
with xopen('file.txt.gz', mode='wb') as f:
f.write(b'Hello')
Credits
-------
The name ``xopen`` was taken from the C function of the same name in the
`utils.h file which is part of BWA <https://github.com/lh3/bwa/blob/83662032a2192d5712996f36069ab02db82acf67/utils.h>`_.
`utils.h file which is part of
BWA <https://github.com/lh3/bwa/blob/83662032a2192d5712996f36069ab02db82acf67/utils.h>`_.
Kyle Beauchamp <https://github.com/kyleabeauchamp/> has contributed support for appending to files.
Kyle Beauchamp <https://github.com/kyleabeauchamp/> has contributed support for
appending to files.
Ruben Vorderman <https://github.com/rhpvorderman/> contributed improvements to
make reading gzipped files faster.
Some ideas were taken from the `canopener project <https://github.com/selassid/canopener>`_.
If you also want to open S3 files, you may want to use that module instead.
......@@ -67,6 +74,12 @@ If you also want to open S3 files, you may want to use that module instead.
Changes
-------
v0.6.0
~~~~~~
* For reading from gzipped files, xopen will now use a ``pigz`` subprocess.
This is faster than using ``gzip.open``.
* Python 2 supported will be dropped in one of the next releases.
v0.5.0
~~~~~~
* By default, pigz is now only allowed to use at most four threads. This hopefully reduces
......
python-xopen (0.7.3-1) unstable; urgency=medium
* Remove Python2
* New upstream version
* debhelper-compat 12
* Standards-Version: 4.4.0
* (Build-)Depends: pigz
-- Andreas Tille <tille@debian.org> Fri, 02 Aug 2019 21:24:25 +0200
python-xopen (0.5.0-2) unstable; urgency=medium
* Add missing Depends
......
......@@ -4,53 +4,26 @@ Uploaders: Andreas Tille <tille@debian.org>
Section: python
Testsuite: autopkgtest-pkg-python
Priority: optional
Build-Depends: debhelper (>= 12~),
Build-Depends: debhelper-compat (= 12),
dh-python,
python,
python-setuptools,
python-nose,
python-bz2file,
python-setuptools-scm,
python-pytest,
python3,
python3-setuptools,
python3-nose,
python3-bz2file,
python3-setuptools-scm,
python3-pytest
Standards-Version: 4.3.0
python3-pytest,
pigz
Standards-Version: 4.4.0
Vcs-Browser: https://salsa.debian.org/med-team/python-xopen
Vcs-Git: https://salsa.debian.org/med-team/python-xopen.git
Homepage: https://github.com/marcelm/xopen
Package: python-xopen
Architecture: all
Depends: ${python:Depends},
${misc:Depends},
python-pkg-resources,
python-bz2file
Provides: ${python:Provides}
Description: Python module to open compressed files transparently
This small Python module provides a xopen function that works like the
built-in open function, but can also deal with compressed files.
Supported compression formats are gzip, bzip2 and xz. They are
automatically recognized by their file extensions .gz, .bz2 or .xz.
.
The focus is on being as efficient as possible on all supported Python
versions. For example, simply using gzip.open is slow in older Pythons,
and it is a lot faster to use a gzip subprocess.
.
This module has originally been developed as part of the cutadapt tool
that is used in bioinformatics to manipulate sequencing data. It has
been in successful use within that software for a few years.
.
This is the Python 2 version.
Package: python3-xopen
Architecture: all
Depends: ${python3:Depends},
${misc:Depends},
python3-pkg-resources
python3-pkg-resources,
pigz
Description: Python3 module to open compressed files transparently
This small Python3 module provides a xopen function that works like the
built-in open function, but can also deal with compressed files.
......
......@@ -4,4 +4,4 @@ DH_VERBOSE := 1
export PYBUILD_NAME=xopen
%:
dh $@ --with python2,python3 --buildsystem=pybuild
dh $@ --with python3 --buildsystem=pybuild
import sys
from setuptools import setup
from setuptools import setup, find_packages
if sys.version_info < (2, 7):
sys.stdout.write("At least Python 2.7 is required.\n")
......@@ -10,7 +10,7 @@ with open('README.rst') as f:
setup(
name='xopen',
use_scm_version=True,
use_scm_version={'write_to': 'src/xopen/_version.py'},
setup_requires=['setuptools_scm'], # Support pip versions that don't know about pyproject.toml
author='Marcel Martin',
author_email='mail@marcelm.net',
......@@ -18,7 +18,8 @@ setup(
description='Open compressed files transparently',
long_description=long_description,
license='MIT',
py_modules=['xopen'],
package_dir={'': 'src'},
packages=find_packages('src'),
install_requires=[
'bz2file; python_version=="2.7"',
],
......
Metadata-Version: 2.1
Name: xopen
Version: 0.5.0
Version: 0.7.3
Summary: Open compressed files transparently
Home-page: https://github.com/marcelm/xopen/
Author: Marcel Martin
......@@ -22,9 +22,11 @@ Description: .. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master
recognized by their file extensions `.gz`, `.bz2` or `.xz`.
The focus is on being as efficient as possible on all supported Python versions.
For example, simply using ``gzip.open`` is very slow in older Pythons, and
it is a lot faster to use a ``gzip`` subprocess. For writing to gzip files,
``xopen`` uses ``pigz`` when available.
For example, ``xopen`` uses ``pigz``, which is a parallel version of ``gzip``,
to open ``.gz`` files, which is faster than using the built-in ``gzip.open``
function. ``pigz`` can use multiple threads when compressing, but is also faster
when reading ``.gz`` files, so it is used both for reading and writing if it is
available.
This module has originally been developed as part of the `cutadapt
tool <https://cutadapt.readthedocs.io/>`_ that is used in bioinformatics to
......@@ -52,21 +54,26 @@ Description: .. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master
content = f.read()
f.close()
Open a file for writing::
Open a file in binary mode for writing::
from xopen import xopen
with xopen('file.txt.gz', mode='w') as f:
f.write('Hello')
with xopen('file.txt.gz', mode='wb') as f:
f.write(b'Hello')
Credits
-------
The name ``xopen`` was taken from the C function of the same name in the
`utils.h file which is part of BWA <https://github.com/lh3/bwa/blob/83662032a2192d5712996f36069ab02db82acf67/utils.h>`_.
`utils.h file which is part of
BWA <https://github.com/lh3/bwa/blob/83662032a2192d5712996f36069ab02db82acf67/utils.h>`_.
Kyle Beauchamp <https://github.com/kyleabeauchamp/> has contributed support for appending to files.
Kyle Beauchamp <https://github.com/kyleabeauchamp/> has contributed support for
appending to files.
Ruben Vorderman <https://github.com/rhpvorderman/> contributed improvements to
make reading gzipped files faster.
Some ideas were taken from the `canopener project <https://github.com/selassid/canopener>`_.
If you also want to open S3 files, you may want to use that module instead.
......@@ -75,6 +82,12 @@ Description: .. image:: https://travis-ci.org/marcelm/xopen.svg?branch=master
Changes
-------
v0.6.0
~~~~~~
* For reading from gzipped files, xopen will now use a ``pigz`` subprocess.
This is faster than using ``gzip.open``.
* Python 2 supported will be dropped in one of the next releases.
v0.5.0
~~~~~~
* By default, pigz is now only allowed to use at most four threads. This hopefully reduces
......
......@@ -7,15 +7,16 @@ pyproject.toml
setup.cfg
setup.py
tox.ini
xopen.py
src/xopen/__init__.py
src/xopen/_version.py
src/xopen.egg-info/PKG-INFO
src/xopen.egg-info/SOURCES.txt
src/xopen.egg-info/dependency_links.txt
src/xopen.egg-info/requires.txt
src/xopen.egg-info/top_level.txt
tests/file.txt
tests/file.txt.bz2
tests/file.txt.gz
tests/file.txt.xz
tests/hello.gz
tests/test_xopen.py
\ No newline at end of file
xopen.egg-info/PKG-INFO
xopen.egg-info/SOURCES.txt
xopen.egg-info/dependency_links.txt
xopen.egg-info/requires.txt
xopen.egg-info/top_level.txt
\ No newline at end of file
......@@ -9,14 +9,8 @@ import io
import os
import time
from subprocess import Popen, PIPE
from pkg_resources import get_distribution, DistributionNotFound
try:
__version__ = get_distribution(__name__).version
except DistributionNotFound:
# package is not installed
pass
from ._version import version as __version__
_PY3 = sys.version > '3'
......@@ -176,12 +170,27 @@ class PipedGzipWriter(Closing):
if retcode != 0:
raise IOError("Output {0} process terminated with exit code {1}".format(self.program, retcode))
def __iter__(self):
return self
def __next__(self):
raise io.UnsupportedOperation('not readable')
class PipedGzipReader(Closing):
"""
Open a pipe to pigz for reading a gzipped file. Even though pigz is mostly
used to speed up writing, when it can use many compression threads, it is
also faster than gzip when reading (three times faster).
"""
def __init__(self, path, mode='r'):
"""
Raise an OSError when pigz could not be found.
"""
if mode not in ('r', 'rt', 'rb'):
raise ValueError("Mode is '{0}', but it must be 'r', 'rt' or 'rb'".format(mode))
self.process = Popen(['gzip', '-cd', path], stdout=PIPE, stderr=PIPE)
self.process = Popen(['pigz', '-cd', path], stdout=PIPE, stderr=PIPE)
self.name = path
if _PY3 and 'b' not in mode:
self._file = io.TextIOWrapper(self.process.stdout)
......@@ -192,7 +201,7 @@ class PipedGzipReader(Closing):
else:
self._stderr = self.process.stderr
self.closed = False
# Give gzip a little bit of time to report any errors (such as
# Give the subprocess a little bit of time to report any errors (such as
# a non-existing file)
time.sleep(0.01)
self._raise_if_error()
......@@ -229,13 +238,33 @@ class PipedGzipReader(Closing):
self._raise_if_error()
return data
def readinto(self, *args):
data = self._file.readinto(*args)
return data
if bz2 is not None:
class ClosingBZ2File(bz2.BZ2File, Closing):
"""
A better BZ2File that supports the context manager protocol.
This is relevant only for Python 2.6.
"""
def readline(self, *args):
data = self._file.readline(*args)
if len(args) == 0 or args[0] <= 0:
# wait for process to terminate until we check the exit code
self.process.wait()
self._raise_if_error()
return data
def seekable(self):
return self._file.seekable()
def peek(self, n=None):
return self._file.peek(n)
if _PY3:
def readable(self):
return self._file.readable()
def writable(self):
return self._file.writable()
def flush(self):
return None
def _open_stdin_or_out(mode):
......@@ -258,9 +287,6 @@ def _open_bz2(filename, mode):
else:
if mode[0] == 'a':
raise ValueError("mode '{0}' not supported with BZ2 compression".format(mode))
if sys.version_info[:2] <= (2, 6):
return ClosingBZ2File(filename, mode)
else:
return bz2.BZ2File(filename, mode)
......@@ -272,32 +298,34 @@ def _open_xz(filename, mode):
def _open_gz(filename, mode, compresslevel, threads):
if _PY3 and 'r' in mode:
return gzip.open(filename, mode)
if sys.version_info[:2] == (2, 7):
buffered_reader = io.BufferedReader
buffered_writer = io.BufferedWriter
else:
buffered_reader = lambda x: x
buffered_writer = lambda x: x
if _PY3:
exc = FileNotFoundError # was introduced in Python 3.3
else:
exc = OSError
if 'r' in mode:
try:
return PipedGzipReader(filename, mode)
except OSError:
# gzip not installed
except exc:
# pigz is not installed
return buffered_reader(gzip.open(filename, mode))
else:
try:
return PipedGzipWriter(filename, mode, compresslevel, threads=threads)
except OSError:
except exc:
return buffered_writer(gzip.open(filename, mode, compresslevel=compresslevel))
def xopen(filename, mode='r', compresslevel=6, threads=None):
"""
A replacement for the "open" function that can also open files that have
been compressed with gzip, bzip2 or xz. If the filename is '-', standard
output (mode 'w') or input (mode 'r') is returned.
A replacement for the "open" function that can also read and write
compressed files transparently. The supported compression formats are gzip,
bzip2 and xz. If the filename is '-', standard output (mode 'w') or input (mode 'r') is returned.
The file type is determined based on the filename: .gz is gzip, .bz2 is bzip2 and .xz is
xz/lzma.
......@@ -314,12 +342,14 @@ def xopen(filename, mode='r', compresslevel=6, threads=None):
In Python 2, the 't' and 'b' characters are ignored.
Append mode ('a', 'at', 'ab') is unavailable with BZ2 compression and
Append mode ('a', 'at', 'ab') is not available with BZ2 compression and
will raise an error.
compresslevel is the gzip compression level. It is not used for bz2 and xz.
threads is the number of threads for pigz. If None, then the pigz default is used.
threads is the number of threads for pigz. If left at None, then the pigz
default is used. With pigz 2.4, this is "the number of online processors,
or 8 if unknown".
"""
if mode in ('r', 'w', 'a'):
mode += 't'
......
# coding: utf-8
# file generated by setuptools_scm
# don't change, don't track in version control
version = '0.7.3'
# coding: utf-8
from __future__ import print_function, division, absolute_import
import io
import os
import random
import sys
import signal
from contextlib import contextmanager
import pytest
from xopen import xopen, PipedGzipReader
from xopen import xopen, PipedGzipReader, PipedGzipWriter
extensions = ["", ".gz", ".bz2"]
......@@ -20,7 +21,8 @@ except ImportError:
base = "tests/file.txt"
files = [base + ext for ext in extensions]
CONTENT = 'Testing, testing ...\nThe second line.\n'
CONTENT_LINES = ['Testing, testing ...\n', 'The second line.\n']
CONTENT = ''.join(CONTENT_LINES)
# File extensions for which appending is supported
append_extensions = extensions[:]
......@@ -28,6 +30,16 @@ if sys.version_info[0] == 2:
append_extensions.remove(".bz2")
@pytest.fixture(params=extensions)
def ext(request):
return request.param
@pytest.fixture(params=files)
def fname(request):
return request.param
@contextmanager
def temporary_path(name):
directory = os.path.join(os.path.dirname(__file__), 'testtmp')
......@@ -38,61 +50,115 @@ def temporary_path(name):
os.remove(path)
@pytest.mark.parametrize("name", files)
def test_xopen_text(name):
with xopen(name, 'rt') as f:
def test_xopen_text(fname):
with xopen(fname, 'rt') as f:
lines = list(f)
assert len(lines) == 2
assert lines[1] == 'The second line.\n', name
assert lines[1] == 'The second line.\n', fname
@pytest.mark.parametrize("name", files)
def test_xopen_binary(name):
with xopen(name, 'rb') as f:
def test_xopen_binary(fname):
with xopen(fname, 'rb') as f:
lines = list(f)
assert len(lines) == 2
assert lines[1] == b'The second line.\n', name
assert lines[1] == b'The second line.\n', fname
@pytest.mark.parametrize("name", files)
def test_no_context_manager_text(name):
f = xopen(name, 'rt')
def test_no_context_manager_text(fname):
f = xopen(fname, 'rt')
lines = list(f)
assert len(lines) == 2
assert lines[1] == 'The second line.\n', name
assert lines[1] == 'The second line.\n', fname
f.close()
assert f.closed
@pytest.mark.parametrize("name", files)
def test_no_context_manager_binary(name):
f = xopen(name, 'rb')
def test_no_context_manager_binary(fname):
f = xopen(fname, 'rb')
lines = list(f)
assert len(lines) == 2
assert lines[1] == b'The second line.\n', name
assert lines[1] == b'The second line.\n', fname
f.close()
assert f.closed
@pytest.mark.parametrize("ext", extensions)
def test_readinto(fname):
# Test whether .readinto() works
content = CONTENT.encode('utf-8')
with xopen(fname, 'rb') as f:
b = bytearray(len(content) + 100)
length = f.readinto(b)
assert length == len(content)
assert b[:length] == content
def test_pipedgzipreader_readinto():
# Test whether PipedGzipReader.readinto works
content = CONTENT.encode('utf-8')
with PipedGzipReader("tests/file.txt.gz", "rb") as f:
b = bytearray(len(content) + 100)
length = f.readinto(b)
assert length == len(content)
assert b[:length] == content
if sys.version_info[0] != 2:
def test_pipedgzipreader_textiowrapper():
with PipedGzipReader("tests/file.txt.gz", "rb") as f:
wrapped = io.TextIOWrapper(f)
assert wrapped.read() == CONTENT
def test_readline(fname):
first_line = CONTENT_LINES[0].encode('utf-8')
with xopen(fname, 'rb') as f:
assert f.readline() == first_line
def test_readline_text(fname):
with xopen(fname, 'r') as f:
assert f.readline() == CONTENT_LINES[0]
def test_readline_pipedgzipreader():
first_line = CONTENT_LINES[0].encode('utf-8')
with PipedGzipReader("tests/file.txt.gz", "rb") as f:
assert f.readline() == first_line
def test_readline_text_pipedgzipreader():
with PipedGzipReader("tests/file.txt.gz", "r") as f:
assert f.readline() == CONTENT_LINES[0]
def test_xopen_has_iter_method(ext, tmpdir):
path = str(tmpdir.join("out" + ext))
with xopen(path, mode='w') as f:
assert hasattr(f, '__iter__')
def test_pipedgzipwriter_has_iter_method(tmpdir):
with PipedGzipWriter(str(tmpdir.join("out.gz"))) as f:
assert hasattr(f, '__iter__')
def test_nonexisting_file(ext):
with pytest.raises(IOError):
with xopen('this-file-does-not-exist' + ext) as f:
pass
@pytest.mark.parametrize("ext", extensions)
def test_write_to_nonexisting_dir(ext):
with pytest.raises(IOError):
with xopen('this/path/does/not/exist/file.txt' + ext, 'w') as f:
pass
@pytest.mark.parametrize("ext", append_extensions)
def test_append(ext):
@pytest.mark.parametrize("aext", append_extensions)
def test_append(aext):
text = "AB".encode("utf-8")
reference = text + text
with temporary_path('truncated.fastq' + ext) as path:
with temporary_path('truncated.fastq' + aext) as path:
try:
os.unlink(path)
except OSError:
......@@ -111,11 +177,11 @@ def test_append(ext):
assert appended == reference
@pytest.mark.parametrize("ext", append_extensions)
def test_append_text(ext):
@pytest.mark.parametrize("aext", append_extensions)
def test_append_text(aext):
text = "AB"
reference = text + text
with temporary_path('truncated.fastq' + ext) as path:
with temporary_path('truncated.fastq' + aext) as path:
try:
os.unlink(path)
except OSError:
......@@ -222,19 +288,16 @@ if sys.version_info[:2] >= (3, 4):
# pathlib was added in Python 3.4
from pathlib import Path
@pytest.mark.parametrize("file", files)
def test_read_pathlib(file):
path = Path(file)
def test_read_pathlib(fname):
path = Path(fname)
with xopen(path, mode='rt') as f:
assert f.read() == CONTENT
@pytest.mark.parametrize("file", files)
def test_read_pathlib_binary(file):
path = Path(file)
def test_read_pathlib_binary(fname):
path = Path(fname)
with xopen(path, mode='rb') as f:
assert f.read() == bytes(CONTENT, 'ascii')
@pytest.mark.parametrize("ext", extensions)
def test_write_pathlib(ext, tmpdir):
path = Path(str(tmpdir)) / ('hello.txt' + ext)
with xopen(path, mode='wt') as f:
......@@ -242,7 +305,6 @@ if sys.version_info[:2] >= (3, 4):
with xopen(path, mode='rt') as f:
assert f.read() == 'hello'
@pytest.mark.parametrize("ext", extensions)
def test_write_pathlib_binary(ext, tmpdir):
path = Path(str(tmpdir)) / ('hello.txt' + ext)
with xopen(path, mode='wb') as f:
......
......@@ -3,4 +3,4 @@ envlist = py27,py34,py35,py36,py37
[testenv]
deps = pytest
commands = pytest
commands = pytest --doctest-modules --pyargs src/xopen tests