Skip to content
Commits on Source (7)
......@@ -76,8 +76,8 @@ docs/_build/
target/
# IPython Notebook
.ipynb_checkpoints
*.ipynb
.ipynb_checkpoints/
#*.ipynb
# pyenv
.python-version
......
language: python - "3.6"
language: python - "3.7"
# Maybe everything shoulb be packed in a `job` that states `include:` `if:` with some condition
before_install:
......@@ -13,27 +13,14 @@ before_install:
- sudo ln -s /run/shm /dev/shm
install:
- conda install --yes python="3.6" numpy scipy cython h5py typing pandas
- pip install sphinx_bootstrap_theme
# Install the dev version (1.7) of sphinx that has solved a problem with f-strings
- git clone https://github.com/sphinx-doc/sphinx.git
- cd sphinx; pip install .;cd ..
- conda install --yes python="3.7" numpy scipy cython h5py typing numba
- pip install Sphinx sphinxcontrib-websupport sphinx_bootstrap_theme numpy-groupies
script:
# Build documentation
- cd doc
- make html
- cd ..
# Additionally we could run some tests and.or the following commented code
# Possibility to autotag upon change of the name of the version
#after_success:
# - git config --global user.email "builds@travis-ci.com"
# - git config --global user.name "Travis CI" # Not sure if here it would be my username and oass
## Here I would put a python script that makes something like GIT_TAG=open("_version.py").read()
# - export GIT_TAG=build-$TRAVIS_BRANCH-$(date -u "+%Y-%m-%d-%H-%M-%S")-$TRAVIS_BUILD_NUMBER
# - git tag $GIT_TAG -a -m "Generated tag from TravisCI"
# - git push --tags
deploy:
# Automatically upload docs to github pages for master commits
......
# loompy 2
# loompy v3.0
⭐ Loompy v2.0 was released Dec. 24, 2017! ([what's new](https://github.com/linnarsson-lab/loompy/releases/tag/v2.0)?)
⭐ Loompy v3.0 was released Sep. 24, 2019!
`.loom` is an efficient file format for very large omics datasets,
consisting of a main matrix, optional additional layers, a variable number of row and column
annotations. Loom also supports sparse graphs. We use loom files to store single-cell gene expression
data: the main matrix contains the actual expression values (one
column per cell, one row per gene); row and column annotations
contain metadata for genes and cells, such as `Name`, `Chromosome`,
`Position` (for genes), and `Strain`, `Sex`, `Age` (for cells).
![Illustration of Loom format structure](/doc/Loom-images.png)
Loom files (`.loom`) are created in the [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) file format, which
supports an internal collection of numerical multidimensional datasets.
HDF5 is supported by many computer languages, including Java, MATLAB,
Mathematica, Python, R, and Julia. `.loom` files are accessible from
any language that supports HDF5.
To get started, head over to [the documentation](http://linnarssonlab.org/loompy/)!
To get started, head over to [the documentation](http://loompy.org)!
Loom, loompy, and the [loom-viewer](https://github.com/linnarsson-lab/loom-viewer) are being developed by members of the [Linnarsson Lab](http://linnarssonlab.org).
This diff is collapsed.
python-loompy for Debian
-----------------------
The folder doc/_build was removed.
The folder doc/_build was removed and the file assets/LoomR_logo.ai for copyright issues.
-- Steffen Moeller <moeller@debian.org> Mon, 12 Aug 2019 13:14:47 +0200
python-loompy (2.0.17+dfsg-1) UNRELEASED; urgency=medium
python-loompy (3.0.6+dfsg-1) UNRELEASED; urgency=medium
* Initial release (Closes: #934597)
* TODO:
- package documentation
Resubmitted with improved copyright. Thanks go to Thorsten.
The new version (up from 2.x) has a build-dep on a Python module "numpy_groupies".
This still needs to be packaged.
-- Steffen Moeller <moeller@debian.org> Mon, 12 Aug 2019 13:14:47 +0200
......@@ -5,20 +5,24 @@ Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.
Uploaders: Steffen Moeller <moeller@debian.org>
Build-Depends: debhelper (>= 11), dh-python,
python3-all, python3-setuptools,
python3-h5py
python3-h5py,
python3-numba,
python3-click
Standards-Version: 4.4.0
Homepage: https://github.com/linnarsson-lab/loompy
Vcs-Browser: https://salsa.debian.org/med-team/python-loompy
Vcs-Git: https://salsa.debian.org/med-team/python-loompy.git
#Testsuite: autopkgtest-pkg-python
#install_requires=['h5py', 'numpy', 'scipy', 'setuptools', 'numba', 'click', "numpy-groupies"],
Package: python3-loompy
Architecture: all
Depends: ${python3:Depends}, ${misc:Depends},
python3-h5py,
python3-numpy,
python3-scipy,
python3-pandas
python3-pandas,
python3-click
Description: access loom formatted files for bioinformatics
Loom is an efficient file format for very large omics datasets,
consisting of a main matrix, optional additional layers, a variable
......
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: loompy
Source: https://github.com/linnarsson-lab/loompy
Files-Excluded:
assets/LoomR_logo.ai
doc/_build
Files: *
Copyright: 2016 Linnarsson Lab
Copyright: 2016 Linnarsson Lab <sten.linnarsson@ki.se>
License: BSD-2-Clause
Files: ./loompy/cell_calling.py
Copyright: 2018 10X Genomics, Inc.
License: MIT
Files: debian/*
Copyright: 2019 Steffen Moeller <moeller@debian.org>
License: BSD-2-Clause
......@@ -31,3 +38,22 @@ License: BSD-2-Clause
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
License: MIT
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
.
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
version=4
opts="dversionmangle=s/+dfsg$//,uversionmangle=s/$/+dfsg/,repack,compression=xz,filenamemangle=s%(?:.*?)?v?(\d[\d.]*)\.tar\.gz%python-loompy-$1.tar.gz%" \
opts="dversionmangle=s/\+dfsg$//,uversionmangle=s/$/+dfsg/,repack,compression=xz,filenamemangle=s%(?:.*?)?v?(\d[\d.]*)\.tar\.gz%python-loompy-$1.tar.gz%" \
https://github.com/linnarsson-lab/loompy/tags \
(?:.*?/)?v?(\d[\d.]*)\.tar\.gz debian uupdate
......@@ -6,7 +6,7 @@ Loom file format specs
Versions
--------
This specification defines the Loom file format version ``2.0.1``.
This specification defines the Loom file format version ``3.0.0``.
.. _formatinfo:
......@@ -61,7 +61,7 @@ standard for storing large numerical datasets. Quoting from h5py.org:
dictionaries, and datasets work like NumPy arrays*.
A valid Loom file is simply an HDF5 file that contains specific
*groups* representing the main matrix as well as row and column
*groups* containing the main matrix as well as row and column
attributes. Because of this, Loom files can be created and read by
any language that supports HDF5, including `Python <http://h5py.org>`__,
`R <http://bioconductor.org/packages/release/bioc/html/rhdf5.html>`__,
......@@ -74,6 +74,19 @@ any language that supports HDF5, including `Python <http://h5py.org>`__,
.. _specifications:
Standards
---------
The official MIME media type for a loom file is ``application/vnd.loom``, `approved by IANA <https://www.iana.org/assignments/media-types/application/vnd.loom>`_.
You should use this media type when requesting a file using HTTP, or if you create a server that offers Loom files for download.
See the `IANA record for application/vnd.loom <https://www.iana.org/assignments/media-types/application/vnd.loom>`_ for important security considerations.
The Loom file format is designated as a standard format for gene expression matrices by the `Global Alliance for Genomics and Health (GA4GH) <https://www.ga4gh.org>`_ standards body, as part of the `rnaget API <https://github.com/ga4gh-rnaseq/schema/blob/master/rnaget.md>`_.
The rnaget API documentation contains good examples of how to use the media type specification.
Specification
-------------
......@@ -91,15 +104,20 @@ Main matrix and layers
Global attributes
^^^^^^^^^^^^^^^^^
- There can OPTIONALLY be at least one `HDF5
attribute <https://www.hdfgroup.org/HDF5/Tutor/crtatt.html>`__ on the
root ``/`` group, which can be any valid scalar or multidimensional datatype and should be
interpreted as attributes of the whole Loom file.
- There can OPTIONALLY be an `HDF5
attribute <https://www.hdfgroup.org/HDF5/Tutor/crtatt.html>`__ on the
root ``/`` group named ``LOOM_SPEC_VERSION``, a string value giving the
loom file spec version that was followed in creating the file. See top of this
document for the current version of the spec.
- There MUST be an HDF5 group ``/attrs`` containing global attributes.
- There MUST be a HDF5 dataset ``/attrs/LOOM_SPEC_VERSION`` with the value ``v3.0.0``.
Global attributes apply semantically to the whole file, not any specific part of it.
Such attributes are stored in the HDF5 group ``/attrs`` and can be any valid scalar
or multidimensional datatype.
As of Loom file format v3.0.0, only one global attribute is mandatory: the ``LOOM_SPEC_VERSION``
attribute, which is a string value giving the loom file spec version that was followed in creating
the file. See top of this document for the current version of the spec.
Note: previous versions of the loom file format stored global attributes as `HDF5 attributes <https://www.hdfgroup.org/HDF5/Tutor/crtatt.html>`__
on the root ``/`` group. However, such attributes are size-limited, which caused problems for some
applications.
Row and column attributes
......@@ -144,6 +162,8 @@ Row and column sparse graphs
format. The lengths of the three datasets MUST be equal, which defines the number
of edges in the graph. Note that the number of rows in the dataset defines
the vertices, so an unconnected vertex is one that has no entry in ``a`` or ``b``.
- Vertex indexing is zero-based. When an entry in ``a`` or ``b`` is zero, this denotes the first column
in the matrix. If there are N columns, then vertices are numbered from 0 to N - 1.
Datatypes
---------
......@@ -154,19 +174,12 @@ Row and column attributes are multidimensional arrays whose first dimension matc
Global attributes are scalars or multidimensional arrays of any shape, whose elements are any of the numeric datatypes ``int8``, ``int16``, ``int32``, ``int64``, ``uint8``, ``uint16``, ``uint32``, ``uint64``, ``float16``, ``float32`` and ``float64`` or fixed-length ASCII strings.
All strings in Loom files are stored as fixed-length null-padded 7-bit ASCII. ``h5dump`` should report something like this:
.. code::
DATATYPE H5T_STRING {
STRSIZE 24;
STRPAD H5T_STR_NULLPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
Starting with v3.0.0 of the spec, all strings are stored as variable-length UTF-8 encoded.
Note: in previous version, strings were stored as fixed-length null-padded 7-bit ASCII. Unicode characters outside 7-bit ASCII were stored using
`XML entity encoding <https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references>`_, to ensure maximum compatibility. Strings
were decoded when read and encoded when written.
Unicode characters outside 7-bit ASCII are stored using `XML entity encoding <https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references>`_, to ensure maximum compatibility. Strings SHOULD be decoded when read and encoded when written. A compatible implementation may choose to encode/decode or not, but MUST decode on reading if it encodes on writing.
.. _loomexample:
......@@ -178,6 +191,10 @@ Here's an example of the structure of a valid Loom file:
+----------------------+-------------------------------+---------------------------------------------+
| Group | Type | Description |
+======================+===============================+=============================================+
| /attrs/ | (subgroup) | Global attribbutes |
+----------------------+-------------------------------+---------------------------------------------+
| /attrs/Species | string | Row attribute "Species" of type string |
+----------------------+-------------------------------+---------------------------------------------+
| /matrix | float32[N,M] or uint16[N,M] | Main matrix of N rows and M columns |
+----------------------+-------------------------------+---------------------------------------------+
| /layers/ | (subgroup) | Subgroup of additional matrix layers |
......@@ -205,3 +222,13 @@ Here's an example of the structure of a valid Loom file:
Backwards compatibility
^^^^^^^^^^^^^^^^^^^^^^^
Loom v3.0.0 introduces two major backwards-incompatible changes (global attributes and variable-length strings; see above).
A compliant Loom reader MUST check the LOOM_SPEC_VERSION and treat files consistently with their spec. For example,
when writing a global attribute, the writer MUST write only to the ``/attrs`` group if ``LOOM_SPEC_VERSION`` is
``3.0.0`` or higher. The writer MUST write the HDF5 attributes on the root ``/``
group if ``LOOM_SPEC_VERSION`` is lower than ``3.0.0`` or if it does not exist. This is to preserve a consistent
format for legacy files.
create_from_fastq (function)
============================
Use `kallisto <https://pachterlab.github.io/kallisto/>`_ to build a loom file from a set of fastq files.
.. autofunction:: loompy.create_from_fastq
......@@ -10,6 +10,7 @@ This is a complete API reference to loompy |version| generated using the docstri
.. toctree::
create
new
create_from_fastq
create_from_cellranger
connect
combine
......
......@@ -49,6 +49,7 @@ including `Python <http://h5py.org>`__,
self
installation/index
semantics/index
kallisto/index
apiwalkthrough/index
cookbook/index
conventions/index
......
.. _kallisto:
Create from fastq
=================
This section introduces the ``loompy`` command-line tool, and shows how it is used to create .loom files directly from fastq files.
The ``loompy`` command-line tool
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After installing loompy 3, a new command-line tool ``loompy`` will be available. To verify, launch a Terminal window and type ``loompy``,
and you should see the following output:
.. code::
Usage: loompy [OPTIONS] COMMAND [ARGS]...
Options:
--show-message / --hide-message
--verbosity [error|warning|info|debug]
--help Show this message and exit.
Commands:
fromfq
Using the ``loompy fromfq`` command
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. Install `kallisto <https://pachterlab.github.io/kallisto/>`_
---------------------------------------------------------------
The excellent kallisto tool performs ultra-fast pseudoalignment, which loompy uses to count reads (UMIs) on genes.
2. Download an index (or build your own)
----------------------------------------
We provide a pre-built index of the `human genome <https://storage.googleapis.com/linnarsson-lab-www-blobs/human_GRCh38_gencode.v31.tar.gz>`_.
.. warning::
This index is really only suitable for use with 10x Chromium data (it makes strong assumptions about the read distribution).
Unzip the index to a directory, which will have the following content:
.. code::
10xv1_whitelist.txt
10xv2_whitelist.txt
10xv3_whitelist.txt
fragments2genes.txt
gencode.v31.fragments.idx
gencode.v31.metadata.tab
manifest.json
spliced_fragments.txt
unspliced_fragments.txt
To build your own index, see the `build instructions <https://github.com/linnarsson-lab/loompy/blob/master/notebooks/build_index.ipynb>`_.
3. Download metadata and fastq files
-------------------------------------
You need to provide the input fastq files, and metadata for your samples. For this tutorial, please download the 1k PBMC v3 reference dataset
from 10x Genomics: `pbmc_1k_v3_fastqs.tar <http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar>`_.
Metadata can be provided either in the form of a tab-delimited file with a single header row, or as a sqlite3 database with a table named ``sample``.
One metadata column must be named ``name`` and contain sample names (sample IDs). One column must be named ``technology`` and contain the technology name (currently one of ``10xv1``,
``10xv2`` or ``10xv3``). The technology name will be used to locate the barcode whitelist file (e.g. ``10xv3_whitelist.txt``). Finally, one
column must be named ``targetnumcells`` and contain the target (or expected) cell number in the samples. If absent, the default will be 5000.
The metadata file (or database) will typically contain metadata for all the samples you are working with, one row per sample. You can add any number of
additional columns of metadata, which will all be copied to the newly generated loom file as global attributes.
For this tutorial, create a tab-delimited file named ``metadata.tab`` with the following content:
.. code::
name technology targetnumcells
1kPBMC 10xv3 1000
(If you copy-paste the above, make sure the fields are tab delimited)
4. Run the ``loompy fromfq`` command
------------------------------------
Run the following command (replace ``human_GRCh38_gencode.v31`` with the path to the human genome index you just downloaded):
.. code::
loompy fromfq 1kPBMC.loom 1kPBMC human_GRCh38_gencode.v31 metadata.tab pbmc_1k_v3_S1_L001_R1_001.fastq.gz pbmc_1k_v3_S1_L001_R2_001.fastq.gz pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3_S1_L002_R2_001.fastq.gz
Note that the fastq files are listed in pairs of R1 (read 1) and R2 (read 2) files. The index files (I1) are not used.
After about half an hour, you will have a ``1kPBMC.loom`` file with separate ``spliced`` and ``unspliced`` layers (the main matrix will be
the sum of the two), and rich metadata for both genes, cells and the sample itself stored as attributes. The output should look something like this:
.. code::
2019-09-29 16:28:08,186 - INFO - kallisto bus -i human_GRCh38_gencode.v31/gencode.v31.fragments.idx -o /tmp/tmp7yk3rf07 -x 10xv3 -t 56 pbmc_1k_v3_S1_L001_R1_001.fastq.gz pbmc_1k_v3_S1_L001_R2_001.fastq.gz pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3_S1_L002_R2_001.fastq.gz
2019-09-29 16:28:08,307 - INFO - [index] k-mer length: 31
2019-09-29 16:28:08,307 - INFO - [index] number of targets: 845,338
2019-09-29 16:28:08,307 - INFO - [index] number of k-mers: 178,605,364
2019-09-29 16:28:29,537 - INFO - [index] number of equivalence classes: 4,191,221
2019-09-29 16:28:40,951 - INFO - [quant] will process sample 1: pbmc_1k_v3_S1_L001_R1_001.fastq.gz
2019-09-29 16:28:40,951 - INFO - pbmc_1k_v3_S1_L001_R2_001.fastq.gz
2019-09-29 16:28:40,951 - INFO - [quant] will process sample 2: pbmc_1k_v3_S1_L002_R1_001.fastq.gz
2019-09-29 16:28:40,951 - INFO - pbmc_1k_v3_S1_L002_R2_001.fastq.gz
2019-09-29 16:31:44,144 - INFO - [quant] finding pseudoalignments for the reads ... done
2019-09-29 16:31:44,145 - INFO - [quant] processed 66,601,887 reads, 46,119,840 reads pseudoaligned
2019-09-29 16:31:52,543 - INFO - Loading gene metadata
2019-09-29 16:31:52,818 - INFO - Loading fragments-to-gene mappings
2019-09-29 16:31:53,426 - INFO - Indexing genes
2019-09-29 16:31:53,846 - INFO - Loading equivalence classes
2019-09-29 16:32:22,273 - INFO - Mapping equivalence classes to genes
2019-09-29 16:32:32,817 - INFO - Loading fragment IDs
2019-09-29 16:32:33,280 - INFO - Loading BUS records
2019-09-29 16:33:46,692 - INFO - Sorting cell IDs
2019-09-29 16:33:49,611 - INFO - Found 46,119,840 records for 60,662 genes and 551,892 uncorrected cell barcodes.
2019-09-29 16:33:49,611 - INFO - Correcting cell barcodes
2019-09-29 16:35:58,753 - INFO - Found 307,677 corrected cell barcodes.
2019-09-29 16:35:58,754 - INFO - Removing redundant reads using UMIs
2019-09-29 16:36:45,546 - INFO - 71% sequencing saturation.
2019-09-29 16:36:45,546 - INFO - Counting pseudoalignments for main matrix
2019-09-29 16:36:52,752 - INFO - Found 5,027,188 UMIs.
2019-09-29 16:36:53,536 - INFO - Counting pseudoalignments for layer 'unspliced'
2019-09-29 16:38:00,099 - INFO - Found 2,376,590 UMIs.
2019-09-29 16:38:00,706 - INFO - Counting pseudoalignments for layer 'spliced'
2019-09-29 16:39:09,718 - INFO - Found 3,231,999 UMIs.
2019-09-29 16:39:09,718 - INFO - Calling cells
2019-09-29 16:42:32,387 - INFO - Found 1189 valid cells and ~77 ambient UMIs.
2019-09-29 16:42:32,388 - INFO - Creating loom file '1kPBMC.loom'
2019-09-29 16:42:32,388 - INFO - Saving
As you can see, 46,119,840 of 66,601,887 reads pseudoaligned (~70%) which is typical. The sequencing saturation was 71%, and the cell
calling algorithm found 1189 valid cells (similar to the 1,222 cells reported by cellranger). Empty beads carried a median of 77
UMIs, presumably from cell-free ambient RNA.
Connect to the loom file and examine its global attributes:
.. code::
import loompy
with loompy.connect("1kPBMC.loom") as ds:
print(ds.attrs.keys())
['AmbientPValue', 'AmbientUMIs', 'BarcodeTotalUMIs', 'CellBarcodes', 'CreationDate', 'KallistoCommand', 'KallistoVersion', 'LOOM_SPEC_VERSION', 'NumPseudoaligned', 'NumReadsProcessed', 'RedundantReadFraction', 'SampleID', 'Saturation', 'Species', 'name', 'targetnumcells', 'technology']
...column attributes...
.. code::
import loompy
with loompy.connect("1kPBMC.loom") as ds:
print(ds.ca.keys())
['CellID', 'TotalUMIs']
...row attributes (see the `index build instructions <https://github.com/linnarsson-lab/loompy/blob/master/notebooks/build_index.ipynb>`_ for an explanation of these)...
.. code::
import loompy
with loompy.connect("1kPBMC.loom") as ds:
print(ds.ra.keys())
['Accession', 'AccessionVersion', 'Aliases', 'CcdsID', 'Chromosome', 'ChromosomeEnd', 'ChromosomeStart', 'CosmicID', 'DnaBindingDomain', 'FullName', 'Gene', 'GeneType', 'HgncID', 'IsTF', 'Location', 'LocationSortable', 'LocusGroup', 'LocusType', 'MgdID', 'MirBaseID', 'OmimID', 'PubmedID', 'RefseqID', 'RgdID', 'UcscID', 'UniprotID', 'VegaID']
...and layers:
.. code::
import loompy
with loompy.connect("1kPBMC.loom") as ds:
print(ds.layers.keys())
['', 'spliced', 'unspliced']
from .utils import *
from .utils import get_loom_spec_version, compare_loom_spec_version, timestamp, deprecated
from .normalize import normalize_attr_values, materialize_attr_values
from .attribute_manager import AttributeManager
from .file_attribute_manager import FileAttributeManager
from .global_attribute_manager import GlobalAttributeManager
from .graph_manager import GraphManager
from .layer_manager import LayerManager
from .loom_view import LoomView
from .loom_layer import MemoryLoomLayer, LoomLayer
from .to_html import to_html
from .view_manager import ViewManager
from .loompy import connect, create, create_append, combine, create_from_cellranger, LoomConnection, new
from .loompy import connect, create, create_append, combine, create_from_cellranger, LoomConnection, new, combine_faster, create_from_matrix_market
from .loom_validator import LoomValidator
from ._version import __version__, loom_spec_version
from .bus_file import create_from_fastq
from .cell_calling import call_cells
__version__ = '2.0.17'
loom_spec_version = '2.0.1'
__version__ = '3.0.6'
loom_spec_version = '3.0.0'
from typing import *
import numpy as np
import h5py
import loompy
from loompy import timestamp
from .utils import compare_loom_spec_version
class AttributeManager:
......@@ -144,13 +146,17 @@ class AttributeManager:
raise KeyError("Attribute name cannot contain slash (/)")
else:
if self.ds is not None:
values = loompy.normalize_attr_values(val)
values = loompy.normalize_attr_values(val, compare_loom_spec_version(self.ds._file, "3.0.0") >= 0)
a = ["/row_attrs/", "/col_attrs/"][self.axis]
if self.ds.shape[self.axis] != 0 and values.shape[0] != self.ds.shape[self.axis]:
raise ValueError(f"Attribute '{name}' must have exactly {self.ds.shape[self.axis]} values but {len(values)} were given")
if self.ds._file[a].__contains__(name):
del self.ds._file[a + name]
self.ds._file[a + name] = values # TODO: for 2D arrays, use block compression along columns/rows
if values.dtype == np.object_:
self.ds._file.create_dataset(a + name, data=values, dtype=h5py.special_dtype(vlen=str))
else:
self.ds._file[a + name] = values
self.ds._file[a + name].attrs["last_modified"] = timestamp()
self.ds._file[a].attrs["last_modified"] = timestamp()
self.ds._file.attrs["last_modified"] = timestamp()
......
import array
import gzip
import json
import logging
import os
import sqlite3 as sqlite
import subprocess
from math import lgamma
from tempfile import TemporaryDirectory
from typing import Dict, Generator, List, Optional, Tuple
import numpy as np
import scipy.sparse as sparse
from numba import jit
from .cell_calling import call_cells
from .loompy import connect, create
# Copied from cytograph
@jit("float32(float64[:], float64[:])", nopython=True, parallel=True, nogil=True)
def multinomial_distance(p: np.ndarray, q: np.ndarray) -> float:
N = p.shape[0]
p_sum = p.sum()
q_sum = q.sum()
x = lgamma(N) + lgamma(p_sum + q_sum + N) - lgamma(p_sum + N) - lgamma(q_sum + N)
for k in range(N):
x += lgamma(p[k] + 1) + lgamma(q[k] + 1) - lgamma(1) - lgamma(p[k] + q[k] + 1)
x = np.exp(x)
return 1 - 1 / (1 + x)
# https://maciejkula.github.io/2015/02/22/incremental-construction-of-sparse-matrices/
class IncrementalSparseMatrixUInt16:
def __init__(self, shape: Tuple[int, int]):
self.dtype = np.uint16
self.shape = shape
self.rows = array.array('i')
self.cols = array.array('i')
self.data = array.array('H')
def append(self, i: int, j: int, v: int) -> None:
m, n = self.shape
if (i >= m or j >= n):
raise Exception('Index out of bounds')
self.rows.append(i)
self.cols.append(j)
self.data.append(v)
def tocoo(self) -> sparse.coo_matrix:
rows = np.frombuffer(self.rows, dtype=np.int32)
cols = np.frombuffer(self.cols, dtype=np.int32)
data = np.frombuffer(self.data, dtype=np.uint16)
return sparse.coo_matrix((data, (rows, cols)), shape=self.shape)
def __len__(self) -> int:
return len(self.data)
twobit_to_dna_table = {0: "A", 1: "C", 2: "G", 3: "T"}
dna_to_twobit_table = {"A": 0, "C": 1, "G": 2, "T": 3}
@jit
def twobit_to_dna(twobit: int, size: int) -> str:
result = []
for i in range(size):
x = (twobit & (3 << 2 * i)) >> 2 * i
if x == 0:
result.append("A")
elif x == 1:
result.append("C")
elif x == 2:
result.append("G")
elif x == 3:
result.append("T")
result.reverse()
return "".join(result)
@jit
def dna_to_twobit(dna: str) -> int:
x = 0
for nt in dna:
if nt == "A":
x += 0
elif nt == "C":
x += 1
elif nt == "G":
x += 2
elif nt == "T":
x += 3
x <<= 2
x >>= 2
return x
@jit
def twobit_1hamming(twobit: int, size: int) -> List[int]:
result = []
for i in range(size):
x = (twobit >> 2 * (size - i - 1)) & 3
for j in range(4):
if x == j:
continue
result.append(twobit & ~(3 << 2 * (size - i - 1)) | (j << 2 * (size - i - 1)))
return result
def ixs_thatsort_a2b(a: np.ndarray, b: np.ndarray, check_content: bool = True) -> np.ndarray:
"This is super duper magic sauce to make the order of one list to be like another"
if check_content:
assert len(np.intersect1d(a, b)) == len(a), f"The two arrays are not matching"
return np.argsort(a)[np.argsort(np.argsort(b))]
def load_sample_metadata(path: str, sample_id: str) -> Dict[str, str]:
if not os.path.exists(path):
raise ValueError(f"Samples metadata file '{path}' not found.")
if path.endswith(".db"):
# sqlite3
with sqlite.connect(path) as db:
cursor = db.cursor()
cursor.execute("SELECT * FROM sample WHERE name = ?", (sample_id,))
keys = [x[0] for x in cursor.description]
vals = cursor.fetchone()
if vals is not None:
return dict(zip(keys, vals))
raise ValueError(f"SampleID '{sample_id}' was not found in the samples database.")
else:
result = {}
with open(path) as f:
headers = [x.lower() for x in f.readline()[:-1].split("\t")]
if "sampleid" not in headers and 'name' not in headers:
raise ValueError("Required column 'SampleID' or 'Name' not found in sample metadata file")
if "sampleid" in headers:
sample_metadata_key_idx = headers.index("sampleid")
else:
sample_metadata_key_idx = headers.index("name")
sample_found = False
for line in f:
items = line[:-1].split("\t")
if len(items) > sample_metadata_key_idx and items[sample_metadata_key_idx] == sample_id:
for i, item in enumerate(items):
result[headers[i]] = item
sample_found = True
if not sample_found:
raise ValueError(f"SampleID '{sample_id}' not found in sample metadata file")
return result
class BusFile:
def __init__(self, path: str, genes_metadata_file: str, genes_metadata_key: str, fragments2genes_file: str, equivalence_classes_file: str, fragments_file: str) -> None:
self.matrix: sparse.coo_matrix = None
logging.info("Loading gene metadata")
self.genes: Dict[str, List[str]] = {} # Keys are Accessions, values are lists of attribute values
self.gene_metadata_attributes: List[str] = [] # Attribute names
with open(genes_metadata_file) as f:
line = f.readline()
self.gene_metadata_attributes = line[:-1].split("\t")
if genes_metadata_key not in self.gene_metadata_attributes:
raise ValueError(f"Metadata key '{genes_metadata_key}' not found in gene metadata file")
key_col = self.gene_metadata_attributes.index(genes_metadata_key)
for line in f:
items = line[:-1].split("\t")
self.genes[items[key_col]] = items
self.accessions = np.array([x for x in self.genes.keys()])
accession_idx = {acc: i for (i, acc) in enumerate(self.accessions)}
self.n_genes = len(self.accessions)
logging.info("Loading fragments-to-gene mappings")
self.gene_for_fragment: List[str] = []
with open(fragments2genes_file) as f:
for line in f:
transcript_id, accession = line[:-1].split("\t")
self.gene_for_fragment.append(accession)
self.gene_for_fragment = np.array(self.gene_for_fragment)
logging.info("Indexing genes")
# Array of indices into self.accessions for each gene in gene_for_transcript
self.gene_for_fragment_idx = np.zeros(len(self.gene_for_fragment), dtype="int32")
for i in range(len(self.gene_for_fragment)):
self.gene_for_fragment_idx[i] = accession_idx[self.gene_for_fragment[i]]
logging.info("Loading equivalence classes")
self.equivalence_classes: Dict[int, List[int]] = {}
with open(equivalence_classes_file) as f:
for line in f:
ec, trs = line[:-1].split("\t")
# Each equivalence class is a set of fragments (transcripts)
self.equivalence_classes[int(ec)] = np.array([int(x) for x in trs.split(",")])
# But we want each equivalence class mapped to a gene (or -1 if multimapping)
logging.info("Mapping equivalence classes to genes")
self.gene_for_ec: Dict[int, int] = {}
for eqc in self.equivalence_classes.keys():
gene = -1
for tid in self.equivalence_classes[eqc]:
if gene == -1:
gene = self.gene_for_fragment_idx[tid]
continue
elif self.gene_for_fragment_idx[tid] != gene:
gene = -1 # Multimapping UMI
break
self.gene_for_ec[eqc] = gene
logging.info("Loading fragment IDs")
self.fragments: List[str] = []
with open(fragments_file) as f:
for line in f:
self.fragments.append(line[:-1])
self.fragments = np.array(self.fragments)
logging.info("Loading BUS records")
fsize = os.path.getsize(path)
with open(path, "rb") as fb:
# Read the header
magic = fb.read(4)
if magic != b"BUS\0":
raise IOError("Not a valid BUS file (four leading magic bytes are missing)")
self.version = int.from_bytes(fb.read(4), byteorder="little", signed=False)
self.barcode_length = int.from_bytes(fb.read(4), byteorder="little", signed=False)
self.umi_length = int.from_bytes(fb.read(4), byteorder="little", signed=False)
tlen = int.from_bytes(fb.read(4), byteorder="little", signed=False)
self.header = fb.read(tlen).decode("utf-8") # BUS does not specify an encoding, but let's assume UTF8
# Read the records
self.n_records = (fsize - tlen - 20) // 32
self.bus = np.fromfile(fb, dtype=[
("barcode", np.uint64),
("UMI", np.uint64),
("equivalence_class", np.int32),
("count", np.uint32),
("flags", np.uint32),
("padding", np.int32)
], count=self.n_records)
self.bus_gene = np.array([self.gene_for_ec[x] for x in self.bus["equivalence_class"]], dtype=np.int32)
self.bus_valid = self.bus_gene != -1
logging.info("Sorting cell IDs")
self.cell_ids = np.unique(self.bus["barcode"])
self.cell_ids.sort()
self.n_cells = len(self.cell_ids) # This will change after error correction
self.layers: Dict[str, sparse.coo_matrix] = {} # Dict of layer name -> sparse matrix
self.ambient_umis = 0
def correct(self, whitelist_file: str = None) -> np.ndarray:
if whitelist_file is not None:
size = 0
whitelist = set()
with open(whitelist_file) as f:
for bc in f:
size = len(bc) - 1 # Don't count the newline
whitelist.add(dna_to_twobit(bc[:-1]))
for i, bc in enumerate(self.bus["barcode"]):
if bc in whitelist:
continue
corrected = False
for mut in twobit_1hamming(int(bc), size=size):
if mut in whitelist:
self.bus["barcode"][i] = mut
corrected = True
break
if not corrected:
self.bus_valid[i] = False
self.cell_ids = np.unique(self.bus["barcode"][self.bus_valid])
self.cell_ids.sort()
self.cell_for_barcode_idx = {bc: i for (i, bc) in enumerate(self.cell_ids)}
self.n_cells = len(self.cell_ids)
def deduplicate(self) -> None:
# Sort by barcode, then by UMI, then by gene
ordering = np.lexsort((self.bus_gene, self.bus["UMI"], self.bus["barcode"]))
self.bus = self.bus[ordering]
self.bus_gene = self.bus_gene[ordering]
self.bus_valid = self.bus_valid[ordering]
dupes = (self.bus["barcode"][1:] == self.bus["barcode"][:-1]) & (self.bus["UMI"][1:] == self.bus["UMI"][:-1]) & (self.bus_gene[1:] == self.bus_gene[:-1])
self.bus_valid[1:][dupes] = False
def count(self) -> sparse.coo_matrix:
logging.info("Counting pseudoalignments for main matrix")
genes = self.bus_gene[self.bus_valid]
cells = [self.cell_for_barcode_idx[x] for x in self.bus["barcode"][self.bus_valid]]
self.matrix = sparse.coo_matrix((np.ones_like(genes), (genes, cells)), shape=(self.n_genes, self.n_cells), dtype=np.uint16)
self.total_umis = np.array(self.matrix.sum(axis=0))[0]
return self.matrix
def remove_empty_beads(self, expected_n_cells: int) -> None:
logging.info("Calling cells")
self.ambient_umis, self.ambient_pvalue = call_cells(self.matrix.tocsc(), expected_n_cells)
self.valid_cells = (self.ambient_pvalue < 0.01) | (self.total_umis > 1500)
self.matrix = self.matrix.tocsr()[:, self.valid_cells]
self.cell_ids = self.cell_ids[self.valid_cells]
self.n_cells = self.valid_cells.sum()
for name in self.layers.keys():
self.layers[name] = self.layers[name].tocsc()[:, self.valid_cells]
logging.info(f"Found {self.n_cells} valid cells and ~{int(self.ambient_umis)} ambient UMIs.")
def count_layer(self, layer_name: str, layer_fragments_file: str) -> sparse.coo_matrix:
fragments_idx = {f: i for i, f in enumerate(self.fragments)}
with open(layer_fragments_file) as f:
include_fragments = set([fragments_idx[x[:-1]] for x in f.readlines()])
logging.info(f"Counting pseudoalignments for layer '{layer_name}'")
# Figure out which of the equivalence classes are relevant
include_ec = {}
for ec, tids in self.equivalence_classes.items():
if any([tid in include_fragments for tid in tids]):
include_ec[ec] = True
else:
include_ec[ec] = False
m = IncrementalSparseMatrixUInt16((self.n_genes, self.n_cells))
for ix in range(self.n_records):
if not self.bus_valid[ix]:
continue
gene = self.bus_gene[ix]
if include_ec[self.bus["equivalence_class"][ix]]:
m.append(gene, self.cell_for_barcode_idx[self.bus["barcode"][ix]], 1)
self.layers[layer_name] = m.tocoo()
return self.layers[layer_name]
def save(self, out_file: str, sample_id: str, samples_metadata_file: str) -> None:
logging.info("Saving")
row_attrs = {}
# Transpose the gene metadata
for i, attr in enumerate(self.gene_metadata_attributes):
row_attrs[attr] = [v[i] for v in self.genes.values()]
# Create cell attributes
col_attrs = {
"CellID": [sample_id + "_" + twobit_to_dna(int(cid), 16) for cid in self.cell_ids],
"TotalUMIs": self.total_umis[self.valid_cells]
}
# Load sample metadata
metadata = load_sample_metadata(samples_metadata_file, sample_id)
global_attrs = {
**{
"SampleID": sample_id,
"AmbientUMIs": self.ambient_umis,
"RedundantReadFraction": 1 - self.bus_valid.sum() / self.n_records,
"AmbientPValue": self.ambient_pvalue,
"BarcodeTotalUMIs": self.total_umis,
"CellBarcodes": self.valid_cells
}, **metadata}
layers = self.layers.copy()
layers[""] = self.matrix
create(out_file, layers, row_attrs, col_attrs, file_attrs=global_attrs)
def execute(cmd: List[str], synchronous: bool = False) -> Generator:
if synchronous:
yield os.popen(" ".join(cmd)).read()
else:
popen = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, universal_newlines=True) # type: ignore
for stdout_line in iter(popen.stdout.readline, ""):
yield stdout_line
popen.stdout.close()
return_code = popen.wait()
if return_code:
raise subprocess.CalledProcessError(return_code, cmd)
def create_from_fastq(out_file: str, sample_id: str, fastqs: List[str], index_path: str, samples_metadata_file: str, n_threads: int = 1, temp_folder: str = None, synchronous: bool = False) -> None:
"""
Args:
technology String like "10xv2" or None to read the technology from the sample metadata file
expected_n_cells Expected number of cells captured in the sample, or None to read the number from the sample metadata file
samples_metadata_file Path to tab-delimited file with one header row OR path to sqlite database with one table called "sample"
Remarks:
Samples metadata table should contain these columns:
name Sample name (i.e. sample id)
chemistry 10x chemistry version (v1, v2 or v3)
targetnumcells Number of cells expected in the sample
"""
manifest_file = os.path.join(index_path, "manifest.json")
if not os.path.exists(manifest_file):
raise ValueError(f"Manifest file 'manifest.json' was missing from index at '{index_path}'")
for fastq in fastqs:
if not os.path.exists(fastq):
raise ValueError(f"Fastq file '{fastq}' was not found")
if not os.path.exists(samples_metadata_file):
raise ValueError("Samples metadata file not found")
with open(manifest_file) as f:
manifest = json.load(f)
metadata = load_sample_metadata(samples_metadata_file, sample_id)
if (("technology" not in metadata) and ("chemistry" not in metadata)) or ("targetnumcells" not in metadata):
print(metadata.keys())
raise ValueError("Samples metadata must contain columns 'targetnumcells' and either 'chemistry' or 'technology'")
if "technology" in metadata:
technology = metadata["technology"]
else:
technology = "10x" + metadata["chemistry"]
try:
expected_n_cells = int(metadata["targetnumcells"])
except:
expected_n_cells = 5000
whitelist_file: Optional[str] = os.path.join(index_path, f"{technology}_whitelist.txt")
if not os.path.exists(whitelist_file): # type: ignore
logging.warning(f"Barcode whitelist file {whitelist_file} not found in index folder at '{index_path}'; barcode correction will be skipped.")
whitelist_file = None
with TemporaryDirectory() as d:
if temp_folder is not None:
d = temp_folder
if not os.path.exists(d):
os.mkdir(d)
cmd = ["kallisto", "bus", "-i", os.path.join(index_path, manifest["index_file"]), "-o", d, "-x", technology, "-t", str(n_threads)] + fastqs
logging.info(" ".join(cmd))
for line in execute(cmd, synchronous):
if line != "\n":
logging.info(line[:-1])
run_info: Optional[Dict[str, str]] = None
try:
with open(os.path.join(d, "run_info.json")) as f:
run_info = json.load(f)
except json.JSONDecodeError as e:
with open(os.path.join(d, "run_info.json")) as f:
for line in f:
logging.error(line)
logging.error(f"Error decoding run_info.json: {e}")
bus = BusFile(
os.path.join(d, "output.bus"),
os.path.join(index_path, manifest["gene_metadata_file"]),
manifest["gene_metadata_key"],
os.path.join(index_path, manifest["fragments_to_genes_file"]),
os.path.join(d, "matrix.ec"),
os.path.join(d, "transcripts.txt")
)
logging.info(f"Found {bus.n_records:,} records for {bus.n_genes:,} genes and {bus.n_cells:,} uncorrected cell barcodes.")
if whitelist_file is None:
logging.warning("Not correcting barcodes, because whitelist file was not provided.")
else:
logging.info("Correcting cell barcodes")
bus.correct(whitelist_file)
logging.info(f"Found {bus.n_cells:,} corrected cell barcodes.")
logging.info("Removing redundant reads using UMIs")
bus.deduplicate()
seq_sat = 1 - bus.bus_valid.sum() / bus.n_records
logging.info(f"{int(seq_sat * 100)}% sequencing saturation.")
bus.count()
logging.info(f"Found {bus.matrix.count_nonzero():,} UMIs.")
for layer, layer_def in manifest["layers"].items():
bus.count_layer(layer, os.path.join(index_path, layer_def))
logging.info(f"Found {bus.layers[layer].count_nonzero():,} UMIs.")
bus.remove_empty_beads(expected_n_cells)
logging.info(f"Creating loom file '{out_file}'")
bus.save(out_file, sample_id, samples_metadata_file)
with connect(out_file) as ds:
ds.attrs.Species = manifest["species"]
ds.attrs.Saturation = seq_sat
if run_info is not None:
ds.attrs.NumReadsProcessed = int(run_info["n_processed"])
ds.attrs.NumPseudoaligned = int(run_info["n_pseudoaligned"])
ds.attrs.KallistoCommand = run_info["call"]
ds.attrs.KallistoVersion = run_info["kallisto_version"]
# Original algorithm was published by John Marioni and colleagues as EmptyDrops (Lun, A. et al. Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data.)
# This implementation is based on the code in cellranger v3.0 by 10x Genomics
# Copyright 2018 10X Genomics, Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
import logging
import numpy as np
import scipy.sparse as sparse
import scipy.stats as sp_stats
# Simple Good-Turing estimator.
# Based on S implementation in
# William A. Gale & Geoffrey Sampson (1995) Good-turing frequency estimation without tears,
# Journal of Quantitative Linguistics, 2:3, 217-237, DOI: 10.1080/09296179508590051
class SimpleGoodTuringError(Exception):
pass
def _averaging_transform(r, nr):
d = np.concatenate((np.ones(1, dtype=int), np.diff(r)))
dr = np.concatenate((
0.5 * (d[1:] + d[0:-1]),
np.array((d[-1],), dtype=float),
))
return nr.astype(float) / dr
def _rstest(r, coef):
return r * np.power(1 + 1 / r, 1 + coef)
def simple_good_turing(xr, xnr):
"""Make a Simple Good-Turing estimate of the frequencies.
Args:
xr (np.array(int)): Non-zero item frequencies
xnr (np.array(int)): Non-zero frequencies of frequencies
Returns:
(rstar (np.array(float)), p0 (float)):
rstar: The adjusted non-zero frequencies
p0: The total probability of unobserved items
"""
xr = xr.astype(float)
xnr = xnr.astype(float)
xN = np.sum(xr * xnr)
# Get Linear Good-Turing estimate
xnrz = _averaging_transform(xr, xnr)
slope, intercept, _, _, _ = sp_stats.linregress(np.log(xr), np.log(xnrz))
if slope > -1:
raise SimpleGoodTuringError("The log-log slope is > -1 (%d); the SGT estimator is not applicable to these data." % slope)
xrst = _rstest(xr, slope)
xrstrel = xrst / xr
# Get traditional Good-Turing estimate
xrtry = xr == np.concatenate((xr[1:] - 1, np.zeros(1)))
xrstarel = np.zeros(len(xr))
xrstarel[xrtry] = (xr[xrtry] + 1) / xr[xrtry] * np.concatenate((xnr[1:], np.zeros(1)))[xrtry] / xnr[xrtry]
# Determine when to switch from GT to LGT estimates
tursd = np.ones(len(xr))
for i in range(len(xr)):
if xrtry[i]:
tursd[i] = float(i + 2) / xnr[i] * np.sqrt(xnr[i + 1] * (1 + xnr[i + 1] / xnr[i]))
xrstcmbrel = np.zeros(len(xr))
useturing = True
for r in range(len(xr)):
if not useturing:
xrstcmbrel[r] = xrstrel[r]
else:
if np.abs(xrstrel[r] - xrstarel[r]) * (1 + r) / tursd[r] > 1.65:
xrstcmbrel[r] = xrstarel[r]
else:
useturing = False
xrstcmbrel[r] = xrstrel[r]
# Renormalize the probabilities for observed objects
sumpraw = np.sum(xrstcmbrel * xr * xnr / xN)
xrstcmbrel = xrstcmbrel * (1 - xnr[0] / xN) / sumpraw
p0 = xnr[0] / xN
return (xr * xrstcmbrel, p0)
def sgt_proportions(frequencies):
"""Use Simple Good-Turing estimate to adjust for unobserved items
Args:
frequencies (np.array(int)): Nonzero frequencies of items
Returns:
(pstar (np.array(float)), p0 (float)):
pstar: The adjusted non-zero proportions
p0: The total probability of unobserved items
"""
if len(frequencies) == 0:
raise ValueError("Input frequency vector is empty")
if np.count_nonzero(frequencies) != len(frequencies):
raise ValueError("Frequencies must be greater than zero")
freqfreqs = np.bincount(frequencies.astype(np.int64))
assert freqfreqs[0] == 0
use_freqs = np.flatnonzero(freqfreqs)
if len(use_freqs) < 10:
raise SimpleGoodTuringError("Too few non-zero frequency items (%d). Aborting SGT." % len(use_freqs))
rstar, p0 = simple_good_turing(use_freqs, freqfreqs[use_freqs])
# rstar contains the smoothed frequencies.
# Map each original frequency r to its smoothed rstar.
rstar_dict = dict(zip(use_freqs, rstar))
rstar_sum = np.sum(freqfreqs[use_freqs] * rstar)
rstar_i = np.fromiter((rstar_dict[f] for f in frequencies), dtype=float, count=len(frequencies))
pstar = (1 - p0) * (rstar_i / rstar_sum)
assert np.isclose(p0 + np.sum(pstar), 1)
return (pstar, p0)
def adjust_pvalue_bh(p):
""" Multiple testing correction of p-values using the Benjamini-Hochberg procedure """
descending = np.argsort(p)[::-1]
# q = p * N / k where p = p-value, N = # tests, k = p-value rank
scale = float(len(p)) / np.arange(len(p), 0, -1)
q = np.minimum(1, np.minimum.accumulate(scale * p[descending]))
# Return to original order
return q[np.argsort(descending)]
def eval_multinomial_loglikelihoods(matrix, profile_p, max_mem_gb=0.1):
"""Compute the multinomial log PMF for many barcodes
Args:
matrix (scipy.sparse.csc_matrix): Matrix of UMI counts (feature x barcode)
profile_p (np.ndarray(float)): Multinomial probability vector
max_mem_gb (float): Try to bound memory usage.
Returns:
log_likelihoods (np.ndarray(float)): Log-likelihood for each barcode
"""
gb_per_bc = float(matrix.shape[0] * matrix.dtype.itemsize) / (1024**3)
bcs_per_chunk = max(1, int(round(max_mem_gb / gb_per_bc)))
num_bcs = matrix.shape[1]
loglk = np.zeros(num_bcs)
for chunk_start in range(0, num_bcs, bcs_per_chunk):
chunk = slice(chunk_start, chunk_start + bcs_per_chunk)
matrix_chunk = matrix[:, chunk].transpose().toarray()
n = matrix_chunk.sum(1)
loglk[chunk] = sp_stats.multinomial.logpmf(matrix_chunk, n, p=profile_p)
return loglk
def simulate_multinomial_loglikelihoods(profile_p, umis_per_bc, num_sims=1000, jump=1000, n_sample_feature_block=1000000, verbose=False):
"""Simulate draws from a multinomial distribution for various values of N.
Uses the approximation from Lun et al. ( https://www.biorxiv.org/content/biorxiv/early/2018/04/04/234872.full.pdf )
Args:
profile_p (np.ndarray(float)): Probability of observing each feature.
umis_per_bc (np.ndarray(int)): UMI counts per barcode (multinomial N).
num_sims (int): Number of simulations per distinct N value.
jump (int): Vectorize the sampling if the gap between two distinct Ns exceeds this.
n_sample_feature_block (int): Vectorize this many feature samplings at a time.
Returns:
(distinct_ns (np.ndarray(int)), log_likelihoods (np.ndarray(float)):
distinct_ns is an array containing the distinct N values that were simulated.
log_likelihoods is a len(distinct_ns) x num_sims matrix containing the
simulated log likelihoods.
"""
distinct_n = np.flatnonzero(np.bincount(umis_per_bc.astype(np.int64)))
loglk = np.zeros((len(distinct_n), num_sims), dtype=float)
sampled_features = np.random.choice(len(profile_p), size=n_sample_feature_block, p=profile_p, replace=True)
k = 0
log_profile_p = np.log(profile_p)
for sim_idx in range(num_sims):
curr_counts = np.ravel(sp_stats.multinomial.rvs(distinct_n[0], profile_p, size=1))
curr_loglk = sp_stats.multinomial.logpmf(curr_counts, distinct_n[0], p=profile_p)
loglk[0, sim_idx] = curr_loglk
for i in range(1, len(distinct_n)):
step = distinct_n[i] - distinct_n[i - 1]
if step >= jump:
# Instead of iterating for each n, sample the intermediate ns all at once
curr_counts += np.ravel(sp_stats.multinomial.rvs(step, profile_p, size=1))
curr_loglk = sp_stats.multinomial.logpmf(curr_counts, distinct_n[i], p=profile_p)
assert not np.isnan(curr_loglk)
else:
# Iteratively sample between the two distinct values of n
for n in range(distinct_n[i - 1] + 1, distinct_n[i] + 1):
j = sampled_features[k]
k += 1
if k >= n_sample_feature_block:
# Amortize this operation
sampled_features = np.random.choice(len(profile_p), size=n_sample_feature_block, p=profile_p, replace=True)
k = 0
curr_counts[j] += 1
curr_loglk += log_profile_p[j] + np.log(float(n) / curr_counts[j])
loglk[i, sim_idx] = curr_loglk
return distinct_n, loglk
def compute_ambient_pvalues(umis_per_bc, obs_loglk, sim_n, sim_loglk):
"""Compute p-values for observed multinomial log-likelihoods
Args:
umis_per_bc (nd.array(int)): UMI counts per barcode
obs_loglk (nd.array(float)): Observed log-likelihoods of each barcode deriving from an ambient profile
sim_n (nd.array(int)): Multinomial N for simulated log-likelihoods
sim_loglk (nd.array(float)): Simulated log-likelihoods of shape (len(sim_n), num_simulations)
Returns:
pvalues (nd.array(float)): p-values
"""
assert len(umis_per_bc) == len(obs_loglk)
assert sim_loglk.shape[0] == len(sim_n)
# Find the index of the simulated N for each barcode
sim_n_idx = np.searchsorted(sim_n, umis_per_bc)
num_sims = sim_loglk.shape[1]
num_barcodes = len(umis_per_bc)
pvalues = np.zeros(num_barcodes)
for i in range(num_barcodes):
num_lower_loglk = np.sum(sim_loglk[sim_n_idx[i], :] < obs_loglk[i])
pvalues[i] = float(1 + num_lower_loglk) / (1 + num_sims)
return pvalues
def estimate_profile_sgt(matrix, barcode_indices, nz_feat):
""" Estimate a gene expression profile by Simple Good Turing.
Args:
raw_mat (sparse matrix): Sparse matrix of all counts
barcode_indices (np.array(int)): Barcode indices to use
nz_feat (np.array(int)): Indices of features that are non-zero at least once
Returns:
profile (np.array(float)): Estimated probabilities of length len(nz_feat).
"""
# Initial profile estimate
prof_mat = matrix[:, barcode_indices]
profile = np.ravel(prof_mat[nz_feat, :].sum(axis=1))
zero_feat = np.flatnonzero(profile == 0)
# Simple Good Turing estimate
p_smoothed, p0 = sgt_proportions(profile[np.flatnonzero(profile)])
# Distribute p0 equally among the zero elements.
p0_i = p0 / len(zero_feat)
profile_p = np.repeat(p0_i, len(nz_feat))
profile_p[np.flatnonzero(profile)] = p_smoothed
assert np.isclose(profile_p.sum(), 1.0)
return profile_p
# Construct a background expression profile from barcodes with <= T UMIs
def est_background_profile_sgt(matrix, use_bcs):
""" Estimate a gene expression profile on a given subset of barcodes.
Use Good-Turing to smooth the estimated profile.
Args:
matrix (scipy.sparse.csc_matrix): Sparse matrix of all counts
use_bcs (np.array(int)): Indices of barcodes to use (col indices into matrix)
Returns:
profile (use_features, np.array(float)): Estimated probabilities of length use_features.
"""
# Use features that are nonzero anywhere in the data
use_feats = np.flatnonzero(np.asarray(matrix.sum(1)))
# Estimate background profile
bg_profile_p = estimate_profile_sgt(matrix, use_bcs, use_feats)
return (use_feats, bg_profile_p)
# Sten Linnarsson's version (Aug 2019)
def call_cells(matrix: sparse.csr_matrix, expected_n_cells: int = 5000) -> np.ndarray:
"""
Determine likely true cells among the barcodes by contrasting with the ambient RNA profile
Args:
matrix: expression matrix
expected_n_cells: expected number of true cells in the sample
Returns:
calls: vector of bools indicating true cell barcodes
"""
n_barcodes = matrix.shape[1]
expected_n_cells = min(expected_n_cells, n_barcodes // 5)
total_umis = np.array(matrix.sum(axis=0))[0] # total UMIs per barcode
# upper limit of UMIs for barcodes considered ambient, calculated as greatest UMI count after removing twice the expected number of cells
max_ambient_umis = np.percentile(total_umis, 100 * (n_barcodes - expected_n_cells * 2) / n_barcodes)
# median number of UMIs among the top expected_n_cells barcodes
median_initial_umis = np.median(total_umis[total_umis > np.percentile(total_umis, 100 * (n_barcodes - expected_n_cells) / n_barcodes)])
min_cell_umis = int(max(500, median_initial_umis * 0.1)) # 10% of median, but at least 500 UMIs
# Ambient RNA beads, covering the range 20 to max_amient_umis
ambient_bcs = (total_umis < max_ambient_umis) & (total_umis > 20)
if ambient_bcs.sum() == 0:
# No beads were ambient, because cells had very low UMIs
logging.warning("No ambient RNA beads were found; maybe sample had too few cells?")
return max_ambient_umis, np.ones_like(total_umis)
try:
eval_features, ambient_profile_p = est_background_profile_sgt(matrix, ambient_bcs)
except SimpleGoodTuringError as e:
logging.error(e)
return max_ambient_umis, np.ones_like(total_umis)
# Evaluate candidate barcodes
eval_bcs = total_umis > min_cell_umis
eval_mat = matrix[eval_features, :][:, eval_bcs]
# Compute observed log-likelihood of barcodes being generated from ambient RNA
obs_loglk = eval_multinomial_loglikelihoods(eval_mat, ambient_profile_p)
# Simulate log likelihoods
distinct_ns, sim_loglk = simulate_multinomial_loglikelihoods(ambient_profile_p, total_umis[eval_bcs], num_sims=1000, verbose=True)
# Compute p-values
pvalues = compute_ambient_pvalues(total_umis[eval_bcs], obs_loglk, distinct_ns, sim_loglk)
pvalues_adj = adjust_pvalue_bh(pvalues)
pvalues_adj_all = np.ones_like(total_umis)
pvalues_adj_all[eval_bcs] = pvalues_adj
return max_ambient_umis, pvalues_adj_all
import colorsys
import json
import os
import random
import sys
from typing import *
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder
# See https://graphicdesign.stackexchange.com/questions/3682/where-can-i-find-a-large-palette-set-of-contrasting-colors-for-coloring-many-d
# Adapted from https://github.com/kevinwuhoo/randomcolor-py
# Copyright (c) 2016 Kevin Wu
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
_hues = {
"blue": {
"hue_range": [179, 257],
"lower_bounds": [
[20, 100],
[30, 86],
[40, 80],
[50, 74],
[60, 60],
[70, 52],
[80, 44],
[90, 39],
[100, 35]
]
},
"green": {
"hue_range": [63, 178],
"lower_bounds": [
[30, 100],
[40, 90],
[50, 85],
[60, 81],
[70, 74],
[80, 64],
[90, 50],
[100, 40]
]
},
"monochrome": {
"hue_range": [0, 0],
"lower_bounds": [
[0, 0],
[100, 0]
]
},
"orange": {
"hue_range": [19, 46],
"lower_bounds": [
[20, 100],
[30, 93],
[40, 88],
[50, 86],
[60, 85],
[70, 70],
[100, 70]
]
},
"pink": {
"hue_range": [283, 334],
"lower_bounds": [
[20, 100],
[30, 90],
[40, 86],
[60, 84],
[80, 80],
[90, 75],
[100, 73]
]
},
"purple": {
"hue_range": [258, 282],
"lower_bounds": [
[20, 100],
[30, 87],
[40, 79],
[50, 70],
[60, 65],
[70, 59],
[80, 52],
[90, 45],
[100, 42]
]
},
"red": {
"hue_range": [-26, 18],
"lower_bounds": [
[20, 100],
[30, 92],
[40, 89],
[50, 85],
[60, 78],
[70, 70],
[80, 60],
[90, 55],
[100, 50]
]
},
"yellow": {
"hue_range": [47, 62],
"lower_bounds": [
[25, 100],
[40, 94],
[50, 89],
[60, 86],
[70, 84],
[80, 82],
[90, 80],
[100, 75]
]
}
}
class RandomColor:
def __init__(self, seed=None):
# Load color dictionary and populate the color dictionary
self.colormap = _hues
self.seed = seed if seed else random.randint(0, sys.maxsize)
self.random = random.Random(self.seed)
for color_name, color_attrs in self.colormap.items():
lower_bounds = color_attrs['lower_bounds']
s_min = lower_bounds[0][0]
s_max = lower_bounds[len(lower_bounds) - 1][0]
b_min = lower_bounds[len(lower_bounds) - 1][1]
b_max = lower_bounds[0][1]
self.colormap[color_name]['saturation_range'] = [s_min, s_max]
self.colormap[color_name]['brightness_range'] = [b_min, b_max]
def generate(self, hue=None, luminosity=None, count=1, format_='hex'):
colors = []
for _ in range(count):
# First we pick a hue (H)
H = self.pick_hue(hue)
# Then use H to determine saturation (S)
S = self.pick_saturation(H, hue, luminosity)
# Then use S and H to determine brightness (B).
B = self.pick_brightness(H, S, luminosity)
# Then we return the HSB color in the desired format
colors.append(self.set_format([H, S, B], format_))
return colors
def pick_hue(self, hue):
hue_range = self.get_hue_range(hue)
hue = self.random_within(hue_range)
# Instead of storing red as two seperate ranges,
# we group them, using negative numbers
if (hue < 0):
hue += 360
return hue
def pick_saturation(self, hue, hue_name, luminosity):
if luminosity == 'random':
return self.random_within([0, 100])
if hue_name == 'monochrome':
return 0
saturation_range = self.get_saturation_range(hue)
s_min = saturation_range[0]
s_max = saturation_range[1]
if luminosity == 'bright':
s_min = 55
elif luminosity == 'dark':
s_min = s_max - 10
elif luminosity == 'light':
s_max = 55
return self.random_within([s_min, s_max])
def pick_brightness(self, H, S, luminosity):
b_min = self.get_minimum_brightness(H, S)
b_max = 100
if luminosity == 'dark':
b_max = b_min + 20
elif luminosity == 'light':
b_min = (b_max + b_min) / 2
elif luminosity == 'random':
b_min = 0
b_max = 100
return self.random_within([b_min, b_max])
def set_format(self, hsv, format_):
if 'hsv' in format_:
color = hsv
elif 'rgb' in format_:
color = self.hsv_to_rgb(hsv)
elif 'hex' in format_:
r, g, b = self.hsv_to_rgb(hsv)
return '#%02x%02x%02x' % (r, g, b)
else:
return "unrecognized format"
if "Array" in format_ or format_ == 'hex':
return color
else:
prefix = format_[:3]
color_values = [str(x) for x in color]
return "%s(%s)" % (prefix, ', '.join(color_values))
def get_minimum_brightness(self, H, S):
lower_bounds = self.get_color_info(H)['lower_bounds']
for i in range(len(lower_bounds) - 1):
s1 = lower_bounds[i][0]
v1 = lower_bounds[i][1]
s2 = lower_bounds[i + 1][0]
v2 = lower_bounds[i + 1][1]
if s1 <= S <= s2:
m = (v2 - v1) / (s2 - s1)
b = v1 - m * s1
return m * S + b
return 0
def get_hue_range(self, color_input):
if color_input and color_input.isdigit():
number = int(color_input)
if 0 < number < 360:
return [number, number]
elif color_input and color_input in self.colormap:
color = self.colormap[color_input]
if 'hue_range' in color:
return color['hue_range']
else:
return [0, 360]
def get_saturation_range(self, hue):
return self.get_color_info(hue)['saturation_range']
def get_color_info(self, hue):
# Maps red colors to make picking hue easier
if 334 <= hue <= 360:
hue -= 360
for color_name, color in self.colormap.items():
if color['hue_range'] and color['hue_range'][0] <= hue <= color['hue_range'][1]:
return self.colormap[color_name]
# this should probably raise an exception
return 'Color not found'
def random_within(self, r):
return self.random.randint(int(r[0]), int(r[1]))
@classmethod
def hsv_to_rgb(cls, hsv):
h, s, v = hsv
h = 1 if h == 0 else h
h = 359 if h == 360 else h
h = float(h) / 360
s = float(s) / 100
v = float(v) / 100
rgb = colorsys.hsv_to_rgb(h, s, v)
return [int(c * 255) for c in rgb]
zviridis = colors.LinearSegmentedColormap.from_list("zviridis", [(0.9, 0.9, 0.9, 1)] + list(plt.cm.viridis(np.arange(1000) / 1000)), N=1001)
_color_alphabet = [
'#efa2fe',
'#0075db',
'#983f00',
'#4c005c',
'#005c31',
'#2bcd48',
'#fecb98',
'#808080',
'#93feb4',
'#8e7c00',
'#9ccb00',
'#c10087',
'#003380',
'#fea305',
'#fea7ba',
'#426600',
'#fe0010',
'#5ef0f1',
'#00988e',
'#dffe66',
'#740afe',
'#980000',
'#fefe80',
'#fefe00',
'#fe5005'
]
def cat_colors(N: int = 1, *, hue: str = None, luminosity: str = None, bgvalue: int = None, loop: bool = False, seed: str = "cat") -> Union[List[Any], colors.LinearSegmentedColormap]:
"""
Return a colormap suitable for N categorical values, optimized to be both aesthetically pleasing and perceptually distinct.
Args:
N The number of colors requested.
hue Controls the hue of the generated color. You can pass a string representing a color name: "red", "orange", "yellow", "green", "blue", "purple", "pink" and "monochrome" are currently supported. If you pass a hexidecimal color string such as "#00FFFF", its hue value will be used to generate colors.
luminosity Controls the luminosity of the generated color: "bright", "light" or "dark".
bgvalue If not None, then the corresponding index color will be set to light gray
loop If True, loop the color alphabet instead of generating random colors
seed If not None, use as the random seed (default: "cat")
Returns:
A set of colors in the requested format, either a list of values or a matplotlib LinearSegmentedColormap (when format="cmap")
If N <= 25 and hue and luminosity are both None, a subset of the optimally perceptually distinct "color alphabet" is returned.
Else, a pleasing set of random colors is returned.
Colors are designed to be displayed on a white background.
"""
c: List[str] = []
if N <= 25 and hue is None and luminosity is None:
c = _color_alphabet[:N]
elif not loop:
c = RandomColor(seed=seed).generate(count=N, hue=hue, luminosity=luminosity, format_="hex")
else:
n = N
while n > 0:
c += _color_alphabet[:n]
n -= 25
if bgvalue is not None:
c[bgvalue] = "#aaaaaa"
return colors.LinearSegmentedColormap.from_list("", c, N)