Skip to content
Commits on Source (17)
......@@ -18,7 +18,7 @@ In addition to scripting TreeTime or using it via the command-line, there is als
![Molecular clock phylogeny of 200 NA sequences of influenza A H3N2](https://raw.githubusercontent.com/neherlab/treetime_examples/master/figures/tree_and_clock.png)
Have a look at our [examples and tutorials](https://github.com/neherlab/treetime_examples).
Have a look at our repository with [example data](https://github.com/neherlab/treetime_examples) and the [tutorials](https://treetime.readthedocs.io/en/latest/tutorials.html).
#### Features
* ancestral sequence reconstruction (marginal and joint maximum likelihood)
......@@ -83,7 +83,7 @@ The to infer a timetree, i.e. a phylogenetic tree in which branch length reflect
treetime --aln <input.fasta> --tree <input.nwk> --dates <dates.csv>
```
This command will infer a time tree, ancestral sequences, a GTR model, and optionally confidence intervals and coalescent models.
A detailed explanation is of this command with its various options and examples are available at [treetime_examples/timetree.md](http://github.com/neherlab/treetime_examples/blob/master/timetree.md)
A detailed explanation is of this command with its various options and examples is available in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/timetree.html).
#### Rerooting and substitution rate estimation
......@@ -93,7 +93,7 @@ To explore the temporal signal in the data and estimate the substitution rate (i
```
The full list if options is available by typing `treetime clock -h`.
Instead of an input alignment, `--sequence-length <L>` can be provided.
Documentation of additional options and examples are available at [treetime_examples/clock.md](https://github.com/neherlab/treetime_examples/blob/master/clock.md)
Documentation of additional options and examples are available at in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/clock.html).
#### Ancestral sequence reconstruction:
......@@ -103,7 +103,7 @@ The subcommand
```
will reconstruct ancestral sequences at internal nodes of the input tree.
The full list if options is available by typing `treetime ancestral -h`.
A detailed explanation of `treetime ancestral` with examples is available at [treetime_examples/ancestral.md](https://github.com/neherlab/treetime_examples/blob/master/ancestral.md)
A detailed explanation of `treetime ancestral` with examples is available at in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/ancestral.html).
#### Homoplasy analysis
Detecting and quantifying homoplasies or recurrent mutations is useful to check for recombination, putative adaptive sites, or contamination.
......@@ -112,7 +112,7 @@ TreeTime provides a simple command to summarize homoplasies in data
treetime homoplasy --aln <input.fasta> --tree <input.nwk>
```
The full list if options is available by typing `treetime homoplasy -h`.
Please see [treetime_examples/homoplasy.md](https://github.com/neherlab/treetime_examples/blob/master/homoplasy.md) for examples and more documentation.
Please see [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/homoplasy.html) for examples and more documentation.
#### Mugration analysis
Migration between discrete geographic regions, host switching, or other transition between discrete states are often parameterized by time-reversible models analogous to models describing evolution of genome sequences.
......@@ -123,7 +123,7 @@ TreeTime GTR model machinery can be used to infer mugration models:
```
where `<field>` is the relevant column in the csv file specifying the metadata `states.csv`, e.g. `<field>=country`.
The full list if options is available by typing `treetime mugration -h`.
Please see [treetime_examples/mugration.md](https://github.com/neherlab/treetime_examples/blob/master/mugration.md) for examples and more documentation.
Please see [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/mugration.html) for examples and more documentation.
#### Metadata and date format
Several of TreeTime commands require the user to specify a file with dates and/or other meta data.
......@@ -177,12 +177,6 @@ The API documentation for the TreeTime package is generated created with Sphinx.
pip install Sphinx
```
- basicstrap Html theme for sphinx:
```bash
pip install sphinxjp.themes.basicstrap
```
After required packages are installed, navigate to doc directory, and build the docs by typing:
```bash
......
import datetime
from treetime.treeanc import TreeAnc
from treetime.clock_tree import ClockTree
from treetime.treetime import TreeTime
from treetime.treetime import ttconf as treetime_conf
from treetime.gtr import GTR
from treetime.treetime import plot_vs_years
from treetime.treetime import treetime_to_newick
from treetime.tree_regression import TreeRegression
from treetime.merger_models import Coalescent
import treetime.seq_utils as seq_utils
from treetime.utils import numeric_date
# 0.7.0 -- restructuring
## Major changes
This release largely includes changes under the hood, some of which also affect how treetime behaves. The biggest changes are
* sequence data handling is now done by a separate class `SequenceData`. There is now a clear distinction between input data that is never changed and inferred sequences. This class also provides consolidated set of functions to convert sparse, compressed, and full sequence representations into each other.
* sequences are now unicode when running from python3. This does not seem to come with a measurable performance hit compared to byte sequences as long as all characters are ASCII. Moving away from bytes to unicode proved much less hassle than converting sequences back and forth from unicode to bytes during IO.
* Ancestral state reconstruction no longer reconstructs the state of terminal nodes by default and sequence accessors and output will return the input data by default. Reconstruction is optional.
* The command-line mugration model inference now optimize the overall rate numerically and is hence no longer making a short-branch length assumption.
* TreeTime raises now a number of custom errors rather than returning success or error codes. This should result in fewer "silent errors" that cause problems downstream.
## Minor new features
In addition, we implemented a number of other changes to the interface
* `treetime`, `treetime clock` now accept the arguments `--name-column` and `-date-column` to explicitly specify the metadata columns to be used as name or date
* `treetime mugration` accepts a `--name-column` argument.
## Bug fixes
* scaling of skyline confidence intervals was wrong. It now reflects the inverse second derivative in log-space
* catch problems after rerooting associated with missing attributes in the newly generated root node.
* make conversion from calendar dates to numeric dates and vice versa compatible and remove approximate handling of leap-years.
* avoid overwriting content of output directory with default names
* don't export inferred dates of tips labeled as `bad_branch`.
\ No newline at end of file
# Contributing to TreeTime
Thank you for your interest in contributing to TreeTime.
We welcome pull-requests that fix bugs or implement new features.
## Bugs
If you come across a bug or unexpected behavior, please file an issue.
## Testing
Upon pushing a commit, travis will run a few simple tests. These use data available in the [neherlab/treetime_examples](https://github.com/neherlab/treetime_examples) repository.
## Coding conventions (loosly adhered to)
* indentation: 4 spaces
* docstrings: numpy style
* variable names: snake_case
python-treetime (0.7.1-1) unstable; urgency=medium
* Team upload.
* New upstream version
* Remove trailing whitespace in debian/changelog (routine-update)
* Set field Upstream-Name in debian/copyright.
* Adjusted d/watch to allow omission of "v" prefix
-- Steffen Moeller <moeller@debian.org> Sat, 18 Jan 2020 21:23:53 +0100
python-treetime (0.7.0~beta.1-1) UNRELEASED; urgency=medium
* Team upload.
* New upstream version
* Standards-Version: 4.4.1 (routine-update)
* Set upstream metadata fields: Bug-Database, Bug-Submit, Repository,
Repository-Browse.
* Fixed indent of autotest in patch
https://github.com/neherlab/treetime/issues/102
* Adjusted d/watch to prefer version without "-beta"
-- Steffen Moeller <moeller@debian.org> Sat, 18 Jan 2020 20:52:28 +0100
python-treetime (0.6.2-1) unstable; urgency=medium
* New upstream version
......
......@@ -12,7 +12,7 @@ Build-Depends: debhelper-compat (= 12),
python3-pandas,
python3-scipy,
python3-setuptools
Standards-Version: 4.4.0
Standards-Version: 4.4.1
Vcs-Browser: https://salsa.debian.org/med-team/python-treetime
Vcs-Git: https://salsa.debian.org/med-team/python-treetime.git
Homepage: https://github.com/neherlab/treetime
......
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Contact: Pavel Sagulenko <pavel.sagulenko@tuebingen.mpg.de>
Source: https://github.com/neherlab/treetime
Upstream-Name: phylo-treetime
Files: *
Copyright: 2016-208 Pavel Sagulenko, Emma Hodcroft, and Richard Neher
......
Index: python-treetime/treetime/test_opt.py
===================================================================
--- python-treetime.orig/treetime/test_opt.py
+++ python-treetime/treetime/test_opt.py
@@ -1,22 +1,83 @@
- def optimize_tree_marginal_new(self, damping=0.5):
- L = self.data.compressed_length
- n_states = self.gtr.alphabet.shape[0]
- # propagate leaves --> root, set the marginal-likelihood messages
- for node in self.tree.find_clades(order='postorder'): #leaves -> root
- if node.up is None and len(node.clades)==2:
- continue
-
- profiles = [c.marginal_subtree_LH for c in node] + [node.marginal_outgroup_LH]
- bls = [c.branch_length for c in nodes] + [node.branch_length]
- new_bls = self.optimize_star(profiles,bls, last_is_root=node.up is None)
-
- # regardless of what was before, set the profile to ones
+def optimize_tree_marginal_new(self, damping=0.5):
+ L = self.data.compressed_length
+ n_states = self.gtr.alphabet.shape[0]
+ # propagate leaves --> root, set the marginal-likelihood messages
+ for node in self.tree.find_clades(order='postorder'): #leaves -> root
+ if node.up is None and len(node.clades)==2:
+ continue
+
+ profiles = [c.marginal_subtree_LH for c in node] + [node.marginal_outgroup_LH]
+ bls = [c.branch_length for c in nodes] + [node.branch_length]
+ new_bls = self.optimize_star(profiles,bls, last_is_root=node.up is None)
+
+ # regardless of what was before, set the profile to ones
+ tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
+ node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
+ for ch in ci,node.clades:
+ ch.branch_length = new_bls[ci]
+ ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
+ ch.branch_length, return_log=True)
+ tmp_log_subtree_LH += ch.marginal_log_Lx
+ node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
+
+ node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
+ node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
+
+ if node.up:
+ node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
+ node.branch_length, return_log=True) # raw prob to transfer prob up
+ tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
+ self._branch_length_to_gtr(node), return_log=False)
+ node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
+ else:
+ node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
+
+ root=self.tree.root
+ print(len(root.clades))
+ if len(root.clades)==2:
+ tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
+ root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
+ old_bl = root.clades[0].branch_length + root.clades[1]
+ bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
+ root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
+ profiles=True, tol=1e-8)
+ for ch in root:
+ ch.branch_length *= ((1-damping)*old_bl + damping*bl)/old_bl
+ ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
+ ch.branch_length, return_log=True) # raw prob to transfer prob up
+ tmp_log_subtree_LH += ch.marginal_log_Lx
+ root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
+
+ root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
+ root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
+
+
+ self.total_LH_and_root_sequence(assign_sequence=False)
+ self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
+
+
+
+
+def optimize_tree_marginal_new2(self, n_iter_internal=2, damping=0.5):
+ L = self.data.compressed_length
+ n_states = self.gtr.alphabet.shape[0]
+ # propagate leaves --> root, set the marginal-likelihood messages
+ for node in self.tree.get_nonterminals(order='postorder'): #leaves -> root
+ if node.up is None and len(node.clades)==2:
+ continue
+ # regardless of what was before, set the profile to ones
+ for ii in range(n_iter_internal):
+ damp = damping**(1+ii)
tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
- for ch in ci,node.clades:
- ch.branch_length = new_bls[ci]
+ for ch in node.clades:
+ outgroup = np.exp(np.log(np.maximum(ttconf.TINY_NUMBER, node.marginal_profile)) - ch.marginal_log_Lx)
+
+ bl = self.gtr.optimal_t_compressed((ch.marginal_subtree_LH, outgroup), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
+ new_bl = (1-damp)*bl + damp*ch.branch_length
+ ch.branch_length=new_bl
ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
- ch.branch_length, return_log=True)
+ new_bl, return_log=True) # raw prob to transfer prob up
tmp_log_subtree_LH += ch.marginal_log_Lx
node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
@@ -24,101 +85,40 @@
node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
if node.up:
+ bl = self.gtr.optimal_t_compressed((node.marginal_subtree_LH, node.marginal_outgroup_LH), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
+ new_bl = (1-damp)*bl + damp*node.branch_length
+ node.branch_length=new_bl
node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
- node.branch_length, return_log=True) # raw prob to transfer prob up
+ new_bl, return_log=True) # raw prob to transfer prob up
+ node.marginal_outgroup_LH, pre = normalize_profile(np.log(np.maximum(ttconf.TINY_NUMBER, node.up.marginal_profile)) - node.marginal_log_Lx,
+ log=True, return_offset=False)
+
tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
self._branch_length_to_gtr(node), return_log=False)
node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
else:
node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
- root=self.tree.root
- print(len(root.clades))
- if len(root.clades)==2:
- tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
- root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
- old_bl = root.clades[0].branch_length + root.clades[1]
- bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
- root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
- profiles=True, tol=1e-8)
- for ch in root:
- ch.branch_length *= ((1-damping)*old_bl + damping*bl)/old_bl
- ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
- ch.branch_length, return_log=True) # raw prob to transfer prob up
- tmp_log_subtree_LH += ch.marginal_log_Lx
- root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-
- root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
- root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-
-
- self.total_LH_and_root_sequence(assign_sequence=False)
- self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
+ root=self.tree.root
+ print(len(root.clades))
+ if len(root.clades)==2:
+ tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
+ root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
+ old_bl = root.clades[0].branch_length + root.clades[1]
+ bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
+ root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
+ profiles=True, tol=1e-8)
+ for ch in root:
+ ch.branch_length *= bl/old_bl
+ ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
+ ch.branch_length, return_log=True) # raw prob to transfer prob up
+ tmp_log_subtree_LH += ch.marginal_log_Lx
+ root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
+ root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
+ root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
- def optimize_tree_marginal_new2(self, n_iter_internal=2, damping=0.5):
- L = self.data.compressed_length
- n_states = self.gtr.alphabet.shape[0]
- # propagate leaves --> root, set the marginal-likelihood messages
- for node in self.tree.get_nonterminals(order='postorder'): #leaves -> root
- if node.up is None and len(node.clades)==2:
- continue
- # regardless of what was before, set the profile to ones
- for ii in range(n_iter_internal):
- damp = damping**(1+ii)
- tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
- node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
- for ch in node.clades:
- outgroup = np.exp(np.log(np.maximum(ttconf.TINY_NUMBER, node.marginal_profile)) - ch.marginal_log_Lx)
-
- bl = self.gtr.optimal_t_compressed((ch.marginal_subtree_LH, outgroup), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
- new_bl = (1-damp)*bl + damp*ch.branch_length
- ch.branch_length=new_bl
- ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
- new_bl, return_log=True) # raw prob to transfer prob up
- tmp_log_subtree_LH += ch.marginal_log_Lx
- node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-
- node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
- node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-
- if node.up:
- bl = self.gtr.optimal_t_compressed((node.marginal_subtree_LH, node.marginal_outgroup_LH), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
- new_bl = (1-damp)*bl + damp*node.branch_length
- node.branch_length=new_bl
- node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
- new_bl, return_log=True) # raw prob to transfer prob up
- node.marginal_outgroup_LH, pre = normalize_profile(np.log(np.maximum(ttconf.TINY_NUMBER, node.up.marginal_profile)) - node.marginal_log_Lx,
- log=True, return_offset=False)
-
- tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
- self._branch_length_to_gtr(node), return_log=False)
- node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
- else:
- node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
-
-
- root=self.tree.root
- print(len(root.clades))
- if len(root.clades)==2:
- tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
- root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
- old_bl = root.clades[0].branch_length + root.clades[1]
- bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
- root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
- profiles=True, tol=1e-8)
- for ch in root:
- ch.branch_length *= bl/old_bl
- ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
- ch.branch_length, return_log=True) # raw prob to transfer prob up
- tmp_log_subtree_LH += ch.marginal_log_Lx
- root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
-
- root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
- root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
-
-
- self.total_LH_and_root_sequence(assign_sequence=False)
- self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
+ self.total_LH_and_root_sequence(assign_sequence=False)
+ self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
fix_test_indent.patch
......@@ -13,3 +13,7 @@ override_dh_install:
dh_install
## enable dh_python finding dependency
find debian -name "*.py" -exec sed -i '1s:/usr/local/bin/python:/usr/bin/python:' \{\} \;
override_dh_auto_clean:
dh_auto_clean
rm -rf phylo_treetime.egg-info
......@@ -7,3 +7,7 @@ Registry:
Entry: treetime
- Name: SciCrunch
Entry: NA
Bug-Database: https://github.com/neherlab/treetime/issues
Bug-Submit: https://github.com/neherlab/treetime/issues/new
Repository: https://github.com/neherlab/treetime.git
Repository-Browse: https://github.com/neherlab/treetime
version=4
https://github.com/neherlab/treetime/releases .*/archive/v@ANY_VERSION@@ARCHIVE_EXT@
opts="uversionmangle=s/-beta/~beta/" \
https://github.com/neherlab/treetime/releases .*/archive/v?@ANY_VERSION@@ARCHIVE_EXT@
......@@ -45,7 +45,7 @@ def test_ancestral():
t = TreeAnc(gtr='Jukes-Cantor', tree=nwk, aln=fasta)
print('ancestral reconstruction' + ("marginal" if marginal else "joint"))
t.reconstruct_anc(method='ml', marginal=marginal)
assert "".join(t.tree.root.sequence) == 'ATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCTTGATAACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATAATAGAAAGAAACATAACAGAGATAGTGTATCTGACCAACACCACCATAGAGAAGGAAATATGCCCCAAACCAGCAGAATACAGAAATTGGTCAAAACCGCAATGTGGCATTACAGGATTTGCACCTTTCTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGACTCTATTGATGAATGAGTTGGGTGTTCCTTTTCATCTGGGGACCAAGCAAGTGTGCATAGCATGGTCCAGCTCAAGTTGTCACGATGGAAAAGCATGGCTGCATGTTTGTATAACGGGGGATGATAAAAATGCAACTGCTAGCTTCATTTACAATGGGAGGCTTGTAGATAGTGTTGTTTCATGGTCCAAAGAAATTCTCAGGACCCAGGAGTCAGAATGCGTTTGTATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAGTGCTTCAGGAAAAGCTGATACTAAAATACTATTCATTGAGGAGGGGAAAATCGTTCATACTAGCACATTGTCAGGAAGTGCTCAGCATGTCGAAGAGTGCTCTTGCTATCCTCGATATCCTGGTGTCAGATGTGTCTGCAGAGACAACTGGAAAGGCTCCAATCGGCCCATCGTAGATATAAACATAAAGGATCATAGCATTGTTTCCAGTTATGTGTGTTCAGGACTTGTTGGAGACACACCCAGAAAAAACGACAGCTCCAGCAGTAGCCATTGTTTGGATCCTAACAATGAAGAAGGTGGTCATGGAGTGAAAGGCTGGGCCTTTGATGATGGAAATGACGTGTGGATGGGAAGAACAATCAACGAGACGTCACGCTTAGGGTATGAAACCTTCAAAGTCATTGAAGGCTGGTCCAACCCTAAGTCCAAATTGCAGATAAATAGGCAAGTCATAGTTGACAGAGGTGATAGGTCCGGTTATTCTGGTATTTTCTCTGTTGAAGGCAAAAGCTGCATCAATCGGTGCTTTTATGTGGAGTTGATTAGGGGAAGAAAAGAGGAAACTGAAGTCTTGTGGACCTCAAACAGTATTGTTGTGTTTTGTGGCACCTCAGGTACATATGGAACAGGCTCATGGCCTGATGGGGCGGACCTCAATCTCATGCCTATA'
assert t.data.compressed_to_full_sequence(t.tree.root.cseq, as_string=True) == 'ATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCTTGATAACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATAATAGAAAGAAACATAACAGAGATAGTGTATCTGACCAACACCACCATAGAGAAGGAAATATGCCCCAAACCAGCAGAATACAGAAATTGGTCAAAACCGCAATGTGGCATTACAGGATTTGCACCTTTCTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGACTCTATTGATGAATGAGTTGGGTGTTCCTTTTCATCTGGGGACCAAGCAAGTGTGCATAGCATGGTCCAGCTCAAGTTGTCACGATGGAAAAGCATGGCTGCATGTTTGTATAACGGGGGATGATAAAAATGCAACTGCTAGCTTCATTTACAATGGGAGGCTTGTAGATAGTGTTGTTTCATGGTCCAAAGAAATTCTCAGGACCCAGGAGTCAGAATGCGTTTGTATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAGTGCTTCAGGAAAAGCTGATACTAAAATACTATTCATTGAGGAGGGGAAAATCGTTCATACTAGCACATTGTCAGGAAGTGCTCAGCATGTCGAAGAGTGCTCTTGCTATCCTCGATATCCTGGTGTCAGATGTGTCTGCAGAGACAACTGGAAAGGCTCCAATCGGCCCATCGTAGATATAAACATAAAGGATCATAGCATTGTTTCCAGTTATGTGTGTTCAGGACTTGTTGGAGACACACCCAGAAAAAACGACAGCTCCAGCAGTAGCCATTGTTTGGATCCTAACAATGAAGAAGGTGGTCATGGAGTGAAAGGCTGGGCCTTTGATGATGGAAATGACGTGTGGATGGGAAGAACAATCAACGAGACGTCACGCTTAGGGTATGAAACCTTCAAAGTCATTGAAGGCTGGTCCAACCCTAAGTCCAAATTGCAGATAAATAGGCAAGTCATAGTTGACAGAGGTGATAGGTCCGGTTATTCTGGTATTTTCTCTGTTGAAGGCAAAAGCTGCATCAATCGGTGCTTTTATGTGGAGTTGATTAGGGGAAGAAAAGAGGAAACTGAAGTCTTGTGGACCTCAAACAGTATTGTTGTGTTTTGTGGCACCTCAGGTACATATGGAACAGGCTCATGGCCTGATGGGGCGGACCTCAATCTCATGCCTATA'
print('testing LH normalization')
from Bio import Phylo,AlignIO
......@@ -101,7 +101,7 @@ def test_seq_joint_reconstruction_correct():
tree = myTree.tree
seq_len = 400
tree.root.ref_seq = np.random.choice(mygtr.alphabet, p=mygtr.Pi, size=seq_len)
print ("Root sequence: " + ''.join(tree.root.ref_seq))
print ("Root sequence: " + ''.join(tree.root.ref_seq.astype('U')))
mutation_list = defaultdict(list)
for node in tree.find_clades():
for c in node.clades:
......@@ -110,7 +110,7 @@ def test_seq_joint_reconstruction_correct():
continue
t = node.branch_length
p = mygtr.evolve( seq_utils.seq2prof(node.up.ref_seq, mygtr.profile_map), t)
# normalie profile
# normalize profile
p=(p.T/p.sum(axis=1)).T
# sample mutations randomly
ref_seq_idxs = np.array([int(np.random.choice(np.arange(p.shape[1]), p=p[k])) for k in np.arange(p.shape[0])])
......@@ -127,25 +127,23 @@ def test_seq_joint_reconstruction_correct():
alnstr = ""
i = 1
for leaf in tree.get_terminals():
alnstr += ">" + leaf.name + "\n" + ''.join(leaf.ref_seq) + '\n'
alnstr += ">" + leaf.name + "\n" + ''.join(leaf.ref_seq.astype('U')) + '\n'
i += 1
print (alnstr)
myTree.aln = AlignIO.read(StringIO(alnstr), 'fasta')
myTree._attach_sequences_to_nodes()
# reconstruct ancestral sequences:
myTree._ml_anc_joint(debug=True)
myTree.infer_ancestral_sequences(final=True, debug=True, reconstruct_leaves=True)
diff_count = 0
mut_count = 0
for node in myTree.tree.find_clades():
if node.up is not None:
mut_count += len(node.ref_mutations)
diff_count += np.sum(node.sequence != node.ref_seq)==0
diff_count += np.sum(node.sequence != node.ref_seq)
if np.sum(node.sequence != node.ref_seq):
print("%s: True sequence does not equal inferred sequence. parent %s"%(node.name, node.up.name))
else:
print("%s: True sequence equals inferred sequence. parent %s"%(node.name, node.up.name))
print (node.name, np.sum(node.sequence != node.ref_seq), np.where(node.sequence != node.ref_seq), len(node.mutations), node.mutations)
# the assignment of mutations to the root node is probabilistic. Hence some differences are expected
assert diff_count/seq_len<2*(1.0*mut_count/seq_len)**2
......
from __future__ import print_function, division, absolute_import
version="0.6.2"
version="0.7.0"
class TreeTimeError(Exception):
"""TreeTimeError class"""
pass
class MissingDataError(TreeTimeError):
"""MissingDataError class raised when tree or alignment are missing"""
pass
class UnknownMethodError(TreeTimeError):
"""MissingDataError class raised when tree or alignment are missing"""
pass
class NotReadyError(TreeTimeError):
"""NotReadyError class raised when results are requested before inference"""
pass
from .treeanc import TreeAnc
from .treetime import TreeTime, plot_vs_years
from .clock_tree import ClockTree
from .treetime import ttconf as treetime_conf
from .gtr import GTR
from .gtr_site_specific import GTR_site_specific
from .merger_models import Coalescent
from .treeregression import TreeRegression
from .argument_parser import make_parser
......@@ -2,7 +2,7 @@
from __future__ import print_function, division, absolute_import
import sys, argparse, os
from treetime.wrappers import ancestral_reconstruction, mugration, scan_homoplasies, timetree, estimate_clock_model
import treetime
from treetime import version
py2 = sys.version_info.major==2
......@@ -148,6 +148,7 @@ def add_gtr_arguments(parser):
def add_anc_arguments(parser):
parser.add_argument('--keep-overhangs', default = False, action='store_true', help='do not fill terminal gaps')
parser.add_argument('--zero-based', default = False, action='store_true', help='zero based mutation indexing')
parser.add_argument('--reconstruct-tip-states', default = False, action='store_true', help='overwrite ambiguous states on tips with the most likely inferred state')
parser.add_argument('--report-ambiguous', default=False, action="store_true", help='include transitions involving ambiguous states')
......@@ -169,6 +170,8 @@ def make_parser():
t_parser.add_argument('--tree', type=str, help=tree_description)
add_seq_len_aln_group(t_parser)
t_parser.add_argument('--dates', type=str, help=dates_description)
t_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
t_parser.add_argument('--date-column', type=str, help="label of the column to be used as sampling date")
add_reroot_group(t_parser)
add_gtr_arguments(t_parser)
t_parser.add_argument('--clock-rate', type=float, help="if specified, the rate of the molecular clock won't be optimized.")
......@@ -189,6 +192,8 @@ def make_parser():
help='maximal number of iterations the inference cycle is run. Note that for polytomy resolution and coalescence models max_iter should be at least 2')
t_parser.add_argument('--coalescent', default="0.0", type=str,
help=coalescent_description)
t_parser.add_argument('--n-skyline', default="20", type=int,
help="number of grid points in skyline coalescent model")
t_parser.add_argument('--plot-tree', default="timetree.pdf",
help = "filename to save the plot to. Suffix will determine format"
" (choices pdf, png, svg, default=pdf)")
......@@ -201,6 +206,7 @@ def make_parser():
help = "don't show tip labels (default for small trees with >=30 leaves)")
add_anc_arguments(t_parser)
add_common_args(t_parser)
t_parser.add_argument("--version", action="version", version="%(prog)s " + version)
def toplevel(params):
if (params.aln or params.tree) and params.dates:
......@@ -239,6 +245,7 @@ def make_parser():
## MUGRATION
m_parser = subparsers.add_parser('mugration', description=mugration_description)
m_parser.add_argument('--tree', required = True, type=str, help=tree_description)
m_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
m_parser.add_argument('--attribute', type=str, help ="attribute to reconstruct, e.g. country")
m_parser.add_argument('--states', required = True, type=str, help ="csv or tsv file with discrete characters."
"\n#name,country,continent\ntaxon1,micronesia,oceania\n...")
......@@ -265,6 +272,8 @@ def make_parser():
"signal and recalculate branch length unless run with --keep_root.")
c_parser.add_argument('--tree', required=True, type=str, help=tree_description)
c_parser.add_argument('--dates', required=True, type=str, help=dates_description)
c_parser.add_argument('--date-column', type=str, help="label of the column to be used as sampling date")
c_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
add_seq_len_aln_group(c_parser)
add_reroot_group(c_parser)
......@@ -279,7 +288,7 @@ def make_parser():
# make a version subcommand
v_parser = subparsers.add_parser('version', description='print version')
v_parser.set_defaults(func=lambda x: print(treetime.version))
v_parser.set_defaults(func=lambda x: print("treetime "+version))
## call the relevant function and return
if py2:
......
......@@ -88,19 +88,11 @@ class BranchLenInterpolator (Distribution):
elif branch_length_mode=='joint':
if not hasattr(node, 'compressed_sequence'):
#FIXME: this assumes node.sequence is set, but this might not be the case if
# ancestral reconstruction is run with final=False
if hasattr(node, 'sequence'):
seq_pairs, multiplicity = self.gtr.compress_sequence_pair(node.up.sequence,
node.sequence,
ignore_gaps=ignore_gaps)
node.compressed_sequence = {'pair':seq_pairs, 'multiplicity':multiplicity}
else:
raise Exception("uncompressed sequence needs to be assigned to nodes")
if not hasattr(node, 'branch_state'):
raise Exception("branch state pairs need to be assigned to nodes")
log_prob = np.array([-self.gtr.prob_t_compressed(node.compressed_sequence['pair'],
node.compressed_sequence['multiplicity'],
log_prob = np.array([-self.gtr.prob_t_compressed(node.branch_state['pair'],
node.branch_state['multiplicity'],
k,
return_log=True)
for k in grid])
......
from __future__ import print_function, division, absolute_import
import numpy as np
from treetime import config as ttconf
from treetime import MissingDataError
from .treeanc import TreeAnc
from .utils import numeric_date, DateConversion
from .utils import numeric_date, DateConversion, datestring_from_numeric
from .distribution import Distribution
from .branch_len_interpolator import BranchLenInterpolator
from .node_interpolator import NodeInterpolator
......@@ -79,8 +80,7 @@ class ClockTree(TreeAnc):
self.clock_model=None
self.use_covariation=use_covariation # if false, covariation will be ignored in rate estimates.
self._set_precision(precision)
if self._assign_dates()==ttconf.ERROR:
raise ValueError("ClockTree requires date constraints!")
self._assign_dates()
def _assign_dates(self):
......@@ -92,8 +92,7 @@ class ClockTree(TreeAnc):
success/error code
"""
if self.tree is None:
self.logger("ClockTree._assign_dates: tree is not set, can't assign dates", 0)
return ttconf.ERROR
raise MissingDataError("ClockTree._assign_dates: tree is not set, can't assign dates")
bad_branch_counter = 0
for node in self.tree.find_clades(order='postorder'):
......@@ -128,9 +127,9 @@ class ClockTree(TreeAnc):
bad_branch_counter += 1
if bad_branch_counter>self.tree.count_terminals()-3:
self.logger("ERROR: ALMOST NO VALID DATE CONSTRAINTS, EXITING", 1, warn=True)
return ttconf.ERROR
raise MissingDataError("ERROR: ALMOST NO VALID DATE CONSTRAINTS")
self.logger("ClockTree._assign_dates: assigned date contraints to {} out of {} tips.".format(self.tree.count_terminals()-bad_branch_counter, self.tree.count_terminals()), 1)
return ttconf.SUCCESS
......@@ -149,7 +148,7 @@ class ClockTree(TreeAnc):
self.precision=precision
if self.one_mutation and self.one_mutation<1e-4 and precision<2:
self.logger("ClockTree._set_precision: FOR LONG SEQUENCES (>1e4) precision>=2 IS RECOMMENDED."
" \n\t **** precision %d was specified by the user"%precision, level=0)
" precision %d was specified by the user"%precision, level=0)
else:
# otherwise adjust it depending on the minimal sensible branch length
if self.one_mutation:
......@@ -263,7 +262,7 @@ class ClockTree(TreeAnc):
"""
self.logger("ClockTree.init_date_constraints...",2)
self.tree.coalescent_joint_LH = 0
if self.aln and (ancestral_inference or (not hasattr(self.tree.root, 'sequence'))):
if self.aln and (not self.sequence_reconstruction):
self.infer_ancestral_sequences('probabilistic', marginal=self.branch_length_mode=='marginal',
sample_from_profile='root',**kwarks)
......@@ -286,9 +285,11 @@ class ClockTree(TreeAnc):
if self.branch_length_mode=='marginal':
node.profile_pair = self.marginal_branch_profile(node)
elif self.branch_length_mode=='joint' and (not hasattr(node, 'branch_state')):
self.add_branch_state(node)
node.branch_length_interpolator = BranchLenInterpolator(node, self.gtr,
pattern_multiplicity = self.multiplicity, min_width=self.min_width,
pattern_multiplicity = self.data.multiplicity, min_width=self.min_width,
one_mutation=self.one_mutation, branch_length_mode=self.branch_length_mode)
node.branch_length_interpolator.merger_cost = merger_cost
......@@ -312,8 +313,8 @@ class ClockTree(TreeAnc):
if hasattr(node, 'bad_branch') and node.bad_branch is True:
self.logger("ClockTree.init_date_constraints -- WARNING: Branch is marked as bad"
", excluding it from the optimization process.\n"
"\t\tDate constraint will be ignored!", 4, warn=True)
", excluding it from the optimization process."
" Date constraint will be ignored!", 4, warn=True)
else: # node without sampling date set
node.raw_date_constraint = None
node.date_constraint = None
......@@ -438,13 +439,14 @@ class ClockTree(TreeAnc):
if node.joint_pos_Cx is None: # no constraints or branch is bad - reconstruct from the branch len interpolator
node.branch_length = node.branch_length_interpolator.peak_pos
elif node.date_constraint is not None and node.date_constraint.is_delta:
node.branch_length = node.up.time_before_present - node.date_constraint.peak_pos
elif isinstance(node.joint_pos_Cx, Distribution):
# NOTE the Lx distribution is the likelihood, given the position of the parent
# (Lx.x = parent position, Lx.y = LH of the node_pos given Lx.x,
# the length of the branch corresponding to the most likely
# subtree is node.Cx(node.time_before_present))
subtree_LH = node.joint_pos_Lx(node.up.time_before_present)
# subtree_LH = node.joint_pos_Lx(node.up.time_before_present)
node.branch_length = node.joint_pos_Cx(max(node.joint_pos_Cx.xmin,
node.up.time_before_present)+ttconf.TINY_NUMBER)
......@@ -475,7 +477,7 @@ class ClockTree(TreeAnc):
# add the root sequence LH and return
if self.aln:
LH += self.gtr.sequence_logLH(self.tree.root.cseq, pattern_multiplicity=self.multiplicity)
LH += self.gtr.sequence_logLH(self.tree.root.cseq, pattern_multiplicity=self.data.multiplicity)
return LH
......@@ -525,7 +527,7 @@ class ClockTree(TreeAnc):
# no information
node.marginal_pos_Lx = None
else: # all other nodes
if node.date_constraint is not None and node.date_constraint.is_delta: # there is a time constraint
if node.date_constraint is not None and node.date_constraint.is_delta: # there is a hard time constraint
# initialize the Lx for nodes with precise date constraint:
# subtree probability given the position of the parent node
# position of the parent node is given by the branch length
......@@ -575,6 +577,8 @@ class ClockTree(TreeAnc):
if node.up is None:
node.msg_from_parent = None # nothing beyond the root
# all other cases (All internal nodes + unconstrained terminals)
elif node.date_constraint is not None and node.date_constraint.is_delta:
node.marginal_pos_LH = node.date_constraint
else:
parent = node.up
# messages from the complementary subtree (iterate over all sister nodes)
......@@ -584,8 +588,6 @@ class ClockTree(TreeAnc):
# if parent itself got smth from the root node, include it
if parent.msg_from_parent is not None:
complementary_msgs.append(parent.msg_from_parent)
elif parent.marginal_pos_Lx is not None:
complementary_msgs.append(parent.marginal_pos_LH)
if len(complementary_msgs):
msg_parent_to_node = NodeInterpolator.multiply(complementary_msgs)
......@@ -677,17 +679,7 @@ class ClockTree(TreeAnc):
"later than present day",4 , warn=True)
node.numdate = now - years_bp
# set the human-readable date
year = np.floor(node.numdate)
days = max(0,365.25 * (node.numdate - year)-1)
try: # datetime will only operate on dates after 1900
n_date = datetime(year, 1, 1) + timedelta(days=days)
node.date = datetime.strftime(n_date, "%Y-%m-%d")
except:
# this is the approximation not accounting for gap years etc
n_date = datetime(1900, 1, 1) + timedelta(days=days)
node.date = "%04d-%02d-%02d"%(year, n_date.month, n_date.day)
node.date = datestring_from_numeric(node.numdate)
def branch_length_to_years(self):
......@@ -722,8 +714,8 @@ class ClockTree(TreeAnc):
params = params or {}
if rate_std is None:
if not (self.clock_model['valid_confidence'] and 'cov' in self.clock_model):
self.logger("ClockTree.calc_rate_susceptibility: need valid standard deviation of the clock rate to estimate dating error.", 1, warn=True)
return ttconf.ERROR
raise ValueError("ClockTree.calc_rate_susceptibility: need valid standard deviation of the clock rate to estimate dating error.")
rate_std = np.sqrt(self.clock_model['cov'][0,0])
current_rate = np.abs(self.clock_model['slope'])
......
......@@ -30,8 +30,7 @@ class GTR(object):
of observing characters in the alphabet. This is used to
implement ambiguous characters like 'N'=[1,1,1,1] which are
equally likely to be any of the 4 nucleotides. Standard profile_maps
are defined in file seq_utils.py. If None is provided, no ambigous
characters are supported.
are defined in file seq_utils.py.
logger : callable
Custom logging function that should take arguments (msg, level, warn=False),
......@@ -39,6 +38,7 @@ class GTR(object):
"""
self.debug=False
self.is_site_specific=False
if isinstance(alphabet, str):
if alphabet not in alphabet_synonyms:
raise AttributeError("Unknown alphabet type specified")
......@@ -48,13 +48,14 @@ class GTR(object):
self.profile_map = profile_maps[tmp_alphabet]
else:
# not a predefined alphabet
self.alphabet = alphabet
self.alphabet = np.array(alphabet)
if prof_map is None: # generate trivial unambiguous profile map is none is given
self.profile_map = {s:x for s,x in zip(self.alphabet, np.eye(len(self.alphabet)))}
else:
self.profile_map = prof_map
self.profile_map = {x if type(x) is str else x:k for x,k in prof_map.items()}
self.state_index={s:si for si,s in enumerate(self.alphabet)}
self.state_index.update({s:si for si,s in enumerate(self.alphabet)})
if logger is None:
def logger_default(*args,**kwargs):
"""standard logging function if none provided"""
......@@ -69,13 +70,6 @@ class GTR(object):
self.n_states = len(self.alphabet)
self.assign_gap_and_ambiguous()
# NEEDED TO BREAK RATE MATRIX DEGENERACY AND FORCE NP TO RETURN REAL ORTHONORMAL EIGENVECTORS
# ugly hack, but works and shouldn't affect results
tmp_rng_state = np.random.get_state()
np.random.seed(12345)
self.break_degen = np.random.random(size=(self.n_states, self.n_states))*1e-6
np.random.set_state(tmp_rng_state)
# init all matrices with dummy values
self.logger("GTR: init with dummy values!", 3)
self.v = None # right eigenvectors
......@@ -86,7 +80,7 @@ class GTR(object):
def assign_gap_and_ambiguous(self):
n_states = len(self.alphabet)
self.logger("GTR: with alphabet: "+str(self.alphabet),1)
self.logger("GTR: with alphabet: "+str([x for x in self.alphabet]),1)
# determine if a character exists that corresponds to no info, i.e. all one profile
if any([x.sum()==n_states for x in self.profile_map.values()]):
amb_states = [c for c,x in self.profile_map.items() if x.sum()==n_states]
......@@ -97,7 +91,7 @@ class GTR(object):
# check for a gap symbol
try:
self.gap_index = list(self.alphabet).index('-')
self.gap_index = self.state_index['-']
except:
self.logger("GTR: no gap symbol!", 4, warn=True)
self.gap_index=None
......@@ -134,7 +128,10 @@ class GTR(object):
and the equilibrium frequencies to obtain the rate matrix
of the GTR model
"""
return (self.W*self.Pi).T
Q_tmp = (self.W*self.Pi).T
Q_diag = -np.sum(Q_tmp, axis=0)
np.fill_diagonal(Q_tmp, Q_diag)
return Q_tmp
######################################################################
......@@ -155,18 +152,18 @@ class GTR(object):
if not multi_site:
eq_freq_str += "\nEquilibrium frequencies (pi_i):\n"
for a,p in zip(self.alphabet, self.Pi):
eq_freq_str+=' '+str(a)+': '+str(np.round(p,4))+'\n'
eq_freq_str+=' '+a+': '+str(np.round(p,4))+'\n'
W_str = "\nSymmetrized rates from j->i (W_ij):\n"
W_str+='\t'+'\t'.join(map(str, self.alphabet))+'\n'
W_str+='\t'+'\t'.join(self.alphabet)+'\n'
for a,Wi in zip(self.alphabet, self.W):
W_str+= ' '+str(a)+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Wi])+'\n'
W_str+= ' '+a+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Wi])+'\n'
if not multi_site:
Q_str = "\nActual rates from j->i (Q_ij):\n"
Q_str+='\t'+'\t'.join(map(str, self.alphabet))+'\n'
Q_str+='\t'+'\t'.join(self.alphabet)+'\n'
for a,Qi in zip(self.alphabet, self.Q):
Q_str+= ' '+str(a)+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Qi])+'\n'
Q_str+= ' '+a+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Qi])+'\n'
return eq_freq_str + W_str + Q_str
......@@ -190,6 +187,7 @@ class GTR(object):
"""
n = len(self.alphabet)
self._mu = mu
self.is_site_specific=False
if pi is not None and len(pi)==n:
Pi = np.array(pi)
......@@ -213,7 +211,11 @@ class GTR(object):
W=np.array(W)
self._W = 0.5*(W+W.T)
self._check_fix_Q(fixed_mu=True)
np.fill_diagonal(W,0)
average_rate = W.dot(self.Pi).dot(self.Pi)
self._W = W/average_rate
self._mu *=average_rate
self._eig()
......@@ -508,8 +510,8 @@ class GTR(object):
if gtr.gap_index is not None:
if pi[gtr.gap_index]<gap_limit:
gtr.logger('The model allows for gaps which are estimated to occur at a low fraction of %1.3e'%pi[gtr.gap_index]+
'\n\t\tthis can potentially result in artificats.'+
'\n\t\tgap fraction will be set to %1.4f'%gap_limit,2,warn=True)
' this can potentially result in artificats.'+
' gap fraction will be set to %1.4f'%gap_limit,2,warn=True)
pi[gtr.gap_index] = gap_limit
pi /= pi.sum()
......@@ -519,39 +521,13 @@ class GTR(object):
########################################################################
### prepare model
########################################################################
def _check_fix_Q(self, fixed_mu=False):
"""
Check the main diagonal of Q and fix it in case it does not corresond
the definition of the rate matrix. Should be run every time when creating
custom GTR model.
"""
# NEEDED TO BREAK RATE MATRIX DEGENERACY AND FORCE NP TO RETURN REAL ORTHONORMAL EIGENVECTORS
self._W += self.break_degen + self.break_degen.T
# fix W
np.fill_diagonal(self.W, 0)
Wdiag = -(self.Q).sum(axis=0)/self.Pi
np.fill_diagonal(self.W, Wdiag)
scale_factor = -np.sum(np.diagonal(self.Q)*self.Pi)
self._W /= scale_factor
if not fixed_mu:
self._mu *= scale_factor
if (self.Q.sum(axis=0) < 1e-10).sum() < self.alphabet.shape[0]: # fix failed
print ("Cannot fix the diagonal of the GTR rate matrix. Should be all zero", self.Q.sum(axis=0))
import ipdb; ipdb.set_trace()
raise ArithmeticError("Cannot fix the diagonal of the GTR rate matrix.")
def _eig(self):
"""
Perform eigendecompositon of the rate matrix and stores the left- and right-
matrices to convert the sequence profiles to the GTR matrix eigenspace
and hence to speed-up the computations.
"""
W_nodiag = np.copy(self.W)
np.fill_diagonal(W_nodiag, 0)
self.eigenvals, self.v, self.v_inv = self._eig_single_site(W_nodiag, self.Pi)
self.eigenvals, self.v, self.v_inv = self._eig_single_site(self.W, self.Pi)
def _eig_single_site(self, W, p):
......@@ -574,7 +550,7 @@ class GTR(object):
return eigvals, tmp_v.T/one_norm, (eigvecs*one_norm).T/tmpp
def compress_sequence_pair(self, seq_p, seq_ch, pattern_multiplicity=None,
def state_pair(self, seq_p, seq_ch, pattern_multiplicity=None,
ignore_gaps=False):
'''
Make a compressed representation of a pair of sequences, only counting
......@@ -615,7 +591,7 @@ class GTR(object):
from collections import Counter
if seq_ch.shape != seq_p.shape:
raise ValueError("GTR.compress_sequence_pair: Sequence lengths do not match!")
raise ValueError("GTR.state_pair: Sequence lengths do not match!")
if len(self.alphabet)<10: # for small alphabet, repeatedly check array for all state pairs
pair_count = []
......@@ -724,7 +700,7 @@ class GTR(object):
Resulting probability
"""
seq_pair, multiplicity = self.compress_sequence_pair(seq_p, seq_ch,
seq_pair, multiplicity = self.state_pair(seq_p, seq_ch,
pattern_multiplicity=pattern_multiplicity, ignore_gaps=ignore_gaps)
return self.prob_t_compressed(seq_pair, multiplicity, t, return_log=return_log)
......@@ -752,7 +728,7 @@ class GTR(object):
If True, ignore gaps in distance calculations
'''
seq_pair, multiplicity = self.compress_sequence_pair(seq_p, seq_ch,
seq_pair, multiplicity = self.state_pair(seq_p, seq_ch,
pattern_multiplicity = pattern_multiplicity,
ignore_gaps=ignore_gaps)
return self.optimal_t_compressed(seq_pair, multiplicity)
......@@ -760,12 +736,13 @@ class GTR(object):
def optimal_t_compressed(self, seq_pair, multiplicity, profiles=False, tol=1e-10):
"""
Find the optimal distance between the two sequences, for compressed sequences
Find the optimal distance between the two sequences represented as state_pairs
or as pair of profiles
Parameters
----------
seq_pair : compressed_sequence_pair
seq_pair : state_pair, tuple
Compressed representation of sequences along a branch, either
as tuple of state pairs or as tuple of profiles.
......@@ -779,7 +756,7 @@ class GTR(object):
either end of the branch. With profiles==True, optimization is performed
while summing over all possible states of the nodes at either end of the
branch. Note that the meaning/format of seq_pair and multiplicity
depend on the value of profiles.
depend on the value of :profiles:.
"""
......@@ -1032,14 +1009,6 @@ class GTR(object):
char_dist=self.Pi,
flow_matrix=self.W)
def save_to_json(self, zip):
d = {
"full_gtr": self.mu * np.dot(self.Pi, self.W),
"Substitution rate" : self.mu,
"Equilibrium character composition": self.Pi,
"Flow rate matrix": self.W
}
if __name__ == "__main__":
pass
......@@ -25,6 +25,7 @@ class GTR_site_specific(GTR):
self.seq_len=seq_len
self.approximate = approximate
super(GTR_site_specific, self).__init__(**kwargs)
self.is_site_specific=True
@property
......@@ -57,25 +58,34 @@ class GTR_site_specific(GTR):
Equilibrium frequencies
"""
if not np.isscalar(mu) and pi is not None and len(pi.shape)==2:
if mu.shape[0]!=pi.shape[1]:
raise ValueError("GTR_site_specific: length of rate vector (got {}) and equilibrium frequency vector (got {}) must match!".format(mu.shape[0], pi.shape[1]))
n = len(self.alphabet)
if np.isscalar(mu):
self._mu = mu*np.ones(self.seq_len)
else:
self._mu = np.copy(mu)
self.seq_len = mu.shape[0]
if pi is not None and pi.shape[0]==n:
self.seq_len = pi.shape[-1]
if pi is not None and pi.shape[0]==n and len(pi.shape)==2:
self.seq_len = pi.shape[1]
Pi = np.copy(pi)
else:
if pi is not None and len(pi)!=n:
raise ArgumentError("GTR_site_specific: length of equilibrium frequency vector does not match alphabet length.")
if pi is not None:
if len(pi)==n:
Pi = np.repeat([pi], self.seq_len, axis=0).T
else:
raise ValueError("GTR_site_specific: length of equilibrium frequency vector (got {}) does not match alphabet length {}".format(len(pi), n))
else:
Pi = np.ones(shape=(n,self.seq_len))
self._Pi = Pi/np.sum(Pi, axis=0)
if W is None or W.shape!=(n,n):
if (W is not None) and W.shape!=(n,n):
raise ArgumentError("GTR_site_specific: Size of substitution matrix does not match alphabet length.")
raise ValueError("GTR_site_specific: Size of substitution matrix (got {}) does not match alphabet length {}".format(W.shape, n))
W = np.ones((n,n))
np.fill_diagonal(W, 0.0)
np.fill_diagonal(W, - W.sum(axis=0))
......@@ -83,11 +93,13 @@ class GTR_site_specific(GTR):
W=0.5*(np.copy(W)+np.copy(W).T)
np.fill_diagonal(W,0)
avg_pi = self.Pi.mean(axis=-1)
average_rate = W.dot(avg_pi).dot(avg_pi)
average_rate = np.einsum('ia,ij,ja',self.Pi, W, self.Pi)/self.seq_len
# average_rate = W.dot(avg_pi).dot(avg_pi)
self._W = W/average_rate
self._mu *=average_rate
self.is_site_specific=True
self._eig()
self._make_expQt_interpolator()
......@@ -124,6 +136,7 @@ class GTR_site_specific(GTR):
gtr = cls(alphabet=alphabet, seq_len=L)
n = gtr.alphabet.shape[0]
# Dirichlet distribution == l_1 normalized vector of samples of the Gamma distribution
if pi_dirichlet_alpha:
pi = 1.0*gamma.rvs(pi_dirichlet_alpha, size=(n,L))
else:
......@@ -143,7 +156,7 @@ class GTR_site_specific(GTR):
mu = np.ones(L)
gtr.assign_rates(mu=mu, pi=pi, W=W)
gtr.mu *= avg_mu/np.mean(gtr.mu)
gtr.mu *= avg_mu/np.mean(gtr.average_rate())
return gtr
......@@ -166,7 +179,7 @@ class GTR_site_specific(GTR):
Equilibrium frequencies
**kwargs:
Key word arguments to be passed
Key word arguments to be passed to the constructor
Keyword Args
------------
......@@ -257,7 +270,7 @@ class GTR_site_specific(GTR):
W_ij = (n_ij + n_ij.T + pc)/(S_ij + S_ij.T + pc)
avg_pi = p_ia.mean(axis=-1)
average_rate = W_ij.dot(avg_pi).dot(avg_pi)
average_rate = W_ij.dot(avg_pi).dot(avg_pi) # crude approx, will be fixed in assign rates
W_ij = W_ij/average_rate
mu_a *=average_rate
......@@ -276,7 +289,7 @@ class GTR_site_specific(GTR):
if p_ia[gtr.gap_index,p]<gap_limit:
gtr.logger('The model allows for gaps which are estimated to occur at a low fraction of %1.3e'%p_ia[gtr.gap_index,p]+
'\n\t\tthis can potentially result in artifacts.'+
'\n\t\tgap fraction will be set to %1.4f'%gap_limit,2,warn=True)
'\n\t\tgap fraction will be set to %1.4f'%gap_limit,4,warn=True)
p_ia[gtr.gap_index,p] = gap_limit
p_ia[:,p] /= p_ia[:,p].sum()
......@@ -456,7 +469,7 @@ class GTR_site_specific(GTR):
logQt[np.isnan(logQt) | np.isinf(logQt) | bad_indices] = -ttconf.BIG_NUMBER
seq_indices_c = np.zeros(len(seq_ch), dtype=int)
seq_indices_p = np.zeros(len(seq_p), dtype=int)
for ai, a in self.alphabet:
for ai, a in enumerate(self.alphabet):
seq_indices_p[seq_p==a] = ai
seq_indices_c[seq_ch==a] = ai
......
......@@ -164,7 +164,7 @@ class Coalescent(object):
if "success" in sol and sol["success"]:
self.set_Tc(sol['x'])
else:
self.logger("merger_models:optimze_Tc: optimization of coalescent time scale failed: " + str(sol), 0, warn=True)
self.logger("merger_models:optimize_Tc: optimization of coalescent time scale failed: " + str(sol), 0, warn=True)
self.set_Tc(initial_Tc.y, T=initial_Tc.x)
......@@ -190,8 +190,8 @@ class Coalescent(object):
# cap log Tc to avoid under or overflow and nan in logs
self.set_Tc(np.exp(np.maximum(-200,np.minimum(100,logTc))), tvals)
neglogLH = -self.total_LH() + stiffness*np.sum(np.diff(logTc)**2) \
+ np.sum((logTc>0)*logTc*regularization)\
- np.sum((logTc<-100)*logTc*regularization)
+ np.sum((logTc>0)*logTc)*regularization\
- np.sum((logTc<-100)*logTc)*regularization
return neglogLH
sol = minimize(cost, np.ones_like(tvals)*np.log(self.Tc.y.mean()), method=method, tol=tol)
......@@ -209,7 +209,7 @@ class Coalescent(object):
dcost = np.array(dcost)
optimal_cost = cost(opt_logTc)
self.confidence = -dlogTc/(2*optimal_cost - dcost[:,0] - dcost[:,1])
self.confidence = dlogTc/np.sqrt(np.abs(2*optimal_cost - dcost[:,0] - dcost[:,1]))
self.logger("Coalescent:optimize_skyline:...done. new LH: %f"%self.total_LH(),2)
else:
self.set_Tc(initial_Tc.y, T=initial_Tc.x)
......