Skip to content
Commits on Source (3)
......@@ -18,7 +18,7 @@ In addition to scripting TreeTime or using it via the command-line, there is als
![Molecular clock phylogeny of 200 NA sequences of influenza A H3N2](https://raw.githubusercontent.com/neherlab/treetime_examples/master/figures/tree_and_clock.png)
Have a look at our [examples and tutorials](https://github.com/neherlab/treetime_examples).
Have a look at our repository with [example data](https://github.com/neherlab/treetime_examples) and the [tutorials](https://treetime.readthedocs.io/en/latest/tutorials.html).
#### Features
* ancestral sequence reconstruction (marginal and joint maximum likelihood)
......@@ -83,7 +83,7 @@ The to infer a timetree, i.e. a phylogenetic tree in which branch length reflect
treetime --aln <input.fasta> --tree <input.nwk> --dates <dates.csv>
```
This command will infer a time tree, ancestral sequences, a GTR model, and optionally confidence intervals and coalescent models.
A detailed explanation is of this command with its various options and examples are available at [treetime_examples/timetree.md](http://github.com/neherlab/treetime_examples/blob/master/timetree.md)
A detailed explanation is of this command with its various options and examples is available in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/timetree.html).
#### Rerooting and substitution rate estimation
......@@ -93,7 +93,7 @@ To explore the temporal signal in the data and estimate the substitution rate (i
```
The full list if options is available by typing `treetime clock -h`.
Instead of an input alignment, `--sequence-length <L>` can be provided.
Documentation of additional options and examples are available at [treetime_examples/clock.md](https://github.com/neherlab/treetime_examples/blob/master/clock.md)
Documentation of additional options and examples are available at in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/clock.html).
#### Ancestral sequence reconstruction:
......@@ -103,7 +103,7 @@ The subcommand
```
will reconstruct ancestral sequences at internal nodes of the input tree.
The full list if options is available by typing `treetime ancestral -h`.
A detailed explanation of `treetime ancestral` with examples is available at [treetime_examples/ancestral.md](https://github.com/neherlab/treetime_examples/blob/master/ancestral.md)
A detailed explanation of `treetime ancestral` with examples is available at in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/ancestral.html).
#### Homoplasy analysis
Detecting and quantifying homoplasies or recurrent mutations is useful to check for recombination, putative adaptive sites, or contamination.
......@@ -112,7 +112,7 @@ TreeTime provides a simple command to summarize homoplasies in data
treetime homoplasy --aln <input.fasta> --tree <input.nwk>
```
The full list if options is available by typing `treetime homoplasy -h`.
Please see [treetime_examples/homoplasy.md](https://github.com/neherlab/treetime_examples/blob/master/homoplasy.md) for examples and more documentation.
Please see [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/homoplasy.html) for examples and more documentation.
#### Mugration analysis
Migration between discrete geographic regions, host switching, or other transition between discrete states are often parameterized by time-reversible models analogous to models describing evolution of genome sequences.
......@@ -123,7 +123,7 @@ TreeTime GTR model machinery can be used to infer mugration models:
```
where `<field>` is the relevant column in the csv file specifying the metadata `states.csv`, e.g. `<field>=country`.
The full list if options is available by typing `treetime mugration -h`.
Please see [treetime_examples/mugration.md](https://github.com/neherlab/treetime_examples/blob/master/mugration.md) for examples and more documentation.
Please see [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/mugration.html) for examples and more documentation.
#### Metadata and date format
Several of TreeTime commands require the user to specify a file with dates and/or other meta data.
......@@ -177,12 +177,6 @@ The API documentation for the TreeTime package is generated created with Sphinx.
pip install Sphinx
```
- basicstrap Html theme for sphinx:
```bash
pip install sphinxjp.themes.basicstrap
```
After required packages are installed, navigate to doc directory, and build the docs by typing:
```bash
......
import datetime
from treetime.treeanc import TreeAnc
from treetime.clock_tree import ClockTree
from treetime.treetime import TreeTime
from treetime.treetime import ttconf as treetime_conf
from treetime.gtr import GTR
from treetime.treetime import plot_vs_years
from treetime.treetime import treetime_to_newick
from treetime.tree_regression import TreeRegression
from treetime.merger_models import Coalescent
import treetime.seq_utils as seq_utils
from treetime.utils import numeric_date
# 0.7.0 -- restructuring
## Major changes
This release largely includes changes under the hood, some of which also affect how treetime behaves. The biggest changes are
* sequence data handling is now done by a separate class `SequenceData`. There is now a clear distinction between input data that is never changed and inferred sequences. This class also provides consolidated set of functions to convert sparse, compressed, and full sequence representations into each other.
* sequences are now unicode when running from python3. This does not seem to come with a measurable performance hit compared to byte sequences as long as all characters are ASCII. Moving away from bytes to unicode proved much less hassle than converting sequences back and forth from unicode to bytes during IO.
* Ancestral state reconstruction no longer reconstructs the state of terminal nodes by default and sequence accessors and output will return the input data by default. Reconstruction is optional.
* The command-line mugration model inference now optimize the overall rate numerically and is hence no longer making a short-branch length assumption.
* TreeTime raises now a number of custom errors rather than returning success or error codes. This should result in fewer "silent errors" that cause problems downstream.
## Minor new features
In addition, we implemented a number of other changes to the interface
* `treetime`, `treetime clock` now accept the arguments `--name-column` and `-date-column` to explicitly specify the metadata columns to be used as name or date
* `treetime mugration` accepts a `--name-column` argument.
## Bug fixes
* scaling of skyline confidence intervals was wrong. It now reflects the inverse second derivative in log-space
* catch problems after rerooting associated with missing attributes in the newly generated root node.
* make conversion from calendar dates to numeric dates and vice versa compatible and remove approximate handling of leap-years.
* avoid overwriting content of output directory with default names
* don't export inferred dates of tips labeled as `bad_branch`.
\ No newline at end of file
# Contributing to TreeTime
Thank you for your interest in contributing to TreeTime.
We welcome pull-requests that fix bugs or implement new features.
## Bugs
If you come across a bug or unexpected behavior, please file an issue.
## Testing
Upon pushing a commit, travis will run a few simple tests. These use data available in the [neherlab/treetime_examples](https://github.com/neherlab/treetime_examples) repository.
## Coding conventions (loosly adhered to)
* indentation: 4 spaces
* docstrings: numpy style
* variable names: snake_case
......@@ -45,7 +45,7 @@ def test_ancestral():
t = TreeAnc(gtr='Jukes-Cantor', tree=nwk, aln=fasta)
print('ancestral reconstruction' + ("marginal" if marginal else "joint"))
t.reconstruct_anc(method='ml', marginal=marginal)
assert "".join(t.tree.root.sequence) == 'ATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCTTGATAACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATAATAGAAAGAAACATAACAGAGATAGTGTATCTGACCAACACCACCATAGAGAAGGAAATATGCCCCAAACCAGCAGAATACAGAAATTGGTCAAAACCGCAATGTGGCATTACAGGATTTGCACCTTTCTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGACTCTATTGATGAATGAGTTGGGTGTTCCTTTTCATCTGGGGACCAAGCAAGTGTGCATAGCATGGTCCAGCTCAAGTTGTCACGATGGAAAAGCATGGCTGCATGTTTGTATAACGGGGGATGATAAAAATGCAACTGCTAGCTTCATTTACAATGGGAGGCTTGTAGATAGTGTTGTTTCATGGTCCAAAGAAATTCTCAGGACCCAGGAGTCAGAATGCGTTTGTATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAGTGCTTCAGGAAAAGCTGATACTAAAATACTATTCATTGAGGAGGGGAAAATCGTTCATACTAGCACATTGTCAGGAAGTGCTCAGCATGTCGAAGAGTGCTCTTGCTATCCTCGATATCCTGGTGTCAGATGTGTCTGCAGAGACAACTGGAAAGGCTCCAATCGGCCCATCGTAGATATAAACATAAAGGATCATAGCATTGTTTCCAGTTATGTGTGTTCAGGACTTGTTGGAGACACACCCAGAAAAAACGACAGCTCCAGCAGTAGCCATTGTTTGGATCCTAACAATGAAGAAGGTGGTCATGGAGTGAAAGGCTGGGCCTTTGATGATGGAAATGACGTGTGGATGGGAAGAACAATCAACGAGACGTCACGCTTAGGGTATGAAACCTTCAAAGTCATTGAAGGCTGGTCCAACCCTAAGTCCAAATTGCAGATAAATAGGCAAGTCATAGTTGACAGAGGTGATAGGTCCGGTTATTCTGGTATTTTCTCTGTTGAAGGCAAAAGCTGCATCAATCGGTGCTTTTATGTGGAGTTGATTAGGGGAAGAAAAGAGGAAACTGAAGTCTTGTGGACCTCAAACAGTATTGTTGTGTTTTGTGGCACCTCAGGTACATATGGAACAGGCTCATGGCCTGATGGGGCGGACCTCAATCTCATGCCTATA'
assert t.data.compressed_to_full_sequence(t.tree.root.cseq, as_string=True) == 'ATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCTTGATAACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATAATAGAAAGAAACATAACAGAGATAGTGTATCTGACCAACACCACCATAGAGAAGGAAATATGCCCCAAACCAGCAGAATACAGAAATTGGTCAAAACCGCAATGTGGCATTACAGGATTTGCACCTTTCTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGACTCTATTGATGAATGAGTTGGGTGTTCCTTTTCATCTGGGGACCAAGCAAGTGTGCATAGCATGGTCCAGCTCAAGTTGTCACGATGGAAAAGCATGGCTGCATGTTTGTATAACGGGGGATGATAAAAATGCAACTGCTAGCTTCATTTACAATGGGAGGCTTGTAGATAGTGTTGTTTCATGGTCCAAAGAAATTCTCAGGACCCAGGAGTCAGAATGCGTTTGTATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAGTGCTTCAGGAAAAGCTGATACTAAAATACTATTCATTGAGGAGGGGAAAATCGTTCATACTAGCACATTGTCAGGAAGTGCTCAGCATGTCGAAGAGTGCTCTTGCTATCCTCGATATCCTGGTGTCAGATGTGTCTGCAGAGACAACTGGAAAGGCTCCAATCGGCCCATCGTAGATATAAACATAAAGGATCATAGCATTGTTTCCAGTTATGTGTGTTCAGGACTTGTTGGAGACACACCCAGAAAAAACGACAGCTCCAGCAGTAGCCATTGTTTGGATCCTAACAATGAAGAAGGTGGTCATGGAGTGAAAGGCTGGGCCTTTGATGATGGAAATGACGTGTGGATGGGAAGAACAATCAACGAGACGTCACGCTTAGGGTATGAAACCTTCAAAGTCATTGAAGGCTGGTCCAACCCTAAGTCCAAATTGCAGATAAATAGGCAAGTCATAGTTGACAGAGGTGATAGGTCCGGTTATTCTGGTATTTTCTCTGTTGAAGGCAAAAGCTGCATCAATCGGTGCTTTTATGTGGAGTTGATTAGGGGAAGAAAAGAGGAAACTGAAGTCTTGTGGACCTCAAACAGTATTGTTGTGTTTTGTGGCACCTCAGGTACATATGGAACAGGCTCATGGCCTGATGGGGCGGACCTCAATCTCATGCCTATA'
print('testing LH normalization')
from Bio import Phylo,AlignIO
......@@ -101,7 +101,7 @@ def test_seq_joint_reconstruction_correct():
tree = myTree.tree
seq_len = 400
tree.root.ref_seq = np.random.choice(mygtr.alphabet, p=mygtr.Pi, size=seq_len)
print ("Root sequence: " + ''.join(tree.root.ref_seq))
print ("Root sequence: " + ''.join(tree.root.ref_seq.astype('U')))
mutation_list = defaultdict(list)
for node in tree.find_clades():
for c in node.clades:
......@@ -110,7 +110,7 @@ def test_seq_joint_reconstruction_correct():
continue
t = node.branch_length
p = mygtr.evolve( seq_utils.seq2prof(node.up.ref_seq, mygtr.profile_map), t)
# normalie profile
# normalize profile
p=(p.T/p.sum(axis=1)).T
# sample mutations randomly
ref_seq_idxs = np.array([int(np.random.choice(np.arange(p.shape[1]), p=p[k])) for k in np.arange(p.shape[0])])
......@@ -127,25 +127,23 @@ def test_seq_joint_reconstruction_correct():
alnstr = ""
i = 1
for leaf in tree.get_terminals():
alnstr += ">" + leaf.name + "\n" + ''.join(leaf.ref_seq) + '\n'
alnstr += ">" + leaf.name + "\n" + ''.join(leaf.ref_seq.astype('U')) + '\n'
i += 1
print (alnstr)
myTree.aln = AlignIO.read(StringIO(alnstr), 'fasta')
myTree._attach_sequences_to_nodes()
# reconstruct ancestral sequences:
myTree._ml_anc_joint(debug=True)
myTree.infer_ancestral_sequences(final=True, debug=True, reconstruct_leaves=True)
diff_count = 0
mut_count = 0
for node in myTree.tree.find_clades():
if node.up is not None:
mut_count += len(node.ref_mutations)
diff_count += np.sum(node.sequence != node.ref_seq)==0
diff_count += np.sum(node.sequence != node.ref_seq)
if np.sum(node.sequence != node.ref_seq):
print("%s: True sequence does not equal inferred sequence. parent %s"%(node.name, node.up.name))
else:
print("%s: True sequence equals inferred sequence. parent %s"%(node.name, node.up.name))
print (node.name, np.sum(node.sequence != node.ref_seq), np.where(node.sequence != node.ref_seq), len(node.mutations), node.mutations)
# the assignment of mutations to the root node is probabilistic. Hence some differences are expected
assert diff_count/seq_len<2*(1.0*mut_count/seq_len)**2
......
from __future__ import print_function, division, absolute_import
version="0.6.2"
version="0.7.0"
class TreeTimeError(Exception):
"""TreeTimeError class"""
pass
class MissingDataError(TreeTimeError):
"""MissingDataError class raised when tree or alignment are missing"""
pass
class UnknownMethodError(TreeTimeError):
"""MissingDataError class raised when tree or alignment are missing"""
pass
class NotReadyError(TreeTimeError):
"""NotReadyError class raised when results are requested before inference"""
pass
from .treeanc import TreeAnc
from .treetime import TreeTime, plot_vs_years
from .clock_tree import ClockTree
from .treetime import ttconf as treetime_conf
from .gtr import GTR
from .gtr_site_specific import GTR_site_specific
from .merger_models import Coalescent
from .treeregression import TreeRegression
from .argument_parser import make_parser
......@@ -2,7 +2,7 @@
from __future__ import print_function, division, absolute_import
import sys, argparse, os
from treetime.wrappers import ancestral_reconstruction, mugration, scan_homoplasies, timetree, estimate_clock_model
import treetime
from treetime import version
py2 = sys.version_info.major==2
......@@ -148,6 +148,7 @@ def add_gtr_arguments(parser):
def add_anc_arguments(parser):
parser.add_argument('--keep-overhangs', default = False, action='store_true', help='do not fill terminal gaps')
parser.add_argument('--zero-based', default = False, action='store_true', help='zero based mutation indexing')
parser.add_argument('--reconstruct-tip-states', default = False, action='store_true', help='overwrite ambiguous states on tips with the most likely inferred state')
parser.add_argument('--report-ambiguous', default=False, action="store_true", help='include transitions involving ambiguous states')
......@@ -169,6 +170,8 @@ def make_parser():
t_parser.add_argument('--tree', type=str, help=tree_description)
add_seq_len_aln_group(t_parser)
t_parser.add_argument('--dates', type=str, help=dates_description)
t_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
t_parser.add_argument('--date-column', type=str, help="label of the column to be used as sampling date")
add_reroot_group(t_parser)
add_gtr_arguments(t_parser)
t_parser.add_argument('--clock-rate', type=float, help="if specified, the rate of the molecular clock won't be optimized.")
......@@ -189,6 +192,8 @@ def make_parser():
help='maximal number of iterations the inference cycle is run. Note that for polytomy resolution and coalescence models max_iter should be at least 2')
t_parser.add_argument('--coalescent', default="0.0", type=str,
help=coalescent_description)
t_parser.add_argument('--n-skyline', default="20", type=int,
help="number of grid points in skyline coalescent model")
t_parser.add_argument('--plot-tree', default="timetree.pdf",
help = "filename to save the plot to. Suffix will determine format"
" (choices pdf, png, svg, default=pdf)")
......@@ -201,6 +206,7 @@ def make_parser():
help = "don't show tip labels (default for small trees with >=30 leaves)")
add_anc_arguments(t_parser)
add_common_args(t_parser)
t_parser.add_argument("--version", action="version", version="%(prog)s " + version)
def toplevel(params):
if (params.aln or params.tree) and params.dates:
......@@ -229,9 +235,9 @@ def make_parser():
## ANCESTRAL RECONSTRUCTION
a_parser = subparsers.add_parser('ancestral', description=ancestral_description)
add_aln_group(a_parser)
a_parser.add_argument('--tree', type = str, help =tree_description)
a_parser.add_argument('--tree', type=str, help=tree_description)
add_gtr_arguments(a_parser)
a_parser.add_argument('--marginal', default = False, action="store_true", help ="marginal reconstruction of ancestral sequences")
a_parser.add_argument('--marginal', default=False, action="store_true", help ="marginal reconstruction of ancestral sequences")
add_anc_arguments(a_parser)
add_common_args(a_parser)
a_parser.set_defaults(func=ancestral_reconstruction)
......@@ -239,6 +245,7 @@ def make_parser():
## MUGRATION
m_parser = subparsers.add_parser('mugration', description=mugration_description)
m_parser.add_argument('--tree', required = True, type=str, help=tree_description)
m_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
m_parser.add_argument('--attribute', type=str, help ="attribute to reconstruct, e.g. country")
m_parser.add_argument('--states', required = True, type=str, help ="csv or tsv file with discrete characters."
"\n#name,country,continent\ntaxon1,micronesia,oceania\n...")
......@@ -265,6 +272,8 @@ def make_parser():
"signal and recalculate branch length unless run with --keep_root.")
c_parser.add_argument('--tree', required=True, type=str, help=tree_description)
c_parser.add_argument('--dates', required=True, type=str, help=dates_description)
c_parser.add_argument('--date-column', type=str, help="label of the column to be used as sampling date")
c_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
add_seq_len_aln_group(c_parser)
add_reroot_group(c_parser)
......@@ -279,7 +288,7 @@ def make_parser():
# make a version subcommand
v_parser = subparsers.add_parser('version', description='print version')
v_parser.set_defaults(func=lambda x: print(treetime.version))
v_parser.set_defaults(func=lambda x: print("treetime "+version))
## call the relevant function and return
if py2:
......
......@@ -88,19 +88,11 @@ class BranchLenInterpolator (Distribution):
elif branch_length_mode=='joint':
if not hasattr(node, 'compressed_sequence'):
#FIXME: this assumes node.sequence is set, but this might not be the case if
# ancestral reconstruction is run with final=False
if hasattr(node, 'sequence'):
seq_pairs, multiplicity = self.gtr.compress_sequence_pair(node.up.sequence,
node.sequence,
ignore_gaps=ignore_gaps)
node.compressed_sequence = {'pair':seq_pairs, 'multiplicity':multiplicity}
else:
raise Exception("uncompressed sequence needs to be assigned to nodes")
log_prob = np.array([-self.gtr.prob_t_compressed(node.compressed_sequence['pair'],
node.compressed_sequence['multiplicity'],
if not hasattr(node, 'branch_state'):
raise Exception("branch state pairs need to be assigned to nodes")
log_prob = np.array([-self.gtr.prob_t_compressed(node.branch_state['pair'],
node.branch_state['multiplicity'],
k,
return_log=True)
for k in grid])
......
from __future__ import print_function, division, absolute_import
import numpy as np
from treetime import config as ttconf
from treetime import MissingDataError
from .treeanc import TreeAnc
from .utils import numeric_date, DateConversion
from .utils import numeric_date, DateConversion, datestring_from_numeric
from .distribution import Distribution
from .branch_len_interpolator import BranchLenInterpolator
from .node_interpolator import NodeInterpolator
......@@ -79,8 +80,7 @@ class ClockTree(TreeAnc):
self.clock_model=None
self.use_covariation=use_covariation # if false, covariation will be ignored in rate estimates.
self._set_precision(precision)
if self._assign_dates()==ttconf.ERROR:
raise ValueError("ClockTree requires date constraints!")
self._assign_dates()
def _assign_dates(self):
......@@ -92,8 +92,7 @@ class ClockTree(TreeAnc):
success/error code
"""
if self.tree is None:
self.logger("ClockTree._assign_dates: tree is not set, can't assign dates", 0)
return ttconf.ERROR
raise MissingDataError("ClockTree._assign_dates: tree is not set, can't assign dates")
bad_branch_counter = 0
for node in self.tree.find_clades(order='postorder'):
......@@ -128,9 +127,9 @@ class ClockTree(TreeAnc):
bad_branch_counter += 1
if bad_branch_counter>self.tree.count_terminals()-3:
self.logger("ERROR: ALMOST NO VALID DATE CONSTRAINTS, EXITING", 1, warn=True)
return ttconf.ERROR
raise MissingDataError("ERROR: ALMOST NO VALID DATE CONSTRAINTS")
self.logger("ClockTree._assign_dates: assigned date contraints to {} out of {} tips.".format(self.tree.count_terminals()-bad_branch_counter, self.tree.count_terminals()), 1)
return ttconf.SUCCESS
......@@ -149,7 +148,7 @@ class ClockTree(TreeAnc):
self.precision=precision
if self.one_mutation and self.one_mutation<1e-4 and precision<2:
self.logger("ClockTree._set_precision: FOR LONG SEQUENCES (>1e4) precision>=2 IS RECOMMENDED."
" \n\t **** precision %d was specified by the user"%precision, level=0)
" precision %d was specified by the user"%precision, level=0)
else:
# otherwise adjust it depending on the minimal sensible branch length
if self.one_mutation:
......@@ -263,7 +262,7 @@ class ClockTree(TreeAnc):
"""
self.logger("ClockTree.init_date_constraints...",2)
self.tree.coalescent_joint_LH = 0
if self.aln and (ancestral_inference or (not hasattr(self.tree.root, 'sequence'))):
if self.aln and (not self.sequence_reconstruction):
self.infer_ancestral_sequences('probabilistic', marginal=self.branch_length_mode=='marginal',
sample_from_profile='root',**kwarks)
......@@ -286,9 +285,11 @@ class ClockTree(TreeAnc):
if self.branch_length_mode=='marginal':
node.profile_pair = self.marginal_branch_profile(node)
elif self.branch_length_mode=='joint' and (not hasattr(node, 'branch_state')):
self.add_branch_state(node)
node.branch_length_interpolator = BranchLenInterpolator(node, self.gtr,
pattern_multiplicity = self.multiplicity, min_width=self.min_width,
pattern_multiplicity = self.data.multiplicity, min_width=self.min_width,
one_mutation=self.one_mutation, branch_length_mode=self.branch_length_mode)
node.branch_length_interpolator.merger_cost = merger_cost
......@@ -312,8 +313,8 @@ class ClockTree(TreeAnc):
if hasattr(node, 'bad_branch') and node.bad_branch is True:
self.logger("ClockTree.init_date_constraints -- WARNING: Branch is marked as bad"
", excluding it from the optimization process.\n"
"\t\tDate constraint will be ignored!", 4, warn=True)
", excluding it from the optimization process."
" Date constraint will be ignored!", 4, warn=True)
else: # node without sampling date set
node.raw_date_constraint = None
node.date_constraint = None
......@@ -438,13 +439,14 @@ class ClockTree(TreeAnc):
if node.joint_pos_Cx is None: # no constraints or branch is bad - reconstruct from the branch len interpolator
node.branch_length = node.branch_length_interpolator.peak_pos
elif node.date_constraint is not None and node.date_constraint.is_delta:
node.branch_length = node.up.time_before_present - node.date_constraint.peak_pos
elif isinstance(node.joint_pos_Cx, Distribution):
# NOTE the Lx distribution is the likelihood, given the position of the parent
# (Lx.x = parent position, Lx.y = LH of the node_pos given Lx.x,
# the length of the branch corresponding to the most likely
# subtree is node.Cx(node.time_before_present))
subtree_LH = node.joint_pos_Lx(node.up.time_before_present)
# subtree_LH = node.joint_pos_Lx(node.up.time_before_present)
node.branch_length = node.joint_pos_Cx(max(node.joint_pos_Cx.xmin,
node.up.time_before_present)+ttconf.TINY_NUMBER)
......@@ -475,7 +477,7 @@ class ClockTree(TreeAnc):
# add the root sequence LH and return
if self.aln:
LH += self.gtr.sequence_logLH(self.tree.root.cseq, pattern_multiplicity=self.multiplicity)
LH += self.gtr.sequence_logLH(self.tree.root.cseq, pattern_multiplicity=self.data.multiplicity)
return LH
......@@ -525,7 +527,7 @@ class ClockTree(TreeAnc):
# no information
node.marginal_pos_Lx = None
else: # all other nodes
if node.date_constraint is not None and node.date_constraint.is_delta: # there is a time constraint
if node.date_constraint is not None and node.date_constraint.is_delta: # there is a hard time constraint
# initialize the Lx for nodes with precise date constraint:
# subtree probability given the position of the parent node
# position of the parent node is given by the branch length
......@@ -575,6 +577,8 @@ class ClockTree(TreeAnc):
if node.up is None:
node.msg_from_parent = None # nothing beyond the root
# all other cases (All internal nodes + unconstrained terminals)
elif node.date_constraint is not None and node.date_constraint.is_delta:
node.marginal_pos_LH = node.date_constraint
else:
parent = node.up
# messages from the complementary subtree (iterate over all sister nodes)
......@@ -584,8 +588,6 @@ class ClockTree(TreeAnc):
# if parent itself got smth from the root node, include it
if parent.msg_from_parent is not None:
complementary_msgs.append(parent.msg_from_parent)
elif parent.marginal_pos_Lx is not None:
complementary_msgs.append(parent.marginal_pos_LH)
if len(complementary_msgs):
msg_parent_to_node = NodeInterpolator.multiply(complementary_msgs)
......@@ -677,17 +679,7 @@ class ClockTree(TreeAnc):
"later than present day",4 , warn=True)
node.numdate = now - years_bp
# set the human-readable date
year = np.floor(node.numdate)
days = max(0,365.25 * (node.numdate - year)-1)
try: # datetime will only operate on dates after 1900
n_date = datetime(year, 1, 1) + timedelta(days=days)
node.date = datetime.strftime(n_date, "%Y-%m-%d")
except:
# this is the approximation not accounting for gap years etc
n_date = datetime(1900, 1, 1) + timedelta(days=days)
node.date = "%04d-%02d-%02d"%(year, n_date.month, n_date.day)
node.date = datestring_from_numeric(node.numdate)
def branch_length_to_years(self):
......@@ -722,8 +714,8 @@ class ClockTree(TreeAnc):
params = params or {}
if rate_std is None:
if not (self.clock_model['valid_confidence'] and 'cov' in self.clock_model):
self.logger("ClockTree.calc_rate_susceptibility: need valid standard deviation of the clock rate to estimate dating error.", 1, warn=True)
return ttconf.ERROR
raise ValueError("ClockTree.calc_rate_susceptibility: need valid standard deviation of the clock rate to estimate dating error.")
rate_std = np.sqrt(self.clock_model['cov'][0,0])
current_rate = np.abs(self.clock_model['slope'])
......
......@@ -30,8 +30,7 @@ class GTR(object):
of observing characters in the alphabet. This is used to
implement ambiguous characters like 'N'=[1,1,1,1] which are
equally likely to be any of the 4 nucleotides. Standard profile_maps
are defined in file seq_utils.py. If None is provided, no ambigous
characters are supported.
are defined in file seq_utils.py.
logger : callable
Custom logging function that should take arguments (msg, level, warn=False),
......@@ -39,6 +38,7 @@ class GTR(object):
"""
self.debug=False
self.is_site_specific=False
if isinstance(alphabet, str):
if alphabet not in alphabet_synonyms:
raise AttributeError("Unknown alphabet type specified")
......@@ -48,13 +48,14 @@ class GTR(object):
self.profile_map = profile_maps[tmp_alphabet]
else:
# not a predefined alphabet
self.alphabet = alphabet
self.alphabet = np.array(alphabet)
if prof_map is None: # generate trivial unambiguous profile map is none is given
self.profile_map = {s:x for s,x in zip(self.alphabet, np.eye(len(self.alphabet)))}
else:
self.profile_map = prof_map
self.profile_map = {x if type(x) is str else x:k for x,k in prof_map.items()}
self.state_index={s:si for si,s in enumerate(self.alphabet)}
self.state_index.update({s:si for si,s in enumerate(self.alphabet)})
if logger is None:
def logger_default(*args,**kwargs):
"""standard logging function if none provided"""
......@@ -69,13 +70,6 @@ class GTR(object):
self.n_states = len(self.alphabet)
self.assign_gap_and_ambiguous()
# NEEDED TO BREAK RATE MATRIX DEGENERACY AND FORCE NP TO RETURN REAL ORTHONORMAL EIGENVECTORS
# ugly hack, but works and shouldn't affect results
tmp_rng_state = np.random.get_state()
np.random.seed(12345)
self.break_degen = np.random.random(size=(self.n_states, self.n_states))*1e-6
np.random.set_state(tmp_rng_state)
# init all matrices with dummy values
self.logger("GTR: init with dummy values!", 3)
self.v = None # right eigenvectors
......@@ -86,7 +80,7 @@ class GTR(object):
def assign_gap_and_ambiguous(self):
n_states = len(self.alphabet)
self.logger("GTR: with alphabet: "+str(self.alphabet),1)
self.logger("GTR: with alphabet: "+str([x for x in self.alphabet]),1)
# determine if a character exists that corresponds to no info, i.e. all one profile
if any([x.sum()==n_states for x in self.profile_map.values()]):
amb_states = [c for c,x in self.profile_map.items() if x.sum()==n_states]
......@@ -97,7 +91,7 @@ class GTR(object):
# check for a gap symbol
try:
self.gap_index = list(self.alphabet).index('-')
self.gap_index = self.state_index['-']
except:
self.logger("GTR: no gap symbol!", 4, warn=True)
self.gap_index=None
......@@ -134,7 +128,10 @@ class GTR(object):
and the equilibrium frequencies to obtain the rate matrix
of the GTR model
"""
return (self.W*self.Pi).T
Q_tmp = (self.W*self.Pi).T
Q_diag = -np.sum(Q_tmp, axis=0)
np.fill_diagonal(Q_tmp, Q_diag)
return Q_tmp
######################################################################
......@@ -155,18 +152,18 @@ class GTR(object):
if not multi_site:
eq_freq_str += "\nEquilibrium frequencies (pi_i):\n"
for a,p in zip(self.alphabet, self.Pi):
eq_freq_str+=' '+str(a)+': '+str(np.round(p,4))+'\n'
eq_freq_str+=' '+a+': '+str(np.round(p,4))+'\n'
W_str = "\nSymmetrized rates from j->i (W_ij):\n"
W_str+='\t'+'\t'.join(map(str, self.alphabet))+'\n'
W_str+='\t'+'\t'.join(self.alphabet)+'\n'
for a,Wi in zip(self.alphabet, self.W):
W_str+= ' '+str(a)+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Wi])+'\n'
W_str+= ' '+a+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Wi])+'\n'
if not multi_site:
Q_str = "\nActual rates from j->i (Q_ij):\n"
Q_str+='\t'+'\t'.join(map(str, self.alphabet))+'\n'
Q_str+='\t'+'\t'.join(self.alphabet)+'\n'
for a,Qi in zip(self.alphabet, self.Q):
Q_str+= ' '+str(a)+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Qi])+'\n'
Q_str+= ' '+a+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Qi])+'\n'
return eq_freq_str + W_str + Q_str
......@@ -190,6 +187,7 @@ class GTR(object):
"""
n = len(self.alphabet)
self._mu = mu
self.is_site_specific=False
if pi is not None and len(pi)==n:
Pi = np.array(pi)
......@@ -213,7 +211,11 @@ class GTR(object):
W=np.array(W)
self._W = 0.5*(W+W.T)
self._check_fix_Q(fixed_mu=True)
np.fill_diagonal(W,0)
average_rate = W.dot(self.Pi).dot(self.Pi)
self._W = W/average_rate
self._mu *=average_rate
self._eig()
......@@ -508,8 +510,8 @@ class GTR(object):
if gtr.gap_index is not None:
if pi[gtr.gap_index]<gap_limit:
gtr.logger('The model allows for gaps which are estimated to occur at a low fraction of %1.3e'%pi[gtr.gap_index]+
'\n\t\tthis can potentially result in artificats.'+
'\n\t\tgap fraction will be set to %1.4f'%gap_limit,2,warn=True)
' this can potentially result in artificats.'+
' gap fraction will be set to %1.4f'%gap_limit,2,warn=True)
pi[gtr.gap_index] = gap_limit
pi /= pi.sum()
......@@ -519,39 +521,13 @@ class GTR(object):
########################################################################
### prepare model
########################################################################
def _check_fix_Q(self, fixed_mu=False):
"""
Check the main diagonal of Q and fix it in case it does not corresond
the definition of the rate matrix. Should be run every time when creating
custom GTR model.
"""
# NEEDED TO BREAK RATE MATRIX DEGENERACY AND FORCE NP TO RETURN REAL ORTHONORMAL EIGENVECTORS
self._W += self.break_degen + self.break_degen.T
# fix W
np.fill_diagonal(self.W, 0)
Wdiag = -(self.Q).sum(axis=0)/self.Pi
np.fill_diagonal(self.W, Wdiag)
scale_factor = -np.sum(np.diagonal(self.Q)*self.Pi)
self._W /= scale_factor
if not fixed_mu:
self._mu *= scale_factor
if (self.Q.sum(axis=0) < 1e-10).sum() < self.alphabet.shape[0]: # fix failed
print ("Cannot fix the diagonal of the GTR rate matrix. Should be all zero", self.Q.sum(axis=0))
import ipdb; ipdb.set_trace()
raise ArithmeticError("Cannot fix the diagonal of the GTR rate matrix.")
def _eig(self):
"""
Perform eigendecompositon of the rate matrix and stores the left- and right-
matrices to convert the sequence profiles to the GTR matrix eigenspace
and hence to speed-up the computations.
"""
W_nodiag = np.copy(self.W)
np.fill_diagonal(W_nodiag, 0)
self.eigenvals, self.v, self.v_inv = self._eig_single_site(W_nodiag, self.Pi)
self.eigenvals, self.v, self.v_inv = self._eig_single_site(self.W, self.Pi)
def _eig_single_site(self, W, p):
......@@ -574,7 +550,7 @@ class GTR(object):
return eigvals, tmp_v.T/one_norm, (eigvecs*one_norm).T/tmpp
def compress_sequence_pair(self, seq_p, seq_ch, pattern_multiplicity=None,
def state_pair(self, seq_p, seq_ch, pattern_multiplicity=None,
ignore_gaps=False):
'''
Make a compressed representation of a pair of sequences, only counting
......@@ -615,7 +591,7 @@ class GTR(object):
from collections import Counter
if seq_ch.shape != seq_p.shape:
raise ValueError("GTR.compress_sequence_pair: Sequence lengths do not match!")
raise ValueError("GTR.state_pair: Sequence lengths do not match!")
if len(self.alphabet)<10: # for small alphabet, repeatedly check array for all state pairs
pair_count = []
......@@ -724,7 +700,7 @@ class GTR(object):
Resulting probability
"""
seq_pair, multiplicity = self.compress_sequence_pair(seq_p, seq_ch,
seq_pair, multiplicity = self.state_pair(seq_p, seq_ch,
pattern_multiplicity=pattern_multiplicity, ignore_gaps=ignore_gaps)
return self.prob_t_compressed(seq_pair, multiplicity, t, return_log=return_log)
......@@ -752,20 +728,21 @@ class GTR(object):
If True, ignore gaps in distance calculations
'''
seq_pair, multiplicity = self.compress_sequence_pair(seq_p, seq_ch,
pattern_multiplicity = pattern_multiplicity,
ignore_gaps=ignore_gaps)
seq_pair, multiplicity = self.state_pair(seq_p, seq_ch,
pattern_multiplicity = pattern_multiplicity,
ignore_gaps=ignore_gaps)
return self.optimal_t_compressed(seq_pair, multiplicity)
def optimal_t_compressed(self, seq_pair, multiplicity, profiles=False, tol=1e-10):
"""
Find the optimal distance between the two sequences, for compressed sequences
Find the optimal distance between the two sequences represented as state_pairs
or as pair of profiles
Parameters
----------
seq_pair : compressed_sequence_pair
seq_pair : state_pair, tuple
Compressed representation of sequences along a branch, either
as tuple of state pairs or as tuple of profiles.
......@@ -779,7 +756,7 @@ class GTR(object):
either end of the branch. With profiles==True, optimization is performed
while summing over all possible states of the nodes at either end of the
branch. Note that the meaning/format of seq_pair and multiplicity
depend on the value of profiles.
depend on the value of :profiles:.
"""
......@@ -1032,14 +1009,6 @@ class GTR(object):
char_dist=self.Pi,
flow_matrix=self.W)
def save_to_json(self, zip):
d = {
"full_gtr": self.mu * np.dot(self.Pi, self.W),
"Substitution rate" : self.mu,
"Equilibrium character composition": self.Pi,
"Flow rate matrix": self.W
}
if __name__ == "__main__":
pass
......@@ -25,6 +25,7 @@ class GTR_site_specific(GTR):
self.seq_len=seq_len
self.approximate = approximate
super(GTR_site_specific, self).__init__(**kwargs)
self.is_site_specific=True
@property
......@@ -57,25 +58,34 @@ class GTR_site_specific(GTR):
Equilibrium frequencies
"""
if not np.isscalar(mu) and pi is not None and len(pi.shape)==2:
if mu.shape[0]!=pi.shape[1]:
raise ValueError("GTR_site_specific: length of rate vector (got {}) and equilibrium frequency vector (got {}) must match!".format(mu.shape[0], pi.shape[1]))
n = len(self.alphabet)
if np.isscalar(mu):
self._mu = mu*np.ones(self.seq_len)
else:
self._mu = np.copy(mu)
self.seq_len = mu.shape[0]
if pi is not None and pi.shape[0]==n:
self.seq_len = pi.shape[-1]
if pi is not None and pi.shape[0]==n and len(pi.shape)==2:
self.seq_len = pi.shape[1]
Pi = np.copy(pi)
else:
if pi is not None and len(pi)!=n:
raise ArgumentError("GTR_site_specific: length of equilibrium frequency vector does not match alphabet length.")
Pi = np.ones(shape=(n,self.seq_len))
if pi is not None:
if len(pi)==n:
Pi = np.repeat([pi], self.seq_len, axis=0).T
else:
raise ValueError("GTR_site_specific: length of equilibrium frequency vector (got {}) does not match alphabet length {}".format(len(pi), n))
else:
Pi = np.ones(shape=(n,self.seq_len))
self._Pi = Pi/np.sum(Pi, axis=0)
if W is None or W.shape!=(n,n):
if (W is not None) and W.shape!=(n,n):
raise ArgumentError("GTR_site_specific: Size of substitution matrix does not match alphabet length.")
raise ValueError("GTR_site_specific: Size of substitution matrix (got {}) does not match alphabet length {}".format(W.shape, n))
W = np.ones((n,n))
np.fill_diagonal(W, 0.0)
np.fill_diagonal(W, - W.sum(axis=0))
......@@ -83,11 +93,13 @@ class GTR_site_specific(GTR):
W=0.5*(np.copy(W)+np.copy(W).T)
np.fill_diagonal(W,0)
avg_pi = self.Pi.mean(axis=-1)
average_rate = W.dot(avg_pi).dot(avg_pi)
average_rate = np.einsum('ia,ij,ja',self.Pi, W, self.Pi)/self.seq_len
# average_rate = W.dot(avg_pi).dot(avg_pi)
self._W = W/average_rate
self._mu *=average_rate
self.is_site_specific=True
self._eig()
self._make_expQt_interpolator()
......@@ -124,6 +136,7 @@ class GTR_site_specific(GTR):
gtr = cls(alphabet=alphabet, seq_len=L)
n = gtr.alphabet.shape[0]
# Dirichlet distribution == l_1 normalized vector of samples of the Gamma distribution
if pi_dirichlet_alpha:
pi = 1.0*gamma.rvs(pi_dirichlet_alpha, size=(n,L))
else:
......@@ -143,7 +156,7 @@ class GTR_site_specific(GTR):
mu = np.ones(L)
gtr.assign_rates(mu=mu, pi=pi, W=W)
gtr.mu *= avg_mu/np.mean(gtr.mu)
gtr.mu *= avg_mu/np.mean(gtr.average_rate())
return gtr
......@@ -166,7 +179,7 @@ class GTR_site_specific(GTR):
Equilibrium frequencies
**kwargs:
Key word arguments to be passed
Key word arguments to be passed to the constructor
Keyword Args
------------
......@@ -255,15 +268,15 @@ class GTR_site_specific(GTR):
p_ia_old = np.copy(p_ia)
S_ij = np.einsum('a,ia,ja',mu_a, p_ia, T_ia)
W_ij = (n_ij + n_ij.T + pc)/(S_ij + S_ij.T + pc)
avg_pi = p_ia.mean(axis=-1)
average_rate = W_ij.dot(avg_pi).dot(avg_pi)
average_rate = W_ij.dot(avg_pi).dot(avg_pi) # crude approx, will be fixed in assign rates
W_ij = W_ij/average_rate
mu_a *=average_rate
p_ia = m_ia/(mu_a*np.dot(W_ij,T_ia)+Lambda)
p_ia = p_ia/p_ia.sum(axis=0)
mu_a = n_a/(pc+np.einsum('ia,ij,ja->a', p_ia, W_ij, T_ia))
......@@ -276,7 +289,7 @@ class GTR_site_specific(GTR):
if p_ia[gtr.gap_index,p]<gap_limit:
gtr.logger('The model allows for gaps which are estimated to occur at a low fraction of %1.3e'%p_ia[gtr.gap_index,p]+
'\n\t\tthis can potentially result in artifacts.'+
'\n\t\tgap fraction will be set to %1.4f'%gap_limit,2,warn=True)
'\n\t\tgap fraction will be set to %1.4f'%gap_limit,4,warn=True)
p_ia[gtr.gap_index,p] = gap_limit
p_ia[:,p] /= p_ia[:,p].sum()
......@@ -456,7 +469,7 @@ class GTR_site_specific(GTR):
logQt[np.isnan(logQt) | np.isinf(logQt) | bad_indices] = -ttconf.BIG_NUMBER
seq_indices_c = np.zeros(len(seq_ch), dtype=int)
seq_indices_p = np.zeros(len(seq_p), dtype=int)
for ai, a in self.alphabet:
for ai, a in enumerate(self.alphabet):
seq_indices_p[seq_p==a] = ai
seq_indices_c[seq_ch==a] = ai
......
......@@ -164,7 +164,7 @@ class Coalescent(object):
if "success" in sol and sol["success"]:
self.set_Tc(sol['x'])
else:
self.logger("merger_models:optimze_Tc: optimization of coalescent time scale failed: " + str(sol), 0, warn=True)
self.logger("merger_models:optimize_Tc: optimization of coalescent time scale failed: " + str(sol), 0, warn=True)
self.set_Tc(initial_Tc.y, T=initial_Tc.x)
......@@ -190,8 +190,8 @@ class Coalescent(object):
# cap log Tc to avoid under or overflow and nan in logs
self.set_Tc(np.exp(np.maximum(-200,np.minimum(100,logTc))), tvals)
neglogLH = -self.total_LH() + stiffness*np.sum(np.diff(logTc)**2) \
+ np.sum((logTc>0)*logTc*regularization)\
- np.sum((logTc<-100)*logTc*regularization)
+ np.sum((logTc>0)*logTc)*regularization\
- np.sum((logTc<-100)*logTc)*regularization
return neglogLH
sol = minimize(cost, np.ones_like(tvals)*np.log(self.Tc.y.mean()), method=method, tol=tol)
......@@ -209,7 +209,7 @@ class Coalescent(object):
dcost = np.array(dcost)
optimal_cost = cost(opt_logTc)
self.confidence = -dlogTc/(2*optimal_cost - dcost[:,0] - dcost[:,1])
self.confidence = dlogTc/np.sqrt(np.abs(2*optimal_cost - dcost[:,0] - dcost[:,1]))
self.logger("Coalescent:optimize_skyline:...done. new LH: %f"%self.total_LH(),2)
else:
self.set_Tc(initial_Tc.y, T=initial_Tc.x)
......
import numpy as np
from Bio import Seq, SeqRecord
alphabet_synonyms = {'nuc':'nuc', 'nucleotide':'nuc', 'aa':'aa', 'aminoacid':'aa',
'nuc_nogap':'nuc_nogap', 'nucleotide_nogap':'nuc_nogap',
......@@ -115,39 +117,83 @@ profile_maps = {
}
}
def seq2array(seq, fill_overhangs=True, ambiguous_character='N'):
def extend_profile(gtr, aln, logger=None):
tmp_unique_chars = []
for seq in aln:
tmp_unique_chars.extend(np.unique(seq))
unique_chars = np.unique(tmp_unique_chars)
for c in unique_chars:
if c not in gtr.profile_map:
gtr.profile_map[c] = np.ones(gtr.n_states)
if logger:
logger("WARNING: character %s is unknown. Treating it as missing information"%c,1,warn=True)
def guess_alphabet(aln):
total=0
nuc_count = 0
for seq in aln:
total += len(seq)
for n in np.array(list('acgtACGT-N')):
nuc_count += np.sum(seq==n)
if nuc_count>0.9*total:
return 'nuc'
else:
return 'aa'
def seq2array(seq, word_length=1, convert_upper=False, fill_overhangs=False, ambiguous='N'):
"""
Take the raw sequence, substitute the "overhanging" gaps with 'N' (missequenced),
and convert the sequence to the numpy array of chars.
Parameters
----------
seq : Biopython.SeqRecord, str, iterable
Sequence as an object of SeqRecord, string or iterable
fill_overhangs : bool
If True, substitute the "overhanging" gaps with ambiguous character symbol
ambiguous_character : char
Specify the character for ambiguous state ('N' default for nucleotide)
seq : Biopython.SeqRecord, str, iterable
Sequence as an object of SeqRecord, string or iterable
word_length : int, optional
1 for nucleotide or amino acids, 3 for codons etc.
convert_upper : bool, optional
convert the sequence to upper case
fill_overhangs : bool
If True, substitute the "overhanging" gaps with ambiguous character symbol
ambiguous : char
Specify the character for ambiguous state ('N' default for nucleotide)
Returns
-------
sequence : np.array
Sequence as 1D numpy array of chars
sequence : np.array
Sequence as 1D numpy array of chars
"""
try:
sequence = ''.join(seq)
except TypeError:
sequence = seq
if isinstance(seq, str):
seq_str = seq
elif isinstance(seq, Seq.Seq):
seq_str = str(seq)
elif isinstance(seq, SeqRecord.SeqRecord):
seq_str = str(seq.seq)
else:
raise TypeError("seq2array: sequence must be Bio.Seq, Bio.SeqRecord, or string. Got "+str(seq))
if convert_upper:
seq_str = seq_str.upper()
if word_length==1:
seq_array = np.array(list(seq_str))
else:
if len(seq_str)%word_length:
raise ValueError("sequence length has to be multiple of word length");
seq_array = np.array([seq_str[i*word_length:(i+1)*word_length]
for i in range(len(seq_str)/word_length)])
sequence = np.array(list(sequence))
# substitute overhanging unsequenced tails
if fill_overhangs:
sequence [:np.where(sequence != '-')[0][0]] = ambiguous_character
sequence [np.where(sequence != '-')[0][-1]+1:] = ambiguous_character
return sequence
gaps = np.where(seq_array != '-')[0]
seq_array[:gaps[0]] = ambiguous
seq_array[gaps[-1]+1:] = ambiguous
return seq_array
def seq2prof(seq, profile_map):
"""
......@@ -184,10 +230,8 @@ def prof2seq(profile, gtr, sample_from_prof=False, normalize=True):
profile : numpy 2D array
Profile. Shape of the profile should be (L x a), where L - sequence
length, a - alphabet size.
gtr : gtr.GTR
Instance of the GTR class to supply the sequence alphabet
collapse_prof : bool
Whether to convert the profile to the delta-function
......@@ -195,10 +239,8 @@ def prof2seq(profile, gtr, sample_from_prof=False, normalize=True):
-------
seq : numpy.array
Sequence as numpy array of length L
prof_values : numpy.array
Values of the profile for the chosen sequence characters (length L)
idx : numpy.array
Indices chosen from profile as array of length L
"""
......
......@@ -13,10 +13,10 @@ class SeqGen(TreeAnc):
This class inherits from TreeAnc.
'''
def __init__(self, *args, **kwargs):
"""Instantiate. Mandatory arguments are a tree and GTR model.
def __init__(self, L, *args, **kwargs):
"""Instantiate. Mandatory arguments are a the sequence length, tree and GTR model.
"""
super(SeqGen, self).__init__(reduce_alignment=False, **kwargs)
super(SeqGen, self).__init__(seq_len=L, compress=False, **kwargs)
def sample_from_profile(self, p):
......@@ -50,30 +50,23 @@ class SeqGen(TreeAnc):
sequence to be used as the root sequence of the tree. if not given,
will sample a sequence from the equilibrium probabilities of the GTR model.
"""
self.seq_len = self.gtr.seq_len
# set root if not given
if root_seq:
self.tree.root.sequence = seq2array(root_seq)
self.tree.root.ancestral_sequence = seq2array(root_seq)
else:
if len(self.gtr.Pi.shape)==2:
self.tree.root.sequence = self.sample_from_profile(self.gtr.Pi.T)
self.tree.root.ancestral_sequence = self.sample_from_profile(self.gtr.Pi.T)
else:
self.tree.root.sequence = self.sample_from_profile(np.repeat([self.gtr.Pi], self.seq_len, axis=0))
self.tree.root.ancestral_sequence = self.sample_from_profile(np.repeat([self.gtr.Pi], self.seq_len, axis=0))
# generate sequences in preorder
for n in self.tree.get_nonterminals(order='preorder'):
profile_p = seq2prof(n.sequence, self.gtr.profile_map)
profile_p = seq2prof(n.ancestral_sequence, self.gtr.profile_map)
for c in n:
profile = self.gtr.evolve(profile_p, c.branch_length)
c.sequence = self.sample_from_profile(profile)
self.make_reduced_alignment()
c.ancestral_sequence = self.sample_from_profile(profile)
# gather mutations
for n in self.tree.find_clades():
if n==self.tree.root:
n.mutations=[]
else:
n.mutations = self.get_mutations(n)
self.aln = self.get_aln()
def get_aln(self, internal=False):
......@@ -96,7 +89,7 @@ class SeqGen(TreeAnc):
tmp = []
for n in self.tree.get_terminals():
if n.is_terminal() or internal:
tmp.append(SeqRecord.SeqRecord(id=n.name, name=n.name, description='', seq=Seq.Seq(''.join(n.sequence))))
tmp.append(SeqRecord.SeqRecord(id=n.name, name=n.name, description='', seq=Seq.Seq(''.join(n.ancestral_sequence.astype('U')))))
return MultipleSeqAlignment(tmp)
......
This diff is collapsed.
def optimize_tree_marginal_new(self, damping=0.5):
L = self.data.compressed_length
n_states = self.gtr.alphabet.shape[0]
# propagate leaves --> root, set the marginal-likelihood messages
for node in self.tree.find_clades(order='postorder'): #leaves -> root
if node.up is None and len(node.clades)==2:
continue
profiles = [c.marginal_subtree_LH for c in node] + [node.marginal_outgroup_LH]
bls = [c.branch_length for c in nodes] + [node.branch_length]
new_bls = self.optimize_star(profiles,bls, last_is_root=node.up is None)
# regardless of what was before, set the profile to ones
tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
for ch in ci,node.clades:
ch.branch_length = new_bls[ci]
ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
ch.branch_length, return_log=True)
tmp_log_subtree_LH += ch.marginal_log_Lx
node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
if node.up:
node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
node.branch_length, return_log=True) # raw prob to transfer prob up
tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
self._branch_length_to_gtr(node), return_log=False)
node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
else:
node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
root=self.tree.root
print(len(root.clades))
if len(root.clades)==2:
tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
old_bl = root.clades[0].branch_length + root.clades[1]
bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
profiles=True, tol=1e-8)
for ch in root:
ch.branch_length *= ((1-damping)*old_bl + damping*bl)/old_bl
ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
ch.branch_length, return_log=True) # raw prob to transfer prob up
tmp_log_subtree_LH += ch.marginal_log_Lx
root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
self.total_LH_and_root_sequence(assign_sequence=False)
self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
def optimize_tree_marginal_new2(self, n_iter_internal=2, damping=0.5):
L = self.data.compressed_length
n_states = self.gtr.alphabet.shape[0]
# propagate leaves --> root, set the marginal-likelihood messages
for node in self.tree.get_nonterminals(order='postorder'): #leaves -> root
if node.up is None and len(node.clades)==2:
continue
# regardless of what was before, set the profile to ones
for ii in range(n_iter_internal):
damp = damping**(1+ii)
tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
for ch in node.clades:
outgroup = np.exp(np.log(np.maximum(ttconf.TINY_NUMBER, node.marginal_profile)) - ch.marginal_log_Lx)
bl = self.gtr.optimal_t_compressed((ch.marginal_subtree_LH, outgroup), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
new_bl = (1-damp)*bl + damp*ch.branch_length
ch.branch_length=new_bl
ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
new_bl, return_log=True) # raw prob to transfer prob up
tmp_log_subtree_LH += ch.marginal_log_Lx
node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
if node.up:
bl = self.gtr.optimal_t_compressed((node.marginal_subtree_LH, node.marginal_outgroup_LH), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
new_bl = (1-damp)*bl + damp*node.branch_length
node.branch_length=new_bl
node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
new_bl, return_log=True) # raw prob to transfer prob up
node.marginal_outgroup_LH, pre = normalize_profile(np.log(np.maximum(ttconf.TINY_NUMBER, node.up.marginal_profile)) - node.marginal_log_Lx,
log=True, return_offset=False)
tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
self._branch_length_to_gtr(node), return_log=False)
node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
else:
node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
root=self.tree.root
print(len(root.clades))
if len(root.clades)==2:
tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
old_bl = root.clades[0].branch_length + root.clades[1]
bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
profiles=True, tol=1e-8)
for ch in root:
ch.branch_length *= bl/old_bl
ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
ch.branch_length, return_log=True) # raw prob to transfer prob up
tmp_log_subtree_LH += ch.marginal_log_Lx
root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
self.total_LH_and_root_sequence(assign_sequence=False)
self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
This diff is collapsed.
......@@ -21,6 +21,9 @@ def base_regression(Q, slope=None):
TYPE
Description
"""
if np.isinf(Q).sum() or np.isnan(Q).sum():
raise ValueError("Invalid values in input data!")
if slope is None:
if (Q[tsqii] - Q[tavgii]**2/Q[sii])>0:
slope = (Q[dtavgii] - Q[tavgii]*Q[davgii]/Q[sii]) \
......@@ -355,7 +358,8 @@ class TreeRegression(object):
bv = self.branch_value(n)
var = self.branch_variance(n)
for dx in [-0.001, 0.001]:
y = min(1.0, max(0.0, best_root["split"]+dx))
# y needs to be bounded away from 0 and 1 to avoid division by 0
y = min(0.9999, max(0.0001, best_root["split"]+dx))
tmpQ = self.propagate_averages(n, tv, bv*y, var*y) \
+ self.propagate_averages(n, tv, bv*(1-y), var*(1-y), outgroup=True)
reg = base_regression(tmpQ, slope=slope)
......@@ -381,6 +385,10 @@ class TreeRegression(object):
+ self.propagate_averages(n, tv, bv*(1-x), var*(1-x), outgroup=True)
return base_regression(tmpQ, slope=slope)['chisq']
if n.bad_branch or (n!=self.tree.root and n.up.bad_branch):
return np.nan, np.inf
chisq_prox = np.inf if n.is_terminal() else base_regression(n.Qtot, slope=slope)['chisq']
chisq_dist = np.inf if n==self.tree.root else base_regression(n.up.Qtot, slope=slope)['chisq']
......@@ -423,6 +431,8 @@ class TreeRegression(object):
regression parameters
"""
best_root = self.find_best_root(force_positive=force_positive, slope=slope)
if best_root is None:
raise ValueError("Rerooting failed!")
best_node = best_root["node"]
x = best_root["split"]
......
......@@ -3,6 +3,7 @@ import numpy as np
from scipy import optimize as sciopt
from Bio import Phylo
from treetime import config as ttconf
from treetime import MissingDataError,UnknownMethodError,NotReadyError
from .utils import tree_layout
from .clock_tree import ClockTree
......@@ -105,7 +106,8 @@ class TreeTime(ClockTree):
use_covariation : bool, optional
default False, if False, rate estimates will be performed using simple
regression ignoring phylogenetic covaration between nodes.
regression ignoring phylogenetic covaration between nodes. If vary_rate is True,
use_covariation is true by default
**kwargs
Keyword arguments needed by the downstream functions
......@@ -120,11 +122,10 @@ class TreeTime(ClockTree):
"""
# register the specified covaration mode
self.use_covariation = use_covariation
self.use_covariation = use_covariation or (vary_rate and (not type(vary_rate)==float))
if (self.tree is None) or (self.aln is None and self.seq_len is None):
self.logger("TreeTime.run: ERROR, alignment or tree are missing", 0)
return ttconf.ERROR
if (self.tree is None) or (self.aln is None and self.data.full_length is None):
raise MissingDataError("TreeTime.run: ERROR, alignment or tree are missing")
if (self.aln is None):
branch_length_mode='input'
......@@ -132,7 +133,8 @@ class TreeTime(ClockTree):
# determine how to reconstruct and sample sequences
seq_kwargs = {"marginal_sequences":sequence_marginal or (self.branch_length_mode=='marginal'),
"sample_from_profile":"root"}
"sample_from_profile":"root",
"reconstruct_tip_states":kwargs.get("reconstruct_tip_states", False)}
tt_kwargs = {'clock_rate':fixed_clock_rate, 'time_marginal':False}
tt_kwargs.update(kwargs)
......@@ -160,11 +162,9 @@ class TreeTime(ClockTree):
else:
plot_rtt=False
reroot_mechanism = 'least-squares' if root=='clock_filter' else root
if self.clock_filter(reroot=reroot_mechanism, n_iqd=n_iqd, plot=plot_rtt, fixed_clock_rate=fixed_clock_rate)==ttconf.ERROR:
return ttconf.ERROR
self.clock_filter(reroot=reroot_mechanism, n_iqd=n_iqd, plot=plot_rtt, fixed_clock_rate=fixed_clock_rate)
elif root is not None:
if self.reroot(root=root, clock_rate=fixed_clock_rate)==ttconf.ERROR:
return ttconf.ERROR
self.reroot(root=root, clock_rate=fixed_clock_rate)
if self.branch_length_mode=='input':
if self.aln:
......@@ -181,8 +181,9 @@ class TreeTime(ClockTree):
self.LH =[[seq_LH, self.tree.positional_joint_LH, 0]]
if root is not None and max_iter:
if self.reroot(root='least-squares' if root=='clock_filter' else root, clock_rate=fixed_clock_rate)==ttconf.ERROR:
return ttconf.ERROR
new_root = self.reroot(root='least-squares' if root=='clock_filter' else root, clock_rate=fixed_clock_rate)
self.logger("###TreeTime.run: rerunning timetree after rerooting",0)
self.make_time_tree(**tt_kwargs)
# iteratively reconstruct ancestral sequences and re-infer
# time tree to ensure convergence.
......@@ -241,14 +242,11 @@ class TreeTime(ClockTree):
# rerun the estimation for variations of the rate
if vary_rate:
if type(vary_rate)==float:
res = self.calc_rate_susceptibility(rate_std=vary_rate, params=tt_kwargs)
self.calc_rate_susceptibility(rate_std=vary_rate, params=tt_kwargs)
elif self.clock_model['valid_confidence']:
res = self.calc_rate_susceptibility(params=tt_kwargs)
self.calc_rate_susceptibility(params=tt_kwargs)
else:
res = ttconf.ERROR
if res==ttconf.ERROR:
self.logger("TreeTime.run: rate variation failed and can't be used for confidence estimation", 1, warn=True)
raise UnknownMethodError("TreeTime.run: rate variation for confidence estimation is not available. Either specify it explicitly, or estimate from root-to-tip regression.")
# if marginal reconstruction requested, make one more round with marginal=True
# this will set marginal_pos_LH, which to be used as error bar estimations
......@@ -325,8 +323,7 @@ class TreeTime(ClockTree):
terminals = self.tree.get_terminals()
if reroot:
if self.reroot(root='least-squares' if reroot=='best' else reroot, covariation=False, clock_rate=fixed_clock_rate)==ttconf.ERROR:
return ttconf.ERROR
self.reroot(root='least-squares' if reroot=='best' else reroot, covariation=False, clock_rate=fixed_clock_rate)
else:
self.get_clock_model(covariation=False, slope=fixed_clock_rate)
......@@ -339,16 +336,23 @@ class TreeTime(ClockTree):
residuals = np.array(list(res.values()))
iqd = np.percentile(residuals,75) - np.percentile(residuals,25)
bad_branch_count = 0
for node,r in res.items():
if abs(r)>n_iqd*iqd and node.up.up is not None:
self.logger('TreeTime.ClockFilter: marking %s as outlier, residual %f interquartile distances'%(node.name,r/iqd), 3, warn=True)
node.bad_branch=True
bad_branch_count += 1
else:
node.bad_branch=False
if bad_branch_count>0.34*self.tree.count_terminals():
self.logger("TreeTime.clock_filter: More than a third of leaves have been excluded by the clock filter. Please check your input data.", 0, warn=True)
# reassign bad_branch flags to internal nodes
self.prepare_tree()
# redo root estimation after outlier removal
if reroot and self.reroot(root=reroot, clock_rate=fixed_clock_rate)==ttconf.ERROR:
return ttconf.ERROR
if reroot:
self.reroot(root=reroot, clock_rate=fixed_clock_rate)
if plot:
self.plot_root_to_tip()
......@@ -414,6 +418,7 @@ class TreeTime(ClockTree):
use_cov = self.use_covariation if covariation is None else covariation
slope = 0.0 if type(root)==str and root.startswith('min_dev') else clock_rate
old_root = self.tree.root
self.logger("TreeTime.reroot: with method or node: %s"%root,0)
for n in self.tree.find_clades():
......@@ -444,8 +449,7 @@ class TreeTime(ClockTree):
if n.raw_date_constraint is not None],
key=lambda x:np.mean(x.raw_date_constraint))[0]
else:
self.logger('TreeTime.reroot -- ERROR: unsupported rooting mechanisms or root not found',0,warn=True)
return ttconf.ERROR
raise UnknownMethodError('TreeTime.reroot -- ERROR: unsupported rooting mechanisms or root not found')
#this forces a bifurcating root, as we want. Branch lengths will be reoptimized anyway.
#(Without outgroup_branch_length, gives a trifurcating root, but this will mean
......@@ -454,9 +458,6 @@ class TreeTime(ClockTree):
self.get_clock_model(covariation=use_cov, slope = slope)
if new_root == ttconf.ERROR:
return ttconf.ERROR
self.logger("TreeTime.reroot: Tree was re-rooted to node "
+('new_node' if new_root.name is None else new_root.name), 2)
......@@ -478,7 +479,7 @@ class TreeTime(ClockTree):
self.get_clock_model(covariation=self.use_covariation, slope=slope)
return ttconf.SUCCESS
return new_root
def resolve_polytomies(self, merge_compressed=False):
......@@ -540,7 +541,7 @@ class TreeTime(ClockTree):
from .branch_len_interpolator import BranchLenInterpolator
zero_branch_slope = self.gtr.mu*self.seq_len
zero_branch_slope = self.gtr.mu*self.data.full_length
def _c_gain(t, n1, n2, parent):
"""
......@@ -598,13 +599,13 @@ class TreeTime(ClockTree):
# set parameters for the new node
new_node.up = clade
new_node.tt = self
n1.up = new_node
n2.up = new_node
if hasattr(clade, "cseq"):
new_node.cseq = clade.cseq
self._store_compressed_sequence_to_node(new_node)
if hasattr(clade, "_cseq"):
new_node._cseq = clade._cseq
self.add_branch_state(new_node)
new_node.mutations = []
new_node.mutation_length = 0.0
new_node.branch_length_interpolator = BranchLenInterpolator(new_node, self.gtr, one_mutation=self.one_mutation,
branch_length_mode = self.branch_length_mode)
......@@ -860,10 +861,14 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
tick_vals = [x+offset-shift for x in xticks]
ax.set_xticks(xticks)
ax.set_xticklabels(map(str, tick_vals))
if step>=1:
tick_labels = ["%d"%(int(x)) for x in tick_vals]
else:
tick_labels = ["%1.2f"%(x) for x in tick_vals]
ax.set_xlim((0,date_range))
ax.set_xticklabels(tick_labels)
ax.set_xlabel('year')
ax.set_ylabel('')
ax.set_xlim((0,date_range))
# put shaded boxes to delineate years
if step:
......@@ -878,7 +883,7 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
edgecolor=[1,1,1])
ax.add_patch(r)
if year in tick_vals and pos>=xlim[0] and pos<=xlim[1] and ticks:
label_str = str(step*(year//step)) if step<1 else str(int(year))
label_str = "%1.2f"%(step*(year//step)) if step<1 else str(int(year))
ax.text(pos,ylim[0]-0.04*(ylim[1]-ylim[0]), label_str,
horizontalalignment='center')
ax.set_axis_off()
......@@ -887,15 +892,13 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
if confidence:
tree_layout(tt.tree)
if not hasattr(tt.tree.root, "marginal_inverse_cdf"):
print("marginal time tree reconstruction required for confidence intervals")
return ttconf.ERROR
raise NotReadyError("marginal time tree reconstruction required for confidence intervals")
elif type(confidence) is float:
cfunc = tt.get_max_posterior_region
elif len(confidence)==2:
cfunc = tt.get_confidence_interval
else:
print("confidence needs to be either a float (for max posterior region) or a two numbers specifying lower and upper bounds")
return ttconf.ERROR
raise NotReadyError("confidence needs to be either a float (for max posterior region) or a two numbers specifying lower and upper bounds")
for n in tt.tree.find_clades():
pos = cfunc(n, confidence)
......
......@@ -7,7 +7,7 @@ from scipy.interpolate import interp1d
from scipy.integrate import quad
from scipy import stats
from scipy.ndimage import binary_dilation
from treetime import config as ttconf
from treetime import TreeTimeError
class DateConversion(object):
"""
......@@ -101,7 +101,7 @@ class DateConversion(object):
def to_numdate(self, tbp):
"""
Convert the numeric date to the branch-len scale
Convert time before present measured in clock rate units to numeric calendar dates
"""
return numeric_date() - self.to_years(tbp)
......@@ -150,19 +150,66 @@ def numeric_date(dt=None):
date of to be converted. if None, assume today
"""
from calendar import isleap
if dt is None:
dt = datetime.datetime.now()
days_in_year = 366 if isleap(dt.year) else 365
try:
res = dt.year + dt.timetuple().tm_yday / 365.25
res = dt.year + (dt.timetuple().tm_yday-0.5) / days_in_year
except:
res = None
return res
def datetime_from_numeric(numdate):
"""convert a numeric decimal date to a python datetime object
Note that this only works for AD dates since the range of datetime objects
is restricted to year>1.
Parameters
----------
numdate : float
numeric date as in 2018.23
Returns
-------
datetime.datetime
datetime object
"""
from calendar import isleap
days_in_year = 366 if isleap(int(numdate)) else 365
# add a small number of the time elapsed in a year to avoid
# unexpected behavior for values 1/365, 2/365, etc
days_elapsed = int(((numdate%1)+1e-10)*days_in_year)
date = datetime.datetime(int(numdate),1,1) + datetime.timedelta(days=days_elapsed)
return date
def datestring_from_numeric(numdate):
"""convert a numerical date to a formated date string YYYY-MM-DD
Parameters
----------
numdate : float
numeric date as in 2018.23
Returns
-------
str
date string YYYY-MM-DD
"""
if numdate>1900: # python datetime doesn't work for dates before 1900. This can be relaxed to numdate>1 once we drop python 2.7
return datetime.datetime.strftime(datetime_from_numeric(numdate), "%Y-%m-%d")
else:
year = int(np.floor(numdate))
dt = datetime_from_numeric(1900+(numdate%1))
return "%04d-%02d-%02d"%(year, dt.month, dt.day)
def parse_dates(date_file):
def parse_dates(date_file, name_col=None, date_col=None):
"""
parse dates from the arguments and return a dictionary mapping
taxon names to numerical dates.
......@@ -191,7 +238,7 @@ def parse_dates(date_file):
try:
# read the metadata file into pandas dataframe.
df = pd.read_csv(date_file, sep=full_sep, engine='python')
df = pd.read_csv(date_file, sep=full_sep, engine='python', dtype='str')
# check the metadata has strain names in the first column
# look for the column containing sampling dates
# We assume that the dates might be given either in human-readable format
......@@ -212,25 +259,37 @@ def parse_dates(date_file):
if any([x==col.lower() for x in ['name', 'strain', 'accession']]):
potential_index_columns.append((ci, col))
if date_col and date_col not in df.columns:
raise TreeTimeError("ERROR: specified column for dates does not exist. \n\tAvailable columns are: "\
+", ".join(df.columns)+"\n\tYou specified '%s'"%date_col)
if name_col and name_col not in df.columns:
raise TreeTimeError("ERROR: specified column for the taxon name does not exist. \n\tAvailable columns are: "\
+", ".join(df.columns)+"\n\tYou specified '%s'"%name_col)
dates = {}
# if a potential numeric date column was found, use it
# (use the first, if there are more than one)
if not len(potential_index_columns):
print("ERROR: Cannot read metadata: need at least one column that contains the taxon labels."
" Looking for the first column that contains 'name', 'strain', or 'accession' in the header.", file=sys.stderr)
return dates
if not (len(potential_index_columns) or name_col):
raise TreeTimeError("ERROR: Cannot read metadata: need at least one column that contains the taxon labels."
" Looking for the first column that contains 'name', 'strain', or 'accession' in the header.")
else:
# use the first column that is either 'name', 'strain', 'accession'
index_col = sorted(potential_index_columns)[0][1]
if name_col is None:
index_col = sorted(potential_index_columns)[0][1]
else:
index_col = name_col
print("\tUsing column '%s' as name. This needs match the taxon names in the tree!!"%index_col)
if len(potential_date_columns)>=1:
if len(potential_date_columns)>=1 or date_col:
#try to parse the csv file with dates in the idx column:
idx = potential_date_columns[0][0]
col_name = potential_date_columns[0][1]
print("\tUsing column '%s' as date."%col_name)
if date_col is None:
date_col = potential_date_columns[0][1]
print("\tUsing column '%s' as date."%date_col)
for ri, row in df.iterrows():
date_str = row.loc[col_name]
date_str = row.loc[date_col]
k = row.loc[index_col]
# try parsing as a float first
try:
......@@ -255,15 +314,16 @@ def parse_dates(date_file):
dates[k] = [numeric_date(x) for x in [lower, upper]]
else:
print("ERROR: Metadata file has no column which looks like a sampling date!", file=sys.stderr)
raise TreeTimeError("ERROR: Metadata file has no column which looks like a sampling date!")
if all(v is None for v in dates.values()):
print("ERROR: Cannot parse dates correctly! Check date format.", file=sys.stderr)
return {}
raise TreeTimeError("ERROR: Cannot parse dates correctly! Check date format.")
print(dates)
return dates
except TreeTimeError as err:
raise err
except:
print("ERROR: Cannot read the metadata file!", file=sys.stderr)
return {}
raise
def ambiguous_date_to_date_range(mydate, fmt="%Y-%m-%d", min_max_year=None):
......@@ -283,10 +343,9 @@ def ambiguous_date_to_date_range(mydate, fmt="%Y-%m-%d", min_max_year=None):
tuple
upper and lower bounds on the date. return (None, None) if errors
"""
from datetime import datetime
sep = fmt.split('%')[1][-1]
min_date, max_date = {}, {}
today = datetime.today().date()
today = datetime.date.today()
for val, field in zip(mydate.split(sep), fmt.split(sep+'%')):
f = 'year' if 'y' in field.lower() else ('day' if 'd' in field.lower() else 'month')
......@@ -315,8 +374,8 @@ def ambiguous_date_to_date_range(mydate, fmt="%Y-%m-%d", min_max_year=None):
return None, None
max_date['day'] = min(max_date['day'], 31 if max_date['month'] in [1,3,5,7,8,10,12]
else 28 if max_date['month']==2 else 30)
lower_bound = datetime(year=min_date['year'], month=min_date['month'], day=min_date['day']).date()
upper_bound = datetime(year=max_date['year'], month=max_date['month'], day=max_date['day']).date()
lower_bound = datetime.date(year=min_date['year'], month=min_date['month'], day=min_date['day'])
upper_bound = datetime.date(year=max_date['year'], month=max_date['month'], day=max_date['day'])
return (lower_bound, upper_bound if upper_bound<today else today)
......@@ -386,6 +445,7 @@ def build_newick_fasttree(aln_fname, nuc=True):
def build_newick_raxml(aln_fname, nthreads=2, raxml_bin="raxml", **kwargs):
import shutil,os
print("Building tree with raxml")
from Bio import Phylo, AlignIO
AlignIO.write(AlignIO.read(aln_fname, 'fasta'),"temp.phyx", "phylip-relaxed")
cmd = raxml_bin + " -f d -T " + str(nthreads) + " -m GTRCAT -c 25 -p 235813 -n tre -s temp.phyx"
......@@ -397,12 +457,33 @@ def build_newick_iqtree(aln_fname, nthreads=2, iqtree_bin="iqtree",
iqmodel="HKY", **kwargs):
import os
from Bio import Phylo, AlignIO
with open(aln_fname) as ifile:
tmp_seqs = ifile.readlines()
print("Building tree with iqtree")
aln = None
for fmt in ['fasta', 'phylip-relaxed']:
try:
aln = AlignIO.read(aln_fname, fmt)
break
except:
continue
if aln is None:
raise ValueError("failed to read alignment for tree building")
aln_file = "temp.fasta"
with open(aln_file, 'w') as ofile:
for line in tmp_seqs:
ofile.write(line.replace('/', '_X_X_').replace('|','_Y_Y_'))
seq_names = set()
for s in aln:
tmp = s.id
for c, sub in zip('/|()', 'VWXY'):
tmp = tmp.replace(c, '_%s_%s_'%(sub,sub))
if tmp in seq_names:
print("A sequence with name {} already exists, skipping....".format(s.id))
continue
s.id = tmp
s.name = s.id
s.description = ''
seq_names.add(s.id)
AlignIO.write(aln, aln_file, 'fasta')
fast_opts = [
"-ninit", "2",
......@@ -416,7 +497,10 @@ def build_newick_iqtree(aln_fname, nthreads=2, iqtree_bin="iqtree",
os.system(" ".join(call))
T = Phylo.read(aln_file+".treefile", 'newick')
for n in T.get_terminals():
n.name = n.name.replace('_X_X_','/').replace('_Y_Y_','|')
tmp = n.name
for c, sub in zip('/|()', 'VWXY'):
tmp = tmp.replace('_%s_%s_'%(sub,sub), c)
n.name = tmp
return T
if __name__ == '__main__':
......