Skip to content
Commits on Source (5)
......@@ -11,10 +11,12 @@ To optimize the likelihood of time-scaled phylogenies, TreeTime uses an iterativ
The only topology optimization are (optional) resolution of polytomies in a way that is most (approximately) consistent with the sampling time constraints on the tree.
The package is designed to be used as a stand-alone tool on the command-line or as a library used in larger phylogenetic analysis work-flows.
In addition to scripting TreeTime or using it via the command-line, there is also a small web server at [treetime.ch](https://treetime.ch).
In addition to scripting TreeTime or using it via the command-line, there is also a small web server at [treetime.ch](https://treetime.biozentrum.unibas.ch/).
![Molecular clock phylogeny of 200 NA sequences of influenza A H3N2](https://raw.githubusercontent.com/neherlab/treetime_examples/master/figures/tree_and_clock.png)
Have a look at our [examples and tutorials](https://github.com/neherlab/treetime_examples).
#### Features
* ancestral sequence reconstruction (marginal and joint maximum likelihood)
* molecular clock tree inference (marginal and joint maximum likelihood)
......@@ -30,6 +32,7 @@ In addition to scripting TreeTime or using it via the command-line, there is als
+ [Ancestral sequence reconstruction](#ancestral-sequence-reconstruction)
+ [Homoplasy analysis](#homoplasy-analysis)
+ [Mugration analysis](#mugration-analysis)
+ [Metadata and date format](#metadata-and-date-format)
* [Example scripts](#example-scripts)
* [Related tools](#related-tools)
* [Projects using TreeTime](#projects-using-treetime)
......@@ -56,6 +59,10 @@ You may install TreeTime and its dependencies by running
pip install .
```
within this repository.
You can also install TreeTime from PyPi via
```bash
pip install phylo-treetime
```
You might need root privileges for system wide installation. Alternatively, you can simply use it TreeTime locally without installation. In this case, just download and unpack it, and then add the TreeTime folder to your $PYTHONPATH.
......@@ -63,7 +70,8 @@ You might need root privileges for system wide installation. Alternatively, you
### Command-line usage
TreeTime can be used as part of python programs that create and interact with tree time objects. How TreeTime can be used to address typical questions like ancestral sequence reconstruction, rerooting, timetree inference etc is illustrated by a collection of example scripts described below.
In addition, we provide scripts that can be run from the command line with arguments specifying input data and parameters.
In addition, TreeTime can be used from the command line with arguments specifying input data and parameters.
Trees can be read as newick, nexus and phylip files; fasta and phylip are supported alignment formats; metadata and dates can be provided as csv or tsv files, see [below](#metadata-and-date-format) for details.
#### Timetrees
The to infer a timetree, i.e. a phylogenetic tree in which branch length reflect time rather than divergence, TreeTime offers implements the command:
......@@ -113,6 +121,25 @@ where `<field>` is the relevant column in the csv file specifying the metadata `
The full list if options is available by typing `treetime mugration -h`.
Please see [treetime_examples/mugration.md](https://github.com/neherlab/treetime_examples/blob/master/mugration.md) for examples and more documentation.
#### Metadata and date format
Several of TreeTime commands require the user to specify a file with dates and/or other meta data.
TreeTime assumes these files to by either comma (csv) or tab-separated (tsv) files.
The first line of these files is interpreted as header line specifying the content of the columns.
Each file needs to have at least one column that is named `name`, `accession`, or `strain`.
This column needs to contain the names of each sequence and match the names of taxons in the tree if one is provided.
If more than one of `name`, `accession`, or `strain` is found, TreeTime will use the first.
If the analysis requires dates, at least one column name needs to contain `date` (i.e. `sampling date` is fine).
Again, if multiple hits are found, TreeTime will use the first.
TreeTime will attempt to parse dates in the following way and order
| order | type/format | example | description|
| --- |-------------|---------|------------|
| 1| float | 2017.56 | decimal date |
| 2| [float:float] | [2013.45:2015.56] | decimal date range |
| 3| %Y-%m-%d | 2017-08-25 | calendar date in ISO format |
| 4| %Y-XX-XX | 2017-XX-XX | calendar date missing month and/or day |
### Example scripts
The following scripts illustrate how treetime can be used to solve common problem with short python scripts. They are meant to be used in an interactive ipython environment and run as `run examples/ancestral_inference.py`.
......
......@@ -132,6 +132,7 @@ def add_reroot_group(parser):
def add_gtr_arguments(parser):
parser.add_argument('--gtr', default='infer', help=gtr_description)
parser.add_argument('--gtr-params', nargs='+', help=gtr_params_description)
parser.add_argument('--aa', action='store_true', help="use aminoacid alphabet")
def add_anc_arguments(parser):
......@@ -162,15 +163,19 @@ if __name__ == '__main__':
add_reroot_group(t_parser)
add_gtr_arguments(t_parser)
t_parser.add_argument('--clock-rate', type=float, help="if specified, the rate of the molecular clock won't be optimized.")
t_parser.add_argument('--clock-std-dev', type=float, help="standard deviation of the provided clock rate estimate")
t_parser.add_argument('--branch-length-mode', default='auto', type=str, choices=['auto', 'input', 'joint', 'marginal'],
help="If set to 'input', the provided branch length will be used without modification. "
"Note that branch lengths optimized by treetime are only accurate at short evolutionary distances.")
t_parser.add_argument('--confidence', action='store_true', help="estimate confidence intervals of divergence times.")
t_parser.add_argument('--keep-polytomies', default=False, action='store_true',
help="Don't resolve polytomies using temporal information.")
t_parser.add_argument('--relax',nargs='*', default=False,
help='use an autocorrelated molecular clock. Prior strength and coupling of parent '
'and offspring rates can be specified e.g. as --relax 1.0 0.5')
t_parser.add_argument('--relax',nargs=2, type=float,
help='use an autocorrelated molecular clock. Strength of the gaussian priors on'
' branch specific rate deviation and the coupling of parent and offspring'
' rates can be specified e.g. as --relax 1.0 0.5. Values around 1.0 correspond'
' to weak priors, larger values constrain rate deviations more strongly.'
' Coupling 0 (--relax 1.0 0) corresponds to an un-correlated clock.')
t_parser.add_argument('--max-iter', default=2, type=int,
help='maximal number of iterations the inference cycle is run. Note that for polytomy resolution and coalescence models max_iter should be at least 2')
t_parser.add_argument('--coalescent', default="0.0", type=str,
......@@ -181,6 +186,10 @@ if __name__ == '__main__':
t_parser.add_argument('--plot-rtt', default="root_to_tip_regression.pdf",
help = "filename to save the plot to. Suffix will determine format"
" (choices pdf, png, svg, default=pdf)")
t_parser.add_argument('--tip-labels', action='store_true',
help = "add tip labels (default for small trees with <30 leaves)")
t_parser.add_argument('--no-tip-labels', action='store_true',
help = "don't show tip labels (default for small trees with >=30 leaves)")
add_anc_arguments(t_parser)
add_common_args(t_parser)
......
python-treetime (0.5.0-2) unstable; urgency=medium
python-treetime (0.5.2-1) unstable; urgency=medium
* Add Emma Hodcroft to copyright
Closes: #909784
* New upstream version
-- Andreas Tille <tille@debian.org> Fri, 28 Sep 2018 11:25:00 +0200
-- Andreas Tille <tille@debian.org> Wed, 05 Dec 2018 12:00:16 +0100
python-treetime (0.5.0-1) unstable; urgency=medium
......
......@@ -69,7 +69,7 @@ master_doc = 'index'
# General information about the project.
project = u'TreeTime'
copyright = u'2017, Pavel Sagulenko and Richard Neher'
copyright = u'2017-2018, Pavel Sagulenko and Richard Neher'
author = u'Pavel Sagulenko and Richard Neher'
# The version info for the project you're documenting, acts as replacement for
......@@ -77,9 +77,9 @@ author = u'Pavel Sagulenko and Richard Neher'
# built documents.
#
# The short X.Y version.
version = u'0.4.0'
version = u'0.5.0'
# The full version, including alpha/beta/rc tags.
release = u'0.4.0'
release = u'0.5.0'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......
......@@ -48,41 +48,40 @@ Utility code
Command-line functions
======================
TreeTime is designed to be part of python workflows, but for a number of standard
tasks we have implemented scripts that can be called from the command line like
regular linux programs.
TreeTime is designed to be part of python workflows, but we have exposed a number of standard
tasks via a command-line interface.
The TreeTime command-line tool is called :code:`treetime`.
Examples and documentation of the command-line interface can be found in the github repo https://github.com/neherlab/treetime_examples.
In its standard mode, it will take a tree, an alignment, and file with dates as input and estimate a time-scaled phylogeny.
The full set of options are available via :code:`treetime -h`.
Subcommand :code:`treetime ancestral`
-------------------------------------
This subcommand reconstructs ancestral sequences and maps mutations to the tree.
It produces an alignment file containing inferred ancestral sequences and a tree file
with mutations included as comments. The inferred GTR model is written to stdout.
homoplasy_scanner
-----------------
Subcommand :code:`treetime homoplasy`
-------------------------------------
Reconstructs ancestral sequences and maps mutations to the tree.
The tree is then scanned for homoplasies. An excess number of homoplasies
might suggest contamination, recombination, culture adaptation or similar.
Results are printed to stdout.
ancestral_reconstruction
------------------------
Reconstructs ancestral sequences and maps mutations to the tree.
Produces an alignment file containing inferred ancestral sequences and a tree file
with mutations included as comments. The inferred GTR model is written to stdout.
temporal_signal
Subcommand :code:`treetime clock`
---------------
Calculates the root-to-tip regression and quantifies the 'clock-i-ness' of the tree.
It will reroot the tree to maximize the clock-like
signal and recalculate branch length unless run with --keep_root.
signal and recalculate branch length unless run with :code:`--keep_root`.
timetree_inference
------------------
Reconstructs ancestral sequences and infers a molecular clock tree. Produces
an alignment file containing inferred ancestral sequences and a tree file with
mutations included as comments, and prints the molecular clock and inferred
GTR model.
mugration
---------
Subcommand :code:`treetime mugration`
-------------------------------------
Reconstructs discrete ancestral states, for example geographic location, host, or similar.
Indices and tables
==================
......
......@@ -29,8 +29,6 @@ Additional functionality
.. automethod:: treetime.TreeTime.reroot_to_best_root
.. automethod:: treetime.TreeTime.find_best_root_and_regression
.. automethod:: treetime.TreeTime.plot_root_to_tip
.. automethod:: treetime.TreeTime.print_lh
......
......@@ -27,7 +27,7 @@ setup(
packages=['treetime'],
install_requires = [
'biopython>=1.66',
'matplotlib>=2.0',
'matplotlib>=2.0, ==2.*',
'numpy>=1.10.4',
'pandas>=0.17.1',
'scipy>=0.16.1'
......
......@@ -6,4 +6,4 @@ from .treetime import ttconf as treetime_conf
from .gtr import GTR
from .merger_models import Coalescent
from .treeregression import TreeRegression
version="0.5.0"
version="0.5.2"
......@@ -4,7 +4,7 @@ from .seq_utils import alphabets
def JTT92(mu=1.0):
from gtr import GTR
from .gtr import GTR
# stationary concentrations:
pis = np.array([
0.07674789,
......
......@@ -69,8 +69,10 @@ class ClockTree(TreeAnc):
self.rel_tol_refine = ttconf.REL_TOL_REFINE
self.branch_length_mode = branch_length_mode
self.clock_model=None
self.use_covariation=True # if false, covariation will be ignored in rate estimates.
self._set_precision(precision)
self._assign_dates()
if self._assign_dates()==ttconf.ERROR:
raise ValueError("ClockTree requires date constraints!")
def _assign_dates(self):
......@@ -78,21 +80,28 @@ class ClockTree(TreeAnc):
Returns
-------
TYPE
Description
str
success/error code
"""
if self.tree is None:
self.logger("ClockTree._assign_dates: tree is not set, can't assign dates", 0)
return ttconf.ERROR
bad_branch_counter = 0
for node in self.tree.find_clades(order='postorder'):
if node.name in self.date_dict:
tmp_date = self.date_dict[node.name]
if np.isscalar(tmp_date) and np.isnan(tmp_date):
self.logger("WARNING: ClockTree.init: node %s has a bad date: %s"%(node.name, str(tmp_date)), 2, warn=True)
node.raw_date_constraint = None
node.bad_branch = True
else:
try:
tmp = np.mean(self.date_dict[node.name])
node.raw_date_constraint = self.date_dict[node.name]
tmp = np.mean(tmp_date)
node.raw_date_constraint = tmp_date
node.bad_branch = False
except:
self.logger("WARNING: ClockTree.init: node %s has a bad date: %s"%(node.name, str(self.date_dict[node.name])), 2, warn=True)
self.logger("WARNING: ClockTree.init: node %s has a bad date: %s"%(node.name, str(tmp_date)), 2, warn=True)
node.raw_date_constraint = None
node.bad_branch = True
else: # nodes without date contraints
......@@ -107,6 +116,13 @@ class ClockTree(TreeAnc):
# this node, the branch is marked as 'bad'
node.bad_branch = np.all([x.bad_branch for x in node])
if node.is_terminal() and node.bad_branch:
bad_branch_counter += 1
if bad_branch_counter>self.tree.count_terminals()-3:
self.logger("ERROR: ALMOST NO VALID DATE CONSTRAINTS, EXITING", 1, warn=True)
return ttconf.ERROR
return ttconf.SUCCESS
......@@ -205,7 +221,7 @@ class ClockTree(TreeAnc):
def get_clock_model(self, covariation=True, slope=None):
Treg = self.setup_TreeRegression(covariation=covariation)
self.clock_model = Treg.regression(slope=slope)
if not Treg.valid_confidence:
if not Treg.valid_confidence or (slope is not None):
if 'cov' in self.clock_model:
self.clock_model.pop('cov')
self.clock_model['valid_confidence']=False
......@@ -273,7 +289,7 @@ class ClockTree(TreeAnc):
node.branch_length_interpolator.gamma = gamma
# use covariance in clock model only after initial timetree estimation is done
use_cov = (len(has_clock_length) - np.sum(has_clock_length))<3
use_cov = (np.sum(has_clock_length) > len(has_clock_length)*0.7) and self.use_covariation
self.get_clock_model(covariation=use_cov, slope=clock_rate)
# make node distribution objects
......@@ -312,7 +328,7 @@ class ClockTree(TreeAnc):
Key word arguments to initialize dates constraints
'''
self.logger("ClockTree: Maximum likelihood tree optimization with temporal constraints:",1)
self.logger("ClockTree: Maximum likelihood tree optimization with temporal constraints",1)
self.init_date_constraints(clock_rate=clock_rate, **kwargs)
......@@ -688,12 +704,101 @@ class ClockTree(TreeAnc):
n.branch_length = n.numdate - n.up.numdate
def calc_rate_susceptibility(self, rate_std=None, params=None):
"""return the time tree estimation of evolutionary rates +/- one
standard deviation form the ML estimate.
Returns
-------
TreeTime.return_code : str
success or failure
"""
params = params or {}
if rate_std is None:
if not (self.clock_model['valid_confidence'] and 'cov' in self.clock_model):
self.logger("ClockTree.calc_rate_susceptibility: need valid standard deviation of the clock rate to estimate dating error.", 1, warn=True)
return ttconf.ERROR
rate_std = np.sqrt(self.clock_model['cov'][0,0])
current_rate = self.clock_model['slope']
upper_rate = self.clock_model['slope'] + rate_std
lower_rate = max(0.1*current_rate, self.clock_model['slope'] - rate_std)
for n in self.tree.find_clades():
if n.up:
n.branch_length_interpolator.gamma*=upper_rate/current_rate
self.logger("###ClockTree.calc_rate_susceptibility: run with upper bound of rate estimate", 1)
self.make_time_tree(**params)
self.logger("###ClockTree.calc_rate_susceptibility: rate: %f, LH:%f"%(upper_rate, self.tree.positional_joint_LH), 2)
for n in self.tree.find_clades():
n.numdate_rate_variation = [(upper_rate, n.numdate)]
if n.up:
n.branch_length_interpolator.gamma*=lower_rate/upper_rate
self.logger("###ClockTree.calc_rate_susceptibility: run with lower bound of rate estimate", 1)
self.make_time_tree(**params)
self.logger("###ClockTree.calc_rate_susceptibility: rate: %f, LH:%f"%(lower_rate, self.tree.positional_joint_LH), 2)
for n in self.tree.find_clades():
n.numdate_rate_variation.append((lower_rate, n.numdate))
if n.up:
n.branch_length_interpolator.gamma*=current_rate/lower_rate
self.logger("###ClockTree.calc_rate_susceptibility: run with central rate estimate", 1)
self.make_time_tree(**params)
self.logger("###ClockTree.calc_rate_susceptibility: rate: %f, LH:%f"%(current_rate, self.tree.positional_joint_LH), 2)
for n in self.tree.find_clades():
n.numdate_rate_variation.append((current_rate, n.numdate))
n.numdate_rate_variation.sort(key=lambda x:x[1]) # sort estimates for different rates by numdate
return ttconf.SUCCESS
def date_uncertainty_due_to_rate(self, node, interval=(0.05, 0.095)):
"""use previously calculated variation of the rate to estimate
the uncertainty in a particular numdate due to rate variation.
Parameters
----------
node : PhyloTree.Clade
node for which the confidence interval is to be calculated
interval : tuple, optional
Array of length two, or tuple, defining the bounds of the confidence interval
"""
if hasattr(node, "numdate_rate_variation"):
from scipy.special import erfinv
nsig = [np.sqrt(2.0)*erfinv(-1.0 + 2.0*x) if x*(1.0-x) else 0
for x in interval]
l,c,u = [x[1] for x in node.numdate_rate_variation]
return np.array([c + x*np.abs(y-c) for x,y in zip(nsig, (l,u))])
else:
return None
def combine_confidence(self, center, limits, c1=None, c2=None):
if c1 is None and c2 is None:
return np.array(limits)
elif c1 is None:
min_val,max_val = c2
elif c2 is None:
min_val,max_val = c1
else:
min_val = center - np.sqrt((c1[0]-center)**2 + (c2[0]-center)**2)
max_val = center + np.sqrt((c1[1]-center)**2 + (c2[1]-center)**2)
return np.array([max(limits[0], min_val),
min(limits[1], max_val)])
def get_confidence_interval(self, node, interval = (0.05, 0.95)):
'''
If temporal reconstruction was done using the marginal ML mode, the entire distribution of
times is available. This function determines the 90% (or other) confidence interval, defined as the
range where 5% of probability is below and above. Note that this does not necessarily contain
the highest probability position.
In absense of marginal reconstruction, it will return uncertainty based on rate
variation. If both are present, the wider interval will be returned.
Parameters
----------
......@@ -711,21 +816,28 @@ class ClockTree(TreeAnc):
Array with two numerical dates delineating the confidence interval
'''
rate_contribution = self.date_uncertainty_due_to_rate(node, interval)
if hasattr(node, "marginal_inverse_cdf"):
min_date, max_date = [self.date2dist.to_numdate(x) for x in
(node.marginal_pos_LH.xmax, node.marginal_pos_LH.xmin)]
if node.marginal_inverse_cdf=="delta":
return np.array([node.numdate, node.numdate])
else:
return self.date2dist.to_numdate(node.marginal_inverse_cdf(np.array(interval))[::-1])
mutation_contribution = self.date2dist.to_numdate(node.marginal_inverse_cdf(np.array(interval))[::-1])
else:
return np.array([np.nan, np.nan])
min_date, max_date = [-np.inf, np.inf]
return self.combine_confidence(node.numdate, (min_date, max_date),
c1=rate_contribution, c2=mutation_contribution)
def get_max_posterior_region(self, node, fraction = 0.9):
'''
If temporal reconstruction was done using the marginal ML mode, the entire distribution of
times is available. This function determines the 95% confidence interval defines as the
range where 5% of probability are below and above. Note that this does not necessarily contain
the highest probability position.
times is available. This function determines the interval around the highest
posterior probability region that contains the specified fraction of the probability mass.
In absense of marginal reconstruction, it will return uncertainty based on rate
variation. If both are present, the wider interval will be returned.
Parameters
----------
......@@ -743,18 +855,18 @@ class ClockTree(TreeAnc):
Array with two numerical dates delineating the high posterior region
'''
if not hasattr(node, "marginal_inverse_cdf"):
return np.array([np.nan, np.nan])
if node.marginal_inverse_cdf=="delta":
return np.array([node.numdate, node.numdate])
min_max = (node.marginal_pos_LH.xmin, node.marginal_pos_LH.xmax)
min_date, max_date = [self.date2dist.to_numdate(x) for x in min_max][::-1]
if node.marginal_pos_LH.peak_pos == min_max[0]: #peak on the left
return self.get_confidence_interval(node, (0, fraction))
elif node.marginal_pos_LH.peak_pos == min_max[1]: #peak on the right
return self.get_confidence_interval(node, (1.0-fraction ,1.0))
else: # peak in the center of the distribution
rate_contribution = self.date_uncertainty_due_to_rate(node, ((1-fraction)*0.5, 1.0-(1.0-fraction)*0.5))
# construct height to position interpolators left and right of the peak
# this assumes there is only one peak --- might fail in odd cases
......@@ -775,9 +887,12 @@ class ClockTree(TreeAnc):
# minimza and determine success
sol = minimize(func, bracket=[0,10], args=(fraction,))
if sol['success']:
return self.date2dist.to_numdate(np.array([right(sol['x']), left(sol['x'])]).squeeze())
mutation_contribution = self.date2dist.to_numdate(np.array([right(sol['x']), left(sol['x'])]).squeeze())
else: # on failure, return standard confidence interval
return self.get_confidence_interval(node, (0.5*(1-fraction), 1-0.5*(1-fraction)))
mutation_contribution = None
return self.combine_confidence(node.numdate, (min_date, max_date),
c1=rate_contribution, c2=mutation_contribution)
if __name__=="__main__":
......
......@@ -3,7 +3,6 @@ from collections import defaultdict
import numpy as np
from treetime import config as ttconf
from .seq_utils import alphabets, profile_maps, alphabet_synonyms
from .aa_models import JTT92
class GTR(object):
......@@ -334,6 +333,7 @@ class GTR(object):
"""
from .nuc_models import JC69, K80, F81, HKY85, T92, TN93
from .aa_models import JTT92
if model.lower() in ['jc', 'jc69', 'jukes-cantor', 'jukes-cantor69', 'jukescantor', 'jukescantor69']:
return JC69(**kwargs)
......
......@@ -10,6 +10,7 @@ from .gtr import GTR
string_types = [str] if sys.version_info[0]==3 else [str, unicode]
class TreeAnc(object):
"""
Class defines simple tree object with basic interface methods: reading and
......@@ -82,24 +83,27 @@ class TreeAnc(object):
self.log=log
self.logger("TreeAnc: set-up",1)
self._internal_node_count = 0
self.additional_constant_sites = 0 # sites not part of the alignment but assumed constant
self.use_mutation_length=False
# if not specified, this will be set as 1/alignment_length
# if not specified, this will be set as the alignment_length or reference length
self._seq_len = None
self.seq_len = kwargs['seq_len'] if 'seq_len' in kwargs else None
self.fill_overhangs = fill_overhangs
self.is_vcf = False #this is set true when aln is set, if aln is dict
# if sequences represent multiple samples, this can be added as multiplicity here
self.seq_multiplicity = {} if seq_multiplicity is None else seq_multiplicity
self.multiplicity = None
self.ignore_gaps = ignore_gaps
self._gtr = None
self.set_gtr(gtr or 'JC69', **kwargs)
self._tree = None
self.tree = tree
if tree is None:
raise AttributeError("TreeAnc: tree loading failed! exiting")
# set up GTR model
self._gtr = None
self.set_gtr(gtr or 'JC69', **kwargs)
# will be None if not set
self.ref = ref
......@@ -111,7 +115,9 @@ class TreeAnc(object):
# otherwise self.aln will be None
self._aln = None
self.reduced_to_full_sequence_map = None
self.multiplicity = None
self.aln = aln
if self.aln and self.tree:
if len(self.tree.get_terminals()) != len(self.aln):
print("**WARNING: Number of sequences in tree differs from number of sequences in alignment!**")
......@@ -322,11 +328,30 @@ class TreeAnc(object):
if (not self.is_vcf) and self.convert_upper:
self._aln = MultipleSeqAlignment([seq.upper() for seq in self._aln])
if self.seq_len:
if self.is_vcf and self.seq_len!=len(self.ref):
self.logger("TreeAnc.aln: specified sequence length doesn't match reference length, ignoring sequence length.", 1, warn=True)
self._seq_len = len(self.ref)
else:
self.logger("TreeAnc.aln: specified sequence length doesn't match alignment length. Treating difference as constant sites.", 2, warn=True)
self.additional_constant_sites = max(0, self.seq_len - self.aln.get_alignment_length())
else:
if self.is_vcf:
self.seq_len = len(self.ref)
else:
self.seq_len = self.aln.get_alignment_length()
# check whether the alignment is consistent with a nucleotide alignment.
likely_alphabet = self._guess_alphabet()
from .seq_utils import alphabets
# if likely alignment is not nucleotide but the gtr alignment is, WARN
if likely_alphabet=='aa' and len(self.gtr.alphabet)==len(alphabets['nuc']) and np.all(self.gtr.alphabet==alphabets['nuc']):
self.logger('WARNING: small fraction of ACGT-N in alignment. Really a nucleotide alignment? if not, rerun with --aa', 1, warn=True)
# conversely, warn if alignment is consistent with nucleotide but gtr has a long alphabet
if likely_alphabet=='nuc' and len(self.gtr.alphabet)>10:
self.logger('WARNING: almost exclusively ACGT-N in alignment. Really a protein alignment?', 1, warn=True)
if hasattr(self, '_tree') and (self.tree is not None):
self._attach_sequences_to_nodes()
else:
......@@ -396,6 +421,26 @@ class TreeAnc(object):
"""
self._ref = in_ref
def _guess_alphabet(self):
if self.aln:
if self.is_vcf and self.ref:
total=self.seq_len
nuc_count = 0
for n in 'ACGT-N':
nuc_count += self.ref.upper().count(n)
else:
total=self.seq_len*len(self.aln)
nuc_count = 0
for seq in self.aln:
for n in 'ACGT-N':
nuc_count += seq.seq.upper().count(n)
if nuc_count>0.9*total:
return 'nuc'
else:
return 'aa'
else:
return 'nuc'
def _attach_sequences_to_nodes(self):
'''
......@@ -504,7 +549,7 @@ class TreeAnc(object):
return
else:
aln_transpose = np.array(seqs).T
positions = range(self.seq_len)
positions = range(aln_transpose.shape[0])
for pi in positions:
if self.is_vcf:
......@@ -539,6 +584,31 @@ class TreeAnc(object):
# sequence to the reduced aln<->sequence_pos_indexes map
alignment_patterns[str_pat][1].append(pi)
# add constant alignment column not in the alignment. We don't know where they
# are, so just add them to the end. First, determine sequence composition.
if self.additional_constant_sites:
character_counts = {c:np.sum(aln_transpose==c) for c in self.gtr.alphabet
if c not in [self.gtr.ambiguous, '-']}
total = np.sum(list(character_counts.values()))
additional_columns = [(c,int(np.round(self.additional_constant_sites*n/total)))
for c, n in character_counts.items()]
columns_left = self.additional_constant_sites
pi = len(positions)
for c,n in additional_columns:
if c==additional_columns[-1][0]: # make sure all additions add up to the correct number to avoid rounding
n = columns_left
str_pat = c*len(self.aln)
pos_list = list(range(pi, pi+n))
if str_pat in alignment_patterns:
alignment_patterns[str_pat][1].extend(pos_list)
else:
alignment_patterns[str_pat] = (len(tmp_reduced_aln), pos_list)
tmp_reduced_aln.append(np.array(list(str_pat)))
pi += n
columns_left -= n
# count how many times each column is repeated in the real alignment
self.multiplicity = np.zeros(len(alignment_patterns))
for p, pos in alignment_patterns.values():
......@@ -796,13 +866,14 @@ class TreeAnc(object):
self.logger("TreeAnc.infer_gtr: counting mutations...done", 3)
if print_raw:
print('alphabet:',alpha)
print('n_ij:', nij)
print('T_i:', Ti)
print('n_ij:', nij, nij.sum())
print('T_i:', Ti, Ti.sum())
root_state = np.array([np.sum((self.tree.root.cseq==nuc)*self.multiplicity) for nuc in alpha])
self._gtr = GTR.infer(nij, Ti, root_state, fixed_pi=fixed_pi, pc=pc,
alphabet=self.gtr.alphabet, logger=self.logger,
prof_map = self.gtr.profile_map)
if normalized_rate:
self.logger("TreeAnc.infer_gtr: setting overall rate to 1.0...", 2)
self._gtr.mu=1.0
......@@ -958,7 +1029,7 @@ class TreeAnc(object):
return mut_matrix_stack
def expanded_sequence(self, node):
def expanded_sequence(self, node, include_additional_constant_sites=False):
"""
Get node's compressed sequence and expand it to the real sequence
......@@ -968,13 +1039,18 @@ class TreeAnc(object):
Tree node
Returns
-------
-------f
seq : np.array
Sequence as np.array of chars
"""
seq = np.zeros_like(self.full_to_reduced_sequence_map, dtype='U1')
if include_additional_constant_sites:
L = self.seq_len
else:
L = self.seq_len - self.additional_constant_sites
seq = np.zeros_like(self.full_to_reduced_sequence_map[:L], dtype='U1')
for pos, state in enumerate(node.cseq):
seq[self.reduced_to_full_sequence_map[pos]] = state
full_pos = self.reduced_to_full_sequence_map[pos]
seq[full_pos[full_pos<L]] = state
return seq
......
......@@ -245,7 +245,7 @@ class TreeRegression(object):
a vector of length 6 containing the updated quantities
"""
if n.is_terminal() and outgroup==False:
if tv is None:
if tv is None or np.isinf(tv) or np.isnan(tv):
res = np.array([0, 0, 0, 0, 0, 0])
elif var==0:
res = np.array([np.inf, np.inf, np.inf, np.inf, np.inf, np.inf])
......@@ -332,7 +332,6 @@ class TreeRegression(object):
bv = self.branch_value(n)
var = self.branch_variance(n)
x, chisq = self._optimal_root_along_branch(n, tv, bv, var)
if (chisq<best_root["chisq"]):
tmpQ = self.propagate_averages(n, tv, bv*x, var*x) \
+ self.propagate_averages(n, tv, bv*(1-x), var*(1-x), outgroup=True)
......@@ -341,8 +340,16 @@ class TreeRegression(object):
best_root = {"node":n, "split":x}
best_root.update(reg)
if 'node' not in best_root:
print("TreeRegression.find_best_root: Not valid root found!", force_positive)
return None
# calculate differentials with respect to x
deriv = []
n = best_root["node"]
tv = self.tip_value(n)
bv = self.branch_value(n)
var = self.branch_variance(n)
for dx in [-0.001, 0.001]:
y = min(1.0, max(0.0, best_root["split"]+dx))
tmpQ = self.propagate_averages(n, tv, bv*y, var*y) \
......
......@@ -33,7 +33,8 @@ class TreeTime(ClockTree):
def run(self, root=None, infer_gtr=True, relaxed_clock=None, n_iqd = None,
resolve_polytomies=True, max_iter=0, Tc=None, fixed_clock_rate=None,
time_marginal=False, sequence_marginal=False, branch_length_mode='auto', **kwargs):
time_marginal=False, sequence_marginal=False, branch_length_mode='auto',
vary_rate=False, use_covariation=True, **kwargs):
"""
Run TreeTime reconstruction. Based on the input parameters, it divides
......@@ -43,7 +44,6 @@ class TreeTime(ClockTree):
Parameters
----------
root : str
Try to find better root position on a given tree. If string is passed,
the root will be searched according to the specified method. If none,
......@@ -77,7 +77,7 @@ class TreeTime(ClockTree):
If Tc is float, it is interpreted as the coalescence time scale
If Tc is str, it should be one of (:code:`opt`, :code:`skyline`)
If Tc is str, it should be one of (:code:`opt`, :code:`const`, :code:`skyline`)
fixed_clock_rate : float
Fixed clock rate to be used. If None, infer clock rate from the molecular clock.
......@@ -85,6 +85,9 @@ class TreeTime(ClockTree):
time_marginal : bool
If True, perform a final round of marginal reconstruction of the node's positions.
sequence_marginal : bool, optional
use marginal reconstruction for ancestral sequences
branch_length_mode : str
Should be one of: :code:`joint`, :code:`marginal`, :code:`input`.
......@@ -93,12 +96,30 @@ class TreeTime(ClockTree):
Otherwise, perform preliminary sequence reconstruction using parsimony
algorithm and do branch length optimization
vary_rate : bool or float, optional
redo the time tree estimation for rates +/- one standard deviation.
if a float is passed, it is interpreted as standard deviation,
otherwise this standard deviation is estimated from the root-to-tip regression
use_covariation : bool, optional
default True, if False, rate estimates will be performed using simple
regression ignoring phylogenetic covaration between nodes.
**kwargs
Keyword arguments needed by the dowstream functions
Returns
-------
TreeTime error/succces code : str
return value depending on success or error
"""
# register the specified covaration mode
self.use_covariation = use_covariation
if (self.tree is None) or (self.aln is None and self.seq_len is None):
self.logger("TreeTime.run: ERROR, alignment or tree are missing", 0)
return ttconf.ERROR
......@@ -214,6 +235,18 @@ class TreeTime(ClockTree):
self.logger("###TreeTime.run: CONVERGED",0)
break
# if the rate is too be varied and the rate estimate has a valid confidence interval
# rerun the estimation for variations of the rate
if vary_rate:
if type(vary_rate)==float:
res = self.calc_rate_susceptibility(rate_std=vary_rate, params=tt_kwargs)
elif self.clock_model['valid_confidence']:
res = self.calc_rate_susceptibility(params=tt_kwargs)
else:
res = ttconf.ERROR
if res==ttconf.ERROR:
self.logger("TreeTime.run: rate variation failed and can't be used for confidence estimation", 1, warn=True)
# if marginal reconstruction requested, make one more round with marginal=True
# this will set marginal_pos_LH, which to be used as error bar estimations
......@@ -394,7 +427,7 @@ class TreeTime(ClockTree):
#(Without outgroup_branch_length, gives a trifurcating root, but this will mean
#mutations may have to occur multiple times.)
self.tree.root_with_outgroup(new_root, outgroup_branch_length=new_root.branch_length/2)
self.get_clock_model(covariation=True)
self.get_clock_model(covariation=self.use_covariation)
if new_root == ttconf.ERROR:
......@@ -419,7 +452,7 @@ class TreeTime(ClockTree):
n.clock_length = n.branch_length
self.prepare_tree()
self.get_clock_model(covariation='ML' in root)
self.get_clock_model(covariation=(('ML' in root) and self.use_covariation))
return ttconf.SUCCESS
......@@ -682,6 +715,7 @@ class TreeTime(ClockTree):
act_len = node.clock_length if hasattr(node, 'clock_length') else node.branch_length
# opt_len \approx 1.0*len(node.mutations)/node.profile.shape[0] but calculated via gtr model
# stiffness is the expectation of the inverse variance of branch length (one_mutation/opt_len)
# contact term: stiffness*(g*bl - bl_opt)^2 + slack(g-1)^2 =
# (slack+bl^2) g^2 - 2 (bl*bl_opt+1) g + C= k2 g^2 + k1 g + C
node._k2 = slack + c*act_len**2/(opt_len+self.one_mutation)
......@@ -763,14 +797,15 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
'''
import matplotlib.pyplot as plt
tt.branch_length_to_years()
nleafs = tt.tree.count_terminals()
if ax is None:
fig = plt.figure()
fig = plt.figure(figsize=(12,10))
ax = plt.subplot(111)
else:
fig = None
# draw tree
if "label_func" not in kwargs:
nleafs = tt.tree.count_terminals()
kwargs["label_func"] = lambda x:x.name if (x.is_terminal() and nleafs<30) else ""
Phylo.draw(tt.tree, axes=ax, **kwargs)
......@@ -784,13 +819,13 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
step/=5
elif date_range/step<5:
step/=2
step = max(1,step)
step = max(1.0/12,step)
# set axis labels
if step:
dtick = step
min_tick = step*(offset//step)
extra = 0 if dtick<date_range else dtick
extra = dtick if dtick<date_range else dtick
tick_vals = np.arange(min_tick, min_tick+date_range+extra, dtick)
xticks = tick_vals - offset
else:
......@@ -819,7 +854,8 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
edgecolor=[1,1,1])
ax.add_patch(r)
if year in tick_vals and pos>=xlim[0] and pos<=xlim[1] and ticks:
ax.text(pos,ylim[0]-0.04*(ylim[1]-ylim[0]),str(int(year)),
label_str = str(step*(year//step)) if step<1 else str(int(year))
ax.text(pos,ylim[0]-0.04*(ylim[1]-ylim[0]), label_str,
horizontalalignment='center')
ax.set_axis_off()
......
......@@ -181,9 +181,10 @@ def parse_dates(date_file):
It will first try to parse the column as float, than via
pandas.to_datetime and finally as ambiguous date such as 2018-05-XX
"""
print("\nAttempting to parse dates...")
dates = {}
if not os.path.isfile(date_file):
print("\n ERROR: file %s does not exist, exiting..."%date_file)
print("\n\tERROR: file %s does not exist, exiting..."%date_file)
return dates
# separator for the csv/tsv file. If csv, we'll strip extra whitespace around ','
full_sep = '\t' if date_file.endswith('.tsv') else r'\s*,\s*'
......@@ -208,46 +209,60 @@ def parse_dates(date_file):
df.iloc[i,ci] = tmp_d.strip(d[0])
if 'date' in col.lower():
potential_date_columns.append((ci, col))
if any([x in col.lower() for x in ['name', 'strain', 'accession']]):
if any([x==col.lower() for x in ['name', 'strain', 'accession']]):
potential_index_columns.append((ci, col))
dates = {}
# if a potential numeric date column was found, use it
# (use the first, if there are more than one)
if not len(potential_index_columns):
print("Cannot read metadata: need at least one column that contains the taxon labels."
print("ERROR: Cannot read metadata: need at least one column that contains the taxon labels."
" Looking for the first column that contains 'name', 'strain', or 'accession' in the header.", file=sys.stderr)
return {}
return dates
else:
# use the first column that is either 'name', 'strain', 'accession'
index_col = sorted(potential_index_columns)[0][1]
print("\tUsing column '%s' as name. This needs match the taxon names in the tree!!"%index_col)
if len(potential_date_columns)>=1:
#try to parse the csv file with dates in the idx column:
idx = potential_date_columns[0][0]
col_name = potential_date_columns[0][1]
dates = {}
print("\tUsing column '%s' as date."%col_name)
for ri, row in df.iterrows():
date_str = row.loc[col_name]
k = row.loc[index_col]
# try parsing as a float first
try:
dates[k] = float(date_str)
continue
except ValueError:
# try whether the date string can be parsed as [2002.2:2004.3]
# to indicate general ambiguous ranges
if date_str[0]=='[' and date_str[-1]==']' and len(date_str[1:-1].split(':'))==2:
try:
dates[k] = [float(x) for x in date_str[1:-1].split(':')]
continue
except ValueError:
pass
# try date format parsing 2017-08-12
try:
tmp_date = pd.to_datetime(date_str)
dates[k] = numeric_date(tmp_date)
except ValueError:
except ValueError: # try ambiguous date format parsing 2017-XX-XX
lower, upper = ambiguous_date_to_date_range(date_str, '%Y-%m-%d')
if lower is not None:
dates[k] = [numeric_date(x) for x in [lower, upper]]
else:
print("Metadata file has no column which looks like a sampling date!", file=sys.stderr)
print("ERROR: Metadata file has no column which looks like a sampling date!", file=sys.stderr)
if all(v is None for v in dates.values()):
print("Cannot parse dates correctly! Check date format.", file=sys.stderr)
print("ERROR: Cannot parse dates correctly! Check date format.", file=sys.stderr)
return {}
return dates
except:
print("Cannot read the metadata file!", file=sys.stderr)
print("ERROR: Cannot read the metadata file!", file=sys.stderr)
return {}
......
......@@ -11,6 +11,7 @@ from treetime import TreeAnc, GTR, TreeTime
from treetime import config as ttconf
from treetime import utils
from treetime.vcf_utils import read_vcf, write_vcf
from treetime.seq_utils import alphabets
def assure_tree(params, tmp_dir='treetime_tmp'):
"""
......@@ -39,7 +40,7 @@ def create_gtr(params):
model = params.gtr
gtr_params = params.gtr_params
if model == 'infer':
gtr = GTR.standard('jc')
gtr = GTR.standard('jc', alphabet='aa' if params.aa else 'nuc')
else:
try:
kwargs = {}
......@@ -60,7 +61,7 @@ def create_gtr(params):
infer_gtr = False
except:
print ("Could not create GTR model from input arguments. Using default (Jukes-Cantor 1969)")
gtr = GTR.standard('jc')
gtr = GTR.standard('jc', alphabet='aa' if params.aa else 'nuc')
infer_gtr = False
return gtr
......@@ -138,7 +139,8 @@ def read_if_vcf(params):
aln = sequences
if not hasattr(params, 'gtr') or params.gtr=="infer": #if not specified, set it:
fixed_pi = [ref.count(base)/len(ref) for base in ['A','C','G','T','-']]
alpha = alphabets['aa'] if params.aa else alphabets['nuc']
fixed_pi = [ref.count(base)/len(ref) for base in alpha]
if fixed_pi[-1] == 0:
fixed_pi[-1] = 0.05
fixed_pi = [v-0.01 for v in fixed_pi]
......@@ -185,8 +187,7 @@ def export_sequences_and_tree(tt, basename, is_vcf=False, zero_based=False,
fh_dates.write('%s\t%s\t%f\t%f\t%f\n'%(n.name, n.date, n.numdate,conf[0], conf[1]))
else:
fh_dates.write('%s\t%s\t%f\n'%(n.name, n.date, n.numdate))
if n.up is None:
continue
n.confidence=None
# due to a bug in older versions of biopython that truncated filenames in nexus export
# we truncate them by hand and make them unique.
......@@ -204,14 +205,23 @@ def export_sequences_and_tree(tt, basename, is_vcf=False, zero_based=False,
n.comment+=(',' if n.comment else '&') + 'date=%1.2f'%n.numdate
# write tree to file
fmt_bl = "%1.6f" if tt.seq_len<1e6 else "%1.8e"
if timetree:
outtree_name = basename + 'timetree.nexus'
print("--- saved divergence times in \n\t %s\n"%dates_fname)
Phylo.write(tt.tree, outtree_name, 'nexus')
else:
outtree_name = basename + 'annotated_tree.nexus'
Phylo.write(tt.tree, outtree_name, 'nexus')
Phylo.write(tt.tree, outtree_name, 'nexus', format_branch_length=fmt_bl)
print("--- tree saved in nexus format as \n\t %s\n"%outtree_name)
if timetree:
for n in tt.tree.find_clades():
n.branch_length = n.mutation_length
outtree_name = basename + 'divergence_tree.nexus'
Phylo.write(tt.tree, outtree_name, 'nexus', format_branch_length=fmt_bl)
print("--- divergence tree saved in nexus format as \n\t %s\n"%outtree_name)
def print_save_plot_skyline(tt, n_std=2.0, screen=True, save='', plot=''):
if plot:
......@@ -452,8 +462,13 @@ def timetree(params):
"""
implementeing treetime tree
"""
if params.relax==[]:
params.relax=True
if params.relax is None:
relaxed_clock_params = None
elif params.relax==[]:
relaxed_clock_params=True
elif len(params.relax)==2:
relaxed_clock_params={'slack':params.relax[0], 'coupling':params.relax[1]}
dates = utils.parse_dates(params.dates)
if len(dates)==0:
......@@ -474,10 +489,13 @@ def timetree(params):
aln, ref, fixed_pi = read_if_vcf(params)
is_vcf = True if ref is not None else False
branch_length_mode = params.branch_length_mode
if is_vcf: #variable-site-only trees can have big branch lengths, setting this wrong.
#variable-site-only trees can have big branch lengths, the auto setting won't work.
if is_vcf or (params.aln and params.sequence_length):
if branch_length_mode == 'auto':
branch_length_mode = 'joint'
###########################################################################
### SET-UP and RUN
###########################################################################
......@@ -488,6 +506,10 @@ def timetree(params):
aln=aln, gtr=gtr, seq_len=params.sequence_length,
verbose=params.verbose)
if not myTree.one_mutation:
print("TreeTime setup failed, exiting")
return 1
# coalescent model options
try:
coalescent = float(params.coalescent)
......@@ -501,14 +523,18 @@ def timetree(params):
"a float, 'opt', 'const' or 'skyline'")
coalescent = None
vary_rate = params.confidence
if params.clock_std_dev and params.clock_rate:
vary_rate = params.clock_std_dev
root = None if params.keep_root else params.reroot
success = myTree.run(root=root, relaxed_clock=params.relax,
success = myTree.run(root=root, relaxed_clock=relaxed_clock_params,
resolve_polytomies=(not params.keep_polytomies),
Tc=coalescent, max_iter=params.max_iter,
fixed_clock_rate=params.clock_rate,
n_iqd=params.clock_filter,
time_marginal="assign" if params.confidence else False,
vary_rate = vary_rate,
branch_length_mode = branch_length_mode,
fixed_pi=fixed_pi)
if success==ttconf.ERROR: # if TreeTime.run failed, exit
......@@ -524,20 +550,23 @@ def timetree(params):
print(myTree.date2dist)
basename = get_basename(params, outdir)
if coalescent in ['skyline', 'opt']:
if coalescent in ['skyline', 'opt', 'const']:
print("Inferred coalescent model")
if coalescent=='skyline':
print_save_plot_skyline(myTree, plot=basename+'skyline.pdf', save=basename+'skyline.tsv', screen=True)
elif coalescent=='opt':
else:
Tc = myTree.merger_model.Tc.y[0]
print(" --T_c: \t %1.4f \toptimized inverse merger rate"%Tc)
print(" --N_e: \t %1.1f \tcorresponding pop size assument 50 gen/year\n"%(Tc/myTree.date2dist.clock_rate*50))
print(" --T_c: \t %1.2e \toptimized inverse merger rate in units of substitutions"%Tc)
print(" --T_c: \t %1.2e \toptimized inverse merger rate in years"%(Tc/myTree.date2dist.clock_rate))
print(" --N_e: \t %1.2e \tcorresponding 'effective population size' assuming 50 gen/year\n"%(Tc/myTree.date2dist.clock_rate*50))
# plot
import matplotlib.pyplot as plt
from .treetime import plot_vs_years
leaf_count = myTree.tree.count_terminals()
label_func = lambda x: x.name[:20] if (leaf_count<30 & x.is_terminal()) else ''
label_func = lambda x: (x.name if x.is_terminal() and ((leaf_count<30
and (not params.no_tip_labels))
or params.tip_labels) else '')
plot_vs_years(myTree, show_confidence=False, label_func=label_func,
confidence=0.9 if params.confidence else None)
......@@ -545,10 +574,21 @@ def timetree(params):
plt.savefig(tree_fname)
print("--- saved tree as \n\t %s\n"%tree_fname)
plot_rtt(myTree, outdir + params.plot_rtt)
if params.relax:
fname = outdir+'substitution_rates.tsv'
print("--- wrote branch specific rates to\n\t %s\n"%fname)
with open(fname, 'w') as fh:
fh.write("#node\tclock_length\tmutation_length\trate\tfold_change\n")
for n in myTree.tree.find_clades(order="preorder"):
if n==myTree.tree.root:
continue
g = n.branch_length_interpolator.gamma
fh.write("%s\t%1.3e\t%1.3e\t%1.3e\t%1.2f\n"%(n.name, n.clock_length, n.mutation_length, myTree.date2dist.clock_rate*g, g))
export_sequences_and_tree(myTree, basename, is_vcf, params.zero_based,
timetree=True, confidence=params.confidence)
plot_rtt(myTree, outdir + params.plot_rtt)
return 0
......@@ -810,7 +850,13 @@ def estimate_clock_model(params):
ofile.write("#Dates of nodes that didn't have a specified date are inferred from the root-to-tip regression.\n")
for n in myTree.tree.get_terminals():
if hasattr(n, "raw_date_constraint") and (n.raw_date_constraint is not None):
ofile.write("%s, %f, %f\n"%(n.name, n.raw_date_constraint, n.dist2root))
if np.isscalar(n.raw_date_constraint):
tmp_str = str(n.raw_date_constraint)
elif len(n.raw_date_constraint):
tmp_str = str(n.raw_date_constraint[0])+'-'+str(n.raw_date_constraint[1])
else:
tmp_str = ''
ofile.write("%s, %s, %f\n"%(n.name, tmp_str, n.dist2root))
else:
ofile.write("%s, %f, %f\n"%(n.name, d2d.numdate_from_dist2root(n.dist2root), n.dist2root))
for n in myTree.tree.get_nonterminals(order='preorder'):
......