Steffen Möller · Steffen Möller · Steffen Möller · 3b0003c7 · 84772318 · 3b0003c7
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ In addition to scripting TreeTime or using it via the command-line, there is als

 ![Molecular clock phylogeny of 200 NA sequences of influenza A H3N2](https://raw.githubusercontent.com/neherlab/treetime_examples/master/figures/tree_and_clock.png)

-Have a look at our [examples and tutorials](https://github.com/neherlab/treetime_examples).
+Have a look at our repository with [example data](https://github.com/neherlab/treetime_examples) and the [tutorials](https://treetime.readthedocs.io/en/latest/tutorials.html).

 #### Features
 * ancestral sequence reconstruction (marginal and joint maximum likelihood)
@@ -83,7 +83,7 @@ The to infer a timetree, i.e. a phylogenetic tree in which branch length reflect
  treetime --aln <input.fasta> --tree <input.nwk> --dates <dates.csv>
 ```
 This command will infer a time tree, ancestral sequences, a GTR model, and optionally confidence intervals and coalescent models.
-A detailed explanation is of this command with its various options and examples are available at [treetime_examples/timetree.md](http://github.com/neherlab/treetime_examples/blob/master/timetree.md)
+A detailed explanation is of this command with its various options and examples is available in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/timetree.html).


 #### Rerooting and substitution rate estimation
@@ -93,7 +93,7 @@ To explore the temporal signal in the data and estimate the substitution rate (i
 ```
 The full list if options is available by typing `treetime clock -h`.
 Instead of an input alignment, `--sequence-length <L>` can be provided.
-Documentation of additional options and examples are available at [treetime_examples/clock.md](https://github.com/neherlab/treetime_examples/blob/master/clock.md)
+Documentation of additional options and examples are available at in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/clock.html).


 #### Ancestral sequence reconstruction:
@@ -103,7 +103,7 @@ The subcommand
 ```
 will reconstruct ancestral sequences at internal nodes of the input tree.
 The full list if options is available by typing `treetime ancestral -h`.
-A detailed explanation of `treetime ancestral` with examples is available at [treetime_examples/ancestral.md](https://github.com/neherlab/treetime_examples/blob/master/ancestral.md)
+A detailed explanation of `treetime ancestral` with examples is available at in [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/ancestral.html).

 #### Homoplasy analysis
 Detecting and quantifying homoplasies or recurrent mutations is useful to check for recombination, putative adaptive sites, or contamination.
@@ -112,7 +112,7 @@ TreeTime provides a simple command to summarize homoplasies in data
  treetime homoplasy --aln <input.fasta> --tree <input.nwk>
 ```
 The full list if options is available by typing `treetime homoplasy -h`.
-Please see [treetime_examples/homoplasy.md](https://github.com/neherlab/treetime_examples/blob/master/homoplasy.md) for examples and more documentation.
+Please see [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/homoplasy.html) for examples and more documentation.

 #### Mugration analysis
 Migration between discrete geographic regions, host switching, or other transition between discrete states are often parameterized by time-reversible models analogous to models describing evolution of genome sequences.
@@ -123,7 +123,7 @@ TreeTime GTR model machinery can be used to infer mugration models:
 ```
 where `<field>` is the relevant column in the csv file specifying the metadata `states.csv`, e.g. `<field>=country`.
 The full list if options is available by typing `treetime mugration -h`.
-Please see [treetime_examples/mugration.md](https://github.com/neherlab/treetime_examples/blob/master/mugration.md) for examples and more documentation.
+Please see [the documentation on readthedocs.org](https://treetime.readthedocs.io/en/latest/tutorials/mugration.html) for examples and more documentation.

 #### Metadata and date format
 Several of TreeTime commands require the user to specify a file with dates and/or other meta data.
@@ -177,12 +177,6 @@ The API documentation for the TreeTime package is generated created with Sphinx.
  pip install Sphinx
  ```

-  - basicstrap Html theme for sphinx:
-
-  ```bash
-  pip install sphinxjp.themes.basicstrap
-  ```
-
 After required packages are installed, navigate to doc directory, and build the docs by typing:

 ```bash

--- a/__init__.py
+++ b/__init__.py
-import datetime
-from treetime.treeanc import TreeAnc
-from treetime.clock_tree import ClockTree
-from treetime.treetime import TreeTime
-from treetime.treetime import ttconf as treetime_conf
-from treetime.gtr import GTR
-from treetime.treetime import plot_vs_years
-from treetime.treetime import treetime_to_newick
-from treetime.tree_regression import TreeRegression
-from treetime.merger_models import Coalescent
-import treetime.seq_utils as seq_utils
-from treetime.utils import numeric_date
--- a/changelog.md
+++ b/changelog.md
+# 0.7.0 -- restructuring
+
+## Major changes
+This release largely includes changes under the hood, some of which also affect how treetime behaves. The biggest changes are
+ * sequence data handling is now done by a separate class `SequenceData`. There is now a clear distinction between input data that is never changed and inferred sequences. This class also provides consolidated set of functions to convert sparse, compressed, and full sequence representations into each other.
+ * sequences are now unicode when running from python3. This does not seem to come with a measurable performance hit compared to byte sequences as long as all characters are ASCII. Moving away from bytes to unicode proved much less hassle than converting sequences back and forth from unicode to bytes during IO.
+ * Ancestral state reconstruction no longer reconstructs the state of terminal nodes by default and sequence accessors and output will return the input data by default. Reconstruction is optional.
+ * The command-line mugration model inference now optimize the overall rate numerically and is hence no longer making a short-branch length assumption.
+ * TreeTime raises now a number of custom errors rather than returning success or error codes. This should result in fewer "silent errors" that cause problems downstream.
+
+## Minor new features
+In addition, we implemented a number of other changes to the interface
+ * `treetime`, `treetime clock` now accept the arguments `--name-column` and `-date-column` to explicitly specify the metadata columns to be used as name or date
+ * `treetime mugration` accepts a `--name-column` argument.
+
+## Bug fixes
+ * scaling of skyline confidence intervals was wrong. It now reflects the inverse second derivative in log-space
+ * catch problems after rerooting associated with missing attributes in the newly generated root node.
+ * make conversion from calendar dates to numeric dates and vice versa compatible and remove approximate handling of leap-years.
+ * avoid overwriting content of output directory with default names
+ * don't export inferred dates of tips labeled as `bad_branch`.
\ No newline at end of file
--- a/contributing.md
+++ b/contributing.md
+# Contributing to TreeTime
+
+Thank you for your interest in contributing to TreeTime.
+We welcome pull-requests that fix bugs or implement new features. 
+
+## Bugs
+If you come across a bug or unexpected behavior, please file an issue. 
+
+## Testing
+Upon pushing a commit, travis will run a few simple tests. These use data available in the [neherlab/treetime_examples](https://github.com/neherlab/treetime_examples) repository.
+
+## Coding conventions (loosly adhered to)
+
+  * indentation: 4 spaces
+  * docstrings: numpy style
+  * variable names: snake_case
--- a/test/test_treetime.py
+++ b/test/test_treetime.py
@@ -45,7 +45,7 @@ def test_ancestral():
        t = TreeAnc(gtr='Jukes-Cantor', tree=nwk, aln=fasta)
        print('ancestral reconstruction' + ("marginal" if marginal else "joint"))
        t.reconstruct_anc(method='ml', marginal=marginal)
-        assert "".join(t.tree.root.sequence) == 'ATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCTTGATAACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATAATAGAAAGAAACATAACAGAGATAGTGTATCTGACCAACACCACCATAGAGAAGGAAATATGCCCCAAACCAGCAGAATACAGAAATTGGTCAAAACCGCAATGTGGCATTACAGGATTTGCACCTTTCTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGACTCTATTGATGAATGAGTTGGGTGTTCCTTTTCATCTGGGGACCAAGCAAGTGTGCATAGCATGGTCCAGCTCAAGTTGTCACGATGGAAAAGCATGGCTGCATGTTTGTATAACGGGGGATGATAAAAATGCAACTGCTAGCTTCATTTACAATGGGAGGCTTGTAGATAGTGTTGTTTCATGGTCCAAAGAAATTCTCAGGACCCAGGAGTCAGAATGCGTTTGTATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAGTGCTTCAGGAAAAGCTGATACTAAAATACTATTCATTGAGGAGGGGAAAATCGTTCATACTAGCACATTGTCAGGAAGTGCTCAGCATGTCGAAGAGTGCTCTTGCTATCCTCGATATCCTGGTGTCAGATGTGTCTGCAGAGACAACTGGAAAGGCTCCAATCGGCCCATCGTAGATATAAACATAAAGGATCATAGCATTGTTTCCAGTTATGTGTGTTCAGGACTTGTTGGAGACACACCCAGAAAAAACGACAGCTCCAGCAGTAGCCATTGTTTGGATCCTAACAATGAAGAAGGTGGTCATGGAGTGAAAGGCTGGGCCTTTGATGATGGAAATGACGTGTGGATGGGAAGAACAATCAACGAGACGTCACGCTTAGGGTATGAAACCTTCAAAGTCATTGAAGGCTGGTCCAACCCTAAGTCCAAATTGCAGATAAATAGGCAAGTCATAGTTGACAGAGGTGATAGGTCCGGTTATTCTGGTATTTTCTCTGTTGAAGGCAAAAGCTGCATCAATCGGTGCTTTTATGTGGAGTTGATTAGGGGAAGAAAAGAGGAAACTGAAGTCTTGTGGACCTCAAACAGTATTGTTGTGTTTTGTGGCACCTCAGGTACATATGGAACAGGCTCATGGCCTGATGGGGCGGACCTCAATCTCATGCCTATA'
+        assert t.data.compressed_to_full_sequence(t.tree.root.cseq, as_string=True) == 'ATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCTTGATAACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATAATAGAAAGAAACATAACAGAGATAGTGTATCTGACCAACACCACCATAGAGAAGGAAATATGCCCCAAACCAGCAGAATACAGAAATTGGTCAAAACCGCAATGTGGCATTACAGGATTTGCACCTTTCTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGACTCTATTGATGAATGAGTTGGGTGTTCCTTTTCATCTGGGGACCAAGCAAGTGTGCATAGCATGGTCCAGCTCAAGTTGTCACGATGGAAAAGCATGGCTGCATGTTTGTATAACGGGGGATGATAAAAATGCAACTGCTAGCTTCATTTACAATGGGAGGCTTGTAGATAGTGTTGTTTCATGGTCCAAAGAAATTCTCAGGACCCAGGAGTCAGAATGCGTTTGTATCAATGGAACTTGTACAGTAGTAATGACTGATGGAAGTGCTTCAGGAAAAGCTGATACTAAAATACTATTCATTGAGGAGGGGAAAATCGTTCATACTAGCACATTGTCAGGAAGTGCTCAGCATGTCGAAGAGTGCTCTTGCTATCCTCGATATCCTGGTGTCAGATGTGTCTGCAGAGACAACTGGAAAGGCTCCAATCGGCCCATCGTAGATATAAACATAAAGGATCATAGCATTGTTTCCAGTTATGTGTGTTCAGGACTTGTTGGAGACACACCCAGAAAAAACGACAGCTCCAGCAGTAGCCATTGTTTGGATCCTAACAATGAAGAAGGTGGTCATGGAGTGAAAGGCTGGGCCTTTGATGATGGAAATGACGTGTGGATGGGAAGAACAATCAACGAGACGTCACGCTTAGGGTATGAAACCTTCAAAGTCATTGAAGGCTGGTCCAACCCTAAGTCCAAATTGCAGATAAATAGGCAAGTCATAGTTGACAGAGGTGATAGGTCCGGTTATTCTGGTATTTTCTCTGTTGAAGGCAAAAGCTGCATCAATCGGTGCTTTTATGTGGAGTTGATTAGGGGAAGAAAAGAGGAAACTGAAGTCTTGTGGACCTCAAACAGTATTGTTGTGTTTTGTGGCACCTCAGGTACATATGGAACAGGCTCATGGCCTGATGGGGCGGACCTCAATCTCATGCCTATA'

    print('testing LH normalization')
    from Bio import Phylo,AlignIO
@@ -101,7 +101,7 @@ def test_seq_joint_reconstruction_correct():
    tree = myTree.tree
    seq_len = 400
    tree.root.ref_seq = np.random.choice(mygtr.alphabet, p=mygtr.Pi, size=seq_len)
-    print ("Root sequence: " + ''.join(tree.root.ref_seq))
+    print ("Root sequence: " + ''.join(tree.root.ref_seq.astype('U')))
    mutation_list = defaultdict(list)
    for node in tree.find_clades():
        for c in node.clades:
@@ -110,7 +110,7 @@ def test_seq_joint_reconstruction_correct():
            continue
        t = node.branch_length
        p = mygtr.evolve( seq_utils.seq2prof(node.up.ref_seq, mygtr.profile_map), t)
-        # normalie profile
+        # normalize profile
        p=(p.T/p.sum(axis=1)).T
        # sample mutations randomly
        ref_seq_idxs = np.array([int(np.random.choice(np.arange(p.shape[1]), p=p[k])) for k in np.arange(p.shape[0])])
@@ -127,25 +127,23 @@ def test_seq_joint_reconstruction_correct():
    alnstr = ""
    i = 1
    for leaf in tree.get_terminals():
-        alnstr += ">" + leaf.name + "\n" + ''.join(leaf.ref_seq) + '\n'
+        alnstr += ">" + leaf.name + "\n" + ''.join(leaf.ref_seq.astype('U')) + '\n'
        i += 1
    print (alnstr)
    myTree.aln = AlignIO.read(StringIO(alnstr), 'fasta')
-    myTree._attach_sequences_to_nodes()
    # reconstruct ancestral sequences:
-    myTree._ml_anc_joint(debug=True)
+    myTree.infer_ancestral_sequences(final=True, debug=True, reconstruct_leaves=True)

    diff_count = 0
    mut_count = 0
    for node in myTree.tree.find_clades():
        if node.up is not None:
            mut_count += len(node.ref_mutations)
-            diff_count += np.sum(node.sequence != node.ref_seq)==0
+            diff_count += np.sum(node.sequence != node.ref_seq)
            if np.sum(node.sequence != node.ref_seq):
                print("%s: True sequence does not equal inferred sequence. parent %s"%(node.name, node.up.name))
            else:
                print("%s: True sequence equals inferred sequence. parent %s"%(node.name, node.up.name))
-        print (node.name, np.sum(node.sequence != node.ref_seq), np.where(node.sequence != node.ref_seq), len(node.mutations), node.mutations)

    # the assignment of mutations to the root node is probabilistic. Hence some differences are expected
    assert diff_count/seq_len<2*(1.0*mut_count/seq_len)**2

--- a/treetime/__init__.py
+++ b/treetime/__init__.py
 from __future__ import print_function, division, absolute_import
-version="0.6.2"
+version="0.7.0"
+
+class TreeTimeError(Exception):
+    """TreeTimeError class"""
+    pass
+
+class MissingDataError(TreeTimeError):
+    """MissingDataError class raised when tree or alignment are missing"""
+    pass
+
+class UnknownMethodError(TreeTimeError):
+    """MissingDataError class raised when tree or alignment are missing"""
+    pass
+
+class NotReadyError(TreeTimeError):
+    """NotReadyError class raised when results are requested before inference"""
+    pass
+
+
 from .treeanc import TreeAnc
 from .treetime import TreeTime, plot_vs_years
 from .clock_tree import ClockTree
 from .treetime import ttconf as treetime_conf
 from .gtr import GTR
+from .gtr_site_specific import GTR_site_specific
 from .merger_models import Coalescent
 from .treeregression import TreeRegression
 from .argument_parser import make_parser
+
+
--- a/treetime/argument_parser.py
+++ b/treetime/argument_parser.py
@@ -2,7 +2,7 @@
 from __future__ import print_function, division, absolute_import
 import sys, argparse, os
 from treetime.wrappers import ancestral_reconstruction, mugration, scan_homoplasies, timetree, estimate_clock_model
-import treetime
+from treetime import version

 py2 = sys.version_info.major==2

@@ -148,6 +148,7 @@ def add_gtr_arguments(parser):
 def add_anc_arguments(parser):
    parser.add_argument('--keep-overhangs', default = False, action='store_true', help='do not fill terminal gaps')
    parser.add_argument('--zero-based', default = False, action='store_true', help='zero based mutation indexing')
+    parser.add_argument('--reconstruct-tip-states', default = False, action='store_true', help='overwrite ambiguous states on tips with the most likely inferred state')
    parser.add_argument('--report-ambiguous', default=False, action="store_true", help='include transitions involving ambiguous states')


@@ -169,6 +170,8 @@ def make_parser():
    t_parser.add_argument('--tree', type=str, help=tree_description)
    add_seq_len_aln_group(t_parser)
    t_parser.add_argument('--dates', type=str, help=dates_description)
+    t_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
+    t_parser.add_argument('--date-column', type=str, help="label of the column to be used as sampling date")
    add_reroot_group(t_parser)
    add_gtr_arguments(t_parser)
    t_parser.add_argument('--clock-rate', type=float, help="if specified, the rate of the molecular clock won't be optimized.")
@@ -189,6 +192,8 @@ def make_parser():
                        help='maximal number of iterations the inference cycle is run. Note that for polytomy resolution and coalescence models max_iter should be at least 2')
    t_parser.add_argument('--coalescent', default="0.0", type=str,
                          help=coalescent_description)
+    t_parser.add_argument('--n-skyline', default="20", type=int,
+                          help="number of grid points in skyline coalescent model")
    t_parser.add_argument('--plot-tree', default="timetree.pdf",
                            help = "filename to save the plot to. Suffix will determine format"
                                   " (choices pdf, png, svg, default=pdf)")
@@ -201,6 +206,7 @@ def make_parser():
                            help = "don't show tip labels (default for small trees with >=30 leaves)")
    add_anc_arguments(t_parser)
    add_common_args(t_parser)
+    t_parser.add_argument("--version", action="version", version="%(prog)s " + version)

    def toplevel(params):
        if (params.aln or params.tree) and params.dates:
@@ -229,9 +235,9 @@ def make_parser():
    ## ANCESTRAL RECONSTRUCTION
    a_parser = subparsers.add_parser('ancestral', description=ancestral_description)
    add_aln_group(a_parser)
-    a_parser.add_argument('--tree', type = str,  help =tree_description)
+    a_parser.add_argument('--tree', type=str,  help=tree_description)
    add_gtr_arguments(a_parser)
-    a_parser.add_argument('--marginal', default = False, action="store_true", help ="marginal reconstruction of ancestral sequences")
+    a_parser.add_argument('--marginal', default=False, action="store_true", help ="marginal reconstruction of ancestral sequences")
    add_anc_arguments(a_parser)
    add_common_args(a_parser)
    a_parser.set_defaults(func=ancestral_reconstruction)
@@ -239,6 +245,7 @@ def make_parser():
    ## MUGRATION
    m_parser = subparsers.add_parser('mugration', description=mugration_description)
    m_parser.add_argument('--tree', required = True, type=str, help=tree_description)
+    m_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
    m_parser.add_argument('--attribute', type=str, help ="attribute to reconstruct, e.g. country")
    m_parser.add_argument('--states', required = True, type=str, help ="csv or tsv file with discrete characters."
                                    "\n#name,country,continent\ntaxon1,micronesia,oceania\n...")
@@ -265,6 +272,8 @@ def make_parser():
                        "signal and recalculate branch length unless run with --keep_root.")
    c_parser.add_argument('--tree', required=True, type=str,  help=tree_description)
    c_parser.add_argument('--dates', required=True, type=str, help=dates_description)
+    c_parser.add_argument('--date-column', type=str, help="label of the column to be used as sampling date")
+    c_parser.add_argument('--name-column', type=str, help="label of the column to be used as taxon name")
    add_seq_len_aln_group(c_parser)

    add_reroot_group(c_parser)
@@ -279,7 +288,7 @@ def make_parser():

    # make a version subcommand
    v_parser = subparsers.add_parser('version', description='print version')
-    v_parser.set_defaults(func=lambda x: print(treetime.version))
+    v_parser.set_defaults(func=lambda x: print("treetime "+version))

    ## call the relevant function and return
    if py2:

--- a/treetime/branch_len_interpolator.py
+++ b/treetime/branch_len_interpolator.py
@@ -88,19 +88,11 @@ class BranchLenInterpolator (Distribution):


        elif branch_length_mode=='joint':
-            if not hasattr(node, 'compressed_sequence'):
-                #FIXME: this assumes node.sequence is set, but this might not be the case if
-                # ancestral reconstruction is run with final=False
-                if hasattr(node, 'sequence'):
-                    seq_pairs, multiplicity = self.gtr.compress_sequence_pair(node.up.sequence,
-                                                                          node.sequence,
-                                                                          ignore_gaps=ignore_gaps)
-                    node.compressed_sequence = {'pair':seq_pairs, 'multiplicity':multiplicity}
-                else:
-                    raise Exception("uncompressed sequence needs to be assigned to nodes")
-
-            log_prob = np.array([-self.gtr.prob_t_compressed(node.compressed_sequence['pair'],
-                                                    node.compressed_sequence['multiplicity'],
+            if not hasattr(node, 'branch_state'):
+                raise Exception("branch state pairs need to be assigned to nodes")
+
+            log_prob = np.array([-self.gtr.prob_t_compressed(node.branch_state['pair'],
+                                                    node.branch_state['multiplicity'],
                                                    k,
                                                    return_log=True)
                                for k in grid])

--- a/treetime/clock_tree.py
+++ b/treetime/clock_tree.py
 from __future__ import print_function, division, absolute_import
 import numpy as np
 from treetime import config as ttconf
+from treetime import MissingDataError
 from .treeanc import TreeAnc
-from .utils import numeric_date, DateConversion
+from .utils import numeric_date, DateConversion, datestring_from_numeric
 from .distribution import Distribution
 from .branch_len_interpolator import BranchLenInterpolator
 from .node_interpolator import NodeInterpolator
@@ -79,8 +80,7 @@ class ClockTree(TreeAnc):
        self.clock_model=None
        self.use_covariation=use_covariation # if false, covariation will be ignored in rate estimates.
        self._set_precision(precision)
-        if self._assign_dates()==ttconf.ERROR:
-            raise ValueError("ClockTree requires date constraints!")
+        self._assign_dates()


    def _assign_dates(self):
@@ -92,8 +92,7 @@ class ClockTree(TreeAnc):
            success/error code
        """
        if self.tree is None:
-            self.logger("ClockTree._assign_dates: tree is not set, can't assign dates", 0)
-            return ttconf.ERROR
+            raise MissingDataError("ClockTree._assign_dates: tree is not set, can't assign dates")

        bad_branch_counter = 0
        for node in self.tree.find_clades(order='postorder'):
@@ -128,9 +127,9 @@ class ClockTree(TreeAnc):
                bad_branch_counter += 1

        if bad_branch_counter>self.tree.count_terminals()-3:
-            self.logger("ERROR: ALMOST NO VALID DATE CONSTRAINTS, EXITING", 1, warn=True)
-            return ttconf.ERROR
+            raise MissingDataError("ERROR: ALMOST NO VALID DATE CONSTRAINTS")

+        self.logger("ClockTree._assign_dates: assigned date contraints to {} out of {} tips.".format(self.tree.count_terminals()-bad_branch_counter, self.tree.count_terminals()), 1)
        return ttconf.SUCCESS


@@ -149,7 +148,7 @@ class ClockTree(TreeAnc):
            self.precision=precision
            if self.one_mutation and self.one_mutation<1e-4 and precision<2:
                self.logger("ClockTree._set_precision: FOR LONG SEQUENCES (>1e4) precision>=2 IS RECOMMENDED."
-                            " \n\t **** precision %d was specified by the user"%precision, level=0)
+                            " precision %d was specified by the user"%precision, level=0)
        else:
            # otherwise adjust it depending on the minimal sensible branch length
            if self.one_mutation:
@@ -263,7 +262,7 @@ class ClockTree(TreeAnc):
        """
        self.logger("ClockTree.init_date_constraints...",2)
        self.tree.coalescent_joint_LH = 0
-        if self.aln and (ancestral_inference or (not hasattr(self.tree.root, 'sequence'))):
+        if self.aln and (not self.sequence_reconstruction):
            self.infer_ancestral_sequences('probabilistic', marginal=self.branch_length_mode=='marginal',
                                            sample_from_profile='root',**kwarks)

@@ -286,9 +285,11 @@ class ClockTree(TreeAnc):

                if self.branch_length_mode=='marginal':
                    node.profile_pair = self.marginal_branch_profile(node)
+                elif self.branch_length_mode=='joint' and (not hasattr(node, 'branch_state')):
+                    self.add_branch_state(node)

                node.branch_length_interpolator = BranchLenInterpolator(node, self.gtr,
-                            pattern_multiplicity = self.multiplicity, min_width=self.min_width,
+                            pattern_multiplicity = self.data.multiplicity, min_width=self.min_width,
                            one_mutation=self.one_mutation, branch_length_mode=self.branch_length_mode)

                node.branch_length_interpolator.merger_cost = merger_cost
@@ -312,8 +313,8 @@ class ClockTree(TreeAnc):

                if hasattr(node, 'bad_branch') and node.bad_branch is True:
                    self.logger("ClockTree.init_date_constraints -- WARNING: Branch is marked as bad"
-                                ", excluding it from the optimization process.\n"
-                                "\t\tDate constraint will be ignored!", 4, warn=True)
+                                ", excluding it from the optimization process."
+                                " Date constraint will be ignored!", 4, warn=True)
            else: # node without sampling date set
                node.raw_date_constraint = None
                node.date_constraint = None
@@ -438,13 +439,14 @@ class ClockTree(TreeAnc):

            if node.joint_pos_Cx is None: # no constraints or branch is bad - reconstruct from the branch len interpolator
                node.branch_length = node.branch_length_interpolator.peak_pos
-
+            elif node.date_constraint is not None and node.date_constraint.is_delta:
+                node.branch_length = node.up.time_before_present - node.date_constraint.peak_pos
            elif isinstance(node.joint_pos_Cx, Distribution):
                # NOTE the Lx distribution is the likelihood, given the position of the parent
                # (Lx.x = parent position, Lx.y = LH of the node_pos given Lx.x,
                # the length of the branch corresponding to the most likely
                # subtree is node.Cx(node.time_before_present))
-                subtree_LH = node.joint_pos_Lx(node.up.time_before_present)
+                # subtree_LH = node.joint_pos_Lx(node.up.time_before_present)
                node.branch_length = node.joint_pos_Cx(max(node.joint_pos_Cx.xmin,
                                            node.up.time_before_present)+ttconf.TINY_NUMBER)

@@ -475,7 +477,7 @@ class ClockTree(TreeAnc):

        # add the root sequence LH and return
        if self.aln:
-            LH += self.gtr.sequence_logLH(self.tree.root.cseq, pattern_multiplicity=self.multiplicity)
+            LH += self.gtr.sequence_logLH(self.tree.root.cseq, pattern_multiplicity=self.data.multiplicity)
        return LH


@@ -525,7 +527,7 @@ class ClockTree(TreeAnc):
                # no information
                node.marginal_pos_Lx = None
            else: # all other nodes
-                if node.date_constraint is not None and node.date_constraint.is_delta: # there is a time constraint
+                if node.date_constraint is not None and node.date_constraint.is_delta: # there is a hard time constraint
                    # initialize the Lx for nodes with precise date constraint:
                    # subtree probability given the position of the parent node
                    # position of the parent node is given by the branch length
@@ -575,6 +577,8 @@ class ClockTree(TreeAnc):
            if node.up is None:
                node.msg_from_parent = None # nothing beyond the root
            # all other cases (All internal nodes + unconstrained terminals)
+            elif node.date_constraint is not None and node.date_constraint.is_delta:
+                node.marginal_pos_LH = node.date_constraint
            else:
                parent = node.up
                # messages from the complementary subtree (iterate over all sister nodes)
@@ -584,8 +588,6 @@ class ClockTree(TreeAnc):
                # if parent itself got smth from the root node, include it
                if parent.msg_from_parent is not None:
                    complementary_msgs.append(parent.msg_from_parent)
-                elif parent.marginal_pos_Lx is not None:
-                    complementary_msgs.append(parent.marginal_pos_LH)

                if len(complementary_msgs):
                    msg_parent_to_node = NodeInterpolator.multiply(complementary_msgs)
@@ -677,17 +679,7 @@ class ClockTree(TreeAnc):
                        "later than present day",4 , warn=True)

            node.numdate = now - years_bp
-
-            # set the human-readable date
-            year = np.floor(node.numdate)
-            days = max(0,365.25 * (node.numdate - year)-1)
-            try:  # datetime will only operate on dates after 1900
-                n_date = datetime(year, 1, 1) + timedelta(days=days)
-                node.date = datetime.strftime(n_date, "%Y-%m-%d")
-            except:
-                # this is the approximation not accounting for gap years etc
-                n_date = datetime(1900, 1, 1) + timedelta(days=days)
-                node.date = "%04d-%02d-%02d"%(year, n_date.month, n_date.day)
+            node.date = datestring_from_numeric(node.numdate)


    def branch_length_to_years(self):
@@ -722,8 +714,8 @@ class ClockTree(TreeAnc):
        params = params or {}
        if rate_std is None:
            if not (self.clock_model['valid_confidence'] and 'cov' in self.clock_model):
-                self.logger("ClockTree.calc_rate_susceptibility: need valid standard deviation of the clock rate to estimate dating error.", 1, warn=True)
-                return ttconf.ERROR
+                raise ValueError("ClockTree.calc_rate_susceptibility: need valid standard deviation of the clock rate to estimate dating error.")
+
            rate_std = np.sqrt(self.clock_model['cov'][0,0])

        current_rate = np.abs(self.clock_model['slope'])

--- a/treetime/gtr.py
+++ b/treetime/gtr.py
@@ -30,8 +30,7 @@ class GTR(object):
            of observing characters in the alphabet. This is used to
            implement ambiguous characters like 'N'=[1,1,1,1] which are
            equally likely to be any of the 4 nucleotides. Standard profile_maps
-            are defined in file seq_utils.py. If None is provided, no ambigous
-            characters are supported.
+            are defined in file seq_utils.py.

         logger : callable
            Custom logging function that should take arguments (msg, level, warn=False),
@@ -39,6 +38,7 @@ class GTR(object):

        """
        self.debug=False
+        self.is_site_specific=False
        if isinstance(alphabet, str):
            if alphabet not in alphabet_synonyms:
                raise AttributeError("Unknown alphabet type specified")
@@ -48,13 +48,14 @@ class GTR(object):
                self.profile_map = profile_maps[tmp_alphabet]
        else:
            # not a predefined alphabet
-            self.alphabet = alphabet
+            self.alphabet = np.array(alphabet)
            if prof_map is None: # generate trivial unambiguous profile map is none is given
                self.profile_map = {s:x for s,x in zip(self.alphabet, np.eye(len(self.alphabet)))}
            else:
-                self.profile_map = prof_map
-
+                self.profile_map = {x if type(x) is str else x:k for x,k in prof_map.items()}

+        self.state_index={s:si for si,s in enumerate(self.alphabet)}
+        self.state_index.update({s:si for si,s in enumerate(self.alphabet)})
        if logger is None:
            def logger_default(*args,**kwargs):
                """standard logging function if none provided"""
@@ -69,13 +70,6 @@ class GTR(object):
        self.n_states = len(self.alphabet)
        self.assign_gap_and_ambiguous()

-        # NEEDED TO BREAK RATE MATRIX DEGENERACY AND FORCE NP TO RETURN REAL ORTHONORMAL EIGENVECTORS
-        # ugly hack, but works and shouldn't affect results
-        tmp_rng_state = np.random.get_state()
-        np.random.seed(12345)
-        self.break_degen = np.random.random(size=(self.n_states, self.n_states))*1e-6
-        np.random.set_state(tmp_rng_state)
-
        # init all matrices with dummy values
        self.logger("GTR: init with dummy values!", 3)
        self.v = None # right eigenvectors
@@ -86,7 +80,7 @@ class GTR(object):

    def assign_gap_and_ambiguous(self):
        n_states = len(self.alphabet)
-        self.logger("GTR: with alphabet: "+str(self.alphabet),1)
+        self.logger("GTR: with alphabet: "+str([x for x in self.alphabet]),1)
        # determine if a character exists that corresponds to no info, i.e. all one profile
        if any([x.sum()==n_states for x in self.profile_map.values()]):
            amb_states = [c for c,x in self.profile_map.items() if x.sum()==n_states]
@@ -97,7 +91,7 @@ class GTR(object):

        # check for a gap symbol
        try:
-            self.gap_index = list(self.alphabet).index('-')
+            self.gap_index = self.state_index['-']
        except:
            self.logger("GTR: no gap symbol!", 4, warn=True)
            self.gap_index=None
@@ -134,7 +128,10 @@ class GTR(object):
           and the equilibrium frequencies to obtain the rate matrix
           of the GTR model
        """
-        return (self.W*self.Pi).T
+        Q_tmp = (self.W*self.Pi).T
+        Q_diag = -np.sum(Q_tmp, axis=0)
+        np.fill_diagonal(Q_tmp, Q_diag)
+        return Q_tmp


 ######################################################################
@@ -155,18 +152,18 @@ class GTR(object):
        if not multi_site:
            eq_freq_str += "\nEquilibrium frequencies (pi_i):\n"
            for a,p in zip(self.alphabet, self.Pi):
-                eq_freq_str+='  '+str(a)+': '+str(np.round(p,4))+'\n'
+                eq_freq_str+='  '+a+': '+str(np.round(p,4))+'\n'

        W_str = "\nSymmetrized rates from j->i (W_ij):\n"
-        W_str+='\t'+'\t'.join(map(str, self.alphabet))+'\n'
+        W_str+='\t'+'\t'.join(self.alphabet)+'\n'
        for a,Wi in zip(self.alphabet, self.W):
-            W_str+= '  '+str(a)+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Wi])+'\n'
+            W_str+= '  '+a+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Wi])+'\n'

        if not multi_site:
            Q_str = "\nActual rates from j->i (Q_ij):\n"
-            Q_str+='\t'+'\t'.join(map(str, self.alphabet))+'\n'
+            Q_str+='\t'+'\t'.join(self.alphabet)+'\n'
            for a,Qi in zip(self.alphabet, self.Q):
-                Q_str+= '  '+str(a)+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Qi])+'\n'
+                Q_str+= '  '+a+'\t'+'\t'.join([str(np.round(max(0,p),4)) for p in Qi])+'\n'

        return eq_freq_str + W_str + Q_str

@@ -190,6 +187,7 @@ class GTR(object):
        """
        n = len(self.alphabet)
        self._mu = mu
+        self.is_site_specific=False

        if pi is not None and len(pi)==n:
            Pi = np.array(pi)
@@ -213,7 +211,11 @@ class GTR(object):
            W=np.array(W)

        self._W = 0.5*(W+W.T)
-        self._check_fix_Q(fixed_mu=True)
+        np.fill_diagonal(W,0)
+        average_rate = W.dot(self.Pi).dot(self.Pi)
+        self._W = W/average_rate
+        self._mu *=average_rate
+
        self._eig()


@@ -508,8 +510,8 @@ class GTR(object):
        if gtr.gap_index is not None:
            if pi[gtr.gap_index]<gap_limit:
                gtr.logger('The model allows for gaps which are estimated to occur at a low fraction of %1.3e'%pi[gtr.gap_index]+
-                       '\n\t\tthis can potentially result in artificats.'+
-                       '\n\t\tgap fraction will be set to %1.4f'%gap_limit,2,warn=True)
+                       ' this can potentially result in artificats.'+
+                       ' gap fraction will be set to %1.4f'%gap_limit,2,warn=True)
            pi[gtr.gap_index] = gap_limit
            pi /= pi.sum()

@@ -519,39 +521,13 @@ class GTR(object):
 ########################################################################
 ### prepare model
 ########################################################################
-    def _check_fix_Q(self, fixed_mu=False):
-        """
-        Check the main diagonal of Q and fix it in case it does not corresond
-        the definition of the rate matrix. Should be run every time when creating
-        custom GTR model.
-        """
-
-        # NEEDED TO BREAK RATE MATRIX DEGENERACY AND FORCE NP TO RETURN REAL ORTHONORMAL EIGENVECTORS
-        self._W += self.break_degen + self.break_degen.T
-        # fix W
-        np.fill_diagonal(self.W, 0)
-        Wdiag = -(self.Q).sum(axis=0)/self.Pi
-        np.fill_diagonal(self.W, Wdiag)
-        scale_factor = -np.sum(np.diagonal(self.Q)*self.Pi)
-        self._W /= scale_factor
-        if not fixed_mu:
-            self._mu *= scale_factor
-        if (self.Q.sum(axis=0) < 1e-10).sum() <  self.alphabet.shape[0]: # fix failed
-            print ("Cannot fix the diagonal of the GTR rate matrix. Should be all zero", self.Q.sum(axis=0))
-            import ipdb; ipdb.set_trace()
-            raise ArithmeticError("Cannot fix the diagonal of the GTR rate matrix.")
-
-
    def _eig(self):
        """
        Perform eigendecompositon of the rate matrix and stores the left- and right-
        matrices to convert the sequence profiles to the GTR matrix eigenspace
        and hence to speed-up the computations.
        """
-        W_nodiag = np.copy(self.W)
-        np.fill_diagonal(W_nodiag, 0)
-
-        self.eigenvals, self.v, self.v_inv = self._eig_single_site(W_nodiag, self.Pi)
+        self.eigenvals, self.v, self.v_inv = self._eig_single_site(self.W, self.Pi)


    def _eig_single_site(self, W, p):
@@ -574,7 +550,7 @@ class GTR(object):
        return eigvals, tmp_v.T/one_norm, (eigvecs*one_norm).T/tmpp


-    def compress_sequence_pair(self, seq_p, seq_ch, pattern_multiplicity=None,
+    def state_pair(self, seq_p, seq_ch, pattern_multiplicity=None,
                               ignore_gaps=False):
        '''
        Make a compressed representation of a pair of sequences, only counting
@@ -615,7 +591,7 @@ class GTR(object):

        from collections import Counter
        if seq_ch.shape != seq_p.shape:
-            raise ValueError("GTR.compress_sequence_pair: Sequence lengths do not match!")
+            raise ValueError("GTR.state_pair: Sequence lengths do not match!")

        if len(self.alphabet)<10: # for small alphabet, repeatedly check array for all state pairs
            pair_count = []
@@ -724,7 +700,7 @@ class GTR(object):
            Resulting probability

        """
-        seq_pair, multiplicity = self.compress_sequence_pair(seq_p, seq_ch,
+        seq_pair, multiplicity = self.state_pair(seq_p, seq_ch,
                                        pattern_multiplicity=pattern_multiplicity, ignore_gaps=ignore_gaps)
        return self.prob_t_compressed(seq_pair, multiplicity, t, return_log=return_log)

@@ -752,20 +728,21 @@ class GTR(object):
            If True, ignore gaps in distance calculations

        '''
-        seq_pair, multiplicity = self.compress_sequence_pair(seq_p, seq_ch,
-                                                            pattern_multiplicity = pattern_multiplicity,
-                                                            ignore_gaps=ignore_gaps)
+        seq_pair, multiplicity = self.state_pair(seq_p, seq_ch,
+                                        pattern_multiplicity = pattern_multiplicity,
+                                        ignore_gaps=ignore_gaps)
        return self.optimal_t_compressed(seq_pair, multiplicity)


    def optimal_t_compressed(self, seq_pair, multiplicity, profiles=False, tol=1e-10):
        """
-        Find the optimal distance between the two sequences, for compressed sequences
+        Find the optimal distance between the two sequences represented as state_pairs
+        or as pair of profiles

        Parameters
        ----------

-         seq_pair : compressed_sequence_pair
+         seq_pair : state_pair, tuple
            Compressed representation of sequences along a branch, either
            as tuple of state pairs or as tuple of profiles.

@@ -779,7 +756,7 @@ class GTR(object):
            either end of the branch. With profiles==True, optimization is performed
            while summing over all possible states of the nodes at either end of the
            branch. Note that the meaning/format of seq_pair and multiplicity
-            depend on the value of profiles.
+            depend on the value of :profiles:.

        """

@@ -1032,14 +1009,6 @@ class GTR(object):
                            char_dist=self.Pi,
                            flow_matrix=self.W)

-    def save_to_json(self, zip):
-        d = {
-        "full_gtr": self.mu * np.dot(self.Pi, self.W),
-        "Substitution rate" : self.mu,
-        "Equilibrium character composition": self.Pi,
-        "Flow rate matrix": self.W
-        }
-

 if __name__ == "__main__":
    pass
--- a/treetime/gtr_site_specific.py
+++ b/treetime/gtr_site_specific.py
@@ -25,6 +25,7 @@ class GTR_site_specific(GTR):
        self.seq_len=seq_len
        self.approximate = approximate
        super(GTR_site_specific, self).__init__(**kwargs)
+        self.is_site_specific=True


    @property
@@ -57,25 +58,34 @@ class GTR_site_specific(GTR):
            Equilibrium frequencies

        """
+        if not np.isscalar(mu) and pi is not None and len(pi.shape)==2:
+            if mu.shape[0]!=pi.shape[1]:
+                raise ValueError("GTR_site_specific: length of rate vector (got {}) and equilibrium frequency vector (got {}) must match!".format(mu.shape[0], pi.shape[1]))
+
        n = len(self.alphabet)
        if np.isscalar(mu):
            self._mu = mu*np.ones(self.seq_len)
        else:
            self._mu = np.copy(mu)
+            self.seq_len = mu.shape[0]

-        if pi is not None and pi.shape[0]==n:
-            self.seq_len = pi.shape[-1]
+        if pi is not None and pi.shape[0]==n and len(pi.shape)==2:
+            self.seq_len = pi.shape[1]
            Pi = np.copy(pi)
        else:
-            if pi is not None and len(pi)!=n:
-                raise ArgumentError("GTR_site_specific: length of equilibrium frequency vector does not match alphabet length.")
-            Pi = np.ones(shape=(n,self.seq_len))
+            if pi is not None:
+                if len(pi)==n:
+                    Pi = np.repeat([pi], self.seq_len, axis=0).T
+                else:
+                    raise ValueError("GTR_site_specific: length of equilibrium frequency vector (got {}) does not match alphabet length {}".format(len(pi), n))
+            else:
+                Pi = np.ones(shape=(n,self.seq_len))

        self._Pi = Pi/np.sum(Pi, axis=0)

        if W is None or W.shape!=(n,n):
            if (W is not None) and W.shape!=(n,n):
-                raise ArgumentError("GTR_site_specific: Size of substitution matrix does not match alphabet length.")
+                raise ValueError("GTR_site_specific: Size of substitution matrix (got {}) does not match alphabet length {}".format(W.shape, n))
            W = np.ones((n,n))
            np.fill_diagonal(W, 0.0)
            np.fill_diagonal(W, - W.sum(axis=0))
@@ -83,11 +93,13 @@ class GTR_site_specific(GTR):
            W=0.5*(np.copy(W)+np.copy(W).T)

        np.fill_diagonal(W,0)
-        avg_pi = self.Pi.mean(axis=-1)
-        average_rate = W.dot(avg_pi).dot(avg_pi)
+        average_rate = np.einsum('ia,ij,ja',self.Pi, W, self.Pi)/self.seq_len
+        # average_rate = W.dot(avg_pi).dot(avg_pi)
        self._W = W/average_rate
        self._mu *=average_rate

+
+        self.is_site_specific=True
        self._eig()
        self._make_expQt_interpolator()

@@ -124,6 +136,7 @@ class GTR_site_specific(GTR):
        gtr = cls(alphabet=alphabet, seq_len=L)
        n = gtr.alphabet.shape[0]

+        # Dirichlet distribution == l_1 normalized vector of samples of the Gamma distribution
        if pi_dirichlet_alpha:
            pi = 1.0*gamma.rvs(pi_dirichlet_alpha, size=(n,L))
        else:
@@ -143,7 +156,7 @@ class GTR_site_specific(GTR):
            mu = np.ones(L)

        gtr.assign_rates(mu=mu, pi=pi, W=W)
-        gtr.mu *= avg_mu/np.mean(gtr.mu)
+        gtr.mu *= avg_mu/np.mean(gtr.average_rate())

        return gtr

@@ -166,7 +179,7 @@ class GTR_site_specific(GTR):
            Equilibrium frequencies

         **kwargs:
-            Key word arguments to be passed
+            Key word arguments to be passed to the constructor

        Keyword Args
        ------------
@@ -255,15 +268,15 @@ class GTR_site_specific(GTR):
            p_ia_old = np.copy(p_ia)
            S_ij = np.einsum('a,ia,ja',mu_a, p_ia, T_ia)
            W_ij = (n_ij + n_ij.T + pc)/(S_ij + S_ij.T + pc)
-            
+
            avg_pi = p_ia.mean(axis=-1)
-            average_rate = W_ij.dot(avg_pi).dot(avg_pi)
+            average_rate = W_ij.dot(avg_pi).dot(avg_pi)  # crude approx, will be fixed in assign rates
            W_ij = W_ij/average_rate
            mu_a *=average_rate
-            
+
            p_ia = m_ia/(mu_a*np.dot(W_ij,T_ia)+Lambda)
            p_ia = p_ia/p_ia.sum(axis=0)
-            
+
            mu_a = n_a/(pc+np.einsum('ia,ij,ja->a', p_ia, W_ij, T_ia))


@@ -276,7 +289,7 @@ class GTR_site_specific(GTR):
                if p_ia[gtr.gap_index,p]<gap_limit:
                    gtr.logger('The model allows for gaps which are estimated to occur at a low fraction of %1.3e'%p_ia[gtr.gap_index,p]+
                           '\n\t\tthis can potentially result in artifacts.'+
-                           '\n\t\tgap fraction will be set to %1.4f'%gap_limit,2,warn=True)
+                           '\n\t\tgap fraction will be set to %1.4f'%gap_limit,4,warn=True)
                p_ia[gtr.gap_index,p] = gap_limit
                p_ia[:,p] /= p_ia[:,p].sum()

@@ -456,7 +469,7 @@ class GTR_site_specific(GTR):
            logQt[np.isnan(logQt) | np.isinf(logQt) | bad_indices] = -ttconf.BIG_NUMBER
            seq_indices_c = np.zeros(len(seq_ch), dtype=int)
            seq_indices_p = np.zeros(len(seq_p), dtype=int)
-            for ai, a in self.alphabet:
+            for ai, a in enumerate(self.alphabet):
                seq_indices_p[seq_p==a] = ai
                seq_indices_c[seq_ch==a] = ai


--- a/treetime/merger_models.py
+++ b/treetime/merger_models.py
@@ -164,7 +164,7 @@ class Coalescent(object):
        if "success" in sol and sol["success"]:
            self.set_Tc(sol['x'])
        else:
-            self.logger("merger_models:optimze_Tc: optimization of coalescent time scale failed: " + str(sol), 0, warn=True)
+            self.logger("merger_models:optimize_Tc: optimization of coalescent time scale failed: " + str(sol), 0, warn=True)
            self.set_Tc(initial_Tc.y, T=initial_Tc.x)


@@ -190,8 +190,8 @@ class Coalescent(object):
            # cap log Tc to avoid under or overflow and nan in logs
            self.set_Tc(np.exp(np.maximum(-200,np.minimum(100,logTc))), tvals)
            neglogLH = -self.total_LH() + stiffness*np.sum(np.diff(logTc)**2) \
-                       + np.sum((logTc>0)*logTc*regularization)\
-                       - np.sum((logTc<-100)*logTc*regularization)
+                       + np.sum((logTc>0)*logTc)*regularization\
+                       - np.sum((logTc<-100)*logTc)*regularization
            return neglogLH

        sol = minimize(cost, np.ones_like(tvals)*np.log(self.Tc.y.mean()), method=method, tol=tol)
@@ -209,7 +209,7 @@ class Coalescent(object):

            dcost = np.array(dcost)
            optimal_cost = cost(opt_logTc)
-            self.confidence = -dlogTc/(2*optimal_cost - dcost[:,0] - dcost[:,1])
+            self.confidence = dlogTc/np.sqrt(np.abs(2*optimal_cost - dcost[:,0] - dcost[:,1]))
            self.logger("Coalescent:optimize_skyline:...done. new LH: %f"%self.total_LH(),2)
        else:
            self.set_Tc(initial_Tc.y, T=initial_Tc.x)

--- a/treetime/seq_utils.py
+++ b/treetime/seq_utils.py
 import numpy as np
+from Bio import Seq, SeqRecord
+

 alphabet_synonyms = {'nuc':'nuc', 'nucleotide':'nuc', 'aa':'aa', 'aminoacid':'aa',
                     'nuc_nogap':'nuc_nogap', 'nucleotide_nogap':'nuc_nogap',
@@ -115,39 +117,83 @@ profile_maps = {
    }
 }

-def seq2array(seq, fill_overhangs=True, ambiguous_character='N'):
+
+def extend_profile(gtr, aln, logger=None):
+    tmp_unique_chars = []
+    for seq in aln:
+        tmp_unique_chars.extend(np.unique(seq))
+
+    unique_chars = np.unique(tmp_unique_chars)
+    for c in unique_chars:
+        if c not in gtr.profile_map:
+            gtr.profile_map[c] = np.ones(gtr.n_states)
+            if logger:
+                logger("WARNING: character %s is unknown. Treating it as missing information"%c,1,warn=True)
+
+
+def guess_alphabet(aln):
+    total=0
+    nuc_count = 0
+    for seq in aln:
+        total += len(seq)
+        for n in np.array(list('acgtACGT-N')):
+            nuc_count += np.sum(seq==n)
+    if nuc_count>0.9*total:
+        return 'nuc'
+    else:
+        return 'aa'
+
+
+def seq2array(seq, word_length=1, convert_upper=False, fill_overhangs=False, ambiguous='N'):
    """
    Take the raw sequence, substitute the "overhanging" gaps with 'N' (missequenced),
    and convert the sequence to the numpy array of chars.

    Parameters
    ----------
-     seq : Biopython.SeqRecord, str, iterable
-        Sequence as an object of SeqRecord, string or iterable
-
-     fill_overhangs : bool
-        If True, substitute the "overhanging" gaps with ambiguous character symbol
-
-     ambiguous_character : char
-        Specify the character for ambiguous state ('N' default for nucleotide)
-
+    seq : Biopython.SeqRecord, str, iterable
+       Sequence as an object of SeqRecord, string or iterable
+    word_length : int, optional
+        1 for nucleotide or amino acids, 3 for codons etc.
+    convert_upper : bool, optional
+        convert the sequence to upper case
+    fill_overhangs : bool
+       If True, substitute the "overhanging" gaps with ambiguous character symbol
+    ambiguous : char
+       Specify the character for ambiguous state ('N' default for nucleotide)
    Returns
    -------
-     sequence : np.array
-        Sequence as 1D numpy array of chars
-
+    sequence : np.array
+       Sequence as 1D numpy array of chars
    """
-    try:
-        sequence = ''.join(seq)
-    except TypeError:
-        sequence = seq
+    if isinstance(seq, str):
+        seq_str = seq
+    elif isinstance(seq, Seq.Seq):
+        seq_str = str(seq)
+    elif isinstance(seq, SeqRecord.SeqRecord):
+        seq_str = str(seq.seq)
+    else:
+        raise TypeError("seq2array: sequence must be Bio.Seq, Bio.SeqRecord, or string. Got "+str(seq))
+
+    if convert_upper:
+        seq_str = seq_str.upper()
+
+    if word_length==1:
+        seq_array = np.array(list(seq_str))
+    else:
+        if len(seq_str)%word_length:
+            raise ValueError("sequence length has to be multiple of word length");
+        seq_array = np.array([seq_str[i*word_length:(i+1)*word_length]
+                              for i in range(len(seq_str)/word_length)])

-    sequence = np.array(list(sequence))
    # substitute overhanging unsequenced tails
    if fill_overhangs:
-        sequence [:np.where(sequence != '-')[0][0]] = ambiguous_character
-        sequence [np.where(sequence != '-')[0][-1]+1:] = ambiguous_character
-    return sequence
+        gaps = np.where(seq_array != '-')[0]
+        seq_array[:gaps[0]] = ambiguous
+        seq_array[gaps[-1]+1:] = ambiguous
+
+    return seq_array
+

 def seq2prof(seq, profile_map):
    """
@@ -184,10 +230,8 @@ def prof2seq(profile, gtr, sample_from_prof=False, normalize=True):
     profile : numpy 2D array
        Profile. Shape of the profile should be (L x a), where L - sequence
        length, a - alphabet size.
-
     gtr : gtr.GTR
        Instance of the GTR class to supply the sequence alphabet
-
     collapse_prof : bool
        Whether to convert the profile to the delta-function

@@ -195,10 +239,8 @@ def prof2seq(profile, gtr, sample_from_prof=False, normalize=True):
    -------
     seq : numpy.array
        Sequence as numpy array of length L
-
     prof_values :  numpy.array
        Values of the profile for the chosen sequence characters (length L)
-
     idx : numpy.array
        Indices chosen from profile as array of length L
    """

--- a/treetime/seqgen.py
+++ b/treetime/seqgen.py
@@ -13,10 +13,10 @@ class SeqGen(TreeAnc):
    This class inherits from TreeAnc.
    '''

-    def __init__(self, *args, **kwargs):
-        """Instantiate. Mandatory arguments are a tree and GTR model.
+    def __init__(self, L, *args, **kwargs):
+        """Instantiate. Mandatory arguments are a the sequence length, tree and GTR model.
        """
-        super(SeqGen, self).__init__(reduce_alignment=False, **kwargs)
+        super(SeqGen, self).__init__(seq_len=L, compress=False, **kwargs)


    def sample_from_profile(self, p):
@@ -50,30 +50,23 @@ class SeqGen(TreeAnc):
            sequence to be used as the root sequence of the tree. if not given,
            will sample a sequence from the equilibrium probabilities of the GTR model.
        """
-        self.seq_len = self.gtr.seq_len
        # set root if not given
        if root_seq:
-            self.tree.root.sequence = seq2array(root_seq)
+            self.tree.root.ancestral_sequence = seq2array(root_seq)
        else:
            if len(self.gtr.Pi.shape)==2:
-                self.tree.root.sequence = self.sample_from_profile(self.gtr.Pi.T)
+                self.tree.root.ancestral_sequence = self.sample_from_profile(self.gtr.Pi.T)
            else:
-                self.tree.root.sequence = self.sample_from_profile(np.repeat([self.gtr.Pi], self.seq_len, axis=0))
+                self.tree.root.ancestral_sequence = self.sample_from_profile(np.repeat([self.gtr.Pi], self.seq_len, axis=0))

        # generate sequences in preorder
        for n in self.tree.get_nonterminals(order='preorder'):
-            profile_p = seq2prof(n.sequence, self.gtr.profile_map)
+            profile_p = seq2prof(n.ancestral_sequence, self.gtr.profile_map)
            for c in n:
                profile = self.gtr.evolve(profile_p, c.branch_length)
-                c.sequence = self.sample_from_profile(profile)
-        self.make_reduced_alignment()
+                c.ancestral_sequence = self.sample_from_profile(profile)

-        # gather mutations
-        for n in self.tree.find_clades():
-            if n==self.tree.root:
-                n.mutations=[]
-            else:
-                n.mutations = self.get_mutations(n)
+        self.aln = self.get_aln()


    def get_aln(self, internal=False):
@@ -96,7 +89,7 @@ class SeqGen(TreeAnc):
        tmp = []
        for n in self.tree.get_terminals():
            if n.is_terminal() or internal:
-                tmp.append(SeqRecord.SeqRecord(id=n.name, name=n.name, description='', seq=Seq.Seq(''.join(n.sequence))))
+                tmp.append(SeqRecord.SeqRecord(id=n.name, name=n.name, description='', seq=Seq.Seq(''.join(n.ancestral_sequence.astype('U')))))

        return MultipleSeqAlignment(tmp)


--- a/treetime/sequence_data.py
+++ b/treetime/sequence_data.py
--- a/treetime/test_opt.py
+++ b/treetime/test_opt.py
+    def optimize_tree_marginal_new(self, damping=0.5):
+        L = self.data.compressed_length
+        n_states = self.gtr.alphabet.shape[0]
+        # propagate leaves --> root, set the marginal-likelihood messages
+        for node in self.tree.find_clades(order='postorder'): #leaves -> root
+            if node.up is None and len(node.clades)==2:
+                continue
+
+            profiles = [c.marginal_subtree_LH for c in node] + [node.marginal_outgroup_LH]
+            bls = [c.branch_length for c in nodes] + [node.branch_length]
+            new_bls = self.optimize_star(profiles,bls, last_is_root=node.up is None)
+
+            # regardless of what was before, set the profile to ones
+            tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
+            node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
+            for ch in ci,node.clades:
+                ch.branch_length = new_bls[ci]
+                ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
+                                                                ch.branch_length, return_log=True)
+                tmp_log_subtree_LH += ch.marginal_log_Lx
+                node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
+
+            node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
+            node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
+
+            if node.up:
+                node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
+                                                node.branch_length, return_log=True) # raw prob to transfer prob up
+                tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
+                                                 self._branch_length_to_gtr(node), return_log=False)
+                node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
+            else:
+                node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
+
+        root=self.tree.root
+        print(len(root.clades))
+        if len(root.clades)==2:
+            tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
+            root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
+            old_bl = root.clades[0].branch_length + root.clades[1]
+            bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
+                                                root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
+                                                profiles=True, tol=1e-8)
+            for ch in root:
+                ch.branch_length *= ((1-damping)*old_bl + damping*bl)/old_bl
+                ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
+                                            ch.branch_length, return_log=True) # raw prob to transfer prob up
+                tmp_log_subtree_LH += ch.marginal_log_Lx
+                root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
+
+            root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
+            root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
+
+
+        self.total_LH_and_root_sequence(assign_sequence=False)
+        self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
+
+
+
+
+    def optimize_tree_marginal_new2(self, n_iter_internal=2, damping=0.5):
+        L = self.data.compressed_length
+        n_states = self.gtr.alphabet.shape[0]
+        # propagate leaves --> root, set the marginal-likelihood messages
+        for node in self.tree.get_nonterminals(order='postorder'): #leaves -> root
+            if node.up is None and len(node.clades)==2:
+                continue
+            # regardless of what was before, set the profile to ones
+            for ii in range(n_iter_internal):
+                damp = damping**(1+ii)
+                tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
+                node.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
+                for ch in node.clades:
+                    outgroup = np.exp(np.log(np.maximum(ttconf.TINY_NUMBER, node.marginal_profile)) - ch.marginal_log_Lx)
+
+                    bl = self.gtr.optimal_t_compressed((ch.marginal_subtree_LH, outgroup), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
+                    new_bl = (1-damp)*bl + damp*ch.branch_length
+                    ch.branch_length=new_bl
+                    ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
+                                                new_bl, return_log=True) # raw prob to transfer prob up
+                    tmp_log_subtree_LH += ch.marginal_log_Lx
+                    node.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
+
+                node.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
+                node.marginal_subtree_LH_prefactor += offset # and store log-prefactor
+
+                if node.up:
+                    bl = self.gtr.optimal_t_compressed((node.marginal_subtree_LH, node.marginal_outgroup_LH), multiplicity=self.data.multiplicity, profiles=True, tol=1e-8)
+                    new_bl = (1-damp)*bl + damp*node.branch_length
+                    node.branch_length=new_bl
+                    node.marginal_log_Lx = self.gtr.propagate_profile(node.marginal_subtree_LH,
+                                                    new_bl, return_log=True) # raw prob to transfer prob up
+                    node.marginal_outgroup_LH, pre = normalize_profile(np.log(np.maximum(ttconf.TINY_NUMBER, node.up.marginal_profile)) - node.marginal_log_Lx,
+                                                 log=True, return_offset=False)
+
+                    tmp_msg_from_parent = self.gtr.evolve(node.marginal_outgroup_LH,
+                                                     self._branch_length_to_gtr(node), return_log=False)
+                    node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * tmp_msg_from_parent, return_offset=False)
+                else:
+                    node.marginal_profile, pre = normalize_profile(node.marginal_subtree_LH * node.marginal_outgroup_LH, return_offset=False)
+
+
+        root=self.tree.root
+        print(len(root.clades))
+        if len(root.clades)==2:
+            tmp_log_subtree_LH = np.zeros((L,n_states), dtype=float)
+            root.marginal_subtree_LH_prefactor = np.zeros(L, dtype=float)
+            old_bl = root.clades[0].branch_length + root.clades[1]
+            bl = self.gtr.optimal_t_compressed((root.clades[0].marginal_subtree_LH*root.marginal_outgroup_LH,
+                                                root.clades[1].marginal_subtree_LH), multiplicity=self.data.multiplicity,
+                                                profiles=True, tol=1e-8)
+            for ch in root:
+                ch.branch_length *= bl/old_bl
+                ch.marginal_log_Lx = self.gtr.propagate_profile(ch.marginal_subtree_LH,
+                                            ch.branch_length, return_log=True) # raw prob to transfer prob up
+                tmp_log_subtree_LH += ch.marginal_log_Lx
+                root.marginal_subtree_LH_prefactor += ch.marginal_subtree_LH_prefactor
+
+            root.marginal_subtree_LH, offset = normalize_profile(tmp_log_subtree_LH, log=True)
+            root.marginal_subtree_LH_prefactor += offset # and store log-prefactor
+
+
+        self.total_LH_and_root_sequence(assign_sequence=False)
+        self.preorder_traversal_marginal(assign_sequence=False, reconstruct_leaves=False)
--- a/treetime/treeanc.py
+++ b/treetime/treeanc.py
--- a/treetime/treeregression.py
+++ b/treetime/treeregression.py
@@ -21,6 +21,9 @@ def base_regression(Q, slope=None):
    TYPE
        Description
    """
+    if np.isinf(Q).sum() or np.isnan(Q).sum():
+        raise ValueError("Invalid values in input data!")
+
    if slope is None:
        if (Q[tsqii] - Q[tavgii]**2/Q[sii])>0:
            slope = (Q[dtavgii] - Q[tavgii]*Q[davgii]/Q[sii]) \
@@ -355,7 +358,8 @@ class TreeRegression(object):
            bv = self.branch_value(n)
            var = self.branch_variance(n)
            for dx in [-0.001, 0.001]:
-                y = min(1.0, max(0.0, best_root["split"]+dx))
+                # y needs to be bounded away from 0 and 1 to avoid division by 0
+                y = min(0.9999, max(0.0001, best_root["split"]+dx))
                tmpQ = self.propagate_averages(n, tv, bv*y, var*y) \
                     + self.propagate_averages(n, tv, bv*(1-y), var*(1-y), outgroup=True)
                reg = base_regression(tmpQ, slope=slope)
@@ -381,6 +385,10 @@ class TreeRegression(object):
                 + self.propagate_averages(n, tv, bv*(1-x), var*(1-x), outgroup=True)
            return base_regression(tmpQ, slope=slope)['chisq']

+        if n.bad_branch or (n!=self.tree.root and n.up.bad_branch):
+            return np.nan, np.inf
+
+
        chisq_prox = np.inf if n.is_terminal() else base_regression(n.Qtot, slope=slope)['chisq']
        chisq_dist = np.inf if n==self.tree.root else base_regression(n.up.Qtot, slope=slope)['chisq']

@@ -423,6 +431,8 @@ class TreeRegression(object):
            regression parameters
        """
        best_root = self.find_best_root(force_positive=force_positive, slope=slope)
+        if best_root is None:
+            raise ValueError("Rerooting failed!")
        best_node = best_root["node"]

        x = best_root["split"]

--- a/treetime/treetime.py
+++ b/treetime/treetime.py
@@ -3,6 +3,7 @@ import numpy as np
 from scipy import optimize as sciopt
 from Bio import Phylo
 from treetime import config as ttconf
+from treetime import MissingDataError,UnknownMethodError,NotReadyError
 from .utils import tree_layout
 from .clock_tree import ClockTree

@@ -105,7 +106,8 @@ class TreeTime(ClockTree):

        use_covariation : bool, optional
            default False, if False, rate estimates will be performed using simple
-            regression ignoring phylogenetic covaration between nodes.
+            regression ignoring phylogenetic covaration between nodes. If vary_rate is True,
+            use_covariation is true by default

        **kwargs
           Keyword arguments needed by the downstream functions
@@ -120,11 +122,10 @@ class TreeTime(ClockTree):
        """

        # register the specified covaration mode
-        self.use_covariation = use_covariation
+        self.use_covariation = use_covariation or (vary_rate and (not type(vary_rate)==float))

-        if (self.tree is None) or (self.aln is None and self.seq_len is None):
-            self.logger("TreeTime.run: ERROR, alignment or tree are missing", 0)
-            return ttconf.ERROR
+        if (self.tree is None) or (self.aln is None and self.data.full_length is None):
+            raise MissingDataError("TreeTime.run: ERROR, alignment or tree are missing")
        if (self.aln is None):
            branch_length_mode='input'

@@ -132,7 +133,8 @@ class TreeTime(ClockTree):

        # determine how to reconstruct and sample sequences
        seq_kwargs = {"marginal_sequences":sequence_marginal or (self.branch_length_mode=='marginal'),
-                      "sample_from_profile":"root"}
+                      "sample_from_profile":"root",
+                      "reconstruct_tip_states":kwargs.get("reconstruct_tip_states", False)}

        tt_kwargs = {'clock_rate':fixed_clock_rate, 'time_marginal':False}
        tt_kwargs.update(kwargs)
@@ -160,11 +162,9 @@ class TreeTime(ClockTree):
            else:
                plot_rtt=False
            reroot_mechanism = 'least-squares' if root=='clock_filter' else root
-            if self.clock_filter(reroot=reroot_mechanism, n_iqd=n_iqd, plot=plot_rtt, fixed_clock_rate=fixed_clock_rate)==ttconf.ERROR:
-                return ttconf.ERROR
+            self.clock_filter(reroot=reroot_mechanism, n_iqd=n_iqd, plot=plot_rtt, fixed_clock_rate=fixed_clock_rate)
        elif root is not None:
-            if self.reroot(root=root, clock_rate=fixed_clock_rate)==ttconf.ERROR:
-                return ttconf.ERROR
+            self.reroot(root=root, clock_rate=fixed_clock_rate)

        if self.branch_length_mode=='input':
            if self.aln:
@@ -181,8 +181,9 @@ class TreeTime(ClockTree):
        self.LH =[[seq_LH, self.tree.positional_joint_LH, 0]]

        if root is not None and max_iter:
-            if self.reroot(root='least-squares' if root=='clock_filter' else root, clock_rate=fixed_clock_rate)==ttconf.ERROR:
-                return ttconf.ERROR
+            new_root = self.reroot(root='least-squares' if root=='clock_filter' else root, clock_rate=fixed_clock_rate)
+            self.logger("###TreeTime.run: rerunning timetree after rerooting",0)
+            self.make_time_tree(**tt_kwargs)

        # iteratively reconstruct ancestral sequences and re-infer
        # time tree to ensure convergence.
@@ -241,14 +242,11 @@ class TreeTime(ClockTree):
        # rerun the estimation for variations of the rate
        if vary_rate:
            if type(vary_rate)==float:
-                res = self.calc_rate_susceptibility(rate_std=vary_rate, params=tt_kwargs)
+                self.calc_rate_susceptibility(rate_std=vary_rate, params=tt_kwargs)
            elif self.clock_model['valid_confidence']:
-                res = self.calc_rate_susceptibility(params=tt_kwargs)
+                self.calc_rate_susceptibility(params=tt_kwargs)
            else:
-                res = ttconf.ERROR
-
-            if res==ttconf.ERROR:
-                self.logger("TreeTime.run: rate variation failed and can't be used for confidence estimation", 1, warn=True)
+                raise UnknownMethodError("TreeTime.run: rate variation for confidence estimation is not available. Either specify it explicitly, or estimate from root-to-tip regression.")

        # if marginal reconstruction requested, make one more round with marginal=True
        # this will set marginal_pos_LH, which to be used as error bar estimations
@@ -325,8 +323,7 @@ class TreeTime(ClockTree):

        terminals = self.tree.get_terminals()
        if reroot:
-            if self.reroot(root='least-squares' if reroot=='best' else reroot, covariation=False, clock_rate=fixed_clock_rate)==ttconf.ERROR:
-                return ttconf.ERROR
+            self.reroot(root='least-squares' if reroot=='best' else reroot, covariation=False, clock_rate=fixed_clock_rate)
        else:
            self.get_clock_model(covariation=False, slope=fixed_clock_rate)

@@ -339,16 +336,23 @@ class TreeTime(ClockTree):

        residuals = np.array(list(res.values()))
        iqd = np.percentile(residuals,75) - np.percentile(residuals,25)
+        bad_branch_count = 0
        for node,r in res.items():
            if abs(r)>n_iqd*iqd and node.up.up is not None:
                self.logger('TreeTime.ClockFilter: marking %s as outlier, residual %f interquartile distances'%(node.name,r/iqd), 3, warn=True)
                node.bad_branch=True
+                bad_branch_count += 1
            else:
                node.bad_branch=False

+        if bad_branch_count>0.34*self.tree.count_terminals():
+            self.logger("TreeTime.clock_filter: More than a third of leaves have been excluded by the clock filter. Please check your input data.", 0, warn=True)
+        # reassign bad_branch flags to internal nodes
+        self.prepare_tree()
+
        # redo root estimation after outlier removal
-        if reroot and self.reroot(root=reroot, clock_rate=fixed_clock_rate)==ttconf.ERROR:
-                return ttconf.ERROR
+        if reroot:
+            self.reroot(root=reroot, clock_rate=fixed_clock_rate)

        if plot:
            self.plot_root_to_tip()
@@ -414,6 +418,7 @@ class TreeTime(ClockTree):

        use_cov = self.use_covariation if covariation is None else covariation
        slope = 0.0 if type(root)==str and root.startswith('min_dev') else clock_rate
+        old_root = self.tree.root

        self.logger("TreeTime.reroot: with method or node: %s"%root,0)
        for n in self.tree.find_clades():
@@ -444,8 +449,7 @@ class TreeTime(ClockTree):
                                   if n.raw_date_constraint is not None],
                                   key=lambda x:np.mean(x.raw_date_constraint))[0]
            else:
-                self.logger('TreeTime.reroot -- ERROR: unsupported rooting mechanisms or root not found',0,warn=True)
-                return ttconf.ERROR
+                raise UnknownMethodError('TreeTime.reroot -- ERROR: unsupported rooting mechanisms or root not found')

            #this forces a bifurcating root, as we want. Branch lengths will be reoptimized anyway.
            #(Without outgroup_branch_length, gives a trifurcating root, but this will mean
@@ -454,9 +458,6 @@ class TreeTime(ClockTree):
            self.get_clock_model(covariation=use_cov, slope = slope)


-        if new_root == ttconf.ERROR:
-            return ttconf.ERROR
-
        self.logger("TreeTime.reroot: Tree was re-rooted to node "
                    +('new_node' if new_root.name is None else new_root.name), 2)

@@ -478,7 +479,7 @@ class TreeTime(ClockTree):

        self.get_clock_model(covariation=self.use_covariation, slope=slope)

-        return ttconf.SUCCESS
+        return new_root


    def resolve_polytomies(self, merge_compressed=False):
@@ -540,7 +541,7 @@ class TreeTime(ClockTree):

        from .branch_len_interpolator import BranchLenInterpolator

-        zero_branch_slope = self.gtr.mu*self.seq_len
+        zero_branch_slope = self.gtr.mu*self.data.full_length

        def _c_gain(t, n1, n2, parent):
            """
@@ -598,13 +599,13 @@ class TreeTime(ClockTree):

                # set parameters for the new node
                new_node.up = clade
+                new_node.tt = self
                n1.up = new_node
                n2.up = new_node
-                if hasattr(clade, "cseq"):
-                    new_node.cseq = clade.cseq
-                    self._store_compressed_sequence_to_node(new_node)
+                if hasattr(clade, "_cseq"):
+                    new_node._cseq = clade._cseq
+                    self.add_branch_state(new_node)

-                new_node.mutations = []
                new_node.mutation_length = 0.0
                new_node.branch_length_interpolator = BranchLenInterpolator(new_node, self.gtr, one_mutation=self.one_mutation,
                                                                            branch_length_mode = self.branch_length_mode)
@@ -860,10 +861,14 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
        tick_vals = [x+offset-shift for x in xticks]

    ax.set_xticks(xticks)
-    ax.set_xticklabels(map(str, tick_vals))
+    if step>=1:
+        tick_labels = ["%d"%(int(x)) for x in tick_vals]
+    else:
+        tick_labels = ["%1.2f"%(x) for x in tick_vals]
+    ax.set_xlim((0,date_range))
+    ax.set_xticklabels(tick_labels)
    ax.set_xlabel('year')
    ax.set_ylabel('')
-    ax.set_xlim((0,date_range))

    # put shaded boxes to delineate years
    if step:
@@ -878,7 +883,7 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
                          edgecolor=[1,1,1])
            ax.add_patch(r)
            if year in tick_vals and pos>=xlim[0] and pos<=xlim[1] and ticks:
-                label_str = str(step*(year//step)) if step<1 else  str(int(year))
+                label_str = "%1.2f"%(step*(year//step)) if step<1 else  str(int(year))
                ax.text(pos,ylim[0]-0.04*(ylim[1]-ylim[0]), label_str,
                        horizontalalignment='center')
        ax.set_axis_off()
@@ -887,15 +892,13 @@ def plot_vs_years(tt, step = None, ax=None, confidence=None, ticks=True, **kwarg
    if confidence:
        tree_layout(tt.tree)
        if not hasattr(tt.tree.root, "marginal_inverse_cdf"):
-            print("marginal time tree reconstruction required for confidence intervals")
-            return ttconf.ERROR
+            raise NotReadyError("marginal time tree reconstruction required for confidence intervals")
        elif type(confidence) is float:
            cfunc = tt.get_max_posterior_region
        elif len(confidence)==2:
            cfunc = tt.get_confidence_interval
        else:
-            print("confidence needs to be either a float (for max posterior region) or a two numbers specifying lower and upper bounds")
-            return ttconf.ERROR
+            raise NotReadyError("confidence needs to be either a float (for max posterior region) or a two numbers specifying lower and upper bounds")

        for n in tt.tree.find_clades():
            pos = cfunc(n, confidence)

--- a/treetime/utils.py
+++ b/treetime/utils.py
@@ -7,7 +7,7 @@ from scipy.interpolate import interp1d
 from scipy.integrate import quad
 from scipy import stats
 from scipy.ndimage import binary_dilation
-from treetime import config as ttconf
+from treetime import TreeTimeError

 class DateConversion(object):
    """
@@ -101,7 +101,7 @@ class DateConversion(object):

    def to_numdate(self, tbp):
        """
-        Convert the numeric date to the branch-len scale
+        Convert time before present measured in clock rate units to numeric calendar dates
        """
        return numeric_date() - self.to_years(tbp)

@@ -150,19 +150,66 @@ def numeric_date(dt=None):
        date of to be converted. if None, assume today

    """
+    from calendar import isleap
+
    if dt is None:
        dt = datetime.datetime.now()

+    days_in_year = 366 if isleap(dt.year) else 365
    try:
-        res = dt.year + dt.timetuple().tm_yday / 365.25
+        res = dt.year + (dt.timetuple().tm_yday-0.5) / days_in_year
    except:
        res = None

    return res


+def datetime_from_numeric(numdate):
+    """convert a numeric decimal date to a python datetime object
+    Note that this only works for AD dates since the range of datetime objects
+    is restricted to year>1.
+
+    Parameters
+    ----------
+    numdate : float
+        numeric date as in 2018.23
+
+    Returns
+    -------
+    datetime.datetime
+        datetime object
+    """
+    from calendar import isleap
+    days_in_year = 366 if isleap(int(numdate)) else 365
+    # add a small number of the time elapsed in a year to avoid
+    # unexpected behavior for values 1/365, 2/365, etc
+    days_elapsed = int(((numdate%1)+1e-10)*days_in_year)
+    date = datetime.datetime(int(numdate),1,1) + datetime.timedelta(days=days_elapsed)
+    return date
+
+
+def datestring_from_numeric(numdate):
+    """convert a numerical date to a formated date string YYYY-MM-DD
+
+    Parameters
+    ----------
+    numdate : float
+        numeric date as in 2018.23
+
+    Returns
+    -------
+    str
+        date string YYYY-MM-DD
+    """
+    if numdate>1900: # python datetime doesn't work for dates before 1900. This can be relaxed to numdate>1 once we drop python 2.7
+        return datetime.datetime.strftime(datetime_from_numeric(numdate), "%Y-%m-%d")
+    else:
+        year = int(np.floor(numdate))
+        dt = datetime_from_numeric(1900+(numdate%1))
+        return "%04d-%02d-%02d"%(year, dt.month, dt.day)
+

-def parse_dates(date_file):
+def parse_dates(date_file, name_col=None, date_col=None):
    """
    parse dates from the arguments and return a dictionary mapping
    taxon names to numerical dates.
@@ -191,7 +238,7 @@ def parse_dates(date_file):

    try:
        # read the metadata file into pandas dataframe.
-        df = pd.read_csv(date_file, sep=full_sep, engine='python')
+        df = pd.read_csv(date_file, sep=full_sep, engine='python', dtype='str')
        # check the metadata has strain names in the first column
        # look for the column containing sampling dates
        # We assume that the dates might be given either in human-readable format
@@ -212,25 +259,37 @@ def parse_dates(date_file):
            if any([x==col.lower() for x in ['name', 'strain', 'accession']]):
                potential_index_columns.append((ci, col))

+        if date_col and date_col not in df.columns:
+            raise TreeTimeError("ERROR: specified column for dates does not exist. \n\tAvailable columns are: "\
+                                +", ".join(df.columns)+"\n\tYou specified '%s'"%date_col)
+
+        if name_col and name_col not in df.columns:
+            raise TreeTimeError("ERROR: specified column for the taxon name does not exist. \n\tAvailable columns are: "\
+                                +", ".join(df.columns)+"\n\tYou specified '%s'"%name_col)
+
+
        dates = {}
        # if a potential numeric date column was found, use it
        # (use the first, if there are more than one)
-        if not len(potential_index_columns):
-            print("ERROR: Cannot read metadata: need at least one column that contains the taxon labels."
-                  " Looking for the first column that contains 'name', 'strain', or 'accession' in the header.", file=sys.stderr)
-            return dates
+        if not (len(potential_index_columns) or name_col):
+            raise TreeTimeError("ERROR: Cannot read metadata: need at least one column that contains the taxon labels."
+                  " Looking for the first column that contains 'name', 'strain', or 'accession' in the header.")
        else:
            # use the first column that is either 'name', 'strain', 'accession'
-            index_col = sorted(potential_index_columns)[0][1]
+            if name_col is None:
+                index_col = sorted(potential_index_columns)[0][1]
+            else:
+                index_col = name_col
            print("\tUsing column '%s' as name. This needs match the taxon names in the tree!!"%index_col)

-        if len(potential_date_columns)>=1:
+        if len(potential_date_columns)>=1 or date_col:
            #try to parse the csv file with dates in the idx column:
-            idx = potential_date_columns[0][0]
-            col_name = potential_date_columns[0][1]
-            print("\tUsing column '%s' as date."%col_name)
+            if date_col is None:
+                date_col = potential_date_columns[0][1]
+
+            print("\tUsing column '%s' as date."%date_col)
            for ri, row in df.iterrows():
-                date_str = row.loc[col_name]
+                date_str = row.loc[date_col]
                k = row.loc[index_col]
                # try parsing as a float first
                try:
@@ -255,15 +314,16 @@ def parse_dates(date_file):
                            dates[k] = [numeric_date(x) for x in [lower, upper]]

        else:
-            print("ERROR: Metadata file has no column which looks like a sampling date!", file=sys.stderr)
+            raise TreeTimeError("ERROR: Metadata file has no column which looks like a sampling date!")

        if all(v is None for v in dates.values()):
-            print("ERROR: Cannot parse dates correctly! Check date format.", file=sys.stderr)
-            return {}
+            raise TreeTimeError("ERROR: Cannot parse dates correctly! Check date format.")
+        print(dates)
        return dates
+    except TreeTimeError as err:
+        raise err
    except:
-        print("ERROR: Cannot read the metadata file!", file=sys.stderr)
-        return {}
+        raise


 def ambiguous_date_to_date_range(mydate, fmt="%Y-%m-%d", min_max_year=None):
@@ -283,10 +343,9 @@ def ambiguous_date_to_date_range(mydate, fmt="%Y-%m-%d", min_max_year=None):
    tuple
        upper and lower bounds on the date. return (None, None) if errors
    """
-    from datetime import datetime
    sep = fmt.split('%')[1][-1]
    min_date, max_date = {}, {}
-    today = datetime.today().date()
+    today = datetime.date.today()

    for val, field  in zip(mydate.split(sep), fmt.split(sep+'%')):
        f = 'year' if 'y' in field.lower() else ('day' if 'd' in field.lower() else 'month')
@@ -315,8 +374,8 @@ def ambiguous_date_to_date_range(mydate, fmt="%Y-%m-%d", min_max_year=None):
                return None, None
    max_date['day'] = min(max_date['day'], 31 if max_date['month'] in [1,3,5,7,8,10,12]
                                           else 28 if max_date['month']==2 else 30)
-    lower_bound = datetime(year=min_date['year'], month=min_date['month'], day=min_date['day']).date()
-    upper_bound = datetime(year=max_date['year'], month=max_date['month'], day=max_date['day']).date()
+    lower_bound = datetime.date(year=min_date['year'], month=min_date['month'], day=min_date['day'])
+    upper_bound = datetime.date(year=max_date['year'], month=max_date['month'], day=max_date['day'])
    return (lower_bound, upper_bound if upper_bound<today else today)


@@ -386,6 +445,7 @@ def build_newick_fasttree(aln_fname, nuc=True):

 def build_newick_raxml(aln_fname, nthreads=2, raxml_bin="raxml", **kwargs):
    import shutil,os
+    print("Building tree with raxml")
    from Bio import Phylo, AlignIO
    AlignIO.write(AlignIO.read(aln_fname, 'fasta'),"temp.phyx", "phylip-relaxed")
    cmd = raxml_bin + " -f d -T " + str(nthreads) + " -m GTRCAT -c 25 -p 235813 -n tre -s temp.phyx"
@@ -397,12 +457,33 @@ def build_newick_iqtree(aln_fname, nthreads=2, iqtree_bin="iqtree",
                        iqmodel="HKY",  **kwargs):
    import os
    from Bio import Phylo, AlignIO
-    with open(aln_fname) as ifile:
-        tmp_seqs = ifile.readlines()
+    print("Building tree with iqtree")
+    aln = None
+    for fmt in ['fasta', 'phylip-relaxed']:
+        try:
+            aln = AlignIO.read(aln_fname, fmt)
+            break
+        except:
+            continue
+
+    if aln is None:
+        raise ValueError("failed to read alignment for tree building")
+
    aln_file = "temp.fasta"
-    with open(aln_file, 'w') as ofile:
-        for line in tmp_seqs:
-            ofile.write(line.replace('/', '_X_X_').replace('|','_Y_Y_'))
+    seq_names = set()
+    for s in aln:
+        tmp  = s.id
+        for c, sub in zip('/|()', 'VWXY'):
+            tmp = tmp.replace(c, '_%s_%s_'%(sub,sub))
+        if tmp in seq_names:
+            print("A sequence with name {} already exists, skipping....".format(s.id))
+            continue
+        s.id = tmp
+        s.name = s.id
+        s.description = ''
+        seq_names.add(s.id)
+
+    AlignIO.write(aln, aln_file, 'fasta')

    fast_opts = [
        "-ninit", "2",
@@ -416,7 +497,10 @@ def build_newick_iqtree(aln_fname, nthreads=2, iqtree_bin="iqtree",
    os.system(" ".join(call))
    T = Phylo.read(aln_file+".treefile", 'newick')
    for n in T.get_terminals():
-        n.name = n.name.replace('_X_X_','/').replace('_Y_Y_','|')
+        tmp = n.name
+        for c, sub in zip('/|()', 'VWXY'):
+            tmp = tmp.replace('_%s_%s_'%(sub,sub), c)
+        n.name = tmp
    return T

 if __name__ == '__main__':