Skip to content
Commits on Source (8)
repo: 9abeda989914318a9bdd9de1f76edb42c4b84a8b
node: 2b20f3dbd996d3db969b64b744e0f0107f33876c
branch: default
tag: 2019.9.13a
......@@ -18,3 +18,12 @@ draw_results
*.coverage
*egg-info*
*.wpu
.cache*
*taskpaper
*.ipynb_checkpoints*
*.sublime*
*.patch
*.pytest_cache
*.tox
*.vscode
*.code-workspace
\ No newline at end of file
4ee8471d007032a28f2d75edf63504fcdafb5eaa 3.0a1
72e162442356b3090e0e884d0461c1ddb16f6c5c remove
4ee8471d007032a28f2d75edf63504fcdafb5eaa 3.0a1
72e162442356b3090e0e884d0461c1ddb16f6c5c 3.0a1
72e162442356b3090e0e884d0461c1ddb16f6c5c 3.0a1
d8e9473a0b248c3a0875fd71053114412ab290ed 3.0a1
e2e8731e8ea3dd6b483f1d9b60268cd7762a7e48 3.0a2
e2e8731e8ea3dd6b483f1d9b60268cd7762a7e48 3.0a2
01be08e75ae196debb552110547315f751d2bd64 3.0a2
36da26dc56e4f400ac62c43bed0813af9d08e204 2019.08.06a
36da26dc56e4f400ac62c43bed0813af9d08e204 2019.08.06a
91686dc745555aa467b93d3f95e2a313e9d6e69c 2019.08.06a
91686dc745555aa467b93d3f95e2a313e9d6e69c 2019.08.06a
833f75a211c160c76267a98d094e32b571b284ec 2019.08.06a
833f75a211c160c76267a98d094e32b571b284ec 2019.08.06a
aa0e8057ad0f0a225eaf715773053481eb7f7184 2019.08.06a
aa0e8057ad0f0a225eaf715773053481eb7f7184 2019.08.06a
a79911d979e8e52b74e84922171305e1686a28f5 2019.08.06a
a79911d979e8e52b74e84922171305e1686a28f5 2019.08.06a
7828e9e27b0b79473eb34c821d68590ca2d909d6 2019.08.06a
7828e9e27b0b79473eb34c821d68590ca2d909d6 2019.08.06a
d24d24a416d58e406722903179c5788b8dfc5fbb 2019.08.06a
d24d24a416d58e406722903179c5788b8dfc5fbb 2019.08.06a
124948744635e6612d35d369cbff385a6e859781 2019.08.06a
124948744635e6612d35d369cbff385a6e859781 2019.08.06a
ddff5cd51d0b0e91dddc99cf38be5c5a5e580cca 2019.08.06a
39a382d7836bda436d9970e54b84242dca88f7a2 2019.8.20a
27890ba31bb55df37a4176dbd5fdbacbec4774cc 2019.8.23a
e24728110ae2894b52bd7f6f2b8ec02411deb0ae 2019.8.28a
e24728110ae2894b52bd7f6f2b8ec02411deb0ae 2019.8.28a
5ee7ba84b3a2c606b800313a99b5bf602679ce84 2019.8.28a
5b08dd8d5184773611bf265a13d20f427335a9ee 2019.8.30a
This diff is collapsed.
Copyright 2019 Gavin Huttley
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
include MANIFEST.in LICENSE
recursive-include src *.c
recursive-include src *.pyx *.c
The Readme
==========
:Download: `From github <https://github.com/pycogent/pycogent>`_ or follow the :ref:`quick-install` instructions.
Dependencies
------------
The toolkit requires Python 2.5.1 or greater, and Numpy 1.3 or greater. Aside from these the dependencies below are optional and the code will work as is. A C compiler, however, will allow external C module's responsible for the likelihood and matrix exponentiation calculations to be compiled, resulting in significantly improved performance.
.. _required:
Required
^^^^^^^^
- Python_: the language the toolkit is primarily written in, and in which the user writes control scripts.
- Numpy_: This is a python module used for speeding up matrix computations. It is available as source code for \*nix.
- zlib_: This is a compression library which is available for all platforms and comes pre-installed on most too. If, by chance, your platform doesn't have this installed then download the source from the zlib_ site and follow the install instructions, or refer to the instructions for `compiling matplotlib`_.
.. note:: On some linux platforms (like Ubuntu), you must specifically install a ``python-dev`` package so that the Python_ header files required for building some external dependencies are available.
Optional
^^^^^^^^
- C compiler: This is standard on most \*nix platforms. On Macos X this is available for free in the Developer tools which, if you don't already have them, can be obtained from Apple_.
- Matplotlib_: used to plot several kinds of graphs related to codon usage. For installation, see these instructions for `compiling matplotlib`_.
- Cython_: This module is only necessary if you are a developer who wants to modify the \*.pyx files.
- mpi4py_: Message Passing Interface interface, required for parallel computation.
- SQLAlchemy_ and a mysql connector such as `PyMySQL <https://github.com/PyMySQL/PyMySQL>`_: These are required for the Ensembl querying code.
See the :ref:`quick-install` approach for how these can be grabbed using the ``all`` option..
Installation
------------
See the :ref:`quick-install`.
Testing
-------
``PyCogent/tests`` contains all the tests. You can most readily run the tests using the ``PyCogent/run_tests`` shell script. This is done by typing:
.. code-block:: guess
$ sh run_tests
which will automatically build extensions in place, set up the PYTHONPATH and run ``PyCogent/tests/alltests.py``. Note that if certain optional applications are not installed this will be indicated in the output as "can't find" or "not installed". A "`.`" will be printed to screen for each test and if they all pass, you'll see output like:
.. code-block:: guess
Ran 3299 tests in 58.128s
OK
Tips for usage
--------------
A good IDE can greatly simplify writing control scripts. Features such as code completion and definition look-up are extremely useful. For a complete list of `editors go here`_.
To get help on attributes of an object in python, use
.. code-block:: python
>>> dir(myalign)
to list the attributes of ``myalign`` or
.. code-block:: python
>>> help(myalign.writeToFile)
to figure out how to use the ``myalign.writeToFile`` method. Also note that the directory structure of the package is similar to the import statements required to use a module -- to see the contents of ``alignment.py`` or ``sequence.py`` you need to look in the directory ``cogent/core`` path, to use the classes in those files you specify ``cogent.core`` for importing.
.. _Python: http://www.python.org
.. _Cython: http://www.cython.org/
.. _Numpy: http://numpy.scipy.org/
.. _Matplotlib: http://matplotlib.sourceforge.net
.. _Apple: http://www.apple.com
.. _Pyrex: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
.. _`editors go here`: http://www.python.org/cgi-bin/moinmoin/PythonEditors
.. _mpi4py: http://code.google.com/p/mpi4py
.. _`restructured text`: http://docutils.sourceforge.net/rst.html
.. _gcc: http://gcc.gnu.org/
.. _SQLAlchemy: http://www.sqlalchemy.org
.. _`mysql-connector-python`: https://pypi.python.org/pypi/mysql-connector-python
.. _zlib: http://www.zlib.net/
.. _`compiling matplotlib`: http://sourceforge.net/projects/pycogent/forums/forum/651121/topic/5635916
PyCogent: A toolkit for making sense from sequence
==================================================
# NOTICE
The official PyCogent source code repository. For details on PyCogent, see http://www.pycogent.org/.
We will be departing Bitbucket for GitHub soon due to their recent decision to drop support for mercurial. Stay tuned for details.
PyCogent includes connectors to remote databases, built-in generalized probabilistic techniques for working with biological sequences, and controllers for 3rd party applications.
## PyCogent3 →Cogent3
For PyCogent news and announcements, including notification of new releases, you can subscribe to the [PyCogent News and Announcements Blog](http://pycogent.wordpress.com).
This is the port of PyCogent to Python 3, it's being renamed to Cogent3 as it has been rewritten extensively and `cogent` was always been the import name (dating back to 2004!).
## What's New?
A massive amount!
We integrated many of the new developments on modelling of non-stationary processes published by the Huttley lab over the last decade.
We have implemented a `cogent3.app` module which contains very different capabilities. Notably, a functional programming style interface for many capabilities to lower the effort for using cogent3's advanced capabilities. These are as yet documented.
Importantly, the app interface should be considered as alpha level code. Documentation for these new features will follow in the next few weeks.
## What's Changed?
There are massive changes to the library since the PyCogent version 1.9 release (which is only compatible with python 2).
Please check the wiki for a list of the major API changes, but note that as of August 23, 2019 that Wiki is now very much out-of-date.
We will update as soon as possible.
## Documentation
PyCogent3 documentation is on [readthedocs](https://cogent3.readthedocs.io/en/latest/).
\ No newline at end of file
image: python:3.7.1
pipelines:
default:
- step:
caches:
- pip
script: # Modify the commands below to build your repository.
- pip install numpy
- pip install pandas
- pip install tqdm
- pip install plotly
- pip install scitrack
- pip install tinydb
- pip install pytest
- pip install tox
- tox -e py37
"""The most commonly used constructors are available from this toplevel module.
The rest are in the subpackages: phylo, evolve, maths, draw, parse and format."""
import sys, os, re, cPickle
import numpy
__author__ = ""
__copyright__ = "Copyright 2007-2016, The Cogent Project"
__credits__ = ["Gavin Huttley", "Rob Knight", "Peter Maxwell",
"Jeremy Widmann", "Catherine Lozupone", "Matthew Wakefield",
"Edward Lang", "Greg Caporaso", "Mike Robeson",
"Micah Hamady", "Sandra Smit", "Zongzhi Liu",
"Andrew Butterfield", "Amanda Birmingham", "Brett Easton",
"Hua Ying", "Jason Carnes", "Raymond Sammut",
"Helen Lindsay", "Daniel McDonald"]
__license__ = "GPL"
__version__ = "1.9"
__maintainer__ = "Gavin Huttley"
__email__ = "gavin.huttley@anu.edu.au"
__status__ = "Production"
#SUPPORT2425
if sys.version_info < (2, 6):
py_version = ".".join([str(n) for n in sys.version_info])
raise RuntimeError("Python-2.6 or greater is required, Python-%s used." % py_version)
numpy_version = re.split("[^\d]", numpy.__version__)
numpy_version_info = tuple([int(i) for i in numpy_version if i.isdigit()])
if numpy_version_info < (1, 3):
raise RuntimeError("Numpy-1.3 is required, %s found." % numpy_version)
version = __version__
version_info = tuple([int(v) for v in version.split(".") if v.isdigit()])
from cogent.util.table import Table as _Table
from cogent.parse.table import load_delimited, autogen_reader
from cogent.core.tree import TreeBuilder, TreeError
from cogent.parse.tree_xml import parse_string as tree_xml_parse_string
from cogent.parse.newick import parse_string as newick_parse_string
from cogent.core.alignment import SequenceCollection
from cogent.core.alignment import Alignment
from cogent.parse.sequence import FromFilenameParser
from cogent.parse.structure import FromFilenameStructureParser
#note that moltype has to be imported last, because it sets the moltype in
#the objects created by the other modules.
from cogent.core.moltype import ASCII, DNA, RNA, PROTEIN, STANDARD_CODON, \
CodonAlphabet
def Sequence(moltype=None, seq=None, name=None, filename=None, format=None):
if seq is None:
for (a_name, a_seq) in FromFilenameParser(filename, format):
if seq is None:
seq = a_seq
if name is None:
name = a_name
else:
raise ValueError("Multiple sequences in '%s'" % filename)
if moltype is not None:
seq = moltype.makeSequence(seq)
elif not hasattr(seq, 'MolType'):
seq = ASCII.makeSequence(seq)
if name is not None:
seq.Name = name
return seq
def LoadSeqs(filename=None, format=None, data=None, moltype=None,
name=None, aligned=True, label_to_name=None, parser_kw={},
constructor_kw={}, **kw):
"""Initialize an alignment or collection of sequences.
Arguments:
- filename: name of the sequence file
- format: format of the sequence file
- data: optional explicit provision of sequences
- moltype: the MolType, eg DNA, PROTEIN
- aligned: set True if sequences are already aligned and have the same
length, results in an Alignment object. If False, a SequenceCollection
instance is returned instead. If callable, will use as a constructor
(e.g. can pass in DenseAlignment or CodonAlignment).
- label_to_name: function for converting original name into another
name. Default behavior is to preserve the original FASTA label and
comment.
To remove all FASTA label comments, and pass in only the label, pass in:
label_to_name=lambda x: x.split()[0]
To look up names in a dict, pass in:
label_to_name = lambda x: d.get(x, default_name)
...where d is a dict that's in scope, and default_name is what you want
to assign any sequence that isn't in the dict.
If format is None, will attempt to infer format from the filename
suffix. If label_to_name is None, will attempt to infer correct
conversion from the format.
"""
if filename is None:
assert data is not None
assert format is None
assert not kw, kw
else:
assert data is None, (filename, data)
data = list(FromFilenameParser(filename, format, **parser_kw))
# the following is a temp hack until we have the load API sorted out.
if aligned: #if callable, call it -- expect either f(data) or bool
if hasattr(aligned, '__call__'):
return aligned(data=data, MolType=moltype, Name=name,
label_to_name=label_to_name, **constructor_kw)
else: #was not callable, but wasn't False
return Alignment(data=data, MolType=moltype, Name=name,
label_to_name=label_to_name, **constructor_kw)
else: #generic case: return SequenceCollection
return SequenceCollection(data, MolType=moltype, Name=name,
label_to_name=label_to_name, **constructor_kw)
def LoadStructure(filename, format=None, parser_kw={}):
"""Initialize a Structure from data contained in filename.
Arguments:
- filename: name of the filename to create structure from.
- format: the optional file format extension.
- parser_kw: optional keyword arguments for the parser."""
# currently there is no support for string-input
assert filename is not None, 'No filename given.'
return FromFilenameStructureParser(filename, format, **parser_kw)
def LoadTable(filename=None, sep=',', reader=None, header=None, rows=None,
row_order=None, digits=4, space=4, title='', missing_data='',
max_width = 1e100, row_ids=False, legend='', column_templates=None,
dtype=None, static_column_types=False, limit=None, **kwargs):
"""
Arguments:
- filename: path to file containing a pickled table
- sep: the delimiting character between columns
- reader: a parser for reading filename. This approach assumes the first
row returned by the reader will be the header row.
- static_column_types: if True, and reader is None, identifies columns
with a numeric data type (int, float) from the first non-header row.
This assumes all subsequent entries in that column are of the same type.
Default is False.
- header: column headings
- rows: a 2D dict, list or tuple. If a dict, it must have column
headings as top level keys, and common row labels as keys in each
column.
- row_order: the order in which rows will be pulled from the twoDdict
- digits: floating point resolution
- space: number of spaces between columns or a string
- title: as implied
- missing_data: character assigned if a row has no entry for a column
- max_width: maximum column width for printing
- row_ids: if True, the 0'th column is used as row identifiers and keys
for slicing.
- legend: table legend
- column_templates: dict of column headings: string format templates
or a function that will handle the formatting.
- dtype: optional numpy array typecode.
- limit: exits after this many lines. Only applied for non pickled data
file types.
"""
#
if filename is not None and not (reader or static_column_types):
if filename[filename.rfind(".")+1:] == 'pickle':
f = file(filename, 'U')
loaded_table = cPickle.load(f)
f.close()
return _Table(**loaded_table)
sep = sep or kwargs.pop('delimiter', None)
header, rows, loaded_title, legend = load_delimited(filename,
delimiter = sep, limit=limit, **kwargs)
title = title or loaded_title
elif filename and (reader or static_column_types):
f = file(filename, "r")
if not reader:
reader = autogen_reader(f, sep, limit=limit,
with_title=kwargs.get('with_title', False))
rows = [row for row in reader(f)]
f.close()
header = rows.pop(0)
table = _Table(header=header, rows=rows, digits=digits, row_order=row_order,
title=title,
dtype=dtype, column_templates=column_templates, space=space,
missing_data=missing_data, max_width=max_width, row_ids=row_ids,
legend=legend)
return table
def LoadTree(filename=None, treestring=None, tip_names=None, format=None, \
underscore_unmunge=False):
"""Constructor for tree.
Arguments, use only one of:
- filename: a file containing a newick or xml formatted tree.
- treestring: a newick or xml formatted tree string.
- tip_names: a list of tip names.
Note: underscore_unmunging is turned off by default, although it is part
of the Newick format. Set underscore_unmunge to True to replace underscores
with spaces in all names read.
"""
if filename:
assert not (treestring or tip_names)
treestring = open(filename).read()
if format is None and filename.endswith('.xml'):
format = "xml"
if treestring:
assert not tip_names
if format is None and treestring.startswith('<'):
format = "xml"
if format == "xml":
parser = tree_xml_parse_string
else:
parser = newick_parse_string
tree_builder = TreeBuilder().createEdge
#FIXME: More general strategy for underscore_unmunge
if parser is newick_parse_string:
tree = parser(treestring, tree_builder, \
underscore_unmunge=underscore_unmunge)
else:
tree = parser(treestring, tree_builder)
if not tree.NameLoaded:
tree.Name = 'root'
elif tip_names:
tree_builder = TreeBuilder().createEdge
tips = [tree_builder([], tip_name, {}) for tip_name in tip_names]
tree = tree_builder(tips, 'root', {})
else:
raise TreeError, 'filename or treestring not specified'
return tree
This diff is collapsed.
#!/usr/bin/env python
"""Code for performing alignments by Needleman-Wunsch and Smith-Waterman.
"""
__author__ = "Rob Knight"
__copyright__ = "Copyright 2007-2016, The Cogent Project"
__credits__ = ["Rob Knight", "Jeremy Widmann"]
__license__ = "GPL"
__version__ = "1.9"
__maintainer__ = "Rob Knight"
__email__ = "rob@spot.colorado.edu"
__status__ = "Development"
class ScoreCell(object):
"""Cell in a ScoreMatrix object. Contains score and pointer."""
__slots__ = ['Score', 'Pointer']
def __init__(self, Score=0, Pointer=None):
"""Returns new ScoreCell object.
Score should be a number.
Pointer should be a direction (typically None, "up", "diag", or "left")
"""
self.Score = Score
self.Pointer = Pointer
def update(self, diag, up, left):
"""Updates the ScoreCell object with scores for its three neighbors.
Finds max of (diag, up, and left), and sets the score and pointer
accordingly. In case of tie, picks up (i.e. tries to minimize
gaps since the first sequence is the shortest), then diag, then left.
"""
best = max(diag, up, left)
if best == up:
self.Score = up
self.Pointer = "up"
elif best == diag:
self.Score = diag
self.Pointer = "diag"
else:
self.Score = left
self.Pointer = "left"
def MatchScorer(match, mismatch):
"""Factory function that returns a score function set to match and mismatch.
match and mismatch should both be numbers. Typically, match should be
positive and mismatch should be negative.
Resulting function has signature f(x,y) -> number.
"""
def scorer(x, y):
if x == y:
return match
else:
return mismatch
return scorer
equality_scorer = MatchScorer(1, -1)
default_gap = -1
default_gap_symbol = '-'
class ScoreMatrix(list):
"""Matrix that contains (score, pointer) pairs for sequence alignment."""
def __init__(self, First, Second, Score=equality_scorer, \
GapScore=default_gap, GapSymbol=default_gap_symbol):
"""Returns new ScoreMatrix object, initialized but not filled.
Calls internal methods to initialize first row, column and cell.
First and Second are the two sequences that will be aligned. Columns
correspond to positions in First; rows correspond to positions in
Second.
Score is a scoring function that takes two corresponding items in
First and Second, and returns a number that can be compared and added.
Gap is the score for inserting a gap.
Note that the matrix is one row and one column bigger than the lengths
of the sequences, since the first row and first column contain
initialization data.
"""
self.First = First #first sequence
self.Second = Second #second sequence
self.Cols = len(First) + 1 #need to add 1 for the start col
self.Rows = len(Second) + 1 #need to add 1 for the start row
self.Score = Score #scoring function: f(x,y) = num
self.GapScore = GapScore #gap penalty
self.GapSymbol = GapSymbol #symbol for gaps
self.Filled = False #whether the matrix has been filled
self.FirstAlign = [] #first sequence, aligned, as list
self.SecondAlign = [] #second sequence, aligned, as list
for r in range(self.Rows):
self.append([ScoreCell() for c in range(self.Cols)])
self._init_first_row()
self._init_first_col()
self._init_first_cell()
#print "INITIALIZED WITH:"
#print self
def __str__(self):
"""Prints score matrix, including labels for row and column."""
first = self.First
second = self.Second
if not first or not second:
return "Empty Score Matrix"
#add first sequence as first row: skip 2 positions for row label and
#for the first column, which is initialized to default values
rows = ['\t\t' + '\t'.join(self.First)]
for index, row in enumerate(self):
#if it's not the first row, add the appropriate char from the seq
if index:
curr = second[index-1]
else:
curr = ''
#add the row containing the char and the data
rows.append('%s\t'%curr + '\t'.join([str(i.Score) for i in row]))
return '\n'.join(rows)
def _init_first_row(self):
"""Hook for matrices that need to initialize the first row."""
pass
def _init_first_col(self):
"""Hook for matrices that need to initialize the first column."""
pass
def _init_first_cell(self):
"""Hook for matrices that need to initialize the first cell."""
pass
def alignment(self):
"""Returns alignment of first and second sequences.
Calls fill() and traceback() if necessary.
Converts the aligned versions to the same class as the originals.
"""
seq1, seq2 = self.First, self.Second
aln1, aln2 = self.FirstAlign, self.SecondAlign
if not aln1 or not aln2:
self.fill()
self.traceback()
aln1, aln2 = self.FirstAlign, self.SecondAlign
#Remember to return sequences that are the correct class. If the class
#subclasses str, you probably want ''.join rather than str to feed into
#the constructor, since str() on a list prints the brackets and commas.
if isinstance(seq1, str):
first_result = seq1.__class__(''.join(aln1))
else:
first_result = seq1.__class__(aln1)
if isinstance(seq2, str):
second_result = seq2.__class__(''.join(aln2))
else:
second_result = seq2.__class__(aln2)
return first_result, second_result
class NeedlemanWunschMatrix(ScoreMatrix):
"""Score matrix for global alignment using Needleman-Wunsch."""
def _init_first_row(self):
"""First row initialized with index * gap penalty."""
gap = self.GapScore
self[0] = [ScoreCell(gap * i, 'left') for i in range(self.Cols)]
#print "AFTER FIRST ROW:"
#print self
def _init_first_col(self):
"""First column initialized with index * gap penalty."""
gap = self.GapScore
for index, row in enumerate(self):
row[0] = ScoreCell(gap * index, 'up')
#print "AFTER FIRST COL:"
#print self
def _init_first_cell(self):
"""First cell initialized to 0 with no pointer."""
self[0][0] = ScoreCell(0)
#print "AFTER FIRST CELL:"
#print self
def fill(self):
"""Fills each cell with its best score and the direction of the next.
For moving up or left, calculates the gap penalty plus the score of the
cell. For moving diagonally, calculates the match/mismatch score for
the corresponding positions in the sequence plus the score of the cell.
Performs all calculations in place.
"""
score = self.Score
gap = self.GapScore
seq_1, seq_2 = self.First, self.Second
curr_row = self[0]
for row_i in range(1, self.Rows):
prev_row = curr_row
curr_row = self[row_i]
curr_cell = curr_row[0]
for col_i in range(1, self.Cols):
prev_cell = curr_cell
curr_cell = curr_row[col_i]
#remember to subtract 1 to find corresponding pos in seq
diag_score = score(seq_1[col_i-1], seq_2[row_i-1]) + \
prev_row[col_i-1].Score
up_score = gap + prev_row[col_i].Score
left_score = gap + prev_cell.Score
#print 'row: %s col: %s '%(row_i,col_i), \
# diag_score,up_score,left_score
curr_cell.update(diag_score, up_score, left_score)
self.Filled = True
self.MaxScore = (self[-1][-1].Score, len(self)-1, len(self[0])-1)
#print "FINISHED FILL"
#print self
def traceback(self):
"""Returns the optimal alignment as a (first, second) tuple w/ gaps.
Follows the pointers assigned in fill(), inserting gaps whenever the
movement is not diagonal.
"""
if not self.Filled:
self.fill()
align_1 = []
align_2 = []
seq_1 = self.First
seq_2 = self.Second
curr_row = self.Rows - 1 #subtract 1 to use length as index
curr_col = self.Cols - 1 #subtract 1 to use length as index
p = self[curr_row][curr_col].Pointer
while 1:
#print "ROW: %s, COL: %s" % (curr_row, curr_col),p
if p == 'diag':
align_1.append(seq_1[curr_col-1])
align_2.append(seq_2[curr_row-1])
curr_row -= 1
curr_col -= 1
elif p == 'left':
align_1.append(seq_1[curr_col-1])
align_2.append('-')
curr_col -= 1
elif p == 'up':
align_1.append('-')
align_2.append(seq_2[curr_row-1])
curr_row -= 1
else:
break
p = self[curr_row][curr_col].Pointer
align_1.reverse()
align_2.reverse()
self.FirstAlign, self.SecondAlign = align_1, align_2
class SmithWatermanMatrix(ScoreMatrix):
def fill(self):
max_row, max_col, max_score = 0, 0, 0
score = self.Score
gap = self.GapScore
seq_1, seq_2 = self.First, self.Second
curr_row = self[0]
for row_i in range(1, self.Rows):
prev_row = curr_row
curr_row = self[row_i]
curr_cell = curr_row[0]
for col_i in range(1, self.Cols):
prev_cell = curr_cell
curr_cell = curr_row[col_i]
#remember to subtract 1 to find corresponding pos in seq
diag_score = score(seq_1[col_i-1], seq_2[row_i-1]) + \
prev_row[col_i-1].Score
up_score = gap + prev_row[col_i].Score
left_score = gap + prev_cell.Score
if max(up_score, left_score, diag_score) <= 0:
continue #leave uninitialized if scores all below threshold
curr_cell.update(diag_score, up_score, left_score)
if curr_cell.Score > max_score:
max_score = curr_cell.Score
max_row = row_i
max_col = col_i
self.MaxScore = (max_score, max_row, max_col)
#print "FINISHED FILL"
#print self
def traceback(self):
align_1 = []
align_2 = []
seq_1 = self.First
seq_2 = self.Second
max_score, curr_row, curr_col = self.MaxScore
p = self[curr_row][curr_col].Pointer
while 1:
#print "ROW: %s, COL: %s" % (curr_row, curr_col)
if p == 'diag':
align_1.append(seq_1[curr_col-1])
align_2.append(seq_2[curr_row-1])
curr_row -= 1
curr_col -= 1
elif p == 'left':
align_1.append(seq_1[curr_col-1])
align_2.append('-')
curr_col -= 1
elif p == 'up':
align_1.append('-')
align_2.append(seq_2[curr_row-1])
curr_row -= 1
else:
break
p = self[curr_row][curr_col].Pointer
align_1.reverse()
align_2.reverse()
self.FirstAlign, self.SecondAlign = align_1, align_2
def nw_align(seq1, seq2, scorer=equality_scorer, gap=default_gap, return_score=False):
"""Returns globally optimal alignment of seq1 and seq2."""
N = NeedlemanWunschMatrix(seq1, seq2, scorer, gap)
if return_score:
return N.alignment(), N.MaxScore[0]
else:
return N.alignment()
def sw_align(seq1, seq2, scorer=equality_scorer, gap=default_gap, return_score=False):
"""Returns locally optimal alignment of seq1 and seq2."""
S = SmithWatermanMatrix(seq1, seq2, scorer, gap)
if return_score:
return S.alignment(), S.MaxScore[0]
else:
return S.alignment()
def demo(seq1, seq2):
result = []
result.append("Global alignment:")
result.extend(nw_align(seq1, seq2))
result.append("Local alignment:")
result.extend(sw_align(seq1, seq2))
return "\n".join(result)
#initialization
if __name__ == '__main__':
from sys import argv
print demo(argv[1], argv[2])
#!/usr/bin/env python
"""
Implements sequence weighting using several standard algorithms.
"""
__all__ = ['util','methods']
__author__ = "Sandra Smit"
__copyright__ = "Copyright 2007-2016, The Cogent Project"
__credits__ = ["Sandra Smit", "Gavin Huttley", "Rob Knight"]
__license__ = "GPL"
__version__ = "1.9"
__maintainer__ = "Sandra Smit"
__email__ = "sandra.smit@colorado.edu"
__status__ = "Production"
#!/usr/bin/env python
"""Provides an implementation of several sequence weighting algorithm"""
from __future__ import division
from random import choice
from numpy.random.mtrand import exponential
from numpy import array, float64 as Float64, dot as matrixmultiply, transpose,\
ones, zeros
from numpy.linalg import inv as inverse
from cogent.core.profile import Profile
from cogent.core.alignment import Alignment, DenseAlignment
from cogent.parse.tree import DndParser
from cogent.util.array import hamming_distance
from cogent.align.weights.util import Weights, number_of_pseudo_seqs,\
pseudo_seqs_exact, pseudo_seqs_monte_carlo, row_to_vote, distance_matrix,\
eigenvector_for_largest_eigenvalue, DNA_ORDER,RNA_ORDER,PROTEIN_ORDER,\
SeqToProfile
from cogent.core.moltype import BYTES
__author__ = "Sandra Smit"
__copyright__ = "Copyright 2007-2016, The Cogent Project"
__credits__ = ["Sandra Smit", "Rob Knight", "Gavin Huttley"]
__license__ = "GPL"
__version__ = "1.9"
__maintainer__ = "Sandra Smit"
__email__ = "sandra.smit@colorado.edu"
__status__ = "Development"
def VA(alignment, distance_method=hamming_distance):
"""Returns Weight object with seq weights according to the VA method.
alignment: Alignment object
The VA method (Vingron & Argos 1989) calculates the Hamming distance
between all sequences in the alignment. The weight assigned to a sequence
is the sum of the distances of all other sequences in the alignment to that
sequence, divided by the sum of all pairwise distances.
Example:
ABBA ABCA CBCB
ABBA 0 1 3
ABCA 1 0 2
CBCB 3 2 1
----------------------------
total 4 3 5 (=12)
normal. 0.333 0.25 0.417
so: weight(ABBA) = 0.333, weight(ABCA)=0.25, etc.
"""
distances = distance_matrix(alignment, distance_method)
sum_dist = sum(distances)
#total weights are the normalized sum of distances (sum over each column,
# divided by the total distance in the matrix
weights = sum_dist/sum(sum_dist)
#create a dictionary of {seq_id: weight}
weight_dict = Weights(dict(zip(alignment.Names,weights)))
return weight_dict
def VOR(alignment,n=1000,force_monte_carlo=False,mc_threshold=1000):
"""Returns sequence weights according to the Voronoi weighting method.
alignment: Alignment object
n: sampling size (in case monte carlo is used)
force_monte_carlo: generate pseudo seqs with monte carlo always (even
if there's only a small number of possible unique pseudo seqs
mc_threshold: threshold of when to use the monte carlo sampling method
if the number of possible pseudo seqs exceeds this threshold monte
carlo is used.
VOR differs from VA in the set of sequences against which it's comparing
all the sequences in the alignment. In addition to the sequences in the
alignment itself, it uses a set of pseudo sequences.
Generating discrete random sequences:
A discrete random sequence is generated by choosing with equal
likelihood at each position one of the residues observed at that position
in the alighment. An occurrence of once in the alignment column is
sufficient to make the residue type an option. Note: you're choosing
with equal likelihood from each of the observed residues (independent
of their frequency at that position). In earlier versions of the algorithm
the characters were chosen either at the frequency with which they occur
at a position or at the frequency with which they occur in the database.
Both trials were unsuccesful, because they deviate from random sampling
(see Sibbald & Argos 1990).
Depending on the number of possible pseudo sequences, all of them are
used or a random sample is taken (monte carlo).
Example:
Alignment: AA, AA, BB
AA AA BB
AA 0 (.5) 0 (.5) 2
AB 1 (1/3) 1 (1/3) 1 (1/3)
BA 1 (1/3) 1 (1/3) 1 (1/3)
BB 2 2 0 (1)
-----------------------------
total 7/6 7/6 10/6
norm .291 .291 .418
For a bigger example with more pseudo sequences, see Henikoff 1994
I tried the described optimization (pre-calculate the distance to the
closest sequence). I doesn't have an advantage over the original method.
"""
MC_THRESHOLD = mc_threshold
#decide on sampling method
if force_monte_carlo or number_of_pseudo_seqs(alignment) > MC_THRESHOLD:
sampling_method = pseudo_seqs_monte_carlo
else:
sampling_method = pseudo_seqs_exact
#change sequences into arrays
aln_array = DenseAlignment(alignment, MolType=BYTES)
weights = zeros(len(aln_array.Names),Float64)
#calc distances for each pseudo seq
rows = [array(seq, 'c') for seq in map(str, aln_array.Seqs)]
for seq in sampling_method(aln_array,n=n):
seq = array(seq, 'c')
temp = [hamming_distance(row, seq) for row in rows]
votes = row_to_vote(array(temp)) #change distances to votes
weights += votes #add to previous weights
weight_dict = Weights(dict(zip(aln_array.Names,weights)))
weight_dict.normalize() #normalize
return weight_dict
def mVOR(alignment,n=1000,order=DNA_ORDER):
"""Returns sequence weights according to the modified Voronoi method.
alignment: Alignment object
n: sample size (=number of random profiles to be generated)
order: specifies the order of the characters found in the alignment,
used to build the sequence and random profiles.
mVOR is a modification of the VOR method. Instead of generating discrete
random sequences, it generates random profiles, to sample more equally from
the sequence space and to prevent random sequences to be equidistant to
multiple sequences in the alignment.
See the Implementation notes to see how the random profiles are generated
and compared to the 'sequence profiles' from the alignment.
Random generalized sequences (or a profile filled with random numbers):
Sequences that are equidistant to multiple sequences in the alignment
can form a problem in small datasets. For longer sequences the likelihood
of this event is negligable. Generating 'random generalized sequences' is
a solution, because we're then sampling from continuous sequence space.
Each column of a random profile is generated by normalizing a set of
independent, exponentially distributed random numbers. In other words, a
random profile is a two-dimensional array (rows are chars in the alphabet,
columns are positions in the alignment) filled with a random numbers,
sampled from the standard exponential distribution (lambda=1, and thus
the mean=1), where each column is normalized to one. These random profiles
are compared to the special profiles of just one sequence (ones for the
single character observed at that position). The distance between the
two profiles is simply the Euclidean distance.
"""
weights = zeros(len(alignment.Names),Float64)
#get seq profiles
seq_profiles = {}
for k,v in alignment.items():
#seq_profiles[k] = ProfileFromSeq(v,order=order)
seq_profiles[k] = SeqToProfile(v,alphabet=order)
for count in range(n):
#generate a random profile
exp = exponential(1,[alignment.SeqLen,len(order)])
r = Profile(Data=exp,Alphabet=order)
r.normalizePositions()
#append the distance between the random profile and the sequence
#profile to temp
temp = [seq_profiles[key].distance(r) for key in alignment.Names]
votes = row_to_vote(array(temp))
weights += votes
weight_dict = Weights(dict(zip(alignment.Names,weights)))
weight_dict.normalize()
return weight_dict
def pos_char_weights(alignment, order=DNA_ORDER):
"""Returns the contribution of each character at each position.
alignment: Alignemnt object
order: the order of characters in the profile (all observed chars
in the alignment
This function is used by the function position_based
For example:
GYVGS
GFDGF
GYDGF
GYQGG
0 1 2 3 4 5
G 1/1*4 1/1*4 1/3*1
Y 1/2*3
F 1/2*1 1/3*2
V 1/3*1
D 1/3*2
Q 1/3*1
S 1/3*1
"""
counts = alignment.columnFreqs()
a = zeros([len(order), alignment.SeqLen],Float64)
for col, c in enumerate(counts):
for char in c:
a[order.index(char),col] = 1/(len(c)*c[char])
return Profile(a,Alphabet=order)
def PB(alignment, order=DNA_ORDER):
"""Returns sequence weights based on the diversity at each position.
The position-based (PB) sequence weighting method is described in Henikoff
1994. The idea is that sequences are weighted by the diversity observed
at each position in the alignment rather than on the diversity measured
for whole sequences.
A simple method to represent the diversity at a position is to award
each different residue an equal share of the weight, and then to divide
up that weight equally among the sequences sharing the same residue.
So if in a position of a MSA, r different residues are represented,
a residue represented in only one sequence contributes a score of 1/r to
that sequence, whereas a residue represented in s sequences contributes
a score of 1/rs to each of the s sequences. For each sequence, the
contributions from each position are summed to give a sequences weight.
See Henikoff 1994 for a good example.
"""
#calculate the contribution of each character at each position
pos_weights = pos_char_weights(alignment, order)
d = pos_weights.Data
result = Weights()
for key,seq in alignment.items():
weight = 0
for idx, char in enumerate(seq):
weight += d[order.index(char),idx]
result[key] = weight
result.normalize()
return result
def SS(alignment):
"""Returns dict of {seq_id: weight} for sequences in the alignment
alignment: Alignment object
The SS sequence weighting method is named after Sander and Schneider,
who published their method in 1991.
Their method starts with the same distance matrix as in the VA method,
where distances are the pairwise Hamming distances between the sequences
in the alignment. Where the VA method uses the normalized total weights
for each sequence, the SS method continues with calculating a
self-consistent set of weights.
They do this by finding the eigenvector of the distance matrix belonging
to the largest eigenvalue for the matrix. This special eigenvector can
be found by a numerical method.
"""
distances = distance_matrix(alignment)
v = eigenvector_for_largest_eigenvalue(distances)
return Weights(dict(zip(alignment.Names,v)))
def ACL(tree):
"""Returns a normalized dictionary of sequence weights {seq_id: weight}
tree: a PhyloNode object
The ACL method is named after Altschul, Carroll and Lipman, who published a
paper on sequence weighting in 1989.
The ACL method is based on an idea of Felsenstein (1973). Imagine
electrical current flows from the root of the tree down the edges and
out the leaves. If the edge lengths are proportional to their electrical
resistances, current flowing out each leaf equals the leaf weight.
The first step in the calculation of the weight vector is calculating a
variance-covariance matrix. The variance of a leaf is proportional to the
distance from the root to that leaf. The covariance of two leaves is
proportional to the distance from the root to the last common ancestor
of the two leaves.
The second step in the calculation results in a vector of weights. Suppose
there are n leaves on the tree. Let i be the vector of size n, all of whose
elements are 1.0. The weight vector is calculated as:
w = (inverse(M)*i)/(transpose(i)*inverse(M)*i)
See Altschul 1989
"""
#clip branch lengths to avoid error due to negative or zero branch lengths
_clip_branch_lengths(tree)
#get a list of sequence IDs (in the order that the tree will be traversed)
seqs = []
for n in tree.tips():
seqs.append(n.Name)
#initialize the variance-covariance matrix
m = zeros([len(seqs),len(seqs)],Float64)
#calculate (co)variances
#variance of a node is defined as the distance from the root to the leaf
#covariance of two nodes is defined as the distance from the root to the
#last common ancestor of the two leaves.
for x in tree.tips():
for y in tree.tips():
idx_x = seqs.index(x.Name)
idx_y = seqs.index(y.Name)
if idx_x == idx_y:
m[idx_x,idx_y] = x.distance(tree)
else:
lca = x.lastCommonAncestor(y)
dist_lca_root = lca.distance(tree)
m[idx_x,idx_y] = dist_lca_root
m[idx_y,idx_x] = dist_lca_root
#get the inverse of the variance-covariance matrix
inv = inverse(m)
#build vector i (vector or ones, length = # of leaves in the tree)
i = ones(len(seqs),Float64)
numerator = matrixmultiply(inv, i)
denominator = matrixmultiply(matrixmultiply(transpose(i),inv),i)
weight_vector = numerator/denominator
#return a Weights object (is dict {seq_id: weight})
return Weights(dict(zip(seqs,weight_vector)))
def _clip_branch_lengths(tree, min_val=1e-9, max_val=1e9):
"""Clips branch lengths in tree to a minimum or maximum value
tree: TreeNode object
min: minimum value that a branch length should have
max: maximum value that a branch length should have
Note: tree is changed in place!!!
"""
for i in tree.traverse():
bl = i.Length
if bl > max_val:
i.Length = max_val
elif bl < min_val:
i.Length = min_val
def _set_branch_sum(tree):
"""Sets the branch sum to each node
tree: TreeNode object
The branch sum of a node is the total branch length of all the nodes
under that node.
WARNING: changes the tree in place!!!
"""
total = 0
for child in tree:
_set_branch_sum(child)
total += child.BranchSum
total += child.Length
tree.BranchSum = total
def _set_node_weight(tree):
"""Sets the node weight to nodes according to the GSC method.
tree: TreeNode object
See documentation in GSC on how node weights are calculated.
WARNING: changes the tree in place!!!
"""
parent = tree.Parent
if parent is None: #root of tree always has weight of 1.0
tree.NodeWeight = 1.0
else:
tree.NodeWeight = parent.NodeWeight * \
(tree.Length + tree.BranchSum)/parent.BranchSum
for child in tree:
_set_node_weight(child)
def GSC(tree):
"""Returns dict of {seq_id: weight} for each leaf in the tree
tree: PhyloNode object
The GSC method is named after Gerstein, Sonnhammer, and Chothia,
who published their method in 1994.
The idea is that the sequences are weighted by their total branch length
in the tree. In the original algorithm weights are calculated from leaves
to root, we calculate them from the root to the leaves.
First, the branch lengths have to be clipped to a minimum and maximum
value to avoid errors due to negative or zero branch lengts, and to
make sure the method behaves for two identical sequences. Next the
branch sum for each node is pre-calulated (the branch sum is the
total branch length in the tree below that node). From the (clipped)
branch lengths and branch sums, we can now calulate how the weight has
to be divided.
Example:
tree = (((A:20,B:10)x:30,C:40)r:0);
weight of the root (r) is always 1
to the left node (x) goes ((30+30)/100)*1 = 0.6
to the right node (C) goed (40/100)*1 = 0.4
to node A goes (20/30)*0.6 = 0.4
to node B goes (10/30)*0.6 = 0.2
"""
_clip_branch_lengths(tree)
_set_branch_sum(tree)
_set_node_weight(tree)
weights = Weights()
for n in tree.tips():
weights[n.Name] = n.NodeWeight
return weights
#======================================================
if __name__ == "__main__":
from old_cogent.base.align import Alignment
#This main block shows the results for four different alignments
a = Alignment({'seq1':'GYVGS','seq2':'GFDGF','seq3':'GYDGF',\
'seq4':'GYQGG'},Names=['seq1','seq2','seq3','seq4'])
b = Alignment({'seq1':'AA', 'seq2':'AA', 'seq3':'BB'},\
Names=['seq1','seq2','seq3'])
c = Alignment({'seq1':'AA', 'seq2':'AA', 'seq3':'BB', 'seq4':'BB',\
'seq5':'CC'},Names=['seq1','seq2','seq3','seq4','seq5'])
d = Alignment({'seq1':'AGCTA', 'seq2':'AGGTA', 'seq3':'ACCTG',
'seq4':'TGCAA'},Names=['seq1','seq2','seq3','seq4'])
print "This is a quick test run of the alignment-based methods"
for al in [a,b,c,d]:
#determine the needed character order for each alignment
if al is b or al is c:
char_order = "ABC"
elif al is a: #need protein_order
char_order=PROTEIN_ORDER
else: #need DNA_order
char_order=DNA_ORDER
print "===========ALIGNMENT=============="
for k in al.Names:
print '\t'.join([k,al[k]])
print 'RESULT OF THE VA METHOD:'
print VA(al)
print 'RESULT OF THE VOR METHOD:'
print VOR(al)
print 'RESULT OF THE mVOR METHOD:'
print mVOR(al, n=10000, order=char_order)
print 'RESULTS OF THE SS METHODS:'
print SS(al)
print 'RESULTS OF THE PB METHODS:'
print PB(al, order=char_order)
#!/usr/bin/env python
"""Provides utility methods for sequence weighting
"""
from __future__ import division
from numpy import array, zeros, dot as matrixmultiply, ones, identity, take,\
asarray, uint8 as UInt8, add, sqrt
from random import choice
from cogent.util.array import hamming_distance
from cogent.core.profile import Profile, CharMeaningProfile
from cogent.core.moltype import DNA, RNA, PROTEIN
from cogent.core.alignment import Alignment
__author__ = "Sandra Smit"
__copyright__ = "Copyright 2007-2016, The Cogent Project"
__credits__ = ["Sandra Smit", "Rob Knight", "Gavin Huttley", "Daniel McDonald"]
__license__ = "GPL"
__version__ = "1.9"
__maintainer__ = "Sandra Smit"
__email__ = "sandra.smit@colorado.edu"
__status__ = "Development"
PROTEIN_ORDER = ''.join(PROTEIN.Alphabet) + '-'
DNA_ORDER = ''.join(DNA.Alphabet) + '-'
RNA_ORDER = ''.join(RNA.Alphabet) + '-'
class Weights(dict):
"""Dictionary seq_ids: seq weight
"""
def normalize(self):
"""Normalizes to one. Works in place!"""
total = sum(self.values(), 0)
for k, v in self.items():
self[k] = v/total
def number_of_pseudo_seqs(alignment):
"""Returns the # of unique randomized sequences that can be generated
from the alignment.
alignment: Alignment object
A random sequence is created by randomly choosing at each position one
of the residues observed at that position (column) in the alighment.
A single occurrence of that residue type is sufficient to make it an
option and the choice is with equal likelihood from any of the observed
characters. (See Implementation Notes for more details).
"""
return reduce(lambda x, y: x*y, map(len,alignment.columnProbs()))
def pseudo_seqs_exact(alignment,n=None):
"""Returns all possible pseudo sequences (generated from the alignment)
alignment: Alignment object
n: has NO FUNCTION, except to make the API for all functions that generate
pseudo sequences the same.
This function is used by the VOR method (only when the number of unique
pseudo sequences is lower than 1000).
The original sequences will be generated in this list of pseudo sequences.
Duplications of sequences in the alignment don't cause duplications
in the list of pseudo sequences.
Example:
AA, AA, BB in the alignment
generates pseudo sequences: AA,BB,AB,BA
ABC, BCC, BAC in the alignment
generates pseudo sequences: AAC, ABC, ACC, BAC, BBC, and BCC
"""
counts = alignment.columnFreqs()
results = [[]]
for col in counts:
new_results = []
for item in results:
for option in col:
new_results.append(item + [option])
results = new_results
return [''.join(i) for i in results]
def pseudo_seqs_monte_carlo(alignment,n=1000):
"""Yields sample of possible pseudo sequences (generated from alignment)
alignment: Alignment object
n = number of pseudo sequences generated (=sample size)
This function is used by the VOR method, usually when the number of
possible pseudo sequences exceeds 1000.
To see how the pseudo sequences are generated read the Implementation
Notes at the top of this file.
"""
freqs = alignment.columnFreqs()
for x in range(n):
seq = []
for i in freqs:
seq.append(choice(i.keys()))
yield ''.join(seq)
def row_to_vote(row):
"""Changes distances to votes.
There's one vote in total. The sequence with the lowest distance
gets the vote. If there are multiple sequences equidistant at the
minimum distance, the vote is split over all the sequences at the
minimum distance.
Example:
[4,2,3,7] -> [0,1,0,0]
[1,3,2,1] -> [.5,0,0,.5]
[1,1,1,1] -> [.25,.25,.25,.25]
"""
result = array(row) - min(row) == 0
return result/sum(result, 0)
def distance_matrix(alignment, distance_method=hamming_distance):
"""Returns distance matrix for seqs in the alignment.
Order is either the Names in the alignment or the order in which
is iterated over the rows.
Distance is the Hamming distance between two sequences
"""
#change sequences into arrays
alignment = [array(str(alignment.NamedSeqs[k]), 'c') for k in\
alignment.Names]
nr_of_seqs = len(alignment)
distances = zeros([nr_of_seqs,nr_of_seqs])
for i, seq_a in enumerate(alignment):
for j, seq_b in enumerate(alignment):
if i < j:
dist = distance_method(seq_a,seq_b)
distances[i,j] = dist
distances[j,i] = dist
else: # i==j (diagonal) or i>j (lower left half)
continue
return distances
def eigenvector_for_largest_eigenvalue(matrix):
"""Returns eigenvector corresponding to largest eigenvalue of matrix.
Implements a numerical method for finding an eigenvector by repeated
application of the matrix to a starting vector. For a matrix A the
process w(k) <-- A*w(k-1) converges to eigenvector w with the largest
eigenvalue. Because distance matrix D has all entries >= 0, a theorem
due to Perron and Frobenius on nonnegative matrices guarantees good
behavior of this method, excepting degenerate cases where the
iteration oscillates. For distance matrices a remedy is to add the
identity matrix to A, permitting the iteration to converge to the
eigenvector. (From Sander and Schneider (1991), and Vingron and
Sibbald (1993))
Note: Only works on square matrices.
"""
#always add the identity matrix to avoid oscillating behavior
matrix = matrix + identity(len(matrix))
#v is a random vector (chosen as the normalized vector of ones)
v = ones(len(matrix))/len(matrix)
#iterate until convergence
for i in range(1000):
new_v = matrixmultiply(matrix,v)
new_v = new_v/sum(new_v, 0) #normalize
if sum(map(abs,new_v-v), 0) > 1e-9:
v = new_v #not converged yet
continue
else: #converged
break
return new_v
def distance_to_closest(alignment, distance_method=hamming_distance):
"""Returns vector of distances to closest neighbor for each s in alignment
alignment: Alignment object
distance_method: function used to calculate the distance between two seqs
Function returns the closest distances according to the Names in the
alignment
example:
Alignment({1:'ABCD',2:'ABCC',3:'CBDD',4:'ACAA'},Names=[3,2,1,4])
[2,1,1,3]
"""
names = alignment.Names
seqs = [array(str(alignment.NamedSeqs[n]), 'c') for n in names]
closest = []
for i, key in enumerate(names):
seq = seqs[i]
dist = None
for j, other_key in enumerate(names):
if key == other_key:
continue
d = distance_method(seq,seqs[j])
if dist:
if d < dist:
dist = d
else:
dist = d
closest.append(dist)
return array(closest)
def SeqToProfile(seq, alphabet=None, char_order=None,\
split_degenerates=False):
"""Generates a Profile object from a Sequence object.
seq: Sequence object
alphabet (optional): Alphabet object (if you want to split
degenerate symbols, the alphabet object should have a
Degenerates property. Default is the Alphabet associated with
the Sequence object.
char_order (optional): The order the characters occur in the Profile.
Default is the list(alphabet)
split_degenerates (optional): Whether you want the counts for the
degenerate symbols to be divided over the non-degenerate symbols they
code for.
A Profile is a position x character matrix describing which characters
occur at each position. In a sequence (as opposed to an alignment) only
one character occurs at each position. In general a sequence profile
will only contain ones and zeros. However, you have the possibility of
splitting degenerate characters. For example, if a position is R, it
means that you have 50/50% chance of A and G. It's also possible to
ignore characters, which in a sequence profile will lead to positions
(rows) containing only zeros.
Example:
Sequence = ACGU
Profile(seq, CharOrder=UCAG):
U C A G
0 0 1 0 first pos
0 1 0 0 second pos
0 0 0 1 third pos
1 0 0 0 fourth pos
Sequence= GURY
Profile(seq,CharOrder=UCAG, split_degenerates=True)
U C A G
0 0 0 1 first pos
1 0 0 0 second pos
0 0 .5 .5 third pos
.5 .5 0 0 fourth pos
Characters can also be ignored
Sequence = ACN-
Profile(seq, CharOrder=UCAGN, split_degenerates=True)
U C A G
0 0 1 0 first pos
0 1 0 0 second pos
.25 .25 .25 .25 third pos
0 0 0 0 fourth pos <--contains only zeros
"""
if alphabet is None:
alphabet = seq.MolType
if char_order is None:
char_order = list(alphabet)
#Determine the meaning of each character based on the alphabet, the
#character order, and the option to split degenerates
char_meaning = CharMeaningProfile(alphabet, char_order,\
split_degenerates)
#construct profile data
idxs = array(str(seq).upper(), 'c').view(UInt8)
result_data = char_meaning.Data[idxs]
#result_data = take(char_meaning.Data, asarray(str(seq).upper(), UInt8), axis=0)
return Profile(result_data, alphabet, char_order)
def AlnToProfile(aln, alphabet=None, char_order=None, split_degenerates=False,\
weights=None):
"""Generates a Profile object from an Alignment.
aln: Alignment object
alphabet (optional): an Alphabet object (or list of chars, but if you
want to split degenerate symbols, the alphabet must have a
Degenerates property. Default is the alphabet of the first seq in
the alignment.
char_order (optional): order of the characters in the profile. Default
is list(alphabet)
split_degenerates (optional): Whether you want the counts for the
degenerate symbols to be divided over the non-degenerate symbols they
code for.
weights (optional): dictionary of seq_id: weight. If not entered all seqs
are weighted equally
A Profile is a position x character matrix describing which characters
occur at each position of an alignment. The Profile is always normalized,
so it gives the probabilities of each character at each position.
Ignoring chars: you can ignore characters in the alignment by not putting
the char in the CharOrder. If you ignore all characters at a particular
position, an error will be raised, because the profile can't be normalized.
Splitting degenerates: you can split degenerate characters over the
non-degenerate characters they code for. For example: R = A or G. So,
an R at a position counts for 0.5 A and 0.5 G.
Example:
seq1 TCAG weight: 0.5
seq2 TAR- weight: 0.25
seq3 YAG- weight: 0.25
Profile(aln,alphabet=DNA,char_order="TACG",weights=w,
split_degenerates=True)
Profile:
T A C G
[[ 0.875 0. 0.125 0. ]
[ 0. 0.5 0.5 0. ]
[ 0. 0.625 0. 0.375]
[ 0. 0. 0. 1. ]]
"""
if alphabet is None:
alphabet = aln.values()[0].MolType
if char_order is None:
char_order = list(alphabet)
if weights is None:
weights = dict.fromkeys(aln.keys(),1/len(aln))
char_meaning = CharMeaningProfile(alphabet, char_order,\
split_degenerates)
profiles = []
for k,v in aln.items():
idxs = array(str(v).upper(), 'c').view(UInt8)
profiles.append(char_meaning.Data[idxs] * weights[k])
s = reduce(add,profiles)
result = Profile(s,alphabet, char_order)
try:
result.normalizePositions()
except Exception, e:
raise ValueError,e
#"Probably one of the rows in your profile adds up to zero,\n "+\
#"because you are ignoring all of the characters in the "+\
#"corresponding\n column in the alignment"
return result
#!/usr/bin/env python
"""apps: provides support libraries for controlling applications (local or web).
"""
__all__ = ['blast',
'carnac',
'cd_hit',
'clustalw',
'cmfinder',
'comrna',
'consan',
'contrafold',
'cove',
'dialign',
'dynalign',
'fasttree',
'fasttree_v1'
'foldalign',
'gctmpca',
'ilm',
'knetfold',
'mfold',
'muscle',
'msms',
'nupack',
'parameters',
'pfold',
'pknotsrg',
'raxml',
'rdp_classifier',
'rnaalifold',
'rnaforester',
'rnashapes',
'rnaview',
'sfold',
'unafold',
'util',
'vienna_package']
__author__ = ""
__copyright__ = "Copyright 2007-2016, The Cogent Project"
__credits__ = ["Jeremy Widmann", "Catherine Lozuopone", "Gavin Huttley",
"Rob Knight", "Zongzhi Liu", "Sandra Smit", "Micah Hamady",
"Greg Caporaso", "Mike Robeson", "Daniel McDonald",
"Marcin Cieslik"]
__license__ = "GPL"
__version__ = "1.9"
__maintainer__ = "Rob Knight"
__email__ = "rob@spot.colorado.edu"
__status__ = "Production"
"""Need to add:
tcoffee alignment
meme motif finding
phylip phylogeny
mrbayes phylogeny
paml phylogeny
other packages? web tools?
"""
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/env python
from cogent.app.util import CommandLineApplication,\
CommandLineAppResult, ResultPath
from cogent.app.parameters import Parameter,FlagParameter,Parameters
__author__ = "Shandy Wikman"
__copyright__ = "Copyright 2007-2016, The Cogent Project"
__contributors__ = ["Shandy Wikman"]
__license__ = "GPL"
__version__ = "1.9"
__maintainer__ = "Shandy Wikman"
__email__ = "ens01svn@cs.umu.se"
__status__ = "Development"
class Carnac(CommandLineApplication):
"""Application controller for Carnac RNAfolding application
Info at:
http://bioinfo.lifl.fr/carnac/index.html
Options:
-a Inhibit the energy correction that is automatically
performed to create the initial set of potential
stems. By default, the energy correction depends of the
GC percentage of each sequence.
-c Eliminate sequences that are too similar. The similarity
treshold is 98%.
-h Add hairpins that are present only in one sequence to
the initial set of potential stems (may be time and space
demanding).
"""
#Limitation
#if -c is turned on and file is deleted error in file handling in _get_result_paths
_parameters = {
'-c':FlagParameter(Prefix='-',Name='c',Value=False),
'-a':FlagParameter(Prefix='-',Name='a'),
'-h':FlagParameter(Prefix='-',Name='h')}
_command = 'carnac'
_input_handler='_input_as_string'
def _get_result_paths(self,data):
"""Specifies the paths of output files generated by the application
data: the data the instance of the application is called on
Carnac produces it's output to a .ct, .eq, to the location of input file
and .out files located in the same folder as the program is run from.
graph and align file is also created.
You always get back: StdOut,StdErr, and ExitStatus
"""
result={}
name_counter = 0
ones='00'
tens='0'
count=''
if not isinstance(data,list):
#means data is file
path=str(data)
data=open(data).readlines()
else: #data input as lines
#path=''.join([self.WorkingDir,self._input_filename.split('/')[-1]])
path = ''.join(['/tmp/', self._input_filename.split('/')[-1]])
for item in data:
if item.startswith('>'):
name_counter += 1
if name_counter < 10:
count=ones
if name_counter > 9:
count=tens
if name_counter > 99:
count=''
name = item.strip('>\n')
else:
nr=name_counter
result['ct%d' % nr] =\
ResultPath(Path=('%s%s%d.ct' % (path,count,nr)))
result['eq%d' % nr] =\
ResultPath(Path=('%s%s%d.eq' % (path,count,nr)))
result['out_seq%d' % nr] = \
ResultPath(Path=(''.join([self.WorkingDir,'Z_%s%d.%s.out'% \
(count,nr,name)])))
result['graph'] =\
ResultPath(Path=(self.WorkingDir+'graph.out'))
result['align'] =\
ResultPath(Path=(self.WorkingDir+'alignment.out'))
return result