Skip to content
Commits on Source (4)
......@@ -17,7 +17,6 @@ RUN apt-get update && apt-get install --no-install-recommends -y \
check \
libtool \
libsubunit-dev \
git \
python3 \
python3-dev \
python3-setuptools \
......@@ -59,8 +58,10 @@ RUN curl -L https://github.com/Cibiv/IQ-TREE/releases/download/v${iqtree_version
&& rm -rf iqtree-${iqtree_version}-Linux
# Install Gubbins
RUN git clone https://github.com/sanger-pathogens/gubbins.git \
&& cd gubbins \
ENV BUILD_DIR /opt/gubbins
RUN mkdir -p ${BUILD_DIR}
COPY . ${BUILD_DIR}
RUN cd ${BUILD_DIR} \
&& autoreconf -i \
&& ./configure \
&& make \
......
......@@ -9,11 +9,10 @@ PLEASE NOTE: we currently do not have the resources to provide support for Gubbi
[![status](https://img.shields.io/badge/NAR-10.1093-brightgreen.svg)](https://academic.oup.com/nar/article/43/3/e15/2410982)
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/recipes/gubbins/README.html)
[![Container ready](https://img.shields.io/badge/container-ready-brightgreen.svg)](https://quay.io/repository/biocontainers/gubbins)
[![Docker Build Status](https://img.shields.io/docker/build/sangerpathogens/gubbins.svg)](https://hub.docker.com/r/sangerpathogens/gubbins)
[![Docker Build Status](https://img.shields.io/docker/cloud/build/sangerpathogens/gubbins.svg)](https://hub.docker.com/r/sangerpathogens/gubbins)
[![Docker Pulls](https://img.shields.io/docker/pulls/sangerpathogens/gubbins.svg)](https://hub.docker.com/r/sangerpathogens/gubbins)
[![codecov](https://codecov.io/gh/sanger-pathogens/gubbins/branch/master/graph/badge.svg)](https://codecov.io/gh/sanger-pathogens/gubbins)
## Contents
* [Introduction](#introduction)
* [Installation](#installation)
......@@ -36,7 +35,15 @@ PLEASE NOTE: we currently do not have the resources to provide support for Gubbi
* [Ancestral sequence reconstruction](#ancestral-sequence-reconstruction)
## Introduction
Since the introduction of high-throughput, second-generation DNA sequencing technologies, there has been an enormous increase in the size of datasets being used for estimating bacterial population phylodynamics. Although many phylogenetic techniques are scalable to hundreds of bacterial genomes, methods which have been used for mitigating the effect of mechanisms of horizontal sequence transfer on phylogenetic reconstructions cannot cope with these new datasets. Gubbins (Genealogies Unbiased By recomBinations In Nucleotide Sequences) is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions. Simulations demonstrate the algorithm generates highly accurate reconstructions under realistic models of short-term bacterial evolution, and can be run in only a few hours on alignments of hundreds of bacterial genome sequences.
Since the introduction of high-throughput, second-generation DNA sequencing technologies, there has been an enormous
increase in the size of datasets being used for estimating bacterial population phylodynamics.
Although many phylogenetic techniques are scalable to hundreds of bacterial genomes, methods which have been used
for mitigating the effect of mechanisms of horizontal sequence transfer on phylogenetic reconstructions cannot cope
with these new datasets. Gubbins (Genealogies Unbiased By recomBinations In Nucleotide Sequences) is an algorithm
that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing
a phylogeny based on the putative point mutations outside of these regions. Simulations demonstrate the algorithm
generates highly accurate reconstructions under realistic models of short-term bacterial evolution, and can be run
in only a few hours on alignments of hundreds of bacterial genome sequences.
## Installation
Before you do anything, please have a look at the [Gubbins webpage](http://sanger-pathogens.github.io/gubbins/).
......@@ -44,7 +51,7 @@ Before you do anything, please have a look at the [Gubbins webpage](http://sange
### Required dependencies
* [FastTree](http://www.microbesonline.org/fasttree/#Install) ( >=2.1.4 )
* [RAxML](https://github.com/stamatak/standard-RAxML) ( >=8.0 )
* Python modules: Biopython (> 1.59), DendroPy (>=4.0), Reportlab, nose, pillow
* Python modules: Biopython (> 1.59), DendroPy (>=4.0), nose
* Standard build environment tools (e.g. python3, pip3, make, autoconf, libtool, gcc, check, etc...)
There are a number of ways to install Gubbins and details are provided below. If you encounter an issue when installing Gubbins please contact your local system administrator.
......@@ -72,16 +79,15 @@ We have a docker container which gets automatically built from the latest versio
To use it you would use a command such as this:
docker run --rm -it -v <local_dir>:<remote_dir> sangerpathogens/gubbins run_gubbins -p <remote_dir>/<prefix> [<options>] <remote_dir>/<alignment_file>
docker run --rm -it -v <local_dir>:<remote_dir> sangerpathogens/gubbins run_gubbins.py -p <remote_dir>/<prefix> [<options>] <remote_dir>/<alignment_file>
The flag `-v` synchronizes a directory on the host machine (here denoted as `<local_dir>`) with a directory in the Docker container (here denoted as `<remote_dir>`).
`<remote_dir>` does not need to exist in the container before the run, a common choice is `/data`.
Note that both `<local_dir>` and `<remote_dir>` must be absolute paths.
The input alignment file must be present in `<local_dir>` (or in one of its subdirectories).
In order to retrieve the files produced by Gubbins, run the program with option `-p`;
the argument of this option must consist of `<remote_dir>`, followed by an arbitrary identifier (here denoted as `<prefix>`).
In order to retrieve the files produced by Gubbins, run the program with option `-p`; the argument of this option must consist of `<remote_dir>`,
followed by an arbitrary identifier (here denoted as `<prefix>`).
### OSX/Linux - from source
This is the most difficult method and is only suitable for someone with advanced computing skills. Please consider using Conda instead.
......@@ -136,16 +142,19 @@ Processing options:
--tree_builder, -t
The algorithm to use in the construction of phylogenies in the analysis; can be ‘raxml’, to use RAxML, ‘fasttree’, to use Fasttree, or ‘hybrid’, to use Fasttree for the first iteration and RAxML in all subsequent iterations. Default is raxml
The algorithm to use in the construction of phylogenies in the analysis; can be ‘raxml’, to use RAxML, ‘fasttree’,
to use Fasttree, or ‘hybrid’, to use Fasttree for the first iteration and RAxML in all subsequent iterations. Default is raxml
--iterations, -i
The maximum number of iterations to perform; the algorithm will stop earlier than this if it converges on the same tree in two successive iterations. Default is 5.
--min_snps, -m
The minimum number of base substitutions required to identify a recombination. Default is 3.
--converge_method, -z
Criteria to use to know when to halt iterations [weighted_robinson_foulds|robinson_foulds|recombination]. Default is weighted_robinson_foulds.
Output options:
......@@ -164,7 +173,8 @@ Print debugging messages. Default is off.
--no_cleanup, -n
Do not remove files from intermediate iterations. This option will also keep other files created by RAxML and fasttree, which would otherwise be deleted. Default is to only keep files from the final iteration.
Do not remove files from intermediate iterations. This option will also keep other files created by RAxML and fasttree,
which would otherwise be deleted. Default is to only keep files from the final iteration.
--raxml_model, -r
......@@ -241,8 +251,10 @@ For more information on this software see the [Gubbins webpage](http://sanger-pa
* ftp://ftp.sanger.ac.uk/pub/project/pathogens/gubbins/ST239.aln.gz
### Midpoint rerooting
From version 1.3.5 (25/6/15) to version 1.4.6 (29/2/16) trees were not midpoint rerooted by default. This doesnt have any effect on the recombination detection, but the output trees may not look as expected. Users are advised to upgrade to the latest version.
From version 1.3.5 (25/6/15) to version 1.4.6 (29/2/16) trees were not midpoint rerooted by default.
This doesnt have any effect on the recombination detection, but the output trees may not look as expected. Users are advised to upgrade to the latest version.
### Ancestral sequence reconstruction
From version 2.0.0 onwards, RAxML is used to reconstruction ancestral sequences instead of fastML. RAxML doesnt always produce results as you would expect so the results can be lower quaility than fastML. If you would like to stick with fastML for ancestral sequence reconstruction, please checkout and install v1.4.9.
From version 2.0.0 onwards, RAxML is used to reconstruction ancestral sequences instead of fastML.
RAxML doesnt always produce results as you would expect so the results can be lower quaility than fastML.
If you would like to stick with fastML for ancestral sequence reconstruction, please checkout and install v1.4.9.
gubbins (2.4.0-1) unstable; urgency=medium
* Team upload.
* New upstream version
* Removing debian/patches/disable_test.patch which apparently upstream
addressed in its new release.
* Added build-time (testing) and recommended runtime dependency on iqtree
* Demoted runtime-dependencies to fasttree and raxml to recommended
-- Steffen Moeller <moeller@debian.org> Wed, 08 Jan 2020 12:59:28 +0100
gubbins (2.3.5-1) unstable; urgency=medium
* Team upload.
......
......@@ -7,6 +7,7 @@ Priority: optional
Build-Depends: debhelper-compat (= 12),
dh-python,
fasttree,
iqtree,
raxml,
python3-setuptools,
zlib1g-dev,
......@@ -27,7 +28,8 @@ Architecture: any
Depends: ${shlibs:Depends},
${misc:Depends},
${python3:Depends},
python3,
python3
Recommends: iqtree,
fasttree,
raxml
Description: phylogenetic analysis of genome sequences
......
Description: disable failing tests
Disabled after checking with upstream, who were happy to have these disabled.
Author: Sascha Steinbiss <satta@debian.org>
Forwarded: https://github.com/sanger-pathogens/gubbins/issues/177
--- a/python/gubbins/tests/test_external_dependancies.py
+++ b/python/gubbins/tests/test_external_dependancies.py
@@ -25,7 +25,7 @@ class TestExternalDependancies(unittest.
assert re.match('.*/ls$', common.GubbinsCommon.which('ls -alrt')) != None
assert common.GubbinsCommon.which('non_existant_program') == None
- def test_change_window_size(self):
+ def _test_change_window_size(self):
parser = self.base_arg_parse()
parser.add_argument('--min_window_size', '-a', help='Minimum window size, default 100', type=int, default = 80)
parser.add_argument('--max_window_size', '-b', help='Maximum window size, default 10000', type=int, default = 1000)
@@ -74,7 +74,7 @@ class TestExternalDependancies(unittest.
self.cleanup()
- def test_robinson_foulds_convergence(self):
+ def _test_robinson_foulds_convergence(self):
self.cleanup()
parser = self.default_arg_parse()
......@@ -3,8 +3,10 @@ Last-Update: Thu, 19 Nov 2015 10:45:48 +0100
Description: Installation for Python files does not work that
way and this is done manually in debian/rules
--- a/Makefile.am
+++ b/Makefile.am
Index: gubbins/Makefile.am
===================================================================
--- gubbins.orig/Makefile.am
+++ gubbins/Makefile.am
@@ -1,5 +1,5 @@
EXTRA_DIST=debian/* tests/*.h tests/data/*
-SUBDIRS=src release python
......
......@@ -5,14 +5,16 @@ Description: gubbins executable prints that it is not intended for direct usage
it prints that it is intended to be called from run_gubbins. So it makes
sense to rather install it to /usr/lib/gubbins instead of /usr/bin.
--- a/python/gubbins/common.py
+++ b/python/gubbins/common.py
@@ -108,7 +108,7 @@
FASTTREE_EXEC = GubbinsCommon.choose_executable(fasttree_executables)
FASTTREE_PARAMS = '-nosupport -gtr -gamma -nt'
- GUBBINS_EXEC = 'gubbins'
+ GUBBINS_EXEC = '/usr/lib/gubbins/gubbins'
GUBBINS_BUNDLED_EXEC = '../src/gubbins'
Index: gubbins/python/gubbins/common.py
===================================================================
--- gubbins.orig/python/gubbins/common.py
+++ gubbins/python/gubbins/common.py
@@ -48,7 +48,7 @@ def parse_and_run(input_args, program_de
printer = utils.VerbosePrinter(True, "\n")
# Check if the Gubbins C-program is available. If so, print a welcome message. Otherwise exit.
- gubbins_exec = 'gubbins'
+ gubbins_exec = '/usr/lib/gubbins/gubbins'
if utils.which(gubbins_exec) is None:
# Check if the Gubbins C-program is available in its source directory (for tests/Travis)
gubbins_bundled_exec = os.path.abspath(os.path.join(current_directory, '../src/gubbins'))
do_not_handle_python_by_upstream_build_system.patch
gubbins_exe_in_usr_lib.patch
disable_test.patch
This diff is collapsed.
......@@ -10,4 +10,3 @@ run_gubbins.py PMEN1.aln
```
This directory contains the output files generated.
It required 60 Mbytes of memory and took 50 seconds to run.
......@@ -5,11 +5,22 @@ set -e
start_dir=$(pwd)
RAXML_VERSION="8.1.21"
FASTTREE_VERSION="2.1.9"
RAXML_VERSION="8.2.12"
FASTTREE_VERSION="2.1.10"
IQTREE_VERSION="1.6.6"
RAXML_DOWNLOAD_URL="https://github.com/stamatak/standard-RAxML/archive/v${RAXML_VERSION}.tar.gz"
FASTTREE_DOWNLOAD_URL="http://www.microbesonline.org/fasttree/FastTree-${FASTTREE_VERSION}.c"
RAXML_DIR="standard-RAxML-$RAXML_VERSION"
RAXML_ZIP_FILE="$RAXML_DIR.tar.gz"
IQTREE_DIR="iqtree-$IQTREE_VERSION-Linux"
IQTREE_ZIP_FILE="$IQTREE_DIR.tar.gz"
FASTTREE_DIR="FastTree-$FASTTREE_VERSION"
FASTTREE_SOURCE="$FASTTREE_DIR.c"
RAXML_DOWNLOAD_URL="https://github.com/stamatak/standard-RAxML/archive/v$RAXML_VERSION.tar.gz"
FASTTREE_DOWNLOAD_URL="http://www.microbesonline.org/fasttree/$FASTTREE_SOURCE"
IQTREE_DOWNLOAD_URL="https://github.com/Cibiv/IQ-TREE/releases/download/v$IQTREE_VERSION/$IQTREE_ZIP_FILE"
# Make an install location
if [ ! -d 'build' ]; then
......@@ -31,8 +42,9 @@ download () {
fi
}
download $RAXML_DOWNLOAD_URL "raxml-${RAXML_VERSION}.tgz"
download $FASTTREE_DOWNLOAD_URL "fasttree-${FASTTREE_VERSION}.c"
download $RAXML_DOWNLOAD_URL $RAXML_ZIP_FILE
download $FASTTREE_DOWNLOAD_URL $FASTTREE_SOURCE
download $IQTREE_DOWNLOAD_URL $IQTREE_ZIP_FILE
# Update dependencies
if [ "$TRAVIS" = 'true' ]; then
......@@ -49,32 +61,44 @@ else
fi
# Build all the things
cd $build_dir
## RAxML
raxml_dir=$(pwd)/"standard-RAxML-${RAXML_VERSION}"
if [ ! -d $raxml_dir ]; then
tar xzf raxml-${RAXML_VERSION}.tgz
cd $build_dir
if [ ! -d $RAXML_DIR ]; then
tar xzf $RAXML_ZIP_FILE
fi
cd $raxml_dir
if [ -e "${raxml_dir}/raxmlHPC" ]; then
echo "Already build RAxML; skipping build"
cd $RAXML_DIR
if [ -e "raxmlHPC" ]; then
echo "Already built single-processor RAxML; skipping build"
else
make -f Makefile.gcc
fi
cd $build_dir
if [ -e "raxmlHPC-PTHREADS" ]; then
echo "Already built multi-processor RAxML; skipping build"
else
make -f Makefile.PTHREADS.gcc
fi
## FastTree
fasttree_dir=${build_dir}/fasttree-${FASTTREE_VERSION}
if [ ! -d $fasttree_dir ]; then
mkdir $fasttree_dir
cd $build_dir
if [ ! -d $FASTTREE_DIR ]; then
mkdir $FASTTREE_DIR
fi
cd $fasttree_dir
if [ -e "${fasttree_dir}/FastTree" ]; then
cd $FASTTREE_DIR
if [ -e "$FASTTREE_DIR/FastTree" ]; then
echo "Skipping, FastTree already exists"
else
gcc -O3 -finline-functions -funroll-loops -Wall -o FastTree ${build_dir}/fasttree-${FASTTREE_VERSION}.c -lm
gcc -O3 -finline-functions -funroll-loops -Wall -o FastTree $build_dir/$FASTTREE_SOURCE -lm
fi
## IQTree
cd $build_dir
if [ ! -d $IQTREE_DIR ]; then
tar xzf $IQTREE_ZIP_FILE
fi
cd $IQTREE_DIR
if [ -e "bin/iqtree" ]; then
cp bin/iqtree iqtree
fi
# Setup environment variables
......@@ -85,8 +109,9 @@ update_path () {
fi
}
update_path ${raxml_dir}
update_path ${fasttree_dir}
update_path $build_dir/$RAXML_DIR
update_path $build_dir/$FASTTREE_DIR
update_path $build_dir/$IQTREE_DIR
cd $start_dir
......
This diff is collapsed.
#! /bin/sh
# Common stub for a few missing GNU programs while installing.
scriptversion=2012-01-06.13; # UTC
# Copyright (C) 1996, 1997, 1999, 2000, 2002, 2003, 2004, 2005, 2006,
# 2008, 2009, 2010, 2011, 2012 Free Software Foundation, Inc.
# Originally by Fran,cois Pinard <pinard@iro.umontreal.ca>, 1996.
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2, or (at your option)
# any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# As a special exception to the GNU General Public License, if you
# distribute this file as part of a program that contains a
# configuration script generated by Autoconf, you may include it under
# the same distribution terms that you use for the rest of that program.
if test $# -eq 0; then
echo 1>&2 "Try \`$0 --help' for more information"
exit 1
fi
run=:
sed_output='s/.* --output[ =]\([^ ]*\).*/\1/p'
sed_minuso='s/.* -o \([^ ]*\).*/\1/p'
# In the cases where this matters, `missing' is being run in the
# srcdir already.
if test -f configure.ac; then
configure_ac=configure.ac
else
configure_ac=configure.in
fi
msg="missing on your system"
case $1 in
--run)
# Try to run requested program, and just exit if it succeeds.
run=
shift
"$@" && exit 0
# Exit code 63 means version mismatch. This often happens
# when the user try to use an ancient version of a tool on
# a file that requires a minimum version. In this case we
# we should proceed has if the program had been absent, or
# if --run hadn't been passed.
if test $? = 63; then
run=:
msg="probably too old"
fi
;;
-h|--h|--he|--hel|--help)
echo "\
$0 [OPTION]... PROGRAM [ARGUMENT]...
Handle \`PROGRAM [ARGUMENT]...' for when PROGRAM is missing, or return an
error status if there is no known handling for PROGRAM.
Options:
-h, --help display this help and exit
-v, --version output version information and exit
--run try to run the given command, and emulate it if it fails
Supported PROGRAM values:
aclocal touch file \`aclocal.m4'
autoconf touch file \`configure'
autoheader touch file \`config.h.in'
autom4te touch the output file, or create a stub one
automake touch all \`Makefile.in' files
bison create \`y.tab.[ch]', if possible, from existing .[ch]
flex create \`lex.yy.c', if possible, from existing .c
help2man touch the output file
lex create \`lex.yy.c', if possible, from existing .c
makeinfo touch the output file
yacc create \`y.tab.[ch]', if possible, from existing .[ch]
Version suffixes to PROGRAM as well as the prefixes \`gnu-', \`gnu', and
\`g' are ignored when checking the name.
Send bug reports to <bug-automake@gnu.org>."
exit $?
;;
-v|--v|--ve|--ver|--vers|--versi|--versio|--version)
echo "missing $scriptversion (GNU Automake)"
exit $?
;;
-*)
echo 1>&2 "$0: Unknown \`$1' option"
echo 1>&2 "Try \`$0 --help' for more information"
exit 1
;;
esac
# normalize program name to check for.
program=`echo "$1" | sed '
s/^gnu-//; t
s/^gnu//; t
s/^g//; t'`
# Now exit if we have it, but it failed. Also exit now if we
# don't have it and --version was passed (most likely to detect
# the program). This is about non-GNU programs, so use $1 not
# $program.
case $1 in
lex*|yacc*)
# Not GNU programs, they don't have --version.
;;
*)
if test -z "$run" && ($1 --version) > /dev/null 2>&1; then
# We have it, but it failed.
exit 1
elif test "x$2" = "x--version" || test "x$2" = "x--help"; then
# Could not run --version or --help. This is probably someone
# running `$TOOL --version' or `$TOOL --help' to check whether
# $TOOL exists and not knowing $TOOL uses missing.
exit 1
fi
;;
esac
# If it does not exist, or fails to run (possibly an outdated version),
# try to emulate it.
case $program in
aclocal*)
echo 1>&2 "\
WARNING: \`$1' is $msg. You should only need it if
you modified \`acinclude.m4' or \`${configure_ac}'. You might want
to install the \`Automake' and \`Perl' packages. Grab them from
any GNU archive site."
touch aclocal.m4
;;
autoconf*)
echo 1>&2 "\
WARNING: \`$1' is $msg. You should only need it if
you modified \`${configure_ac}'. You might want to install the
\`Autoconf' and \`GNU m4' packages. Grab them from any GNU
archive site."
touch configure
;;
autoheader*)
echo 1>&2 "\
WARNING: \`$1' is $msg. You should only need it if
you modified \`acconfig.h' or \`${configure_ac}'. You might want
to install the \`Autoconf' and \`GNU m4' packages. Grab them
from any GNU archive site."
files=`sed -n 's/^[ ]*A[CM]_CONFIG_HEADER(\([^)]*\)).*/\1/p' ${configure_ac}`
test -z "$files" && files="config.h"
touch_files=
for f in $files; do
case $f in
*:*) touch_files="$touch_files "`echo "$f" |
sed -e 's/^[^:]*://' -e 's/:.*//'`;;
*) touch_files="$touch_files $f.in";;
esac
done
touch $touch_files
;;
automake*)
echo 1>&2 "\
WARNING: \`$1' is $msg. You should only need it if
you modified \`Makefile.am', \`acinclude.m4' or \`${configure_ac}'.
You might want to install the \`Automake' and \`Perl' packages.
Grab them from any GNU archive site."
find . -type f -name Makefile.am -print |
sed 's/\.am$/.in/' |
while read f; do touch "$f"; done
;;
autom4te*)
echo 1>&2 "\
WARNING: \`$1' is needed, but is $msg.
You might have modified some files without having the
proper tools for further handling them.
You can get \`$1' as part of \`Autoconf' from any GNU
archive site."
file=`echo "$*" | sed -n "$sed_output"`
test -z "$file" && file=`echo "$*" | sed -n "$sed_minuso"`
if test -f "$file"; then
touch $file
else
test -z "$file" || exec >$file
echo "#! /bin/sh"
echo "# Created by GNU Automake missing as a replacement of"
echo "# $ $@"
echo "exit 0"
chmod +x $file
exit 1
fi
;;
bison*|yacc*)
echo 1>&2 "\
WARNING: \`$1' $msg. You should only need it if
you modified a \`.y' file. You may need the \`Bison' package
in order for those modifications to take effect. You can get
\`Bison' from any GNU archive site."
rm -f y.tab.c y.tab.h
if test $# -ne 1; then
eval LASTARG=\${$#}
case $LASTARG in
*.y)
SRCFILE=`echo "$LASTARG" | sed 's/y$/c/'`
if test -f "$SRCFILE"; then
cp "$SRCFILE" y.tab.c
fi
SRCFILE=`echo "$LASTARG" | sed 's/y$/h/'`
if test -f "$SRCFILE"; then
cp "$SRCFILE" y.tab.h
fi
;;
esac
fi
if test ! -f y.tab.h; then
echo >y.tab.h
fi
if test ! -f y.tab.c; then
echo 'main() { return 0; }' >y.tab.c
fi
;;
lex*|flex*)
echo 1>&2 "\
WARNING: \`$1' is $msg. You should only need it if
you modified a \`.l' file. You may need the \`Flex' package
in order for those modifications to take effect. You can get
\`Flex' from any GNU archive site."
rm -f lex.yy.c
if test $# -ne 1; then
eval LASTARG=\${$#}
case $LASTARG in
*.l)
SRCFILE=`echo "$LASTARG" | sed 's/l$/c/'`
if test -f "$SRCFILE"; then
cp "$SRCFILE" lex.yy.c
fi
;;
esac
fi
if test ! -f lex.yy.c; then
echo 'main() { return 0; }' >lex.yy.c
fi
;;
help2man*)
echo 1>&2 "\
WARNING: \`$1' is $msg. You should only need it if
you modified a dependency of a manual page. You may need the
\`Help2man' package in order for those modifications to take
effect. You can get \`Help2man' from any GNU archive site."
file=`echo "$*" | sed -n "$sed_output"`
test -z "$file" && file=`echo "$*" | sed -n "$sed_minuso"`
if test -f "$file"; then
touch $file
else
test -z "$file" || exec >$file
echo ".ab help2man is required to generate this page"
exit $?
fi
;;
makeinfo*)
echo 1>&2 "\
WARNING: \`$1' is $msg. You should only need it if
you modified a \`.texi' or \`.texinfo' file, or any other file
indirectly affecting the aspect of the manual. The spurious
call might also be the consequence of using a buggy \`make' (AIX,
DU, IRIX). You might want to install the \`Texinfo' package or
the \`GNU make' package. Grab either from any GNU archive site."
# The file to touch is that specified with -o ...
file=`echo "$*" | sed -n "$sed_output"`
test -z "$file" && file=`echo "$*" | sed -n "$sed_minuso"`
if test -z "$file"; then
# ... or it is the one specified with @setfilename ...
infile=`echo "$*" | sed 's/.* \([^ ]*\) *$/\1/'`
file=`sed -n '
/^@setfilename/{
s/.* \([^ ]*\) *$/\1/
p
q
}' $infile`
# ... or it is derived from the source name (dir/f.texi becomes f.info)
test -z "$file" && file=`echo "$infile" | sed 's,.*/,,;s,.[^.]*$,,'`.info
fi
# If the file does not exist, the user really needs makeinfo;
# let's fail without touching anything.
test -f $file || exit 1
touch $file
;;
*)
echo 1>&2 "\
WARNING: \`$1' is needed, and is $msg.
You might have modified some files without having the
proper tools for further handling them. Check the \`README' file,
it often tells you about the needed prerequisites for installing
this package. You may also peek at any GNU archive site, in case
some other package would contain this missing \`$1' program."
exit 1
;;
esac
exit 0
# Local variables:
# eval: (add-hook 'write-file-hooks 'time-stamp)
# time-stamp-start: "scriptversion="
# time-stamp-format: "%:y-%02m-%02d.%02H"
# time-stamp-time-zone: "UTC"
# time-stamp-end: "; # UTC"
# End:
import os
import sys
import hashlib
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from collections import defaultdict
class PreProcessFasta(object):
def __init__(self, input_filename, verbose=False, filter_percentage=25):
self.input_filename = input_filename
self.verbose = verbose
......@@ -27,7 +29,7 @@ class PreProcessFasta(object):
return sequence_hash_to_taxa
def calculate_sequences_missing_data_percentage(self):
sequences_to_missing_data = defaultdict(list)
sequences_to_missing_data = {}
with open(self.input_filename) as input_handle:
alignments = AlignIO.parse(input_handle, "fasta")
for alignment in alignments:
......@@ -43,10 +45,11 @@ class PreProcessFasta(object):
if self.verbose:
print("Sample " + str(record.id) + " has no sequence ")
else:
per_missing_data = (number_of_gaps*100/sequence_length)
per_missing_data = number_of_gaps*100/sequence_length
sequences_to_missing_data[record.id] = per_missing_data
if self.verbose:
print("Sample "+ str(record.id) + " has missing data percentage of " + str(per_missing_data))
print("Sample " + str(record.id) + " has missing data percentage of " +
str(per_missing_data))
input_handle.close()
return sequences_to_missing_data
......@@ -54,9 +57,10 @@ class PreProcessFasta(object):
def taxa_missing_too_much_data(self):
taxa_to_remove = []
for taxa, percentage_missing in self.calculate_sequences_missing_data_percentage().items():
if(percentage_missing > self.filter_percentage):
if percentage_missing > self.filter_percentage:
taxa_to_remove.append(taxa)
print("Excluded sequence " +taxa + " because it had " + str(percentage_missing) +" percentage missing data while a maximum of "+ str(self.filter_percentage) +" is allowed")
print("Excluded sequence " + taxa + " because it had " + str(percentage_missing) +
" percentage missing data while a maximum of " + str(self.filter_percentage) + " is allowed")
return taxa_to_remove
......@@ -66,15 +70,16 @@ class PreProcessFasta(object):
if len(taxa) > 1:
taxon_to_keep = taxa.pop()
for taxon in taxa:
print("Sequences in "+ taxon + " and " + taxon_to_keep + " are identical, removing " + taxon + " from analysis")
print("Sequences in " + taxon + " and " + taxon_to_keep + " are identical, removing " + taxon +
" from analysis")
taxa_to_remove.append(taxon)
return taxa_to_remove
def remove_duplicate_sequences_and_sequences_missing_too_much_data(self, output_filename,remove_identical_sequences = 0):
def remove_duplicate_sequences_and_sequences_missing_too_much_data(self, output_filename,
remove_identical_sequences=None):
taxa_to_remove = []
if remove_identical_sequences < 1:
if not remove_identical_sequences:
taxa_to_remove = self.taxa_missing_too_much_data()
else:
taxa_to_remove = self.taxa_of_duplicate_sequences() + self.taxa_missing_too_much_data()
......
# encoding: utf-8
# Wellcome Trust Sanger Institute
# Copyright (C) 2013 Wellcome Trust Sanger Institute
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#
import os
import sys
import subprocess
import re
class RAxMLExecutable(object):
def __init__(self, threads, model = 'GTRCAT', verbose = False ):
self.verbose = verbose
self.threads = threads
self.single_threaded_executables = ['raxmlHPC-AVX2','raxmlHPC-AVX','raxmlHPC-SSE3','raxmlHPC']
self.multi_threaded_executables = ['raxmlHPC-PTHREADS-AVX2','raxmlHPC-PTHREADS-AVX','raxmlHPC-PTHREADS-SSE3','raxmlHPC-PTHREADS']
self.model = model
self.raxml_executable = self.select_executable_based_on_threads()
self.tree_building_parameters_gtrgamma = ' -f d -p 1 -m GTRGAMMA '
self.tree_building_parameters_gtrcat = ' -f d -p 1 -m GTRCAT -V '
self.internal_sequence_parameters = ' -f A -p 1 -m GTRGAMMA '
def tree_building_command(self):
tree_building_parameters = self.tree_building_parameters_gtrcat
if self.model == 'GTRGAMMA':
tree_building_parameters =self.tree_building_parameters_gtrgamma
command = self.raxml_executable + self.threads_parameter() + tree_building_parameters
if self.verbose:
print("Tree building command: "+command)
return command
def internal_sequence_reconstruction_command(self):
command = self.raxml_executable + self.threads_parameter() + self.internal_sequence_parameters
if self.verbose:
print("Internal sequence reconstruction command: "+command)
return command
def choose_executable_from_list(self,list_of_executables):
flags = []
if os.path.exists('/proc/cpuinfo'):
output = subprocess.Popen('grep flags /proc/cpuinfo', stdout = subprocess.PIPE, shell=True).communicate()[0].decode("utf-8")
flags = output.split()
for executable in list_of_executables:
if os.path.exists('/proc/cpuinfo'):
if re.search('AVX2', executable) and 'avx2' not in flags:
continue
elif re.search('AVX', executable) and 'avx' not in flags:
continue
elif re.search('SSE3', executable) and 'ssse3' not in flags:
continue
if self.which(executable) != None:
return executable
return None
def which(self,program):
executable = program.split(" ")
program = executable[0]
def is_exe(fpath):
return os.path.isfile(fpath) and os.access(fpath, os.X_OK)
fpath, fname = os.path.split(program)
if fpath:
if is_exe(program):
return program
else:
for path in os.environ["PATH"].split(os.pathsep):
exe_file = os.path.join(path, program)
if is_exe(exe_file):
return exe_file
return None
def threads_parameter(self):
if self.threads > 1:
return " -T " + str(self.threads) + " "
else:
return ""
def select_executable_based_on_threads(self):
single_threaded_exec = self.choose_executable_from_list(self.single_threaded_executables)
multi_threaded_exec = self.choose_executable_from_list(self.multi_threaded_executables)
if self.threads == 1:
if single_threaded_exec != None:
return single_threaded_exec
else:
print("Trying multithreaded version of RAxML because no single threaded version of RAxML could be found. Just to warn you, this requires 2 threads.\n")
self.threads = 2
if self.threads > 1:
if multi_threaded_exec != None:
return multi_threaded_exec
else:
sys.exit("No usable version of RAxML could be found, please ensure one of these executables is in your PATH:\nraxmlHPC-PTHREADS-AVX2\nraxmlHPC-PTHREADS-AVX\nraxmlHPC-PTHREADS-SSE3\nraxmlHPC-PTHREADS\nraxmlHPC-AVX2\nraxmlHPC-AVX\nraxmlHPC-SSE3\nraxmlHPC")
# encoding: utf-8
# Wellcome Trust Sanger Institute
# Copyright (C) 2013 Wellcome Trust Sanger Institute
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#
import os
import sys
import tempfile
import dendropy
import subprocess
import shutil
import time
from random import randint
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
class RAxMLSequenceReconstruction(object):
def __init__(self, input_alignment_filename, input_tree, output_alignment_filename, output_tree, raxml_internal_sequence_reconstruction_command, verbose = False ):
self.input_alignment_filename = os.path.abspath(input_alignment_filename)
self.input_tree = os.path.abspath(input_tree)
self.output_alignment_filename = os.path.abspath(output_alignment_filename)
self.output_tree = os.path.abspath(output_tree)
self.raxml_internal_sequence_reconstruction_command = raxml_internal_sequence_reconstruction_command
self.verbose = verbose
self.working_dir = tempfile.mkdtemp(dir=os.getcwd())
self.temp_rooted_tree = self.working_dir +'/' +'rooted_tree.newick'
self.temp_interal_fasta = self.working_dir +'/' +'internal.fasta'
self.internal_node_prefix = 'internal_'
def reconstruct_ancestor_sequences(self):
self.root_tree(self.input_tree, self.temp_rooted_tree)
self.run_raxml_ancestor_command(self.temp_rooted_tree)
self.convert_raw_ancestral_states_to_fasta(self.raw_internal_sequence_filename(), self.temp_interal_fasta)
self.combine_fastas(self.input_alignment_filename, self.temp_interal_fasta,self.output_alignment_filename)
if os.path.exists(self.temp_rooted_tree):
self.transfer_internal_names_to_tree(self.raw_internal_rooted_tree_filename(), self.temp_rooted_tree, self.output_tree)
shutil.rmtree(self.working_dir)
def run_raxml_ancestor_command(self,rooted_tree):
current_directory = os.getcwd()
if self.verbose > 0:
print(self.raxml_reconstruction_command(rooted_tree))
try:
os.chdir(self.working_dir)
subprocess.check_call(self.raxml_reconstruction_command(rooted_tree), shell=True)
os.chdir(current_directory)
except:
os.chdir(current_directory)
sys.exit("Something went wrong while creating the ancestor sequences using RAxML")
if self.verbose > 0:
print(int(time.time()))
def raw_internal_sequence_filename(self):
return self.working_dir +'/RAxML_marginalAncestralStates.internal'
def raw_internal_rooted_tree_filename(self):
return self.working_dir +'/RAxML_nodeLabelledRootedTree.internal'
def raxml_reconstruction_command(self,rooted_tree):
verbose_suffix = ''
if not self.verbose:
verbose_suffix = '> /dev/null 2>&1'
return " ".join([self.raxml_internal_sequence_reconstruction_command, ' -s', self.input_alignment_filename, '-t', rooted_tree, '-n', 'internal' ,verbose_suffix ])
def write_tree(self, tree, output_tree):
output_tree_string = tree.as_string(
schema='newick',
suppress_leaf_taxon_labels=False,
suppress_leaf_node_labels=True,
suppress_internal_taxon_labels=False,
suppress_internal_node_labels=False,
suppress_rooting=False,
suppress_edge_lengths=False,
unquoted_underscores=True,
preserve_spaces=False,
store_tree_weights=False,
suppress_annotations=True,
annotations_as_nhx=False,
suppress_item_comments=True,
node_label_element_separator=' '
)
with open(output_tree, 'w+') as output_file:
output_file.write(output_tree_string.replace('\'', ''))
output_file.closed
return output_tree
def transfer_internal_names_to_tree(self, source_tree, destination_tree, output_tree):
source_tree_obj = dendropy.Tree.get_from_path(source_tree, 'newick', preserve_underscores=True)
source_internal_node_labels = []
for source_internal_node in source_tree_obj.internal_nodes():
if source_internal_node.label:
source_internal_node_labels.append(source_internal_node.label)
else:
source_internal_node_labels.append('')
destination_tree_obj = dendropy.Tree.get_from_path(destination_tree, 'newick', preserve_underscores=True)
for index, destination_internal_node in enumerate(destination_tree_obj.internal_nodes()):
destination_internal_node.label = None
destination_internal_node.taxon = dendropy.Taxon(self.internal_node_prefix + str(source_internal_node_labels[index]))
self.write_tree( destination_tree_obj, output_tree)
def root_tree(self, input_tree_filename, output_tree):
# split bi nodes and root tree
tree = dendropy.Tree.get_from_path(input_tree_filename, 'newick', preserve_underscores=True)
self.split_all_non_bi_nodes(tree.seed_node)
self.write_tree( tree, output_tree)
def convert_raw_ancestral_states_to_fasta(self, input_filename, output_filename):
with open(input_filename, 'r') as infile:
with open(output_filename, 'w+') as outfile:
for sequence_line in infile:
[sequence_name, sequence_bases] = sequence_line.split(' ')
sequence_bases = sequence_bases.replace('?', 'N')
outfile.write('>'+sequence_name+'\n')
outfile.write(sequence_bases)
# Warning - recursion
def split_all_non_bi_nodes(self, node):
if node.is_leaf():
return None
elif len(node.child_nodes()) > 2:
self.split_child_nodes(node)
for child_node in node.child_nodes():
self.split_all_non_bi_nodes(child_node)
return None
def split_child_nodes(self,node):
all_child_nodes = node.child_nodes()
first_child = all_child_nodes.pop()
new_child_node = node.new_child(edge_length=0)
new_child_node.set_child_nodes(all_child_nodes)
node.set_child_nodes((first_child,new_child_node))
def combine_fastas(self, leaf_node_filename, internl_node_filename, output_file ):
with open(output_file, 'w') as output_handle:
# print out leafnodes as is
with open(leaf_node_filename, 'r') as input_handle:
alignments = AlignIO.parse(input_handle, "fasta")
AlignIO.write(alignments,output_handle, "fasta")
input_handle.closed
with open(internl_node_filename, 'r') as input_handle:
alignments = AlignIO.parse(input_handle, "fasta")
output_alignments = []
for alignment in alignments:
for record in alignment:
record.id = self.internal_node_prefix + str(record.id)
record.description = ''
output_alignments.append(record)
AlignIO.write(MultipleSeqAlignment(output_alignments),output_handle, "fasta")
input_handle.closed
output_handle.closed
\ No newline at end of file
This diff is collapsed.
>sequence1
AAAA
>sequence2
CCCC
>sequence3
GGGG
>sequence4
TTTT
>sequence5
AAAA
>sequence11
ACGT
>sequence12
CATG
>sequence13
GACT
>sequence14
TGAC
>sequence15
AGTC