\fBandi\fR estimates the evolutionary distance between closely related genomes. For this \fBandi\fR reads the input sequences from \fIFASTA\fR files and computes the pairwise anchor distance. The idea behind this is explained in a paper by Haubold et al. (2015).
The output is a symmetrical distance matrix in \fIPHYLIP\fR format, with each entry representing divergence with a positive real number. A distance of zero means that two sequences are identical, whereas other values are estimates for the nucleotide substitution rate (Jukes-Cantor corrected). For technical reasons the comparison might fail and no estimate can be computed. In such cases \fInan\fR is printed. This either means that the input sequences were too short (<200bp) or too diverse (K>0.5) for our method to work properly.
Compute multiple distance matrices, with \fIn-1\fR bootstrapped from the first. See the paper Klötzl & Haubold (2016) for a detailed explanation.
.TP
\fB--file-of-filenames\fR <FILE>
\fB--file-of-filenames\fR=\fIFILE\fR
Usually, \fBandi\fR is called with the filenames as commandline arguments. With this option the filenames may also be read from a file itself, with one name per line. Use a single dash (\fB'-'\fR) to read from stdin.
.TP
\fB\-j\fR, \fB\-\-join\fR
...
...
@@ -23,13 +23,16 @@ Use this mode if each of your \fIFASTA\fR files represents one assembly with num
\fB\-l\fR, \fB\-\-low-memory\fR
In multithreaded mode, \fBandi\fR requires memory linear to the amount of threads. The low memory mode changes this to a constant demand independent from the used number of threads. Unfortunately, this comes at a significant runtime cost.
.TP
\fB\-m\fR, \fB\-\-model\fR <Raw|JC|Kimura>
Different models of nucleotide evolution are supported. By default the Jukes-Cantor correction is used.
Set the nucleotide evolution model to one of 'Raw', 'JC', or 'Kimura'. By default the Jukes-Cantor correction is used.
.TP
\fB\-p\fR <FLOAT>
\fB\-p\fR \fIFLOAT\fR
Significance of an anchor; default: 0.025.
.TP
\fB\-t\fR, \fB\-\-threads\fR <INT>
\fB--progress\fR[=\fIWHEN\fR]
Print a progress bar. \fIWHEN\fR can be 'auto' (default if omitted), 'always', or 'never'.
.TP
\fB\-t\fR \fIINT\fR, \fB\-\-threads\fR=\fIINT\fR
The number of threads to be used; by default, all available processors are used.
.br
Multithreading is only available if \fBandi\fR was compiled with OpenMP support.
...
...
@@ -38,7 +41,7 @@ Multithreading is only available if \fBandi\fR was compiled with OpenMP support.
By default \fBandi\fR outputs the full names of sequences, optionally padded with spaces, if they are shorter than ten characters. Names longer than ten characters may lead to problems with downstream tools. With this switch names will be truncated.
.TP
\fB\-v\fR, \fB\-\-verbose\fR
Prints additional information. Apply multiple times for extra verboseness.
Prints additional information, including the amount of found homology. Apply multiple times for extra verboseness.
.TP
\fB\-h\fR, \fB\-\-help\fR
Prints the synopsis and an explanation of available options.
...
...
@@ -46,7 +49,7 @@ Prints the synopsis and an explanation of available options.
\fB\-\-version\fR
Outputs version information and acknowledgments.
.SH COPYRIGHT
Copyright \(co 2014 - 2016 Fabian Klötzl
Copyright \(co 2014 - 2017 Fabian Klötzl
License GPLv3+: GNU GPL version 3 or later.
.br
This is free software: you are free to change and redistribute it.
@@ -106,61 +106,62 @@ This document is release under the Creative Commons Attribution Share-Alike lice
The easiest way to install \andi is via a package manager. This also handles all dependencies for you.
\noindent Debian and Ubuntu (since 16.04):
\noindent Debian and Ubuntu:
\begin{lstlisting}
~ % sudo apt-get install andi
\end{lstlisting}
\noindent OS X with homebrew:
\noindentmacOS with homebrew:
\begin{lstlisting}
~ % brew install homebrew/science/andi
\end{lstlisting}
\noindent ArchLinux:
\noindent ArchLinux AUR package with aura:
\begin{lstlisting}
~ % aura -A andi
\end{lstlisting}
\andi is intended to run in a \algo{Unix} commandline such as \lstinline$bash$ or \lstinline$zsh$. All examples in this document are also intended for that environment. You can verify that \andi was installed correctly by executing \lstinline$andi -h$. This should give you a list of all available options (see Section~\ref{sec:options}).
\andi is intended to be run in a \algo{Unix} commandline such as \lstinline$bash$ or \lstinline$zsh$. All examples in this document are also intended for that environment. You can verify that \andi was installed correctly by executing \lstinline$andi -h$. This should give you a list of all available options (see Section~\ref{sec:options}).
\section{Source Package}\label{sub:regular}
Download the latest \href{https://github.com/EvolBioInf/andi/releases}{release} from GitHub. Please note, that \andi requires the \algo{Gnu Scientific Library} and optionally \algo{libdivsufsort}\footnote{\url{https://github.com/y-256/libdivsufsort}} for optimal performance \cite{divsufsort}. If you wish to install \andi without \algo{libdivsufsort}, consult Section~\ref{sub:wo-divsufsort}.
To build \andi from source, download the latest \href{https://github.com/EvolBioInf/andi/releases}{release} from GitHub. Please note, that \andi requires the \algo{Gnu Scientific Library} and optionally \algo{libdivsufsort}\footnote{\url{https://github.com/y-256/libdivsufsort}} for optimal performance \cite{divsufsort}. If you wish to install \andi without \algo{libdivsufsort}, consult Section~\ref{sub:wo-divsufsort}.
Once you have downloaded the package, unzip it and change into the newly created directory.
\begin{lstlisting}
~ % tar -xzvf andi-0.11.tar.gz
~ % cd andi-0.11
~ % tar -xzvf andi-0.12.tar.gz
~ % cd andi-0.12
\end{lstlisting}
\noindent Now build and install \andi.
\begin{lstlisting}
~/andi-0.11% ./configure
~/andi-0.11% make
~/andi-0.11% sudo make install
~/andi-0.12% ./configure
~/andi-0.12% make
~/andi-0.12% sudo make install
\end{lstlisting}
\noindent This installs \andi for all users on your system. If you do not have root privileges, you will find a working copy of \andi in the \lstinline$src$ subdirectory. For the rest of this documentation, I will assume, that \andi is in your \textdollar\lstinline!PATH!.
\noindent This installs \andi for all users on your system. If you do not have root privileges, you will find a working copy of \andi in the \lstinline$src$ subdirectory. For the rest of this documentation, it is assumed, that \andi is in your \textdollar\lstinline!PATH!.
Now \andi should be ready for use. Try invoking the help.
--file-of-filenames=FILE Read additional filenames from FILE; one per line
-j, --join Treat all sequences from one file as a single genome
-l, --low-memory Use less memory at the cost of speed
-m, --model <Raw|JC|Kimura> Pick an evolutionary model; default: JC
-p <FLOAT> Significance of an anchor; default: 0.025
-t, --threads <INT> Set the number of threads; by default, all available processors are used
-m, --model=MODEL Pick an evolutionary model of 'Raw', 'JC', 'Kimura'; default: JC
-p FLOAT Significance of an anchor; default: 0.025
--progress=WHEN Print a progress bar 'always', 'never', or 'auto'; default: auto
-t, --threads=INT Set the number of threads; by default, all processors are used
--truncate-names Truncate names to ten characters
-v, --verbose Prints additional information
-h, --help Display this help and exit
...
...
@@ -230,9 +231,9 @@ When the \algo{join} mode is active, the file names are used to label the indivi
If not enough file names are provided, \andi will try to read sequences from the standard input stream. This behaviour can be explicitly triggered by passing a single dash (\lstinline$-$) as a file name, which is useful in pipelines.
If \andi seems to take unusually long, or requires huge amounts of memory, then you might have forgotten the \algo{join} switch. This makes \andi compare each contig instead of each genome, resulting in many more comparisons! To make \andi output the number of genome it about to compare, use the \lstinline$--verbose$ switch.
If \andi seems to take unusually long, or requires huge amounts of memory, then you might have forgotten the \algo{join} switch. This makes \andi compare each contig instead of each genome, resulting in many more comparisons! Since version 0.12 \andi produces a progressmeter on the standard error stream. \andi tries to be smart about when to show or hide the progress bar. You can manually change this behaviour using the \lstinline!--progress! option.
Starting with version 0.11 \andi supports an extra way of input. Instead of passing file names directly to \andi via the commandline arguments, the files may also be read from a file itself. Using this new \lstinline$--file-of-filenames$ can work around limitations imposed be the shell.
Starting with version 0.11 \andi supports an extra way of input. Instead of passing file names directly to \andi via the commandline arguments, the file names may also be read from a file itself. Using this new \lstinline$--file-of-filenames$argument can work around limitations imposed be the shell.
The following three snippets have the same functionality.
...
...
@@ -267,7 +268,7 @@ If the computation completed successfully, \andi exits with the status code 0. O
\section{Options}\label{sec:options}
\andi takes a small number of commandline options, of which even fewer are of interest on a day-to-day basis. If \lstinline$andi -h$ displays a \lstinline$-t$ option, then \andi was compiled with multi-threading support (implemented using \algo{OpenMP}). By default \andi uses all available processors. However, to restrict the number of threads, use \lstinline$-t$.
\andi takes a small number of commandline options, of which even fewer are of interest on a day-to-day basis. If \lstinline$andi -h$ displays a \lstinline$-t$ option, then \andi was compiled with multi-threading support (implemented using \algo{OpenMP}). By default,\andi uses all available processors. However, to restrict the number of threads, use \lstinline$-t$.
\begin{lstlisting}
~ % time andi ../test/1M.1.fasta -t 1
...
...
@@ -298,13 +299,13 @@ S1 0.0000 0.1071
S2 0.1071 0.0000
\end{lstlisting}
The original \algo{phylip} only supports distance matrices with names no longer than ten characters. However, this sometimes leads to problems with long accession numbers. Starting with version 0.11 \andi print the full name of a sequence, even if it is longer than ten characters. If your downstream tools have trouble with this, use \lstinline$--truncate-names$ to reimpose the limit.
The original \algo{phylip} only supports distance matrices with names no longer than ten characters. However, this sometimes leads to problems with long accession numbers. Starting with version 0.11 \andi prints the full name of a sequence, even if it is longer than ten characters. If your downstream tools have trouble with this, use \lstinline$--truncate-names$ to reimpose the limit.
Also new in version 0.11 is the \lstinline$--file-of-filenames$ option. See Section~\ref{sec:join} for details.
\section{Example: \algo{eco29}}
Here follows a real-world example of how to use \algo{andi}. It makes heavy use of the commandline and tools like \algo{Phylip}. If you prefer \algo{R}, check out this excellent \href{http://holtlab.net/2015/05/08/r-code-to-infer-tree-from-andi-output/}{blog post} by Kathryn Holt.
Here follows a real-world example of how to use \algo{andi}. It makes heavy use of the commandline and tools like \algo{Phylip}. If you prefer \algo{R}, check out this excellent blog post by Kathryn Holt.\footnote{\url{http://holtlab.net/2015/05/08/r-code-to-infer-tree-from-andi-output/}}
As a data set we use \algo{eco29}; 29 genomes of \textit{E. Coli} and \textit{Shigella}. You can download the data from here: {\small{\url{http://guanine.evolbio.mpg.de/andi/eco29.fasta.gz}}}. The genomes have an average length of 4.9~million nucleotides amounting to a total \SI{138}{\mega\byte}.
...
...
@@ -414,11 +415,11 @@ Some command line parameters of \andi require arguments. If these are not of the
\section{Output-related Warnings}
As the input sequences get more evolutionary divergent, \andi finds less anchors. With less anchors, less nucleotides are considered homologous between two sequences. If no anchors are found, comparison fails and \lstinline!nan! is printed instead. See our paper and especially Figure~2 for details.
As the input sequences get more evolutionary divergent, \andi finds less homologous anchors. With less anchors, less nucleotides are considered homologous between two sequences. If no anchors are found, comparison fails and \lstinline!nan! is printed instead. See our paper and especially Figure~2 for details.
\subsection*{NaN}
No anchors were found. Your sequences are very divergent ($d>0.5$) or sprout a lot of indels that make comparison difficult.
No homologous sections were found. Your sequences are very divergent ($d>0.5$) or sprout a lot of indels that make comparison difficult.
\subsection*{Little Homology}
...
...
@@ -469,7 +470,7 @@ The unit tests are located in the \andi repository under the \lstinline$./test$
~/andi % make check
\end{lstlisting}
\noindent The unit tests are also checked each time a commit is sent to the repository. This is done via \algo{TravisCI}.\footnote{\url{https://travis-ci.org/EvolBioInf/andi}} Thus, a warning is produced, when the builds fail, or the unit tests to not run successfully. Currently, the unit tests cover more than 75\% of the code. This is computed via the \algo{Travis} builds and a service called \algo{Coveralls}.\footnote{\url{https://coveralls.io/r/EvolBioInf/andi}} Unfortunately, coveralls is broken at this point in time.
\noindent The unit tests are also checked each time a commit is sent to the repository. This is done via \algo{TravisCI}.\footnote{\url{https://travis-ci.org/EvolBioInf/andi}} Thus, a warning is produced, when the builds fail, or the unit tests did not run successfully. Currently, the unit tests cover more than 75\% of the code. This is computed via the \algo{Travis} builds and a service called \algo{Coveralls}.\footnote{\url{https://coveralls.io/r/EvolBioInf/andi}}
* @brief This file is a preprocessor hack for the two functions `distMatrix`
* and `distMatrixLM`.
*/
// clang-format off
#ifdef FAST
#define NAME distMatrix
#define P_OUTER _Pragma("omp parallel for num_threads( THREADS)")
#define P_OUTER _Pragma("omp parallel for num_threads( THREADS) default(none) shared(progress_counter) firstprivate( stderr, M, sequences, n, print_progress)")
#define P_INNER
#else
#undef NAME
...
...
@@ -12,8 +13,9 @@
#undef P_INNER
#define NAME distMatrixLM
#define P_OUTER
#define P_INNER _Pragma("omp parallel for num_threads( THREADS)")
#define P_INNER _Pragma("omp parallel for num_threads( THREADS) default(none) shared(progress_counter) firstprivate( stderr, M, sequences, n, print_progress, i, E, subject)")
#endif
// clang-format on
/** @brief This function calls dist_andi for pairs of subjects and queries, and