Skip to content
Commits on Source (2)
repo: 092c2fe2278cb7f0b18d81faeb4aab98b89dc096
node: cbd7880df400b453b8beb4e62b39e4a23b5523b6
node: 9760413b180fc2c68b817c23602541d3a97528af
branch: default
tag: 2.7.6
tag: 2.7.8
syntax: glob
databases/
metaphlan_databases/
tests/
*.pyc
build/
dist/
......
c168a100f37e23e2c110849a8d91fac8da49f5bd utils/export2graphlan
35dfd725e7f024fc6d0edef0cc191c7963108787 utils/hclust2
69efddea43ae5e37d761cd138fcf083090371d1a utils/hclust2
......@@ -14,3 +14,8 @@ a1fe0d15320c04f69d56f1b7dd31cff972a7b8df 2.7.2
b2f9b3286d4be376805e3b5c26cf141ed375c605 2.7.5
847b250adbe97b9f4adc7e15f0d4bb5a66e782ec 2.7.4
178d1aaf4ac76e5d5477833e8e614104dcd32088 2.7.3
cbd7880df400b453b8beb4e62b39e4a23b5523b6 2.7.6
1365be3b4e5a68ed872911e949a25d0944f822b2 2.7.61
4d50b9ccd95234a436541db13bcb10741f99138b 2.7.62
a71e7d6b9d50b89c4b80714af145e10d050be7cf 2.7.63
3e9c612e0a922b73214c90db51c5e26be17bf8b6 2.7.7
......@@ -2,11 +2,26 @@
# MetaPhlAn 2: Metagenomic Phylogenetic Analysis
## Installation
MetaPhlAn 2.0 can be obtained
Thorugh **Bioconda**
``$ conda install metaphlan2``
by **direct download** from [Bitbucket](https://bitbucket.org/biobakery/metaphlan2/get/default.zip)
or **cloning the repository** using the following command
``$ hg clone https://bitbucket.org/biobakery/metaphlan2``
----------------------
## Description
MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.
MetaPhlAn 2 relies on ~1M unique clade-specific marker genes ([the marker information file mpa_v20_m200_marker_info.txt.bz2 can be found in the Download page here](https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200_marker_info.txt.bz2)) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
MetaPhlAn 2 relies on ~1M unique clade-specific marker genes ([the marker information file `mpa_v20_m200_marker_info.txt.bz2` can be found in the Download page here](https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_v20_m200_marker_info.txt.bz2)) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:
* unambiguous taxonomic assignments;
* accurate estimation of organismal relative abundance;
......@@ -29,7 +44,7 @@ If you use StrainPhlAn, please cite the MetaPhlAn2 paper and the following Strai
-------------
## MetaPhlAn and StrainPhlAn resources
## MetaPhlAn and StrainPhlAn tutorials and resources
In addition to the information on this page, you can refer to the following additional resources.
......@@ -43,15 +58,18 @@ In addition to the information on this page, you can refer to the following addi
* The related [bioBakery workflows](https://bitbucket.org/biobakery/biobakery/wiki/biobakery_workflows)
-------------
## Pre-requisites
MetaPhlAn requires *python 2.7* or higher with argparse, tempfile and [*numpy*](http://www.numpy.org/) libraries installed
(apart for numpy they are usually installed together with the python distribution).
MetaPhlAn requires *python 2.7* or higher with argparse, tempfile, [numpy](http://www.numpy.org/), and [Biopython](https://biopython.org/) libraries installed
(apart for numpy and Biopython, the others are usually installed together with the python distribution).
Python3 is also now supported.
MetaPhlAn requires the `read_fastx.py` script to be present in the system path, if not found MetaPhlAn will try to locate it in the folder containing the `metaphlan2.py`
script under `utils/read_fastx.py`.
In case you moved the `metaphlan2.py` script, please export the `read_fastx.py` script in your PATH bash variable.
**If you provide the SAM output of [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) as input, there are no additional prerequisite.**
* If you would like to use the BowTie2 integrated in MetaPhlAn, you need to have BowTie2 version 2.0.0 or higher and perl installed (bowtie2 needs to be in the system path with execute _and_ read permission)
......@@ -62,22 +80,8 @@ MetaPhlAn requires *python 2.7* or higher with argparse, tempfile and [*numpy*](
* MetaPhlAn is not tightly integrated with advanced heatmap plotting with [hclust2](https://bitbucket.org/nsegata/hclust2) and cladogram visualization with [GraPhlAn](https://bitbucket.org/nsegata/graphlan/wiki/Home). If you use such visualization tool please refer to their prerequisites.
----------------------
## Installation
MetaPhlAn 2.0 can be obtained by either
Dowloading MetaPhlAn v2.0 [here](https://bitbucket.org/biobakery/metaphlan2/get/default.zip) or [here](https://www.dropbox.com/s/ztqr8qgbo727zpn/metaphlan2.zip?dl=0)
**OR**
Cloning the repository via the following commands
``$ hg clone https://bitbucket.org/biobakery/metaphlan2``
--------------------------
## Basic Usage
This section presents some basic usages of MetaPhlAn2, for more advanced usages, please see at [its wiki](https://bitbucket.org/biobakery/biobakery/wiki/metaphlan2).
......@@ -97,55 +101,28 @@ Here is the basic example to profile a metagenome from raw reads (requires BowTi
$ metaphlan2.py metagenome.fastq --input_type fastq > profiled_metagenome.txt
```
It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out), and use multiple CPUs (--nproc) if available:
It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (`--bowtie2out`), and use multiple CPUs (`--nproc`) if available:
```
#!bash
$ metaphlan2.py metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq > profiled_metagenome.txt
```
If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out file and specifying the input (--input_type bowtie2out):
If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved `--bowtie2out` file and specifying the input (`--input_type bowtie2out`):
```
#!bash
$ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out > profiled_metagenome.txt
```
You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
You can also provide an externally BowTie2-mapped SAM if you specify this format with `--input_type`. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:
```
#!bash
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
```
In order to make MetaPhlAn 2 easily compatible with complex metagenomic pipeline, there are now multiple alternative ways to pass the input:
```
#!bash
$ cat metagenome.fastq | metaphlan2.py --input_type fastq > profiled_metagenome.txt
```
```
#!bash
$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq --bowtie2db ${mpa_dir}/db_v20/mpa_v20_m200 > profiled_metagenome.txt
```
```
#!bash
$ metaphlan2.py --input_type fastq < metagenome.fastq > profiled_metagenome.txt
```
```
#!bash
$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2) > profiled_metagenome.txt
```
```
#!bash
$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz) > profiled_metagenome.txt
```
MetaPhlAn 2 can also natively **handle paired-end metagenomes** (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):
```
......@@ -208,16 +185,9 @@ $ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out
* You can also provide an externally BowTie2-mapped SAM if you specify this format with
--input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn2 with the obtained sam:
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt
* Multiple alternative ways to pass the input are also available:
$ cat metagenome.fastq | metaphlan2.py --input_type fastq
$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq
$ metaphlan2.py --input_type fastq < metagenome.fastq
$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)
$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)
* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in
multiple files (but you need to specify the --bowtie2out parameter):
$ metaphlan2.py metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq
......@@ -254,7 +224,7 @@ $ metaphlan2.py -t clade_profiles metagenome_outfmt.bz2 --input_type bowtie2out
* Finally, to obtain all markers present for a specific clade and all its subclades, the
'-t clade_specific_strain_tracker' should be used. For example, the following command
is reporting the presence/absence of the markers for the B. fragulis species and its strains
$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 db_v20/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 databases/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
the optional argument --min_ab specifies the minimum clade abundance for reporting the markers
-------------------------------------------------------------------
......@@ -501,7 +471,7 @@ In order to add a marker to the database, the user needs the following steps:
```
#!bash
bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 > metaphlan2/markers.fasta
bowtie2-inspect metaphlan2/databases/mpa_v20_m200 > metaphlan2/markers.fasta
```
* Add the marker sequence stored in a file new_marker.fasta to the marker set:
......@@ -527,7 +497,7 @@ bowtie2-build metaphlan2/markers.fasta metaphlan2/db_v21/mpa_v21_m200
import cPickle as pickle
import bz2
db = pickle.load(bz2.BZ2File('db_v20/mpa_v20_m200.pkl', 'r'))
db = pickle.load(bz2.BZ2File('databases/mpa_v20_m200.pkl', 'r'))
# Add the taxonomy of the new genomes
db['taxonomy']['taxonomy of genome1'] = length of genome1
......@@ -577,12 +547,12 @@ In addition, the table below shows the number of snps between the sample strains
In the next sections, we will illustrate step by step how to run MetaPhlAn\_Strainer on this toy example to reproduce the above figures.
### Pre-requisites
StrainPhlAn requires *python 2.7* and the libraries [pysam](http://pysam.readthedocs.org/en/latest/) (tested on **version 0.8.3**), [biopython](http://biopython.org/wiki/Main_Page), [msgpack](https://pypi.python.org/pypi/msgpack-python) and [numpy](http://www.numpy.org/), [dendropy](https://pythonhosted.org/DendroPy/) (tested on version **3.12.0**). Besides, StrainPhlAn also needs the following programs in the executable path:
StrainPhlAn requires *python 2.7* and the libraries [pysam](http://pysam.readthedocs.org/en/latest/) (tested on **version 0.8.3**), [biopython](http://biopython.org/wiki/Main_Page), [msgpack](https://pypi.python.org/pypi/msgpack-python), [pandas](https://pandas.pydata.org) (tested on **version 0.22**), [numpy](http://www.numpy.org/) (tested on **version 1.14.2**) and [scipy](https://www.scipy.org) (tested on **version 1.0.0**), [dendropy](https://pythonhosted.org/DendroPy/) (tested on version **3.12.0**). Besides, StrainPhlAn also needs the following programs in the executable path:
* [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for mapping reads against the marker database.
* [MUSCLE](http://www.drive5.com/muscle/) for the alignment step.
* [samtools, bcftools and vcfutils.pl](http://samtools.sourceforge.net/) which can be downloaded from [here](https://github.com/samtools) for building consensus markers. Note that vcfutils.pl is included in bcftools and **StrainPhlAn only works with samtools version 0.1.19** as samtools has changed the output format after this version.
* [blastn](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) for adding reference genomes to the phylogenetic tree.
* [blast+](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) for adding reference genomes to the phylogenetic tree (blastn and makeblastdb commands)
* [raxmlHPC and raxmlHPC-PTHREADS-SSE3](http://sco.h-its.org/exelixis/web/software/raxml/index.html) for building the phylogenetic trees.
All dependence binaries on Linux 64 bit can be downloaded in the folder "bin" from [this link](https://www.dropbox.com/sh/m4na8wefp53j8ej/AABA3yVsG26TbB0t1cnBS9-Ra?dl=0).
......@@ -621,7 +591,7 @@ for f in $(ls fastqs/*.bz2)
do
echo "Running metaphlan2 on ${f}"
bn=$(basename ${f} | cut -d . -f 1)
tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../db_v20/mpa_v20_m200 --mpa_pkl ../db_v20/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
tar xjfO ${f} | ../metaphlan2.py --bowtie2db ../databases/mpa_v20_m200 --mpa_pkl ../databases/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile
done
```
......@@ -654,8 +624,8 @@ The commands to run are:
```
#!python
mkdir -p db_markers
bowtie2-inspect ../db_v20/mpa_v20_m200 > db_markers/all_markers.fasta
python ../strainphlan_src/extract_markers.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta
bowtie2-inspect ../databases/mpa_v20_m200 > db_markers/all_markers.fasta
python ../strainphlan_src/extract_markers.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta
```
Note that the "all\_markers.fasta" file consists can be reused for extracting other reference genomes.
......@@ -666,7 +636,7 @@ Before building the trees, we should get the list of all clades detected from th
```
#!python
python ../strainphlan.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt
python ../strainphlan.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt
```
The clade names in the output file "clades.txt" will be used for the next step.
......@@ -681,7 +651,7 @@ The commands to run are:
```
#!python
mkdir -p output
python ../strainphlan.py --mpa_pkl ../db_v20/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt
python ../strainphlan.py --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt
```
This step will take around 2 minutes. After this step, you will find the tree "output/RAxML\_bestTree.s\_\_Bacteroides\_caccae.tree". All the output files can be found in the folder "output" in [this link](https://www.dropbox.com/sh/m4na8wefp53j8ej/AABA3yVsG26TbB0t1cnBS9-Ra?dl=0).
......
......@@ -16,8 +16,8 @@ from __future__ import with_statement
__author__ = ('Nicola Segata (nicola.segata@unitn.it), '
'Duy Tin Truong, '
'Francesco Asnicar (f.asnicar@unitn.it)')
__version__ = '2.7.6'
__date__ = '2 March 2018'
__version__ = '2.7.7'
__date__ = '31 May 2018'
import sys
......@@ -62,7 +62,7 @@ DATABASE_DOWNLOAD = "https://bitbucket.org/biobakery/metaphlan2/downloads/"
# get the directory that contains this script
metaphlan2_script_install_folder = os.path.dirname(os.path.abspath(__file__))
# get the default database folder
DEFAULT_DB_FOLDER = os.path.join(metaphlan2_script_install_folder, "databases")
DEFAULT_DB_FOLDER = os.path.join(metaphlan2_script_install_folder, "metaphlan_databases")
#**********************************************************************************************
......@@ -420,12 +420,12 @@ def read_params(args):
"$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/db_v20/mpa_v20_m200 -U metagenome.fastq\n"
"$ metaphlan2.py metagenome.sam --input_type sam > profiled_metagenome.txt\n\n"
"* Multiple alternative ways to pass the input are also available:\n"
"$ cat metagenome.fastq | metaphlan2.py --input_type fastq \n"
"$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq \n"
"$ metaphlan2.py --input_type fastq < metagenome.fastq\n"
"$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)\n"
"$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)\n\n"
# "* Multiple alternative ways to pass the input are also available:\n"
# "$ cat metagenome.fastq | metaphlan2.py --input_type fastq \n"
# "$ tar xjf metagenome.tar.bz2 --to-stdout | metaphlan2.py --input_type fastq \n"
# "$ metaphlan2.py --input_type fastq < metagenome.fastq\n"
# "$ metaphlan2.py --input_type fastq <(bzcat metagenome.fastq.bz2)\n"
# "$ metaphlan2.py --input_type fastq <(zcat metagenome_1.fastq.gz metagenome_2.fastq.gz)\n\n"
"* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in \n"
" multiple files (but you need to specify the --bowtie2out parameter):\n"
......@@ -503,10 +503,11 @@ def read_params(args):
help=("The BowTie2 database file of the MetaPhlAn database. Used if "
"--input_type is fastq, fasta, multifasta, or multifastq [default "+DEFAULT_DB_FOLDER+"]\n"))
INDEX = 'v20_m200'
arg('-x', '--index', type=str, default='v20_m200',
help=("Specify the id of the database version to use. If the database\n"
"files are not found on the local MetaPhlAn2 installation they\n"
"will be automatically downloaded\n"))
"will be automatically downloaded [default "+INDEX+"]\n"))
bt2ps = ['sensitive', 'very-sensitive', 'sensitive-local', 'very-sensitive-local']
arg('--bt2_ps', metavar="BowTie2 presets", default='very-sensitive',
......@@ -647,9 +648,9 @@ def read_params(args):
help="Only checks if the MetaPhlAn2 DB is installed and installs it if not. All other parameters are ignored.")
arg('--read_min_len', type=int, default=70,
help="Specify the minimum length of the reads to be considered when parsing the input file with "
"'read_fastx.py' script, default value is 60")
"'read_fastx.py' script, default value is 70")
arg('-v', '--version', action='version',
version="MetaPhlAn version {}\t({})".format(__version__, __date__),
version="MetaPhlAn version {} ({})".format(__version__, __date__),
help="Prints the current MetaPhlAn version and exit")
arg("-h", "--help", action="help", help="show this help message and exit")
......@@ -748,7 +749,7 @@ def download_unpack_tar(url, download_file_name, folder, bowtie2_build, nproc):
for row in f:
md5_md5 = row.strip().split(' ')[0]
else:
sys.stderr.write('File "{}" not found!'.format(md5_file))
sys.stderr.write('File "{}" not found!\n'.format(md5_file))
# compute MD5 of .tar.bz2
if os.path.isfile(tar_file):
......@@ -760,7 +761,7 @@ def download_unpack_tar(url, download_file_name, folder, bowtie2_build, nproc):
md5_tar = hash_md5.hexdigest()[:32]
else:
sys.stderr.write('File "{}" not found!'.format(tar_file))
sys.stderr.write('File "{}" not found!\n'.format(tar_file))
if (md5_tar is None) or (md5_md5 is None):
sys.exit("MD5 checksums not found, something went wrong!")
......@@ -1264,7 +1265,7 @@ def map2bbh(mapping_f, input_type='bowtie2out', min_alignment_len=None):
return markers2reads
def maybe_generate_biom_file(pars, abundance_predictions):
def maybe_generate_biom_file(tree, pars, abundance_predictions):
json_key = "MetaPhlAn2"
if not pars['biom']:
......@@ -1294,8 +1295,7 @@ def maybe_generate_biom_file(pars, abundance_predictions):
return {'taxonomy': clade_name.split(delimiter)}
clades = iter((abundance, findclade(name))
for (name, abundance) in abundance_predictions
if istip(name) )
for (name, abundance) in abundance_predictions if istip(name))
packed = iter(([abundance], clade.get_full_name(), clade.id)
for (abundance, clade) in clades)
......@@ -1493,7 +1493,7 @@ def metaphlan2():
outf.write( "\t".join( [k,str(v)] ) + "\n" )
else:
outf.write( "unclassified\t100.0\n" )
maybe_generate_biom_file(pars, outpred)
maybe_generate_biom_file(tree, pars, outpred)
elif pars['t'] == 'rel_ab_w_read_stats':
cl2ab, rr = tree.relative_abundances(
pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
......@@ -1520,7 +1520,7 @@ def metaphlan2():
outf.write( "#estimated total number of reads from known clades: " + str(totl)+"\n")
else:
outf.write( "unclassified\t100.0\n" )
maybe_generate_biom_file(pars, outpred)
maybe_generate_biom_file(tree, pars, outpred)
elif pars['t'] == 'clade_profiles':
cl2pr = tree.clade_profiles( pars['tax_lev']+"__" if pars['tax_lev'] != 'a' else None )
......
......@@ -2,9 +2,11 @@
# Author: Duy Tin Truong (duytin.truong@unitn.it)
# at CIBIO, University of Trento, Italy
__author__ = 'Duy Tin Truong (duytin.truong@unitn.it)'
__version__ = '1.0.0'
__date__ = '2nd August 2016'
__author__ = ('Duy Tin Truong (duytin.truong@unitn.it), '
'Francesco Asnicar (f.asnicar@unitn.it), '
'Moreno Zolfo (moreno.zolfo@unitn.it)')
__version__ = '1.2.0'
__date__ = '31 May 2018'
import sys
import os
......@@ -17,7 +19,10 @@ sys.path.append(os.path.join(MAIN_DIR, 'strainphlan_src'))
import which
import argparse as ap
try:
import cPickle as pickle
except ImportError:
import pickle
import msgpack
import glob
from mixed_utils import statistics
......@@ -32,9 +37,9 @@ from Bio.Alphabet import IUPAC
import pandas
import logging
import logging.config
import sample2markers
import copy
import threading
# import sample2markers
# import copy
# import threading
import numpy
import random
import gc
......@@ -86,12 +91,16 @@ def read_params():
help='The representative sample. The marker list of each species '\
'extracted from this sample will be used for all other samples.')
p.add_argument('-x', '--index', required=False, default="v20_m200", type=str,
help=("Specify the id of the database version to use. If the database files are not found "
"on the local MetaPhlAn2 installation they will be automatically downloaded"))
p.add_argument(
'--mpa_pkl',
required=False,
default=os.path.join(metaphlan2_script_install_folder,"db_v20","mpa_v20_m200.pkl"),
default=os.path.join(metaphlan2_script_install_folder, "metaphlan_databases"),
type=str,
help='The database of metaphlan3.py.')
help='The database of metaphlan2.py')
p.add_argument(
'--output_dir',
......@@ -1166,7 +1175,16 @@ def load_sample(args):
marker2seq = {}
# reformat 'pileup'
marker2seq2exclude = []
for m in marker2seq:
# FIX: Check if freq vector is longer than the reconstructed sequence. In this case, the marker is removed
# TODO: uniform the two vectors in samples2markers.py
if len(marker2seq[m]['freq']) > len(marker2seq[m]['seq']):
logger.info('Skipping marker'+m+' as sequence is reconstructed incorrectly')
marker2seq2exclude.append(m)
else:
freq = marker2seq[m]['freq']
marker2seq[m]['freq'] = [(0.0, 0.0, 0.0) for i in \
range(len(marker2seq[m]['seq']))]
......@@ -1174,7 +1192,12 @@ def load_sample(args):
marker2seq[m]['freq'][p] = freq[p]
marker2seq[m]['seq'] = marker2seq[m]['seq'].replace('-', 'N') # make sure we have no gaps in the sequence
# returning only the markers for which sequence was reconstructed correctly
for marker2exclude in marker2seq2exclude:
del marker2seq[marker2exclude]
return marker2seq
else:
# remove redundant clades and markers
clade2n_markers = defaultdict(int)
......@@ -1533,6 +1556,12 @@ def check_dependencies(args):
def strainphlan():
args = read_params()
# fix db .pkl file
if '--mpa_pkl' not in sys.argv:
if os.path.isfile(os.path.join(args['mpa_pkl'], "mpa_" + args['index'] + ".pkl")):
args['mpa_pkl'] = os.path.join(args['mpa_pkl'], "mpa_" + args['index'] + ".pkl")
check_dependencies(args)
strainer(args)
......
......@@ -3,14 +3,14 @@
# at CIBIO, University of Trento, Italy
import sys
import os
# import sys
# import os
import argparse as ap
import pandas
import copy
import ConfigParser
# import copy
# import ConfigParser
import dendropy
import numpy
# import numpy
# import ipdb
......
......@@ -6,7 +6,7 @@ __author__ = 'Duy Tin Truong (duytin.truong@unitn.it)'
__version__ = '0.1'
__date__ = '17 Sep 2015'
import sys
# import sys
import os
import argparse
import numpy
......
......@@ -15,8 +15,8 @@ sys.path.append(MAIN_DIR)
from mixed_utils import dist2file, statistics
import argparse as ap
from Bio import SeqIO, Seq, SeqRecord
from collections import defaultdict
from Bio import SeqIO#, Seq, SeqRecord
# from collections import defaultdict
import numpy
from ooSubprocess import ooSubprocess
......
......@@ -13,9 +13,9 @@ MAIN_DIR = os.path.dirname(ABS_PATH)
os.environ['PATH'] += ':%s'%MAIN_DIR
sys.path.append(MAIN_DIR)
import argparse as ap
from Bio import SeqIO, Seq, SeqRecord
from collections import defaultdict
import numpy
from Bio import SeqIO#, Seq, SeqRecord
# from collections import defaultdict
# import numpy
from compute_distance import compute_dist_matrix
from ooSubprocess import parallelize
......
......@@ -6,8 +6,8 @@ __author__ = 'Duy Tin Truong (duytin.truong@unitn.it)'
__version__ = '0.1'
__date__ = '1 Sep 2014'
import sys
import os
# import sys
# import os
import argparse as ap
import pickle
import bz2
......
......@@ -8,12 +8,12 @@ import os
import multiprocessing
from multiprocessing.pool import ThreadPool
import sys
import cStringIO
# import cStringIO
from tempfile import NamedTemporaryFile
import which
import functools
import traceback
import numpy
# import numpy
class ooSubprocessException(Exception):
......
......@@ -6,8 +6,8 @@ __author__ = 'Duy Tin Truong (duytin.truong@unitn.it)'
__version__ = '0.1'
__date__ = '7 Sep 2016'
import sys
import os
# import sys
# import os
import argparse
from ete2 import Tree, TreeStyle, NodeStyle
......
......@@ -6,14 +6,14 @@ __author__ = 'Duy Tin Truong (duytin.truong@unitn.it)'
__version__ = '0.1'
__date__ = '4 May 2015'
import sys
import os
# import sys
# import os
import argparse as ap
import dendropy
from StringIO import StringIO
import re
from collections import defaultdict
import ConfigParser
# import ConfigParser
import matplotlib.colors as colors
import subprocess
......
......@@ -7,7 +7,7 @@ __version__ = '0.1'
__date__ = '18 Jul 2015'
import sys
import os
# import os
import argparse
......
......@@ -16,23 +16,23 @@ import argparse as ap
import glob
import ooSubprocess
from ooSubprocess import print_stderr, trace_unhandled_exceptions
import ConfigParser
# import ConfigParser
from Bio import SeqIO, Seq, SeqRecord
import cStringIO
# import cStringIO
import msgpack
import random
import subprocess
import bz2
import gzip
# import random
# import subprocess
# import bz2
# import gzip
import logging
import logging.config
import tarfile
import threading
# import tarfile
# import threading
import multiprocessing
import pysam
from collections import defaultdict
from scipy import stats
import numpy
# import numpy
logging.basicConfig(level=logging.DEBUG, stream=sys.stderr,
disable_existing_loggers=False,
......
......@@ -39,6 +39,7 @@ def merge( aaastrIn, astrLabels, iCol, ostm ):
# For each input datum in each input stream...
pos = 0
dCol = None
for f in aaastrIn:
with open(f) as csvfile:
......@@ -51,12 +52,16 @@ def merge( aaastrIn, astrLabels, iCol, ostm ):
# For a line in the file
for astrLine in iIn:
if astrLine[0].startswith('#'):
dCol = astrLine.index('relative_abundance') if 'relative_abundance' in astrLine else None
continue
iLine += 1
if dCol:
strID, astrData = astrLine[iCol], [astrLine[dCol]]
else:
# ID is from first column, data are everything else
strID, astrData = astrLine[iCol], ( astrLine[:iCol] + astrLine[( iCol + 1 ):] )
strID, astrData = astrLine[iCol], (astrLine[:iCol] + astrLine[iCol + 1:])
hashIDs[strID] = iLine
aastrData.append(astrData)
......