Skip to content
Commits on Source (11)
language: python
python:
- "3.5"
- "3.6"
before_install:
- sudo apt-get -qq update
- sudo apt-get install -y ncbi-blast+ prodigal
- wget http://github.com/bbuchfink/diamond/releases/download/v0.8.36/diamond-linux64.tar.gz
- tar xzf diamond-linux64.tar.gz
- sudo mv diamond /usr/bin
install:
- pip install -r requirements.txt
- pip install .
script:
- ./test.sh
# base image (stripped down ubuntu for Docker)
FROM phusion/baseimage
# metadata
LABEL base.image="ubuntu"
LABEL version="1"
LABEL software="RGI"
LABEL software.version="4.0.3"
LABEL description="Tool to identify resistance genes using the CARD database"
LABEL website="https://card.mcmaster.ca/"
LABEL documentation="https://github.com/arpcard/rgi/blob/master/README.rst"
LABEL license="https://github.com/arpcard/rgi/blob/master/LICENSE"
LABEL tags="Genomics"
# maintainer
MAINTAINER Finlay Maguire <finlaymaguire@gmail.com>
# install system dependencies
RUN \
apt-get update && \
apt-get install -y git python3 python3-dev python3-pip ncbi-blast+ prodigal wget && \
wget http://github.com/bbuchfink/diamond/releases/download/v0.8.36/diamond-linux64.tar.gz && \
tar xvf diamond-linux64.tar.gz && \
mv diamond /usr/bin
# install and test rgi
RUN git clone https://github.com/arpcard/rgi
WORKDIR rgi/
RUN pip3 install -r requirements.txt && \
pip3 install . && \
bash test.sh
# sort the database path issue
RUN pwd
WORKDIR /usr
RUN cp /rgi/card_data/card.json /usr/local/lib/python3.5/dist-packages/app/_data/card.json
# move to workdir
WORKDIR /data/
# set rgi executable as cmd to allow overriding
CMD ["rgi"]
1. About the CARD, ARO, and RGI
The Comprehensive Antibiotic Resistance Database ("CARD") provides data, models, and
algorithms relating to the molecular basis of antimicrobial resistance. The CARD provides
curated reference sequences and SNPs organized via the Antibiotic Resistance Ontology
("ARO"). These data can be browsed on the website or downloaded in a number of formats.
These data are additionally associated with detection models, in the form of curated
homology cut-offs and SNP maps, for prediction of resistome from molecular sequences.
These models can be downloaded or can be used for analysis of genome sequences using the
Resistance Gene Identifier ("RGI"), either online or as a stand-alone tool.
The CARD was designed and developed by the laboratories of Drs. Gerry Wright and Andrew G.
McArthur of McMaster University's Department of Biochemistry & Biomedical Sciences
(Hamilton, Ontario, Canada) with the help of a global team of collaborators. It is built
entirely using open source software and tools. This research has been supported by funds
from the Canadian Foundation for Innovation, Canadian Institutes of Health Research,
Natural Sciences and Engineering Research Council of Canada, Medical Research Council
(UK), and Ontario Research Fund, as well as a Cisco Research Chair in Bioinformatics
supported by Cisco Systems Canada, Inc. (Dr. McArthur), Canada Research Chair (Dr.
Wright), and Killam Research Fellowship (Dr. Wright).
2. Trade Marks and Official Marks
All trade marks used or displayed on the CARD (card.mcmaster.ca) website are registered or
unregistered trade marks or official marks of McMaster University. Nothing on the website
should be interpreted to grant any license to use any trade mark or official mark without
the express written permission of McMaster University.
3. Copyright and Discliamer
All of the materials on the website, including but not limited to the CARD, the RGI, and
the underlying HTML, text, illustration, designs, icons, audio clips, video clips,
documents, products, software and all other content (the "Materials") are protected by
copyright and are either owned by McMaster University, provided by the creator for free
use by the scientific community, or McMaster University has obtained a license to use the
Materials for their website. Except as noted in the Terms of Use below, specifically
sections 4 and 5 below, none of the Materials may be copied, reproduced, distributed,
republished, downloaded, displayed, posted or transmitted in any form or by any means,
without the prior written consent of McMaster University, or the original creator, where
applicable.
The only exception is the distribution of the Resistance Gene Identifier (RGI) software on
the Conda open source, cross-platform, language-agnostic package manager and environment
management system (https://conda.io/). Conda users are still required to abid by the Terms
of Use listed, specifically sections 4 and 5 below.
To obtain permission to reproduce material on the CARD website, please send a request to:
McMaster University
c/o McMaster Industry Liaison Office
305 – 175 Longwood Rd South
Hamilton, Ontario L8P 0A1
Attention: Copyright Advisor
E: milodsk@mcmaster.ca
4. Terms of Use for Non-Commercial, Research or Academic Reproduction
All information and the Materials at CARD has been posted to provide the public with
direct access to information about Antibiotic Resistance. The information and the
Materials have been posted with the intent that it be readily available for personal and
public non-commercial, research or academic use by individuals at academic institutions
only and except where otherwise prohibited, may be reproduced, in part or in whole and by
any means, without charge or further permission from McMaster University. Any other
organization must seek to obtain a license described in section 5 below.
We only ask that:
4.1 The Materials not be modified and used "as is"
4.2 Users exercise due diligence in ensuring the accuracy of the Materials
4.3 McMaster University be identified as the source of the Materials
4.4 The reproduction is not represented as an official version of the Materials
reproduced, nor as having been made in affiliation with or without the endorsement of
McMaster University
5. Terms of Use for Commercial or Non-Academic Organizations
Use or reproduction of Materials on the site, in whole or in part, by any non-academic
organization whether or not for non-commercial (including research) or commercial purposes
is prohibited, except with written permission of McMaster University. Commercial uses of
the Materials on the site are offered only pursuant to a written license and user fee. To
obtain permission and begin the licensing process, please contact McMaster University
using the address in section 3.
6. Disclaimer
The Materials on this site are provided "as is" without warranties or conditions of any
kind either expressed or implied. To the fullest extent possible under applicable law,
McMaster University disclaims all warranties and conditions, expressed or implied,
including but not limited to, implied warranties or conditions of merchantability and
fitness for a particular purpose, non-infringement or other violation of rights. McMaster
University does not warrant or make any other representations regarding the use, accuracy,
timelines, applicability, performance, security, availability or reliability of this
website or any of the sites linked to this website, or the results from the use of this
website or any sites linked to this website, or otherwise respecting the Materials on this
website or any sites linked to this website.
7. Limitation of Liability
Under no circumstances, including but not limited to negligence, shall McMaster University
be liable for any direct, indirect, special, punitive, incidental or consequential damages
arising out of the use of, or the inability to use, the website or the Materials.
Resistance Gene Identifier (RGI)
--------------------------------------------
This application is used to predict resistome(s) from protein or nucleotide data based on homology and SNP models. The application uses data from `CARD database <https://card.mcmaster.ca/>`_.
.. image:: https://travis-ci.org/arpcard/rgi.svg?branch=master
:target: https://travis-ci.org/arpcard/rgi
Table of Contents
-------------------------------------
- `License`_
- `Requirements`_
- `Install dependencies`_
- `Install RGI from project root`_
- `Running RGI tests`_
- `Help menu`_
- `Usage`_
- `Load card.json`_
- `Check database version`_
- `Run RGI`_
- `Run RGI using GNU parallel`_
- `Running RGI with short contigs to predict partial genes`_
- `Clean previous or old databases`_
- `RGI Heatmap`_
- `Run RGI from docker`_
- `Tab-delimited results file`_
- `Support & Bug Reports`_
License
--------
Use or reproduction of these materials, in whole or in part, by any non-academic organization whether or not for non-commercial (including research) or commercial purposes is prohibited, except with written permission of McMaster University. Commercial uses are offered only pursuant to a written license and user fee. To obtain permission and begin the licensing process, see `CARD website <https://card.mcmaster.ca/about>`_
Requirements
--------------------
- `Prodigal v2.6.3 <https://github.com/hyattpd/prodigal/wiki/Installation>`_
- `NCBI BLAST v2.6.0+ <https://blast.ncbi.nlm.nih.gov/Blast.cgi>`_
- `DIAMOND v0.8.36 <https://ab.inf.uni-tuebingen.de/software/diamond>`_
- `Python 3.6 <https://www.python.org/>`_
Install dependencies
--------------------
- pip3 install biopython
- pip3 install filetype
- pip3 install pandas
- pip3 install pytest
- pip3 install mock
Install RGI from project root
-----------------------------
.. code-block:: sh
pip3 install .
or
.. code-block:: sh
python3 setup.py build
python3 setup.py test
python3 setup.py install
Running RGI tests
-------------------
.. code-block:: sh
cd tests
pytest -v -rxs
Help menu
-------------------
.. code-block:: sh
rgi --help
Usage
-------------------
.. code-block:: sh
usage: rgi <command> [<args>]
commands are:
main Runs rgi application
tab Creates a Tab-delimited from rgi results
parser Creates categorical JSON files RGI wheel visualization
load Loads CARD database JSON file
clean Removes BLAST databases and temporary files
galaxy Galaxy project wrapper
database Information on installed card database
heatmap Heatmap for multiple analysis
---------------------------------------------------------------------------------------
bwt Metagenomics resistomes (Experimental)
card_annotation Create fasta files with annotations from card.json (Experimental)
wildcard_annotation Create fasta files with annotations from variants (Experimental)
baits_annotation Create fasta files with annotations from baits (Experimental)
remove_duplicates Removes duplicate sequences (Experimental)
kmer_build Build CARD*kmer database (Experimental)
kmer_query Query sequences through CARD*kmers (Experimental)
Resistance Gene Identifier - <version_number>
positional arguments:
command Subcommand to run
optional arguments:
-h, --help show this help message and exit
Use the Resistance Gene Identifier to predict resistome(s) from protein or
nucleotide data based on homology and SNP models. Check
https://card.mcmaster.ca/download for software and data updates. Receive email
notification of monthly CARD updates via the CARD Mailing List
(https://mailman.mcmaster.ca/mailman/listinfo/card-l)
Load card.json
-------------------
- local or working directory
.. code-block:: sh
rgi load --afile /path/to/card.json --local
- system wide
.. code-block:: sh
rgi load --afile /path/to/card.json
Check database version
-----------------------
- local or working directory
.. code-block:: sh
rgi database --version --local
- system wide
.. code-block:: sh
rgi database --version
Run RGI
----------------------
- local or working directory
.. code-block:: sh
rgi main --input_sequence /path/to/protein_input.fasta --output_file /path/to/output_file --input_type protein --local
- system wide
.. code-block:: sh
rgi main --input_sequence /path/to/nucleotide_input.fasta --output_file /path/to/output_file --input_type contig
Run RGI using GNU parallel
--------------------------------------------
- system wide and writing log files for each input file. (Note add code below to script.sh then run with `./script.sh /path/to/input_files`)
.. code-block:: sh
#!/bin/bash
DIR=`find . -mindepth 1 -type d`
for D in $DIR; do
NAME=$(basename $D);
parallel --no-notice --progress -j+0 'rgi main -i {} -o {.} -n 16 -a diamond --clean --debug > {.}.log 2>&1' ::: $NAME/*.{fa,fasta};
done
Running RGI with short contigs to predict partial genes
--------------------------------------------------------
- local or working directory
.. code-block:: sh
rgi main --input_sequence /path/to/nucleotide_input.fasta --output_file /path/to/output_file --local --low_quality
- system wide
.. code-block:: sh
rgi main --input_sequence /path/to/nucleotide_input.fasta --output_file /path/to/output_file --low_quality
Clean previous or old databases
--------------------------------
- local or working directory
.. code-block:: sh
rgi clean --local
- system wide
.. code-block:: sh
rgi clean
RGI Heatmap
------------
- Default Heatmap
.. code-block:: sh
rgi heatmap --input /path/to/rgi_results_json_files_directory/
- Heatmap with `AMR Gene Family` categorization
.. code-block:: sh
rgi heatmap --input /path/to/rgi_results_json_files_directory/ --category gene_family
- Heatmap with `AMR Gene Family` categorization and fill display
.. code-block:: sh
rgi heatmap --input /path/to/rgi_results_json_files_directory/ --category gene_family --display fill
- Heatmap with `AMR Gene Family` categorization and coloured y-axis labels display
.. code-block:: sh
rgi heatmap --input /path/to/rgi_results_json_files_directory/ --category gene_family --display text
- Heatmap with frequency display enabled
.. code-block:: sh
rgi heatmap --input /path/to/rgi_results_json_files_directory/ --frequency
- Heatmap with drug class category and frequency enabled
.. code-block:: sh
rgi heatmap --input /path/to/rgi_results_json_files_directory/ --category drug_class --frequency --display text
- Heatmap with samples and genes clustered
.. code-block:: sh
rgi heatmap --input /path/to/rgi_results_json_files_directory/ --cluster both
- Heatmap with resistance mechanism categorization and clustered samples
.. code-block:: sh
rgi heatmap --input /path/to/rgi_results_json_files_directory/ --cluster samples --category resistance_mechanism --display fill
Run RGI from docker
-------------------
- First you you must either pull the docker container from dockerhub (latest CARD version automatically installed)
.. code-block:: sh
docker pull finlaymaguire/rgi
- Or Alternatively, build it locally from the Dockerfile (latest CARD version automatically installed)
.. code-block:: sh
git clone https://github.com/arpcard/rgi
docker build -t arpcard/rgi rgi
- Then you can either run interactively (mounting a local directory called `rgi_data` in your current directory
to `/data/` within the container
.. code-block:: sh
docker run -i -v $PWD/rgi_data:/data -t arpcard/rgi bash
- Or you can directly run the container as an executable with `$RGI_ARGS` being any of the commands described above. Remember paths to input and outputs files are relative to the container (i.e. `/data/` if mounted as above).
.. code-block:: sh
docker run -v $PWD/rgi_data:/data arpcard/rgi $RGI_ARGS
Tab-delimited results file
---------------------------
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| ORF_ID | Open Reading Frame identifier (internal to RGI)|
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Contig | Source Sequence |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Start | Start co-ordinate of ORF |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Stop | End co-ordinate of ORF |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Orientation | Strand of ORF |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Cut_Off | RGI Detection Paradigm |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Pass_Bitscore | STRICT detection model bitscore value cut-off |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Best_Hit_Bitscore | Bitscore value of match to top hit in CARD |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Best_Hit_ARO | ARO term of top hit in CARD |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Best_Identities | Percent identity of match to top hit in CARD |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| ARO | ARO accession of top hit in CARD |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Model_type | CARD detection model type |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| | Mutations observed in the ARO term of top hit |
| SNPs_in_Best_Hit_ARO | in CARD (if applicable) |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| | Mutations observed in ARO terms of other hits |
| Other_SNPs | indicated by model id (if applicable) |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Drug Class | ARO Categorization |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Resistance Mechanism | ARO Categorization |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| AMR Gene Family | ARO Categorization |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Predicted_DNA | ORF predicted nucleotide sequence |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Predicted_Protein | ORF predicted protein sequence |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| CARD_Protein_Sequence | Protein sequence of top hit in CARD |
+----------------------------------------------------------+------------------------------------------------+
| :: | Calculated as percentage |
| | (length of ORF protein / |
| Percentage Length of Reference Sequence | length of CARD reference protein) |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| ID | HSP identifier (internal to RGI) |
+----------------------------------------------------------+------------------------------------------------+
| :: | |
| Model_id | CARD detection model id |
+----------------------------------------------------------+------------------------------------------------+
Support & Bug Reports
----------------------
Please log an issue on `github issue <https://github.com/arpcard/rgi/issues>`_.
You can email the CARD curators or developers directly at `card@mcmaster.ca <mailto:card@mcmaster.ca>`_, via Twitter at `@arpcard <http://www.twitter.com/arpcard>`_.
card.json
\ No newline at end of file
This diff is collapsed.
protein*
!__init__.py
\ No newline at end of file
|--------------------------------------------------------------------------
| Resistance Gene Identifier (RGI) Documentation
|--------------------------------------------------------------------------
Before you run the RGI scripts, make sure you have installed needed external tools:
|--------------------------------------------------------------------------
| Install open reading frame callers : Prodigal, http://prodigal.ornl.gov/, https://github.com/hyattpd/prodigal/wiki/Installation
|--------------------------------------------------------------------------
# Install Prodigal - conda should be installed
$ conda install --channel bioconda prodigal
# Mac OS X
$ brew tap homebrew/science
$ brew install homebrew/science/prodigal
# Linux Redhat / Centos
$ sudo yum groupinstall 'Development Tools' && sudo yum install curl git irb python-setuptools ruby
$ git clone https://github.com/Linuxbrew/brew.git ~/.linuxbrew
$ export PATH="$HOME/.linuxbrew/bin:$PATH"
$ export MANPATH="$HOME/.linuxbrew/share/man:$MANPATH"
$ export INFOPATH="$HOME/.linuxbrew/share/info:$INFOPATH"
# Linux Debian / Ubuntu
$ brew tap homebrew/science
$ brew install homebrew/science/prodigal
|--------------------------------------------------------------------------
| Install alignment software : BLAST and DIAMOND
|--------------------------------------------------------------------------
|--------------------------------------------------------------------------
| BLAST ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
|--------------------------------------------------------------------------
- Tested with BLAST 2.2.28, BLAST 2.2.31+ and 2.5.0+ on linux 64 and Mac OS X
* You can alson run the following command to install blast. This will only install version 2.2.28
$ sudo apt-get install ncbi-blast+
- Test blast install with the following command:
$ makeblastdb
* Biopython http://biopython.org/DIST/docs/install/Installation.html#sec12
* Run the following command to install Bio-python
$ sudo apt-get install python-biopython
* Download the database - card.json from Downloads on the CARD website (a copy may be included with this release)
|--------------------------------------------------------------------------
| DIAMOND
| - https://ab.inf.uni-tuebingen.de/software/diamond
| - https://github.com/bbuchfink/diamond
| - https://github.com/bbuchfink/diamond/releases
|--------------------------------------------------------------------------
* Install diamond using conda packages
$ conda install --channel bioconda diamond
|--------------------------------------------------------------------------
| Running RGI:
|--------------------------------------------------------------------------
Open a terminal, type:
$ python rgi.py -h
Check software version:
$ python rgi.py -sv or python rgi.py --software_version
Check data version:
$ python rgi.py -dv or python rgi.py --data_version
|--------------------------------------------------------------------------
| RGI inputs
|--------------------------------------------------------------------------
$ python rgi.py -h
usage: rgi.py [-h] [-t INTYPE] [-i INPUTSEQ] [-n THREADS] [-o OUTPUT]
[-e CRITERIA] [-c CLEAN] [-d DATA] [-l VERBOSE] [-a ALIGNER]
[-r DATABASE] [-sv] [-dv]
Resistance Gene Identifier - Version 3.1.2
optional arguments:
-h, --help show this help message and exit
-t INTYPE, --input_type INTYPE
must be one of contig, orf, protein, read (default:
contig)
-i INPUTSEQ, --input_sequence INPUTSEQ
input file must be in either FASTA (contig and
protein), FASTQ(read) or gzip format! e.g
myFile.fasta, myFasta.fasta.gz
-n THREADS, --num_threads THREADS
Number of threads (CPUs) to use in the BLAST search
(default=32)
-o OUTPUT, --output_file OUTPUT
Output JSON file (default=Report)
-e CRITERIA, --loose_criteria CRITERIA
The options are YES to include loose hits and NO to
exclude loose hits. (default=NO to exclude loose hits)
-c CLEAN, --clean CLEAN
This removes temporary files in the results directory
after run. Options are NO or YES (default=YES for
remove)
-d DATA, --data DATA Specify a data-type, i.e. wgs, chromosome, plasmid,
etc. (default = NA)
-l VERBOSE, --verbose VERBOSE
log progress to file. Options are OFF or ON (default =
OFF for no logging)
-a ALIGNER, --alignment_tool ALIGNER
choose between BLAST and DIAMOND. Options are BLAST or
DIAMOND (default = BLAST)
-r DATABASE, --db DATABASE
specify path to CARD blast databases (default: None)
-sv, --software_version
Prints software number
-dv, --data_version Prints data version number
INTYPE could be one of 'contig', 'protein' or 'read'.
1. 'contig' means that inputSequence is a DNA sequence stored in a FASTA file, presumably a complete genome or assembly contigs. RGI will predict ORFs de novo and predict resistome using a combination of BLASTP against the CARD data, curated cut-offs, and SNP screening.
2. 'protein', as its name suggests, requires a FASTA file with protein sequences. As above, RGI predict resistome using a combination of BLASTP against the CARD data, curated cut-offs, and SNP screening.
3. 'read' expects raw FASTQ format nucleotide data and predicts resistome using a combination of BLASTX against the CARD data, curated cut-offs, and SNP screening. This is an experimental tool and we have yet to adjust the CARD cut-offs for BLASTX. We will be exploring other metagenomics or FASTQ screening methods. Note that RGI does not perform any pre-processing of the FASTQ data (linker trimming, etc).
|--------------------------------------------------------------------------
| RGI outputs
|--------------------------------------------------------------------------
RGI output will produce a detailed JSON file, Summary Tab-delimited file and gff3 (where applicable)
The JSON is as follows (example shows only one hit):
- gene_71|gi|378406451|gb|JN420336.1| Klebsiella pneumoniae plasmid pNDM-MAR, complete sequence: {
// Hit 1
gnl|BL_ORD_ID|39|hsp_num:0: {
SequenceFromBroadStreet: "MRYIRLCIISLLATLPLAVHASPQPLEQIKQSESQLSGRVGMIEMDLASGRTLTAWRADERFPMMSTFKVVLCGAVLARVDAGDEQLERKIHYRQQDLVDYSPVSEKHLADGMTVGELCAAAITMSDNSAANLLLATVGGPAGLTAFLRQIGDNVTRLDRWETELNEALPGDARDTTTPASMAATLRKLLTSQRLSARSQRQLLQWMVDDRVAGPLIRSVLPAGWFIADKTGASKRGARGIVALLGPNNKAERIVVIYLRDTPASMAERNQQIAGIGAA",
"orf_start": 67822,
"ARO_name": "SHV-12",
"type_match": "Loose",
"query": "INDWRLDYNECRPHSSLNYLTPAEFAAGWRN",
"evalue": 3.82304,
"max-identities": 10,
"orf_strand": "-",
"bit-score": 24.6386,
"cvterm_id": "35914",
"sequenceFromDB": "LDRWETELNEALPGDARDTTTPASMAATLRK",
"match": "++ W + NE P + + TPA AA R ",
"model_id": "103",
"orf_From": "gi|378406451|gb|JN420336.1| Klebsiella pneumoniae plasmid pNDM-MAR, complete sequence",
"pass_evalue": 1e-100,
"query_end": 68607,
"ARO_category": {
"36696": {
"category_aro_name": "antibiotic inactivation enzyme",
"category_aro_cvterm_id": "36696",
"category_aro_accession": "3000557",
"category_aro_description": "Enzyme that catalyzes the inactivation of an antibiotic resulting in resistance. Inactivation includes chemical modification, destruction, etc."
},
"36268": {
"category_aro_name": "beta-lactam resistance gene",
"category_aro_cvterm_id": "36268",
"category_aro_accession": "3000129",
"category_aro_description": "Genes conferring resistance to beta-lactams."
}
},
"ARO_accession": "3001071",
"query_start": 68515,
"model_name": "SHV-12",
"model_type": "protein homolog model",
"orf_end": 68646
},
...
// Hit 2
...
// Hit 3
...
}
|--------------------------------------------------------------------------
| Getting Tab Delimited output after running RGI:
|--------------------------------------------------------------------------
Run the following command to get help on how to get the Tab Delimited output
$ python convertJsonToTSV.py -h
|--------------------------------------------------------------------------
| convertJsonToTSV inputs
|--------------------------------------------------------------------------
$ python convertJsonToTSV.py -h
usage: convertJsonToTSV.py [-h] [-i AFILE] [-o OUTPUT] [-v VERBOSE]
Convert RGI JSON file to Tab-delimited file
optional arguments:
-h, --help show this help message and exit
-i AFILE, --afile AFILE
must be a json file generated from RGI in JSON or gzip
format e.g out.json, out.json.gz
-o OUTPUT, --out_file OUTPUT
Output JSON file (default=dataSummary)
-v VERBOSE, --verbose VERBOSE
include help menu. Options are OFF or ON (default =
OFF for no help)
|--------------------------------------------------------------------------
| convertJsonToTSV outputs
|--------------------------------------------------------------------------
This outputs a tab-delimited text file: dataSummary.txt
The tab-output is as follows:
---------------------------------------------------------------------
COLUMN HELP_MESSAGE
---------------------------------------------------------------------
ORF_ID ------- Open Reading Frame identifier (internal to RGI)
CONTIG ------- Source Sequence
START ------- Start co-ordinate of ORF
STOP ------- End co-ordinate of ORF
ORIENTATION ------- Strand of ORF
CUT_OFF ------- RGI Detection Paradigm
PASS_EVALUE ------- STRICT detection model Expectation value cut-off
Best_Hit_evalue ------- Expectation value of match to top hit in CARD
Best_Hit_ARO ------- ARO term of top hit in CARD
Best_Identities ------- Percent identity of match to top hit in CARD
ARO ------- ARO accession of top hit in CARD
ARO_name ------- ARO term of top hit in CARD
Model_type ------- CARD detection model type
SNP ------- Observed mutation (if applicable)
AR0_category ------- ARO Categorization
bit_score ------- Bitscore of match to top hit in CARD
Predicted_Protein ------- ORF predicted protein sequence
CARD_Protein_Sequence ------- Protein sequence of top hit in CARD
LABEL ------- ORF label (internal to RGI)
ID ------- HSP identifier (internal to RGI)
|--------------------------------------------------------------------------
| Files Structure
|--------------------------------------------------------------------------
`-- rgi/
|-- _data/
|-- card.json
|-- _db/
... (BLAST DBs)
|-- _docs/
|-- INSTALL
|-- README
|-- _tmp/
|-- tests/
|-- __init__.py
|-- blastnsnp.py
|-- clean.py
|-- contigToORF.py
|-- contigToProteins.py
|-- convertJsonToTSV.py
|-- create_gff3_file.py
|-- filepaths.py
|-- formatJson.py
|-- fqToFsa.py
|-- load.py
|-- rgi.py
|-- rrna.py
|--------------------------------------------------------------------------
| Loading new card.json:
|--------------------------------------------------------------------------
* If new card.json is available. Replace card.json in the directory show above. Use the following command:
$ python load.py -h
|--------------------------------------------------------------------------
| Load inputs
|--------------------------------------------------------------------------
$ python load.py -h
usage: load.py [-h] [-i AFILE]
Load card database json file
optional arguments:
-h, --help show this help message and exit
-i AFILE, --afile AFILE
must be a card database json file
|--------------------------------------------------------------------------
| Clean databases
|--------------------------------------------------------------------------
* Database is created once the rgi.py is run. Use clean.py to remove databases after new card.json is loaded.
* Then run clean.py to clean directory.
$ python clean.py -h
|--------------------------------------------------------------------------
| Clean inputs
|--------------------------------------------------------------------------
$ python clean.py -h
usage: clean.py [-h]
Removes BLAST databases created using card.json
optional arguments:
-h, --help show this help message and exit
|--------------------------------------------------------------------------
| Format JSON
|--------------------------------------------------------------------------
$ python formatJson.py -h
usage: formatJson.py [-h] [-i IN_FILE] [-o OUT_FILE]
Convert RGI JSON file to Readable JSON file
optional arguments:
-h, --help show this help message and exit
-i IN_FILE, --in_file IN_FILE
input file must be in JSON format e.g Report.json
-o OUT_FILE, --out_file OUT_FILE
Output JSON file (default=ReportFormatted)
|--------------------------------------------------------------------------
| Contact Us:
|--------------------------------------------------------------------------
For help please contact us at:
* CARD Developers <card@mcmaster.ca>
\ No newline at end of file
|--------------------------------------------------------------------------
| Running RGI localy
|--------------------------------------------------------------------------
|--------------------------------------------------------------------------
| Files
|--------------------------------------------------------------------------
`-- rgi/
|-- _data/
|-- card.json
|-- _db/
... (BLAST DBs)
|-- _docs/
|-- INSTALL
|-- README
|-- _tmp/
|-- tests/
|-- __init__.py
|-- blastnsnp.py
|-- clean.py
|-- contigToORF.py
|-- contigToProteins.py
|-- convertJsonToTSV.py
|-- create_gff3_file.py
|-- filepaths.py
|-- formatJson.py
|-- fqToFsa.py
|-- load.py
|-- rgi.py
|-- rrna.py
|--------------------------------------------------------------------------
| Commands for Running RGI localy
|--------------------------------------------------------------------------
$ python rgi.py -h
$ python clean.py -h
$ python convertJsonToTSV.py -h
$ python load.py -h
$ python formatJson.py -h
|--------------------------------------------------------------------------
| Running RGI system-wide
|--------------------------------------------------------------------------
# install RGI
$ conda install --channel rgi
# un-install RGI
$ conda remove --channel rgi
|--------------------------------------------------------------------------
| Commands for Running RGI system-wide
|--------------------------------------------------------------------------
$ rgi -h
$ rgi_clean -h
$ rgi_jsontab -h
$ rgi_load -h
$ rgi_jsonformat -h
from app.settings import *
class Analyser(object):
"""Class to find analyse fasta files."""
def __init__(self,input_file):
"""Creates Analyser object for analysing fasta files."""
self.input_file = input_file
logger.info(repr(self))
def __repr__(self):
"""Returns Analyser class full object."""
return "Analyser({}".format(self.__dict__)
def run(self):
checksums = set()
d = {}
# Using the Biopython fasta parse we can read our fasta input
for record in SeqIO.parse(self.input_file, "fasta"):
checksum = seguid(record.seq)
if checksum in checksums:
model_type_id = record.description.split("|")[1].split(":")[-1].strip()
if model_type_id in ["40292"]:
logger.warning("Duplicates {} -> {} ~ {} - [{}]".format(record.id, record.description.split("|")[-1], d[checksum].split("|")[-1], model_type_id))
continue
d[checksum] = record.description
checksums.add(checksum)
This diff is collapsed.
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
Base classes used throughout RGI
"""
import os
from Bio import SeqIO
import json
from abc import ABCMeta, abstractmethod
from app.settings import logger
from Bio.Alphabet import generic_dna
from Bio.Seq import Seq
from pyfaidx import Fasta
class RGIBase(object):
"""Interface for RGI"""
__metaclass__ = ABCMeta
@abstractmethod
def from_string(self): pass
@abstractmethod
def from_args(self): pass
@abstractmethod
def validate_inputs(self): pass
@abstractmethod
def run(self): pass
@abstractmethod
def create_databases(self): pass
@abstractmethod
def run_blast(self): pass
@abstractmethod
def filter_process(self): pass
@abstractmethod
def output(self): pass
class BaseModel(object):
def extract_nth_bar(self, bunch_str, n):
"""
Parse the nth "|" delimited result field from aligner output
and return the result information with the appropriate type
Parameters
----------
Args:
bunch_str (str): "|" delimited result information from aligner
n (int): "|" field index to extract
Return:
result (int, float, ascii byte-str): result with correct type
"""
# to get appropriate field set offset
start = n + 3
end = n + 4
# subset the string to find the relevant set of "|" delimited results
subset_split_str = bunch_str.split('|')[start: end]
temporary_str = "|".join(subset_split_str)
# rebuild the string and extract the hit information after colon
result = temporary_str[temporary_str.find(':')+2:]
result = result.rstrip()
# check if integer first
if result.isdigit():
return int(result)
# otherwise try to coerce to float
else:
try:
return float(result)
# if that doesn't work encode into ascii-bytestring
except ValueError:
# return result.encode("ascii", "replace")
return result
def extract_nth_hash(self, bunch_str, n):
"""
Parse and return the nth hash delimited field from
alignment output
Parameters
----------
Args:
bunch_str (str): '#' delimited string from aligner
n (int): '#' field index to extract
Return:
result (int, str): extracted value from output in appropriate type
"""
if "#" not in bunch_str:
return 0
else:
arr = bunch_str.split("#")
if n >= len(arr):
return ""
else:
# if first two positional information fields return as integers
if n == 1 or n == 2:
return int(arr[n])
# strandedness data field
elif n == 3:
# convert 1/0 strand indicator to +/- notation
if int(arr[n]) == 1:
return "+"
else:
return "-"
# return as string if not specific strandedness or positional
else:
return arr[n]
def find_num_dash(self, subject, index):
"""
Finds location of mutation by counting the
number of dashes in the aligner subject output
Parameters
----------
Args:
subject (str): aligner output string
index (int): max output size
Returns:
dash_count (int): position of the mutation/SNP
"""
dash_count = 0
output = []
for i in range(len(subject)):
if subject[i] == '-':
dash_count += 1
else:
output.append(subject[i])
if len(output) == index:
break
return dash_count
def get_submitted_protein_sequence(self, seq_filepath):
"""
Parses sequence fasta into a dictionary keyed with the sequence IDs
Parameters
----------
Args:
seq_filepath (str): sequence filepath
Returns:
submitted_proteins_dict (dict): dictionary of sequences of the
format {seq_id: sequence string}
"""
submitted_proteins_dict = {}
if os.stat(seq_filepath).st_size != 0:
for record in SeqIO.parse(seq_filepath, 'fasta'):
submitted_proteins_dict[record.id] = str(record.seq)
return submitted_proteins_dict
def get_orf_dna_sequence(self, input_file, input_type):
"""
Get the predicted open reading frame nucleotide sequence.
Args:
input_file (str): filepath of the input file
input_type (str): [contig, read]
"""
filename = os.path.basename(input_file)
predicted_genes_dict = {}
if input_type in ["contig"]:
contig_filepath = os.path.join(self.working_directory,
filename + ".temp.contigToORF.fsa")
if os.stat(contig_filepath).st_size != 0:
for record in SeqIO.parse(contig_filepath, 'fasta'):
predicted_genes_dict[record.id] = str(record.seq)
elif input_type in ["read"]:
read_filepath = os.path.join(self.working_directory,
filename + ".temp.read.fsa")
if os.stat(read_filepath).st_size != 0:
for record in SeqIO.parse(read_filepath, 'fasta'):
predicted_genes_dict[record.id] = str(record.seq)
else:
raise ValueError("input_type invalid \
(must be 'contig' or 'read'): {}".format(input_type))
# write json for all predicted file
pjson = json.dumps(predicted_genes_dict)
predicted_filepath = os.path.join(self.working_directory,
filename + ".temp.predictedGenes.json")
with open(predicted_filepath, 'w') as wf:
wf.write(pjson)
return predicted_genes_dict
def results(self, blast_results, query_id, perfect, strict , loose):
"""
Sort results to perfect, strict, loose paradigm
Parameters
----------
Args:
blast_results (dict): dictionary containing perfect, strict anf loose hits
query_id (str): blast record query
perfect (dict): dictionary containing perfect hits
strict (dict): dictionary containing strict hits
loose (dict): dictionary containing loose hits
Returns:
blast_results (dict): dictionary of sorted results
"""
nudged = False
if len(perfect) == 0 and len(strict) == 0 and len(loose) > 0:
nudged , loose = self.nudge_loose_to_strict(loose)
if nudged is True and self.loose is False:
blast_results[query_id] = loose
elif self.loose is True:
blast_results[query_id] = loose
elif len(perfect) == 0:
if len(strict) > 0:
nudged , strict = self.nudge_strict_to_perfect(strict)
blast_results[query_id] = strict
else:
if len(perfect) > 0:
blast_results[query_id] = perfect
return blast_results
def nudge_strict_to_perfect(self, strict):
"""
Nudge strict hits with missing n-terminus, c-terminus and alternate start codons
Parameters
----------
Args:
strict (dict): dictionary containing strict hits
Returns:
nudged (bool): True or False
strict (dict): dictionary containing strict or perfect hits
"""
nudged = False
# - check if there is 100% match with matching part to the reference
# - getting matching protein then pull nucleotides from reference and translate
# - check the start codons including alternates
# - promote to perfect if the start codon is present in the N-terminus
for s in strict:
if int(strict[s]["perc_identity"]) == 100 and strict[s]["type_match"] == "Strict" and strict[s]["model_type_id"] not in [40295] and self.input_type == 'contig':
reference = strict[s]["sequence_from_broadstreet"]
query = strict[s]["orf_prot_sequence"]
orf_dna_sequence_ = ""
orf_end_ = ""
orf_start_ = ""
_partial_protein = ""
# Missing n-terminus or c-terminus
if len(query) < len(reference) and query in reference:
length_nucleotides = (len(reference) - len(strict[s]["match"]))*3
# pull nucleotides from query or submitted sequence
partial_bases, orf_start_, orf_end_ = self.get_part_sequence(
self.input_sequence, strict[s]["orf_from"],
strict[s]["orf_start"], strict[s]["orf_end"],
length_nucleotides, strict[s]["orf_strand"],
strict[s]["ARO_name"]
)
# check if partial_bases are multiples of 3
if len(partial_bases) % 3 == 0:
logger.info("Missing part: [{}]".format(partial_bases))
if strict[s]["orf_strand"] == "-":
# update orf_end
strict[s]["orf_end_possible"] = orf_end_ + 1
strict[s]["orf_start_possible"] = strict[s]["orf_start"]
orf_dna_sequence_ = str(Seq(partial_bases, generic_dna).reverse_complement()) + strict[s]["orf_dna_sequence"]
partial_protein = str(Seq(partial_bases, generic_dna).reverse_complement().translate(table=11))
# logger.info("Reverse strand: {}".format(partial_protein))
else:
# update orf_start
strict[s]["orf_start_possible"] = orf_start_
strict[s]["orf_end_possible"] = int(strict[s]["orf_end"]) + 1
orf_dna_sequence_ = partial_bases + strict[s]["orf_dna_sequence"]
partial_protein = str(Seq(partial_bases, generic_dna).translate(table=11))
# logger.info("Forward strand: {}".format(partial_protein))
logger.info("Translated protein: [{}]".format(partial_protein))
# update start codon to M for all other alternate start codons
if len(_partial_protein) > 0:
_partial_protein = partial_protein[0]
if partial_protein[0] in ["L","M","I","V"]:
_partial_protein = "M"+partial_protein[1:]
combine = _partial_protein + strict[s]["match"]
if combine == strict[s]["sequence_from_broadstreet"]:
logger.info("Missing n-terminus push to Perfect: {} #complete_gene {}".format(strict[s]["ARO_name"],
json.dumps({
"ARO_name": strict[s]["ARO_name"],
"strand": strict[s]["orf_strand"],
"header": strict[s]["orf_from"],
"orf_start_possible": strict[s]["orf_start_possible"],
"orf_end_possible": strict[s]["orf_end_possible"]
}, indent=2)
)
)
strict[s]["type_match"] = "Perfect"
strict[s]["nudged"] = True
strict[s]["note"] = "possible complete gene, missing n-terminus"
strict[s]["missed_bases_by_prodigal"] = partial_bases
# update orf_dna_sequence
strict[s]["orf_dna_sequence_possible"] = orf_dna_sequence_
nudged = True
else:
logger.info("incomplete nucleotides for gene: {} #partial_gene {}".format(strict[s]["ARO_name"],
json.dumps({
"ARO_name": strict[s]["ARO_name"],
"strand": strict[s]["orf_strand"],
"header": strict[s]["orf_from"],
"orf_start": strict[s]["orf_start"],
"orf_end": strict[s]["orf_end"],
"partial_bases": partial_bases
}, indent=2)
)
)
# reference contained within open reading frame
elif len(query) > len(reference) and reference in query:
strict[s]["type_match"] = "Perfect"
strict[s]["nudged"] = True
strict[s]["note"] = "possible complete gene, reference contained within open reading frame"
nudged = True
if strict[s]["orf_strand"] == "-":
strict[s]["orf_start_possible"] = strict[s]["orf_start"]
strict[s]["orf_end_possible"] = int(strict[s]["orf_start"]) + len(strict[s]["dna_sequence_from_broadstreet"])
else:
strict[s]["orf_start_possible"] = int(strict[s]["orf_end"]) - len(strict[s]["dna_sequence_from_broadstreet"]) + 1
strict[s]["orf_end_possible"] = int(strict[s]["orf_end"]) + 1
logger.info("Reference contained within open reading frame push to Perfect: {} #complete_gene {}".format(strict[s]["ARO_name"],
json.dumps({
"ARO_name": strict[s]["ARO_name"],
"strand": strict[s]["orf_strand"],
"header": strict[s]["orf_from"],
"orf_start_possible": strict[s]["orf_start_possible"],
"orf_end_possible": strict[s]["orf_end_possible"]
}, indent=2)
))
# orf and reference are overlapping
'''
elif reference not in query and query not in reference:
logger.warning("TODO:: orf and reference are overlapping")
'''
return nudged, strict
def get_part_sequence(self, fasta_file, header, start, stop, nterminus, strand, name):
"""
Pull part sequence from fasta file
# https://github.com/mdshw5/pyfaidx
# pip install pyfaidx
Parameters
----------
Args:
fasta_file (str): input fasta file
header (str): header for fasta sequence
start (str): start coordinate
stop (str): stop coordinate
nterminus (int): length of missing sequence
strand (str): strand
name (str): gene name
Returns:
sequence (str): portion on a sequence
"""
# remove the last 2 characters from header as this is appended by prodigal
header = header[:header.rfind("_")]
# logger.info("[PARTIAL] ARO: {} | contig: {} | filename: {}".format(name, header, fasta_file))
genes = Fasta(fasta_file, sequence_always_upper=False, read_long_names=False, one_based_attributes=True)
# logger.info(genes.records)
# logger.info(json.dumps({"strand":strand, "start":start, "stop":stop, "nterminus":nterminus}, indent=2))
if strand == "-":
_start = stop + 1
_stop = stop + nterminus
# logger.info("grep sequence from {}|-|{}-{}".format(header,_start, _stop,))
return str(genes.get_spliced_seq( header, [[_start, _stop]])), _start, _stop
elif strand == "+":
_start = start - nterminus
_stop = start - 1
if _start <= 0:
_start = 1
if _stop <= 0:
_stop = 1
# logger.info("grep sequence from {}|+|{}-{}".format(header,_start, _stop))
return str(genes.get_spliced_seq( header, [[_start, _stop]])), _start, _stop
def nudge_loose_to_strict(self, loose):
"""
Nudge loose hits with at least 95 percent identity to be strict hits
Parameters
----------
Args:
loose (dict): dictionary containing loose hits
Returns:
nudged (bool): True or False
loose (dict): dictionary containing loose or strict hits
"""
nudged = False
# check if there are any loose hits that might be strict
for i in loose:
if 95 <= int(loose[i]["perc_identity"]) <= 100:
# add to strict
logger.info("loose hit with at least 95 percent identity push to Strict: {} #partial_gene".format(loose[i]["ARO_name"]))
loose[i]["type_match"] = "Strict"
loose[i]["nudged"] = True
nudged = True
return nudged, loose
from app.settings import *
class Blast(object):
"""Class to create Blast object and align for protein and translated DNA searches."""
def __init__(self,input_file, output_file=None, program = 'blastp', num_threads=32, local_database=False ):
"""Creates Blast object for running NCBI BLAST algorithm."""
self.input_file = input_file
if output_file == None:
f_path, f_name = os.path.split(input_file)
self.output_file = os.path.join(f_path,"{}.blastRes.xml".format(f_name))
else:
self.output_file = output_file
self.local_database = local_database
self.db = path
if self.local_database:
self.db = LOCAL_DATABASE
self.program = program
self.num_threads = num_threads
self.outfmt = 5
def __repr__(self):
"""Returns Blast class full object."""
return "Blast({}".format(self.__dict__)
def run(self):
"""Runs BLAST algorithm."""
logger.info("run blast")
os.system('{program} -query {input} -db {path} \
-num_threads {num_threads} -outfmt {outfmt} -out {output_file}' \
.format(
program=self.program,
num_threads=self.num_threads,
outfmt=self.outfmt,
input=self.input_file,
path=os.path.join(self.db,"protein.db"),
output_file=self.output_file
)
)
def run_custom(self, db ):
"""Runs DIAMOND algorithm."""
# logger.info("run diamond")
os.system('{program} -query {input} -db {db} -num_threads {num_threads} \
-outfmt {outfmt} -out {output_file} -perc_identity {perc_identity} \
-strand {strand}' \
.format(
program=self.program,
input=self.input_file,
db=db,
num_threads=self.num_threads,
outfmt=self.outfmt,
output_file=self.output_file,
perc_identity=95.0,
strand="both"
)
)
logger.info("done running {} -> {}".format(self.program, db))
import csv
from app.settings import *
from operator import itemgetter, attrgetter
class ConvertJsonToTSV(object):
def __init__(self, filepath, homolog_file=None, variant_file=None, overexpression_file=None, rrna_file=None):
f_path, f_name = os.path.split(filepath)
name, ext = os.path.splitext(f_name)
self.filepath = os.path.join(f_path, "{}.json".format(f_name))
if ext.lower() == ".json":
self.filepath = os.path.join(f_path, "{}{}".format(name,ext))
self.homolog_file = homolog_file
self.variant_file = variant_file
self.overexpression_file = overexpression_file
self.rrna_file = rrna_file
def __repr__(self):
"""Returns ConvertJsonToTSV class full object."""
return "ConvertJsonToTSV({}".format(self.__dict__)
def parse_jsons(self, part, data):
if data.keys():
for k1,v1 in part.items():
if k1 in data:
data[k1].update(v1)
elif k1 not in data:
data[k1]=v1
else:
data.update(part)
return data
def combine_jsons(self):
if self.homolog_file is not None and self.variant_file is not None and self.overexpression_file is not None and self.rrna_file is not None:
data={}
if os.path.isfile(self.homolog_file):
with open (self.homolog_file) as hf:
homolog = json.load(hf)
data = self.parse_jsons(homolog, data)
if os.path.isfile(self.variant_file):
with open (self.variant_file) as vf:
variant = json.load(vf)
data = self.parse_jsons(variant, data)
if os.path.isfile(self.overexpression_file):
with open (self.overexpression_file) as of:
overexpression = json.load(of)
data = self.parse_jsons(overexpression, data)
if os.path.isfile(self.rrna_file):
with open (self.rrna_file) as rf:
rrna = json.load(rf)
data = self.parse_jsons(rrna, data)
with open(self.filepath, "w") as outfile:
json.dump(data, outfile)
def run(self):
if os.path.isfile(self.filepath):
f_path, f_name = os.path.split(self.filepath)
with open(os.path.join(f_path, "{}.txt".format(os.path.splitext(f_name)[0])), "w") as af:
writer = csv.writer(af, delimiter='\t', dialect='excel')
writer.writerow(["ORF_ID",
"Contig",
"Start",
"Stop",
"Orientation",
"Cut_Off",
"Pass_Bitscore",
"Best_Hit_Bitscore",
"Best_Hit_ARO",
"Best_Identities",
"ARO",
"Model_type",
"SNPs_in_Best_Hit_ARO",
"Other_SNPs",
"Drug Class",
"Resistance Mechanism",
"AMR Gene Family",
"Predicted_DNA",
"Predicted_Protein",
"CARD_Protein_Sequence",
"Percentage Length of Reference Sequence",
"ID",
"Model_ID"])
if os.path.isfile(self.filepath):
with open(self.filepath) as rgi_file:
rgi_data = json.load(rgi_file)
try:
del rgi_data["_metadata"]
except:
pass
for hsp in rgi_data:
order_perfect = []
order_loose = []
order_strict = []
dna = 0
cgList = []
hitID = []
temp2=[]
temp3 = []
best_snps = ""
other_snps = ""
for hit in rgi_data[hsp]:
if rgi_data[hsp][hit]["type_match"] == "Perfect":
order_perfect.append((
hit, rgi_data[hsp][hit]["bit_score"], rgi_data[hsp][hit]["perc_identity"]
))
if rgi_data[hsp][hit]["type_match"] == "Strict":
order_strict.append((
hit, rgi_data[hsp][hit]["bit_score"], rgi_data[hsp][hit]["perc_identity"]
))
if rgi_data[hsp][hit]["type_match"] == "Loose":
order_loose.append((
hit, rgi_data[hsp][hit]["bit_score"], rgi_data[hsp][hit]["perc_identity"]
))
ordered = []
# sort by bitscore and percent identity
if len(order_perfect) > 0:
ordered = sorted(order_perfect, key=itemgetter(1,2)).pop(0)
elif len(order_strict) > 0:
ordered = sorted(order_strict, key=itemgetter(1,2)).pop(0)
else:
ordered = sorted(order_loose, key=itemgetter(1,2)).pop(0)
ordered = [i for i in ordered]
if "orf_dna_sequence" in rgi_data[hsp][ordered[0]]:
dna = 1
if "ARO_category" in rgi_data[hsp][ordered[0]]:
for aroctkey in rgi_data[hsp][hit]["ARO_category"]:
cgList.append(str(rgi_data[hsp][hit]["ARO_category"][aroctkey]["category_aro_name"].encode('ascii','replace').decode("utf-8")))
if "hsp_num:" in rgi_data[hsp][ordered[0]]:
hitID.append(rgi_data[hsp][ordered[0]])
match_dict = {}
if dna == 1:
if len(rgi_data[hsp]) != 0:
if rgi_data[hsp][hit]["model_type_id"] == 41091:
if "snp" in rgi_data[hsp][ordered[0]]:
for x in rgi_data[hsp].values():
if "snp" in x.keys():
if x['model_id'] == rgi_data[hsp][ordered[0]]['model_id']:
temp2.append(x["snp"]["original"] + str(x["snp"]["position"]) + x["snp"]["change"])
best_snps = ', '.join(temp2)
else:
temp3.append(x["snp"]["original"] + str(x["snp"]["position"]) + x["snp"]["change"] + ":" + x['model_id'])
other_snps = ', '.join(temp3)
else:
best_snps = "n/a"
other_snps = "n/a"
elif rgi_data[hsp][hit]["model_type_id"] in [40293,40295]:
if "snp" in rgi_data[hsp][ordered[0]]:
for x in rgi_data[hsp].values():
if "snp" in x.keys():
if x['model_id'] == rgi_data[hsp][ordered[0]]['model_id']:
temp2.append(x["snp"]["original"] + str(x["snp"]["position"]) + x["snp"]["change"])
best_snps = ', '.join(temp2)
else:
temp3.append(x["snp"]["original"] + str(x["snp"]["position"]) + x["snp"]["change"] + ":" + x['model_id'])
other_snps = ', '.join(temp3)
else:
best_snps = "n/a"
other_snps = "n/a"
elif rgi_data[hsp][hit]["model_type_id"] == 40292:
best_snps = "n/a"
other_snps = "n/a"
if not other_snps:
other_snps = "n/a"
if rgi_data[hsp][hit]["model_type_id"] in [40295]:
percentage_length_reference_sequence = format(((rgi_data[hsp][ordered[0]]["orf_end"] - rgi_data[hsp][ordered[0]]["orf_start"]) /\
len(rgi_data[hsp][ordered[0]]["dna_sequence_from_broadstreet"]))*100, '.2f')
else:
percentage_length_reference_sequence = format((len(rgi_data[hsp][ordered[0]]["orf_prot_sequence"]) /\
len(rgi_data[hsp][ordered[0]]["sequence_from_broadstreet"]))*100, '.2f')
match_dict[hsp] = [hsp,
rgi_data[hsp][ordered[0]]["orf_from"],
rgi_data[hsp][ordered[0]]["orf_start"],
rgi_data[hsp][ordered[0]]["orf_end"],
rgi_data[hsp][ordered[0]]["orf_strand"],
rgi_data[hsp][ordered[0]]["type_match"],
rgi_data[hsp][ordered[0]]["pass_bitscore"],
rgi_data[hsp][ordered[0]]["bit_score"],
rgi_data[hsp][ordered[0]]["ARO_name"],
rgi_data[hsp][ordered[0]]["perc_identity"],
rgi_data[hsp][ordered[0]]["ARO_accession"],
rgi_data[hsp][ordered[0]]["model_type"],
best_snps,
other_snps,
"; ".join(rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_name"] for x in rgi_data[hsp][ordered[0]]["ARO_category"] \
if rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_class_name"] == 'Drug Class'),
"; ".join(rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_name"] for x in rgi_data[hsp][ordered[0]]["ARO_category"] \
if rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_class_name"] == 'Resistance Mechanism'),
"; ".join(rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_name"] for x in rgi_data[hsp][ordered[0]]["ARO_category"] \
if rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_class_name"] == 'AMR Gene Family'),
rgi_data[hsp][ordered[0]]["orf_dna_sequence"],
rgi_data[hsp][ordered[0]]["orf_prot_sequence"],
rgi_data[hsp][ordered[0]]["sequence_from_broadstreet"],
# length of hsps / length reference
percentage_length_reference_sequence,
ordered[0],
rgi_data[hsp][ordered[0]]["model_id"]
]
for key, value in match_dict.items():
writer.writerow(value)
else:
if len(rgi_data[hsp]) != 0:
if rgi_data[hsp][hit]["model_type_id"] == 41091:
if "snp" in rgi_data[hsp][ordered[0]]:
for x in rgi_data[hsp].values():
if "snp" in x.keys():
if x['model_id'] == rgi_data[hsp][ordered[0]]['model_id']:
temp2.append(x["snp"]["original"] + str(x["snp"]["position"]) + x["snp"]["change"])
best_snps = ', '.join(temp2)
else:
temp3.append(x["snp"]["original"] + str(x["snp"]["position"]) + x["snp"]["change"] + ":" + x['model_id'])
other_snps = ', '.join(temp3)
else:
best_snps = "n/a"
other_snps = "n/a"
elif rgi_data[hsp][hit]["model_type_id"] == 40293:
if "snp" in rgi_data[hsp][ordered[0]]:
for x in rgi_data[hsp].values():
if "snp" in x.keys():
if x['model_id'] == rgi_data[hsp][ordered[0]]['model_id']:
temp2.append(x["snp"]["original"] + str(x["snp"]["position"]) + x["snp"]["change"])
best_snps = ', '.join(temp2)
else:
temp3.append(x["snp"]["original"] + str(x["snp"]["position"]) + x["snp"]["change"] + ":" + x['model_id'])
other_snps = ', '.join(temp3)
else:
best_snps = "n/a"
other_snps = "n/a"
elif rgi_data[hsp][hit]["model_type_id"] == 40292:
best_snps = "n/a"
other_snps = "n/a"
match_dict[hsp] = [hsp, "", "", "", "",
rgi_data[hsp][ordered[0]]["type_match"],
rgi_data[hsp][ordered[0]]["pass_bitscore"],
rgi_data[hsp][ordered[0]]["bit_score"],
rgi_data[hsp][ordered[0]]["ARO_name"],
rgi_data[hsp][ordered[0]]["perc_identity"],
rgi_data[hsp][ordered[0]]["ARO_accession"],
rgi_data[hsp][ordered[0]]["model_type"],
best_snps,
other_snps,
"; ".join(rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_name"] for x in rgi_data[hsp][ordered[0]]["ARO_category"] \
if rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_class_name"] == 'Drug Class'),
"; ".join(rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_name"] for x in rgi_data[hsp][ordered[0]]["ARO_category"] \
if rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_class_name"] == 'Resistance Mechanism'),
"; ".join(rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_name"] for x in rgi_data[hsp][ordered[0]]["ARO_category"] \
if rgi_data[hsp][ordered[0]]["ARO_category"][x]["category_aro_class_name"] == 'AMR Gene Family'),
"",
rgi_data[hsp][ordered[0]]["orf_prot_sequence"],
rgi_data[hsp][ordered[0]]["sequence_from_broadstreet"],
format((len(rgi_data[hsp][ordered[0]]["orf_prot_sequence"]) / len(rgi_data[hsp][ordered[0]]["sequence_from_broadstreet"]))*100, '.2f'),
ordered[0],
rgi_data[hsp][ordered[0]]["model_id"]
]
for key, value in match_dict.items():
writer.writerow(value)
def manual():
h = {}
h["ORF_ID"] = "Open Reading Frame identifier (internal to RGI)"
h["Contig"] = "Source Sequence"
h["Start"] = "Start co-ordinate of ORF"
h["Stop"] = "End co-ordinate of ORF"
h["Orientation"] = "Strand of ORF"
h["Cut_Off"] = "RGI Detection Paradigm"
h["Pass_Bitscore"] = "STRICT detection model bitscore value cut-off"
h["Best_Hit_Bitscore"] = "bitscore value of match to top hit in CARD"
h["Best_Hit_ARO"] = "ARO term of top hit in CARD"
h["Best_Identities"] = "Percent identity of match to top hit in CARD"
h["ARO"] = "ARO accession of top hit in CARD"
h["Model_type"] = "CARD detection model type"
h["SNPs_in_Best_Hit_ARO"] = "Mutations observed in the ARO term of top hit in CARD (if applicable)"
h["Other_SNPs"] = "Mutations observed in ARO terms of other hits indicated by model id (if applicable)"
h["Drug Class"] = "ARO Categorization"
h["Resistance Mechanism"] = "ARO Categorization"
h["AMR Gene Family"] = "ARO Categorization"
h["Predicted_DNA"] = "ORF predicted nucleotide sequence"
h["Predicted_Protein"] = "ORF predicted protein sequence"
h["CARD_Protein_Sequence"] = "Protein sequence of top hit in CARD"
h["Percentage Length of Reference Sequence"] = "Percentage Length of Reference Sequence"
h["ID"] = "HSP identifier (internal to RGI)"
h["Model_ID"] = "CARD detection model id"
print ("\n")
print ("COLUMN","\t\t\t","HELP_MESSAGE")
for i in h:
print (i,"\t\t\t",h[i])
print ("\n")
from app.settings import *
class Database(object):
"""Class to create BLAST databases from a card.json file."""
def __init__(self, local_database=False):
"""Creates Database object."""
self.local_database = local_database
self.db = path
self.data = data_path
self.stdout = "2>&1 >> /dev/null" #"2> /dev/null"
if self.local_database:
self.db = LOCAL_DATABASE
self.data = LOCAL_DATABASE
def __repr__(self):
"""Returns Database class full object."""
return "Database({}".format(self.__dict__)
def build_databases(self):
"""Build BLAST and DIAMOND databases."""
self.write_fasta_from_json()
self.make_blast_database()
self.make_diamond_database()
self.write_fasta_from_json_rna()
def make_blast_database(self):
"""Build BLAST database from a FASTA file."""
if os.path.isfile(os.path.join(self.db,"proteindb.fsa")) == True and os.path.exists(os.path.join(self.db,"proteindb.fsa")) == True \
and os.path.exists(os.path.join(self.db,"protein.db.phr")) == True and os.path.exists(os.path.join(self.db,"protein.db.pin")) == True \
and os.path.exists(os.path.join(self.db,"protein.db.psq")) == True:
logger.info("blast DB exists")
pass
else:
logger.info("create blast DB.")
os.system('makeblastdb -in {} -dbtype prot -out {} {stdout}'.format(os.path.join(self.db,"proteindb.fsa"),os.path.join(self.db,"protein.db"),stdout=self.stdout))
def make_diamond_database(self):
"""Build DIAMOND database from a FASTA file."""
if os.path.isfile(os.path.join(self.db,"proteindb.fsa")) == True and os.path.exists(os.path.join(self.db,"proteindb.fsa")) == True \
and os.path.exists(os.path.join(self.db,"protein.db.dmnd")) == True:
logger.info("diamond DB exists")
pass
else:
logger.info("create diamond DB.")
os.system('diamond makedb --quiet --in {} --db {} {stdout}'.format(os.path.join(self.db,"proteindb.fsa"),os.path.join(self.db,"protein.db"),stdout=self.stdout))
def make_custom_db(self, in_file, out_file, db_type="nucl", program="blast"):
if program == 'blast':
os.system('makeblastdb -in {in_file} -dbtype {db_type} -out {out_file} \
{stdout}'.format( in_file=in_file, db_type=db_type, out_file=out_file, stdout=self.stdout))
else:
exit("Only NCBI BLAST is supported.")
def write_fasta_from_json(self):
"""Creates a fasta file from card.json file."""
if os.path.isfile(os.path.join(self.db, "proteindb.fsa")):
# logger.info("Database already exists.")
return
else:
try:
with open(os.path.join(self.data, "card.json"), 'r') as jfile:
j = json.load(jfile)
except Exception as e:
logger.error(e)
exit()
with open(os.path.join(self.db, "proteindb.fsa"), 'w') as fout:
for i in j:
if i.isdigit():
# model_type: protein homolog model
if j[i]['model_type_id'] == '40292':
try:
pass_bit_score = j[i]['model_param']['blastp_bit_score']['param_value']
except KeyError:
logger.warning("No bitscore for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
else:
try:
for seq in j[i]['model_sequences']['sequence']:
fout.write('>%s_%s | model_type_id: 40292 | pass_bitscore: %s | %s\n' % (i, seq, pass_bit_score, j[i]['ARO_name']))
fout.write('%s\n' %(j[i]['model_sequences']['sequence'][seq]['protein_sequence']['sequence']))
except Exception as e:
logger.warning("No model sequences for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
# model_type: protein variant model
elif j[i]["model_type_id"] == "40293":
try:
pass_bit_score = j[i]['model_param']['blastp_bit_score']['param_value']
except KeyError:
logger.warning("No bitscore for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
else:
try:
snpList = [j[i]['model_param']['snp']['param_value'][k] for k in j[i]['model_param']['snp']['param_value']]
except Exception as e:
logger.warning("No snp for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
try:
for seq in j[i]['model_sequences']['sequence']:
fout.write('>%s_%s | model_type_id: 40293 | pass_bit_score: %s | SNP: %s | %s\n' \
% (i, seq, pass_bit_score, ','.join(snpList), j[i]['ARO_name']))
fout.write('%s\n' % (j[i]['model_sequences']['sequence'][seq]['protein_sequence']['sequence']))
except Exception as e:
logger.warning("No model sequences for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
# model_type: protein overexpression model
elif j[i]["model_type_id"] == "41091":
try:
pass_bit_score = j[i]["model_param"]["blastp_bit_score"]["param_value"]
except KeyError:
logger.warning("No bitscore for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
else:
try:
snpList = [j[i]['model_param']['snp']['param_value'][k] for k in j[i]['model_param']['snp']['param_value']]
except Exception as e:
logger.warning("No snp for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
try:
for seq in j[i]['model_sequences']['sequence']:
fout.write('>%s_%s | model_type_id: 41091 | pass_bit_score: %s | SNP: %s | %s\n' \
% (i, seq, pass_bit_score, ','.join(snpList), j[i]['ARO_name']))
fout.write('%s\n' % (j[i]['model_sequences']['sequence'][seq]['protein_sequence']['sequence']))
except Exception as e:
logger.warning("No model sequences for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
def write_fasta_from_json_rna(self):
snpList_16s = []
snpList_23s = []
"""Creates a fasta file for 16S and 23S data from card.json file."""
if os.path.isfile(os.path.join(self.db, "rnadb.fsa")):
# logger.info("RNA database already exists.")
return
else:
with open(os.path.join(self.data, "card.json"), 'r') as jfile:
j = json.load(jfile)
with open(os.path.join(self.db, "rnadb.fsa"), 'w') as fout:
for i in j:
if i.isdigit():
# model_type: ribosomal RNA model
if j[i]["model_type_id"] == "40295":
try:
pass_bit_score = j[i]['model_param']['blastn_bit_score']['param_value']
except KeyError:
logger.warning("No bitscore for model (%s, %s). RGI will omit this model and keep running." \
% (j[i]['model_id'], j[i]['model_name']))
logger.info("Please let the CARD Admins know! Email: card@mcmaster.ca")
else:
snpList = [j[i]['model_param']['snp']['param_value'][k] for k in j[i]['model_param']['snp']['param_value']]
for s in snpList:
if "16S" in j[i]['ARO_name']:
if s not in snpList_16s:
snpList_16s.append(s)
if "23S" in j[i]['ARO_name']:
if s not in snpList_23s:
snpList_23s.append(s)
for seq in j[i]['model_sequences']['sequence']:
if j[i]['model_sequences']['sequence'][seq]['dna_sequence']['strand'] == "-":
basecomplement = self.complementary_strand(j[i]['model_sequences']['sequence'][seq]['dna_sequence']['sequence'])
fout.write('>%s_%s | model_type_id: 40295 | pass_bit_score: %s | SNP: %s | %s\n' \
% (i, seq, pass_bit_score, ','.join(snpList), j[i]['ARO_name']))
fout.write('%s\n' % (basecomplement))
else:
fout.write('>%s_%s | model_type_id: 40295 | pass_bit_score: %s | SNP: %s | %s\n' \
% (i, seq, pass_bit_score, ','.join(snpList), j[i]['ARO_name']))
fout.write('%s\n' % (j[i]['model_sequences']['sequence'][seq]['dna_sequence']['sequence']))
# write snps to file
with open(os.path.join(self.db,"16s_rRNA.txt"), 'w') as f16s:
snpList_16s.sort()
f16s.write(',\n'.join(snpList_16s))
with open(os.path.join(self.db, "23s_rRNA.txt"), 'w') as f23s:
snpList_23s.sort()
f23s.write(',\n'.join(snpList_23s))
def complementary_strand(self, strand):
'''Takes a DNA strand string and returns its opposite base pair match.'''
self.trans = { "T": "A", "A": "T", "G": "C", "C": "G" , "N": "N", "M":"K", "K":"M", \
"R":"Y", "Y":"R", "S":"S", "W":"W", "B":"V", "V":"B", "H":"D", "D":"H"}
complement = []
for base in strand:
complement.append(self.trans[base])
complement_seq = ''.join(complement)
return complement_seq
from app.settings import *
class Diamond(object):
"""Class to create Diamond object and align for protein and translated DNA searches."""
def __init__(self,input_file, output_file=None, program = 'blastp', num_threads=32, local_database=False ):
"""Creates Diamond object for running DIAMOND algorithm."""
self.input_file = input_file
if output_file == None:
f_path, f_name = os.path.split(input_file)
self.output_file = os.path.join(f_path,"{}.blastRes.xml".format(f_name))
else:
self.output_file = output_file
self.local_database = local_database
self.db = path
if self.local_database:
self.db = LOCAL_DATABASE
self.program = program
self.num_threads = num_threads
self.index_chunks = 1
self.block_size = 1
self.outfmt = 5
def __repr__(self):
"""Returns Diamond class full object."""
return "Diamond({}".format(self.__dict__)
def run(self):
"""Runs DIAMOND algorithm."""
# logger.info("run diamond")
os.system('diamond {program} --in {in_ref} --db {db} \
--query {input} --outfmt {outfmt} --out {output_file} \
--threads {num_threads} --index-chunks {index_chunks} \
--block-size {block_size} \
--salltitles --quiet --more-sensitive' \
.format(
program=self.program,
in_ref=os.path.join(self.db,"proteindb.fsa"),
db=os.path.join(self.db,"protein.db"),
input=self.input_file,
output_file=self.output_file,
num_threads=self.num_threads,
index_chunks=self.index_chunks,
block_size=self.block_size,
outfmt=self.outfmt
)
)
from app.Base import BaseModel
from app.HomologModel import Homolog
from app.VariantModel import Variant
from app.OverexpressionModel import Overexpression
from app.RrnaModel import Rrna
from app.Blast import Blast
from app.Database import Database
from app.ConvertRGIJsonToTSV import ConvertJsonToTSV
from app.settings import *
import hashlib
import multiprocessing
class Filter(BaseModel):
"""This class takes in blast xml file and card.json file and producess perfect strict paradigm for RGI """
def __init__(self, input_type, loose, input_sequence, xml_file, card_json, input_file, output_file, num_threads ,rgi_obj=None):
self.input_type = input_type
self.xml_file = xml_file
self.card_json = card_json
self.input_filename = input_file
self.input_sequence = input_sequence
self.loose = loose
self.blast_results = {}
self.rna_results = {}
self.rgi_obj = rgi_obj
self.num_threads = num_threads
self.working_directory = rgi_obj.working_directory
if output_file == None:
f_path, f_name = os.path.split(input_file)
self.output_file = os.path.join(f_path,"{}.Report.json".format(f_name))
else:
self.output_file = output_file
def __repr__(self):
"""Returns Filter class full object."""
return "Filter({}".format(self.__dict__)
def getHashName(self, name):
m = hashlib.md5()
t = time.gmtime()
m.update(name + str(t))
return m.hexdigest()
def worker(self, model_type):
logger.info("{}_worker started...".format(model_type))
# results = {}
try:
if model_type == "homolog":
obj = Homolog(self.input_type, self.loose, self.input_sequence, self.xml_file, self.working_directory, self.rgi_obj.local_database)
if model_type == "variant":
obj = Variant(self.input_type, self.loose, self.input_sequence, self.xml_file, self.working_directory, self.rgi_obj.local_database)
if model_type == "overexpression":
obj = Overexpression(self.input_type, self.loose, self.input_sequence, self.xml_file, self.working_directory, self.rgi_obj.local_database)
results = obj.run()
logger.info("save {} results...".format(model_type))
file_name = os.path.basename(self.input_sequence)
with open(os.path.join(self.working_directory,"{}.temp.{}.json".format(file_name, model_type)), 'w') as fout:
fout.write(json.dumps(results))
except Exception as e:
logger.warning("Exception: {} -> {} -> model_type: {}".format(type(e), e, model_type))
def async_func(self):
p = multiprocessing.Process(target=self.worker, args=("homolog",))
p2 = multiprocessing.Process(target=self.worker, args=("variant",))
p3 = multiprocessing.Process(target=self.worker, args=("overexpression",))
p4 = multiprocessing.Process(target=self.process_rrna, args=("rrna",))
logger.info("{} -> {}".format(p.pid, p.name))
logger.info("{} -> {}".format(p2.pid, p2.name))
logger.info("{} -> {}".format(p3.pid, p3.name))
logger.info("{} -> {}".format(p4.pid, p4.name))
p.start()
p2.start()
p3.start()
p4.start()
def prepare_output(self):
"""
Read all json into one json results file
"""
logger.info("prepare output(s) for input: {}".format(self.input_sequence))
file_name = os.path.basename(self.input_sequence)
obj=ConvertJsonToTSV(self.output_file, \
os.path.join(self.working_directory,"{}.temp.{}.json".format(file_name, "homolog")), \
os.path.join(self.working_directory,"{}.temp.{}.json".format(file_name, "variant")), \
os.path.join(self.working_directory,"{}.temp.{}.json".format(file_name, "overexpression")), \
os.path.join(self.working_directory,"{}.temp.{}.json".format(file_name, "rrna"))
)
# combine 3 json files
obj.combine_jsons()
# write tsv
obj.run()
def cleanup(self):
self.rgi_obj.clean_files()
def process_xml_file(self):
""" This function is used to process blast xml file """
model_thread = multiprocessing.Process(target=self.async_func, args=())
model_thread.start()
model_thread.join()
prepare_output_thread = multiprocessing.Process(target=self.prepare_output, args=())
prepare_output_thread.start()
prepare_output_thread.join()
cleanup_thread = multiprocessing.Process(target=self.cleanup, args=())
cleanup_thread.start()
def process_rrna(self, model_type="rrna"):
logger.info("rRNA process: {}".format(self.input_type))
if self.input_type == "protein":
logger.info("Skip rRNA...")
else:
logger.info("rRNA process")
self.format_fasta()
""" Cleans rRNA model previous result and temporal files"""
self.file_name = os.path.basename(self.input_sequence)
d, x = self.create_db_query()
rrna_obj = Rrna(self.input_sequence, self.output_file, d, x, self.loose, self.rgi_obj.local_database)
res = rrna_obj.run()
file_name = os.path.basename(self.input_sequence)
with open(os.path.join(self.working_directory,"{}.{}.json".format(file_name, model_type)), 'w') as fout:
fout.write(json.dumps(res))
# with open(os.path.splitext(self.output_file)[0]+".{}.json".format(model_type), 'w') as fout:
# fout.write(json.dumps(res))
logger.info("rRNA process Done.")
def create_db_query(self):
logger.info("create_db_query")
# make_custom_db(self, in_file, out_file, db_type="diamond")
in_file = self.input_sequence
f_path, f_name = os.path.split(self.input_sequence)
out_file = os.path.join(self.working_directory, "{}.db".format(f_name))
xml_file = os.path.join(self.working_directory,"{}.blastRes.rrna.xml".format(f_name))
logger.info("DB from user query")
db_obj = Database(self.rgi_obj.local_database)
db_obj.make_custom_db(in_file, out_file)
self.blast_reference_to_db_query(out_file, xml_file)
return out_file, xml_file
def blast_reference_to_db_query(self, db, xml_file):
logger.info("blast_reference_to_db_query")
# blast all rrna db against query db
rrna_db_fasta = os.path.join(self.rgi_obj.db, "rnadb.fsa")
blast_obj = Blast(rrna_db_fasta, program='blastn', output_file=xml_file, local_database=self.rgi_obj.local_database, num_threads=self.num_threads)
blast_obj.run_custom(db)
def format_fasta(self):
f_path, f_name = os.path.split(self.input_sequence)
temp_file = os.path.join(self.working_directory, "{}.temp".format(f_name))
with open(temp_file, 'w') as fout:
for record in SeqIO.parse(self.input_sequence, 'fasta'):
header = record.id
seq = record.seq
fout.write(">{}\n{}\n".format(header, seq))
self.input_sequence = temp_file
def encode_header(self,name):
return hashlib.md5(name.encode('utf-8')).hexdigest()
def write_output(self):
file_name = os.path.basename(self.input_sequence)
logger.info(self.output_file)
with open(self.output_file, 'w') as rrna_js:
rrna_js.write(json.dumps(self.rna_results))
def run(self):
if(os.path.exists(self.xml_file)):
self.process_xml_file()
else:
logger.error("missing blast xml file({}). Please check if input_type: '{}' correspond with input file: '{}' or use '--low_quality' flag for short contigs to predicts partial genes." \
.format(self.xml_file, self.input_type, self.input_sequence))
from app.settings import *
import shutil
class Galaxy(object):
def __init__(self, database, debug):
self.database = database
self.debug = debug
if self.debug:
logger.setLevel(10)
def __repr__(self):
"""Returns Galaxy class full object."""
return "Galaxy({}".format(self.__dict__)
def load_db_galaxy(self):
"""
verify that we have the following files at the specified location:
# CARD json data file
- card.json
# diamond blast
- protein.db.dmnd
# ncbi blast
- protein.db.phr
- protein.db.pin
- protein.db.psq
# Protein fasta file
- proteindb.fsa
"""
needed_files = ['card.json','proteindb.fsa','protein.db.dmnd','protein.db.phr','protein.db.pin','protein.db.psq']
found_files = []
files = os.listdir(self.database)
for f in files:
if not f.startswith('.'):
found_files.append(f)
missing_files = list(set(needed_files) - set(found_files))
if len(missing_files) > 0:
logger.error("missing database files {}".format(missing_files))
exit()
logger.info("found all needed files: {}".format(json.dumps(needed_files, indent=2)))
# Files found - move files into _data and _db directories
logger.info("files found - now move files into _data and _db directories")
for f in os.listdir(self.database):
if not f.startswith('.') and f in needed_files:
src_path = os.path.join(self.database,f)
dst_path = ""
if f in ['card.json']:
dst_path = os.path.join(data_path,f)
else:
dst_path = os.path.join(path,f)
logger.info("copy {} to {}".format(src_path,dst_path))
try:
shutil.copy2(src_path,dst_path)
except Exception as e:
logger.error("failed to copy file: {}".format(e))
exit()
This diff is collapsed.
from app.Base import BaseModel
from app.settings import *
class Homolog(BaseModel):
"""Class for homology searches."""
def __init__(self, input_type, loose, input_sequence, xml_file, working_directory, local_database=False):
self.input_type = input_type
self.loose = loose
self.input_sequence = input_sequence
self.xml_file = xml_file
self.output = {}
self.working_directory = working_directory
self.local_database = local_database
self.data = data_path
if self.local_database:
self.db = LOCAL_DATABASE
self.data = LOCAL_DATABASE
def __repr__(self):
"""Returns Homolog class full object."""
return "Homolog({}".format(self.__dict__)
def run(self):
"""Runs homolog search."""
blastResults = {}
predicted_genes_dict = {}
submitted_proteins_dict = {}
if self.input_type == "contig":
predicted_genes_dict = self.get_orf_dna_sequence(self.input_sequence,self.input_type)
if self.input_type == "protein":
submitted_proteins_dict = self.get_submitted_protein_sequence(self.input_sequence)
with open(os.path.join(self.data,"card.json")) as json_file:
json_data = json.load(json_file)
with open(self.xml_file, 'r') as result_handle:
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
perfect = {}
strict = {}
loose = {}
for alignment in blast_record.alignments:
alignTitle = alignment.title
orfInfo = blast_record.query.encode('ascii','replace')
c = 0
barc = 0
for eachc in orfInfo:
if barc >= 6:
break
elif eachc == '|':
barc += 1
c += 1
else:
c += 1
orffrom = orfInfo[c:]
modelTypeID = self.extract_nth_bar(alignTitle, 0)
if modelTypeID == 40292:
spacepos = alignTitle.index(' ')
hitid = alignTitle[0:spacepos]
hitid = hitid.encode('ascii','replace')
modelDescrpt =alignTitle[alignTitle.index(' ')+1:]
underscoreinMD = modelDescrpt.index('_')
modelID = modelDescrpt[0:underscoreinMD]
seqinModel = modelDescrpt[underscoreinMD+1: modelDescrpt.index(' ')]
pass_bitscore = "{}".format(self.extract_nth_bar(alignment.title, 1))
pass_evalue = "{}".format("n/a")
# logger.info("pass_evalue: {}".format(pass_evalue))
# logger.info("pass_bitscore: {}".format(pass_bitscore))
init = 0
for hsp in alignment.hsps:
querySeq = hsp.query.replace('-', '')
realQueryLength = len(querySeq)
card_sequence = ""
orf_protein_sequence = ""
try:
card_sequence = str(json_data[modelID]["model_sequences"]["sequence"][seqinModel]["protein_sequence"]["sequence"])
except Exception as e:
logger.warning("Exception : {} -> {} -> Model({}) missing in database. Please generate new database.".format(type(e), e, modelID))
if predicted_genes_dict:
if orfInfo.strip() in predicted_genes_dict.keys():
orf_protein_sequence = str(Seq(predicted_genes_dict[orfInfo.decode()], generic_dna).translate(table=11)).strip("*")
else:
orf_protein_sequence = str(Seq(predicted_genes_dict[orfInfo.decode()[:orfInfo.decode().index(' # ')]], generic_dna).translate(table=11)).strip("*")
if submitted_proteins_dict:
orf_protein_sequence = str(submitted_proteins_dict[orfInfo.decode().split(" ")[0]])
if card_sequence.upper() == orf_protein_sequence.upper():
""" Perfect hits """
# logger.info("Perfect hits")
ppinsidedict = {}
ppinsidedict["type_match"] = "Perfect"
ppinsidedict["model_id"] = modelID
ppinsidedict["orf_strand"] = self.extract_nth_bar(orfInfo.decode(), 0)
ppinsidedict["orf_start"] = self.extract_nth_bar(orfInfo.decode(), 1)
ppinsidedict["orf_end"] = self.extract_nth_bar(orfInfo.decode(), 2)
ppinsidedict["orf_from"] = orffrom.decode()
ppinsidedict["model_name"] = json_data[modelID]["model_name"]
ppinsidedict["model_type"] = json_data[modelID]["model_type"]
ppinsidedict["model_type_id"] = modelTypeID
ppinsidedict["pass_evalue"] = pass_evalue
ppinsidedict["pass_bitscore"] = pass_bitscore
ppinsidedict["ARO_accession"] = json_data[modelID]["ARO_accession"]
ppinsidedict["ARO_name"] = json_data[modelID]["ARO_name"]
ppinsidedict["ARO_category"] = json_data[modelID]["ARO_category"]
ppinsidedict["evalue"] = hsp.expect
ppinsidedict["bit_score"] = hsp.bits
ppinsidedict["max_identities"] = hsp.identities
ppinsidedict["cvterm_id"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["NCBI_taxonomy"]["NCBI_taxonomy_cvterm_id"]
ppinsidedict["query"] = hsp.query
ppinsidedict["match"] = hsp.match
ppinsidedict["sequence_from_db"] = hsp.sbjct
ppinsidedict["sequence_from_broadstreet"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["protein_sequence"]["sequence"]
ppinsidedict["dna_sequence_from_broadstreet"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["dna_sequence"]["sequence"]
if self.input_type == 'contig':
ppinsidedict["query_start"] = self.extract_nth_hash(orfInfo.decode(), 1) + (hsp.query_start - 1)*3
ppinsidedict["query_end"] = self.extract_nth_hash(orfInfo.decode(), 1) + (hsp.query_start - 1)*3 + realQueryLength*3 - 1
ppinsidedict["orf_strand"] = self.extract_nth_hash(orfInfo.decode(), 3)
ppinsidedict["orf_start"] = self.extract_nth_hash(orfInfo.decode(), 1)
ppinsidedict["orf_end"] = self.extract_nth_hash(orfInfo.decode(), 2)
ppinsidedict["orf_from"] = self.extract_nth_hash(orfInfo.decode(), 0).rstrip()
if orfInfo.decode().split(' # ')[0] in predicted_genes_dict:
ppinsidedict["orf_dna_sequence"] = predicted_genes_dict[orfInfo.decode().split(' # ')[0]]
ppinsidedict["orf_prot_sequence"] = str(Seq(predicted_genes_dict[orfInfo.decode().split(' # ')[0]], generic_dna).translate(table=11)).strip("*")
else:
ppinsidedict["orf_dna_sequence"] = ""
ppinsidedict["orf_prot_sequence"] = ""
elif self.input_type == 'protein':
ppinsidedict["query_start"] = hsp.query_start
ppinsidedict["query_end"] = hsp.query_start + realQueryLength
ppinsidedict["query_from"] = blast_record.query
ppinsidedict["orf_prot_sequence"] = orf_protein_sequence
elif self.input_type == 'read':
pass
ppinsidedict["perc_identity"] = float(format(float(ppinsidedict["max_identities"]*100) / len(ppinsidedict["query"]),'.2f'))
perfect["{}|hsp_num:{}".format(hitid.decode(),init)] = ppinsidedict
init += 1
elif hsp.bits >= float(pass_bitscore):
""" Strict hits """
# logger.info("Strict hits")
insidedict = {}
insidedict["type_match"] = "Strict"
insidedict["orf_strand"] = self.extract_nth_bar(orfInfo.decode(), 0)
insidedict["orf_start"] = self.extract_nth_bar(orfInfo.decode(), 1)
insidedict["orf_end"] = self.extract_nth_bar(orfInfo.decode(), 2)
insidedict["orf_from"] = orffrom.decode()
insidedict["model_name"] = json_data[modelID]["model_name"]
insidedict["model_type"] = json_data[modelID]["model_type"]
insidedict["model_type_id"] = modelTypeID
insidedict["model_id"] = modelID
insidedict["pass_evalue"] = pass_evalue
insidedict["pass_bitscore"] = pass_bitscore
insidedict["ARO_accession"] = json_data[modelID]["ARO_accession"]
insidedict["ARO_name"] = json_data[modelID]["ARO_name"]
insidedict["ARO_category"] = json_data[modelID]["ARO_category"]
insidedict["evalue"] = hsp.expect
insidedict["bit_score"] = hsp.bits
insidedict["max_identities"] = hsp.identities
insidedict["cvterm_id"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["NCBI_taxonomy"]["NCBI_taxonomy_cvterm_id"]
insidedict["query"] = hsp.query
insidedict["match"] = hsp.match
insidedict["sequence_from_db"] = hsp.sbjct
insidedict["sequence_from_broadstreet"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["protein_sequence"]["sequence"]
insidedict["dna_sequence_from_broadstreet"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["dna_sequence"]["sequence"]
if self.input_type == 'contig':
insidedict["query_start"] = self.extract_nth_hash(orfInfo.decode(), 1) + (hsp.query_start - 1)*3
insidedict["query_end"] = self.extract_nth_hash(orfInfo.decode(), 1) + (hsp.query_start - 1)*3 + realQueryLength*3 - 1
insidedict["orf_strand"] = self.extract_nth_hash(orfInfo.decode(), 3)
insidedict["orf_start"] = self.extract_nth_hash(orfInfo.decode(), 1)
insidedict["orf_end"] = self.extract_nth_hash(orfInfo.decode(), 2)
insidedict["orf_from"] = self.extract_nth_hash(orfInfo.decode(), 0).rstrip()
if orfInfo.decode().split(' # ')[0] in predicted_genes_dict:
insidedict["orf_dna_sequence"] = predicted_genes_dict[orfInfo.decode().split(' # ')[0]]
insidedict["orf_prot_sequence"] = str(Seq(predicted_genes_dict[orfInfo.decode().split(' # ')[0]], generic_dna).translate(table=11)).strip("*")
else:
insidedict["orf_dna_sequence"] = ""
insidedict["orf_prot_sequence"] = ""
elif self.input_type == 'protein':
insidedict["query_start"] = hsp.query_start
insidedict["query_end"] = hsp.query_start + realQueryLength
insidedict["query_from"] = blast_record.query
insidedict["orf_prot_sequence"] = orf_protein_sequence
elif self.input_type == 'read':
pass
insidedict["perc_identity"] = float(format(float(insidedict["max_identities"]*100) / len(insidedict["query"]),'.2f'))
strict["{}|hsp_num:{}".format(hitid.decode(),init)] = insidedict
init += 1
else:
""" Loose hits """
# logger.info("Loose hits")
linsidedict = {}
linsidedict["type_match"] = "Loose"
linsidedict["orf_strand"] = self.extract_nth_bar(orfInfo.decode(), 0)
linsidedict["orf_start"] = self.extract_nth_bar(orfInfo.decode(), 1)
linsidedict["orf_end"] = self.extract_nth_bar(orfInfo.decode(), 2)
linsidedict["orf_from"] = orffrom.decode().strip()
linsidedict["model_name"] = json_data[modelID]["model_name"]
linsidedict["model_type"] = json_data[modelID]["model_type"]
linsidedict["model_type_id"] = modelTypeID
linsidedict["pass_evalue"] = pass_evalue
linsidedict["pass_bitscore"] = pass_bitscore
linsidedict["model_id"] = modelID
linsidedict["ARO_accession"] = json_data[modelID]["ARO_accession"]
linsidedict["ARO_name"] = json_data[modelID]["ARO_name"]
linsidedict["ARO_category"] = json_data[modelID]["ARO_category"]
linsidedict["evalue"] = hsp.expect
linsidedict["max_identities"] = hsp.identities
linsidedict["bit_score"] = hsp.bits
linsidedict["cvterm_id"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["NCBI_taxonomy"]["NCBI_taxonomy_cvterm_id"]
linsidedict["query"] = hsp.query
linsidedict["match"] = hsp.match
linsidedict["sequence_from_db"] = hsp.sbjct
linsidedict["sequence_from_broadstreet"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["protein_sequence"]["sequence"]
linsidedict["dna_sequence_from_broadstreet"] = json_data[modelID]["model_sequences"]["sequence"][seqinModel]["dna_sequence"]["sequence"]
if self.input_type == 'contig':
linsidedict["query_start"] = self.extract_nth_hash(orfInfo.decode(), 1) + (hsp.query_start - 1)*3
linsidedict["query_end"] = self.extract_nth_hash(orfInfo.decode(), 1) + (hsp.query_start - 1)*3 + realQueryLength*3 - 1
linsidedict["orf_strand"] = self.extract_nth_hash(orfInfo.decode(), 3)
linsidedict["orf_start"] = self.extract_nth_hash(orfInfo.decode(), 1)
linsidedict["orf_end"] = self.extract_nth_hash(orfInfo.decode(), 2)
linsidedict["orf_from"] = self.extract_nth_hash(orfInfo.decode(), 0)
if orfInfo.decode().split(' # ')[0] in predicted_genes_dict:
linsidedict["orf_dna_sequence"] = predicted_genes_dict[orfInfo.decode().split(' # ')[0]]
linsidedict["orf_prot_sequence"] = str(Seq(predicted_genes_dict[orfInfo.decode().split(' # ')[0]], generic_dna).translate(table=11)).strip("*")
else:
linsidedict["orf_dna_sequence"] = ""
linsidedict["orf_prot_sequence"] = ""
elif self.input_type == 'protein':
linsidedict["query_start"] = hsp.query_start
linsidedict["query_end"] = hsp.query_start + realQueryLength
linsidedict["query_from"] = blast_record.query
linsidedict["orf_prot_sequence"] = orf_protein_sequence
elif self.input_type == 'read':
pass
linsidedict["perc_identity"] = float(format(float(linsidedict["max_identities"]*100) / len(linsidedict["query"]), '.2f'))
loose["{}|hsp_num:{}".format(hitid.decode(),init)] = linsidedict
init += 1
blastResults = self.results(blastResults, blast_record.query, perfect, strict , loose)
return blastResults