proteinortho-nf
This is a nextflow implementation of proteinortho available on gitlab.
Proteinortho is a tool to detect orthologous genes within different species.
Input: Multiple fasta files (orange boxes) with many proteins/genes (circles). Output: Groups (*.proteinortho) and pairs (*.proteinortho-graph) of orthologs proteins/genes.
For doing so, it compares similarities of given gene sequences and clusters them to find significant groups. The algorithm was designed to handle large-scale data and can be applied to hundreds of species at one. Details can be found in (doi:10.1186/1471-2105-12-124). To enhance the prediction accuracy, the relative order of genes (synteny) can be used as additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (doi:10.1371/journal.pone.0105015), is already build in Proteinortho. The general workflow of proteinortho:
First an initial all vs. all comparison between all proteins of all species is performed to determine protein similarities (upper right image).
The second stage is the clustering of similar genes to meaningful co-orthologous groups (lower right image).
Connected components within this graph can be considered as putative co-orthologous groups in theory and are returned in the output (lower left image).
SYNOPSIS
please modify the nextflow.conf to specify the input and output directory and then run:
nextflow main.nf
EXAMPLES
Calling proteinortho Sequences are typically given in plain fasta format like the files in test/
test/C.faa:
>C_10
VVLCRYEIGGLAQVLDTQFDMYTNCHKMCSADSQVTYKEAANLTARVTTDRQKEPLTGGY
HGAKLGFLGCSLLRSRDYGYPEQNFHAKTDLFALPMGDHYCGDEGSGNAYLCDFDNQYGR
...
test/E.faa:
>E_10
CVLDNYQIALLRNVLPKLFMTKNFIEGMCGGGGEENYKAMTRATAKSTTDNQNAPLSGGF
NDGKMGTGCLPSAAKNYKYPENAVSGASNLYALIVGESYCGDENDDKAYLCDVNQYAPNV
...
To run proteinortho for these sequences, simply call
perl proteinortho6.pl test/C.faa test/E.faa test/L.faa test/M.faa
To give the outputs the name 'test', call
perl proteinortho6.pl -project=test test/*faa
To use blast instead of the default diamond, call
perl proteinortho6.pl -project=test -p=blastp+ test/*faa
If installed with make install, you can also call
proteinortho -project=test -p=blastp+ test/*faa
Hints
Using .faa to indicate that your file contains amino acids and .fna to show it contains nucleotides makes life much easier but is not required.
Sequence IDs must be unique within a single FASTA file. Consider renaming otherwise. Note: Till version 5.15 sequences IDs had to be unique among the whole dataset. Proteinortho now keeps track of name and species to avoid the necessissity of renaming.
You need write permissions in the directory of your FASTA files as Proteinortho will create blast databases. If this is not the case, consider using symbolic links to the FASTA files.
The directory src/ contains useful tools, e.g. proteinortho_grab_proteins.pl which fetches protein sequences of orthologous groups from Proteinortho output table. (These files are installed during 'make install')
Credit where credit is due
- The all-versus-all BLAST-analysis (-step=2) is only possible with (one of) the following underlying algorithms:
- NCBI BLAST+ or NCBI BLAST legacy (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)
- Diamond (doi:10.1038/nmeth.3176, https://github.com/bbuchfink/diamond)
- Last (doi:10.1101/gr.113985.110, http://last.cbrc.jp/)
- Rapsearch2 (doi:10.1093/bioinformatics/btr595, https://github.com/zhaoyanswill/RAPSearch2)
- Topaz (doi:10.1186/s12859-018-2290-3, https://github.com/ajm/topaz)
- usearch,ublast (doi:10.1093/bioinformatics/btq461, https://www.drive5.com/usearch/download.html)
- blat (http://hgdownload.soe.ucsc.edu/admin/)
- mmseqs2 (doi:10.1038/nbt.3988 (2017). https://github.com/soedinglab/MMseqs2)
- The clustering step (-step=3) got a huge speedup with the integration of LAPACK (Univ. of Tennessee; Univ. of California, Berkeley; Univ. of Colorado Denver; and NAG Ltd., http://www.netlib.org/lapack/)
- The html output of the *proteinortho.tsv (orthology groups) is enhanced by clusterize (https://github.com/NeXTs/Clusterize.js), reducing the scroll lag.
ONLINE INFORMATION
For download and online information, see https://www.bioinf.uni-leipzig.de/Software/proteinortho/ or https://gitlab.com/paulklemm_PHD/proteinortho
REFERENCES
Lechner, M., Findeisz, S., Steiner, L., Marz, M., Stadler, P. F., & Prohaska, S. J. (2011). Proteinortho: detection of (co-) orthologs in large-scale analysis. BMC bioinformatics, 12(1), 124.