Skip to content
Commits on Source (5)
[This file has been replaced by README.md]
December, 2017
The most up-to-date version information on FASTA versions is available
in README.md and doc/readme.v36 .
July, 2015
This version of the FASTA programs is fasta-36.3.8. Since March, 2011
......
......@@ -12,9 +12,26 @@ includes programs for aligning translated DNA sequences against
proteins (`fastx`, `fasty` are equivalent to `blastx`, `tfastx`,
`tfasty` are similar to `tblastn`).
####September, 2016
####December, 2017
The current FASTA version is fasta-36.3.8f, Dec. 2017
The statistics routines for normally distributed scores (ggsearch36,
glsearch36) are more robust to very low E()-value thresholds.
####Sept, 2017
The current FASTA version is fasta-36.3.8f, Sept. 2017
If the -S option is used and a query sequence has no upper case
letters, it is re-read with lower-case letters converted to upper-case.
The current FASTA version is fasta-36.3.8e.
####May, 2017
The current FASTA version is fasta-36.3.8f, May. 2017
Various bugs in sub-alignment scoring corrected and support for the
EBI SP:GSTM1_HUMAN P09488 added. The format for the $SRCH_URL and
$SRCH_URL2 format strings has changed to enable pairwise alignment.
####September, 2016
The fasta-36.3.6e version includes a new directory, `psisearch2`, with
scripts to run iterative PSSM (PSI-BLAST or SSEARCH36) searches using
......
Placeholder file to create destination for program binaries.
fasta3 (36.3.8f-1) UNRELEASED; urgency=low
fasta3 (36.3.8g-1) unstable; urgency=low
* Initial release (Closes: #895740)
* New upstream version.
* Fixed remaining lintian error.
* Bumped policy to 4.1.4.
[ Steffen Moeller ]
* Initial release (Closes: #coming)
-- Steffen Moeller <moeller@debian.org> Sun, 15 Apr 2018 16:08:39 +0200
fasta3 (36.3.8f-1) UNRELEASED; urgency=low
[ Andreas Tille ]
* Moved packaging from SVN to Git
......
......@@ -4,9 +4,9 @@ Priority: optional
Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Uploaders: Steffen Moeller <moeller@debian.org>
Build-Depends: debhelper (>= 9)
Standards-Version: 3.9.6
Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/fasta3.git
Vcs-Git: https://anonscm.debian.org/git/debian-med/fasta3.git
Standards-Version: 4.1.4
Vcs-Browser: https://salsa.debian.org/med-team/fasta3
Vcs-Git: https://salsa.debian.org/med-team/fasta3.git
Homepage: http://fasta.bioch.virginia.edu
Package: fasta3
......
......@@ -12,4 +12,4 @@ Abstract: This documentation describes the version 36 of the FASTA program
Section: Science/Biology
Format: PDF
Files: /usr/share/doc/fasta3/fasta_guide.pdf
Files: /usr/share/doc/fasta3-doc/fasta_guide.pdf
Index: fasta3-36.3.7a/make/Makefile
Index: fasta3/make/Makefile
===================================================================
--- fasta3-36.3.7a.orig/make/Makefile
+++ fasta3-36.3.7a/make/Makefile
@@ -34,6 +34,7 @@
--- fasta3.orig/make/Makefile
+++ fasta3/make/Makefile
@@ -34,6 +34,7 @@ THR_SUBS = pthr_subs2
THR_LIBS = -lpthread
THR_CC =
......@@ -10,33 +10,34 @@ Index: fasta3-36.3.7a/make/Makefile
XDIR = /seqprg/bin
DROPGSW_NA_O = dropgsw2.o wm_align.o calcons_sw.o
Index: fasta3-36.3.7a/make/Makefile.linux64_sse2
Index: fasta3/make/Makefile.linux64_sse2
===================================================================
--- fasta3-36.3.7a.orig/make/Makefile.linux64_sse2
+++ fasta3-36.3.7a/make/Makefile.linux64_sse2
@@ -12,7 +12,7 @@
--- fasta3.orig/make/Makefile.linux64_sse2
+++ fasta3/make/Makefile.linux64_sse2
@@ -12,7 +12,8 @@
SHELL=/bin/bash
-CC = gcc -g -O -msse2
+CC = gcc -g -O -msse2 $(CPPFLAGS)
+CC = gcc
+CFLAGS = -g -O -msse2 $(CPPFLAGS)
#CC= gcc -pg -g -O -msse2 -ffast-math
#CC = gcc -g -DDEBUG -msse2
@@ -25,7 +25,7 @@
#CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG
@@ -24,7 +25,7 @@ CC = gcc -g -O -msse2
# standard options
-CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DPROGRESS -DM10_CONS -DFASTA_HOST='"your_fasta_host_here"' -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64
+CFLAGS+= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DPROGRESS -DM10_CONS -DFASTA_HOST='"your_fasta_host_here"' -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64
-CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64
+CFLAGS += -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64
# -I/usr/include/mysql -DMYSQL_DB
# -DSUPERFAMNUM -DSFCHAR="'|'"
Index: fasta3-36.3.7a/make/Makefile36m.common
Index: fasta3/make/Makefile36m.common
===================================================================
--- fasta3-36.3.7a.orig/make/Makefile36m.common
+++ fasta3-36.3.7a/make/Makefile36m.common
@@ -34,7 +34,7 @@
--- fasta3.orig/make/Makefile36m.common
+++ fasta3/make/Makefile36m.common
@@ -34,7 +34,7 @@ NGETLIB=nmgetlib
# and "-L/usr/lib64/mysql -lmysqlclient -lz" in LIB_M
# some systems may also require a LD_LIBRARY_PATH change
......
Index: fasta3/src/dropnnw2.c
===================================================================
--- fasta3.orig/src/dropnnw2.c
+++ fasta3/src/dropnnw2.c
@@ -575,7 +575,7 @@ void do_work (const unsigned char *aa0,
* be rerun with 16 bits. If it is more, and we have tried at least
* 500 sequences, we switch off the 8-bit mode.
*/
- if (score == OVERFLOW) {
+ if (score == OVERFLOW_SCORE) {
f_str->done_16bit++;
if(f_str->done_8bit>500 && (3*f_str->done_16bit)>(f_str->done_8bit))
f_str->try_8bit = 0;
......@@ -8,7 +8,8 @@ Reference:
Pages: 2444-8
PMID: 3162770
URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC280013/
eprint: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC280013/pdf/pnas00260-0036.pdf
ePrint: >
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC280013/pdf/pnas00260-0036.pdf
- Author: William R. Pearson
Title: "Effective protein sequence comparison"
Journal: Methods Enzymol.
......@@ -18,3 +19,10 @@ Reference:
DOI: 10.1016/S0076-6879(96)66017-0
PMID: 8743688
URL: http://www.sciencedirect.com/science/article/pii/S0076687996660170
Registry:
- Name: OMICtools
Entry: NA
- Name: SciCrunch
Entry: NA
- Name: bio.tools
Entry: NA
......@@ -6,7 +6,7 @@ Changes in **fasta-36.3.8d** released 13-April-2016:
1. Various bug fixes to `pssm_asn_subs.c` that avoid coredumps when
reading NCBI PSSM ASN.1 binary files. `pssm_asn_subs.c` can now read
UUPACAA sequences.
IUPACAA sequences.
2. default gap penalties for VT40 (from -14/-2 to -13/-1), VT80 (from
-14/-2 to -11/-1), and VT120 (from -10/-1 to 11/-1) have changed
......
## The FASTA package - protein and DNA sequence similarity searching and alignment programs
Changes in **fasta-36.3.8f** released 31-Dec-2017
1. (December, 2017) -- Make statistical thresholds more robust for
small E()-values with normally distributed scores (ggsearch36,
glsearch36).
2. (September, 2017) Treat all lower-case queries as uppercase with -S option.
3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies.
4. Improvements/fixes to psisearch2 scripts.
For more detailed information, see `doc/readme.v36`.
No preview for this file type
......@@ -1470,8 +1470,8 @@ environment variables are used in HTML mode (\texttt{-m 6}) to provide
links from the sequence alignment (see the links at
\url{http://fasta.bioch.virginia.edu/fasta_www2/}). \texttt{REF\_URL}
is associated with the \texttt{Entrez Lookup} link; \texttt{SRCH\_URL}
with the \texttt{Re-search database} link, and \texttt{SRCH\_URL1}
with the \texttt{General re-search} link. In each case, the text
with the \texttt{General re-search} link, and \texttt{SRCH\_URL1}
with the \texttt{Pairwise alignment} link. In each case, the text
corresponds to a HTML URL, but with positions containing the
\texttt{\%s} or \texttt{\%ld} (for numbers) part of a 'C'
\texttt{sprintf()} call for specific variables. \texttt{REF\_URL} uses
......
......@@ -6,6 +6,24 @@ multiple high-scoring alignments to be shown, rather than just one.
This is the main functional difference between FASTA and BLAST -
BLAST could show multiple HSPs, FASTA did not.
>>Dec. 30, 2017 [released as fasta-36.3.8g]
[scaleswn.c]
Replace np_to_z() with np1_to_z(), which does not substract low
probability from 1.0, thus allowing accurate z-values for very low
probabilities.
>>Sept. 26, 2017
[comp_lib9.c, compacc2e.c]
Previously, if the query sequence was all lower-case letters (seg-ed)
and the '-S' option specified, the search would effectively be done
with a zero-length sequence, which broke the statistics. The code has
been modified to convert all lower-case queries to upper-case when -S
is used.
[scaleswn.c]
Fixed problem with scaleswn.c/ag_stats() not setting parameters
properly when matrix was unknown.
>>May 23, 2017 [released as fasta-36.3.8f]
[url_subs.c]
A small, but major change in the output available to the $SRCH_URL and
......@@ -18,7 +36,7 @@ pairwise alignment becomes possible.
>>May 23, 2017
[scripts/ann_feats2ipr.pl,ann_feats_up_www2.pl,test_ann_scripts.sh src/defs.h]
Changes to ensure that EBI format databases, which place the ID before
the accession, e.g. SP:GSTM1_HUMAN P09388, can be processed properly
the accession, e.g. SP:GSTM1_HUMAN P09488, can be processed properly
by annotation scripts. This involved displaying more of the
description line, so that the accession field is included, in the
annot_XXXXX file.
......
......@@ -50,7 +50,7 @@ use Getopt::Long;
my ($shelp, $help, $m_format, $evalue, $qvalue, $domain_bound) = (0, 0, "m8CB", 0.001, 30.0,0);
my ($query_file, $bound_file_in, $bound_file_only, $bound_file_out, $masked_lib_out,$mask_type_end, $mask_type_int) = ("","","","","","","");
my ($query_file, $sel_file, $bound_file_in, $bound_file_only, $bound_file_out, $masked_lib_out,$mask_type_end, $mask_type_int) = ("","","","","","","","");
my $query_lib_r = 0;
my ($eval2_fmt, $eval2) = (0,"");
......@@ -62,6 +62,9 @@ GetOptions(
"expect=f" => \$evalue,
"qvalue=f" => \$qvalue,
"format=s" => \$m_format,
"selected_file_in=s" => \$sel_file,
"sel_file_in=s" => \$sel_file,
"sel_file=s" => \$sel_file,
"m_format=s" => \$m_format,
"mformat=s" => \$m_format,
"bound_file_in=s" => \$bound_file_in,
......@@ -120,6 +123,30 @@ if (! $query_file || !$query_len) {
die "query sequence required";
}
my %sel_accs = ();
my $have_sel_accs = 0;
if ($sel_file) {
if (open (my $sfd, $sel_file)) {
while (my $line = <$sfd>) {
chomp $line;
next if $line =~ m/^#/;
if ($line =~ m/\t/) {
my @data = split(/\t/,$line);
$sel_accs{$data[0]} = $data[1];
}
else {
$sel_accs{$line} = 1;
}
}
close($sfd);
$evalue = 1000000.0;
$have_sel_accs = 1;
}
else {
warn "Cannot open selected sequence file: $sel_file";
}
}
push @multi_names, $query_acc;
$acc_names{$query_acc} = 1;
$multi_align{$query_acc} = btop2alignment($query_seq_r, $query_len, {BTOP=>$query_len, q_start=>1, q_end=>$query_len}, 0);
......@@ -242,6 +269,10 @@ while (my $line = <>) {
$subj_acc =~ s/^gi\|\d+\|(\w+\|\w+)\|?\w+/$1/;
}
if ($have_sel_accs) {
next unless ($sel_accs{$hit_data{'s_seqid'}});
}
# a better solution would be to rename the q_seqid, or at least to
# check for identity
# next if ($hit_data{q_seqid} eq $hit_data{s_seqid});
......
......@@ -25,21 +25,20 @@ use Pod::Usage;
# implementation of simple shell script to do iterative searches
#
# logic:
# (1) do initial search
# (1) do initial search or take results from previous search with --prev_m89res
# (2) use results of initial search to produce MSA/PSSM for next search
# (3) do PSSM search
# (4) use results of PSSM search to produce MSA/PSSM for iterative step 3
# (4) repeat at step (2)
#
################
#
# command:
# psisearch2_msa.pl --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast
# psisearch2_msa.pl --query query.file --db database.file --num_iter N --pssm_evalue 0.002 --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --out_suffix none --pgm ssearch/psiblast --prev_m89res prev_results.itx.m8CB.file --sel_res selected_accs.file --prev_bounds boundary.file
#
################
use vars qw( $query_file $db_file $num_iter $evalue $int_mask $end_mask $query_mask $no_msa $tmp_dir $dom_flag $align_flag $suffix $srch_pgm $file_out $help $shelp $error_log $rm_flag $annot_type $quiet);
use vars qw( $prev_msa $next_msa $prev_hitdb $next_hitdb $prev_pssm $next_pssm $prev_bound_in $next_bound_out $tmp_file_list $save_all $delete_bnd $delete_tmp);
use vars qw( $query_file $db_file $num_iter $pssm_evalue $srch_evalue $int_mask $end_mask $query_mask $tmp_dir $dom_flag $align_flag $suffix $srch_pgm $file_out $help $shelp $error_log $rm_flag $annot_type $quiet);
use vars qw( $prev_m89res $m_format $prev_sel_res $this_iter $prev_msa $next_msa $prev_hitdb $next_hitdb $prev_pssm $next_pssm $prev_bound $next_bound_out $tmp_file_list $save_all $delete_bnd $delete_tmp $use_stdout);
################
# locations of required programs:
......@@ -54,21 +53,22 @@ my $ssearch_bin = "$pgm_bin/ssearch36";
my $psiblast_bin = "$pgm_bin/psiblast";
my $makeblastdb_bin = "$pgm_bin/makeblastdb";
my $datatool_bin = "$pgm_bin/datatool -m $pgm_data/NCBI_all.asn";
my $align2msa_lib = "m89_btop_msa2.pl";
my $align2msa_lib = "$pgm_bin/m89_btop_msa2.pl";
my %srch_subs = ('ssearch' => \&get_ssearch_cmd,
'psiblast' => \&get_psiblast_cmd,
);
my %annot_cmds = ('rpd3' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"),
'rpd3nv' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --split_over"),
'rpd3nvn' => qq("\!../scripts/ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"),
'pfam' => qq("\!../scripts/ann_pfam30.pl --pfacc --vdoms --split_over"));
my %annot_cmds = ('rpd3' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --vdoms --split_over"),
'rpd3nv' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over"),
'rpd3nvn' => qq("\!ann_pfam28.pl --pfacc --db RPD3 --split_over --neg"),
'pfam' => qq("\!ann_pfam30.pl --vdoms --split_over --neg")
);
($num_iter, $evalue, $dom_flag, $align_flag, $int_mask, $end_mask, $query_mask, $srch_pgm, $tmp_dir, $error_log, $annot_type, $quiet) =
( 5, 0.002, 0, 0, 'none', 'none', 0, 'ssearch','',0, 0, "", 0);
($num_iter, $pssm_evalue, $srch_evalue, $dom_flag, $align_flag, $int_mask, $end_mask, $query_mask, $srch_pgm, $tmp_dir, $error_log, $annot_type, $quiet) =
( 5, 0.002, 5.0, 0, 0, 'none', 'none', 0, 'ssearch','',0, 0, "", 0);
($save_all, $tmp_file_list, $delete_bnd, $delete_tmp) = (0, "", 0, 0);
($prev_m89res, $m_format, $prev_sel_res, $prev_bound, $this_iter, $use_stdout) = ("","", "","", 1, 0);
my $pgm_command = "# ".join(" ",($0,@ARGV));
print STDERR "# ",join(" ",($0,@ARGV)),"\n" if ($error_log);
......@@ -78,25 +78,33 @@ GetOptions(
'db|database=s' => \$db_file,
'suffix|out_suffix=s' => \$suffix,
'dir=s' => \$tmp_dir,
'evalue=f' => \$evalue,
'pssm_evalue=f' => \$pssm_evalue,
'search_evalue=f' => \$srch_evalue,
'annot_db=s' => \$annot_type,
'out_name=s' => \$file_out,
'use_stdout' => \$use_stdout,
'this_iter=s' => \$this_iter,
'iter=i' => \$num_iter,
'prev_m89res=s' => \$prev_m89res,
'sel_res_in=s' => \$prev_sel_res,
'sel_accs=s' => \$prev_sel_res,
'sel_file=s' => \$prev_sel_res,
'sel_file_in=s' => \$prev_sel_res,
# 'in_msa=s' => \$prev_msa,
# 'out_msa=s' => \$next_msa,
# 'in_hitdb=s' => \$prev_hitdb,
# 'out_hitdb=s' => \$next_hitdb,
'in_pssm=s' => \$prev_pssm,
# 'out_pssm=s' => \$next_pssm,
'in_bounds=s' => \$prev_bound_in,
'prev_bounds=s' => \$prev_bound,
'out_bounds=s' => \$next_bound_out,
'num_iter|max_iter=i' => \$num_iter,
# 'no_msa' => \$no_msa,
'dom|domain' => \$dom_flag,
'align' => \$align_flag,
'query_seed|query_mask' => \$query_mask,
'int_mask|int-mask|int_seed|int-seed=s' => \$int_mask,
'end_mask|end-mask|end_seed|end-seed=s' => \$end_mask,
'm_format=s' => \$m_format,
'pgm=s' => \$srch_pgm,
'quiet' => \$quiet,
'q' => \$quiet,
......@@ -116,6 +124,8 @@ pod2usage(exitstatus => 0, verbose => 2) if $help;
pod2usage(1) unless $query_file && -r $query_file; # need a query
pod2usage(1) unless $db_file ; # need a database
my %sel_hits = ();
my @del_file_ext = qw(msa psibl_out hit_db asntxt asnbin);
if ($srch_pgm =~ m/psiblast/) {
......@@ -127,10 +137,7 @@ if ($query_mask) {
$end_mask='query' unless $end_mask ne 'none';
}
if ($delete_tmp) {
$delete_bnd = 1;
}
elsif ($save_all) {
if (!$delete_tmp && $save_all) {
@del_file_ext = ();
$tmp_file_list = "";
$delete_bnd = 0
......@@ -155,58 +162,75 @@ if ($tmp_file_list) {
print "$pgm_command\n" unless ($quiet);
my $this_iter = "it1";
####
# generate output filenames
my ($query_pref) = ($query_file =~ m/([\w\.]+)$/);
$file_out = $query_pref unless $file_out;
my $this_file_pref = "$file_out.$this_iter";
my $this_file_pref = "$file_out.it$this_iter";
$this_file_pref = "$this_file_pref.$suffix" if ($suffix);
my $this_file_out = $this_file_pref;
$this_file_out = "$tmp_dir/$this_file_out" if ($tmp_dir);
my $prev_file_out = $this_file_out;
my $prev_file_out = "";
####
# parse output to build PSSM
# generate output filenames
# do the first search or use $prev_m89res
my $first_iter = 0;
my $iter_val = $this_iter;
my $search = "";
my @del_err_files = ();
# do the first search
my $search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
unless ($prev_m89res) {
$search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
unless ($use_stdout) {
log_system("$search > $this_file_out 2> $this_file_out.err");
}
else {
log_system("$search 2> $this_file_out.err");
}
push @del_err_files, "$this_file_out.err";
$first_iter++;
}
else {
$this_file_out = $prev_m89res;
}
my ($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound_in);
my ($this_pssm, $this_bound_out) = ("","");
# now have necessary files for next iteration
for (my $it=2; $it <= $num_iter; $it++) {
for (my $it=$first_iter; $it < $num_iter; $it++) {
####
# build the PSSM for the current search
my ($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound, $prev_sel_res, $m_format);
$prev_file_out = $this_file_out;
$prev_sel_res = "";
$iter_val = $this_iter + $it;
$prev_pssm = $this_pssm;
$prev_bound_in = $this_bound_out;
####
# build filename for this iteration
$this_file_pref = $this_file_out = "$file_out.it$it";
$this_file_pref = $this_file_out = "$file_out.it$iter_val";
$this_file_out = "$this_file_pref.$suffix" if ($suffix);
$this_file_out = "$tmp_dir/$this_file_out" if ($tmp_dir);
$search = $srch_subs{$srch_pgm}($query_file, $db_file, $prev_pssm);
$search = $srch_subs{$srch_pgm}($query_file, $db_file, $this_pssm);
unless ($use_stdout) {
log_system("$search > $this_file_out 2> $this_file_out.err");
# here, we are done with previous .msa, .asntxt, .asnbin, etc files. Delete them if desired
if (@del_file_ext) {
my @del_file_list = ();
for my $ext (@del_file_ext) {
push @del_file_list, "$prev_file_out.$ext";
}
log_system("rm ".join(" ",@del_file_list));
else {
log_system("$search 2> $this_file_out.err");
}
$prev_file_out = $this_file_out;
push @del_err_files, "$this_file_out.err";
($this_pssm, $this_bound_out) = build_msa_pssm($query_file, $this_file_out, $prev_bound_in);
if (has_converged($prev_bound_in, $this_bound_out)) {
print STDERR "$0 $srch_pgm $query_file $db_file converged ($it iterations)\n" unless ($quiet);
####
# here, we are done with previous .msa, .asntxt, .asnbin, etc files. Delete them if desired
if (@del_file_ext) {
my @del_file_list = ();
......@@ -214,23 +238,21 @@ for (my $it=2; $it <= $num_iter; $it++) {
push @del_file_list, "$prev_file_out.$ext";
}
log_system("rm ".join(" ",@del_file_list));
log_system("rm $prev_bound_in $this_bound_out") if ($delete_bnd);
}
exit(0);
}
log_system("rm $prev_bound_in") if ($delete_bnd);
}
log_system("rm $prev_bound") if ($delete_bnd);
if (@del_file_ext) {
my @del_file_list = ();
for my $ext (@del_file_ext) {
push @del_file_list, "$prev_file_out.$ext";
if (has_converged($prev_bound, $this_bound_out)) {
print STDERR "$0 $srch_pgm $query_file $db_file converged ($iter_val iterations)\n" unless ($quiet);
last;
}
log_system("rm ".join(" ",@del_file_list));
$prev_bound = $this_bound_out
}
log_system("rm $this_bound_out") if ($delete_bnd);
if (@del_err_files) {
log_system("rm ".join(" ",@del_err_files));
}
log_system("rm $prev_bound") if ($delete_bnd);
unless ($quiet) {
print STDERR "$0 $srch_pgm $query_file $db_file finished ($num_iter iterations)\n";
}
......@@ -254,7 +276,7 @@ sub log_system {
sub get_ssearch_cmd {
my ($query_file, $db_file, $pssm_file) = @_;
my $search_cmd = qq($ssearch_bin -S -m 8CB -d 0 -E "1.0 0" -s BP62);
my $search_cmd = qq($ssearch_bin -S -m 6 -m 9B -E "$srch_evalue 0" -s BP62);
if ($annot_type) {
$search_cmd .= qq( -V $annot_cmds{$annot_type});
}
......@@ -274,7 +296,7 @@ sub get_ssearch_cmd {
sub get_psiblast_cmd {
my ($query_file, $db_file, $pssm_file) = @_;
my $search_cmd = qq($psiblast_bin -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh $evalue -num_iterations 1 -db $db_file);
my $search_cmd = qq($psiblast_bin -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh $pssm_evalue -evalue $srch_evalue -num_iterations 1 -db $db_file);
if ($pssm_file) {
$search_cmd .= qq( -in_pssm $pssm_file);
# $search_cmd .= qq( -comp_based_stats 0);
......@@ -296,7 +318,7 @@ sub get_psiblast_cmd {
# always produce a $bound_file_out file to test for convergence
#
sub build_msa_pssm {
my ($query_file, $this_file_out,$prev_bound_in) = @_;
my ($query_file, $this_file_out,$prev_bound_in, $prev_sel_in, $m_format) = @_;
my ($this_msa, $this_hit_db, $this_pssm_asntxt, $this_pssm_asnbin, $this_psibl_out, $this_bound_out) =
("$this_file_out.msa",
......@@ -308,7 +330,18 @@ sub build_msa_pssm {
);
my $blastdb_err = "$this_file_out.mkbldb_err";
my $aln2msa_cmd = qq($align2msa_lib --query $query_file --evalue $evalue --masked_lib_out=$this_hit_db);
my $aln2msa_cmd = qq($align2msa_lib --query $query_file --masked_lib_out=$this_hit_db);
if ($m_format) {
$aln2msa_cmd .= qq( --m_format $m_format);
}
if ($prev_sel_in) {
$aln2msa_cmd .= qq( --sel_file_in $prev_sel_in);
}
else {
$aln2msa_cmd .= qq( --evalue $pssm_evalue);
}
if ($int_mask) {
$aln2msa_cmd .= qq( --int_mask_type $int_mask);
......@@ -342,7 +375,7 @@ sub build_msa_pssm {
log_system("rm $this_hit_db.p* $blastdb_err");
# remove uninformative error logs
log_system("rm $this_psibl_out.err $this_file_out.err") unless $error_log;
log_system("rm $this_psibl_out.err") unless $error_log;
unless ($srch_pgm eq 'psiblast') {
my $asn2asn_cmd = "$datatool_bin -v $this_pssm_asntxt -e $this_pssm_asnbin";
......@@ -361,6 +394,8 @@ sub build_msa_pssm {
sub has_converged {
my ($file1, $file2) = @_;
return 0 unless ($file1 && $file2);
my @f1_names = ();
my @f2_names = ();
......@@ -408,12 +443,19 @@ psisearch2_msa.pl
-h short help
--help include description
--query query file (also --sequence)
--db database file (--database)
--pgm program used for searching, ssearch or psiblast
--num_iter maximum number of iterations (--max_iter)
--pgm [ssearch] program used for searching, ssearch or psiblast
--num_iter/iter [5] maximum number of iterations (--max_iter)
--this_iter [0] iteration number for tracking
--pssm_evalue [0.002] threshold for inclusion in PSSM
--search_evalue [5.0] threshold for inclusion in search display
--annot_db [null] (rpd3, rpd3nv, rpd3nvn, pfam)
--dir working directory and location of output
--evalue threshold for inclusion in PSSM
--out_name/--suffix result file is "out_name.it#.suffix"
--in_msa/--out_msa [not implemented] MSA used to build PSSM, requires --in_hitdb
--in_hitdb/--out_hitdb [not implemented] used to build PSSM
......@@ -427,8 +469,8 @@ psisearch2_msa.pl
=head1 DESCRIPTION
C<psisearch2_msa.pl> automates successive iterations of C<psiblast> or
C<ssearch36> using different strategies to reduce PSSM contamination
C<psisearch2_msa.pl> can perform one, or multiple, successive iterations of C<psiblast> or
C<ssearch36> using different query seeding strategies to reduce PSSM contamination
from alignment over-extension. C<psisearch2_msa.pl> uses the
C<m89_btop_msa2.pl> program to read BTOP formatted output from
C<psiblast> or C<ssearch36> and produce both a multiple sequence
......
......@@ -34,7 +34,7 @@ import re
################
#
# command:
# psisearch2_msa.py --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix --pgm ssearch/psiblast
# psisearch2_msa.py --query query_file --db database --num_iter N --evalue 0.002 --no_msa --int_mask none/query/random --end_mask none/query/random --tmp_dir results/ --domain --align --suffix M8CB --pgm ssearch/psiblast --prev_m89res pre_iter.out --this_iter # --num_iter #
#
################
......@@ -59,15 +59,9 @@ annot_cmds = {'rpd3': '"!../scripts/ann_pfam28.pl --pfacc --db RPD3 --vdoms --sp
num_iter = 5
evalue = 0.002
dom_flag = 0
align_flag = 0
int_mask = 'none'
end_mask = 'none'
srch_pgm = 'ssearch'
tmp_dir = ''
error_log = 0
rm_flag = 0
annot_type = ''
quiet = 0
################
......@@ -90,8 +84,8 @@ def get_ssearch_cmd(query_file, db_file, pssm_file) :
search_cmd = '%s -S -m 8CB -d 0 -E "1.0 0" -s BP62' % (ssearch_bin)
if (annot_type) :
search_cmd += " -V %s" % (annot_cmds[annot_type])
if (args.annot_type) :
search_cmd += " -V %s" % (annot_cmds[args.annot_type])
if (pssm_file) :
search_cmd += ' -P "%s 2"' % (pssm_file)
......@@ -107,7 +101,7 @@ def get_ssearch_cmd(query_file, db_file, pssm_file) :
#
def get_psiblast_cmd(query_file, db_file, pssm_file) :
search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -num_iterations 1 -db %s" % (psiblast_bin, evalue, db_file)
search_cmd = "%s -num_threads 4 -max_target_seqs 5000 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore score btop' -inclusion_ethresh %f -num_iterations 1 -db %s" % (psiblast_bin, args.evalue, db_file)
if (pssm_file) :
search_cmd += " -in_pssm %s" % (pssm_file)
......@@ -126,24 +120,29 @@ def get_psiblast_cmd(query_file, db_file, pssm_file) :
#
# always produce a bound_file_out file to test for convergence
#
def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
def build_msa_pssm(query_file, this_file_out,prev_bound_in, prev_sel_res, args, error_log) :
(this_msa, this_hit_db, this_pssm_asntxt, this_pssm_asnbin, this_psibl_out, this_bound_out) = (this_file_out+".msa",this_file_out+".hit_db",this_file_out+".asntxt",this_file_out+".asnbin",this_file_out+".psibl_out",this_file_out+".bnd_out")
blastdb_err = this_file_out+".mkbldb_err"
aln2msa_cmd = "%s --query %s --evalue %f --masked_lib_out=%s" % (align2msa_lib, query_file, evalue, this_hit_db)
aln2msa_cmd = "%s --query %s --masked_lib_out=%s" % (align2msa_lib, query_file, this_hit_db)
if (int_mask) :
aln2msa_cmd += " --int_mask_type %s" % (int_mask)
if (prev_sel_res) :
aln2msa_cmd += " --sel_res %s" % (prev_sel_res)
else:
aln2msa_cmd += " --evalue %f" % (args.evalue)
if (args.int_mask) :
aln2msa_cmd += " --int_mask_type %s" % (args.int_mask)
if (end_mask) :
aln2msa_cmd += " --end_mask_type %s" % (end_mask)
if (args.end_mask) :
aln2msa_cmd += " --end_mask_type %s" % (args.end_mask)
if (dom_flag) :
if (args.dom_flag) :
aln2msa_cmd += " --domain"
if (align_flag and prev_bound_in) :
aln2msa_cmd += " --bound_file_in %s" %(prev_bound_in)
if (args.align_flag and args.prev_bound_in) :
aln2msa_cmd += " --bound_file_in %s" %(args.prev_bound_in)
# always produce this file to check for convergence
aln2msa_cmd += " --bound_file_out %s" % (this_bound_out)
......@@ -162,7 +161,7 @@ def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
# remove uninformative error logs
if (not error_log) :
log_system("rm "+this_psibl_out+".err "+this_file_out+".err",error_log)
log_system("rm "+this_psibl_out+".err ",error_log)
if (srch_pgm != 'psiblast') :
asn2asn_cmd = "%s -v %s -e %s" % (datatool_bin, this_pssm_asntxt, this_pssm_asnbin)
......@@ -177,6 +176,9 @@ def build_msa_pssm(query_file, this_file_out,prev_bound_in, error_log) :
#
def has_converged(file1, file2) :
if (not file1 or not file2):
return 0
f1_names = []
f2_names = []
......@@ -224,13 +226,16 @@ arg_parse.add_argument('--evalue', dest='evalue', default=0.002, type=float, act
arg_parse.add_argument('--annot_db', dest='annot_type', action='store',help='source of domain annotations')
arg_parse.add_argument('--suffix', dest='suffix', action='store',help='suffix for result output')
arg_parse.add_argument('--out_name', dest='file_out', action='store',help='result file name')
arg_parse.add_argument('--iter', dest='num_iter', default=5, type=int, action='store',help='number of iterations')
arg_parse.add_argument('--num_iter', dest='num_iter', default=5, type=int, action='store',help='number of iterations to run')
arg_parse.add_argument('--in_pssm', dest='prev_pssm', action='store',help='initial PSSM')
arg_parse.add_argument('--in_bounds', dest='prev_bound_in', type=str, action='store',help='initial boundaries')
arg_parse.add_argument('--domain', dest='dom_flag', action='store_true',help='use domain annotations')
arg_parse.add_argument('--align', dest='align_flag', action='store_true',help='use alignment boundaries')
arg_parse.add_argument('--align', dest='align_flag', default=0, action='store_true',help='use alignment boundaries')
arg_parse.add_argument('--pgm', dest='srch_pgm', action='store',default='ssearch',help='search program: ssearch/psiblast')
arg_parse.add_argument('--query_seed', dest='query_mask', action='store_true',help='use query seeding')
arg_parse.add_argument('--prev_m89res', dest='prev_m89res', action='store', help='prevous iteration result')
arg_parse.add_argument('--sel_res', dest='prev_sel_res', action='store', help='selected accession file')
arg_parse.add_argument('--this_iter', dest='this_iter', help='this iteration number',type=int)
arg_parse.add_argument('--int_seed', dest='int_mask', action='store',default='none',help='sequence masking: none/query/random')
arg_parse.add_argument('--end_seed', dest='end_mask', action='store',default='none',help='sequence masking: none/query/random')
arg_parse.add_argument('--int_mask', dest='int_mask', action='store',default='none',help='sequence masking: none/query/random')
......@@ -257,13 +262,16 @@ del_file_ext = ["msa","psibl_out","hit_db","asntxt","asnbin"]
if (re.match('psiblast',srch_pgm)) :
del_file_ext.pop()
if (args.query_mask) :
if (args.int_mask == 'none') :
args.int_mask = 'query'
if (args.end_mask == 'none') :
args.end_mask = 'query'
this_iter = 1
if (args.this_iter):
this_iter = args.this_iter
delete_bnd = args.delete_bnd
if (args.delete_tmp) :
delete_bnd = 1
......@@ -282,8 +290,6 @@ if (args.tmp_file_list) :
new_del_file_ext.append(ext)
del_file_ext = new_del_file_ext[:]
this_iter = "it1"
query_pref = query_file = args.query_file
m = re.search(r'([\w\.]+)$',str(args.query_file))
query_pref = m.groups()[0]
......@@ -293,35 +299,50 @@ if (not args.file_out) :
else :
file_out = args.file_out
this_file_out = this_file_pref = file_out+"."+this_iter
this_file_out = this_file_pref = file_out+".it"+str(this_iter)
if (args.suffix) :
this_file_out = this_file_pref+"."+args.suffix
if (args.tmp_dir) :
this_file_out = args.tmp_dir+"/"+this_file_out
prev_bound_in = ''
if (args.prev_bound_in):
prev_bound_in = args.prev_bound_in
####
# parse output to build PSSM
# generate output filenames
prev_file_out = this_file_out
first_iter = 0
del_err_files = []
# do the first search
if (not args.prev_m89res):
search_str = srch_subs[srch_pgm](args.query_file, args.db_file, args.prev_pssm)
log_system(search_str+" > "+this_file_out+" 2> "+this_file_out+".err", error_log)
del_err_files.append(this_file_out+".err")
first_iter += 1
else:
this_file_out = args.prev_m89res
prev_file_out = this_file_out
(this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, args.prev_bound_in, error_log)
it=first_iter
# now have necessary files for next iteration
it=2
while (it <= args.num_iter) :
while (it < args.num_iter) :
(this_pssm, this_bound_out) = build_msa_pssm(args.query_file, this_file_out, prev_bound_in, arg.prev_sel_res, error_log)
prev_file_out = this_file_out
arg.prev_sel_res = ''
iter_val = this_iter + it
prev_pssm = this_pssm
prev_bound_in = this_bound_out
####
# build filename for this iteration
this_file_out = this_file_pref = "%s.it%d" % (file_out,it)
this_file_out = this_file_pref = "%s.it%d" % (file_out,iter_val)
if (args.suffix) :
this_file_out = this_file_pref+"."+args.suffix
if (args.tmp_dir) :
......@@ -329,22 +350,19 @@ while (it <= args.num_iter) :
search_str = srch_subs[srch_pgm](args.query_file, args.db_file, prev_pssm)
log_system("%s > %s 2> %s" % (search_str,this_file_out,this_file_out+".err"), error_log)
del_err_files.append(this_file_out+".err")
if (len(del_file_ext)):
del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
log_system('rm '+' '.join(del_file_list),error_log)
prev_file_out = this_file_out
(this_pssm, this_bound_out) = build_msa_pssm(query_file, this_file_out, prev_bound_in, error_log)
if (has_converged(prev_bound_in, this_bound_out)) :
if (not quiet) :
sys.stderr.write(" %s %s %s %s converged (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it))
if (len(del_file_ext)):
del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
log_system('rm '+' '.join(del_file_list),error_log)
# if (len(del_file_ext)):
# del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
# log_system('rm '+' '.join(del_file_list),error_log)
if (delete_bnd) :
log_system("rm "+prev_bound_in,error_log)
......@@ -353,16 +371,20 @@ while (it <= args.num_iter) :
if (delete_bnd) :
log_system("rm "+prev_bound_in,error_log)
prev_bound_in = this_bound_out
it += 1
if (len(del_file_ext)):
del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
log_system('rm '+' '.join(del_file_list),error_log)
if (len(del_err_files)):
log_system('rm '+' '.join(del_err_files),error_log)
# if (len(del_file_ext)):
# del_file_list = [ prev_file_out+'.'+ext for ext in del_file_ext]
# log_system('rm '+' '.join(del_file_list),error_log)
if (delete_bnd):
log_system("rm "+this_bound_out,error_log)
if (not quiet) :
sys.stderr.write(" %s %s %s %s finished (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it-1))
sys.stderr.write(" %s %s %s %s finished (%d iterations)\n" % (sys.argv[0], srch_pgm, query_file, args.db_file, it))
......@@ -4,13 +4,13 @@
# gi|23065544|ref|NP_000552.2|
#
# and returns the exons present in the protein from NCBI gff3 tables (human and mouse only)
# and returns the exons present in the protein from NCBI gff3 tables (human, mouse, rat, xtrop)
#
# it must:
# (1) read in the line
# (2) parse it to get the acc
# (3) return the tab delimited exon boundaries
#
use strict;
......