Skip to content
Commits on Source (7)
## The FASTA package - protein and DNA sequence similarity searching and alignment programs
The **FASTA** (pronounced FAST-Aye, not FAST-Ah) programs are a
comprehensive set of similarity searching and alignment programs for
searching protein and DNA sequence databases. Like the **BLAST** programs `blastp` and `blastn`, the `fasta` program itself uses a rapid heuristic strategy for finding similar regions in protein and DNA sequences. But in
addition to heuristic similarity searching, the FASTA package provides
programs for rigorous local (`ssearch`) and global (`ggsearch`)
similarity searching, as well as a program for finding non-overlapping
sequence similarities (`lalign`). Like BLAST, the FASTA package also
includes programs for aligning translated DNA sequences against
proteins (`fastx`, `fasty` are equivalent to `blastx`, `tfastx`,
`tfasty` are similar to `tblastn`).
####December, 2017
The current FASTA version is fasta-36.3.8f, Dec. 2017
The **FASTA** (pronounced FAST-Aye, not FAST-Ah) programs are a comprehensive set of similarity searching and alignment programs for searching protein and DNA sequence databases. Like the **BLAST** programs `blastp` and `blastn`, the `fasta` program itself uses a rapid heuristic strategy for finding similar regions in protein and DNA sequences. But in addition to heuristic similarity searching, the FASTA package provides
programs for rigorous local (`ssearch`) and global (`ggsearch`) similarity searching, as well as a program for finding non-overlapping sequence similarities (`lalign`). Like BLAST, the FASTA package also includes programs for aligning translated DNA sequences against proteins (`fastx`, `fasty` are equivalent to `blastx`, and `tfastx`, `tfasty` are similar to `tblastn`).
#### March, 2019
An updated release of the FASTA package (`fasta-36.3.8h`) is
available. In addition to minor bug fixes, the latest version can
generate query and library sequences using program scripts.
See doc/README_v36.3.8h.md and doc/readme.v36 for a more complete summary of changes.
#### December, 2018
The latest version of the FASTA package is `fasta-36.3.8h`, Dec. 2018.
See doc/README_v36.3.8h.md for a more complete summary of changes.
#### November, 2018
The current released version of the FASTA package is `fasta-36.3.8h`, Nov. 2018
See doc/README_v36.3.8h.md for a more complete summary of changes.
#### October, 2018
The current version of the FASTA package is fasta-36.3.8g, Oct. 2018
See doc/README_v36.3.8h.md for a more complete summary of changes.
#### April, 2018
The current version of the FASTA package is fasta-36.3.8g, Apr. 2018
#### December, 2017
The current FASTA version is fasta-36.3.8g, Dec. 2017
The statistics routines for normally distributed scores (ggsearch36,
glsearch36) are more robust to very low E()-value thresholds.
####Sept, 2017
#### Sept, 2017
The current FASTA version is fasta-36.3.8f, Sept. 2017
If the -S option is used and a query sequence has no upper case
letters, it is re-read with lower-case letters converted to upper-case.
####May, 2017
#### May, 2017
The current FASTA version is fasta-36.3.8f, May. 2017
Various bugs in sub-alignment scoring corrected and support for the
EBI SP:GSTM1_HUMAN P09488 added. The format for the $SRCH_URL and
$SRCH_URL2 format strings has changed to enable pairwise alignment.
EBI SP:GSTM1_HUMAN P09488 added. The format for the `$SRCH_URL` and
`$SRCH_URL2` format strings has changed to enable pairwise alignment.
####September, 2016
#### September, 2016
The fasta-36.3.6e version includes a new directory, `psisearch2`, with
scripts to run iterative PSSM (PSI-BLAST or SSEARCH36) searches using
......
Placeholder file to create destination for program binaries.
fasta3 (36.3.8h-1) UNRELEASED; urgency=medium
* Team upload.
* New upstream version
* debhelper-compat 12
* Standards-Version: 4.4.0
TODO: Do we really need to use non-free smith waterman code?
There is a free libssw. Please contact upstream!
-- Andreas Tille <tille@debian.org> Mon, 19 Aug 2019 21:45:02 +0200
fasta3 (36.3.8g-1) unstable; urgency=low
[ Andreas Tille ]
......
......@@ -4,9 +4,9 @@ Uploaders: Steffen Moeller <moeller@debian.org>
Section: non-free/science
XS-Autobuild: no
Priority: optional
Build-Depends: debhelper (>= 9),
Build-Depends: debhelper-compat (= 12),
zlib1g-dev
Standards-Version: 4.1.3
Standards-Version: 4.4.0
Vcs-Browser: https://salsa.debian.org/med-team/fasta3
Vcs-Git: https://salsa.debian.org/med-team/fasta3.git
Homepage: http://fasta.bioch.virginia.edu
......
Description: Makefile
Index: fasta3/make/Makefile
===================================================================
--- fasta3.orig/make/Makefile
+++ fasta3/make/Makefile
--- a/make/Makefile
+++ b/make/Makefile
@@ -34,6 +34,7 @@ THR_SUBS = pthr_subs2
THR_LIBS = -lpthread
THR_CC =
......@@ -11,10 +9,8 @@ Index: fasta3/make/Makefile
XDIR = /seqprg/bin
DROPGSW_NA_O = dropgsw2.o wm_align.o calcons_sw.o
Index: fasta3/make/Makefile.linux64_sse2
===================================================================
--- fasta3.orig/make/Makefile.linux64_sse2
+++ fasta3/make/Makefile.linux64_sse2
--- a/make/Makefile.linux64_sse2
+++ b/make/Makefile.linux64_sse2
@@ -12,7 +12,8 @@
SHELL=/bin/bash
......@@ -22,10 +18,10 @@ Index: fasta3/make/Makefile.linux64_sse2
-CC = gcc -g -O -msse2
+CC = gcc
+CFLAGS = -g -O -msse2 $(CPPFLAGS)
LIB_DB=
#CC= gcc -pg -g -O -msse2 -ffast-math
#CC = gcc -g -DDEBUG -msse2
#CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG
@@ -24,7 +25,7 @@ CC = gcc -g -O -msse2
@@ -26,7 +27,7 @@ LIB_DB=
# standard options
......@@ -34,16 +30,57 @@ Index: fasta3/make/Makefile.linux64_sse2
# -I/usr/include/mysql -DMYSQL_DB
# -DSUPERFAMNUM -DSFCHAR="'|'"
Index: fasta3/make/Makefile36m.common
===================================================================
--- fasta3.orig/make/Makefile36m.common
+++ fasta3/make/Makefile36m.common
@@ -34,7 +34,7 @@ NGETLIB=nmgetlib
# and "-L/usr/lib64/mysql -lmysqlclient -lz" in LIB_M
# some systems may also require a LD_LIBRARY_PATH change
-LIB_M= -lm -lz
+LIB_M= $(LDFLAGS) -lm -lz
#LIB_M= -L/usr/lib64/mysql -lmysqlclient -lz -lm
NCBL_LIB=ncbl2_mlib.o
#NCBL_LIB=ncbl2_mlib.o mysql_lib.o
--- a/make/Makefile36m.common
+++ /dev/null
@@ -1,51 +0,0 @@
-#
-# $Name: $ - $Id: Makefile36m.common 1250 2014-01-24 21:33:39Z wrp $
-#
-# commands common to all architectures
-# if your architecture does not support "include", append at the end.
-#
-
-COMP_LIBO=comp_mlib9.o # reads database into memory for multi-query without delay
-COMP_THRO=comp_mthr9.o # threaded version
-
-WORK_THRO=work_thr2.o
-GETSEQO =
-
-# standard nxgetaa, no memory mapping for 0 - 6
-#LGETLIB=getseq.o lgetlib.o
-#NGETLIB=nmgetlib
-
-# memory mapping for 0FASTA, 5PIRVMS, 6GCGBIN
-LGETLIB= $(GETSEQO) lgetlib.o lgetaa_m.o
-NGETLIB=nmgetlib
-
-# use ncbl_lib.c for BLAST1.4 support instead of ncbl2_mlib.c
-#NCBL_LIB=ncbl_lib.o
-
-# this option should support both formats (BLAST1.4 not currently supported):
-#NCBL_LIB=ncbl_lib.o ncbl2_mlib.o
-
-# normally use ncbl2_mlib.c
-#NCBL_LIB=ncbl2_mlib.o
-#LIB_M= -lm
-
-# this option supports NCBI BLAST2 and mySQL
-# it requires "-I/usr/include/mysql -DMYSQL_DB" in CFLAGS
-# and "-L/usr/lib64/mysql -lmysqlclient -lz" in LIB_M
-# some systems may also require a LD_LIBRARY_PATH change
-
-LIB_M= -lm
-#LIB_M= -L/usr/lib64/mysql -lmysqlclient -lm # -lz
-NCBL_LIB=ncbl2_mlib.o
-#NCBL_LIB=ncbl2_mlib.o mysql_lib.o
-
-# threaded as _t, serial
-# include ../make/Makefile.pcom
-
-# threaded without _t
-include ../make/Makefile.pcom_t
-
-# serial only
-# include ../make/Makefile.pcom_s
-
-include ../make/Makefile.fcom
Description: OVERFLOW
Index: fasta3/src/dropnnw2.c
===================================================================
--- fasta3.orig/src/dropnnw2.c
+++ fasta3/src/dropnnw2.c
@@ -575,7 +575,7 @@ void do_work (const unsigned char *aa0,
* be rerun with 16 bits. If it is more, and we have tried at least
* 500 sequences, we switch off the 8-bit mode.
*/
- if (score == OVERFLOW) {
+ if (score == OVERFLOW_SCORE) {
f_str->done_16bit++;
if(f_str->done_8bit>500 && (3*f_str->done_16bit)>(f_str->done_8bit))
f_str->try_8bit = 0;
## The FASTA package - protein and DNA sequence similarity searching and alignment programs
Changes in **fasta-36.3.8f** released 31-Dec-2017
1. (December, 2017) -- Make statistical thresholds more robust for
small E()-values with normally distributed scores (ggsearch36,
glsearch36).
2. (September, 2017) Treat all lower-case queries as uppercase with -S option.
3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies.
4. Improvements/fixes to psisearch2 scripts.
For more detailed information, see `doc/readme.v36`.
## The FASTA package - protein and DNA sequence similarity searching and alignment programs
Changes in **fasta-36.3.8h** August, 2019
1. Modifications to support makeblastdb format v5 databases. Currently, only simple database reads have been tested.
Changes in **fasta-36.3.8h** March, 2019
1. Translation table 1 (`-t 1`) now translates 'TGA'->'U' (selenocysteine).
2. New script for extracting DNA sequences from genomes (`scripts/get_genome_seq.py`). Currently works with human (hg38), mouse (mm10), and rat (rn6).
Changes in **fasta-36.3.8h** January, 2019
1. Bug fixes: `fastx`/`tfastx` searches done with the `-t t` option (which adds a `*` to protein sequences so that termination codons can be matched), did not work properly with the `VT` series of matrices, particularly `VT10`. This has been fixed.
2. New features: Both query and library/subject sequences can be generated by specifying a program script, either by putting a `!` at the start of the query/subject file name, or by specifying library type `9`. Thus, `fasta36 \\!../scripts/get_protein.py+P09488+P30711 /seqlib/swissprot.fa` or `fasta36 "../scripts/get_protein.py+P09488+P30711 9" /seqlib/swissprot.fa` will compare two query sequences, `P09488` and `P30711`, to SwissProt, by downloading them from Uniprot using the `get_protein.py` script (which can download sequences using either Uniprot or RefSeq protein accessions). Often, the leading `!` must be escaped from shell interpretation with `\\!`.
New scripts that return FASTA sequences using accessions or genome coordinates are available in `scripts/`. `get_protein.py`, `get_uniprot.py`, `get_up_prot_iso_sql.py` and `get_refseq.py`. `get_refseq.py` can download either protein or mRNA RefSeq entries. `get_up_prot_iso_sql.py` retrieves a protein and its isoforms from a MySQL database.
`get_genome_seq.py` extracts genome sequences using coordinates from local reference genomes (`hg38` and `mm10` included by default).
Changes in **fasta-36.3.8h** December, 2018
The `scripts/ann_exons_up_www.pl` and `ann_exons_up_sql.pl` now include the option `--gen_coord` which provides the associated genome coordinate (including chromosome) as a feature, indicated by `'<'` (start of exon) and `'>'` (end of exon).
Changes in **fasta-36.3.8h** released November, 2018
**fasta-36.3.8h** provides new scripts and modifications to the `fasta` programs that normalize the process of merging sub-alignment scores and region information into both FASTA and BLAST results. To move BLASTP towards FASTA with respect to alignment annotation and sub-alignment scoring:
1. The `blastp_annot_cmd.sh` runs a blast search, finds and scores domain information for the alignments, and merges this information back into the blast output `.html` file. This script uses:
1. `annot_blast_btab2.pl --query query.file --ann_script annot_script.pl --q_ann_script annot_script.pl blast.btab_file > blast.btab_file_ann` (a blast tabular file with one or two new fields, an annotation field and (optionally with --dom_info) a raw domain content field.
2. `merge_blast_btab.pl --btab blast.btab_file_ann blast.html > blast_ann.html` (merge the annotations and domain content information in the `blast.btab_file_ann` file together with the standard blast output file to produce annotated alignments.
3. In addition, `rename_exons.py` is available to rename exons (later other domains) in the subject sequences to match the exon labeling in the aligned query sequence.
4. `relabel_domains.py` can be used to adjust color sets for homologous domains.
2. There is also an equivalent `fasta_annot_cmd.sh` script that provides similar funtionality for the FASTA programs. This script does not need to use `annot_blast_btab2.pl` to produce domain subalignment scores (that functionality is provided in FASTA), but it also can use `merge_fasta_btab.pl` and `rename_exons.py` to modify the names of the aligned exons/domains in the subject sequences.
3. To support the independence of the `blastp`/`fasta` output from html annotation, the FASTA package includes some new options:
1. The `-m 8CBL` option includes query sequence length and subject sequence length in the blast tabular output. In addition, if domain annotations are available, the raw domain coordinates are provided in an additional field after the annotation/subalignment scoring field. `-m 8CBl` provides the sequence lengths, but does not add the raw domain coordinates.
2. The `-Xa` option prevents annotation information from being included in the html output -- it is only available in the `-m 8CB` (or `-m 8CBL/l`) output
3. To reduce problems with spaces in script arguements, annotation scripts with spaces separating arguments can use '+' instead of ' '.
4. The `fasta_annot_cmd.sh` script produces both a conventional alignment on `stdout` and a `-m 8CBL` alignment, which is sent to a separate file, which is separated from the `-m F8CBL` option with a `=`, thus `-m F8CBL=tmp_output.blast_tab`.
Changes in **fasta-36.3.8g** released 23-Oct-2018
1. (Oct. 2018) Improvements to scripts in the `psisearch2/` directory:
1. `psisearch2/m89_btop_msa2.pl`
1. the `--clustal` option produces a "CLUSTALW (1.8)", which is required for some downstream programs
2. the `--trunc_acc` option removes the database and accession from identifiers of the form: `sp|P09488|GSTM1_HUMAN` to produce `GSTM1_HUMAN`.
3. the `--min_align` option specifies the fraction of the query sequence that must be aligned `(q_end-q_start+1)/q_length)`
Together, these changes make it possible for the output of `m89_btop_msa2.pl` to be used by the EMBOSS program `fprotdist`.
2. A more general implementation of `psisearch2_msa_iter.sh`, which does `psisearch2` one iteration at a time, and a new equivalent `psisearch2_msa_iter_bl.sh`, which uses `psiblast` to do the search.
* (Oct. 2018) A small restructuring of the `make/Makefiles` to remove the `-lz` dependence for non-debugging scripts (and add it back when -DDEBUG is used).
Changes in **fasta-36.3.8g** released 5-Aug-2018
1. (Apr 2018) incorporation of `-t t1` termination codes ("*") in `-m 8CB`, `-m 8CC`, and `-m9C` so that aligned termination codons are indicated as `**` (`-m8CB`) or `*1` (`-m8CC`, `-m9C`).
2. (Mar 2018) Updates to scripts/annot_blast_btop2.pl to provide subalignment scoring for blastp searches (BLOSUM62 only). (see doc/readme.v36)
3. (Feb. 2018) a new extended option, `-XB`, which causes percent identity, percent similarity, and alignment length to be calculated using the BLAST model, which does not count gaps in the alignment length.
see readme.v36 for other bug fixes.
Changes in **fasta-36.3.8g** released 31-Dec-2017
1. (December, 2017) -- Make statistical thresholds more robust for small E()-values with normally distributed scores (`ggsearch36`,`glsearch36`).
2. (September, 2017) Treat lower-case queries with no upper-case residues as uppercase with `-S` option.
3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies.
4. Improvements/fixes to psisearch2 scripts.
For more detailed information, see `doc/readme.v36`.
......@@ -24,28 +24,44 @@ font-size: 12px; font-family: sans-serif; text-decoration:none; background-color
</small>
</pre>
<hr>
<h2>Latest Updates - FASTA version 36.3.8d (April, 2016)</h2>
<h2>Latest Updates - FASTA version 36.3.8h (March, 2019)</h2>
<ol>
<li>The FASTA programs have been released under the Apache2.0 Open
Source License. The COPYRIGHT file, and copyright notices in
program files, have been updated to reflect this change.
<p>
<li>
fasta-36.3.8h includes bug fixes for translated alignments
with termination codons, the ability to use scripts as query
and library sequences, and new scripts for extracting genomic
DNA sequences given chromosome coordinates.
<li>
fasta-36.3.8g includes bug fixes for sub-alignment scoring and
psisearch2 scripts, new annotation scripts for exons, and
fixes enabling very low statistical thresholds with ggsearch36
and glsearch36.
<li>
fasta-36.3.8e/scripts includes updated scripts for
capturing domain and feature annotations using the
EBI/proteins API (https://www.ebi.ac.uk/proteins/api/) to get
Uniprot annotations and exon locations.
<p>
<li>
The <tt>fasta-36.3.8e/psisearch2/</tt> directory now
provides <tt>psisearch2_msa.pl</tt>
and <tt>psisearch2_msa.py</tt>, functionally identical scripts
for iterative searching with <tt>psiblast</tt>
or <tt>ssearch36</tt>. <tt>psisearch2-msa.pl</tt> offers an
option, <tt>--query_seed</tt>, that can dramatically reduce
false-positives caused by alignment overextension, with very
little loss of search sensitivity.
<p>
<li>
The <tt>fasta-36.3.8d/scripts/</tt> directory now provides a
script, <tt>annot_blast_btop2.pl</tt> that allows annotations and
sub-alignment scoring on BLAST alignments that use the tabular format
with BTOP alignment encoding.
<p>
<li>
Bug fixes for overlapping domain domain scoring. v36.3.7 was not thread-safe.
<li>
Annotation scripts accessing the Pfam domain database can now use
the <tt>--vdoms</tt> option to highlight missing parts of a Pfam
domain model. In addtion, domains from clans are labeled as clans
unless <tt>--no-clans</tt> is specified.
</ol>
<h2>Updates - FASTA version 36.3.7 (November, 2014)</h2>
<ol>
<li>The FASTA programs have been released under the Apache2.0 Open
Source License. The COPYRIGHT file, and copyright notices in
program files, have been updated to reflect this change.
<p>
<li>Alignment sub-scoring scripts have been extended to allow
overlapping domains. This requires a modified annotation file format.
The "classic" format placed the beginning and end of a domain on different lines:
......@@ -70,7 +86,7 @@ which allows annotations of the form:
</pre>
<p>
<li> New annotation scripts are available in
the <tt>fasta-36.3.7/scripts</tt> directory,
the <tt>fasta-36.3.8/scripts</tt> directory,
e.g. <tt>ann_pfam_www_e.pl</tt> (Pfam) and <tt>ann_up_www2_e.pl</tt>
(Uniprot) to support this new format. If the domain annotations
provided by Pfam or Uniprot overlap, then overlapping domains are
......
No preview for this file type
......@@ -267,28 +267,41 @@ FASTA format files consist of a description line, beginning
with a '$>$' character, followed by the sequence itself:
\begin{quote}
\begin{verbatim}
>sequence name and description 1
>sequence_name1 and description
A F A S Y T .... actual sequence.
F S S .... second line of sequence.
>sequence name and description 2
>sequence_name2 and description
PMILTYV ... sequence 2
\end{verbatim}
\end{quote}
All of the characters of the description line are read, and special
characters can be used to indicate additional information about the
sequence. In general, non-amino-acid/non-nucleotide sequences in the
sequence lines are ignored.
sequence. In particular, a \texttt{'@:C 12345'} at the end of the
description line indicates that the first residue of the sequence has
coordinate \texttt{'12345'}, instead of starting at \texttt{'1'}.
Coordinates can be negative; a DNA sequence upstream from the start of
transcription could be displayed with negative coordinates.
In general, non-amino-acid/non-nucleotide sequences in the sequence
lines are ignored, with the exception of \texttt{'*'}, which indicates
a termination codon in a protein sequence, and can be used to indicate
the match to a termination codon in protein:DNA alignments.
FASTA format files from major sequence distributors, like the NCBI and
EBI, have specially formatted description lines, e.g.:\\
\indent
\texttt{
>gi|54321|ref|np\_12345| example NCBI refseq sequence\\
>np\_12345| example NCBI refseq sequence\\
}
or\\
\indent
\texttt{
>sw:gstm1\_human P01234 glutathione transferase GSTM1 - human\\
>sp:gstm1\_human P01234 glutathione transferase GSTM1 - human\\
}
or
\indent
\texttt{
>sp|P09488|GSTM1\_HUMAN glutathione transferase GSTM1 - human\\
}
Several sample test files are included with the FASTA distribution:
......@@ -852,7 +865,11 @@ can use \texttt{-m 1 -m 6 -m 9}.
comments, \texttt{-m 8XC} without comments) and, if available, an
annotation encoding matching FASTA \texttt{-m 9C} output. All the
\texttt{-m 9c/C/d/D} encodings are available with BLAST tabular
output using \texttt{-m 8C[c/C/d/D]}.
output using \texttt{-m 8C[c/C/d/D]}. In the v36.3.8h release, a
new option has been added to \texttt{-m 8CB}, \texttt{-m 8CBL} (or
\texttt{-m 8CBl}. The \texttt{L/l} option adds the lengths of the
query and subject sequences after the \texttt{seqid}'s to BLAST
tabular output, e.g. \texttt{qseqid qlen sseqid slen percid ...}
\item[\texttt{-m 9}] display alignment coordinates and scores with the
best score information. \texttt{-m 9i} provides alignment length,
......@@ -926,7 +943,7 @@ while \texttt{-m 9D} would be:
\texttt{1M1X2M4X2M1X2M7X3M9D1M2X1M4X2M1X1M1X2I1X1M1X1M3X1M2X1I3M1D1X1M2X1M}
\end{footnotesize}
\item[\texttt{-m 10}]
a parseable format for use with other programs.
a parseable format for use with other programs (this option no longer reliably tested; \texttt{-m 8CBL} is easier to parse and tested more extensively).
\item[\texttt{-m 11}]
Provide \texttt{lav}-like output (used by \texttt{lalign}) for graphical output.
\begin{quote}
......@@ -1124,17 +1141,24 @@ not treated as low complexity by the translated alignment
programs. (There is an option in the \texttt{Makefile},
\texttt{-DDNALIB\_LC}, to enable preserving case in DNA sequences.)
\item[\texttt{-t \#}]
Translation table - fastx36, tfastx36, fasty36, and
tfasty3 now support the BLAST translation tables. See
\url{http://www.ncbi.nih.gov/Taxonomy/Utils/wprintgc.cgi}.
\texttt{-t t} or \texttt{-t t\#} enables the addition of
an implicit termination codon to a protein:translated DNA match. That
is, each protein sequence implicitly ends with \texttt{*}, which
matches the termination codes for the appropriate genetic code.
\texttt{-t t\#} sets implicit termination and a different genetic
code.
\item[\texttt{-t \#}] Translation table - fastx36, tfastx36, fasty36,
and tfasty3 now support the BLAST translation tables. See
\url{http://www.ncbi.nih.gov/Taxonomy/Utils/wprintgc.cgi}.
\texttt{-t 1} also enables translation of \texttt{'TGA'} to
\texttt{'U'} (seleno-cysteine) (by default, \texttt{'TGA'} is
translated to \texttt{'*'}). Because of the ambiguity of the
\texttt{'TGA'} codon, translated alignments of \texttt{'TGA'} with
\texttt{-t 1} match \texttt{'U'} and \texttt{'*'} (termination)
equally well.
\texttt{-t t} enables the addition of an implicit termination codon to
a protein:translated DNA match. That is, each protein sequence
implicitly ends with \texttt{*}, which matches the termination codes
for the appropriate genetic code. To change the translation table and
insert a termination character after each protein sequence, use
\texttt{-t 1 -t t}.
\item[\texttt{-T \#}]
set number of threads/workers. Normally on a multi-core machine, the maximum
number of processors/cores is used.
......@@ -1349,7 +1373,13 @@ A number of rarely used options are now only available as extended options:
\item[\texttt{X1}] sort output by \texttt{init1} score (for
compatibility with FASTP; obsolete).
\item[\texttt{XB}] (Previously \texttt{-B}.) Show the z-score, rather
\item[\texttt{XB}] Calculate pecent identity, percent similarity, and
alignment using the BLAST model, which excludes gapped residues.
This allows very high identity alignments with large gaps to look
much closer, but causes the alignment length to drop by the length
of the gap.
\item[\texttt{Xb}] (Previously \texttt{-B}.) Show the z-score, rather
than the bit-score in the list of best scores (rarely used, provided
for backward compatibility).
......@@ -1795,6 +1825,7 @@ read the libraries in the following formats:\\
5 & NBRF/PIR VMS (\texttt{>P1;SEQID}/comment/sequence) (obsolete)\\
6 & GCG (version 8.0) Unix Protein and DNA (compressed)\\
7 & FASTQ (sequence only, quality ignored)\\
9 & a script that is executed to produce FASTA format sequences \\
10 & subset format (</slib2/swissprot.lseg 0:2 4|) \\
11 & NCBI Blast1.3.2 format (unix only) (obsolete)\\
12 & NCBI Blast2.0 format\\
......@@ -1870,11 +1901,15 @@ remember where the libraries are kept or how they are named.
\section{Frequently Asked Questions (FAQs)}
{\noindent}\textbf{Where can I get FASTA?} --
\url{http://faculty.virginia.edu/wrpearson/fasta} has the latest
versions of the FASTA programs. This document describes
\texttt{\CURRENT}, which is available from
\url{http://faculty.virginia.edu/wrpearson/fasta/fasta3.tar.gz}.
In addition, pre-compiled versions of the programs are available for
The most current version of the FASTA source code is available from
\url{http://github.com/wrpearson/fasta36}. In addition, you can get
the programs from \url{http://faculty.virginia.edu/wrpearson/fasta},
but sometimes there is a lag between the latest release on GITHUB and
the compiled versions at \url{faculty.virginia.edu}. This document
describes \texttt{\CURRENT}, which is available from
\url{http://faculty.virginia.edu/wrpearson/fasta/fasta3.tar.gz}. In
addition, pre-compiled versions of the programs are available for
MacOSX and Windows.
\needspace{4\baselineskip}
......@@ -1887,7 +1922,7 @@ Query & Library & FASTA pgm. & BLAST pgm. & \\[1.2ex]
Prot. & Prot. & \texttt{fasta36} & \texttt{blastp} & heuristic local similarity \\
& & \texttt{ssearch36} & & optimal local sim.\\
& & \texttt{ggearch36} & & global:global sim. \\
& & \texttt{ggearch36} & & global:local sim.\\
& & \texttt{glearch36} & & global:local sim.\\
DNA & DNA & \texttt{fasta36}$^*$ & \texttt{blastn} & \\[1.2ex]
\hline \\[-1.0ex]
Prot. & Prot. & \texttt{lalign36} & & multiple non-intersecting \\
......@@ -2029,7 +2064,7 @@ As always, please inform me of bugs as soon as possible.
\begin{quote}
William R. Pearson\\
Department of Biochemistry\\
Jordan Hall Box 800733\\
Pinn Hall Box 800733\\
U. of Virginia\\
Charlottesville, VA\\
wrp@virginia.EDU
......
README_v36.3.8h.md
\ No newline at end of file
......@@ -111,7 +111,7 @@ compilation with Sun compiler with Makefile.sun_x86.
This release provides an extremely efficient SSE2 implementation of
the Smith-Waterman algorithm for the SSE2 vector instructions written
by Michael Farrar (farrar.michael@gmail.com). The SSE code speeds up
by Michael Farrar. The SSE code speeds up
Smith-Waterman 8 - 10-fold in my tests, making it comparable to Eric
Lindahl's Altivec code for the Apple/IBM G4/G5 architecture.
......
......@@ -6,6 +6,205 @@ multiple high-scoring alignments to be shown, rather than just one.
This is the main functional difference between FASTA and BLAST -
BLAST could show multiple HSPs, FASTA did not.
>>Aug. 9, 2019
[src/ncbl2_mlib.c, ncbl2_head.h]
Modest extensions made to support reading makeblastdb format v5
databases. Changes have only been made to read the db.pin file, but
things work in simple tests.
>July 16, 2019
[src/comp_lib9.c]
Fixed a memory leak problem when searching with large libraries that
could be memory mapped (libraries with .xin index files). If the
library did not fit in memory, then the kept allocating new memory.
By default, the largest database that fits in memory must be less than
16 GB. Larger libraries will be re-read, which slows down multi-query
searches considerably. To increase the size of the library allowed in
memory, use the option: "-X M32G" to fit 32 GB libraries.
>>Mar. 8, 2019
[src/initfa.c,faatran.c,dropfx2.c]
Modify translation table 1 to allow selenocysteine translation
(TGA->'U'), and modify scoring matrices to give positive scores to
'*':'U'. The translation modification ONLY works with "-t 1". In
addition, BLAST BTOP alignments (-m 8CB) convert a 'U' aligned with a
'*' to a '*', so the end of the alignment is '**' rather than 'U*'
(fastx36) or '*U' (tfastx36).
dropfx2.c (fastx36/tfastx36), dropfz3.c(fasty36/tfasty36) did not
properly switch protein and translated DNA codes with -m 8CB -- fixed.
version date updated to Mar, 2019
>>Feb. 26, 2019
[scripts/get_genome_seq.py]
added get_genome_seq.py as a replacement for get_hg38_bed.py, remove
get_hg38_bed.py. 'get_genome_seq.py --genome mm10' also produces
sequences from mouse mm10 (and can now do any genome that bedtools can
read).
>>Feb. 23, 2019
[src/comp_lib9.c, mshowbest.c]
Modify repeat_thresh so that poor alignment scores (E() >
ppst->e_cut_r, typically -E-threshold/10.0) do not look for additional
alignments.
>>Feb. 21, 2019
[src/nmgetaa.c, scaleswn.c, scripts/get_protein.py, get_hg38_bed.py]
Modify nmgetaa.c to ignore ':'s (for sequence subsets) in scripts.
The script can do the subsetting. Modify scripts/get_protein.py to
provide subsetting. Add scripts/get_hg38_bed.py to extract fasta
sequences using the format "chr2:123456-543210"
Modify scaleswn.c to estimate Altshul-Gish parameters when gap and
extension do not match exactly.
>>Feb. 6, 2019
[src/compacc2e.c, nmgetaa.c]
modify build_link_data() to allow '+' for space in scripts. Ensure
that lib_type is properly initialized (open_lib.c()).
>>Jan. 23, 2019
[nmgetaa.c]
Fix bug introduced when checking for lib_type.
>>Jan. 15, 2019
[src/upam.h, altlib.h, nmgetaa.c]
[scripts/rename_exons.py, map_exons_coords.py, get_uniprot.py, get_refseq.py, get_proteins.py]
Bug fixes: The VT10, VT20, etc scoring matrices did not have scores for '*:*'
alignments, used with FASTX/TFASTX for extending alignments through
the termination codon. As a result, searchs with '-t t' did not
extend through the termination codon, even though they should have.
This has been fixed.
Enhancements: FASTA can now download both query and library sequences using a script, by specifying file type 9. Thus:
fasta36 "../scripts/get_uniprot.py+P09488 9" /seqlib/swissprot.fasta
Will run the script "get_uniprot.py" with the argument "P09488" and
use the output of the script as the query sequence. In this example,
the library type (9) is specified by the " 9" (this space cannot be
replaced with a '+' character).
Alternatively, library type '9' can be specified by putting a '!' before the script file name.
fasta36 \!../scripts/get_uniprot.py+P09488 /seqlib/swissprot.fasta
Scripts can be used to produce query or library sequences, or both.
Three scripts that download sequences from the NCBI and Uniprot have
been added in the "scripts" directory: "get_uniprot.py" takes Uniprot
accessions as arguments, "get_refseq.py" takes refseq accessions
(protein or mRNA), and "get_protein.py" gets both Uniprot and RefSeq
protein sequences.
rename_exons.py and map_exons_coords.py can take annotated BTOP
alignments with genome coordinates and map exons to the alternative
genome.
>>Jan. 2, 2019
[src/mshowbest.c]
Fix problems with site annotation when dom_info is provided with -m8CBL
[scripts/ann_exons_up_sql.pl, ann_exons_up_www.pl]
Make scripts more robust to missing chromosome information,
reverse-strand coordinates.
>>Dec. 11, 2018
[scripts/ann_exons_up_www.pl, ann_exons_up_sql.pl]
Add the option "--gen_coord" to report exon start ('<') and end ('>')
genome coordinates features of exons.
>>Nov. 14, 2018
[scripts/rename_exons.py, relabel_domains.py, compacc2e.c]
Two new scripts, rename_exons.py and relabel_domains.py, that take a
blast tabular output file with domain alignment annotations (and
possibly raw domain information) and modifies the names
(rename_exons.py) or colors (relabel_domains.py). rename_exons.py
takes the exon numbering associated with the query sequence and maps
it onto the subject alignments. relabel_domains.py can be used to use
different color numbers for homologous and non-homologous domains.
Both of these programs modify blast tabular output files, which can
then be merged back into an alignment display using
merge_blastp_annot.pl or merge_fasta_annot.pl.
compacc2.c:build_link_data() has been modified to convert '+' in the
script string to ' ', to allow passing command line options. A space
in the script string is used to separate the script from the library
type of the file returned by the script.
>>Nov. 6-7, 2018
[doinit.c, mshowbest.c, mshowalign2.c, defs.h, structs.h]
(a) Add options to provide query and subject sequence lengths and raw
domain coordinates in BLASTP tabular output with the options -m 8CBl
and -m 8CBL. If domain annotations are available, -m 8CBL also
provides the raw domain coordinates (not just those included in the
alignment) in the form |DX:1-100;C=PF12345|XD:1-100;C=PF12345 where
|DX a query annotation and |XD indicates a subject annotation. -m
8CBl (lower-case L) shows the sequence lengths, but not the raw domain
info.
(b) parse the annotation program strings so that '+' are converted to
' '. This greatly simplifies passing arguments to the annotation scripts. Thus:
-V \!ann_pfam_sql.pl --db=pfam31 --neg --vdoms can be written as:
-V \!ann_pfam_sql.pl+--db=pfam31+--neg+--vdoms (likewise for -V q\!ann_pfam...)
(c) provide an option to remove region/feature annotations from non-m8
(blast-tabular) output. This simplifies the process of using
scripts/merge_fasta_btab.pl to use .bl_tab (-m 8CBL) files to inject
sub-alignment scores and domain information.
>>Nov. 1, 2018
[doinit.c]
Allow -m F#=file.name in addition to -m "F# file.name" to address
problems I had with spaces in shell scripts.
>>Oct. 23, 2018 [re-released as fasta-36.3.8g] (see README_v36.3.8g.md)
[make/Makefiles*,psisearch2/m89_btop_msa2.pl]
Add options to psisearch2/m89_btop_msa2.pl to provide clustalw header
(--clustal), require a minimum coverage of the query sequence
(--min_align 0.8), and edit sequence identifiers to remove database
and accession (--trunc_acc).
Remove -lz dependency from non-debug Makefiles.
>>Aug. 5, 2018 [re-released as fasta-36.3.8g]
[lib_sel.c]
Make lib_select.c more robust to missing indirect name files.
[scripts/ann*.pl]
update various annotation scripts to use https:// instead of http://
>>April 3, 2018
[initfa.c, comp_lib.c, dropfx2.c]
Changes to (a) ensure that the "-t t" option correctly inserts and
aligns a termination codon '*'. (a) changes to -m 8CB, -m8CC, and -m9C
so that aligned termination codons are indicated as "**" (-m8CB) or
"*1" (-m8CC, -m9C).
>>Mar. 9, 2018
[scripts/annot_blast_btop2.pl, merge_blast_btab.pl, blastp_annot_cmd.sh]
Code is now in place to provide sub-alignment scoring using domain
annotations with blastp searches (BLOSUM62 only). blastp_annot_cmd.sh
runs blast and produces both a standard HTML and a tabular output
file. It then runs annot_blast_btop2.pl to add sub-alignment scoring
to the tabular ouput file, and then merge_blast_btab.pl merges the
domain-annotated blast tabular file with the HTML output file. When
combined in this way, the FASTA web server (fasta.bioch.virginia.edu)
can produce blastp searches with domain highlights/scoring.
>>Feb. 6, 2018
[initfa.c, doinit.c, mshowbest.c, mshowalign2.c]
Add a new extended option, -XB, which causes percent identity, percent
similarity, and alignment length to be presented using the BLAST
model, which does not count gaps in the alignment length.
>>Dec. 30, 2017 [released as fasta-36.3.8g]
[scaleswn.c]
Replace np_to_z() with np1_to_z(), which does not substract low
......
Makefile.linux64_sse2
\ No newline at end of file
# $ Id: $
#
# makefile for fasta3, fasta3_t Use Makefile.mpi for fasta36_mpi
#
# This file is designed for 64-bit Linux systems using an X86
# architecture with SSE2 extensions. -D_LARGEFILE64_SOURCE and
# -DBIG_LIB64 require a 64-bit linux system.
# SSE2 extensions are used for ssearch35(_t)
#
# Use Makefile.linux32_sse2 for 32-bit linux x86
#
SHELL=/bin/bash
CC = gcc -g -O -msse2
LIB_DB=
#CC= gcc -pg -g -O -msse2 -ffast-math
#CC = gcc -g -DDEBUG -msse2
#CC=gcc -Wall -pedantic -ansi -g -msse2 -DDEBUG
# EBI uses the following with pgcc, -O3 does not work:
# CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer
# this file works for x86 LINUX
# standard options
CFLAGS= -DSHOW_HELP -DSHOWSIM -DUNIX -DTIMES -DHZ=100 -DMAX_WORKERS=8 -DTHR_EXIT=pthread_exit -DM10_CONS -D_REENTRANT -DHAS_INTTYPES -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -DUSE_FSEEKO -DSAMP_STATS -DPGM_DOC -DUSE_MMAP -D_LARGEFILE64_SOURCE -DBIG_LIB64
# -I/usr/include/mysql -DMYSQL_DB
# -DSUPERFAMNUM -DSFCHAR="'|'"
#
#(for mySQL databases) (also requires change to Makefile36m.common or use of Makefile36m.common_mysql)
# run 'mysql_config' so find locations of mySQL files
LIB_M = -lm
# for mySQL databases
# LIB_M = -L/usr/lib64/mysql -lmysqlclient -lm
HFLAGS= -o
NFLAGS= -o
# for Linux
THR_SUBS = pthr_subs2
THR_LIBS = -lpthread
THR_CC =
BIN = ../bin
XDIR = /seqprg/bin
#XDIR = ~/bin/LINUX
# set up files for SSE2/Altivec acceleration
#
include ../make/Makefile.sse_alt
# SSE2 acceleration
#
DROPGSW_O = $(DROPGSW_SSE_O)
DROPLAL_O = $(DROPLAL_SSE_O)
DROPGNW_O = $(DROPGNW_SSE_O)
DROPLNW_O = $(DROPLNW_SSE_O)
# renamed (fasta36) programs
include ../make/Makefile36m.common
# conventional (fasta3) names
# include ../make/Makefile.common
......@@ -13,9 +13,11 @@ SHELL=/bin/bash
#CC= gcc -g -O
#CC = gcc -g -DDEBUG
#LIB_DB=
#CC=gcc -Wall -pedantic -ansi -g -O
CC= /usr/local/parasoft/bin/insure -g -DDEBUG
LIB_DB=-lz
# EBI uses the following with pgcc, -O3 does not work:
# CC= pgcc -O2 -pipe -mcpu=pentiumpro -march=pentiumpro -fomit-frame-pointer
......
......@@ -13,9 +13,11 @@
SHELL=/bin/bash
CC= gcc -g -O -msse2 -ffast-math
LIB_DB=
#CC = gcc -g -DDEBUG -msse2
#CC= /usr/local/parasoft/bin/insure -g -DDEBUG
#LIB_DB=-lz
#CC=gcc -Wall -pedantic -ansi -g -O
......