Skip to content
Commits on Source (2)
2019-11-28 Martin C. Frith <Martin C. Frith>
* doc/last-train.txt, scripts/last-train, test/last-train-test.out:
train: double scale if integer-rounding is bad
[887c0e6339d1] [tip]
2019-11-27 Martin C. Frith <Martin C. Frith>
* scripts/last-train, test/last-train-test.out:
!Make last-train use balanced length probability
[fff603c39404]
* scripts/last-train, test/last-train-test.out:
train: change integer-rounding of gap open costs
[d4e0fcd8cc2b]
2019-11-26 Martin C. Frith <Martin C. Frith>
* src/LastalArguments.cc, src/lastal.cc, test/last-test.out:
!Change lastal default gapless score threshold
[24e288b4947c]
* scripts/last-train:
Refactor
[4a486c301cd5]
* scripts/last-train:
Refactor
[932a5e822d1c]
2019-11-21 Martin C. Frith <Martin C. Frith>
* scripts/last-train:
Refactor
[037daaa2fe00]
* scripts/last-train:
Refactor
[ad8ed4200d8c]
* scripts/last-train:
Refactor
[87e4bbb5b03f]
2019-11-18 Martin C. Frith <Martin C. Frith>
* src/GappedXdropAligner.cc, src/GappedXdropAligner.hh,
src/GappedXdropAligner2qual.cc, src/GappedXdropAligner3frame.cc,
src/GappedXdropAligner3framePssm.cc, src/GappedXdropAlignerInl.hh,
src/GappedXdropAlignerPssm.cc, src/mcf_simd.hh:
Refactor
[f0bc43d74861]
* src/GappedXdropAligner.cc, src/GappedXdropAlignerInl.hh,
src/GappedXdropAlignerPssm.cc, src/ScoreMatrix.cc,
src/ScoreMatrix.hh, src/mcf_simd.hh:
Refactor
[f164ba36017b]
* scripts/last-train, src/Alignment.cc, src/Centroid.cc,
src/Centroid.hh, src/GappedXdropAligner.cc,
src/GappedXdropAligner.hh, test/last-test.out:
Remove always-zero counts from lastal -j7
[43344af70eab]
2019-11-15 Martin C. Frith <Martin C. Frith>
* src/Centroid.cc:
Re-order some probability calculations
[f46e1c98977c]
* doc/lastal.txt, examples/multiMito.maf, scripts/last-train,
src/Alignment.cc, src/Centroid.cc, src/Centroid.hh, test/last-
test.out, test/last-train-test.out, test/maf-swap-test.out:
!Get lastal probabilities from "model A"
[033e6b4b0308]
* src/Centroid.cc:
Re-order probability calculations
[b97eb8aee1f7]
2019-11-14 Martin C. Frith <Martin C. Frith>
* src/Centroid.cc:
Refactor (better variable names)
[54db8ec6938c]
* doc/lastal.txt, src/Alignment.cc, src/Centroid.cc, src/Centroid.hh,
src/LastalArguments.cc:
Remove probabilities for generalized affine gaps
[2e10e14cc66d]
2019-11-12 Martin C. Frith <Martin C. Frith>
* src/GappedXdropAlignerPssm.cc:
Make fastq gapped alignment faster (SIMD)!
[6afae8dec8f0]
* src/GappedXdropAlignerPssm.cc:
Refactor
[168d611f79ab]
* src/GappedXdropAlignerPssm.cc, test/last-test.out, test/last-
test.sh:
Refactor
[bdc3abede08a]
* makefile, src/Centroid.cc, src/GappedXdropAligner.cc,
src/GappedXdropAligner.hh, src/GappedXdropAligner2qual.cc,
src/GappedXdropAlignerPssm.cc, src/makefile, src/mcf_simd.hh:
Make fasta gapped alignment faster (SIMD)!
[2a74b539aecf]
2019-11-11 Martin C. Frith <Martin C. Frith>
* src/GappedXdropAligner.cc:
Refactor
[92f84eaa2c77]
* src/Centroid.cc, src/Centroid.hh, src/GappedXdropAligner.cc,
src/GappedXdropAligner.hh, src/makefile,
src/mcf_contiguous_queue.hh:
Refactor (not sure if useful)
[f1192efe0bec]
* src/Alignment.cc, src/GappedXdropAligner.cc,
src/GappedXdropAligner.hh, src/GappedXdropAligner2qual.cc,
src/GappedXdropAlignerPssm.cc, src/GreedyXdropAligner.cc,
src/GreedyXdropAligner.hh:
Refactor
[de3177d4a626]
* src/GappedXdropAligner.cc, src/GappedXdropAligner.hh,
src/GappedXdropAligner2qual.cc, src/GappedXdropAligner3frame.cc,
src/GappedXdropAligner3framePssm.cc, src/GappedXdropAlignerPssm.cc:
Refactor
[b726e939f94c]
* src/GappedXdropAligner.cc, src/GappedXdropAligner.hh,
src/GappedXdropAligner2qual.cc, src/GappedXdropAlignerPssm.cc:
Refactor
[1587261f2469]
* src/Centroid.hh, src/GappedXdropAligner.cc,
src/GappedXdropAligner.hh, src/GappedXdropAligner2qual.cc,
src/GappedXdropAligner3frame.cc,
src/GappedXdropAligner3framePssm.cc, src/GappedXdropAlignerInl.hh,
src/GappedXdropAlignerPssm.cc:
Refactor (std::size_t -> size_t)
[0cdc7da1019d]
* src/Centroid.cc, src/Centroid.hh, src/GappedXdropAligner.cc,
src/GappedXdropAligner.hh, src/GappedXdropAligner3frame.cc,
src/alp/njn_random.cpp, src/makefile:
Make lastal gapped alignment a bit faster
[5506d2756b47]
* src/LastEvaluer.hh, test/last-test.out, test/last-test.sh:
Fix lastal crash on empty lastdb database
[e5c59d1ed3a6]
2019-10-09 Martin C. Frith <Martin C. Frith>
* data/BL80.mat, data/PAM10.mat, data/PAM30.mat:
Add PAM10 substitution score matrix
[e57e547b8235]
2019-10-08 Martin C. Frith <Martin C. Frith>
* src/lastal.cc, src/lastdb.cc:
Omit non-alphabet letters for DB seq lengths
[7f841cc97f20]
* src/SubsetMinimizerFinder.cc, src/SubsetMinimizerFinder.hh,
src/SubsetSuffixArray.cc, src/lastal.cc:
Refactor
[4011a21f96ad]
2019-10-07 Martin C. Frith <Martin C. Frith>
* src/ScoreMatrix.cc, src/ScoreMatrix.hh, src/lastal.cc,
src/mcf_substitution_matrix_stats.cc,
src/mcf_substitution_matrix_stats.hh:
Allow score matrix with letter frequencies
[50bcc51ca59b]
* src/CyclicSubsetSeed.hh, src/ScoreMatrix.cc, src/ScoreMatrix.hh,
src/ScoreMatrixRow.hh, src/makefile:
Refactor
[ca6613dd1f06]
* src/LastEvaluer.cc, src/LastEvaluer.hh, src/lastal.cc:
Enable stored E-value params for any genetic code
[f757e6ca4edb]
* build/gc-inc.sh, data/gc.prt, doc/lastal.txt, src/GeneticCode.cc,
src/GeneticCode.hh, src/LastalArguments.cc, src/lastal.cc,
src/makefile, test/last-test.out, test/last-test.sh:
Add NCBI genetic codes
[88adecc284af]
* doc/last-evalues.txt, src/LastEvaluer.cc, src/LastEvaluer.hh,
src/LastalArguments.cc, src/LastalArguments.hh, src/lastal.cc,
src/lastdb.cc, test/last-test.out:
Improve E-values for multiple short DB sequences
[49caf55d4d4d]
2019-08-21 Martin C. Frith <Martin C. Frith>
* doc/last-dotplot.txt, scripts/last-dotplot:
dotplot: change satellite-repeat color to purple
[fb43f850eba2] [tip]
[fb43f850eba2]
2019-07-16 Martin C. Frith <Martin C. Frith>
......
#! /bin/sh
# This generates source code from genetic codes.
cat <<EOF
const struct {
const char *name;
const char *text;
} geneticCodes[] = {
EOF
cat "$@" | tr -d '",' |
awk '
$1 == "id" {print "{\"" $2 "\", \"\\"}
$1 == "ncbieaa" {print " AAs = " $2 "\\n\\"}
/-- Base/ {print $2 " = " $3 "\\n\\"}
/-- Base3/ {print "\"},"}
'
echo "};"
# This protein scoring scheme is good at finding somewhat
# short-and-strong similarities. (S Henikoff & JG Henikoff, PNAS 1992
# This protein scoring scheme is good for finding somewhat strong, and
# short, similarities. (S Henikoff & JG Henikoff, PNAS 1992
# 89(22):10915-9).
#nickname BLOSUM80
......
# This protein scoring scheme is good for finding very strong, and
# short, similarities (MO Dayhoff et al. 1978).
#last -a20 -b3
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 7 -10 -7 -6 -10 -7 -5 -4 -11 -8 -9 -10 -8 -12 -4 -3 -3 -20 -11 -5 -6 -6 -6 -23
R -10 9 -9 -17 -11 -4 -15 -13 -4 -8 -12 -2 -7 -12 -7 -6 -10 -5 -14 -11 -11 -7 -9 -23
N -7 -9 9 -1 -17 -7 -5 -6 -2 -8 -10 -4 -15 -12 -9 -2 -5 -11 -7 -12 7 -6 -6 -23
D -6 -17 -1 8 -21 -6 0 -6 -7 -11 -19 -8 -17 -21 -12 -7 -8 -21 -17 -11 7 -1 -9 -23
C -10 -11 -17 -21 10 -20 -20 -13 -10 -9 -21 -20 -20 -19 -11 -6 -11 -22 -7 -9 -18 -20 -13 -23
Q -7 -4 -7 -6 -20 9 -1 -10 -2 -11 -8 -6 -7 -19 -6 -8 -9 -19 -18 -10 -6 7 -8 -23
E -5 -15 -5 0 -20 -1 8 -7 -9 -8 -13 -7 -10 -20 -9 -7 -9 -23 -11 -10 -1 7 -8 -23
G -4 -13 -6 -6 -13 -10 -7 7 -13 -17 -14 -10 -12 -12 -10 -4 -10 -21 -20 -9 -6 -8 -8 -23
H -11 -4 -2 -7 -10 -2 -9 -13 10 -13 -9 -10 -17 -9 -7 -9 -11 -10 -6 -9 -4 -4 -8 -23
I -8 -8 -8 -11 -9 -11 -8 -17 -13 9 -4 -9 -3 -5 -12 -10 -5 -20 -9 -1 -9 -9 -8 -23
L -9 -12 -10 -19 -21 -8 -13 -14 -9 -4 7 -11 -2 -5 -10 -12 -10 -9 -10 -5 -12 -10 -9 -23
K -10 -2 -4 -8 -20 -6 -7 -10 -10 -9 -11 7 -4 -20 -10 -7 -6 -18 -12 -13 -5 -6 -8 -23
M -8 -7 -15 -17 -20 -7 -10 -12 -17 -3 -2 -4 12 -7 -11 -8 -7 -19 -17 -4 -16 -8 -9 -23
F -12 -12 -12 -21 -19 -19 -20 -12 -9 -5 -5 -20 -7 9 -13 -9 -12 -7 -1 -12 -14 -20 -12 -23
P -4 -7 -9 -12 -11 -6 -9 -10 -7 -12 -10 -10 -11 -13 8 -4 -7 -20 -20 -9 -10 -7 -8 -23
S -3 -6 -2 -7 -6 -8 -7 -4 -9 -10 -12 -7 -8 -9 -4 7 -2 -8 -10 -10 -4 -8 -6 -23
T -3 -10 -5 -8 -11 -9 -9 -10 -11 -5 -10 -6 -7 -12 -7 -2 8 -19 -9 -6 -6 -9 -7 -23
W -20 -5 -11 -21 -22 -19 -23 -21 -10 -20 -9 -18 -19 -7 -20 -8 -19 13 -8 -22 -13 -21 -16 -23
Y -11 -14 -7 -17 -7 -18 -11 -20 -6 -9 -10 -12 -17 -1 -20 -10 -9 -8 10 -10 -9 -13 -11 -23
V -5 -11 -12 -11 -9 -10 -10 -9 -9 -1 -5 -13 -4 -12 -9 -10 -6 -22 -10 8 -11 -10 -8 -23
B -6 -11 7 7 -18 -6 -1 -6 -4 -9 -12 -5 -16 -14 -10 -4 -6 -13 -9 -11 7 -3 -8 -23
Z -6 -7 -6 -1 -20 7 7 -8 -4 -9 -10 -6 -8 -20 -7 -8 -9 -21 -13 -10 -3 7 -8 -23
X -6 -9 -6 -9 -13 -8 -8 -8 -8 -8 -9 -8 -9 -12 -8 -6 -7 -16 -11 -8 -8 -8 -8 -23
* -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 1
# This protein scoring scheme is good for finding short-and-strong
# This protein scoring scheme is good for finding strong, and short,
# similarities (MO Dayhoff et al. 1978).
#last -a13 -b3
......
--**************************************************************************
-- This is the NCBI genetic code table
-- Initial base data set from Andrzej Elzanowski while at PIR International
-- Addition of Eubacterial and Alternative Yeast by J.Ostell at NCBI
-- Base 1-3 of each codon have been added as comments to facilitate
-- readability at the suggestion of Peter Rice, EMBL
-- Later additions by Taxonomy Group staff at NCBI
--
-- Version 4.5
-- Added Cephalodiscidae mitochondrial genetic code 33
--
-- Version 4.4
-- Added GTG as start codon for genetic code 3
-- Added Balanophoraceae plastid genetic code 32
--
-- Version 4.3
-- Change to CTG -> Leu in genetic codes 27, 28, 29, 30
--
-- Version 4.2
-- Added Karyorelict nuclear genetic code 27
-- Added Condylostoma nuclear genetic code 28
-- Added Mesodinium nuclear genetic code 29
-- Added Peritrich nuclear genetic code 30
-- Added Blastocrithidia nuclear genetic code 31
--
-- Version 4.1
-- Added Pachysolen tannophilus nuclear genetic code 26
--
-- Version 4.0
-- Updated version to reflect numerous undocumented changes:
-- Corrected start codons for genetic code 25
-- Name of new genetic code is Candidate Division SR1 and Gracilibacteria
-- Added candidate division SR1 nuclear genetic code 25
-- Added GTG as start codon for genetic code 24
-- Corrected Pterobranchia Mitochondrial genetic code (24)
-- Added genetic code 24, Pterobranchia Mitochondrial
-- Genetic code 11 is now Bacterial, Archaeal and Plant Plastid
-- Fixed capitalization of mitochondrial in codes 22 and 23
-- Added GTG, ATA, and TTG as alternative start codons to code 13
--
-- Version 3.9
-- Code 14 differs from code 9 only by translating UAA to Tyr rather than
-- STOP. A recent study (Telford et al, 2000) has found no evidence that
-- the codon UAA codes for Tyr in the flatworms, but other opinions exist.
-- There are very few GenBank records that are translated with code 14,
-- but a test translation shows that retranslating these records with code
-- 9 can cause premature terminations. Therefore, GenBank will maintain
-- code 14 until further information becomes available.
--
-- Version 3.8
-- Added GTG start to Echinoderm mitochondrial code, code 9
--
-- Version 3.7
-- Added code 23 Thraustochytrium mitochondrial code
-- formerly OGMP code 93
-- submitted by Gertraude Berger, Ph.D.
--
-- Version 3.6
-- Added code 22 TAG-Leu, TCA-stop
-- found in mitochondrial DNA of Scenedesmus obliquus
-- submitted by Gertraude Berger, Ph.D.
-- Organelle Genome Megasequencing Program, Univ Montreal
--
-- Version 3.5
-- Added code 21, Trematode Mitochondrial
-- (as deduced from: Garey & Wolstenholme,1989; Ohama et al, 1990)
-- Added code 16, Chlorophycean Mitochondrial
-- (TAG can translated to Leucine instaed to STOP in chlorophyceans
-- and fungi)
--
-- Version 3.4
-- Added CTG,TTG as allowed alternate start codons in Standard code.
-- Prats et al. 1989, Hann et al. 1992
--
-- Version 3.3 - 10/13/95
-- Added alternate intiation codon ATC to code 5
-- based on complete mitochondrial genome of honeybee
-- Crozier and Crozier (1993)
--
-- Version 3.2 - 6/24/95
-- Code Comments
-- 10 Alternative Ciliate Macronuclear renamed to Euplotid Macro...
-- 15 Blepharisma Macro.. code added
-- 5 Invertebrate Mito.. GTG allowed as alternate initiator
-- 11 Eubacterial renamed to Bacterial as most alternate starts
-- have been found in Archea
--
--
-- Version 3.1 - 1995
-- Updated as per Andrzej Elzanowski at NCBI
-- Complete documentation in NCBI toolkit documentation
-- Note: 2 genetic codes have been deleted
--
-- Old id Use id - Notes
--
-- id 7 id 4 - Kinetoplast code now merged in code id 4
-- id 8 id 1 - all plant chloroplast differences due to RNA edit
--
--
--*************************************************************************
Genetic-code-table ::= {
{
name "Standard" ,
name "SGC0" ,
id 1 ,
ncbieaa "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "---M------**--*----M---------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Vertebrate Mitochondrial" ,
name "SGC1" ,
id 2 ,
ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG",
sncbieaa "----------**--------------------MMMM----------**---M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Yeast Mitochondrial" ,
name "SGC2" ,
id 3 ,
ncbieaa "FFLLSSSSYY**CCWWTTTTPPPPHHQQRRRRIIMMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "----------**----------------------MM---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate
Mitochondrial; Mycoplasma; Spiroplasma" ,
name "SGC3" ,
id 4 ,
ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "--MM------**-------M------------MMMM---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Invertebrate Mitochondrial" ,
name "SGC4" ,
id 5 ,
ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSSSVVVVAAAADDEEGGGG",
sncbieaa "---M------**--------------------MMMM---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear" ,
name "SGC5" ,
id 6 ,
ncbieaa "FFLLSSSSYYQQCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "--------------*--------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Echinoderm Mitochondrial; Flatworm Mitochondrial" ,
name "SGC8" ,
id 9 ,
ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG",
sncbieaa "----------**-----------------------M---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Euplotid Nuclear" ,
name "SGC9" ,
id 10 ,
ncbieaa "FFLLSSSSYY**CCCWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "----------**-----------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Bacterial, Archaeal and Plant Plastid" ,
id 11 ,
ncbieaa "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "---M------**--*----M------------MMMM---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Alternative Yeast Nuclear" ,
id 12 ,
ncbieaa "FFLLSSSSYY**CC*WLLLSPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "----------**--*----M---------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Ascidian Mitochondrial" ,
id 13 ,
ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSSGGVVVVAAAADDEEGGGG",
sncbieaa "---M------**----------------------MM---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
name "Alternative Flatworm Mitochondrial" ,
id 14 ,
ncbieaa "FFLLSSSSYYY*CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNNKSSSSVVVVAAAADDEEGGGG",
sncbieaa "-----------*-----------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Blepharisma Macronuclear" ,
id 15 ,
ncbieaa "FFLLSSSSYY*QCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "----------*---*--------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Chlorophycean Mitochondrial" ,
id 16 ,
ncbieaa "FFLLSSSSYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "----------*---*--------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Trematode Mitochondrial" ,
id 21 ,
ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNNKSSSSVVVVAAAADDEEGGGG",
sncbieaa "----------**-----------------------M---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Scenedesmus obliquus Mitochondrial" ,
id 22 ,
ncbieaa "FFLLSS*SYY*LCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "------*---*---*--------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Thraustochytrium Mitochondrial" ,
id 23 ,
ncbieaa "FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "--*-------**--*-----------------M--M---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Pterobranchia Mitochondrial" ,
id 24 ,
ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSSKVVVVAAAADDEEGGGG",
sncbieaa "---M------**-------M---------------M---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Candidate Division SR1 and Gracilibacteria" ,
id 25 ,
ncbieaa "FFLLSSSSYY**CCGWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "---M------**-----------------------M---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Pachysolen tannophilus Nuclear" ,
id 26 ,
ncbieaa "FFLLSSSSYY**CC*WLLLAPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "----------**--*----M---------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Karyorelict Nuclear" ,
id 27 ,
ncbieaa "FFLLSSSSYYQQCCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "--------------*--------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Condylostoma Nuclear" ,
id 28 ,
ncbieaa "FFLLSSSSYYQQCCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "----------**--*--------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Mesodinium Nuclear" ,
id 29 ,
ncbieaa "FFLLSSSSYYYYCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "--------------*--------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Peritrich Nuclear" ,
id 30 ,
ncbieaa "FFLLSSSSYYEECC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "--------------*--------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Blastocrithidia Nuclear" ,
id 31 ,
ncbieaa "FFLLSSSSYYEECCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "----------**-----------------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Balanophoraceae Plastid" ,
id 32 ,
ncbieaa "FFLLSSSSYY*WCC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "---M------*---*----M------------MMMM---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
} ,
{
name "Cephalodiscidae Mitochondrial" ,
id 33 ,
ncbieaa "FFLLSSSSYYY*CCWWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSSKVVVVAAAADDEEGGGG",
sncbieaa "---M-------*-------M---------------M---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
}
}
......@@ -334,7 +334,7 @@ expected number of alignments with greater or equal score, between two
randomly-shuffled sequences of length 1 billion each.</p>
<p><strong>E</strong> is the expected number of alignments with greater or equal
score, between: a random sequence with the same length as the query
sequence, and a random sequence with the same length as the database.</p>
sequence, and random sequences with the same length as the database.</p>
<p>In this example, the two alignments are identical, but their &quot;E&quot;
values are not. That is because the query sequences have different
lengths: human.chrX is longer than human.chrY.</p>
......@@ -375,7 +375,7 @@ are fixed at 0.25.</p>
<h3>Calculation of EG2</h3>
<p>EG2 is calculated from the score like this:</p>
<pre class="literal-block">
EG2 = K * (1 billion) * (1 billion) * exp(-lambda * score)
EG2 = K * exp(-lambda * score) * (1 billion) * (1 billion)
</pre>
<p>The parameters lambda and K are printed in the header of lastal's
output.</p>
......@@ -384,21 +384,30 @@ output.</p>
<h3>Calculation of E</h3>
<p>E is calculated by this formula:</p>
<pre class="literal-block">
E = K * area * exp(-lambda * score) * (number of strands)
E = K * exp(-lambda * score) * strands * area(m, n, score) * N / n,
</pre>
<p>This depends on the number of strands that are searched, either 1 or
2.</p>
<p>The area is approximately (database length) * (query length). However,
if the query (or database) is very short, this formula over-estimates
E. LAST uses a &quot;finite-size correction&quot; to calculate E more
accurately.</p>
<p>where:</p>
<ul>
<li><p class="first"><tt class="docutils literal">strands</tt> is either 1 or 2 (if forward and reverse DNA strands are
searched).</p>
</li>
<li><p class="first"><tt class="docutils literal">m</tt> is the length of the query sequence.</p>
</li>
<li><p class="first"><tt class="docutils literal">n</tt> is the length of the longest sequence in the database.</p>
</li>
<li><p class="first"><tt class="docutils literal">N</tt> is the total length of all database sequences.</p>
</li>
<li><p class="first"><tt class="docutils literal">area(m, n, score)</tt> is <a class="reference external" href="https://doi.org/10.1186/1756-0500-5-286">slightly less than</a> <tt class="docutils literal">m * n</tt>.</p>
</li>
</ul>
</div>
<div class="section" id="effect-of-option-d">
<h3>Effect of option -D</h3>
<p>Option -D indirectly sets the EG2 threshold, via this formula:</p>
<p>Option -D sets the minimum alignment score, by this formula:</p>
<pre class="literal-block">
1 = EG2 * (database length)/1e9 * (option -D)/1e9 * (number of strands)
K * exp(-lambda * score) * strands * D * length(n, score) * N / n ≤ 1,
</pre>
<p>where <tt class="docutils literal">length(n, score)</tt> is slightly less than <tt class="docutils literal">n</tt>.</p>
</div>
<div class="section" id="bit-score">
<h3>Bit score</h3>
......
......@@ -18,7 +18,7 @@ randomly-shuffled sequences of length 1 billion each.
**E** is the expected number of alignments with greater or equal
score, between: a random sequence with the same length as the query
sequence, and a random sequence with the same length as the database.
sequence, and random sequences with the same length as the database.
In this example, the two alignments are identical, but their "E"
values are not. That is because the query sequences have different
......@@ -67,7 +67,7 @@ Calculation of EG2
EG2 is calculated from the score like this::
EG2 = K * (1 billion) * (1 billion) * exp(-lambda * score)
EG2 = K * exp(-lambda * score) * (1 billion) * (1 billion)
The parameters lambda and K are printed in the header of lastal's
output.
......@@ -77,22 +77,26 @@ Calculation of E
E is calculated by this formula::
E = K * area * exp(-lambda * score) * (number of strands)
E = K * exp(-lambda * score) * strands * area(m, n, score) * N / n,
This depends on the number of strands that are searched, either 1 or
2.
where:
The area is approximately (database length) * (query length). However,
if the query (or database) is very short, this formula over-estimates
E. LAST uses a "finite-size correction" to calculate E more
accurately.
* ``strands`` is either 1 or 2 (if forward and reverse DNA strands are
searched).
* ``m`` is the length of the query sequence.
* ``n`` is the length of the longest sequence in the database.
* ``N`` is the total length of all database sequences.
* ``area(m, n, score)`` is `slightly less than
<https://doi.org/10.1186/1756-0500-5-286>`_ ``m * n``.
Effect of option -D
~~~~~~~~~~~~~~~~~~~
Option -D indirectly sets the EG2 threshold, via this formula::
Option -D sets the minimum alignment score, by this formula::
1 = EG2 * (database length)/1e9 * (option -D)/1e9 * (number of strands)
K * exp(-lambda * score) * strands * D * length(n, score) * N / n ≤ 1,
where ``length(n, score)`` is slightly less than ``n``.
Bit score
~~~~~~~~~
......
......@@ -419,8 +419,8 @@ X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
</div>
<div class="section" id="bl80-or-blosum80">
<h2>BL80 or BLOSUM80</h2>
<p>This protein scoring scheme is good at finding somewhat
short-and-strong similarities. (S Henikoff &amp; JG Henikoff, PNAS 1992
<p>This protein scoring scheme is good for finding somewhat strong, and
short, similarities. (S Henikoff &amp; JG Henikoff, PNAS 1992
89(22):10915-9).
It uses this matrix:</p>
<pre class="literal-block">
......@@ -504,9 +504,44 @@ X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -6 -6
<p>It sets these default lastal parameter values:
-a13 -b2</p>
</div>
<div class="section" id="pam10">
<h2>PAM10</h2>
<p>This protein scoring scheme is good for finding very strong, and
short, similarities (MO Dayhoff et al. 1978).
It uses this matrix:</p>
<pre class="literal-block">
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 7 -10 -7 -6 -10 -7 -5 -4 -11 -8 -9 -10 -8 -12 -4 -3 -3 -20 -11 -5 -6 -6 -6 -23
R -10 9 -9 -17 -11 -4 -15 -13 -4 -8 -12 -2 -7 -12 -7 -6 -10 -5 -14 -11 -11 -7 -9 -23
N -7 -9 9 -1 -17 -7 -5 -6 -2 -8 -10 -4 -15 -12 -9 -2 -5 -11 -7 -12 7 -6 -6 -23
D -6 -17 -1 8 -21 -6 0 -6 -7 -11 -19 -8 -17 -21 -12 -7 -8 -21 -17 -11 7 -1 -9 -23
C -10 -11 -17 -21 10 -20 -20 -13 -10 -9 -21 -20 -20 -19 -11 -6 -11 -22 -7 -9 -18 -20 -13 -23
Q -7 -4 -7 -6 -20 9 -1 -10 -2 -11 -8 -6 -7 -19 -6 -8 -9 -19 -18 -10 -6 7 -8 -23
E -5 -15 -5 0 -20 -1 8 -7 -9 -8 -13 -7 -10 -20 -9 -7 -9 -23 -11 -10 -1 7 -8 -23
G -4 -13 -6 -6 -13 -10 -7 7 -13 -17 -14 -10 -12 -12 -10 -4 -10 -21 -20 -9 -6 -8 -8 -23
H -11 -4 -2 -7 -10 -2 -9 -13 10 -13 -9 -10 -17 -9 -7 -9 -11 -10 -6 -9 -4 -4 -8 -23
I -8 -8 -8 -11 -9 -11 -8 -17 -13 9 -4 -9 -3 -5 -12 -10 -5 -20 -9 -1 -9 -9 -8 -23
L -9 -12 -10 -19 -21 -8 -13 -14 -9 -4 7 -11 -2 -5 -10 -12 -10 -9 -10 -5 -12 -10 -9 -23
K -10 -2 -4 -8 -20 -6 -7 -10 -10 -9 -11 7 -4 -20 -10 -7 -6 -18 -12 -13 -5 -6 -8 -23
M -8 -7 -15 -17 -20 -7 -10 -12 -17 -3 -2 -4 12 -7 -11 -8 -7 -19 -17 -4 -16 -8 -9 -23
F -12 -12 -12 -21 -19 -19 -20 -12 -9 -5 -5 -20 -7 9 -13 -9 -12 -7 -1 -12 -14 -20 -12 -23
P -4 -7 -9 -12 -11 -6 -9 -10 -7 -12 -10 -10 -11 -13 8 -4 -7 -20 -20 -9 -10 -7 -8 -23
S -3 -6 -2 -7 -6 -8 -7 -4 -9 -10 -12 -7 -8 -9 -4 7 -2 -8 -10 -10 -4 -8 -6 -23
T -3 -10 -5 -8 -11 -9 -9 -10 -11 -5 -10 -6 -7 -12 -7 -2 8 -19 -9 -6 -6 -9 -7 -23
W -20 -5 -11 -21 -22 -19 -23 -21 -10 -20 -9 -18 -19 -7 -20 -8 -19 13 -8 -22 -13 -21 -16 -23
Y -11 -14 -7 -17 -7 -18 -11 -20 -6 -9 -10 -12 -17 -1 -20 -10 -9 -8 10 -10 -9 -13 -11 -23
V -5 -11 -12 -11 -9 -10 -10 -9 -9 -1 -5 -13 -4 -12 -9 -10 -6 -22 -10 8 -11 -10 -8 -23
B -6 -11 7 7 -18 -6 -1 -6 -4 -9 -12 -5 -16 -14 -10 -4 -6 -13 -9 -11 7 -3 -8 -23
Z -6 -7 -6 -1 -20 7 7 -8 -4 -9 -10 -6 -8 -20 -7 -8 -9 -21 -13 -10 -3 7 -8 -23
X -6 -9 -6 -9 -13 -8 -8 -8 -8 -8 -9 -8 -9 -12 -8 -6 -7 -16 -11 -8 -8 -8 -8 -23
* -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 1
</pre>
<p>It sets these default lastal parameter values:
-a20 -b3</p>
</div>
<div class="section" id="pam30">
<h2>PAM30</h2>
<p>This protein scoring scheme is good for finding short-and-strong
<p>This protein scoring scheme is good for finding strong, and short,
similarities (MO Dayhoff et al. 1978).
It uses this matrix:</p>
<pre class="literal-block">
......
......@@ -107,8 +107,8 @@ It uses this matrix::
BL80 or BLOSUM80
----------------
This protein scoring scheme is good at finding somewhat
short-and-strong similarities. (S Henikoff & JG Henikoff, PNAS 1992
This protein scoring scheme is good for finding somewhat strong, and
short, similarities. (S Henikoff & JG Henikoff, PNAS 1992
89(22):10915-9).
It uses this matrix::
......@@ -193,10 +193,46 @@ It uses this matrix::
It sets these default lastal parameter values:
-a13 -b2
PAM10
-----
This protein scoring scheme is good for finding very strong, and
short, similarities (MO Dayhoff et al. 1978).
It uses this matrix::
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 7 -10 -7 -6 -10 -7 -5 -4 -11 -8 -9 -10 -8 -12 -4 -3 -3 -20 -11 -5 -6 -6 -6 -23
R -10 9 -9 -17 -11 -4 -15 -13 -4 -8 -12 -2 -7 -12 -7 -6 -10 -5 -14 -11 -11 -7 -9 -23
N -7 -9 9 -1 -17 -7 -5 -6 -2 -8 -10 -4 -15 -12 -9 -2 -5 -11 -7 -12 7 -6 -6 -23
D -6 -17 -1 8 -21 -6 0 -6 -7 -11 -19 -8 -17 -21 -12 -7 -8 -21 -17 -11 7 -1 -9 -23
C -10 -11 -17 -21 10 -20 -20 -13 -10 -9 -21 -20 -20 -19 -11 -6 -11 -22 -7 -9 -18 -20 -13 -23
Q -7 -4 -7 -6 -20 9 -1 -10 -2 -11 -8 -6 -7 -19 -6 -8 -9 -19 -18 -10 -6 7 -8 -23
E -5 -15 -5 0 -20 -1 8 -7 -9 -8 -13 -7 -10 -20 -9 -7 -9 -23 -11 -10 -1 7 -8 -23
G -4 -13 -6 -6 -13 -10 -7 7 -13 -17 -14 -10 -12 -12 -10 -4 -10 -21 -20 -9 -6 -8 -8 -23
H -11 -4 -2 -7 -10 -2 -9 -13 10 -13 -9 -10 -17 -9 -7 -9 -11 -10 -6 -9 -4 -4 -8 -23
I -8 -8 -8 -11 -9 -11 -8 -17 -13 9 -4 -9 -3 -5 -12 -10 -5 -20 -9 -1 -9 -9 -8 -23
L -9 -12 -10 -19 -21 -8 -13 -14 -9 -4 7 -11 -2 -5 -10 -12 -10 -9 -10 -5 -12 -10 -9 -23
K -10 -2 -4 -8 -20 -6 -7 -10 -10 -9 -11 7 -4 -20 -10 -7 -6 -18 -12 -13 -5 -6 -8 -23
M -8 -7 -15 -17 -20 -7 -10 -12 -17 -3 -2 -4 12 -7 -11 -8 -7 -19 -17 -4 -16 -8 -9 -23
F -12 -12 -12 -21 -19 -19 -20 -12 -9 -5 -5 -20 -7 9 -13 -9 -12 -7 -1 -12 -14 -20 -12 -23
P -4 -7 -9 -12 -11 -6 -9 -10 -7 -12 -10 -10 -11 -13 8 -4 -7 -20 -20 -9 -10 -7 -8 -23
S -3 -6 -2 -7 -6 -8 -7 -4 -9 -10 -12 -7 -8 -9 -4 7 -2 -8 -10 -10 -4 -8 -6 -23
T -3 -10 -5 -8 -11 -9 -9 -10 -11 -5 -10 -6 -7 -12 -7 -2 8 -19 -9 -6 -6 -9 -7 -23
W -20 -5 -11 -21 -22 -19 -23 -21 -10 -20 -9 -18 -19 -7 -20 -8 -19 13 -8 -22 -13 -21 -16 -23
Y -11 -14 -7 -17 -7 -18 -11 -20 -6 -9 -10 -12 -17 -1 -20 -10 -9 -8 10 -10 -9 -13 -11 -23
V -5 -11 -12 -11 -9 -10 -10 -9 -9 -1 -5 -13 -4 -12 -9 -10 -6 -22 -10 8 -11 -10 -8 -23
B -6 -11 7 7 -18 -6 -1 -6 -4 -9 -12 -5 -16 -14 -10 -4 -6 -13 -9 -11 7 -3 -8 -23
Z -6 -7 -6 -1 -20 7 7 -8 -4 -9 -10 -6 -8 -20 -7 -8 -9 -21 -13 -10 -3 7 -8 -23
X -6 -9 -6 -9 -13 -8 -8 -8 -8 -8 -9 -8 -9 -12 -8 -6 -7 -16 -11 -8 -8 -8 -8 -23
* -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 -23 1
It sets these default lastal parameter values:
-a20 -b3
PAM30
-----
This protein scoring scheme is good for finding short-and-strong
This protein scoring scheme is good for finding strong, and short,
similarities (MO Dayhoff et al. 1978).
It uses this matrix::
......
......@@ -445,7 +445,9 @@ alignments: they are described in more detail at <a class="reference external" h
<td>Maximum expected alignments per square giga. (See <a class="reference external" href="last-evalues.html">here</a>.)</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-s <var>NUMBER</var></span></kbd></td>
<td>Which query strand to use: 0=reverse, 1=forward, 2=both.</td></tr>
<td>Which query strand to use: 0=reverse, 1=forward, 2=both.
If specified, this parameter is written in last-train's
output, so it will override lastal's default.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-S <var>NUMBER</var></span></kbd></td>
<td><p class="first">Specify how to use the substitution score matrix for
......@@ -513,6 +515,36 @@ output, so it will override lastal's default.</p>
</blockquote>
</div>
</div>
<div class="section" id="details">
<h2>Details</h2>
<ul>
<li><p class="first">last-train (and lastal) uses &quot;Model A&quot;, in Figure 5A of <a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>.</p>
</li>
<li><p class="first">last-train (and lastal) converts between path and alignment
parameters as in Supplementary Section 3.1 of <a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>.</p>
</li>
<li><p class="first">last-train converts probabilities to scores using an arbitrary scale
factor. (If all the scores are multiplied by any positive constant,
it makes no difference to alignment.) This scale is determined by
the initial score parameters.</p>
</li>
<li><p class="first">last-train uses parameters with &quot;homogeneous letter probabilities&quot;
and &quot;balanced length probability&quot; (<a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>).</p>
</li>
<li><p class="first">last-train rounds the scores to integers, which makes them slightly
inaccurate. It then finds an adjusted scale factor (without
changing the scores), which makes the integer-rounded scores
correspond to homogeneous letter probabilities and balanced length
probability. It writes this adjusted scale as a &quot;-t&quot; option for
lastal, e.g. &quot;-t4.4363&quot;.</p>
</li>
<li><p class="first">In rare cases, it may be impossible to find such an adjusted scale
factor. If that happens, last-train doubles the original scale (to
reduce the inaccuracy of integer rounding), until the problem goes
away.</p>
</li>
</ul>
</div>
<div class="section" id="bugs">
<h2>Bugs</h2>
<ul>
......
......@@ -82,6 +82,8 @@ Alignment options
-E EG2 Maximum expected alignments per square giga. (See `here
<last-evalues.html>`_.)
-s NUMBER Which query strand to use: 0=reverse, 1=forward, 2=both.
If specified, this parameter is written in last-train's
output, so it will override lastal's default.
-S NUMBER Specify how to use the substitution score matrix for
reverse strands. If you use ``--revsym``, this makes no
difference. "0" means that the matrix is used as-is for
......@@ -133,6 +135,36 @@ Alignment options
If specified, this parameter is written in last-train's
output, so it will override lastal's default.
Details
-------
* last-train (and lastal) uses "Model A", in Figure 5A of btz576_.
* last-train (and lastal) converts between path and alignment
parameters as in Supplementary Section 3.1 of btz576_.
* last-train converts probabilities to scores using an arbitrary scale
factor. (If all the scores are multiplied by any positive constant,
it makes no difference to alignment.) This scale is determined by
the initial score parameters.
* last-train uses parameters with "homogeneous letter probabilities"
and "balanced length probability" (btz576_).
* last-train rounds the scores to integers, which makes them slightly
inaccurate. It then finds an adjusted scale factor (without
changing the scores), which makes the integer-rounded scores
correspond to homogeneous letter probabilities and balanced length
probability. It writes this adjusted scale as a "-t" option for
lastal, e.g. "-t4.4363".
* In rare cases, it may be impossible to find such an adjusted scale
factor. If that happens, last-train doubles the original scale (to
reduce the inaccuracy of integer rounding), until the problem goes
away.
.. _btz576: https://doi.org/10.1093/bioinformatics/btz576
Bugs
----
......
......@@ -769,12 +769,14 @@ this core</p>
<p class="last">Use -w0 to turn this off.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-G <var>FILE</var></span></kbd></td>
<td>Use an alternative genetic code in the specified file. For an
example of the format, see vertebrateMito.gc in the examples
directory. By default, the standard genetic code is used. This
<kbd><span class="option">-G <var>GENETIC-CODE</var></span></kbd></td>
<td>Specify the genetic code for translating DNA to protein. This
option has no effect unless DNA-versus-protein alignment is
selected with option -F.</td></tr>
selected with option -F. Codes are specified by numbers
(e.g. 1 = standard, 2 = vertebrate mitochondrial), listed here:
<a class="reference external" href="https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi">https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi</a>. Any
other GENETIC-CODE is assumed to be a file name: for an example
of the format, see vertebrateMito.gc in the examples directory.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-t <var>TEMPERATURE</var></span></kbd></td>
<td>Parameter for converting between scores and probability ratios.
......@@ -824,9 +826,9 @@ with &quot;c&quot; for each alignment. The first 16 numbers on this line
are the expected counts of matches and mismatches: first the
count of reference As aligned to query As, then the count of
reference As aligned to query Cs, and so on. For proteins there
will be 400 such numbers. The final ten numbers are expected
will be 400 such numbers. The next 5 numbers are expected
counts related to gaps. They are:</p>
<ul>
<ul class="last">
<li><p class="first">The count of matches plus mismatches. (This may exceed the
total of the preceding numbers, if the sequences have non-ACGT
letters.)</p>
......@@ -839,20 +841,6 @@ letters.)</p>
</li>
<li><p class="first">The count of insert opens (= count of insert closes).</p>
</li>
<li><p class="first">The count of adjacent pairs of insertions and deletions.</p>
</li>
</ul>
<p>The final four numbers are always zero, unless you use
generalized affine gap costs. They are:</p>
<ul class="last">
<li><p class="first">The count of unaligned letter pairs.</p>
</li>
<li><p class="first">The count of unaligned letter pair opens (= count of closes).</p>
</li>
<li><p class="first">The count of adjacent pairs of deletions and unaligned letter pairs.</p>
</li>
<li><p class="first">The count of adjacent pairs of insertions and unaligned letter pairs.</p>
</li>
</ul>
</td></tr>
<tr><td class="option-group">
......
......@@ -400,12 +400,14 @@ Miscellaneous options
Use -w0 to turn this off.
-G FILE
Use an alternative genetic code in the specified file. For an
example of the format, see vertebrateMito.gc in the examples
directory. By default, the standard genetic code is used. This
-G GENETIC-CODE
Specify the genetic code for translating DNA to protein. This
option has no effect unless DNA-versus-protein alignment is
selected with option -F.
selected with option -F. Codes are specified by numbers
(e.g. 1 = standard, 2 = vertebrate mitochondrial), listed here:
https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi. Any
other GENETIC-CODE is assumed to be a file name: for an example
of the format, see vertebrateMito.gc in the examples directory.
-t TEMPERATURE
Parameter for converting between scores and probability ratios.
......@@ -461,7 +463,7 @@ Miscellaneous options
are the expected counts of matches and mismatches: first the
count of reference As aligned to query As, then the count of
reference As aligned to query Cs, and so on. For proteins there
will be 400 such numbers. The final ten numbers are expected
will be 400 such numbers. The next 5 numbers are expected
counts related to gaps. They are:
* The count of matches plus mismatches. (This may exceed the
......@@ -471,15 +473,6 @@ Miscellaneous options
* The count of inserted letters.
* The count of delete opens (= count of delete closes).
* The count of insert opens (= count of insert closes).
* The count of adjacent pairs of insertions and deletions.
The final four numbers are always zero, unless you use
generalized affine gap costs. They are:
* The count of unaligned letter pairs.
* The count of unaligned letter pair opens (= count of closes).
* The count of adjacent pairs of deletions and unaligned letter pairs.
* The count of adjacent pairs of insertions and unaligned letter pairs.
-Q NUMBER
Specify how to read the query sequences::
......
This diff is collapsed.
CXXFLAGS = -O3 -std=c++11 -pthread -DHAS_CXX_THREADS
CXXFLAGS = -msse4.1 -O3 -std=c++11 -pthread -DHAS_CXX_THREADS
all:
@cd src && $(MAKE) CXXFLAGS="$(CXXFLAGS)"
......
#! /usr/bin/env python
# Copyright 2015 Martin C. Frith
# References:
# [Fri19] How sequence alignment scores correspond to probability models,
# MC Frith, Bioinformatics, 2019.
from __future__ import print_function
import gzip
import math, optparse, os, random, signal, subprocess, sys, tempfile
import math
import optparse
import os
import random
import signal
import subprocess
import sys
import tempfile
def myOpen(fileName): # faster than fileinput
if fileName == "-":
......@@ -13,6 +24,38 @@ def myOpen(fileName): # faster than fileinput
return gzip.open(fileName, "rt") # xxx dubious for Python2
return open(fileName)
def rootOfIncreasingFunction(func, lowerBound, upperBound, *args):
# Find x such that func(x, *args) == 0
while True:
mid = 0.5 * (lowerBound + upperBound)
if mid <= lowerBound or mid >= upperBound:
return mid
if func(mid, *args) < 0:
lowerBound = mid
else:
upperBound = mid
def homogeneousLetterFreqs(scale, matScores):
# Solve the simultaneous equations in Section 2.1 of [Fri19]
expMat = [[math.exp(j / scale) for j in i] for i in matScores]
m = [row[:] + [1.0] for row in expMat] # augmented matrix
n = len(expMat)
for k in range(n):
iMax = k
for i in range(k, n):
if abs(m[i][k]) > abs(m[iMax][k]):
iMax = i
if iMax > k:
m[k], m[iMax] = m[iMax], m[k]
if abs(m[k][k]) <= 0:
raise ArithmeticError("singular matrix")
for i in range(n):
if i != k:
mul = m[i][k] / m[k][k]
for j in range(k + 1, n + 1):
m[i][j] -= m[k][j] * mul
return [m[k][n] / m[k][k] for k in range(n)]
def randomSample(things, sampleSize):
"""Randomly get sampleSize things (or all if fewer)."""
reservoir = [] # "reservoir sampling" algorithm
......@@ -133,6 +176,7 @@ def scaleFromHeader(lines):
raise Exception("couldn't read the scale")
def countsFromLastOutput(lines, opts):
numTransitionCounts = 5
matrix = []
# use +1 pseudocounts as a kludge to mitigate numerical problems:
matches = 1.0
......@@ -146,11 +190,12 @@ def countsFromLastOutput(lines, opts):
strand = line.split()[4] # slow?
if line[0] == "c":
c = [float(i) for i in line.split()[1:]]
transCounts = c[-numTransitionCounts:]
if not matrix:
matrixSize = int(math.sqrt(len(c) - 10))
matrixSize = int(math.sqrt(len(c) - numTransitionCounts))
matrix = [[1.0] * matrixSize for i in range(matrixSize)]
identities = sum(c[i * matrixSize + i] for i in range(matrixSize))
alignmentLength = c[-10] + c[-9] + c[-8]
alignmentLength = transCounts[0] + transCounts[1] + transCounts[2]
if 100 * identities > opts.pid * alignmentLength: continue
for i in range(matrixSize):
for j in range(matrixSize):
......@@ -158,11 +203,11 @@ def countsFromLastOutput(lines, opts):
matrix[i][j] += c[i * matrixSize + j]
else:
matrix[-1-i][-1-j] += c[i * matrixSize + j]
matches += c[-10]
deletes += c[-9]
inserts += c[-8]
delOpens += c[-7]
insOpens += c[-6]
matches += transCounts[0]
deletes += transCounts[1]
inserts += transCounts[2]
delOpens += transCounts[3]
insOpens += transCounts[4]
alignments += 1
gapCounts = matches, deletes, inserts, delOpens, insOpens, alignments
return matrix, gapCounts
......@@ -180,10 +225,6 @@ def guessAlphabet(matrixSize):
if matrixSize == 20: return "ACDEFGHIKLMNPQRSTVWY"
raise Exception("can't handle unusual alphabets")
def matrixWithLetters(matrix):
alphabet = guessAlphabet(len(matrix))
return [alphabet] + [[a] + i for a, i in zip(alphabet, matrix)]
def writeMatrixHead(outFile, prefix, alphabet, formatString):
writeWords(outFile, [prefix + " "] + [formatString % k for k in alphabet])
......@@ -206,14 +247,6 @@ def writeScoreMatrix(outFile, matrix, prefix):
writeMatrixHead(outFile, prefix, alphabet, "%6s")
writeMatrixBody(outFile, prefix, alphabet, matrix, "%6s")
def writeMatrixWithLetters(outFile, matrix, prefix):
head = matrix[0]
tail = matrix[1:]
left = [i[0] for i in tail]
body = [i[1:] for i in tail]
writeMatrixHead(outFile, prefix, head, "%6s")
writeMatrixBody(outFile, prefix, left, body, "%6s")
def matProbsFromCounts(counts, opts):
r = range(len(counts))
if opts.revsym: # add complement (reverse strand) substitutions
......@@ -237,12 +270,27 @@ def matProbsFromCounts(counts, opts):
return probs
def gapProbsFromCounts(counts, opts):
def probImbalance(endProb, matchProb, firstDelProb, delExtendProb,
firstInsProb, insExtendProb):
# (RHS - LHS) of Equation (12) in [Fri19]
d = firstDelProb / (endProb - delExtendProb)
i = firstInsProb / (endProb - insExtendProb)
return 1 - matchProb / (endProb * endProb) - d - i
def balancedEndProb(*args):
matchProb, firstDelProb, delExtendProb, firstInsProb, insExtendProb = args
lowerBound = max(delExtendProb, insExtendProb)
upperBound = 1.0
r = rootOfIncreasingFunction(probImbalance, lowerBound, upperBound, *args)
return r
def gapRatiosFromCounts(counts, opts):
matches, deletes, inserts, delOpens, insOpens, alignments = counts
if not alignments: raise Exception("no alignments")
gaps = deletes + inserts
gapOpens = delOpens + insOpens
denominator = matches + gapOpens + (alignments + 1) # +1 pseudocount
matchProb = matches / denominator
if opts.gapsym:
delOpenProb = gapOpens / denominator / 2
insOpenProb = gapOpens / denominator / 2
......@@ -262,6 +310,7 @@ def gapProbsFromCounts(counts, opts):
print("# alignments:", alignments)
print("# mean delete size: %g" % (deletes / delOpens))
print("# mean insert size: %g" % (inserts / insOpens))
print("# matchProb: %g" % matchProb)
print("# delOpenProb: %g" % delOpenProb)
print("# insOpenProb: %g" % insOpenProb)
print("# delExtendProb: %g" % delExtendProb)
......@@ -269,54 +318,109 @@ def gapProbsFromCounts(counts, opts):
print()
delCloseProb = 1 - delExtendProb
insCloseProb = 1 - insExtendProb
firstDelProb = delOpenProb * delCloseProb
insCloseProb = 1 - insExtendProb
firstInsProb = insOpenProb * insCloseProb
# If we define "an alignment" to mean "a set of indistinguishable
# paths", then:
#delExtendProb += firstDelProb
#insExtendProb += firstInsProb
# Else, this ensures gap existence cost >= 0:
delExtendProb = max(delExtendProb, firstDelProb)
insExtendProb = max(insExtendProb, firstInsProb)
delExistProb = firstDelProb / delExtendProb
insExistProb = firstInsProb / insExtendProb
return delExistProb, insExistProb, delExtendProb, insExtendProb
def scoreFromLetterProbs(scale, pairProb, prob1, prob2):
probRatio = pairProb / (prob1 * prob2)
return scoreFromProb(scale, probRatio)
def matScoresFromProbs(scale, probs):
rowProbs = [sum(i) for i in probs]
colProbs = [sum(i) for i in zip(*probs)]
return [[scoreFromLetterProbs(scale, j, x, y) for j, y in zip(i, colProbs)]
for i, x in zip(probs, rowProbs)]
def gapCostsFromProbs(scale, probs):
delExistProb, insExistProb, delExtendProb, insExtendProb = probs
delExistCost = costFromProb(scale, delExistProb)
insExistCost = costFromProb(scale, insExistProb)
delExtendCost = costFromProb(scale, delExtendProb)
insExtendCost = costFromProb(scale, insExtendProb)
if delExtendCost == 0: delExtendCost = 1
if insExtendCost == 0: insExtendCost = 1
return delExistCost, insExistCost, delExtendCost, insExtendCost
endProb = balancedEndProb(matchProb, firstDelProb, delExtendProb,
firstInsProb, insExtendProb)
# probably, endProb is negligibly less than 1
matchRatio = matchProb / (endProb * endProb)
firstDelRatio = firstDelProb / endProb
delExtendRatio = delExtendProb / endProb
delRatios = firstDelRatio, delExtendRatio
firstInsRatio = firstInsProb / endProb
insExtendRatio = insExtendProb / endProb
insRatios = firstInsRatio, insExtendRatio
return matchRatio, delRatios, insRatios
def scoreFromLetterProbs(scale, matchRatio, pairProb, rowProb, colProb):
# Equation (4) in [Fri19]
probRatio = pairProb / (rowProb * colProb)
return scoreFromProb(scale, matchRatio * probRatio)
def matScoresFromProbs(scale, matchRatio, matProbs, rowProbs, colProbs):
r = range(len(matProbs))
return [[scoreFromLetterProbs(scale, matchRatio, matProbs[i][j],
rowProbs[i], colProbs[j]) for j in r]
for i in r]
def gapCostsFromProbRatios(scale, firstGapRatio, gapExtendRatio):
# The next addition gets the alignment parameter from the path
# parameters, as in Supplementary section 3.1 of [Fri19]:
gapExtendRatio += firstGapRatio
firstGapCost = max(costFromProb(scale, firstGapRatio), 1)
gapExtendCost = max(costFromProb(scale, gapExtendRatio), 1)
return firstGapCost, gapExtendCost
def imbalanceFromGap(scale, firstGapCost, gapExtendCost):
firstGapRatio = math.exp(-firstGapCost / scale)
gapExtendRatio = math.exp(-gapExtendCost / scale)
# The next subtraction gets the path parameter from the alignment
# parameters, as in Supplementary section 3.1 of [Fri19]:
gapExtendRatio -= firstGapRatio
return firstGapRatio / (1 - gapExtendRatio)
def scoreImbalance(scale, matScores, delCosts, insCosts):
# C' - 1, where C' is defined in Equation (13) of [Fri19]
d = imbalanceFromGap(scale, *delCosts)
i = imbalanceFromGap(scale, *insCosts)
return 1 / sum(homogeneousLetterFreqs(scale, matScores)) + d + i - 1
def negScoreImbalance(*args):
return -scoreImbalance(*args)
def balancedScale(nearScale, *args):
# Find a scale, near nearScale, with balanced length probability
bump = 1.000001
value = scoreImbalance(nearScale, *args)
if abs(value) <= 0:
return nearScale
oldLower = oldUpper = nearScale
while True: # xxx
newLower = oldLower / bump
lowerValue = scoreImbalance(newLower, *args)
if (lowerValue < 0) != (value < 0):
func = scoreImbalance if value > 0 else negScoreImbalance
return rootOfIncreasingFunction(func, newLower, oldLower, *args)
oldLower = newLower
newUpper = oldUpper * bump
upperValue = scoreImbalance(newUpper, *args)
if (upperValue < 0) != (value < 0):
func = scoreImbalance if value < 0 else negScoreImbalance
return rootOfIncreasingFunction(func, oldUpper, newUpper, *args)
oldUpper = newUpper
def scoresAndScale(originalScale, matParams, delRatios, insRatios):
while True:
matScores = matScoresFromProbs(originalScale, *matParams)
delCosts = gapCostsFromProbRatios(originalScale, *delRatios)
insCosts = gapCostsFromProbRatios(originalScale, *insRatios)
scale = balancedScale(originalScale, matScores, delCosts, insCosts)
rowFreqs = homogeneousLetterFreqs(scale, zip(*matScores))
colFreqs = homogeneousLetterFreqs(scale, matScores)
if all(i >= 0 for i in rowFreqs + colFreqs):
return matScores, delCosts, insCosts, scale
print("# the integer-rounded scores are too inaccurate: "
"doubling the scale")
originalScale *= 2
def writeGapCosts(gapCosts, out):
delExistCost, insExistCost, delExtendCost, insExtendCost = gapCosts
print("#last -a", delExistCost, file=out)
print("#last -A", insExistCost, file=out)
firstDelCost, delExtendCost, firstInsCost, insExtendCost = gapCosts
print("#last -a", firstDelCost - delExtendCost, file=out)
print("#last -A", firstInsCost - insExtendCost, file=out)
print("#last -b", delExtendCost, file=out)
print("#last -B", insExtendCost, file=out)
def printGapCosts(gapCosts):
delExistCost, insExistCost, delExtendCost, insExtendCost = gapCosts
print("# delExistCost:", delExistCost)
print("# insExistCost:", insExistCost)
firstDelCost, delExtendCost, firstInsCost, insExtendCost = gapCosts
print("# delExistCost:", firstDelCost - delExtendCost)
print("# insExistCost:", firstInsCost - insExtendCost)
print("# delExtendCost:", delExtendCost)
print("# insExtendCost:", insExtendCost)
print()
......@@ -388,26 +492,29 @@ def doTraining(opts, args):
print()
sys.stdout.flush()
matCounts, gapCounts = countsFromLastOutput(q.stdout, opts)
gapProbs = gapProbsFromCounts(gapCounts, opts)
gapCosts = gapCostsFromProbs(internalScale, gapProbs)
printGapCosts(gapCosts)
matchRatio, delRatios, insRatios = gapRatiosFromCounts(gapCounts, opts)
matProbs = matProbsFromCounts(matCounts, opts)
matScores = matScoresFromProbs(internalScale, matProbs)
rowProbs = [sum(i) for i in matProbs]
colProbs = [sum(i) for i in zip(*matProbs)]
matParams = matchRatio, matProbs, rowProbs, colProbs
sas = scoresAndScale(internalScale, matParams, delRatios, insRatios)
matScores, delCosts, insCosts, scale = sas
printGapCosts(delCosts + insCosts)
print("# score matrix "
"(query letters = columns, reference letters = rows):")
writeScoreMatrix(sys.stdout, matScores, "# ")
print()
parameters = gapCosts, matScores
parameters = delCosts, insCosts, matScores
if parameters in oldParameters: break
oldParameters.append(parameters)
internalMatrix = matrixWithLetters(matScores)
x = fixedLastalArgs(opts, lastalProgName)
x.append("-t{0:.6}".format(scale))
x.append("-p-")
x += args
p = subprocess.Popen(x, stdin=subprocess.PIPE, stdout=subprocess.PIPE,
universal_newlines=True)
writeGapCosts(gapCosts, p.stdin)
writeMatrixWithLetters(p.stdin, internalMatrix, "")
writeGapCosts(delCosts + insCosts, p.stdin)
writeScoreMatrix(p.stdin, matScores, "")
p.stdin.close()
# in python2.6, the next line must come after p.stdin.close()
q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE,
......@@ -416,17 +523,17 @@ def doTraining(opts, args):
q = subprocess.Popen(z, stdin=q.stdout, stdout=subprocess.PIPE,
universal_newlines=True)
sas = scoresAndScale(externalScale, matParams, delRatios, insRatios)
matScores, delCosts, insCosts, scale = sas
if opts.X: print("#last -X", opts.X)
if opts.Q: print("#last -Q", opts.Q)
gapCosts = gapCostsFromProbs(externalScale, gapProbs)
writeGapCosts(gapCosts, sys.stdout)
print("#last -t{0:.6}".format(scale))
writeGapCosts(delCosts + insCosts, sys.stdout)
if opts.s: print("#last -s", opts.s)
if opts.S: print("#last -S", opts.S)
matScores = matScoresFromProbs(externalScale, matProbs)
externalMatrix = matrixWithLetters(matScores)
print("# score matrix "
"(query letters = columns, reference letters = rows):")
writeMatrixWithLetters(sys.stdout, externalMatrix, "")
writeScoreMatrix(sys.stdout, matScores, "")
def lastTrain(opts, args):
if opts.sample_number:
......
......@@ -35,19 +35,10 @@ static void addExpectedCounts( double* expectedCounts,
double* transitionCounts = &expectedCounts[ numEmissionCounts ];
transitionCounts[0] += ec.toMatch;
transitionCounts[1] += ec.DD + ec.MD + ec.PD; // deleted letter count
transitionCounts[2] += ec.II + ec.MI + ec.DI + ec.PI; // ins. letter count
transitionCounts[3] += ec.MD + ec.PD; // deletion open/close count
transitionCounts[4] += ec.MI + ec.DI + ec.PI; // insertion open/close count
transitionCounts[5] += ec.DI; // adjacent insertion & deletion count
transitionCounts[7] += ec.PP + ec.MP; // unaligned letter pair count
transitionCounts[6] += ec.MP; // pair-gap open/close count
transitionCounts[8] += ec.PD;
transitionCounts[9] += ec.PI;
// MD = DM + DI - PD + DQ
// MI = IM - DI - PI + IQ
// PM = MP - PD - PI - PQ
// DM + IM + PM = MD + MI + MP - DQ - IQ - PQ
transitionCounts[1] += ec.delNext + ec.delInit; // deleted letter count
transitionCounts[2] += ec.insNext + ec.insInit; // inserted letter count
transitionCounts[3] += ec.delInit; // deletion open/close count
transitionCounts[4] += ec.insInit; // insertion open/close count
}
static void countSeedMatches( double* expectedCounts,
......@@ -83,7 +74,7 @@ void Alignment::makeXdrop( Centroid& centroid,
if( outputType == 7 ){
assert( seed.size > 0 ); // makes things easier to understand
const int numEmissionCounts = alph.size * alph.size;
const int numTransitionCounts = 10;
const int numTransitionCounts = 5;
std::vector<double>& expectedCounts = extras.expectedCounts;
expectedCounts.resize( numEmissionCounts + numTransitionCounts );
countSeedMatches( &expectedCounts[0],
......@@ -328,6 +319,11 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
return;
}
if (!isForward) {
--start1;
--start2;
}
int extensionScore =
isGreedy ? greedyAligner.align( seq1 + start1, seq2 + start2,
isForward, sm, maxDrop, alph.size )
......@@ -371,10 +367,6 @@ void Alignment::extend( std::vector< SegmentPair >& chunks,
if( outputType > 3 ){ // calculate match probabilities
assert( !isGreedy );
assert( !sm2qual );
if (!isForward) {
--start1;
--start2;
}
centroid.doForwardBackwardAlgorithm(seq1, seq2, start1, start2, isForward,
gap, globality);
......
This diff is collapsed.
......@@ -7,7 +7,6 @@
#include "mcf_gap_costs.hh"
#include "SegmentPair.hh"
#include "OneQualityScoreMatrix.hh"
#include <stddef.h> // size_t
#include <vector>
#include <iostream> // for debug
......@@ -17,10 +16,10 @@ namespace cbrc{
public:
double emit[scoreMatrixRowSize][scoreMatrixRowSize];
double toMatch;
double MD, MP, MI;
double DD, DI;
double PP, PD, PI;
double II;
double delInit;
double delNext;
double insInit;
double insNext;
public:
ExpectedCount ();
};
......@@ -57,7 +56,7 @@ namespace cbrc{
void doForwardBackwardAlgorithm(const uchar* seq1, const uchar* seq2,
size_t start1, size_t start2,
bool isExtendFwd, const mcf::GapCosts& gap,
bool isExtendFwd, const GapCosts& gap,
int globality) {
seq1 += start1;
seq2 += start2;
......@@ -98,7 +97,7 @@ namespace cbrc{
// Added by MH (2008/10/10) : compute expected counts for transitions and emissions
void computeExpectedCounts(const uchar* seq1, const uchar* seq2,
size_t start1, size_t start2, bool isExtendFwd,
const mcf::GapCosts& gap, unsigned alphabetSize,
const GapCosts& gap, unsigned alphabetSize,
ExpectedCount& count) const;
private:
......@@ -119,12 +118,10 @@ namespace cbrc{
dvec_t fM; // f^M(i,j)
dvec_t fD; // f^D(i,j) Ix
dvec_t fI; // f^I(i,j) Iy
dvec_t fP; // f^P(i,j)
dvec_t bM; // b^M(i,j)
dvec_t bD; // b^D(i,j)
dvec_t bI; // b^I(i,j)
dvec_t bP; // b^P(i,j)
dvec_t mD;
dvec_t mI;
......@@ -141,11 +138,11 @@ namespace cbrc{
void forward(const uchar* seq1, const uchar* seq2,
const ExpMatrixRow* pssm, bool isExtendFwd,
const mcf::GapCosts& gap, int globality);
const GapCosts& gap, int globality);
void backward(const uchar* seq1, const uchar* seq2,
const ExpMatrixRow* pssm, bool isExtendFwd,
const mcf::GapCosts& gap, int globality);
const GapCosts& gap, int globality);
void initForwardMatrix();
void initBackwardMatrix();
......@@ -157,12 +154,6 @@ namespace cbrc{
bestPos1 = cur;
}
}
// start of the x-drop region (i.e. number of skipped seq1 letters
// before the x-drop region) for this antidiagonal
size_t seq1start( size_t antidiagonal ) const {
return xa.scoreEndIndex( antidiagonal ) - xa.scoreOrigin( antidiagonal );
}
};
}
......