Skip to content
Commits on Source (9)
......@@ -7,7 +7,7 @@
## Overview
[*FreeBayes*](http://arxiv.org/abs/1207.3907) is a
[*freebayes*](http://arxiv.org/abs/1207.3907) is a
[Bayesian](http://en.wikipedia.org/wiki/Bayesian_inference) genetic variant
detector designed to find small polymorphisms, specifically SNPs
(single-nucleotide polymorphisms), indels (insertions and deletions), MNPs
......@@ -15,7 +15,7 @@ detector designed to find small polymorphisms, specifically SNPs
substitution events) smaller than the length of a short-read sequencing
alignment.
*FreeBayes* is haplotype-based, in the sense that it calls variants based on
*freebayes* is haplotype-based, in the sense that it calls variants based on
the literal sequences of reads aligned to a particular target, not their
precise alignment. This model is a straightforward generalization of previous
ones (e.g. PolyBayes, samtools, GATK) which detect or report variants based on
......@@ -25,7 +25,7 @@ alignments:
<img src="https://github.com/ekg/freebayes/raw/v1.3.0/paper/haplotype_calling.png" width=500/>
*FreeBayes* uses short-read alignments
*freebayes* uses short-read alignments
([BAM](http://samtools.sourceforge.net/SAMv1.pdf) files with
[Phred+33](http://en.wikipedia.org/wiki/Phred_quality_score) encoded quality
scores, now standard) for any number of individuals from a population and a
......@@ -41,123 +41,70 @@ variation across the samples under analysis.
## Citing freebayes
A preprint [Haplotype-based variant detection from short-read
sequencing](http://arxiv.org/abs/1207.3907) provides an overview of the
statistical models
used in FreeBayes. We ask that you cite this paper if you use FreeBayes in
work that leads to publication.
A preprint [Haplotype-based variant detection from short-read sequencing](http://arxiv.org/abs/1207.3907) provides an overview of the
statistical models used in freebayes.
We ask that you cite this paper if you use freebayes in work that leads to publication.
This preprint is used for documentation and citation.
freebayes was never submitted for review, but has been used in over 1000 publications.
Please use this citation format:
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing.
*arXiv preprint arXiv:1207.3907 [q-bio.GN]* 2012
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. *arXiv preprint arXiv:1207.3907 [q-bio.GN]* 2012
If possible, please also refer to the version number provided by freebayes when
it is run without arguments or with the `--help` option. For example, you
should see something like this:
If possible, please also refer to the version number provided by freebayes when it is run without arguments or with the `--help` option.
For example, you should see something like this:
version: v0.9.10-3-g47a713e
version: v1.3.1-1-g5eb71a3-dirty
This provides both a point release number and a git commit id, which will
ensure precise reproducibility of results.
This provides both a point release number and a git commit id, which will ensure precise reproducibility of results.
## Download
## Obtaining
Precompiled static binaries are available for [freebayes relases](https://github.com/ekg/freebayes/releases).
Most users should simply download these and run them.
To download FreeBayes, please use git to download the most recent development
tree. Currently, the tree is hosted on github, and can be obtained via:
Other packages are available from various sources, including conda and Debian, but these are relatively old (at least as of 2019) and do not include important updates from the past few years.
git clone --recursive git://github.com/ekg/freebayes.git
Note the use of --recursive. This is required in order to download all
nested git submodules for external repositories.
### Resolving proxy issues with git
Depending on your local network configuration, you may have problems obtaining
freebayes via git. If you see something like this you may be behind a proxy
that blocks access to standard git:// port (9418).
$ git clone --recursive git://github.com/ekg/freebayes.git
Cloning into 'freebayes'...
fatal: Unable to look up github.com (port 9418) (Name or service not known)
Luckily, if you have access to https:// on port 443, then you can use this
'magic' command as a workaround to enable download of the submodules:
git config --global url.https://github.com/.insteadOf git://github.com/
## Compilation
FreeBayes requires g++ and the standard C and C++ development libraries.
make
Will build the executable freebayes, as well as the utilities bamfiltertech and
bamleftalign. These executables can be found in the `bin/` directory in the
repository.
Users may wish to install to e.g. /usr/local/bin (default), which is
accomplished via
sudo make install
## Support
Note that the freebayes-parallel script and the programs on which it depends are
not installed by this command.
Please report any issues or questions to the [freebayes mailing list](https://groups.google.com/forum/#!forum/freebayes), [freebayes issue tracker](https://github.com/ekg/freebayes/issues), or by email to <erik.garrison@gmail.com>.
Users can optionally build with [BamTools](https://github.com/pezmaster31/bamtools) instead of [SeqLib](https://github.com/walaj/SeqLib). Building with BamTools requires CMake.
## Usage
make wbamtools
In its simplest operation, freebayes requires only two inputs: a FASTA reference sequence, and a BAM-format alignment file sorted by reference position.
For instance:
## Usage
freebayes -f ref.fa aln.bam >var.vcf
In its simplest operation, freebayes requires only two inputs: a FASTA reference
sequence, and a BAM-format alignment file sorted by reference position. For
instance:
... will produce a VCF file describing all SNPs, INDELs, and haplotype variants between the reference and aln.bam.
freebayes --fasta-reference h.sapiens.fasta NA20504.bam
Multiple BAM files may be given for joint calling.
... produce (on standard output) a VCF file on standard out describing
all SNPs, INDELs, MNPs, and Complex events between the reference and the
alignments in NA20504.bam. In order to produce correct output, the reference
supplied must be the reference to which NA20504.bam was aligned.
Typically, we might consider two additional parameters.
GVCF output allows us to have coverage information about non-called sites, and we can enable it with `--gcvcf`.
For performance reasons we may want to skip regions of extremely high coverage in the reference using the `--skip-limit` parameter `-g`.
These can greatly increase runtime but do not produce meaningful results.
For instance, if we wanted to exclude regions of 1000X coverage, we would run:
Users may specify any number of BAM files on the command line. FreeBayes uses
the [BamTools API](http://github.com/pezmaster31/bamtools) to open and parse
these files in parallel, virtually merging them at runtime into one logical
file with a merged header.
freebayes -f ref.fa --gvcf -g 1000 >var.vcf
For a description of available command-line options and their defaults, run:
freebayes --help
## Getting the best results
freebayes provides many options to configure its operation.
These may be important in certain use cases, but they are not meant for standard use.
It is strongly recommended that you run freebayes with parameters that are as close to default as possible unless you can validate that the chosen parameters improve performance.
To achieve the desired tradeoff between sensitivity and specificity, the best approach is to filter the output using the `QUAL` field, which encodes the posterior estimate of the probability of variation.
Filtering the input with the provided options demonstrably hurts performance where truth sets can be used to evaluate results.
By removing information from the input, you can confuse the Bayesian model by making it appear that certain alleles frequently indicative of context specific error (such as indels in homopolymers) don't exist.
Many users apply these filters to force freebayes to not make haplotype calls.
This is almost always a mistake, as the haplotype calling process greatly improves the method's signal to noise ratio and normalizes differential alignment in complex regions.
## Examples
If you wish to obtain a VCF that does not contain haplotype calls or complex alleles, first call with default parameters and then decompose the output with tools in vcflib, vt, vcf-tools, bcftools, GATK, and Picard.
Here we use a tool in vcflib that normalizes the haplotype calls into pointwise SNPs and indels:
Call variants assuming a diploid sample:
freebayes ... | vcfallelicprimitives -kg >calls.vcf
freebayes -f ref.fa aln.bam >var.vcf
Note that this is not done by default as it makes it difficult to determine which variant calls freebayes completed.
The raw output faithfully describes exactly the calls that were made.
Call variants on only chrQ:
## Examples
freebayes -f ref.fa -r chrQ aln.bam >var.vcf
Call variants assuming a diploid sample:
Call variants on only chrQ, from position 1000 to 2000:
freebayes -f ref.fa aln.bam >var.vcf
freebayes -f ref.fa -r chrQ:1000-2000 aln.bam >var.vcf
Require at least 5 supporting observations to consider a variant:
......@@ -208,7 +155,6 @@ should probably be running it in parallel using this script to run on a single
host, or generating a series of scripts, one per region, and run them on a
cluster. Be aware that the freebayes-parallel script contains calls to other programs using relative paths from the scripts subdirectory; the easiest way to ensure a successful run is to invoke the freebayes-parallel script from within the scripts subdirectory.
## Calling variants: from fastq to VCF
You've sequenced some samples. You have a reference genome or assembled set of
......@@ -221,10 +167,8 @@ samples. You can use freebayes to detect the variants, following these steps:
* Ensure your alignments have **read groups** attached so their sample may be
identified by freebayes. Aligners allow you to do this, but you can also use
[bamaddrg](http://github.com/ekg/bamaddrg) to do so post-alignment.
* **Sort** the alignments (e.g. bamtools sort).
* **Mark duplicates**, for instance with [samtools
rmdup](http://samtools.sourceforge.net/) (if PCR was used in the preparation of
your sequencing library)
* **Sort** the alignments (e.g. [sambamba sort](https://github.com/biod/sambamba)).
* **Mark duplicates**, for instance with [sambamba markdup](https://github.com/biod/sambamba) (if PCR was used in the preparation of your sequencing library)
* ***Run freebayes*** on all your alignment data simultaneously, generating a
VCF. The default settings should work for most use cases, but if your samples
are not diploid, set the `--ploidy` and adjust the `--min-alternate-fraction`
......@@ -235,7 +179,7 @@ observation count (AO).
* (possibly, **Iterate** the variant detection process in response to insight
gained from your interpretation)
FreeBayes emits a standard VCF 4.1 output stream. This format is designed for the
freebayes emits a standard VCF 4.1 output stream. This format is designed for the
probabilistic description of allelic variants within a population of samples,
but it is equally suited to describing the probability of variation in a single
sample.
......@@ -269,7 +213,7 @@ can also be done by filtering on the DP flag.
## Calling variants in a population
FreeBayes is designed to be run on many individuals from the same population
freebayes is designed to be run on many individuals from the same population
(e.g. many human individuals) simultaneously. The algorithm exploits a neutral
model of allele diffusion to impute most-confident genotypings
across the entire population. In practice, the discriminant power of the
......@@ -327,41 +271,46 @@ more flexible than setting a hard count on the number of observations.
## Observation filters and qualities
### Input filters
FreeBayes filters its input so as to ignore low-confidence alignments and
alleles which are only supported by low-quality sequencing observations (see
`--min-mapping-quality` and `--min-base-quality`). It also will only evaluate a
position if at least one read has mapping quality of
`--min-supporting-mapping-quality` and one allele has quality of at least
`--min-supporting-base-quality`.
Reads with more than a fixed number of high-quality mismatches can be excluded
by specifying `--read-mismatch-limit`. This is meant as a workaround when
mapping quality estimates are not appropriately calibrated.
By default, freebayes doesn't
freebayes may be configured to filter its input so as to ignore low-confidence alignments and alleles which are only supported by low-quality sequencing observations (see `--min-mapping-quality` and `--min-base-quality`).
It also will only evaluate a position if at least one read has mapping quality of `--min-supporting-mapping-quality` and one allele has quality of at least `--min-supporting-base-quality`.
Reads with more than a fixed number of high-quality mismatches can be excluded by specifying `--read-mismatch-limit`.
This is meant as a workaround when mapping quality estimates are not appropriately calibrated.
Reads marked as duplicates in the BAM file are ignored, but this can be
disabled for testing purposes by providing `--use-duplicate-reads`. FreeBayes
does not mark duplicates on its own, you must use another process to do this.
Reads marked as duplicates in the BAM file are ignored, but this can be disabled for testing purposes by providing `--use-duplicate-reads`.
freebayes does not mark duplicates on its own, you must use another process to do this, such as that in [sambamba](https://github.com/biod/sambamba).
### Observation thresholds
As a guard against spurious variation caused by sequencing artifacts, positions
are skipped when no more than `--min-alternate-count` or
`--min-alternate-fraction`
non-clonal observations of an alternate are found in one sample. These default
to 2 and 0.2 respectively. The default setting of `--min-alternate-fraction
0.2` is suitable for diploid samples but should be changed for ploidy > 2.
As a guard against spurious variation caused by sequencing artifacts, positions are skipped when no more than `--min-alternate-count` or `--min-alternate-fraction` non-clonal observations of an alternate are found in one sample.
These default to 2 and 0.05 respectively.
The default setting of `--min-alternate-fraction 0.05` is suitable for diploid samples but may need to be changed for higher ploidy.
### Allele type exclusion
FreeBayes provides a few methods to ignore certain classes of allele, e.g.
`--no-indels` and `--no-mnps`. Users are *strongly cautioned against using
freebayes provides a few methods to ignore certain classes of allele, e.g.
`--throw-away-indels-obs` and `--throw-awary-mnps-obs`. Users are *strongly cautioned against using
these*, because removing this information is very likely to reduce detection
power. To generate a report only including SNPs, use vcffilter post-call as
such:
freebayes ... | vcffilter -f "TYPE = snp"
### Normalizing variant representation
If you wish to obtain a VCF that does not contain haplotype calls or complex alleles, first call with default parameters and then decompose the output with tools in vcflib, vt, vcf-tools, bcftools, GATK, or Picard.
Here we use a tool in vcflib that normalizes the haplotype calls into pointwise SNPs and indels:
freebayes ... | vcfallelicprimitives -kg >calls.vcf
Note that this is not done by default as it makes it difficult to determine which variant calls freebayes completed.
The raw output faithfully describes exactly the calls that were made.
### Observation qualities
FreeBayes estimates observation quality using several simple heuristics based
freebayes estimates observation quality using several simple heuristics based
on manipulations of the phred-scaled base qualities:
* For single-base observations, *mismatches* and *reference observations*: the
......@@ -373,21 +322,13 @@ deleted sequence.
* For *haplotypes*: the mean quality of allele observations within the
haplotype.
### Effective base depth
By default, filters are left completely open.
Use `--experimental-gls` if you would like to integrate both base and mapping
quality are into the reported site quality (QUAL in the VCF) and
genotype quality (GQ, when supplying `--genotype-qualities`). This integration
is driven by the "Effective Base Depth" metric first developed in
[snpTools](http://www.hgsc.bcm.edu/software/snptools), which scales observation
quality by mapping quality. When `--experimental-gls` is given, *P(Obs|Genotype) ~
P(MappedCorrectly(Obs))P(SequencedCorrectly(Obs))*.
By default, both base and mapping quality are into the reported site quality (QUAL in the VCF) and genotype quality (GQ, when supplying `--genotype-qualities`).
This integration is driven by the "Effective Base Depth" metric first developed in [snpTools](http://www.hgsc.bcm.edu/software/snptools), which scales observation quality by mapping quality: *P(Obs|Genotype) ~ P(MappedCorrectly(Obs))P(SequencedCorrectly(Obs))*.
Set `--standard-gls` to use the model described in the freebayes preprint.
## Stream processing
FreeBayes can read BAM from standard input `--stdin` instead of directly from
freebayes can read BAM from standard input `--stdin` instead of directly from
files. This allows the application of any number of streaming BAM filters and
calibrators to its input.
......@@ -487,67 +428,58 @@ is combined with high a sequencing error rate.
## Best practices and design philosophy
FreeBayes follows the patterns suggested by the [Unix
philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which promotes the
development of simple, modular systems that perform a single function, and can
be combined into more complex systems using stream processing of common
interchange formats.
FreeBayes incorporates a number of features in order to reduce the complexity
of variant detection for researchers and developers:
* **Indel realignment is accomplished internally** using a read-independent
method, and issues resulting from discordant alignments are dramatically
reducedy through the direct detection of haplotypes.
* The need for **base quality recalibration is avoided** through the direct
detection of haplotypes. Sequencing platform errors tend to cluster (e.g. at
the ends of reads), and generate unique, non-repeating haplotypes at a given
locus.
* **Variant quality recalibration is avoided** by incorporating a number of
metrics, such as read placement bias and allele balance, directly into the
Bayesian model. (Our upcoming publication will discuss this in more detail.)
A minimal pre-processing pipeline similar to that described in "Calling
variants" should be sufficient for most uses. For more information, please
refer to a recent post by Brad Chapman [on minimal BAM preprocessing
methods](http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/).
For a push-button solution to variant detection, from reads to variant calls,
look no further than the [gkno genome analysis platform](http://gkno.me/).
## Contributors
FreeBayes is made by:
- Erik Garrison
- Thomas Sibley
- Dillon Lee
- Patrick Marks
- Noah Spies
- Joshua Randall
- Jeremy Anderson
freebayes follows the patterns suggested by the [Unix philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which promotes the development of simple, modular systems that perform a single function, and can be combined into more complex systems using stream processing of common interchange formats.
## Support
freebayes incorporates a number of features in order to reduce the complexity of variant detection for researchers and developers:
### email
* **Indel realignment is accomplished internally** using a read-independent method, and issues resulting from discordant alignments are dramatically reducedy through the direct detection of haplotypes.
* The need for **base quality recalibration is avoided** through the direct detection of haplotypes. Sequencing platform errors tend to cluster (e.g. at the ends of reads), and generate unique, non-repeating haplotypes at a given locus.
* **Variant quality recalibration is avoided** by incorporating a number of metrics, such as read placement bias and allele balance, directly into the Bayesian model. (Our upcoming publication will discuss this in more detail.)
Please report any issues or questions to the [freebayes mailing
list](https://groups.google.com/forum/#!forum/freebayes), [freebayes issue
tracker](https://github.com/ekg/freebayes/issues), or by email to
<erik.garrison@gmail.com>.
A minimal pre-processing pipeline similar to that described in "Calling variants" should be sufficient for most uses.
For more information, please refer to a recent post by Brad Chapman [on minimal BAM preprocessing methods](http://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/).
### IRC
## Development
To download freebayes, please use git to download the most recent development tree:
git clone --recursive git://github.com/ekg/freebayes.git
If you would like to chat real-time about freebayes, join #freebayes on
freenode. A gittr.im chat is also available.
Note the use of --recursive. This is required in order to download all nested git submodules for external repositories.
### reversion
### Resolving proxy issues with git
Note that if you encounter issues with the development HEAD and you would like
a quick workaround for an issue that is likely to have been reintroduced
recently, you can use `git checkout` to step back a few revisions.
Depending on your local network configuration, you may have problems obtaining freebayes via git.
If you see something like this you may be behind a proxy that blocks access to standard git:// port (9418).
git checkout [git-commit-id]
$ git clone --recursive git://github.com/ekg/freebayes.git
Cloning into 'freebayes'...
fatal: Unable to look up github.com (port 9418) (Name or service not known)
It will also help with debugging to know if a problem has arisen in recent
commits!
Luckily, if you have access to https:// on port 443, then you can use this
'magic' command as a workaround to enable download of the submodules:
git config --global url.https://github.com/.insteadOf git://github.com/
## Compilation
freebayes requires g++ and the standard C and C++ development libraries.
make
Will build the executable freebayes, as well as the utilities bamfiltertech and
bamleftalign. These executables can be found in the `bin/` directory in the
repository.
Users may wish to install to e.g. /usr/local/bin (default), which is
accomplished via
sudo make install
Note that the freebayes-parallel script and the programs on which it depends are
not installed by this command.
Users can optionally build with [BamTools](https://github.com/pezmaster31/bamtools) instead of [SeqLib](https://github.com/walaj/SeqLib). Building with BamTools requires CMake.
make wbamtools
freebayes (1.3.2-1) UNRELEASED; urgency=medium
* New upstream version
* Some additional patches created by 2to3 despite the general code base
is Python3
Closes: #936551
* Standards-Version: 4.4.1
* Secure URI in copyright format
* Remove trailing whitespace in debian/copyright
* Set upstream metadata fields: Repository, Repository-Browse.
-- Andreas Tille <tille@debian.org> Tue, 17 Dec 2019 10:46:24 +0100
freebayes (1.3.1-1) unstable; urgency=medium
* New upstream version
......
......@@ -7,7 +7,7 @@ Priority: optional
Build-Depends: debhelper-compat (= 12),
cmake,
pkg-config,
python,
python3,
zlib1g-dev,
libbamtools-dev,
libvcflib-dev (>= 1.0.0~rc2+dfsg-3~),
......@@ -17,7 +17,7 @@ Build-Depends: debhelper-compat (= 12),
samtools,
parallel,
libvcflib-tools (>= 1.0.0~rc1+dfsg1-6)
Standards-Version: 4.4.0
Standards-Version: 4.4.1
Vcs-Browser: https://salsa.debian.org/med-team/freebayes
Vcs-Git: https://salsa.debian.org/med-team/freebayes.git
Homepage: https://github.com/ekg/freebayes
......
Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: freebayes
Source: https://github.com/ekg/freebayes
......
Description: Some additional patches created by 2to3 despite the general code base is Python3
Bug-Debian: https://bugs.debian.org/936551
Author: Andreas Tille <tille@debian.org>
Last-Update: Tue, 17 Dec 2019 10:46:24 +0100
--- a/python/allelebayes.py
+++ b/python/allelebayes.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/python3
# calculates data likelihoods for sets of alleles
from __future__ import print_function, division
--- a/python/dirichlet.py
+++ b/python/dirichlet.py
@@ -1,5 +1,5 @@
-#!/usr/bin/env python
-from __future__ import division
+#!/usr/bin/python3
+
from scipy.special import gamma, gammaln
import operator
import math
@@ -38,7 +38,7 @@ def dirichlet_maximum_likelihood_ratio(p
return dirichlet(probs, obs, s) / float(maximum_likelihood)
def multinomial(probs, obs):
- return math.factorial(sum(obs)) / product(map(math.factorial, obs)) * product([math.pow(p, x) for p,x in zip(probs, obs)])
+ return math.factorial(sum(obs)) / product(list(map(math.factorial, obs))) * product([math.pow(p, x) for p,x in zip(probs, obs)])
def multinomialln(probs, obs):
return factorialln(sum(obs)) - sum(map(factorialln, obs)) + sum([math.log(math.pow(p, x)) for p,x in zip(probs, obs)])
--- a/python/hwe.py
+++ b/python/hwe.py
@@ -1,5 +1,5 @@
-#!/usr/bin/env python
-from __future__ import print_function, division
+#!/usr/bin/python3
+
from dirichlet import multinomial, multinomialln, multinomial_coefficient, multinomial_coefficientln
import math
import operator
--- a/python/multiset.py
+++ b/python/multiset.py
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/python3
# -*- coding: utf-8 -*-
"""
multiset.py -- non-recursive n multichoose k and
@@ -99,8 +99,8 @@ class ListElement:
def nth(self, n):
o = self
i = 0
- while i < n and o.next is not None:
- o = o.next
+ while i < n and o.__next__ is not None:
+ o = o.__next__
i += 1
return o
@@ -117,24 +117,24 @@ def __visit(h):
l = []
while o is not None:
l.append(o.value)
- o = o.next
+ o = o.__next__
return l
def permutations(multiset):
"""Generator providing all multiset permutations of a multiset."""
h, i, j = __init(multiset)
yield __visit(h)
- while j.next is not None or j.value < h.value:
- if j.next is not None and i.value >= j.next.value:
+ while j.__next__ is not None or j.value < h.value:
+ if j.__next__ is not None and i.value >= j.next.value:
s = j
else:
s = i
- t = s.next
- s.next = t.next
+ t = s.__next__
+ s.next = t.__next__
t.next = h
if t.value < h.value:
i = t
- j = i.next
+ j = i.__next__
h = t
yield __visit(h)
--- a/scripts/coverage_to_regions.py
+++ b/scripts/coverage_to_regions.py
@@ -1,5 +1,5 @@
-#!/usr/bin/env python
-from __future__ import print_function, division
+#!/usr/bin/python3
+
import sys
if len(sys.argv) < 3:
--- a/scripts/fasta_generate_regions.py
+++ b/scripts/fasta_generate_regions.py
@@ -1,5 +1,5 @@
-#!/usr/bin/env python
-from __future__ import print_function
+#!/usr/bin/python3
+
import sys
if len(sys.argv) == 1:
......@@ -3,3 +3,4 @@ fix_all_target.patch
fix_test.patch
vcffirstheader.patch
skip_failing_test.patch
2to3.patch
......@@ -15,3 +15,5 @@ Registry:
Entry: freebayes
- Name: conda:bioconda
Entry: freebayes
Repository: https://github.com/ekg/freebayes.git
Repository-Browse: https://github.com/ekg/freebayes
#!/usr/bin/env python
# calculates data likelihoods for sets of alleles
from __future__ import print_function, division
import multiset
import sys
import cjson
......@@ -73,8 +73,16 @@ def alleles_quality_to_lnprob(alleles):
allele['quality'] = phred.phred2ln(allele['quality'])
return alleles
def product(l):
return reduce(operator.mul, l)
def fold(func, iterable, initial=None, reverse=False):
x=initial
if reverse:
iterable=reversed(iterable)
for e in iterable:
x=func(x,e) if x is not None else e
return x
def product(listy):
return fold(operator.mul, listy)
def observed_alleles_in_genotype(genotype, allele_groups):
in_genotype = {}
......@@ -125,12 +133,12 @@ def sampling_prob(genotype, alleles):
genotype, follows the multinomial probability distribution."""
allele_groups = group_alleles(alleles)
multiplicity = sum([x[1] for x in genotype])
print genotype, multiplicity, alleles
print(genotype, multiplicity, alleles)
for allele, count in genotype:
if allele_groups.has_key(allele):
print allele, count, math.pow(float(count) / multiplicity, len(allele_groups[allele]))
print product([math.factorial(len(obs)) for obs in allele_groups.values()])
print allele_groups.values()
print(product([math.factorial(len(obs)) for obs in allele_groups.values()]))
print(allele_groups.values())
return float(math.factorial(len(alleles))) \
/ product([math.factorial(len(obs)) for obs in allele_groups.values()]) \
* product([math.pow(float(count) / multiplicity, len(allele_groups[allele])) \
......@@ -271,7 +279,7 @@ def banded_genotype_combinations(sample_genotypes, bandwidth, band_depth):
yield [(sample, genotypes[index]) for index, (sample, genotypes) in zip(index_permutation, sample_genotypes)]
def genotype_str(genotype):
return reduce(operator.add, [allele * count for allele, count in genotype])
return fold(operator.add, [allele * count for allele, count in genotype])
if __name__ == '__main__':
......@@ -371,5 +379,5 @@ if __name__ == '__main__':
sample['genotypes'] = sorted([[genotype_str(genotype), math.exp(prob)] for genotype, prob in sample['genotypes']],
key=lambda c: c[1], reverse=True)
print cjson.encode(position)
print(cjson.encode(position))
#print position['position']
#!/usr/bin/env python
from __future__ import division
from scipy.special import gamma, gammaln
import operator
import math
from logsumexp import logsumexp
from factorialln import factorialln
def product(l):
return reduce(operator.mul, l)
def fold(func, iterable, initial=None, reverse=False):
x=initial
if reverse:
iterable=reversed(iterable)
for e in iterable:
x=func(x,e) if x is not None else e
return x
def product(listy):
return fold(operator.mul, listy)
def beta(alphas):
"""Multivariate beta function"""
......
#!/usr/bin/env python
from __future__ import print_function, division
from dirichlet import multinomial, multinomialln, multinomial_coefficient, multinomial_coefficientln
import math
import operator
def fold(func, iterable, initial=None, reverse=False):
x=initial
if reverse:
iterable=reversed(iterable)
for e in iterable:
x=func(x,e) if x is not None else e
return x
def hwe_expectation(genotype, allele_counts):
"""@genotype is counts of A,B,C etc. alleles, e.g. (2,0) is AA and (1,1) is AB
@allele_counts is the counts of the alleles in the population"""
......@@ -9,7 +19,7 @@ def hwe_expectation(genotype, allele_counts):
ploidy = sum(genotype)
genotype_coeff = multinomial_coefficient(ploidy, genotype)
allele_frequencies = [count / float(population_total_alleles) for count in allele_counts]
genotype_expected_frequency = genotype_coeff * reduce(operator.mul, [math.pow(freq, p) for freq, p in zip(allele_frequencies, genotype)])
genotype_expected_frequency = genotype_coeff * fold(operator.mul, [math.pow(freq, p) for freq, p in zip(allele_frequencies, genotype)])
return genotype_expected_frequency
......@@ -28,7 +38,7 @@ def hwe_sampling_probln(genotype, genotypes, ploidy):
population as suggested by the genotype counts, given HWE"""
population_total_alleles = sum([sum(g[0]) * g[1] for g in genotypes])
#print "population_total_alleles", population_total_alleles
allele_counts = reduce(add_tuple, [[a * g[1] for a in g[0]] for g in genotypes])
allele_counts = fold(add_tuple, [[a * g[1] for a in g[0]] for g in genotypes])
#print "allele_counts", allele_counts
genotype_counts = [g[1] for g in genotypes]
#print "genotype_counts", genotype_counts
......@@ -49,7 +59,7 @@ def hwe_sampling_probln(genotype, genotypes, ploidy):
def inbreeding_coefficient(genotype, genotypes):
population_total_alleles = sum([sum(g[0]) * g[1] for g in genotypes])
allele_counts = reduce(add_tuple, [[a * g[1] for a in g[0]] for g in genotypes])
allele_counts = fold(add_tuple, [[a * g[1] for a in g[0]] for g in genotypes])
genotype_counts = [g[1] for g in genotypes]
population_total_genotypes = sum([gtc[1] for gtc in genotypes])
expected = hwe_expectation(genotype, allele_counts) * population_total_genotypes
......@@ -59,6 +69,6 @@ def inbreeding_coefficient(genotype, genotypes):
observed = g[1]
break
if observed == 0:
print "error, no observations of genotype, cannot calculate inbreeding coefficient"
print("error, no observations of genotype, cannot calculate inbreeding coefficient")
return 0
return 1 - (observed / expected)
#!/usr/bin/env python
from __future__ import print_function, division
import sys
if len(sys.argv) < 3:
print "usage: <bamtools_coverage_output ", sys.argv[0], " fasta_index num_regions >regions.bed"
print "Generates regions with even sequencing coverage, provided an input of coverage per-position as"
print "generated by bamtools coverage. In other words, generates regions such that the integral of"
print "coverage is approximately equal for each. These can be used when variant calling to reduce"
print "variance in job runtime."
print("usage: <bamtools_coverage_output {} fasta_index num_regions >regions.bed").format(sys.argv[0])
print("Generates regions with even sequencing coverage, provided an input of coverage per-position as")
print("generated by bamtools coverage. In other words, generates regions such that the integral of")
print("coverage is approximately equal for each. These can be used when variant calling to reduce")
print("variance in job runtime.")
exit(1)
lengths = {}
......@@ -35,7 +35,7 @@ for line in positions:
chrom, pos, depth = line #line.strip().split("\t")
if lchrom != chrom:
if lchrom:
print lchrom+":"+str(lpos)+"-"+str(lengths[lchrom])
print(lchrom+":"+str(lpos)+"-"+str(lengths[lchrom]))
lpos = 0
lchrom = chrom
bp_in_region = 0
......@@ -43,9 +43,8 @@ for line in positions:
lchrom = chrom
bp_in_region += depth
if bp_in_region > bp_per_region:
print chrom+":"+str(lpos)+"-"+str(pos) #, pos - lpos
print(chrom+":"+str(lpos)+"-"+str(pos)) #, pos - lpos
lpos = pos
bp_in_region = 0
print lchrom+":"+str(lpos)+"-"+str(lengths[lchrom])
print(lchrom+":"+str(lpos)+"-"+str(lengths[lchrom]))
#!/usr/bin/env python
from __future__ import print_function
import sys
if len(sys.argv) == 1:
print "usage: ", sys.argv[0], " <fasta file or index file> <region size>"
print "generates a list of freebayes/bamtools region specifiers on stdout"
print "intended for use in creating cluster jobs"
print("usage: {} <fasta file or index file> <region size>").format(sys.argv[0])
print("generates a list of freebayes/bamtools region specifiers on stdout")
print("intended for use in creating cluster jobs")
exit(1)
fasta_index_file = sys.argv[1]
......@@ -17,19 +16,14 @@ fasta_index_file = open(fasta_index_file)
region_size = int(sys.argv[2])
for line in fasta_index_file:
fields = line.strip().split("\t")
chrom_name = fields[0]
chrom_length = int(fields[1])
region_start = 0
while region_start < chrom_length:
start = region_start
end = region_start + region_size
if end > chrom_length:
end = chrom_length
print chrom_name + ":" + str(region_start) + "-" + str(end)
print(chrom_name + ":" + str(region_start) + "-" + str(end))
region_start = end
......@@ -1317,6 +1317,13 @@ RegisteredAlignment& AlleleParser::registerAlignment(BAMALIGN& alignment, Regist
string rDna = alignment.QUERYBASES;
string rQual = alignment.QUALITIES;
if (qualityChar2LongDouble(rQual.at(0)) == -1) {
// force rQual to be 0
char q0 = qualityInt2Char(0);
for (size_t i = 0; i < rQual.size(); ++i) {
rQual[i] = q0;
}
}
int rp = 0; // read position, 0-based relative to read
int csp = currentSequencePosition(alignment); // current sequence position, 0-based relative to currentSequence
int sp = alignment.POSITION; // sequence position
......@@ -1979,12 +1986,8 @@ void AlleleParser::updateAlignmentQueue(long int position,
}
// skip alignments which are non-primary
#ifdef HAVE_BAMTOOLS
if (!currentAlignment.IsPrimaryAlignment()) {
#else
if (currentAlignment.SecondaryFlag()) {
#endif
//DEBUG("skipping alignment " << currentAlignment.Name << " because it is not marked primary");
DEBUG("skipping alignment " << currentAlignment.QNAME << " because it is not marked primary");
continue;
}
......@@ -2878,8 +2881,6 @@ bool AlleleParser::toNextPosition(void) {
// remove past registered alleles
DEBUG2("marking previous alleles as processed and removing from registered alleles");
removePreviousAlleles(registeredAlleles, currentPosition);
sort(registeredAlleles.begin(), registeredAlleles.end());
registeredAlleles.erase(unique(registeredAlleles.begin(), registeredAlleles.end()), registeredAlleles.end());
// if we have alignments which ended at the previous base, erase them and their alleles
DEBUG2("erasing old registered alignments");
......@@ -3341,12 +3342,15 @@ void AlleleParser::buildHaplotypeAlleles(
groupAlleles(samples, alleleGroups);
/*
if (parameters.debug) {
DEBUG("after re-grouping alleles");
for (Samples::iterator s = samples.begin(); s != samples.end(); ++s) {
cerr << s->first << endl;
for (Sample::iterator t = s->second.begin(); t != s->second.end(); ++t) {
cerr << t->first << " " << t->second << endl << endl;
}
}
}
*/
Allele refAllele = genotypeAllele(ALLELE_REFERENCE,
......
......@@ -17,7 +17,7 @@ void Parameters::simpleUsage(char ** argv) {
<< " \"Haplotype-based variant detection from short-read sequencing\"" << endl
<< " arXiv:1207.3907 (http://arxiv.org/abs/1207.3907)" << endl
<< endl
<< "author: Erik Garrison <erik.garrison@bc.edu>, Marth Lab, Boston College, 2010-2014" << endl
<< "author: Erik Garrison <erik.garrison@gmail.com>" << endl
<< "version: " << VERSION_GIT << endl;
}
......@@ -381,7 +381,7 @@ void Parameters::usage(char** argv) {
<< " -dd Print more verbose debugging output (requires \"make DEBUG\")" << endl
<< endl
<< endl
<< "author: Erik Garrison <erik.garrison@bc.edu>, Marth Lab, Boston College, 2010-2014" << endl
<< "author: Erik Garrison <erik.garrison@gmail.com>" << endl
<< "version: " << VERSION_GIT << endl;
}
......