@@ -41,7 +41,7 @@ Most commands accept VCF, bgzipped VCF and BCF with filetype detected automatica
BCFtools is designed to work on a stream\&. It regards an input file "\-" as the standard input (stdin) and outputs to the standard output (stdout)\&. Several commands can thus be combined with Unix pipes\&.
.SS "VERSION"
.sp
This manual page was last updated \fB2018\-02\-12\fR and refers to bcftools git version \fB1\&.7\fR\&.
This manual page was last updated \fB2018\-04\-03\fR and refers to bcftools git version \fB1\&.8\fR\&.
.SS "BCF1"
.sp
The BCF1 format output by versions of samtools <= 0\&.1\&.19 is \fBnot\fR compatible with this version of bcftools\&. To read BCF1 files one can use the view command from old versions of bcftools packaged with samtools versions <= 0\&.1\&.19 to convert to VCF, which can then be read by this version of bcftools\&.
...
...
@@ -1796,8 +1796,45 @@ reference sequence in fasta format (required)
Checks sample identity or, without \fB\-g\fR, multi\-sample cross\-check is performed\&.
Checks sample identity\&. The program can operate in two modes\&. If the \fB\-g\fR option is given, the identity of the \fB\-s\fR sample from \fIquery\&.vcf\&.gz\fR is checked against the samples in the \fB\-g\fR file\&. Without the \fB\-g\fR option, multi\-sample cross\-check of samples in \fIquery\&.vcf\&.gz\fR is performed\&.
.PP
\fB\-a, \-\-all\-sites\fR
.RS 4
...
...
@@ -2227,21 +2264,14 @@ is used for the unseen genotypes\&. With
can be used instead; the discordance value then gives exactly the number of differing genotypes\&.
.RE
.PP
SM, Average Discordance
.RS 4
Average discordance between sample
\fIa\fR
and all other samples\&.
.RE
.PP
SM, Average Depth
ERR, error rate
.RS 4
Average depth at evaluated sites, or 1 if FORMAT/DP field is not present\&.
Pairwise error rate calculated as number of differences divided by the total number of comparisons\&.
.RE
.PP
SM, Average Number of sites
CLUSTER, TH, DOT
.RS 4
The average number of sites used to calculate the discordance\&. In other words, the average number of non\-missing PLs/genotypes seen both samples\&.
In presence of multiple samples, related samples and outliers can be identified by clustering samples by error rate\&. A simple hierarchical clustering based on minimization of standard deviation is used\&. This is useful to detect sample swaps, for example in situations where one sample has been sequenced in multiple runs\&.
.RE
.RE
.SS "bcftools index [\fIOPTIONS\fR] \fIin\&.bcf\fR|\fIin\&.vcf\&.gz\fR"
...
...
@@ -2986,7 +3016,7 @@ and
can swap alleles and will update genotypes (GT) and AC counts, but will not attempt to fix PL or other fields\&.
If a record is present multiple times, output only the first instance, see
\fB\-\-collapse\fR
...
...
@@ -2997,8 +3027,7 @@ in
\fB\-D, \-\-remove\-duplicates\fR
.RS 4
If a record is present in multiple files, output only the first instance\&. Alias for
\fB\-d none\fR\&. Requires
\fB\-a, \-\-allow\-overlaps\fR\&.
\fB\-d none\fR, deprecated\&.
.RE
.PP
\fB\-f, \-\-fasta\-ref\fR \fIFILE\fR
...
...
@@ -3864,12 +3893,28 @@ nor the other
options are given, the allele frequency is estimated from AC and AN counts which are already present in the INFO field\&.
.RE
.PP
\fB\-\-exclude\fR \fIEXPRESSION\fR
.RS 4
exclude sites for which
\fIEXPRESSION\fR
is true\&. For valid expressions see
\fBEXPRESSIONS\fR\&.
.RE
.PP
\fB\-G, \-\-GTs\-only\fR \fIFLOAT\fR
.RS 4
use genotypes (FORMAT/GT fields) ignoring genotype likelihoods (FORMAT/PL), setting PL of unseen genotypes to
\fIFLOAT\fR\&. Safe value to use is 30 to account for GT errors\&.
.RE
.PP
\fB\-\-include\fR \fIEXPRESSION\fR
.RS 4
include only sites for which
\fIEXPRESSION\fR
is true\&. For valid expressions see
\fBEXPRESSIONS\fR\&.
.RE
.PP
\fB\-I, \-\-skip\-indels\fR
.RS 4
skip indels as their genotypes are usually enriched for errors
...
...
@@ -4680,15 +4725,17 @@ TYPE!~"snp"
.sp -1
.IP \(bu 2.3
.\}
array subscripts (0\-based), "*" for any field, "\-" to indicate a range\&. Note that for querying FORMAT vectors, the colon ":" can be used to select a sample and a subfield
array subscripts (0\-based), "*" for any element, "\-" to indicate a range\&. Note that for querying FORMAT vectors, the colon ":" can be used to select a sample and an element of the vector, as shown in the examples below
.sp
.if n \{\
.RS 4
.\}
.nf
(DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0\&.3
DP4[*] == 0
CSQ[*] ~ "missense_variant\&.*deleterious"
INFO/AF[0] > 0\&.3 \&.\&. first AF value bigger than 0\&.3
FORMAT/AD[0:0] > 30 \&.\&. first AD value of the first sample bigger than 30
FORMAT/AD[0:1] \&.\&. first sample, second AD value
FORMAT/AD[1:0] \&.\&. second sample, first AD value
DP4[*] == 0 \&.\&. any DP4 value
FORMAT/DP[0] > 30 \&.\&. DP of the first sample bigger than 30
FORMAT/DP[1\-3] > 10 \&.\&. samples 2\-4
FORMAT/DP[1\-] < 7 \&.\&. all samples but the first
FORMAT/AD[0:1] \&.\&. first sample, second AD field
FORMAT/AD[0:*], AD[0:] or AD[0] \&.\&. first sample, any AD field
FORMAT/AD[*:1] or AD[:1] \&.\&. any sample, second AD field
(DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0\&.3
CSQ[*] ~ "missense_variant\&.*deleterious"
.fi
.if n \{\
.RE
...
...
@@ -4743,6 +4792,29 @@ N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING
.RE
.\}
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
custom perl filtering\&. Note that this command is not compiled in by default, see the section
\fBOptional Compilation with Perl\fR
in the INSTALL file for help and misc/demo\-flt\&.pl for a working example\&. The demo defined the perl subroutine "severity" which can be invoked from the command line as follows:
@@ -4776,7 +4848,7 @@ Variables and function names are case\-insensitive, but not tag names\&. For exa
.sp -1
.IP \(bu 2.3
.\}
When querying multiple subfields, all subfields are tested and the OR logic is used on the result\&. For example, when querying "TAG=1,2,3,4", it will be evaluated as follows:
When querying multiple values, all elements are tested and the OR logic is used on the result\&. For example, when querying "TAG=1,2,3,4", it will be evaluated as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<htmlxmlns="http://www.w3.org/1999/xhtml"><head><metahttp-equiv="Content-Type"content="text/html; charset=UTF-8"/><title>bcftools</title><linkrel="stylesheet"type="text/css"href="docbook-xsl.css"/><metaname="generator"content="DocBook XSL Stylesheets V1.76.1"/></head><body><divxml:lang="en"class="refentry"title="bcftools"lang="en"><aid="idp144864"></a><divclass="titlepage"></div><divclass="refnamediv"><h2>Name</h2><p>bcftools — utilities for variant calling and manipulating VCFs and BCFs.</p></div><divclass="refsynopsisdiv"title="Synopsis"><aid="_synopsis"></a><h2>Synopsis</h2><p><spanclass="strong"><strong>bcftools</strong></span> [--version|--version-only] [--help] [<spanclass="emphasis"><em>COMMAND</em></span>] [<spanclass="emphasis"><em>OPTIONS</em></span>]</p></div><divclass="refsect1"title="DESCRIPTION"><aid="_description"></a><h2>DESCRIPTION</h2><p>BCFtools is a set of utilities that manipulate variant calls in the Variant
<htmlxmlns="http://www.w3.org/1999/xhtml"><head><metahttp-equiv="Content-Type"content="text/html; charset=UTF-8"/><title>bcftools</title><linkrel="stylesheet"type="text/css"href="docbook-xsl.css"/><metaname="generator"content="DocBook XSL Stylesheets V1.76.1"/></head><body><divxml:lang="en"class="refentry"title="bcftools"lang="en"><aid="idp25098944"></a><divclass="titlepage"></div><divclass="refnamediv"><h2>Name</h2><p>bcftools — utilities for variant calling and manipulating VCFs and BCFs.</p></div><divclass="refsynopsisdiv"title="Synopsis"><aid="_synopsis"></a><h2>Synopsis</h2><p><spanclass="strong"><strong>bcftools</strong></span> [--version|--version-only] [--help] [<spanclass="emphasis"><em>COMMAND</em></span>] [<spanclass="emphasis"><em>OPTIONS</em></span>]</p></div><divclass="refsect1"title="DESCRIPTION"><aid="_description"></a><h2>DESCRIPTION</h2><p>BCFtools is a set of utilities that manipulate variant calls in the Variant
Call Format (VCF) and its binary counterpart BCF. All commands work
transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.</p><p>Most commands accept VCF, bgzipped VCF and BCF with filetype detected
automatically even when streaming from a pipe. Indexed VCF and BCF
...
...
@@ -8,7 +8,7 @@ will work in all situations. Un-indexed VCF and BCF and streams will
work in most, but not all situations. In general, whenever multiple VCFs are
read simultaneously, they must be indexed and therefore also compressed.</p><p>BCFtools is designed to work on a stream. It regards an input file "-" as the
standard input (stdin) and outputs to the standard output (stdout). Several
commands can thus be combined with Unix pipes.</p><divclass="refsect2"title="VERSION"><aid="_version"></a><h3>VERSION</h3><p>This manual page was last updated <spanclass="strong"><strong>2018-02-12</strong></span> and refers to bcftools git version <spanclass="strong"><strong>1.7</strong></span>.</p></div><divclass="refsect2"title="BCF1"><aid="_bcf1"></a><h3>BCF1</h3><p>The BCF1 format output by versions of samtools <= 0.1.19 is <spanclass="strong"><strong>not</strong></span>
commands can thus be combined with Unix pipes.</p><divclass="refsect2"title="VERSION"><aid="_version"></a><h3>VERSION</h3><p>This manual page was last updated <spanclass="strong"><strong>2018-04-03</strong></span> and refers to bcftools git version <spanclass="strong"><strong>1.8</strong></span>.</p></div><divclass="refsect2"title="BCF1"><aid="_bcf1"></a><h3>BCF1</h3><p>The BCF1 format output by versions of samtools <= 0.1.19 is <spanclass="strong"><strong>not</strong></span>
compatible with this version of bcftools. To read BCF1 files one can use
the view command from old versions of bcftools packaged with samtools
versions <= 0.1.19 to convert to VCF, which can then be read by
...
...
@@ -1036,8 +1036,36 @@ output VCF and are ignored for the prediction analysis.</p><div class="variablel
GFF3 annotation file (required), such as <aclass="ulink"href="ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/"target="_top">ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/</a>
</dd><dt><spanclass="term">
GFF3 annotation file (required), such as <aclass="ulink"href="ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens"target="_top">ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens</a>.
An example of a minimal working GFF file:
</dd></dl></div><preclass="screen"> # The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
# looks up their parent transcript (determined from the "Parent=transcript:" attribute),
# the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
# (the most interesting is "protein_coding").
#
# Attributes required for
# gene lines:
# - ID=gene:<gene_id>
# - biotype=<biotype>
# - Name=<gene_name> [optional]
#
# transcript lines:
# - ID=transcript:<transcript_id>
# - Parent=gene:<gene_id>
# - biotype=<biotype>
#
# other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
# - Parent=transcript:<transcript_id>
#
# Supported biotypes:
# - see the function gff_parse_biotype() in bcftools/csq.c
see <spanclass="strong"><strong><aclass="link"href="#common_options"title="Common Options">Common Options</a></strong></span>
</dd></dl></div></div><divclass="refsect2"title="bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz"><aid="gtcheck"></a><h3>bcftools gtcheck [<spanclass="emphasis"><em>OPTIONS</em></span>] [<spanclass="strong"><strong>-g</strong></span><spanclass="emphasis"><em>genotypes.vcf.gz</em></span>] <spanclass="emphasis"><em>query.vcf.gz</em></span></h3><p>Checks sample identity or, without <spanclass="strong"><strong>-g</strong></span>, multi-sample cross-check is performed.</p><divclass="variablelist"><dl><dt><spanclass="term">
</dd></dl></div></div><divclass="refsect2"title="bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz"><aid="gtcheck"></a><h3>bcftools gtcheck [<spanclass="emphasis"><em>OPTIONS</em></span>] [<spanclass="strong"><strong>-g</strong></span><spanclass="emphasis"><em>genotypes.vcf.gz</em></span>] <spanclass="emphasis"><em>query.vcf.gz</em></span></h3><p>Checks sample identity. The program can operate in two modes. If the <spanclass="strong"><strong>-g</strong></span>
option is given, the identity of the <spanclass="strong"><strong>-s</strong></span> sample from <spanclass="emphasis"><em>query.vcf.gz</em></span>
is checked against the samples in the <spanclass="strong"><strong>-g</strong></span> file.
Without the <spanclass="strong"><strong>-g</strong></span> option, multi-sample cross-check of samples in <spanclass="emphasis"><em>query.vcf.gz</em></span> is performed.</p><divclass="variablelist"><dl><dt><spanclass="term">
<spanclass="strong"><strong>-G</strong></span>, the value <spanclass="emphasis"><em>1</em></span> can be used instead; the discordance value then
gives exactly the number of differing genotypes.
</dd><dt><spanclass="term">
SM, Average Discordance
</span></dt><dd>
Average discordance between sample <spanclass="emphasis"><em>a</em></span> and all other samples.
</dd><dt><spanclass="term">
SM, Average Depth
ERR, error rate
</span></dt><dd>
Average depth at evaluated sites, or 1 if FORMAT/DP field is not
present.
Pairwise error rate calculated as number of differences divided
by the total number of comparisons.
</dd><dt><spanclass="term">
SM, Average Number of sites
CLUSTER, TH, DOT
</span></dt><dd>
The average number of sites used to calculate the discordance. In
other words, the average number of non-missing PLs/genotypes seen
both samples.
In presence of multiple samples, related samples and outliers can be
identified by clustering samples by error rate. A simple hierarchical
clustering based on minimization of standard deviation is used. This is
useful to detect sample swaps, for example in situations where one
sample has been sequenced in multiple runs.
</dd></dl></div></div></div><divclass="refsect2"title="bcftools index [OPTIONS] in.bcf|in.vcf.gz"><aid="index"></a><h3>bcftools index [<spanclass="emphasis"><em>OPTIONS</em></span>] <spanclass="emphasis"><em>in.bcf</em></span>|<spanclass="emphasis"><em>in.vcf.gz</em></span></h3><p>Creates index for bgzip compressed VCF/BCF files for random access. CSI
(coordinate-sorted index) is created by default. The CSI format
supports indexing of chromosomes up to length 2^31. TBI (tabix index)
...
...
@@ -1778,7 +1807,7 @@ the <span class="strong"><strong><a class="link" href="#fasta_ref">--fasta-ref</
can swap alleles and will update genotypes (GT) and AC counts,
If a record is present multiple times, output only the first instance,
see <spanclass="strong"><strong>--collapse</strong></span> in <spanclass="strong"><strong><aclass="link"href="#common_options"title="Common Options">Common Options</a></strong></span>.
...
...
@@ -1786,7 +1815,7 @@ the <span class="strong"><strong><a class="link" href="#fasta_ref">--fasta-ref</
If neither <spanclass="strong"><strong>-e</strong></span> nor the other <spanclass="strong"><strong>--AF-…</strong></span> options are given, the allele frequency is
estimated from AC and AN counts which are already present in the INFO field.
variables calculated on the fly if not present: number of alternate alleles;
...
...
@@ -2726,14 +2770,19 @@ number of samples; count of alternate alleles; minor allele count (similar to
AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN);
frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes;
number of samples with missing genotype; fraction of samples with missing genotype
</p><preclass="literallayout">N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING</pre></li></ul></div><divclass="itemizedlist"title="Notes:"><pclass="title"><strong>Notes:</strong></p><ulclass="itemizedlist"type="disc"><liclass="listitem">
</p><preclass="literallayout">N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING</pre></li><liclass="listitem"><pclass="simpara">
custom perl filtering. Note that this command is not compiled in by default, see
the section <spanclass="strong"><strong>Optional Compilation with Perl</strong></span> in the INSTALL file for help
and misc/demo-flt.pl for a working example. The demo defined the perl subroutine
"severity" which can be invoked from the command line as follows:
bcf_hdr_append(args->hdr_out,"##INFO=<ID=NOVELAL,Number=.,Type=String,Description=\"List of samples with novel alleles\">");
bcf_hdr_append(args->hdr_out,"##INFO=<ID=NOVELGT,Number=.,Type=String,Description=\"List of samples with novel genotypes. Note that only samples w/o a novel allele are listed.\">");