VCFtools

A set of tools written in Perl and C++ for working with VCF files.

The bcftools/htslib VCF commands

HTSlib is a C library for high-throughput sequencing data formats. It is designed for speed and works with both VCF and BCFv2.

Download and installation

The library is hosted on github. It can be downloaded and compiled the usual way. The clone command is run only once, the pull command is run whenever the latest snapshots from github is needed. Please see the bcftools github page for the up-to-date version of the clone command. The software is under heavy development and the option --branch may be required.

git clone [--branch=name] git://github.com/samtools/htslib.git htslib
git clone git://github.com/samtools/bcftools.git bcftools
cd htslib; git pull; cd ..
cd bcftools; git pull; cd ..

# Compile
cd bcftools; make; make test

# Run
./bcftools stats file.vcf.gz

The tools

bcftools annotate

Adds or removes annotations, support for user-written plugins.

Fast alternative to vcf-annotate

(Read more)
About:   Annotate and edit VCF/BCF files.
Usage:   bcftools annotate [options] <in.vcf.gz>

Options:
   -a, --annotations <file>       VCF file or tabix-indexed file with annotations: CHR\tPOS[\tVALUE]+
   -c, --columns <list>           list of columns in the annotation file, e.g. CHROM,POS,REF,ALT,-,INFO/TAG. See man page for details
   -h, --header-lines <file>      lines which should be appended to the VCF header
   -l, --list-plugins             list available plugins. See BCFTOOLS_PLUGINS environment variable and man page for details
   -O, --output-type <b|u|z|v>    b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
   -p, --plugins <name|...>       comma-separated list of dynamically loaded user-defined plugins. See man page for details
   -r, --regions <reg|file>       restrict to comma-separated list of regions or regions listed in a file, see man page for details
   -R, --remove <list>            list of annotations to remove (e.g. ID,INFO/DP,FORMAT/DP,FILTER). See man page for details

bcftools call

Formerly known as bcftools view, this is the successor of the popular caller from the samtools package with extended capabilities.

(Read more)
About:   SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup.
         This command replaces the former "bcftools view" caller. Some of the original
         functionality has been temporarily lost in the process of transition to htslib,
         but will be added back on popular demand. The original calling model can be
         invoked with the -c option.
Usage:   bcftools call [options] <in.vcf.gz>

File format options:
   -O, --output-type <b|u|z|v>     output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
   -r, --regions <reg|file>        restrict to comma-separated list of regions or regions listed in a file, see man page for details
   -s, --samples <list|:file>      sample list, PED file or a file with optional second column for ploidy (0, 1 or 2) [all samples]
   -t, --targets <reg|file>        similar to -r but streams rather than index-jumps, see man page for details

Input/output options:
   -A, --keep-alts                 keep all possible alternate alleles at variant sites
   -M, --keep-masked-ref           keep sites with masked reference allele (REF=N)
   -S, --skip <snps|indels>        skip indels/snps
   -v, --variants-only             output variant sites only

Consensus/variant calling options:
   -c, --consensus-caller          the original calling method (conflicts with -m)
   -C, --constrain <str>           one of: alleles, trio (see manual)
   -m, --multiallelic-caller       alternative model for multiallelic and rare-variant calling (conflicts with -c)
   -n, --novel-rate <float>,[...]  likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9]
   -p, --pval-threshold <float>    variant if P(ref|D)<FLOAT with -c [0.5] or another allele accepted if P(chi^2)>=1-FLOAT with -m [1e-2]
   -X, --chromosome-X              haploid output for male samples (requires PED file with -s)
   -Y, --chromosome-Y              haploid output for males and skips females (requires PED file with -s)

bcftools filter

Powerful fixed-threshold filtering, accepts boolean and arithmetic expressions.
See also the bcftools view below.

(Read more)
About:   Apply fixed-threshold filters.
Usage:   bcftools filter [options] <in.vcf.gz>

Options:
    -e, --exclude <expr>          exclude sites for which the expression is true (e.g. '%TYPE="snp" && %QUAL>=10 && (DP4[2]+DP4[3] > 2')
    -g, --SnpGap <int>            filter SNPs within <int> base pairs of an indel
    -G, --IndelGap <int>          filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass
    -i, --include <expr>          include only sites for which the expression is true
    -m, --mode <+|x>              "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass
    -O, --output-type <b|u|z|v>   b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
    -r, --regions <reg|file>      restrict to comma-separated list of regions or regions listed in a file, see man page for details
    -s, --soft-filter <string>    annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+")
    -t, --targets <reg|file>      similar to -r but streams rather than index-jumps, see man page for details

Filter expressions may contain:
    - arithmetic operators: +,*,-,/
    - logical operators: && (same as &), || (same as |)
    - comparison operators: == (same as =), >, >=, <=, <, !=
    - parentheses: (, )
    - array subscripts, such as (e.g. AC[0]>=10)
    - double quotes for string values (e.g. %FILTER="PASS")
    - 1 (or 0) for testing the presence (or absence) of a flag (e.g. FlagA=1 && FlagB=0)
    - TAG or INFO/TAG for INFO values (e.g. DP<800 or INFO/DP<800)
    - %QUAL, %FILTER, etc. for column names (note: currently only some columns are supported)
    - %TYPE for variant type, such as %TYPE="indel"|"snp"|"mnp"|"other"
    - %FUNC(TAG) where FUNC is one of MAX, MIN, AVG and TAG is one of the FORMAT fields (e.g. %MIN(DV)>5)

bcftools gtcheck

A tool for detecting sample swaps and contamination

(Read more)
About:   Check sample identity. With no -g BCF given, multi-sample cross-check is performed.
Usage:   bcftools gtcheck [options] [-g <genotypes.vcf.gz>] <query.vcf.gz>

Options:
    -a, --all-sites                 output comparison for all sites
    -g, --genotypes <file>          genotypes to compare against
    -G, --GTs-only <int>            use GTs, ignore PLs, using <int> for unseen genotypes [99]
    -H, --homs-only                 homozygous genotypes only (useful for low coverage data)
    -p, --plot <prefix>             plot
    -r, --regions <file|reg>        restrict to list of regions or regions listed in a file, see man page for details
    -s, --query-sample <string>     query sample (by default the first sample is checked)
    -S, --target-sample <string>    target sample in the -g file (used only for plotting)
    -t, --targets <reg|file>        similar to -r but streams rather than index-jumps, see man page for details

bcftools isec

Fast alternative to vcf-isec

(Read more)
About:   Create intersections, unions and complements of VCF files.
Usage:   bcftools isec [options] <A.vcf.gz> <B.vcf.gz> [...]

Options:
    -c, --collapse <string>           treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none]
    -C, --complement                  output positions present only in the first file but missing in the others
    -f, --apply-filters <list>        require at least one of the listed FILTER strings (e.g. "PASS,.")
    -n, --nfiles [+-=]<int>           output positions present in this many (=), this many or more (+), or this many or fewer (-) files
    -O, --output-type <b|u|z|v>       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
    -p, --prefix <dir>                if given, subset each of the input files accordingly, see also -w
    -r, --regions <file|reg>          restrict to comma-separated list of regions or regions listed in a file, see man page for details
    -t, --targets <file|reg>          similar to -r but streams rather than index-jumps, see man page for details
    -w, --write <list>                list of files to write with -p given as 1-based indexes. By default, all files are written

Examples:
   # Create intersection and complements of two sets saving the output in dir/*
   bcftools isec A.vcf.gz B.vcf.gz -p dir

   # Extract and write records from A shared by both A and B using exact allele match
   bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1

   # Extract records private to A or B comparing by position only
   bcftools isec A.vcf.gz B.vcf.gz -p dir -n -1 -c all

bcftools merge

Fast alternative to vcf-merge with extended capabilities and correct handling of Number=A,G,R INFO fields.

(Read more)
About:   Merge multiple VCF or BCF files to create one multi-sample file combining compatible records
         into one according to the -m option.
Usage:   bcftools merge [options] <A.vcf.gz> <B.vcf.gz> [...]

Options:
        --use-header <file>            use the provided header
        --print-header                 print only the merged header and exit
    -f, --apply-filters <list>         require at least one of the listed FILTER strings (e.g. "PASS,.")
    -i, --info-rules <tag:method,..>   rules for merging INFO fields (method is one of sum,avg,min,max,join) or "-" to turn off the default [DP:sum,DP4:sum]
    -m, --merge <string>               merge sites with differing alleles for <snps|indels|both|all|none>, see man page for details [both]
    -O, --output-type <b|u|z|v>        'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
    -r, --regions <reg|file>           merge in the given regions only

bcftools norm

Left-align and normalize indels to the shortest possible representation.

(Read more)
About:   Left-align and normalize indels.
Usage:   bcftools norm [options] -f <ref.fa> <in.vcf.gz>

Options:
    -D, --remove-duplicates           remove duplicate lines of the same type. [Todo: merge genotypes, don't just throw away.]
    -f, --fasta-ref <file>            reference sequence
    -O, --output-type <type>          'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
    -r, --regions <file|reg>          restrict to comma-separated list of regions or regions listed in a file, see man page for details
    -w, --win <int,int>               alignment window and buffer window [50,1000]

bcftools query

Fast alternative to vcf-query

(Read more)
About:   Extracts fields from VCF/BCF file and prints them in user-defined format
Usage:   bcftools query [options] <A.vcf.gz> [<B.vcf.gz> [...]]

Options:
    -a, --annots <list>               alias for -f '%CHROM\t%POS\t%MASK\t%REF\t%ALT\t%TYPE\t' + tab-separated <list> of tags
    -c, --collapse <string>           collapse lines with duplicate positions for <snps|indels|both|all|some|none>, see man page [none]
    -f, --format <string>             learn by example, see below
    -H, --print-header                print header
    -l, --list-samples                print the list of samples and exit
    -r, --regions <reg|file>          restrict to comma-separated list of regions or regions listed in a file, see man page for details
    -t, --targets <reg|file>          similar to -r but streams rather than index-jumps, see man page for details
    -s, --samples <list|:file>        comma-separated list of samples to include or one name per line in a file
    -v, --vcf-list <file>             process multiple VCFs listed in the file

Expressions:
	%CHROM          The CHROM column (similarly also other columns, such as POS, ID, QUAL, etc.)
	%INFO/TAG       Any tag in the INFO column
	%TYPE           Variant type (REF, SNP, MNP, INDEL, OTHER)
	%MASK           Indicates presence of the site in other files (with multiple files)
	%TAG{INT}       Curly brackets to subscript vectors (0-based)
	[]              The brackets loop over all samples
	%GT             Genotype (e.g. 0/1)
	%TGT            Translated genotype (e.g. C/A)
	%LINE           Prints the whole line
	%SAMPLE         Sample name

Examples:
	bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz

bcftools stats

Formerly known as vcfcheck. Extract stats from a VCF/BCF file or compare two VCF/BCF files. The resulting text file can be plotted using plot-vcfstats.

bcftools stats file.vcf.gz > file.vchk
plot-vcfstats file.vchk -p plots/

(Read more)
About:   Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats.
         When two files are given, the program generates separate stats for intersection
         and the complements.
Usage:   bcftools stats [options] <A.vcf.gz> [<B.vcf.gz>]

Options:
    -1, --1st-allele-only              include only 1st allele at multiallelic sites
    -c, --collapse <string>            treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none]
    -d, --depth <int,int,int>          depth distribution: min,max,bin size [0,500,1]
        --debug                        produce verbose per-site and per-sample output
    -e, --exons <file.gz>              tab-delimited file with exons for indel frameshifts (chr,from,to; 1-based, inclusive, bgzip compressed)
    -f, --apply-filters <list>         require at least one of the listed FILTER strings (e.g. "PASS,.")
    -F, --fasta-ref <file>             faidx indexed reference sequence file to determine INDEL context
    -i, --split-by-ID                  collect stats for sites with ID separately (known vs novel)
    -r, --regions <reg|file>           restrict to comma-separated list of regions or regions listed in a file, see man page for details
    -s, --samples <list|:file>         produce sample stats, "-" to include all samples
    -t, --targets <reg|file>           similar to -r but streams rather than index-jumps, see man page for details
    -u, --user-tstv <TAG[:min:max:n]>  collect Ts/Tv stats for any tag using the given binning [0:1,100]

bcftools view

This versatile tool can be used for subsetting by sample, position and even flexible fixed-threshold filtering.

(Read more)
About:   VCF/BCF conversion, view, subset and filter VCF/BCF files.
Usage:   bcftools view [options] <in.vcf.gz> [region1 [...]]

Output options:
    -G,   --drop-genotypes              drop individual genotype information (after subsetting if -s option set)
    -h/H, --header-only/--no-header     print the header only/suppress the header in VCF output
    -l,   --compression-level [0-9]     compression level: 0 uncompressed, 1 best speed, 9 best compression [-1]
    -o,   --output-file <file>          output file name [stdout]
    -O,   --output-type <b|u|z|v>       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
    -r,   --regions <reg|file>          restrict to comma-separated list of regions or regions in a file, see man page for details
    -t,   --targets <reg|file>          similar to -r but streams rather than index-jumps, see man page for details

Subset options:
    -a, --trim-alt-alleles      trim alternate alleles not seen in the subset
    -I, --no-update             do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN)
    -s, --samples STR/FILE      list of samples (FILE or comma separated list STR) [null]

Filter options:
    -c/C, --min-ac/--max-ac <int>[:<type>]      minimum/maximum count for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref]
    -f,   --apply-filters <list>                require at least one of the listed FILTER strings (e.g. "PASS,.")
    -i/e, --include/--exclude <expr>            select/exclude sites for which the expression is true (see below for details)
    -k/n, --known/--novel                       select known/novel sites only (ID is not/is '.')
    -m/M, --min-alleles/--max-alleles <int>     minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites)
    -p/P, --phased/--exclude-phased             select/exclude sites where all samples are phased/not all samples are phased
    -q/Q, --min-af/--max-af <float>[:<type>]    minimum/maximum frequency for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref]
    -u/U, --uncalled/--exclude-uncalled         select/exclude sites without a called genotype
    -v/V, --types/--exclude-types <list>        select/exclude comma-separated list of variant types: snps,indels,mnps,other [null]
    -x/X, --private/--exclude-private           select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples

Filter expressions may contain:
    - arithmetic operators: +,*,-,/
    - logical operators: && (same as &), || (same as |)
    - comparison operators: == (same as =), >, >=, <=, <, !=
    - parentheses: (, )
    - array subscripts, such as (e.g. AC[0]>=10)
    - double quotes for string values (e.g. %FILTER="PASS")
    - 1 (or 0) for testing the presence (or absence) of a flag (e.g. FlagA=1 && FlagB=0)
    - TAG or INFO/TAG for INFO values (e.g. DP<800 or INFO/DP<800)
    - %QUAL, %FILTER, etc. for column names (note: currently only some columns are supported)
    - %TYPE for variant type, such as %TYPE="indel"|"snp"|"mnp"|"other"
    - %FUNC(TAG) where FUNC is one of MAX, MIN, AVG and TAG is one of the FORMAT fields (e.g. %MIN(DV)>5)