A set of tools written in Perl and C++ for working with VCF files.
HTSlib is a C library for high-throughput sequencing data formats. It is designed for speed and works with both VCF and BCFv2.
The library is hosted on github. It can be downloaded and compiled the usual way. The clone command is run only once, the pull command is run whenever the latest snapshots from github is needed. Please see the bcftools github page for the up-to-date version of the clone command. The software is under heavy development and the option --branch may be required.
git clone [--branch=name] git://github.com/samtools/htslib.git htslib
git clone git://github.com/samtools/bcftools.git bcftools
cd htslib; git pull; cd ..
cd bcftools; git pull; cd ..
# Compile
cd bcftools; make; make test
# Run
./bcftools stats file.vcf.gz
Adds or removes annotations, support for user-written plugins.
Fast alternative to vcf-annotate
About: Annotate and edit VCF/BCF files. Usage: bcftools annotate [options] <in.vcf.gz> Options: -a, --annotations <file> VCF file or tabix-indexed file with annotations: CHR\tPOS[\tVALUE]+ -c, --columns <list> list of columns in the annotation file, e.g. CHROM,POS,REF,ALT,-,INFO/TAG. See man page for details -h, --header-lines <file> lines which should be appended to the VCF header -l, --list-plugins list available plugins. See BCFTOOLS_PLUGINS environment variable and man page for details -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] -p, --plugins <name|...> comma-separated list of dynamically loaded user-defined plugins. See man page for details -r, --regions <reg|file> restrict to comma-separated list of regions or regions listed in a file, see man page for details -R, --remove <list> list of annotations to remove (e.g. ID,INFO/DP,FORMAT/DP,FILTER). See man page for details
Formerly known as bcftools view, this is the successor of the popular caller from the samtools package with extended capabilities.
About: SNP/indel variant calling from VCF/BCF. To be used in conjunction with samtools mpileup. This command replaces the former "bcftools view" caller. Some of the original functionality has been temporarily lost in the process of transition to htslib, but will be added back on popular demand. The original calling model can be invoked with the -c option. Usage: bcftools call [options] <in.vcf.gz> File format options: -O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v] -r, --regions <reg|file> restrict to comma-separated list of regions or regions listed in a file, see man page for details -s, --samples <list|:file> sample list, PED file or a file with optional second column for ploidy (0, 1 or 2) [all samples] -t, --targets <reg|file> similar to -r but streams rather than index-jumps, see man page for details Input/output options: -A, --keep-alts keep all possible alternate alleles at variant sites -M, --keep-masked-ref keep sites with masked reference allele (REF=N) -S, --skip <snps|indels> skip indels/snps -v, --variants-only output variant sites only Consensus/variant calling options: -c, --consensus-caller the original calling method (conflicts with -m) -C, --constrain <str> one of: alleles, trio (see manual) -m, --multiallelic-caller alternative model for multiallelic and rare-variant calling (conflicts with -c) -n, --novel-rate <float>,[...] likelihood of novel mutation for constrained trio calling, see man page for details [1e-8,1e-9,1e-9] -p, --pval-threshold <float> variant if P(ref|D)<FLOAT with -c [0.5] or another allele accepted if P(chi^2)>=1-FLOAT with -m [1e-2] -X, --chromosome-X haploid output for male samples (requires PED file with -s) -Y, --chromosome-Y haploid output for males and skips females (requires PED file with -s)
Powerful fixed-threshold filtering, accepts boolean and arithmetic expressions.
See also the bcftools view below.
About: Apply fixed-threshold filters. Usage: bcftools filter [options] <in.vcf.gz> Options: -e, --exclude <expr> exclude sites for which the expression is true (e.g. '%TYPE="snp" && %QUAL>=10 && (DP4[2]+DP4[3] > 2') -g, --SnpGap <int> filter SNPs within <int> base pairs of an indel -G, --IndelGap <int> filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass -i, --include <expr> include only sites for which the expression is true -m, --mode <+|x> "+": do not replace but add to existing FILTER; "x": reset filters at sites which pass -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] -r, --regions <reg|file> restrict to comma-separated list of regions or regions listed in a file, see man page for details -s, --soft-filter <string> annotate FILTER column with <string> or unique filter name ("Filter%d") made up by the program ("+") -t, --targets <reg|file> similar to -r but streams rather than index-jumps, see man page for details Filter expressions may contain: - arithmetic operators: +,*,-,/ - logical operators: && (same as &), || (same as |) - comparison operators: == (same as =), >, >=, <=, <, != - parentheses: (, ) - array subscripts, such as (e.g. AC[0]>=10) - double quotes for string values (e.g. %FILTER="PASS") - 1 (or 0) for testing the presence (or absence) of a flag (e.g. FlagA=1 && FlagB=0) - TAG or INFO/TAG for INFO values (e.g. DP<800 or INFO/DP<800) - %QUAL, %FILTER, etc. for column names (note: currently only some columns are supported) - %TYPE for variant type, such as %TYPE="indel"|"snp"|"mnp"|"other" - %FUNC(TAG) where FUNC is one of MAX, MIN, AVG and TAG is one of the FORMAT fields (e.g. %MIN(DV)>5)
A tool for detecting sample swaps and contamination
About: Check sample identity. With no -g BCF given, multi-sample cross-check is performed. Usage: bcftools gtcheck [options] [-g <genotypes.vcf.gz>] <query.vcf.gz> Options: -a, --all-sites output comparison for all sites -g, --genotypes <file> genotypes to compare against -G, --GTs-only <int> use GTs, ignore PLs, using <int> for unseen genotypes [99] -H, --homs-only homozygous genotypes only (useful for low coverage data) -p, --plot <prefix> plot -r, --regions <file|reg> restrict to list of regions or regions listed in a file, see man page for details -s, --query-sample <string> query sample (by default the first sample is checked) -S, --target-sample <string> target sample in the -g file (used only for plotting) -t, --targets <reg|file> similar to -r but streams rather than index-jumps, see man page for details
Fast alternative to vcf-isec
About: Create intersections, unions and complements of VCF files. Usage: bcftools isec [options] <A.vcf.gz> <B.vcf.gz> [...] Options: -c, --collapse <string> treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none] -C, --complement output positions present only in the first file but missing in the others -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.") -n, --nfiles [+-=]<int> output positions present in this many (=), this many or more (+), or this many or fewer (-) files -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] -p, --prefix <dir> if given, subset each of the input files accordingly, see also -w -r, --regions <file|reg> restrict to comma-separated list of regions or regions listed in a file, see man page for details -t, --targets <file|reg> similar to -r but streams rather than index-jumps, see man page for details -w, --write <list> list of files to write with -p given as 1-based indexes. By default, all files are written Examples: # Create intersection and complements of two sets saving the output in dir/* bcftools isec A.vcf.gz B.vcf.gz -p dir # Extract and write records from A shared by both A and B using exact allele match bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1 # Extract records private to A or B comparing by position only bcftools isec A.vcf.gz B.vcf.gz -p dir -n -1 -c all
Fast alternative to vcf-merge with extended capabilities and correct handling of Number=A,G,R INFO fields.
About: Merge multiple VCF or BCF files to create one multi-sample file combining compatible records into one according to the -m option. Usage: bcftools merge [options] <A.vcf.gz> <B.vcf.gz> [...] Options: --use-header <file> use the provided header --print-header print only the merged header and exit -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.") -i, --info-rules <tag:method,..> rules for merging INFO fields (method is one of sum,avg,min,max,join) or "-" to turn off the default [DP:sum,DP4:sum] -m, --merge <string> merge sites with differing alleles for <snps|indels|both|all|none>, see man page for details [both] -O, --output-type <b|u|z|v> 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v] -r, --regions <reg|file> merge in the given regions only
Left-align and normalize indels to the shortest possible representation.
About: Left-align and normalize indels. Usage: bcftools norm [options] -f <ref.fa> <in.vcf.gz> Options: -D, --remove-duplicates remove duplicate lines of the same type. [Todo: merge genotypes, don't just throw away.] -f, --fasta-ref <file> reference sequence -O, --output-type <type> 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v] -r, --regions <file|reg> restrict to comma-separated list of regions or regions listed in a file, see man page for details -w, --win <int,int> alignment window and buffer window [50,1000]
Fast alternative to vcf-query
About: Extracts fields from VCF/BCF file and prints them in user-defined format Usage: bcftools query [options] <A.vcf.gz> [<B.vcf.gz> [...]] Options: -a, --annots <list> alias for -f '%CHROM\t%POS\t%MASK\t%REF\t%ALT\t%TYPE\t' + tab-separated <list> of tags -c, --collapse <string> collapse lines with duplicate positions for <snps|indels|both|all|some|none>, see man page [none] -f, --format <string> learn by example, see below -H, --print-header print header -l, --list-samples print the list of samples and exit -r, --regions <reg|file> restrict to comma-separated list of regions or regions listed in a file, see man page for details -t, --targets <reg|file> similar to -r but streams rather than index-jumps, see man page for details -s, --samples <list|:file> comma-separated list of samples to include or one name per line in a file -v, --vcf-list <file> process multiple VCFs listed in the file Expressions: %CHROM The CHROM column (similarly also other columns, such as POS, ID, QUAL, etc.) %INFO/TAG Any tag in the INFO column %TYPE Variant type (REF, SNP, MNP, INDEL, OTHER) %MASK Indicates presence of the site in other files (with multiple files) %TAG{INT} Curly brackets to subscript vectors (0-based) [] The brackets loop over all samples %GT Genotype (e.g. 0/1) %TGT Translated genotype (e.g. C/A) %LINE Prints the whole line %SAMPLE Sample name Examples: bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
Formerly known as vcfcheck. Extract stats from a VCF/BCF file or compare two VCF/BCF files. The resulting text file can be plotted using plot-vcfstats.
bcftools stats file.vcf.gz > file.vchk
plot-vcfstats file.vchk -p plots/
About: Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats. When two files are given, the program generates separate stats for intersection and the complements. Usage: bcftools stats [options] <A.vcf.gz> [<B.vcf.gz>] Options: -1, --1st-allele-only include only 1st allele at multiallelic sites -c, --collapse <string> treat as identical records with <snps|indels|both|all|some|none>, see man page for details [none] -d, --depth <int,int,int> depth distribution: min,max,bin size [0,500,1] --debug produce verbose per-site and per-sample output -e, --exons <file.gz> tab-delimited file with exons for indel frameshifts (chr,from,to; 1-based, inclusive, bgzip compressed) -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.") -F, --fasta-ref <file> faidx indexed reference sequence file to determine INDEL context -i, --split-by-ID collect stats for sites with ID separately (known vs novel) -r, --regions <reg|file> restrict to comma-separated list of regions or regions listed in a file, see man page for details -s, --samples <list|:file> produce sample stats, "-" to include all samples -t, --targets <reg|file> similar to -r but streams rather than index-jumps, see man page for details -u, --user-tstv <TAG[:min:max:n]> collect Ts/Tv stats for any tag using the given binning [0:1,100]
This versatile tool can be used for subsetting by sample, position and even flexible fixed-threshold filtering.
About: VCF/BCF conversion, view, subset and filter VCF/BCF files. Usage: bcftools view [options] <in.vcf.gz> [region1 [...]] Output options: -G, --drop-genotypes drop individual genotype information (after subsetting if -s option set) -h/H, --header-only/--no-header print the header only/suppress the header in VCF output -l, --compression-level [0-9] compression level: 0 uncompressed, 1 best speed, 9 best compression [-1] -o, --output-file <file> output file name [stdout] -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] -r, --regions <reg|file> restrict to comma-separated list of regions or regions in a file, see man page for details -t, --targets <reg|file> similar to -r but streams rather than index-jumps, see man page for details Subset options: -a, --trim-alt-alleles trim alternate alleles not seen in the subset -I, --no-update do not (re)calculate INFO fields for the subset (currently INFO/AC and INFO/AN) -s, --samples STR/FILE list of samples (FILE or comma separated list STR) [null] Filter options: -c/C, --min-ac/--max-ac <int>[:<type>] minimum/maximum count for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref] -f, --apply-filters <list> require at least one of the listed FILTER strings (e.g. "PASS,.") -i/e, --include/--exclude <expr> select/exclude sites for which the expression is true (see below for details) -k/n, --known/--novel select known/novel sites only (ID is not/is '.') -m/M, --min-alleles/--max-alleles <int> minimum/maximum number of alleles listed in REF and ALT (e.g. -m2 -M2 for biallelic sites) -p/P, --phased/--exclude-phased select/exclude sites where all samples are phased/not all samples are phased -q/Q, --min-af/--max-af <float>[:<type>] minimum/maximum frequency for non-reference (nref), 1st alternate (alt1) or minor (minor) alleles [nref] -u/U, --uncalled/--exclude-uncalled select/exclude sites without a called genotype -v/V, --types/--exclude-types <list> select/exclude comma-separated list of variant types: snps,indels,mnps,other [null] -x/X, --private/--exclude-private select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples Filter expressions may contain: - arithmetic operators: +,*,-,/ - logical operators: && (same as &), || (same as |) - comparison operators: == (same as =), >, >=, <=, <, != - parentheses: (, ) - array subscripts, such as (e.g. AC[0]>=10) - double quotes for string values (e.g. %FILTER="PASS") - 1 (or 0) for testing the presence (or absence) of a flag (e.g. FlagA=1 && FlagB=0) - TAG or INFO/TAG for INFO values (e.g. DP<800 or INFO/DP<800) - %QUAL, %FILTER, etc. for column names (note: currently only some columns are supported) - %TYPE for variant type, such as %TYPE="indel"|"snp"|"mnp"|"other" - %FUNC(TAG) where FUNC is one of MAX, MIN, AVG and TAG is one of the FORMAT fields (e.g. %MIN(DV)>5)