VCFtools

A set of tools written in Perl and C++ for working with VCF files.

The C++ executable module examples

This page provides usage examples for the executable module. Extended documentation for all of the options can be found on the manual page.

Run the program

By default the executable can be found in the bin/ subdirectory. To run the program, type:

./vcftools

The program will return information regarding the version number.

Get basic file statistics

The executable can be run with only an input VCF file without any other options, and will return basic information regarding the contents of the file. To specify an input file you must use the one of the input options ( --vcf, --gzvcf, or --bcf ) depending on the type of file. For example, for a VCF file called input_data.vcf the following command could be run:

./vcftools --vcf input_data.vcf

It will return information about the file such as the number of variants and the number of individuals in the file.

Beginning with vcftools v0.1.12, the program can also take input in from standard input (stdin). To do this, use any of the normal file type input options followed by the dash - character.

zcat input_data.vcf.gz | ./vcftools --vcf -

Applying a filter

You can use VCFtools to filter out variants or individuals based on the values within the file. For example, to filter the sites within a file based upon their location in genome, use the options --chr, --from-bp, and --to-bp to specify the region.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000

After running this line, the program will return the amount of sites in the file that are included in the chromosomal region chr1:1000000-2000000. This option can be modified to work with any desired region.

Writing to a new VCF file

VCFtools can perform analyses on the variants that pass through the filters or simply write those variants out to a new file. This function is helpful for creating subsets of VCF files or just removing unwanted variants from VCF files. To write out the variants that pass through filters use the --recode option. In addition, use --recode-INFO-all to include all data from the INFO fields in the output. By default INFO fields are not written because many filters will alter the variants in a file, rendering the INFO values incorrect.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --recode-INFO-all

In this example, VCFtools will create a new VCF file containing only variants within the specified chromosomal region while keeping all INFO fields included in the original file.

Any files written out by VCFtools will be in the current working directory and have the prefix ./out.SUFFIX by default. To change the path, specify the new path using the option --out followed by the desired path. The program will add a suffix to that path based on the chosen output function.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --out subset

Writing out to screen

Beginning with VCFtools v0.1.12, the program can also write out to screen instead of having the program write to a specified path. Using the options --stdout or -c will redirect all output to standard out. The output can then be piped into other programs or written out to a specified file name.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode --stdout | more

The above example will output the resulting file to screen one line at a time for quick inspection of the results.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode -c > /home/usr/data/subset.vcf

The above example will redirect the output and write it to the specified file name.

./vcftools --vcf input_data.vcf --chr 1 --from-bp 1000000 --to-bp 2000000 --recode -c | gzip -c > /home/usr/data/subset.vcf.gz

The above example will redirect the output into gzip (assuming it is installed) for compression, and then gzip will write the file to the specified destination.

Converting a VCF file to BCF

Beginning with VCFftools v0.1.11, the program has the ability to read and write BCF files. This means that the program can also convert files between the two formats. This is accomplished in a similar way as the above example, instead using the --recode-bcf option. All output BCF files are automatically compressed using BGZF.

./vcftools --vcf input_data.vcf --recode-bcf --recode-INFO-all --out converted_output

Comparing two files

Using VCFtools, two VCF files can be compared to determine which sites and individuals are shared between them. The first file is declared using the input file options just like any other output function. The second file must be specified using --diff, --gzdiff, or --diff-bcf. There are also advanced options to determine additional discordance between the two files.

./vcftools --vcf input_data.vcf --diff other_data.vcf --out compare

Getting allele frequency

To determine the frequency of each allele over all individuals in a VCF file, the --freq argument is used.

./vcftools --vcf input_data.vcf --freq --out output

The output file will be written to output.frq.

Getting sequencing depth information

Another useful output function summarizes sequencing depth for each individual or for each site. Just like the allele frequency example above, this output function follows the same basic model.

./vcftools --vcf input_data.vcf --depth -c > depth_summary.txt

With VCFtools, you can use many combinations of filters and an output function. For example, to write out site-wise sequence depths only at sites that have no missing data, include the --max-missing argument.

./vcftools --vcf input_data.vcf --site-depth --max-missing 1.0 --out site_depth_summary

Getting linkage disequilibrium statistics

Linkage disequilibrium between sites can be determined as well. This is accomplished using the --hap-r2, --geno-r2, or --geno-chisq arguments. Since the program must do pairwise site comparisons, this analysis can be time consuming, so it is recommended to filter the sites first or use one of the other options (--ld-window, --ld-window-bp or --min-r2) to reduce the number of comparisons. In this example, the VCFtools will only compare sites within 50,000 base pairs of one another.

./vcftools --vcf input_data.vcf --hap-r2 --ld-window-bp 50000 --out ld_window_50000

Getting Fst population statistics

VCFtools can also calculate Fst statistics between individuals of different populations. It is an estimate calculated in accordance to Weir and Cockerham’s 1984 paper. The user must supply text files that contain lists of individuals (one per line) that are members of each population. The function will work with multiple populations if multiple --weir-fst-pop arguments are used. The following example shows how to calculate a per-site Fst calculation with two populations. Other arguments can be used in conjunction with this function, such as --fst-window-size and --fst-window-step.

./vcftools --vcf input_data.vcf --weir-fst-pop population_1.txt --weir-fst-pop population_2.txt --out pop1_vs_pop2

Converting VCF files to PLINK format

VCFtools can convert VCF files into formats convenient for use in other programs. One such example is the ability to convert into PLINK format. The following function will output the variants in .ped and .map files.

./vcftools --vcf input_data.vcf --plink --chr 1 --out output_in_plink