A set of tools written in Perl and C++ for working with VCF files.
VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0, 4.1 and 4.2), nevertheless, the users are encouraged to use the latest versions VCFv4.1 or VCFv4.2. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well. Run any of the Perl scripts with the --help switch to obtain more help.
Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the htslib/tabix package, available for download here). The VCF files can be compressed and indexed using the following commands
bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz
Fill or recalculate AN and AC INFO fields.
zcat file.vcf.gz | fill-an-ac | bgzip -c > out.vcf.gz
Usage: fill-an-ac [OPTIONS] < in.vcf >out.vcf Options: -h, -?, --help This help message.
Annotates the VCF file with flanking sequence (INFO/FS tag) masking known variants with N's. Useful for designing primers.
fill-fs -r /path/to/refseq.fa | vcf-query '%CHROM\t%POS\t%INFO/FS\n' > out.tab
About: Annotate VCF with flanking sequence (INFO/FS tag) Usage: fill-fs [OPTIONS] file.vcf Options: -b, --bed-mask <file> Regions to mask (tabix indexed), multiple files can be given -c, --cluster <int> Do self-masking of clustered variants within this range. -l, --length <int> Flanking sequence length [100] -m, --mask-char <char|lc> The character to use or "lc" for lowercase. This option must preceed -b, -v or -c in order to take effect. With multiple files works as a switch on the command line, see the example below [N] -r, --refseq <file> The reference sequence. -v, --vcf-mask <file> Mask known variants in the flanking sequence, multiple files can be given (tabix indexed) -h, -?, --help This help message. Example: # Mask variants from the VCF file with N's and use lowercase for the bed file regions fill-fs file.vcf -v mask.vcf -m lc -b mask.bed
Fill missing reference info and sequence MD5s into VCF header.
fill-ref-md5 -i "SP:Homo\ Sapiens" -r ref.fasta in.vcf.gz -d ref.dict out.vcf.gz
About: The script computes MD5 sum of the reference sequence and inserts 'reference' and 'contig' tags into header as recommended by VCFv4.1. The VCF file must be compressed and tabix indexed, as it takes advantage of the lightning fast tabix reheader functionality. Usage: fill-ref-md5 [OPTIONS] in.vcf.gz out.vcf.gz Options: -d, --dictionary <file> Where to read/write computed MD5s. Opened in append mode, existing records are not touched. -i, --info <AS:xx,SP:xx,TX:xx> Optional info on reference assembly (AS), species (SP), taxonomy (TX) -r, --refseq <file> The reference sequence in fasta format indexed by samtools faidx -h, -?, --help This help message. Examples: fill-ref-md5 -i AS:NCBIM37,SP:"Mus\ Musculus" -r NCBIM37_um.fa -d NCBIM37_um.fa.dict in.vcf.gz out.vcf.gz
Fill missing rsIDs. This script has been discontinued, please use vcf-annotate instead.
The script adds or removes filters and custom annotations to VCF files. To add custom annotations to VCF files, create TAB delimited file with annotations such as
#CHR FROM TO ANNOTATION 1 12345 22345 gene1 1 67890 77890 gene2
Compress the file (using bgzip annotations), index (using tabix -s 1 -b 2 -e 3 annotations.gz) and run
cat in.vcf | vcf-annotate -a annotations.gz \
-d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation' \
-c CHROM,FROM,TO,INFO/ANN > out.vcf
The script is also routinely used to apply filters. There are a number of predefined filters and custom filters can be easily added, see vcf-annotate -h for examples. Some of the predefined filters take advantage of tags added by bcftools, the descriptions of the most frequently asked ones follow:
Note: A fast htslib C version of this tool is now available (see bcftools annotate).
About: Annotates VCF file, adding filters or custom annotations. Requires tabix indexed file with annotations. Currently it can annotate ID, QUAL, FILTER and INFO columns, but will be extended on popular demand. For examples of user-defined filters see online documentation or examples/filters.txt in vcftools distribution. Usage: cat in.vcf | vcf-annotate [OPTIONS] > out.vcf Options: -a, --annotations <file.gz> The tabix indexed file with the annotations: CHR\tFROM[\tTO][\tVALUE]+. -c, --columns <list> The list of columns in the annotation file, e.g. CHROM,FROM,TO,-,QUAL,INFO/STR,INFO/GN. The dash in this example indicates that the third column should be ignored. If TO is not present, it is assumed that TO equals to FROM. When REF and ALT columns are present, only matching lines are annotated. -d, --description <file|string> Header annotation, e.g. key=INFO,ID=HM2,Number=0,Type=Flag,Description='HapMap2 membership'. The descriptions can be read from a file, one annotation per line. --fill-AC-AN (Re)Calculate AC and AN tags --fill-HWE (Re)Calculate HWE, AC and AN tags --fill-ICF (Re)Calculate Inbreeding Coefficient F, HWE, AC and AN --fill-type Annotate INFO/TYPE with snp,del,ins,mnp,complex -f, --filter <file|list> Apply filters, list is in the format flt1=value/flt2/flt3=value/etc. If argument to -f is a file, user-defined filters be applied. See User Defined Filters below. -H, --hard-filter Remove lines with FILTER anything else than PASS or "." -n, --normalize-alleles Make REF and ALT alleles more compact if possible (e.g. TA,TAA -> T,TA). -r, --remove <list> Comma-separated list of tags to be removed (e.g. ID,INFO/DP,FORMAT/DP,FILTER). -h, -?, --help This help message. Filters: + Apply all filters with default values (can be overriden, see the example below). -X Exclude the filter X 1, StrandBias FLOAT Min P-value for strand bias (INFO/PV4) [0.0001] 2, BaseQualBias FLOAT Min P-value for baseQ bias (INFO/PV4) [0] 3, MapQualBias FLOAT Min P-value for mapQ bias (INFO/PV4) [0] 4, EndDistBias FLOAT Min P-value for end distance bias (INFO/PV4) [0.0001] a, MinAB INT Minimum number of alternate bases (INFO/DP4) [2] c, SnpCluster INT1,INT2 Filters clusters of 'INT1' or more SNPs within a run of 'INT2' bases [] D, MaxDP INT Maximum read depth (INFO/DP or INFO/DP4) [10000000] d, MinDP INT Minimum read depth (INFO/DP or INFO/DP4) [2] H, HWE FLOAT Minimum P-value for HWE and F<0 (invokes --fill-HWE) [] H2, HWE2 FLOAT Minimum P-value for HWE (plus F<0) (INFO/AC and INFO/AN or --fill-AC-AN) [] HG, HWE_G3 FLOAT Minimum P-value for HWE and F<0 (INFO/HWE and INFO/G3) [] q, MinMQ INT Minimum RMS mapping quality for SNPs (INFO/MQ) [10] Q, Qual INT Minimum value of the QUAL field [10] r, RefN Reference base is N [] v, VDB FLOAT Minimum Variant Distance Bias (INFO/VDB) [0] W, GapWin INT Window size for filtering adjacent gaps [3] w, SnpGap INT SNP within INT bp around a gap to be filtered [10] Examples: zcat in.vcf.gz | vcf-annotate -a annotations.gz -d descriptions.txt -c FROM,TO,CHROM,ID,INFO/DP | bgzip -c >out.vcf.gz zcat in.vcf.gz | vcf-annotate -f +/-a/c=3,10/q=3/d=5/-D -a annotations.gz -d key=INFO,ID=GN,Number=1,Type=String,Description='Gene Name' | bgzip -c >out.vcf.gz zcat in.vcf.gz | vcf-annotate -a dbSNPv132.tab.gz -c CHROM,POS,REF,ALT,ID,-,-,- | bgzip -c >out.vcf.gz zcat in.vcf.gz | vcf-annotate -r FILTER/MinDP | bgzip -c >out.vcf.gz Where descriptions.txt contains: key=INFO,ID=GN,Number=1,Type=String,Description='Gene Name' key=INFO,ID=STR,Number=1,Type=Integer,Description='Strand' The file dbSNPv132.tab.gz with dbSNP IDs can be downloaded from ftp://ftp.sanger.ac.uk/pub/1000genomes/pd3/dbSNP/
# Examples of user-defined filters. Edit and run with -f filters.txt. # The examples below are self-explanatory. Notice the use of the predefined # variables ($PASS, $FAIL, $MATCH, $RECORD) and methods (error). # In this example, a minimum value of AF1=0.1 is required { tag => 'INFO/AF1', # The VCF tag to apply this filter on name => 'MinAF', # The filter ID desc => 'Minimum AF1 [0.01]', # Description for the VCF header test => sub { return $MATCH < 0.01 ? $FAIL : $PASS }, }, # Filter all indels (presence of INDEL tag is tested) { tag => 'INFO/INDEL', apply_to => 'indels', # Can be one of SNPs, indels, all. Default: [All] name => 'Indel', desc => 'INDEL tag present', test => sub { return $FAIL }, }, # Only loci with enough reads supporting the variant will pass the filter { tag => 'INFO/DP4', name => 'FewAlts', desc => 'Too few reads supporting the variant', apply_to => 'SNPs', test => sub { if ( !($MATCH =~ /^([^,]+),([^,]+),([^,]+),(.+)$/) ) { error("Could not parse INFO/DP4: $CHROM:$POS [$MATCH]"); } if ( 0.1*($1+$2) > $3+$4 ) { return $PASS; } return $FAIL; }, }, # Example of filtering based on genotype columns and the QUAL column { tag => 'FORMAT/PL', name => 'NoHets', desc => 'Inbred homozygous mouse, no hets expected', apply_to => 'SNPs', test => sub { for my $pl (@$MATCH) { my @pls = split(/,/,$pl); if ( $pls[1]<$pls[0] && $pls[1]<$pls[2] ) { return $FAIL; } } return $PASS; }, }, # This example splits the four PV4 values into four tags names PV0, PV1, PV2 and PV3. # Note the use of the 'header' key, and the $RECORD and $VCF variables. { header => [ qq[key=INFO,ID=PV0,Number=1,Type=Float,Description="P-value for strand bias"], qq[key=INFO,ID=PV1,Number=1,Type=Float,Description="P-value for baseQ bias"], qq[key=INFO,ID=PV2,Number=1,Type=Float,Description="P-value for mapQ bias"], qq[key=INFO,ID=PV3,Number=1,Type=Float,Description="P-value for tail distance bias"] ], tag => 'INFO/PV4', name => 'SplitPV4', desc => 'Split PV4', apply_to => 'all', test => sub { my @vals = split(/,/,$MATCH); $$RECORD[7] = $VCF->add_info_field($$RECORD[7],'PV0'=>$vals[0],'PV1'=>$vals[1],'PV2'=>$vals[2],'PV3'=>$vals[3]); return $PASS; }, }, # Do whatever you want with every record and edit it according to your needs. This silly # example removes the tag SILLY in records where ID is set and depth is bigger than 5. { tag => 'Dummy', test => sub { if ( $$RECORD[2] eq '.' ) { return $PASS; } # Modify only lines with ID my $dp = $vcf->get_info_field($$RECORD[7],'DP'); if ( $dp>5 ) { $$RECORD[7] = $VCF->add_info_field($$RECORD[7],'SILLY'=>undef); } return $PASS; }, } # Filter records with the value XY absent or not equal to 42 { tag => 'Dummy', header => [ qq[key=FILTER,ID=XY,Description="XY not OK"], ], test => sub { my $xy = $VCF->get_info_field($$RECORD[7],'XY'); my $is_bad = ( !defined $xy or $xy!=42 ) ? 1 : 0; $$RECORD[6] = $VCF->add_filter($$RECORD[6],'XY'=>$is_bad); return $PASS; }, }, # Annotate INFO field with SINGLETON flag when one and only one sample is different from the reference { header => [ qq[key=INFO,ID=SINGLETON,Number=0,Type=Flag,Description="Only one non-ref sample"], ], tag => 'FORMAT/GT', name => 'Dummy', desc => 'Dummy', test => sub { my $nalt = 0; for my $gt (@$MATCH) { my @gt = $VCF->split_gt($gt); for my $allele (@gt) { if ( $allele ne 0 && $allele ne '.' ) { $nalt++; last; } } if ( $nalt>1 ) { last; } } if ( $nalt==1 ) { $$RECORD[7] = $VCF->add_info_field($$RECORD[7],'SINGLETON'=>''); } return $PASS; }, }, # Set genotypes to unknown ("." or "./." depending on ploidy) when coverage is low (by Shane McCarthy). { tag => 'FORMAT/DP', name => 'MinSampleDP', desc => 'Genotypes set to . for samples with DP < 2', apply_to => 'all', test => sub { my $i = 8; for my $dp (@$MATCH) { $i++; next unless ($dp<2); my @format = split(/:/,$$RECORD[$i]); $format[0] = $format[0] =~ /\// ? "./." : "."; $$RECORD[$i] = join(":",@format); } return $PASS; }, },
Compares positions in two or more VCF files and outputs the numbers of
positions contained in one but not the other files; two but not the other files, etc, which
comes handy when generating Venn diagrams. The script also computes numbers
such as nonreference discordance rates (including multiallelic sites), compares
actual sequence (useful when comparing indels), etc.
vcf-compare -H A.vcf.gz B.vcf.gz C.vcf.gz
Note: A fast htslib C version of this tool is now available (see bcftools stats).
About: Compare bgzipped and tabix indexed VCF files. (E.g. bgzip file.vcf; tabix -p vcf file.vcf.gz) Usage: vcf-compare [OPTIONS] file1.vcf file2.vcf ... vcf-compare -p plots chr1.cmp chr2.cmp ... Options: -a, --apply-filters Ignore lines where FILTER column is anything else than PASS or '.' -c, --chromosomes <list|file> Same as -r, left for backward compatibility. Please do not use as it will be dropped in the future. -d, --debug Debugging information. Giving the option multiple times increases verbosity -g, --cmp-genotypes Compare genotypes, not only positions --ignore-indels Exclude sites containing indels from genotype comparison -m, --name-mapping <list|file> Use with -g when comparing files with differing column names. The argument to this options is a comma-separated list or one mapping per line in a file. The names are colon separated and must appear in the same order as the files on the command line. --INFO <string> [<int>] Calculate genotype errors by INFO. Use zero based indecies if field has more than one value. Can be given multiple times. -p, --plot <prefix> Create plots. Multiple files (e.g. per-chromosome outputs from vcf-compare) can be given. -R, --refseq <file> Compare the actual sequence, not just positions. Use with -w to compare indels. -r, --regions <list|file> Process the given regions (comma-separated list or one region per line in a file). -s, --samples <list|file> Process only the listed samples. Excluding unwanted samples may increase performance considerably. -t, --title <string> Title for graphs (see also -p) -w, --win <int> In repetitive sequences, the same indel can be called at different positions. Consider records this far apart as matching (be it a SNP or an indel). -h, -?, --help This help message.
Concatenates VCF files (for example split by chromosome). Note that the input and output VCFs will have the same number of columns, the script does not merge VCFs by position (see also vcf-merge).
In the basic mode it does not do anything fancy except for a sanity check that all files have the same columns. When run with the -s option, it will perform a partial merge sort, looking at limited number of open files simultaneously.
vcf-concat A.vcf.gz B.vcf.gz C.vcf.gz | gzip -c > out.vcf.gz
About: Convenience tool for concatenating VCF files (e.g. VCFs split by chromosome). In the basic mode it does not do anything fancy except for a sanity check that all files have the same columns. When run with the -s option, it will perform a partial merge sort, looking at limited number of open files simultaneously. Usage: vcf-concat [OPTIONS] A.vcf.gz B.vcf.gz C.vcf.gz > out.vcf Options: -c, --check-columns Do not concatenate, only check if the columns agree. -f, --files <file> Read the list of files from a file. -p, --pad-missing Write '.' in place of missing columns. Useful for joining chrY with the rest. -s, --merge-sort <int> Allow small overlaps in N consecutive files. -h, -?, --help This help message.
Apply VCF variants to a fasta file to create consensus sequence.
cat ref.fa | vcf-consensus file.vcf.gz > out.fa
Usage: cat ref.fa | vcf-consensus [OPTIONS] in.vcf.gz > out.fa Options: -h, -?, --help This help message. -H, --haplotype <int> Apply only variants for the given haplotype (1,2) -i, --iupac-codes Apply variants in the form of IUPAC ambiguity codes -s, --sample <name> If not given, all variants are applied Examples: # Get the consensus for one region. The fasta header lines are then expected # in the form ">chr:from-to". samtools faidx ref.fa 8:11870-11890 | vcf-consensus in.vcf.gz > out.fa
Convert between VCF versions, currently from VCFv3.3 to VCFv4.0.
zcat file.vcf.gz | vcf-convert -r reference.fa > out.vcf
About: Convert between VCF versions. Usage: cat in.vcf | vcf-convert [OPTIONS] > out.vcf Options: -r, --refseq <file> The reference sequence in samtools faindexed fasta file. (Not required with SNPs only.) -v, --version <string> 4.0, 4.1, 4.2 -h, -?, --help This help message.
A tool for finding differences between groups of samples, useful in trio analysises, cancer genomes etc.
In the example below variants with average mapping quality of 30 (-f MinMQ=30) and minimum depth of 10 (-d 10) are considered. Only novel alleles are reported (-n). Then vcf-query is used to extract the INFO/NOVEL* annotations into a table. Finally the sites are sorted by confidence of the site being different in the child (-k5,5nr).
vcf-annotate -f MinMQ=30 file.vcf | vcf-contrast -n +Child -Mother,Father -d 10 -f | vcf-query -f '%CHROM %POS\t%INFO/NOVELTY\t%INFO/NOVELAL\t%INFO/NOVELGT[\t%SAMPLE %GTR %PL]\n' | sort -k3,3nr | head
About: Finds differences amongst samples adding NOVELGT, NOVELAL and NOVELTY annotations to INFO field. Note that haploid genotypes are internally treated as homozygous diploid genotypes, therefore "0/1" and "1" are considered different genotypes. Usage: vcf-contrast +<list> -<list> [OPTIONS] file.vcf.gz Options: +<list> List of samples where unique variant is expected -<list> List of background samples -d, --min-DP <int> Minimum depth across all -<list> samples -f, --apply-filters Skip sites with FILTER column different from PASS or "." -n, --novel-sites Print only records with novel genotypes -h, -?, --help This help message. Example: # Test if any of the samples A,B is different from all C,D,E vcf-contrast +A,B -C,D,E -m file.vcf.gz # Same as above but printing only sites with novel variants and table output vcf-contrast -n +A,B -C,D,E -m file.vcf.gz | vcf-query -f '%CHROM %POS\t%INFO/NOVELTY\t%INFO/NOVELAL\t%INFO/NOVELGT[\t%SAMPLE %GTR %PL]\n' # Similar to above but require minimum mapping quality of 20 vcf-annotate -f MinMQ=20 file.vcf.gz | vcf-contrast +A,B,C -D,E,F -f
Please take a look at vcf-annotate and bcftools view which does what you are looking for. Apologies for the non-intuitive naming.
Note: A fast HTSlib C version of a filtering tool is now available (see bcftools filter and
bcftools view).
Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.
About: Reads in a VCF file with any (commonly used) newline representation and outputs with the current system's newline representation. Usage: vcf-fix-newlines [OPTIONS] Options: -i, --info Report if the file is consistent with the current platform based. -h, -?, --help This help message. Example: vcf-fix-newlines -i file.vcf vcf-fix-newlines file.vcf.gz > out.vcf cat file.vcf | vcf-fix-newlines > out.vcf
Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.
Usage: cat broken.vcf | vcf-fix-ploidy [OPTIONS] > fixed.vcf Options: -a, --assumed-sex <sex> M or F, required if the list is not complete in -s -l, --fix-likelihoods Add or remove het likelihoods (not the default behaviour) -p, --ploidy <file> Ploidy definition. The default is shown below. -s, --samples <file> List of sample sexes (sample_name [MF]). -h, -?, --help This help message. Default ploidy definition: ploidy => { X => [ # The pseudoautosomal regions 60,001-2,699,520 and 154,931,044-155,270,560 with the ploidy 2 { from=>1, to=>60_000, M=>1 }, { from=>2_699_521, to=>154_931_043, M=>1 }, ], Y => [ # No chrY in females and one copy in males { from=>1, to=>59_373_566, M=>1, F=>0 }, ], MT => [ # Haploid MT in males and females { from=>1, to => 16_569, M=>1, F=>1 }, ], }
Calculate in-frame ratio.
Note: A fast htslib C version of this tool is now available (see bcftools stats).
About: Currently calculates in-frame ratio. Usage: vcf-indel-stats [OPTIONS] < in.vcf > out.txt Options: -h, -?, --help This help message. -e, --exons <file> Tab-separated file with exons (chr,from,to; 1-based, inclusive) -v, --verbose
Creates intersections and complements of two or more VCF files. Given multiple VCF files, it can output the list of positions which are shared by at least N files, at most N files, exactly N files, etc. The first example below outputs positions shared by at least two files and the second outputs positions present in the files A but absent from files B and C.
vcf-isec -n +2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz
vcf-isec -c A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
Note: A fast htslib C version of this tool is now available (see bcftools isec).
About: Create intersections, unions, complements on bgzipped and tabix indexed VCF or tab-delimited files. Note that lines from all files can be intermixed together on the output, which can yield unexpected results. Usage: vcf-isec [OPTIONS] file1.vcf file2.vcf ... Options: -a, --apply-filters Ignore lines where FILTER column is anything else than PASS or '.' -c, --complement Output positions present in the first file but missing from the other files. -d, --debug Debugging information -f, --force Continue even if the script complains about differing columns, VCF versions, etc. -o, --one-file-only Print only entries from the left-most file. Without -o, all unique positions will be printed. -n, --nfiles [+-=]<int> Output positions present in this many (=), this many or more (+), or this many or fewer (-) files. -p, --prefix <path> If present, multiple files will be created with all possible isec combinations. (Suitable for Venn Diagram analysis.) -r, --regions <list|file> Do only the given regions (comma-separated list or one region per line in a file). -t, --tab <chr:pos:file> Tab-delimited file with indexes of chromosome and position columns. (1-based indexes) -w, --win <int> In repetitive sequences, the same indel can be called at different positions. Consider records this far apart as matching (be it a SNP or an indel). -h, -?, --help This help message. Examples: bgzip file.vcf; tabix -p vcf file.vcf.gz bgzip file.tab; tabix -s 1 -b 2 -e 2 file.tab.gz
Merges two or more VCF files into one so that, for example, if two source files had one column each, on output will be printed a file with two columns. See also vcf-concat for concatenating VCFs split by chromosome.
vcf-merge A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
Note that this script is not intended for concatenating VCF files. For this, use
vcf-concat instead.
Note: A fast htslib C version of this tool is now available (see bcftools merge).
About: Merges VCF files by position, creating multi-sample VCFs from fewer-sample VCFs. The tool requires bgzipped and tabix indexed VCF files on input. (E.g. bgzip file.vcf; tabix -p vcf file.vcf.gz) If you need to concatenate VCFs (e.g. files split by chromosome), look at vcf-concat instead. Usage: vcf-merge [OPTIONS] file1.vcf file2.vcf.gz ... > out.vcf Options: -c, --collapse <snps|indels|both|any|none> treat as identical sites with differing alleles [any] -d, --remove-duplicates If there should be two consecutive rows with the same chr:pos, print only the first one. -H, --vcf-header <file> Use the provided VCF header -h, -?, --help This help message. -r, --regions <list|file> Do only the given regions (comma-separated list or one region per line in a file). -R, --ref-for-missing <string> Use the REF allele instead of the default missing genotype. Because it is not obvious what ploidy should be used, a user-defined string is used instead (e.g. 0/0). -s, --silent Try to be a bit more silent, no warnings about duplicate lines. -t, --trim-ALTs If set, redundant ALTs will be removed
Concatenates multiple overlapping VCFs preserving phasing.
About: The script takes multiple overlapping pre-phased chunks and concatenates them into one VCF using heterozygous calls from the overlaps to determine correct phase. Usage: vcf-phased-join [OPTIONS] A.vcf B.vcf C.vcf Options: -j, --min-join-quality <num> Quality threshold for gluing the pre-phased blocks together [10] -l, --list <file> List of VCFs to join. -o, --output <file> Output file name. When "-" is supplied, STDOUT and STDERR will be used -q, --min-PQ <num> Break pre-phased segments if PQ value is lower in input VCFs [0.6] -h, -?, --help This help message
Powerful tool for converting VCF files into format defined by the user. Supports retrieval of subsets of positions, columns and fields.
vcf-query file.vcf.gz 1:10327-10330
vcf-query file.vcf -f '%CHROM:%POS %REF %ALT [ %DP]\n'
Note: A fast htslib C version of this tool is now available (see bcftools query).
Usage: vcf-query [OPTIONS] file.vcf.gz Options: -c, --columns <file|list> List of comma-separated column names or one column name per line in a file. -f, --format <string> The default is '%CHROM:%POS\t%REF[\t%SAMPLE=%GT]\n' -l, --list-columns List columns. -r, --region chr:from-to Retrieve the region. (Runs tabix.) --use-old-method Use old version of API, which is slow but more robust. -h, -?, --help This help message. Expressions: %CHROM The CHROM column (similarly also other columns) %GT Translated genotype (e.g. C/A) %GTR Raw genotype (e.g. 0/1) %INFO/TAG Any tag in the INFO column %LINE Prints the whole line %SAMPLE Sample name [] The brackets loop over all samples %*<A><B> All format fields printed as KEY<A>VALUE<B> Examples: vcf-query file.vcf.gz 1:1000-2000 -c NA001,NA002,NA003 vcf-query file.vcf.gz -r 1:1000-2000 -f '%CHROM:%POS\t%REF\t%ALT[\t%SAMPLE:%*=,]\n' vcf-query file.vcf.gz -f '[%GT\t]%LINE\n' vcf-query file.vcf.gz -f '[%GT\ ]%LINE\n' vcf-query file.vcf.gz -f '%CHROM\_%POS\t%INFO/DP\t%FILTER\n' Notes: Please use `bcftools query` instead, this script will not be supported in future.
Reorder columns
vcf-shuffle-cols -t template.vcf.gz file.vcf.gz > out.vcf
About: Reorder columns to match the order in the template VCF. Usage: vcf-shuffle-cols [OPTIONS] -t template.vcf.gz file.vcf.gz > out.vcf Options: -t, --template <file> The file with the correct order of the columns. -h, -?, --help This help message.
Sort a VCF file.
vcf-sort file.vcf.gz
Usage: vcf-sort > out.vcf cat file.vcf | vcf-sort > out.vcf Options: -c, --chromosomal-order Use natural ordering (1,2,10,MT,X) rather then the default (1,10,2,MT,X). This requires new version of the unix "sort" command which supports the --version-sort option. -p, --parallel <int> Change the number of sorts run concurrently to <int> -t, --temporary-directory Use a directory other than /tmp as the temporary directory for sorting. -h, -?, --help This help message.
Outputs some basic statistics: the number of SNPs, indels, etc.
vcf-stats file.vcf.gz
Note: A fast htslib C version of this tool is now available (see bcftools stats).
Use of qw(...) as parentheses is deprecated at /usr/share/perl5/VcfStats.pm line 428. Use of qw(...) as parentheses is deprecated at /usr/share/perl5/VcfStats.pm line 433. Usage: vcf-stats [OPTIONS] file.vcf.gz Options: -d, --dump <file> Take an existing dump file and recreate the files (works with -p) -f, --filters <filter1,filter2> List of filters such as column/field (any value), column/field=bin:max (cluster in bins),column/field=value (exact value) -p, --prefix <dir/string> Prefix of output files. If slashes are present, directories will be created. -s, --samples <list> Process only the listed samples, - for none. Excluding unwanted samples may increase performance considerably. -h, -?, --help This help message. Examples: # Calculate stats separately for the filter field, quality and non-indels vcf-stats file.vcf.gz -f FILTER,QUAL=10:200,INFO/INDEL=False -p out/ # Calculate stats for all samples vcf-stats file.vcf.gz -f FORMAT/DP=10:200 -p out/ # Calculate stats only for the sample NA00001 vcf-stats file.vcf.gz -f SAMPLE/NA00001/DP=1:200 -p out/ vcf-stats file.vcf.gz > perl.dump
Remove some columns from the VCF file.
vcf-subset -c NA0001,NA0002 file.vcf.gz | bgzip -c > out.vcf.gz
Note: A fast HTSlib C version of this tool is now available (see bcftools view).
Usage: vcf-subset [OPTIONS] in.vcf.gz > out.vcf Options: -a, --trim-alt-alleles Remove alternate alleles if not found in the subset -c, --columns <string> File or comma-separated list of columns to keep in the vcf file. If file, one column per row -e, --exclude-ref Exclude rows not containing variants. -f, --force Proceed anyway even if VCF does not contain some of the samples. -p, --private Print only rows where only the subset columns carry an alternate allele. -r, --replace-with-ref Replace the excluded types with reference allele instead of dot. -t, --type <list> Comma-separated list of variant types to include: ref,SNPs,indels,MNPs,other. -u, --keep-uncalled Do not exclude rows without calls. -h, -?, --help This help message. Examples: cat in.vcf | vcf-subset -r -t indels -e -c SAMPLE1 > out.vcf
A lightweight script for quick calculation of Ts/Tv ratio.
cat file.vcf | vcf-tstv
Note: A fast htslib C version of this tool is now available (see bcftools stats).
Usage: cat file.vcf | vcf-tstv Options: -h, -?, --help This help message.
A simple script which converts the VCF file into a tab-delimited text file listing the actual variants instead of ALT indexes.
zcat file.vcf.gz | vcf-to-tab > out.tab
Usage: vcf-to-tab [OPTIONS] < in.vcf > out.tab Options: -h, -?, --help This help message. -i, --iupac Use one-letter IUPAC codes Notes: Please use `bcftools query` instead, this script will not be supported in future.
vcf-validator file.vcf.gz
Usage: vcf-validator [OPTIONS] file.vcf.gz Options: -d, --duplicates Warn about duplicate positions. -u, --unique-messages Output all messages only once. -h, -?, --help This help message.
For examples how to use the Perl API, it is best to look at some of the simpler scripts, for example vcf-to-tab. The detailed documentation can be obtained by running
perldoc Vcf.pm