Provided by: libvcflib-tools_1.0.9+dfsg1-3build1_amd64 

NAME
vcflib index
DESCRIPTION
vcflib contains tools and libraries for dealing with the Variant Call Format (VCF) which is a flat-file,
tab-delimited textual format intended to describe reference-indexed variations between individuals.
VCF provides a common interchange format for the description of variation in individuals and populations
of samples, and has become the defacto standard reporting format for a wide array of genomic variant de‐
tectors.
vcflib provides methods to manipulate and interpret sequence variation as it can be described by VCF. It
is both:
• an API for parsing and operating on records of genomic variation as it can be described by the VCF for‐
mat,
• and a collection of command-line utilities for executing complex manipulations on VCF files.
The API itself provides a quick and extremely permissive method to read and write VCF files. Extensions
and applications of the library provided in the included utilities (*.cpp) comprise the vast bulk of the
library’s utility for most users.
filter
filter command description
──────────────────────────────────────────────────────────────────────────
vcffilter VCF filter the specified vcf
file using the set of filters
vcfuniq List unique genotypes. Simi‐
lar to GNU uniq, but aimed at
VCF records. vcfuniq removes
records which have the same
position, ref, and alt as the
previous record on a sorted
VCF file. Note that it does
not adjust/combine genotypes
in the output, but simply
takes the first record. See
also vcfcreatemulti for com‐
bining records.
vcfuniqalleles List unique alleles For each
record, remove any duplicate
alternate alleles that may
have resulted from merging
separate VCF files.
metrics
metrics command description
──────────────────────────────────────────────────────────────────────────
vcfcheck Validate integrity and identi‐
ty of the VCF by verifying
that the VCF record’s REF
matches a given reference
file.
vcfdistance Adds a tag to each variant
record which indicates the
distance to the nearest vari‐
ant. (defaults to BasesTo‐
ClosestVariant if no custom
tag name is given.
vcfentropy Annotate VCF records with the
Shannon entropy of flanking
sequence. Anotates the output
VCF file with, for each
record, EntropyLeft, Entropy‐
Right, EntropyCenter, which
are the entropies of the se‐
quence of the given window
size to the left, right, and
center of the record. Also
adds EntropyRef and EntropyAlt
for each alt.
vcfhetcount Calculate the heterozygosity
rate: count the number of al‐
ternate alleles in heterozy‐
gous genotypes in all records
in the vcf file
vcfhethomratio Generates the het/hom ratio
for each individual in the
file
phenotype
phenotype command description
──────────────────────────────────────────────────────────────────────────
permuteGPAT++ permuteGPAT++ is a method for
adding empirical p-values to a
GPAT++ score.
genotype
genotype command description
──────────────────────────────────────────────────────────────────────────
abba-baba abba-baba calculates the tree
pattern for four indviduals.
This tool assumes reference is
ancestral and ignores non ab‐
ba-baba sites. The output is
a boolian value: 1 = true , 0
= false for abba and baba.
the tree argument should be
specified from the most basal
taxa to the most derived.
hapLrt HapLRT is a likelihood ratio
test for haplotype lengths.
The lengths are modeled with
an exponential distribution.
The sign denotes if the target
has longer haplotypes (1) or
the background (-1).
normalize-iHS normalizes iHS or XP-EHH
scores.
transformation
transformation command description
──────────────────────────────────────────────────────────────────────────
dumpContigsFromHeader Dump contigs from header
smoother smoothes is a method for win‐
dow smoothing many of the
GPAT++ formats.
vcf2dag Modify VCF to be able to build
a directed acyclic graph (DAG)
vcf2fasta Generates sample_seq:N.fa for
each sample, reference se‐
quence, and chromosomal copy N
in [0,1... ploidy]. Each se‐
quence in the fasta file is
named using the same pattern
used for the file name, allow‐
ing them to be combined.
vcf2tsv Converts VCF to per-allelle or
per-genotype tab-delimited
format, using null string to
replace empty values in the
table. Specifying -g will
output one line per sample
with genotype information.
When there is more than one
alt allele there will be mul‐
tiple rows, one for each al‐
lele and, the info will match
the `A' index
vcfaddinfo Adds info fields from the sec‐
ond file which are not present
in the first vcf file.
vcfafpath Display genotype paths
vcfallelicprimitives WARNING: this tool is consid‐
ered legacy and is only re‐
tained for older workflows.
It will emit a warning! Even
though it can use the WFA you
should use vcfwave instead.
vcfannotate Intersect the records in the
VCF file with targets provided
in a BED file. Intersections
are done on the reference se‐
quences in the VCF file. If
no VCF filename is specified
on the command line (last ar‐
gument) the VCF read from
stdin.
vcfannotategenotypes Examine genotype correspon‐
dence. Annotate genotypes in
the first file with genotypes
in the second adding the geno‐
type as another flag to each
sample filed in the first
file. annotation-tag is the
name of the sample flag which
is added to store the annota‐
tion. also adds a `has_vari‐
ant' flag for sites where the
second file has a variant.
vcfbreakmulti If multiple alleles are speci‐
fied in a single record, break
the record into multiple
lines, preserving allele-spe‐
cific INFO fields.
vcfcat Concatenates VCF files
vcfclassify Creates a new VCF where each
variant is tagged by allele
class: snp, ts/tv, indel, mnp
vcfcleancomplex Removes reference-matching se‐
quence from complex alleles
and adjusts records to reflect
positional change.
vcfcombine Combine VCF files positional‐
ly, combining samples when
sites and alleles are identi‐
cal. Any number of VCF files
may be combined. The INFO
field and other columns are
taken from one of the files
which are combined when
records in multiple files
match. Alleles must have
identical ordering to be com‐
bined into one record. If
they do not, multiple records
will be emitted.
vcfcommonsamples Generates each record in the
first file, removing samples
not present in the second
vcfcreatemulti Go through sorted VCF and when
overlapping alleles are repre‐
sented across multiple
records, merge them into a
single multi-ALT record. See
the documentation for more in‐
formation.
vcfecho Echo VCF to stdout (simple de‐
mo)
vcfevenregions Generates a list of regions,
e.g. chr20:10..30 using the
variant density information
provided in the VCF file to
ensure that the regions have
even numbers of variants.
This can be use to reduce the
variance in runtime when di‐
viding variant detection or
genotyping by genomic coordi‐
nates.
vcffixup Generates a VCF stream where
AC and NS have been generated
for each record using sample
genotypes
vcfflatten Removes multi-allelic sites by
picking the most common alter‐
nate. Requires allele fre‐
quency specification `AF' and
use of `G' and `A' to specify
the fields which vary accord‐
ing to the Allele or Genotype.
VCF file may be specified on
the command line or piped as
stdin.
vcfgeno2alleles modifies the genotypes field
to provide the literal alleles
rather than indexes
vcfgeno2haplo Convert genotype-based phased
alleles within –window-size
into haplotype alleles. Will
break haplotype construction
when encountering non-phased
genotypes on input.
vcfgenosamplenames Get samplenames
vcfglbound Adjust GLs so that the maximum
GL is 0 by dividing all GLs
for each sample by the max.
vcfglxgt Set genotypes using the maxi‐
mum genotype likelihood for
each sample.
vcfindex Adds an index number to the
INFO field (id=position)
vcfinfo2qual Sets QUAL from info field tag
keyed by [key]. The VCF file
may be omitted and read from
stdin. The average of the
field is used if it contains
multiple values.
vcfinfosummarize Take annotations given in the
per-sample fields and add the
mean, median, min, or max to
the site-level INFO.
vcfintersect VCF set analysis
vcfkeepgeno Reduce file size by removing
FORMAT fields not listed on
the command line from sample
specifications in the output
vcfkeepinfo To decrease file size remove
INFO fields not listed on the
command line
vcfkeepsamples outputs each record in the vcf
file, removing samples not
listed on the command line
vcfld Compute LD
vcfleftalign Left-align indels and complex
variants in the input using a
pairwise ref/alt alignment
followed by a heuristic, iter‐
ative left realignment process
that shifts indel representa‐
tions to their absolute left‐
most (5’) extent.
vcflength Add length info field
vcfnullgenofields Makes the FORMAT for each
variant line the same (uses
all the FORMAT fields de‐
scribed in the header). Fills
out per-sample fields to match
FORMAT. Expands GT values of
`.' with number of alleles
based on ploidy (eg: `./.' for
dipolid).
vcfnumalt outputs a VCF stream where NU‐
MALT has been generated for
each record using sample geno‐
types
vcfoverlay Overlay records in the input
vcf files with order as prece‐
dence.
vcfprimers For each VCF record, extract
the flanking sequences, and
write them to stdout as FASTA
records suitable for align‐
ment.
vcfqual2info Puts QUAL into an info field
tag keyed by [key].
vcfremap For each alternate allele, at‐
tempt to realign against the
reference with lowered gap
open penalty. If realignment
is possible, adjust the cigar
and reference/alternate alle‐
les. Observe how different
alignment parameters, includ‐
ing context and entropy-depen‐
dent ones, influence variant
classification and interpreta‐
tion.
vcfremoveaberrantgenotypes strips samples which are ho‐
mozygous but have observations
implying heterozygosity. Re‐
move samples for which the re‐
ported genotype (GT) and ob‐
servation counts disagree (AO,
RO).
vcfremovesamples outputs each record in the vcf
file, removing samples listed
on the command line
vcfsample2info Take annotations given in the
per-sample fields and add the
mean, median, min, or max to
the site-level INFO.
vcfsamplediff Establish putative somatic
variants using reported dif‐
ferences between germline and
somatic samples. Tags each
record where the listed sample
genotypes differ with . The
first sample is assumed to be
germline, the second somatic.
Each record is tagged with
={germline,somatic,loh} to
specify the type of variant
given the genotype difference
between the two samples.
vcfsamplenames List sample names
vcfstreamsort Sorts the input (either stdin
or file) using a streaming
sort algorithm. Guarantees
that the positional order is
correct provided out-of-order
variants are no more than 100
positions in the VCF file
apart.
vcfwave Realign reference and alter‐
nate alleles with WFA, parsing
out the `primitive' alleles
into multiple VCF records.
New records have IDs that ref‐
erence the source record ID.
Genotypes/samples are handled
correctly. Deletions generate
haploid/missing genotypes at
overlapping sites.
statistics
statistics command description
──────────────────────────────────────────────────────────────────────────
bFst bFst is a Bayesian approach to
Fst. Importantly bFst ac‐
counts for genotype uncertain‐
ty in the model using genotype
likelihoods. For a more de‐
tailed description see: `A
Bayesian approach to inferring
population structure from dom‐
inant markers’ by Holsinger et
al. Molecular Ecology Vol 11,
issue 7 2002. The likelihood
function has been modified to
use genotype likelihoods pro‐
vided by variant callers.
There are five free parameters
estimated in the model: each
subpopulation’s allele fre‐
quency and Fis (fixation in‐
dex, within each subpopula‐
tion), a free parameter for
the total population’s allele
frequency, and Fst.
genotypeSummary Generates a table of genotype
counts. Summarizes genotype
counts for bi-allelic SNVs and
indel
iHS iHS calculates the integrated
haplotype score which measures
the relative decay of extended
haplotype homozygosity (EHH)
for the reference and alterna‐
tive alleles at a site (see:
voight et al. 2006, Spiech &
Hernandez 2014).
meltEHH
pFst pFst is a probabilistic ap‐
proach for detecting differ‐
ences in allele frequencies
between two populations.
pVst pVst calculates vst, a measure
of CNV stratification.
permuteSmooth permuteSmooth is a method for
adding empirical p-values
smoothed wcFst scores.
plotHaps plotHaps provides the format‐
ted output that can be used
with `bin/plotHaplotypes.R'.
popStats General population genetic
statistics for each SNP
segmentFst segmentFst creates genomic
segments (bed file) for re‐
gions with high wcFst
segmentIhs Creates genomic segments (bed
file) for regions with high
wcFst
sequenceDiversity The sequenceDiversity program
calculates two popular metrics
of haplotype diversity: pi and
extended haplotype homozygo‐
isty (eHH). Pi is calculated
using the Nei and Li 1979 for‐
mulation. eHH a convenient
way to think about haplotype
diversity. When eHH = 0 all
haplotypes in the window are
unique and when eHH = 1 all
haplotypes in the window are
identical.
vcfaltcount count the number of alternate
alleles in all records in the
vcf file
vcfcountalleles Count alleles
vcfgenosummarize Adds summary statistics to
each record summarizing quali‐
ties reported in called geno‐
types. Uses: RO (reference
observation count), QR (quali‐
ty sum reference observations)
AO (alternate observation
count), QA (quality sum alter‐
nate observations)
vcfgenotypecompare adds statistics to the INFO
field of the vcf file describ‐
ing the amount of discrepancy
between the genotypes (GT) in
the vcf file and the genotypes
reported in the . use this
after vcfannotategenotypes to
get correspondence statistics
for two vcfs.
vcfgenotypes Report the genotypes for each
sample, for each variant in
the VCF. Convert the numeri‐
cal represenation of genotypes
provided by the GT field to a
human-readable genotype for‐
mat.
vcfparsealts Alternate allele parsing
method. This method uses
pairwise alignment of REF and
ALTs to determine component
allelic primitives for each
alternate allele.
vcfrandom Generate a random VCF file
vcfrandomsample Randomly sample sites from an
input VCF file, which may be
provided as stdin. Scale the
sampling probability by the
field specified in KEY. This
may be used to provide uniform
sampling across allele fre‐
quencies, for instance.
vcfroc Generates a pseudo-ROC curve
using sensitivity and speci‐
ficity estimated against a pu‐
tative truth set. Threshold‐
ing is provided by successive
QUAL cutoffs.
vcfsitesummarize Summarize by site
vcfstats Prints statistics about vari‐
ants in the input VCF file.
wcFst wcFst is Weir & Cockerham’s
Fst for two populations. Neg‐
ative values are VALID, they
are sites which can be treated
as zero Fst. For more infor‐
mation see Evolution, Vol. 38
N. 6 Nov 1984. Specifically
wcFst uses equations 1,2,3,4.
SOURCE CODE
See the source code repository at https://github.com/vcflib/vcflib
CREDIT
Citations are the bread and butter of Science. If you are using this software in your research and want
to support our future work, please cite the following publication:
Please cite:
A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2,
hts-nim and slivar (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009123).
Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P (2022), PLoS Comput Biol 18(5): e1009123.
https://doi.org/10.1371/journal.pcbi.1009123
LICENSE
Copyright 2011-2023 (C) Erik Garrison and vcflib contributors. MIT licensed.
AUTHORS
Erik Garrison and vcflib contributors.
vcflib vcflib(1)