Provided by: qtltools_1.3.1+dfsg-4build4_amd64 bug

NAME

       QTLtools cis - cis QTL analysis

SYNOPSIS

       QTLtools  cis  --vcf  [in.vcf|in.vcf.gz|in.bcf|in.bed.gz] --bed quantifications.bed.gz [--nominal float |
       --permute integer | --mapping in.txt] --out output.txt [OPTIONS]

DESCRIPTION

       This mode maps cis (proximal) quantitative trait loci (QTLs) that  affect  the  phenotype,  using  linear
       regression.   The  method is detailed in <https://www.nature.com/articles/ncomms15452>.  We first regress
       out the provided covariates from the phenotype data, followed by running the  linear  regression  between
       the  phenotype residuals and the genotype.  If --normal and --cov are provided at the same time, then the
       residuals after the covariate correction are rank  normal  transformed.   It  incorporates  an  efficient
       permutation  scheme to control for differential multiple testing burden of each phenotype.  You can run a
       nominal pass (--nominal threshold) listing all genotype-phenotype associations below a certain threshold,
       a permutation pass (--permute no_of_permutations) to empirically characterize the  null  distribution  of
       associations  for  each  phenotype separately, thus adjusting the nominal p-value of the best association
       for a phenotype, or a conditional analysis pass (--mapping filename) to discover multiple  proximal  QTLs
       with independent effects on a phenotype.

       As  multiple  molecular  phenotypes  can belong to higher order biological entities, e.g. exons of genes,
       QTLtools cis allows grouping of  phenotypes  to  maximize  the  discoveries  in  such  particular  cases.
       Specifically,  QTLtools can either aggregate multiple phenotypes in a given group into a single phenotype
       via PCA (--grp-pca1) or by taking their mean (--grp-mean), or directly use all individual  phenotypes  in
       an extended permutation scheme that accounts for their number and correlation structure (--grp-best).  In
       our experience, --grp-best outperforms the other options for expression QTLs (eQTLs).

       The conditional analysis pass first uses permutations to derive a nominal p-value threshold per phenotype
       that   varies   and  reflects  the  number  of  independent  tests  per  cis-window.   Then,  it  uses  a
       forward-backward stepwise regression to learn the number of independent signals per phenotype,  determine
       the  best  candidate  variant  per  signal and assign all significant hits to the independent signal they
       relate to.

       Since linear regressions assumes normally distributed data, we highly recommend using the --normal option
       to rank normal transform the phenotype quantifications in order to avoid false positive associations  due
       to outliers.

OPTIONS

       --vcf [in.vcf|in.bcf|in.vcf.gz|in.bed.gz]
              Genotypes in VCF/BCF format, or another molecular phenotype in BED format.  If there is a DS field
              in   the  genotype  FORMAT  of  a  variant  (dosage  of  the  genotype  calculated  from  genotype
              probabilities, e.g. after imputation), then this is used as the genotype.  If there is only the GT
              field in the genotype FORMAT then this is used and it is converted to a dosage.  REQUIRED.

       --bed quantifications.bed.gz
              Molecular phenotype quantifications in BED format.  REQUIRED.

       --out output.txt
              Output file.  REQUIRED.

       --cov covariates.txt
              Covariates to correct the phenotype data with.

       --normal
              Rank normal transform  the  phenotype  data  so  that  each  phenotype  is  normally  distributed.
              RECOMMENDED.

       --std-err
              Calculate and output the standard error of the beta (slope).

       --window integer
              Size   of   the   cis   window   flanking   each  phenotype's  start  position.   DEFAULT=1000000.
              RECOMMENDED=1000000.

       --nominal float
              Calculate the nominal p-value for the genotype-phenotype associations and print out the ones  that
              pass the provided threshold.  Give 1.0 to print everything.  Mutually exclusive with --permute and
              --mapping.

       --permute integer
              Adjust  the  best nominal p-value for this phenotype accounting for the number of variants and the
              linkage disequilibrium in its cis-window.  We recommend at least 1000 permutation  for  the  final
              analysis,  and  in  most cases you will see diminishing returns when going over 5000.  However, if
              you are doing exploratory analyses like which/how many covariates to include, you can go as low as
              100.  Mutually exclusive with --nominal and --mapping.  RECOMMENDED=1000.

       --mapping thresholds_filename
              The conditional analysis.  First  you  need  to  run  a  permutation  analysis,  then  generate  a
              thresholds  file  using  the runFDR_cis.R script in the script directory.  Mutually exclusive with
              --nominal and --permute.

       --grp-best
              Correct for multiple phenotypes within a phenotype group.  Mutually exclusive with --grp-pca1  and
              --grp-mean.

       --grp-pca1
              Run  PCA  on  phenotypes  within  a phenotype group and use PC1 for association testing.  Mutually
              exclusive with --grp-best and --grp-mean.

       --grp-mean
              Average phenotypes within a group and use the results for association testing.  Mutually exclusive
              with --grp-best and --grp-pca1.

       --chunk integer1 integer2
              For parallelization.  Divide the data into integer2 number of  chunks  and  process  chunk  number
              integer1.   Chunk  0  will  print  a  header.  Mutually exclusive with --region and --region-pair.
              Minimum number of chunks has to be at least the same number of chromosomes in the --bed file.

       --region chr:start-end
              Genomic region to be processed.  E.g. chr4:12334456-16334456,  or  chr5  Mutually  exclusive  with
              --chunk and --region-pair.

       --region-pair chr:start-end chr:start-end
              Genomic  region  for  genotypes  followed  by  region  for  phenotypes  to be processed.  Mutually
              exclusive with --chunk and --region.

OUTPUT FILES

       --nominal output file
        Space separated output file with the following columns  (certain  columns  are  only  printed  based  on
        options).  We recommend including chunk 0 to print out a header in order to avoid confusion.
         1     phe_id | grp_id                     The  phenotype  ID  or  if  one  of  the  grouping options is
                                                   provided, then phenotype group ID
         2     phe_chr                             The phenotype chromosome
         3     phe_from                            Start position of the phenotype
         4     phe_to                              End position of the phenotype
         5     phe_strd                            The phenotype strand
         5.1   phe_id | ve_by_pc1 | n_phe_in_grp   Only printed if --group-best | --group-pca1  |  --group-mean.
                                                   The  phenotype  ID,  variance  explained by PC1, or number of
                                                   phenotypes in the phenotype group for --group-best,  --group-
                                                   pca1, and --group-mean, respectively.
         5.2   n_phe_in_grp                        Only  printed  if --group-pca1 | --group-mean.  The number of
                                                   phenotypes in the phenotype group.
         6     n_var_in_cis                        The number variants in the cis window for this phenotype.
         7     dist_phe_var                        The distance between the  variant  and  the  phenotype  start
                                                   positions.
         8     var_id                              The variant ID.
         9     var_chr                             The variant chromosome.
        10     var_from                            The start position of the variant.
        11     var_to                              The end position of the variant.
        12     nom_pval                            The  nominal  p-value  of the association between the variant
                                                   and the phenotype.
        13     r_squared                           The r squared of the linear regression.
        14     slope                               The beta (slope) of the linear regression.
        14.1   slope_se                            The standard error of the beta.  Only printed if --std-err is
                                                   provided.
        15     best_hit                            Whether this varint was the best hit for this phenotype.

       --permute output file
        Space separated output file with the following columns  (certain  columns  are  only  printed  based  on
        options).  We recommend including chunk 0 to print out a header in order to avoid confusion.
         1     phe_id | grp_id                     The  phenotype  ID  or  if  one  of  the  grouping options is
                                                   provided, then phenotype group ID
         2     phe_chr                             The phenotype chromosome
         3     phe_from                            Start position of the phenotype
         4     phe_to                              End position of the phenotype
         5     phe_strd                            The phenotype strand
         5.1   phe_id | ve_by_pc1 | n_phe_in_grp   Only printed if --group-best | --group-pca1  |  --group-mean.
                                                   The  phenotype  ID,  variance  explained by PC1, or number of
                                                   phenotypes in the phenotype group for --group-best,  --group-
                                                   pca1, and --group-mean, respectively.
         5.2   n_phe_in_grp                        Only  printed  if --group-pca1 | --group-mean.  The number of
                                                   phenotypes in the phenotype group.
         6     n_var_in_cis                        The number variants in the cis window for this phenotype.
         7     dist_phe_var                        The distance between the  variant  and  the  phenotype  start
                                                   positions.
         8     var_id                              The most significant variant ID.
         9     var_chr                             The most significant variant's chromosome.
        10     var_from                            The start position of the most significant variant.
        11     var_to                              The end position of the most significant variant.
        12     dof1                                The  number  of  degrees  of  freedom  used to compute the p-
                                                   values.
        13     dof2                                Estimated  number  of  degrees  of  freedom  used   in   beta
                                                   approximation p-value calculations.
        14     bml1                                The  first  shape  parameter  of the fitted beta distribution
                                                   (alpha parameter).  These should be close to 1.
        15     bml2                                The second shape parameter of the  fitted  beta  distribution
                                                   (beta  parameter).   This corresponds to the effective number
                                                   of independent tests in the region.
        16     nom_pval                            The nominal p-value  of  the  association  between  the  most
                                                   significant variant and the phenotype.
        17     r_squared                           The r squared of the linear regression.
        18     slope                               The beta (slope) of the linear regression.
        18.1   slope_se                            The standard error of the beta.  Only printed if --std-err is
                                                   provided.
        19     adj_emp_pval                        Adjusted  empirical  p-value  from permutations.  This is the
                                                   adjusted p-value not using the  beta  approximation.   Simply
                                                   calculated   as:   (number   of   p-values   observed  during
                                                   permutations that were smaller than or equal to  the  nominal
                                                   p-value  +  1)  /  (number  of  permutations  + 1).  The most
                                                   significant p-value  achievable  would  be  1  /  (number  of
                                                   permutations + 1).
        20     adj_beta_pval                       Adjusted   empirical   p-value   given  by  the  fitted  beta
                                                   distribution.  We strongly recommend using this  adjusted  p-
                                                   value in any downstream analysis.

       --mapping output file
        Space  separated  output  file  with  the  following  columns (certain columns are only printed based on
        options).  We recommend including chunk 0 to print out a header in order to avoid confusion.
         1     phe_id | grp_id                     The phenotype ID  or  if  one  of  the  grouping  options  is
                                                   provided, then phenotype group ID
         2     phe_chr                             The phenotype chromosome
         3     phe_from                            Start position of the phenotype
         4     phe_to                              End position of the phenotype
         5     phe_strd                            The phenotype strand
         5.1   phe_id | ve_by_pc1 | n_phe_in_grp   Only  printed  if --group-best | --group-pca1 | --group-mean.
                                                   The phenotype ID, variance explained by  PC1,  or  number  of
                                                   phenotypes  in the phenotype group for --group-best, --group-
                                                   pca1, and --group-mean, respectively.
         5.2   n_phe_in_grp                        Only printed if --group-pca1 | --group-mean.  The  number  of
                                                   phenotypes in the phenotype group.
         6     n_var_in_cis                        The number variants in the cis window for this phenotype.
         7     dist_phe_var                        The  distance  between  the  variant  and the phenotype start
                                                   positions.
         8     var_id                              The most significant variant ID.
         9     var_chr                             The most significant variant's chromosome.
        10     var_from                            The start position of the most significant variant.
        11     var_to                              The end position of the most significant variant.
        12     rank                                The rank of the association.  This tells you if  the  variant
                                                   has been mapped as belonging to the best signal (rank=0), the
                                                   second  best (rank=1), etc ...  As a consequence, the maximum
                                                   rank  value  for  a  given  phenotype  tells  you  how   many
                                                   independent   signals   there   are   (e.g.  rank=2  means  3
                                                   independent signals).
        13     fwd_pval                            The nominal forward p-value of the  association  between  the
                                                   most significant variant and the phenotype.
        14     fwd_r_squared                       The r squared of the forward linear regression.
        15     fwd_slope                           The beta (slope) of the forward linear regression.
        15.1   fwd_slope_se                        The  standard  error  of  the  forward beta.  Only printed if
                                                   --std-err is provided.
        16     fwd_best_hit                        Whether or not this variant was the forward most  significant
                                                   variant.
        17     fwd_sig                             Whether this variant was significant.  Currently all variants
                                                   are significant so this is redundant.
        18     bwd_pval                            The  nominal  backward p-value of the association between the
                                                   most significant variant and the phenotype.
        19     bwd_r_squared                       The r squared of the backward linear regression.
        20     bwd_slope                           The beta (slope) of the backward linear regression.
        20.1   bwd_slope_se                        The standard error of the backward  beta.   Only  printed  if
                                                   --std-err is provided.
        21     bwd_best_hit                        Whether or not this variant was the backward most significant
                                                   variant.
        22     bwd_sig                             Whether this variant was significant.  Currently all variants
                                                   are significant so this is redundant.

NOMINAL ANALYSIS EXAMPLES

       o Nominal pass, correcting for technical covariates, rank normal transforming the phenotype, and printing
         out  associations  with  a  p-value  <=  0.01  on  chromosome  22 between 17000000 and 18000000 bp, and
         excluding some samples (see QTLtools (1)):

         QTLtools    cis    --vcf    genotypes.chr22.vcf.gz     --bed     genes.50percent.chr22.bed.gz     --cov
         genes.covariates.pc50.txt.gz   --nominal   0.01   --region   chr22:17000000-18000000   --normal   --out
         nominals.txt --exclude-samples sample_names_to_exclude.txt

       o Nominal pass with parallelization correcting for technical covariates,  rank  normal  transforming  the
         phenotype,  and  printing  out  associations  with a p-value <= 0.01.  To facilitate parallelization on
         compute cluster, we developed an option to run the analysis into chunks of molecular  phenotypes.   For
         instance, to run analysis on chunk 12 when splitting the example data set into 20 chunks, run:

         QTLtools     cis     --vcf     genotypes.chr22.vcf.gz    --bed    genes.50percent.chr22.bed.gz    --cov
         genes.covariates.pc50.txt.gz --nominal 0.01 --chunk 12 20 --normal --out nominals_12_20.txt

       o If you want to submit the whole analysis with 20 jobs on a compute cluster, just run (qsub needs to  be
         changed to the job submission system used [bsub, psub, etc...]):

         for j in $(seq 0 20); do
             echo   "QTLtools   cis   --vcf   genotypes.chr22.vcf.gz  --bed  genes.50percent.chr22.bed.gz  --cov
             genes.covariates.pc50.txt.gz --nominal 0.01 --chunk $j 20  --normal  --out  nominals_$j\_20.txt"  |
             qsub
         done

PERMUTATION ANALYSIS EXAMPLES

       o Permutation  pass,  correcting  for  technical  covariates, rank normal transforming the phenotype, and
         running 1000 permutations with a specific random seed on chromosome 22 between  17000000  and  18000000
         bp:

         QTLtools     cis     --vcf     genotypes.chr22.vcf.gz    --bed    genes.50percent.chr22.bed.gz    --cov
         genes.covariates.pc50.txt.gz --permute 1000 --region chr22:17000000-18000000  --normal  --seed  1354145
         --out permutation.txt

       o Permutation pass with parallelization correcting for technical covariates, rank normal transforming the
         phenotype,  and  running  5000  permutations.   To  facilitate  parallelization  on compute cluster, we
         developed an option to run the analysis into chunks of molecular  phenotypes.   For  instance,  to  run
         analysis on chunk 12 when splitting the example data set into 20 chunks, run:

         QTLtools     cis     --vcf     genotypes.chr22.vcf.gz    --bed    genes.50percent.chr22.bed.gz    --cov
         genes.covariates.pc50.txt.gz --permute 5000 --chunk 12 20 --normal --out permutations_12_20.txt

       o If you want to submit the whole analysis with 20 jobs on a compute cluster, just run (qsub needs to  be
         changed to the job submission system used [bsub, psub, etc...]):

         for j in $(seq 0 20); do
             echo   "QTLtools   cis   --vcf   genotypes.chr22.vcf.gz  --bed  genes.50percent.chr22.bed.gz  --cov
             genes.covariates.pc50.txt.gz --permute 5000 --chunk $j 20 --normal --out permutations_$j\_20.txt" |
             qsub
         done

       o When your phenotypes in the BED file are grouped, you can perform a permutation pass at  the  phenotype
         group level in order to discover group-level QTLs:

         QTLtools     cis     --vcf     genotypes.chr22.vcf.gz    --bed    exons.50percent.chr22.bed.gz    --cov
         genes.covariates.pc50.txt.gz --permute 1000 --normal --grp-best --out permutation.group.txt

CONDITIONAL ANALYSIS EXAMPLE

       Conditional analysis to discover independent signals.

       1 First we need to run a permutation analysis (see previous  section),  then  calculate  nominal  p-value
         threshold for each gene.  Here an FDR of 5% is given as an example:

         cat permutations_*.txt | gzip -c > permutations_all.txt.gz
         Rscript ./script/qtltools_runFDR_cis.R permutations_all.txt.gz 0.05 permutations_all

       2 Now  you can proceed with the actual conditional analysis.  Here splitting into 20 chunks, and when all
         complete concatenate the results:

         for j in $(seq 0 20); do
             echo  "QTLtools  cis  --vcf   genotypes.chr22.vcf.gz   --bed   genes.50percent.chr22.bed.gz   --cov
             genes.covariates.pc50.txt.gz --mapping permutations_all.thresholds.txt --chunk $j 20 --normal --out
             conditional_$j\_20.txt" | qsub
         done
         cat conditional_*.txt > conditional_all.txt

       3 If  you  are  interested  in  the  most significant variants per independent signal, you can filter the
         results, using the backward p-value:

         awk '{ if ($21 == 1) print $0 }' conditional_all.txt > conditional_top_variants.txt

SEE ALSO

       QTLtools(1)

       QTLtools website: <https://qtltools.github.io/qtltools>

BUGS

       o Versions up to and including 1.2, suffer from a bug in reading  missing  genotypes  in  VCF/BCF  files.
         This  bug  affects  variants with a DS field in their genotype's FORMAT and have a missing genotype (DS
         field is .) in one of the samples, in which case genotypes for all the  samples  are  set  to  missing,
         effectively removing this variant from the analyses.

       Please submit bugs to <https://github.com/qtltools/qtltools>

CITATION

       Delaneau,  O.,  Ongen, H., Brown, A. et al. A complete tool set for molecular QTL discovery and analysis.
       Nat Commun 8, 15452 (2017).  <https://doi.org/10.1038/ncomms15452>

AUTHORS

       Olivier Delaneau (olivier.delaneau@gmail.com), Halit Ongen (halitongen@gmail.com)

QTLtools-v1.3                                      06 May 2020                                   QTLtools-cis(1)