Provided by: igdiscover_0.11-4_all bug

NAME

       igdiscover - IgDiscover Documentation

       IgDiscover analyzes antibody repertoires and discovers new V genes from high-throughput sequencing reads.
       Heavy chains, kappa and lambda light chains are supported (to discover VH, VK and VL genes).

       IgDiscover is the result of a collaboration between the Gunilla Karlsson Hedestam group at the Department
       of Microbiology, Tumor and Cell Biology at Karolinska Institutet, Sweden and the Bioinformatics Long-Term
       Support facility at Science for Life Laboratory (SciLifeLab), Sweden.

       If you use IgDiscover, please cite:
          Corcoran, Martin M. and Phad, Ganesh E. and Bernat, Néstor Vázquez and Stahl-Hennig,
          Christiane and Sumida, Noriyuki and Persson, Mats A.A. and Martin, Marcel and
          Karlsson Hedestam, Gunilla B.
          Production of individualized V gene databases reveals high levels of immunoglobulin genetic
          diversity.
          Nature Communications 7:13642 (2016)
          https://dx.doi.org/10.1038/ncomms13642

LINKS

DocumentationSource codeReport an issueProject page on PyPI (Python package index)

         [image]

INSTALLATION

       IgDiscover  is  written in Python 3 and is developed on Linux. The tool also runs on macOS, but is not as
       well tested on that platform.

       For installation on either system, we recommend that you follow the instructions below, which will  first
       explain  how  to  install  the Conda package manager. IgDiscover is available as a Conda-package from the
       bioconda channel.  Using Conda will  make  the  installation  easy  because  all  dependencies  are  also
       available as Conda packages and can thus be installed automatically along with IgDiscover.

       There are also non-Conda installation instructions if you cannot use Conda.

   Installing IgDiscover with Conda
       1. Install  Conda  by  following  the conda installation instructions as appropriate for your system. You
          will need to choose between a “Miniconda” and “Anaconda” installation. We recommend Miniconda  as  the
          download  is  smaller.  If  you  are  in a hurry, these two commands are usually sufficient to install
          Miniconda on Linux (read the linked document for macOS instructions):

             wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
             bash Miniconda3-latest-Linux-x86_64.sh

          When the installer asks you about modifying the PATH in your .bashrc file, answer yes.

       2. Close the terminal window and open a new one. Then  test  whether  conda  is  installed  correctly  by
          running

             conda --version

          If you see the conda version number, it worked.

       3. Set  up  Conda  so  that it can access the bioconda channel.  For that, follow the instructions on the
          bioconda website or simply run these commands:

             conda config --add channels defaults
             conda config --add channels bioconda
             conda config --add channels conda-forge

       4. Install IgDiscover with this command:

             conda create -n igdiscover igdiscover

          This will create a new so-called “environment” for IgDiscover (retry if it fails). Whenever  you  want
          to run IgDiscover, you will need to activate the environment with this command:

             source activate igdiscover

       5. Make  sure  you  have activated the igdiscover environment.  Then test whether IgDiscover is correctly
          installed with this command:

             igdiscover --version

          If you see the version number of IgDiscover, it worked! If an error message  appears  that  says  "The
          'networkx' distribution was not found and is required by snakemake", install networkx manually with:

             pip install networkx==2.1

          Then retry to check the igdiscover version.

       6. You can now run IgDiscover on the test data set to familiarize yourself with how it works.

   Troubleshooting on Linux
       If  you  use zsh instead of bash (applies to Bio-Linux, for example), the $PATH environment variable will
       not be setup correctly by the Conda installer. The miniconda installer adds a line export PATH=... to the
       to the end of your /home/your-user-name/.bashrc file. Copy that line from the file and add it to the  end
       of the file /home/your-user-name/.zshrc instead.

       Alternatively, change your default shell to bash by running chsh -s /bin/bash.

       If you use conda and see an error that includes something like this:

          ImportError: .../.local/lib/python3.5/site-packages/sqt/_helpers.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf

       Or  you  see  any  error that mentions a .local/ directory, then a previous installation of IgDiscover is
       interfering with the conda installation.

       The easiest way to solve this problem is to delete the directory .local/ in your home directory, see also
       how to remove IgDiscover from a Linux system.

   Troubleshooting on macOS
       If you get the error

          ValueError: unknown locale: UTF-8

       Then follow these instructions.

   Development version
       To install IgDiscover directly from  the  most  recent  source  code,  read  the  developer  installation
       instructions.

MANUAL INSTALLATION

       IgDiscover  requires  quite  a few other software tools that are not included in most Linux distributions
       (or mac OS) and which are also not available from the Python packaging index (PyPI) because they are  not
       Python  tools.  If you do not use the recommended simple installation instructions via Conda, you need to
       install those non-Python dependencies manually. Regular Python dependencies are automatically  pulled  in
       when IgDiscover itself is installed in the last step with the pip install command. The instructions below
       are written for Linux and require modifications if you want to try this on OS X.

       NOTE:
          We  recommend  the  much  simpler  installation  via  Conda  instead of using the instructions in this
          section.

   Install non-Python dependencies
       The dependencies are: MUSCLE, IgBLAST, PEAR, and -- optionally -- flash.

       1. Install Python 3.5 or newer. It most likely is already installed on your system, but in Debian/Ubuntu,
          you can get it with

             sudo apt-get install python3

       2. Create the directory where binaries will be installed. We assume $HOME/.local/bin here, but  this  can
          be anywhere as long as they are in your $PATH.

             mkdir -p ~/.local/bin

          Add this line to the end of your ~/.bashrc file:

             export PATH=$HOME/.local/bin:$PATH

          Then either start a new shell or run source ~/.bashrc to get the changes.

       3. Install MUSCLE. This is available as a package in Ubuntu:

             sudo apt-get install muscle

          If your distribution does not have a 'muscle' package or if you are not allowed to run sudo:

             wget -O - http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz | tar xz
             mv muscle3.8.31_i86linux64 ~/.local/bin/

       4. Install PEAR:

             wget http://sco.h-its.org/exelixis/web/software/pear/files/pear-0.9.6-bin-64.tar.gz
             tar xvf pear-0.9.6-bin-64.tar.gz
             mv pear-0.9.6-bin-64/pear-0.9.6-bin-64 ~/.local/bin/pear

       5. Install IgBLAST:

             wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/1.4.0/ncbi-igblast-1.4.0-x64-linux.tar.gz
             tar xvf ncbi-igblast-1.4.0-x64-linux.tar.gz
             mv ncbi-igblast-1.4.0/bin/igblast? ~/.local/bin/

          IgBLAST  requires  some  data files that must be downloaded separately. The following commands put the
          files into ~/.local/igdata:

             mkdir ~/.local/igdata
             cd ~/.local/igdata
             wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data
             wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/
             wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file/

          Also, you must set the $IGDATA environment variable to point to the directory  with  data  files.  Add
          this line to your ~/.bashrc:

             export IGDATA=$HOME/.local/igdata

          Then run source ~/.bashrc to get the changes.

       7. Optionally, install flash:

             wget -O FLASH-1.2.11.tar.gz http://sourceforge.net/projects/flashpage/files/FLASH-1.2.11.tar.gz/download
             tar xf FLASH-1.2.11.tar.gz
             cd FLASH-1.2.11
             make
             mv flash ~/.local/bin/

   Install IgDiscover
       Install  IgDiscover  with  the Python package manager pip, which will download and install IgDiscover and
       its dependencies:

          pip3 install --user igdiscover

       Both commands also install all remaining dependencies. The  --user  option  instructs  both  commands  to
       install everything into $HOME/.local.

       Finally, check the installation with

          igdiscover --version

       and you should see the version number of IgDiscover.

       You should now run IgDiscover on the test data set.

TEST DATA SET

       After  installing  IgDiscover,  you should run it once on a small test data that we provide, both to test
       your installation and to familiarize yourself with running the program.

       1. Download und unpack the test data set (version 0.5). To do  this  from  the  command-line,  use  these
          commands:

             wget https://bitbucket.org/igdiscover/testdata/downloads/igdiscover-testdata-0.5.tar.gz
             tar xvf igdiscover-testdata-0.5.tar.gz
          The  test  data set contains some paired-end reads from human IgM heavy chain dataset ERR1760498 and a
          database of IGHV, IGHD, IGHJ sequences based on Ensembl annotations. You  should  use  a  database  of
          higher quality for your own experiments.

       2. Initialize the IgDiscover pipeline directory:

             igdiscover init --db igdiscover-testdata/database/ --reads igdiscover-testdata/reads.1.fastq.gz discovertest

          The  name  discovertest is the name of the pipeline directory that will be created. Note that only the
          path to the first reads file needs to be given. The second file is found automatically. There may be a
          couple of messages “Skipping 'x' because it contains the same sequence as 'y'”, which you can ignore.

          The command will have printed a message telling you that the pipeline directory has been  initialized,
          that you should edit the configuration file, and how to actually run IgDiscover after that.

       3. The  generated  igdiscover.yaml  configuration  file  does not actually need to be edited for the test
          dataset, but you may still want to have a read through it as you will need to do so for you own  data.
          You  may  want to do this while the pipeline is running in the next step. The configuration is in YAML
          format. When editing the file, just follow the way it is already structured.

       4. Run the analysis. To do so, change into the pipeline directory and run this command:

             cd discovertest && igdiscover run

          On this small dataset, running the pipeline should take not more than about 5 minutes.

       5. Finally, inspect the results in the discovertest/iteration-01 or discovertest/final  directories.  The
          discovered  V  genes and extra information are listed in discovertest/iteration-01/new_V_germline.tab.
          Discovered J genes are in discovertest/iteration-01/new_J.tab. There  are  also  corresponding  .fasta
          files with the sequences only.

          See the explanation of final result files.

   Other test data sets
       ENA  project  PRJEB15295  contains  the data for our Nature Communications paper from 2016, in particular
       ERR1760498, which is the data for the human “H1” sample (multiplex PCR, IgM heavy chain).

       Data used for testing TCR detection (human, RACE): SRR2905677 and SRR2905710.

USER GUIDE

   Overview
       IgDiscover works on a single library at a time. It works within an “analysis directory” for the  library,
       which contains all intermediate and result files.

       To start an analysis, you need:

       1. A  FASTA or FASTQ file with single-end reads or two FASTQ files with paired-end reads (also, the files
          must be gzip-compressed)

       2. A database of V/D/J genes (three FASTA files named V.fasta, D.fasta, J.fasta)

       3. A configuration file that describes the library

       If you do not have a V/D/J database, yet, you may want to read the section  about  how  to  obtain  V/D/J
       sequences.

       To run an analysis, proceed as follows.

       NOTE:
          If you are on macOS, it may be necessary to run export SHELL=/bin/bash before continuing.

       1. Create and initialize the analysis directory.

          First,  pick a name for your analysis. We will use myexperiment in the following.  Then run igdiscover
          init:

             igdiscover init myexperiment

          A dialog will appear and ask for the file with the first (forward) reads.  Find your compressed  FASTQ
          file  that contains them and select it.  Typical file names may be Library1_S1_L001_R1_001.fastq.gz or
          mylibrary.1.fastq.gz.  You do not need to choose the second read file!  It is found automatically.

          Next, choose the directory with your database.  The directory must contain the  three  files  V.fasta,
          D.fasta,  J.fasta.   These  files contain the V, D, J gene sequences, respectively.  Even if have have
          only light chains in your data, a D.fasta file needs to be provided, just use one with the heavy chain
          D gene sequences.

          If you do not want a graphical user interface, use the two command-line parameters --db  and  --reads1
          to provide this information instead:

             igdiscover init --db path/to/my/database/ --reads1 mylibrary.1.fastq.gz myexperiment

          Again,  the  second reads file will be found automatically.  Use --single-reads instead of --reads1 if
          you have single-end reads or a dataset with already merged reads.  For --single-reads,  a  FASTA  file
          (not  only  FASTQ)  is  also allowed.  In any case, an analysis directory named myexperiment will have
          been created.

       2. Adjust the configuration file

          The previous step created a configuration file named myexperiment/igdiscover.yaml, which you may  need
          to  adjust.  In  particular, the number of discovery rounds is set to 3 by default, which takes a long
          time. Reducing this to 2 or even 1 often works just as well.

       3. Run the analysis

          Change into the newly created analysis directory and run the analysis:

             igdiscover run

          Depending on the size of your library, your computer, and the number of iterations, this will now take
          from a few hours to a day. See the running IgDiscover section for more fine-grained control over  what
          to run and how to resume the process if something failed.

   Obtaining a V/D/J database
       We  use  the  term “database” to refer to three FASTA files that contain the sequences for the V, D and J
       genes.  IMGT provides sequences for download.  For discovering new VH genes, for example, you need to get
       the IGHV, IGHD and IGHJ files of your species.  As IgDiscover uses this only as a starting point, using a
       similar species will also work.

       When using an IMGT database, it is very important to change the  long  IMGT  sequence  headers  to  short
       headers as IgBLAST does not accept the long headers. We recommend using the program edit_imgt_file.pl. If
       you  installed  IgDiscover  from  Conda, the script is already installed and you can run it by typing the
       name. It is also available on the IgBlast FTP site.

       Run it for all three downloaded files, and then rename files appropritely to make sure  that  they  named
       V.fasta, D.fasta and J.fasta.

       You always need a file with D genes even if you analyze light chains.

       In  case  you  have  used  IgBLAST  previously, note that there is no need to run the makeblastdb tool as
       IgDiscover will do that for you.

   Input data requirements
   Paired-end or single-end data
       IgDiscover can process input data of three different types:

       • Paired-end reads in gzipped FASTQ format,

       • Single-end reads in gzipped FASTQ format,

       • Single-end reads in gzipped FASTA format.

       IgDiscover was tested mainly on paired-end Illumina MiSeq reads (2x300bp), but it can also handle 454 and
       Ion Torrent data.

       Depending on the input file type, use a variant of one  of  the  following  commands  to  initialize  the
       analysis directory:

          igdiscover init --single-reads=file.fasta.gz  --database=my-database-dir/ myexperiment
          igdiscover init --reads1=file.1.fasta.gz  --database=my-database-dir/ myexperiment
          igdiscover init --reads1=file.1.fastq.gz  --database=my-database-dir/ myexperiment

   Read layout
       Paired-end  reads  are  first  merged  and then processed in the same way as single-end reads. Reads that
       could not be merged are discarded. Single-end reads and merged paired-end reads are  expected  to  follow
       this structure (from 5' to 3'):

       • The forward primer sequence. This is optional.

       • A   random   barcode   (molecular   identifier).   This  is  optional.  Set  the  configuration  option
         barcode_length_5p to 0 if you don’t have random barcodes or if you don’t want the program to use them.

       • Optionally, a run of G nucleotides. This is an artifact of the RACE protocol  (Rapid  amplification  of
         cDNA ends). If you have this, set race_g to true in the configuration file.

       • 5' UTR

       • Leader

       • Re-arranged V, D and J gene sequences for heavy chains; only V and J for light chains

       • An  optional  random  barcode.  Set  the  configuration  option barcode_length_3p to the length of this
         barcode. You can currently not have both a 5' and a 3' barcode.

       • The reverse primer. This is optional.

       We use IgBLAST to detect the location of the V, D, J genes through the igdiscover igblast subcommand. The
       G nucleotides after the barcode are split off if the configuration specifies  race_g:  true.  The  leader
       sequence is detected by looking for a start codon near 60 bp upstream of the start of the V gene match.

   Configuration
       The  igdiscover  init  command creates a configuration file igdiscover.yaml in the analysis directory. To
       configure your analysis, change that file with a text editor before running the analysis with  igdiscover
       run.

       The syntax should be mostly self-explanatory.  The file is in YAML format, but you will not need to learn
       that.   Just  follow  the  examples  given  in  the  file.   A few rules that may be good to know are the
       following ones:

       1. Lines starting with the # symbol are comments (they are ignored)

       2. A configuration option that is meant to be switched on or off will say something like stranded:  false
          if it is off.  Change this to stranded: true to switch the option on (and vice versa).

       3. The  primer  sequences  are  given  as a list, and must be written in a certain way - one sequence per
          line, and a - (dash) in front, like so:

             forward_primers:
             - ACGTACGTACGT
             - AACCGGTTAACC

          Even if you have only one primer sequence, you still need to use this syntax.

       To find out what the configuration options achieve,  see  the  explanations  in  the  configuration  file
       itself.

       The main parameters parameters that may require adjusting are the following.

       The  iterations  option sets the number of rounds of V gene discovery that will be performed. By default,
       three iterations are run. Even with a very restricted starting V database (for example with only a single
       V gene sequence), this is usually sufficient to identify most novel germline sequences.

       When the starting database is more complete, for example, when analyzing a human  IgM  library  with  the
       current  IMGT  heavy  chain  database,  a single iteration may be sufficient to produce an individualized
       database.

       If you do not want to discover any new genes and only want to produce an expression profile, for example,
       then use iterations: 0.

       The ignore_j option should be set to true when producing  a  V  gene  database  for  a  species  where  J
       sequences are unknown:

          ignore_j: true

       Setting the parameters stranded, forward_primers and reverse_primers to the correct values can be used to
       remove  5' and 3' primers from the sequences.  Doing this is not strictly necessary for IgDiscover. It is
       simplest if you do not specify any primer sequences.

   Pregermline and germline filter criteria
       This provides IgDiscover with stringency requirements for V gene discovery that  enable  the  program  to
       filter  out  false  positives. Usually the ”pregermline filter” can be used in the default mode since all
       these sequences will be subsequently passed to the higher stringency ”germline filter” where the criteria
       are set to maximize stringency. Here is how it looks in the configuration file:

          pre_germline_filter:
            unique_cdr3s: 2      # Minimum number of unique CDR3s (within exact matches)
            unique_js: 2         # Minimum number of unique J genes (within exact matches)
            check_motifs: false  # Check whether 5' end starts with known motif
            whitelist: true      # Add database sequences to the whitelist
            cluster_size: 0      # Minimum number of sequences assigned to cluster
            differences: 0       # Merge sequences if they have at most this number of differences
            allow_stop: true     # Whether to allow non-productive sequences containing stop codons
            cross_mapping_ratio: 0.02  # Threshold for removal of cross-mapping artifacts (set to 0 to disable)
            allele_ratio: 0.1    # Required minimum ratio between alleles of a single gene

          # Filtering criteria applied to candidate sequences in the last iteration.
          # These should be more strict than the pre_germline_filter criteria.
          #
          germline_filter:
            unique_cdr3s: 5      # Minimum number of unique CDR3s (within exact matches)
            unique_js: 3         # Minimum number of unique J genes (within exact matches)
            check_motifs: false  # Check whether 5' end starts with known motif
            whitelist: true      # Add database sequences to the whitelist
            cluster_size: 100    # Minimum number of sequences assigned to cluster
            differences: 0       # Merge sequences if they have at most this number of differences
            allow_stop: false    # Whether to allow non-productive sequences containing stop codons
            cross_mapping_ratio: 0.02  # Threshold for removal of cross-mapping artifacts (set to 0 to disable)
            allele_ratio: 0.1    # Required minimum ratio between alleles of a single gene

       Factors that affect germline discovery include library source (IgM vs IgK,  IgL  or  IgG)  library  size,
       sequence  error  rate  and individual genomic factors (for example the number of J segments present in an
       individual).

       In general, setting a higher cutoff of unique_cdr3s and unique_js  will  minimize  the  number  of  false
       positives in the output. Example:

          unique_cdr3s: 10      # Minimum number of unique CDR3s (within exact matches)
          unique_js: 4          # Minimum number of unique J genes (within exact matches)

       If  the  differences  parameter is set to a value higher than 0, the germline filter inspects clusters of
       sequences that are closely related (when the edit distance between  them  is  at  most  differences)  and
       retains  only  the  most common sequence of each cluster. Previously, we believed this would removes some
       false positives due to accumulated random sequence errors of  highly  expressed  alleles  that  otherwise
       would pass the cutoff criteria. However, we found out that we miss true positives, in particular if there
       are  two  alleles  in  the  sample that differ in only a single nucleotide. We have now implemented other
       measures to avoid false positives and recommend against setting the differences to something  other  than
       0.

       Read also about the cross mapping, for which germline filtering corrects, and about the germline filters.

       Changed in version The: default for the differences setting was changed from 1 to 0.

   Running IgDiscover
   Resuming failed runs
       The  command igdiscover run, which is used to start the pipeline, can also be used to resume execution if
       there was an interruption (a transient failure). Reasons for interruptions might be:

       • Ctrl+C was pressed on the keyboard

       • A full harddisk

       • If running on a cluster, the program may have been terminated because it exceeded its allocated time

       • Too little RAM

       • Power loss

       To resume execution after you have fixed the problem, go to the analysis directory and run igdiscover run
       again. It will skip the steps that have already finished successfully.  This capability  comes  from  the
       workflow  management  system  snakemake,  on  which  igdiscover  run  is based.  Snakemake will determine
       automatically which steps need to be re-run in order to get to a full result and then run only those.

       Alterations to the configuration file after an interruption are possible, but affect only steps that have
       not already finished successfully. For example, assume you interrupted a run  with  Ctrl+C  after  it  is
       already  past  the step in which barcodes are removed. Then, even if you change the barcode length in the
       configuration, the barcode removal step will not be re-run when you resume the pipeline and the  previous
       barcode length is in effect.  See also the next section.

   Changing parameters and re-running parts of the pipeline
       When you experiment with parameters in the igdiscover.yaml file, such as germline filtering criteria, you
       do  not  need  to  re-run the entire pipeline from the beginning, but can re-use the results that already
       exist. This can save a lot of processing time, in particular when you avoid re-running  IgBLAST  in  this
       way.

       As  described  in  the  previous section, igdiscover run automatically figures out which files need to be
       re-created if a run was interrupted. Unfortunately, this mechanism is currently not smart enough to  also
       look  for changes in the igdiscover.yaml file. Thus, if the full pipeline has finished successfully, then
       re-running igdiscover run will just print the message Nothing to be done.  even after  you  have  changed
       the configuration file.

       You  will  therefore  need to know yourself which file you want to regenerate.  Then follow the following
       steps. Note that these will remove parts of the existing results, and if you need to keep  them,  make  a
       copy of your analysis directory first.

       1. Change the configuration setting.

       2. Delete the file that needs to be re-generated. Assume it is filename

       3. Run  igdiscover  run filename to re-create the file. Only that file will be created, not the ones that
          usually would be created afterwards.

       4. Optionally, run igdiscover run (without a file name this time) to update the  remaining  files  (those
          that depend on the file that was just updated).

       For  example,  assume  you  want  to modify some germline filtering setting and then re-run the pipeline.
       Change the setting in your igdiscover.yaml, then run these commands:

          rm iteration-01/new_V_germline.tab
          igdiscover run iteration-01/new_V_germline.tab

       The   above   will   have   regenerated   the   iteration-01/new_V_germline.tab   file   and   also   the
       iteration-01/new_V_germline.fasta file since they are generated by the same script. If you want to update
       any other files, then also run

          igdiscover run

   The analysis directory
       IgDiscover  writes  all  intermediate  files,  the  final  V gene database, statistics and plots into the
       analysis directory that was created with igdiscover init.  Inside  that  directory,  there  is  a  final/
       subdirectory that contains the analysis results.

       These  are  the files and subdirectories that can be found in the analysis directory.  Subdirectories are
       described in detail below.

       igdiscover.yaml
              The configuration file.  Make sure to adjust this to your needs as described above.

       reads.1.fastq.gz, reads.2.fastq.gz
              Symbolic links to the raw paired-end reads.

       database/
              The input V/D/J database (as three FASTA files).  The files are a copy of the  ones  you  selected
              when running igdiscover init.

       reads/ Processed reads (merged, de-duplicated etc.)

       iteration-xx/
              Iteration-specific analysis directory, where “xx” is a number starting from 01.  Each iteration is
              run  in  one  of these directories.  The first iteration (in iteration-01) uses the original input
              database, which is also found in the database/ directory.  The database is updated and  then  used
              as input for the next iteration.

       final/ After  the  last  iteration,  IgBLAST  is  run  again  on the input sequences, but using the final
              database (the one created in the very last iteration).  This directory contains all  the  results,
              such  as  plots  of  the  repertoire  profiles.   If  you set the number of iterations to 0 in the
              configuration file, this directory is the only one that is created.

   Final results
       Final results are found in the final/ subdirectory of the analysis directory.

       final/database/(V,D,J).fasta
              These three files represent the final, individualized V/D/J database found by IgDiscover.   The  D
              and J files are copies of the original starting database; they are not updated by IgDiscover.

       final/dendrogram_(V,D,J).pdf
              These  three  PDF  files  contain  dendrograms  of  the V, D and J sequences in the individualized
              database.

       final/assigned.tab.gz
              V/D/J gene assignments and other information for each sequence.  The file is  created  by  parsing
              the  IgBLAST  output  in  the igblast.txt.gz file.  This is a table that contains one row for each
              input sequence.  See below for a detailed description of the columns.

       final/filtered.tab.gz
              Filtered V/D/J gene assignments. This is the same as the assigned.tab file  mentioned  above,  but
              with  low-quality  assignments  filtered  out.   Run igdiscover filter --help to see the filtering
              criteria.

       final/expressed_(V,D,J).tab, final/expressed_(V,D,J).pdf
              The V, D and J gene expression counts. Some assignments are filtered out to reduce  artifacts.  In
              particular,  an  allele-ratio filter of 10% is applied. For D genes, only those with an E-value of
              at most 1E-4 and a coverage of at least 70% are counted. See also  the  help  for  the  igdiscover
              count subcommand, which is used to create these files.

              The  .tab  file  contains  the  counts  as a table, while the PDF file contains a plot of the same
              values.

              These tables also exist in the iteration-specific directories (iteration-xx). For those, note that
              the numbers do not include the  genes  that  were  discovered  in  that  iteration.  For  example,
              iteration-01/expressed_V.tab shows only expression counts of the V genes in the starting database.

       final/errorhistograms.pdf
              A  PDF with one page per V gene/allele.  Each page shows a histogram of the percentage differences
              for that gene.

       final/clusterplots/
              This is a directory that contains one PNG file for each discovered gene/allele.  Each image  shows
              a clusterplot of all the sequences assigned to that gene.  Note that the shown clusterplots are by
              default  restricted  to  showing  only  at most 300 sequences, while the actual clustering used by
              IgDiscover uses 1000 sequences.

       If you are interested in the results of each iteration, you can inspect  the  iteration-xx/  directories.
       They  are structured in the same way as the final/ subdirectory, except that the results are based on the
       intermediate databases of that iteration.  They also contain the following additional files.

       iteration-xx/candidates.tab
              A table with candidate novel V alleles (or genes).  This is a list of sequences found through  the
              windowing  strategy  or  linkage  cluster  analysis,  as  discussed  in  our  paper.  See the full
              description of candidates.tab.

       iteration-xx/read_names_map.tab
              For each candidate novel V allele listed in candidates.tab, this file contains one row that  lists
              which  sequences  went into generating this candidate. Only the exact matches are listed, that is,
              the number of listed sequence names should be equal to the value in the exact column. Each line in
              this file contains tab-separated values. The first is name of the candidate, the  others  are  the
              names of the sequences. Some of these sequences may be consensus sequences if barcode grouping was
              enabled, so in that case, this will not be a read name.

       iteration-xx/new_V_germline.fasta, iteration-xx/new_V_pregermline.fasta
              The  discovered  list  of V genes for this iteration.  The file is created from the candidates.tab
              file by applying either the germline or pre-germline filter.  The file resulting from  application
              of the germline filter is used in the last iteration only.  The file resulting from application of
              the pre-germline filter is used in earlier iterations.

       iteration-xx/annotated_V_germline.tab, iteration-xx/annotated_V_pregermline.tab
              A  version  of  the  candidates.tab  file that is annotated with extra columns that describe why a
              candidate was filtered out. See the description of this file.

   Other files
       For completeness, here is a description of the files in the reads/  and  stats/  directories.   They  are
       created during pre-processing and are not iteration specific.

       reads/1-limited.1.fastq.gz, reads/1-limited.1.fastq.gz
              Input reads file limited to the first N entries. This is just a symbolic link to the input file if
              the limit configuration option is not set.

       reads/2-merged.fastq.gz
              Reads merged with PEAR or FLASH

       reads/3-forward-primer-trimmed.fastq.gz
              Merged  reads with 5' primer sequences removed. (This file is automatically removed when it is not
              needed anymore.)

       reads/4-trimmed.fastq.gz
              Merged reads with 5' and 3' primer sequences removed.

       reads/5-filtered.fasta
              Merged, primer-trimmed sequences converted to FASTA, and too short sequences removed.  (This  file
              is automatically removed when it is not needed anymore.)

       reads/sequences.fasta.gz
              Fully pre-processed sequences. That is, filtered sequences without duplicates (using VSEARCH)

       stats/reads.txt
              Statistics of pre-processed sequences.

       stats/readlengths.txt, stats/readlengths.pdf
              Histogram of the lengths of pre-processed sequences (created from reads/sequences.fasta)

   Format of output files
   assigned.tab.gz
       This  file  is a gzip-compressed table with tab-separated values. It is created by the igdiscover igblast
       subcommand and is the result of parsing raw output from IgBLAST.  It contains a  few  additional  columns
       that  do  not  come  directly  from  IgBLAST.  In particular, the CDR3 sequence is detected, the sequence
       before the V gene match is split into UTR and leader, and the RACE-specific run of G nucleotides is  also
       detected.   The  first  row is a header row with column names.  Each subsequent row describes the IgBLAST
       results for a single pre-processed input sequence.

       Note: This file is typically quite large.  LibreOffice can open the file  directly  (even  though  it  is
       compressed), but make sure you have enough RAM.

       Columns:

       count  How  many  copies of input sequence this query sequence represents. Copied from the ;size=3; entry
              in the FASTA header field that is added by VSEARCH -derep_fulllength.

       V_gene, D_gene, J_gene
              V/D/J gene match for the query sequence

       stop   whether the sequence contains a stop codon (either “yes” or “no”)

       productive

       V_covered, D_covered, J_covered
              percentage of bases of the reference gene that is covered by the bases of the query sequence

       V_evalue, D_evalue, J_evalue
              E-value of V/D/J hit

       FR1_SHM, CDR1_SHM, FR2_SHM, CDR2_SHM, FR3_SHM, V_SHM, J_SHM
              rate of somatic hypermutation (actually, an error rate)

       V_errors, J_errors
              Absolute number of errors (differences) in the V and J gene match

       UTR    Sequence of the 5' UTR (the part before the V gene match up  to,  but  not  including,  the  start
              codon)

       leader Leader sequence (the part between UTR and the V gene match)

       CDR1_nt, CDR1_aa, CDR2_nt, CDR2_aa, CDR3_nt, CDR3_aa
              nucleotide and amino acid sequence of CDR1/2/3

       V_nt, V_aa
              Nucleotide and amino acid sequence of V gene match

       V_CDR3_start
              Start  coordinate of CDR3 within V_nt. Set to zero if no CDR3 was detected.  Comparisons involving
              the V gene ignore those V bases that are part of the CDR3.

       V_end, VD_junction, D_region, DJ_junction, J_start
              nucleotide sequences for various match regions

       name, barcode, race_G, genomic_sequence
              see the following explanation

       The UTR, leader, barcode, race_G and genomic_sequence columns are filled in the following way.

       1. Split the 5' end barcode from the sequence (if barcode length is zero, this will be empty), put it  in
          the barcode column.

       2. Remove the initial run of G bases from the remaining sequence, put that in the race_G column.

       3. The remainder is put into the genomic_sequence column.

       4. If  there  is a V gene match, take the sequence before it and split it up in the following way. Search
          for the start codon and write the part before it into the UTR column. Write the part starting with the
          start column into the leader column.

   filtered.tab.gz
       This table is the same as the assigned.tab.gz table, except that rows containing low-quality matches have
       been filtered out.  Rows fulfilling any of the following criteria are filtered:

       • The J gene was not assigned

       • A stop was codon found

       • The V gene coverage is less than 90%

       • The J gene coverage is less than 60%

       • The V gene E-value is greater than 10-3

   candidates.tab
       This table contains the candidates for novel V genes found by the  discover  subcommand.   As  the  other
       files,  it  is  a  text  file  in  tab-separated  values format, with the first row containing the column
       headings.  It can be opened directly in LibreOffice, for example.

       Candidates are found by inspecting all the sequences assigned to a database gene, and clustering them  in
       multiple ways.  The candidate sequences are found by computing a consensus from each found cluster.

       Each  row  describes  a single candidate, but possibly multiple clusters.  If there are multiple clusters
       from a single gene that lead to the same consensus sequence, then they get only  one  row.   The  cluster
       column  lists  the  source clusters for the given sequence.  Duplicate sequences can still occur when two
       different genes lead to identical consensus sequences.  (These duplicated sequences  are  merged  by  the
       germline filters.)

       Below, we use the term cluster set to refer to all the sequences that are in any of the listed clusters.

       Some  clusters  lead  to  ambiguous consensus sequences (those that include N bases).  These have already
       been filtered out.

       name   The name of the candidate gene. See novel gene names.

       source The original database gene to which the sequences from this row  were  originally  assigned.   All
              candidates coming from the same source gene are grouped together.

       chain  Chain type: VH for heavy, VK for light chain lambda, VL for light chain kappa

       cluster
              From which type of cluster or clusters the consensus was computed.  If there are multiple clusters
              that  give  rise to the same consensus sequence, they are all listed here, separated by semicolon.
              A cluster name such as 2-4 is for a percentage difference window: Such a cluster consists  of  all
              sequences  assigned  to  the  source  gene that have a percentage difference to it between 2 and 4
              percent.

              A cluster name such as cl3 describes a cluster generated through linkage  cluster  analysis.   The
              clusters  are  simply named cl1, cl2, cl3 etc.  If any cluster number seems to be missing (such as
              when cl1 and cl3 occur, but not cl2), then this  means  that  the  cluster  led  to  an  ambiguous
              consensus  sequence  that  has been filtered out.  Since the cl clusters are created from a random
              subsample of the data (in order to keep computation time down), they are  never  larger  than  the
              size of the subsample (currently 1000).

              The name db represents a cluster that is identical to the database sequence.  If no actual cluster
              corresponding  to  the  database  sequence  is found, but the database sequence is expressed, a db
              cluster is inserted artificially in order to make sure that the sequence is not lost.  The cluster
              name all represents the set of all sequences assigned to the source  gene.   This  means  that  an
              unambiguous  consensus  could  be computed from all the sequences.  Typically, this happens during
              later iterations when there are no more novel  sequences  among  the  sequences  assigned  to  the
              database gene.

       cluster_size
              The  number  of  sequences  from  which the consensus was computed.  Equivalently, the size of the
              cluster set (all clusters described in this row).  Sequences that are in multiple clusters at  the
              same time are counted only once.

       Js     The number of unique J genes associated with the sequences in the cluster set.

              Consensus sequences are computed only from V gene sequences, but each V gene sequence is part of a
              full  V/D/J sequence.  We therefore know for each V sequence which J gene it was found with.  This
              number says how many different J genes were found for all sequences that the consensus in this row
              was computed from.

       CDR3s  The number of unique CDR3 sequences associated with the sequences in the cluster  set.   See  also
              the  description for the Js column.  This number says how many different CDR3 sequences were found
              for all sequences that the consensus in this row was computed from.

       exact  The number of exact occurrences of the consensus sequence among  all  sequences  assigned  to  the
              source gene, ignoring the 3' junction region.

              To clarify: While the consensus sequence is computed only from a subset of sequences assigned to a
              source  gene, all sequences assigned to the source gene are searched for exact occurrences of that
              consensus sequence.

              When comparing sequences, they are first truncated at the 3' end by removing those  (typically  8)
              bases that correspond to the CDR3 region.

       barcodes_exact
              How  many  unique  barcode  sequences  were  used  by  the sequences in the set of exact sequences
              (described above).

       Ds_exact
              How many unique D genes were used by the sequences  in  the  set  of  exact  sequences  (described
              above). Only those D gene assignments are included in this count for which the number of errors is
              zero,  the  E-value  is at most a given threshold, and for which the number of covered bases is at
              least a given percentage.

       Js_exact
              How many unique J genes were used by the sequences  in  the  set  of  exact  sequences  (described
              above).

       CDR3s_exact
              How many unique CDR3 sequences were used by the sequences in the set of exact sequences (described
              above).

       clonotypes
              The  estimated  number of clonotypes within the set of exact sequences (which is described above).
              The value is  computed  by  clustering  the  unique  CDR3  sequences  associated  with  all  exact
              occurrences,  allowing up to six differences (mismatches, insertions, deletions) and then counting
              the number of resulting clusters.

       database_diff
              The number of differences between the consensus sequence and the  sequence  of  the  source  gene.
              (Given as edit distance, that is insertion, deletion, mismatch count as one difference each.)

       has_stop
              Indicates whether the consensus sequence contains a stop codon.

       looks_like_V
              Whether  the  consensus  sequence “looks like” a true V gene (1 if yes, 0 if no).  Currently, this
              checks whether the 5' end of the sequence matches a known V gene motif.

       CDR3_start
              Where the CDR3 starts within the discovered V gene sequence. This uses the most common CDR3  start
              location among the sequences from which this consensus is derived.

       consensus
              The consensus sequence itself.

       The  igdiscover  discover command can also be run by hand with other parameters, in which case additional
       columns may appear.

       N_bases
              Number of N bases in the consensus

   annotated_V_*.tab
       The two files annotated_V_germline.tab and annotated_V_pregermline.tab are copies of  the  candidates.tab
       file  with  two  extra  columns  that  show why a candidate was filtered in the germline and pre-germline
       filtering steps. The two columns are:

          • is_filtered – This is a number that indicates how many filtering  criteria  exclude  this  candidate
            apply.

          • why_filtered – This is a semicolon-separated list of filtering reasons.

       The following values can occur in the why_filtered column:

       too_low_dbdiff
              The  number  of  differences  between  this  candidate and the database is lower than the required
              number.

       too_many_N_bases
              The candidate contains too many N nucleotide wildcard characters.

       too_low_CDR3s_exact
              The CDR3s_exact value for this candidate is lower than required.

       too_high_CDR3_shared_ratio
              The CDR3_shared_ratio is higher than the configured threshold.

       too_low_Js_exact
              The Js_exact value is lower than the configured threshold.

       has_stop
              The filter configuration disallows stop codons, but this candidate has one and is not whitelisted.

       too_low_cluster_size
              The cluster_size of this candidate is lower than the configured threshold, and  the  candidate  is
              not whitelisted.

       is_duplicate
              A filtering criterion not listed above applies to this candidate. This covers all the filters that
              need  to  compare  candidates  to  each  other: cross-mapping ratio, clonotype allele ratio, exact
              ratio, Ds_exact ratio.

   Names for discovered genes
       Each gene discovered by IgDiscover gets a unique name such as “VH4.11_S1234”.  The “VH4.11” is  the  name
       of  the  database  gene to which the novel V gene was initially assigned. The number 1234 is derived from
       the nucleotide sequence of the novel gene. That is, if you discover the same sequence  in  two  different
       runs  of the IgDiscover, or just in different iterations, the number will be the same. This may help when
       manually inspecting results.

       Be aware that you still need to check the sequence itself since even different  sequences  can  sometimes
       lead to the same number (a “hash collision”).

       The _S1234 suffixes do not accumulate.  Before IgDiscover adds the suffix in an iteration, it removes the
       suffix if it already exists.

   Subcommands
       The  igdiscover  program  has multiple subcommands.  You should already be familiar with the two commands
       init and run.  Each subcommand comes with its own help page that shows how to use that  subcommand.   Run
       the command with the --help option to see the help. For example,

          igdiscover run --help

       shows the help for the run subcommand.

       The following additional subcommands may be useful for further analysis.

       commonv
              Find common V genes between two different antibody libraries

       upstream
              Cluster upstream sequences (UTR and leader) for each gene

       dendrogram
              Draw a dendrogram of sequences in a FASTA file.

       rename Rename sequences in a target FASTA file using a template FASTA file

       union  Compute union of sequences in multiple FASTA files

       The following subcommands are used internally, and listed here for completeness.

       filter Filter a table with IgBLAST results

       count  Count and plot V, D, J gene usage

       group  Group  sequences  by  barcode  and  V/J  assignment  and  print  each group’s consensus (unused in
              IgDiscover)

       germlinefilter
              Create new V gene database from V gene candidates  using  the  germline  and  pre-germline  filter
              criteria.

       discover
              Discover candidate new V genes within a single antibody library

       clusterplot
              For each V gene, plot a clustermap of the sequences assigned to it

       errorplot
              Plot histograms of differences to reference V gene

   Germline and pre-germline filtering
       V  gene sequences found by the clustering step of the program (the discover subcommand) are stored in the
       candidates.tab file. The entries are “candidates” because many of these will be PCR  or  other  artifacts
       and  therefore  do  not  represent true novel V genes. The germline and pre-germline filters take care of
       removing artifacts. They germline filter is the “real” filter and used only  in  the  last  iteration  in
       order  to  obtain  the  final  gene  database. The pre-germline filter is less strict and used in all the
       earlier iterations.

       The germline filters are implemented  in  the  igdiscover  germlinefilter  subcommand.  It  performs  the
       following filtering and processing steps:

       • Discard sequences with N bases

       • Discard sequences that come from a consensus over too few source sequences

       • Discard sequences with too few unique CDR3s (CDR3s_exact column)

       • Discard sequences with too few unique Js (Js_exact column)

       • Discard sequences identical to one of the database sequences (if DB given)

       • Discard sequences that do not match a set of known good motifs

       • Discard sequences that contain a stop codon (has_stop column)

       • Discard near-duplicate sequences

       • Discard cross-mapping artifacts

       • Discard sequences whose “allele ratio” is too low.

       If  a  whitelist  of  sequences  is  provided  (by  default, this is the input V gene database), then the
       candidates that appear on it

       • are not checked for the cluster size criterion,

       • do not need to match a set of known good motifs,

       • are never considered near-duplicates (but they are checked for cross-mapping and for the allele ratio),

       • are allowed to contain a stop codon.

       Whitelisting allows IgDiscover to identify known germline sequences that are expressed at low levels in a
       library. If enabled with whitelist: true (the default) in the pregermline and germline filter sections of
       the configuration file, the sequences present in the starting database are treated as validated  germline
       sequences  and  will  not  be  discarded  if  due  to  too small cluster size as long as they fulfill the
       remaining criteria (unique_cdr3s, unique_js etc.).

       You can see why a candidate was filtered by inspecting the annotated_V_*.tab files

   Cross-mapping artifacts
       If two very similar sequences appear in the database used by IgBLAST, then sequencing errors may lead  to
       one  sequence  incorrectly  being  assigned  to the other. This is particularly problematic if one of the
       sequences is highly expressed while the other is not expressed at all. The not expressed sequence is even
       included in the list of V gene candidates because it is in the input database and therefore  whitelisted.
       We call this a “cross-mapping artifact”.

       The germline filtering step of IgDiscover therefore aims to eliminate cross-mapping artifacts by checking
       all pairs of sequences for the following:

       • The two sequences have a distance of 1,

       • they are both in the database for that particular iteration (only then can cross-mapping occur)

       • the  ratio  between  the  expression  levels  of the two sequences (using the cluster_size field in the
         candidates.tab file) is less than the value cross_mapping_ratio defined in the configuration file (0.02
         by default).

       If all that is the case, then the sequence with the lower expression is discarded.

   Allele-ratio filtering
       When multiple alleles of the same gene appear in the list of V gene candidates, such  as  IGHV1-2*02  and
       IGHV1-2*04,  the germline filter computes the ratio of the values in the exact and the clonotypes columns
       between them.  If the ratio is under the configured threshold, the candidate  with  the  lower  count  is
       discarded. See the exact_ratio and clonotype_ratio settings in the germline_filter and pregermline_filter
       sections of the configuration file.

       New in version 0.7.0.

   Data from the Sequence Read Archive (SRA)
       To  work with datasets from the Sequence Read Archive, you may want to use the tool fastq-dump, which can
       download the reads in the format required by IgDiscover. You just need to know the accession number, such
       as “SRR2905710” and then run this command to download the files to the current directory:

          fastq-dump --split-files --gzip SRR2905710

       The --split-files option ensures that the paired-end reads are stored in two separate files, one for  the
       forward  and  one  for  the  reverse  read,  respectively.   (If  you  do not provide it, you will get an
       interleaved FASTQ file that currently cannot be read by IgDiscover). The --gzip option creates compressed
       output.  The command creates two files in the current directory. In the  above  example,  they  would  be
       named SRR2905710_1.fastq.gz and SRR2905710_2.fastq.gz.

       The  program  fastq-dump  is  part  of  the  SRA  toolkit. On Debian-derived Linux distributions, you can
       typically install it with sudo apt-get install sra-toolkit. On Conda, install it with  conda  install  -c
       bioconda sra-tools.

   Does random subsampling influence results?
       Random  subsampling  indeed  influences  somewhat  which  sequences  are  found  by the cluster analysis,
       particularly in the beginning. However, the probability is large that all highly expressed sequences  are
       represented  in  the random sample. Also, due to the database growing with subsequent iterations, the set
       of sequences assigned to a single database gene becomes smaller  and  more  homogeneous.  This  makes  it
       increasingly likely that also sequences expressed at lower levels result in a cluster since they now make
       up a larger fraction of each subsample.

       Also,  many  of  the clusters which are captured in one subsample but not in the other are artifacts that
       are then filtered out anyway by the pre-germline or germline filter.

       On human data with a nearly complete starting database, the subsampling seems to  have  no  influence  at
       all,  as  we  determined  experimentally.  We  repeated a run of the program four times on the same human
       dataset, using identical parameters each time except that the subsampling was done in  a  different  way.
       Although  intermediate  results  differed, all four personalized databases that the program produced were
       exactly identical.

       Concordance is lower, though, when the input database is not as complete as the human one.

       The way in which random subsampling is done is modified by the seed configuration setting, which  is  set
       to  1 by default. If its value is the same for two different runs of the program with otherwise identical
       settings, the numbers chosen by the  random  number  generator  will  be  the  same  and  therefore  also
       subsampling  will  be  done in an identical way. This makes runs of the program reproducible. In order to
       test how results differ when subsampling is done in a different way,  change  the  seed  to  a  different
       value.

   Logging the program’s output to a file
       When  you  report  a  bug or unusual behavior to us, we might ask you to send us the output of igdiscover
       run. You can send its output to a file by running the program like this:

          igdiscover run >& logfile.txt

       And here is how to send the logging output to a file and also see the output in your terminal at the same
       time (but you lose the colors):

          igdiscover run |& tee logfile.txt

   Caching of IgBLAST results and of merged reads
       Sometimes you may want to re-analyze a dataset multiple times with different filter settings.   To  speed
       this up, IgDiscover can cache the results of two of the most time-consuming steps, read-merging with PEAR
       and running IgBLAST.

       The cache is disabled by default as it uses a lot of disk space. To enable the cache, create a file named
       ~/.config/igdiscover.conf with the following contents:

          use_cache: true

       If  you do so, a directory named ~/.cache/igdiscover/ is created the next time you run IgDiscover and all
       IgBLAST results as well as merged reads from PEAR are stored there.  On  subsequent  runs,  the  existing
       result  is  used  directly  without  calling  the  respective  program,  which  speeds  up  the  pipeline
       considerably.

       The cache is only used when we are certain that the results will indeed be the same. For example, if  the
       IgBLAST program version or th V/D/J database changes, the cached result is not used.

       The  files  in  the cache are compressed, but the cache may still get large over time. You can delete the
       cache with rm -r ~/.cache/igdiscover to free the space.

       You should also delete the cache when updating to a newer IgBLAST version as the old results will not  be
       used anymore.

   Terms
       Analysis directory
              The  directory  that  was  created  with  igdiscover  init.  Separate  ones  are  created for each
              experiment.  When  you  used  igdiscover  init  myexperiment,  the  analysis  directory  would  be
              myexperiment/.

       Starting database
              The  initial list of V/D/J genes. These are expected to be in FASTA format and are copied into the
              database/ directory within each analysis directory.

QUESTIONS AND ANSWERS

   How many sequences are needed to discover germline V gene sequences?
       Library sizes of several hundred thousand sequences are required for V gene discovery, with  even  higher
       numbers  necessary  for  full database production. For example, IgM library sizes of 750,000 to 1,000,000
       sequences for heavy chain databases and 1.5 to 2 million sequences for light chain databases.

   Can IgDiscover analyze IgG libraries?
       IgDiscover has been developed to identify germline databases  from  libraries  that  contain  substantial
       fractions  of  unswitched  antibody  sequences.  We  recommend  IgM  libraries  for  heavy  chain  V gene
       identification and IgKappa and  IgLambda  libraries  for  light  chain  identification.   IgDiscover  can
       identify  a  proportion of gemline sequences in IgG libraries but the process is much more efficient with
       IgM libraries, enabling the full set of germline sequences to be discovered.

   Can IgDiscover analyze a previously sequenced library?
       Yes, IgDiscover accepts both unpaired FASTQ files and paired FASTA files but the program should  be  made
       aware which is being used, see input requirements.

   Do the positions of the PCR primers make a difference to the output?
       Yes.  For  accurate V gene discovery, all primer sequences must be external to the V gene sequences.  For
       example, forward multiplex amplification primers should be present in the leader sequence or 5' UTR,  and
       reverse amplification primers should be located in the constant region, preferably close to the 5' border
       of  the constant region. Primers that are present in framework 1 region or J segments are not recommended
       for library production.

   What are the advantages to 5'-RACE compared to multiplex PCR for IgDiscover analysis?
       Both 5'-RACE and multiplex PCR have their own advantages.

       5'-RACE will enable library production from species where the upstream V gene sequence is  unknown.   The
       output  of  the  upstream  subcommand  in  IgDiscovery enables the identification of consensus leader and
       5'-UTR sequences for each of the identified germline V genes, that can subsequenctly be used  for  primer
       design for either multiplex PCR or for monoclonal antibody amplification sets.

       Multiplex  PCR is recommended for species where the upstream sequences are well characterized.  Multiplex
       amplification products are shorter than 5'-RACE products and therefore will be easier to  pair  and  will
       have less length associated sequence errors.

   What is meant by 'starting database'?
       The  starting database refers to the folder that contains the three FASTA files necessary for the process
       of iterative V gene discovery to begin. IgDiscover uses the standalone IgBLAST  program  for  comparative
       assignment  of  sequences  to  the  starting  database. Because IgBlast requires three files (for example
       V.fasta, D.fasta, J.fasta), three FASTA files should be included in the database folder for each analysis
       to proceed.

       In the case of light chains (that do not contain D segments), a dummy D segment file should  be  included
       as  IgBLAST  will  not proceed if it does not see three files in the database folder. It is sufficient to
       save the following sequence as a fasta file and rename it D.fasta, for example, for it to function as the
       dummy D.fasta file for human light chain analysis:

          >D_ummy
          GGGGGGGGGG

   How can I use the IMGT database as a starting database?
       Since we do not have permission to distribute IMGT database files with IgDiscover, you need  to  download
       them directly from IMGT.  See the section about obtaining a V/D/J database.

   How do I change the parameters of the program?
       By editing the configuration file.

   Where do I find the individualized database produced by IgDiscover?
       The  final  germline  database  in  FASTA  format  is  in  your  analysis  directory  in the subdirectory
       final/database/. The V.fasta file contains the new list of V genes. The D.fasta  and  J.fasta  files  are
       unchanged from the starting database.

       A phylogenetic tree of the V sequences can be found in final/dendrogram_V.pdf.

       For  more  details  of  how  that database was created, you need to inspect the files created in the last
       iteration of the discovery process, located in  iteration-xx,  where  xx  is  the  number  of  iterations
       configured  in  the  igdiscover.yaml configuration file. For example, if three iterations were used, look
       into iteration-03/.

       Most interesting in that folder are likely

       • the linkage cluster analysis plots in iteration-03/clusterplots/,

       • the error histograms in iteration-03/errorhistograms.pdf, which contain the windowed  cluster  analysis
         figures.

       • Details about the individualized database in new_V_germline.tab in tab-separated-value format

       The new_V_germline.fasta file is identical to the one in final/database/V.fasta

   What does the _S1234 at the end of same gene names mean?
       Please see the Section on gene names.

ADVANCED TOPICS

       IgDiscover itself does not (yet) come with all imaginable analysis facilities built into it.  However, it
       creates  many  files  (mostly  with  tables) that can be used for custom analysis.  For example, all .tab
       files (in particular assigned.tab.gz and candidates.tab) can be opened and  inspected  in  a  spreadsheet
       application such as LibreOffice. From there, you can do basic tasks such as sorting from the menu of that
       application.

       Often,  these  facilities  are  not  enough, however, and some basic understanding of the command-line is
       helpful. Clearly, this is not as convenient as working in a graphical user interface (GUI), but we do not
       currently have the resources to provide one for IgDiscover. To alleviate this somewhat, we  provide  here
       instructions for a few things that you may want to do with the IgDiscover result files.

   Extract all sequences that match any database gene exactly
       The  candidates.tab file tells you for each discovered sequence how often an exact match of that sequence
       was found in your input reads. A high number of exact matches is a good indication that the candidate  is
       actually  a new gene or allele. In order to find the original reads that correspond to those matches, you
       can look at the filtered.tab.gz file and extract all rows where the V_errors column is zero.

       First, run this on the filtered.tab.gz file:

          zcat filtered.tab.gz | head -n 1 | tr '\t' '\n' | nl

       This will enumerate the columns in the file. Take a note of the index that the V_errors  column  has.  In
       newer  pipeline versions, the index is 21. Then extract all rows of the file where that field is equal to
       zero:
          zcat filtered.tab.gz | awk -vFS="t" '$21 == 0 || NR == 1' > exact.tab

       If the column wasn’t 21, then replace the $21 appropriately. The part where it says NR == 1 ensures  that
       the column headings are also printed.

   Extra configuration settings
       Some configuration settings are not documented in the default igdiscover.yaml file since they rarely need
       to be changed.

          # Leave empty or choose a species name supported by IgBLAST:
          # human, mouse, rabbit, rat, rhesus_monkey
          # This setting is not used anywhere except that it is passed
          # to IgBLAST using the -organism option. Since we provide IgBLAST
          # with our own gene databases, it seems this has no effect.
          species:

          # Which program to use for computing multiple alignments. This is used for
          # computing consens sequences.
          # Choose 'mafft', 'clustalo', 'muscle' or 'muscle-fast'.
          # 'muscle-fast' runs muscle with parameters "-maxiters 1 -diags".
          #
          #multialign_program: muscle-fast

DEVELOPMENT

Source codeReport an issue

   Installing the development version
       To use the most recent IgDiscover version from Git, follow these steps.

       1. If  you  haven’t  done  so,  install  miniconda.  See  the  first  steps  of  the regular installation
          instructions. Do not install IgDiscover, yet!

       2. Clone the IgDiscover repository:

             git clone https://github.com/NBISweden/IgDiscover.git

          (Use the git@ URL instead if you are a developer.)

       4. Create a new Conda environment using the environment.yml file in the repository:

             cd IgDiscover
             conda env create -n igdiscover -f environment.yml

          You can choose a different environment name by changing the name after the -n parameter. This  may  be
          necessary,  when  you  already have a regular (non-developer) IgDiscover installation in an igdiscover
          environment that you don’t want to overwrite.

       5. Activate the environment:

             source activate igdiscover

          (Or use whichever name you chose above.)

       6. Install IgDiscover in “editable” mode:

             python3 -m pip install -e .

       Whenever you want to update the software:

          cd IgDiscover
          git pull

       It may also be necessary to repeat the python3 -m pip install -e . step.

   IgBLAST result cache
       For development, in particular when running tests repeatedly, you should enable the IgBLAST result cache.
       The cache stores IgBLAST output. If the same dataset with the same dataset is  run  a  second  time,  the
       result  is  retrieved  from the cache and IgBLAST is not re-run. This saves a lot of time when re-running
       datasets, but may also fill up the cache directory ~/.cache/igdiscover/.  Also, in  production,  datasets
       are usually not re-run with the same settings, which is why caching is disabled by default.

       To enable the cache, create a file ~/.config/igdiscover.conf with the following content:

          use_cache: true

       The file is in YAML format, but at the moment, no other settings are supported.

   Building the documentation
       Go to the doc/ directory in the repository, then run:

          make

       to  build  the  documentation  locally. Open _build/html/index.html in a browser. The layout is different
       from the version shown on Read the Docs, but allows you to preview any changes you may have made.

   Making a release
       We use versioneer to manage version numbers. It extracts the version number from the most recent  tag  in
       Git. Thus, to increment the version number, create a Git tag:

          git tag v0.5

       The v prefix is mandatory.

       Then:

       • tests/run.shpython3 setup.py sdisttwine upload sdist/igdiscover-0.10.tar.gz

       • Update bioconda recipe

   Removing IgDiscover from a Linux system
       If  you  have  been playing around with different installation methods (pip, conda, git, python3 setup.py
       install etc.) you may have multiple copies of IgDiscover on your system and  you  will  likely  run  into
       problems  on  updates.  Here  is  a  list  you  can  follow  in  order to get rid of the installations as
       preparation for a clean re-install. Do not add sudo to the commands below if you get permission problems,
       unless explicitly told to do so! If one of the steps does not work, that is fine, just continue.

       1. Delete  miniconda:  Run   the   command   which   conda.   The   output   will   be   something   like
          /home/myusername/miniconda3/bin/conda.  The  part  before  bin/conda  is  the  miniconda  installation
          directory. Delete that folder. In this case, you would need to delete miniconda3 in /home/myusername.

       2. Run pip3 uninstall igdiscover. If this runs successfully  and  prints  some  messages  about  removing
          files,  then  repeat  the  same  command! Do this until you get a message telling you that the package
          cannot be uninstalled because it is not installed.

       3. Repeat the previous step, but with pip3 uninstall sqt.

       4. If you have a directory named .local within your home directory, you may want to rename it: mv  .local
          dot-local-backup  You  can  also  delete  it,  but  there  is  a  small  risk that other software (not
          IgDiscover) uses that directory. The directory is hidden, so a normal ls will not show it.  Use ls -la
          while in your home directory to see it.

       5. If you have ever used sudo to install IgDiscover, you may have an installation in /usr/local/. You can
          try to remove it with sudo pip3 uninstall igdiscover.

       6. Delete the cloned Git repository if you have one. This is the directory in which you run git pull.

       Finally, you can follow  the  normal  installation  instructions  and  then  the  developer  installation
       instructions.

CHANGES

   v0.11 (2018-11-27)
       • The  IgBLAST  cache  is  now  disabled  by default. We assume that, in most cases, datasets will not be
         re-run with the exact same parameters, and then it only fills up the disk. Delete your cache with rm -r
         ~/.cache/igdiscover to reclaim the space. To enable the cache, create a file  ~/.config/igdiscover.conf
         with the contents use_cache: true.

       • If you choose to enable the cache, results from the PEAR merging step will now also be cached. See also
         the caching documentation.

       • Added  detection of chimeras to the (pre-)germline filters. Any novel allele that can be explained as a
         chimera of two unmodified reference alleles is marked in the new_V_germline.tab file.  This  is  a  bit
         sensitive, so the candidate is currently not discarded.

       • Two  additional  files  annotated_V_germline.tab  and  annotated_V_pregermline.tab  are created in each
         iteration during the germline filtering step. These are identical to the  candidates.tab  file,  except
         that  they  contain  a  why_filtered  column  that  describes  why  a  sequence  was  filtered. See the
         documentation for this feature.

       • A more realistic test dataset (v0.5), now based on human instead of  rhesus  data,  was  prepared.  The
         testing instructions have been updated accordingly.

       • J discovery has been tuned to give fewer truncated sequences.

       • Statistics are written to stats/stats.json.

       • V  SHM  distribution  plots are created automatically and written written to v-shm-distributions.pdf in
         each iteration folder.

       • An igdiscover dbdiff subcommand was added that can compare two FASTA files.

   v0.10 (2018-05-11)
       • When computing a consensus sequence, allow some sequences to be truncated in the 3' end.  Many  of  the
         discovered  novel  V  alleles  were  truncated by one nucleotide in the 3' end because IgBLAST does not
         always extend the alignment to the end of the V sequence. If these slightly too short V sequences  were
         in  the  majority,  their  consensus  would  lead  to  a  truncated sequence as well. The new consensus
         algorithm allows for this effect at the 3' end and can therefore more often than  previously  find  the
         full sequence.  Example:

            TACTGTGCGAGAGA (seq 1)
            TACTGTGCGAGAGA (seq 2)
            TACTGTGCGAGAG- (seq 3)
            TACTGTGCGAG--- (seq 4)
            TACTGTGCGAG--- (seq 5)

            TACTGTGCGAGAG  (previous consensus)
            TACTGTGCGAGAGA (new consensus)

       • Add  a  column  database_changes  to  the new_V_germline.tab file that describes how the novel sequence
         differs from the database sequence. Example: 93C>T; 114A>G

       • Allow filtering by CDR3_shared_ratio and do so by default (needs documentation)

       • Cache the edit distance when computing the distance matrix. Speeds up the discover command slightly.

       • discover: Use more than six CPU cores if available

       • igblast: Print progress every minute

   v0.9 (2018-03-22)
       • Implemented allele ratio filtering for J gene discovery

       • J genes are discovered as part of the pipeline (previously, one needed  to  run  the  discoverj  script
         manually)

       • In  each  iteration,  dendrograms are now created not only for V genes, but also for D and J genes. The
         file names are dendrogram_D.pdf, dendrogram_J.pdf

       • The V dendrograms are  now  in  dendrogram_V.pdf  (no  longer  V_dendrogram.pdf).  This  puts  all  the
         dendrograms together when looking at the files in the iteration directory.

       • The   V_usage.tab   and  V_usage.pdf  files  are  no  longer  created.   Instead,  expressed_V.tab  and
         expressed_V.pdf are created. These contain similar information, but an allele-ratio filter is  used  to
         filter out artifacts.

       • Similarly,  expressed_D.tab  and  expressed_J.tab  and  their  .pdf  counterparts  are  created in each
         iteration.

       • Removed parse subcommand (functionality is in the igblast subcommand)

       • New CDR3 detection method (only heavy chain sequences): CDR3  start/end  coordinates  are  pre-computed
         using the database V and J sequences. Increases detection rate to 99% (previously less than 90%).

       • Remove the ability to check discovered genes for required motifs. This has never worked well.

       • Add  a  column  clonotypes to the candidates.tab that tries to count how many clonotypes are associated
         with a single candidate (using only exact occurrences).  This is intended to  replace  the  CDR3s_exact
         column.

       • Add  an  exact_ratio  to  the  germline  filtering  options.  This checks the ratio between the exact V
         occurrence counts (exact column) between alleles.

       • Germline filtering option allele_ratio was renamed to clonotypes_ratio

       • Implement a cache for IgBLAST results. When the same dataset is re-analyzed,  possibly  with  different
         parameters,  the  cached  results are used instead of re-running IgBLAST, which saves a lot of time. If
         the V/D/J database or the IgBLAST version has changed, results are not re-used.

   v0.8.0 (2017-06-20)
       • Add a barcodes_exact column to the candidates table. It gives the number of  unique  barcode  sequences
         that  were  used  by  the  sequences  in  the set of exact sequences. Also, add a configuration setting
         barcode_consensus that can turn off consensus taking of barcode groups, which needs to be set to  false
         for barcodes_exact to work.

       • Add a Ds_exact column to candidates table.

       • Add a D_coverage configuration option.

       • The pre-processing filtering step no longer reads in the full table of IgBLAST assignments, but filters
         the  table  piece by piece. Memory usage for this step therefore does not depend anymore on the dataset
         size and should always be below 1 GB.

       • The functionality of the parse subcommand has been integrated into the igblast subcommand.  This  means
         that igdiscover igblast now directly outputs a result table (assigned.tab). This makes it easier to use
         that subcommand directly instead of only via the workflow.

       • The igblast subcommand now always runs makeblastdb by itself and deletes the BLAST database afterwards.
         This reduces clutter and ensures the database is always up to date.

       • Remove  the library_name configuration setting. Instead, the library_name is now always the same as the
         name of analysis directory.

   v0.7.0 (2017-05-04)
       • Add an “allele ratio” criterion to the germline filter to further reduce the number of false positives.
         The filter is activated by default and can be  configured  through  the  allele_ratio  setting  in  the
         configuration file. See the documentation for how it works.

       • Ignore the CDR3-encoding bases whenever comparing two V gene sequences.

       • Avoid finding 5'-truncated V genes by extending found hits towards the 5' end.

       • By  default,  candidate  sequences  are  no  longer  merged  if they are nearly identical. That is, the
         differences setting within the two germline filter  configuration  sections  is  now  set  to  zero  by
         default.   Previously,  we  believed the merging would remove some false positives, but it turns out we
         also miss true positives. It also seems that with the other changes in this version we also  no  longer
         get the particular false positives the setting was supposed to catch.

       • Implement  an experimental discoverj script for J gene discovery.  It is curently not run automatically
         as part of igdiscover run. See igdiscover discoverj --help for how to run it manually.

       • Add a config subcommand, which can be used to change the configuration file from the command-line.

       • Add a V_CDR3_start column to the assigned.tab/filtered.tab tables. It describes where the  CDR3  starts
         within the V sequence.

       • Similarly,  add  a  CDR3_start  column  to the new_V_germline.tab file describing where the CDR3 starts
         within a discovered V sequence.  It is computed by using the most common CDR3 start  of  the  sequences
         within the cluster.

       • Rename the compose subcommand to germlinefilter.

       • The  init  subcommand  automatically fixes certain problems in the input database (duplicate sequences,
         empty records, duplicate sequence names). Previously, it would complain, but the user would have to fix
         the problems themselves.

       • Move source code to GitHub

       • Set up automatic code testing (continuous integration) via Travis

       • Many documentation improvements

   v0.6.0 (2016-12-07)
       • The FASTA files of the input V/D/J gene lists now need to be named V.fasta, D.fasta  and  J.fasta.  The
         species name is no longer part of the file name. This should reduce confusion when working with species
         not supported by IgBLAST.

       • The  species:  configuration  setting in the configuration can (and should) now be left empty. Its only
         use was that it is passed to IgBLAST,  but  since  IgDiscover  provides  IgBLAST  with  its  own  V/D/J
         sequences anyway, it does not seem to make a difference.

       • A “cross-mapping” detection has been added, which should reduce the number of false positives.  See the
         documentation for an explanation.

       • Novel sequences identical to a database sequence no longer get the _S1234 suffix.

       • No  longer trim trim the initial G run in sequences (due to RACE) by default. It is now a configuration
         setting.

       • Add cdr3_location configuration setting: It allows to set whether to use a  CDR3  in  addition  to  the
         barcode for grouping sequences.

       • Create a groups.tab.gz file by default (describing the de-barcoded groups)

       • The   pre-processing   filter  is  now  configurable.  See  the  preprocessing_filter  section  in  the
         configuration file.

       • Many improvements to the documentation

       • Extended and fixed unit tests. These are now run via a CI system.

       • Statistics in JSON format are written to stats/stats.json.

       • IgBLAST 1.5.0 output can now be parsed. Parsing is also faster by 25%.

       • More helpful warning message when no sequences were discovered in an iteration.

       • Drop support for Python 3.3.

   v0.5 (2016-09-01)
       • V sequences of the input database are now  whitelisted  by  default.   The  meaning  of  the  whitelist
         configuration  option  has  changed:  If  set  to false, those sequences are no longer whitelisted.  To
         whitelist additional sequences, create a whitelist.fasta file as before.

       • Sequences with stop codons are now filtered out by default.

       • Use more stringent germline filtering parameters by default.

   v0.4 (2016-08-24)
       • It is now possible to install and run IgDiscover on OS X. Appropriate Conda packages are  available  on
         bioconda.

       • Add  column  has_stop to candidates.tab, which indicates whether the candidate sequence contains a stop
         codon.

       • Add a configuration  option  that  makes  it  possible  to  disable  the  5'  motif  check  by  setting
         check_motifs: false (the looks_like_V column is ignored in this case).

       • Make  it  possible  to  whitelist  known sequences: If a found gene candidate appears in that list, the
         sequence is included in the list of  discovered  sequences  even  when  it  would  otherwise  not  pass
         filtering  criteria.  To  enable  this, just add a whitelist.fasta file to the project directory before
         starting the analysis.

       • The criteria for germline filter and pre-germline filter are now configurable: See germline_filter  and
         pre_germline_filter sections in the configuration file.

       • Different  runs  of  IgDiscover with the same parameters on the same input files will now give the same
         results. See the seed parameter in the configuration, also on how to get  non-reproducible  results  as
         before.

       • Both  the  germline  and  pre-germline  filter  are  now  applied  in  each  iteration.  Instead of the
         new_V_database.fasta  file,  two  files  named  new_V_germline.fasta  and  new_V_pregermline.fasta  are
         created.

       • The compose subcommand now outputs a filtered version of the candidates.tab file in addition to a FASTA
         file.  The  table  contains  columns  closest_whitelist,  which  is  the  name of the closest whitelist
         sequence, and whitelist_diff, which is the number of differences to that whitelist sequence.

   v0.3 (2016-08-08)
       • Optionally, sequences are not renamed in the assigned.tab file, but retain their original  name  as  in
         the FASTA or FASTQ file. Set rename: false in the configuration file to get this behavior.

       • Started an “advanced” section in the manual.

   v0.2
       • IgDiscover can now also detect kappa and lambda light chain V genes (VK, VL)

AUTHOR

       Marcel Martin

COPYRIGHT

       2015-2023, Marcel Martin

0.11                                              Mar 10, 2023                                     IGDISCOVER(1)