Provided by: phast_1.6+dfsg-4_amd64 bug

NAME

       pbsTrain - Estimate a discrete encoding scheme for probabilistic biological

DESCRIPTION

       Estimate a discrete encoding scheme for probabilistic biological sequences (PBSs) based on training data.
       Input file should be a table of probability vectors, with a row for each distinct vector, and a column of
       counts  (positive  integers)  followed  by  d  columns  for the elements of the d-dimensional probability
       vectors (see example below).  It may be produced with 'prequel' using the --suff-stats option.  Output is
       a code file that can be used with pbsEncode, pbsDecode, etc.  By default, a code of size 255 is  created,
       so  that  encoded  PBSs  can  be  represented with one byte per position (the 256th letter in the code is
       reserved for gaps).  The --nbytes option allows larger codes to be created, if desired.

       The code is estimated by a two-part procedure designed to minimize the "training error" (defined  as  the
       total  KL  divergence) of the encoded training data with respect to the original training data.  First, a
       "grid" is defined for the probability simplex,  partitioning  it  into  regions  that  intersect  "cells"
       (hypercubes) in a matrix in d-dimensional space.  This grid has n "rows" per dimension.  By default, n is
       given  the  largest  possible  value such that the number of simplex regions is no larger than the target
       code size, but smaller values of n can be specified using --nrows.  Each simplex  region  is  assigned  a
       letter  in  the  code, and the representative point for that letter is set equal to the mean (weighted by
       the counts) of all vectors in the training data that fall in that region.  This can be shown to  minimize
       the  training  error  for  this  initial  encoding  scheme.   (If  no  vectors fall in a region, then the
       representative point is set equal to the centroid of the region, which  can  be  shown  to  minimize  the
       expected KL divergence of points uniformly distributed in the region.)

       In the second part of the estimation procedure, the remaining letters in the code are defined by a greedy
       algorithm,  which  attempts  to  further minimize the training error.  Briefly, on each step, the simplex
       region with the largest contribution to the total error is identified, and the next letter in the code is
       assigned to that region.  In this new encoding, there are multiple letters, hence multiple representative
       points, per region; the representative point for a given vector is taken to be the closest, in  terms  of
       KL  divergence,  of  the  representative  points  associated with the simplex region in which that vector
       falls.  When a new representative point is added to a region, all representative points for  that  region
       are  reoptimized using a k-means type algorithm.  This procedure is repeated, letter by letter, until the
       number of code letters equals the target code size.

EXAMPLE

       Generate training data using prequel:

              prequel --suff-stats mammals.fa mytree.mod training

       A file called "training.stats" will be generated.
              It will look

              something like this:

              #count

              p(A)    p(C)    p(G)    p(T)

              170085

              0.043485        0.797886        0.029534        0.129096

              158006

              0.191119        0.046081        0.695205        0.067595

              221937

              0.047309        0.122834        0.043852        0.786004

              221585

              0.781156        0.044520        0.126179        0.048146

              159472

              0.067254        0.697947        0.045959        0.188840

              ...

       Now estimate a code from the training data:

              pbsTrain training.stats > mammals.code

       The code file contains some metadata followed by a list of code indices and representative points, e.g.,

              ##NROWS = 7

              ##DIMENSION = 4

              ##NBYTES = 1

              ##CODESIZE = 255

              # Code generated by pbsTrain, with argument(s) "training.stats"

              # acs, Mon Jul 18 23:29:07 2005

              # Average training error = 0.001298 bits

       Each index of the code is shown below with its representative probability vector (p1, p2, ..., pd).

              #code_index p1 p2 ...  0       0.107143        0.107143        0.107143        0.678571

              1       0.033226        0.093854        0.031987        0.840933

              2       0.000059        0.001645        0.000111        0.998185

              3       0.139270        0.021059        0.278993        0.560678

              ...

       The reported "average training error" is the training error divided by the number of data points (the sum
       of the counts).

OPTIONS


       --nrows, -n <n> Number of "rows" per dimension in the simplex grid.  Default is maximum possible for code
              size.

       --nbytes, -b <b>

              Number of bytes per encoded probabilistic base (default 1).  The size of the code will be 256^b  -
              1  (one  letter in the code is reserved for gaps).  Values as large as 4 are allowed for b, but in
              the current implementation, performance considerations effectively limit it to 2 or 3.

       --no-greedy, -G Skip greedy optimization -- just assign a single representative point to each  region  of
              the  probability  simplex, equal to the (weighted) mean of all vectors from the training data that
              fall in that region.

       --no-train, -x <dim>

              Ignore the data entirely; just use the centroid of each simplex partition.  The dimension  of  the
              simplex must be given (<dim>) but no data file is required.

       --log, -l <file>

              write log of optimization procedure to specified file.

       --help, -h

              Print this help message.

pbsTrain 1.4                                        May 2016                                         PBSTRAIN(1)