Provided by: autoclass_3.3.6.dfsg.2-1_amd64 bug

NAME

       autoclass - automatically discover classes in data

SYNOPSIS

       autoclass -search data_file header_file model_file s_param_file
       autoclass -report results_file search_file r_params_file
       autoclass -predict results_file search_file results_file

DESCRIPTION

       AutoClass  solves  the problem of automatic discovery of classes in data (sometimes called clustering, or
       unsupervised learning), as distinct from the generation  of  class  descriptions  from  labeled  examples
       (called  supervised  learning).   It  aims  to  discover the "natural" classes in the data.  AutoClass is
       applicable to observations of things that can be described by a set of attributes, without  referring  to
       other  things.   The  data values corresponding to each attribute are limited to be either numbers or the
       elements of a fixed set of symbols.  With numeric data, a measurement error must be provided.

       AutoClass is looking for the best classification(s) of  the  data  it  can  find.   A  classification  is
       composed of:

       1)     A  set  of classes, each of which is described by a set of class parameters, which specify how the
              class is distributed along the various attributes.  For example, "height normally distributed with
              mean 4.67 ft and standard deviation .32 ft",

       2)     A set of class weights, describing what percentage of cases are likely to be in each class.

       3)     A probabilistic assignment of cases in the data  to  these  classes.   I.e.  for  each  case,  the
              relative probability that it is a member of each class.

       As  a  strictly Bayesian system (accept no substitutes!), the quality measure AutoClass uses is the total
       probability that, had you known nothing about your data or its domain, you would have found this  set  of
       data generated by this underlying model.  This includes the prior probability that the "world" would have
       chosen  this  number  of classes, this set of relative class weights, and this set of parameters for each
       class, and the likelihood that such a set of classes would have generated this  set  of  values  for  the
       attributes in the data cases.

       These  probabilities  are typically very small, in the range of e^-30000, and so are usually expressed in
       exponential notation.

       When run with the -search command, AutoClass searches for a classification.  The required  arguments  are
       the  paths  to  the  four input files, which supply the data, the data format, the desired classification
       model, and the search parameters, respectively.

       By default, AutoClass writes intermediate results in a binary file.  With the -report command,  AutoClass
       generates an ASCII report.  The arguments are the full path names of the .results, .search, and .r-params
       files.

       When run with the -predict command, AutoClass predicts the class membership of a "test" data set based on
       classes found in a "training" data set (see "PREDICTIONS" below).

INPUT FILES

       An  AutoClass data set resides in two files.  There is a header file (file type "hd2") that describes the
       specific data format and attribute definitions.  The actual data values are in a  data  file  (file  type
       "db2").   We  use  two files to allow editing of data descriptions without having to deal with the entire
       data set.  This makes it easy to experiment with different descriptions of the database without having to
       reproduce the data set.  Internally, an AutoClass database structure is identified by its header and data
       files, and the number of data loaded.

       For more detailed information on the formats of these  files,  see  /usr/share/doc/autoclass/preparation-
       c.text.

   DATA FILE
       The  data file contains a sequence of data objects (datum or case) terminated by the end of the file. The
       number of values for each data object must be equal to the number of attributes  defined  in  the  header
       file.   Data  objects  must  be  groups of tokens delimited by "new-line".  Attributes are typed as REAL,
       DISCRETE, or DUMMY.  Real attribute values are numbers,  either  integer  or  floating  point.   Discrete
       attribute  values  can  be  strings,  symbols,  or integers.  A dummy attribute value can be any of these
       types.  Dummys are read in but otherwise ignored -- they will  be  set  to  zeros  in  the  the  internal
       database.   Thus  the  actual  values  will  not  be  available  for use in report output.  To have these
       attribute values available, use either type REAL or type DISCRETE, and define their model type as  IGNORE
       in  the  .model  file.   Missing values for any attribute type may be represented by either "?", or other
       token specified in the header file.  All are translated to a special unique value after  being  read,  so
       this symbol is effectively reserved for unknown/missing values.

       For example:
             white       38.991306 0.54248405  2 2 1
             red         25.254923 0.5010235   9 2 1
             yellow      32.407973 ?           8 2 1
             all_white   28.953982 0.5267696   0 1 1

   HEADER FILE
       The  header  file specifies the data file format, and the definitions of the data attributes.  The header
       file functional specifications consists of two parts -- the data set  format  definition  specifications,
       and the attribute descriptors. ";" in column 1 identifies a comment.

       A header file follows this general format:

           ;; num_db2_format_defs value (number of format def lines
           ;; that follow), range of n is 1 -> 5
           num_db2_format_defs n
           ;; number_of_attributes token and value required
           number_of_attributes <as required>
           ;; following are optional - default values are specified
           separator_char  ' '
           comment_char    ';'
           unknown_token   '?'
           separator_char  ','

           ;; attribute descriptors
           ;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>
           ;; <att_param_pairs>

       Each attribute descriptor is a line of:

             Attribute index (zero based, beginning in column 1)
             Attribute type.  See below.
             Attribute subtype.  See below
             Attribute description: symbol (no embedded blanks) or
                   string; <= 40 characters
             Specific property and value pairs.
                   Currently available combinations:

                type           subtype         property type(s)
                ----           --------        ---------------
                dummy          none/nil        --
                discrete       nominal         range
                real           location        error
                real           scalar          zero_point rel_error

       The  ERROR  property should represent your best estimate of the average error expected in the measurement
       and recording of that real attribute.  Lacking better information, the error can  be  taken  as  1/2  the
       minimum  possible  difference  between  measured  values.   It  can  be argued that real values are often
       truncated, so that smaller errors may be justified, particularly for generated data.  But AutoClass  only
       sees  the  recorded  values.   So  it  needs  the  error  in  the recorded values, rather than the actual
       measurement error.  Setting this error much smaller than the minimum expressible difference  implies  the
       possibility  of values that cannot be expressed in the data.  Worse, it implies that two identical values
       must represent measurements that were much closer than they might actually  have  been.   This  leads  to
       over-fitting of the classification.

       The  REL_ERROR  property  is  used for SCALAR reals when the error is proportional to the measured value.
       The ERROR property is not supported.

       AutoClass uses the error as a lower bound on the width  of  the  normal  distribution.   So  small  error
       estimates  tend  to give narrower peaks and to increase both the number of classes and the classification
       probability.  Broad error estimates tend to limit the number of classes.

       The scalar ZERO_POINT property is the smallest value that the measurement process  could  have  produced.
       This is often 0.0, or less by some error range.  Similarly, the bounded real's min and max properties are
       exclusive  bounds  on  the attributes generating process.  For a calculated percentage these would be 0-e
       and 100+e, where e is an error value.  The discrete attribute's range is the number  of  possible  values
       the attribute can take on.  This range must include unknown as a value when such values occur.

       Header File Example:

       !#; AutoClass C header file -- extension .hd2
       !#; the following chars in column 1 make the line a comment:
       !#; '!', '#', ';', ' ', and '\n' (empty line)

       ;#! num_db2_format_defs <num of def lines -- min 1, max 4>
       num_db2_format_defs 2
       ;; required
       number_of_attributes 7
       ;; optional - default values are specified
       ;; separator_char  ' '
       ;; comment_char    ';'
       ;; unknown_token   '?'
       separator_char     ','

       ;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>
       <att_param_pairs>
       0 dummy nil       "True class, range = 1 - 3"
       1 real location "X location, m. in range of 25.0 - 40.0" error .25
       2 real location "Y location, m. in range of 0.5 - 0.7" error .05
       3 real scalar   "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0
       rel_error .001
       4 discrete nominal  "Truth value, range = 1 - 2" range 2
       5 discrete nominal  "Color of foobar, 10 values" range 10
       6 discrete nominal  Spectral_color_group range 6

   MODEL FILE
       A  classification  of  a  data  set  is  made  with  respect  to  a model which specifies the form of the
       probability distribution function for classes in that data set.  Normally the model structure is  defined
       in  a  model  file  (file  type  "model"), containing one or more models.  Internally, a model is defined
       relative to a particular database.  Thus it is identified by  the  corresponding  database,  the  model's
       model file and its sequential position in the file.

       Each  model  is  specified by one or more model group definition lines.  Each model group line associates
       attribute indices with a model term type.

       Here is an example model file:

       # AutoClass C model file -- extension .model
       model_index 0 7
       ignore 0
       single_normal_cn 3
       single_normal_cn 17 18 21
       multi_normal_cn 1 2
       multi_normal_cn 8 9 10
       multi_normal_cn 11 12 13
       single_multinomial default

       Here, the first line is a comment.  The following characters in column 1 make the line  a  comment:  `!',
       `#', ` ', `;', and `\n' (empty line).

       The  tokens  "model_index  n  m"  must  appear  on the first non-comment line, and precede the model term
       definition lines. n is the zero-based model index, typically 0 where there  is  only  one  model  --  the
       majority of search situations.  m is the number of model term definition lines that follow.

       The last seven lines are model group lines.  Each model group line consists of:

       A model term type (one of single_multinomial, single_normal_cm, single_normal_cn, multi_normal_cn, or
           ignore).

       A list of attribute indices (the attribute set list), or the symbol default.  Attribute indices are zero-
           based.  Single model terms may have one or more attribute indices on each line, while multi model
           terms require two or more attribute indices per line.  An attribute index must not appear more than
           once in a model list.

       Notes:

       1)     At least one model definition is required (model_index token).

       2)     There may be multiple entries in a model for any model term type.

       3)     Model term types currently consist of:

              single_multinomial
                     models discrete attributes as multinomials, with missing values.

              single_normal_cn
                     models real valued attributes as normals; no missing values.

              single_normal_cm
                     models real valued attributes with missing values.

              multi_normal_cn
                     is a covariant normal model without missing values.

              ignore allows  the  model  to  ignore one or more attributes.  ignore is not a valid default model
                     term type.

              See the documentation in models-c.text for further information about specific model terms.

       4)     Single_normal_cn, single_normal_cm, and multi_normal_cn modeled  data,  whose  subtype  is  scalar
              (value  distribution is away from 0.0, and is thus not a "normal" distribution) will be log trans‐
              formed and modeled with the log-normal model.  For data whose subtype is location (value distribu‐
              tion is around 0.0), no transform is done, and the normal model is used.

SEARCHING

       AutoClass, when invoked in the "search" mode will check the validity of the set of data,  header,  model,
       and  search  parameter  files.  Errors will stop the search from starting, and warnings will ask the user
       whether to continue.  A history of the error and warning messages is saved, by default, in the log file.

       Once you have succeeded in describing your data with a header file and model file that passes  the  AUTO‐
       CLASS -SEARCH <...> input checks, you will have entered the search domain where AutoClass classifies your
       data.  (At last!)

       The main function to use in finding a good classification of your data is AUTOCLASS -SEARCH, and using it
       will take most of the computation time.  Searches are invoked with:

       autoclass -search <.db2 file path> <.hd2 file path>
            <.model file path> <.s-params file path>

       All  files  must  be  specified  as fully qualified relative or absolute pathnames.  File name extensions
       (file types) for all files are forced to canonical values required by the AutoClass program:

               data file   ("ascii")   db2
               data file   ("binary")  db2-bin
               header file             hd2
               model file              model
               search params file      s-params

       The sample-run (/usr/share/doc/autoclass/examples/) that comes with AutoClass shows some sample searches,
       and browsing these is probably the fastest way to get familiar with how to do searches.   The  test  data
       sets  located  under  /usr/share/doc/autoclass/examples/  will  show  you some other header (.hd2), model
       (.model), and search params (.s-params) file setups.  The remainder of this section describes how  to  do
       searches in somewhat more detail.

       The  bold faced tokens below are generally search params file parameters.  For more information on the s-
       params file, see SEARCH PARAMETERS below, or /usr/share/doc/autoclass/search-c.text.gz.

   WHAT RESULTS ARE
       AutoClass is looking for the best classification(s) of the data it can find.  A  classification  is  com‐
       posed of:

       1)     a  set  of classes, each of which is described by a set of class parameters, which specify how the
              class is distributed along the various attributes.  For example, "height normally distributed with
              mean 4.67 ft and standard deviation .32 ft",

       2)     a set of class weights, describing what percentage of cases are likely to be in each class.

       3)     a probabilistic assignment of cases in the data to these classes.  I.e. for each case,  the  rela‐
              tive probability that it is a member of each class.

       As  a  strictly Bayesian system (accept no substitutes!), the quality measure AutoClass uses is the total
       probability that, had you known nothing about your data or its domain, you would have found this  set  of
       data generated by this underlying model.  This includes the prior probability that the "world" would have
       chosen  this  number  of classes, this set of relative class weights, and this set of parameters for each
       class, and the likelihood that such a set of classes would have generated this set of values for the  at‐
       tributes in the data cases.

       These  probabilities  are typically very small, in the range of e^-30000, and so are usually expressed in
       exponential notation.

   WHAT RESULTS MEAN
       It is important to remember that all of these probabilities are GIVEN that the real model is in the model
       family that AutoClass has restricted its attention to.  If AutoClass is looking for Gaussian classes  and
       the  real  classes  are  Poisson,  then the fact that AutoClass found 5 Gaussian classes may not say much
       about how many Poisson classes there really are.

       The relative probability between different classifications found can be very large, like e^1000,  so  the
       very  best classification found is usually overwhelmingly more probable than the rest (and overwhelmingly
       less probable than any better classifications as yet undiscovered).  If AutoClass should manage  to  find
       two  classifications  that are within about exp(5-10) of each other (i.e. within 100 to 10,000 times more
       probable) then you should consider them to be about equally probable, as our computation is  usually  not
       more accurate than this (and sometimes much less).

   HOW IT WORKS
       AutoClass repeatedly creates a random classification and then tries to massage this into a high probabil‐
       ity  classification  though local changes, until it converges to some "local maximum".  It then remembers
       what it found and starts over again, continuing until you tell it to  stop.   Each  effort  is  called  a
       "try",  and the computed probability is intended to cover the whole volume in parameter space around this
       maximum, rather than just the peak.

       The standard approach to massaging is to

       1)     Compute the probabilistic class memberships of cases using the class parameters  and  the  implied
              relative likelihoods.

       2)     Using the new class members, compute class statistics (like mean) and revise the class parameters.

       and   repeat  till  they  stop  changing.   There  are  three  available  convergence  algorithms:  "con‐
       verge_search_3" (the default), "converge_search_4" and "converge".  Their specification is controlled  by
       search params file parameter try_fn_type.

   WHEN TO STOP
       You  can  tell AUTOCLASS -SEARCH to stop by: 1) giving a max_duration (in seconds) argument at the begin‐
       ning; 2) giving a max_n_tries (an integer) argument at the beginning; or 3) by typing a "q" and  <return>
       after you have seen enough tries.  The max_duration and max_n_tries arguments are useful if you desire to
       run AUTOCLASS -SEARCH in batch mode.  If you are restarting AUTOCLASS -SEARCH from a previous search, the
       value  of max_n_tries you provide, for instance 3, will tell the program to compute 3 more tries in addi‐
       tion to however many it has already done.  The same incremental behavior is exhibited by max_duration.

       Deciding when to stop is a judgment call and it's up to you.  Since the search includes a  random  compo‐
       nent, there's always the chance that if you let it keep going it will find something better.  So you need
       to  trade  off how much better it might be with how long it might take to find it.  The search status re‐
       ports that are printed when a new best classification is found are intended to provide you information to
       help you make this tradeoff.

       One clear sign that you should probably stop is if most of the classifications found  are  duplicates  of
       previous  ones (flagged by "dup" as they are found).  This should only happen for very small sets of data
       or when fixing a very small number of classes, like two.

       Our experience is that for moderately large to extremely large data sets (~200 to ~10,000 datum),  it  is
       necessary to run AutoClass for at least 50 trials.

   WHAT GETS RETURNED
       Just  before returning, AUTOCLASS -SEARCH will give short descriptions of the best classifications found.
       How many will be described can be controlled with n_final_summary.

       By default AUTOCLASS -SEARCH will write out a number of files, both at the end  and  periodically  during
       the  search  (in  case  your system crashes before it finishes).  These files will all have the same name
       (taken from the search params pathname [<name>.s-params]), and differ only in their file extensions.   If
       your  search  runs are very long and there is a possibility that your machine may crash, you can have in‐
       termediate "results" files written out.  These can be used to restart your search run with  minimum  loss
       of search effort.  See the documentation file /usr/share/doc/autoclass/checkpoint-c.text.

       A  ".log"  file  will hold a listing of most of what was printed to the screen during the run, unless you
       set log_file_p to false to say you want no such foolishness.  Unless results_file_p is  false,  a  binary
       ".results-bin"  file  (the  default) or an ASCII ".results" text file, will hold the best classifications
       that were returned, and unless search_file_p is false, a ".search" file  will  hold  the  record  of  the
       search tries. save_compact_p controls whether the "results" files are saved as binary or ASCII text.

       If  the C global variable "G_safe_file_writing_p" is defined as TRUE in "autoclass-c/prog/globals.c", the
       names of "results" files (those that contain the saved classifications) are modified  internally  to  ac‐
       count  for  redundant  file writing.  If the search params file name is "my_saved_clsfs" you will see the
       following "results" file names (ignoring directories and pathnames for this example)

         save_compact_p = true --
         "my_saved_clsfs.results-bin"  - completely written file
         "my_saved_clsfs.results-tmp-bin" - partially written file, renamed
                             when complete

         save_compact_p = false --
         "my_saved_clsfs.results" - completely written file
         "my_saved_clsfs.results-tmp"  - partially written file, renamed
                             when complete

       If check pointing is being done, these additional names will appear

         save_compact_p = true --
         "my_saved_clsfs.chkpt-bin"    - completely written checkpoint file
         "my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file,
                                renamed when complete
         save_compact_p = false --
         "my_saved_clsfs.chkpt"   - completely written checkpoint file
         "my_saved_clsfs.chkpt-tmp"    - partially written checkpoint file,
                                renamed when complete

   HOW TO GET STARTED
       The way to invoke AUTOCLASS -SEARCH is:

       autoclass -search <.db2 file path> <.hd2 file path>
            <.model file path> <.s-params file path>

       To restart a previous search, specify that force_new_search_p has the value false in  the  search  params
       file, since its default is true.  Specifying false tells AUTOCLASS -SEARCH to try to find a previous com‐
       patible search (<...>.results[-bin] & <...>.search) to continue from, and will restart using it if found.
       To  force  a new search instead of restarting an old one, give the parameter force_new_search_p the value
       of true, or use the default.  If there is an existing search (<...>.results[-bin]  &  <...>.search),  the
       user will be asked to confirm continuation since continuation will discard the existing search.

       If  a  previous  search  is continued, the message "RESTARTING SEARCH" will be given instead of the usual
       "BEGINNING SEARCH".  It is generally better to continue a previous search than to start a new one, unless
       you are trying a significantly different search method, in which case statistics from the previous search
       may mislead the current one.

   STATUS REPORTS
       A running commentary on the search will be printed to the screen and to the log file  (unless  log_file_p
       is false).  Note that the ".log" file will contain a listing of all default search params values, and the
       values of all params that are overridden.

       After  each try a very short report (only a few characters long) is given.  After each new best classifi‐
       cation, a longer report is given, but no more often than min_report_period (default is 30 seconds).

   SEARCH VARIATIONS
       AUTOCLASS -SEARCH by default uses a certain standard search method or "try function" (try_fn_type = "con‐
       verge_search_3").  Two others are also available: "converge_search_4" and "converge").  They are provided
       in case your problem is one that may happen to benefit from them.  In general the default method will re‐
       sult in finding better classifications at the expense of a longer search time.  The default was chosen so
       as to be robust, giving even performance across many problems.  The alternatives to the  default  may  do
       better on some problems, but may do substantially worse on others.

       "converge_search_3"  uses an absolute stopping criterion (rel_delta_range, default value of 0.0025) which
       tests the variation of each class of the delta of the log approximate-marginal-likelihood  of  the  class
       statistics  with-respect-to  the  class  hypothesis  (class->log_a_w_s_h_j)  divided  by the class weight
       (class->w_j) between successive convergence cycles.  Increasing this value loosens  the  convergence  and
       reduces the number of cycles.  Decreasing this value tightens the convergence and increases the number of
       cycles. n_average (default value of 3) specifies how many successive cycles must meet the stopping crite‐
       rion before the trial terminates.

       "converge_search_4"  uses an absolute stopping criterion (cs4_delta_range, default value of 0.0025) which
       tests the variation of each class of the slope for each class of log  approximate-marginal-likelihood  of
       the  class  statistics  with-respect-to  the class hypothesis (class->log_a_w_s_h_j) divided by the class
       weight (class->w_j) over sigma_beta_n_values (default value 6) convergence cycles.  Increasing the  value
       of  cs4_delta_range  loosens  the  convergence  and  reduces the number of cycles.  Decreasing this value
       tightens the convergence and increases the number of cycles.  Computationally, this try function is  more
       expensive than "converge_search_3", but may prove useful if the computational "noise" is significant com‐
       pared  to  the variations in the computed values.  Key calculations are done in double precision floating
       point, and for the largest data base we have tested so far ( 5,420 cases of 93 attributes), computational
       noise has not been a problem, although the value of max_cycles needed to be increased to 400.

       "converge" uses one of two absolute stopping criterion which test the  variation  of  the  classification
       (clsf)  log_marginal  (clsf->log_a_x_h)  delta  between  successive  convergence  cycles.  The largest of
       halt_range (default value 0.5) and halt_factor * current_clsf_log_marginal) is  used  (default  value  of
       halt_factor  is  0.0001).   Increasing these values loosens the convergence and reduces the number of cy‐
       cles.  Decreasing these values tightens the convergence and increases the number  of  cycles.   n_average
       (default  value  of  3) specifies how many cycles must meet the stopping criteria before the trial termi‐
       nates.  This is a very approximate stopping criterion, but will give you some feel for the kind of  clas‐
       sifications to expect.  It would be useful for "exploratory" searches of a data base.

       The  purpose of reconverge_type = "chkpt" is to complete an interrupted classification by continuing from
       its last checkpoint.  The purpose of reconverge_type = "results" is to attempt further refinement of  the
       best  completed  classification  using  a  different  value  of  try_fn_type  ("converge_search_3", "con‐
       verge_search_4", "converge").  If max_n_tries is greater than 1, then in each case, after  the  reconver‐
       gence  has  completed,  AutoClass will perform further search trials based on the parameter values in the
       <...>.s-params file.

       With the use of reconverge_type ( default value ""), you may apply more than one try function to a  clas‐
       sification.   Say  you  generate  several exploratory trials using try_fn_type = "converge", and quit the
       search saving .search and .results[-bin] files.  Then you can begin another  search  with  try_fn_type  =
       "converge_search_3",  reconverge_type  = "results", and max_n_tries = 1.  This will result in the further
       convergence of the best classification generated with try_fn_type = "converge", with try_fn_type =  "con‐
       verge_search_3".  When AutoClass completes this search try, you will have an additional refined classifi‐
       cation.

       A  good  way to verify that any of the alternate try_fun_type are generating a well converged classifica‐
       tion is to run AutoClass in prediction mode on the same data  used  for  generating  the  classification.
       Then  generate and compare the corresponding case or class cross reference files for the original classi‐
       fication and the prediction.  Small differences between these files are to be expected, while large  dif‐
       ferences  indicate  incomplete  convergence.   Differences between such file pairs should, on average and
       modulo class deletions, decrease monotonically with further convergence.

       The standard way to create a random classification to begin a try is with the default value  of  "random"
       for  start_fn_type.   At this point there are no alternatives.  Specifying "block" for start_fn_type pro‐
       duces repeatable non-random searches.  That is how the <..>.s-params  files  in  the  autoclass-c/data/..
       sub-directories are specified.  This is how development testing is done.

       max_cycles  controls  the maximum number of convergence cycles that will be performed in any one trial by
       the convergence functions.  Its default value is 200.  The screen output shows a period  (".")  for  each
       cycle  completed.  If  your  search trials run for 200 cycles, then either your data base is very complex
       (increase the value), or the try_fn_type is not adequate for situation  (try  another  of  the  available
       ones, and use converge_print_p to get more information on what is going on).

       Specifying  converge_print_p to be true will generate a brief print-out for each cycle which will provide
       information so that you  can  modify  the  default  values  of  rel_delta_range  &  n_average  for  "con‐
       verge_search_3"; cs4_delta_range & sigma_beta_n_values for "converge_search_4"; and halt_range, halt_fac‐
       tor,  and n_average for "converge".  Their default values are given in the <..>.s-params files in the au‐
       toclass-c/data/..  sub-directories.

   HOW MANY CLASSES?
       Each new try begins with a certain number of classes and may end up with a smaller number, as some class‐
       es may drop out of the convergence.  In general, you want to begin the try with some  number  of  classes
       that  previous  tries have indicated look promising, and you want to be sure you are fishing around else‐
       where in case you missed something before.

       n_classes_fn_type = "random_ln_normal" is the default way to make this choice.  It fits a log  normal  to
       the  number  of  classes  (usually called "j" for short) of the 10 best classifications found so far, and
       randomly selects from that.  There is currently no alternative.

       To start the game off, the default is to go down start_j_list for the first few tries, and then switch to
       n_classes_fn_type.  If you believe that the probable number of classes in your data base is say 75,  then
       instead  of  using the default value of start_j_list (2, 3, 5, 7, 10, 15, 25), specify something like 50,
       60, 70, 80, 90, 100.

       If one wants to always look for, say, three classes, one can use fixed_j and override the above.   Search
       status reports will describe what the current method for choosing j is.

   DO I HAVE ENOUGH MEMORY AND DISK SPACE?
       Internally,  the  storage  requirements in the current system are of order n_classes_per_clsf * (n_data +
       n_stored_clsfs * n_attributes * n_attribute_values).  This depends on the number of cases, the number  of
       attributes,  the  values  per attribute (use 2 if a real value), and the number of classifications stored
       away for comparison to see if others are duplicates -- controlled by max_n_store (default  value  =  10).
       The search process does not itself consume significant memory, but storage of the results may do so.

       AutoClass  C  is  configured to handle a maximum of 999 attributes.  If you attempt to run with more than
       that you will get array bound violations.   In  that  case,  change  these  configuration  parameters  in
       prog/autoclass.h and recompile AutoClass C:

       #define ALL_ATTRIBUTES                  999
       #define VERY_LONG_STRING_LENGTH         20000
       #define VERY_LONG_TOKEN_LENGTH          500

       For example, these values will handle several thousand attributes:

       #define ALL_ATTRIBUTES                  9999
       #define VERY_LONG_STRING_LENGTH         50000
       #define VERY_LONG_TOKEN_LENGTH          50000

       Disk  space  taken up by the "log" file will of course depend on the duration of the search.  n_save (de‐
       fault value = 2) determines how many best classifications  are  saved  into  the  ".results[-bin]"  file.
       save_compact_p  controls  whether the "results" and "checkpoint" files are saved as binary.  Binary files
       are faster and more compact, but are not portable.  The default value of save_compact_p  is  true,  which
       causes binary files to be written.

       If  the time taken to save the "results" files is a problem, consider increasing min_save_period (default
       value = 1800 seconds or 30 minutes).  Files are saved to disk this often if there is  anything  different
       to report.

   JUST HOW SLOW IS IT?
       Compute time is of order n_data * n_attributes * n_classes * n_tries * converge_cycles_per_try. The major
       uncertainties  in this are the number of basic back and forth cycles till convergence in each try, and of
       course the number of tries.  The number of cycles per trial is typically  10-100  for  try_fn_type  "con‐
       verge",  and 10-200+ for "converge_search_3" and "converge_search-4".  The maximum number is specified by
       max_n_tries (default value = 200).  The number of trials is up to you and your  available  computing  re‐
       sources.

       The  running time of very large data sets will be quite uncertain.  We advise that a few small scale test
       runs be made on your system to determine a baseline.  Specify n_data to limit how many data  vectors  are
       read.   Given  a very large quantity of data, AutoClass may find its most probable classifications at up‐
       wards of a hundred classes, and this will require that start_j_list be specified appropriately (See above
       section HOW MANY CLASSES?).  If you are quite certain that you only want a few classes, you can force Au‐
       toClass to search with a fixed number of classes specified by fixed_j.  You will then need to  run  sepa‐
       rate searches with each different fixed number of classes.

   CHANGING FILENAMES IN A SAVED CLASSIFICATION FILE
       AutoClass  caches the data, header, and model file pathnames in the saved classification structure of the
       binary (".results-bin") or ASCII (".results") "results" files.  If the "results" and "search"  files  are
       moved to a different directory location, the search cannot be successfully restarted if you have used ab‐
       solute  pathnames.   Thus  it  is advantageous to run invoke AutoClass in a parent directory of the data,
       header, and model files, so that relative pathnames can be used.  Since the pathnames cached will then be
       relative, the files can be moved to a different host or file system and restarted -- providing  the  same
       relative pathname hierarchy exists.

       However,  since  the  ".results"  file is ASCII text, those pathnames could be changed with a text editor
       (save_compact_p must be specified as false).

   SEARCH PARAMETERS
       The search is controlled by the ".s-params" file.  In this file, an empty line or a  line  starting  with
       one  of these characters is treated as a comment: "#", "!", or ";".  The parameter name and its value can
       be separated by an equal sign, a space, or a tab:

            n_clsfs 1
            n_clsfs = 1
            n_clsfs<tab>1

       Spaces are ignored if "=" or "<tab>" are used as separators.  Note there are no trailing semicolons.

       The search parameters, with their default values, are as follows:

       rel_error = 0.01
              Specifies the relative difference measure used by clsf-DS-%=, when deciding if a new clsf is a du‐
              plicate of an old one.

       start_j_list = 2, 3, 5, 7, 10, 15, 25
              Initially try these numbers of classes, so as not to narrow the search too quickly.  The state  of
              this  list is saved in the <..>.search file and used on restarts, unless an override specification
              of start_j_list is made in the .s-params file for the restart run.  This list should bracket  your
              expected  number  of classes, and by a wide margin!  "start_j_list = -999" specifies an empty list
              (allowed only on restarts)

       n_classes_fn_type = "random_ln_normal"
              Once start_j_list is exhausted, AutoClass will call this function to decide how  many  classes  to
              start  with  on  the  next try, based on the 10 best classifications found so far.  Currently only
              "random_ln_normal" is available.

       fixed_j = 0
              When fixed_j > 0, overrides start_j_list and n_classes_fn_type, and AutoClass will always use this
              value for the initial number of classes.

       min_report_period = 30
              Wait at least this time (in seconds) since last report until reporting verbosely again.  Should be
              set longer than the expected run time when checking for repeatability of results.  For  repeatable
              results,  also see force_new_search_p, start_fn_type and randomize_random_p. NOTE: At least one of
              "interactive_p", "max_duration", and "max_n_tries" must be active.  Otherwise AutoClass  will  run
              indefinitely.  See below.

       interactive_p = true
              When  false,  allows run to continue until otherwise halted.  When true, standard input is queried
              on each cycle for the quit character "q", which, when detected, triggers an immediate halt.

       max_duration = 0
              When = 0, allows run to continue until otherwise halted.  When > 0, specifies the  maximum  number
              of seconds to run.

       max_n_tries = 0
              When  =  0, allows run to continue until otherwise halted.  When > 0, specifies the maximum number
              of tries to make.

       n_save = 2
              Save this many clsfs to disk in the .results[-bin] and .search files.  if 0, don't  save  anything
              (no .search & .results[-bin] files).

       log_file_p = true
              If false, do not write a log file.

       search_file_p = true
              If false, do not write a search file.

       results_file_p = true
              If false, do not write a results file.

       min_save_period = 1800
              CPU crash protection.  This specifies the maximum time, in seconds, that AutoClass will run before
              it saves the current results to disk.  The default time is 30 minutes.

       max_n_store = 10
              Specifies the maximum number of classifications stored internally.

       n_final_summary = 10
              Specifies the number of trials to be printed out after search ends.

       start_fn_type = "random"
              One  of {"random", "block"}.  This specifies the type of class initialization.  For normal search,
              use "random", which randomly selects instances to be initial class  means,  and  adds  appropriate
              variances.  For  testing  with  repeatable search, use "block", which partitions the database into
              successive blocks of near equal  size.   For  repeatable  results,  also  see  force_new_search_p,
              min_report_period, and randomize_random_p.

       try_fn_type = "converge_search_3"
              One  of  {"converge_search_3",  "converge_search_4",  "converge"}.  These specify alternate search
              stopping criteria.  "converge" merely tests the rate of change of the log_marginal  classification
              probability   (clsf->log_a_x_h),  without  checking  rate  of  change  of  individual  classes(see
              halt_range and halt_factor).  "converge_search_3" and "converge_search_4" each monitor  the  ratio
              class->log_a_w_s_h_j/class->w_j  for all classes, and continue convergence until all pass the qui‐
              escence criteria for n_average cycles.  "converge_search_3" tests differences  between  successive
              convergence  cycles  (see  rel_delta_range).  This provides a reasonable, general purpose stopping
              criteria.   "converge_search_4"  averages  the  ratio  over  "sigma_beta_n_values"   cycles   (see
              cs4_delta_range).  This is preferred when converge_search_3 produces many similar classes.

       initial_cycles_p = true
              If true, perform base_cycle in initialize_parameters.  false is used only for testing.

       save_compact_p = true
              true  saves  classifications as machine dependent binary (.results-bin & .chkpt-bin).  false saves
              as ascii text (.results & .chkpt)

       read_compact_p = true
              true reads classifications as machine dependent binary (.results-bin & .chkpt-bin).   false  reads
              as ascii text (.results & .chkpt).

       randomize_random_p = true
              false seeds lrand48, the pseudo-random number function with 1 to give repeatable test cases.  true
              uses  universal time clock as the seed, giving semi-random searches.  For repeatable results, also
              see force_new_search_p, min_report_period and start_fn_type.

       n_data = 0
              With n_data = 0, the entire database is read from .db2.  With n_data > 0, only this number of data
              are read.

       halt_range = 0.5
              Passed to try_fn_type "converge".  With the "converge" try_fn_type, convergence is halted when the
              larger of halt_range and (halt_factor * current_log_marginal) exceeds the difference between  suc‐
              cessive  cycle values of the classification log_marginal (clsf->log_a_x_h).  Decreasing this value
              may tighten the convergence and increase the number of cycles.

       halt_factor = 0.0001
              Passed to try_fn_type "converge".  With the "converge" try_fn_type, convergence is halted when the
              larger of halt_range and (halt_factor * current_log_marginal) exceeds the difference between  suc‐
              cessive  cycle values of the classification log_marginal (clsf->log_a_x_h).  Decreasing this value
              may tighten the convergence and increase the number of cycles.

       rel_delta_range = 0.0025
              Passed to try function "converge_search_3", which monitors the ratio of log  approx-marginal-like‐
              lihood  of class statistics with-respect-to the class hypothesis (class->log_a_w_s_h_j) divided by
              the class weight (class->w_j), for each class.  "converge_search_3"  halts  convergence  when  the
              difference  between cycles, of this ratio, for every class, has been exceeded by "rel_delta_range"
              for "n_average" cycles.  Decreasing "rel_delta_range" tightens the convergence and  increases  the
              number of cycles.

       cs4_delta_range = 0.0025
              Passed    to    try    function    "converge_search_4",    which    monitors    the    ratio    of
              (class->log_a_w_s_h_j)/(class->w_j), for each class, averaged over  "sigma_beta_n_values"  conver‐
              gence cycles.  "converge_search_4" halts convergence when the maximum difference in average values
              of  this  ratio  falls below "cs4_delta_range".  Decreasing "cs4_delta_range" tightens the conver‐
              gence and increases the number of cycles.

       n_average = 3
              Passed to try functions "converge_search_3" and "converge".  The number of cycles  for  which  the
              convergence criterion must be satisfied for the trial to terminate.

       sigma_beta_n_values = 6
              Passed  to try_fn_type "converge_search_4".  The number of past values to use in computing sigma^2
              (noise) and beta^2 (signal).

       max_cycles = 200
              This is the maximum number of cycles permitted for any one convergence of  a  classification,  re‐
              gardless  of any other stopping criteria.  This is very dependent upon your database and choice of
              model and convergence parameters, but should be about twice the average number of cycles  reported
              in the screen dump and .log file

       converge_print_p = false
              If  true,  the  selected try function will print to the screen values useful in specifying non-de‐
              fault values for halt_range, halt_factor,  rel_delta_range,  n_average,  sigma_beta_n_values,  and
              range_factor.

       force_new_search_p = true
              If  true,  will  ignore  any  previous  search  results,  discarding the existing .search and .re‐
              sults[-bin] files after confirmation by the user; if false, will continue the search using the ex‐
              isting .search and .results[-bin] files.  For  repeatable  results,  also  see  min_report_period,
              start_fn_type and randomize_random_p.

       checkpoint_p = false
              If  true,  checkpoints of the current classification will be written every "min_checkpoint_period"
              seconds, with file extension .chkpt[-bin]. This is only useful for very large classifications

       min_checkpoint_period = 10800
              If checkpoint_p = true, the checkpointed classification will be written this often  -  in  seconds
              (default = 3 hours)

       reconverge_type = "
              Can  be  either  "chkpt"  or "results".  If "checkpoint_p" = true and "reconverge_type" = "chkpt",
              then continue convergence of the classification contained in <...>.chkpt[-bin].  If  "checkpoint_p
              "  = false and "reconverge_type" = "results", continue convergence of the best classification con‐
              tained in <...>.results[-bin].

       screen_output_p = true
              If false, no output is directed to the screen.  Assuming log_file_p = true, output will be direct‐
              ed to the log file only.

       break_on_warnings_p = true
              The default value asks the user whether or not to continue,  when  data  definition  warnings  are
              found.   If specified as false, then AutoClass will continue, despite warnings -- the warning will
              continue to be output to the terminal and the log file.

       free_storage_p = true
              The default value tells AutoClass to free the majority of its allocated storage.  This is not  re‐
              quired,  and in the case of the DEC Alpha causes core dump [is this still true?].  If specified as
              false, AutoClass will not attempt to free storage.

   HOW TO GET AUTOCLASS C TO PRODUCE REPEATABLE RESULTS
       In some situations, repeatable classifications are required: comparing basic  AutoClass  C  integrity  on
       different  platforms, porting AutoClass C to a new platform, etc.  In order to accomplish this two things
       are necessary: 1) the same random number generator must be used, and 2) the  search  parameters  must  be
       specified properly.

       Random  Number  Generator. This implementation of AutoClass C uses the Unix srand48/lrand48 random number
       generator which generates pseudo-random numbers using the well-known linear  congruential  algorithm  and
       48-bit  integer arithmetic.  lrand48() returns non- negative long integers uniformly distributed over the
       interval [0, 2**31].

       Search Parameters.  The following .s-params file parameters should be specified:

       force_new_search_p = true
       start_fn_type   "block"
       randomize_random_p = false
       ;; specify the number of trials you wish to run
       max_n_tries = 50
       ;; specify a time greater than duration of run
       min_report_period = 30000

       Note that no current best classification reports will be produced.  Only a final  classification  summary
       will be output.

CHECKPOINTING

       With very large databases there is a significant probability of a system crash during any one classifica‐
       tion  try.   Under such circumstances it is advisable to take the time to checkpoint the calculations for
       possible restart.

       Checkpointing is initiated by specifying "checkpoint_p = true" in the ".s-params" file.  This causes  the
       inner convergence step, to save a copy of the classification onto the checkpoint file each time the clas‐
       sification  is  updated,  providing  a  certain  period  of  time  has  elapsed.   The  file extension is
       ".chkpt[-bin]".

       Each time a AutoClass completes a cycle, a "." is output to the screen to provide you with information to
       be used in setting the min_checkpoint_period value (default 10800 seconds or 3 hours).  There is obvious‐
       ly a trade-off between frequency of checkpointing and the probability that your machine may crash,  since
       the repetitive writing of the checkpoint file will slow the search process.

       Restarting AutoClass Search:

       To  recover  the  classification and continue the search after rebooting and reloading AutoClass, specify
       reconverge_type = "chkpt" in the ".s-params" file (specify force_new_search_p as false).

       AutoClass will reload the appropriate database and models, provided there has been  no  change  in  their
       filenames  since the time they were loaded for the checkpointed classification run.  The ".s-params" file
       contains any non-default arguments that were provided to the original call.

       In the beginning of a search, before start_j_list has been emptied, it will  be  necessary  to  trim  the
       original  list  to  what would have remained in the crashed search.  This can be determined by looking at
       the ".log" file to determine what values were already used.  If the start_j_list has been  emptied,  then
       an empty start_j_list should be specified in the ".s-params" file.  This is done either by

               start_j_list =

       or

               start_j_list = -9999

       Here is an a set of scripts to demonstrate check-pointing:

       autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \
            data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params

       Run 1)
         ## glassc-chkpt.s-params
         max_n_tries = 2
         force_new_search_p = true
         ## --------------------
         ;; run to completion

       Run 2)
         ## glassc-chkpt.s-params
         force_new_search_p = false
         max_n_tries = 10
         checkpoint_p = true
         min_checkpoint_period = 2
         ## --------------------
         ;; after 1 checkpoint, ctrl-C to simulate cpu crash

       Run 3)
         ## glassc-chkpt.s-params
         force_new_search_p = false
         max_n_tries = 1
         checkpoint_p = true
         min_checkpoint_period = 1
         reconverge_type = "chkpt"
         ## --------------------
         ;; checkpointed trial should finish

OUTPUT FILES

       The standard reports are

       1)     Attribute  influence values: presents the relative influence or significance of the data's attrib‐
              utes both globally (averaged over all classes), and  locally  (specifically  for  each  class).  A
              heuristic for relative class strength is also listed;

       2)     Cross-reference  by  case  (datum) number: lists the primary class probability for each datum, or‐
              dered by case number.  When report_mode = "data", additional lesser class  probabilities  (greater
              than or equal to 0.001) are listed for each datum;

       3)     Cross-reference by class number: for each class the primary class probability and any lesser class
              probabilities  (greater than or equal to 0.001) are listed for each datum in the class, ordered by
              case number. It is also possible to list, for each datum, the values of attributes, which you  se‐
              lect.

       The attribute influence values report attempts to provide relative measures of the "influence" of the da‐
       ta attributes on the classes found by the classification.  The normalized class strengths, the normalized
       attribute  influence values summed over all classes, and the individual influence values (I[jkl]) are all
       only relative measures and should be interpreted with more meaning than rank ordering, but not like  any‐
       thing approaching absolute values.

       The  reports  are output to files whose names and pathnames are taken from the ".r-params" file pathname.
       The report file types (extensions) are:

       influence values report
              "influ-o-text-n" or "influ-no-text-n"

       cross-reference by case
              "case-text-n"

       cross-reference by class
              "class-text-n"

       or, if report_mode is overridden to "data":

       influence values report
              "influ-o-data-n" or "influ-no-data-n"

       cross-reference by case
              "case-data-n"

       cross-reference by class
              "class-data-n"

       where n is the classification number from the "results" file.  The first or best classification  is  num‐
       bered  1,  the  next best 2, etc.  The default is to generate reports only for the best classification in
       the "results" file.  You can produce reports for other saved classifications by using report params  key‐
       words n_clsfs and clsf_n_list.  The "influ-o-text-n" file type is the default (order_attributes_by_influ‐
       ence_p  =  true), and lists each class's attributes in descending order of attribute influence value.  If
       the value of order_attributes_by_influence_p is overridden to be false in the <...>.r-params  file,  then
       each class's attributes will be listed in ascending order by attribute number.  The extension of the file
       generated  will  be  "influ-no-text-n".   This method of listing facilitates the visual comparison of at‐
       tribute values between classes.

       For example, this command:

            autoclass -reports sample/imports-85c.results-bin
                 sample/imports-85c.search sample/imports-85c.r-params

       with this line in the ".r-params" file:

            xref_class_report_att_list = 2, 5, 6

       will generate these output files:

            imports-85.influ-o-text-1
            imports-85.case-text-1
            imports-85.class-text-1

       The AutoClass C reports provide the capability to compute sigma class contour values for specified  pairs
       of  real valued attributes, when generating the influence values report with the data option (report_mode
       = "data").  Note that sigma class contours are not generated from discrete type attributes.

       The sigma contours are the two dimensional equivalent of n-sigma error bars in one  dimension.   Specifi‐
       cally, for two independent attributes the n-sigma contour is defined as the ellipse where

       ((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n

       With covariant attributes, the n-sigma contours are defined identically, in the rotated coordinate system
       of  the  distribution's principle axes.  Thus independent attributes give ellipses oriented parallel with
       the attribute axes, while the axes of sigma contours of covariant attributes are rotated about the center
       determined by the means.  In either case the sigma contour represents a line where the class  probability
       is constant, irrespective of any other class probabilities.

       With  three or more attributes the n-sigma contours become k-dimensional ellipsoidal surfaces.  This code
       takes advantage of the fact that the parallel projection of an n-dimensional ellipsoid,  onto  any  2-dim
       plane,  is  bounded by an ellipse.  In this simplified case of projecting the single sigma ellipsoid onto
       the coordinate planes, it is also true that the 2-dim covariances of this ellipse are equal to the corre‐
       sponding elements of the n-dim ellipsoid's covariances.  The Eigen-system of the  2-dim  covariance  then
       gives  the variances w.r.t. the principal components of the eclipse, and the rotation that aligns it with
       the data.  This represents the best way to display a distribution in the marginal plane.

       To get contour values, set the keyword sigma_contours_att_list to a list of real valued attribute indices
       (from .hd2 file), and request an influence values report with the data option.  For example,

            report_mode = "data"
            sigma_contours_att_list = 3, 4, 5, 8, 15

   OUTPUT REPORT PARAMETERS
       The contents of the output report are controlled by the ".r-params" file.  In this file, an empty line or
       a line starting with one of these characters is treated as a comment: "#", "!", or  ";".   The  parameter
       name and its value can be separated by an equal sign, a space, or a tab:

            n_clsfs 1
            n_clsfs = 1
            n_clsfs<tab>1

       Spaces are ignored if "=" or "<tab>" are used as separators.  Note there are no trailing semicolons.

       The following are the allowed parameters and their default values:

       n_clsfs = 1
              number  of  clsfs  in  the .results file for which to generate reports, starting with the first or
              "best".

       clsf_n_list =
              if specified, this is a one-based index list of clsfs in the clsf sequence read from the  .results
              file.  It overrides "n_clsfs".  For example:

                   clsf_n_list = 1, 2

              will produce the same output as

                   n_clsfs = 2

              but

                   clsf_n_list = 2

              will only output the "second best" classification report.

       report_type =
              type of reports to generate: "all", "influence_values", "xref_case", or "xref_class".

       report_mode =
              mode of reports to generate. "text" is formatted text layout.  "data" is numerical -- suitable for
              further processing.

       comment_data_headers_p = false
              the  default  value  does  not insert # in column 1 of most report_mode = "data" header lines.  If
              specified as true, the comment character will be inserted in most header lines.

       num_atts_to_list =
              if specified, the number of attributes to list in influence values report.  if not specified,  all
              attributes will be listed.  (e.g. "num_atts_to_list = 5")

       xref_class_report_att_list =
              if  specified,  a  list  of  attribute  numbers  (zero-based),  whose values will be output in the
              "xref_class" report along with the case probabilities.  if not  specified,  no  attributes  values
              will be output.  (e.g. "xref_class_report_att_list = 1, 2, 3")

       order_attributes_by_influence_p = true
              The  default value lists each class's attributes in descending order of attribute influence value,
              and uses ".influ-o-text-n" as the influence values report file type.  If specified as false,  then
              each  class's  attributes will be listed in ascending order by attribute number.  The extension of
              the file generated will be "influ-no-text-n".

       break_on_warnings_p = true
              The default value asks the user whether to continue or  not  when  data  definition  warnings  are
              found.   If specified as false, then AutoClass will continue, despite warnings -- the warning will
              continue to be output to the terminal.

       free_storage_p = true
              The default value tells AutoClass to free the majority of its allocated storage.  This is not  re‐
              quired,  and  in the case of the DEC Alpha causes a core dump [is this still true?].  If specified
              as false, AutoClass will not attempt to free storage.

       max_num_xref_class_probs = 5
              Determines how many lessor class probabilities will be printed for the case and class cross-refer‐
              ence reports.  The default is to print the most probable class  probability  value  and  up  to  4
              lessor class prob- ibilities.  Note this is true for both the "text" and "data" class cross-refer‐
              ence  reports,  but only true for the "data" case cross- reference report.  The "text" case cross-
              reference report only has the most probable class probability.

       sigma_contours_att_list =
              If specified, a list of real valued attribute indices (from .hd2 file) will be  to  compute  sigma
              class  contour values, when generating influence values report with the data option (report_mode =
              "data").  If not specified, there will be  no  sigma  class  contour  output.   (e.g.  "sigma_con‐
              tours_att_list = 3, 4, 5, 8, 15")

INTERPRETATION OF AUTOCLASS RESULTS

   WHAT HAVE YOU GOT?
       Now you have run AutoClass on your data set -- what have you got?  Typically, the AutoClass search proce‐
       dure finds many classifications, but only saves the few best.  These are now available for inspection and
       interpretation.  The most important indicator of the relative merits of these alternative classifications
       is Log total posterior probability value.  Note that since the probability lies between 1 and 0, the cor‐
       responding  Log  probability  is  negative and ranges from 0 to negative infinity. The difference between
       these Log probability values raised to the power e gives the relative  probability  of  the  alternatives
       classifications.   So  a difference of, say 100, implies one classification is e^100 ~= 10^43 more likely
       than the other.  However, these numbers can be very misleading, since they give the relative  probability
       of alternative classifications under the AutoClass assumptions.

   ASSUMPTIONS
       Specifically,  the  most important AutoClass assumptions are the use of normal models for real variables,
       and the assumption of independence of attributes within a class.  Since these assumptions are often  vio‐
       lated  in  practice, the difference in posterior probability of alternative classifications can be partly
       due to one classification being closer to satisfying the assumptions than another, rather than to a  real
       difference in classification quality.  Another source of uncertainty about the utility of Log probability
       values is that they do not take into account any specific prior knowledge the user may have about the do‐
       main.   This means that it is often worth looking at alternative classifications to see if you can inter‐
       pret them, but it is worth starting from the most probable first.  Note that if the Log probability value
       is much greater than that for the one class case, it is saying that there is  overwhelming  evidence  for
       some structure in the data, and part of this structure has been captured by the AutoClass classification.

   INFLUENCE REPORT
       So  you  have now picked a classification you want to examine, based on its Log probability value; how do
       you examine it?  The first thing to do is to generate an "influence" report on the  classification  using
       the report generation facilities documented in /usr/share/doc/autoclass/reports-c.text.  An influence re‐
       port is designed to summarize the important information buried in the AutoClass data structures.

       The first part of this report gives the heuristic class "strengths".  Class "strength" is here defined as
       the geometric mean probability that any instance "belonging to" class, would have been generated from the
       class  probability model.  It thus provides a heuristic measure of how strongly each class predicts "its"
       instances.

       The second part is a listing of the overall "influence" of each of the attributes used in the classifica‐
       tion.  These give a rough heuristic measure of the relative importance of each attribute in the classifi‐
       cation.  Attribute "influence values" are a class probability weighted average of the "influence" of each
       attribute in the classes, as described below.

       The next part of the report is a summary description of each of the classes.  The classes are arbitrarily
       numbered from 0 up to n, in order of descending class weight.  A class weight of say 34.1 means that  the
       weighted sum of membership probabilities for class is 34.1.  Note that a class weight of 34 does not nec‐
       essarily  mean  that  34 cases belong to that class, since many cases may have only partial membership in
       that class.  Within each class, attributes or attribute sets are ordered by the "influence" of their mod‐
       el term.

   CROSS ENTROPY
       A commonly used measure of the divergence between two probability distributions is the cross entropy: the
       sum over all possible values x, of P(x|c...)*log[P(x|c...)/P(x|g...)], where c...  and  g...  define  the
       distributions.   It  ranges from zero, for identical distributions, to infinite for distributions placing
       probability 1 on differing values of an attribute.  With conditionally independent terms in the probabil‐
       ity distributions, the cross entropy can be factored to a sum over these terms.  These factors provide  a
       measure of the corresponding modeled attribute's influence in differentiating the two distributions.

       We  define the modeled term's "influence" on a class to be the cross entropy term for the class distribu‐
       tion w.r.t. the global class distribution of the single class classification.  "Influence" is thus a mea‐
       sure of how strongly the model term helps differentiate the class from the whole data set.  With indepen‐
       dently modeled attributes, the influence can legitimately be ascribed to the attribute itself.  With cor‐
       related or covariant attributes sets, the cross entropy factor is a function of the entire  set,  and  we
       distribute the influence value equally over the modeled attributes.

   ATTRIBUTE INFLUENCE VALUES
       In  the  "influence"  report on each class, the attribute parameters for that class are given in order of
       highest influence value for the model term attribute sets.  Only the first  few  attribute  sets  usually
       have  significant  influence  values.   If an influence value drops below about 20% of the highest value,
       then it is probably not significant, but all attribute sets are listed for completeness.  In addition  to
       the  influence value for each attribute set, the values of the attribute set parameters in that class are
       given along with the corresponding "global" values.  The global values are computed directly from the da‐
       ta independent of the classification.  For example, if the class mean of attribute  "temperature"  is  90
       with  standard  deviation  of 2.5, but the global mean is 68 with a standard deviation of 16.3, then this
       class has selected out cases with much higher than average temperature, and a rather small spread in this
       high range.  Similarly, for discrete attribute sets, the probability of each outcome  in  that  class  is
       given, along with the corresponding global probability -- ordered by its significance: the absolute value
       of  (log {<local-probability> / <global-probability>}).  The sign of the significance value shows the di‐
       rection of change from the global class.  This information gives an overview of how  each  class  differs
       from the average for all the data, in order of the most significant differences.

   CLASS AND CASE REPORTS
       Having  gained a description of the classes from the "influence" report, you may want to follow-up to see
       which classes your favorite cases ended up in.  Conversely, you may want to see which cases belong  to  a
       particular class.  For this kind of cross-reference information two complementary reports can be generat‐
       ed.   These  are  more  fully  documented in /usr/share/doc/autoclass/reports-c.text. The "class" report,
       lists all the cases which have significant membership in each class and the degree  to  which  each  such
       case  belongs  to  that  class.   Cases whose class membership is less than 90% in the current class have
       their other class membership listed as well.  The cases within a class are  ordered  in  increasing  case
       number.   The  alternative "cases" report states which class (or classes) a case belongs to, and the mem‐
       bership probability in the most probable class.  These two reports allow you to find which  cases  belong
       to  which  classes or the other way around.  If nearly every case has close to 99% membership in a single
       class, then it means that the classes are well separated, while a high degree of  cross-membership  indi‐
       cates that the classes are heavily overlapped.  Highly overlapped classes are an indication that the idea
       of  classification is breaking down and that groups of mutually highly overlapped classes, a kind of meta
       class, is probably a better way of understanding the data.

   COMPARING CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS
       The class weight given as the class probability parameter, is essentially  the  sum  over  all  data  in‐
       stances, of the normalized probability that the instance is a member of the class.  It is probably an er‐
       ror  on our part that we format this number as an integer in the report, rather than emphasizing its real
       nature.  You will find the actual real value recorded as the w_j parameter in the class_DS structures  on
       any .results[-bin] file.

       The  .case  and  .class  reports give probabilities that cases are members of classes.  Any assignment of
       cases to classes requires some decision rule.  The maximum probability assignment rule is often implicit‐
       ly assumed, but it cannot be expected that the resulting partition sizes will equal the class weights un‐
       less nearly all class membership probabilities are effectively one  or  zero.   With  non-1/0  membership
       probabilities, matching the class weights requires summing the probabilities.

       In  addition, there is the question of completeness of the EM (expectation maximization) convergence.  EM
       alternates between estimating class parameters and estimating class membership probabilities.  These  es‐
       timates  converge on each other, but never actually meet.  AutoClass implements several convergence algo‐
       rithms with alternate stopping criteria using appropriate parameters in the .s-params file.  Proper  set‐
       ting  of  these parameters, to get reasonably complete and efficient convergence may require experimenta‐
       tion.

   ALTERNATIVE CLASSIFICATIONS
       In summary, the various reports that can be generated give you a way of viewing the  current  classifica‐
       tion.   It is usually a good idea to look at alternative classifications even though they do not have the
       minimum Log probability values.  These other classifications usually have classes that correspond closely
       to strong classes in other classifications, but can differ in the weak  classes.   The  "strength"  of  a
       class  within  a classification can usually be judged by how dramatically the highest influence value at‐
       tributes in the class differ from the corresponding global attributes.  If none  of  the  classifications
       seem quite satisfactory, it is always possible to run AutoClass again to generate new classifications.

   WHAT NEXT?
       Finally,  the  question of what to do after you have found an insightful classification arises.  Usually,
       classification is a preliminary data analysis step for examining a set of cases (things, examples,  etc.)
       to  see if they can be grouped so that members of the group are "similar" to each other.  AutoClass gives
       such a grouping without the user having to define a similarity measure.  The built-in  "similarity"  mea‐
       sure  is  the  mutual predictiveness of the cases.  The next step is to try to "explain" why some objects
       are more like others than those in a different group.  Usually, domain knowledge suggests an answer.  For
       example, a classification of people based on income, buying habits, location, age, etc., may reveal  par‐
       ticular  social  classes that were not obvious before the classification analysis.  To obtain further in‐
       formation about such classes, further information, such as number of cars, what  TV  shows  are  watched,
       etc.,  would  reveal even more information.  Longitudinal studies would give information about how social
       classes arise and what influences their attitudes -- all of which is going way beyond the initial classi‐
       fication.

PREDICTIONS

       Classifications can be used to predict class membership for new cases.  So in addition to possibly giving
       you some insight into the structure behind your data, you can now use AutoClass directly to make  predic‐
       tions, and compare AutoClass to other learning systems.

       This  technique  for  predicting  class probabilities is applicable to all attributes, regardless of data
       type/sub_type or likelihood model term type.

       In the event that the class membership of a data case does not exceed 0.0099999 for any of the "training"
       classes, the following message will appear in the screen output for each case:

               xref_get_data: case_num xxx => class 9999

       Class 9999 members will appear in the "case" and "class" cross-reference reports with a class  membership
       of 1.0.

       Cautionary Points:

       The usual way of using AutoClass is to put all of your data in a data_file, describe that data with model
       and header files, and run "autoclass -search".  Now, instead of one data_file you will have two, a train‐
       ing_data_file and a test_data_file.

       It  is  most  important that both databases have the same AutoClass internal representation.  Should this
       not be true, AutoClass will exit, or possibly in in some situations, crash.  The prediction mode  is  de‐
       signed to hopefully direct the user into conforming to this requirement.

       Preparation:

       Prediction requires having a training classification and a test database.  The training classification is
       generated  by the running of "autoclass -search" on the training data_file ("data/soybean/soyc.db2"), for
       example:

           autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2
               data/soybean/soyc.model data/soybean/soyc.s-params

       This will produce "soyc.results-bin" and "soyc.search".  Then create a "reports" parameter file, such  as
       "soyc.r-params"  (see /usr/share/doc/autoclass/reports-c.text), and run AutoClass in "reports" mode, such
       as:

           autoclass -reports data/soybean/soyc.results-bin
               data/soybean/soyc.search data/soybean/soyc.r-params

       This will generate class and case cross-reference files, and an influence values file.   The  file  names
       are based on the ".r-params" file name:

               data/soybean/soyc.class-text-1
               data/soybean/soyc.case-text-1
               data/soybean/soyc.influ-text-1

       These  will describe the classes found in the training_data_file.  Now this classification can be used to
       predict the probabilistic class membership of the test_data_file cases  ("data/soybean/soyc-predict.db2")
       in the training_data_file classes.

           autoclass -predict data/soybean/soyc-predict.db2
               data/soybean/soyc.results-bin data/soybean/soyc.search
               data/soybean/soyc.r-params

       This  will  generate  class  and case cross-reference files for the test_data_file cases predicting their
       probabilistic class memberships in the training_data_file classes.  The  file  names  are  based  on  the
       ".db2" file name:

               data/soybean/soyc-predict.class-text-1
               data/soybean/soyc-predict.case-text-1

SEE ALSO

       AutoClass is documented fully here:

       /usr/share/doc/autoclass/introduction-c.text Guide to the documentation

       /usr/share/doc/autoclass/preparation-c.text How to prepare data for use by AutoClass

       /usr/share/doc/autoclass/search-c.text How to run AutoClass to find classifications.

       /usr/share/doc/autoclass/reports-c.text How to examine the classification in various ways.

       /usr/share/doc/autoclass/interpretation-c.text How to interpret AutoClass results.

       /usr/share/doc/autoclass/checkpoint-c.text Protocols for running a checkpointed search.

       /usr/share/doc/autoclass/prediction-c.text Use classifications to predict class membership for new cases.

       These provide supporting documentation:

       /usr/share/doc/autoclass/classes-c.text What classification is all about, for beginners.

       /usr/share/doc/autoclass/models-c.text Brief descriptions of the model term implementations.

       The mathematical theory behind AutoClass is explained in these documents:

       /usr/share/doc/autoclass/kdd-95.ps  Postscript file containing: P. Cheeseman, J. Stutz, "Bayesian Classi‐
       fication (AutoClass): Theory and Results", in "Advances in Knowledge Discovery and Data Mining", Usama M.
       Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, & Ramasamy Uthurusamy,  Eds.  The  AAAI  Press,  Menlo
       Park, expected fall 1995.

       /usr/share/doc/autoclass/tr-fia-90-12-7-01.ps Postscript file containing: R. Hanson, J. Stutz, P. Cheese‐
       man,  "Bayesian Classification Theory", Technical Report FIA-90-12-7-01, NASA Ames Research Center, Arti‐
       ficial Intelligence Branch, May 1991 (The figures are not included, since they were inserted by "cut-and-
       paste" methods into the original "camera-ready" copy.)

AUTHORS

       Dr. Peter Cheeseman
       Principal Investigator - NASA Ames, Computational Sciences Division
       cheesem@ptolemy.arc.nasa.gov

       John Stutz
       Research Programmer - NASA Ames, Computational Sciences Division
       stutz@ptolemy.arc.nasa.gov

       Will Taylor
       Support Programmer - NASA Ames, Computational Sciences Division
       taylor@ptolemy.arc.nasa.gov

SEE ALSO

       multimix(1).

                                                December 9, 2001                                    AUTOCLASS(1)