Ubuntu Manpage: Algorithm::NaiveBayes - Bayesian prediction of categories

Provided by: libalgorithm-naivebayes-perl_0.04-2_all

NAME

       Algorithm::NaiveBayes - Bayesian prediction of categories

SYNOPSIS

         use Algorithm::NaiveBayes;
         my $nb = Algorithm::NaiveBayes->new;

         $nb->add_instance
           (attributes => {foo => 1, bar => 1, baz => 3},
            label => 'sports');

         $nb->add_instance
           (attributes => {foo => 2, blurp => 1},
            label => ['sports', 'finance']);

         ... repeat for several more instances, then:
         $nb->train;

         # Find results for unseen instances
         my $result = $nb->predict
           (attributes => {bar => 3, blurp => 2});

DESCRIPTION

       This module implements the classic "Naive Bayes" machine learning algorithm.  It is a well-studied
       probabilistic algorithm often used in automatic text categorization.  Compared to other algorithms (kNN,
       SVM, Decision Trees), it's pretty fast and reasonably competitive in the quality of its results.

       A paper by Fabrizio Sebastiani provides a really good introduction to text categorization:
       <http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf>

METHODS

       new()
           Creates a new "Algorithm::NaiveBayes" object and returns it.  The following parameters are accepted:

           purge
               If set to a true value, the "do_purge()" method will be invoked during "train()".  The default is
               true.   Set  this  to  a  false  value if you'd like to be able to add additional instances after
               training and then call "train()" again.

       add_instance( attributes => HASH, label => STRING|ARRAY )
           Adds a training instance to the categorizer.  The "attributes" parameter contains  a  hash  reference
           whose keys are string attributes and whose values are the weights of those attributes.  For instance,
           if  you're  categorizing  text  documents, the attributes might be the words of the document, and the
           weights might be the number of times each word occurs in the document.

           The "label" parameter can contain  a  single  string  or  an  array  of  strings,  with  each  string
           representing a label for this instance.  The labels can be any arbitrary strings.  To indicate that a
           document has no applicable labels, pass an empty array reference.

       train()
           Calculates the probabilities that will be necessary for categorization using the "predict()" method.

       predict( attributes => HASH )
           Use  this  method  to predict the label of an unknown instance.  The attributes should be of the same
           format as you passed to "add_instance()".  "predict()" returns a hash reference whose  keys  are  the
           names  of labels, and whose values are the score for each label.  Scores are between 0 and 1, where 0
           means the label doesn't seem to apply to this instance, and 1 means it does.

           In practice, scores using Naive Bayes  tend  to  be  very  close  to  0  or  1  because  of  the  way
           normalization is performed.  I might try to alleviate this in future versions of the code.

       labels()
           Returns  a  list  of all the labels the object knows about (in no particular order), or the number of
           labels if called in a scalar context.

       do_purge()
           Purges training instances and their associated information from the NaiveBayes object.  This can save
           memory after training.

       purge()
           Returns true or false depending on the value of the object's "purge" property.  An  optional  boolean
           argument sets the property.

       save_state($path)
           This object method saves the object to disk for later use.  The $path argument indicates the place on
           disk where the object should be saved:

             $nb->save_state($path);

       restore_state($path)
           This class method reads the file specified by $path and returns the object that was previously stored
           there using "save_state()":

             $nb = Algorithm::NaiveBayes->restore_state($path);

THEORY

       Bayes' Theorem is a way of inverting a conditional probability. It states:

                       P(y|x) P(x)
             P(x|y) = -------------
                          P(y)

       The    notation    "P(x|y)"    means    "the    probability    of    "x"    given    "y"."     See   also
       "/mathforum.org/dr.math/problems/battisfore.03.22.99.html"" in "http: for a simple but  complete  example
       of Bayes' Theorem.

       In  this  case,  we want to know the probability of a given category given a certain string of words in a
       document, so we have:

                           P(words | cat) P(cat)
         P(cat | words) = --------------------
                                  P(words)

       We have applied Bayes' Theorem because "P(cat | words)" is a difficult quantity to compute directly,  but
       "P(words | cat)" and "P(cat)" are accessible (see below).

       The  greater  the  expression  above,  the greater the probability that the given document belongs to the
       given category.  So we want to find the maximum value.  We write this as

                                        P(words | cat) P(cat)
         Best category =   ArgMax      -----------------------
                          cat in cats          P(words)

       Since "P(words)" doesn't change over the range of categories, we can get rid of it.  That's good, because
       we didn't want to have to compute these values anyway.  So our new formula is:

         Best category =   ArgMax      P(words | cat) P(cat)
                          cat in cats

       Finally, we note that if "w1, w2, ... wn" are  the  words  in  the  document,  then  this  expression  is
       equivalent to:

         Best category =   ArgMax      P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat)
                          cat in cats

       That's  the formula I use in my document categorization code.  The last step is the only non-rigorous one
       in the derivation, and this is the "naive" part of the  Naive  Bayes  technique.   It  assumes  that  the
       probability  of  each word appearing in a document is unaffected by the presence or absence of each other
       word in the document.  We assume this even though  we  know  this  isn't  true:  for  example,  the  word
       "iodized"  is  far more likely to appear in a document that contains the word "salt" than it is to appear
       in a document that contains the word "subroutine".  Luckily, as it turns out, making this assumption even
       when it isn't true may have little effect on our results,  as  the  following  paper  by  Pedro  Domingos
       argues: "/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz"" in "http:

HISTORY

       My  first  implementation of a Naive Bayes algorithm was in the now-obsolete AI::Categorize module, first
       released in May 2001.  I replaced it with the Naive Bayes implementation  in  AI::Categorizer  (note  the
       extra  'r'),  first released in July 2002.  I then extracted that implementation into its own module that
       could be used outside the framework, and that's what you see here.

AUTHOR

       Ken Williams, ken@mathforum.org

COPYRIGHT

       Copyright 2003-2004 Ken Williams.  All rights reserved.

       This library is free software; you can redistribute it and/or modify it under  the  same  terms  as  Perl
       itself.

NAME

SYNOPSIS

DESCRIPTION

METHODS

THEORY

HISTORY

AUTHOR

COPYRIGHT

SEE ALSO