Provided by: mlpack-bin_4.3.0-2build1_amd64 bug

NAME

       mlpack_kde - kernel density estimation

SYNOPSIS

        mlpack_kde [-E double] [-a string] [-b double] [-s int] [-m unknown] [-k string] [-c double] [-C double] [-P double] [-S bool] [-q unknown] [-r unknown] [-e double] [-t string] [-V bool] [-M unknown] [-p unknown] [-h -v]

DESCRIPTION

       This  program performs a Kernel Density Estimation. KDE is a non-parametric way of estimating probability
       density function. For each query point the program will estimate its probability density  by  applying  a
       kernel function to each reference point. The computational complexity of this is O(N^2) where there are N
       query  points and N reference points, but this implementation will typically see better performance as it
       uses an approximate dual or single tree algorithm for acceleration.

       Dual or single tree optimization avoids many barely relevant  calculations  (as  kernel  function  values
       decrease  with distance), so it is an approximate computation. You can specify the maximum relative error
       tolerance for each query value with '--rel_error (-e)' as well as the maximum  absolute  error  tolerance
       with  the  parameter '--abs_error (-E)'. This program runs using an Euclidean metric. Kernel function can
       be selected using the '--kernel (-k)' option. You can also choose what which type of tree to use for  the
       dual-tree  algorithm with '--tree (-t)'. It is also possible to select whether to use dual-tree algorithm
       or single-tree algorithm using the '--algorithm (-a)' option.

       Monte Carlo estimations can be used to accelerate the KDE estimate when the Gaussian Kernel is used. This
       provides a probabilistic guarantee on the  the  error  of  the  resulting  KDE  instead  of  an  absolute
       guarantee.To  enable  Monte  Carlo  estimations,  the  '--monte_carlo (-S)' flag can be used, and success
       probability can be set with the '--mc_probability (-P)' option. It is possible to set the initial  sample
       size  for  the  Monte  Carlo estimation using ’--initial_sample_size (-s)'. This implementation will only
       consider a node, as a candidate for the Monte Carlo estimation, if its  number  of  descendant  nodes  is
       bigger  than  the  initial sample size. This can be controlled using a coefficient that will multiply the
       initial sample size and can be set using ’--mc_entry_coef (-C)'.  To  avoid  using  the  same  amount  of
       computations  an  exact  approach  would  take, this program recurses the tree whenever a fraction of the
       amount of the  node's  descendant  points  have  already  been  computed.  This  fraction  is  set  using
       '--mc_break_coef (-c)'.

       For  example,  the  following  will run KDE using the data in 'ref_data.csv' for training and the data in
       'qu_data.csv' as query data. It will apply an Epanechnikov kernel with a 0.2 bandwidth to each  reference
       point and use a KD-Tree for the dual-tree optimization. The returned predictions will be within 5% of the
       real KDE value for each query point.

       $ mlpack_kde --reference_file ref_data.csv --query_file qu_data.csv --bandwidth 0.2 --kernel epanechnikov
       --tree kd-tree --rel_error 0.05 --predictions_file out_data.csv

       the  predicted  density  estimations  will  be  stored  in  'out_data.csv'.  If no '--query_file (-q)' is
       provided, then KDE will be computed on the ’--reference_file (-r)' dataset.  It  is  possible  to  select
       either a reference dataset or an input model but not both at the same time. If an input model is selected
       and parameter values are not set (e.g. '--bandwidth (-b)') then default parameter values will be used.

       In  addition  to  the  last  program  call,  it is also possible to activate Monte Carlo estimations if a
       Gaussian kernel is used. This can provide faster results, but the KDE  will  only  have  a  probabilistic
       guarantee  of  meeting  the desired error bound (instead of an absolute guarantee). The following example
       will run KDE using a Monte Carlo estimation when possible. The results will be within a 5%  of  the  real
       KDE  value  with a 95% probability. Initial sample size for the Monte Carlo estimation will be 200 points
       and a node will be a candidate for the estimation only when it contains 700 (i.e. 3.5*200) points.  If  a
       node  contains  700  points  and  420  (i.e.  0.6*700) have already been sampled, then the algorithm will
       recurse instead of keep sampling.

       $ mlpack_kde --reference_file ref_data.csv --query_file qu_data.csv  --bandwidth  0.2  --kernel  gaussian
       --tree  kd-tree  --rel_error  0.05  --predictions_file  out_data.csv  --monte_carlo --mc_probability 0.95
       --initial_sample_size 200 --mc_entry_coef 3.5 --mc_break_coef 0.6

OPTIONAL INPUT OPTIONS

       --abs_error (-E) [double]
              Relative error tolerance for the prediction.  Default value 0.

       --algorithm (-a) [string]
              Algorithm to use for the prediction.('dual-tree', 'single-tree').  Default value 'dual-tree'.

       --bandwidth (-b) [double]
              Bandwidth of the kernel. Default value 1.

       --help (-h) [bool]
              Default help info.

       --info [string]
              Print help on a specific option. Default value ''.

       --initial_sample_size (-s) [int]
              Initial sample size for Monte Carlo estimations. Default value 100.

       --input_model_file (-m) [unknown]
              Contains pre-trained KDE model.

       --kernel (-k) [string]
              Kernel  to  use  for  the  prediction.('gaussian',   'epanechnikov',   'laplacian',   'spherical',
              'triangular'). Default value 'gaussian'.

       --mc_break_coef (-c) [double]
              Controls what fraction of the amount of node's descendants is the limit for the sample size before
              it recurses. Default value 0.4.

       --mc_entry_coef (-C) [double]
              Controls  how  much  larger  does the amount of node descendants has to be compared to the initial
              sample size in order to be a candidate for Monte Carlo estimations. Default value 3.

       --mc_probability (-P) [double]
              Probability of the estimation being bounded by relative error when using Monte Carlo  estimations.
              Default value 0.95.

       --monte_carlo (-S) [bool]
              Whether to use Monte Carlo estimations when possible.

       --query_file (-q) [unknown]
              Query dataset to KDE on.

       --reference_file (-r) [unknown]
              Input reference dataset use for KDE.

       --rel_error (-e) [double]
              Relative error tolerance for the prediction.  Default value 0.05.

       --tree (-t) [string]
              Tree  to  use  for  the  prediction.('kd-tree',  'ball-tree',  'cover-tree',  'octree', 'r-tree').
              Default value 'kd-tree'.

       --verbose (-v) [bool]
              Display informational messages and the full list of parameters and timers at the end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --output_model_file (-M) [unknown]
              If specified, the KDE model will be saved here.

       --predictions_file (-p) [unknown]
              Vector to store density predictions.

ADDITIONAL INFORMATION

       For further information, including relevant papers, citations,  and  theory,  consult  the  documentation
       found at http://www.mlpack.org or included with your distribution of mlpack.

mlpack-4.3.0                                     19 January 2024                                   mlpack_kde(1)