Provided by: likwid_5.2.2+dfsg1-2_amd64 bug

NAME

       likwid-bench - low-level benchmark suite and microbenchmarking framework

SYNOPSIS

       likwid-bench    [-hap]    [-t    <testname>]    [-s    <min_time>]    [-w   <workgroup_expression>]   [-W
       <workgroup_expression_short>] [-l <testname>] [-d <delimiter>] [-i <iterations>] [-f <filepath>]

DESCRIPTION

       likwid-bench is a  benchmark  suite  for  low-level  (assembly)  benchmarks  to  measure  bandwidths  and
       instruction  throughput  for  specific  instruction code on x86 systems. The currently included benchmark
       codes include common data access patterns like load and store but also calculations like vector triad and
       sum.  likwid-bench includes architecture specific benchmarks for x86, x86_64 and x86 for Intel  Xeon  Phi
       coprocessors.  With  LIKWID  5  also  ARM  and POWER benchmarks are supported. The performance values can
       either be calculated by likwid-bench or measured using performance counters by using likwid-perfctr as  a
       wrapper  to likwid-bench.  This requires to build likwid-bench with instrumentation enabled in config.mk.
       Benchmarks   can   be   dynamically    added    when    a    proper    ptt    file    is    present    at
       $HOME/.likwid/bench/<arch>/<testname>.ptt . The files are compiled to a .S file and compiled using either
       gcc,  icc  or  pgcc (searched in $PATH). The default folder is /tmp/<PID>. Possible values for <arch> are
       'x86', 'x86-64', 'phi', armv7', 'armv8' and 'power'.

OPTIONS

       -h     prints a help message to standard output, then exits.

       -a     list available benchmark codes for the current system.

       -p     list available thread domains.

       -s <min_time>
              Run the benchmark for at least <min_time> seconds.  The amount of iterations is  determined  using
              this value. Default: 1 second.

       -t <testname>
              Name of the benchmark code to run (mandatory).

       -w <workgroup_expression>
              Specify  the  affinity domain, thread count and data set size for the current benchmarking run (-w
              or -W mandatory). First thread in thread domain initializes the stream.

       -W <workgroup_expression_short>
              Specify the affinity domain, thread count and data set size for the current benchmarking  run  (-w
              or -W mandatory). Each thread in the workgroup initializes its own chunk of the stream.

       -l <testname>
              list properties of a benchmark code.

       -i <iterations>
              Set the number of iterations per thread (optional)

       -f <filepath>
              Filepath for the dynamic generation of benchmarks. Default /tmp/. <PID> is always attached

WORKGROUP SYNTAX

       <thread_domain>:<size>  [:<num_threads>[:<chunk_size>:<stride>]]  [-<streamId>:<domain_id>]  with size in
       kB, MB or GB. The <thread_domain> defines where the threads are placed.  <size> is  the  total  data  set
       size for the benchmark, the allocated vectors in memory sum up to this size.  <num_threads> specifies how
       many  threads  are  used  in  the  <thread_domain>.   Threads are always placed using a compact policy in
       likwid-bench.  This means that per default all SMT threads are used. Optionally similar a the  expression
       based  syntax  in  likwid-pin  a  <chunk_size>  and <stride> can be provided. Optionally for every stream
       (array, vector) the placement can  be  controlled.  Per  default  all  arrays  are  placed  in  the  same
       <thread_domain> the threads are running in. To place the data in a different domain for every stream of a
       benchmark  case  (the  total  number of streams can be acquired by the -l option) the domain to place the
       data in can be specified. Multiple streams are comma separated. Either the placement is provided  or  all
       streams  have  to  be  explicitly  placed.  Please  refer  to  the Wiki pages on https://github.com/RRZE-
       HPC/likwid/wiki/Likwid-Bench for further details and examples on usage.  With -W each thread  initializes
       its own chunk of the streams but pleacement of the streams is deactivated.

EXAMPLE

       1.  Run the copy benchmark on socket 0 ( S0 ) with a total data set size of 100kB.

       likwid-bench -t copy -w S0:100kB

       Since  no  <num_threads>  is  given in the workload expression, each hardware thread of socket 0 gets one
       application thread. The workload is split up  between  all  threads  and  the  number  of  iterations  is
       determined automatically.

       2.  Run the triad benchmark code with explicitly 100 iterations per thread with 2 threads on the socket 0
           ( S0 ) and a data size of 1GB.

       likwid-bench -t triad -i 100 -w S0:1GB:2:1:2

       Assuming  socket  0  (  S0  ) has 2 physical hardware threads with SMT enabled, hence in total 4 hardware
       threads, one thread is assigned to each physical hardware thread of socket 0.

       3.  Run the update benchmark on socket 0 ( S0 ) with a workload of 100kB and on socket 1 ( S1 ) with  the
           same workload.

       likwid-bench -t update -w S0:100kB -w S1:100kB

       The  results  of  both  workgroups  are  combinded  for  the output. Hence the workload in each workgroup
       expression should have the same size.

       4.  Run  the  copy  benchmark  but  measure  the  memory  traffic  with   likwid-perfctr.    The   option
           INSTRUMENT_BENCH in config.mk needs to be true at compile time to use that feature.

       likwid-perfctr -c E:S0:4 -g MEM -m likwid-bench -t update -w S0:100kB

       likwid-perfctr  will configure and start the performance counters on socket 0 ( S0 ) with 4 threads prior
       to the execution of likwid-bench.  The performance counters are read right before and after  running  the
       benchmarking code to minimize the interferences of the measurement.

       5.  Run the copy benchmark and place the data on another socket

       likwid-bench -t copy -w S0:1GB:10:1:2-0:S1,1:S1

       Stream  id  0  and  1  are  placed  in  thread domains S1, which is socket 1. This can be verified as the
       initialization threads output where they are running.

WARNING

       Since LIKWID 5.0, it is possible to have different numbers of threads in workgroups. Also different sizes
       are allowed. Both features seem promising, but they show a range of problems. If you have a  NUMA  system
       and  run  with  multiple  threads on NUMA node 0 but with less on NUMA node 1, the threads on NUMA node 1
       cause less preassure on the memory interface and consequently achieve higher throughput. They will finish
       early compared to the threads on NUMA node 0. The runtime used for caluclating the bandwidth and MFlops/s
       values use the maximal runtime of all threads, hence one of NUMA node 0.   Similar  problems  exist  with
       different  sizes.  One  workgroup  might  run  in  cache  while  the other waits for data from the memory
       interface.

AUTHOR

       Written by Thomas Gruber <thomas.roehl@googlemail.com>.

BUGS

       Report Bugs on <https://github.com/RRZE-HPC/likwid/issues>.

SEE ALSO

       likwid-perfctr(1), likwid-pin(1), likwid-topology(1), likwid-setFrequencies(1)

likwid-5                                           26.07.2022                                    LIKWID-BENCH(1)