Provided by: xen-utils-common_4.17.3+10-g091466ba55-1.1ubuntu3_amd64 bug

NAME

       xen-tscmode - Xen TSC (time stamp counter) and timekeeping discussion

OVERVIEW

       As of Xen 4.0, a new config option called tsc_mode may be specified for each domain.  The default for
       tsc_mode handles the vast majority of hardware and software environments.  This document is targeted for
       Xen users and administrators that may need to select a non-default tsc_mode.

       Proper selection of tsc_mode depends on an understanding not only of the guest operating system (OS), but
       also of the application set that will ever run on this guest OS.  This is because tsc_mode applies
       equally to both the OS and ALL apps that are running on this domain, now or in the future.

       Key questions to be answered for the OS and/or each application are:

       •   Does the OS/app use the rdtsc instruction at all?  (We will explain below how to determine this.)

       •   At what frequency is the rdtsc instruction executed by either the OS or any running apps?  If the sum
           exceeds about 10,000 rdtsc instructions per second per processor, we call this a "high-TSC-frequency"
           OS/app/environment.   (This  is  relatively  rare, and developers of OS's and apps that are high-TSC-
           frequency are usually aware of it.)

       •   If the OS/app does use rdtsc, will it behave incorrectly if "time goes backwards" or if the frequency
           of the TSC suddenly changes?  If so, we call this a "TSC-sensitive" app or OS; otherwise it is  "TSC-
           resilient".

       This last is the US$64,000 question as it may be very difficult (or, for legacy apps, even impossible) to
       predict  all  possible failure cases.  As a result, unless proven otherwise, any app that uses rdtsc must
       be assumed to be TSC-sensitive and, as we will see, this is the default starting in Xen 4.0.

       Xen's new tsc_mode parameter determines the circumstances under which the family  of  rdtsc  instructions
       are executed "natively" vs emulated.  Roughly speaking, native means rdtsc is fast but TSC-sensitive apps
       may,  under  unpredictable  circumstances,  run  incorrectly;  emulated  means  there is some performance
       degradation (unobservable in most cases), but TSC-sensitive apps will always run correctly.  Prior to Xen
       4.0, all rdtsc instructions were native: "fast but potentially incorrect."   Starting  at  Xen  4.0,  the
       default is that all rdtsc instructions are "correct but potentially slow".  The tsc_mode parameter in 4.0
       provides  an  intelligent  default but allows system administrator's to adjust how rdtsc instructions are
       executed differently for different domains.

       The non-default choices for tsc_mode are:

       •   tsc_mode=1 (always emulate).

           All rdtsc instructions are emulated; this is the best choice when TSC-sensitive apps are running  and
           it is necessary to understand worst-case performance degradation for a specific hardware environment.

       •   tsc_mode=2 (never emulate).

           This is the same as prior to Xen 4.0 and is the best choice if it is certain that all apps running in
           this VM are TSC-resilient and highest performance is required.

       •   tsc_mode=3 (PVRDTSCP).

           This mode has been removed.

       If  tsc_mode  is  left  unspecified  (or  set  to  tsc_mode=0),  a hybrid algorithm is utilized to ensure
       correctness while providing the best performance possible given:

       •   the requirement of correctness,

       •   the underlying hardware, and

       •   whether or not the VM has been saved/restored/migrated

       To understand this in more detail, the rest of this document must be read.

DETERMINING RDTSC FREQUENCY

       To determine the frequency of rdtsc instructions that are emulated, an "xl" command  can  be  used  by  a
       privileged user of domain0.  The command:

           # xl debug-key s; xl dmesg | tail

       provides information about TSC usage in each domain where TSC emulation is currently enabled.

TSC HISTORY

       To understand tsc_mode completely, some background on TSC is required:

       The x86 "timestamp counter", or TSC, is a 64-bit register on each processor that increases monotonically.
       Historically, TSC incremented every processor cycle, but on recent processors, it increases at a constant
       rate  even  if  the  processor  changes frequency (for example, to reduce processor power usage).  TSC is
       known by x86 programmers as the fastest, highest-precision measurement of the passage of time  so  it  is
       often  used  as  a foundation for performance monitoring.  And since it is guaranteed to be monotonically
       increasing and, at 64 bits, is guaranteed to not wraparound within 10 years, it is sometimes  used  as  a
       random number or a unique sequence identifier, such as to stamp transactions so they can be replayed in a
       specific order.

       On most older SMP and early multi-core machines, TSC was not synchronized between processors.  Thus if an
       application  were  to  read the TSC on one processor, then was moved by the OS to another processor, then
       read TSC again, it might appear that "time went backwards".  This loss of monotonicity resulted  in  many
       obscure  application  bugs when TSC-sensitive apps were ported from a uniprocessor to an SMP environment;
       as a result, many applications -- especially in the Windows world -- removed their dependency on TSC  and
       replaced their timestamp needs with OS-specific functions, losing both performance and precision. On some
       more  recent generations of multi-core machines, especially multi-socket multi-core machines, the TSC was
       synchronized but if one processor were to enter certain low-power states, its TSC would stop,  destroying
       the  synchrony and again causing obscure bugs.  This reinforced decisions to avoid use of TSC altogether.
       On the most recent generations of multi-core machines, however, synchronization is  provided  across  all
       processors in all power states, even on multi-socket machines, and provide a flag that indicates that TSC
       is  synchronized  and  "invariant".   Thus  TSC  is  once  again  useful for applications, and even newer
       operating systems are using and depending upon TSC for critical timekeeping tasks when running  on  these
       recent machines.

       We  will  refer  to  hardware  that  ensures TSC is both synchronized and invariant as "TSC-safe" and any
       hardware on which TSC is not (or may not remain) synchronized as "TSC-unsafe".

       As a result of TSC's sordid history, two classes of applications use TSC: old applications  designed  for
       single  processors,  and  the  most  recent  enterprise  applications  which require high-frequency high-
       precision timestamping.

       We will refer to apps that might break if running on a TSC-unsafe machine as "TSC-sensitive";  apps  that
       don't  use  TSC,  or  do  use  TSC  but  use  it  in a way that monotonicity and frequency invariance are
       unimportant as "TSC-resilient".

       The emergence of virtualization once  again  complicates  the  usage  of  TSC.   When  features  such  as
       save/restore or live migration are employed, a guest OS and all its currently running applications may be
       invisibly transported to an entirely different physical machine.  While TSC may be "safe" on one machine,
       it  is  essentially  impossible  to  precisely  synchronize  TSC  across  a data center or even a pool of
       machines.  As a result, when run in a virtualized environment, rare and obscure  "time  going  backwards"
       problems  might  once again occur for those TSC-sensitive applications.  Worse, if a guest OS moves from,
       for example, a 3GHz machine to a 1.5GHz machine, attempts by an OS/app to measure time intervals with TSC
       may without notice be incorrect by a factor of two.

       The rdtsc (read timestamp counter) instruction is used to read the TSC register.  The rdtscp  instruction
       is  a  variant  of  rdtsc  on  recent  processors.   We  refer  to  these together as the rdtsc family of
       instructions, or just "rdtsc".  Instructions in the  rdtsc  family  are  non-privileged,  but  privileged
       software  may  set a cpuid bit to cause all rdtsc family instructions to trap.  This trap can be detected
       by Xen, which can then transparently "emulate" the results of the rdtsc instruction and return control to
       the code following the rdtsc instruction.

       To provide a "safe" TSC, i.e. to ensure both TSC monotonicity  and  a  fixed  rate,  Xen  provides  rdtsc
       emulation  whenever  necessary  or  when  explicitly  specified  by  a  per-VM configuration option.  TSC
       emulation is relatively slow -- roughly 15-20 times slower  than  the  rdtsc  instruction  when  executed
       natively.  However, except when an OS or application uses the rdtsc instruction at a high frequency (e.g.
       more  than  about  10,000 times per second per processor), this performance degradation is not noticeable
       (i.e. <0.3%).  And, TSC emulation is nearly always faster than  OS-provided  alternatives  (e.g.  Linux's
       gettimeofday).   For  environments  where  it  is  certain  that  all apps are TSC-resilient (e.g.  "TSC-
       safeness" is not necessary) and highest performance is a  requirement,  TSC  emulation  may  be  entirely
       disabled (tsc_mode==2).

       The  default  mode  (tsc_mode==0)  checks  TSC-safeness  of  the underlying hardware on which the virtual
       machine is launched.  If it is TSC-safe, rdtsc will execute at hardware speed; if it is not,  rdtsc  will
       be  emulated.  Once a virtual machine is save/restored or migrated, however, there are two possibilities:
       TSC remains native IF the source physical machine and target physical machine have the same TSC frequency
       (or, for HVM/PVH guests, if TSC scaling support is available); else TSC is emulated.  Note  that,  though
       emulated,  the  "apparent"  TSC frequency will be the TSC frequency of the initial physical machine, even
       after migration.

       Finally, tsc_mode==1 always enables TSC emulation, regardless of the underlying  physical  hardware.  The
       "apparent" TSC frequency will be the TSC frequency of the initial physical machine, even after migration.
       This  mode  is  useful  to measure any performance degradation that might be encountered by a tsc_mode==0
       domain after migration occurs, or a tsc_mode==3 domain when it is running on TSC-unsafe hardware.

       Note that while Xen ensures that an emulated TSC is "safe" across migration, it does not ensure  that  it
       continues  to tick at the same rate during the actual migration.  As an oversimplified example, if TSC is
       ticking once per second in a guest, and the guest is saved when the TSC is 1000, then restored 30 seconds
       later, TSC is only guaranteed to be greater than or equal to 1001, not precisely 1030.  This has some  OS
       implications as will be seen in the next section.

TSC INVARIANT BIT and NO_MIGRATE

       Related  to  TSC emulation, the "TSC Invariant" bit is architecturally defined in a cpuid bit on the most
       recent x86 processors.  If set, TSC invariance ensures that the TSC is "safe", that is it will  increment
       at  a  constant  rate  regardless  of  power  events, will be synchronized across all processors, and was
       properly initialized to zero on all processors at boot-time by system hardware/BIOS.  As long  as  system
       software  never writes to TSC, TSC will be safe and continuously incremented at a fixed rate and thus can
       be used as a system "clocksource".

       This bit is used by some OS's, and specifically by Linux starting with version 2.6.30(?), to  select  TSC
       as  a  system  clocksource.   Once  selected,  TSC  remains  the Linux system clocksource unless manually
       overridden.  In a virtualized environment, since it is not possible to synchronize  TSC  across  all  the
       machines  in  a pool or data center, a migration may "break" TSC as a usable clocksource; while time will
       not go backwards,  it  may  not  track  wallclock  time  well  enough  to  avoid  certain  time-sensitive
       consequences.  As a result, Xen can only expose the TSC Invariant bit to a guest OS if it is certain that
       the  domain  will  never  migrate.   As  of  Xen  4.0,  the "no_migrate=1" VM configuration option may be
       specified to disable migration.  If no_migrate is selected and the VM is running on  a  physical  machine
       with  "TSC  Invariant",  Linux  2.6.30+  will safely use TSC as the system clocksource.  But, attempts to
       migrate or, once saved, restore this domain will fail.

       There is another cpuid-related complication: The x86 cpuid instruction is  non-privileged.   HVM  domains
       are  configured  to  always trap this instruction to Xen, where Xen can "filter" the result.  In a PV OS,
       all cpuid instructions have been replaced by  a  paravirtualized  equivalent  of  the  cpuid  instruction
       ("pvcpuid")  and  also  trap  to  Xen.   But  apps  in a PV guest that use a cpuid instruction execute it
       directly, without a trap to Xen.  As a result, an app may directly examine  the  physical  TSC  Invariant
       cpuid bit and make decisions based on that bit.

HARDWARE TSC SCALING

       Intel  VMX  TSC  scaling  and AMD SVM TSC ratio allow the guest TSC read by guest rdtsc/p increasing in a
       different frequency than the host TSC frequency.

       If a HVM container in default TSC mode (tsc_mode=0) is created on a host that provides constant TSC,  its
       guest  TSC  frequency will be the same as the host. If it is later migrated to another host that provides
       constant TSC and supports Intel VMX TSC scaling/AMD SVM TSC ratio, its guest TSC frequency  will  be  the
       same before and after migration.

       For above HVM container in default TSC mode (tsc_mode=0), if above hosts support rdtscp, both guest rdtsc
       and rdtscp instructions will be executed natively before and after migration.

AUTHORS

       Dan Magenheimer <dan.magenheimer@oracle.com>

4.17.4-pre                                         2024-04-01                                     xen-tscmode(7)