Provided by: zfsutils-linux_2.1.5-1ubuntu6~22.04.5_amd64 bug

NAME

       zpoolconcepts — overview of ZFS storage pools

DESCRIPTION

   Virtual Devices (vdevs)
       A  "virtual  device"  describes a single device or a collection of devices organized according to certain
       performance and fault characteristics.  The following virtual devices are supported:

       disk     A block device, typically located under /dev.  ZFS can  use  individual  slices  or  partitions,
                though  the  recommended  mode of operation is to use whole disks.  A disk can be specified by a
                full path, or it can be a shorthand name (the relative portion of the path under /dev).  A whole
                disk can be specified by omitting the slice or  partition  designation.   For  example,  sda  is
                equivalent  to  /dev/sda.   When  given  a  whole  disk,  ZFS  automatically labels the disk, if
                necessary.

       file     A regular file.  The use of files as a backing store is strongly discouraged.   It  is  designed
                primarily  for  experimental  purposes,  as the fault tolerance of a file is only as good as the
                file system on which it resides.  A file must be specified by a full path.

       mirror   A mirror of two or more devices.   Data  is  replicated  in  an  identical  fashion  across  all
                components  of a mirror.  A mirror with N disks of size X can hold X bytes and can withstand N-1
                devices failing without losing data.

       raidz, raidz1, raidz2, raidz3
                A variation on RAID-5 that allows for better distribution of parity and  eliminates  the  RAID-5
                "write hole" (in which data and parity become inconsistent after a power loss).  Data and parity
                is striped across all disks within a raidz group.

                A  raidz  group  can  have  single,  double,  or triple parity, meaning that the raidz group can
                sustain one, two, or three failures, respectively, without losing any  data.   The  raidz1  vdev
                type specifies a single-parity raidz group; the raidz2 vdev type specifies a double-parity raidz
                group;  and  the raidz3 vdev type specifies a triple-parity raidz group.  The raidz vdev type is
                an alias for raidz1.

                A raidz group with N disks of size X with P parity disks can hold  approximately  (N-P)*X  bytes
                and  can  withstand  P  devices  failing without losing data. The minimum number of devices in a
                raidz group is one more than the number of parity disks.  The recommended number  is  between  3
                and 9 to help increase performance.

       draid, draid1, draid2, draid3
                A  variant  of  raidz  that  provides  integrated distributed hot spares which allows for faster
                resilvering while retaining the benefits of raidz.  A dRAID vdev is  constructed  from  multiple
                internal  raidz  groups,  each  with  D  data  devices  and  P  parity devices. These groups are
                distributed over all of the children in order to fully utilize the available disk performance.

                Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with zeros) to  allow  fully
                sequential  resilvering.  This fixed stripe width significantly effects both usable capacity and
                IOPS.  For example, with the default D=8 and 4kB disk sectors the  minimum  allocation  size  is
                32kB.   If  using  compression,  this  relatively large allocation size can reduce the effective
                compression ratio.  When using ZFS volumes and dRAID, the default of the  volblocksize  property
                is increased to account for the allocation size.  If a dRAID pool will hold a significant amount
                of small blocks, it is recommended to also add a mirrored special vdev to store those blocks.

                In  regards  to I/O, performance is similar to raidz since for any read all D data disks must be
                accessed.    Delivered     random     IOPS     can     be     reasonably     approximated     as
                floor((N-S)/(D+P))*single_drive_IOPS.

                Like raidzm a dRAID can have single-, double-, or triple-parity.  The draid1, draid2, and draid3
                types can be used to specify the parity level.  The draid vdev type is an alias for draid1.

                A  dRAID  with  N  disks  of  size  X,  D data disks per redundancy group, P parity level, and S
                distributed hot spares can hold  approximately  (N-S)*(D/(D+P))*X  bytes  and  can  withstand  P
                devices failing without losing data.

       draid[parity][:datad][:childrenc][:sparess]
                A  non-default  dRAID  configuration  can be specified by appending one or more of the following
                optional arguments to the draid keyword:
                parity    The parity level (1-3).
                data      The number of data devices per redundancy group.  In general, a  smaller  value  of  D
                          will  increase  IOPS,  improve  the compression ratio, and speed up resilvering at the
                          expense of total usable capacity.  Defaults to 8, unless N-P-S is less than 8.
                children  The expected number of children.  Useful as a cross-check when listing a large  number
                          of devices.  An error is returned when the provided number of children differs.
                spares    The number of distributed hot spares.  Defaults to zero.

       spare    A  pseudo-vdev  which keeps track of available hot spares for a pool.  For more information, see
                the “Hot Spares” section.

       log      A separate intent log device.  If more than one log device is specified, then writes  are  load-
                balanced  between  devices.   Log  devices  can  be mirrored.  However, raidz vdev types are not
                supported for the intent log.  For more information, see the “Intent Log” section.

       dedup    A device dedicated solely for deduplication tables.  The redundancy of this device should  match
                the  redundancy  of  the  other  normal  devices  in the pool.  If more than one dedup device is
                specified, then allocations are load-balanced between those devices.

       special  A device dedicated solely for allocating various kinds  of  internal  metadata,  and  optionally
                small  file  blocks.   The  redundancy  of  this device should match the redundancy of the other
                normal devices in the pool.  If more than one special device is specified, then allocations  are
                load-balanced between those devices.

                For more information on special allocations, see the “Special Allocation Class” section.

       cache    A  device  used  to cache storage pool data.  A cache device cannot be configured as a mirror or
                raidz group.  For more information, see the “Cache Devices” section.

       Virtual devices cannot be nested, so a mirror or raidz virtual device can only contain  files  or  disks.
       Mirrors of mirrors (or other combinations) are not allowed.

       A  pool  can  have any number of virtual devices at the top of the configuration (known as "root vdevs").
       Data is dynamically distributed across all top-level devices to  balance  data  among  devices.   As  new
       virtual devices are added, ZFS automatically places data on the newly available devices.

       Virtual  devices are specified one at a time on the command line, separated by whitespace.  Keywords like
       mirror and raidz are used to distinguish where a  group  ends  and  another  begins.   For  example,  the
       following creates a pool with two root vdevs, each a mirror of two disks:
             # zpool create mypool mirror sda sdb mirror sdc sdd

   Device Failure and Recovery
       ZFS  supports a rich set of mechanisms for handling device failure and data corruption.  All metadata and
       data is checksummed, and ZFS automatically repairs bad data from a good copy when corruption is detected.

       In order to take advantage of these features, a pool must make use of  some  form  of  redundancy,  using
       either mirrored or raidz groups.  While ZFS supports running in a non-redundant configuration, where each
       root  vdev  is  simply a disk or file, this is strongly discouraged.  A single case of bit corruption can
       render some or all of your data unavailable.

       A pool's health status is described by one of three states: online, degraded, or faulted.  An online pool
       has all devices operating normally.  A degraded pool is one in which one or more devices have failed, but
       the data is still available due to a redundant configuration.  A faulted pool has corrupted metadata,  or
       one or more faulted devices, and insufficient replicas to continue functioning.

       The  health of the top-level vdev, such as a mirror or raidz device, is potentially impacted by the state
       of its associated vdevs, or component devices.  A top-level vdev or component device is  in  one  of  the
       following states:

       DEGRADED  One  or more top-level vdevs is in the degraded state because one or more component devices are
                 offline.  Sufficient replicas exist to continue functioning.

                 One or more component devices is in the degraded or  faulted  state,  but  sufficient  replicas
                 exist to continue functioning.  The underlying conditions are as follows:
                    The  number  of  checksum errors exceeds acceptable levels and the device is degraded as an
                     indication that something may be wrong.  ZFS continues to use the device as necessary.
                    The number of I/O errors exceeds acceptable levels.  The device  could  not  be  marked  as
                     faulted because there are insufficient replicas to continue functioning.

       FAULTED   One  or  more top-level vdevs is in the faulted state because one or more component devices are
                 offline.  Insufficient replicas exist to continue functioning.

                 One or more component devices is in the faulted  state,  and  insufficient  replicas  exist  to
                 continue functioning.  The underlying conditions are as follows:
                    The device could be opened, but the contents did not match expected values.
                    The  number  of  I/O  errors exceeds acceptable levels and the device is faulted to prevent
                     further use of the device.

       OFFLINE   The device was explicitly taken offline by the zpool offline command.

       ONLINE    The device is online and functioning.

       REMOVED   The device was physically removed while the system was running.  Device  removal  detection  is
                 hardware-dependent and may not be supported on all platforms.

       UNAVAIL   The  device could not be opened.  If a pool is imported when a device was unavailable, then the
                 device will be identified by a unique identifier instead of its path since the path  was  never
                 correct in the first place.

       Checksum errors represent events where a disk returned data that was expected to be correct, but was not.
       In other words, these are instances of silent data corruption.  The checksum errors are reported in zpool
       status  and zpool events.  When a block is stored redundantly, a damaged block may be reconstructed (e.g.
       from raidz parity or a mirrored copy).  In this case, ZFS reports the checksum error  against  the  disks
       that contained damaged data.  If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
       in  a  raidz2 group), it is not possible to determine which disks were silently corrupted.  In this case,
       checksum errors are reported for all disks on which the block is stored.

       If a device is removed and later re-attached to the system, ZFS attempts online the device automatically.
       Device attachment detection is hardware-dependent and might not be supported on all platforms.

   Hot Spares
       ZFS allows devices to be associated with pools as "hot spares".  These devices are not actively  used  in
       the pool, but when an active device fails, it is automatically replaced by a hot spare.  To create a pool
       with hot spares, specify a spare vdev with any number of devices.  For example,
             # zpool create pool mirror sda sdb spare sdc sdd

       Spares  can be shared across multiple pools, and can be added with the zpool add command and removed with
       the zpool remove command.  Once a spare replacement is initiated, a new spare vdev is created within  the
       configuration that will remain there until the original device is replaced.  At this point, the hot spare
       becomes available again if another device fails.

       If a pool has a shared spare that is currently being used, the pool can not be exported since other pools
       may use this shared spare, which may lead to potential data corruption.

       Shared  spares  add  some  risk.   If  the pools are imported on different hosts, and both pools suffer a
       device failure at the same time, both could attempt to use the spare at the same time.  This may  not  be
       detected, resulting in data corruption.

       An  in-progress  spare  replacement can be cancelled by detaching the hot spare.  If the original faulted
       device is detached, then the hot spare assumes its place in the configuration, and is  removed  from  the
       spare list of all active pools.

       The  draid  vdev  type  provides distributed hot spares.  These hot spares are named after the dRAID vdev
       they're a part of (draid1-2-3 specifies spare 3 of vdev 2, which is a single parity dRAID) and  may  only
       be used by that dRAID vdev.  Otherwise, they behave the same as normal hot spares.

       Spares cannot replace log devices.

   Intent Log
       The  ZFS  Intent  Log  (ZIL)  satisfies  POSIX  requirements for synchronous transactions.  For instance,
       databases often require their transactions to be on stable storage devices when returning from  a  system
       call.  NFS and other applications can also use fsync(2) to ensure data stability.  By default, the intent
       log  is  allocated  from  blocks  within  the  main  pool.   However,  it might be possible to get better
       performance using separate intent log devices such as NVRAM or a dedicated disk.  For example:
             # zpool create pool sda sdb log sdc

       Multiple log devices can also be specified, and they can be mirrored.  See the “EXAMPLES” section for  an
       example of mirroring multiple log devices.

       Log  devices  can  be  added,  replaced,  attached,  detached  and removed.  In addition, log devices are
       imported and exported as part of the pool that  contains  them.   Mirrored  devices  can  be  removed  by
       specifying the top-level mirror vdev.

   Cache Devices
       Devices  can be added to a storage pool as "cache devices".  These devices provide an additional layer of
       caching between main memory and disk.  For read-heavy workloads, where  the  working  set  size  is  much
       larger  than  what can be cached in main memory, using cache devices allows much more of this working set
       to be served from low latency media.  Using cache devices provides the greatest  performance  improvement
       for random read-workloads of mostly static content.

       To create a pool with cache devices, specify a cache vdev with any number of devices.  For example:
             # zpool create pool sda sdb cache sdc sdd

       Cache  devices  cannot be mirrored or part of a raidz configuration.  If a read error is encountered on a
       cache device, that read I/O is reissued to the original storage pool device, which might  be  part  of  a
       mirrored or raidz configuration.

       The  content of the cache devices is persistent across reboots and restored asynchronously when importing
       the pool in L2ARC (persistent L2ARC).  This can be  disabled  by  setting  l2arc_rebuild_enabled=0.   For
       cache devices smaller than 1GB, we do not write the metadata structures required for rebuilding the L2ARC
       in order not to waste space.  This can be changed with l2arc_rebuild_blocks_min_l2size.  The cache device
       header  (512B)  is  updated  even  if  no metadata structures are written.  Setting l2arc_headroom=0 will
       result in scanning the full-length ARC lists for cacheable content to be  written  in  L2ARC  (persistent
       ARC).   If  a  cache  device  is  added  with  zpool add its label and header will be overwritten and its
       contents are not going to be restored in L2ARC, even if the device was previously part of the pool.  If a
       cache device is onlined with zpool online its contents will be restored in L2ARC.  This is useful in case
       of memory pressure where the contents of the cache device are not fully restored in L2ARC.  The user  can
       off-  and  online  the  cache  device  when  there  is less memory pressure in order to fully restore its
       contents to L2ARC.

   Pool checkpoint
       Before starting critical procedures that include destructive actions (like zfs destroy), an administrator
       can checkpoint the pool's state and in the case of a mistake or failure, rewind the entire pool  back  to
       the   checkpoint.   Otherwise,  the  checkpoint  can  be  discarded  when  the  procedure  has  completed
       successfully.

       A pool checkpoint can be thought of as a pool-wide snapshot and should be used with care as  it  contains
       every  part of the pool's state, from properties to vdev configuration.  Thus, certain operations are not
       allowed while a pool has a checkpoint.  Specifically, vdev removal/attach/detach, mirror  splitting,  and
       changing the pool's GUID.  Adding a new vdev is supported, but in the case of a rewind it will have to be
       added  again.   Finally,  users  of  this  feature  should  keep in mind that scrubs in a pool that has a
       checkpoint do not repair checkpointed data.

       To create a checkpoint for a pool:
             # zpool checkpoint pool

       To later rewind to its checkpointed state, you need to first export it and then rewind it during import:
             # zpool export pool
             # zpool import --rewind-to-checkpoint pool

       To discard the checkpoint from a pool:
             # zpool checkpoint -d pool

       Dataset reservations (controlled by the reservation and refreservation properties) may  be  unenforceable
       while  a  checkpoint  exists,  because  the  checkpoint  is allowed to consume the dataset's reservation.
       Finally, data that is part of the checkpoint but has been freed in the current state of the pool won't be
       scanned during a scrub.

   Special Allocation Class
       Allocations in the special class are dedicated to specific block types.  By  default  this  includes  all
       metadata,  the  indirect  blocks  of  user  data,  and  any  deduplication tables.  The class can also be
       provisioned to accept small file blocks.

       A pool must always have at least one  normal  (non-dedup/-special)  vdev  before  other  devices  can  be
       assigned  to the special class.  If the special class becomes full, then allocations intended for it will
       spill back into the normal class.

       Deduplication tables can be excluded from the special class by unsetting the zfs_ddt_data_is_special  ZFS
       module parameter.

       Inclusion  of  small  file  blocks  in the special class is opt-in.  Each dataset can control the size of
       small file blocks allowed in the special class by setting the special_small_blocks property  to  nonzero.
       See zfsprops(7) for more info on this property.

OpenZFS                                           June 2, 2021                                  ZPOOLCONCEPTS(7)