Ubuntu Manpage: zpoolconcepts — overview of ZFS storage pools

Provided by: zfsutils-linux_2.1.5-1ubuntu6~22.04.5_amd64

NAME

       zpoolconcepts — overview of ZFS storage pools

DESCRIPTION

Virtual Devices (vdevs)
A "virtual device" describes a single device or a collection of devices organized according to certain
performance and fault characteristics. The following virtual devices are supported:

disk A block device, typically located under /dev. ZFS can use individual slices or partitions,
though the recommended mode of operation is to use whole disks. A disk can be specified by a
full path, or it can be a shorthand name (the relative portion of the path under /dev). A whole
disk can be specified by omitting the slice or partition designation. For example, sda is
equivalent to /dev/sda. When given a whole disk, ZFS automatically labels the disk, if
necessary.

file A regular file. The use of files as a backing store is strongly discouraged. It is designed
primarily for experimental purposes, as the fault tolerance of a file is only as good as the
file system on which it resides. A file must be specified by a full path.

mirror A mirror of two or more devices. Data is replicated in an identical fashion across all
components of a mirror. A mirror with N disks of size X can hold X bytes and can withstand N-1
devices failing without losing data.

raidz, raidz1, raidz2, raidz3
A variation on RAID-5 that allows for better distribution of parity and eliminates the RAID-5
"write hole" (in which data and parity become inconsistent after a power loss). Data and parity
is striped across all disks within a raidz group.

A raidz group can have single, double, or triple parity, meaning that the raidz group can
sustain one, two, or three failures, respectively, without losing any data. The raidz1 vdev
type specifies a single-parity raidz group; the raidz2 vdev type specifies a double-parity raidz
group; and the raidz3 vdev type specifies a triple-parity raidz group. The raidz vdev type is
an alias for raidz1.

A raidz group with N disks of size X with P parity disks can hold approximately (N-P)*X bytes
and can withstand P devices failing without losing data. The minimum number of devices in a
raidz group is one more than the number of parity disks. The recommended number is between 3
and 9 to help increase performance.

draid, draid1, draid2, draid3
A variant of raidz that provides integrated distributed hot spares which allows for faster
resilvering while retaining the benefits of raidz. A dRAID vdev is constructed from multiple
internal raidz groups, each with D data devices and P parity devices. These groups are
distributed over all of the children in order to fully utilize the available disk performance.

Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with zeros) to allow fully
sequential resilvering. This fixed stripe width significantly effects both usable capacity and
IOPS. For example, with the default D=8 and 4kB disk sectors the minimum allocation size is
32kB. If using compression, this relatively large allocation size can reduce the effective
compression ratio. When using ZFS volumes and dRAID, the default of the volblocksize property
is increased to account for the allocation size. If a dRAID pool will hold a significant amount
of small blocks, it is recommended to also add a mirrored special vdev to store those blocks.

In regards to I/O, performance is similar to raidz since for any read all D data disks must be
accessed. Delivered random IOPS can be reasonably approximated as
floor((N-S)/(D+P))*single_drive_IOPS.

Like raidzm a dRAID can have single-, double-, or triple-parity. The draid1, draid2, and draid3
types can be used to specify the parity level. The draid vdev type is an alias for draid1.

A dRAID with N disks of size X, D data disks per redundancy group, P parity level, and S
distributed hot spares can hold approximately (N-S)*(D/(D+P))*X bytes and can withstand P
devices failing without losing data.

draid[parity][:datad][:childrenc][:sparess]
A non-default dRAID configuration can be specified by appending one or more of the following
optional arguments to the draid keyword:
parity The parity level (1-3).
data The number of data devices per redundancy group. In general, a smaller value of D
will increase IOPS, improve the compression ratio, and speed up resilvering at the
expense of total usable capacity. Defaults to 8, unless N-P-S is less than 8.
children The expected number of children. Useful as a cross-check when listing a large number
of devices. An error is returned when the provided number of children differs.
spares The number of distributed hot spares. Defaults to zero.

spare A pseudo-vdev which keeps track of available hot spares for a pool. For more information, see
the “Hot Spares” section.

log A separate intent log device. If more than one log device is specified, then writes are load-
balanced between devices. Log devices can be mirrored. However, raidz vdev types are not
supported for the intent log. For more information, see the “Intent Log” section.

dedup A device dedicated solely for deduplication tables. The redundancy of this device should match
the redundancy of the other normal devices in the pool. If more than one dedup device is
specified, then allocations are load-balanced between those devices.

special A device dedicated solely for allocating various kinds of internal metadata, and optionally
small file blocks. The redundancy of this device should match the redundancy of the other
normal devices in the pool. If more than one special device is specified, then allocations are
load-balanced between those devices.

For more information on special allocations, see the “Special Allocation Class” section.

cache A device used to cache storage pool data. A cache device cannot be configured as a mirror or
raidz group. For more information, see the “Cache Devices” section.

Virtual devices cannot be nested, so a mirror or raidz virtual device can only contain files or disks.
Mirrors of mirrors (or other combinations) are not allowed.

A pool can have any number of virtual devices at the top of the configuration (known as "root vdevs").
Data is dynamically distributed across all top-level devices to balance data among devices. As new
virtual devices are added, ZFS automatically places data on the newly available devices.

Virtual devices are specified one at a time on the command line, separated by whitespace. Keywords like
mirror and raidz are used to distinguish where a group ends and another begins. For example, the
following creates a pool with two root vdevs, each a mirror of two disks:
# zpool create mypool mirror sda sdb mirror sdc sdd

Device Failure and Recovery
ZFS supports a rich set of mechanisms for handling device failure and data corruption. All metadata and
data is checksummed, and ZFS automatically repairs bad data from a good copy when corruption is detected.

In order to take advantage of these features, a pool must make use of some form of redundancy, using
either mirrored or raidz groups. While ZFS supports running in a non-redundant configuration, where each
root vdev is simply a disk or file, this is strongly discouraged. A single case of bit corruption can
render some or all of your data unavailable.

A pool's health status is described by one of three states: online, degraded, or faulted. An online pool
has all devices operating normally. A degraded pool is one in which one or more devices have failed, but
the data is still available due to a redundant configuration. A faulted pool has corrupted metadata, or
one or more faulted devices, and insufficient replicas to continue functioning.

The health of the top-level vdev, such as a mirror or raidz device, is potentially impacted by the state
of its associated vdevs, or component devices. A top-level vdev or component device is in one of the
following states:

DEGRADED One or more top-level vdevs is in the degraded state because one or more component devices are
offline. Sufficient replicas exist to continue functioning.

One or more component devices is in the degraded or faulted state, but sufficient replicas
exist to continue functioning. The underlying conditions are as follows:
• The number of checksum errors exceeds acceptable levels and the device is degraded as an
indication that something may be wrong. ZFS continues to use the device as necessary.
• The number of I/O errors exceeds acceptable levels. The device could not be marked as
faulted because there are insufficient replicas to continue functioning.

FAULTED One or more top-level vdevs is in the faulted state because one or more component devices are
offline. Insufficient replicas exist to continue functioning.

One or more component devices is in the faulted state, and insufficient replicas exist to
continue functioning. The underlying conditions are as follows:
• The device could be opened, but the contents did not match expected values.
• The number of I/O errors exceeds acceptable levels and the device is faulted to prevent
further use of the device.

OFFLINE The device was explicitly taken offline by the zpool offline command.

ONLINE The device is online and functioning.

REMOVED The device was physically removed while the system was running. Device removal detection is
hardware-dependent and may not be supported on all platforms.

UNAVAIL The device could not be opened. If a pool is imported when a device was unavailable, then the
device will be identified by a unique identifier instead of its path since the path was never
correct in the first place.

Checksum errors represent events where a disk returned data that was expected to be correct, but was not.
In other words, these are instances of silent data corruption. The checksum errors are reported in zpool
status and zpool events. When a block is stored redundantly, a damaged block may be reconstructed (e.g.
from raidz parity or a mirrored copy). In this case, ZFS reports the checksum error against the disks
that contained damaged data. If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
in a raidz2 group), it is not possible to determine which disks were silently corrupted. In this case,
checksum errors are reported for all disks on which the block is stored.

If a device is removed and later re-attached to the system, ZFS attempts online the device automatically.
Device attachment detection is hardware-dependent and might not be supported on all platforms.

Hot Spares
ZFS allows devices to be associated with pools as "hot spares". These devices are not actively used in
the pool, but when an active device fails, it is automatically replaced by a hot spare. To create a pool
with hot spares, specify a spare vdev with any number of devices. For example,
# zpool create pool mirror sda sdb spare sdc sdd

Spares can be shared across multiple pools, and can be added with the zpool add command and removed with
the zpool remove command. Once a spare replacement is initiated, a new spare vdev is created within the
configuration that will remain there until the original device is replaced. At this point, the hot spare
becomes available again if another device fails.

If a pool has a shared spare that is currently being used, the pool can not be exported since other pools
may use this shared spare, which may lead to potential data corruption.

Shared spares add some risk. If the pools are imported on different hosts, and both pools suffer a
device failure at the same time, both could attempt to use the spare at the same time. This may not be
detected, resulting in data corruption.

An in-progress spare replacement can be cancelled by detaching the hot spare. If the original faulted
device is detached, then the hot spare assumes its place in the configuration, and is removed from the
spare list of all active pools.

The draid vdev type provides distributed hot spares. These hot spares are named after the dRAID vdev
they're a part of (draid1-2-3 specifies spare 3 of vdev 2, which is a single parity dRAID) and may only
be used by that dRAID vdev. Otherwise, they behave the same as normal hot spares.

Spares cannot replace log devices.

Intent Log
The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous transactions. For instance,
databases often require their transactions to be on stable storage devices when returning from a system
call. NFS and other applications can also use fsync(2) to ensure data stability. By default, the intent
log is allocated from blocks within the main pool. However, it might be possible to get better
performance using separate intent log devices such as NVRAM or a dedicated disk. For example:
# zpool create pool sda sdb log sdc

Multiple log devices can also be specified, and they can be mirrored. See the “EXAMPLES” section for an
example of mirroring multiple log devices.

Log devices can be added, replaced, attached, detached and removed. In addition, log devices are
imported and exported as part of the pool that contains them. Mirrored devices can be removed by
specifying the top-level mirror vdev.

Cache Devices
Devices can be added to a storage pool as "cache devices". These devices provide an additional layer of
caching between main memory and disk. For read-heavy workloads, where the working set size is much
larger than what can be cached in main memory, using cache devices allows much more of this working set
to be served from low latency media. Using cache devices provides the greatest performance improvement
for random read-workloads of mostly static content.

To create a pool with cache devices, specify a cache vdev with any number of devices. For example:
# zpool create pool sda sdb cache sdc sdd

Cache devices cannot be mirrored or part of a raidz configuration. If a read error is encountered on a
cache device, that read I/O is reissued to the original storage pool device, which might be part of a
mirrored or raidz configuration.

The content of the cache devices is persistent across reboots and restored asynchronously when importing
the pool in L2ARC (persistent L2ARC). This can be disabled by setting l2arc_rebuild_enabled=0. For
cache devices smaller than 1GB, we do not write the metadata structures required for rebuilding the L2ARC
in order not to waste space. This can be changed with l2arc_rebuild_blocks_min_l2size. The cache device
header (512B) is updated even if no metadata structures are written. Setting l2arc_headroom=0 will
result in scanning the full-length ARC lists for cacheable content to be written in L2ARC (persistent
ARC). If a cache device is added with zpool add its label and header will be overwritten and its
contents are not going to be restored in L2ARC, even if the device was previously part of the pool. If a
cache device is onlined with zpool online its contents will be restored in L2ARC. This is useful in case
of memory pressure where the contents of the cache device are not fully restored in L2ARC. The user can
off- and online the cache device when there is less memory pressure in order to fully restore its
contents to L2ARC.

Pool checkpoint
Before starting critical procedures that include destructive actions (like zfs destroy), an administrator
can checkpoint the pool's state and in the case of a mistake or failure, rewind the entire pool back to
the checkpoint. Otherwise, the checkpoint can be discarded when the procedure has completed
successfully.

A pool checkpoint can be thought of as a pool-wide snapshot and should be used with care as it contains
every part of the pool's state, from properties to vdev configuration. Thus, certain operations are not
allowed while a pool has a checkpoint. Specifically, vdev removal/attach/detach, mirror splitting, and
changing the pool's GUID. Adding a new vdev is supported, but in the case of a rewind it will have to be
added again. Finally, users of this feature should keep in mind that scrubs in a pool that has a
checkpoint do not repair checkpointed data.

To create a checkpoint for a pool:
# zpool checkpoint pool

To later rewind to its checkpointed state, you need to first export it and then rewind it during import:
# zpool export pool
# zpool import --rewind-to-checkpoint pool

To discard the checkpoint from a pool:
# zpool checkpoint -d pool

Dataset reservations (controlled by the reservation and refreservation properties) may be unenforceable
while a checkpoint exists, because the checkpoint is allowed to consume the dataset's reservation.
Finally, data that is part of the checkpoint but has been freed in the current state of the pool won't be
scanned during a scrub.

Special Allocation Class
Allocations in the special class are dedicated to specific block types. By default this includes all
metadata, the indirect blocks of user data, and any deduplication tables. The class can also be
provisioned to accept small file blocks.

A pool must always have at least one normal (non-dedup/-special) vdev before other devices can be
assigned to the special class. If the special class becomes full, then allocations intended for it will
spill back into the normal class.

Deduplication tables can be excluded from the special class by unsetting the zfs_ddt_data_is_special ZFS
module parameter.

Inclusion of small file blocks in the special class is opt-in. Each dataset can control the size of
small file blocks allowed in the special class by setting the special_small_blocks property to nonzero.
See zfsprops(7) for more info on this property.

OpenZFS June 2, 2021 ZPOOLCONCEPTS(7)