Provided by: ocfs2-tools_1.8.7-1build4_amd64 bug

NAME

       OCFS2 - A Shared-Disk Cluster File System for Linux

INTRODUCTION

       OCFS2  is a file system. It allows users to store and retrieve data. The data is stored in files that are
       organized in a hierarchical directory tree. It is  a  POSIX  compliant  file  system  that  supports  the
       standard interfaces and the behavioral semantics as spelled out by that specification.

       It  is  also a shared disk cluster file system, one that allows multiple nodes to access the same disk at
       the same time. This is where the fun begins as allowing a file system to be accessible on multiple  nodes
       opens a can of worms. What if the nodes are of different architectures? What if a node dies while writing
       to  the  file  system?  What  data  consistency  can one expect if processes on two nodes are reading and
       writing concurrently? What if one node removes a file while it is still being used on another node?

       Unlike most shared file systems where the answer is fuzzy, the answer in OCFS2 is very well  defined.  It
       behaves  on  all  nodes  exactly  like  a local file system. If a file is removed, the directory entry is
       removed but the inode is kept as long as it is in use across the cluster. When the last user  closes  the
       descriptor, the inode is marked for deletion.

       The  data consistency model follows the same principle. It works as if the two processes that are running
       on two different nodes are running on the same node. A read on a node gets the last write irrespective of
       the IO mode used. The modes can be buffered, direct, asynchronous, splice or memory  mapped  IOs.  It  is
       fully cache coherent.

       Take  for  example  the  REFLINK  feature that allows a user to create multiple write-able snapshots of a
       file. This feature, like all others, is fully cluster-aware. A file being written to  on  multiple  nodes
       can  be  safely  reflinked  on  another.  The  snapshot created is a point-in-time image of the file that
       includes both the file data and all its attributes (including extended attributes).

       It is a journaling file system. When a node dies, a surviving node transparently replays the  journal  of
       the  dead  node.  This  ensures  that  the file system metadata is always consistent. It also defaults to
       ordered data journaling to ensure the file data is flushed to disk before the journal commit,  to  remove
       the small possibility of stale data appearing in files after a crash.

       It  is  architecture  and  endian neutral. It allows concurrent mounts on nodes with different processors
       like x86, x86_64, IA64 and PPC64. It handles little and big endian, 32-bit and 64-bit architectures.

       It is feature rich. It supports indexed directories, metadata checksums, extended attributes, POSIX ACLs,
       quotas, REFLINKs, sparse files, unwritten extents and inline-data.

       It is fully integrated with the mainline Linux kernel. The file  system  was  merged  into  Linux  kernel
       2.6.16 in early 2006.

       It is quickly installed. It is available with almost all Linux distributions.  The file system is on-disk
       compatible across all of them.

       It  is modular. The file system can be configured to operate with other cluster stacks like Pacemaker and
       CMAN along with its own stack, O2CB.

       It is easily configured. The O2CB cluster stack configuration involves editing two files, one for cluster
       layout and the other for cluster timeouts.

       It is very efficient. The file system consumes very little resources. It is used to store virtual machine
       images in limited memory environments like Xen and KVM.

       In summary, OCFS2 is an efficient, easily configured, modular, quickly installed,  fully  integrated  and
       compatible,  feature-rich,  architecture  and  endian  neutral,  cache coherent, ordered data journaling,
       POSIX-compliant, shared disk cluster file system.

OVERVIEW

       OCFS2 is a general-purpose shared-disk cluster file system for  Linux  capable  of  providing  both  high
       performance and high availability.

       As  it  provides local file system semantics, it can be used with almost all applications.  Cluster-aware
       applications can make use of cache-coherent parallel I/Os from multiple nodes to scale  out  applications
       easily.  Other applications can make use of the clustering facilities to fail-over running application in
       the event of a node failure.

       The notable features of the file system are:

       Tunable Block size
              The file system supports block  sizes  of  512,  1K,  2K  and  4K  bytes.  4KB  is  almost  always
              recommended. This feature is available in all releases of the file system.

       Tunable Cluster size
              A  cluster  size is also referred to as an allocation unit. The file system supports cluster sizes
              of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M bytes. For most use cases, 4KB  is  recommended.
              However,  a  larger value is recommended for volumes hosting mostly very large files like database
              files, virtual machine images, etc. A large cluster size allows the file  system  to  store  large
              files more efficiently. This feature is available in all releases of the file system.

       Endian and Architecture neutral
              The file system can be mounted concurrently on nodes having different architectures.  Like 32-bit,
              64-bit,  little-endian  (x86,  x86_64,  ia64)  and  big-endian  (ppc64,  s390x).   This feature is
              available in all releases of the file system.

       Buffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes
              The file system supports all modes of I/O  for  maximum  flexibility  and  performance.   It  also
              supports cluster-wide shared writeable mmap(2). The support for bufferred, direct and asynchronous
              I/O  is available in all releases. The support for splice I/O was added in Linux kernel 2.6.20 and
              for shared writeable map(2) in 2.6.23.

       Multiple Cluster Stacks
              The file system includes a flexible framework to allow  it  to  function  with  userspace  cluster
              stacks  like Pacemaker (pcmk) and CMAN (cman), its own in-kernel cluster stack o2cb and no cluster
              stack.

              The support for o2cb cluster stack is available in all releases.

              The support for no cluster stack, or local mount, was added in Linux kernel 2.6.20.

              The support for userspace cluster stack was added in Linux kernel 2.6.26.

       Journaling
              The file system supports both ordered (default) and writeback data  journaling  modes  to  provide
              file  system  consistency  in  the  event of power failure or system crash.  It uses JBD2 in Linux
              kernel 2.6.28 and later. It used JBD in earlier kernels.

       Extent-based Allocations
              The file system allocates and tracks space in ranges of clusters. This is unlike block based  file
              systems  that  have  to track each and every block. This feature allows the file system to be very
              efficient when dealing with both large volumes and large files.  This feature is available in  all
              releases of the file system.

       Sparse files
              Sparse  files  are  files  with  holes. With this feature, the file system delays allocating space
              until a write is issued to a cluster. This feature was added in Linux kernel 2.6.22  and  requires
              enabling on-disk feature sparse.

       Unwritten Extents
              An  unwritten  extent  is  also  referred  to  as user pre-allocation. It allows an application to
              request a range of clusters to be allocated, but not initialized, within a  file.   Pre-allocation
              allows  the file system to optimize the data layout with fewer, larger extents. It also provides a
              performance boost, delaying initialization until the user writes to the clusters. This feature was
              added in Linux kernel 2.6.23 and requires enabling on-disk feature unwritten.

       Hole Punching
              Hole punching allows an application to remove arbitrary allocated regions within a file.  Creating
              holes,  essentially.  This  is  more  efficient  than  zeroing  the same extents.  This feature is
              especially useful in virtualized environments as it allows a block discard in a guest file  system
              to  be  converted to a hole punch in the host file system thus allowing users to reduce disk space
              usage. This feature was added in Linux kernel 2.6.23 and requires enabling on-disk features sparse
              and unwritten.

       Inline-data
              Inline data is also referred to as data-in-inode as it allows storing small files and  directories
              in  the  inode  block.  This  not  only  saves  space but also has a positive impact on cold-cache
              directory and file operations. The data is transparently moved out to an extent when it no  longer
              fits  inside  the inode block. This feature was added in Linux kernel 2.6.24 and requires enabling
              on-disk feature inline-data.

       REFLINK
              REFLINK is also referred to as fast copy. It allows  users  to  atomically  (and  instantly)  copy
              regular files. In other words, create multiple writeable snapshots of regular files.  It is called
              REFLINK  because it looks and feels more like a (hard) link(2) than a traditional snapshot. Like a
              link, it is a regular user operation, subject to  the  security  attributes  of  the  inode  being
              reflinked  and  not  to  the super user privileges typically required to create a snapshot. Like a
              link, it operates within a file system. But unlike a link, it links the inodes at the data  extent
              level  allowing  each  reflinked  inode  to  grow independently as and when written to. Up to four
              billion inodes can share a data extent.  This  feature  was  added  in  Linux  kernel  2.6.32  and
              requires enabling on-disk feature refcount.

       Allocation Reservation
              File  contiguity  plays an important role in file system performance. When a file is fragmented on
              disk, reading and writing to the file involves many seeks, leading to lower throughput. Contiguous
              files, on the other hand, minimize seeks, allowing the disks to perform IO at the maximum rate.

              With allocation reservation, the file system reserves a window in the  bitmap  for  all  extending
              files  allowing  each  to  grow  as  contiguously as possible. As this extra space is not actually
              allocated, it is available for use by other files if the need arises.  This feature was  added  in
              Linux kernel 2.6.35 and can be tuned using the mount option resv_level.

       Indexed Directories
              An indexed directory allows users to perform quick lookups of a file in very large directories. It
              also  results  in  faster  creates  and unlinks and thus provides better overall performance. This
              feature was added in Linux kernel 2.6.30 and requires enabling on-disk feature indexed-dirs.

       File Attributes
              This refers to EXT2-style file attributes, such as immutable, modified using chattr(1) and queried
              using lsattr(1). This feature was added in Linux kernel 2.6.19.

       Extended Attributes
              An extended attribute refers to a name:value pair than can be associated with file system  objects
              like regular files, directories, symbolic links, etc. OCFS2 allows associating an unlimited number
              of  attributes per object. The attribute names can be up to 255 bytes in length, terminated by the
              first NUL character. While it is not  required,  printable  names  (ASCII)  are  recommended.  The
              attribute values can be up to 64 KB of arbitrary binary data. These attributes can be modified and
              listed using standard Linux utilities setfattr(1) and getfattr(1). This feature was added in Linux
              kernel 2.6.29 and requires enabling on-disk feature xattr.

       Metadata Checksums
              This  feature  allows  the  file  system  to detect silent corruptions in all metadata blocks like
              inodes and directories. This feature was added in Linux kernel 2.6.29 and  requires  enabling  on-
              disk feature metaecc.

       POSIX ACLs and Security Attributes
              POSIX  ACLs  allows  assigning fine-grained discretionary access rights for files and directories.
              This security scheme is a lot more flexible than the  traditional  file  access  permissions  that
              imposes a strict user-group-other model.

              Security  attributes  allow the file system to support other security regimes like SELinux, SMACK,
              AppArmor, etc.

              Both these security extensions were added in Linux kernel 2.6.29  and  requires  enabling  on-disk
              feature xattr.

       User and Group Quotas
              This  feature  allows  setting  up  usage  quotas  on  user  and group basis by using the standard
              utilities like quota(1), setquota(8), quotacheck(8), and quotaon(8). This  feature  was  added  in
              Linux kernel 2.6.29 and requires enabling on-disk features usrquota and grpquota.

       Unix File Locking
              The  Unix  operating system has historically provided two system calls to lock files.  flock(2) or
              BSD locking and fcntl(2) or POSIX locking. OCFS2 extends both file  locks  to  the  cluster.  File
              locks taken on one node interact with those taken on other nodes.

              The  support  for  clustered  flock(2) was added in Linux kernel 2.6.26.  All flock(2) options are
              supported, including the kernels ability to cancel a lock request when an appropriate kill  signal
              is received by the user. This feature is supported with all cluster-stacks including o2cb.

              The  support  for  clustered  fcntl(2)  was added in Linux kernel 2.6.28.  But because it requires
              group communication to make the locks coherent,  it  is  only  supported  with  userspace  cluster
              stacks, pcmk and cman and not with the default cluster stack o2cb.

       Comprehensive Tools Support
              The  file  system  has a comprehensive EXT3-style toolset that tries to use similar parameters for
              ease-of-use. It includes mkfs.ocfs2(8) (format), tunefs.ocfs2(8)  (tune),  fsck.ocfs2(8)  (check),
              debugfs.ocfs2(8) (debug), etc.

       Online Resize
              The  file  system  can be dynamically grown using tunefs.ocfs2(8). This feature was added in Linux
              kernel 2.6.25.

RECENT CHANGES

       The O2CB cluster stack has a global heartbeat mode. It allows users to specify heartbeat regions that are
       consistent across all nodes. The cluster stack also allows online addition and removal of both nodes  and
       heartbeat regions.

       o2cb(8)  is  the  new  cluster  configuration  utility. It is an easy to use utility that allows users to
       create the cluster configuration on a node that is not part of the cluster. It replaces the older utility
       o2cb_ctl(8) which has being deprecated.

       ocfs2console(8) has been obsoleted.

       o2info(8) is a new utility that can be used to provide file system information.  It allows non-privileged
       users to see the enabled file system features, block and cluster sizes, extended file  stat,  free  space
       fragmentation, etc.

       o2hbmonitor(8) is a o2hb heartbeat monitor. It is an extremely light weight utility that logs messages to
       the  system  logger  once  the  heartbeat  delay  exceeds  the  warn threshold. This utility is useful in
       identifying volumes encountering I/O delays.

       debugfs.ocfs2(8) has some new commands. net_stats shows the o2net message times  between  various  nodes.
       This  is useful in identifying nodes are that slowing down the cluster operations. stat_sysdir allows the
       user to dump the entire system directory that can be used to debug issues. grpextents dumps the  complete
       free space fragmentation in the cluster group allocator.

       mkfs.ocfs2(8)  now  enables xattr, indexed-dirs, discontig-bg, refcount, extended-slotmap and clusterinfo
       feature flags by default, in addition to the older defaults, sparse, unwritten and inline-data.

       mount.ocfs2(8) allows users to specify the level of cache coherency between nodes.  By default  the  file
       system  operates  in  full  coherency  mode  that  also  serializes  the  direct I/Os. While this mode is
       technically correct, it limits the I/O thruput in a clustered database. This mount option allows the user
       to limit the cache coherency to only the buffered I/Os to allow multiple nodes to  do  concurrent  direct
       writes to the same file. This feature works with Linux kernel 2.6.37 and later.

COMPATIBILITY

       The OCFS2 development teams goes to great lengths to maintain compatibility. It attempts to maintain both
       on-disk  and network protocol compatibility across all releases of the file system. It does so even while
       adding new features that entail on-disk format and network protocol changes. To do this successfully,  it
       follows a few rules:

           1.  The  on-disk  format changes are managed by a set of feature flags that can be turned on and off.
           The file system in kernel detects these features during mount and continues only  if  it  understands
           all  the  features.  Users  encountering  this  have  the  option of either disabling that feature or
           upgrading the file system to a newer release.

           2. The latest release of ocfs2-tools is  compatible  with  all  versions  of  the  file  system.  All
           utilities  detect  the features enabled on disk and continue only if it understands all the features.
           Users encountering this have to upgrade the tools to a newer release.

           3. The network protocol version is negotiated by the nodes to ensure all nodes understand the  active
           protocol version.

       FEATURE FLAGS
              The feature flags are split into three categories, namely, Compat, Incompat and RO Compat.

              Compat,  or  compatible,  is  a  feature that the file system does not need to fully understand to
              safely read/write to the volume. An example of this is the backup-super  feature  that  added  the
              capability to backup the super block in multiple locations in the file system. As the backup super
              blocks  are  typically not read nor written to by the file system, an older file system can safely
              mount a volume with this feature enabled.

              Incompat, or incompatible, is a feature  that  the  file  system  needs  to  fully  understand  to
              read/write to the volume. Most features fall under this category.

              RO Compat, or read-only compatible, is a feature that the file system needs to fully understand to
              write to the volume. Older software can safely read a volume with this feature enabled. An example
              of  this  would  be  user and group quotas. As quotas are manipulated only when the file system is
              written to, older software can safely mount such volumes in read-only mode.

              The list of feature flags, the version of the kernel it was added in, the earliest version of  the
              tools that understands it, etc., is as follows:

                     ┌──────────────────────┬────────────────┬─────────────────┬───────────┬───────────┐
                     │ Feature FlagsKernel VersionTools VersionCategoryHex Value │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ backup-super         │      All       │ ocfs2-tools 1.2 │  Compat   │     1     │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ strict-journal-super │      All       │       All       │  Compat   │     2     │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ local                │  Linux 2.6.20  │ ocfs2-tools 1.2 │ Incompat  │     8     │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ sparse               │  Linux 2.6.22  │ ocfs2-tools 1.4 │ Incompat  │    10     │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ inline-data          │  Linux 2.6.24  │ ocfs2-tools 1.4 │ Incompat  │    40     │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ extended-slotmap     │  Linux 2.6.27  │ ocfs2-tools 1.6 │ Incompat  │    100    │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ xattr                │  Linux 2.6.29  │ ocfs2-tools 1.6 │ Incompat  │    200    │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ indexed-dirs         │  Linux 2.6.30  │ ocfs2-tools 1.6 │ Incompat  │    400    │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ metaecc              │  Linux 2.6.29  │ ocfs2-tools 1.6 │ Incompat  │    800    │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ refcount             │  Linux 2.6.32  │ ocfs2-tools 1.6 │ Incompat  │   1000    │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ discontig-bg         │  Linux 2.6.35  │ ocfs2-tools 1.6 │ Incompat  │   2000    │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ clusterinfo          │  Linux 2.6.37  │ ocfs2-tools 1.8 │ Incompat  │   4000    │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ unwritten            │  Linux 2.6.23  │ ocfs2-tools 1.4 │ RO Compat │     1     │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ grpquota             │  Linux 2.6.29  │ ocfs2-tools 1.6 │ RO Compat │     2     │
                     ├──────────────────────┼────────────────┼─────────────────┼───────────┼───────────┤
                     │ usrquota             │  Linux 2.6.29  │ ocfs2-tools 1.6 │ RO Compat │     4     │
                     └──────────────────────┴────────────────┴─────────────────┴───────────┴───────────┘

              To query the features enabled on a volume, do:

              $ o2info --fs-features /dev/sdf1
              backup-super strict-journal-super sparse extended-slotmap inline-data xattr
              indexed-dirs refcount discontig-bg clusterinfo unwritten

       ENABLING AND DISABLING FEATURES

              The format utility, mkfs.ocfs2(8), allows a user to enable and disable specific features using the
              fs-features  option. The features are provided as a comma separated list. The enabled features are
              listed as is. The disabled features are prefixed with no.  The example below shows the file system
              being formatted with sparse disabled and inline-data enabled.

              # mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1

              After formatting, the users can toggle features using the tune utility, tunefs.ocfs2(8).  This  is
              an offline operation. The volume needs to be umounted across the cluster.  The example below shows
              the sparse feature being enabled and inline-data disabled.

              # tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1

              Care  should  be taken before enabling and disabling features. Users planning to use a volume with
              an older version of the file system will be better of  not  enabling  newer  features  as  turning
              disabling may not succeed.

              An example would be disabling the sparse feature; this requires filling every hole.  The operation
              can only succeed if the file system has enough free space.

       DETECTING FEATURE INCOMPATIBILITY

              Say  one  tries  to  mount  a volume with an incompatible feature. What happens then? How does one
              detect the problem? How does one know the name of that incompatible feature?

              To begin with, one should look for error messages in dmesg(8). Mount failures that are due  to  an
              incompatible feature will always result in an error message like the following:

              ERROR: couldn't mount because of unsupported optional features (200).

              Here  the  file  system is unable to mount the volume due to an unsupported optional feature. That
              means that that feature is an Incompat feature. By referring to the  table  above,  one  can  then
              deduce  that  the  user failed to mount a volume with the xattr feature enabled. (The value in the
              error message is in hexadecimal.)

              Another example of an error message due to incompatibility is as follows:

              ERROR: couldn't mount RDWR because of unsupported optional features (1).

              Here the file system is unable to mount the volume in the RW mode. That means that that feature is
              a RO Compat feature. Another look at the table and it becomes apparent that  the  volume  had  the
              unwritten feature enabled.

              In  both cases, the user has the option of disabling the feature. In the second case, the user has
              the choice of mounting the volume in the RO mode.

GETTING STARTED

       The OCFS2 software is split into two components, namely, kernel and tools. The kernel component  includes
       the core file system and the cluster stack, and is packaged along with the kernel. The tools component is
       packaged  as  ocfs2-tools  and needs to be specifically installed. It provides utilities to format, tune,
       mount, debug and check the file system.

       To install ocfs2-tools, refer to the package handling utility in in your distributions.

       The next step is selecting a cluster stack. The options include:

           A. No cluster stack, or local mount.

           B. In-kernel o2cb cluster stack with local or global heartbeat.

           C. Userspace cluster stacks pcmk or cman.

       The file system allows changing cluster stacks easily using tunefs.ocfs2(8).  To list the cluster  stacks
       stamped on the OCFS2 volumes, do:

       # mounted.ocfs2 -d
       Device     Stack  Cluster     F  UUID                              Label
       /dev/sdb1  o2cb   webcluster  G  DCDA2845177F4D59A0F2DCD8DE507CC3  hbvol1
       /dev/sdc1  None                  23878C320CF3478095D1318CB5C99EED  localmount
       /dev/sdd1  o2cb   webcluster  G  8AB016CD59FC4327A2CDAB69F08518E3  webvol
       /dev/sdg1  o2cb   webcluster  G  77D95EF51C0149D2823674FCC162CF8B  logsvol
       /dev/sdh1  o2cb   webcluster  G  BBA1DBD0F73F449384CE75197D9B7098  scratch

       NON-CLUSTERED OR LOCAL MOUNT

              To format a OCFS2 volume as a non-clustered (local) volume, do:

              # mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1

              To convert an existing clustered volume to a non-clustered volume, do:

              # tunefs.ocfs2 --fs-features=local /dev/sda1

              Non-clustered volumes do not interact with the cluster stack. One can have both clustered and non-
              clustered volumes mounted at the same time.

              While formatting a non-clustered volume, users should consider the possibility of later converting
              that volume to a clustered one. If there is a possibility of that, then the user should add enough
              node-slots  using  the  -N  option.  Adding  node-slots  during format creates journals with large
              extents. If created later, then the journals will be fragmented which is not good for performance.

       CLUSTERED MOUNT WITH O2CB CLUSTER STACK

              Only one of the two heartbeat mode can be active at any one time. Changing heartbeat modes  is  an
              offline operation.

              Both  heartbeat  modes  require /etc/ocfs2/cluster.conf and /etc/sysconfig/o2cb to be populated as
              described in ocfs2.cluster.conf(5) and o2cb.sysconfig(5) respectively. The only difference in  set
              up  between the two modes is that global requires heartbeat devices to be configured whereas local
              does not.

              Refer o2cb(7) for more information.

              LOCAL HEARTBEAT
                     This is the default heartbeat mode. The user needs to populate the configuration  files  as
                     described  in  ocfs2.cluster.conf(5) and o2cb.sysconfig(5). In this mode, the cluster stack
                     heartbeats on all mounted volumes. Thus, one does not have to specify heartbeat devices  in
                     cluster.conf.

                     Once configured, the o2cb cluster stack can be onlined and offlined as follows:

                     # service o2cb online
                     Setting cluster stack "o2cb": OK
                     Registering O2CB cluster "webcluster": OK
                     Setting O2CB cluster timeouts : OK

                     # service o2cb offline
                     Clean userdlm domains: OK
                     Stopping O2CB cluster webcluster: OK
                     Unregistering O2CB cluster "webcluster": OK

              GLOBAL HEARTBEAT
                     The  configuration  is  similar to local heartbeat. The one additional step in this mode is
                     that it requires heartbeat devices to be also configured.

                     These heartbeat devices are OCFS2 formatted volumes with global heartbeat enabled on  disk.
                     These volumes can later be mounted and used as clustered file systems.

                     The  steps  to  format  a  volume with global heartbeat enabled is listed in o2cb(7).  Also
                     listed there is listing all volumes with the cluster stack stamped on disk.

                     In this mode, the heartbeat is started when the cluster is onlined  and  stopped  when  the
                     cluster is offlined.

                     # service o2cb online
                     Setting cluster stack "o2cb": OK
                     Registering O2CB cluster "webcluster": OK
                     Setting O2CB cluster timeouts : OK
                     Starting global heartbeat for cluster "webcluster": OK

                     # service o2cb offline
                     Clean userdlm domains: OK
                     Stopping global heartbeat on cluster "webcluster": OK
                     Stopping O2CB cluster webcluster: OK
                     Unregistering O2CB cluster "webcluster": OK

                     # service o2cb status
                     Driver for "configfs": Loaded
                     Filesystem "configfs": Mounted
                     Stack glue driver: Loaded
                     Stack plugin "o2cb": Loaded
                     Driver for "ocfs2_dlmfs": Loaded
                     Filesystem "ocfs2_dlmfs": Mounted
                     Checking O2CB cluster "webcluster": Online
                       Heartbeat dead threshold: 31
                       Network idle timeout: 30000
                       Network keepalive delay: 2000
                       Network reconnect delay: 2000
                       Heartbeat mode: Global
                     Checking O2CB heartbeat: Active
                       77D95EF51C0149D2823674FCC162CF8B /dev/sdg1
                     Nodes in O2CB cluster: 92 96

       CLUSTERED MOUNT WITH USERSPACE CLUSTER STACK

              Configure  and  online the userspace stack pcmk or cman before using tunefs.ocfs2(8) to update the
              cluster stack on disk.

              # tunefs.ocfs2 --update-cluster-stack /dev/sdd1
              Updating on-disk cluster information to match the running cluster.
              DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS
              FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION.
              Update the on-disk cluster information? y

              Refer to the cluster stack documentation for information on  starting  and  stopping  the  cluster
              stack.

FILE SYSTEM UTILITIES

       This sections lists the utilities that are used to manage the OCFS2 file systems.  This includes tools to
       format,  tune,  check,  mount,  debug  the  file  system.  Each  utility  has  a  man page that lists its
       capabilities in detail.

       mkfs.ocfs2(8)
              This is the file system format utility. All volumes have to be formatted prior  to  its  use.   As
              this  utility overwrites the volume, use it with care. Double check to ensure the volume is not in
              use on any node in the cluster.

              As a precaution, the utility will abort if the volume is locally  mounted.  It  also  detects  use
              across the cluster if used by OCFS2. But these checks are not comprehensive and can be overridden.
              So use it with care.

              While it is not always required, the cluster should be online.

       tunefs.ocfs2(8)
              This  is  the  file system tune utility. It allows users to change certain on-disk parameters like
              label, uuid, number of node-slots, volume size and the  size  of  the  journals.  It  also  allows
              turning on and off the file system features as listed above.

              This utility requires the cluster to be online.

       fsck.ocfs2(8)
              This  is  the  file system check utility. It detects and fixes on-disk errors. All the check codes
              and their fixes are listed in fsck.ocfs2.checks(8).

              This utility requires the cluster to be online to ensure the volume is not in use on another  node
              and to prevent the volume from being mounted for the duration of the check.

       mount.ocfs2(8)
              This is the file system mount utility. It is invoked indirectly by the mount(8) utility.

              This utility detects the cluster status and aborts if the cluster is offline or does not match the
              cluster stamped on disk.

       o2cluster(8)
              This  is  the  file system cluster stack update utility. It allows the users to update the on-disk
              cluster stack to the one provided.

              This utility only updates the disk if the utility is reasonably assured that the  file  system  is
              not in use on any node.

       o2info(1)
              This  is the file system information utility. It provides information like the features enabled on
              disk, block size, cluster size, free space fragmentation, etc.

              It can be used by both privileged and non-privileged users. Users having read  permission  on  the
              device can provide the path to the device. Other users can provide the path to a file on a mounted
              file system.

       debugfs.ocfs2(8)
              This  is  the  file  system  debug  utility. It allows users to examine all file system structures
              including walking directory  structures,  displaying  inodes,  backing  up  files,  etc.,  without
              mounting the file system.

              This utility requires the user to have read permission on the device.

       o2image(8)
              This  is the file system image utility. It allows users to copy the file system metadata skeleton,
              including the inodes, directories, bitmaps, etc. As it excludes data, it shrinks the size  of  the
              file system tremendously.

              The image file created can be used in debugging on-disk corruptions.

       mounted.ocfs2(8)
              This  is  the file system detect utility. It detects all OCFS2 volumes in the system and lists its
              label, uuid and cluster stack.

O2CB CLUSTER STACK UTILITIES

       This sections lists the utilities that are used to manage O2CB cluster stack.  Each  utility  has  a  man
       page that lists its capabilities in detail.

       o2cb(8)
              This  is the cluster configuration utility. It allows users to update the cluster configuration by
              adding and removing nodes and heartbeat regions. This utility is used by the o2cb init  script  to
              online and offline the cluster.

              This is a new utility and replaces o2cb_ctl(8) which has been deprecated.

       ocfs2_hb_ctl(8)
              This  is  the  cluster  heartbeat utility. It allows users to start and stop local heartbeat. This
              utility is invoked by mount.ocfs2(8) and should not be invoked directly by the user.

       o2hbmonitor(8)
              This is the disk heartbeat monitor. It tracks the elapsed time since the last heartbeat  and  logs
              warnings once that time exceeds the warn threshold.

FILE SYSTEM NOTES

       This section includes some useful notes that may prove helpful to the user.

       BALANCED CLUSTER
              A  cluster  is a computer. This is a fact and not a slogan. What this means is that an errant node
              in the cluster can affect the behavior of other nodes. If one node is slow, the cluster operations
              will slow down on all nodes. To prevent that, it is best to have a balanced  cluster.  This  is  a
              cluster that has equally powered and loaded nodes.

              The  standard  recommendation  for such clusters is to have identical hardware and software across
              all the nodes. However, that is not a hard and fast rule. After all, we have taken the  effort  to
              ensure that OCFS2 works in a mixed architecture environment.

              If  one  uses  OCFS2 in a mixed architecture environment, try to ensure that the nodes are equally
              powered and loaded. The use of a load balancer can assist with the latter.  Power  refers  to  the
              number of processors, speed, amount of memory, I/O throughput, network bandwidth, etc. In reality,
              having  equally  powered heterogeneous nodes is not always practical. In that case, make the lower
              node numbers more powerful than the higher node numbers. The O2CB cluster stack favors lower  node
              numbers in all of its tiebreaking logic.

              This  is not to suggest you should add a single core node in a cluster of quad cores. No amount of
              node number juggling will help you there.

       FILE DELETION
              In Linux, rm(1) removes the directory entry. It does  not  necessarily  delete  the  corresponding
              inode. But by removing the directory entry, it gives the illusion that the inode has been deleted.
              This  puzzles  users when they do not see a corresponding up-tick in the reported free space.  The
              reason is that inode deletion has a few more hurdles to cross.

              First is the hard link count, that indicates the number of  directory  entries  pointing  to  that
              inode. As long as an inode has one or more directory entries pointing to it, it cannot be deleted.
              The  file  system has to wait for the removal of all those directory entries. In other words, wait
              for that count to drop to zero.

              The second hurdle is the POSIX semantics allowing files to be unlinked even while they are in-use.
              In OCFS2, that translates to in-use across the cluster. The  file  system  has  to  wait  for  all
              processes across the cluster to stop using the inode.

              Once  these conditions are met, the inode is deleted and the freed space is visible after the next
              sync.

              Now the amount of space freed depends on the allocation. Only space that is actually allocated  to
              that  inode is freed. The example below shows a sparsely allocated file of size 51TB of which only
              2.4GB is actually allocated.

              $ ls -lsh largefile
              2.4G -rw-r--r-- 1 mark mark 51T Sep 29 15:04 largefile

              Furthermore, for reflinked files, only private extents are freed. Shared extents  are  freed  when
              the  last  inode accessing it, is deleted. The example below shows a 4GB file that shares 3GB with
              other reflinked files. Deleting it will increase the free space by 1GB. However, if it is the only
              remaining file accessing the shared extents, the full 4G will be freed.  (More information on  the
              shared-du(1) utility is provided below.)

              $ shared-du -m -c --shared-size reflinkedfile
              4000    (3000)  reflinkedfile

              The  deletion itself is a multi-step process. Once the hard link count falls to zero, the inode is
              moved to the orphan_dir system directory where it remains  until  the  last  process,  across  the
              cluster,  stops  using  the inode. Then the file system frees the extents and adds the freed space
              count to the truncate_log system file where it remains until the next sync.  The  freed  space  is
              made visible to the user only after that sync.

       DIRECTORY LISTING
              ls(1)  may  be  a  simple command, but it is not cheap. What is expensive is not the part where it
              reads the directory listing, but the second part where it reads all the inodes, also  referred  as
              an  inode  stat(2).  If  the inodes are not in cache, this can entail disk I/O.  Now, while a cold
              cache inode stat(2) is expensive in all file systems, it is especially  so  in  a  clustered  file
              system as it needs to take a cluster lock on each inode.

              A hot cache stat(2), on the other hand, has shown to perform on OCFS2 like it does on EXT3.

              In  other  words,  the second ls(1) will be quicker than the first. However, it is not guaranteed.
              Say you have a million files in a file system and not  enough  kernel  memory  to  cache  all  the
              inodes. In that case, each ls(1) will involve some cold cache stat(2)s.

       ALLOCATION RESERVATION
              Allocation  reservation  allows  multiple  concurrently extending files to grow as contiguously as
              possible. One way to demonstrate its functioning is to run a script that extends multiple files in
              a circular order. The script below does that by writing one hundred 4KB chunks to four files,  one
              after another.

              $ for i in $(seq 0 99);
              > do
              >   for j in $(seq 4);
              >   do
              >     dd if=/dev/zero of=file$j bs=4K count=1 seek=$i;
              >   done;
              > done;

              When run on a system running Linux kernel 2.6.34 or earlier, we end up with files with 100 extents
              each.  That  is full fragmentation. As the files are being extended one after another, the on-disk
              allocations are fully interleaved.

              $ filefrag file1 file2 file3 file4
              file1: 100 extents found
              file2: 100 extents found
              file3: 100 extents found
              file4: 100 extents found

              When run on a system running Linux kernel 2.6.35 or later, we see files with 7 extents each.  That
              is  a  lot  fewer than before. Fewer extents mean more on-disk contiguity and that always leads to
              better overall performance.

              $ filefrag file1 file2 file3 file4
              file1: 7 extents found
              file2: 7 extents found
              file3: 7 extents found
              file4: 7 extents found

       REFLINK OPERATION
              This feature allows a user to create a writeable snapshot of a regular file.  In  this  operation,
              the  file system creates a new inode with the same extent pointers as the original inode. Multiple
              inodes are thus able to share data extents. This  adds  a  twist  in  file  system  administration
              because none of the existing file system utilities in Linux expect this behavior. du(1), a utility
              to  used  to  compute file space usage, simply adds the blocks allocated to each inode. As it does
              not know about shared extents, it over estimates the space used.  Say, we have a  5GB  file  in  a
              volume having 42GB free.

              $ ls -l
              total 5120000
              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile

              $ du -m myfile*
              5000    myfile

              $ df -h .
              Filesystem            Size  Used Avail Use% Mounted on
              /dev/sdd1             50G   8.2G   42G  17% /ocfs2

              If  we were to reflink it 4 times, we would expect the directory listing to report five 5GB files,
              but the df(1) to report no loss of available space. du(1), on the other  hand,  would  report  the
              disk usage to climb to 25GB.

              $ reflink myfile myfile-ref1
              $ reflink myfile myfile-ref2
              $ reflink myfile myfile-ref3
              $ reflink myfile myfile-ref4

              $ ls -l
              total 25600000
              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:15 myfile
              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref1
              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref2
              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref3
              -rw-r--r--  1 jeff jeff   5242880000 Sep 24 17:16 myfile-ref4

              $ df -h .
              Filesystem            Size  Used Avail Use% Mounted on
              /dev/sdd1             50G   8.2G   42G  17% /ocfs2

              $ du -m myfile*
              5000    myfile
              5000    myfile-ref1
              5000    myfile-ref2
              5000    myfile-ref3
              5000    myfile-ref4
              25000 total

              Enter  shared-du(1), a shared extent-aware du. This utility reports the shared extents per file in
              parenthesis and the overall footprint. As expected, it lists the overall footprint at 5GB. One can
              view the details of the extents using shared-filefrag(1).  Both these utilities are  available  at
              http://oss.oracle.com/~smushran/reflink-tools/.   We  are  currently in the process of pushing the
              changes to the upstream maintainers of these utilities.

              $ shared-du -m -c --shared-size myfile*
              5000    (5000)  myfile
              5000    (5000)  myfile-ref1
              5000    (5000)  myfile-ref2
              5000    (5000)  myfile-ref3
              5000    (5000)  myfile-ref4
              25000 total
              5000 footprint

              # shared-filefrag -v myfile
              Filesystem type is: 7461636f
              File size of myfile is 5242880000 (1280000 blocks, blocksize 4096)
              ext logical physical expected length flags
              0         0  2247937            8448
              1      8448  2257921  2256384  30720
              2     39168  2290177  2288640  30720
              3     69888  2322433  2320896  30720
              4    100608  2354689  2353152  30720
              7    192768  2451457  2449920  30720
               . . .
              37  1073408  2032129  2030592  30720 shared
              38  1104128  2064385  2062848  30720 shared
              39  1134848  2096641  2095104  30720 shared
              40  1165568  2128897  2127360  30720 shared
              41  1196288  2161153  2159616  30720 shared
              42  1227008  2193409  2191872  30720 shared
              43  1257728  2225665  2224128  22272 shared,eof
              myfile: 44 extents found

       DATA COHERENCY
              One of the challenges in a shared file system is data coherency when multiple nodes are writing to
              the same set of files. NFS, for example, provides close-to-open data coherency that results in the
              data being flushed to the server when the file is closed on the client.  This leaves open  a  wide
              window for stale data being read on another node.

              A  simple test to check the data coherency of a shared file system involves concurrently appending
              the same file. Like running "uname -a >>/dir/file" using a parallel distributed shell like dsh  or
              pconsole. If coherent, the file will contain the results from all nodes.

              # dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test"
              # cat /ocfs2/test
              Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
              Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
              Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
              Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

              OCFS2 is a fully cache coherent cluster file system.

       DISCONTIGUOUS BLOCK GROUP
              Most  file  systems  pre-allocate space for inodes during format. OCFS2 dynamically allocates this
              space when required.

              However, this dynamic allocation has been problematic when the  free  space  is  very  fragmented,
              because  the file system required the inode and extent allocators to grow in contiguous fixed-size
              chunks.

              The discontiguous block group feature takes care of this problem by  allowing  the  allocators  to
              grow in smaller, variable-sized chunks.

              This feature was added in Linux kernel 2.6.35 and requires enabling on-disk feature discontig-bg.

       BACKUP SUPER BLOCKS
              A  file  system  super  block  stores critical information that is hard to recreate.  In OCFS2, it
              stores the block size, cluster size, and the locations of the root and system  directories,  among
              other  things.  As  this  block is close to the start of the disk, it is very susceptible to being
              overwritten by an errant write.  Say, dd if=file of=/dev/sda1.

              Backup super blocks are copies of the super block. These blocks are dispersed  in  the  volume  to
              minimize  the  chances of being overwritten. On the small chance that the original gets corrupted,
              the backups are available to scan and fix the corruption.

              mkfs.ocfs2(8) enables this feature  by  default.  Users  can  disable  this  by  specifying  --fs-
              features=nobackup-super during format.

              o2info(1) can be used to view whether the feature has been enabled on a device.

              # o2info --fs-features /dev/sdb1
              backup-super strict-journal-super sparse extended-slotmap inline-data xattr
              indexed-dirs refcount discontig-bg clusterinfo unwritten

              In  OCFS2, the super block is on the third block. The backups are located at the 1G, 4G, 16G, 64G,
              256G and 1T byte offsets. The actual number of backup blocks depends on the size  of  the  device.
              The super block is not backed up on devices smaller than 1GB.

              fsck.ocfs2(8)  refers  to  these six offsets by numbers, 1 to 6. Users can specify any backup with
              the -r option to recover the volume. The example below uses  the  second  backup.  If  successful,
              fsck.ocfs2(8) overwrites the corrupted super block with the backup.

              # fsck.ocfs2 -f -r 2 /dev/sdb1
              fsck.ocfs2 1.8.0
              [RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? <n> y
              Checking OCFS2 filesystem in /dev/sdb1:
                Label:              webhome
                UUID:               B3E021A2A12B4D0EB08E9E986CDC7947
                Number of blocks:   13107196
                Block size:         4096
                Number of clusters: 13107196
                Cluster size:       4096
                Number of slots:    8

              /dev/sdb1 was run with -f, check forced.
              Pass 0a: Checking cluster allocation chains
              Pass 0b: Checking inode allocation chains
              Pass 0c: Checking extent block allocation chains
              Pass 1: Checking inodes and blocks.
              Pass 2: Checking directory entries.
              Pass 3: Checking directory connectivity.
              Pass 4a: checking for orphaned inodes
              Pass 4b: Checking inodes link counts.
              All passes succeeded.

       SYNTHETIC FILE SYSTEMS
              The  OCFS2  development  effort  included  two synthetic file systems, configfs and dlmfs. It also
              makes use of a third, debugfs.

              configfs
                     configfs has since been accepted as  a  generic  kernel  component  and  is  also  used  by
                     netconsole  and fs/dlm. OCFS2 tools use it to communicate the list of nodes in the cluster,
                     details of the heartbeat device, cluster timeouts, and so on to the in-kernel node manager.
                     The o2cb init script mounts this file system at /sys/kernel/config.

              dlmfs  dlmfs exposes the in-kernel o2dlm to the user-space. While it was developed  primarily  for
                     OCFS2  tools,  it  has  seen  usage by others looking to add a cluster locking dimension in
                     their applications. Users interested in doing the same should look at the libo2dlm  library
                     provided by ocfs2-tools. The o2cb init script mounts this file system at /dlm.

              debugfs
                     OCFS2  uses debugfs to expose its in-kernel information to user space. For example, listing
                     the file system cluster locks, dlm locks, dlm state, o2net state, etc. Users can access the
                     information by mounting the  file  system  at  /sys/kernel/debug.  To  automount,  add  the
                     following to /etc/fstab: debugfs /sys/kernel/debug debugfs defaults 0 0

       DISTRIBUTED LOCK MANAGER
              One of the key technologies in a cluster is the lock manager, which maintains the locking state of
              all  resources  across  the cluster. An easy implementation of a lock manager involves designating
              one node to handle everything. In this model, if a node wanted to acquire a lock,  it  would  send
              the  request  to the lock manager. However, this model has a weakness: lock manager’s death causes
              the cluster to seize up.

              A better model is one where all nodes manage a subset of the lock resources. Each  node  maintains
              enough  information  for all the lock resources it is interested in. On event of a node death, the
              remaining nodes pool in the information to reconstruct the lock state maintained by the dead node.
              In this scheme, the locking overhead is  distributed  amongst  all  the  nodes.  Hence,  the  term
              distributed lock manager.

              O2DLM  is a distributed lock manager. It is based on the specification titled "Programming Locking
              Application"  written  by   Kristin   Thomas   and   is   available   at   the   following   link.
              http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf

       DLM DEBUGGING
              O2DLM  has  a  rich debugging infrastructure that allows it to show the state of the lock manager,
              all the lock resources, among other things.  The figure below shows the dlm state of  a  nine-node
              cluster  that  has  just  lost three nodes: 12, 32, and 35. It can be ascertained that node 7, the
              recovery master, is currently recovering node 12 and has received the lock states of the dead node
              from all other live nodes.

              # cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state
              Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001  Key: 0x10748e61
              Thread Pid: 24542  Node: 7  State: JOINED
              Number of Joins: 1  Joining Node: 255
              Domain Map: 7 31 33 34 40 50
              Live Map: 7 31 33 34 40 50
              Lock Resources: 48850 (439879)
              MLEs: 0 (1428625)
                Blocking: 0 (1066000)
                Mastery: 0 (362625)
                Migration: 0 (0)
              Lists: Dirty=Empty  Purge=Empty  PendingASTs=Empty  PendingBASTs=Empty
              Purge Count: 0  Refs: 1
              Dead Node: 12
              Recovery Pid: 24543  Master: 7  State: ACTIVE
              Recovery Map: 12 32 35
              Recovery Node State:
                      7 - DONE
                      31 - DONE
                      33 - DONE
                      34 - DONE
                      40 - DONE
                      50 - DONE

              The figure below shows the state of a dlm lock resource that is mastered (owned) by node 25,  with
              6 locks in the granted queue and node 26 holding the EX (writelock) lock on that resource.

              # debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1
              Lockres: M000000000000000022d63c00000000   Owner: 25    State: 0x0
              Last Used: 0      ASTs Reserved: 0    Inflight: 0    Migration Pending: No
              Refs: 8    Locks: 6    On Lists: None
              Reference Map: 26 27 28 94 95
               Lock-Queue  Node  Level  Conv  Cookie           Refs  AST  BAST  Pending-Action
               Granted     94    NL     -1    94:3169409       2     No   No    None
               Granted     28    NL     -1    28:3213591       2     No   No    None
               Granted     27    NL     -1    27:3216832       2     No   No    None
               Granted     95    NL     -1    95:3178429       2     No   No    None
               Granted     25    NL     -1    25:3513994       2     No   No    None
               Granted     26    EX     -1    26:3512906       2     No   No    None

              The figure below shows a lock from the file system perspective. Specifically, it shows a lock that
              is in the process of being upconverted from a NL to EX. Locks in this state are are referred to in
              the file system as busy locks and can be listed using the debugfs.ocfs2 command, "fs_locks -B".

              # debugfs.ocfs2 -R "fs_locks -B" /dev/sda1
              Lockres: M000000000000000000000b9aba12ec  Mode: No Lock
              Flags: Initialized Attached Busy
              RO Holders: 0  EX Holders: 0
              Pending Action: Convert  Pending Unlock Action: None
              Requested Mode: Exclusive  Blocking Mode: No Lock
              PR > Gets: 0  Fails: 0    Waits Total: 0us  Max: 0us  Avg: 0ns
              EX > Gets: 1  Fails: 0    Waits Total: 544us  Max: 544us  Avg: 544185ns
              Disk Refreshes: 1

              With this debugging infrastructure in place, users can debug hang issues as follows:

                  *  Dump  the busy fs locks for all the OCFS2 volumes on the node with hanging processes. If no
                  locks are found, then the problem is not related to O2DLM.

                  * Dump the corresponding dlm lock for all the busy fs locks. Note down the owner  (master)  of
                  all the locks.

                  * Dump the dlm locks on the master node for each lock.

              At this stage, one should note that the hanging node is waiting to get an AST from the master. The
              master,  on  the  other hand, cannot send the AST until the current holder has down converted that
              lock, which it will do upon receiving a Blocking AST. However, a node can only down convert if all
              the lock holders have stopped using that lock.  After dumping the dlm lock  on  the  master  node,
              identify the current lock holder and dump both the dlm and fs locks on that node.

              The trick here is to see whether the Blocking AST message has been relayed to file system. If not,
              the  problem  is  in the dlm layer. If it has, then the most common reason would be a lock holder,
              the count for which is maintained in the fs lock.

              At this stage, printing the list of process helps.

              $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN

              Make a note of all D state processes. At least one of them is responsible  for  the  hang  on  the
              first node.

              The  challenge  then  is to figure out why those processes are hanging. Failing that, at least get
              enough information (like alt-sysrq t output) for the kernel developers to review.  What to do next
              depends on where the process is hanging. If it is waiting for the I/O  to  complete,  the  problem
              could  be  anywhere  in  the I/O subsystem, from the block device layer through the drivers to the
              disk array. If the hang concerns a user lock (flock(2)),  the  problem  could  be  in  the  user’s
              application.  A  possible  solution  could  be  to kill the holder. If the hang is due to tight or
              fragmented memory, free up some memory by killing non-essential processes.

              The thing to note is that the symptom for the problem was on one node but the cause is on another.
              The issue can only be resolved on the node holding the lock. Sometimes, the best solution will  be
              to reset that node. Once killed, the O2DLM recovery process will clear all locks owned by the dead
              node  and  let  the  cluster continue to operate. As harsh as that sounds, at times it is the only
              solution. The good news is that, by following the trail, you now have enough information to file a
              bug and get the real issue resolved.

       NFS EXPORTING
              OCFS2 volumes can be exported as NFS volumes. This support is limited  to  NFS  version  3,  which
              translates to Linux kernel version 2.4 or later.

              If  the  version of the Linux kernel on the system exporting the volume is older than 2.6.30, then
              the NFS clients must mount the volumes using  the  nordirplus  mount  option.  This  disables  the
              READDIRPLUS RPC call to workaround a bug in NFSD, detailed in the following link:

              http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html

              Users  running  NFS  version 2 can export the volume after having disabled subtree checking (mount
              option no_subtree_check). Be warned, disabling the check has security implications (documented  in
              the exports(5) man page) that users must evaluate on their own.

       FILE SYSTEM LIMITS
              OCFS2  has  no intrinsic limit on the total number of files and directories in the file system. In
              general, it is only limited by the size of the device. But there  is  one  limit  imposed  by  the
              current  filesystem.  It can address at most four billion clusters. A file system with 1MB cluster
              size can go up to 4PB, while a file system with a 4KB cluster size can address up to 16TB.

       SYSTEM OBJECTS
              The OCFS2 file system stores its internal meta-data, including bitmaps, journals, etc., as  system
              files. These are grouped in a system directory. These files and directories are not accessible via
              the file system interface but can be viewed using the debugfs.ocfs2(8) tool.

              To list the system directory (referred to as double-slash), do:

              # debugfs.ocfs2 -R "ls -l //" /dev/sde1
                      66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 .
                      66     drwxr-xr-x  10  0  0         3896 19-Jul-2011 13:36 ..
                      67     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 bad_blocks
                      68     -rw-r--r--   1  0  0      1179648 19-Jul-2011 13:36 global_inode_alloc
                      69     -rw-r--r--   1  0  0         4096 19-Jul-2011 14:35 slot_map
                      70     -rw-r--r--   1  0  0      1048576 19-Jul-2011 13:36 heartbeat
                      71     -rw-r--r--   1  0  0  53686960128 19-Jul-2011 13:36 global_bitmap
                      72     drwxr-xr-x   2  0  0         3896 25-Jul-2011 15:05 orphan_dir:0000
                      73     drwxr-xr-x   2  0  0         3896 19-Jul-2011 13:36 orphan_dir:0001
                      74     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0000
                      75     -rw-r--r--   1  0  0      8388608 19-Jul-2011 13:36 extent_alloc:0001
                      76     -rw-r--r--   1  0  0    121634816 19-Jul-2011 13:36 inode_alloc:0000
                      77     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 inode_alloc:0001
                      77     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:36 journal:0000
                      79     -rw-r--r--   1  0  0    268435456 19-Jul-2011 13:37 journal:0001
                      80     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0000
                      81     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 local_alloc:0001
                      82     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0000
                      83     -rw-r--r--   1  0  0            0 19-Jul-2011 13:36 truncate_log:0001

              The  file  names  that end with numbers are slot specific and are referred to as node-local system
              files. The set of node-local files used by a node can be determined from the slot map. To list the
              slot map, do:

              # debugfs.ocfs2 -R "slotmap" /dev/sde1
                  Slot#    Node#
                      0       32
                      1       35
                      2       40
                      3       31
                      4       34
                      5       33

              For more information, refer to the OCFS2 support guides available in the Documentation section  at
              http://oss.oracle.com/projects/ocfs2.

       HEARTBEAT, QUORUM, AND FENCING
              Heartbeat  is  an  essential  component  in any cluster. It is charged with accurately designating
              nodes as dead or alive. A mistake here could lead to a cluster hang or a corruption.

              o2hb is the disk heartbeat component of  o2cb.  It  periodically  updates  a  timestamp  on  disk,
              indicating  to  others that this node is alive. It also reads all the timestamps to identify other
              live nodes. Other cluster components, like o2dlm and o2net, use the o2hb service to  get  node  up
              and down events.

              The  quorum  is  the group of nodes in a cluster that is allowed to operate on the shared storage.
              When there is a failure in the cluster, nodes may be split into groups  that  can  communicate  in
              their  groups and with the shared storage but not between groups.  o2quo determines which group is
              allowed to continue and initiates fencing of the other group(s).

              Fencing is the act of forcefully removing a node from a cluster. A node with  OCFS2  mounted  will
              fence  itself when it realizes that it does not have quorum in a degraded cluster. It does this so
              that other nodes won’t be stuck trying to access its resources.

              o2cb uses a machine reset to fence. This is the quickest route for the node to rejoin the cluster.

       PROCESSES

              [o2net]
                     One per node. It is a work-queue thread started when the cluster  is  brought  on-line  and
                     stopped when it is off-lined. It handles network communication for all mounts.  It gets the
                     list  of  active  nodes from O2HB and sets up a TCP/IP communication channel with each live
                     node. It sends regular keep-alive packets to detect any interruption on the channels.

              [user_dlm]
                     One per node. It is a work-queue thread started when dlmfs is loaded and stopped when it is
                     unloaded (dlmfs is a synthetic file system that allows user space processes to  access  the
                     in-kernel dlm).

              [ocfs2_wq]
                     One per node. It is a work-queue thread started when the OCFS2 module is loaded and stopped
                     when  it  is  unloaded.  It  is assigned background file system tasks that may take cluster
                     locks like flushing the truncate log, orphan directory recovery and local  alloc  recovery.
                     For  example,  orphan  directory recovery runs in the background so that it does not affect
                     recovery time.

              [o2hb-14C29A7392]
                     One per heartbeat device. It is a kernel  thread  started  when  the  heartbeat  region  is
                     populated  in  configfs  and  stopped  when it is removed. It writes every two seconds to a
                     block in the heartbeat region, indicating that this node is alive. It also reads the region
                     to maintain a map of live nodes. It notifies  subscribers  like  o2net  and  o2dlm  of  any
                     changes in the live node map.

              [ocfs2dc]
                     One  per  mount. It is a kernel thread started when a volume is mounted and stopped when it
                     is unmounted. It downgrades locks in response to blocking ASTs (BASTs) requested  by  other
                     nodes.

              [jbd2/sdf1-97]
                     One per mount. It is part of JBD2, which OCFS2 uses for journaling.

              [ocfs2cmt]
                     One  per  mount. It is a kernel thread started when a volume is mounted and stopped when it
                     is unmounted. It works with kjournald2.

              [ocfs2rec]
                     It is started whenever a node has  to  be  recovered.  This  thread  performs  file  system
                     recovery  by  replaying  the  journal  of  the  dead node. It is scheduled to run after dlm
                     recovery has completed.

              [dlm_thread]
                     One per dlm domain. It is a kernel thread started when a dlm domain is created and  stopped
                     when  it  is  destroyed. This thread sends ASTs and blocking ASTs in response to lock level
                     convert requests. It also frees unused lock resources.

              [dlm_reco_thread]
                     One per dlm domain. It is a kernel thread that handles dlm recovery when another node dies.
                     If this node is the dlm recovery master, it re-masters every lock  resource  owned  by  the
                     dead node.

              [dlm_wq]
                     One per dlm domain. It is a work-queue thread that o2dlm uses to queue blocking tasks.

       FUTURE WORK
              File  system  development is a never ending cycle. Faster and larger disks, faster and more number
              of processors, larger caches, etc. keep changing the sweet spot for performance forcing developers
              to rethink long held beliefs. Add to that new use cases, which forces developers to be  innovative
              in providing solutions that melds seamlessly with existing semantics.

              We  are  currently  looking  to add features like transparent compression, transparent encryption,
              delayed allocation, multi-device support, etc. as well as work on improving performance  on  newer
              generation machines.

              If you are interested in contributing, email the development team at ocfs2-devel@oss.oracle.com.

ACKNOWLEDGEMENTS

       The principal developers of the OCFS2 file system, its tools and the O2CB cluster stack, are Joel Becker,
       Zach Brown, Mark Fasheh, Jan Kara, Kurt Hackel, Tao Ma, Sunil Mushran, Tiger Yang and Tristan Ye.

       Other developers who have contributed to the file system via bug fixes, testing, etc.  are Wim Coekaerts,
       Srinivas Eeda, Coly Li, Jeff Mahoney, Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wengang Wang.

       The members of the Linux Cluster community including Andrew Beekhof, Lars Marowsky-Bree, Fabio Massimo Di
       Nitto and David Teigland.

       The members of the Linux File system community including Christoph Hellwig and Chris Mason.

       The  corporations  that  have  contributed  resources  for this project including Oracle, SUSE Labs, EMC,
       Emulex, HP, IBM, Intel and Network Appliance.

SEE ALSO

       debugfs.ocfs2(8)  fsck.ocfs2(8)  fsck.ocfs2.checks(8)   mkfs.ocfs2(8)   mount.ocfs2(8)   mounted.ocfs2(8)
       o2cluster(8)  o2image(8) o2info(1) o2cb(7) o2cb(8) o2cb.sysconfig(5) o2hbmonitor(8) ocfs2.cluster.conf(5)
       tunefs.ocfs2(8)

AUTHOR

       Oracle Corporation

COPYRIGHT

       Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.7                                     January 2012                                          OCFS2(7)