Provided by: liburing-dev_2.1-2build1_amd64 bug

NAME

       io_uring_setup - setup a context for performing asynchronous I/O

SYNOPSIS

       #include <linux/io_uring.h>

       int io_uring_setup(u32 entries, struct io_uring_params *p);

DESCRIPTION

       The  io_uring_setup() system call sets up a submission queue (SQ) and completion queue (CQ) with at least
       entries entries, and returns a file descriptor which can be used to perform subsequent operations on  the
       io_uring  instance.   The  submission  and completion queues are shared between userspace and the kernel,
       which eliminates the need to copy data when initiating and completing I/O.

       params is used by the application to pass options to the kernel, and by the kernel to convey  information
       about the ring buffers.

           struct io_uring_params {
               __u32 sq_entries;
               __u32 cq_entries;
               __u32 flags;
               __u32 sq_thread_cpu;
               __u32 sq_thread_idle;
               __u32 features;
               __u32 resv[4];
               struct io_sqring_offsets sq_off;
               struct io_cqring_offsets cq_off;
           };

       The  flags,  sq_thread_cpu, and sq_thread_idle fields are used to configure the io_uring instance.  flags
       is a bit mask of 0 or more of the following values ORed together:

       IORING_SETUP_IOPOLL
              Perform  busy-waiting  for  an  I/O  completion,  as  opposed  to  getting  notifications  via  an
              asynchronous  IRQ  (Interrupt  Request).   The  file system (if any) and block device must support
              polling in order for this to work.  Busy-waiting provides lower latency, but may consume more  CPU
              resources  than interrupt driven I/O.  Currently, this feature is usable only on a file descriptor
              opened using the O_DIRECT flag.  When a read or write  is  submitted  to  a  polled  context,  the
              application  must poll for completions on the CQ ring by calling io_uring_enter(2).  It is illegal
              to mix and match polled and non-polled I/O on an io_uring instance.

       IORING_SETUP_SQPOLL
              When this flag is specified, a kernel thread is created to perform submission queue  polling.   An
              io_uring  instance configured in this way enables an application to issue I/O without ever context
              switching into the kernel.  By using the submission queue to fill in new submission queue  entries
              and  watching  for  completions  on the completion queue, the application can submit and reap I/Os
              without doing a single system call.

              If the kernel thread  is  idle  for  more  than  sq_thread_idle  milliseconds,  it  will  set  the
              IORING_SQ_NEED_WAKEUP  bit  in  the  flags field of the struct io_sq_ring.  When this happens, the
              application must call io_uring_enter(2) to wake the kernel thread.   If  I/O  is  kept  busy,  the
              kernel  thread will never sleep.  An application making use of this feature will need to guard the
              io_uring_enter(2) call with the following code sequence:

                  /*
                   * Ensure that the wakeup flag is read after the tail pointer
                   * has been written. It's important to use memory load acquire
                   * semantics for the flags read, as otherwise the application
                   * and the kernel might not agree on the consistency of the
                   * wakeup flag.
                   */
                  unsigned flags = atomic_load_relaxed(sq_ring->flags);
                  if (flags & IORING_SQ_NEED_WAKEUP)
                      io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

              where sq_ring is a submission queue ring setup using the struct io_sqring_offsets described below.

       Before version 5.11 of the Linux kernel, to successfully use this feature, the
              application must register a set of files to be used for IO through io_uring_register(2) using  the
              IORING_REGISTER_FILES  opcode.  Failure  to  do  so will result in submitted IO being errored with
              EBADF.  The presence of this feature can be detected by  the  IORING_FEAT_SQPOLL_NONFIXED  feature
              flag.  In version 5.11 and later, it is no longer necessary to register files to use this feature.
              5.11 also allows using this as non-root, if the user has the CAP_SYS_NICE capability.

       IORING_SETUP_SQ_AFF
              If  this flag is specified, then the poll thread will be bound to the cpu set in the sq_thread_cpu
              field of the struct io_uring_params.  This flag is only  meaningful  when  IORING_SETUP_SQPOLL  is
              specified.  When  cgroup  setting  cpuset.cpus  changes  (typically in container environment), the
              bounded cpu set may be changed as well.

       IORING_SETUP_CQSIZE
              Create the completion queue with struct io_uring_params.cq_entries entries.   The  value  must  be
              greater than entries, and may be rounded up to the next power-of-two.

       IORING_SETUP_CLAMP
              If  this  flag  is  specified,  and  if  entries exceeds IORING_MAX_ENTRIES , then entries will be
              clamped at IORING_MAX_ENTRIES .  If the flag IORING_SETUP_SQPOLL is  set,  and  if  the  value  of
              struct  io_uring_params.cq_entries  exceeds  IORING_MAX_CQ_ENTRIES  ,  then  it will be clamped at
              IORING_MAX_CQ_ENTRIES .

       IORING_SETUP_ATTACH_WQ
              This flag should be set in conjunction with struct io_uring_params.wq_fd being set to an  existing
              io_uring  ring  file  descriptor.  When  set,  the  io_uring instance being created will share the
              asynchronous worker thread backend of the specified  io_uring  ring,  rather  than  create  a  new
              separate thread pool.

       IORING_SETUP_R_DISABLED
              If  this  flag  is  specified,  the  io_uring  ring  starts  in  a disabled state.  In this state,
              restrictions can be registered, but submissions are not  allowed.   See  io_uring_register(2)  for
              details on how to enable the ring. Available since 5.10.

       If no flags are specified, the io_uring instance is setup for interrupt driven I/O.  I/O may be submitted
       using io_uring_enter(2) and can be reaped by polling the completion queue.

       The resv array must be initialized to zero.

       features  is  filled  in  by  the  kernel,  which  specifies various features supported by current kernel
       version.

       IORING_FEAT_SINGLE_MMAP
              If this flag is set, the two SQ and CQ rings can be mapped with a single mmap(2)  call.  The  SQEs
              must  still  be  allocated  separately. This brings the necessary mmap(2) calls down from three to
              two. Available since kernel 5.4.

       IORING_FEAT_NODROP
              If this flag is set, io_uring supports never dropping completion events.  If  a  completion  event
              occurs  and the CQ ring is full, the kernel stores the event internally until such a time that the
              CQ ring has room for more entries. If this overflow condition is  entered,  attempting  to  submit
              more  IO  will  fail with the -EBUSY error value, if it can't flush the overflown events to the CQ
              ring. If this happens, the application must reap events from the CQ ring and  attempt  the  submit
              again. Available since kernel 5.5.

       IORING_FEAT_SUBMIT_STABLE
              If this flag is set, applications can be certain that any data for async offload has been consumed
              when the kernel has consumed the SQE. Available since kernel 5.5.

       IORING_FEAT_RW_CUR_POS
              If  this  flag  is  set,  applications  can  specify  offset == -1 with IORING_OP_{READV,WRITEV} ,
              IORING_OP_{READ,WRITE}_FIXED , and IORING_OP_{READ,WRITE} to mean  current  file  position,  which
              behaves like preadv2(2) and pwritev2(2) with offset == -1. It'll use (and update) the current file
              position.  This  obviously  comes  with  the  caveat that if the application has multiple reads or
              writes in flight, then the end result will not be as expected. This is similar to threads  sharing
              a file descriptor and doing IO using the current file position. Available since kernel 5.6.

       IORING_FEAT_CUR_PERSONALITY
              If  this  flag  is  set,  then io_uring guarantees that both sync and async execution of a request
              assumes the credentials of the task that called io_uring_enter(2) to queue the requests.  If  this
              flag  isn't  set,  then  requests  are  issued  with  the  credentials of the task that originally
              registered the io_uring. If only one task is using a ring, then this flag doesn't  matter  as  the
              credentials  will  always  be  the  same.  Note that this is the default behavior, tasks can still
              register different personalities through io_uring_register(2) with IORING_REGISTER_PERSONALITY and
              specify the personality to use in the sqe. Available since kernel 5.6.

       IORING_FEAT_FAST_POLL
              If this flag is set, then io_uring supports using an internal poll mechanism to  drive  data/space
              readiness.  This means that requests that cannot read or write data to a file no longer need to be
              punted to an async thread for handling, instead they will begin operation when the file is  ready.
              This is similar to doing poll + read/write in userspace, but eliminates the need to do so. If this
              flag  is set, requests waiting on space/data consume a lot less resources doing so as they are not
              blocking a thread. Available since kernel 5.7.

       IORING_FEAT_POLL_32BITS
              If this flag is set, the IORING_OP_POLL_ADD command accepts the full 32-bit range of  epoll  based
              flags.  Most  notably  EPOLLEXCLUSIVE  which  allows  exclusive  (waking single waiters) behavior.
              Available since kernel 5.9.

       IORING_FEAT_SQPOLL_NONFIXED
              If this flag is set, the IORING_SETUP_SQPOLL feature no longer requires the use  of  fixed  files.
              Any  normal  file  descriptor  can be used for IO commands without needing registration. Available
              since kernel 5.11.

       IORING_FEAT_ENTER_EXT_ARG
              If this flag is set, then the io_uring_enter(2)  system  call  supports  passing  in  an  extended
              argument  instead  of  just  the  sigset_t of earlier kernels. This.  extended argument is of type
              struct io_uring_getevents_arg and allows the caller to pass in  both  a  sigset_t  and  a  timeout
              argument for waiting on events. The struct layout is as follows:

               struct io_uring_getevents_arg {
                  __u64 sigmask;
                  __u32 sigmask_sz;
                  __u32 pad;
                  __u64 ts;
              };

              and a pointer to this struct must be passed in if IORING_ENTER_EXT_ARG is set in the flags for the
              enter system call. Available since kernel 5.11.

       IORING_FEAT_NATIVE_WORKERS
              If  this  flag  is  set, io_uring is using native workers for its async helpers.  Previous kernels
              used kernel threads that assumed the identity of the original  io_uring  owning  task,  but  later
              kernels will actively create what looks more like regular process threads instead. Available since
              kernel 5.12.

       IORING_FEAT_RSRC_TAGS
              If  this  flag  is  set,  then  io_uring supports a variety of features related to fixed files and
              buffers. In particular, it indicates that registered buffers  can  be  updated  in-place,  whereas
              before the full set would have to be unregistered first. Available since kernel 5.13.

       The  rest  of  the  fields  in  the  struct  io_uring_params are filled in by the kernel, and provide the
       information necessary to memory map the submission queue, completion queue, and the array  of  submission
       queue  entries.  sq_entries specifies the number of submission queue entries allocated.  sq_off describes
       the offsets of various ring buffer fields:

           struct io_sqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 flags;
               __u32 dropped;
               __u32 array;
               __u32 resv[3];
           };

       Taken together, sq_entries and sq_off  provide  all  of  the  information  necessary  for  accessing  the
       submission  queue  ring  buffer and the submission queue entry array.  The submission queue can be mapped
       with a call like:

           ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                      ring_fd, IORING_OFF_SQ_RING);

       where sq_off is the io_sqring_offsets structure,  and  ring_fd  is  the  file  descriptor  returned  from
       io_uring_setup(2).   The  addition of sq_off.array to the length of the region accounts for the fact that
       the ring located at the end of the data structure.  As an example, the ring buffer head  pointer  can  be
       accessed by adding sq_off.head to the address returned from mmap(2):

           head = ptr + sq_off.head;

       The flags field is used by the kernel to communicate state information to the application.  Currently, it
       is  used  to inform the application when a call to io_uring_enter(2) is necessary.  See the documentation
       for the IORING_SETUP_SQPOLL flag above.  The dropped member is incremented for  each  invalid  submission
       queue entry encountered in the ring buffer.

       The  head  and  tail  track  the  ring  buffer  state.   The  tail is incremented by the application when
       submitting new I/O, and the head is incremented  by  the  kernel  when  the  I/O  has  been  successfully
       submitted.  Determining the index of the head or tail into the ring is accomplished by applying a mask:

           index = tail & ring_mask;

       The array of submission queue entries is mapped with:

           sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
                            PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
                            ring_fd, IORING_OFF_SQES);

       The completion queue is described by cq_entries and cq_off shown here:

           struct io_cqring_offsets {
               __u32 head;
               __u32 tail;
               __u32 ring_mask;
               __u32 ring_entries;
               __u32 overflow;
               __u32 cqes;
               __u32 flags;
               __u32 resv[3];
           };

       The  completion  queue  is simpler, since the entries are not separated from the queue itself, and can be
       mapped with:

           ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
                      PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
                      IORING_OFF_CQ_RING);

       Closing the file descriptor returned by io_uring_setup(2) will free all  resources  associated  with  the
       io_uring context.

RETURN VALUE

       io_uring_setup(2)  returns  a  new file descriptor on success.  The application may then provide the file
       descriptor in a subsequent mmap(2)  call  to  map  the  submission  and  completion  queues,  or  to  the
       io_uring_register(2) or io_uring_enter(2) system calls.

       On error, -1 is returned and errno is set appropriately.

ERRORS

       EFAULT params is outside your accessible address space.

       EINVAL The  resv  array  contains  non-zero data, p.flags contains an unsupported flag, entries is out of
              bounds, IORING_SETUP_SQ_AFF was specified, but IORING_SETUP_SQPOLL was not, or IORING_SETUP_CQSIZE
              was specified, but io_uring_params.cq_entries was invalid.

       EMFILE The per-process limit on the number of open file descriptors has been reached (see the description
              of RLIMIT_NOFILE in getrlimit(2)).

       ENFILE The system-wide limit on the total number of open files has been reached.

       ENOMEM Insufficient kernel resources are available.

       EPERM  IORING_SETUP_SQPOLL was specified, but the effective user ID of the caller did not have sufficient
              privileges.

SEE ALSO

       io_uring_register(2), io_uring_enter(2)

Linux                                              2019-01-29                                  IO_URING_SETUP(2)