Provided by: charliecloud-runtime_0.37-1build1_amd64 bug

NAME

       ch-run - Run a command in a Charliecloud container

SYNOPSIS

          $ ch-run [OPTION...] IMAGE -- COMMAND [ARG...]

DESCRIPTION

       Run  command  COMMAND  in a fully unprivileged Charliecloud container using the image specified by IMAGE,
       which can be:  (1) a  path  to  a  directory,  (2) the  name  of  an  image  in  ch-image  storage  (e.g.
       example.com:5050/foo)  or,  if the proper support is enabled, a SquashFS archive. ch-run does not use any
       setuid or setcap helpers, even for mounting SquashFS images with FUSE.

          -b, --bind=SRC[:DST]
                 Bind-mount SRC at guest DST. The default destination if not specified is to use the  same  path
                 as the host; i.e., the default is --bind=SRC:SRC. Can be repeated.

                 With  a  read-only image (the default), DST must exist. However, if --write or --write-fake are
                 given, DST will be created as an empty directory  (possibly  with  the  tmpfs  overmount  trick
                 described  in  --bind creates mount points within un-writeable directories!). In this case, DST
                 must be entirely within the image itself, i.e., DST cannot enter a  previous  bind  mount.  For
                 example,  --bind  /foo:/tmp/foo  will  fail because /tmp is shared with the host via bind-mount
                 (unless $TMPDIR is set to something else or --private-tmp is given).

                 Most images have ten directories /mnt/[0-9] already available as mount points.

                 Symlinks in DST are followed, and absolute links can have  surprising  behavior.  Bind-mounting
                 happens  after  namespace setup but before pivoting into the container image, so absolute links
                 use the host root. For  example,  suppose  the  image  has  a  symlink  /foo  ->  /mnt.   Then,
                 --bind=/bar:/foo  will bind-mount on the host’s /mnt, which is inaccessible on the host because
                 namespaces are already set up and also inaccessible in the container because of the  subsequent
                 pivot  into  the  image. Currently, this problem is only detected when DST needs to be created:
                 ch-run will refuse to follow absolute symlinks  in  this  case,  to  avoid  directory  creation
                 surprises.

          -c, --cd=DIR
                 Initial working directory in container.

          --env-no-expand
                 Don’t expand variables when using --set-env.

          --feature=FEAT
                 If  feature  FEAT  is enabled, exit with success. Valid values of FEAT are extglob for extended
                 globs, seccomp for seccomp(2), and squash for squashfs archives.

          -g, --gid=GID
                 Run as group GID within container.

          --home Bind-mount your host home directory (i.e., $HOME) at guest  /home/$USER,  hiding  any  existing
                 image content at that path.  Implies --write-fake so the mount point can be created if needed.

          -j, --join
                 Use the same container (namespaces) as peer ch-run invocations.

          --join-pid=PID
                 Join the namespaces of an existing process.

          --join-ct=N
                 Number of ch-run peers (implies --join; default: see below).

          --join-tag=TAG
                 Label for ch-run peer group (implies --join; default: see below).

          -m, --mount=DIR
                 Use  DIR  for the SquashFS mount point, which must already exist. If not specified, the default
                 is /var/tmp/$USER.ch/mnt, which will be created if needed.

          --no-passwd
                 By default, temporary /etc/passwd and /etc/group files are created according to the UID and GID
                 maps for the container and bind-mounted into it. If this is specified, no such temporary  files
                 are created and the image’s files are exposed.

          -q, --quiet
                 Be  quieter;  can  be  repeated. Incompatible with -v. See the How can I control Charliecloud’s
                 quietness or verbosity? for details.

          -s, --storage DIR
                 Set the storage directory. Equivalent to the same option for ch-image(1).

          --seccomp
                 Using seccomp, intercept some system calls that  would  fail  due  to  lack  of  privilege,  do
                 nothing,  and  return  fake  success  to  the  calling  program.   This  is intended for use by
                 ch-image(1) when building images; see that man page for a detailed discussion.

          -t, --private-tmp
                 By default, the host’s /tmp (or $TMPDIR if set) is bind-mounted at container /tmp. If  this  is
                 specified, a new tmpfs is mounted on the container’s /tmp instead.

          --set-env, --set-env=FILE, --set-env=VAR=VALUE
                 Set  environment variables with newline-separated file (/ch/environment within the image if not
                 specified) or on the command line. See below for details.

          --set-env0, --set-env0=FILE, --set-env0=VAR=VALUE
                 Like --set-env, but file is null-byte separated.

          -u, --uid=UID
                 Run as user UID within container.

          --unsafe
                 Enable various unsafe behavior. For internal use only. Seriously, stay away from this option.

          --unset-env=GLOB
                 Unset environment variables whose names match GLOB.

          -v, --verbose
                 Print extra chatter; can be repeated. See the FAQ entry on verbosity for details.

          -w, --write
                 Mount image read-write. By default, the image is  mounted  read-only.  This  option  should  be
                 avoided for most use cases, because (1) changing images live (as opposed to prescriptively with
                 a  Dockerfile)  destroys  their  provenance and (2) SquashFS images, which is the best-practice
                 format on parallel filesystems, must be read-only.  It  is  better  to  use  --write-fake  (for
                 disposable data) or bind-mount host directories (for retained data).

          -W, --write-fake[=SIZE]
                 Overlay  a  writeable tmpfs on top of the image. This makes the image appear read-write, but it
                 actually remains read-only and unchanged. All data “written” to the image  are  discarded  when
                 the container exits.

                 The  size  of the writeable filesystem SIZE is any size specification acceptable to tmpfs, e.g.
                 4m for 4MiB or 50% for half of physical memory. If this option is specified without  SIZE,  the
                 default  is  12%. Note (1) this limit is a maximum — only actually stored files consume virtual
                 memory — and (2) SIZE larger than memory can be requested without error  (the  failure  happens
                 later if the actual contents become too large).

                 This  requires  kernel  support and there are some caveats. See section “Writeable overlay with
                 --write-fake” below for details.

          -?, --help
                 Print help and exit.

          --usage
                 Print a short usage message and exit.

          -V, --version
                 Print version and exit.

       Note: Because ch-run is fully unprivileged, it is not  possible  to  change  UIDs  and  GIDs  within  the
       container  (the relevant system calls fail). In particular, setuid, setgid, and setcap executables do not
       work. As a precaution, ch-run calls prctl(PR_SET_NO_NEW_PRIVS, 1) to disable these executables within the
       container. This does not reduce functionality but is a “belt and suspenders”  precaution  to  reduce  the
       attack surface should bugs in these system calls or elsewhere arise.

IMAGE FORMAT

       ch-run supports two different image formats.

       The first is a simple directory that contains a Linux filesystem tree. This can be accomplished by:

       • ch-convert directly from ch-image or another builder to a directory.

       • Charliecloud’s  tarball  workflow:  build  or  pull the image, ch-convert it to a tarball, transfer the
         tarball to the target system, then ch-convert the tarball to a directory.

       • Manually mount a SquashFS image, e.g. with squashfuse(1) and then un-mount it after run with fusermount
         -u.

       • Any other workflow that produces an appropriate directory tree.

       The second is a SquashFS image archive mounted internally by ch-run, available if it’s  linked  with  the
       optional libsquashfuse_ll shared library. ch-run mounts the image filesystem, services all FUSE requests,
       and unmounts it, all within ch-run. See --mount above to set the mount point location.

       Like  other  FUSE  implementations,  Charliecloud  calls the fusermount3(1) utility to mount the SquashFS
       filesystem. However, this executable does not need to be  installed  setuid  root,  and  in  fact  ch-run
       actively suppresses its setuid bit if set (using prctl(2)).

       Prior  versions  of  Charliecloud  provided  wrappers for the squashfuse and squashfuse_ll SquashFS mount
       commands and fusermount -u unmount command. We removed  these  because  we  concluded  they  had  minimal
       value-add over the standard, unwrapped commands.

       WARNING:
          Currently, Charliecloud unmounts the SquashFS filesystem when user command COMMAND’s process exits. It
          does not monitor any of its child processes. Therefore, if the user command spawns child processes and
          then  exits  before  them  (e.g.,  some  daemons),  those  children will have the image unmounted from
          underneath them. In this case, the workaround is to mount/unmount using external tools. We  expect  to
          remove this limitation in a future version.

HOST FILES AND DIRECTORIES AVAILABLE IN CONTAINER VIA BIND MOUNTS

       In  addition  to  any  directories  specified by the user with --bind, ch-run has standard host files and
       directories that are bind-mounted in as well.

       The following host files and directories are bind-mounted at the same location in  the  container.  These
       give  access  to  the  host’s  devices  and various kernel facilities. (Recall that Charliecloud provides
       minimal isolation and containerized processes are mostly normal unprivileged processes.) They  cannot  be
       disabled and are required; i.e., they must exist both on host and within the image.

          • /dev/proc/sys

       Optional; bind-mounted only if path exists on both host and within the image, without error or warning if
       not.

          • /etc/hosts  and  /etc/resolv.conf. Because Charliecloud containers share the host network namespace,
            they need the same hostname resolution configuration.

          • /etc/machine-id. Provides a unique ID for the OS installation; matching  the  host  works  for  most
            situations. Needed to support D-Bus, some software licensing situations, and likely other use cases.
            See also issue #1050.

          • /var/lib/hugetlbfs  at  guest  /var/opt/cray/hugetlbfs,  and /var/opt/cray/alps/spool. These support
            Cray MPI.

       Additional bind mounts done by default but can be disabled; see the options above.

          • $HOME at /home/$USER (and image /home is hidden).  Makes user data and init files available.

          • /tmp (or $TMPDIR if set) at guest  /tmp.  Provides  a  temporary  directory  that  persists  between
            container runs and is shared with non-containerized application components.

          • temporary  files  at /etc/passwd and /etc/group. Usernames and group names need to be customized for
            each container run.

MULTIPLE PROCESSES IN THE SAME CONTAINER WITH --JOIN

       By default, different ch-run invocations  use  different  user  and  mount  namespaces  (i.e.,  different
       containers).  While  this  has  no  impact on sharing most resources between invocations, there are a few
       important exceptions.  These include:

       1. ptrace(2), used by debuggers and related tools. One can attach a debugger to processes  in  descendant
          namespaces,  but  not  sibling namespaces.  The practical effect of this is that (without --join), you
          can’t run a command with ch-run and then attach to it with a debugger also run with ch-run.

       2. Cross-memory attach (CMA) is used by cooperating  processes  to  communicate  by  simply  reading  and
          writing  one  another’s  memory.  This  is also not permitted between sibling namespaces. This affects
          various MPI implementations that use CMA to pass messages between ranks on the same node, because it’s
          faster than traditional shared memory.

       --join is designed to address this by placing related ch-run commands (the  “peer  group”)  in  the  same
       container.  This  is  done  by  one  of  the peers creating the namespaces with unshare(2) and the others
       joining with setns(2).

       To do so, we need to know the number of peers and a name for the group. These are specified by additional
       arguments that can (hopefully) be left at default values in most cases:

       • --join-ct sets the number of peers. The default is the value of the first of the following  environment
         variables that is defined: OMPI_COMM_WORLD_LOCAL_SIZE, SLURM_STEP_TASKS_PER_NODE, SLURM_CPUS_ON_NODE.

       • --join-tag  sets  the tag that names the peer group. The default is environment variable SLURM_STEP_ID,
         if defined; otherwise, the PID of ch-run’s parent. Tags can be re-used for peer groups  that  start  at
         different times, i.e., once all peer ch-run have replaced themselves with the user command, the tag can
         be re-used.

       Caveats:

       • One  cannot  currently  add peers after the fact, for example, if one decides to start a debugger after
         the fact. (This is only required for code with bugs and is thus an unusual use case.)

       • ch-run instances race. The winner of this race sets up the namespaces, and  the  other  peers  use  the
         winner  to  find  the  namespaces  to  join.  Therefore,  if  the user command of the winner exits, any
         remaining peers will not be able to join the namespaces, even  if  they  are  still  active.  There  is
         currently no general way to specify which ch-run should be the winner.

       • If  --join-ct  is  too  high,  the winning ch-run’s user command exits before all peers join, or ch-run
         itself crashes, IPC resources such as semaphores and shared  memory  segments  will  be  leaked.  These
         appear as files in /dev/shm/ and can be removed with rm(1).

       • Many  of  the arguments given to the race losers, such as the image path and --bind, will be ignored in
         favor of what was given to the winner.

WRITEABLE OVERLAY WITH --WRITE-FAKE

       If you need the image to stay read-only but appear writeable, you may be  able  to  use  --write-fake  to
       overlay a writeable tmpfs atop the image. This requires kernel support. Specifically:

       1. To use the feature at all, you need unprivileged overlayfs support. This is available in upstream 5.11
          (February  2021), but distributions vary considerably. If you don’t have this, the container will fail
          to start with error “operation not permitted”.

       2. For a fully functional overlay, you need a tmpfs that supports xattrs in the user namespace.  This  is
          available in upstream 6.6 (October 2023). If you don’t have this, most things will work fine, but some
          operations  will  fail  with  “I/O  error”,  for  example creating a directory with the same path as a
          previously deleted directory. There will also be syslog noise about xattr problems.

          (overlayfs can also use xattrs in the trusted namespace, but this requires CAP_SYS_ADMIN on  the  host
          and thus is not helpful for unprivileged containers.)

ENVIRONMENT VARIABLES

       ch-run  leaves  environment  variables  unchanged, i.e. the host environment is passed through unaltered,
       except:

       • by default (--home not specified), HOME is set to /root, if it exists, and / otherwise.

       • limited tweaks to avoid significant guest breakage;

       • user-set variables via --set-env;

       • user-unset variables via --unset-env; and

       • set CH_RUNNING.

       This section describes these features.

       The default tweaks happen first, then --set-env and --unset-env in the order  specified  on  the  command
       line,  and  then  CH_RUNNING.  The two options can be repeated arbitrarily many times, e.g. to add/remove
       multiple variable sets or add only some variables in a file.

   Default behavior
       By default, ch-run makes the following environment variable changes:

       $CH_RUNNING
              Set to Weird Al Yankovic. While a process can figure out that it’s in  an  unprivileged  container
              and  what  namespaces are active without this hint, that can be messy, and there is no way to tell
              that it’s a Charliecloud container specifically. This  variable  makes  such  a  test  simple  and
              well-defined. (Note: This variable is unaffected by --unset-env.)

       $HOME  If --home is specified, then your home directory is bind-mounted into the guest at /home/$USER. If
              you  also  have  a different home directory path on the host, an inherited $HOME will be incorrect
              inside the guest, which confuses lots of software, notably Spack. Thus, with --home, $HOME is  set
              to /home/$USER (by default, it is unchanged.)

       $PATH  Newer  Linux  distributions  replace  some  root-level directories, such as /bin, with symlinks to
              their counterparts in /usr.

              Some of these distributions (e.g., Fedora 24) have also dropped /bin from the default $PATH.  This
              is a problem when the guest OS does not have a merged /usr (e.g., Debian 8 “Jessie”). Thus, we add
              /bin to $PATH if it’s not already present.

              Further reading:

                 • The case for the /usr MergeFedoraDebian

       $TMPDIR
              Unset,  because  this is almost certainly a host path, and that host path is made available in the
              guest at /tmp unless --private-tmp is given.

   Setting variables with --set-env or --set-env0
       The purpose of these two options is to set environment  variables  within  the  container.  Values  given
       replace  any  already  in the environment (i.e., inherited from the host shell) or set by earlier uses of
       the options. These flags take an optional argument with two possible forms:

       1. If the argument contains an equals sign (=, ASCII 61), that sets an environment variable directly. For
          example, to set FOO to the string value bar:

             $ ch-run --set-env=FOO=bar ...

          Single straight quotes around the value (', ASCII 39) are stripped, though be aware that  both  single
          and double quotes are also interpreted by the shell. For example, this example is similar to the prior
          one; the double quotes are removed by the shell and the single quotes are removed by ch-run:

             $ ch-run --set-env="'BAZ=qux'" ...

       2. If  the  argument does not contain an equals sign, it is a host path to a file containing zero or more
          variables using the same syntax as above (except with no prior shell processing).

          With --set-env, this file contains a sequence of assignments separated by newline  (n  or  ASCII  10);
          with  --set-env0,  the  assignments  are  separated  by  the  null  byte  (i.e., 0 or ASCII 0).  Empty
          assignments are ignored, and no comments are interpreted. (This  syntax  is  designed  to  accept  the
          output of printenv and be easily produced by other simple mechanisms.) The file need not be seekable.

          For example:

             $ cat /tmp/env.txt
             FOO=bar
             BAZ='qux'
             $ ch-run --set-env=/tmp/env.txt ...

          For  directory  images only (because the file is read before containerizing), guest paths can be given
          by prepending the image path.

       3. If there is no argument, the file /ch/environment within the image is  used.  This  file  is  commonly
          populated by ENV instructions in the Dockerfile. For example, equivalently to form 2:

             $ cat Dockerfile
             [...]
             ENV FOO=bar
             ENV BAZ=qux
             [...]
             $ ch-image build -t foo .
             $ ch-convert foo /var/tmp/foo.sqfs
             $ ch-run --set-env /var/tmp/foo.sqfs -- ...

          (Note the image path is interpreted correctly, not as the --set-env argument.)

          At present, there is no way to use files other than /ch/environment within SquashFS images.

       Environment  variables  are  expanded  for  values that look like search paths, unless --env-no-expand is
       given prior to --set-env. In this case, the value is a sequence of  zero  or  more  possibly-empty  items
       separated  by colon (:, ASCII 58). If an item begins with dollar sign ($, ASCII 36), then the rest of the
       item is the name of an environment variable.  If this variable is set to a non-empty value, that value is
       substituted for the item; otherwise (i.e., the variable is unset  or  the  empty  string),  the  item  is
       deleted,  including  a  delimiter  colon. The purpose of omitting empty expansions is to avoid surprising
       behavior such as an empty element in $PATH meaning the current directory.

       For example, to set HOSTPATH to the search path in the current shell (this is expanded by ch-run,  though
       letting the shell do it happens to be equivalent):

          $ ch-run --set-env='HOSTPATH=$PATH' ...

       To prepend /opt/bin to this current search path:

          $ ch-run --set-env='PATH=/opt/bin:$PATH' ...

       To  prepend  /opt/bin  to  the  search  path  set  by  the  Dockerfile,  as  retrieved  from  guest  file
       /ch/environment (here we really cannot let the shell expand $PATH):

          $ ch-run --set-env --set-env='PATH=/opt/bin:$PATH' ...

       Examples of valid assignment, assuming that environment variable BAR is set to bar and UNSET is unset  or
       set to the empty string:
                           ────────────────────────────────────────────────────────────────
                             Assignment                    Name    Value
                           ────────────────────────────────────────────────────────────────
                             FOO=bar                       FOO     bar
                           ────────────────────────────────────────────────────────────────
                             FOO=bar=baz                   FOO     bar=baz
                           ────────────────────────────────────────────────────────────────
                             FLAGS=-march=foo -mtune=bar   FLAGS   -march=foo -mtune=bar
                           ────────────────────────────────────────────────────────────────
                             FLAGS='-march=foo             FLAGS   -march=foo -mtune=bar
                             -mtune=bar'
                           ────────────────────────────────────────────────────────────────
                             FOO=$BAR                      FOO     bar
                           ────────────────────────────────────────────────────────────────
                             FOO=$BAR:baz                  FOO     bar:baz
                           ────────────────────────────────────────────────────────────────
                             FOO=                          FOO     empty string
                           ────────────────────────────────────────────────────────────────
                             FOO=$UNSET                    FOO     empty string
                           ────────────────────────────────────────────────────────────────
                             FOO=baz:$UNSET:qux            FOO     baz:qux (not baz::qux)
                           ────────────────────────────────────────────────────────────────
                             FOO=:bar:baz::                FOO     :bar:baz::
                           ────────────────────────────────────────────────────────────────
                             FOO=''                        FOO     empty string
                           ────────────────────────────────────────────────────────────────
                             FOO=''''                      FOO     '' (two single quotes)
                           ┌─────────────────────────────┬───────┬────────────────────────┐
                           │                             │       │                        │
--

EXAMPLES

       Run the command echo hello inside a Charliecloud container using the unpacked image at /data/foo:

          $ ch-run /data/foo -- echo hello
          hello

       Run an MPI job that can use CMA to communicate:

          $ srun ch-run --join /data/foo -- bar

SYSLOG

       By  default,  ch-run  logs  its  command  line  to  syslog.  (This  can  be  disabled by configuring with
       --disable-syslog.) This includes: (1) the invoking real UID, (2) the number of  command  line  arguments,
       and (3) the arguments, separated by spaces. For example:

          Dec 10 18:19:08 mybox ch-run: uid=1000 args=7: ch-run -v /var/tmp/00_tiny -- echo hello "wor l}\$d"

       Logging  is one of the first things done during program initialization, even before command line parsing.
       That is, almost all command lines are logged, even if erroneous, and  there  is  no  logging  of  program
       success or failure.

       Arguments  are  serialized  with  the  following  procedure.  The  purpose is to provide a human-readable
       reconstruction of the command line while also allowing each argument to be recovered byte-for-byte.

          • If an argument contains only printable ASCII bytes that are not  whitespace,  shell  metacharacters,
            double quote (", ASCII 34 decimal), or backslash (, ASCII 92), then log it unchanged.

          • Otherwise,  (a) enclose  the  argument  in  double  quotes  and  (b) backslash-escape double quotes,
            backslashes, and characters interpreted by Bash (including POSIX shells) within double quotes.

       The verbatim command line typed in the shell cannot be  recovered,  because  not  enough  information  is
       provided  to UNIX programs. For example, echo  'foo' is given to programs as a sequence of two arguments,
       echo and foo; the two spaces and single quotes are removed by the shell. The zero byte, ASCII NUL, cannot
       appear in arguments because it would terminate the string.

EXIT STATUS

       If there is an error during containerization, ch-run exits with status non-zero. If the user  command  is
       started successfully, the exit status is that of the user command, with one exception: if the image is an
       internally  mounted  SquashFS filesystem and the user command is killed by a signal, the exit status is 1
       regardless of the signal value.

REPORTING BUGS

       If Charliecloud was obtained  from  your  Linux  distribution,  use  your  distribution’s  bug  reporting
       procedures.

       Otherwise, report bugs to: https://github.com/hpc/charliecloud/issues

SEE ALSO

       charliecloud(7)

       Full documentation at: <https://hpc.github.io/charliecloud>

COPYRIGHT

       2014–2023, Triad National Security, LLC and others

0.37                                          2024-04-01 05:37 UTC                                     CH-RUN(1)