Ubuntu Manpage: This manual page is part of the POSIX Programmer's Manual. The Linux implementation of this interface

Provided by: manpages-posix-dev_2017a-2_all

PROLOG

       This  manual  page  is part of the POSIX Programmer's Manual.  The Linux implementation of this interface
       may differ (consult the corresponding Linux manual page for details of Linux behavior), or the  interface
       may not be implemented on Linux.

NAME

       regcomp, regerror, regexec, regfree — regular expression matching

SYNOPSIS

       #include <regex.h>

       int regcomp(regex_t *restrict preg, const char *restrict pattern,
           int cflags);
       size_t regerror(int errcode, const regex_t *restrict preg,
           char *restrict errbuf, size_t errbuf_size);
       int regexec(const regex_t *restrict preg, const char *restrict string,
           size_t nmatch, regmatch_t pmatch[restrict], int eflags);
       void regfree(regex_t *preg);

DESCRIPTION

       These  functions  interpret  basic  and extended regular expressions as described in the Base Definitions
       volume of POSIX.1‐2017, Chapter 9, Regular Expressions.

       The regex_t structure is defined in <regex.h> and contains at least the following member:
                             ┌───────────────┬──────────────┬───────────────────────────┐
                             │ Member Type   │ Member Name  │        Description        │
                             ├───────────────┼──────────────┼───────────────────────────┤
                             │ size_t        │re_nsub       │ Number  of  parenthesized │
                             │               │              │ subexpressions.           │
                             └───────────────┴──────────────┴───────────────────────────┘

       The regmatch_t structure is defined in <regex.h> and contains at least the following members:
                             ┌───────────────┬──────────────┬───────────────────────────┐
                             │ Member Type   │ Member Name  │        Description        │
                             ├───────────────┼──────────────┼───────────────────────────┤
                             │ regoff_t      │rm_so         │ Byte offset from start of │
                             │               │              │ string    to   start   of │
                             │               │              │ substring.                │
                             │ regoff_t      │rm_eo         │ Byte offset from start of │
                             │               │              │ string   of   the   first │
                             │               │              │ character  after  the end │
                             │               │              │ of substring.             │
                             └───────────────┴──────────────┴───────────────────────────┘

       The regcomp() function shall compile the regular expression contained in the string  pointed  to  by  the
       pattern  argument  and place the results in the structure pointed to by preg.  The cflags argument is the
       bitwise-inclusive OR of zero or more of the following flags, which are defined in the <regex.h> header:

       REG_EXTENDED  Use Extended Regular Expressions.

       REG_ICASE     Ignore case in match (see the Base Definitions volume of POSIX.1‐2017, Chapter  9,  Regular
                     Expressions).

       REG_NOSUB     Report only success/fail in regexec().

       REG_NEWLINE   Change the handling of <newline> characters, as described in the text.

       The  default  regular  expression  type  for  pattern  is a Basic Regular Expression. The application can
       specify Extended Regular Expressions using the REG_EXTENDED cflags flag.

       If the REG_NOSUB flag was not set  in  cflags,  then  regcomp()  shall  set  re_nsub  to  the  number  of
       parenthesized  subexpressions  (delimited  by  "\(\)"  in  basic  regular expressions or "()" in extended
       regular expressions) found in pattern.

       The regexec() function compares the null-terminated string specified by string with the compiled  regular
       expression preg initialized by a previous call to regcomp().  If it finds a match, regexec() shall return
       0; otherwise, it shall return non-zero indicating either no match or an error. The eflags argument is the
       bitwise-inclusive OR of zero or more of the following flags, which are defined in the <regex.h> header:

       REG_NOTBOL    The  first  character  of the string pointed to by string is not the beginning of the line.
                     Therefore, the <circumflex> character ('^'), when taken as a special character,  shall  not
                     match the beginning of string.

       REG_NOTEOL    The  last  character  of  the  string  pointed  to  by  string  is not the end of the line.
                     Therefore, the <dollar-sign> ('$'), when taken as a special character, shall not match  the
                     end of string.

       If  nmatch is 0 or REG_NOSUB was set in the cflags argument to regcomp(), then regexec() shall ignore the
       pmatch argument. Otherwise, the application shall ensure that the pmatch argument points to an array with
       at least nmatch elements, and regexec() shall fill in the elements of that  array  with  offsets  of  the
       substrings  of  string  that  correspond  to the parenthesized subexpressions of pattern: pmatch[i].rm_so
       shall be the byte offset of the beginning and pmatch[i].rm_eo shall be one greater than the  byte  offset
       of  the  end  of substring i.  (Subexpression i begins at the ith matched open parenthesis, counting from
       1.) Offsets in pmatch[0] identify the substring that corresponds to the entire regular expression. Unused
       elements of pmatch up to pmatch[nmatch-1] shall be  filled  with  -1.  If  there  are  more  than  nmatch
       subexpressions  in  pattern (pattern itself counts as a subexpression), then regexec() shall still do the
       match, but shall record only the first nmatch substrings.

       When matching a basic or extended regular expression, any given parenthesized  subexpression  of  pattern
       might  participate  in  the  match  of  several different substrings of string, or it might not match any
       substring even though the pattern as a whole did match. The following rules shall be  used  to  determine
       which substrings to report in pmatch when matching regular expressions:

        1. If  subexpression  i  in  a  regular expression is not contained within another subexpression, and it
           participated in the match several times, then the byte offsets in pmatch[i] shall  delimit  the  last
           such match.

        2. If  subexpression  i  is not contained within another subexpression, and it did not participate in an
           otherwise successful match, the byte offsets in pmatch[i] shall  be  -1.  A  subexpression  does  not
           participate in the match when:

           '*' or "\{\}" appears immediately after the subexpression in a basic regular expression, or '*', '?',
           or  "{}"  appears  immediately  after  the  subexpression  in an extended regular expression, and the
           subexpression did not match (matched 0 times)

           or:

                  '|' is used in an extended regular expression to select this subexpression or another, and the
                  other subexpression matched.

        3. If subexpression i is contained within another subexpression j, and i is  not  contained  within  any
           other  subexpression  that  is  contained  within  j,  and  a match of subexpression j is reported in
           pmatch[j], then the match or non-match of subexpression i reported in pmatch[i] shall be as described
           in 1. and 2. above, but within the substring reported in pmatch[j] rather than the whole string.  The
           offsets in pmatch[i] are still relative to the start of string.

        4. If  subexpression  i  is contained in subexpression j, and the byte offsets in pmatch[j] are -1, then
           the pointers in pmatch[i] shall also be -1.

        5. If subexpression i matched a zero-length string, then both byte offsets in  pmatch[i]  shall  be  the
           byte offset of the character or null terminator immediately following the zero-length string.

       If,  when regexec() is called, the locale is different from when the regular expression was compiled, the
       result is undefined.

       If REG_NEWLINE is not set in cflags, then a <newline> in  pattern  or  string  shall  be  treated  as  an
       ordinary  character.  If  REG_NEWLINE  is  set,  then <newline> shall be treated as an ordinary character
       except as follows:

        1. A <newline> in string shall not be matched by a <period> outside a bracket expression or by any  form
           of  a  non-matching  list  (see  the  Base  Definitions  volume  of  POSIX.1‐2017, Chapter 9, Regular
           Expressions).

        2. A <circumflex> ('^') in pattern, when used to specify expression anchoring (see the Base  Definitions
           volume  of POSIX.1‐2017, Section 9.3.8, BRE Expression Anchoring), shall match the zero-length string
           immediately after a <newline> in string, regardless of the setting of REG_NOTBOL.

        3. A <dollar-sign> ('$') in pattern, when used to specify expression anchoring, shall  match  the  zero-
           length string immediately before a <newline> in string, regardless of the setting of REG_NOTEOL.

       The regfree() function frees any memory allocated by regcomp() associated with preg.

       The  following  constants  are  defined  as the minimum set of error return values, although other errors
       listed as implementation extensions in <regex.h> are possible:

       REG_BADBR     Content of "\{\}" invalid: not a number, number too large, more  than  two  numbers,  first
                     larger than second.

       REG_BADPAT    Invalid regular expression.

       REG_BADRPT    '?', '*', or '+' not preceded by valid regular expression.

       REG_EBRACE    "\{\}" imbalance.

       REG_EBRACK    "[]" imbalance.

       REG_ECOLLATE  Invalid collating element referenced.

       REG_ECTYPE    Invalid character class type referenced.

       REG_EESCAPE   Trailing <backslash> character in pattern.

       REG_EPAREN    "\(\)" or "()" imbalance.

       REG_ERANGE    Invalid endpoint in range expression.

       REG_ESPACE    Out of memory.

       REG_ESUBREG   Number in "\digit" invalid or in error.

       REG_NOMATCH   regexec() failed to match.

       If  more  than  one  error occurs in processing a function call, any one of the possible constants may be
       returned, as the order of detection is unspecified.

       The regerror() function provides a mapping from error  codes  returned  by  regcomp()  and  regexec()  to
       unspecified  printable strings. It generates a string corresponding to the value of the errcode argument,
       which the application shall ensure is the last non-zero value returned by regcomp() or regexec() with the
       given value of preg.  If errcode is not such a value, the content of the generated string is unspecified.

       If preg is a null pointer, but errcode is a value returned by a previous call to regexec() or  regcomp(),
       the regerror() still generates an error string corresponding to the value of errcode, but it might not be
       as detailed under some implementations.

       If the errbuf_size argument is not 0, regerror() shall place the generated string into the buffer of size
       errbuf_size bytes pointed to by errbuf.  If the string (including the terminating null) cannot fit in the
       buffer, regerror() shall truncate the string and null-terminate the result.

       If  errbuf_size  is  0,  regerror()  shall  ignore the errbuf argument, and return the size of the buffer
       needed to hold the generated string.

       If the preg argument to regexec()  or  regfree()  is  not  a  compiled  regular  expression  returned  by
       regcomp(), the result is undefined. A preg is no longer treated as a compiled regular expression after it
       is given to regfree().

RETURN VALUE

       Upon  successful completion, the regcomp() function shall return 0. Otherwise, it shall return an integer
       value indicating an error as described in <regex.h>, and the content of preg is undefined. If a  code  is
       returned, the interpretation shall be as given in <regex.h>.

       If  regcomp()  detects  an  invalid RE, it may return REG_BADPAT, or it may return one of the error codes
       that more precisely describes the error.

       Upon successful completion, the regexec() function shall return 0. Otherwise, it shall return REG_NOMATCH
       to indicate no match.

       Upon successful completion, the regerror() function shall return the number of bytes needed to  hold  the
       entire generated string, including the null termination. If the return value is greater than errbuf_size,
       the string returned in the buffer pointed to by errbuf has been truncated.

       The regfree() function shall not return a value.

ERRORS

       No errors are defined.

       The following sections are informative.

EXAMPLES

           #include <regex.h>

           /*
            * Match string against the extended regular expression in
            * pattern, treating errors as no match.
            *
            * Return 1 for match, 0 for no match.
            */

           int
           match(const char *string, char *pattern)
           {
               int    status;
               regex_t    re;

               if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
                   return(0);      /* Report error. */
               }
               status = regexec(&re, string, (size_t) 0, NULL, 0);
               regfree(&re);
               if (status != 0) {
                   return(0);      /* Report error. */
               }
               return(1);
           }

       The following demonstrates how the REG_NOTBOL flag could be used with regexec() to find all substrings in
       a  line  that  match  a  pattern  supplied  by a user.  (For simplicity of the example, very little error
       checking is done.)

           (void) regcomp (&re, pattern, 0);
           /* This call to regexec() finds the first match on the line. */
           error = regexec (&re, &buffer[0], 1, &pm, 0);
           while (error == 0) {  /* While matches found. */
               /* Substring found between pm.rm_so and pm.rm_eo. */
               /* This call to regexec() finds the next match. */
               error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
           }

APPLICATION USAGE

       An application could use:

           regerror(code,preg,(char *)NULL,(size_t)0)

       to find out how big a buffer is needed for the generated string, malloc() a buffer to  hold  the  string,
       and then call regerror() again to get the string. Alternatively, it could allocate a fixed, static buffer
       that  is  big  enough to hold most strings, and then use malloc() to allocate a larger buffer if it finds
       that this is too small.

       To match a pattern as described in the Shell and Utilities volume of POSIX.1‐2017, Section 2.13,  Pattern
       Matching Notation, use the fnmatch() function.

RATIONALE

The regexec() function must fill in all nmatch elements of pmatch, where nmatch and pmatch are supplied
by the application, even if some elements of pmatch do not correspond to subexpressions in pattern. The
application developer should note that there is probably no reason for using a value of nmatch that is
larger than preg->re_nsub+1.

The REG_NEWLINE flag supports a use of RE matching that is needed in some applications like text editors.
In such applications, the user supplies an RE asking the application to find a line that matches the
given expression. An anchor in such an RE anchors at the beginning or end of any line. Such an
application can pass a sequence of <newline>-separated lines to regexec() as a single long string and
specify REG_NEWLINE to regcomp() to get the desired behavior. The application must ensure that there are
no explicit <newline> characters in pattern if it wants to ensure that any match occurs entirely within a
single line.

The REG_NEWLINE flag affects the behavior of regexec(), but it is in the cflags parameter to regcomp() to
allow flexibility of implementation. Some implementations will want to generate the same compiled RE in
regcomp() regardless of the setting of REG_NEWLINE and have regexec() handle anchors differently based on
the setting of the flag. Other implementations will generate different compiled REs based on the
REG_NEWLINE.

The REG_ICASE flag supports the operations taken by the grep -i option and the historical implementations
of ex and vi. Including this flag will make it easier for application code to be written that does the
same thing as these utilities.

The substrings reported in pmatch[] are defined using offsets from the start of the string rather than
pointers. This allows type-safe access to both constant and non-constant strings.

The type regoff_t is used for the elements of pmatch[] to ensure that the application can represent large
arrays in memory (important for an application conforming to the Shell and Utilities volume of
POSIX.1‐2017).

The 1992 edition of this standard required regoff_t to be at least as wide as off_t, to facilitate future
extensions in which the string to be searched is taken from a file. However, these future extensions have
not appeared. The requirement rules out popular implementations with 32-bit regoff_t and 64-bit off_t,
so it has been removed.

The standard developers rejected the inclusion of a regsub() function that would be used to do
substitutions for a matched RE. While such a routine would be useful to some applications, its utility
would be much more limited than the matching function described here. Both RE parsing and substitution
are possible to implement without support other than that required by the ISO C standard, but matching is
much more complex than substituting. The only difficult part of substitution, given the information
supplied by regexec(), is finding the next character in a string when there can be multi-byte characters.
That is a much larger issue, and one that needs a more general solution.

The errno variable has not been used for error returns to avoid filling the errno name space for this
feature.

The interface is defined so that the matched substrings rm_sp and rm_ep are in a separate regmatch_t
structure instead of in regex_t. This allows a single compiled RE to be used simultaneously in several
contexts; in main() and a signal handler, perhaps, or in multiple threads of lightweight processes. (The
preg argument to regexec() is declared with type const, so the implementation is not permitted to use the
structure to store intermediate results.) It also allows an application to request an arbitrary number of
substrings from an RE. The number of subexpressions in the RE is reported in re_nsub in preg. With this
change to regexec(), consideration was given to dropping the REG_NOSUB flag since the user can now
specify this with a zero nmatch argument to regexec(). However, keeping REG_NOSUB allows an
implementation to use a different (perhaps more efficient) algorithm if it knows in regcomp() that no
subexpressions need be reported. The implementation is only required to fill in pmatch if nmatch is not
zero and if REG_NOSUB is not specified. Note that the size_t type, as defined in the ISO C standard, is
unsigned, so the description of regexec() does not need to address negative values of nmatch.

REG_NOTBOL was added to allow an application to do repeated searches for the same pattern in a line. If
the pattern contains a <circumflex> character that should match the beginning of a line, then the pattern
should only match when matched against the beginning of the line. Without the REG_NOTBOL flag, the
application could rewrite the expression for subsequent matches, but in the general case this would
require parsing the expression. The need for REG_NOTEOL is not as clear; it was added for symmetry.

The addition of the regerror() function addresses the historical need for conforming application programs
to have access to error information more than ``Function failed to compile/match your RE for unknown
reasons''.

This interface provides for two different methods of dealing with error conditions. The specific error
codes (REG_EBRACE, for example), defined in <regex.h>, allow an application to recover from an error if
it is so able. Many applications, especially those that use patterns supplied by a user, will not try to
deal with specific error cases, but will just use regerror() to obtain a human-readable error message to
present to the user.

The regerror() function uses a scheme similar to confstr() to deal with the problem of allocating memory
to hold the generated string. The scheme used by strerror() in the ISO C standard was considered
unacceptable since it creates difficulties for multi-threaded applications.

The preg argument is provided to regerror() to allow an implementation to generate a more descriptive
message than would be possible with errcode alone. An implementation might, for example, save the
character offset of the offending character of the pattern in a field of preg, and then include that in
the generated message string. The implementation may also ignore preg.

A REG_FILENAME flag was considered, but omitted. This flag caused regexec() to match patterns as
described in the Shell and Utilities volume of POSIX.1‐2017, Section 2.13, Pattern Matching Notation
instead of REs. This service is now provided by the fnmatch() function.

Notice that there is a difference in philosophy between the ISO POSIX‐2:1993 standard and POSIX.1‐2008 in
how to handle a ``bad'' regular expression. The ISO POSIX‐2:1993 standard says that many bad constructs
``produce undefined results'', or that ``the interpretation is undefined''. POSIX.1‐2008, however, says
that the interpretation of such REs is unspecified. The term ``undefined'' means that the action by the
application is an error, of similar severity to passing a bad pointer to a function.

The regcomp() and regexec() functions are required to accept any null-terminated string as the pattern
argument. If the meaning of the string is ``undefined'', the behavior of the function is ``unspecified''.
POSIX.1‐2008 does not specify how the functions will interpret the pattern; they might return error
codes, or they might do pattern matching in some completely unexpected way, but they should not do
something like abort the process.

FUTURE DIRECTIONS

       None.

COPYRIGHT

       Portions of this text are reprinted and reproduced in electronic form from IEEE Std 1003.1-2017, Standard
       for  Information  Technology  --  Portable  Operating  System  Interface  (POSIX),  The  Open  Group Base
       Specifications Issue 7, 2018 Edition, Copyright (C) 2018 by the Institute of Electrical  and  Electronics
       Engineers, Inc and The Open Group.  In the event of any discrepancy between this version and the original
       IEEE  and The Open Group Standard, the original IEEE and The Open Group Standard is the referee document.
       The original Standard can be obtained online at http://www.opengroup.org/unix/online.html .

       Any typographical or formatting errors that appear in this page are most likely to have  been  introduced
       during   the   conversion  of  the  source  files  to  man  page  format.  To  report  such  errors,  see
       https://www.kernel.org/doc/man-pages/reporting_bugs.html .

IEEE/The Open Group                                   2017                                       REGCOMP(3POSIX)