Ubuntu Manpage: Regexp::Common - Provide commonly requested regular expressions

Provided by: libregexp-common-perl_2024080801-1_all

NAME

       Regexp::Common - Provide commonly requested regular expressions

SYNOPSIS

        # STANDARD USAGE

        use Regexp::Common;

        while (<>) {
            /$RE{num}{real}/               and print q{a number};
            /$RE{quoted}/                  and print q{a ['"`] quoted string};
           m[$RE{delimited}{-delim=>'/'}]  and print q{a /.../ sequence};
            /$RE{balanced}{-parens=>'()'}/ and print q{balanced parentheses};
            /$RE{profanity}/               and print q{a #*@%-ing word};
        }

        # SUBROUTINE-BASED INTERFACE

        use Regexp::Common 'RE_ALL';

        while (<>) {
            $_ =~ RE_num_real()              and print q{a number};
            $_ =~ RE_quoted()                and print q{a ['"`] quoted string};
            $_ =~ RE_delimited(-delim=>'/')  and print q{a /.../ sequence};
            $_ =~ RE_balanced(-parens=>'()'} and print q{balanced parentheses};
            $_ =~ RE_profanity()             and print q{a #*@%-ing word};
        }

        # IN-LINE MATCHING...

        if ( $RE{num}{int}->matches($text) ) {...}

        # ...AND SUBSTITUTION

        my $cropped = $RE{ws}{crop}->subs($uncropped);

        # ROLL-YOUR-OWN PATTERNS

        use Regexp::Common 'pattern';

        pattern name   => ['name', 'mine'],
                create => '(?i:J[.]?\s+A[.]?\s+Perl-Hacker)',
                ;

        my $name_matcher = $RE{name}{mine};

        pattern name    => [ 'lineof', '-char=_' ],
                create  => sub {
                               my $flags = shift;
                               my $char = quotemeta $flags->{-char};
                               return '(?:^$char+$)';
                           },
                match   => sub {
                               my ($self, $str) = @_;
                               return $str !~ /[^$self->{flags}{-char}]/;
                           },
                subs   => sub {
                               my ($self, $str, $replacement) = @_;
                               $_[1] =~ s/^$self->{flags}{-char}+$//g;
                          },
                ;

        my $asterisks = $RE{lineof}{-char=>'*'};

        # DECIDING WHICH PATTERNS TO LOAD.

        use Regexp::Common qw /comment number/;  # Comment and number patterns.
        use Regexp::Common qw /no_defaults/;     # Don't load any patterns.
        use Regexp::Common qw /!delimited/;      # All, but delimited patterns.

DESCRIPTION

       By default, this module exports a single hash (%RE) that stores or generates commonly needed regular
       expressions (see "List of available patterns").

       There is an alternative, subroutine-based syntax described in "Subroutine-based interface".

   General syntax for requesting patterns
       To access a particular pattern, %RE is treated as a hierarchical hash of hashes (of hashes...), with each
       successive key being an identifier. For example, to access the pattern that matches real numbers, you
       specify:

               $RE{num}{real}

       and to access the pattern that matches integers:

               $RE{num}{int}

       Deeper layers of the hash are used to specify flags: arguments that modify the resulting pattern in some
       way. The keys used to access these layers are prefixed with a minus sign and may have a value; if a value
       is given, it's done by using a multidimensional key.  For example, to access the pattern that matches
       base-2 real numbers with embedded commas separating groups of three digits (e.g. 10,101,110.110101101):

               $RE{num}{real}{-base => 2}{-sep => ','}{-group => 3}

       Through the magic of Perl, these flag layers may be specified in any order (and even interspersed through
       the identifier keys!)  so you could get the same pattern with:

               $RE{num}{real}{-sep => ','}{-group => 3}{-base => 2}

       or:

               $RE{num}{-base => 2}{real}{-group => 3}{-sep => ','}

       or even:

               $RE{-base => 2}{-group => 3}{-sep => ','}{num}{real}

       etc.

       Note, however, that the relative order of amongst the identifier keys is significant. That is:

               $RE{list}{set}

       would not be the same as:

               $RE{set}{list}

   Flag syntax
       In versions prior to 2.113, flags could also be written as "{"-flag=value"}". This no longer works,
       although "{"-flag$;value"}" still does. However, "{-flag => 'value'}" is the preferred syntax.

   Universal flags
       Normally, flags are specific to a single pattern.  However, there is two flags that all patterns may
       specify.

       "-keep"
           By  default,  the  patterns provided by %RE contain no capturing parentheses. However, if the "-keep"
           flag is specified (it requires no value) then any significant substrings that the pattern matches are
           captured. For example:

                   if ($str =~ $RE{num}{real}{-keep}) {
                           $number   = $1;
                           $whole    = $3;
                           $decimals = $5;
                   }

           Special care is needed if a "kept" pattern is interpolated into a larger regular expression,  as  the
           presence  of  other  capturing  parentheses  is  likely  to  change the "number variables" into which
           significant substrings are saved.

           See also "Adding new regular expressions", which describes how to create new patterns with "optional"
           capturing brackets that respond to "-keep".

       "-i"
           Some patterns or subpatterns only match lowercase or uppercase letters.  If one  wants  the  do  case
           insensitive  matching, one option is to use the "/i" regexp modifier, or the special sequence "(?i)".
           But if the functional interface is used, one does not have this option. The "-i" switch  solves  this
           problem; by using it, the pattern will do case insensitive matching.

   OO interface and inline matching/substitution
       The patterns returned from %RE are objects, so rather than writing:

               if ($str =~ /$RE{some}{pattern}/ ) {...}

       you can write:

               if ( $RE{some}{pattern}->matches($str) ) {...}

       For matching this would seem to have no great advantage apart from readability (but see below).

       For  substitutions, it has other significant benefits. Frequently you want to perform a substitution on a
       string without changing the original. Most people use this:

               $changed = $original;
               $changed =~ s/$RE{some}{pattern}/$replacement/;

       The more adept use:

               ($changed = $original) =~ s/$RE{some}{pattern}/$replacement/;

       Regexp::Common allows you do write this:

               $changed = $RE{some}{pattern}->subs($original=>$replacement);

       Apart from reducing precedence-angst, this approach  has  the  added  advantages  that  the  substitution
       behaviour  can  be  optimized  from the regular expression, and the replacement string can be provided by
       default (see "Adding new regular expressions").

       For example, in the implementation of this substitution:

               $cropped = $RE{ws}{crop}->subs($uncropped);

       the default empty string is provided automatically, and the substitution is optimized to use:

               $uncropped =~ s/^\s+//;
               $uncropped =~ s/\s+$//;

       rather than:

               $uncropped =~ s/^\s+|\s+$//g;

   Subroutine-based interface
       The hash-based interface was chosen because it  allows  regexes  to  be  effortlessly  interpolated,  and
       because it also allows them to be "curried". For example:

               my $num = $RE{num}{int};

               my $commad     = $num->{-sep=>','}{-group=>3};
               my $duodecimal = $num->{-base=>12};

       However,  the  use  of  tied  hashes does make the access to Regexp::Common patterns slower than it might
       otherwise be. In contexts where impatience overrules  laziness,  Regexp::Common  provides  an  additional
       subroutine-based interface.

       For  each  (sub-)entry  in  the  %RE  hash  ("$RE{key1}{key2}{etc}"), there is a corresponding exportable
       subroutine: RE_key1_key2_etc(). The name of each subroutine is the underscore-separated concatenation  of
       the non-flag keys that locate the same pattern in %RE. Flags are passed to the subroutine in its argument
       list. Thus:

               use Regexp::Common qw( RE_ws_crop RE_num_real RE_profanity );

               $str =~ RE_ws_crop() and die "Surrounded by whitespace";

               $str =~ RE_num_real(-base=>8, -sep=>" ") or next;

               $offensive = RE_profanity(-keep);
               $str =~ s/$offensive/$bad{$1}++; "<expletive deleted>"/ge;

       Note  that,  unlike  the  hash-based interface (which returns objects), these subroutines return ordinary
       "qr"'d regular expressions. Hence they do not curry, nor do they provide the OO  match  and  substitution
       inlining described in the previous section.

       It is also possible to export subroutines for all available patterns like so:

               use Regexp::Common 'RE_ALL';

       Or you can export all subroutines with a common prefix of keys like so:

               use Regexp::Common 'RE_num_ALL';

       which will export "RE_num_int" and "RE_num_real" (and if you have create more patterns who have first key
       num,  those will be exported as well). In general, RE_key1_..._keyn_ALL will export all subroutines whose
       pattern names have first keys key1 ... keyn.

   Adding new regular expressions
       You can add your own regular expressions to the %RE hash at  run-time,  using  the  exportable  "pattern"
       subroutine. It expects a hash-like list of key/value pairs that specify the behaviour of the pattern. The
       various possible argument pairs are:

       "name => [ @list ]"
           A  required  argument  that  specifies  the  name  of  the  pattern, and any flags it may take, via a
           reference to a list of strings. For example:

                    pattern name => [qw( line of -char )],
                            # other args here
                            ;

           This specifies an entry "$RE{line}{of}", which may take a "-char" flag.

           Flags may also be specified with a default value, which is then used whenever the flag  is  specified
           without an explicit value (but not when the flag is omitted). For example:

                    pattern name => [qw( line of -char=_ )],
                            # default char is '_'
                            # other args here
                            ;

       "create => $sub_ref_or_string"
           A required argument that specifies either a string that is to be returned as the pattern:

                   pattern name    => [qw( line of underscores )],
                           create  => q/(?:^_+$)/
                           ;

           or a reference to a subroutine that will be called to create the pattern:

                   pattern name    => [qw( line of -char=_ )],
                           create  => sub {
                                           my ($self, $flags) = @_;
                                           my $char = quotemeta $flags->{-char};
                                           return '(?:^$char+$)';
                                       },
                           ;

           If the subroutine version is used, the subroutine will be called with three arguments: a reference to
           the  pattern  object  itself,  a  reference  to  a  hash containing the flags and their values, and a
           reference to an array containing the non-flag keys.

           Whatever the subroutine returns is stringified as the pattern.

           No matter how the pattern is created, it is immediately postprocessed to include or exclude capturing
           parentheses (according to the value of the  "-keep"  flag).  To  specify  such  "optional"  capturing
           parentheses  within the regular expression associated with "create", use the notation "(?k:...)". Any
           parentheses of this type will be converted to  "(...)"   when  the  "-keep"  flag  is  specified,  or
           "(?:...)" when it is not.  It is a Regexp::Common convention that the outermost capturing parentheses
           always capture the entire pattern, but this is not enforced.

       "match => $sub_ref"
           An   optional   argument   that   specifies   a   subroutine   that   is   to   be  called  when  the
           "$RE{...}->matches(...)" method of this pattern is invoked.

           The subroutine should expect two arguments: a reference to the pattern object itself, and the  string
           to be matched against.

           It should return the same types of values as a "m/.../" does.

                pattern name    => [qw( line of -char )],
                        create  => sub {...},
                        match   => sub {
                                        my ($self, $str) = @_;
                                        $str !~ /[^$self->{flags}{-char}]/;
                                   },
                        ;

       "subs => $sub_ref"
           An  optional argument that specifies a subroutine that is to be called when the "$RE{...}->subs(...)"
           method of this pattern is invoked.

           The subroutine should expect three arguments: a reference to the pattern object itself, the string to
           be changed, and the value to be substituted into it.  The third argument may be  "undef",  indicating
           the default substitution is required.

           The subroutine should return the same types of values as an "s/.../.../" does.

           For example:

                pattern name    => [ 'lineof', '-char=_' ],
                        create  => sub {...},
                        subs    => sub {
                                     my ($self, $str, $ignore_replacement) = @_;
                                     $_[1] =~ s/^$self->{flags}{-char}+$//g;
                                   },
                        ;

           Note that such a subroutine will almost always need to modify $_[1] directly.

       "version => $minimum_perl_version"
           If  this argument is given, it specifies the minimum version of perl required to use the new pattern.
           Attempts to use the pattern with earlier versions of perl will generate a fatal diagnostic.

   Loading specific sets of patterns.
       By default, all the sets of patterns listed below  are  made  available.   However,  it  is  possible  to
       indicate  which  sets of patterns should be made available - the wanted sets should be given as arguments
       to "use". Alternatively, it is also possible to indicate which  sets  of  patterns  should  not  be  made
       available  -  those  sets  will  be  given  as  argument to the "use" statement, but are preceded with an
       exclaimation mark. The argument no_defaults indicates  none  of  the  default  patterns  should  be  made
       available. This is useful for instance if all you want is the pattern() subroutine.

       Examples:

        use Regexp::Common qw /comment number/;  # Comment and number patterns.
        use Regexp::Common qw /no_defaults/;     # Don't load any patterns.
        use Regexp::Common qw /!delimited/;      # All, but delimited patterns.

       It's  also  possible to load your own set of patterns. If you have a module "Regexp::Common::my_patterns"
       that makes patterns available, you can have it made available with

        use Regexp::Common qw /my_patterns/;

       Note that the default patterns will still be made available - only if you use no_defaults, or mention one
       of the default sets explicitly, the non mentioned defaults aren't made available.

   List of available patterns
       The patterns listed below are currently  available.  Each  set  of  patterns  has  its  own  manual  page
       describing  the  details. For each pattern set named name, the manual page Regexp::Common::name describes
       the details.

       Currently available are:

       Regexp::Common::balanced
           Provides regexes for strings with balanced parenthesized delimiters.

       Regexp::Common::comment
           Provides regexes for comments of various languages (43 languages currently).

       Regexp::Common::delimited
           Provides regexes for delimited strings.

       Regexp::Common::lingua
           Provides regexes for palindromes.

       Regexp::Common::list
           Provides regexes for lists.

       Regexp::Common::net
           Provides regexes for IPv4, IPv6, and MAC addresses.

       Regexp::Common::number
           Provides regexes for numbers (integers and reals).

       Regexp::Common::profanity
           Provides regexes for profanity.

       Regexp::Common::whitespace
           Provides regexes for leading and trailing whitespace.

       Regexp::Common::zip
           Provides regexes for zip codes.

   Forthcoming patterns and features
       Future releases of the module will also provide patterns for the following:

               * email addresses
               * HTML/XML tags
               * more numerical matchers,
               * mail headers (including multiline ones),
               * more URLS
               * telephone numbers of various countries
               * currency (universal 3 letter format, Latin-1, currency names)
               * dates
               * binary formats (e.g. UUencoded, MIMEd)

       If you have other patterns or pattern generators that you think would be generally  useful,  please  send
       them  to  the  maintainer  --  preferably as source code using the "pattern" subroutine. Submissions that
       include a set of tests will be especially welcome.

DIAGNOSTICS

       "Can't export unknown subroutine %s"
           The subroutine-based interface didn't recognize the requested subroutine.  Often caused by a spelling
           mistake or an incompletely specified name.

       "Can't create unknown regex: $RE{...}"
           Regexp::Common doesn't have a generator for the requested pattern.  Often  indicates  a  misspelt  or
           missing parameter.

        "Perl %f does not support the pattern $RE{...}. You need Perl %f or later"
           The  requested  pattern  requires advanced regex features (e.g. recursion) that not available in your
           version of Perl. Time to upgrade.

       "pattern() requires argument: name => [ @list ]"
           Every user-defined pattern specification must have a name.

       "pattern() requires argument: create => $sub_ref_or_string"
           Every user-defined pattern specification must provide a pattern creation mechanism: either a  pattern
           string or a reference to a subroutine that returns the pattern string.

       "Base must be between 1 and 36"
           The  "$RE{num}{real}{-base=>'N'}"  pattern  uses  the  characters [0-9A-Z] to represent the digits of
           various bases. Hence it only produces regular expressions for bases up to hexatricensimal.

       "Must specify delimiter in $RE{delimited}"
           The pattern has no default delimiter.  You  need  to  write:  "$RE{delimited}{-delim=>X'}"  for  some
           character X

ACKNOWLEDGEMENTS

       Deepest  thanks  to  the  many  people  who  have encouraged and contributed to this project, especially:
       Elijah, Jarkko, Tom, Nat, Ed, and Vivek.

       Further thanks go to: Alexandr Ciornii, Blair Zajac, Bob Stockdale, Charles Thomas, Chris Vertonghen, the
       CPAN Testers, David Hand, Fany, Geoffrey Leach, Hermann-Marcus Behrens, Jerome Quelin, Jim  Cromie,  Lars
       Wilke,  Linda Julien, Mike Arms, Mike Castle, Mikko, Murat Uenalan, Rafaël Garcia-Suarez, Ron Savage, Sam
       Vilain, Slaven Rezic, Smylers, Tim Maher, and all the others I've forgotten.

AUTHOR

       Damian Conway (damian@conway.org)

MAINTENANCE

       This package is maintained by Abigail (regexp-common@abigail.freedom.nl).

BUGS AND IRRITATIONS

       Bound to be plenty.

       For a start, there are many common regexes missing.  Send them in to regexp-common@abigail.freedom.nl.

       There are some POD issues when installing this module using a pre-5.6.0 perl; some manual pages  may  not
       install,  or  may  not install correctly using a perl that is that old. You might consider upgrading your
       perl.

NOT A BUG

       •   The various patterns are not anchored. That is, a pattern like "$RE {num} {int}" will  match  against
           "abc4def",  because a substring of the subject matches. This is by design, and not a bug. If you want
           the pattern to be anchored, use something like:

            my $integer = $RE {num} {int};
            $subj =~ /^$integer$/ and print "Matches!\n";

LICENSE and COPYRIGHT

       This software is Copyright (c) 2001 - 2024, Damian Conway and Abigail.

       This module is free software, and maybe used under any of the following licenses:

        1) The Perl Artistic License.     See the file COPYRIGHT.AL.
        2) The Perl Artistic License 2.0. See the file COPYRIGHT.AL2.
        3) The BSD License.               See the file COPYRIGHT.BSD.
        4) The MIT License.               See the file COPYRIGHT.MIT.

perl v5.38.2                                       2024-08-10                                Regexp::Common(3pm)