Ubuntu Manpage: Test::utf8 - handy utf8 tests

Provided by: libtest-utf8-perl_1.03-1_all

NAME

       Test::utf8 - handy utf8 tests

SYNOPSIS

         # check the string is good
         is_valid_string($string);   # check the string is valid
         is_sane_utf8($string);      # check not double encoded

         # check the string has certain attributes
         is_flagged_utf8($string1);   # has utf8 flag set
         is_within_ascii($string2);   # only has ascii chars in it
         isnt_within_ascii($string3); # has chars outside the ascii range
         is_within_latin_1($string4); # only has latin-1 chars in it
         isnt_within_ascii($string5); # has chars outside the latin-1 range

DESCRIPTION

       This module is a collection of tests useful for dealing with utf8 strings in Perl.

       This module has two types of tests: The validity tests check if a string is valid and not corrupt,
       whereas the characteristics tests will check that string has a given set of characteristics.

   Validity Tests
       is_valid_string($string, $testname)
           Checks  if  the  string  is  "valid", i.e. this passes and returns true unless the internal utf8 flag
           hasn't been set on scalar that isn't made up of a valid utf-8 byte sequence.

           This should never happen and, in theory, this test should always pass. Unless you (or  a  module  you
           use)  goes monkeying around inside a scalar using Encode's private functions or XS code you shouldn't
           ever end up in a situation where you've got a corrupt scalar.  But if you do, and you do,  then  this
           function should help you detect the problem.

           To be clear, here's an example of the error case this can detect:

             my $mark = "Mark";
             my $leon = "L\x{e9}on";
             is_valid_string($mark);  # passes, not utf-8
             is_valid_string($leon);  # passes, not utf-8

             my $iloveny = "I \x{2665} NY";
             is_valid_string($iloveny);      # passes, proper utf-8

             my $acme = "L\x{c3}\x{a9}on";
             Encode::_utf8_on($acme);      # (please don't do things like this)
             is_valid_string($acme);       # passes, proper utf-8 byte sequence upgraded

             Encode::_utf8_on($leon);      # (this is why you don't do things like this)
             is_valid_string($leon);       # fails! the byte \x{e9} isn't valid utf-8

       is_sane_utf8($string, $name)
           This  test  fails  if  the  string  contains  something  that looks like it might be dodgy utf8, i.e.
           containing something that looks like the multi-byte sequence for a latin-1 character but perl  hasn't
           been instructed to treat as such.  Strings that are not utf8 always automatically pass.

           Some examples may help:

             # This will pass as it's a normal latin-1 string
             is_sane_utf8("Hello L\x{e9}eon");

             # this will fail because the \x{c3}\x{a9} looks like the
             # utf8 byte sequence for e-acute
             my $string = "Hello L\x{c3}\x{a9}on";
             is_sane_utf8($string);

             # this will pass because the utf8 is correctly interpreted as utf8
             Encode::_utf8_on($string)
             is_sane_utf8($string);

           Obviously  this  isn't  a  hundred percent reliable.  The edge case where this will fail is where you
           have "\x{c2}" (which is "LATIN CAPITAL LETTER WITH CIRCUMFLEX") or "\x{c3}" (which is "LATIN  CAPITAL
           LETTER WITH TILDE") followed by one of the latin-1 punctuation symbols.

             # a capital letter A with tilde surrounded by smart quotes
             # this will fail because it'll see the "\x{c2}\x{94}" and think
             # it's actually the utf8 sequence for the end smart quote
             is_sane_utf8("\x{93}\x{c2}\x{94}");

           However,  since  this  hardly  comes  up this test is reasonably reliable in most cases.  Still, care
           should be applied in cases where dynamic data is placed next to latin-1 punctuation  to  avoid  false
           negatives.

           There  exists  two situations to cause this test to fail; The string contains utf8 byte sequences and
           the string hasn't been flagged as utf8 (this normally means that you got it from an  external  source
           like  a C library; When Perl needs to store a string internally as utf8 it does it's own encoding and
           flagging transparently) or a utf8 flagged string contains byte  sequences  that  when  translated  to
           characters  themselves  look  like a utf8 byte sequence.  The test diagnostics tells you which is the
           case.

   String Characteristic Tests
       These routines allow you to check the range of characters in a string.   Note  that  these  routines  are
       blind  to the actual encoding perl internally uses to store the characters, they just check if the string
       contains only characters that can be represented in the named encoding:

       is_within_ascii
           Tests that a string only contains characters that are in the ASCII character set.

       is_within_latin_1
           Tests that a string only contains characters that are in latin-1.

       Simply check if a scalar is or isn't flagged as utf8 by perl's internals:

       is_flagged_utf8($string, $name)
           Passes if the string is flagged by perl's internals as utf8, fails if it's not.

       isnt_flagged_utf8($string,$name)
           The opposite of "is_flagged_utf8", passes if and only if the string isn't flagged as utf8  by  perl's
           internals.

           Note: you can refer to this function as "isn't_flagged_utf8" if you really want to.

AUTHOR

       Written by Mark Fowler mark@twoshortplanks.com

COPYRIGHT

       Copyright Mark Fowler 2004,2012.  All rights reserved.

       This  program  is  free  software;  you can redistribute it and/or modify it under the same terms as Perl
       itself.

BUGS

       None known.  Please report any to me via the CPAN RT system.  See http://rt.cpan.org/ for more details.

NAME

SYNOPSIS

DESCRIPTION

AUTHOR

COPYRIGHT

BUGS

SEE ALSO