Provided by: libencoding-fixlatin-perl_1.04-4_all bug

NAME

       Encoding::FixLatin - takes mixed encoding input and produces UTF-8 output

SYNOPSIS

           use Encoding::FixLatin qw(fix_latin);

           my $utf8_string = fix_latin($mixed_encoding_string);

DESCRIPTION

       Most encoding conversion tools take input in one encoding and produce output in another encoding.  This
       module takes input which may contain characters in more than one encoding and makes a best effort to
       convert them all to UTF-8 output.

EXPORTS

       Nothing is exported by default.  The only public function is "fix_latin" which will be exported on
       request (as per SYNOPSIS).

FUNCTIONS

   fix_latin( string, options ... )
       Decodes the supplied 'string' and returns a UTF-8 version of the string.  The following rules are used:

       •   ASCII characters (single bytes in the range 0x00 - 0x7F) are passed through unchanged.

       •   Well-formed UTF-8 multi-byte characters are also passed through unchanged.

       •   UTF-8  multi-byte  character  which  are  over-long  but  otherwise  well-formed are converted to the
           shortest UTF-8 normal form.

       •   Bytes in the range 0xA0 - 0xFF are assumed to be  Latin-1  characters  (ISO8859-1  encoded)  and  are
           converted to UTF-8.

       •   Bytes  in  the  range  0x80  - 0x9F are assumed to be Win-Latin-1 characters (CP1252 encoded) and are
           converted to UTF-8.  Except for the five bytes in this range which are not defined in CP1252 (see the
           "ascii_hex" option below).

       The achilles heel of these rules is that it's  possible  for  certain  combinations  of  two  consecutive
       Latin-1  characters  to  be  misinterpreted  as a single UTF-8 character - ie: there is some risk of data
       corruption.  See the 'LIMITATIONS' section below to quantify this  risk  for  the  type  of  data  you're
       working with.

       If  you  pass  in  a  string  that  is already a UTF-8 character string (the utf8 flag is set on the Perl
       scalar) then the string will simply be  returned  unchanged.   However  if  the  'bytes_only'  option  is
       specified  (see  below),  the  returned string will be a byte string rather than a character string.  The
       rules described above will not be applied in either case.

       The "fix_latin" function accepts options as name => value pairs.  Recognised options are:

       bytes_only => 1/0
           The value returned by fix_latin is normally a Perl character string and will have the utf8  flag  set
           if  it  contains  non-ASCII  characters.   If  you  set  the "bytes_only" option to a true value, the
           returned string will be a binary string of UTF-8 bytes.  The utf8 flag will  not  be  set.   This  is
           useful  if  you're  going  to  immediately  use  the  string in an IO operation and wish to avoid the
           overhead of converting to and from Perl's internal representation.

       ascii_hex => 1/0
           Bytes in the range 0x80-0x9F are assumed to be CP1252, however CP1252 does not define a mapping for 5
           of these bytes (0x81, 0x8D, 0x8F, 0x90 and 0x9D).  Use this option to  specify  how  they  should  be
           handled:

           •   If  the  ascii_hex  option  is  set  to  true  (the  default), these bytes will be converted to 3
               character ASCII hex strings of the form %XX.  For example the byte 0x81 will become %81.

           •   If the ascii_hex option is set  to  false,  these  bytes  will  be  treated  as  Latin-1  control
               characters and converted to the equivalent UTF-8 multi-byte sequences.

           When  processing text strings you will almost certainly never encounter these bytes at all.  The most
           likely reason you would see them is if  a  malicious  attacker  was  feeding  random  bytes  to  your
           application.  It is difficult to conceive of a scenario in which it makes sense to change this option
           from its default setting.

       overlong_fatal => 1/0
           An  over-long UTF-8 byte sequence is one which uses more than the minimum number of bytes required to
           represent the character.  Use this option to specify how overlong sequences should be handled.

           •   If the overlong_fatal option is set to false (the default) over-long sequences will be  converted
               to the shortest normal UTF-8 sequence.  For example the input byte string "\xC0\xBCscript>" would
               be converted to "<script>".

           •   If  the overlong_fatal option is set to true, this module will die with an error when an overlong
               sequence is encountered.  You would probably want to use eval to trap and handle this scenario.

           There is a strong argument that overlong sequences are only ever encountered in malicious  input  and
           therefore they should always be rejected.

       use_xs => 'auto' | 'always' | 'never'
           This option controls whether or not the XS (compiled C) implementation of "fix_latin" is used.  Note,
           the  Encoding::FixLatin::XS  module must be installed separately.  The three possible values for this
           option are:

           •   'auto' is the default behaviour - if Encoding::FixLatin::XS is installed, it will be  loaded  and
               used, otherwise the pure Perl implementation will be used.

           •   'always'  means  the  XS  module  will  be used and a fatal exception will be thrown if it is not
               available.

           •   'never' means no attempt will be made to use the XS module.

LIMITATIONS OF THIS MODULE

       This module is perfectly safe when handling data containing only ASCII and UTF-8 characters.  Introducing
       ISO8859-1 or CP1252 characters does add a risk of data corruption (ie: some characters in the input being
       converted to incorrect characters in the output).  To quantify the risk it  is  necessary  to  understand
       it's cause.  First, let's break the input bytes into two categories.

       •   ASCII bytes fall into the range 0x00-0x7F - the most significant bit is always set to zero.  I'll use
           the symbol 'a' to represent these bytes.

       •   Non-ASCII  bytes fall into the range 0x80-0xFF - the most significant bit is always set to one.  I'll
           use the symbol 'B' to represent these bytes.

       A sequence of ASCII bytes ('aaa') is always unambiguous and will not be misinterpreted.

       Lone non-ASCII bytes within sequences of ASCII bytes ('aaBaBa') are also  unambiguous  and  will  not  be
       misinterpreted.

       The  potential for error occurs with two (or more) consecutive non-ASCII bytes.  For example the sequence
       'BB' might be intended to represent two characters in one of the legacy encodings or a  single  character
       in UTF-8.  Because this module gives precedence to the UTF-8 characters it is possible that a random pair
       of legacy characters may be misinterpreted as a single UTF-8 character.

       The  risk is reduced by the fact that not all pairs of non-ASCII bytes form valid UTF-8 sequences.  Every
       non-ASCII UTF-8 character is made up of two or  more  'B'  bytes  and  no  'a'  bytes.   For  a  two-byte
       character, the first byte must be in the range 0xC0-0xDF and the second must be in the range 0x80-0xBF.

       Any  pair  of  'BB'  bytes  that  do  not  fall  into the required ranges are unambiguous and will not be
       misinterpreted.

       Pairs of 'BB' bytes that are actually individual Latin-1 characters but happen to fall into the  required
       ranges  to  be  misinterpreted as a UTF-8 character are rather unlikely to appear in normal text.  If you
       look those ranges up on a Latin-1 code chart you'll see that the first character  would  need  to  be  an
       uppercase accented letter and the second  would need to be a non-printable control character or a special
       punctuation symbol.

       One  way  to summarise the role of this module is that it guarantees to produce UTF-8 output, possibly at
       the cost of introducing the odd 'typo'.

BUGS

       Please report any bugs to "bug-encoding-fixlatin  at  rt.cpan.org",  or  through  the  web  interface  at
       <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Encoding-FixLatin>.   I will be notified, and then you'll
       automatically be notified of progress on your bug as I make changes.

SUPPORT

       You can also look for information at:

       •   Issue tracker

           <https://github.com/grantm/encoding-fixlatin/issues>

       •   AnnoCPAN: Annotated CPAN documentation

           <http://annocpan.org/dist/Encoding-FixLatin>

       •   CPAN Ratings

           <http://cpanratings.perl.org/d/Encoding-FixLatin>

       •   Search CPAN

           <http://search.cpan.org/dist/Encoding-FixLatin/>

       •   Source code repository

           <http://github.com/grantm/encoding-fixlatin>

COPYRIGHT & LICENSE

       Copyright 2009-2014 Grant McLean "<grantm@cpan.org>"

       This program is free software; you can redistribute it and/or modify it under  the  same  terms  as  Perl
       itself.

perl v5.38.2                                       2024-03-05                            Encoding::FixLatin(3pm)