Ubuntu Manpage: I18N::Charset - IANA Character Set Registry names and Unicode::MapUTF8 (et al.) conversion scheme names

Provided by: libi18n-charset-perl_1.419-1_all

NAME

       I18N::Charset - IANA Character Set Registry names and Unicode::MapUTF8 (et al.) conversion scheme names

SYNOPSIS

         use I18N::Charset;

         $sCharset = iana_charset_name('WinCyrillic');
         # $sCharset is now 'windows-1251'
         $sCharset = umap_charset_name('Adobe DingBats');
         # $sCharset is now 'ADOBE-DINGBATS' which can be passed to Unicode::Map->new()
         $sCharset = map8_charset_name('windows-1251');
         # $sCharset is now 'cp1251' which can be passed to Unicode::Map8->new()
         $sCharset = umu8_charset_name('x-sjis');
         # $sCharset is now 'sjis' which can be passed to Unicode::MapUTF8->new()
         $sCharset = libi_charset_name('x-sjis');
         # $sCharset is now 'MS_KANJI' which can be passed to `iconv -f $sCharset ...`
         $sCharset = enco_charset_name('Shift-JIS');
         # $sCharset is now 'shiftjis' which can be passed to Encode::from_to()

         I18N::Charset::add_iana_alias('my-japanese' => 'iso-2022-jp');
         I18N::Charset::add_map8_alias('my-arabic' => 'arabic7');
         I18N::Charset::add_umap_alias('my-hebrew' => 'ISO-8859-8');
         I18N::Charset::add_libi_alias('my-sjis' => 'x-sjis');
         I18N::Charset::add_enco_alias('my-japanese' => 'shiftjis');

DESCRIPTION

       The "I18N::Charset" module provides access to the IANA Character Set Registry names for identifying
       character encoding schemes.  It also provides a mapping to the character set names used by the
       Unicode::Map and Unicode::Map8 modules.

       So, for example, if you get an HTML document with a META CHARSET="..."  tag, you can fairly quickly
       determine what Unicode::MapXXX module can be used to convert it to Unicode.

       If you don't have the module Unicode::Map installed, the umap_ functions will always return undef.  If
       you don't have the module Unicode::Map8 installed, the map8_ functions will always return undef.  If you
       don't have the module Unicode::MapUTF8 installed, the umu8_ functions will always return undef.  If you
       don't have the iconv library installed, the libi_ functions will always return undef.  If you don't have
       the Encode module installed, the enco_ functions will always return undef.

CONVERSION ROUTINES

       There are four main conversion routines: "iana_charset_name()", "map8_charset_name()",
       "umap_charset_name()", and "umu8_charset_name()".

       iana_charset_name()
           This  function  takes  a  string  containing  the  name of a character set and returns a string which
           contains the official IANA name of the character set identified. If no valid character set  name  can
           be  identified,  then  "undef"  will be returned.  The case and punctuation within the string are not
           important.

               $sCharset = iana_charset_name('WinCyrillic');

       mime_charset_name()
           This function takes a string containing the name of a  character  set  and  returns  a  string  which
           contains  the preferred MIME name of the character set identified. If no valid character set name can
           be identified, then "undef" will be returned.  The case and punctuation within  the  string  are  not
           important.

               $sCharset = mime_charset_name('Extended_UNIX_Code_Packed_Format_for_Japanese');

       enco_charset_name()
           This  function  takes  a  string  containing  the  name of a character set and returns a string which
           contains a name of the character set suitable to be  passed  to  the  Encode  module.   If  no  valid
           character  set  name can be identified, or if Encode is not installed, then "undef" will be returned.
           The case and punctuation within the string are not important.

               $sCharset = enco_charset_name('Extended_UNIX_Code_Packed_Format_for_Japanese');

       libi_charset_name()
           This function takes a string containing the name of a  character  set  and  returns  a  string  which
           contains  a name of the character set suitable to be passed to iconv.  If no valid character set name
           can be identified, then "undef" will be returned.  The case and punctuation within the string are not
           important.

               $sCharset = libi_charset_name('Extended_UNIX_Code_Packed_Format_for_Korean');

       mib_to_charset_name
           This function takes a string containing the MIBenum of a character set and  returns  a  string  which
           contains  a  name  for  the character set.  If the given MIBenum does not correspond to any character
           set, then "undef" will be returned.

               $sCharset = mib_to_charset_name('3');

       mib_charset_name
           This is a synonum for mib_to_charset_name

       charset_name_to_mib
           This function takes a string containing the name of a character set in almost any format and  returns
           a  MIBenum  for  the  character set.  For IANA-registered character sets, this is the IANA-registered
           MIB.  For non-IANA character sets, this is an unambiguous unique string whose only use is to pass  to
           other  functions in this module.  If no valid character set name can be identified, then "undef" will
           be returned.

               $iMIB = charset_name_to_mib('US-ASCII');

       map8_charset_name()
           This function takes a string containing the name of a  character  set  (in  almost  any  format)  and
           returns   a   string   which   contains  a  name  for  the  character  set  that  can  be  passed  to
           Unicode::Map8::new().  Note: the returned string will be capitalized just like the name of  the  .bin
           file  in  the  Unicode::Map8::MAPS_DIR  directory.  If no valid character set name can be identified,
           then "undef" will be returned.   The  case  and  punctuation  within  the  argument  string  are  not
           important.

               $sCharset = map8_charset_name('windows-1251');

       umap_charset_name()
           This  function  takes  a  string  containing  the  name of a character set (in almost any format) and
           returns  a  string  which  contains  a  name  for  the  character  set  that   can   be   passed   to
           Unicode::Map::new(). If no valid character set name can be identified, then "undef" will be returned.
           The case and punctuation within the argument string are not important.

               $sCharset = umap_charset_name('hebrew');

       umu8_charset_name()
           This  function  takes  a  string  containing  the  name of a character set (in almost any format) and
           returns  a  string  which  contains  a  name  for  the  character  set  that   can   be   passed   to
           Unicode::MapUTF8::new().  If  no  valid  character  set  name can be identified, then "undef" will be
           returned.  The case and punctuation within the argument string are not important.

               $sCharset = umu8_charset_name('windows-1251');

QUERY ROUTINES

       There is one function which can be used to obtain a list of all IANA-registered character set names.

       "all_iana_charset_names()"
           Returns a list of all registered IANA character set names.  The  names  are  not  in  any  particular
           order.

CHARACTER SET NAME ALIASING

       This module supports several semi-private routines for specifying character set name aliases.

       add_iana_alias()
           This  function  takes  two  strings:  a  new  alias, and a target IANA Character Set Name (or another
           alias).  It defines the new alias to refer to that character set name (or to the character  set  name
           to which the second alias refers).

           Returns  the  target  character set name of the successfully installed alias.  Returns 'undef' if the
           target character set name is not registered.  Returns 'undef' if the target character set name of the
           second alias is not registered.

             I18N::Charset::add_iana_alias('my-alias1' => 'Shift_JIS');

           With this code, "my-alias1" becomes an alias for the existing IANA character set name 'Shift_JIS'.

             I18N::Charset::add_iana_alias('my-alias2' => 'sjis');

           With this code, "my-alias2" becomes an alias for the IANA character  set  name  referred  to  by  the
           existing alias 'sjis' (which happens to be 'Shift_JIS').

       add_map8_alias()
           This  function  takes  two strings: a new alias, and a target Unicode::Map8 Character Set Name (or an
           existing alias to a Map8 name).  It defines the new alias to refer to that mapping name  (or  to  the
           mapping name to which the second alias refers).

           If  the  first  argument  is  a  registered  IANA  character  set name, then all aliases of that IANA
           character set name will end up pointing to the target Map8 mapping name.

           Returns the target mapping name of the successfully installed alias.  Returns 'undef' if  the  target
           mapping  name  is  not registered.  Returns 'undef' if the target mapping name of the second alias is
           not registered.

             I18N::Charset::add_map8_alias('normal' => 'ANSI_X3.4-1968');

           With the above statement, "normal" becomes an alias  for  the  existing  Unicode::Map8  mapping  name
           'ANSI_X3.4-1968'.

             I18N::Charset::add_map8_alias('normal' => 'US-ASCII');

           With  the  above  statement,  "normal"  becomes  an  alias for the existing Unicode::Map mapping name
           'ANSI_X3.4-1968' (which is what "US-ASCII" is an alias for).

             I18N::Charset::add_map8_alias('IBM297' => 'EBCDIC-CA-FR');

           With the above statement, "IBM297" becomes an  alias  for  the  existing  Unicode::Map  mapping  name
           'EBCDIC-CA-FR'.   As  a  side  effect, all the aliases for 'IBM297' (i.e. 'cp297' and 'ebcdic-cp-fr')
           also become aliases for 'EBCDIC-CA-FR'.

       add_umap_alias()
           This function works identically to add_map8_alias() above,  but  operates  on  Unicode::Map  encoding
           tables.

       add_libi_alias()
           This  function  takes  two  strings:  a new alias, and a target iconv Character Set Name (or existing
           iconv alias).  It defines the new alias to refer to that character set name (or to the character  set
           name to which the existing alias refers).

           Returns  the  target  conversion scheme name of the successfully installed alias.  Returns 'undef' if
           there is no such target conversion scheme or alias.

           Examples:

             I18N::Charset::add_libi_alias('my-chinese1' => 'CN-GB');

           With this code, "my-chinese1" becomes an alias for the existing iconv conversion scheme 'CN-GB'.

             I18N::Charset::add_libi_alias('my-chinese2' => 'EUC-CN');

           With this code, "my-chinese2" becomes an alias for the iconv conversion scheme  referred  to  by  the
           existing alias 'EUC-CN' (which happens to be 'CN-GB').

       add_enco_alias()
           This  function  takes two strings: a new alias, and a target Encode encoding Name (or existing Encode
           alias).  It defines the new alias referring to that encoding name (or to the encoding  to  which  the
           existing alias refers).

           Returns the target encoding name of the successfully installed alias.  Returns 'undef' if there is no
           such encoding or alias.

           Examples:

             I18N::Charset::add_enco_alias('my-japanese1' => 'jis0201-raw');

           With this code, "my-japanese1" becomes an alias for the existing encoding 'jis0201-raw'.

             I18N::Charset::add_enco_alias('my-japanese2' => 'my-japanese1');

           With  this  code,  "my-japanese2" becomes an alias for the encoding referred to by the existing alias
           'my-japanese1' (which happens to be 'jis0201-raw' after the previous call).

KNOWN BUGS AND LIMITATIONS

       •   There could probably be many more aliases added (for convenience) to all the IANA names.  If you have
           some specific recommendations, please email the author!

       •   The only character set names which have a corresponding mapping in the Unicode::Map8 module  are  the
           character sets that Unicode::Map8 can convert.

           Similarly, the only character set names which have a corresponding mapping in the Unicode::Map module
           are the character sets that Unicode::Map can convert.

       •   In  the current implementation, all tables are read in and initialized when the module is loaded, and
           then held in memory until the program exits.  A "lazy" implementation (or a less-portable tied  hash)
           might lead to a shorter startup time.  Suggestions, patches, comments are always welcome!

AUTHOR

       Martin 'Kingpin' Thurn, "mthurn at cpan.org", <http://tinyurl.com/nn67z>.

LICENSE

       This  module  is  free  software;  you  can redistribute it and/or modify it under the same terms as Perl
       itself.

perl v5.32.1                                       2021-02-27                                 I18N::Charset(3pm)

NAME

SYNOPSIS

DESCRIPTION

CONVERSION ROUTINES

QUERY ROUTINES

CHARACTER SET NAME ALIASING

KNOWN BUGS AND LIMITATIONS

SEE ALSO

AUTHOR

LICENSE