Provided by: libsgml-parser-opensp-perl_0.994-7build3_amd64 bug

NAME

       SGML::Parser::OpenSP - Parse SGML documents using OpenSP

SYNOPSIS

         use SGML::Parser::OpenSP;

         my $p = SGML::Parser::OpenSP->new;
         my $h = ExampleHandler->new;

         $p->catalogs(qw(xhtml.soc));
         $p->warnings(qw(xml valid));
         $p->handler($h);

         $p->parse("example.xhtml");

DESCRIPTION

       This module provides an interface to the OpenSP SGML parser. OpenSP and this module are event based. As
       the parser recognizes parts of the document (say the start or end of an element), then any handlers
       registered for that type of an event are called with suitable parameters.

COMMON METHODS

       new()
           Returns a new SGML::Parser::OpenSP object. Takes no arguments.

       parse($file)
           Parses  the  file  passed as an argument. Note that this must be a filename and not a filehandle. See
           "PROCESSING FILES" below for details.

       parse_string($data)
           Parses the data passed as an argument. See "PROCESSING FILES" below for details.

       halt()
           Halts processing before parsing the entire document. Takes no arguments.

       split_message()
           Splits OpenSP's error messages into their component  parts.   See  "POST-PROCESSING  ERROR  MESSAGES"
           below for details.

       get_location()
           See "POSITIONING INFORMATION" below for details.

CONFIGURATION

   BOOLEAN OPTIONS
       $p->handler([$handler])
           Report events to the blessed reference $handler.

   ERROR MESSAGE FORMAT
       $p->show_open_entities([$bool])
           Describe  open  entities  in  error  messages. Error messages always include the position of the most
           recently opened external entity. The default is false.

       $p->show_open_elements([$bool])
           Show the generic identifiers of open elements in error messages.  The default is false.

       $p->show_error_numbers([$bool])
           Show message numbers in error messages.

   GENERATED EVENTS
       $p->output_comment_decls([$bool])
           Generate "comment_decl" events. The default is false.

       $p->output_marked_sections([$bool])
           Generate marked section events ("marked_section_start", "marked_section_end",  "ignored_chars").  The
           default is false.

       $p->output_general_entities([$bool])
           Generate "general_entity" events. The default is false.

   IO SETTINGS
       $p->map_catalog_document([$bool])
           "parse"  arguments  specify  catalog  files  rather than the document entity.  The document entity is
           specified by the first DOCUMENT entry in the catalog files. The default is false.

       $p->restrict_file_reading([$bool])
           Restrict  file  reading  to  the  specified  directories  (see  the  "search_dirs"  method  and   the
           "SGML_SEARCH_PATH"  environment  variable).  You  should turn this option on and configure the search
           paths accordingly if you intend to process untrusted resources. The default is false.

       $p->catalogs([@catalogs])
           Map public identifiers and entity names to system  identifiers  using  the  specified  catalog  entry
           files.  Multiple  catalogs are allowed. If there is a catalog entry file called "catalog" in the same
           place as the document entity, it will be searched for immediately after those specified.

       $p->search_dirs([@search_dirs])
           Search the specified directories for files specified in system identifiers.  Multiple values  options
           are  allowed.  See the description of the osfile storage manager in the OpenSP documentation for more
           information about file searching.

       $p->pass_file_descriptor([$bool])
           Instruct "parse_string" to pass the input data down to the guts of OpenSP using  the  "OSFD"  storage
           manager  (if true) or the "OSFILE" storage manager (if false). This amounts to the difference between
           passing a file descriptor and a (temporary) file name.

           The default is true except on platforms, such as Win32, which are known to not support  passing  file
           descriptors  around  in  this  manner.  On platforms which support it you can call this method with a
           false parameter to force use of temporary file names instead.

           In general, this will do the right thing on its own so it's best to consider this an internal method.
           If your platform is such that you have to force use of the OSFILE storage manager, please  report  it
           as  a  bug  and include the values of $^O, $Config{archname}, and a description of the platform (e.g.
           "Windows Vista Service Pack 42").

   PROCESSING OPTIONS
       $p->include_params([@include_params])
           For each name in @include_params pretend that

             <!ENTITY % name "INCLUDE">

           occurs at the start of the document type declaration  subset  in  the  SGML  document  entity.  Since
           repeated  definitions  of  an entity are ignored, this definition will take precedence over any other
           definitions of this entity in the document type declaration. Multiple names are allowed.  If the SGML
           declaration replaces the reserved name INCLUDE then the new reserved name  will  be  the  replacement
           text of the entity. Typically the document type declaration will contain

             <!ENTITY % name "IGNORE">

           and will use %name; in the status keyword specification of a marked section declaration. In this case
           the effect of the option will be to cause the marked section not to be ignored.

       $p->active_links([@active_links])
           ???

   ENABLING WARNINGS
       Additional warnings can be enabled using

         $p->warnings([@warnings])

       The following values can be used to enable warnings:

       xml Warn about constructs that are not allowed by XML.

       mixed
           Warn about mixed content models that do not allow #pcdata anywhere.

       sgmldecl
           Warn about various dubious constructions in the SGML declaration.

       should
           Warn  about  various  recommendations  made  in  ISO  8879  that  the  document does not comply with.
           (Recommendations are expressed with ``should'', as  distinct  from  requirements  which  are  usually
           expressed with ``shall''.)

       default
           Warn about defaulted references.

       duplicate
           Warn about duplicate entity declarations.

       undefined
           Warn about undefined elements: elements used in the DTD but not defined.

       unclosed
           Warn about unclosed start and end-tags.

       empty
           Warn about empty start and end-tags.

       net Warn about net-enabling start-tags and null end-tags.

       min-tag
           Warn  about  minimized  start  and  end-tags.  Equivalent  to  combination of unclosed, empty and net
           warnings.

       unused-map
           Warn about unused short reference maps: maps  that  are  declared  with  a  short  reference  mapping
           declaration but never used in a short reference use declaration in the DTD.

       unused-param
           Warn  about  parameter  entities  that  are defined but not used in a DTD.  Unused internal parameter
           entities whose text is "INCLUDE" or "IGNORE" won't get the warning.

       notation-sysid
           Warn about notations for which no system identifier could be generated.

       all Warn about conditions that should usually be avoided (in the opinion of the author).  Equivalent  to:
           "mixed",  "should",  "default",  "undefined",  "sgmldecl",  "unused-map", "unused-param", "empty" and
           "unclosed".

   DISABLING WARNINGS
       A warning can be disabled by using its name  prefixed  with  "no-".   Thus  calling  warnings(qw(all  no-
       duplicate)) will enable all warnings except those about duplicate entity declarations.

       The following values for warnings() disable errors:

       no-idref
           Do  not give an error for an ID reference value which no element has as its ID. The effect will be as
           if each attribute declared as an ID reference value had been declared as a name.

       no-significant
           Do not give an error when a character that is not a significant character in the  reference  concrete
           syntax  occurs  in  a literal in the SGML declaration. This may be useful in conjunction with certain
           buggy test suites.

       no-valid
           Do not require the document to be type-valid. This has the effect of changing the SGML declaration to
           specify "VALIDITY NOASSERT" and "IMPLYDEF ATTLIST YES ELEMENT YES". An  option  of  "valid"  has  the
           effect  of  changing the SGML declaration to specify "VALIDITY TYPE" and "IMPLYDEF ATTLIST NO ELEMENT
           NO". If neither "valid" nor "no-valid" are specified, then the "VALIDITY" and "IMPLYDEF" specified in
           the SGML declaration will be used.

   XML WARNINGS
       The following warnings are turned on for the "xml" warning described above:

       inclusion
           Warn about inclusions in element type declarations.

       exclusion
           Warn about exclusions in element type declarations.

       rcdata-content
           Warn about RCDATA declared content in element type declarations.

       cdata-content
           Warn about CDATA declared content in element type declarations.

       ps-comment
           Warn about comments in parameter separators.

       attlist-group-decl
           Warn about name groups in attribute declarations.

       element-group-decl
           Warn about name groups in element type declarations.

       pi-entity
           Warn about PI entities.

       internal-sdata-entity
           Warn about internal SDATA entities.

       internal-cdata-entity
           Warn about internal CDATA entities.

       external-sdata-entity
           Warn about external SDATA entities.

       external-cdata-entity
           Warn about external CDATA entities.

       bracket-entity
           Warn about bracketed text entities.

       data-atts
           Warn about attribute definition list declarations for notations.

       missing-system-id
           Warn about external identifiers without system identifiers.

       conref
           Warn about content reference attributes.

       current
           Warn about current attributes.

       nutoken-decl-value
           Warn about attributes with a declared value of NUTOKEN or NUTOKENS.

       number-decl-value
           Warn about attributes with a declared value of NUMBER or NUMBERS.

       name-decl-value
           Warn about attributes with a declared value of NAME or NAMES.

       named-char-ref
           Warn about named character references.

       refc
           Warn about omitted refc delimiters.

       temp-ms
           Warn about TEMP marked sections.

       rcdata-ms
           Warn about RCDATA marked sections.

       instance-include-ms
           Warn about INCLUDE marked sections in the document instance.

       instance-ignore-ms
           Warn about IGNORE marked sections in the document instance.

       and-group
           Warn about AND connectors in model groups.

       rank
           Warn about ranked elements.

       empty-comment-decl
           Warn about empty comment declarations.

       att-value-not-literal
           Warn about attribute values which are not literals.

       missing-att-name
           Warn about omitted attribute names in start tags.

       comment-decl-s
           Warn about spaces before the MDC in comment declarations.

       comment-decl-multiple
           Warn about comment declarations containing multiple comments.

       missing-status-keyword
           Warn about marked sections without a status keyword.

       multiple-status-keyword
           Warn about marked sections with multiple status keywords.

       instance-param-entity
           Warn about parameter entities in the document instance.

       min-param
           Warn about minimization parameters in element type declarations.

       mixed-content-xml
           Warn about cases of mixed content which are not allowed in XML.

       name-group-not-or
           Warn about name groups with a connector different from OR.

       pi-missing-name
           Warn about processing instructions which don't start with a name.

       instance-status-keyword-s
           Warn about spaces between DSO and status keyword in marked sections.

       external-data-entity-ref
           Warn about references to external data entities in the content.

       att-value-external-entity-ref
           Warn about references to external data entities in attribute values.

       data-delim
           Warn about occurances of `<' and `&' as data.

       explicit-sgml-decl
           Warn about an explicit SGML declaration.

       internal-subset-ms
           Warn about marked sections in the internal subset.

       default-entity
           Warn about a default entity declaration.

       non-sgml-char-ref
           Warn about numeric character references to non-SGML characters.

       internal-subset-ps-param-entity
           Warn about parameter entity references in parameter separators in the internal subset.

       internal-subset-ts-param-entity
           Warn about parameter entity references in token separators in the internal subset.

       internal-subset-literal-param-entity
           Warn about parameter entity references in parameter literals in the internal subset.

PROCESSING FILES

       In order to start processing of a document and receive events, the "parse"  method  must  be  called.  It
       takes one argument specifying the path to a file (not a file handle). You must set an event handler using
       the "handler" method prior to using this method. The return value of "parse" is currently undefined.

EVENT HANDLERS

       In order to receive data from the parser you need to write an event handler. For example,

         package ExampleHandler;

         sub new { bless {}, shift }

         sub start_element
         {
             my ($self, $elem) = @_;
             printf "  * %s\n", $elem->{Name};
         }

       This  handler  would  print  all the element names as they are found in the document, for a typical XHTML
       document this might result in something like

         * html
         * head
         * title
         * body
         * p
         * ...

       The    events    closely    match    those    in    the    generic    interface    to     OpenSP,     see
       <http://openjade.sf.net/doc/generic.htm> for more information.

       The  event  names  have  been  changed  to lowercase and underscores to separate words and properties are
       capitalized. Arrays are represented as Perl array references. "Position" information is not passed to the
       handler but made available through the "get_location" method which can be  called  from  event  handlers.
       Some  redundant  information has also been stripped and the generic identifier of an element is stored in
       the "Name" hash entry.

       For example, for an EndElementEvent the "end_element" handler gets called with a hash reference

         {
           Name => 'gi'
         }

       The following events are defined:

         * appinfo
         * processing_instruction
         * start_element
         * end_element
         * data
         * sdata
         * external_data_entity_ref
         * subdoc_entity_ref
         * start_dtd
         * end_dtd
         * end_prolog
         * general_entity       # set $p->output_general_entities(1)
         * comment_decl         # set $p->output_comment_decls(1)
         * marked_section_start # set $p->output_marked_sections(1)
         * marked_section_end   # set $p->output_marked_sections(1)
         * ignored_chars        # set $p->output_marked_sections(1)
         * error
         * open_entity_change

       If the documentation of the generic interface to OpenSP states that certain data is not  valid,  it  will
       not be available through this interface (i.e., the respective key does not exist in the hash ref).

POSITIONING INFORMATION

       Event  handlers  can  call  the  "get_location"  method  on  the  parser  object  to retrieve positioning
       information, the get_location method will return a hash reference with the following properties:

         LineNumber   => ..., # line number
         ColumnNumber => ..., # column number
         ByteOffset   => ..., # number of preceding bytes
         EntityOffset => ..., # number of preceding bit combinations
         EntityName   => ..., # name of the external entity
         FileName     => ..., # name of the file

       These can be "undef" or an empty string.

POST-PROCESSING ERROR MESSAGES

       OpenSP returns error messages in form of a string rather than individual components of the  message  like
       line numbers or message text. The "split_message" method on the parser object can be used to post-process
       these  error message strings as reliable as possible. It can be used e.g.  from an error event handler if
       the parser object is accessible like

         sub error
         {
           my $self = shift;
           my $erro = shift;
           my $mess = $self->{parser}->split_message($erro);
         }

       See the documentation of "split_message" in the SGML::Parser::OpenSP::Tools documentation.

UNICODE SUPPORT

       All strings returned from event handlers and helper routines are UTF-8 encoded with the UTF-8 flag turned
       on, helper functions like "split_message" expect (but  don't  check)  that  string  arguments  are  UTF-8
       encoded  and  have  the  UTF-8  flag  turned  on. Behavior of helper functions is undefined when you pass
       unexpected input and should be avoided.

       "parse" has limited support for binary input, but the binary  input  must  be  compatible  with  OpenSP's
       generic  interface  requirements  and  you must specify the encoding through means available to OpenSP to
       enable it to properly decode the binary input. Any encoding meta data about such binary input specific to
       Perl (such as encoding disciplines for file handles when you pass a file descriptor) will be ignored. For
       more specific information refer to the OpenSP manual.

       •   <http://openjade.sourceforge.net/doc/sysid.htm>

       •   <http://openjade.sourceforge.net/doc/charset.htm>

ENVIRONMENT VARIABLES

       OpenSP supports a number of  environment  variables  to  control  specific  processing  aspects  such  as
       "SGML_SEARCH_PATH"  or "SP_CHARSET_FIXED".  Portable applications need to ensure that these are set prior
       to loading the OpenSP library into memory which happens when the XS code is loaded. This means  you  need
       to wrap the code into a "BEGIN" block:

         BEGIN { $ENV{SP_CHARSET_FIXED} = 1; }
         use SGML::Parser::OpenSP;
         # ...

       Otherwise  changes  to the environment might not propagate to OpenSP.  This applies specifically to Win32
       systems.

       SGML_SEARCH_PATH
           See <http://openjade.sourceforge.net/doc/sysid.htm>.

       SP_HTTP_USER_AGENT
           The "User-Agent" header for HTTP requests.

       SP_HTTP_ACCEPT
           The "Accept" header for HTTP requests.

       SP_MESSAGE_FORMAT
           Enable run time selection of message format, Value is one of "XML",  "NONE",  "TRADITIONAL".  Whether
           this  will have an effect depends on a compile time setting which might not be enabled in your OpenSP
           build. This module assumes that no such support was compiled in.

       SGML_CATALOG_FILES
       SP_USE_DOCUMENT_CATALOG
           See <http://openjade.sourceforge.net/doc/catalog.htm>.

       SP_SYSTEM_CHARSET
       SP_CHARSET_FIXED
       SP_BCTF
       SP_ENCODING
           See <http://openjade.sourceforge.net/doc/charset.htm>.

       Note that you can use the "search_dirs" method instead of using  "SGML_SEARCH_PATH"  and  the  "catalogs"
       method  instead  of  using  "SGML_CATALOG_FILES"  and  attributes  on  storage  object specifications for
       "SP_BCTF" and "SP_ENCODING" respectively. For example, if "SP_CHARSET_FIXED" is set to 1 you can use

         $p->parse("<OSFILE encoding='UTF-8'>example.xhtml");

       to process "example.xhtml" using the "UTF-8" character encoding.

KNOWN ISSUES

       OpenSP must be compiled with "SP_MULTI_BYTE" defined and with  "SP_WIDE_SYSTEM"  undefined,  this  module
       will otherwise break at runtime or not compile.

BUG REPORTS

       Please report bugs in this module via <http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP>

       Please report bugs in OpenSP via <http://sf.net/tracker/?group_id=2115&atid=102115>

       Please     send     comments     and     questions     to     the    spo-devel    mailing    list,    see
       <http://lists.sf.net/lists/listinfo/spo-devel> for details.

SEE ALSO

       •   <http://openjade.sf.net/doc/generic.htm>

       •   <http://openjade.sf.net/>

       •   <http://sf.net/projects/spo/>

AUTHORS

         Terje Bless <link@cpan.org> wrote version 0.01.
         Bjoern Hoehrmann <bjoern@hoehrmann.de> wrote version 0.02+.

COPYRIGHT AND LICENSE

         Copyright (c) 2006-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
         This module is licensed under the same terms as Perl itself.

perl v5.38.2                                       2024-03-31                          SGML::Parser::OpenSP(3pm)