Ubuntu Manpage: HTML::HTML5::Parser

Provided by: libhtml-html5-parser-perl_0.992-2_all

NAME

       HTML::HTML5::Parser - parse HTML reliably

SYNOPSIS

         use HTML::HTML5::Parser;

         my $parser = HTML::HTML5::Parser->new;
         my $doc    = $parser->parse_string(<<'EOT');
         <!doctype html>
         <title>Foo</title>
         <p><b><i>Foo</b> bar</i>.
         <p>Baz</br>Quux.
         EOT

         my $fdoc   = $parser->parse_file( $html_file_name );
         my $fhdoc  = $parser->parse_fh( $html_file_handle );

DESCRIPTION

       This library is substantially the same as the non-CPAN module Whatpm::HTML.  Changes include:

       •       Provides  an  XML::LibXML-like  DOM  interface. If you usually use XML::LibXML's DOM parser, this
               should be a drop-in solution for tag soup HTML.

       •       Constructs an XML::LibXML::Document as the result of parsing.

       •       Via bundling and modifications, removed external dependencies on non-CPAN packages.

   Constructor
       "new"
                 $parser = HTML::HTML5::Parser->new;
                 # or
                 $parser = HTML::HTML5::Parser->new(no_cache => 1);

               The constructor does nothing interesting besides take one flag  argument,  "no_cache  =>  1",  to
               disable  the global element metadata cache. Disabling the cache is handy for conserving memory if
               you parse a large number of documents, however, class methods such  as  "/source_line"  will  not
               work, and must be run from an instance of this parser.

   XML::LibXML-Compatible Methods
       "parse_file", "parse_html_file"
             $doc = $parser->parse_file( $html_file_name [,\%opts] );

           This  function  parses  an  HTML  document  from  a  file or network; $html_file_name can be either a
           filename or an URL.

           Options include 'encoding' to indicate file encoding (e.g.  'utf-8') and 'user_agent' which should be
           a blessed "LWP::UserAgent" (or HTTP::Tiny) object to be used when retrieving URLs.

           If requesting a URL and the response Content-Type header indicates an XML-based media type  (such  as
           XHTML),  XML::LibXML::Parser  will  be  used  automatically (instead of the tag soup parser). The XML
           parser can be told to use a DTD catalogue by setting the option 'xml_catalogue' to  the  filename  of
           the catalogue.

           HTML  (tag  soup) parsing can be forced using the option 'force_html', even when an XML media type is
           returned. If an options hashref was passed, parse_file will set $options->{'parser_used'} to the name
           of the class used to parse the URL, to allow the calling code to double-check which parser  was  used
           afterwards.

           If  an  options  hashref was passed, parse_file will set $options->{'response'} to the HTTP::Response
           object obtained by retrieving the URI.

       "parse_fh", "parse_html_fh"
             $doc = $parser->parse_fh( $io_fh [,\%opts] );

           "parse_fh()" parses a IOREF or a subclass of "IO::Handle".

           Options include 'encoding' to indicate file encoding (e.g.  'utf-8').

       "parse_string", "parse_html_string"
             $doc = $parser->parse_string( $html_string [,\%opts] );

           This function is similar to "parse_fh()", but it parses an HTML  document  that  is  available  as  a
           single string in memory.

           Options include 'encoding' to indicate file encoding (e.g.  'utf-8').

       "load_xml", "load_html"
           Wrappers  for  the  parse_* functions. These should be roughly compatible with the equivalently named
           functions in XML::LibXML.

           Note that "load_xml" first attempts to parse as real XML, falling back to HTML5 parsing;  "load_html"
           just goes straight for HTML5.

       "parse_balanced_chunk"
             $fragment = $parser->parse_balanced_chunk( $string [,\%opts] );

           This  method  is roughly equivalent to XML::LibXML's method of the same name, but unlike XML::LibXML,
           and despite its name it does not require the chunk to be "balanced". This method  is  somewhat  black
           magic,  but should work, and do the proper thing in most cases. Of course, the proper thing might not
           be what you'd expect! I'll try to keep this explanation as brief as possible...

           Consider the following string:

             <b>Hello</b></td></tr> <i>World</i>

           What is the proper way to parse that? If it were found in a document like this:

             <html>
               <head><title>X</title></head>
               <body>
                 <div>
                   <b>Hello</b></td></tr> <i>World</i>
                 </div>
               </body>
             </html>

           Then the document would end up equivalent to the following XHTML:

             <html>
               <head><title>X</title></head>
               <body>
                 <div>
                   <b>Hello</b> <i>World</i>
                 </div>
               </body>
             </html>

           The superfluous "</td></tr>" is simply ignored. However, if it were found in a document like this:

             <html>
               <head><title>X</title></head>
               <body>
                 <table><tbody><tr><td>
                   <b>Hello</b></td></tr> <i>World</i>
                 </td></tr></tbody></table>
               </body>
             </html>

           Then the result would be:

             <html>
               <head><title>X</title></head>
               <body>
                 <i>World</i>
                 <table><tbody><tr><td>
                   <b>Hello</b></td></tr>
                 </tbody></table>
               </body>
             </html>

           Yes, "<i>World</i>" gets hoisted up before the "<table>".  This  is  weird,  I  know,  but  it's  how
           browsers do it in real life.

           So what should:

             $string   = q{<b>Hello</b></td></tr> <i>World</i>};
             $fragment = $parser->parse_balanced_chunk($string);

           actually return? Well, you can choose...

             $string = q{<b>Hello</b></td></tr> <i>World</i>};

             $frag1  = $parser->parse_balanced_chunk($string, {within=>'div'});
             say $frag1->toString; # <b>Hello</b> <i>World</i>

             $frag2  = $parser->parse_balanced_chunk($string, {within=>'td'});
             say $frag2->toString; # <i>World</i><b>Hello</b>

           If you don't pass a "within" option, then the chunk is parsed as if it were within a "<div>" element.
           This  is  often  the  most sensible option. If you pass something like "{ within => "foobar" }" where
           "foobar" is not a real HTML element name (as found in the HTML5 spec), then this method  will  croak;
           if you pass the name of a void element (e.g. "br" or "meta") then this method will croak; there are a
           handful of other unsupported elements which will croak (namely: "noscript", "noembed", "noframes").

           Note  that  the  second  time  around,  although  we parsed the string "as if it were within a "<td>"
           element", the "<i>Hello</i>" bit did not strictly end up within the "<td>" element (not  even  within
           the "<table>" element!) yet it still gets returned.  We'll call things such as this "outliers". There
           is a "force_within" option which tells parse_balanced_chunk to ignore outliers:

             $frag3  = $parser->parse_balanced_chunk($string,
                                                     {force_within=>'td'});
             say $frag3->toString; # <b>Hello</b>

           There   is   a   boolean   option   "mark_outliers"  which  marks  each  outlier  with  an  attribute
           ("data-perl-html-html5-parser-outlier") to indicate its outlier status. Clearly, this is ignored when
           you use "force_within" because no outliers are  returned.  Some  outliers  may  be  XML::LibXML::Text
           elements; text nodes don't have attributes, so these will not be marked with an attribute.

           A   last   note   is   to   mention   what   gets   returned   by   this  method.  Normally  it's  an
           XML::LibXML::DocumentFragment object, but if you call the method in  list  context,  a  list  of  the
           individual  node  elements  is  returned. Alternatively you can request the data to be returned as an
           XML::LibXML::NodeList object:

            # Get an XML::LibXML::NodeList
            my $list = $parser->parse_balanced_chunk($str, {as=>'list'});

           The exact implementation of this method may change from version to version, but  the  long-term  goal
           will be to approach how common desktop browsers parse HTML fragments when implementing the setter for
           DOM's "innerHTML" attribute.

       The  push  parser  and  SAX-based  parser  are  not  supported.  Trying  to  change  an  option  (such as
       recover_silently) will make HTML::HTML5::Parser carp a warning. (But you can inspect the options.)

   Error Handling
       Error handling is obviously different to XML::LibXML, as errors are (bugs notwithstanding) non-fatal.

       "error_handler"
           Get/set an error handling function. Must be set to a coderef or undef.

           The error handling function will be called with  a  single  parameter,  a  HTML::HTML5::Parser::Error
           object.

       "errors"
           Returns a list of errors that occurred during the last parse.

           See HTML::HTML5::Parser::Error.

   Additional Methods
       The module provides a few methods to obtain additional, non-DOM data from DOM nodes.

       "dtd_public_id"
             $pubid = $parser->dtd_public_id( $doc );

           For  an  XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this method will
           tell you the Public Identifier of the DTD used (if any).

       "dtd_system_id"
             $sysid = $parser->dtd_system_id( $doc );

           For an XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this  method  will
           tell you the System Identifier of the DTD used (if any).

       "dtd_element"
             $element = $parser->dtd_element( $doc );

           For  an  XML::LibXML::Document which has been returned by HTML::HTML5::Parser, using this method will
           tell you the root element declared in the DTD used (if any).  That  is,  if  the  document  has  this
           doctype:

             <!doctype html>

           ... it will return "html".

           This may return the empty string if a DTD was present but did not contain a root element; or undef if
           no DTD was present.

       "compat_mode"
             $mode = $parser->compat_mode( $doc );

           Returns 'quirks', 'limited quirks' or undef (standards mode).

       "charset"
             $charset = $parser->charset( $doc );

           The character set apparently used by the document.

       "source_line"
             ($line, $col) = $parser->source_line( $node );
             $line = $parser->source_line( $node );

           In scalar context, "source_line" returns the line number of the source code that started a particular
           node (element, attribute or comment).

           In list context, returns a tuple: $line, $column, $implicitness.  Tab characters count as one column,
           not eight.

           $implicitness  indicates  that  the  node  was  not  explicitly marked up in the source code, but its
           existence was inferred by the parser.  For example, in the following markup, the HTML,  TITLE  and  P
           elements are explicit, but the HEAD and BODY elements are implicit.

            <html>
             <title>I have an implicit head</title>
             <p>And an implicit body too!</p>
            </html>

           (Note  that  implicit  elements  do  still  have  a  line  number and column number.) The implictness
           indicator is a new feature, and I'd appreciate any bug reports where it gets things wrong.

           XML::LibXML::Node  has  a  "line_number"  method.  In  general  this  will  always   return   0   and
           HTML::HTML5::Parser    has    no    way    of    influencing    it.    However,    if   you   install
           XML::LibXML::Devel::SetLineNumber on your system, the "line_number" method  will  start  working  (at
           least for elements).

AUTHOR

       Toby Inkster, <tobyink@cpan.org>

COPYRIGHT AND LICENCE

       Copyright (C) 2007-2011 by Wakaba

       Copyright (C) 2009-2012 by Toby Inkster

       This library is free software; you can redistribute it and/or modify it under  the  same  terms  as  Perl
       itself.

DISCLAIMER OF WARRANTIES

       THIS  PACKAGE  IS  PROVIDED  "AS  IS"  AND  WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT
       LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

perl v5.36.0                                       2022-08-28                           HTML::HTML5::Parser(3pm)

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO

AUTHOR

COPYRIGHT AND LICENCE

DISCLAIMER OF WARRANTIES