Provided by: tcllib_1.21+dfsg-1_all bug

NAME

       htmlparse - Procedures to parse HTML strings

SYNOPSIS

       package require Tcl  8.2

       package require struct::stack  1.3

       package require cmdline  1.1

       package require htmlparse  ?1.2.2?

       ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html

       ::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag

       ::htmlparse::mapEscapes html

       ::htmlparse::2tree html tree

       ::htmlparse::removeVisualFluff tree

       ::htmlparse::removeFormDefs tree

________________________________________________________________________________________________________________

DESCRIPTION

       The  htmlparse  package provides commands that allow libraries and applications to parse HTML in a string
       into a representation of their choice.

       The following commands are available:

       ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html
              This command is the basic parser for HTML. It takes an  HTML  string,  parses  it  and  invokes  a
              command  prefix  for  every tag encountered. It is not necessary for the HTML to be valid for this
              parser to function. It is the responsibility of the command invoked for every tag to  check  this.
              Another  responsibility  of  the  invoked  command is the handling of tag attributes and character
              entities (escaped characters). The parser  provides  the  un-interpreted  tag  attributes  to  the
              invoked  command  to  aid  in  the  former,  and  the  package at large provides a helper command,
              ::htmlparse::mapEscapes, to aid in the handling of the latter.  The  parser  does  ignore  leading
              DOCTYPE declarations and all valid HTML comments it encounters.

              All information beyond the HTML string itself is specified via options, these are explained below.

              To help understand the options, some more background information about the parser.

              It  is  capable  of  detecting  incomplete  tags  in  the  HTML  string  given to it. Under normal
              circumstances this will cause the parser to throw an error, but if the option -incvar is  used  to
              specify  a  global (or namespace) variable, the parser will store the incomplete part of the input
              into this variable instead. This will aid greatly in the handling of incrementally arriving  HTML,
              as the parser will handle whatever it can and defer the handling of the incomplete part until more
              data has arrived.

              Another  feature  of  the  parser  are  its  two  possible  modes of operation. The normal mode is
              activated if the option -queue is not present on the command line invoking the parser.  If  it  is
              present, the parser will go into the incremental mode instead.

              The main difference is that a parser in normal mode will immediately invoke the command prefix for
              each  tag  it encounters. In incremental mode however the parser will generate a number of scripts
              which invoke the command prefix for groups of tags in the HTML string and then store these scripts
              in the specified queue. It is then the responsibility of the caller of the parser  to  ensure  the
              execution of the scripts in the queue.

              Note:  The queue object given to the parser has to provide the same interface as the queue defined
              in tcllib -> struct. This means, for example, that all queues created via that tcllib  module  can
              be  immediately  used here. Still, the queue doesn't have to come from tcllib -> struct as long as
              the same interface is provided.

              In both modes the parser will return an empty string to the caller.

              The -split option may be given to a parser in incremental mode to specify the size of  the  groups
              it  creates.  In  other  words,  -split 5 means that each of the generated scripts will invoke the
              command prefix for 5 consecutive tags in the HTML string. A parser in normal mode will ignore this
              option and its value.

              The option -vroot specifies a virtual root tag. A parser in normal mode will  invoke  the  command
              prefix for it immediately before and after it processes the tags in the HTML, thus simulating that
              the  HTML  string  is  enclosed in a <vroot> </vroot> combination. In incremental mode however the
              parser is unable to provide the closing virtual root as it never knows when the input is complete.
              In this case the first script  generated  by  each  invocation  of  the  parser  will  contain  an
              invocation of the command prefix for the virtual root as its first command.  The following options
              are available:

              -cmd cmd
                     The   command   prefix   to   invoke  for  every  tag  in  the  HTML  string.  Defaults  to
                     ::htmlparse::debugCallback.

              -vroot tag
                     The virtual root tag to add around the HTML in normal mode. In incremental mode it  is  the
                     first  tag  in  each  chunk  processed  by  the  parser, but there will be no closing tags.
                     Defaults to hmstart.

              -split n
                     The size of the groups produced by an incremental mode parser. Ignored when in normal mode.
                     Defaults to 10. Values <= 0 are not allowed.

              -incvar var
                     The name of the variable where to store any incomplete HTML into. This makes most sense for
                     the incremental mode. The parser will throw an error if it sees incomplete HTML and has  no
                     place  to  store  it  to.  This  makes  sense for the normal mode. Only incomplete tags are
                     detected, not missing tags.  Optional, defaults to 'no variable'.

              Interface to the command prefix
                     In normal mode the parser will invoke the command prefix with four arguments appended.  See
                     ::htmlparse::debugCallback for a description.

                     In  incremental  mode,  however,  the generated scripts will invoke the command prefix with
                     five arguments appended. The last four of these are the same which  were  mentioned  above.
                     The  first  is  a  placeholder  string  (@win@) for a clientdata value to be supplied later
                     during the actual execution of the generated scripts. This could be a tk window  path,  for
                     example. This allows the user of this package to preprocess HTML strings without committing
                     them  to  a  specific  window, object, whatever during parsing. This connection can be made
                     later. This also means that it is possible to cache preprocessed HTML. Of  course,  nothing
                     prevents the user of the parser from replacing the placeholder with an empty string.

       ::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag
              This  command  is  the  standard  callback  used  by  the parser in ::htmlparse::parse if none was
              specified by the user. It simply dumps its arguments to stdout.  This callback  can  be  used  for
              both  normal  and  incremental mode of the calling parser. In other words, it accepts four or five
              arguments. The last four arguments are described below. The optional fifth argument  contains  the
              clientdata  value  passed  to  the callback by a parser in incremental mode. All callbacks have to
              follow the signature of this command in the last four arguments, and callbacks used in incremental
              parsing have to follow this signature in the last five arguments.

              The first argument, clientdata, is optional and present only if  this  command  is  invoked  by  a
              parser in incremental mode. It contains whatever the user of this package wishes.

              The second argument, tag, contains the name of the tag which is currently processed by the parser.

              The  third  argument, slash, is either empty or contains a slash character. It allows the callback
              to distinguish between opening  (slash  is  empty)  and  closing  tags  (slash  contains  a  slash
              character).

              The fourth argument, param, contains the un-interpreted list of parameters to the tag.

              The  fifth  and  last argument, textBehindTheTag, contains the text found by the parser behind the
              tag named in tag.

       ::htmlparse::mapEscapes html
              This command takes a HTML string, substitutes all escape sequences with  their  actual  characters
              and  then  returns  the  resulting string.  HTML strings which do not contain escape sequences are
              returned unchanged.

       ::htmlparse::2tree html tree
              This command is a wrapper around ::htmlparse::parse which takes  an  HTML  string  (in  html)  and
              converts  it  into a tree containing the logical structure of the parsed document. The name of the
              tree is given to the command as its second argument (tree). The command does not generate the tree
              by itself but expects that the caller provided it with an existing and empty tree. It also expects
              that the specified tree object follows the same interface as the tree object in tcllib ->  struct.
              It doesn't have to be from tcllib -> struct, but it must provide the same interface.

              The internal callback does some basic checking of HTML validity and tries to recover from the most
              basic  errors.  The  command  returns  the  contents  of its second argument. Side effects are the
              creation and manipulation of a tree object.

              Each node in the generated tree represent one tag in the input. The name of the tag is  stored  in
              the  attribute  type of the node. Any html attributes coming with the tag are stored unmodified in
              the attribute data of the tag. In other words, the command does not  parse  html  attributes  into
              their names and values.

              If  a  tag contains text its node will have children of type PCDATA containing this text. The text
              will be stored in the attribute data of these children.

       ::htmlparse::removeVisualFluff tree
              This command walks a tree as generated by ::htmlparse::2tree  and  removes  all  the  nodes  which
              represent  visual  tags  and  not  structural ones. The purpose of the command is to make the tree
              easier to navigate without getting bogged down in visual information not relevant to  the  search.
              Its only argument is the name of the tree to cut down.

       ::htmlparse::removeFormDefs tree
              Like  ::htmlparse::removeVisualFluff  this  command is here to cut down on the size of the tree as
              generated by ::htmlparse::2tree. It removes all nodes representing forms and  form  elements.  Its
              only argument is the name of the tree to cut down.

BUGS, IDEAS, FEEDBACK

       This  document,  and  the package it describes, will undoubtedly contain bugs and other problems.  Please
       report such in the category htmlparse  of  the  Tcllib  Trackers  [http://core.tcl.tk/tcllib/reportlist].
       Please also report any ideas for enhancements you may have for either package and/or documentation.

       When proposing code changes, please provide unified diffs, i.e the output of diff -u.

       Note  further  that  attachments  are strongly preferred over inlined patches. Attachments can be made by
       going to the Edit form of the ticket immediately after its creation, and then using the left-most  button
       in the secondary navigation bar.

SEE ALSO

       struct::tree

KEYWORDS

       html, parsing, queue, tree

CATEGORY

       Text processing

tcllib                                                1.2.2                                      htmlparse(3tcl)