Provided by: swish-e_2.4.7-6.2build3_amd64 bug

NAME

       SWISH-FAQ - The Swish-e FAQ. Answers to Common Questions

OVERVIEW

       List  of  commonly  asked and answered questions.  Please review this document before asking questions on
       the Swish-e discussion list.

       General Questions

       What is Swish-e?

       Swish-e is Simple Web Indexing System for Humans - Enhanced.  With it, you can quickly and  easily  index
       directories of files or remote web sites and search the generated indexes for words and phrases.

       So, is Swish-e a search engine?

       Well,  yes.   Probably  the  most common use of Swish-e is to provide a search engine for web sites.  The
       Swish-e distribution includes CGI scripts that can be used with it to add a search engine  for  your  web
       site.  The CGI scripts can be found in the example directory of the distribution package.  See the README
       file for information about the scripts.

       But  Swish-e  can  also  be  used  to  index  all sorts of data, such as email messages, data stored in a
       relational database management system, XML documents, or documents such as Word and PDF documents  --  or
       any combination of those sources at the same time.  Searches can be limited to fields or MetaNames within
       a  document,  or  limited  to  areas within an HTML document (e.g. body, title).  Programs other than CGI
       applications can use Swish-e, as well.

       Should I upgrade if I'm already running a previous version of Swish-e?

       A large number of bug fixes, feature additions, and logic corrections  were  made  in  version  2.2.   In
       addition,  indexing  speed  has  been  drastically improved (reports of indexing times changing from four
       hours to 5 minutes), and major parts of the indexing and search parsers  have  been  rewritten.   There's
       better  debugging  options,  enhanced  output  formats, more document meta data (e.g. last modified date,
       document summary), options for indexing from external data sources, and faster spidering just to  name  a
       few changes.  (See the CHANGES file for more information.

       Since so much effort has gone into version 2.2, support for previous versions will probably be limited.

       Are there binary distributions available for Swish-e on platform foo?

       Foo?   Well,  yes  there  are some binary distributions available.  Please see the Swish-e web site for a
       list at http://swish-e.org/.

       In general, it is recommended that you build Swish-e from source, if possible.

       Do I need to reindex my site each time I upgrade to a new Swish-e version?

       At times it might not strictly be necessary, but since you don't really know if anything in the index has
       changed, it is a good rule to reindex.

       What's the advantage of using the libxml2 library for parsing HTML?

       Swish-e may be linked with libxml2, a library for working with HTML and XML documents.  Swish-e  can  use
       libxml2 for parsing HTML and XML documents.

       The  libxml2 parser is a better parser than Swish-e's built-in HTML parser.  It offers more features, and
       it does a much better job at extracting out the text from a web page.   In  addition,  you  can  use  the
       "ParserWarningLevel"  configuration  setting  to find structural errors in your documents that could (and
       would with Swish-e's HTML parser) cause documents to be indexed incorrectly.

       Libxml2 is not required, but is strongly recommended for parsing HTML documents.  It's  also  recommended
       for parsing XML, as it offers many more features than the internal Expat xml.c parser.

       The  internal  HTML  parser will have limited support, and does have a number of bugs.  For example, HTML
       entities may not always be correctly converted and  properties  do  not  have  entities  converted.   The
       internal  parser  tends  to get confused when invalid HTML is parsed where the libxml2 parser doesn't get
       confused as often.  The structure is better detected with the libxml2 parser.

       If you are using the Perl module (the C interface to the Swish-e library)  you  may  wish  to  build  two
       versions  of  Swish-e,  one with the libxml2 library linked in the binary, and one without, and build the
       Perl module against the library without the libxml2  code.   This  is  to  save  space  in  the  library.
       Hopefully, the library will someday soon be split into indexing and searching code (volunteers welcome).

       Does Swish-e include a CGI interface?

       Yes.  Kind of.

       There's   two   example   CGI  scripts  included,  swish.cgi  and  search.cgi.   Both  are  installed  at
       $prefix/lib/swish-e.

       Both require a bit of work to setup and use.  Swish.cgi is probably what most people will want to use  as
       it  contains more features.  Search.cgi is for those that want to start with a small script and customize
       it to fit their needs.

       An example of using swish.cgi is given in the INSTALL man page, and it the swish.cgi documentation.  Like
       often is the case, it will be easier to use if you first read the documentation.

       Please use caution about CGI scripts found on the Internet for use with Swish-e.  Some are not secure.

       The included example CGI scripts were designed with security in mind.  Regardless, you are encouraged  to
       have  your  local  Perl  expert  review  it  (and  all  other CGI scripts you use) before placing it into
       production.  This is just a good policy to follow.

       How secure is Swish-e?

       We know of no security issues with using Swish-e.  Careful attention has been made with regard to  common
       security problems such as buffer overruns when programming Swish-e.

       The  most  likely security issue with Swish-e is when it is run via a poorly written CGI interface.  This
       is not limited to CGI scripts written in Perl, as it's just as easy to write an insecure CGI script in C,
       Java, PHP, or Python.  A good source of  information  is  included  with  the  Perl  distribution.   Type
       "perldoc  perlsec"  at  your local prompt for more information.  Another must-read document is located at
       "http://www.w3.org/Security/faq/wwwsf4.html".

       Note that there are many free yet insecure and poorly written CGI scripts available -- even some designed
       for use with Swish-e.  Please carefully review any CGI script you use.  Free is not  such  a  good  price
       when you get your server hacked...

       Should I run Swish-e as the superuser (root)?

       No.  Never.

       What files does Swish-e write?

       Swish  writes  the index file, of course.  This is specified with the "IndexFile" configuration directive
       or by the "-f" command line switch.

       The index file is actually a collection of files, but all start with the file  name  specified  with  the
       "IndexFile" directive or the "-f" command line switch.

       For example, the file ending in .prop contains the document properties.

       When creating the index files Swish-e appends the extension .temp to the index file names.  When indexing
       is complete Swish-e renames the .temp files to the index files specified by "IndexFile" or "-f".  This is
       done so that existing indexes remain untouched until it completes indexing.

       Swish-e  also  writes  temporary  files  in  some  cases  during indexing (e.g. "-s http", "-s prog" with
       filters), when merging, and when using "-e").  Temporary files are created with the  mkstemp(3)  function
       (with 0600 permission on unix-like operating systems).

       The  temporary  files  are  created  in the directory specified by the environment variables "TMPDIR" and
       "TMP" in that order.  If those are not set then swish uses the setting the configuration setting  TmpDir.
       Otherwise, the temporary file will be located in the current directory.

       Can I index PDF and MS-Word documents?

       Yes,  you  can  use  a  Filter to convert documents while indexing, or you can use a program that "feeds"
       documents to Swish-e that have already been converted.  See "Indexing" below.

       Can I index documents on a web server?

       Yes, Swish-e provides two ways to index (spider) documents on a web server.  See "Spidering" below.

       Swish-e can retrieve documents from a file system or from a remote web server.  It  can  also  execute  a
       program  that  returns documents back to it.  This program can retrieve documents from a database, filter
       compressed documents files, convert PDF files, extract data from mail  archives,  or  spider  remote  web
       sites.

       Can I implement keywords in my documents?

       Yes,  Swish-e can associate words with MetaNames while indexing, and you can limit your searches to these
       MetaNames while searching.

       In your HTML files you can put keywords in HTML META tags or in XML blocks.

       META tags can have two formats in your source documents:

           <META NAME="DC.subject" CONTENT="digital libraries">

       And in XML format (can also be used in HTML documents when using libxml2):

           <meta2>
               Some Content
           </meta2>

       Then, to inform Swish-e about the existence of the meta name in your documents, edit  the  line  in  your
       configuration file:

           MetaNames DC.subject meta1 meta2

       When  searching  you  can  now limit some or all search terms to that MetaName.  For example, to look for
       documents that contain the word apple and also have either fruit or cooking in the DC.subject meta tag.

       What are document properties?

       A document property is typically data that describes the document.  For example, properties might include
       a document's path name, its last modified date, its title, or its  size.   Swish-e  stores  a  document's
       properties in the index file, and they can be reported back in search results.

       Swish-e  also  uses  properties  for  sorting.   You  may sort your results by one or more properties, in
       ascending or descending order.

       Properties can also be defined within your documents.  HTML and XML files can specify tags (see  previous
       question)  as  properties.   The  contents of these tags can then be returned with search results.  These
       user-defined properties can also be used for sorting search results.

       For example, if you had the following in your documents

          <meta name="creator" content="accounting department">

       and "creator" is defined  as  a  property  (see  "PropertyNames"  in  SWISH-CONFIG)  Swish-e  can  return
       "accounting department" with the result for that document.

           swish-e -w foo -p creator

       Or for sorting:

           swish-e -w foo -s creator

       What's the difference between MetaNames and PropertyNames?

       MetaNames  allows  keywords  searches  in  your  documents.   That  is, you can use MetaNames to restrict
       searches to just parts of your documents.

       PropertyNames, on the other hand, define text that can be returned with results,  and  can  be  used  for
       sorting.

       Both  use  meta tags found in your documents (as shown in the above two questions) to define the text you
       wish to use as a property or meta name.

       You may define a tag as both a property and a meta name.  For example:

          <meta name="creator" content="accounting department">

       placed in your documents and then using configuration settings of:

           PropertyNames creator
           MetaNames creator

       will allow you to limit your searches to documents created by accounting:

           swish-e -w 'foo and creator=(accounting)'

       That will find all documents with the word "foo" that also have a creator meta tag that contains the word
       "accounting".  This is using MetaNames.

       And you can also say:

           swish-e -w foo -p creator

       which will return all documents with the word "foo", but the results will also include  the  contents  of
       the "creator" meta tag along with results.  This is using properties.

       You can use properties and meta names at the same time, too:

           swish-e -w creator=(accounting or marketing) -p creator -s creator

       That searches only in the "creator" meta name for either of the words "accounting" or "marketing", prints
       out  the  contents  of  the  contents  of  the "creator" property, and sorts the results by the "creator"
       property name.

       (See also the "-x" output format switch in SWISH-RUN.)

       Can Swish-e index multi-byte characters?

       No.  This will require much work to change.  But,  Swish-e  works  with  eight-bit  characters,  so  many
       characters  sets  can be used.  Note that it does call the ANSI-C tolower() function which does depend on
       the current locale setting.  See locale(7) for more information.

       Indexing

       How do I pass Swish-e a list of files to index?

       Currently, there is not a configuration directive to include a file that contains  a  list  of  files  to
       index.  But, there is a directive to include another configuration file.

           IncludeConfigFile /path/to/other/config

       And in "/path/to/other/config" you can say:

           IndexDir file1 file2 file3 file4 file5 ...
           IndexDir file20 file21 file22

       You may also specify more than one configuration file on the command line:

           ./swish-e -c config_one config_two config_three

       Another  option  is  to create a directory with symbolic links of the files to index, and index just that
       directory.

       How does Swish-e know which parser to use?

       Swish can parse HTML, XML, and text documents.  The parser is set by associating a file extension with  a
       parser  by  the  "IndexContents"  directive.   You  may set the default parser with the "DefaultContents"
       directive.  If a document is not assigned a parser it will default to the HTML  parser  (HTML2  if  built
       with libxml2).

       You may use Filters or an external program to convert documents to HTML, XML, or text.

       Can I reindex and search at the same time?

       Yes.   Starting  with  version  2.2  Swish-e  indexes to temporary files, and then renames the files when
       indexing is complete.  On most systems renames are atomic.  But, since Swish-e also generates  more  than
       one  file  during  indexing  there will be a very short period of time between renaming the various files
       when the index is out of sync.

       Settings in src/config.h control some options related to temporary files, and their use during indexing.

       Can I index phrases?

       Phrases are indexed automatically.  To search for a phrase simply place double quotes around the phrase.

       For example:

           swish-e -w 'free and "fast search engine"'

       How can I prevent phrases from matching across sentences?

       Use the BumpPositionCounterCharacters configuration directive.

       Swish-e isn't indexing a certain word or phrase.

       There are a number of configuration parameters that control what Swish-e considers a "word" and it has  a
       debugging feature to help pinpoint any indexing problems.

       Configuration   file  directives  (SWISH-CONFIG)  "WordCharacters",  "BeginCharacters",  "EndCharacters",
       "IgnoreFirstChar", and "IgnoreLastChar" are the main settings that Swish-e uses to define a "word".   See
       SWISH-CONFIG and SWISH-RUN for details.

       Swish-e also uses compile-time defaults for many settings.  These are located in src/config.h file.

       Use  of  the  command line arguments "-k", "-v" and "-T" are useful when debugging these problems.  Using
       "-T INDEXED_WORDS" while indexing will display each word as it is indexed.  You should specify  one  file
       when using this feature since it can generate a lot of output.

            ./swish-e -c my.conf -i problem.file -T INDEXED_WORDS

       You  may  also wish to index a single file that contains words that are or are not indexing as you expect
       and use -T to output debugging information about the index.  A useful command might be:

           ./swish-e -f index.swish-e -T INDEX_FULL

       Once you see how Swish-e is parsing and indexing your words, you can adjust  the  configuration  settings
       mentioned above to control what words are indexed.

       Another useful command might be:

            ./swish-e -c my.conf -i problem.file -T PARSED_WORDS INDEXED_WORDS

       This  will show white-spaced words parsed from the document (PARSED_WORDS), and how those words are split
       up into separate words for indexing (INDEXED_WORDS).

       How do I keep Swish-e from indexing numbers?

       Swish-e indexes words as defined by the "WordCharacters"  setting,  as  described  above.   So  to  avoid
       indexing numbers you simply remove digits from the "WordCharacters" setting.

       There  are  also  some settings in src/config.h that control what "words" are indexed.  You can configure
       swish to never index words that are all digits, vowels, or consonants, or that  contain  more  than  some
       consecutive  number  of  digits,  vowels,  or  consonants.   In  general,  you won't need to change these
       settings.

       Also, there's an experimental feature called "IgnoreNumberChars" which allows you  to  define  a  set  of
       characters that describe a number.  If a word is made up of only those characters it will not be indexed.

       Swish-e crashes and burns on a certain file. What can I do?

       This  shouldn't  happen.   If it does please post to the Swish-e discussion list the details so it can be
       reproduced by the developers.

       In the mean time, you can use a "FileRules" directive to exclude the particular file name,  or  pathname,
       or  its title.  If there are serious problems in indexing certain types of files, they may not have valid
       text in them (they may be binary files, for instance). You can use NoContents to  exclude  that  type  of
       file.

       Swish-e  will issue a warning if an embedded null character is found in a document.  This warning will be
       an indication that you are trying to index binary data.  If you need to index binary files try to find  a
       program that will extract out the text (e.g. strings(1), catdoc(1), pdftotext(1)).

       How to I prevent indexing of some documents?

       When  using  the  file  system  to  index  your  files you can use the "FileRules" directive.  Other than
       "FileRules title", "FileRules" only works with the file system ("-S fs") indexing method,  not  with  "-S
       prog" or "-S http".

       If  you are spidering a site you have control over, use a robots.txt file in your document root.  This is
       a standard way to  excluded  files  from  search  engines,  and  is  fully  supported  by  Swish-e.   See
       http://www.robotstxt.org/

       If  spidering  a website with the included spider.pl program then add any necessary tests to the spider's
       configuration file.  Type <perldoc spider.pl> in the "prog-bin" directory for details or see  the  spider
       documentation on the Swish-e website.  Look for the section on callback functions.

       If  using the libxml2 library for parsing HTML (which you probably are), you may also use the Meta Robots
       Exclusion in your documents:

           <meta name="robots" content="noindex">

       See the obeyRobotsNoIndex directive.

       How do I prevent indexing parts of a document?

       To prevent Swish-e from indexing a common header, footer, or navigation bar, AND you  are  using  libxml2
       for  parsing  HTML,  then  you  may  use  a  fake HTML tag around the text you wish to ignore and use the
       "IgnoreMetaTags" directive.  This will generate an error message if the "ParserWarningLevel"  is  set  as
       it's invalid HTML.

       "IgnoreMetaTags"  works with XML documents (and HTML documents when using libxml2 as the parser), but not
       with documents parsed by the text (TXT) parser.

       If you are using the libxml2 parser (HTML2 and XML2) then you can use the the following comments in  your
       documents to prevent indexing:

              <!-- SwishCommand noindex -->
              <!-- SwishCommand index -->

       and/or these may be used also:

              <!-- noindex -->
              <!-- index -->

       How do I modify the path or URL of the indexed documents.

       Use  the  "ReplaceRules"  configuration  directive  to rewrite path names and URLs.  If you are using "-S
       prog" input method you may set the path to any string.

       How can I index data from a database?

       Use the "prog" document source method of indexing.  Write a program to extract out  the  data  from  your
       database,  and  format  it  as XML, HTML, or text.  See the examples in the "prog-bin" directory, and the
       next question.

       How do I index my PDF, Word, and compressed documents?

       Swish-e can internally only parse HTML, XML and TXT (text) files by default, but can make use of  filters
       that  will  convert other types of files such as MS Word documents, PDF, or gzipped files into one of the
       file types that Swish-e understands.

       Please see SWISH-CONFIG and the examples in the filters and filter-bin directory for more information.

       See the next question to learn about the filtering options with Swish-e.

       How do I filter documents?

       The term "filter" in Swish-e means the converstion of a document of one type  (one  that  swish-e  cannot
       index  directly)  into  a  type  that  Swish-e can index, namely HTML, plain text, or XML.  To add to the
       confusion, there are a number of ways to accomplish this in Swish-e.  So here's a bit of background.

       The FileFilter directive was added to swish first.  This feature allows you to specify a program  to  run
       for  documents  that  match  a given file extension.  For example, to filter PDF files (files that end in
       .pdf) you can specify the configuation setting of:

           FileFilter .pdf pdftotext   "'%p' -"

       which says to run the program "pdftotext" passing it the pathname of the file  (%p)  and  a  dash  (which
       tells  pdftotext  to  output to stdout).   Then for each .pdf file Swish-e runs this program and reads in
       the filtered document from the output from the filter program.

       This has the advantage that it is easy to setup -- a single line in the config file is all that is needed
       to add the filter into Swish-e.  But it also has a number of problems.  For example, if you  use  a  Perl
       script  to  do your filtering it can be very slow since the filter script must be run (and thus compiled)
       for each processed document.  This is exacerbated when using the -S http method since the -S http  method
       also  uses  a  Perl  script  that is run for every URL fetched.  Also, when using -S prog method of input
       (reading input from a program) using FileFilter means that Swish-e must first read the file in  from  the
       external program and then write the file out to a temporary file before running the filter.

       With  -S  prog  it  makes  much  more  sense  to  filter the document in the program that is fetching the
       documents than to have swish-e read the file into memory, write it to a temporary file and  then  run  an
       external program.

       The  Swish-e distribution contains a couple of example -S prog programs.  spider.pl is a reasonably full-
       featured web spider that offers many more options than the -S http method.  And it is  much  faster  than
       running -S http, too.

       The  spider  has  a  perl  configuration  file,  which means you can add programming logic right into the
       configuration file without editing the spider program.  One bit of logic that is provided in the spider's
       configuration file is a "call-back" function that allows you to filter  the  content.   In  other  words,
       before  the  spider  passes  a  fetched  web  document to swish for indexing the spider can call a simple
       subroutine in the spider's configuration file passing the document and its content type.  The  subroutine
       can then look at the content type and decide if the document needs to be filtered.

       For  example, when processing a document of type "application/msword" the call-back subroutine might call
       the doc2txt.pm perl module, and a document of type "appliation/pdf" could  use  the  pdf2html.pm  module.
       The prog-bin/SwishSpiderConfig.pl file shows this usage.

       This  system  works  reasonably  well,  but  also  means that more work is required to setup the filters.
       First, you must explicitly check for specific content types and then call the  appropriate  Perl  module,
       and  second,  you  have to know how each module must be called and how each returns the possibly modified
       content.

       In comes SWISH::Filter.

       To make things easier the SWISH::Filter Perl module was created.  The idea of this module is  that  there
       is  one  interface  used  to filter all types of documents.  So instead of checking for specific types of
       content you just pass the content type and the document to the SWISH::Filter module and it returns a  new
       content  type  and  document if it was filtered.  The filters that do the actual work are designed with a
       standard interface and work like filter "plug-ins". Adding new filters means just downloading the  filter
       to  a  directory  and  no  changes  are  needed to the spider's configuation file.  Download a filter for
       Postscript and next time you run indexing your Postscript files will be indexed.

       Since the filters are standardized, hopefully when you have the need to filter documents  of  a  specific
       type there will already be a filter ready for your use.

       Now,  note that the perl modules may or may not do the actual conversion of a document.  For example, the
       PDF conversion module calls the pdfinfo and  pdftotext  programs.   Those  programs  (part  of  the  Xpfd
       package) must be installed separately from the filters.

       The  SwishSpiderConfig.pl  examle spider configuration file shows how to use the SWISH::Filter module for
       filtering.  This file is  installed  at  $prefix/share/doc/swish-e/examples/prog-bin,  where  $prefix  is
       normally /usr/local on unix-type machines.

       The  SWISH::Filter  method of filtering can also be used with the -S http method of indexing.  By default
       the swishspider program (the Perl helper script that fetches documents from the web) will attempt to  use
       the  SWISH::Filter  module  if it can be found in Perls library path.  This path is set automatically for
       spider.pl but not for swishspider (because it would slow down a method that's already slow and  spider.pl
       is recommended over the -S http method).

       Therefore,  all that's required to use this system with -S http is setting the @INC array to point to the
       filter directory.

       For example, if the swish-e distribution was unpacked into ~/swish-e:

          PERL5LIB=~/swish-e/filters swish-e -c conf -S http

       will allow the -S http method to make use of the SWISH::Filter module.

       Note that if you are not using the SWISH::Filter module you may wish to edit the swishspider program  and
       disable the use of the SWISH::Filter module using this setting:

           use constant USE_FILTERS  => 0;  # disable SWISH::Filter

       This  prevents the program from attempting to use the SWISH::Filter module for every non-text URL that is
       fetched.  Of course, if you are concerned with indexing speed you should be using the -S prog method with
       spider.pl instead of -S http.

       If you are not spidering, but you still want to make use of the SWISH::Filter module  for  filtering  you
       can  use  the  DirTree.pl  program (in $prefix/lib/swish-e).  This is a simple program that traverses the
       file system and uses SWISH::Filter for filtering.

       Here's two examples of how to run a filter program, one using Swish-e's "FileFilter"  directive,  another
       using  a  "prog"  input  method  program.   See the SwishSpiderConfig.pl file for an example of using the
       SWISH::Filter module.

       These filters simply use the program "/bin/cat" as a filter and only indexes .html files.

       First, using the "FileFilter" method, here's the entire configuration file (swish.conf):

           IndexDir .
           IndexOnly .html
           FileFilter .html "/bin/cat"   "'%p'"

       and index with the command

           swish-e -c swish.conf -v 1

       Now, the same thing with using the "-S prog" document source input  method  and  a  Perl  program  called
       catfilter.pl.   You  can  see  that's  it's  much more work than using the "FileFilter" method above, but
       provides a place to do additional processing.  In this  example,  the  "prog"  method  is  only  slightly
       faster.  But if you needed a perl script to run as a FileFilter then "prog" will be significantly faster.

           #!/usr/local/bin/perl -w
           use strict;
           use File::Find;  # for recursing a directory tree

           $/ = undef;
           find(
               { wanted => \&wanted, no_chdir => 1, },
               '.',
           );

           sub wanted {
               return if -d;
               return unless /\.html$/;

               my $mtime  = (stat)[9];

               my $child = open( FH, '-⎪' );
               die "Failed to fork $!" unless defined $child;
               exec '/bin/cat', $_ unless $child;

               my $content = <FH>;
               my $size = length $content;

               print <<EOF;
           Content-Length: $size
           Last-Mtime: $mtime
           Path-Name: $_

           EOF

               print <FH>;
           }

       And index with the command:

           swish-e -S prog -i ./catfilter.pl -v 1

       This  example  will  probably  not work under Windows due to the '-⎪' open.  A simple piped open may work
       just as well:

       That is, replace:

           my $child = open( FH, '-⎪' );
           die "Failed to fork $!" unless defined $child;
           exec '/bin/cat', $_ unless $child;

       with this:

           open( FH, "/bin/cat $_ ⎪" ) or die $!;

       Perl will try to avoid running the command through the shell if meta characters are  not  passed  to  the
       open.  See "perldoc -f open" for more information.

       Eh, but I just want to know how to index PDF documents!

       See the examples in the conf directory and the comments in the SwishSpiderConfig.pl file.

       See  the previous question for the details on filtering.  The method you decide to use will depend on how
       fast you want to index, and your comfort level with using Perl modules.

       Regardless of the filtering method you use you will need to install  the  Xpdf  packages  available  from
       http://www.foolabs.com/xpdf/.

       I'm using Windows and can't get Filters or the prog input method to work!

       Both  the  "-S  prog" input method and filters use the "popen()" system call to run the external program.
       If your external program is, for example, a perl script, you have to tell Swish-e to run perl, instead of
       the script.  Swish-e will convert forward slashes to backslashes when running under Windows.

       For example, you would need to specify the path to perl as (assuming  this  is  where  perl  is  on  your
       system):

           IndexDir e:/perl/bin/perl.exe

       Or run a filter like:

           FileFilter .foo e:/perl/bin/perl.exe 'myscript.pl "%p"'

       It's often easier to just install Linux.

       How do I index non-English words?

       Swish-e  indexes  8-bit characters only.  This is the ISO 8859-1 Latin-1 character set, and includes many
       non-English letters (and symbols).  As long as they are listed in "WordCharacters" they will be indexed.

       Actually, you probably can index any 8-bit character set, as long as you don't mix character sets in  the
       same index and don't use libxml2 for parsing (see below).

       The "TranslateCharacters" directive (SWISH-CONFIG) can translate characters while indexing and searching.
       You  may  specify  the  mapping  of  one  character  to  another character with the "TranslateCharacters"
       directive.

       "TranslateCharacters :ascii7:" is a predefined set of characters that will translate eight-bit characters
       to ascii7 characters.  Using the ":ascii7:" rule will, for  example,  translate  "Ääç"  to  "aac".   This
       means: searching "Çelik", "çelik" or "celik" will all match the same word.

       Note:  When  using  libxml2  for  parsing,  parsed documents are converted internally (within libxml2) to
       UTF-8.  This is converted to ISO 8859-1 Latin-1 when indexing.  In  cases  where  a  string  can  not  be
       converted  from  UTF-8 to ISO 8859-1 (because it contains non 8859-1 characters), the string will be sent
       to  Swish-e  in  UTF-8  encoding.   This  will  results  in  some  words  indexed  incorrectly.   Setting
       "ParserWarningLevel" to 1 or more will display warnings when UTF-8 to 8859-1 conversion fails.

       Can I add/remove files from an index?

       Try building swish-e with the "--enable-incremental" option.

       The rest of this FAQ applies to the default swish-e format.

       Swish-e currently has no way to add or remove items from its index.  But, Swish-e indexes so quickly that
       it's  often  possible  to  reindex  the  entire  document  set when a file needs to be added, modified or
       removed.  If you are spidering a remote site then consider caching documents locally compressed.

       Incremental additions can be handled in a couple of ways, depending on  your  situation.   It's  probably
       easiest  to  create  one main index every night (or every week), and then create an index of just the new
       files between main indexing jobs and use the "-f" option to pass both indexes to Swish-e while searching.

       You can merge the indexes into one index (instead of using -f), but it's not  clear  that  this  has  any
       advantage over searching multiple indexes.

       How does one create the incremental index?

       One  method is by using the "-N" switch to pass a file path to Swish-e when indexing.  It will only index
       files that have a last modification date "newer" than the file supplied with the "-N" switch.

       This option has the disadvantage that Swish-e must process every file in every directory as if they  were
       going  to be indexed (the test for "-N" is done last right before indexing of the file contents begin and
       after all other tests on the file have been completed) -- all that just to find a few new files.

       Also, if you use the Swish-e index file as the file passed to "-N" there may be  files  that  were  added
       after indexing was started, but before the index file was written.  This could result in a file not being
       added to the index.

       Another  option  is  to  maintain  a  parallel directory tree that contains symlinks pointing to the main
       files.  When a new file is added (or changed) to the main directory tree you create a symlink to the real
       file in the parallel directory tree.  Then just index the symlink directory to generate  the  incremental
       index.

       This  option has the disadvantage that you need to have a central program that creates the new files that
       can also create the symlinks.  But, indexing is quite fast since Swish-e only has to look  at  the  files
       that need to be indexed.  When you run full indexing you simply unlink (delete) all the symlinks.

       Both of these methods have issues where files could end up in both indexes, or files being left out of an
       index.   Use  of  file  locks  while  indexing,  and  hash lookups during searches can help prevent these
       problems.

       I run out of memory trying to index my files.

       It's true that indexing can take up a lot of memory!  Swish-e is extremely fast  at  indexing,  but  that
       comes at the cost of memory.

       The best answer is install more memory.

       Another  option  is use the "-e" switch.  This will require less memory, but indexing will take longer as
       not all data will be stored in memory while indexing.  How much  less  memory  and  how  much  more  time
       depends on the documents you are indexing, and the hardware that you are using.

       Here's  an  example of indexing all .html files in /usr/doc on Linux.  This first example is without "-e"
       and used about 84M of memory:

           270279 unique words indexed.
           23841 files indexed.  177640166 total bytes.
           Elapsed time: 00:04:45 CPU time: 00:03:19

       This is with "-e", and used about 26M or memory:

           270279 unique words indexed.
           23841 files indexed.  177640166 total bytes.
           Elapsed time: 00:06:43 CPU time: 00:04:12

       You can also build a number of smaller indexes and then merge  together  with  "-M".   Using  "-e"  while
       merging will save memory.

       Finally,  if you do build a number of smaller indexes, you can specify more than one index when searching
       by using the "-f" switch.  Sorting large results sets by  a  property  will  be  slower  when  specifying
       multiple index files while searching.

       "too many open files" when indexing with -e option

       Some  platforms  report "too many open files" when using the -e economy option.  The -e feature uses many
       temporary files (something like 377) plus the index files and this may exceed your system's limits.

       Depending on your platform you may need to set "ulimit" or "unlimit".

       For example, under Linux bash shell:

         $ ulimit -n 1024

       Or under an old Sparc

         % unlimit openfiles

       My system admin says Swish-e uses too much of the CPU!

       That's a good thing!  That expensive CPU is supposed to be busy.

       Indexing takes a lot of work -- to make indexing fast much of the work is done in  memory  which  reduces
       the amount of time Swish-e is waiting on I/O.  But, there's two things you can try:

       The "-e" option will run Swish-e in economy mode, which uses the disk to store data while indexing.  This
       makes  Swish-e run somewhat slower, but also uses less memory.  Since it is writing to disk more often it
       will be spending more time waiting on I/O and less time in CPU.  Maybe.

       The other thing is to simply lower the priority of the job using the nice(1) command:

           /bin/nice -15 swish-e -c search.conf

       If concerned about searching time, make sure you are using the -b and -m switches to only return  a  page
       at a time.  If you know that your result sets will be large, and that you wish to return results one page
       at  a  time,  and  that  often  times many pages of the same query will be requested, you may be smart to
       request all the documents on the first request, and then cache the results to a temporary file.  The perl
       module File::Cache makes this very simple to accomplish.

       Spidering

       How can I index documents on a web server?

       If possible, use the file system method "-S fs" of indexing to index documents in you  web  area  of  the
       file  system.   This  avoids  the overhead of spidering a web server and is much faster.  ("-S fs" is the
       default method if "-S" is not specified).

       If this is impossible (the web server is not local, or  documents  are  dynamically  generated),  Swish-e
       provides  two methods of spidering. First, it includes the http method of indexing "-S http". A number of
       special configuration directives are available that control  spidering  (see  "Directives  for  the  HTTP
       Access  Method  Only"  in  SWISH-CONFIG).   A  perl  helper  script  (swishspider) is included in the src
       directory to assist with spidering web servers. There are example configurations  for  spidering  in  the
       conf directory.

       As of Swish-e 2.2, there's a general purpose "prog" document source where a program can feed documents to
       it  for  indexing.   A  number  of example programs can be found in the "prog-bin" directory, including a
       program to spider web servers.  The provided spider.pl program is full-featured and is easily customized.

       The advantage of the "prog" document source feature over the "http" method is that the  program  is  only
       executed  one time, where the swishspider.pl program used in the "http" method is executed once for every
       document read from the web server.  The forking of Swish-e and compiling of the perl script can be  quite
       expensive, time-wise.

       The  other  advantage of the "spider.pl" program is that it's simple and efficient to add filtering (such
       as for PDF or MS Word docs) right into the spider.pl's configuration, and it includes  features  such  as
       MD5  checks  to  prevent  duplicate  indexing,  options to avoid spidering some files, or index but avoid
       spidering.  And since it's a perl program there's no limit on the features you can add.

       Why does swish report "./swishspider: not found"?

       Does the file swishspider exist where the error message displays?  If not, either set  the  configuration
       option  SpiderDirectory  to  point  to the directory where the swishspider program is found, or place the
       swishspider program in the current directory when running swish-e.

       If you are running Windows, make sure "perl" is in your path.  Try typing perl from a command prompt.

       If you not running windows, make sure that the shebang line (the first line of  the  swishspider  program
       that  starts  with  #!)  points to the correct location of perl.  Typically this will be /usr/bin/perl or
       /usr/local/bin/perl.  Also, make sure that you have execute and read permissions on swishspider.

       The swishspider perl script is only used with the -S http method of indexing.

       I'm using the spider.pl program to spider my web site, but some large files are not indexed.

       The "spider.pl" program has a default limit of 5MB file size.  This can be changed  with  the  "max_size"
       parameter setting.  See "perldoc spider.pl" for more information.

       I still don't think all my web pages are being indexed.

       The  spider.pl  program has a number of debugging switches and can be quite verbose in telling you what's
       happening, and why.  See "perldoc spider.pl" for instructions.

       Swish is not spidering Javascript links!

       Swish cannot follow links generated by Javascript, as they are generated by the browser and are not  part
       of the document.

       How do I spider other websites and combine it with my own (filesystem) index?

       You  can  either  merge  "-M" two indexes into a single index, or use "-f" to specify more than one index
       while searching.

       You will have better results with the "-f" method.

       Searching

       How do I limit searches to just parts of the index?

       If you can identify "parts" of your index by the path name you have two options.

       The first options is by indexing the document path.  Add this to your configuration:

           MetaNames swishdocpath

       Now you can search for words or phrases in the path name:

           swish-e -w 'foo AND swishdocpath=(sales)'

       So that will only find documents with the word "foo" and where the file's path  contains  "sales".   That
       might not works as well as you like, though, as both of these paths will match:

           /web/sales/products/index.html
           /web/accounting/private/sales_we_messed_up.html

       This can be solved by searching with a phrase (assuming "/" is not a WordCharacter):

           swish-e -w 'foo AND swishdocpath=("/web/sales/")'
           swish-e -w 'foo AND swishdocpath=("web sales")'  (same thing)

       The  second  option  is  a  bit  more  powerful.   With the "ExtractPath" directive you can use a regular
       expression to extract out a sub-set of the path and save it as a separate meta name:

           MetaNames department
           ExtractPath department regex !^/web/([^/]+).+$!$1/

       Which says match a path that starts with "/web/" and extract out everything after that  up  to,  but  not
       including  the  next "/" and save it in variable $1, and then match everything from the "/" onward.  Then
       replace the entire matches string with $1.  And that gets indexed as meta name "department".

       Now you can search like:

           swish-e -w 'foo AND department=sales'

       and be sure that you will only match the documents in the /www/sales/*  path.   Note  that  you  can  map
       completely different areas of your file system to the same metaname:

           # flag the marketing specific pages
           ExtractPath department regex !^/web/(marketing⎪sales)/.+$!marketing/
           ExtractPath department regex !^/internal/marketing/.+$!marketing/

           # flag the technical departments pages
           ExtractPath department regex !^/web/(tech⎪bugs)/.+$!tech/

       Finally,  if  you have something more complicated, use "-S prog" and write a perl program or use a filter
       to set a meta tag when processing each file.

       How is ranking calculated?

       The "swishrank" property value is calculated based on  which  Ranking  Scheme  (or  algorithm)  you  have
       selected.  In  this  discussion,  any time the word fancy is used, you should consult the actual code for
       more details. It is open source, after all.

       Things you can do to affect ranking:

       MetaNamesRank
           You may configure your index to bias certain metaname values more  or  less  than  others.   See  the
           "MetaNamesRank" configuration option in SWISH-CONFIG.

       IgnoreTotalWordCountWhenRanking
           Set  to  1 (default) or 0 in your config file. See SWISH-CONFIG.  NOTE: You must set this to 0 to use
           the IDF Ranking Scheme.

       structure
           Each term's position in each HTML document is given a structure value based on the context  in  which
           the  word  appears. The structure value is used to artificially inflate the frequency of each term in
           that particular document.  These structural values are defined in config.h:

            #define RANK_TITLE             7
            #define RANK_HEADER            5
            #define RANK_META              3
            #define RANK_COMMENTS          1
            #define RANK_EMPHASIZED        0

           For example, if the word "foo" appears in the title  of  a  document,  the  Scheme  will  treat  that
           document as if "foo" appeared 7 additional times.

       All Schemes share the following characteristics:

       AND searches
           The  rank value is averaged for all AND'd terms. Terms within a set of parentheses () are averaged as
           a single term (this is an acknowledged weakness and is on the TODO list).

       OR searches
           The rank value is summed and then doubled for each pair of OR'd terms. This results in  higher  ranks
           for documents that have multiple OR'd terms.

       scaled rank
           After  a  document's  raw  rank  score  is calculated, a final rank score is calculated using a fancy
           "log()" function. All the documents are then scaled against a base score  of  1000.   The  top-ranked
           document will therefore always have a "swishrank" value of 1000.

       Here  is  a brief overview of how the different Schemes work. The number in parentheses after the name is
       the value to invoke that scheme with "swish-e -R" or "RankScheme()".

       Default (0)
           The default ranking scheme considers the number of times a term appears in  a  document  (frequency),
           the MetaNamesRank and the structure value. The rank might be summarized as:

            DocRank = Sum of ( structure + metabias )

           Consider this output with the DEBUG_RANK variable set at compile time:

            Ranking Scheme: 0
            Word entry 0 at position 6 has struct 7
            Word entry 1 at position 64 has struct 41
            Word entry 2 at position 71 has struct 9
            Word entry 3 at position 132 has struct 9
            Word entry 4 at position 154 has struct 9
            Word entry 5 at position 423 has struct 73
            Word entry 6 at position 541 has struct 73
            Word entry 7 at position 662 has struct 73
            File num: 1104.  Raw Rank: 21.  Frequency: 8 scaled rank: 30445
             Structure tally:
             struct 0x7 = count of 1 ( HEAD TITLE FILE ) x rank map of 8 = 8

             struct 0x9 = count of 3 ( BODY FILE ) x rank map of 1 = 3

             struct 0x29 = count of 1 ( HEADING BODY FILE ) x rank map of 6 = 6

             struct 0x49 = count of 3 ( EM BODY FILE ) x rank map of 1 = 3

           Every  word  instance  starts with a base score of 1.  Then for each instance of your word, a running
           sum is taken of the structural value of that word position plus any bias you've configured.   In  the
           example above, the raw rank is "1 + 8 + 3 + 6 + 3 = 21".

           Consider this line:

             struct 0x7 = count of 1 ( HEAD TITLE FILE ) x rank map of 8 = 8

           That  means  there  was  one  instance of our word in the title of the file.  It's context was in the
           <head> tagset, inside the <title>.  The <title> is the  most  specific  structure,  so  it  gets  the
           RANK_TITLE score: 7. The base rank of 1 plus the structure score of 7 equals 8. If there had been two
           instances of this word in the title, then the score would have been "8 + 8 = 16".

       IDF (1)
           IDF  is  short for Inverse Document Frequency. That's fancy ranking lingo for taking into account the
           total frequency of a term across the entire index, in addition to the term's frequency  in  a  single
           document.  IDF ranking also uses the relative density of a word in a document to judge its relevancy.
           Words that appear more often in a doc make that doc's rank higher, and longer docs are  not  weighted
           higher than shorter docs.

           The IDF Scheme might be summarized as:

             DocRank = Sum of ( density * idf * ( structure + metabias ) )

           Consider this output from DEBUG_RANK:

            Ranking Scheme: 1
            File num: 1104  Word Score: 1  Frequency: 8  Total files: 1451
            Total word freq: 108   IDF: 2564
            Total words: 1145877   Indexed words in this doc: 562
            Average words: 789   Density: 1120    Word Weight: 28716
            Word entry 0 at position 6 has struct 7
            Word entry 1 at position 64 has struct 41
            Word entry 2 at position 71 has struct 9
            Word entry 3 at position 132 has struct 9
            Word entry 4 at position 154 has struct 9
            Word entry 5 at position 423 has struct 73
            Word entry 6 at position 541 has struct 73
            Word entry 7 at position 662 has struct 73
            Rank after IDF weighting: 574321
            scaled rank: 132609
             Structure tally:
             struct 0x7 = count of  1 ( HEAD TITLE FILE ) x rank map of 8 = 8

             struct 0x9 = count of  3 ( BODY FILE ) x rank map of 1 = 3

             struct 0x29 = count of  1 ( HEADING BODY FILE ) x rank map of 6 = 6

             struct 0x49 = count of  3 ( EM BODY FILE ) x rank map of 1 = 3

           It  is  similar  to the default Scheme, but notice how the total number of files in the index and the
           total word frequency (as opposed to the document frequency) are both part of the equation.

       Ranking is a  complicated  subject.  SWISH-E  allows  for  more  Ranking  Schemes  to  be  developed  and
       experimented  with,  using  the  -R  option  (from  the  swish-e command) and the RankScheme (see the API
       documentation). Experiment and share your findings via the discussion list.

       How can I limit searches to the title, body, or comment?

       Use the "-t" switch.

       I can't limit searches to title/body/comment.

       Or, I can't search with meta names, all the names are indexed as "plain".

       Check in the config.h file if #define INDEXTAGS is set to 1. If it is, change it  to  0,  recompile,  and
       index  again.   When  INDEXTAGS  is 1, ALL the tags are indexed as plain text, that is you index "title",
       "h1", and so on, AND they loose their indexing meaning.  If INDEXTAGS is set to 0, you will  still  index
       meta  tags  and  comments,  unless  you  have  indicated  otherwise  in  the  user  config  file with the
       IndexComments directive.

       Also, check for the "UndefinedMetaTags" setting in your configuration file.

       I've tried running the included CGI script and I get a "Internal Server Error"

       Debugging CGI scripts are beyond the scope of this  document.   Internal  Server  Error  basically  means
       "check the web server's log for an error message", as it can mean a bad shebang (#!) line, a missing perl
       module,  FTP  transfer error, or simply an error in the program.  The CGI script swish.cgi in the example
       directory contains some debugging suggestions.  Type "perldoc swish.cgi" for information.

       There are also many, many CGI FAQs available on the Internet.  A quick web search should offer help.   As
       a last resort you might ask your webadmin for help...

       When I try to view the swish.cgi page I see the contents of the Perl program.

       Your  web  server  is  not  configured  to run the program as a CGI script.  This problem is described in
       "perldoc swish.cgi".

       How do I make Swish-e highlight words in search results?

       Short answer:

       Use the supplied swish.cgi or search.cgi scripts located in the example directory.

       Long answer:

       Swish-e can't because it doesn't have access to the source documents when returning results,  of  course.
       But  a  front-end  program  of  your  creation  can highlight terms.  Your program can open up the source
       documents and then use regular expressions to replace search terms with highlighted or bolded words.

       But, that will fail with all but the most simple source documents.  For HTML documents, for example,  you
       must  parse  the  document  into  words  and  tags (and comments).  A word you wish to highlight may span
       multiple HTML tags, or be a word in a URL and you wish to highlight the entire link text.

       Perl modules such as HTML::Parser and XML::Parser make word  extraction  possible.   Next,  you  need  to
       consider   that   Swish-e   uses   settings   such  as  WordCharacters,  BeginCharacters,  EndCharacters,
       IgnoreFirstChar, and IgnoreLast, char to define a "word".  That is, you can't consider that a  string  of
       characters with white space on each side is a word.

       Then  things like TranslateCharacters, and HTML Entities may transform a source word into something else,
       as far as Swish-e is concerned.  Finally, searches can be limited by metanames, so you may need to  limit
       your  highlighting  to  only  parts of the source document.  Throw phrase searches and stopwords into the
       equation and you can see that it's not a trivial problem to solve.

       All hope is not lost, thought, as Swish-e does provide some help.  Using the "-H" option it  will  return
       in  the headers the current index (or indexes) settings for WordCharacters (and others) required to parse
       your source documents as it parses them during indexing, and will return a "Parsed  Words:"  header  that
       will  show  how  it  parsed  the query internally.  If you use fuzzy indexing (word stemming, soundex, or
       metaphone) then you will also need to stem each word in your document before comparing with  the  "Parsed
       Words:" returned by Swish-e.

       The  Swish-e  stemming  code  is  available either by using the Swish-e Perl module (SWISH::API) or the C
       library (included with the swish-e distribution), or by using  the  SWISH::Stemmer  module  available  on
       CPAN.   Also  on  CPAN  is the module Text::DoubleMetaphone.  Using SWISH::API probably provides the best
       stemming support.

       Do filters effect the performance during search?

       No.  Filters (FileFilter or via "prog" method) are only used for  building  the  search  index  database.
       During search requests there will be no filter calls.

       I have read the FAQ but I still have questions about using Swish-e.

       The  Swish-e  discussion  list  is the place to go.  http://swish-e.org/.  Please do not email developers
       directly.  The list is the best place to ask questions.

       Before you post please read QUESTIONS AND TROUBLESHOOTING located in the INSTALL page.  You  should  also
       search the Swish-e discussion list archive which can be found on the swish-e web site.

       In short, be sure to include in the following when asking for help.

       * The swish-e version (./swish-e -V)
       * What you are indexing (and perhaps a sample), and the number of files
       * Your Swish-e configuration file
       * Any error messages that Swish-e is reporting

Document Info

       $Id: SWISH-FAQ.pod 2147 2008-07-21 02:48:55Z karpet $

       .

2.4.7                                              2009-04-04                                       SWISH-FAQ(1)