Provided by: linkchecker_10.4.0-1_amd64 bug

NAME

       linkchecker - command line client to check HTML documents and websites for broken links

SYNOPSIS

       linkchecker [options] [file-or-url]...

DESCRIPTION

       LinkChecker features

       • recursive and multithreaded checking

       • output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats

       • support for HTTP/1.1, HTTPS, FTP, mailto: and local file links

       • restriction of link checking with URL filters

       • proxy support

       • username/password authorization for HTTP and FTP

       • support for robots.txt exclusion protocol

       • support for Cookies

       • support for HTML5

       • Antivirus check

       • a command line and web interface

EXAMPLES

       The most common use checks the given domain recursively:

          $ linkchecker http://www.example.com/

       Beware  that  this  checks the whole site which can have thousands of URLs. Use the -r option to restrict
       the recursion depth.

       Don't check URLs with /secret in its name. All other links are checked as usual:

          $ linkchecker --ignore-url=/secret mysite.example.com

       Checking a local HTML file on Unix:

          $ linkchecker ../bla.html

       Checking a local HTML file on Windows:

          C:\> linkchecker c:empest.html

       You can skip the http:// url part if the domain starts with www.:

          $ linkchecker www.example.com

       You can skip the ftp:// url part if the domain starts with ftp.:

          $ linkchecker -r0 ftp.example.com

       Generate a sitemap graph and convert it with the graphviz dot utility:

          $ linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps

OPTIONS

   General options
       -f FILENAME, --config=FILENAME
              Use     FILENAME     as     configuration     file.     By      default      LinkChecker      uses
              $XDG_CONFIG_HOME/linkchecker/linkcheckerrc.

       -h, --help
              Help me! Print usage information for this program.

       -t NUMBER, --threads=NUMBER
              Generate  no  more  than  the given number of threads. Default number of threads is 10. To disable
              threading specify a non-positive number.

       -V, --version
              Print version and exit.

       --list-plugins
              Print available check plugins and exit.

   Output options
   URL checking results
       -F TYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
              Output to a file linkchecker-out.TYPE, $XDG_DATA_HOME/linkchecker/failures for the failures output
              type, or FILENAME if specified. The ENCODING specifies the output encoding, the default is that of
              your          locale.          Valid          encodings          are           listed           at
              https://docs.python.org/library/codecs.html#standard-encodings.   The  FILENAME and ENCODING parts
              of the none output type will be ignored, else if the file already exists, it will be  overwritten.
              You can specify this option more than once. Valid file output TYPEs are text, html, sql, csv, gml,
              dot,  xml,  sitemap,  none  or  failures. Default is no file output.  The various output types are
              documented below. Note that you can suppress all console output with the option -o none.

       --no-warnings
              Don't log warnings. Default is to log warnings.

       -o TYPE[/ENCODING], --output=TYPE[/ENCODING]
              Specify the console output type as text, html, sql, csv, gml, dot, xml, sitemap, none or failures.
              Default type is text. The various output types are documented below.  The ENCODING  specifies  the
              output   encoding,   the   default  is  that  of  your  locale.  Valid  encodings  are  listed  at
              https://docs.python.org/library/codecs.html#standard-encodings.

       -v, --verbose
              Log all checked URLs, overriding --no-warnings.  Default is to log only errors and warnings.

   Progress updates
       --no-status
              Do not print URL check status messages.

   Application
       -D STRING, --debug=STRING
              Print debugging output for the given logger.   Available  debug  loggers  are  cmdline,  checking,
              cache,  plugin  and  all.   all  is  an alias for all available loggers.  This option can be given
              multiple times to debug with more than one logger.

   Quiet
       -q, --quiet
              Quiet operation, an alias for -o none that also hides application information messages.   This  is
              only useful with -F, else no results will be output.

   Checking options
       --cookiefile=FILENAME
              Use initial cookie data read from a file. The cookie data format is explained below.

       --check-extern
              Check external URLs.

       --ignore-url=REGEX
              URLs  matching the given regular expression will only be syntax checked.  This option can be given
              multiple times.  See section REGULAR EXPRESSIONS for more info.

       --no-follow-url=REGEX
              Check but do not recurse into URLs matching the given regular  expression.   This  option  can  be
              given multiple times.  See section REGULAR EXPRESSIONS for more info.

       --no-robots
              Check URLs regardless of any robots.txt files.

       -p, --password
              Read  a  password  from  console  and  use  it for HTTP and FTP authorization. For FTP the default
              password is anonymous@. For HTTP there is no default password. See also -u.

       -r NUMBER, --recursion-level=NUMBER
              Check recursively all links up to given depth. A negative depth will  enable  infinite  recursion.
              Default depth is infinite.

       --timeout=NUMBER
              Set the timeout for connection attempts in seconds. The default timeout is 60 seconds.

       -u STRING, --user=STRING
              Try  the given username for HTTP and FTP authorization. For FTP the default username is anonymous.
              For HTTP there is no default username. See also -p.

       --user-agent=STRING
              Specify the User-Agent string to send to the HTTP server, for example "Mozilla/4.0".  The  default
              is "LinkChecker/X.Y" where X.Y is the current version of LinkChecker.

   Input options
       --stdin
              Read from stdin a list of white-space separated URLs to check.

       FILE-OR-URL
              The  location  to  start checking with.  A file can be a simple list of URLs, one per line, if the
              first line is "# LinkChecker URL list".

CONFIGURATION FILES

       Configuration files can specify all options above. They can also specify some options that cannot be  set
       on the command line. See linkcheckerrc(5) for more info.

OUTPUT TYPES

       Note  that by default only errors and warnings are logged. You should use the option --verbose to get the
       complete URL list, especially when outputting a sitemap graph format.

       text   Standard text logger, logging URLs in keyword: argument fashion.

       html   Log URLs in keyword:  argument  fashion,  formatted  as  HTML.   Additionally  has  links  to  the
              referenced pages. Invalid URLs have HTML and CSS syntax check links appended.

       csv    Log check result in CSV format with one URL per line.

       gml    Log parent-child relations between linked URLs as a GML sitemap graph.

       dot    Log parent-child relations between linked URLs as a DOT sitemap graph.

       gxml   Log check result as a GraphXML sitemap graph.

       xml    Log check result as machine-readable XML.

       sitemap
              Log    check    result    as    an    XML    sitemap    whose    protocol    is    documented   at
              https://www.sitemaps.org/protocol.html.

       sql    Log check result as SQL script with INSERT commands. An example script to create the  initial  SQL
              table is included as create.sql.

       failures
              Suitable  for  cron  jobs.  Logs  the check result into a file $XDG_DATA_HOME/linkchecker/failures
              which only contains entries with invalid URLs and the number of times they have failed.

       none   Logs nothing. Suitable for debugging or checking the exit code.

REGULAR EXPRESSIONS

       LinkChecker accepts Python  regular  expressions.  See  https://docs.python.org/howto/regex.html  for  an
       introduction.  An addition is that a leading exclamation mark negates the regular expression.

COOKIE FILES

       A cookie file contains standard HTTP header (RFC 2616) data with the following possible names:

       Host (required)
              Sets the domain the cookies are valid for.

       Path (optional)
              Gives the path the cookies are value for; default path is /.

       Set-cookie (required)
              Set cookie name/value. Can be given more than once.

       Multiple  entries  are  separated  by  a  blank line. The example below will send two cookies to all URLs
       starting with http://example.com/hello/ and one to all URLs starting with https://example.org/:

          Host: example.com
          Path: /hello
          Set-cookie: ID="smee"
          Set-cookie: spam="egg"

          Host: example.org
          Set-cookie: baggage="elitist"; comment="hologram"

PROXY SUPPORT

       To use a proxy on Unix or Windows set the http_proxy or https_proxy environment variables  to  the  proxy
       URL. The URL should be of the form http://[user:pass@]host[:port].  LinkChecker also detects manual proxy
       settings  of Internet Explorer under Windows systems. On a Mac use the Internet Config to select a proxy.
       You can also set a comma-separated domain list in the no_proxy environment variable to ignore  any  proxy
       settings  for  these  domains.   The  curl_ca_bundle  environment  variable  can  be  used to identify an
       alternative certificate bundle to be used with an HTTPS proxy.

       Setting a HTTP proxy on Unix for example looks like this:

          $ export http_proxy="http://proxy.example.com:8080"

       Proxy authentication is also supported:

          $ export http_proxy="http://user1:mypass@proxy.example.org:8081"

       Setting a proxy on the Windows command prompt:

          C:\> set http_proxy=http://proxy.example.com:8080

PERFORMED CHECKS

       All URLs have to pass a preliminary syntax test. Minor quoting mistakes will issue a warning,  all  other
       invalid  syntax  issues  are  errors.  After  the  syntax  check passes, the URL is queued for connection
       checking. All connection check types are described below.

       HTTP links (http:, https:)
              After connecting to the given HTTP server the given path or query is requested.  All  redirections
              are  followed,  and if user/password is given it will be used as authorization when necessary. All
              final HTTP status codes other than 2xx are errors.

              HTML page contents are checked for recursion.

       Local files (file:)
              A regular, readable file that can be opened is valid. A readable  directory  is  also  valid.  All
              other files, for example device files, unreadable or non-existing files are errors.

              HTML or other parseable file contents are checked for recursion.

       Mail links (mailto:)
              A  mailto: link eventually resolves to a list of email addresses.  If one address fails, the whole
              list will fail. For each mail address we check the following things:

              1. Check the address syntax, both the parts before and after the @ sign.

              2. Look up the MX DNS records. If we found no MX record, print an error.

              3. Check if one of the mail hosts accept an SMTP connection.  Check  hosts  with  higher  priority
                 first. If no host accepts SMTP, we print a warning.

              4. Try  to  verify  the  address  with  the  VRFY command. If we got an answer, print the verified
                 address as an info.

       FTP links (ftp:)
              For FTP links we do:

              1. connect to the specified host

              2. try to login with the given user and password. The  default  user  is  anonymous,  the  default
                 password is anonymous@.

              3. try to change to the given directory

              4. list the file with the NLST command

       Unsupported links (javascript:, etc.)
              An unsupported link will only print a warning. No further checking will be made.

              The   complete   list   of   recognized,   but   unsupported   links   can   be   found   in   the
              linkcheck/checker/unknownurl.py source file. The most  prominent  of  them  should  be  JavaScript
              links.

SITEMAPS

       Sitemaps  are  parsed for links to check and can be detected either from a sitemap entry in a robots.txt,
       or when passed as a FILE-OR-URL argument in which case detection requires the urlset/sitemapindex tag  to
       be within the first 70 characters of the sitemap.  Compressed sitemap files are not supported.

PLUGINS

       There are two plugin types: connection and content plugins. Connection plugins are run after a successful
       connection  to  the  URL  host. Content plugins are run if the URL type has content (mailto: URLs have no
       content for example) and if the check is  not  forbidden  (ie.  by  HTTP  robots.txt).   Use  the  option
       --list-plugins  for  a  list  of  plugins  and  their  documentation.  All  plugins  are  enabled via the
       linkcheckerrc(5) configuration file.

RECURSION

       Before descending recursively into a URL, it has to fulfill several conditions. They are checked in  this
       order:

       1. A URL must be valid.

       2. A  URL  must be parseable. This currently includes HTML files, Opera bookmarks files, and directories.
          If a file type cannot be determined (for example it does not have a common HTML  file  extension,  and
          the content does not look like HTML), it is assumed to be non-parseable.

       3. The  URL  content  must be retrievable. This is usually the case except for example mailto: or unknown
          URL types.

       4. The maximum recursion level must not be exceeded. It is configured with the  --recursion-level  option
          and is unlimited per default.

       5. It must not match the ignored URL list. This is controlled with the --ignore-url option.

       6. The  Robots Exclusion Protocol must allow links in the URL to be followed recursively. This is checked
          by searching for a "nofollow" directive in the HTML header data.

       Note that the directory recursion reads all files in that directory, not just a subset like index.htm.

NOTES

       URLs on the commandline starting with ftp. are treated like  ftp://ftp.,  URLs  starting  with  www.  are
       treated like http://www.. You can also give local files as arguments.  If you have your system configured
       to  automatically establish a connection to the internet (e.g. with diald), it will connect when checking
       links not pointing to your local host. Use the --ignore-url option to prevent this.

       Javascript links are not supported.

       If your platform does not support threading, LinkChecker disables it automatically.

       You can supply multiple user/password pairs in a configuration file.

ENVIRONMENT

       http_proxy
              specifies default HTTP proxy server

       https_proxy
              specifies default HTTPS proxy server

       curl_ca_bundle
              an alternative certificate bundle to be used with an HTTPS proxy

       no_proxy
              comma-separated list of domains to not contact over a proxy server

       LC_MESSAGES, LANG, LANGUAGE
              specify output language

RETURN VALUE

       The return value is 2 when

       • a program error occurred.

       The return value is 1 when

       • invalid links were found or

       • link warnings were found and warnings are enabled

       Else the return value is zero.

LIMITATIONS

       LinkChecker consumes memory for each queued URL to check. With thousands of queued  URLs  the  amount  of
       consumed memory can become quite large.  This might slow down the program or even the whole system.

FILES

       $XDG_CONFIG_HOME/linkchecker/linkcheckerrc - default configuration file

       $XDG_DATA_HOME/linkchecker/failures - default failures logger output filename

       linkchecker-out.TYPE - default logger file output name

SEE ALSO

       linkcheckerrc(5)

       https://docs.python.org/library/codecs.html#standard-encodings - valid output encodings

       https://docs.python.org/howto/regex.html - regular expression documentation

AUTHOR

       Bastian Kleineidam <bastian.kleineidam@web.de>

COPYRIGHT

       2000-2016 Bastian Kleineidam, 2010-2023 LinkChecker Authors

10.3.0.post33+g268799ee0                        December 04, 2023                                 LINKCHECKER(1)