Ubuntu Manpage: gztool - extract random-positioned data from gzip files, even like `tail -f`

NAME

       gztool - extract random-positioned data from gzip files, even like `tail -f`

SYNOPSIS

       gztool  [ [-[abLnsv] #] [-[1..9]AcCdDeEfFhilpPrRStTwWxXzZ|u[cCdD]] [-I <INDEX>] ] "files" ...

       Note that actions `-bcStT` proceed to an index file creation (if none exists) INTERLEAVED with data flow.
       As  data flow and index creation occur at the same time there's no waste of time.  Also you can interrupt
       actions at any moment and the remaining index file will be reused (and completed  if  necessary)  on  the
       next gztool run over the same data.

DESCRIPTION

       gztool  is  a GZIP files indexer, compressor and data retriever.  It can create small indexes for gzipped
       files and use them for quick and random-positioned data extraction.

       gztool can extract random-positioned data from gzip files with no penalty, including  gzip  tailing  like
       with `tail -f`.

       gztool  creates  an  index  file  (.gzi)  for  every  gzip it treats, and this action is interleaved with
       compression/uncompression so there's no waste of time. Any action can be interrupted at any  moment,  and
       the remaining index will be reused on next runs.

       Extraction  is possible from any byte (or line) position in the uncompressed data using `-b` (or `-L` for
       lines).

       If the uncompressed file is a text file, then using the `-x` modifier the index will take care of  lines,
       so later gztool can be requested to extract data from a particular line with `-L`.

       See the full listing of capabilities on INTERNALS.

OPTIONS

       files  One or more files. If no file is indicated, standard input is used.

       -[1..9]
              compression  factor  to  use with `-[c|u[cC]]`, from best speed (`-1`) to best compression (`-9`).
              Default is `-6`.

       -a #   Await # seconds between reads when `-[ST]|Ec`. Default is 4 s.

       -A     modifier for `-[rR]` to indicate the range of bytes/lines  in  absolute  values,  instead  of  the
              default incremental values.

       -b #   extract  data from indicated uncompressed byte position of gzip file (creating or reusing an index
              file) to STDOUT.  Accepts '0', '0x', and suffixes 'kmgtpe' (^10) or 'KMGTPE' (^2).

       -C     always create a 'Complete' index file, ignoring possible errors.

       -c     compress a file like with gzip, creating an index at the same time.

       -d     decompress a file like with gzip.

       -D     do not delete original file when using `-[cd]`.

       -e     if multiple files are indicated, continue on error (if any).

       -E     end processing on first GZIP end of file marker at EOF.  Nonetheless with  `-c`,  `-E`  waits  for
              more data even at EOF.

       -f     force file overwriting if destination file already exists.

       -F     force  index  creation/completion  first,  and  then action: if `-F` is not used, index is created
              interleaved with actions.

       -h     print brief help; `-hh` prints this help.

       -i     create index for indicated  gzip  file  (For  'file.gz'  the  default  index  file  name  will  be
              'file.gzi'). This is the default action.

       -I string
              index file name will be the indicated string.

       -l     check  and  list  info  contained in indicated index file.  `-ll` and `-lll` increase the level of
              index checking detail.

       -L #   extract data from indicated uncompressed line position of gzip file (creating or reusing an  index
              file) to STDOUT.  Accepts '0', '0x', and suffixes 'kmgtpe' (^10) or 'KMGTPE' (^2).

       -n #   indicates  that the first byte on compressed input is #, not 1, and so truncated compressed inputs
              can be used if an index exists.

       -p     indicates that the gzip input stream may  be  composed  of  various  incorrectly  terminated  GZIP
              streams, and so then a careful Patching of the input may be needed to extract correct data.

       -P     like  `-p`, but when used with `-[ST]` implies that checking for errors in stream is made as quick
              as possible as the gzip file grows. Warning: this may lead to some errors not being patched.

       -r #   (range): Number of bytes to extract when using `-[bL]`.  Accepts '0', '0x', and suffixes  'kmgtpe'
              (^10) 'KMGTPE' (^2).

       -R #   (Range):  Number of lines to extract when using `-[bL]`.  Accepts '0', '0x', and suffixes 'kmgtpe'
              (^10) 'KMGTPE' (^2).

       -s #   span in uncompressed MiB between index points when creating the index. By default is `10`.

       -S     Supervise indicated file.  Create a growing  index,  for  a  still-growing  gzip  file.  (`-i`  is
              implicit).

       -t     tail (extract last bytes) to STDOUT on indicated gzip file.

       -T     tail (extract last bytes) to STDOUT on indicated still-growing gzip file, and continue Supervising
              & extracting to STDOUT.

       -u [cCdD]
              utility  to  compress  (`-u c`) or decompress (`-u d`) zlib-format files to STDOUT. Use `-u C` and
              `-u D` to manage raw compressed files. No index involved.

       -v #   output verbosity from `0` (none) to `5` (nuts). Default is `1` (normal).

       -w     wait for creation if file doesn't exist, when using `-[cdST]`.

       -W     do not Write index to disk. But if one is already available read and use it. Useful if  the  index
              is still under a `-S` run.

       -x     create index with line number information (win/*nix compatible).
              Please,  note  that  gztool's  index counts last line even if the last char isn't a newline char -
              whilst `wc` command will not count it in this case!.
              This is implicit unless `-X` or `-z` are indicated.

       -X     like `-x`, but newline character is '\r' (old mac).

       -z     create index without line number information.

       -Z     adjust index points to a byte boundary: no previous byte needed.

QUICK EXAMPLE

       Extract data from 1 GiB byte (byte 2^30) on, from `myfile.gz` to the file `myfile.txt`. Also gztool  will
       create (or reuse, or complete) an index file named `myfile.gzi`:

           $ gztool -b 1G myfile.gz > myfile.txt

MORE EXAMPLES

       * Make an index for `test.gz`. The index will be named `test.gzi`:

           $ gztool -i test.gz

       * Make an index for `test.gz` with name `test.index`, using `-I`:

           $ gztool -I test.index test.gz

       *  Also  `-I`  can  be  used to indicate the complete path to an index in another directory. This way the
       directory where the gzip file resides could be read-only and the index be created in  another  read-write
       path:

           $ gztool -I /tmp/test.gzi test.gz

       * Retrieve data from uncompressed byte position 1000000 inside test.gz. Index file will be created at the
       same time (named `test.gzi`):

           $ gztool -b 1m test.gz

       * Supervise an still-growing gzip file and generate the index for it on-the-fly. The index file name will
       be  `openldap.log.gzi`  in  this case. `gztool` will execute until interrupted (it can also stop at first
       end-of-gzip data with `-E`):

           $ gztool -S openldap.log.gz

       * The previous command can be sent to background and with no verbosity, so we can forget about it:

           $ gztool -v0 -S openldap.log.gz &

       Creating and index for all "*gz" files in a directory:

           $ gztool -i *gz

       * Extract data from `project.gz` byte at 1 GiB to STDOUT, and use `grep` on this output. Index file  name
       will be `project.gzi`:

           $ gztool -b 1G project.gz | grep -i "balance = "

       *  Please,  note  that STDOUT is used for data extraction with `-bcdtT` modifiers, so an explicit command
       line redirection is needed if output is to be stored in a file:

           $ gztool -b 99m project.gz > uncompressed.data

       * Extract data from a gzipped file which index is still growing  with  a  `gztool  -S`  process  that  is
       monitoring  the  (still-growing) gzip file: in this case the use of `-W` will not try to update the index
       on disk so the other process is not disturb! (Note that `gztool` always tries to update the index used if
       it thinks it's necessary):

           $ gztool -Wb 100k still-growing-gzip-file.gz > mytext

       * Extract data from line 10 million, to STDOUT:

           $ gztool -L 10m compressed_text_file.gz

       * Nonetheless note that if in the precedent example an index was previously created  for  the  gzip  file
       without  the `-x` parameter (or not using `-L`), as it doesn't contain line numbering info, `gztool` will
       complain and stop. This can be circumvented by telling `gztool`  to  use  another  new  index  file  name
       (`-I`),  or  even  not  using  anyone  at  all with `-W` (do not write index) and an index file name that
       doesn't exists (in this case `None` - it won't be created because of `-W`), and so ((just) this time) the
       gzip will be processed from the beginning:

           $ gztool -L 10m -WI None compressed_text_file.gz

       *         Extract         all         data         from         a         rsyslog's         veryRobustZip
       (//www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html#veryrobustzip)   that  contains  dirty
       data. This *corrupted-gzip-files* can arise when using rsyslog's  veryRobustZip  omfile  option  and  the
       process  that  is  logging  is  abruptly  terminated  and  then restarted - this produces an incorrectly-
       terminated-gzip stream that is followed by another gzip stream in the  same  file.  `gzip`  (nor  `zlib`)
       cannot  read  this files beyond the point of error. But `gztool` can correctly extract all data (and only
       good data) using `-p` (*patch*) parameter:

           $ gztool -p -b0 compressed_text_file.gz

       This creates, as usual, the index file `compressed_text_file.gzi`. In order to not create  it,  `-W`  (do
       not Write index) can be used:

           $ gztool -pWb0 compressed_text_file.gz

       Note  that `-p` can require up to twice the time for decompression, because it performs two decompression
       processes: the usual one, and another one that is performed in advance of the usual and which is the  one
       that  detects errors, marks them, and finds new entry points to end/begin the decompression circumventing
       the problems.
       Note also that these corrupted-gzip-files should be always decompressed with `-p` parameter,  even  if  a
       `gztool` index file exists for them, because the index file stores entry points, but does not store where
       do errors occur in the `gzip` file.  That said, if the `-[bL]` point of extraction is beyond the point(s)
       of  error  in  the  `gzip` file and an index file exists, then the decompression can proceed fine without
       `-p`, as the index points stored in the index file are always clean.

       * When tailing an still-growing gzip file (`-T`) that could contain errors at some point, one  may  still
       want to obtain output from the gzip stream as soon as possible - this is what the patching option `-P` is
       for  (like  `-p`  but  capitalized):  with  `-p` `gztool` decompress the stream about 48 kiB ahead of the
       output that is actually shown/written in order to catch possible gzip-stream errors ahead of output,  and
       so  maintain  always  a  clean  output  without error-introduced artifacts. This has the side effect that
       output must always wait for that 48 kiB of data to be available in  advance,  which  if  the  file  grows
       slowly  can take a very long time. With `-P` the buffer-ahead restriction is relaxed to just as few bytes
       as available before reaching end-of-file and waiting for new data,  so  responsiveness  is  as  quick  as
       without  `-p`.  The  side  effect  of  `-P`  is  that  depending on the gzip file some errors may lead to
       incorrect output being shown/written - though in this case  a  "PATCHING  WARNING"  would  be  shown  (to
       stderr).

           $ gztool -PT application_log.gz

       The same applies to `-S` though in this case there's no output, as only the index is being constructed:

           $ gztool -PS application_log.gz

       *  To  tail  to  stdout, like a `tail -f`, an still-growing gzip file (an index file will be created with
       name `still-growing-gzip-file.gzi` in this case):

           $ gztool -T still-growing-gzip-file.gz

       * More on files still being "Supervised" (`-S`) by another `gztool` instance: they can also be  tailed  à
       la `tail -f` without updating the index on disk using `-W`:

           $ gztool -WT still-growing-gzip-file.gz

       *  Compress  (`-c`)  an  still growing (`-E`) file: in this case both `still-growing-file.gz` and `still-
       growing-file.gzi` files will be created on-the-fly as the source  file  grows.  Note  that  in  order  to
       terminate compression, Ctrl+C must be used to kill gztool: this results in an incomplete-gzip-file as per
       GZIP  standard,  but  this  is  not important as it will contain all the source data, and both `gzip` and
       `gztool` (or any other tool) can correctly and completely decompress it:

           $ gztool -Ec still-growing-file

       * If you have an incomplete index file (it just does not have the length of the source data, as it didn't
       correctly finish) and want to make it complete and so that the length of the uncompressed data be stored,
       just unconditionally complete it with `-C` with a new `-i` run over your gzip  file:  note  that  as  the
       existent  index  data is used (in this case the file `my-incomplete-gzip-data.gzi`), only last compressed
       bytes are decompressed to complete this action:

           $ gztool -Ci my-incomplete-gzip-data.gz

       * Decompress a file like with gzip (`-d`), but do not delete (`-D`) the original one:  Decompressed  file
       will be `myfile`. Note that gzipped file must have a ".gz" extension or `gztool` will complain:

           $ gztool -Dd myfile.gz

       * Decompress a file that does not have ".gz" file extension, like with gzip (`-d`):

           $ cat mycompressedfile | gztool -d > my_uncompressed_file

       *  Show internals of all index files in this directory. `-e` is used not to stop the process on the first
       error, if a `*.gzi` file is not a valid gzip index file. The `-ll` list option repetition will show  data
       about each index point. `-lll` also decompress each point's window to ensure index integrity:

           $ gztool -ell *.gzi

       If  `gztool`  finds  the  gzip  file  companion  of  the  index file, some statistics are shown, like the
       index/gzip size ratio, or the ratio of compression of the gzip file.  Also, if the gzip is complete,  the
       size of the uncompressed data is shown. This number is interesting if the gzip file is bigger than 4 GiB,
       in  which  case  `gunzip  -l`  cannot  correctly  calculate  it as it is limited to a 32 bit counter (see
       //tools.ietf.org/html/rfc1952#page-5), or if the gzip file is in `bgzip` format, in  which  case  `gunzip
       -l` would only show data about the first block (< 64 kiB).
       Note  that  `gztool  -l`  tries to guess the companion gzip file of the index looking for a file with the
       same name, but without the `i` of the `.gzi` file name extension, or without the  `.gzi`.  But  the  gzip
       file name can also be directly indicated with this format:

           $ gztool -l -I index_filename gzip_filename

       In this latter case only a pair of index+gzip filenames can be indicated with each use.

       *  Use  a  truncated  gzip file (100000 first bytes are removed: (not zeroed, removed); if they're zeroed
       cautions are the same, but `-n` is not needed), to extract from byte 20 MiB, using a previously generated
       index: as far as the `-b` parameter refers to a byte after an index point (See `-ll`) and  `-n`  be  less
       than  that  needed  first  index  point,  this  is  always possible. In this case -I gzip_filename.gzi is
       implicit:

           $ gztool -n 100001 -b 20M gzip_filename.gz

       Take into account that, as shown, the first byte of the truncated  `gzip_filename.gz`  file  is  numbered
       100001,  that is, the bytes retain the order number in which they appear in the original file (that's the
       reason why it is not the 1 byte).
       Please, note that index point positions at index file may require also the previous byte to be  available
       in  the truncated gzip file, as gzip stream is not byte-rounded but a stream of pure bits. Thus if you're
       thinking on truncating a gzip file, please do it always at least by one byte before the  indicated  index
       point in the gzip - as said, it may not be needed, but in 7 of 8 cases it is needed. Another option is to
       use `-Z` when creating the index, as indicated below.

       *  Create  an  index for a gzip file in which every index entry point is adjusted to byte boundary, so no
       previous byte (bits) is needed. Note that in general the byte at which the index entry point begins  does
       not  represent a clear cut point as the gzip window needs up to 7 bits from the previous byte. This is so
       because gzip is a bit-level stream compressor. With `-Z` the cut point is always clean and no  bits  from
       the  previous  byte are required. This will result in index points spaced by more than `-s` bytes between
       then, and so, may be, less points in the index. But this is completely safe and sound.

               $ gztool -Z my_gzip_file.gz

       `-Z` exists since gztool v1.6.0.

       * Since v1.5.0, using `-[fW]` (`-f`: force index overwriting; `-W`: do  not  write  index)  with  `-[ST]`
       (`-S`:  create  index  on  still-growing  gzip  file;  `-T`:  tail  and continue decompressing to stdout)
       indicates `gztool` to continue operations even after the source file is overwritten. If using  `-f`,  the
       index file will be overwritten. For example:

           $ gztool -WT log_filename.gz
           ...
           File overwriting detected and restarting decompression...
           Processing 'log_filename.gz'...

INTERNALS

       By  default  gzip-compressed  files  cannot  be  accessed in random mode: any byte required at position N
       requires the complete gzip file to be decompressed from the beginning to the N  byte.   Nonetheless  Mark
       Adler,  the  author  of zlib (//github.com/madler/zlib), provided years ago a cryptic file named `zran.c`
       (//github.com/madler/zlib/blob/master/examples/zran.c) that creates an "index" of "windows"  filled  with
       32  kiB  of  uncompressed  data at different positions along the un/compressed file, which can be used to
       initialize the zlib library and make it behave as if compressed data begin there.

       gztool builds upon zran.c to provide a useful command line tool.  Also, some optimizations has been made:

       * gztool can correctly read incomplete gzip-concatenated-files (using `-p`), that is, a gzip composed  of
       a  concatenation  of  `gzip`  files,  some  of  which  are not correctly terminated. This can happen, for
       example,         when         using         rsyslog's         veryRobustZip         omfile         option
       (//www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html#veryrobustzip) and the process that is
       logging is abruptly terminated and then restarted.

       *  gztool  can  store  line  numbering  information  in the index (use only if source data is text!), and
       retrieve data from a specific line number using `-L`. (Using `-[xXz]` when  creating  the  index  selects
       Unix new line format (default), old Mac new line format, or no line information respectively.)

       * gztool can Supervise an still-growing gzip file (for example, a log created by rsyslog directly in gzip
       format)  and  generate  the  index  on-the-fly,  thus  reducing in the practice to zero the time of index
       creation. See `-S`.

       * extraction of data and index creation are interleaved, so there's  no  waste  of  time  for  the  index
       creation.

       * index files are reusable, so they can be stopped at any time and reused and/or completed later.

       * an ex novo index file format has been created to store the index

       *  span  between  index  points  is raised by default from 1 MiB to 10 MiB, and can be adjusted with `-s`
       (span).

       * windows are compressed in file

       * windows are not loaded in memory unless they're needed, so the application memory footprint  is  fairly
       low (< 1 MiB)

       *  gztool  can  compress  files  (`-c`) and at the same time generate an index that is about 10-100 times
       smaller than if the index is generated after the file has already been compressed with gzip.

       * Compatible with `bgzip` files (//www.htslib.org/doc/bgzip.html)

       * Compatible with complete `gzip` concatenated files

       * Compatible with rsyslog's veryRobustZip omfile option (variable-short-uncompressed  complete-gzip-block
       sizes)

       * data can be provided from/to stdin/stdout

       *  gztool  can  be  used  to  remotely  retrieve  just  a small part of a bigger gzip compressed file and
       successfully decompress it locally. See  //unix.stackexchange.com/questions/429197/#541903  .  Just  note
       that the gztool index file must be also available.

PROJECT HOME PAGE

       //github.com/circulosmeos/gztool

AUTHOR

       This program was written by Roberto S. Galende <roberto.s.galende@gmail.com> on work by Mark Adler's zlib
       (examples/zran.c) and is copyrighted under zlib licence terms.

gztool v1.6.1                                      Nov  6 2023                                         gztool(1)

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

QUICK EXAMPLE

MORE EXAMPLES

INTERNALS

PROJECT HOME PAGE

SEE ALSO

AUTHOR