Ubuntu Manpage: Data::TableReader - Extract records from "dirty" tabular data sources

Provided by: libdata-tablereader-perl_0.021-1_all

NAME

       Data::TableReader - Extract records from "dirty" tabular data sources

VERSION

       version 0.021

SYNOPSIS

         # Find a row in the Excel file containing the headers
         #   "address", "city", "state", "zip" (in any order)
         # and then convert each row under that into a hashref of those fields.

         my $records= Data::TableReader>new(
             input => 'path/to/file.xlsx',
             fields => [qw( address city state zip )],
           )
           ->iterator->all;

       but there are plenty of options to choose from...

         my $tr= Data::TableReader->new(
           # path or file handle
           # let it auto-detect the format (but can override that if we need)
           input => 'path/to/file.csv',

           # We want these fields to exist in the file (identified by headers)
           fields => [
             { name => 'address', header => qr/street|address/i },
             'city',
             'state',
             # can validate with Type::Tiny classes
             { name => 'zip', header => qr/zip\b|postal/i, type => US_Zipcode },
           ],

           # Our data provider is horrible; just ignore any nonsense we encounter
           on_blank_row => 'next',
           on_validation_error => 'next',

           # Capture warnings and show to user who uploaded file
           log => \(my @messages)
         );

         my $records= $tr->iterator->all;
         ...
         $http_response->body( encode_json({ messages => \@messages }) );

DESCRIPTION

       This module is designed to take "loose" or "dirty" tabular data sources (such as Excel, CSV, TSV, or
       HTML) which may have been edited by non-technical humans and extract the data into sanitized records,
       while also verifying that the data file contains roughly the schema you were expecting.  It is primarily
       intended for making automated imports of data from non-automated or unstable sources, and providing
       human-readable feedback about the validity of the data file.

ATTRIBUTES

   input
       This can be a file name, Path::Class instance, file handle, arrayref, or
       Spreadsheet::ParseExcel::Worksheet object.  If you supply a file handle, it must be seekable in order to
       auto-detect the file format, or you may specify the decoder directly to avoid auto-detection.  Arrayrefs
       are passed to the 'Mock' decoder which just returns the data as-is.

   decoder
       This is either an instance of Data::TableReader::Decoder, or a class name, or a partial class name to be
       appended as "Data::TableReader::Decoder::$name" or an arrayref or hashref of arguments to build the
       decoder.

       In an arrayref the first argument can be undef, and in a hashref the CLASS argument can be missing or
       undef. In those cases it will be detected from the input attribute and any default arguments combined
       with (and if necessary trumped by) the extra arguments in the arrayref or hashref.

       Examples:

         'CSV'
         # becomes Data::TableReader::Decoder::CSV->new()

         [ 'CSV', sep_char => "|" ]
         # becomes Data::TableReader::Decoder::CSV->new(sep_char => "|")

         { CLASS => 'CSV', sep_char => "|" }
         # becomes Data::TableReader::Decoder::CSV->new({ sep_char => "|" })

   fields
       An arrayref of Data::TableReader::Field objects which this module should search for within the tables
       (worksheets etc.) of "input".

       If an element of this array is a hashref or string, it will be coerced to an instance of
       Data::TableReader::Field, with plain strings becoming the "name" attribute.  See "header" in
       Data::TableReader::Field for how names are automatically converted to the header-matching regex.

       There are some convenience accessors for the fields:

       field_list
           List access for "@{ $reader->fields }"

       field_by_name
           Map  of "{ $field->name => $field }".  If you have multiple fields of the same name (allowed, but not
           recommended) the value is the first per the order of "field_list".

       field_by_addr
           Map of "{ refaddr($field) => $field }".

   record_class
       Default is the special value 'HASH' for un-blessed hashref  records.   The  special  value  'ARRAY'  will
       result  in  arrayrefs  with  fields  in the same order they were specified in the "fields" specification.
       Setting it to anything else will return records created with "$record_class->new(\%fields);"

   static_field_order
       Boolean, whether the "fields" must be found in columns in the  exact  order  that  they  were  specified.
       Default is false.

   header_row_at
       Row  number, or range of row numbers where the header must be found.  (All row numbers in this module are
       1-based, to match end-user expectations.)  The default is "[1,10]" to limit header scanning to the  first
       10  rows.  As a special case, if you are reading a source which lacks headers and you trust the source to
       deliver the columns in the right order, you can set this to undef if you also set "static_field_order  =>
       1".

   col_map
       This  is an arrayref, one element per column of input data, listing which field was detected to come from
       that column.  If you specify this to the constructor, "find_table" will respect any  defined  element  of
       the  array,  but  still  search  for  matching  headers  in  the  undefined  columns.  After a successful
       "find_table",     "col_map"     is     changed     to     refer     to     the     same      hash      as
       "->table_search_results->{found}{col_map}".   (If you wanted to re-run the search for the table, you need
       to both "clear_table_search_results" and reset "col_map" to whatever you passed to the constructor.)

       For backward compatibility, if you did not specify this attribute to the constructor and try accessing it
       before calling "find_table", it automatically calls "find_table" for you (and die if it fails).

   has_col_map
       Check whether col_map has been defined, to avoid lazy-building it.

   table_search_results
       This is the output of the most recent "find_table" operation.

         {
           candidates => [
             { dataset_idx => $n,
               row => $n,
               col_map => [ $field_or_undef, $field_or_undef, ... ],
               missing_required => \@fields,
               ambiguous_columns => { $col_idx => \@fields, ... },
               ambiguous_fields => { $field_name => { $col_idx => $field }, ... },
               unmatched => \@col_idx,
               messages => [],
             },
             ...
           ],
           found => $ref_to_candidate, # undef if find_table failed
         }

       The  fields  describing  problems  ("missing_required",  "ambiguous_columns",   "ambiguous_fields",   and
       "unmatched") are not present unless they contain data.  All @fields are refs to the actual Field objects.
       "col_map"  has  one  element  per  element  of  the  header row.  If "missing_required" is populated, the
       analysis of ambiguity may be incomplete, because missing  required  columns  abort  the  search  for  the
       header.  All "_idx" values are 0-based, but the errors in "messages" use 1-based descriptions.

   has_table_search_results, clear_table_search_results
       Predicate and clearer for lazy-built table_search_results.

   on_partial_match
         on_partial_match    => 'next'    # keep searching for a better line of headers
         on_partial_match    => 'last'    # return failure from ->find_table
         on_partial_match    => sub {
           my ($reader, $candidate, $header_row)= @_;
           return $action; # one of the above values
         }

       During  "find_table",  if  a  row  is  found that matches at least one header, but fails to match all the
       requirements (required columns, unknown or ambiguous columns if those are configured as an error) you can
       either keep searching for a better header row, or stop here.  The default is 'next', to  keep  searching,
       but this may result in a lot of noise.  The 'last' setting allows you to stop after a likely header row.

       If you supply a coderef, you receive the "candidate" info described in "table_search_results".

   on_ambiguous_columns
         on_ambiguous_columns => 'warn'    # warn, and omit from the match
         on_ambiguous_columns => 'error'   # fail the header match for this row

       During  "find_table",  when  matching  a  field's header pattern vs. the columns of a row, if the pattern
       could match more than one cell it is an error.  You might want to handle it in various ways:

       'warn'
           If a Field matches multiple columns (and isn't an array  field)  omit  the  field  from  the  col_map
           entirely.   If  a  column  matches  multiple  fields,  leave the col_map blank for this column.  Both
           generate warnings, but the header match can still proceed to a successful result.

       'error' (default)
           Any ambiguities (field matching multiple columns, multiple fields matching a column) cause the  match
           of  the  header  on  this  row  to  fail.   Further  attempts  at  finding  the  header depend on the
           "on_partial_headers" setting.

   on_unknown_columns
         on_unknown_columns => 'warn'  # warn, and then accept these headers
         on_unknown_columns => 'error' # fail the header match for this row
         on_unknown_columns => sub {
           my ($reader, $col_headers)= @_;
           ...;
           return $opt; # one of the above values
         }

       This determines handling for columns that aren't associated with any field.  The "required" columns  must
       all  be  found before it considers this setting, but once it has found everything it needs to make this a
       candidate, you might or might not care about the leftover columns.

       'warn'  (default)
           You don't care if there are extra columns, just log warnings about them and proceed  extracting  from
           this table.

       'error'
           Extra  columns mean that you didn't find the table you wanted.  Log the near-miss, and keep searching
           additional rows or additional tables, according to "on_partial_headers".

       "sub {}"
           You can add your own logic to handle this.  Inspect the headers however you like, and then return one
           of the above values.

   on_blank_rows
         on_blank_rows => 'next' # warn, and then skip the row(s)
         on_blank_rows => 'last' # warn, and stop iterating the table
         on_blank_rows => 'die'  # fatal error
         on_blank_rows => 'use'  # actually try to return the blank rows as records
         on_blank_rows => sub {
           my ($reader, $first_blank_rownum, $last_blank_rownum)= @_;
           ...;
           return $opt; # one of the above values
         }

       This determines what happens when you've found the table, are extracting records, and encounter a  series
       of  blank  rows  (defined as a row with no printable characters in any field) followed by non-blank rows.
       If you use the callback, it suppresses the default warning, since you can generate your own.

       The default is 'next'.

   on_validation_error
         on_validation_error => 'next'  # warn, and then skip the record
         on_validation_error => 'use'   # warn, and then use the record anyway
         on_validation_error => 'die'   # fatal error
         on_validation_error => sub {
           my ($tablereader, $failures, $record, $data_iterator)= @_;
           # $record is the assembled hashref (unblessed) or arrayref of fields
           # $data_iterator is the Decoder's row iterator, useful for context
           for (@$failures) {
             my ($field, $value_ref, $message, $path)= @$_;
             ...
             # $field is a Data::TableReader::Field
             # $$value_ref is the string that failed validation
             # $message is the error returned from the validation function
             # $path is the element (and maybe sub-element) of $record
             #   i.e.  $value_ref= \$record->{$path[0]}[$path[1]]
             # You may modify $$value_ref or $record to alter the output
           }
           # Clear the failures array to suppress warnings, if you actually corrected
           # the validation problems.
           @$failures= ();
           # return one of the above constants to tell the iterator what to do next
           return $opt;
         }

       This determines what happens when you've found the table, are extracting records, and one row  fails  its
       validation.   In  addition  to  deciding  an  option, the callback gives you a chance to alter the record
       before 'use'ing it.

       The default is 'die'.

   log
       If undefined (the default) all log messages above 'info' will be emitted with  "warn  "$message\n"".   If
       set to an object, it should support an API of:

         trace,  is_trace
         debug,  is_debug
         info,   is_info
         warn,   is_warn
         error,  is_error

       such as Log::Any and may other perl logging modules use.  You can also set it to a coderef such as:

         my @messages;
         sub { my ($level, $message)= @_;
           push @messages, [ $level, $message ]
             if grep { $level eq $_ } qw( info warn error );
         };

       for  a simple way to capture the messages without involving a logging module.  And for extra convenience,
       you can set it to an arrayref which will receive any message that would otherwise have gone to 'warn'  or
       'error'.

METHODS

   detect_input_format
          my ($class, @args)= $tr->detect_input_format( $filename, $head_of_file );

       This  is  used  internally to detect the format of a file, but you can call it manually if you like.  The
       first argument (optional) is a file name, and the second  argument  (also  optional)  is  the  first  few
       hundred  bytes of the file.  Missing arguments will be pulled from "input" if possible.  The return value
       is the best guess of module name and constructor arguments  that  should  be  used  to  parse  the  file.
       However,  this doesn't guarantee such module actually exists or is installed; it might just echo the file
       extension back to you.

   find_table
         if ($tr->find_table) { ... }

       Search through the input for the beginning of the records,  identified  by  a  header  row  matching  the
       various  constraints  defined  in  "fields".   If  "header_row_at" is "undef", then this does nothing and
       assumes success.

       Returns a boolean of whether it succeeded.  This method does not "croak" on failure like "iterator" does,
       on the assumption that you want to handle them gracefully.  All diagnostics about the search  are  logged
       via "log", but also reported in "table_search_results".

   field_map
       Build a hashref of "{ $field_name => $col_idx_or_arrayref }"  for the current "col_map".  If the field is
       defined  as an array field, the value will be an arrayref (even if only found in one column).  Otherwise,
       the value is a simple scalar of the column index.

   iterator
         my $iter= $tr->iterator;
         while (my $rec= $iter->()) { ... }

       Create an iterator.  If the table has not been located, then find it and "croak" if it  can't  be  found.
       Depending  on  the  decoder  and  input  filehandle,  you  might only be able to have one instance of the
       iterator at a time.

       The iterator derives from Data::TableReader::Iterator but also has  a  method  "all"  which  returns  all
       records in an arrayref.

         my $records= $tr->iterator->all;

THANKS

       Portions    of    this    software   were   funded   by   Ellis,   Partners   in   Management   Solutions
       <http://www.epmsonline.com/> and Candela Corporation <https://www.candelacorp.com/>.

AUTHOR

       Michael Conrad <mike@nrdvana.net>

CONTRIBUTOR

       Christian Walde <walde.christian@gmail.com>

COPYRIGHT AND LICENSE

       This software is copyright (c) 2024 by Michael Conrad.

       This is free software; you can redistribute it and/or modify it under  the  same  terms  as  the  Perl  5
       programming language system itself.

perl v5.38.2                                       2024-06-21                             Data::TableReader(3pm)