Provided by: libebook-tools-perl_0.5.4-1.3_amd64 bug

NAME

       EBook::Tools::Mobipocket - Palm::PDB handler for manipulating the Mobipocket format.

SYNOPSIS

        use EBook::Tools::Mobipocket qw(:all);
        my $mobi = EBook::Tools::Mobipocket->new();
        $mobi->Load('filename.prc');
        print "Title: ",$mobi->{title},"\n";
        print "Author: ",$mobi->{header}{exth}{author},"\n";
        print "Language: ",$mobi->{header}{mobi}{language},"\n";

        my $mobigen = find_mobigen();
        system_mobigen('myfile.opf');

DEPENDENCIES

       •   "Bit::Vector"

       •   "Compress::Zlib"

       •   "HTML::Tree"

       •   "Image::Size"

       •   "List::MoreUtils"

       •   "P5-Palm"

       •   "String::CRC32"

CONSTRUCTOR

   "new()"
       Instantiates a new Ebook::Tools::Mobipocket object.

ACCESSOR METHODS

   "drm()"
       Returns  1  if the "drmoffset" header value is neither 0 nor 0xffffffff.  Returns undef if "drmoffset" is
       undefined. Returns 0 otherwise.

   "text()"
       Returns the text of the file

   "write_images()"
       Writes each image record to the disk.

       Returns the number of images written.

   "write_text($filename)"
       Writes the book text to disk with the given filename.  This filename must match  the  filename  given  to
       "fix_html()" for the internal links to be consistent.

       Croaks if $filename is not specified.

       Returns 1 on success, or undef if there was no text to write.

   "write_unknown_records()"
       Writes each unidentified record to disk with a filename in the format of 'raw-record-####', where #### is
       the record number (not the record ID).

       Returns the number of records written.

MODIFIER METHODS

       These  methods  have  two naming/capitalization schemes -- methods directly related to the subclassing of
       Palm::PDB use its MethodName capitalization style.  Any other methods are lowercase_with_underscores  for
       consistency with the rest of EBook::Tools.

   "Load($filename)"
       Sets   "$self->{filename}"   and  then  loads  and  parses  the  file  specified  by  $filename,  calling
       "ParseRecord(%record)" on every record found.

       If DictionaryHuffman compression is detected, text records will be left untouched during the  ParseRecord
       pass,  and  "uncompress_dictionaryhuffman_records()"  will  be  called  after the initial parsing pass is
       complete.

   "ParseRecord(%record)"
       Parses PDB records, updating the object  attributes.   This  method  is  called  automatically  on  every
       database record during "Load()".

   "ParseRecord0($data)"
       Parses  the  header  record  and  places  the parsed values into the hashref "$self->{header}{palm}", the
       hashref  "$self->{header}{mobi}",  and  "$self->{header}{exth}"  by   calling   "parse_palmdoc_header()",
       "parse_mobi_header()", and "parse_mobi_exth()" respectively.

   "ParseRecordCDIC(\$data)"
       Parses a CDIC record.  Takes as a sole argument a reference to the data of the record.

       Record format

       •   Offset 0: Record identifier

           4 bytes, always 'CDIC'

       •   Offset 4: Header length

           4 bytes, big-endian long int, always = 16

       •   Offset 8: Index count

           4  bytes,  big-endian  long  int, marks the number of big-endian short ints immediately following the
           header used as index points into the dictionary data

       •   Offset 12: Codelength

           4 bytes, big-endian long int, number of code bits

       •   Offset 16: Indexes

           A number of big-endian short ints used as index points into the dictionary data

       •   Offset ??: Dictionary data

           Dictionary result strings immediately following the indexes

   "ParseRecordHUFF(\$data)"
       Parses a HUFF record.  Takes as a sole argument a reference to the data of the record.

       Record format

       •   Offset 0: Record identifier

           4 bytes, always 'HUFF'

       •   Offset 4: Header length

           4 bytes, big-endian long int, always = 24

       •   Offset 8: Cache table (big-endian) offset

           4 bytes, big-endian long int, always = 24

       •   Offset 12: Base table (big-endian) offset

           4 bytes, big-endian long int, always = 1048

       •   Offset 16: Cache table (little-endian) offset

           4 bytes, big-endian long int, always = 1304

       •   Offset 20: Base table (little-endian) offset

           4 bytes, big-endian long int, always = 2328

       •   Offset 24: Cache table (big-endian)

           1024 bytes, 256 big-endian long ints

           This is a look up table for the length and decoding of short codewords.  If the codeword  represented
           by  the 8 bits is unique, then bit 7 (0x80) will be set, and the low 5 bits are the length in bits of
           the code.  The high three bytes partially represent the final symbol.

           If bit 7 is clear, then the code is looked up in the base table

       •   Offset 1048: Base table (big-endian)

           256 bytes, 64 big-endian long ints

           This is where the codeword is looked up if it isn't found in the cache table.

       •   Offset 1304: Cache table (little-endian)

           1024 bytes, 256 little-endian long ints.

           This contains exactly the same data as in the cache table at offset 24, except that all of the values
           are stored in little-endian format instead of big-endian.

           Presumably this is for a speed advantage on slow little-endian processors.  This module uses only the
           big-endian tables.

       •   Offset 2328: Base table (little-endian)

           256 bytes, 64 little-endian long ints

           This contains exactly the same data as in the base table at offset  1048,  except  that  all  of  the
           values are stored in little-endian format instead of big-endian.

           Presumably this is for a speed advantage on slow little-endian processors.  This module uses only the
           big-endian tables.

   "ParseRecordImage(\$dataref)"
       Parses  image  records,  updating  object  attributes,  most  notably  adding  the image data to the hash
       "$self->{imagedata}",  adding  the  image  filename   to   "$self->{recindexlinks}",   and   incrementing
       "$self->{recindex}".

       Takes as an argument a reference to the record data.  Croaks if it isn't provided, or isn't a reference.

       This is called automatically by "ParseRecord()" and "ParseResource()" as needed.

   "ParseRecordText(\$dataref)"
       Parses  text  records, updating object attributes, most notably appending text to "$self->{text}".  Takes
       as an argument a reference to the record data.

       This is called automatically by "ParseRecord()" and "ParseResource()" as needed.

   fix_html(%args)
       Takes raw Mobipocket text and replaces the custom tags and file position anchors

       Arguments

       •   "filename"

           The name of the output HTML file (used in generating hrefs).  The procedure croaks  if  this  is  not
           supplied.

       •   "nonewlines" (optional)

           If this is set to true, the procedure will not attempt to insert newlines for readability.  This will
           leave  the output in a single unreadable line, but has the advantage of reducing the processing time,
           especially useful if tidy is going to be run on the output anyway.

   "fix_html_filepos()"
       Takes the raw HTML text of the object and replaces the filepos anchors.  This has to be called before any
       other action that modifies the text, or the filepos positions will not be valid.

       Returns 1 if successful, undef if there was no text to fix.

       This is called automatically by "fix_html()".

   "uncompress_dictionaryhuffman_records()"
       Uncompresses all  text  records  using  "uncompress_dictionaryhuffman()".   This  destroys  the  existing
       contents of $self->{text} if any.

       This method is called automatically at the end of "Load()" if DictionaryHuffman encoding is detected.

PROCEDURES

       All procedures are exportable, but none are exported by default.  All procedures can be exported by using
       the ":all" tag.

   "find_mobidedrm()"
       Attempts  to locate a copy of the MobiDeDrm script by searching PATH and looking in the EBook::Tools user
       configuration directory (see "userconfigdir()" in EBook::Tools.

       Returns the complete path to the script, or undef if nothing was found.

       This will use package variable $mobidedrm_cmd as its first guess, and set that  variable  to  the  return
       value as well.

   "find_mobigen()"
       Attempts  to  locate  the mobigen executable by making a test execution on predicted locations (including
       just checking PATH) and looking in the EBook::Tools user configuration directory  (see  "userconfigdir()"
       in EBook::Tools.

       Returns the system command used for a successful invocation, or undef if nothing worked.

       This will use package variable $mobigen_cmd as its first guess, and set that variable to the return value
       as well.

   "parse_mobi_exth($headerdata)"
       Takes  as an argument a scalar containing the variable-length Mobipocket EXTH data from the first record.
       Returns an array of hashes, each hash containing the data from one EXTH record with values from that data
       keyed to recognizable names.

       If $headerdata doesn't appear to be an EXTH header, carps a warning and returns an empty list.

       See:

       http://wiki.mobileread.com/wiki/MOBI

       Hash keys

       •   "type"

           A numeric value indicating the type of EXTH data in the record.  See package variable %exthtypes.

       •   "length"

           The length of the "data" value in bytes

       •   "data"

           The data of the record.

   parse_mobi_header($headerdata)
       Takes as an argument a scalar containing the variable-length Mobipocket-specific  header  data  from  the
       first record.  Returns a hash containing values from that data keyed to recognizable names.

       See:

       http://wiki.mobileread.com/wiki/MOBI

       keys

       The  returned hash will have the following keys (documented in the order in which they are encountered in
       the header):

       "identifier"
           This should always be the string 'MOBI'.  If it isn't, the procedure croaks.

       "headerlength"
           This is the size of the complete header.  If this value is different from the length of the argument,
           the procedure croaks.

       "type"
           A numeric code indicating what category of Mobipocket file this is.

       "encoding"
           A numeric code representing the encoding.  Expected values are '1252' (for Windows-1252)  and  '65001
           (for UTF-8).

           The procedure carps a warning if an unexpected value is encountered.

       "uniqueid"
           This is thought to be a unique ID for the book, but its actual use is unknown.

           Use with caution.  This key may be renamed in the future if more information is found.

       "version"
           This  is  thought to be the Mobipocket format version.  A second version code shows up again later as
           "version2" which is usually the same on unprotected books but different on DRMd books.

           Use with caution.  This key may be renamed in the future if more information is found.

       "reserved"
           40 bytes of reserved data.

           Use with caution.  This key may be renamed in the future if more information is found.

       "indxrecord"
           This is thought to be the record offset to the first 'INDX' record,  so  named  for  its  first  four
           letters.

           Use with caution.  This key may be renamed in the future if more information is found.

       "titleoffset"
           Offset in record 0 (not from start of file) of the full title of the book.

       "titlelength"
           Length in bytes of the full title of the book

       "languageunknown"
           16 bits of unknown data thought to be related to the book language.

           Use with caution.  This key may be renamed in the future if more information is found.

       "language"
           A  pseudo-IANA  language  code  string  representing  the  main  book  language  (i.e.  the  value of
           <dc:language>).  See %mobilangcodes for an exact map of raw values to this string and notes  on  non-
           compliant results.

       "dilanguageunknown"
           16 bits of unknown data thought to be related to the dictionary input language.

           Use with caution.  This key may be renamed in the future if more information is found.

       "dilanguage"
           A  pseudo-IANA  language code string for the DictionaryInLanguage element.  See %mobilangcodes for an
           exact map of raw values to this string and notes on non-compliant results.

       "dolanguageunknown"
           16 bits of unknown data thought to be related to the dictionary output language.

           Use with caution.  This key may be renamed in the future if more information is found.

       "dolanguage"
           A pseudo-IANA language code string for the DictionaryOutLanguage element.  See %mobilangcodes for  an
           exact map of raw values to this string and notes on non-compliant results.

       "version2"
           This  is  another  Mobipocket  format version related to DRM.  If no DRM is present, it should be the
           same as "version".

           Use with caution.  This key may be renamed in the future if more information is found.

       "firstimagerecord"
           This is thought to be an index to the first record containing image data.  If there are no images  in
           the book, this value will be 4294967295 (0xffffffff)

           Use with caution.  This key may be renamed in the future if more information is found.

       "huffrecord"
           This is thought to be the record offset to the 'HUFF' record, used in HUFF/CDIC decompression.

           Use with caution.  This key may be renamed in the future if more information is found.

       "huffreccnt"
           This is thought to be the number of HUFF and CDIC records, starting at "huffrecord".

           Use with caution.  This key may be renamed in the future if more information is found.

       "datprecord"
           This  is  thought  to  be  the  record offset to the first 'DATP' record, so named for its first four
           letters.

           Use with caution.  This key may be renamed in the future if more information is found.

       "datpreccnt"
           This is thought to be the number of 'DATP' records present.

           Use with caution.  This key may be renamed in the future if more information is found.

       "exthflags"
           A 32-bit bitfield related to the Mobipocket EXTH data.  If bit 6 (0x40) is  set,  then  there  is  at
           least one EXTH record.

       "unknown116"
           36 bytes of unknown data at offset 116.  This value will be undefined if the header data was not long
           enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "drmoffset"
           A  number  thought  to be the byte offset inside of the record 0 data in which DRM data can be found.
           If present and no DRM is set, contains either the  value  0xFFFFFFFF  (normal  books)  or  0x00000000
           (samples).  This value will be undefined if the header data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "drmcount"
           A number thought to be related to DRM.

           This value will be undefined if the header data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "drmsize"
           A number thought to be the size of the data in bytes after "drmoffset" containing DRM keys.

           This value will be undefined if the header data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "drmflags"
           A number thought to be related to DRM.

           This value will be undefined if the header data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "unknown168"
           32  bits  of  unknown data at offset 168, usually zeroes.  This value will be undefined if the header
           data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "unknown172"
           32 bits of unknown data at offset 172, usually zeroes.  This value will be undefined  if  the  header
           data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "unknown176"
           16  bits of unknown data at offset 176.  This value will be undefined if the header data was not long
           enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "lastimagerecord"
           This is thought to be an index to the last record containing image data.  If there are no  images  in
           the book, this value will be 65535 (0xffff).

           Use with caution.  This key may be renamed in the future if more information is found.

       "unknown180"
           32  bits of unknown data at offset 180.  This value will be undefined if the header data was not long
           enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "fcisrecord"
           This is thought to be an index to a 'FCIS' record, so named because those are always the  first  four
           characters when the record data is decompressed using uncompress_palmdoc().

           This value will be undefined if the header data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "unknown188"
           32  bits of unknown data at offset 188.  This value will be undefined if the header data was not long
           enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "flisrecord"
           This is thought to be an index to a 'FLIS' record, so named because those are always the  first  four
           characters when the record data is decompressed using uncompress_palmdoc().

           This value will be undefined if the header data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "unknown196"
           32  bits of unknown data at offset 180.  This value will be undefined if the header data was not long
           enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "unknown200"
           Unknown data of unknown length running to the end of the header.  This value will be undefined if the
           header data was not long enough to contain it.

           Use with caution.  This key may be renamed in the future if more information is found.

       "extradataflags"
           Two bytes sometimes found inside of "unknown200", used to determine if extra data has  been  appended
           to each text record that should not be used in decompression.

   "parse_mobi_language($languagecode, $regioncode)"
       Takes  the integer values $languagecode and $regioncode unpacked from the Mobipocket header and returns a
       language string mostly (but not entirely) conformant to the IANA language subtag registry codes.

       Croaks if $languagecode is not provided.  If $regioncode  is  not  provided  or  not  recognized,  it  is
       disregarded and the base language string (with no region or script) is returned.

       If  $languagecode  is  not provided, the sub croaks.  If it isn't recognized, a warning is carped and the
       sub returns undef.  Note that 0,0 is a recognized code returning an empty string.

       See %mobilanguagecodes for an exact map of values.  Note that the bottom two  bits  of  the  region  code
       appear to be unused (i.e. the values are all multiples of 4).

   "pid_append_checksum($pid)"
       Computes  the  Mobipocket  PID  checksum used as the final two bytes of the PID and appends them to $pid,
       returning the merged string.

       Used by "pid_is_valid($pid)".

   "pid_is_valid($pid)"
       Returns 1 if the PID is a valid Mobipocket/Kindle PID and 0 otherwise.

       This is determined by first ensuring that $pid is exactly ten bytes long, and then  stripping  the  final
       two  bytes  normally  used  as  a  checksum and recomputing them, returning 1 only if they are recomputed
       correctly.

   "pukall_cipher_1(%args)"
       This is a COMPLETELY UNTESTED implementation of the Pukall Cipher 1 algorithm  used  for  encryption  and
       decryption  in  Mobipocket  files.   It  is  a 128-bit stream cipher.  For more information and alternate
       implementations, see <http://membres.lycos.fr/pc1/>.

       Use at your own risk.  Bug reports appreciated.

       Arguments

       •   "key"

           16-byte encryption key.  This must be provided, and must be exactly 16 bytes, or the  procedure  will
           croak.

       •   "input"

           Input data to be either encrypted or decrypted.  If this is not provided, the procedure croaks.

       •   "encrypt" (optional)

           If  set to true, the cipher will be used to encrypt the input data.  If not set, or set to false, the
           cipher will be used to decrypt the input data.

   "record_extradata_size(%args)"
       This checks the end of a text record for extra data that should not be made  part  of  decompression  and
       returns the total size of all data fields.

       Arguments

       •   "dataref"

           A reference to the record data

       •   "extradataflags"

           16 bits worth of flags indicating which extra data fields are present.

   "system_mobidedrm(%args)"
       Runs  python  on  a  copy  of  "MobiDeDrm.py" if it is available (not included with this distribution) to
       downconvert a Mobipocket file.

       Returns the output filename on success, or undef otherwise.

       Arguments

       •   "infile"

           The input filename.  If not specified or invalid, the procedure returns undef.

       •   "outfile"

           The output filename.  If not specified, the program  will  use  a  name  based  on  the  input  file,
           appending  '-nodrm'  to  the  basename  and keeping the extension.  In the special case of Mobipocket
           files ending in '-sm', the '-sm' portion of the basename is  simply  removed,  and  nothing  else  is
           appended.

       •   "pid"

           The PID to use to decrypt the file.  If not specified or invalid, the procedure returns undef.

   "system_mobigen(%args)"
       Runs  "mobigen"  to  convert  OPF,  HTML, or ePub input into a Mobipocket .prc/.mobi book.  The procedure
       find_mobigen() is called to locate the executable.

       Returns the return value from mobigen, or undef if no filename was specified or the file did  not  exist.
       Also returns undef if mobigen could not be found.

       Arguments

       •   "infile"

           The input filename.  If not specified or invalid, the procedure returns undef.

       •   "outfile"

           The  output  filename.  The mobigen executable will choose its own filename for direct output, but if
           this argument is specified, the output file will be renamed to the specified filename instead.

           If not specified, the default output will be left in place.

       •   "dir"

           The directory in which to place the output file.  The mobigen executable itself will always place its
           output into the current working directory, but if this argument is specified, the output file will be
           moved into the specified directory, creating that directory if necessary.

       •   "compression"

           Compression level from 0-2, where 0 is no compression, 1 is PalmDoc compression, and 2  is  HUFF/CDIC
           compression.  If not specified, defaults to 1 (PalmDoc compression).

   "uncompress_dictionaryhuffman(%args)"
       Uncompresses text compressed with the DictionaryHuffman compression scheme.

       Arguments

       •   "data"

           A scalar containing the compressed data to uncompress.

       •   "huff"

           A hashref pointing to the HUFF record data

       •   "cdics"

           An arrayref pointing to the CDIC record data

       •   "depth"

           The current depth of the huffman tree, currently only used in debugging.

   "unpack_mobi_language($data)"
       Takes  as  an  argument  4  bytes  of  data.  If less data is provided, the sub croaks.  If more, a debug
       warning is provided, but the sub continues.

       In scalar context returns a language string mostly (but not entirely) conformant  to  the  IANA  language
       subtag registry codes.

       In  list  context,  returns  the  language  string, an unknown code integer, a region code integer, and a
       language code integer, with the last three being directly unpacked values.

       See %mobilangcodes for an exact map of values.  Note that the bottom two bits of the region  code  appear
       to  be  unused  (i.e. the values are all multiples of 4).  The unknown code integer appears to be unused,
       and is generally zero.

       The original implementation by Mobipocket may have been via Microsoft's  .NET  CultureInfo  class.   See:
       <http://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo(VS.71).aspx>

BUGS AND LIMITATIONS

       •   Unpacking DRM-protected text isn't supported.  Although infrastructure may be added later to make use
           of  external  helpers  and plugins, direct DRM support will never be added to the main code for legal
           reasons.

       •   Repacking a .prc without fully extracting to OPF and  completely  converting  back  isn't  supported.
           This  will  have  to  be implemented before an interface to perform minor metadata alterations can be
           implemented.

       •   Mobipocket HUFF/CDIC decoding (used mostly on dictionaries) isn't well documented.

       •   Not all Mobipocket data is understood, so a conversion from OPF to Mobipocket .prc back to  OPF  will
           not result in all data being retained.  Patches welcome.

       •   Mobipocket INDX, DATP, FCIS, and FLIS records are not understood and are completely ignored

       •   Mobipocket  EXTH  subjectcode  records  may not end up attached to the correct subject element if the
           number of subject records differs from the number  of  subjectcode  records.   This  is  because  the
           Mobipocket  format  leaves the EXTH subjectcode records completely unlinked from the subject records,
           and there is no way to detect if a subject with no associated subjectcode comes before a subject with
           an associated subjectcode.

           Fortunately, this should rarely be a problem with real data, as  Mobipocket  Creator  only  allows  a
           single  subject  to  be set, and the only other way to have a subjectcode attached to a subject is to
           manually edit the OPF file and insert an additional dc:Subject element with a BASICCode attribute.

           Mobipocket has indicated that they may move data currently in their custom elements and attributes to
           the standard <meta> elements in a future release, so this problem may become moot then.

AUTHOR

       Zed Pobre <zed@debian.org>

LICENSE AND COPYRIGHT

       Copyright 2008 Zed Pobre

       Licensed to the public under the terms of the GNU GPL, version 2

perl v5.24.1                                       2017-05-22                      EBook::Tools::Mobipocket(3pm)