Provided by: po4a_0.69-1_all bug

NAME

       po4a-gettextize - convert an original file (and its translation) to a PO file

SYNOPSIS

       po4a-gettextize -f fmt -m master.doc -l XX.doc -p XX.po

       (XX.po is the output, all others are inputs)

DESCRIPTION

       po4a (PO for anything) eases the maintenance of documentation translation using the classical gettext
       tools. The main feature of po4a is that it decouples the translation of content from its document
       structure.  Please refer to the page po4a(7) for a gentle introduction to this project.

       The po4a-gettextize script helps you converting your previously existing translations into a po4a-based
       workflow. This is only to be done once to salvage an existing translation while converting to po4a, not
       on a regular basis after the conversion of your project. This tedious process is explained in details in
       Section 'Converting a manual translation to po4a' below.

       You must provide both a master file (e.g., the source in English) and an existing translated file (e.g.,
       a previous translation attempt without po4a). If you provide more than one master or translation files,
       they will be used in sequence, but it may be easier to gettextize each page or chapter separately and
       then use msgmerge to merge all produced PO files. As you wish.

       If the master document has non-ASCII characters, the new generated PO file will be in UTF-8. If the
       master document is completely in ASCII, the generated PO will use the encoding of the translated input
       document.

OPTIONS

       -f, --format
           Format  of  the  documentation  you  want  to handle. Use the --help-format option to see the list of
           available formats.

       -m, --master
           File containing the master document to translate. You can use this option multiple times if you  want
           to gettextize multiple documents.

       -M, --master-charset
           Charset of the file containing the document to translate.

       -l, --localized
           File  containing  the localized (translated) document. If you provided multiple master files, you may
           wish to provide multiple localized file by using this option more than once.

       -L, --localized-charset
           Charset of the file containing the localized document.

       -p, --po
           File where the message catalog should be written. If not given, the message catalog will  be  written
           to the standard output.

       -o, --option
           Extra  option(s)  to  pass  to  the  format  plugin.  See  the  documentation of each plugin for more
           information about the valid options and their meanings. For example, you could pass  '-o  tablecells'
           to the AsciiDoc parser, while the text parser would accept '-o tabs=split'.

       -h, --help
           Show a short help message.

       --help-format
           List the documentation formats understood by po4a.

       -k --keep-temps
           Keep  the  temporary  master  and  localized  POT  files built before merging.  This can be useful to
           understand why these files get desynchronized, leading to gettextization problems

       -V, --version
           Display the version of the script and exit.

       -v, --verbose
           Increase the verbosity of the program.

       -d, --debug
           Output some debugging information.

       --msgid-bugs-address email@address
           Set the report address for msgid bugs. By default, the created POT files have no Report-Msgid-Bugs-To
           fields.

       --copyright-holder string
           Set the copyright holder in the POT header. The default value is "Free Software Foundation, Inc."

       --package-name string
           Set the package name for the POT header. The default is "PACKAGE".

       --package-version string
           Set the package version for the POT header. The default is "VERSION".

   Converting a manual translation to po4a
       po4a-gettextize synchronizes the master and localized files to extract their content into a PO file.  The
       content of the master file gives the msgid while the content of the localized file gives the msgstr. This
       process  is  somewhat fragile: the Nth string of the translated file is supposed to be the translation of
       the Nth string in the original.

       Gettextization works best if you manage to retrieve the exact version of the original document  that  was
       used for translation. Even so, you may need to fiddle with both master and localized files to align their
       structure if it was changed by the original translator, so working on files' copies is advised.

       Internally,  each  po4a  parser  reports  the  syntactical  type  of  each extracted strings. This is how
       desynchronization are detected during the gettextization.  In the example  depicted  below,  it  is  very
       unlikely  that  the 4th string in translation (of type 'chapter') is the translation of the 4th string in
       original (of type 'paragraph'). It is more likely that a new paragraph was added to the original, or that
       two original paragraphs were merged together in the translation.

           Original         Translation

         chapter            chapter
           paragraph          paragraph
           paragraph          paragraph
           paragraph        chapter
         chapter              paragraph
           paragraph          paragraph

       po4a-gettextize will verbosely diagnose any structure desynchronization. When this  happens,  you  should
       manually  edit the files to add fake paragraphs or remove some content here and there until the structure
       of both files actually match. Some tricks are given below to salvage the most of the existing translation
       while doing so.

       If you are lucky enough to have a perfect match in the file structures out of the box, building a correct
       PO file is a matter of seconds. Otherwise, you will soon understand why this process  has  such  an  ugly
       name :) Even so, gettextization often remains faster than translating everything again. I gettextized the
       French  translation  of  the whole Perl documentation in one day despite the many synchronization issues.
       Given the amount of text (2Mb of original text), restarting the translation without first  salvaging  the
       old translations would have required several months of work. In addition, this grunt work is the price to
       pay  to  get  the  comfort  of  po4a.  Once  converted,  the synchronization between master documents and
       translations will always be fully automatic.

       After a successful gettextization, the produced documents  should  be  manually  checked  for  undetected
       disparities and silent errors, as explained below.

       Hints and tricks for the gettextization process

       The  gettextization stops as soon as a desynchronization is detected. When this happens, you need to edit
       the files as much as needed to re-align the files' structures. po4a-gettextize  is  rather  verbose  when
       things  go  wrong.  It reports the strings that don't match, their positions in the text, and the type of
       each of them. Moreover, the PO file generated so far is dumped as  gettextization.failed.po  for  further
       inspection.

       Here  are  some  tricks  to  help you in this tedious process and ensure that you salvage the most of the
       previous translation:

       •   Remove all extra content of the translations, such as the section giving credits to the  translators.
           They should be added separately to po4a as addendas (see po4a(7)).

       •   When editing the files to align their structures, prefer editing the translation if possible. Indeed,
           if the changes to the original are too intrusive, the old and new versions will not be matched during
           the first po4a run after gettextization (see below). Any unmatched translation will be dumped anyway.
           That  being  said,  you  still  want  to  edit  the  original  document  if  it's too hard to get the
           gettextization to proceed otherwise, even if it means  that  one  paragraph  of  the  translation  is
           dumped. The important thing is to get a first PO file to start with.

       •   Do  not  hesitate  to  kill any original content that would not exist in the translated version. This
           content will be automatically reintroduced  afterward,  when  synchronizing  the  PO  file  with  the
           document.

       •   You should probably inform the original author of any structural change in the translation that seems
           justified.  Issues  in  the  original  document  should  reported  to the author. Fixing them in your
           translation only fixes them for a part of the community. Plus, it is impossible to do so  when  using
           po4a  ;)  But  you  probably want to wait until the end of the conversion to po4a before changing the
           original files.

       •   Sometimes, the paragraph content does match, but  not  their  types.  Fixing  it  is  rather  format-
           dependent.  In  POD  and man, it often comes from the fact that one of them contains a line beginning
           with a white space while the other does not.  In those formats, such paragraph cannot be wrapped  and
           thus  become  a  different type. Just remove the space and you are fine. It may also be a typo in the
           tag name in XML.

           Likewise, two paragraphs may get merged together in  POD  when  the  separating  line  contains  some
           spaces, or when there is no empty line between the =item line and the content of the item.

       •   Sometimes,  the  desynchronization message seems odd because the translation is attached to the wrong
           original paragraph. It is the sign of an undetected issue earlier in  the  process.  Search  for  the
           actual desynchronization point by inspecting the file gettextization.failed.po that was produced, and
           fix the problem where it really is.

       •   Other  issues  may  come  from  duplicated  strings in either the original or translation. Duplicated
           strings are merged in PO  files,  with  two  references.   This  constitutes  a  difficulty  for  the
           gettextization  algorithm,  that is a simple one to one pairing between the msgids of both the master
           and the localized files. It is however believed that recent  versions  of  po4a  deal  properly  with
           duplicated strings, so you should report any remaining issue that you may encounter.

   Reviewing files produced by po4a-gettextize
       Any  file  produced  by  po4a-gettextize  should  be  manually  reviewed, even when the script terminates
       successfully. You should skim over the PO file, ensuring that the msgid and msgstr actually match. It  is
       not necessary to ensure that the translation is perfectly correct yet, as all entries are marked as fuzzy
       translations  anyway.  You  only  need  to  check  for  obvious  matching  issues  because  badly matched
       translations will be dumped in subsequent steps while you want to salvage them.

       Fortunately, this step does not require to master the target languages as  you  only  want  to  recognize
       similar  elements  in  each msgid and its corresponding msgstr. As a speaker of French, English, and some
       German myself, I can do this for all European languages at least, even if I cannot say one word  of  most
       of  these  languages.  I  sometimes manage to detect matching issues in non-Latin languages by looking at
       string length, phrase structures (does the amount of interrogation marks match?) and other clues,  but  I
       prefer when someone else can review those languages.

       If  you  detect  a  mismatch,  edit  the original and translation files as if po4a-gettextize reported an
       error, and try again. Once you have a decent PO file for your previous translation, backup it  until  you
       get po4a working correctly.

   Running po4a for the first time
       The  easiest  way  to  setup po4a is to write a po4a.conf configuration file, and use the integrated po4a
       program (po4a-updatepo and po4a-translate are deprecated). Please check the "CONFIGURATION FILE"  Section
       in po4a(1) documentation for more details.

       When po4a runs for the first time, the current version of the master documents will be used to update the
       PO  files containing the old translations that you salvaged through gettextization. This can take quite a
       long time, because many of the msgids of from the gettextization do not exactly match the elements of the
       POT file built from the recent master files. This forces gettext to search for the closest  one  using  a
       costly  string  proximity  algorithm.   For  example,  the first run over the Perl documentation's French
       translation (5.5 MB PO file) took about 48 hours (yes, two days) while  the  subsequent  ones  only  take
       seconds.

   Moving your translations to production
       After  this  first  run, the PO files are ready to be reviewed by translators. All entries were marked as
       fuzzy in the PO file by po4a-gettextization, forcing their careful review before use. Translators  should
       take  each entry to verify that the salvaged translation actually match the current original text, update
       the translation on need, and remove the fuzzy markers.

       Once enough fuzzy markers are removed, po4a will start generating the  translation  files  on  disk,  and
       you're  ready  to  move  your translation workflow to production. Some projects find it useful to rely on
       weblate to coordinate between translators and maintainers, but that's beyond po4a' scope.

SEE ALSO

       po4a(1), po4a-normalize(1), po4a-translate(1), po4a-updatepo(1), po4a(7).

AUTHORS

        Denis Barbier <barbier@linuxfr.org>
        Nicolas François <nicolas.francois@centraliens.net>
        Martin Quinson (mquinson#debian.org)

COPYRIGHT AND LICENSE

       Copyright 2002-2022 by SPI, inc.

       This program is free software; you may redistribute it and/or modify it under the terms of GPL  (see  the
       COPYING file).

Po4a Tools                                         2023-01-03                                PO4A-GETTEXTIZE(1p)