Provided by: getdata_0.2-4_all bug

NAME

       getData - retrieves databases from the Internet

SYNOPSIS

       getData [ --mirrordir <path> ] <list of db names>

       getData --list

DESCRIPTION

       Bioinformatics has the intrinsic problem to bring the biological data to the end user. Astronomers have
       the equivalent problem and particle physicists, well, they haven come up with (first) the web and
       (second) the computational grids to address their problems. Debian helps with the programs but will not
       provide such huge datasets that are even frequently updated - not even in volatile.debian.org. Most
       bioinformatics researchers will not need too many of such databases. And even more so will gladly
       continue in using public services remotely.

       For those who need a set of databases on a regular basis, this script shall be a start to automate the
       burden to download the data and update indices and the like. The world has seen such magic before with
       the Lion Biosciences Prisma tool (http://bib.oxfordjournals.org/cgi/reprint/3/4/389.pdf) but how about
       something simpler (as a start) that at least gets close to what we desire and is Free. The aim must be to
       address the needs of all (most) communities, not only of the bioinformatics world. The seed was hence
       made with databases from astronomy.

       Please contact the Debian-Med community if you consider this program to be almost ready for your needs
       and explain what still needs to be added. Public databases that you managed to integrate with this system
       are also very warmly welcomed as feedback.

OPTIONS

       --help
               this help

       --man
           Present a more detailed description in form of a man page.

       --verbose
           Say one or two words more than required.

       --mirrordir <path>
           Specifies destination directory. The data will be mirrored to the folder $mirrordir/$dbname/.  Please
           be  aware that this mirrordir is nowhere stored. The directory can consequently be moved to arbitrary
           locations at any time, if the users of the data are only informed about that moving.

       --list
           Lists all databases that may be requested to be installed.

       <list of db names>
           Only those databases that are  explicitly  requested  to  be  downloaded  will  be  downloaded.  Such
           databases  may  require  considerable bandwidth, so please make sure you know you are doing the right
           thing.

       --post
           Perform only the unpacking/indexing, but  do  not  retrieve/update  the  databases.  This  option  is
           considered  useful  when adding a new database management system to the system, e.g. after installing
           EMBOSS.

       --source
           Perform only the unpacking/indexing, but do not retrieve/update the databases.  This  option  may  be
           beneficial  when  the  site administator is aware of current analyses that should not be disturbed by
           the indexing process but the downloading from the net can already be started.

       --confd <directory>
           Allows for the specification of a directory in which multiple files can be stored that will  be  read
           by  getData  upon  its  invocation.  These  may  add values to the global variable %toBeMirrored that
           specifies the databases and their download scripts.

       --config <system>
           Preparation of the configuration file that would be reuired for a particular system that  deals  with
           the  database.  The  configuration  is printed to stdout and is expected to be copied manually to the
           proper file or folder. One could imagine this process  to  be  automated,  though  this  is  not  yet
           implemented.  Currently available is support for two systems:

           emboss  This  specifies  the  EMBOSS  suite of tools for bioinformatics (www.emboss.org) that is also
                   available as a Debian package.  The configuration for the Uniprot databases  will  allow  the
                   sequence retrieval with the seqret tool.

           dre - ARC Grid Runtime Environment
                   Runtime  environments  (REs)  are  a  concept of the ARC grid middleware of which more can be
                   learned on http://www.nordugrid.org.  A script is  needed  to  indicate  the  presence  of  a
                   runtime  environment.   Here,  the name of the script is important, which is not definable by
                   getData though since it only writes to stdout.

           Unfortunately, the configuration was not yet be found to be modularised.   It  all  needs  to  happen
           within the getData script itself.

       --remove <list of dbnames>
           This  command  removes  folders  that  store the data. In principle this could be performed manually,
           though some databases may have special requirements pre- or  post-removal,  which  can  be  specified
           individually for every database.

SPECIFICATION OF DATABASES

       Databases  for  download  and their post-processing are specified at two different locations.  One is the
       getData script itself, the other are files stored in /etc/getData.d.  Either will define  elements  of  a
       considerably large hash. The key is the identifier which is also shown by the 'getData --list' directive.
       The  value is a reference to another hash, which assigns values to all the properties that a database has
       for its download and post-processing:

       name - a human-readable pretty-printed name or short description that makes clear to the world what this
       database is about.
           A bad example is the mere assignment of "DE405", which few people understand.  A  better  example  is
           "Pfam-A  :  Manually  curated  protein  families and domains, only the seed is presented.". One could
           argue that one should have that field renamed to "description".

       source - shell commands to perform the initial download and subsequent updates
           Commonly the wget tool is used for download. The such presented little script is executed  underneath
           the      mirrordir      directory.      One      simple      example      is      "wget      --mirror
           ftp://ssd.jpl.nasa.gov/pub/eph/export/unix/unxp2[01]*.405".  With  increasing  proficiency  in  using
           wget,   one   is   tempted   to   substitute   "--mirror"   with  "--recursive  --no-host-directories
           --no-directories --level 1 --no-parent".

       post-download - shell commands to perform after the data has been downloaded.
           A simple (and unnecessary when used the right flags to  wget)  example  is  the  mere  setting  of  a
           symbolic link:

             "post-download" => "ln -s ssd.jpl.nasa.gov/pub/eph/export/unix/unxp*.405 ."

           Some more effort has been put into TrEMBL for the merging of releases with subsequent updates and the
           indexing for EMBOSS:

             "d=uncompressed; if [ ! -d \$d ]; then mkdir \$d; fi; "
              ."rm -rf \$d/trembl.dat; "
              ."(find ftp.ebi.ac.uk -name '*.dat.gz' | xargs -r zcat ) > \$d/trembl.dat; "
              ."[ -x /usr/bin/dbxflat ] "
              . "&& cd \$d && "
              . "dbxflat -dbresource embl -dbname trembllocal -idformat swiss -filenames=trembl.dat -fields id,acc -auto",

           The  dots  are connecting strings in Perl. This helps the readability of the code. When writing these
           scripts, please be aware the newlines don't separate the  individual  commands  here.  Semicolon  are
           required.

       recommends - suggests a series of packages to be present for the use of the database or the performance
       of the indexing.
           This  information  is  not used at the moment, also to render this script more useful for other Linux
           distributions than Debian.

       getWgetOptions - private command to get wget options
           This is used at download time by makefiles, is not intended to be used interactively,  and  could  be
           removed anytime.

EXAMPLES

       The following will list the identifiers and the descriptions of the first 4 databases that area available
       via getData on your system.

            ./getData --mirrordir=/local/databases/mirrored --list | head 4

       To  install  any particular database, only give its name as an argument. If the installation is performed
       at another directory than the default, then the --mirrordir needs again to be set.

            ./getData swiss.dat

       To remove the database again, give the script a hint with the --remove flag

            ./getData --remove swiss.dat

       To perform the indexing only and circumvent the download (attention, this is dangerous  since  the  index
       files will look newer than the database is), do

            ./getData --post swiss.dat

       A  special  exception  to  these  extra  scripts  is  the  --config flag in that it takes a list of extra
       arguments. Each shall denote a particular system that this database may be of  interest  for.  There  are
       today two systems supported:

TODO

       We  now  need  a mechanism with which packages can specify hooks that shall be called upon an update of a
       database. But we cannot assume that every indexing that can be performed because of the  installation  of
       some package is also desired by the user. How to configure this properly is left to be decided.

SEE ALSO

       http://debian-med.alioth.debian.org, http://wiki.debian.org/DebianMed, /etc/getData.conf

AUTHORS

       This    script    was   prepared   by   Steffen   Moeller   <moeller@debian.org>   and   Charles   Plessy
       <debian-no-spam@plessy.org> and is distributed under the terms of the GNU Public License (GPL). On Debian
       systems, this license can be found under /usr/share/common-licenses/GPL.

perl v5.32.0                                       2020-11-29                                         GETDATA(1)