Ubuntu Manpage: map2slim - maps gene associations to a 'slim' ontology

NAME

       map2slim - maps gene associations to a 'slim' ontology

SYNOPSIS

         cd go
         map2slim GO_slims/goslim_generic.obo ontology/gene_ontology.obo gene-associations/gene_association.fb

DESCRIPTION

       Given a GO slim file, and a current ontology (in one or more files), this script will map a gene
       association file (containing annotations to the full GO) to the terms in the GO slim.

       The script can be used to either create a new gene association file, containing the most pertinent GO
       slim accessions, or in count-mode, in which case it will give distinct gene product counts for each slim
       term

       The association file format is described here:

       <http://www.geneontology.org/GO.annotation.shtml#file>

ARGUMENTS

       -b bucket slim file
           This argument adds bucket terms to the slim ontology; see the documentation below for an explanation.
           The new slim ontology file, including bucket terms will be written to bucket slim file

       -outmap slim mapping file
           This will generate a mapping file for every term in the full ontology showing both the most pertinent
           slim  term  and  all  slim  terms  that  are ancestors. If you use this option, do NOT supply a gene-
           associations file

       shownames
           (Only works with -outmap)

           Show the names of the term in the slim mapping file

       -c  This will force map2slim to give counts of the assoc file, rather than map it

       -t  When used in conjunction with -c will tab the output  so  that  the  indentation  reflects  the  tree
           hierarchy in the slim file

       -o out file
           This will write the mapped assocs (or counts) to the specified file, rather than to the screen

DOWNLOAD

       This script is part of the go-perl package, available from CPAN

       <http://search.cpan.org/~cmungall/go-perl/>

       This script will not work without installing go-perl

   MAPPING ALGORITHM
       GO  is a DAG, not a tree. This means that there is often more than one path from a GO term up to the root
       Gene_Ontology node; the path may intersect multiple terms in the slim ontology -  which  means  that  one
       annotation can map to multiple slim terms!

       (note  you  need  to  view  this  online  to  see  the  image  below - if you are not viewing this on the
       http://www.geneontology.org     site,     you     can      look      at      the      following      URL:
       <http://geneontology.cvs.sourceforge.net/*checkout*/geneontology/go-dev/go-perl/doc/map2slim.gif> )

       A hypothetical example  blue circles show terms in the GO slim, and yellow circles show terms in the full
       ontology. The full ontology subsumes the slim, so the blue terms are also in the ontology.

         GO ID  MAPS TO SLIM ID        ALL SLIM ANCESTORS
         =====  ===============        ==================
         5      2+3                    2,3,1
         6      3 only                 3,1
         7      4 only                 4,3,1
         8      3 only                 3,1
         9      4 only                 4,3,1
         10     2+3                    2,3,1

       The  2nd  column shows the most pertinent ID(s) in the slim  the direct mapping. The 3rd column shows all
       ancestors in the slim.

       Note  in particular the mapping of ID 9  although this has two paths to the root through the slim  via  3
       and 4, 3 is discarded because it is subsumed by 4.

       On  the  other  hand,  10  maps to both 2 and 3 because these are both the first slim ID in the two valid
       paths to the root, and neither subsumes the other.

       The algorithm used is:

       to map any one term in the full ontology: find all valid paths through to  the  root  node  in  the  full
       ontology

       for each path, take the first slim term encountered in the path

       discard any redundant slim terms in this set  ie slim terms subsumed by other slim terms in the set

   BUCKET TERMS
       If  you  run  the script with the -b option, bucket terms will be added. For any term P in the slim, if P
       has at least one child C, a bucket term P' will be created under P. This is a catch-all term for  mapping
       any  term  in  the full ontology that is a descendant of P, but NOT a descendant of any child of P in the
       slim ontology.

       For example, the slim generic.0208 has the following terms and structure:

           %DNA binding ; GO:0003677
            %chromatin binding ; GO:0003682
            %transcription factor activity ; GO:0003700, GO:0000130

       After adding bucket terms, it will look like this:

          %DNA binding ; GO:0003677
           %chromatin binding ; GO:0003682
           %transcription factor activity ; GO:0003700 ; synonym:GO:0000130
           @bucket:Z-OTHER-DNA binding ; slim_temp_id:12

       Terms from the full ontology that are other children of DNA binding, such as single-stranded DNA  binding
       and its descendents will map to the bucket term.

       The  bucket  term has a slim ID which is transient and is there only to facilitate the mapping. It should
       not be used externally.

       The bucket term has the prefix Z-OTHER; the Z is a hack to make sure that the term is always listed  last
       in the alphabetic ordering.

       The algorithm is slightly modified if bucket terms are used. The bucket term has an implicit relationship
       to all OTHER siblings not in the slim.

       Do I need bucket terms?

       Nowadays most slim files are entirely or nearly 'complete', that is there are no gaps. This means the the
       -b  option  will  not produce noticeable different results. For example, you may see a bucket term OTHER-
       binding created, with nothing annotated to it: because  all  the  children  of  binding  in  the  GO  are
       represented in the slim file.

       The  bucket  option  is really only necessary for some of the older archived slim files, which are static
       and were generated in a fairly ad-hoc way; they tend to accumulate 'gaps' over time (eg GO will add a new
       child of binding, but the static slim file won't be up to date, so any gene products  annotated  to  this
       new term will map to OTHER-binding in the slim)

   GRAPH MISMATCHES
       Note that the slim ontology file(s) may be out of date with respect to the current ontology.

       Currently  map2slim  does  not  flag  graph  mismatches  between the slim graph and the graph in the full
       ontology file; it takes the full ontology as being the real graph. However, the  slim  ontology  will  be
       used to format the results if you select -t -c as options.

   OUTPUT
       In  normal  mode,  a  standard  format  gene-association  file will be written. The GO ID column (5) will
       contain GO slim IDs. The mapping corresponds to the 2nd column in the table above. Note that  the  output
       file  may  contain  more  lines  that the input file. This is because some full GO IDs have more than one
       pertinent slim ID.

       COUNT MODE

       map2slim can be run with the -c option, which will gives the counts of distinct gene products  mapped  to
       each slim term. The columns are as follows

       GO Term
           The  first column is the GO ID followed by the term name (the term name is provided as it is found in
           both the full GO and slim ontologies - these will usually be the same but occasionally the slim  file
           will lage behind changes in the GO file)

       Count of gene products for which this is the most relevant slim term
           the  number  of  distinct  gene products for which this is the most pertinent/direct slim ID. By most
           direct we mean that either the association is made directly to this term, OR the association is  made
           to a child of this slim term AND there is no child slim term which the association maps to.

           For  most  slims, this count will be equivalent to the number of associations directly mapped to this
           slim term. However, some older slim files are "spotty" in that they admit "gaps". For example, if the
           slim has all children of "biological process" with the exception of "behavior" then  all  annotations
           to "behavior" or its children will be counted here

           see example below

       Count of gene products inferred to be associated with slim term
           and  the  number  of distinct gene products which are annotated to any descendant of this slim ID (or
           annotated directly to the slim ID).

       obsoletion flag
       GO ontology

       To take an example; if we use -t and -c like this:

         map2slim -t -c GO_slims/goslim_generic.obo ontology/gene_ontology.obo gene-associations/gene_association.fb

       Then part of the results may look like this:

        GO:0008150 biological_process (biological_process)     34      10025           biological_process
         GO:0007610 behavior (behavior)        632     632             biological_process
         GO:0000004 biological process unknown (biological process unknown)    832     832             biological_process
         GO:0007154 cell communication (cell communication)    333     1701            biological_process
          GO:0008037 cell recognition (cell recognition)       19      19              biological_process
       19 products were mapped to GO:0008037 or one of its children. (GO:0008037 is a leaf node in the slim, so the two counts are identical).

       On the other hand, GO:0008150 only gets 34 products for which this is the most  relevant  term.  This  is
       because  most  annotations  would  map  to  some  child  of  GO:0008150  in  the slim, such as GO:0007610
       (behavior). These 34 gene products are either annotated directly to GO:0008150, or to some child of  this
       term  which is not in the slim. This can point to 'gaps' in the slim. Note that running map2slim with the
       -b option will 'plug' these gaps with artificial filler terms.

AUTHOR

       Chris Mungall BDGP

NAME

SYNOPSIS

DESCRIPTION

ARGUMENTS

DOWNLOAD

AUTHOR

SEE ALSO