Provided by: tigr-glimmer_3.02b-6_amd64 bug

NAME

       long-orfs — Find/Score potential genes in genome-file using the probability model in icm-file

SYNOPSIS

       tigr-long-orgs [genome-file options]

DESCRIPTION

       Program  long-orfs  takes  a  sequence  file  (in FASTA format) and outputs a list of all long "potential
       genes" in it that do not overlap by too much.  By "potential gene" I mean the portion of an orf from  the
       first start codon to the stop codon at the end.

       The first few lines of output specify the settings of various parameters in the program:

       Minimum  gene  length  is  the  length  of  the smallest fragment considered to be a gene.  The length is
       measured from the first base of the start codon to the last base *before* the stop codon.  This value can
       be specified when running the program with the  -g  option.  By default, the  program  now  (April  2003)
       will  compute  an  optimal  length  for  this  parameter,  where "optimal" is the value that produces the
       greatest number of long ORFs, thereby increasing the amount of data used for training.

       Minimum overlap length is a lower bound on the number of bases overlap between 2 genes that is considered
       a problem.  Overlaps shorter than this are ignored.

       Minimum overlap percent is another lower bound on the number  of  bases  overlap  that  is  considered  a
       problem.  Overlaps shorter than this percentage of *both* genes are ignored.

       The next portion of the output is a list of potential genes:

       Column  1  is  an ID number for reference purposes.  It is assigned sequentially starting with  1  to all
       long potential genes.  If overlapping genes are eliminated, gaps in  the  numbers  will  occur.   The  ID
       prefix is specified in the constant  ID_PREFIX .

       Column 2 is the position of the first base of the first start codon in the orf.  Currently I use atg, and
       gtg as start codons.  This is easily changed in the function  Is_Start () .

       Column  3  is  the position of the last base *before* the stop codon.  Stop codons are taa, tag, and tga.
       Note that for orfs in the reverse reading frames have their start position higher than the end  position.
       The  order  in  which  orfs  are  listed is in increasing order by Max {OrfStart, End}, i.e., the highest
       numbered position in the orf, except for orfs that "wrap around" the end of the sequence.

       When two genes with ID numbers overlap by at least a sufficient amount (as  determined  by  Min_Olap  and
       Min_Olap_Percent ), they are eliminated and do not appear in the output.

       The  final  output  of the program (sent to the standard error file so it does not show up when output is
       redirected to a file) is the length of the longest orf found.

       Specifying Different Start and Stop Codons:

       To specify different sets of start  and  stop  codons,  modify  the  file  gene.h  .   Specifically,  the
       functions:

       Is_Forward_Start     Is_Reverse_Start     Is_Start Is_Forward_Stop      Is_Reverse_Stop      Is_Stop

       are used to determine what is used for start and stop codons.

       Is_Start   and   Is_Stop   do simple string comparisons to specify which patterns are used.  To add a new
       pattern, just add the comparison for it.  To remove a pattern, comment out or delete the  comparison  for
       it.

       The  other  four  functions  use a bit comparison to determine start and stop patterns.  They represent a
       codon as a 12-bit pattern, with 4 bits for each base, one bit for each possible value of the bases, T, G,
       C or A.  Thus the bit pattern  0010 0101 1100  represents the base pattern  [C] [A or G] [G  or  T].   By
       doing  bit operations (& | ~) and comparisons, more complicated patterns involving ambiguous reads can be
       tested efficiently.  Simple patterns can be tested as in the current code.

       For example, to insert an additional start codon of CAT requires 3 changes:  1.  The  line  ||  (Codon  &
       0x218)  == Codon should be inserted into  Is_Forward_Start , since 0x218 = 0010 0001 1000 represents CAT.
       2. The line || (Codon & 0x184) == Codon should be inserted into  Is_Reverse_Start , since  0x184  =  0001
       1000  0100  represents  ATG,  which  is the reverse-complement of CAT.  Alternately, the #define constant
       ATG_MASK  could be used.  3. The line || strncmp (S, "cat", 3) == 0 should be inserted into  Is_Start .

OPTIONS

       -g n      Set minimum gene length to n.  Default is to compute an  optimal  value  automatically.   Don't
                 change this unless you know what you're doing.

       -l        Regard  the  genome as linear (not circular), i.e., do not allow genes to "wrap around" the end
                 of the genome.  This option works on both  glimmer and long-orfs .  The default behavior is  to
                 regard the genome as circular.

       -o n      Set maximum overlap length to n.  Overlaps shorter than this are permitted.  (Default is 0 bp.)

       -p n      Set  maximum overlap percentage to n%.  Overlaps shorter than this percentage of *both* strings
                 are ignored.  (Default is 10%.)

SEE ALSO

       tigr-glimmer3 (1), tigr-adjust (1), tigr-anomaly   (1), tigr-build-icm (1), tigr-check  (1),  tigr-codon-
       usage  (1),  tigr-compare-lists  (1),  tigr-extract  (1),  tigr-generate (1), tigr-get-len (1), tigr-get-
       putative (1),

       http://www.tigr.org/software/glimmer/

       Please see the readme in /usr/share/doc/tigr-glimmer for a description on how to use Glimmer3.

AUTHOR

       This manual page was quickly copied from the glimmer web site by Steffen Moeller  moeller@debian.org  for
       the Debian system.

                                                                                                    LONG-ORFS(1)