Provided by: sim4_0.0.20121010-8_amd64 bug

NAME

       sim4 - align an expressed DNA sequence with a genomic sequence

SYNOPSIS

       sim4 seqfile1 seqfile2 {[WXKCRDAPNB]=value}

DESCRIPTION

       sim4  is  a similarity-based tool for aligning an expressed DNA sequence (EST, cDNA, mRNA) with a genomic
       sequence for the gene. It also detects end matches when the two input sequences overlap at one end (i.e.,
       the start of one sequence overlaps the end of the other). If seqfile2 is a  database  of  sequences,  the
       sequence in seqfile1 will be aligned with each of the sequences in seqfile2.

       sim4  employs a blast-based technique to first determine the basic matching blocks representing the "exon
       cores". In this first stage, it detects all possible exact matches of W-mers (i.e., DNA words of size  W)
       between the two sequences and extends them to maximal scoring gap-free segments. In the second stage, the
       exon  cores  are extended into the adjacent as-yet-unmatched fragments using greedy alignment algorithms,
       and heuristics are used to favor configurations that conform to the splice-site recognition signals  (GT-
       AG,  CT-AC).  If  necessary,  the  process  is  repeated  with less stringent parameters on the unmatched
       fragments.

       By default, sim4 searches both strands and reports the best match, measured by  the  number  of  matching
       nucleotides  found  in the alignment. The R command line option can be used to restrict the search to one
       orientation (strand) only.

       Currently, five major alignment display options are supported, controlled by the  A  option.  By  default
       (A=0), only the endpoints, overall similarity, and orientation of the introns are reported. An arrow sign
       (`->' or `<-') indicates the orientation of the intron (`+' or `-' strand), when the signals flanking the
       intron have three or more position matches with either the GT-AG or the CT-AC splice recognition signals.
       When  the same number of matches is found for both orientations, the intron is reported as ambiguous, and
       represented by `--'. The sign `==' marks the absence from the alignment of a cDNA  fragment  starting  at
       that  position.  Alternative  formats  (lav-block  format,  text,  PipMaker-type `exons file', or certain
       combinations of these options) can be requested by specifying a different value for A.

       If the P option is specified with a non-zero value, sim4 will remove any  3'-end  poly-A  tails  that  it
       detects in the alignment.

       Occasionally, sim4 may miss an internal exon when surrounded by very large introns, typically longer than
       100 Kb. When this is suspected, the H option can be used to reset the exons' weight to compensate for the
       intron gap penalty.

       Ambiguity  codes  are  by  default  allowed in sequence data, but sim4 treats them non-differentially. If
       desired, the B command option can restrict the set of acceptable characters to A,C,G,T,N and X only.

       sim4 compares the lengths of the input sequences to  distinguish  between  the  cDNA  (`short')  and  the
       genomic  (`long')  components  in  the  comparison. When seqfile2 contains a collection of sequences, the
       first entry in the file will be used to determine the type of this and all subsequent comparisons.

       In the description below, the term MSP denotes a Maximal Segment Pair, that is, a pair of highly  similar
       fragments  in  the  two  sequences,  obtained during the blast-like procedure by extending a W-mer hit by
       matches and perhaps a few mismatches.

OPTIONS

       The algorithm parameters (included in the first two sections below) have already been tuned  and  do  not
       normally require adjustment by the user.

       Parameters internal to the blast-like procedure:

       W      Sets  the  word  size for blast hits in the first stage of the algorithm. The default value is 12,
              but it can be increased for a more stringent search or decreased to find weaker matches.

       X      Controls the limits for terminating word extensions in the blast-like stage of the algorithm.  The
              default value is 12.

       K      Sets  the  threshold  for the MSP scores when determining the basic `exon cores', during the first
              stage of the algorithm. (If this option is not specified,  the  threshold  is  computed  from  the
              lengths  of  the  sequences,  using  statistical  criteria.) For example, a good value for genomic
              sequences in the range of a few hundred Kb is 16. To avoid spurious  matches,  however,  a  larger
              value may be needed for longer sequences.

       C      Sets  the  threshold  for  the MSP scores when aligning the as-yet-unmatched fragments, during the
              second stage of the algorithm. By default, the smaller of the constant 12 and  a  statistics-based
              threshold is chosen.

       Additional algorithm parameters:

       D      Sets  the  bound for the "diagonal" distance within consecutive MSPs in an exon. The default value
              is 10.

       Context parameters:

       R      Specifies the direction of the search. If R=0, only the "+" (direct) strand is searched.  If  R=1,
              only the "-" (reverse complement) matches are sought. By default (R=2), sim4 searches both strands
              and reports the best match, measured by the number of matching pairs in the alignment.

       A      Specifies  the  format  of the output: exon endpoints only (A=0), exon endpoints and boundaries of
              the coding region (CDS) in the  genomic  sequence,  when  specified  for  the  input  mRNA  (A=5),
              alignment  text  (A=1),  alignment in lav-block format (A=2), or both exon endpoints and alignment
              text (A=3 or A=4). If a reverse complement match is found, A=0,1,2,3,5 will give its  position  in
              the  "+"  strand  of the longer sequence and the "-" strand of the shorter sequence. A=4 will give
              its position in the "+" strand of the first sequence (seqfile1) and the "-" strand of  the  second
              sequence (seqfile2), regardless of which sequence is longer. The A=5 option can be used with the S
              command  line  option  to specify the endpoints of the CDS in the mRNA, and produces output in the
              `exons file' format required by PipMaker.

       P      Specifies whether or not the program should report the fragment of the  alignment  containing  the
              poly-A  tail (if found). By default (P=0) the alignment is displayed as computed, but specifying a
              non-zero value will request sim4 to remove the poly-A tails. When this  feature  is  enabled,  all
              display options produce additional lav alignment headers.

       H      Resets the MSPs' weight to compensate for very large introns. The default value is H=500, but some
              introns larger than 100 Kb may require higher values, typically between 1000 and 2500. This option
              should  be used cautiously, generally in cases where an unmatched internal portion of the cDNA may
              disguise a missed exon within a very large intron. It is not recommended for ESTs, where they  may
              produce spurious exons.

       N      Requests an additional search for small marginal exons (N=1) guided by the splice-site recognition
              signals. This option can be used when a high accuracy match is expected. The default value is N=0,
              specifying no additional search.

       B      Controls  the  set  of  characters  allowed  in  the  input sequences. By default (B=1), ambiguity
              characters (ABCDGHKMNRSTVWXY) are allowed. By specifying B=0, the set of acceptable characters  is
              restricted to A,C,G,T,N and X only.

       S      Allows  the user to specify the endpoints of the CDS in the input mRNA, with the syntax: S=n1..n2.
              This option is only available with the A=5 flag, which produces output in the format  required  by
              PipMaker.  Alternatively,  the CDS coordinates could appear in a construct CDS=n1..n2 in the FastA
              header of the mRNA sequence.  When  the  second  file  is  an  mRNA  database,  the  command  line
              specification for the CDS will apply to the first sequence in the file only.

EXAMPLES

       sim4 est genomic

       sim4 genomic estdb

       sim4 est genomic A=1 P=1

       sim4 est1 est2 R=1

       sim4 mRNA genomic A=5 S=123..1020

       sim4 mouse_cDNA human_genomic K=15 C=11 A=3 W=10

AUTHORS

       sim4 was written by Liliana Florea <florea@gwu.edu> and Scott Schwartz.

       This  manual  page  was  written  by  Nelson  A.  de  Oliveira  <naoliv@gmail.com>,  based  on the online
       documentation at http://globin.cse.psu.edu/html/docs/sim4.html, for the Debian project (but may  be  used
       by others).

                                         Wed, 03 Aug 2005 18:40:58 -0300                                 SIM4(1)