Provided by: ncbi-tools-bin_6.1.20170106+dfsg2-2_amd64 bug

NAME

       spidey - align mRNA sequences to a genome

SYNOPSIS

       spidey   [-]   [-F N]   [-G]  [-L N]  [-M filename]  [-N filename]  [-R filename]  [-S p/m]  [-T N]  [-X]
       [-a filename] [-c N] [-d] [-e X] [-f X] [-g X] -i filename [-j] [-k filename] [-l N]  -m filename  [-n N]
       [-o str] [-p N] [-r c/d/m/p/v] [-s] [-t filename] [-u] [-w]

DESCRIPTION

       spidey is a tool for aligning one or more mRNA sequences to a given genomic sequence.  spidey was written
       with  two  main goals in mind: find good alignments regardless of intron size; and avoid getting confused
       by nearby pseudogenes and paralogs.  Towards the first goal, spidey uses  BLAST  and  Dot  View  (another
       local alignment tool) to find its alignments; since these are both local alignment tools, spidey does not
       intrinsically  favor  shorter  or  longer  introns  and  has no maximum intron size.  To avoid mistakenly
       including exons from paralogs and pseudogenes, spidey first defines windows on the genomic  sequence  and
       then  performs  the  mRNA-to-genomic  alignment  separately  within  each window.  Because of the way the
       windows are constructed, neighboring paralogs or pseudogenes should be in separate windows and should not
       be included in the final spliced alignment.

   Initial alignments and construction of genomic windows
       spidey takes as input a single genomic sequence and a set of mRNA accessions  or  FASTA  sequences.   All
       processing  is  done  one  mRNA  sequence  at  a  time.  The first step for each mRNA sequence is a high-
       stringency BLAST against the genomic sequence.  The resulting hits  are  analyzed  to  find  the  genomic
       windows.

       The  BLAST  alignments  are  sorted by score and then assigned into windows by a recursive function which
       takes the first alignment and then goes  down  the  alignment  list  to  find  all  alignments  that  are
       consistent  with the first (same strand of mRNA, both the mRNA and genomic coordinates are nonoverlapping
       and linearly consistent).  On subsequent passes, the remaining alignments are examined and are  put  into
       their  own  nonoverlapping, consistent windows, until no alignments are left.  Depending on how many gene
       models are desired, the top n windows are chosen to go on to the next step and the others are deleted.

   Aligning in each window
       Once the genomic windows are constructed, the initial BLAST alignments are freed and another BLAST search
       is performed, this time with the entire mRNA against the genomic region defined by the window, and  at  a
       lower  stringency  than  the  initial  search.   spidey  then uses a greedy algorithm to generate a high-
       scoring, nonoverlapping subset of the alignments from the second BLAST search.  This  consistent  set  is
       analyzed  carefully  to  make sure that the entire mRNA sequence is covered by the alignments.  When gaps
       are found between the alignments, the appropriate region of genomic  sequence  is  searched  against  the
       missing  mRNA,  first  using  a  very  low-stringency  BLAST and, if the BLAST fails to find a hit, using
       DotView functions to locate the alignment.  When gaps are found at the ends of the alignments, the  BLAST
       and  DotView searches are actually allowed to extend past the boundaries of the window.  If the 3' end of
       the mRNA does not align completely, it is first examined for the presence of a poly(A) tail.  No  attempt
       is  made  to  align the portion of the mRNA that seems to be a poly(A) tail; sometimes there is a poly(A)
       tail that does align to the genomic sequence, and these are noted because they indicate  the  possibility
       of a pseudogene.

       Now that the mRNA is completely covered by the set of alignments, the boundaries of the alignments (there
       should  be  one alignment per exon now) are adjusted so that the alignments abut each other precisely and
       so that they are adjacent to good splice donor and acceptor sites.  Most commonly,  two  adjacent  exons'
       alignments  overlap  by  as much as 20 or 30 base pairs on the mRNA sequence.  The true exon boundary may
       lie anywhere within this overlap, or (as we have seen empirically) even a  few  base  pairs  outside  the
       overlap.  To position the exon boundaries, the overlap plus a few base pairs on each side is examined for
       splice donor sites, using functions that have different splice matrices depending on the organism chosen.
       The  top  few  splice  donor  sites (by score) are then evaluated as to how much they affect the original
       alignment boundaries.  The site that affects the boundaries the least is chosen, and is evaluated  as  to
       the  presence  of  an  acceptor site.  The alignments are truncated or extended as necessary so that they
       terminate at the splice donor site and so that they do not overlap.

   Final result
       The windows are examined carefully to get the percent identity per exon, the number of gaps per exon, the
       overall percent identity, the percent coverage of the mRNA,  presence  of  an  aligning  or  non-aligning
       poly(A) tail, number of splice donor sites and the presence or absence of splice donor and acceptor sites
       for each exon, and the occurrence of an mRNA that has a 5' or 3' end (or both) that does not align to the
       genomic sequence.  If the overall percent identity and percent length coverage are above the user-defined
       cutoffs,  a  summary  report  is  printed,  and,  if  requested,  a text alignment showing identities and
       mismatches is also printed.

   Interspecies alignments
       spidey is capable of performing interspecies alignments.  The major difference in interspecies alignments
       is that the mRNA-genomic identity will not be close to 100% as it is in  intraspecies  alignments;  also,
       the  alignments  have numerous and lengthy gaps.  If spidey is used in its normal mode to do interspecies
       alignments, it produces gene models with many, many short exons.  When  the  interspecies  flag  is  set,
       spidey  uses  different BLAST parameters to encourage longer and more gaps and to not penalize as heavily
       for mismatches.  This way, the alignments for the exons are much longer and more closely approximate  the
       actual gene structure.

   Extracting CDS alignments
       When spidey is run in network-aware mode or when ASN.1 files are used for the mRNA records, it is capable
       of  extracting  a  CDS alignment from an mRNA alignment and printing the CDS information also.  Since the
       CDS alignment is just a subset of the mRNA alignment, it is relatively straightforward  to  truncate  the
       exon  alignments as necessary and to generate a CDS alignment.  Furthermore, the untranslated regions are
       now defined, so the percent identity for the 5' and 3' untranslated regions is also calculated.

OPTIONS

       A summary of options is included below.

       -      Print usage message.

       -F N   Start of genomic interval desired (from; 0-based).

       -G     Input file is a GI list.

       -L N   The extra-large intron size to use (default = 220000).

       -M filename
              File with donor splice matrix.

       -N filename
              File with acceptor splice matrix.

       -R filename
              File (including path) to repeat blast database for filtering.

       -S p/m Restrict to plus (p) or minus (m) strand of genomic sequence.

       -T N   Stop of genomic interval desired (to; 0-based).

       -X     Use extra-large intron sizes (increases the limit for initial and terminal introns from  100kb  to
              240kb and for all others from 35kb to 120kb); may result in significantly longer compute times.

       -a filename
              Output file for alignments when directed to a separate file with -p 3 (default = spidey.aln).

       -c N   Identity cutoff, in percent, for quality control purposes.

       -d     Also  try  to  align coding sequences corresponding to the given mRNA records (may require network
              access).

       -e X   First-pass e-value (default = 1.0e-10).  Higher values increase speed at the cost of sensitivity.

       -f X   Second-pass e-value (default = 0.001).

       -g X   Third-pass e-value (default = 10).

       -i filename
              Input file containing the genomic sequence in ASN.1 or FASTA format.  If your computer is  running
              on  a  network  that  can  access GenBank, you can substitute the desired accession number for the
              filename.

       -j     Print ASN.1 alignment?

       -k filename
              File for ASN.1 output with -k (default = spidey.asn).

       -l N   Length coverage cutoff, in percent.

       -m filename
              Input file containing the mRNA sequence(s) in ASN.1 or FASTA format, or a list of their accessions
              (with -G).  If your computer is running on a network that can access GenBank, you can substitute a
              single accession number for the filename.

       -n N   Number of gene models to return per input mRNA (default = 1).

       -o str Main output file (default = stdout; contents controlled by -p).

       -p N   Print alignment?
              0      summary and alignments together (default)
              1      just the summary
              2      just the alignments
              3      summary and alignments in different files

       -r c/d/m/p/v
              Organism of genomic sequence, used to determine splice matrices.
              c      C. elegans
              d      Drosophila
              m      Dictyostelium discoideum
              p      plant
              v      vertebrate (default)

       -s     Tune for interspecies alignments.

       -t filename
              File with feature table, in 4 tab-delimited columns:
              seqid  (e.g., NM_04377.1)
              name   (only repetitive_region is currently supported)
              start  (0-based)
              stop   (0-based)

       -u     Make a multiple alignment of all input mRNAs (which must overlap on the genomic sequence).

       -w     Consider lowercase characters in input FASTA sequences to be masked.

AUTHOR

       Sarah Wheelan  and  others  at  the  National  Center  for  Biotechnology  Information;  Steffen  Moeller
       contributed to this documentation.

SEE ALSO

       <http://www.ncbi.nlm.nih.gov/spidey>

NCBI                                               2005-01-25                                          SPIDEY(1)