Provided by: lucy_1.20-4_amd64 bug

NAME

       lucy - Assembly Sequence Cleanup Program

SYNOPSIS

       lucy [-pass_along min_value max_value med_value]
       [-range area1 area2 area3] [-alignment area1 area2 area3]
       [-vector vector_sequence_file splice_site_file]
       [-cdna [minimum_span maximum_error initial_search_range]] [-keep]
       [-size vector_tag_size] [-threshold vector_cutoff]
       [-minimum good_sequence_length] [-debug [filename]]
       [-output sequence_filename quality_filename]
       [-error max_avg_error max_error_at_ends]
       [-window window_size max_avg_error
           [window_size max_avg_error ...]]
       [-bracket window_size max_avg_error]
       [-quiet] [-inform_me] [-xtra cpu_threads]
       sequence_file quality_file [2nd_sequence_file]

DESCRIPTION

       Lucy  is  a  utility  that  prepares  raw  DNA  sequence fragments for sequence assembly, possibly using the TIGR
       Assembler.  Raw DNA sequence fragments are obtained from DNA sequencing machines, such as those from the  Applied
       Biosystems  Inc.  (ABI).   Lucy  accepts  three  input  data  files:  the  sequences  file, the quality file, and
       (optionally) a second sequence file for comparison purposes. All three files should be in the plain FASTA format.

       The first sequence file and its accompanying quality file are obtained  from  other  utility  programs,  such  as
       phred(1),  which  reads  the  sequencing  machine  chromatograph  outputs  and generates sequence base calls in a
       sequence file together with a quality assessment file for each base call made. The optional second sequence  file
       usually  comes  directly  from  the sequencing machine itself and are used to reassure/enhance the quality of the
       sequences.

       Lucy makes no assumption about the order of sequences in  the  three  input  files.  As  long  as  all  necessary
       information  can  be found, DNA sequences and quality sequences can be in different order. A sequence without its
       quality assessment companion or vice versa will be reported as an error. The second sequence file is  allowed  to
       have  missing  sequences  that  appear  only in the first sequence file. Sequences that appear only in the second
       sequence file will be reported and ignored by lucy.

       The operations of Lucy are divided into 7 phases:

       Phase 1:
              Count the number of input sequences in the first  sequence  file,  and  create  internal  data  structures
              accordingly.   Lucy  allocates memory dynamically, so there is no preset limit on the size of input files.
              The size of input files is only limited by your available computer memory.

              Note that lucy's static memory requirement grows very slowly  with  the  number  of  input  sequences,  at
              roughly  60  bytes  plus  the sequence name storage for each input sequence. This is because lucy does not
              store any sequence data in the memory after processing them.  Therefore, by all  practical  considerations
              lucy  can handle any number of input sequences.  The dynamic memory requirement of lucy is proportional to
              the longest sequence in the input, not the number of sequences, but its actual size varies from processing
              phases to phases.

       Phase 2:
              Read all sequence information, including name, length, and positions.  To save memory, lucy does not  load
              all  sequences  data  into  main  memory  at  once. Instead, it uses direct file addressing to access each
              sequence only when it is needed. Therefore, it is very important that the content of the input files stays
              unchanged during runtime.

       Phase 3:
              Read the quality information for sequences,  and  compute  good  quality  regions,  i.e.,  regions  within
              sequences  that  have  higher  quality values and can be trusted to be correct.  Lucy determines a "clean"
              range which has an average probability of error (per  base)  that  is  no  greater  than  the  probability
              specified  by  the  -error  option  (or the default, if -error is not used).  Note that secondary sequence
              extension (next step) is performed after quality trimming.  Because of this, it is possible that the final
              "clear" range (after vector trimming) will have a probability of error which is greater than the specified
              value.

       Phase 4:
              Read the second sequence file, compare its sequences to the first sequence file, and extend  good  quality
              regions  if  they both agree. If the second sequence file is not provided, this step is skipped. Note that
              this sequence comparison phase will not in anyway shorten  the  good  quality  region  determined  by  the
              previous  phase;  it  will  only extend it if possible. It is very important that the second sequence file
              does not come from the same base calling software as the first sequence  file,  and  is  base-called  with
              different  algorithms.  Otherwise, if the two sequence files are identical, lucy will extend sequences all
              the way to both ends, and completely ruin the purpose of quality trimming done in the previous phase.

              Usually, the first sequence file comes from phred(1) with the  companion  quality  file,  and  the  second
              sequence file comes directly from the original ABI base calling software with the sequencing machines.

       Phase 5:
              Locate splice sites on ends of sequences. In this phase, lucy tries to compare all input sequences against
              splice site sequences in a splice site file which defines the vector sequences near the insertion point on
              the  vector. If splice site sequences are found on any input sequence, they will be excluded from the good
              quality region so that the sequence assembly program will not  mistakenly  take  them  into  account  when
              trying  to  reassemble  the  sequences.  Note that lucy assumes all input sequences are read from the same
              direction and matching the direction of the splice site sequences. Therefore, the forward and reverse read
              sequences of a clone should not be mixed together in a single input file. If such a mixture of forward and
              reverse read sequences is unavoidable, lucy can be run twice to check in both directions,  once  with  the
              forward splice site sequences, the other time with the reverse splice site sequences.  See the description
              of option -vector below for more details.

              By popular demand, a poly-A/T trimming feature has been built into lucy.  It is designated Phase 5a and is
              an optional step. See the options -cdna and -keep below for details of their usage.

       Phase 6:
              Remove  vector  insert  sequences.  In  this  phase, all input sequences are checked against a full length
              vector sequence in a vector file, and sequences that are vector inserts themselves will  be  detected  and
              removed.   Lucy  uses  a quick fragment match method to check for vector sequences. Both the target vector
              sequence and the input sequences are converted into fragments (range from 8 to 16 bases long,  default  is
              10),  and  matching  fragments  are detected quickly. Vector sequences are detected when they contain more
              matches to vector fragments in their good quality region (already excluded of splice site  sequences  done
              previously)  than a normal, non-vector sequence can possibly match by chance. The default cutoff threshold
              is 20%. A sequence which contains over 20% match to the vector will be  considered  a  vector  insert  and
              discarded.

       Phase 7:
              Produce  output  sequences  for fragment assembly. In the final phase, Lucy produces two output files, the
              cleaned sequence file with markers for good quality regions, and a  companion  quality  file.  Optionally,
              lucy can also generate a cleavage information file (i.e. the good quality region information) which can be
              used to update database.

       Each  sequence  in  the  output sequence file begins with a header that includes its name, three pass along clone
       length values to the fragment assembly program, and a left and right marker denoting the begin  and  end  of  the
       good quality, vector free region.  The following is an example of lucy output:

       >GCCAA03TF 1500 3000 2000 43 490
       AGCCAAGTTTGCAGCCCTGCAGGTCGACTCTAGAGGATCCCCAGGATGATCAGCCACATT
       GGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTGGGGAATCTTGCGCAATGG
       GCGAAAGCCTGACGCAGCCATGCCGCGTGAATGATGAAGGTCTTAGGATTGTAAAATTCT
       TTCACCGGGGACGATAATGACGGTACCCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGC
           ...

OPTIONS

       Note  that  lucy  checks  only  the  first letter of each option, so all options below can be represented by just
       typing the first letter, e.g. -p for -pass_along.

       -pass_along min_value max_value med_value
              The three pass along values of minimum, maximum and medium clone lengths  are  given  using  this  option.
              Lucy  does  not  interpret  these  values;  they are used by some sequence assembly programs, such as TIGR
              Assembler.  These values are directly copied over to the output sequence file. The default values  are  0,
              0, and 0.

       -range area1 area2 area3
              This option is used in combination with the following option -alignment.  It defines the three splice site
              checking  areas which may need different strengths of splice site alignment. The quality of the base calls
              is usually poor at the beginning of a sequence but gradually improves when moving into the sequence  read.
              Therefore,  when looking for splice sites, stronger and stronger alignment measurements are needed to cope
              with the quality change. The default range values are 40, 60 and 100, i.e., lucy  will  check  for  splice
              sites  at  the  first  200  DNA  bases.  If  splice site is not found at the first 200 bases, the next 100
              (=area3) bases will be checked, with a total checking length of 300. Once a splice site is found, the rest
              of the sequence after the splice site is searched for the other end of the splice site, if any,  to  guard
              against short inserts.

       -alignment area1 area2 area3
              This  option  is  used  in combination with the previous option.  It defines the three different alignment
              strengths for the three areas. An alignment within each area must be equal or  longer  than  these  values
              before  it  is considered a match of the splice site. Default values are 8, 12 and 16 for the first 40, 60
              and 100 bases, respectively.

       -vector vector_sequence_file splice_site_file
              This option provides the complete vector sequence file and a partial  splice  site  sequence  file.   Lucy
              expects  to see one single (probably long) sequence of the vector that is used to do cloning in the vector
              file, and two splice site sequences before and after the insertion point on the vector in the splice  site
              file.  The splice site sequences are usually 150 bases in length, with a 50 bases overlay right around the
              vector insertion point. Their actual lengths are not very critical. For example, the  followings  are  the
              PUC18 splice site sequences that can be used by lucy:

              >PUCsplice.for.begin
              gattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacg
              acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc
              gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga
              >PUCsplice.for.end
              acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc
              gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga
              aattgttatccgctcacaattccacacaacatacgagccggaagcataaa

              With  two  splice site sequences as above, lucy assumes all sequences from the input files are read in the
              same direction as the splice site sequences. If that is not true and the input consists of sequences  from
              both  forward  and  reverse reads of a clone, there are two options. One can either separate the sequences
              into forward and reverse read sets and run them through lucy with correct splice site sequences.  One  can
              also  run  lucy  with  a  combined splice site file with both the forward and reverse splice site sequence
              pairs. That is, if lucy sees four splice site sequences, it will assume that a bidirectional  splice  site
              trimming  has been ordered. For example, the following reverse PUC18 splice site sequences can be appended
              to the forward splice sequences above to instruct lucy to do bidirectional trimmings:

              >PUCsplice.rev.begin
              tttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatt
              tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc
              ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt
              >PUCsplice.rev.end
              tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc
              ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt
              cgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc

              Bidirectional trimmings will run about two times slower because each sequence is compared against two sets
              of splice site sequences when only one set is actually needed. It is possible that random alignments  with
              the  other  (unneeded)  set will result in somehow a shortened good quality region of a sequence. However,
              bidirectional trimmings can guarantee that there are no vector fragments in the good quality  region  even
              when the assumed direction of some sequence reads is wrong.

              During  phase  6  contaminant  removal,  lucy will automatically reverse complement the full length vector
              sequence and check for both its forward and reverse inserts, so only one vector sequence is needed in  the
              vector  file.   Lucy  can  also  get  the vector and splice site file names from the environment variables
              VECTOR_FILE and SPLICE_FILE, if they are not given at the command line.

       -cdna [minimum_span maximum_error initial_search_range]
              Since the release of lucy, many people have requested that a poly-A/T trimming feature be built into  lucy
              for  the  convenience  of  people doing cDNA sequencing. This option is added for that purpose. By default
              lucy will not do this step, unless this option is given. This  option  can  be  given  alone  without  any
              parameter, in that case the default values will be used, or it can be given with all three parameters. The
              minimum_span  defines  the minimum length of continuous poly-A or T before lucy believes that it has found
              them. The maximum_error denotes the maximum number of errors allowed  before  a  new  continuous  poly-A/T
              region  stops.   If mismatch error count goes beyond maximum_error, then lucy believes that it has reached
              the end of the poly-A/T tail/head. Note that each  new  continuous  poly-A/T  region  that  is  more  than
              minimum_span  long will reset the error counter to zero, therefore an interleaving number of maximum_error
              followed by minimum_span can keep the poly-A/T region expanding. The last  parameter  initial_search_range
              denotes  the range from the ends of a sequence within which a minimum_span number of continuous poly-A/T's
              have to be found, otherwise lucy believes that it cannot find poly-A/T regions for the sequence. Note that
              the ends of the sequence are related to the clear region after vector splice sites trimming,  not  to  the
              actual  physical  ends  of  the  sequence.  The default values of the three parameter are minimum_span=10,
              maximum_error=3 and initial_search_range=50.  Warning: these three parameters have to be set carefully  in
              order  to  avoid throwing good sequences away in the middle of a sequence, or to let poly-A/T regions slip
              by at the ends of a sequence.

       -keep  When -cdna option is turned on, lucy will trim all poly-A/T fragments it  finds.  This  is  good  for  EST
              clustering  purposes, where you don't want the poly-A/T fragments to stay. However, if you want to see the
              EST sequence in its entirety to know its direction, it is not helpful to trim poly-A/T  away.   The  -keep
              option,  when used in combination with the -cdna option, will preserve the poly-A/T tails/heads at ends of
              each EST sequence to keep them as tags indicating the direction of the EST sequence.

       -size vector_tag_size
              This option is used in combination with the following  option.  It  defines  the  size  of  fragments  for
              checking  vector  using  the  fragment  matching  algorithm.  The  default value is 10 bases. The range of
              acceptable values are 8 to 16.

       -threshold vector_cutoff
              The option is used in combination with the previous option. It defines the threshold of similarity between
              a sequence and the vector for it to be considered a vector insert. Since splice sites are not included  in
              vector  screening,  any  sequence which has a higher than normal similarity to the vector sequence will be
              considered a vector itself and discarded. The default value of cutoff is 20% of the good quality region.

       -minimum good_sequence_length
              After all kinds of checking, comparing and trimming, the good region of a  sequence  must  still  be  long
              enough  than the minimum length for it to be considered useful to the sequence assembly program. We do not
              want our sequence assembly program to be bothered by many small, trashy  fragments.  The  default  minimum
              good sequence length is 100 bases.

       -debug [filename]
              This  option,  if  given,  tells lucy to produce a sequence cleavage information file for reference or for
              updating the database. The default file name is "lucy.debug", which can be overridden by  the  given  file
              name.

       -output sequence_filename quality_filename
              This  option defines the output sequence and quality file names. If not given, the default file names lucy
              uses are "lucy.seq" and "lucy.qul".

       -error max_avg_error max_error_at_ends
              There are three main steps in the quality trimming performed by lucy.  The first  involves  removing  low-
              quality  bases  from  each  end of the sequence, using the criteria specified by the -bracket option.  The
              second involves finding regions of the sequence where the probability of error meets all of  the  criteria
              specified by the -window option.  After these regions are found, the third step is to trim each of them to
              the  largest  region having an average probability of error no greater than the max_avg_error specified by
              the -error option.  Finally, the largest region meeting all of the criteria is chosen as the final "clean"
              range.

              Two parameters  are  specified  with  this  option:   max_avg_error  is  the  maximum  acceptable  average
              probability  of  error  over the final clean range.  max_error_at_ends is the maximum probability of error
              that is allowed for the 2 bases at each end of the final clean range.  The defaults are  0.025  and  0.02,
              respectively, if -error is not specified.

              Note:   A  base's  estimated probability of error is calculated from the quality value that is assigned by
              the base caller.  The quality value (Q) is defined as:

              Q = -10 * log10(Probability of error)

       -window window_size max_avg_error [window_size max_avg_error ...]
              This option affects the quality trimming of the sequence  (see  the  description  of  the  -error  option,
              above).   It  specifies one or more window sizes, and a maximum allowable average probability of error for
              each of those window sizes.  If more than one  window  size  is  specified,  they  must  be  specified  in
              decreasing order by window size.  The maximum number of windows that may be specified is 20.

              Lucy  uses a sliding window algorithm to find regions of the sequence within which the average probability
              of error, within any window of the specified  size,  is  no  greater  than  the  specified  max_avg_error.
              Regions  which  meet  all  of  the  specified  window  criteria  are then trimmed again using the criteria
              specified by the -error option, and the final "clean" range is the largest region that meets  all  of  the
              criteria.

              If  the  -window  option  is  not specified, then lucy uses 2 windows by default, of 50 and 10 bases.  The
              default maximum allowable probabilities of error in the two windows are listed below:

              50-base window: 0.08
              10-base window:  0.3

       -bracket window_size max_avg_error
              This option controls the initial quality trimming step, which is the removal  of  low-quality  bases  from
              both ends of the sequence.  lucy looks for the first and last window of size window_size having an average
              probability  of error no greater than max_avg_error.  The subsequence which extends from the first base of
              the first such window to the last base of the last window is then  examined  further  to  find  the  clean
              range.   Bases  which precede the first window or follow the last window are excluded from the clean range
              (so the two terminal windows bracket the clean range).

              The defaults for window_size and max_avg_error are 10 and 0.02.

       -quiet Tells lucy to shut up and only report serious errors it finds. :)

       -inform_me
              Asks lucy to report sequences by names that have been thrown out due to low quality  values,  or  salvaged
              due to comparison to the 2nd sequence file.

       -xtra cpu_threads
              If you have multiple CPUs in your computer, you can dramatically increase lucy's speed by allowing lucy to
              run  multiple  execution  threads concurrently. For example, if you have a dual-CPU computer, you can give
              the option -xtra 2 to cut lucy's execution time roughly in half. By default, lucy will run just one thread
              if this option is not given. The maximum number of allowable threads is 32. Note that this option is  only
              available  with  the  multi-threaded  lucy  version  1.16p. There is also a 1.16s version that does not do
              multi-threading.

ENVIRONMENT

       The environment variable VECTOR_FILE defines the vector file  name,  and  the  environment  variable  SPLICE_FILE
       defines  the splice site file name. Both variables are used when the user does not specify them using the -vector
       option.

SEE ALSO

       TIGR_Assembler(1), grim(1), everm.sp(1), phred(1), TraceTuner(1), and ethyl.pl(1).

BUGS

       No known bugs for the program at this moment. Some of the manual pages mentioned above do not exist.

CAVEATS

       The "no bugs" claim above can never be true. This is a new program built mostly from scratch, and there  must  be
       bugs somewhere, somehow. Please direct all bug reports to the authors.

ACRONYM

       Lucy  stands  for  Less  Useful  Chunks Yank, an awkward combination of words in order to make it a member of the
       family with phred, the base caller, ethyl, the old scripting  system  lucy  replaced,  and  ricky,  the  database
       linking and communication driving software of lucy for use in TIGR.

AUTHOR

       Lucy  was  written  by  Hui-Hsien  Chou  and  Michael Holmes at The Institute for Genomic Research, with help and
       suggestions from Granger Sutton, Anna Glodek, John Scott, and Terry Shea. Michael Holmes is currently responsible
       for lucy.  Please direct any suggestions, bug reports, etc. to mholmes@tigr.org.

TIGR software                                      10/28/2000                                            LUCY(1)