Ubuntu Manpage: CollapseSeq.py - emoves duplicate sequences from FASTA/FASTQ files

NAME

       CollapseSeq.py - emoves duplicate sequences from FASTA/FASTQ files

DESCRIPTION

       usage: CollapseSeq.py [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]

       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
              [--outname OUT_NAME] [--log LOG_FILE] [--failed] [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
              [-n  MAX_MISSING]  [--uf  UNIQ_FIELDS  [UNIQ_FIELDS  ...]]   [--cf  COPY_FIELDS [COPY_FIELDS ...]]
              [--act {min,max,sum,set} [{min,max,sum,set} ...]]   [--inner]  [--keepmiss]  [--maxf  MAX_FIELD  |
              --minf MIN_FIELD]

       Removes duplicate sequences from FASTA/FASTQ files

   help:
       --version
              show program's version number and exit

       -h, --help
              show this help message and exit

   standard arguments:
       -s SEQ_FILES [SEQ_FILES ...]
              A list of FASTA/FASTQ files containing sequences to process. (default: None)

       -o OUT_FILES [OUT_FILES ...]
              Explicit  output  file name(s). Note, this argument cannot be used with the --failed, --outdir, or
              --outname arguments. If unspecified,  then  the  output  filename  will  be  based  on  the  input
              filename(s).  (default: None)

       --outdir OUT_DIR
              Specify  to  changes  the  output directory to the location specified. The input file directory is
              used if this is not specified. (default: None)

       --outname OUT_NAME
              Changes the prefix of the successfully processed output file to the string specified. May  not  be
              specified with multiple input files. (default: None)

       --log LOG_FILE
              Specify  to  write  verbose  logging  to  a  file. May not be specified with multiple input files.
              (default: None)

       --failed
              If specified create files containing records that fail processing. (default: False)

       --fasta
              Specify to force output as FASTA rather than FASTQ.  (default: None)

       --delim DELIMITER DELIMITER DELIMITER
              A list of the three delimiters that separate annotation blocks, field names and values, and values
              within a field, respectively. (default: ('|', '=', ','))

   collapse arguments:
       -n MAX_MISSING
              Maximum number of missing nucleotides to consider for collapsing sequences.  A  sequence  will  be
              considered undetermined if it contains too many missing nucleotides. (default: 0)

       --uf UNIQ_FIELDS [UNIQ_FIELDS ...]
              Specifies  a  set  of annotation fields that must match for sequences to be considered duplicates.
              (default: None)

       --cf COPY_FIELDS [COPY_FIELDS ...]
              Specifies a set of annotation fields to copy into the unique sequence output. (default: None)

       --act {min,max,sum,set} [{min,max,sum,set} ...]
              List of actions to take for each copy field which defines how each  annotation  will  be  combined
              into  a  single  value.  The  actions  "min",  "max", "sum" perform the corresponding mathematical
              operation on numeric annotations. The action "set" collapses annotations into  a  comma  delimited
              list of unique values.  (default: None)

       --inner
              If  specified,  exclude  consecutive  missing  characters at either end of the sequence. (default:
              False)

       --keepmiss
              If specified, sequences with more missing characters than the threshold set by  the  -n  parameter
              will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified,
              such sequences will be written to a separate file.  (default: False)

       --maxf MAX_FIELD
              Specify  the  field  whose maximum value determines the retained sequence; mutually exclusive with
              --minf.  (default: None)

       --minf MIN_FIELD
              Specify the field whose minimum value determines the retained sequence;  mutually  exclusive  with
              --minf.  (default: None)

   output files:
              collapse-unique

              unique  sequences.  Contains one representative from each set of duplicate sequences. The retained
              representative is determined by user defined criteria.

              collapse-duplicate

              raw reads which are duplicates of the sequences retained in the collapse-unique file.

              collapse-undetermined

              raw reads which were excluded from consideration due to  having  too  many  N  characters  in  the
              sequence.

   output annotation fields:
              DUPCOUNT

              total number of sequences within the set of duplicates for each retained unique sequence. Meaning,
              the copy number of each unique sequence within the data file.

              <user defined>

              annotation fields specified by the --cf parameter.

AUTHOR

        This manpage was written by Andreas Tille for the Debian distribution and
        can be used for any other usage of the program.

CollapseSeq.py 0.6.0                                May 2020                                   COLLAPSESEQ.PY(1)