Ubuntu Manpage: cdbfasta - Creates an index file for records from a multi-fasta file.

Provided by: cdbfasta_1.00+git20230710.da8f5ba+dfsg-1build1_amd64

NAME

       cdbfasta - Creates an index file for records from a multi-fasta file.

DESCRIPTION

   Usage:
              cdbfasta <fastafile> [-o <index_file>] [-r <record_delimiter>]

              [-z <compressed_db>] [-i] [-m|-n <numkeys>|-f<LIST>]|-c|-C]

       [-w <stopwords_list>] [-s <stripendchars>] [{-Q|-G}]
              [-v]

              Creates  an  index  file  for  records  from  a multi-fasta file.  By default (without -m/-n/-c/-C
              option), only the first space-delimited token from the defline is used as a key.

              <fastafile> is the multi-fasta file to index; -o the index file will be named <index_file>; if not
              given,

              the index filename is database name plus the suffix '.cidx'

       -r <record_delimiter> a string of characters at the beginning of line

              marking the start of a record (default: '>')

       -Q treat input as fastq format, i.e. with '@' as record delimiter

              and with records expected to have at least 4 lines

       -z database is compressed into the file <compressed_db>

              before indexing (<fastafile> can be "-" or "stdin" in order to get the input records from stdin)

       -s strip extraneous characters from *around* the space delimited

              tokens,  for  the  multikey  options   below   (-m,-n,-f);   Default   <stripendchars>   set   is:
              '",`.(){}/[]!:;~|><+-

       -m ("multi-key" option) create hash entries pointing to

              the same record for all tokens found in the defline

       -n <numkeys> same as -m, but only takes the first <numkeys>

              tokens  from  the defline; when used with -a option (see below), only collects the first <numkeys>
              accessions from each defline

       -f indexes *space* delimited tokens (fields) in the defline as given

              by LIST of fields or fields ranges (the same syntax as UNIX 'cut')

       -w <stopwordslist> exclude from indexing all the words found

              in the file <stopwordslist> (for options -m, -n and -k)

       -i do case insensitive indexing (i.e. create additional keys for

              all-lowercase tokens used for indexing from the defline

       -c for deflines in the format: db1|accession1|db2|accession2|...,

              only the first db-accession pair ('db1|accession1') is taken as key

       -C like -c, but also subsequent db|accession constructs are indexed,

              along with the full (default) token; additionally, all nrdb concatenated accessions found  in  the
              defline are parsed and stored (assuming 0x01 or '^|^' as separators)

       -a accession mode: like -C but indexes only the 'accession' part for all

              'db|accession' constructs found, plus the default first tokens

       -A like -a and -C together (both accessions and 'db|accession'

              constructs are used as keys

       -D index each pipe ('|') delimited token found in the record identifier

              (e.g. >key1|key2|key3|.. )

       -d same as -D but using a custom key delimiter <kdelim> instead of the pipe

              character '|'

       -G FASTA records are treated as large genomic sequences (e.g. full

              chromosomes/contigs)  and their formatting is checked for suitability for fast range queries (i.e.
              uniform line length within each record)

       -v show program version and exit

cdbfasta version 1.00                              April 2024                                        CDBFASTA(1)