Ubuntu Manpage: rsem-simulate-reads - Simulate RNA-Seq data ("reads") for a given model and a RSEM reference transcript

Provided by: rsem_1.3.3+dfsg-3build1_amd64

NAME

       rsem-simulate-reads  -  Simulate RNA-Seq data ("reads") for a given model and a RSEM reference transcript
       collection.

SYNOPSIS

       rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results  theta0  N  output_name
       [--seed seed] [-q]

DESCRIPTION

       Parameters:

       reference_name:    The    name   of   RSEM   references,   which   should   be   already   generated   by
       'rsem-prepare-reference' estimated_model_file:  This  file  describes  how  the  RNA-Seq  reads  will  be
       sequenced   given   the   expression  levels.  It  determines  what  kind  of  reads  will  be  simulated
       (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read
       start position distribution, sequencing error models, etc. Normally, this file  should  be  learned  from
       real  data  using  'rsem-calculate-expression'. The file can be found under the 'sample_name.stat' folder
       with the name of 'sample_name.model' estimated_isoform_results: This file contains expression levels  for
       all  isoforms  recorded  in  the reference. It can be learned using 'rsem-calculate-expression' from real
       data. The corresponding file users want to use  is  'sample_name.isoforms.results'.  If  simulating  from
       user-designed expression profile is desired, start from a learned 'sample_name.isoforms.results' file and
       only  modify  the  'TPM' column. The simulator only reads the TPM column. But keeping the file format the
       same  is  required.  If  the  RSEM  references  built   are   aware   of   allele-specific   transcripts,
       'sample_name.alleles.results'  should be used instead.  theta0: This parameter determines the fraction of
       reads that are coming from background "noise" (instead of from a transcript). It can  also  be  estimated
       using  'rsem-calculate-expression' from real data. Users can find it as the first value of the third line
       of the file 'sample_name.stat/sample_name.theta'.  N: The total number  of  reads  to  be  simulated.  If
       'rsem-calculate-expression' is executed on a real data set, the total number of reads can be found as the
       4th number of the first line of the file 'sample_name.stat/sample_name.cnt'.  output_name: Prefix for all
       output  files.  --seed seed: Set seed for the random number generator used in simulation. The seed should
       be a 32-bit unsigned integer.  -q: Set it will stop outputting intermediate information.

       Outputs:

       output_name.sim.isoforms.results, output_name.sim.genes.results: Expression levels estimated by  counting
       where each simulated read comes from.  output_name.sim.alleles.results: Allele-specific expression levels
       estimated by counting where each simulated read comes from.

       output_name.fa  if  single-end  without  quality  score; output_name.fq if single-end with quality score;
       output_name_1.fa  &  output_name_2.fa  if  paired-end   without   quality   score;   output_name_1.fq   &
       output_name_2.fq if paired-end with quality score.

       Format of the header line: Each simulated read's header line encodes where it comes from. The header line
       has the format:

              {>/@}_rid_dir_sid_pos[_insertL]

       {>/@}:  Either  '>' or '@' must appear. '>' appears if FASTA files are generated and '@' appears if FASTQ
       files are generated rid: Simulated read's index, numbered from 0 dir:  The  direction  of  the  simulated
       read.  0  refers  to  forward  strand  ('+')  and  1  refers to reverse strand ('-') sid: Represent which
       transcript this read is simulated from. It ranges between 0 and  M,  where  M  is  the  total  number  of
       transcripts.  If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated
       from a transcript with index sid. Transcript sid's transcript name can be found  in  the  'transcript_id'
       column  of the 'sample_name.isoforms.results' file (at line sid + 1, line 1 is for column names) pos: The
       start position of the simulated read in strand dir of transcript sid. It is numbered from 0 insertL: Only
       appear for paired-end reads. It gives the insert length of the simulated read.

       Example:

       Suppose we want to simulate 50 millon single-end reads with quality scores and use the parameters learned
       from [Example](#example). In addition, we set theta0 as 0.2 and  output_name  as  'simulated_reads'.  The
       command is:

              rsem-simulate-reads       /ref/mouse_125      mmliver_single_quals.stat/mmliver_single_quals.model
              mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads

rsem-simulate-reads 1.3.3+dfsg                     April 2024                             RSEM-SIMULATE-READS(1)