1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121
|
# pbsim(1)
## NAME
pbsim - simulator for PacBio sequencing reads
## SYNOPSIS
*pbsim* [options] <reference.fasta>
## DESCRIPTION
The `pbsim` command produces simulated PacBio reads for reference FASTA sequence <reference.fasta>.
Model files (parameters for the `--model-qc` option) can be found in the /usr/share/pbsim/models directory.
## OPTIONS
The options for pbsim can be divided into general, sampling-based and model-based simulation options.
### General options
*--prefix*::
prefix of output files (sd).
*--data-type*::
data type. CLR or CCS (CLR).
*--depth*::
depth of coverage (CLR: 20.0, CCS: 50.0).
*--length-min*::
minimum length (100).
*--length-max*::
maximum length (CLR: 25000, CCS: 2500).
*--accuracy-min*::
minimum accuracy (CLR: 0.75, CCS: fixed as 0.75).
This option can be used only in case of CLR.
*--accuracy-max*::
maximum accuracy (CLR: 1.00, CCS: fixed as 1.00).
This option can be used only in case of CLR.
*--difference-ratio*::
ratio of differences. substitution:insertion:deletion.
Each value up to 1000 (CLR: 10:60:30, CCS:6:21:73).
*--seed*::
for a pseudorandom number generator (Unix time).
### Options for sampling-based simulation
*--sample-fastq*::
FASTQ format file to sample.
*--sample-profile-id*::
sample-fastq (filtered) profile ID. When using *--sample-fastq*, profile is stored. `sample_profile_<ID>.fastq`, and `sample_profile_<ID>_.stats` are created.
When not using *--sample-fastq*, profile is re-used. Note that when profile is used, *--length-min,max*, *--accuracy-min,max* would be the same as the profile.
### Options for model-based simulation
*--model_qc*::
model of quality code.
*--length-mean*::
mean of length model (CLR: 3000.0, CCS:450.0).
*--length-sd*::
standard deviation of length model (CLR: 2300.0, CCS: 170.0).
*--accuracy-mean*::
mean of accuracy model (CLR: 0.78, CCS: fixed as 0.98).
This option can be used only in case of CLR.
*--accuracy-sd*::
standard deviation of accuracy model (CLR: 0.02, CCS: fixed as 0.02).
This option can be used only in case of CLR.
## EXAMPLES
To run model-based simulation:
pbsim --data-type CLR \
--depth 20 \
--model_qc /usr/share/pbsim/models/model_qc_clr \
reference.fasta
In the example above, simulated read sequences are randomly sampled from
a reference sequence ("reference.fasta") and differences (errors) of
the sampled reads are introduced.
Data type is CLR, and coverage depth is 20.
If the reference sequence is multi-FASTA file, the simulated data is created
for each FASTA. Three output files are created for each FASTA.
"sd_0001.ref" is a single-FASTA file which is copied from the reference
sequence.
"sd_0001.fastq" is a simulated read dataset in the FASTQ format.
"sd_0001.maf" is a list of alignments between reference sequence and
simulated reads in the MAF format.
The length and accuracy of reads are simulated based on our model of PacBio
read.
To run sampling-based simulation:
pbsim --data-type CLR \
--depth 20 \
--sample-fastq sample.fastq \
reference.fastaq
In the sampling-based simulation, read length and quality score are
the same as those of a read taken randomly in the sample PacBio dataset
("sample.fastq").
## LICENSE
pbsim is available under the terms of the GNU General Public License, version 2 (GPL-2).
## AUTHORS
Michiaki Hamada (mhamada@k.u-tokyo.ac.jp), Yukiteru Ono
|