File: README.txt

package info (click to toggle)
epcr 2.3.12-1-9
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 916 kB
  • sloc: cpp: 5,730; ansic: 231; makefile: 31; python: 26; sh: 12
file content (464 lines) | stat: -rw-r--r-- 15,438 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464

           Electronic PCR commandline tools: operating instructions

                                                          Version: 2.3.12
     _________________________________________________________________

   Use e-PCR to map sequences using STS database

   Use re-PCR to map STSes or short primers in sequence database

   Use famap and fahash to prepare sequence database for re-PCR searches.
     _________________________________________________________________

Forward e-PCR

Example

work> e-PCR -w9 -f 1 -m100 mystsdb.sts D=100-400 myfastafile.fa N=1 G=1 T=3

Synopsis


e-PCR [-hV] [posix-options] stsfile [fasta ...] [compat-options]
where posix-options are:
        -m ##   Margin (default 50)
        -w ##   Wordsize  (default 7)
        -n ##   Max mismatches allowed (default 0)
        -g ##   Max indels allowed (default 0)
        -f ##   Use ## discontiguos words
        -o ##   Set output file
        -t ##   Set output format:
                1 - classic, range (pos1..pos2)
                2 - classic, midpoint
                3 - tabular
                4 - tabular with alignment in comments (slow)
        -d ##-## Set default sts size
        -p +-   Turn hits postprocess on/off
        -v +-   Verbose on/Off
        -a a|f  Use presize alignmens (only if gaps>0), slow
                 a - Always or f - as Fallback
        -x +-   Use 5'-end lowercase masking of primers (default -)
        -u +-   Uppercase all primers (default -)
and compat-options (duplicate posix-options) are:
        M=##    Margin (default 50)
        W=##    Wordsize  (default 7)
        N=##    Max mismatches allowed (default 0)
        G=##    Max indels allowed (default 0)
        F=##    Use ## discontinuos words
        O=##    Set output file to ##
        T=##    Set output format (1..4)
        D=##-## Set default sts size
        P=+-    Postprocess hits on/off
        V=+-    Verbose on/Off
        A=a|f   Use presize alignmens (only if gaps>0), slow
                 a - Always or f - as Fallback
        X=+-    Use 5'-end lowercase masking of primers (default -)
        U=+-    Uppercase all primers (default -)
        -mid    Same as T=2

Description

   e-PCR parses stsfile in unists format, then reads nucleotide sequence
   data in FASTA format from files listed in commandline if any, or from
   stdin otherwise. For input sequences e-PCR finds matches and prints
   output in one of three formats.

Options

   Two sets of options are used: POSIX-compatible and old-style provided
   for compatibility with previous versions of e-PCR.

   Posix-style options can appear only before first parameter not
   starting with '-'. Argument '--' explicitely stops parsing arguments
   as posix options.

   Compatibility options can appear anywhere in commandline. '-mid' can
   appear anywhere and do not stop posix options recognision.

General options

   -V
          Print version, exit after parsing commandline

   -h
          Print help, exit after parsing commandline

Hash building options

   -w wordsize | W=wordsize
          Set word size for primers hash (nucleotide positions). Longer
          word size decreases hash collision rate, but increases memory
          usage. Also no mismatches are allowed within word size near
          "inner" boundary of primers unless one uses discontiguous
          words, and no gaps are ever allowed in that region.

   -f wordcnt | W=wordcnt
          Set discontiguous word count for primers hash (1 means "use
          contiguous words"). Discontiguous words increase number of hash
          tables and decrease "effective" word size (thus increasing hash
          collision rate), so make search significantly slower, but
          increase sencitivity by allowing mismatches within word size.
          Reasonable values are 1 (contiguous words) and 3.

   -d lo-hi | D=lo-hi
          Set ddefault STS size range - values used for STSs that have no
          size associated in file.

Hit quality options

   -m margin | M=margin
          Set maximal allowed deviation of hit product size from expected
          STS size.

   -n mism | N=mism
          Set maximal number of mismatches allowed in primer-to-sequence
          alignment (per primer!).

   -g mism | G=mism
          Set maximal number of gaps allowed in primer-to-sequence
          alignment (per primer!).

Alignment algorithms options

   -a a|f | A=a|f
          Use NW algorithm to align primers to sequence: a - always, f -
          as fallback if fast algorithm gives no hit at this position.

   -x +|- | X=+|-
          Turn on/off recognising of lowercase characters at 5'-ends of
          primers as nucleotides that don't need to be aligned to
          sequence (floppy tails).

   -u +|- | U=+|-
          Uppercase primers. To use with files prepared for ``-x=+''
          mode, but requiring full primer alignment.

   If STS file contains primers with lowercase charactars, you have to
   use either -x+ or -u+ flag.

Report options

   -o output | O=output
          Set output file.

   -t 1|2|3|4 | T=1|2|3|4
          Set output format.

   -p +|- | P=+|-
          Set hit grouping on/off: when using discontiguous words and
          gaps, some hits may be reported multiple times with little
          different quality. This option controls reporting only best hit
          of group of overlapping hits. Default depends on F and G
          values.

   -v +|- | V=+|-
          Report sequence ids to stderr on/off.

Ouput formats

   1: Traditional: reports whitespace-separated

          + Sequence FASTA identifier
          + POS1..POS2 -- start and end positions of hit (includes length
            floppy tail)
          + STS identifier (col. 1 from STS file)
          + STS description (columns 5..last from STS file)

          In this format product size equals to POS2-POS1+1

   2: Traditional midpoint: reports whitespace-separated

          + Sequence FASTA identifier
          + POS -- middle point position of hit
          + STS identifier (col. 1 from STS file)
          + STS description (columns 5..last from STS file)

   3: Tab-separated detailed

          + Sequence FASTA identifier
          + STS identifier (col. 1 from STS file)
          + +|- -- strand of hit (order of primers in hit)
          + POS1 -- start position of hit (does not include floppy tail
            if any)
          + POS2 -- end position of hit (does not include floppy tail)
          + SIZE/MIN..MAX -- observed size of hit/expected size range of
            STS
          + MISM -- Total number of mismatches for two primers
          + GAPS -- Total number of gaps for two primers
          + STS description (columns 5..last from STS file)

          In this format product size may be greater then POS2-POS1+1 for
          probes with floppy tails

   4: Tab-separated detailed with alignment
          Is same as format 3, but also containing visualisations of
          alignments in comment lines (lines starting with ``#'')

Exit codes

   Zero on success, nonzero on fail
     _________________________________________________________________

Reverse e-PCR

Example

work> famap -tN -b genome.famap org/chr_*.fa
work> fahash -b genome.hash -w 12 -f3 ${PWD}/genome.famap
work> re-PCR -s genome.hash -n1 -g1 ACTATTGATGATGA AGGTAGATGTTTTT 120-200

Synopsis


famap [-hV]
famap -b mmapped-file [-t cvt] [fasta-file ...]
famap -d mmapped-file [ord ...]
famap -l mmapped-file [ord ...]
where cvt is one of: off n N nx NX

fahash [-hV]
fahash -b hash-file [build-options] mmapped-file ...
fahash -T hash-file [-o output]

where:
        -b hash-file    Build hash tables (hash-file) from sequence files,
        -T hash-file    Print word usage statistics for hash-file
        -o outfile      Set output file name for -T

build-options:
        -w wordsize     Set word size when building hash tables
        -f period       Set discontiguity when building hash tables
        -k              Skip repeats when building indexfile
        -F min,max      Set watermarks for fragment size (in Mb) for -v1
        -v 1|2          Build file of format version 1 or 2
        -c cachesize    Use cache size cachesize (for -v2)

re-PCR [-hV]
re-PCR -p hash-file [-g gaps] [-n mism] [primer ...]
re-PCR -P hash-file [-g gaps] [-n mism] [primer-file ...]
re-PCR -s hash-file [search-options] [-O output] [left right lo hi [...]]
re-PCR -S hash-file [search-options] [-O output] [-C bcnt] [stsfile ...]

where:
        -p hash-file    Perform primer lookup using hash-file
        -P hash-file    Perform primer lookup using hash-file
        -s hash-file    Perform STS lookup using hash-file, STSs in cmdline
        -S hash-file    Perform STS lookup using hash-file, STSs in file


search-options:
        -n mism         Set max allowed mismatches per primer for lookup
        -g gaps         Set max allowed indels per primer for lookup
        -m margin       Set variability for STS size for lookup
        -d min-max      Set default STS size (for STSs without size set)
        -r +|-          Enable/disable reverse STS lookup
        -O +|-          Enable/disable syscall optimisation

        -C batchcnt     Set number of STSes per batch
        -o outfile      Set output file name

Description

   Reverse e-PCR (re-PCR) performs STS or primer lookup against sequence
   database. Two files are required for database: mmapped-file with
   sequence data in fast random-accessible format and hash-file, that
   keeps precalculated positions of all words of sequence database

   Use famap to build mmapped-file from FASTA files.

   Use fahash to build hash-file, and output word usage statistics.

   Use re-PCR to perform STS and primer searches.

   Discontiguous words are supported by re-PCR as well as contiguous.

Options

Common options

   -V
          Print version, exit after parsing commandline

   -h
          Print help, exit after parsing commandline

famap options

   -b mmapped-file
          Build famap-file from input fasta file(s). If no fasta files
          are set in commandline, use stdin as input.

   -d mmapped-file
          Dump famap-file contents in fasta format. If ord number(s) are
          set, print only sequences with given ordinals.

   -l mmapped-file
          List fama-file sequence identifiers. If ord number(s) are set,
          print only sequences with given ordinals.

   -t cvt-table
          Use compiled-in table to convert input.

        n
                Nucleotides. Allowed characters are [actgACTGnN]. Other
                letters are converted to n or N. Rest of symbols are
                ignored. Case is preserved.

        nx
                Nucleotides with extended ambiquity codes iupac_na,
                lowercase are allowed. Other letters are converted to n
                or N. Rest of symbols are ignored. Case is preserved.

        N
                Nucleotides. Allowed characters are [ACTGN]. [actgn] are
                converted to uppercase. Other letters are converted to N.
                Rest of symbols are ignored.

        NX
                Nucleotides with extended ambiquity codes iupac_na,
                lowercase are converted to uppercase. Other letters are
                converted to N. Rest of symbols are ignored.

Fahash

   -b hash-file
          Build hash-file for mmapped-file(s).

   -T hash-file
          Dump word usage statustics for hash-file.

   -v version
          Build hash-file of version 1 or 2 (2 is default).

   -w wordsize
          Build hash-file for word wordsize nucleotides long.

   -f wordcnt
          Build hash-file for wordcnt discontiguous words. 1 stands for
          contiguous words.

   -F min,max
          Use memory watermarks (Mbytes) for hash table size (for -v 1).

   -c cachesize
          Set cache size for -v 2.

   -o output-file
          Use output-file for output result of -T.

Commands

   -p hash-file
          Perform lookup for primers given in commandline.

   -s hash-file
          Perform lookup for STSes given in commandline.

   -S hash-file
          Perform lookup for STSes taken from unists file(s) given in
          commandline.

Search options

   -n mism
          Number of mismatches allowed per primer.

   -g gaps
          Number of gaps allowed per primer.

   -m margin
          Maximal deviation of observed product size to expected STS
          size.

   -d lo-hi
          Set ddefault STS size range - values used for STSs that have no
          size associated in file.

   -r +|-
          Enable|disable flipped STS lookup (default is "enabled").

   -O +|-
          Enable|disable syscall optimisation. Since lookup is i/o
          expensive, enabling this parameter may improve search
          performance diskwise. On the other hand, it takes significantly
          more memory and CPU.

   -C batchcount
          How many STSs from input file to look at one pass. May effect
          on performance, especialy when used with -O +.

   -o output-file
          Use output-file for output.

Output format

   Is tab-separated file with following fields:

For primer lookup

     * Primer ID
     * Sequence ID
     * Strand
     * Hit start
     * Hit end
     * Mismatches
     * Gaps
     * Size

For STS lookup

     * STS ID
     * Sequence ID
     * Strand
     * Hit start
     * Hit end
     * Mismatches
     * Gaps
     * Observed Size/Expected size range

Exit codes

   Zero on success, non-zero on errors

Bugs and features

     * Mmapped-file path is hardcoded to hash-file as it is in
       commandline when hash-file is being built, which means that when
       one performs searches mmapped-file should be accessible with same
       name from current directory, as it is hardcoded.
     * Mmapped-file is a proprietary format, that could be substituted
       with megablast database format, but is not (yet?) for performance
       reasons.
     * If sequence sizes are large, it may be tricky to create database
       with discontiguous words because of memory usage requirements.
       Changing parameter -F (for -v 1) or -c (for -v 2) may help.
     _________________________________________________________________

File formats

   STS database
          Is single-tab (i.e. two tabs in a row mean "empty field")
          separated file with following fields:

          + STS id (required).
          + First (left) primer (required).
          + Second (right) primer (required).
          + Product size (optional): can be number for strict size, or
            two numbers separated by dash for size range.
          + Additional info, that can be used by applications (optional).

          Primers should be in iupac_na encoding, everything that is not
          ACTG or actg is translated to N or n. Primers sequences should
          be uppercase, unless you want to use file with e-PCR -x+ flag -
          then several first nucleotides of primers may be
          lowercase-masked. If primers are not fully uppercase and you
          don't use -x+ flag, you have to use -u+ flag with e-PCR.

   Primers file
          Is single-tab (i.e. two tabs in a row mean "empty field")
          separated file with following fields:

          + Primer id (required).
          + Primer sequence.
     _________________________________________________________________