File: lucy.1

package info (click to toggle)
lucy 1.20-3
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye
  • size: 6,604 kB
  • sloc: ansic: 3,161; makefile: 33; sh: 21; awk: 21
file content (589 lines) | stat: -rw-r--r-- 22,153 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
.PU
.TH LUCY 1 10/28/2000 "TIGR software" "Sequence Assembly Utilities"
.SH NAME
.B lucy 
\- Assembly Sequence Cleanup Program
.SH SYNOPSIS
.ll +8
.B lucy
[-pass_along min_value max_value med_value]
.br
[-range area1 area2 area3] [-alignment area1 area2 area3]
.br
[-vector vector_sequence_file splice_site_file]
.br
[-cdna [minimum_span maximum_error initial_search_range]] [-keep]
.br
[-size vector_tag_size] [-threshold vector_cutoff]
.br
[-minimum good_sequence_length] [-debug [filename]]
.br
[-output sequence_filename quality_filename]
.br
[-error max_avg_error max_error_at_ends]
.br
[-window window_size max_avg_error 
.br
    [window_size max_avg_error ...]]
.br
[-bracket window_size max_avg_error]
.br
[-quiet] [-inform_me] [-xtra cpu_threads]
.br
.B sequence_file quality_file [2nd_sequence_file]
.SH DESCRIPTION
.I Lucy
is a utility that prepares raw DNA sequence fragments
for sequence assembly,
possibly using the
.I TIGR Assembler.
Raw DNA sequence fragments are obtained from DNA
sequencing machines, such as those from the Applied
Biosystems Inc. (ABI).
.I Lucy
accepts three input data files: the
.B sequences file, 
the
.B quality file,
and (optionally) a 
.B second sequence file
for comparison purposes. All three files should be in the plain FASTA
format.
.PP
The first sequence file and its accompanying
quality file are obtained from other utility programs, such as
.I phred(1),
which reads the sequencing machine chromatograph outputs and generates
sequence base calls in a sequence file together with a quality
assessment file for each base call made. The optional second
sequence file usually comes directly from the sequencing machine
itself and are used to reassure/enhance the quality of the sequences.
.PP
.I Lucy
makes no assumption about the order of sequences in the three input
files. As long as all necessary information can be found, DNA
sequences and quality sequences can be in different order. A sequence
without its quality assessment companion or vice versa will
be reported as an error. The second sequence file is allowed to have
missing sequences that appear only in the first sequence
file. Sequences that appear only in the second sequence file
will be reported and ignored by
.I lucy.
.PP
The operations of 
.I "Lucy"
are divided into 7 phases:
.TP
.B Phase 1:
Count the number of input sequences in the first sequence file, and
create internal data structures accordingly.
.I Lucy
allocates memory dynamically, so there is no preset limit on the size
of input files. The size of input files is only limited by your
available computer memory. 

Note that lucy's static memory requirement grows very slowly with the
number of input sequences, at roughly 60 bytes plus the sequence name
storage for each input sequence. This is because lucy does not store
any sequence data in the memory after processing them.  Therefore, by
all practical considerations lucy can handle any number of input
sequences.  The dynamic memory requirement of lucy is proportional to
the longest sequence in the input, not the number of sequences, but
its actual size varies from processing phases to phases.
.TP
.B Phase 2:
Read all sequence information, including name, length, and positions.
To save memory,
.I lucy
does not load all sequences data into main memory at
once. Instead, it uses direct file addressing to access each
sequence only when it is needed. Therefore, it is very
important that the content of the input files stays unchanged during
runtime.
.TP
.B Phase 3:
Read the quality information for sequences, and compute good quality
regions, i.e., regions within sequences that have higher quality
values and can be trusted to be correct.
.I Lucy
determines a "clean" range which has an average probability of error
(per base) that is no greater than the probability specified by the
-error option (or the default, if -error is not used).  Note that
secondary sequence extension (next step) is performed after quality
trimming.  Because of this, it is possible that the final "clear"
range (after vector trimming) will have a probability of error which
is greater than the specified value.
.TP
.B Phase 4:
Read the second sequence file, compare its sequences to the first
sequence file, and extend good quality regions if they both agree. If
the second sequence file is not provided, this step is skipped. Note
that this sequence comparison phase will not in anyway shorten the
good quality region determined by the previous phase; it will only
extend it if possible. It is very important that the second sequence
file does
.I not
come from the same base calling software as the first sequence file,
and is base-called with
.I different
algorithms. Otherwise, if the two sequence files are
identical,
.I lucy
will extend sequences all the way to both ends, and completely ruin
the purpose of quality trimming done in the previous phase. 

Usually,
the first sequence file comes from
.I phred(1)
with the companion quality file, and the second sequence file comes
directly from the original ABI base calling software with the
sequencing machines.
.TP
.B Phase 5:
Locate splice sites on ends of sequences. In this phase,
.I lucy
tries to compare all input sequences against splice site sequences in
a splice site file which defines the vector sequences near the
insertion point on the vector. If splice site sequences are found on
any input sequence, they will be excluded from the good quality
region so that the sequence assembly program will not mistakenly take
them into account when trying to reassemble the sequences. Note that
.I lucy
assumes all input sequences are read from the same direction and 
matching the direction of the splice site sequences. Therefore, the
forward and reverse read sequences of a clone should not be mixed
together in a single input file. If such a mixture of forward and
reverse read sequences is unavoidable,
.I lucy
can be run twice to check in both directions, once with the forward
splice site sequences, the other time with the reverse splice site sequences.
See the description of option
.B -vector
below for more details.

By popular demand, a poly-A/T trimming feature has been built into 
.I lucy.
It is designated Phase 5a and is an optional step. See the options
.B -cdna
and
.B -keep
below for details of their usage.
.TP
.B Phase 6:
Remove vector insert sequences. In this phase, all input sequences are
checked against a full length vector sequence in a vector file, and
sequences that are vector inserts themselves will be detected and
removed.
.I Lucy
uses a quick fragment match method to check for vector sequences. Both
the target vector sequence and the input sequences are converted into
fragments (range from 8 to 16 bases long, default is 10), and matching
fragments are detected quickly. Vector sequences are detected when
they contain more matches to vector fragments in their good quality
region (already excluded of splice site sequences done previously)
than a normal, non-vector sequence can possibly match by chance. The
default cutoff threshold is 20%. A sequence which contains over 20%
match to the vector will be considered a vector insert and discarded.
.TP
.B Phase 7:
Produce output sequences for fragment assembly. In the final phase, 
.I Lucy
produces two output files, the cleaned sequence file with markers for
good quality regions, and a companion quality file. Optionally,
.I lucy
can also generate a cleavage information file (i.e. the good quality
region information) which can be used to update database.
.PP
Each sequence in the output sequence file begins with a header that
includes its name, three pass along clone length values to the
fragment assembly program, and a left and right marker denoting the
begin and end of the
.B good
quality, vector free region.  The following is an example of
.I lucy
output:
.PP
>GCCAA03TF 1500 3000 2000 43 490
.br
AGCCAAGTTTGCAGCCCTGCAGGTCGACTCTAGAGGATCCCCAGGATGATCAGCCACATT
.br
GGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTGGGGAATCTTGCGCAATGG
.br
GCGAAAGCCTGACGCAGCCATGCCGCGTGAATGATGAAGGTCTTAGGATTGTAAAATTCT
.br
TTCACCGGGGACGATAATGACGGTACCCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGC
.br
    ...
.SH OPTIONS
Note that
.I lucy
checks only the first letter of each option, so all options below can
be represented by just typing the first letter, e.g. -p for
-pass_along.
.TP
.B \-pass_along min_value max_value med_value
The three pass along values of minimum, maximum and medium clone
lengths are given using this option.
.I Lucy
does not interpret these values; they are used by some sequence
assembly programs, such as
.I TIGR Assembler.
These values are directly copied over to the output sequence file. The
default values are 0, 0, and 0.
.TP
.B \-range area1 area2 area3 
This option is used in combination with the following option
.B -alignment. 
It defines the three splice site checking areas which may need
different strengths of splice site alignment. The quality of the base
calls is usually poor at the beginning of a sequence but gradually
improves when moving into the sequence read. Therefore, when looking
for splice sites, stronger and stronger alignment measurements are
needed to cope with the quality change. The default range values are
40, 60 and 100, i.e., 
.I lucy
will check for splice sites at the first 200 DNA bases. If splice site
is not found at the first 200 bases, the next 100 (=area3) bases will be
checked, with a total checking length of 300. Once a splice site is
found, the rest of the sequence after the splice site is searched for
the other end of the splice site, if any, to guard against short inserts.
.TP
.B \-alignment area1 area2 area3
This option is used in combination with the previous option.  It
defines the three different alignment strengths for the three
areas. An alignment within each area must be equal or longer than
these values before it is considered a match of the splice
site. Default values are 8, 12 and 16 for the first 40, 60 and 100
bases, respectively.
.TP
.B \-vector vector_sequence_file splice_site_file
This option provides the complete vector sequence file and a partial
splice site sequence file.
.I Lucy
expects to see one single (probably long) sequence of the vector that
is used to do cloning in the vector file, and
.B two
splice site sequences
.B before
and
.B after
the insertion point on the vector in the splice site file. The splice
site sequences are usually 150 bases in length, with a 50 bases
overlay right around the vector insertion point. Their actual lengths
are not very critical. For example, the followings are the PUC18 splice
site sequences that can be used by
.I lucy:

>PUCsplice.for.begin
.br
gattaagttgggtaacgccagggttttcccagtcacgacgttgtaaaacg
.br
acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc
.br
gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga
.br
>PUCsplice.for.end
.br
acggccagtgccaagcttgcatgcctgcaggtcgactctagaggatcccc
.br
gggtaccgagctcgaattcgtaatcatggtcatagctgtttcctgtgtga
.br
aattgttatccgctcacaattccacacaacatacgagccggaagcataaa

With two splice site sequences as above,
.I lucy
assumes all sequences from the input files are read in the same
direction as the splice site sequences. If that is not true and the
input consists of sequences from both forward and reverse reads of a
clone, there are two options. One can either separate the
sequences into forward and reverse read sets and run them through
.I lucy
with correct splice site sequences. One can also run
.I lucy
with a combined splice site file with 
.B both 
the forward and reverse splice site sequence pairs. That is, if
.I lucy
sees four splice site sequences, it will assume that a
.B bidirectional
splice site trimming has been ordered. For example, the following
reverse PUC18 splice site sequences can be appended to the forward
splice sequences above to instruct
.I lucy
to do bidirectional trimmings:

>PUCsplice.rev.begin
.br
tttatgcttccggctcgtatgttgtgtggaattgtgagcggataacaatt
.br
tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc
.br
ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt
.br
>PUCsplice.rev.end
.br
tcacacaggaaacagctatgaccatgattacgaattcgagctcggtaccc
.br
ggggatcctctagagtcgacctgcaggcatgcaagcttggcactggccgt
.br
cgttttacaacgtcgtgactgggaaaaccctggcgttacccaacttaatc

Bidirectional trimmings will run about two times slower because each
sequence is compared against two sets of splice site sequences when
only one set is actually needed. It is possible that random alignments
with the other (unneeded) set will result in somehow a shortened good
quality region of a sequence. However, bidirectional trimmings can
guarantee that there are no vector fragments in the good quality
region even when the assumed direction of some sequence reads is
wrong.

During
.B phase 6
contaminant removal,
.I lucy 
will automatically reverse complement the full length vector sequence
and check for both its forward and reverse inserts, so only one vector
sequence is needed in the vector file.
.I Lucy
can also get the vector and splice site file names from the
environment variables
.B VECTOR_FILE
and
.B SPLICE_FILE,
if they are not given at the command line.
.TP
.B \-cdna [minimum_span maximum_error initial_search_range]
Since the release of
.I lucy,
many people have requested that a poly-A/T trimming feature be
built into
.I lucy
for the convenience of people doing cDNA sequencing. This option is
added for that purpose. By default 
.I lucy
will not do this step, unless this option is given. This option can be
given alone without any parameter, in that case the default values
will be used, or it can be given with 
.B all
three
parameters. The 
.B minimum_span
defines the minimum length of
.I continuous
poly-A or T before 
.I lucy
believes that it has found them. The 
.B maximum_error
denotes the maximum
number of errors allowed before a new continuous poly-A/T region stops.
If mismatch error count goes beyond maximum_error, then
.I lucy
believes that it has reached the end of the poly-A/T tail/head. Note
that each new continuous poly-A/T region that is more than
minimum_span long will reset the error counter to zero, therefore an
interleaving number of maximum_error followed by minimum_span can keep
the poly-A/T region expanding. The last parameter
.B initial_search_range
denotes the range from the ends of a sequence within which a
minimum_span number of continuous poly-A/T's have to be found, otherwise
.I lucy
believes that it cannot find poly-A/T regions for the sequence. Note
that the ends of the sequence are related to the clear region 
.B after
vector splice sites trimming, not to the actual physical ends of the
sequence. The default values of the three parameter are
minimum_span=10, maximum_error=3 and initial_search_range=50. 
.B Warning:
these three parameters have to be set carefully in order to avoid
throwing good sequences away in the middle of a sequence, or to let
poly-A/T regions slip by at the ends of a sequence.
.TP
.B \-keep 
When
.B -cdna
option is turned on,
.I lucy
will trim all poly-A/T fragments it finds. This is good for EST
clustering purposes, where you don't want the poly-A/T fragments to
stay. However, if you want to see the EST sequence in its entirety to
know its direction, it is not helpful to trim poly-A/T away.
The
.B -keep 
option, when used in combination with the
.B -cdna
option, will preserve the poly-A/T tails/heads at ends of each EST
sequence to keep them as tags indicating the direction of the EST
sequence.
.TP
.B \-size vector_tag_size
This option is used in combination with the following option. It
defines the size of fragments for checking vector using the fragment
matching algorithm. The default value is 10 bases. The range of
acceptable values are 8 to 16.
.TP
.B \-threshold vector_cutoff
The option is used in combination with the previous option. It defines
the threshold of similarity between a sequence and the vector for it
to be considered a vector insert. Since splice sites are not included
in vector screening, any sequence which has a higher than normal
similarity to the vector sequence will be considered a vector itself
and discarded. The default value of cutoff is 20% of the good quality
region.
.TP
.B \-minimum good_sequence_length
After all kinds of checking, comparing and trimming, the good region
of a sequence must still be long enough than the minimum length for it
to be considered useful to the sequence assembly program. We do not
want our sequence assembly program to be bothered by many small,
trashy fragments. The default minimum good sequence length is 100
bases.
.TP
.B \-debug [filename]
This option, if given, tells
.I lucy
to produce a sequence cleavage information file for reference or for
updating the database. The default file name is "lucy.debug", which
can be overridden by the given file name.
.TP
.B \-output sequence_filename quality_filename
This option defines the output sequence and quality file names. If not
given, the default file names
.I lucy
uses are "lucy.seq" and "lucy.qul".
.TP
.B \-error max_avg_error max_error_at_ends
There are three main steps in the quality trimming performed by
.I lucy.
The first involves removing low-quality bases from each end of the
sequence, using the criteria specified by the
.B -bracket
option.  The second involves finding regions of the sequence where 
the probability of error meets all of the criteria specified by the
.B -window
option.  After these regions are found, the third step is to trim
each of them to the largest region having an average probability of
error no greater than the max_avg_error specified by the
.B -error
option.  Finally, the largest region meeting all of the criteria is 
chosen as the final "clean" range.  

Two parameters are specified with this option:  max_avg_error is 
the maximum acceptable average probability
of error over the final clean range.  max_error_at_ends is the maximum
probability of error that is allowed for the 2 bases at each end of the
final clean range.  The defaults are 0.025 and 0.02, respectively, if
.B -error
is not specified.

Note:  A base's estimated probability of error is calculated from the
quality value that is assigned by the base caller.  The quality value (Q)
is defined as:

Q = -10 * log10(Probability of error)
.TP
.B \-window window_size max_avg_error [window_size max_avg_error ...]
.br
This option affects the quality trimming of the sequence (see the
description of the
.B -error
option, above).  It specifies one or more window sizes, and a maximum
allowable average probability of error for each of those window sizes.
If more than one window size is specified, they must be specified in
decreasing order by window size.  The maximum number of windows that
may be specified is 20.

.I Lucy
uses a sliding window algorithm to find regions of the sequence within
which the average probability of error, within any window of the
specified size, is no greater than the specified max_avg_error.
Regions which meet all of the specified window criteria are then
trimmed again using the criteria specified by the
.B -error
option, and the final "clean" range is the largest region that meets
all of the criteria.

If the
.B -window
option is not specified, then
.I lucy
uses 2 windows by default, of 50 and 10 bases.  The default maximum
allowable probabilities of error in the two windows are listed
below:

50-base window: 0.08
.br
10-base window:  0.3
.TP
.B \-bracket window_size max_avg_error
This option controls the initial quality trimming step, which is the
removal of low-quality bases from both ends of the sequence.
.I lucy
looks for the first and last window of size window_size having an 
average probability of error no greater than max_avg_error.  The 
subsequence which extends from the first base of the first such window 
to the last base of the last window is then examined further to
find the clean range.  Bases which precede the first window or
follow the last window are excluded from the clean range (so the two
terminal windows bracket the clean range).

The defaults for window_size and max_avg_error are 10 and 0.02.
.TP
.B \-quiet
Tells
.I lucy
to shut up and only report serious errors it finds. :)
.TP
.B \-inform_me
Asks
.I lucy
to report sequences by names that have been thrown out due to low
quality values, or salvaged due to comparison to the 2nd sequence file.
.TP
.B \-xtra cpu_threads
.br
If you have multiple CPUs in your computer, you can dramatically
increase lucy's speed by allowing lucy to run multiple execution
.I threads 
concurrently. For example, if you have a dual-CPU computer, you
can give the option
.B -xtra 2
to cut lucy's execution time roughly in half. By default, lucy will run just
one thread if this option is not given. The maximum number of
allowable threads is 32. Note that this option is only available with
the multi-threaded lucy version 1.16p. There is also a 1.16s version
that does not do multi-threading.
.SH ENVIRONMENT
The environment variable
.B VECTOR_FILE
defines the vector file name, and the environment variable
.B SPLICE_FILE
defines the splice site file name. Both variables are used when the
user does not specify them using the
.B -vector
option.
.SH "SEE ALSO"
TIGR_Assembler(1), grim(1), everm.sp(1), phred(1), TraceTuner(1), and
ethyl.pl(1).
.SH BUGS
No known bugs for the program at this moment. Some of the manual pages
mentioned above do not exist.
.SH CAVEATS
The "no bugs" claim above can never be true. This is a new program
built mostly from scratch, and there must be bugs somewhere,
somehow. Please direct all bug reports to the authors.
.SH ACRONYM
.I Lucy
stands for
.B Less Useful Chunks Yank,
an awkward combination of words in order to make it a member of the
family with phred, the base caller, ethyl, the old scripting system
lucy replaced, and ricky, the database linking and communication
driving software of lucy for use in TIGR.
.SH AUTHOR
.I Lucy
was written by Hui-Hsien Chou and Michael Holmes at The Institute for
Genomic Research, with help and suggestions from Granger Sutton, Anna
Glodek, John Scott, and Terry Shea. Michael Holmes is currently
responsible for
.I lucy.
Please direct any suggestions, bug reports, etc. to mholmes@tigr.org.