1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
|
Version 1.20, 9/3/2008. This version fixes a bug that caused lucy
to fail when there was too much information on the FASTA header lines
in the input files. Lucy had a 256 character buffer for reading
lines from the input files. If any FASTA header lines were longer
than 256 characters, the remaining characters would get read as part
of the FASTA sequence. The solution implemented in this version is
simply to increase the buffer size to 4096 characters. While that
still leaves the potential for the same error to occur with extremely
long header lines, the limitation that this entails seems reasonable,
and this should fix the problem for all pratical purposes.
---------------------------------------------------------------------
Version 1.19, 12/30/2003. This version fixes a bug that could cause
sequences to be rejected incorrectly in the vector detection step
(phase 6). Because of the way that lucy compares "tags" in the
target sequence with "tags" in the vector sequence, some bases in
the target sequence could get counted more than once in the tally
of bases that match the vector sequence. In rare instances, this
could cause the sequence to exceed the minimum threshold for
rejection, as a result of random sequence similarity.
The reporting of the CLB range in the -debug output file has also
been changed. If the CLB range begins with the first base of the
sequence, then the left coordinate of the CLB range will be reported
as 1 (instead of 0). The range "CLB 0 0" still indicates an empty
CLB range.
---------------------------------------------------------------------
Version 1.18, 2/20/2003. This version fixes a bug that could cause
lucy to crash. If the size of the largest sequence in the batch to
be processed was less than the sum of the alignment ranges (200 by
default), earlier versions of lucy could crash. The alignment ranges
are controlled by the -range parameter.
---------------------------------------------------------------------
Version 1.17, 8/29/02. There are 2 changes in this version. Lucy now
allows up to 40 sequences in the splice file (up to 20 begin/end
pairs). This is to allow for better trimming in complex situations,
such as transposon systems. (40 sequences is overkill, I hope!)
There is also an additional dynamic programming step added for
splice site trimming at the 3' end of the sequence. The first
trimming attempt at the 3' end uses the target sequence up to the
end of the current clear range only. Normally this corresponds to
the end of the good quality range. If that attempt does not find
a hit, then a second attenpt is made using N bases before the
end of the current clear range and N bases after, where N is the
value of the third -alignment parameter (16 by default). The purpose
of this additional step is to handle the case where the 3' splice
site begins less than N bases from the end of the good quality range.
---------------------------------------------------------------------
Version 1.16, 5/16/01. There is an obvious bug not in lucy but in the
included utility program zapping.awk for physically trimming sequences
after lucy processing. Apparently nobody uses it until now?!
We do find a bug in lucy itself with regard to poly-A/T trimming.
Previously, lucy attempts to trim poly-A at the beginning of an EST
sequence and poly-T at the end of an EST sequence as well as the
correct strategy, i.e., to find them at the opposite ends. This means
that lucy attempts to find both poly-A/T at both ends, but it will
choose only one poly-A or -T, but not both, at each end as the
dominant target. Normally, if there is a strong poly-T signal at the
beginning or a strong poly-A signal at the end of a sequence, this
redundant checking won't cause a problem because it will still find
the correct targets. However, this is not a biologically correct
behavior anyway and may cause some innocent A's at the beginning of a
sequence and some innocent T's at the end of a sequence to be trimmed
away if there are no poly-T/A signals at either ends. This bug has now
been corrected. Because lucy isn't used to trim EST sequences in our
research environments often (we handle mostly genomic DNA), this bug
wasn't found until some colleagues at Iowa State University tried to
use lucy on their EST sequences and noticed this behavior.
Also, at ISU scientists' request, we added a new feature to lucy, the
-keep option. This option, when used in combination with the -cdna
option, will preserve the poly-A/T tails/heads at ends of each EST
sequence to keep them as tags indicating the direction of the EST
sequence. Previously, lucy will trim poly-A/T away if given the -cdna
option. The original design is for EST clustering purposes, where you
don't want the poly-A/T to stay. However, if the researcher wants to
see the EST sequence in its entirety to know its direction, it is not
helpful to trim poly-A/T away. Therefore, now users can just give the
-keep option to tell lucy to keep those poly-A/T tags but still remove
anything else before and/or beyond them.
Finally, there is a cosmetic bug-fix to remove the sequence count
reporting redundancy during phase I of lucy processing. This does not
effect the outcome of lucy before, but it won't hurt to fix it either.
---------------------------------------------------------------------
Version 1.15, 10/29/00. Developed concurrently with version 1.14, this
new version comes with multi-threading capability at the slowest step
of lucy: phase 5 splice site trimming. Lucy can now take advantage of
extra CPUs in a computer to speed up its processing. With two CPUs,
lucy will run roughly two times faster than before. With four CPUs,
lucy will take only a quarter of the runtime of the previous
version. Because more affordable multiple CPU computers are becoming
widely available, this multi-threading feature will be very useful for
organizations that have lots of sequences to process. Functionally,
this version is identical to version 1.14 and should produce exactly
the same output. If you find any difference in the outputs between the
two versions, please report that as a bug to us.
---------------------------------------------------------------------
Version 1.14, 10/29/00. Previously in version 1.12 all splice site
trimmings in lucy have been made adaptive to the beginning of the
quality region so that there will be no chance of vector fragments in
the good quality region, *but* it is possible to miss some vector
fragments in the low quality region. The rationale is that anything in
the low quality region will be tossed away anyway. However, after that
modification lucy's output becomes somewhat different than before,
particularly in its CLV summary report.
The job to improve lucy further proves to be a very difficult one
because lucy already has a very stable code base that strikes a
delicate balance between various parameters. Although different
outputs generated by version 1.12 should not be considered wrong, we
cannot ascertain confidence with the new code even remotely comparable
to our confidence on the original code base that has been in constant
use and scrutiny at TIGR for over two years. Therefore, we decided to
limit the scope of change and only include specialized codes to lucy
that deal with occasional erroneous cases. In another word, we changed
the strategy of the first attempt to find vector splice site fragments
back to its original, fixed and non-adaptive design.
To that effect, if you run the new version 1.14 of lucy over most data
sets you may not see any difference in its output at all. However,
when facing those rare cases where low quality values were much longer
into the sequences, the vector splice site trimming should now be
always correct and there should not be any missed vector splice sites
in the good quality region lucy reports. A special test case was
included in this release of lucy to illustrate the idea:
If you run the following command with this version of lucy:
lucy -v pSPORT1vector pSPORT1splice ARMTM40TR.seq ARMTM40TR.qul \
-debug ARMTM40TR.info
You will find out that lucy reports vector splice site trimming
position at 279. In previous versions of lucy (you can try), the
vector splice site would be missed because of a much longer than
expected low quality region in this sequence (up to 211bp!).
However, if you run lucy with the standard test command shown in the
README.FIRST file:
lucy -v PUC19 PUC19splice atie.seq atie.qul atie.2nd -debug lucy.info
You should get exactly the same output as before, i.e., lucy's
behavior was not dramatically changed compared to previous versions
before 1.12. We opt for the final solution of letting lucy do an
adaptive vector splice site trimming only in the the second (and rare)
attempt.
In summary, the new splice site trimming strategy in lucy is to do an
"adaptive vector" trimming if the initial attempt to find vector
splice sites in a fixed beginning region fails, of if the low quality
region extends too much into the sequence. After quality assessment,
lucy starts looking for vector splice sites in a fixed region at the
beginning of a sequence. By default, that's the first 200 bases of a
sequence no matter that region has been excluded off the "good
quality" region lucy determines earlier or not. If that search fails,
lucy will then look at the next 100 bases (201-300) for splice sites
adaptive to the quality region lucy determines, i.e., it will look at
the next 100 bp to the right of CLN reported in phase 3. For example,
if CLN is 250 (which is rare, of course), then lucy will look for
vector sites between 250 to 350 in the second attempt, instead of
between 201 to 300. The reason for this change is that it is more
important to prevent vector sequences in the good quality region than
to actually find them. Of course, if CLN is 200 or less then lucy will
still behave like before. Note that the trimming regions can be
changed by users, thus if a user sets region 1, 2 and 3 to 50, 100,
and 150 bases, then lucy will look for vector splice sites between 1
to 300 bp in the first attempt, then in the next 150 bp to the
right of CLN or 300, whichever is further to the right.
If you don't understand anything I said above, don't worry. Lucy will
work for you as before with default parameters. These modifications
were put into lucy to deal with special cases. Most of the sequence
data will still be handled by lucy the same way and will get the same
or better trimmings from lucy. After this latest modification, I have
fixed all known bugs to date about lucy. Please let me know of any new
bugs you find with lucy.
---------------------------------------------------------------------
Version 1.13, 2/10/00. Michael Holmes at TIGR wrote a new quality
assessment and trimming routine for lucy to replace the original
grim.c algorithm which was designed to handle only ABI 377
outputs. This new quality trimmer is much more flexible and allows
lucy to handle new phred and Paracel TraceTuner base caller outputs
from the new ABI 3700 sequencer as well as the old ones. Some earlier
version of phred produced non-zero quality values no smaller than
15. Previous versions of lucy will tend to over-trim sequences from
those earlier versions of phred. Currently, we are using phred version
0.990722.g as our base caller for sequences from the ABI 377
sequencer, and TraceTuner (from Paracel) for sequences from the ABI
3700. We recommend that you use phred version 0.990722.g or later for
377 sequences.
Note that this version was developed independently from version 1.11,
so the changes in version 1.12 was not incorporated into it. However,
in version 1.14 above, we merged all modifications together into a
single code base again. Also, Michael now becomes a coauthor of lucy
and will be maintaining the quality trimming code.
---------------------------------------------------------------------
Version 1.12, 12/29/99. After over a year of practical use of lucy by
many institutions, I have received several bug reports. Many of the
bugs that were reported to me are actually not lucy's fault but due to
incorrect/incomplete parametric data files users supplied to lucy. In
another word, if lucy is given wrong vector splice files, you cannot
expect it to be able to clean the vector fragments out of your
sequences. My sincere words to users of lucy are that thank you for
using lucy, and please read the document which comes with lucy before
you put it to work for you.
Two actual modifications are done to lucy in this new revision, one
minor, the other, oh well, is also minor but can potentially remove
the last known bug of incorrect vector trimming for sequences with
quality cut-off extremely into the sequence, i.e., beyond position
200.
The first modification is the relaxation of sequence name recognized
by lucy. In the past, especially in TIGR, only numbers and letters are
used for sequence names, thus lucy is built that way. After people
other than TIGR started using lucy, I heard many complains that the
underscore character '_' in sequence names was not recognized by lucy
and therefore lucy thinks there are duplicative names in the sequences
when there is none. I have now changed lucy to treat everything as
part of sequence names except white spaces, return and the end of line
characters. I hope this makes some of the users' life easier since now
you don't have to rename your sequences just because you want to use
lucy.
The second improvement is that I now make lucy do an "adaptive vector"
trimming rather than the fixed one in versions before this. In the
past, after quality assessment, lucy starts looking for vector splice
sites in a fixed region at the beginning of a sequence. By default,
that's the first 200 bases of a sequence, no matter how bad the first
200 bases are and whether that region has been excluded in the "good
quality" region lucy determines earlier. In this new version, I make
that region adaptive to the quality region lucy determines, i.e., we
make the vector trimming region centered around the CLN reported in
phase 3. For example, if CLN is 250 (which is rare, of course), then
lucy will look for vector sites between 150 to 350, assuming the
default trimming region settings have not been changed. Note that the
trimming regions can be changed by users, thus if a user sets region
1, 2 and 3 to 50, 100, 150 bases, then by the same example above lucy
will look for vector sites between 100 to 400 on the sequence. That
is, regions 1 and 2 are always on the left of the CLN and region 3 on
the right of the CLN. Of course, if CLN is 150 or less than lucy will
look at the first 300 bases, assuming the same user settings above.
---------------------------------------------------------------------
Version 1.11, 12/17/98. Mike Holmes at TIGR found a special case where
the false-positive preventive code is overdoing its job and prevents
an otherwise perfect vector splice site match to be flagged and
trimmed. It only happens when vector match site ends right after the
200 bp search range by default. This bug has been corrected.
---------------------------------------------------------------------
Version 1.10, 11/25/98. By popular demands, a poly-A/T trimming
feature has been added to lucy. Since this is a medium scale
improvement that causes a new source code file poly.c to be created, I
jump this version to 1.10 to reflect that difference. Check the manual
page about the new option -cdna for this purpose. I am going home for
Thanksgiving! :-)
---------------------------------------------------------------------
Version 1.05, 10/14/98. Corrected a leftover CLB!=CLN+CLZ bug when a
low quality sequence is salvaged. This bug does not show up for good
sequences that have high quality values.
---------------------------------------------------------------------
Version 1.04, 8/27/98. Corrected CL? reporting mechanism errors when
sequences have no corresponding CL? data.
---------------------------------------------------------------------
Version 1.03, 8/7/98. Added reporting option -inform so the user
knows which low quality sequences are dropped and salvaged.
---------------------------------------------------------------------
Version 1.02, 8/3/98. Added salvage effort for low quality sequences,
so that even they are dropped at quality assessment phase, they can
still be recycled if they match well to the ABI base calls. Note that
if the ABI base calls have been wrongly selected as exactly the phred
base calls, you will end up having some really nasty sequences
included. Never provide a secondary (ABI) sequence which is exactly
the phred sequence.
---------------------------------------------------------------------
Version 1.01, 7/27/98. Added support for bidirectional vector
trimming, multi-segment vector trimming, and corrected over-sensitive
cases where a small fragment of false-positive vector causes an
otherwise good sequence to be dropped. Other minor bugs fixed.
---------------------------------------------------------------------
Version 1.00, 6/15/98. The first release version was done. Errors
found during beta testing v.0.99 were corrected.
---------------------------------------------------------------------
Version 0.99, 5/19/98. The first version of lucy was finished. I went
on vacation. :)
|