1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274
|
TODO list
---------
GENERAL
- Get rid of all 'R CMD check' warnings.
- Re-enable the unit tests that were disabled and add tests for all the
functionalities that are still not covered. There is a lot of them!
Right now we use the RUnit framework for the unit tests but feel free
to switch to testthat or any other framework of your choice.
- An R problem that might be worth reporting: Loading a serialized
XStringViews object without prior loading the Biostrings package seems
to have problems: the object loads, but then calling show() on the object
dispatches on the show() method for List objects (defined in S4Vectors)
instead of the method for XStringViews objects.
BASIC CONTAINERS
- Make sure that validObject() on an XStringSet object is actually performing
an extensive validation of the object with no possibility for false
positives (i.e. objects that are actually invalid get caught). This might
involve looking at validity methods for XStringSet parent/ancestor
classes like XRawList and/or XVectorList (both defined in XVector).
- DNA/RNA/AAStringSet constructor functions:
o DNA/RNA/AAStringSet(x, start=c(1,5), end=c(3,7)) when x is a single
string (character or XString). The result should be a DNA/RNA/AAStringSet
object of the length of 'start' (or 'end').
o Low priority: Make sure that DNA/RNA/AAStringSet(x) takes the shortest
possible path to return 'x' when 'x' is already a DNA/RNA/AAStringSet
_instance_ (i.e. when class(x) == "DNAStringSet" or "RNAStringSet"
or "AAStringSet"). Right now it seems that it goes thru some unnecessary
operations. Does it slow it down in a significant way though? I don't
know and some benchmarking would be nice to have. It's just not a
satisfying situation from a theoretical point of view when a no-op
wastes CPU cycles in performing pointless calculations.
- Maybe define a DNAorRNA virtual class (or NucleotideString) that would be
the common parent of the DNAString and RNAString classes. Advantage of this
is that it allows some code simplification by defining stuff shared by
the DNAString and RNAString classes at the DNAorRNA level. For example,
the two following methods:
setMethod("alphabetFrequency", "DNAString", ...)
setMethod("alphabetFrequency", "RNAString", ...)
can be replaced with a single method:
setMethod("alphabetFrequency", "NucleotideString", ...)
Also this:
if (is(x@subject, "DNAString") || is(x@subject, "RNAString")) ...
can be replaced with:
if (is(x@subject, "NucleotideString")) ...
Etc...
- Improve the MIndex container. It needs to support storage of the nb of
mismatches for each hit. No need to store the locations of those mismatches
though: this would require a huge amount of memory and there are other ways
to retrieve these locations on user demand so maybe it's not worth it.
BASIC OPERATIONS
- Several problems with comparison operators (==, !=, <=, >=, < and >)
when at least one of the two operands is an XString derivative.
Issue https://github.com/Bioconductor/Biostrings/issues/51 is just the
tip of the iceberg :-(
For example:
> DNAString("A") != "A"
Error: C stack usage 7969796 is too close to the limit
> DNAString("A") <= "A"
Error: C stack usage 7969828 is too close to the limit
> DNAString("A") >= "A"
Error in order(...) : unimplemented type 'list' in 'orderVector1'
> DNAString("A") < "A"
Error in order(...) : unimplemented type 'list' in 'orderVector1'
> DNAString("A") > "A"
Error: C stack usage 7969796 is too close to the limit
> DNAString("AAA") == "AAA"
Error in FUN(X[[i]], ...) :
coercion of character object to DNAString didn't preserve its length
> DNAString("AAA") <= "AAA"
Error in FUN(X[[i]], ...) :
coercion of character object to DNAString didn't preserve its length
> BString("AAA") <= "AAA"
Error in FUN(X[[i]], ...) :
coercion of character object to BString didn't preserve its length
> DNAString("A") > DNAString("A")
Error: C stack usage 7969796 is too close to the limit
> DNAString("C") > DNA_ALPHABET
Error in .charToXString(seqtype, x, start, end, width) :
input must be a single non-NA string
etc...
All of these should work and behave the same way as when replacing the
XString derivative with a character vector of length 1.
There will be many 'x <op> y' cases to test:
o 'x' is a B/DNA/RNA/AAString and 'y' is a character vector of
arbitrary length;
o 'x' is a character vector of arbitrary length and 'y' is a
B/DNA/RNA/AAString;
o both 'x' and 'y' are XString derivatives of the same type or not;
o <op> is any of the 6 comparison operators.
So there's a little bit of a combinatorial explosion here but we should
be able to come up with a solid set of unit tests that will cover it all.
UTILITIES
- Move lcprefix()/lcsuffix() out of pmatchPattern.R to a file of their own.
(This stuff needs to belong to the UTILITIES component of the package, not
to the STRING ALIGNMENT component.)
- Add patternFrequency() methods for XStringSet and XStringViews objects.
- Maybe add a new generic that combines a subject and an IRanges (or
IRanges-like) object to return an XStringViews object. The IRanges-like
object could be MIndex object and the method for it would do as if it had
received unlist(MIndex). Currently my problem is that I can't come up with
a good name for such generic :-/ Maybe I could just use views(subject, x) for
this (dispatch would be on x, not subject). And the current views function
could be renamed (or maybe it's not needed at all, maybe a fancy
new("IRanges", ...) could replace it).
STRING MATCHING
- Revisit matchLRPatterns() semantic and improve its performance.
The current behaviour is to return *all* L/R match pairs that satisfy
the search criteria. This leads to poor performance, and, maybe more
importantly, it tends to return redundant match pairs (e.g. 2 match
pairs can overlap, or one can be within the limits of the other).
Maybe, by default, a better semantic would be one similar to what
matchProbePair() does, that is, only match pairs that don't contain
another match pair are returned (the assumption here is that those are
the most relevant match pairs). Also, should L/R match pairs where the
left and right matches overlap be accepted?
- When algo="auto", use "naive_inexact" instead of "shift-or" when max.mismatch
is high (e.g. >= 8, finding the exact cut-value requires some testing).
Correct Robert's RBioinf book reporting that the performance decrease
significantly when max.mismatch becomes to large. Should not be the case
anymore.
- Add some convenience function (e.g. a wrapper to .valid.algos()) to let the
curious user know which algos are available/used for a given search problem.
- PDict()/matchPDict()/countPDict()/whichPDict():
o PDict(), matchPDict(): Support fuzzy matching with variable width
dictionaries.
The trick used internally that consists in splitting the patterns into N+1
Trusted Bands (where N is max.mismatch) could be restricted to a "not
really trusted band" specified by the user (not necessarily a prefix), and
then brute force could be used on the head and tail of this "not really
trusted band". The user would also need a way to specify 2 max.mismatch
values: one for the "not really trusted band" and one for the head/tail.
From a user point of view, it could look something like this:
# Right now the following is not allowed (you cannnot specify both
# 'max.mismatch' and 'tb.end'). Also the names of the
# tb.start/tb.end/tb.width args would need to change because it's
# not about the Trusted Band anymore (strictly speaking):
pdict <- PDict(dict, max.mismatch=2, tb.end=min(width(dict)))
# Then to allow up to 2 mismatches on the "not really trusted band"
# and up to 1 mismatch on the tail ('pdict' has no head):
mi <- matchPDict(pdict, subject, max.mismatch=c(2, 1))
The notion of Trusted Band as it is defined right now would not need
to be exposed anymore and would become an entirely internal thing.
From a user point of view it would be replaced by this more general kind
of constant-width band where a small number of mismatches is allowed.
I need a better name than "not really trusted band" for it.
o Document the "allow mismacthes anywhere in the patterns" feature (activated
via the 'max.mismatch' argument of PDict()).
o Support IUPAC ambiguity letters in the DNAStringSet object passed to
PDict().
o Harris suggestion: treat a max.mismatch value that is strictly between 0
and 1 like an error rate (so that the actual max number of mismatches
adjust to the length of each pattern).
o Patrick's suggestion: give the user the option to make matchPDict() return
directly the coverage of the hits (apparently a common use case). That
would avoid the overhead of storing the hits in the (generally big) MIndex
object first.
Maybe put this in a separate function e.g. coveragePDict().
o _match_tbACtree2() doesn't need to walk until the end of the subject: it
could stop when the number of remaining chars to read is < to the
difference between the depth of the AC tree (i.e. the width of the
Trusted Band) and the current depth. This should speed up matchPDict()
(and family) substantially when the length of the subject is very small.
A typical use case where this could be of great benefit is when finding
the neighbors of a given pattern with e.g.
whichPDict(pdict, pdict[[99]], max.mismatch=2).
o Implement the skip.invalid.patterns arg in PDict() (so the user can build
a PDict object from Solexa data that would contain Ns if he wants, reads
with an N would just be skipped).
o Implement "duplicated" and "patternFrequency" methods for PDict objects
with a head or a tail. Add 'as.prob' arg (default FALSE) to
patternFrequency() like for alphabetFrequency().
o extractAllMatches() fails on a very big MIndex object (can't allocate
vector of size 5.2Gb).
o C code improvement: no need to use temporary storage for 'dups_buf' and
'match_count' in match_pdict.c, store directly in the returned
INTEGER vector.
o MIndex objects: at some point the user will want to be able to combine
"compatible" MIndex objects. 2 MIndex objects are "compatible" if they
are describing 2 set of matches coming from the same original dict and on
the same target (subject). In practice, it will be enough that they have
the same index i.e. they have the same pids() or, if the pids() is NULL,
they have the same length.
Then methods like "union", "rangesect", "setdiff", "setequal", etc...
could be defined. The set operation would be performed between the 2
subsets of matches of each input pattern. Of course, 2 matches are
considered equal if their start/end are the same.
o Make reverseComplement() work on a PDict object.
- Maybe add a Biostrings.Rnw vignette with a short overview of the string
matching/aligning capabilities and a "how to choose the right string
matching/aligning function" diagram.
LOW PRIORITY
------------
- Maybe add a no.match.length argument to gregexpr2() that is FALSE by
default so gregexpr(pattern, text, fixed=TRUE) is more interchangeable
with gregexpr2(pattern, text) and use no.match.length=TRUE in matchPattern's
internal code.
- Look at apse.c in R/src/main for Levenshtein. It's coming from
http://search.cpan.org/dist/String-Approx/
and is what Perl and Python are using.
It should stop and return an error code after some max.distance has been
reached. We definitely want to be able to do this if we're going to use it
on the millions of elements of the head and tail of a TB_PDict object.
- Maybe: add a specific containers for results returned by matchPattern
(and family). Would derive from the XStringViews class with at least one
additional slot, the @call slot (of type "language"), that would contain
the value of match.call(), so that one knows what parameters were supplied
for a given matching task.
- Fix pb with length(x) <- 2 screwing up x if it's an XString or XStringViews
object (prevent people from doing this by defining a length() setter that
retuns an error).
- Still have to think about it (Robert suggestion): make [[ work on "out
of limits" views with a warning (we keep issuing an error only when the
view is all blank).
- Point raised in https://github.com/Bioconductor/Biostrings/issues/93 about
having a sequence to letter conversion for AAStrings (e.g., MetThyGly -> MTG).
Seems to be relatively low interest, but maybe it could be implemeneted as
a helper function in the future.
|