File: TODO

package info (click to toggle)
r-bioc-biostrings 2.74.1%2Bdfsg-2
links: PTS, VCS
area: main
in suites: sid, trixie
size: 13,856 kB
sloc: ansic: 9,172; makefile: 2
file content (274 lines) | stat: -rw-r--r-- 12,948 bytes
TODO list
---------

GENERAL

- Get rid of all 'R CMD check' warnings.

- Re-enable the unit tests that were disabled and add tests for all the
  functionalities that are still not covered. There is a lot of them!
  Right now we use the RUnit framework for the unit tests but feel free
  to switch to testthat or any other framework of your choice.

- An R problem that might be worth reporting: Loading a serialized
  XStringViews object without prior loading the Biostrings package seems
  to have problems: the object loads, but then calling show() on the object
  dispatches on the show() method for List objects (defined in S4Vectors)
  instead of the method for XStringViews objects.

BASIC CONTAINERS

- Make sure that validObject() on an XStringSet object is actually performing
  an extensive validation of the object with no possibility for false
  positives (i.e. objects that are actually invalid get caught). This might
  involve looking at validity methods for XStringSet parent/ancestor
  classes like XRawList and/or XVectorList (both defined in XVector).

- DNA/RNA/AAStringSet constructor functions:
    o DNA/RNA/AAStringSet(x, start=c(1,5), end=c(3,7)) when x is a single
      string (character or XString). The result should be a DNA/RNA/AAStringSet
      object of the length of 'start' (or 'end').
    o Low priority: Make sure that DNA/RNA/AAStringSet(x) takes the shortest
      possible path to return 'x' when 'x' is already a DNA/RNA/AAStringSet
      _instance_ (i.e. when class(x) == "DNAStringSet" or "RNAStringSet"
      or "AAStringSet"). Right now it seems that it goes thru some unnecessary
      operations. Does it slow it down in a significant way though? I don't
      know and some benchmarking would be nice to have. It's just not a
      satisfying situation from a theoretical point of view when a no-op
      wastes CPU cycles in performing pointless calculations.

- Maybe define a DNAorRNA virtual class (or NucleotideString) that would be
  the common parent of the DNAString and RNAString classes. Advantage of this
  is that it allows some code simplification by defining stuff shared by
  the DNAString and RNAString classes at the DNAorRNA level. For example,
  the two following methods:
    setMethod("alphabetFrequency", "DNAString", ...)
    setMethod("alphabetFrequency", "RNAString", ...)
  can be replaced with a single method:
    setMethod("alphabetFrequency", "NucleotideString", ...)
  Also this:
    if (is(x@subject, "DNAString") || is(x@subject, "RNAString")) ...
  can be replaced with:
    if (is(x@subject, "NucleotideString")) ...
  Etc...

- Improve the MIndex container. It needs to support storage of the nb of
  mismatches for each hit. No need to store the locations of those mismatches
  though: this would require a huge amount of memory and there are other ways
  to retrieve these locations on user demand so maybe it's not worth it.

BASIC OPERATIONS

- Several problems with comparison operators (==, !=, <=, >=, < and >)
  when at least one of the two operands is an XString derivative.
  Issue https://github.com/Bioconductor/Biostrings/issues/51 is just the
  tip of the iceberg :-(
  For example:

    > DNAString("A") != "A"
    Error: C stack usage  7969796 is too close to the limit

    > DNAString("A") <= "A"
    Error: C stack usage  7969828 is too close to the limit

    > DNAString("A") >= "A"
    Error in order(...) : unimplemented type 'list' in 'orderVector1'

    > DNAString("A") < "A"
    Error in order(...) : unimplemented type 'list' in 'orderVector1'

    > DNAString("A") > "A"
    Error: C stack usage  7969796 is too close to the limit

    > DNAString("AAA") == "AAA"
    Error in FUN(X[[i]], ...) :
      coercion of character object to DNAString didn't preserve its length

    > DNAString("AAA") <= "AAA"
    Error in FUN(X[[i]], ...) :
      coercion of character object to DNAString didn't preserve its length

    > BString("AAA") <= "AAA"
    Error in FUN(X[[i]], ...) :
      coercion of character object to BString didn't preserve its length

    > DNAString("A") > DNAString("A")
    Error: C stack usage  7969796 is too close to the limit

    > DNAString("C") > DNA_ALPHABET
    Error in .charToXString(seqtype, x, start, end, width) :
      input must be a single non-NA string

    etc...

  All of these should work and behave the same way as when replacing the
  XString derivative with a character vector of length 1.
  There will be many 'x <op> y' cases to test:
    o 'x' is a B/DNA/RNA/AAString and 'y' is a character vector of
      arbitrary length;
    o 'x' is a character vector of arbitrary length and 'y' is a
      B/DNA/RNA/AAString;
    o both 'x' and 'y' are XString derivatives of the same type or not;
    o <op> is any of the 6 comparison operators.
  So there's a little bit of a combinatorial explosion here but we should
  be able to come up with a solid set of unit tests that will cover it all.

UTILITIES

- Move lcprefix()/lcsuffix() out of pmatchPattern.R to a file of their own.
  (This stuff needs to belong to the UTILITIES component of the package, not
  to the STRING ALIGNMENT component.)

- Add patternFrequency() methods for XStringSet and XStringViews objects.

- Maybe add a new generic that combines a subject and an IRanges (or
  IRanges-like) object to return an XStringViews object. The IRanges-like
  object could be MIndex object and the method for it would do as if it had
  received unlist(MIndex). Currently my problem is that I can't come up with
  a good name for such generic :-/ Maybe I could just use views(subject, x) for
  this (dispatch would be on x, not subject). And the current views function
  could be renamed (or maybe it's not needed at all, maybe a fancy
  new("IRanges", ...) could replace it).

STRING MATCHING

- Revisit matchLRPatterns() semantic and improve its performance.
  The current behaviour is to return *all* L/R match pairs that satisfy
  the search criteria. This leads to poor performance, and, maybe more
  importantly, it tends to return redundant match pairs (e.g. 2 match
  pairs can overlap, or one can be within the limits of the other).
  Maybe, by default, a better semantic would be one similar to what
  matchProbePair() does, that is, only match pairs that don't contain
  another match pair are returned (the assumption here is that those are
  the most relevant match pairs). Also, should L/R match pairs where the
  left and right matches overlap be accepted?

- When algo="auto", use "naive_inexact" instead of "shift-or" when max.mismatch
  is high (e.g. >= 8, finding the exact cut-value requires some testing).
  Correct Robert's RBioinf book reporting that the performance decrease
  significantly when max.mismatch becomes to large. Should not be the case
  anymore.

- Add some convenience function (e.g. a wrapper to .valid.algos()) to let the
  curious user know which algos are available/used for a given search problem.

- PDict()/matchPDict()/countPDict()/whichPDict():

  o PDict(), matchPDict(): Support fuzzy matching with variable width
    dictionaries.

    The trick used internally that consists in splitting the patterns into N+1
    Trusted Bands (where N is max.mismatch) could be restricted to a "not
    really trusted band" specified by the user (not necessarily a prefix), and
    then brute force could be used on the head and tail of this "not really
    trusted band". The user would also need a way to specify 2 max.mismatch
    values: one for the "not really trusted band" and one for the head/tail.

    From a user point of view, it could look something like this:

    # Right now the following is not allowed (you cannnot specify both
    # 'max.mismatch' and 'tb.end'). Also the names of the
    # tb.start/tb.end/tb.width args would need to change because it's
    # not about the Trusted Band anymore (strictly speaking):
    pdict <- PDict(dict, max.mismatch=2, tb.end=min(width(dict)))

    # Then to allow up to 2 mismatches on the "not really trusted band"
    # and up to 1 mismatch on the tail ('pdict' has no head):
    mi <- matchPDict(pdict, subject, max.mismatch=c(2, 1))

    The notion of Trusted Band as it is defined right now would not need
    to be exposed anymore and would become an entirely internal thing.
    From a user point of view it would be replaced by this more general kind
    of constant-width band where a small number of mismatches is allowed.
    I need a better name than "not really trusted band" for it.

  o Document the "allow mismacthes anywhere in the patterns" feature (activated
    via the 'max.mismatch' argument of PDict()).

  o Support IUPAC ambiguity letters in the DNAStringSet object passed to
    PDict().

  o Harris suggestion: treat a max.mismatch value that is strictly between 0
    and 1 like an error rate (so that the actual max number of mismatches
    adjust to the length of each pattern).

  o Patrick's suggestion: give the user the option to make matchPDict() return
    directly the coverage of the hits (apparently a common use case). That
    would avoid the overhead of storing the hits in the (generally big) MIndex
    object first.
    Maybe put this in a separate function e.g. coveragePDict().

  o _match_tbACtree2() doesn't need to walk until the end of the subject: it
    could stop when the number of remaining chars to read is < to the
    difference between the depth of the AC tree (i.e. the width of the
    Trusted Band) and the current depth. This should speed up matchPDict()
    (and family) substantially when the length of the subject is very small.
    A typical use case where this could be of great benefit is when finding
    the neighbors of a given pattern with e.g.
    whichPDict(pdict, pdict[[99]], max.mismatch=2).

  o Implement the skip.invalid.patterns arg in PDict() (so the user can build
    a PDict object from Solexa data that would contain Ns if he wants, reads
    with an N would just be skipped).

  o Implement "duplicated" and "patternFrequency" methods for PDict objects
    with a head or a tail. Add 'as.prob' arg (default FALSE) to
    patternFrequency() like for alphabetFrequency().

  o extractAllMatches() fails on a very big MIndex object (can't allocate
    vector of size 5.2Gb).

  o C code improvement: no need to use temporary storage for 'dups_buf' and
    'match_count' in match_pdict.c, store directly in the returned
    INTEGER vector.

  o MIndex objects: at some point the user will want to be able to combine
    "compatible" MIndex objects. 2 MIndex objects are "compatible" if they
    are describing 2 set of matches coming from the same original dict and on
    the same target (subject). In practice, it will be enough that they have
    the same index i.e. they have the same pids() or, if the pids() is NULL,
    they have the same length.
    Then methods like "union", "rangesect", "setdiff", "setequal", etc...
    could be defined. The set operation would be performed between the 2
    subsets of matches of each input pattern. Of course, 2 matches are
    considered equal if their start/end are the same.

  o Make reverseComplement() work on a PDict object.

- Maybe add a Biostrings.Rnw vignette with a short overview of the string
  matching/aligning capabilities and a "how to choose the right string
  matching/aligning function" diagram.

LOW PRIORITY
------------

- Maybe add a no.match.length argument to gregexpr2() that is FALSE by
  default so gregexpr(pattern, text, fixed=TRUE) is more interchangeable
  with gregexpr2(pattern, text) and use no.match.length=TRUE in matchPattern's
  internal code.

- Look at apse.c in R/src/main for Levenshtein. It's coming from
    http://search.cpan.org/dist/String-Approx/
  and is what Perl and Python are using.
  It should stop and return an error code after some max.distance has been
  reached. We definitely want to be able to do this if we're going to use it
  on the millions of elements of the head and tail of a TB_PDict object.

- Maybe: add a specific containers for results returned by matchPattern
  (and family). Would derive from the XStringViews class with at least one
  additional slot, the @call slot (of type "language"), that would contain
  the value of match.call(), so that one knows what parameters were supplied
  for a given matching task.

- Fix pb with length(x) <- 2 screwing up x if it's an XString or XStringViews
  object (prevent people from doing this by defining a length() setter that
  retuns an error).

- Still have to think about it (Robert suggestion): make [[ work on "out
  of limits" views with a warning (we keep issuing an error only when the
  view is all blank).

- Point raised in https://github.com/Bioconductor/Biostrings/issues/93 about 
  having a sequence to letter conversion for AAStrings (e.g., MetThyGly -> MTG).
  Seems to be relatively low interest, but maybe it could be implemeneted as
  a helper function in the future.