TODO file for GNU ptx
Tell <email@example.com> if you feel like volunteering for any
of these ideas, listed more or less in decreasing order of priority.
* Use mmap for swallowing files (maybe wrong when memory edited).
* Sort keywords intelligently for Latin-1 code. See how to interface
this character set with various output formats. Also, introduce
options to inverse-sort and possibly to reverse-sort.
* Use rx instead of regex.
* Correct the infinite loop using -S '$' or -S '^'.
* Improve speed for Ignore and Only tables. Consider hashing instead
of sorting. Consider playing with obstacks to digest them.
* Provide better handling of format effectors obtained from input, and
also attempt white space compression on output which would still
maximize full output width usage.
* See how TeX mode could be made more useful, and if a texinfo mode
would mean something to someone.
* Understand and mimic `-t' option, if I can.
* Provide multiple language support
Most of the boosting work should go along the line of fast recognition
of multiple and complex boundaries, which define various `languages'.
Each such language has its own rules for words, sentences, paragraphs,
and reporting requests. This is less difficult than I first thought:
. Recognize language modifiers with each option. At least -b, -i, -o,
-W, -S, and also new language switcher options, will have such
modifiers. Modifiers on language switchers will allow or disallow
. Complete the transformation of underlying variables into arrays in
. Implement a heap of positions in the input file. There is one entry
in the heap for each compiled regexp; it is initialized by a re_search
after each regexp compile. Regexps reschedule themselves in the heap
when their position passes while scanning input. In this way, looking
simultaneously for a lot of regexps should not be too inefficient,
once the scanning starts. If this works ok, maybe consider accepting
regexps in Only and Ignore tables.
. Merge with language processing boundary processing options, really
integrating -S processing as a special case. Maybe, implement several
level of boundaries. See how to implement a stack of languages, for
handling quotations. See if more sophisticated references could be
handled as another special case of a language.
* Tackle other aspects, in a more long term view
. Add options for statistics, frequency lists, referencing, and all
other prescreening tools and subsidiary tasks of concordance
. Develop an interactive mode. Even better, construct a GNU emacs
interface. I'm looking at Gene Myers <firstname.lastname@example.org> suffix
arrays as a possible implementation along those ideas.
. Implement hooks so word classification and tagging should be merged
in. See how to effectively hook in lemmatisation or other
morphological features. It is far from being clear by now how to
interface this correctly, so some experimentation is mandatory.
. Profile and speed up the whole thing.
. Make it work on small address space machines. Consider three levels
of hugeness for files, and three corresponding algorithms to make
optimal use of memory. The first case is when all the input files and
all the word references fit in memory: this is the case currently
implemented. The second case is when the files cannot fit all together
in memory, but the word references do. The third case is when even
the word references cannot fit in memory.
. There also are subsidiary developments for in-core incremental sort
routines as well as for external sort packages. The need for more
flexible sort packages comes partly from the fact that linguists use
kinds of keys which compare in unusual and more sophisticated ways.
GNU `sort' and `ptx' could evolve together.
outline-regexp: " *[-+*.] \\|"