File: TODO

package info (click to toggle)
pinot 0.95-1
  • links: PTS, VCS
  • area: main
  • in suites: squeeze
  • size: 6,860 kB
  • ctags: 4,511
  • sloc: cpp: 39,387; sh: 9,973; ansic: 4,174; makefile: 657; xml: 372; python: 260
file content (128 lines) | stat: -rw-r--r-- 6,231 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
Documentation
- List what files from libtextcat 2.2 go where ?
- Say where libtextcat 3.0 can (and cannot ;-) be found
- MIME type should be as returned by xdg-utils' 'xdg-mime query filetype ...'
- Try listing names of dependency packages for most distros
- How to trouble-shoot with delve, get to a file with filters and labels
- Explain when indexing and updating are done

General
- Fix the FIXMEs
- Get rid of dead code/classes/methods...
- Advertise service via Rendezvous
- Extend metadata beyond title,location,language,type,timestamp,size
- Don't package gmo files, they are platform dependent
- CLI programs to use tty highlighting if available

Barpanel
- Write a plugin for barpanel http://www.igelle.org/barpanel/

Tokenize
- Allow to cache documents that had to be converted ? eg PDF, MS Word
- Write a PDF filter that handles columns correctly, with poppler ?
- WordPerfect filter with libwpd
- Office filter with libgst
- TeX filter
- HtmlFilter to look for META tags Author, Creator, Publisher and CreationDate
- XmlFilter is slow-ish, rewrite file parsing with the TextReader interface
- Filters should at least return errno when they fail
- Use libpng to extract PNGs' metadata

SQL
- Move history files into the index directories
- Set any PRAGMA ?

Monitor
- Implement support for Solaris FEM

Collect
- Comply with robot stuff defined at http://www.robotstxt.org/
- Harvest mode grabs all pages on a specific site down to a certain depth
- Make User-Agent string configurable
- Make download timeout configurable
- Support for HTML frames
- Curl and NeonDownloader don't share much code

Search
- With engines that provide a redirection URL for results (eg Acoona), it looks like
  the query hitory is not saved/checked correctly
- Make sure Description files' SyndicationRight is not private or closed
- getCloseTerms() should be a search engine method so that WebEngine can use plugins'
 suggestions Url field (http://developer.mozilla.org/en/docs/Supporting_search_suggestions_in_search_plugins)
- Filters with CJKV should work better; supporting quoting would help, eg title:"你好"
- Check Mozdex plugin once it's back up
- Add a plugin for http://arxiv.org/find

Index
- Play around with the XAPIAN_FLUSH_THRESHOLD env var
- MD5 hash to determine on updates whether documents have changed, as done by omindex
- Allow to access remote Xapian indexes tunneled through ssh with xapian-progsrv,
  and make sure ssh will ask passwords with /usr/libexec/openssh/ssh-askpass
- Index Nautilus metadata (http://linuxboxadmin.com/articles/nautilus.php)
- Reverse terms so that left wildcards can be applied ?
- XapianIndex could do with some common code refactoring
- Automatically categorize documents based on MIME type and source into picture, video, etc...
- After indexing or updating a document, a call to getDocumentInfo() shouldn't be necessary
- Labels and the rest of DocumentInfo are handled separately, they shouldn't be
- Indexes have no knowledge of indexId's
- Be ready to catch DatabaseModifiedError exceptions and reopen the index
- Think about security issues, especially when indexes are shared, based on http://plg.uwaterloo.ca/~claclark/fast2005.pdf
- Split "compound_word", index separately and as whole

Mail
- Find out what kind of locking scheme Mozilla uses (POSIX lock ?) and use that
- Index Evolution email (Camel, might be useful for other types actually)
- Index mail headers
- Decypher and use Mozilla's mailbox scheme, eg
  mailbox://mbox_file_name?number=2164959&part=1.2&type=text/plain&filename=portability.txt
- Keep track of attachments and avoid indexing the same file twice
- Mailboxes where all messages are flagged by Mozilla/that are empty are not indexed at all

Daemon
- Allow building without the daemon
- Enable to deactivate D-Bus interface
- Clean up method names
- Prefer ustring to string whenever possible
- Queue unindexing too
- Follow updates to Xesam specs
- The daemon should ask for permission before reindexing, especially if the corpus is large
- What does a first run mean for the daemon ? ie no configuration file
- Daemon should use worker threads' doWork() instead of duplicating code
- Only crawl newly added locations when the configuration changes
- Some D-Bus methods need not returning anything

UI
- Show which threads are running, what they are doing, and allow to stop them
 selectively
- Display search engines icons (Gtk::IconSource::set_filename() and Gtk::Style::render_icon())
- Replace glademm with libglademm ?
- Use unique (http://www.gnome.org/~ebassi/source/) if available
- Either Live Query behaves like a live query (eg results list updated when new
 documents match) or it is renamed to something else to avoid confusion
- When viewing or indexing a result, all rows for that same URL should be updated with
 the Viewed or Indexed icons (the latter after IndexingThread returns)
- Make use of GTKmm 2.10 StatusIcon
- Unknown exceptions in IndexingThread or elsewhere should be logged as errors
- Query expansion should be interactive
- Default cache provider should be configurable
- Unique preferences
- Use gtk2 2.14's gtk_show_uri()
- Changing set group by mode a few times will show index results under engine "xapian", why ?
- getIndexNames() to return ustring's
- Always call getIndexPropertiesByName() with a ustring, store engine names as ustring's
- Live and stored queries shouldn't cap on the number of results but the number of results per page
- For each query group in the results list, show Next and Previous buttons to page through results
- Browse mode to be merged with the new search and page mode

v0.90
- Filters should have a version number so that new versions only reindex documents
 of the given type
- Queries should be cancellable
- Queries should return the top N results first, then the rest
- D-Bus (Simple)Query shouldn't let the bus connection time out before replying
- The query builder for stored queries should be available for live queries too
- Make sure all of Core is localized
- HtmlParser should use CJKVTokenizer's Unicode conversion function
- Status dialog to show time of latest update 
- Send a signal when preferences are modified so that the UI and the daemon can reload them