1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435
|
TODO
----
General development directions
* More various databases support.
* More various transport protocols support.
* More various APIs. e.g write Java class with libudmsearch support.
* Support for huge databases with hundred or thousand millions documents.
* Make it more managable, i.e. administration tools, etc.
Below there are things that can be implemented somewhere in the future.
They are given in no paticular order. If you want to change the order of
their development, please ask on general@mnogosearch.org.
Search quality and results presentation
---------------------------------------
* Click rank
* Administator defined dynamic site priority:
- approved sites which should be displayed in the top of results;
- disapproved sites (e.g. for abuse) which should not be displayed.
* Take in account words context: <b>, <font size="xx">, <big> and so on.
* Optional automatic URL limit by SERVER_NAME variable.
* "Exclude" limits, for example "to search though everything except
given site": ue=http://esite/
* Fuzzy search for accent letters, for example cyrillic "io" and "ie".
* Regex search
* Rank URLs with long pathnames lower than direct hits on let's say a domain
name with no directory path.
Indexing related stuff
----------------------
* Detect clones on site level. Currently it is implemented on page level
only. The idea is to detect that site being indexed is a mirror of another
site without having to index all pages but after indexing several pages only.
* SPAM clearance.
* Fix that indexer bacame slow when ServerTable is big. This is because
of full consecutive examination. Make in-memory cache for ServerTable part.
* FTP digest ls-lR.gz support. For example,ftp://ftp.chg.ru/ls-lR.gz
* Make it possible for external parsers to return converted content
together with headers like Content-Type, Title and so on.
* Exclude autoincrement mode for 'url' table. We have to use CRC32 mode
since it is much faster for indexing and probably would take less space.
Charset related stuff
---------------------
* Remove "ForceIISCharset1251 yes/no"command. Replcase it with
enhanced "CharsetByServer <charset> <regexp> [<regexp>...]"
commmand.
* Stateful character sets support: UTF-7, Asian ISO-2022-XX
and others. They will not be used as a LocalCharset because
of much space, however indexer should be able to index them,
as well as search frontend should be able to use them as
a BrowserCharset.
Misc
----
* Smart search results cache cleaning after reindexing.
* Make it possible to set table names in indexer.conf and search.htm
* There was a discussion about word separators back in January; see
http://www.mail-archive.com/udmsearch%40web.izhcom.ru/msg00200.html.
* Learn about dublin core. A simple set of standard metadata for web pages.
http://www.searchtools.com/related/metadata.html#dc
* Add curl library support.
* Rewrite mirroring functions. Make it possible to optionally store whole
document, not only MaxDocSize.
Portability and code quality
----------------------------
Remove warnings on various platforms. Currenly it is built without
warnings on Linux and FreeBSD with these CFLAGS:
-Wall
-Wconversion
-Wshadow
-Wpointer-arith
-Wcast-qual
-Wcast-align
-Wwrite-strings
-Waggregate-return
-Wstrict-prototypes
-Wmissing-prototypes
-Wmissing-declarations
-Wredundant-decls
-Wnested-externs
-Wlong-long
-Winline
However some other platform compilers do produce warnings.
For example, mixed signed/unsigned chars on NetBSD Alpha compiler.
Please report those warnings to general@mnogosearch.org!
Documentation
-------------
* Constantly improve it!
* PDF version.
Things that will most likely be done in 3.3 (in no particular order)
--------------------------------------------------------------------
1. Better relevancy
- DONE: separate word enumeration for each section
and add number_of_words_in_this_section into coord, i.e.
"section + position_inside_section + number_of_words_in_this_section"
instead of
"section + position_inside_document"
Note, number_of_words_in_this_section doesn't need to be exact,
it can be approximate, to safe space.
- DONE: Use number_of_words_in_this_section in relevancy formula,
i.e. be close to the classic TF*IDF rank algorithm.
- DONE: better "body" capacity (get rid of "64K words in body" limit)
and more sections (reg rid of "256 sections" limit).
It can be done using dynamic encoding, e.g.
128 sections with 256*256*256 words plus
128*256 sections with 256*256 words.
- DONE: Change MinCordFactor and MaxCoordFactor to work per-section,
not per document.
- NumSections autodetection from wf
and from secno specifiers, e.g. "body:a b c".
- Find the best combination of the default values for all score commands.
- Link all score commands in the manual,
add a new section "commands affecting score".
- Move processing of DateFactor and RelevancyFactor to searchtool.c
Make sure they're documented.
- Move processing of UserScore/UserScoreFactor to searchtool.c
Make sure they're documented.
2. Cluster
- Res2XML (built-in XML template)
- DONE: XML2Res (to parse built-in XML template)
- DONE: DBAddr http://hostname/path/to/searchxml.cgi
- Make it possible to run search.cgi as a HTTPD server
- Site enumerating without having to talk to each cluster node
(e.g. crc48 or crc56, with direct encoding for short names)
- Clone detection at search time
- Configurable distibution type: by site_id, by seed, etc.
- Add "did you mean?" support into cluster.
3. Extend SQL drivers to use prepared statements in sql.c
- Prepare/Bind/Exec for MySQL
(using mysql_escape_string or hex notation for 4.0,
or using PS API for 4.1 and later)
- Prepare/Bind/Exec for PgSQL
- Prepare/Bind/Exec for Interbase
- Prepare/Bind/Exec for SQLite3
- Prepare/Bind/Exec for CTLib
- Modify sql.c to use Prepare/Bind/Exec for all databases
4. DBType=myinnodb (and maybe for other handler types)
- scripts MySQL with Engine=InnoDB
- true transactional code in sql.c, instead of LOCK TABLE.
- instead of separate DBType, it can perhaps use autodetection
of engines
5. More concurent "indexer" safety
- test with concurrent indexers with all databases
- set isolation levels or lock tables when running
"indexer -Eblob" to avoid concurent indexers update
tables (especially table "bdicti")
which will give an inconsistant result in table "bdict".
Oracle:
SELECT FOR UPDATE
LOCK TABLE t1 IN {ROW SHARE|SHARE|EXCLUSIVE} MODE
SET TRANSACTION
DB2:
SELECT .. FOR {READ ONLY|FETCH ONLY|UPDATE [OF column [, column]*]}
LOCK TABLE t1 IN {SHARE|EXCLUSIVE} MODE
SET TRANSACTION
PostgreSQL:
SELECT FOR UPDATE
LOCK [ TABLE ] name [, ...] [ IN lockmode MODE ] [ NOWAIT ]
lockmode ::= ACCESS SHARE | ROW SHARE | ROW EXCLUSIVE |
| SHARE UPDATE EXCLUSIVE | SHARE
| SHARE ROW EXCLUSIVE | EXCLUSIVE | ACCESS EXCLUSIVE
SET TRANSACTION ISOLATION LEVEL
MSSQL:
SET TRANSACTION ISOLATION LEVEL
Hints in SELECT statement: UPDLOCK, XLOCK, TABLOC
SELECT...table_name (TABLOCK) - share mode
SELECT...table_name (TABLOCK REPEATABLEREAD) - exclusive mode
SELECT...table_name (TABLOCKX) - lock until the end of trans
SELECT FOR UPDATE allowed only for DECLARE CURSOR.
An exclusive lock can be placed on a SQL Server table with
the SELECT..table_name (TABLOCKX) statement.
This statement requests an exclusive lock on a table.
It is used to prevent others from reading or updating
the table and is held until the end of the command or transaction.
It is similar in function to the Oracle
LOCK TABLE..IN EXCLUSIVE MODE statement.
Sybase:
SELECT FOR UPDATE
LOCK TABLE table-name IN { SHARE | EXCLUSIVE } MODE
sa_locks - Displays all locks in the database.
FOR UPDATE can not be used in a SELECT which is not part of the
declaration of a cursor or which is not inside a stored procedure.
Mimer:
SELECT FOR UPDATE - is not allowed for a read-only cursor
SET TRANSACTION ISOLATION LEVEL
6. More multithread safety
- test with multiple threads
- DONE: better robot.txt locking
Currently all threads are waiting for a single thread
to fetch robots.txt file, independently of host name.
It can be done by implementing of a shared array
of "robots.txt currently being fetched".
7. DBMode=blob improvements
- RENAME TABLE for more databases
MSSQL, Sybase:
[EXEC] sp_rename t1,t2
SELECT * INTO t1 FROM t2 WHERE 1=0; - copy structure (without indexes)
Oracle:
CREATE TABLE t2 AS SELECT field FROM t1 WHERE 1=0; -- does not copy idx
ALTER TABLE t1 RENAME TO t2;
RENAME t1 TO t2
PostgreSQL:
CREATE TABLE t2 (LIKE t1); -- does not copy indexes
CREATE TABLE t2 (LIKE t1 INCLUDING INDEXES); -- copy indexes
ATLER TABLE t1 RENAME TO t2;
DB2:
CREATE TABLE t1 LIKE t2; -- does not copy indexes
RENAME TABLE t1 TO t2
SQLite:
CREATE TABLE t2 (LIKE t1); -- does not copy indexes
Mimer, Interbase: do not seem to have table rename.
- Check a possibility to use VIEWs for those databases
not supporting RENAME
- Partial incremental "indexer -Eblob"
- Configurable choice to run partial or full
"indexer -Eblob", depending on amount
of new data collected.
- Put information from "url" into "bdict" table ???
- Put information from "urlinfo" ???
- Add HIGH_PRIORITY into this query:
SELECT rec_id, site_id, pop_rank, last_mod_time
FROM url WHERE rec_id IN (...)
ORDER BY rec_id
- DBMode=blob for sqlite3?
- DONE: DBMode=blob for Interbase
- DONE: DBMode=rawblob and a mixed blob+rawblob mode (live updates).
8. Database consistency check (and maybe pepair) tools,
- e.g. report (and/or remove) all bdicti/urlinfo records
which don't have corresponding url records.
- don't put lost url records during "indexer -Eblob" run,
generate warnings if found lost records.
9. Source code and packaging improvements
(see some more info added by svoj in TODO.ru)
- more separate files (e.g. break utils.c)
- dynamically loadable database modules
- build statically linked (platform independent) and
dinamically linked (distribution-specific) RPMs,
FreeBSD packages and so on.
Gentoo: http://www.mnogosearch.org/board/message.php?id=17992
Solaris SPARC: http://www.mnogosearch.org/board/message.php?id=17955
10. mnoGoSearch benchmark suite
- tiny (~1000 documents)
- medium (~10000 documents
- huge (~1000000 documents)
- with 1,2,3,5,10,20 simultaneous users
- cluster with huge databases on several machines
11. Windows version
- Unix compatible indexer.conf
- UdmEnvWrite() (can be done by a Unix developer)
- GUI for all missing important commands
- GUI for "extra" (i.e. not so important) commands
- package prepared plugins, for example for ispell or external parsers,
to reduce manual actions required from user.
- build MySQL fulltext parser plug-in
12. API improvements (PHP, ASP, Perl) ???
- Stabilize and document C API.
- Put module code into the main tree,
add --with-php, --with-perl, and so on, options to configure.
- Add tests with Perl and PHP modules - find a way to cover
Perl and PHP modules by "msearch-test" tests.
- Add "PHP via COM" frontend example (Windows)
- From Yannick LE NY
http://pecl.php.net/package/mnogosearch, there is an update.
This update correct:
- Initial PECL release
- fix compiler warnings and errors on 64bit platform
- #34705 (php bugs), disable udm_clear_search_limits when used with mnogosearch 3.2+
*this is a required backward compatibility break*
13. Better internationalization (from Yannick LE NY)
http://www.mnogosearch.org/board/message.php?id=17948
- add i18n templates
- use gettext to i18n the indexer binary help and messages.
- Korean frequency dictionaries ???
See this thread:
http://www.mnogosearch.org/board/message.php?id=17984
http://www.mnogosearch.org/board/message.php?id=18219
- Character set for FTP requests, replies, file listings.
http://www.mnogosearch.org/board/message.php?id=17992
14. Documentation
- Full step-by-step instructions how to install and configure mnoGoSearch
15. Support for Internationalized domain name:
http://en.wikipedia.org/wiki/Internationalized_domain_name
Maybe using GNU IDN Library: Libidn http://www.gnu.org/software/libidn/
16. Misc:
- Single char problem: <font>W</font>ord
http://www.mnogosearch.org/board/message.php?id=17998
- Add "indexer -Esql -e"SELECT xxx FROM"
- List all DB software tested with
- Get rid of disk based search result cache,
complete search result cache for all SQL databases,
and document it.
- Write/read cached copies in chunks, for faster excerpts
- Learn about MySQL's "LOAD INDEX INTO CACHE"
- Add FOREACH loop
For the near 3.3.x releases:
---------------------------
- Firebird: Add FIRST/SKIP syntax into UdmTargets's SQL query:
SELECT FIRST 2 * FROM t1;
SELECT SKIP 2 * FROM t1;
SELECT FIRST 2 SKIP 3 * FROM t1;
- Document ServerTable faq:
http://www.mnogosearch.org/board/message.php?id=16226
- Update Pecl module at
http://pecl.php.net/package/mnogosearch
- Add PHP module documentation into mnoGoSearch manual
- Fix ispell in PHP module
- Fix CachedCopy in PHP module
- Document "QCache yes" and "search in found"
- $(ndocs) doesn't work with cluster
http://www.mnogosearch.org/bugs/index.php?id=1646
- Make sure "tmplt" variable removed from the manual, or make it work.
- Document rthc:
rthc --use-stdout $1 2>/dev/null
- Document rtfx - nice RTF to XML converter
http://memberwebs.com/nielsen/software/rtfx/
- Document PostgreSQL + unixODBC
- Highlight color #FFFFCA
- Document "indexing performance tips": MySQL log and binary log,
syslog, HoldBadHrefs, etc.
- FIX this problem:
==26951== Thread 79:
==26951== Invalid write of size 1
==26951== at 0x4006181: strncat (mc_replace_strmem.c:218)
==26951== by 0x405D766: UdmHTMLParseTag (parsehtml.c:870)
==26951== by 0x405E407: UdmHTMLParse (parsehtml.c:1119)
==26951== by 0x4019E1D: UdmDocParseContent (indexer.c:1412)
==26951== by 0x401CE11: UdmIndexNextURL (indexer.c:2107)
==26951== by 0x804C18C: thread_main (main.c:887)
==26951== by 0x7153DA: start_thread (in /lib/libpthread-2.5.so)
==26951== by 0x66F06D: clone (in /lib/libc-2.5.so)
==26951== Address 0x73A4A88 is 0 bytes after a block of size 128 alloc'd
==26951== at 0x4005400: malloc (vg_replace_malloc.c:149)
==26951== by 0x405CDA4: UdmHTMLParseTag (parsehtml.c:761)
==26951== by 0x405E407: UdmHTMLParse (parsehtml.c:1119)
==26951== by 0x4019E1D: UdmDocParseContent (indexer.c:1412)
==26951== by 0x401CE11: UdmIndexNextURL (indexer.c:2107)
==26951== by 0x804C18C: thread_main (main.c:887)
==26951== by 0x7153DA: start_thread (in /lib/libpthread-2.5.so)
==26951== by 0x66F06D: clone (in /lib/libc-2.5.so)
- Add reentrant gethostbyname() functions for non-Linux
platforms. See HOWTO-RESOLVE.
- Fix this problem:
http://www.city.otaru.hokkaido.jp/
==4897== Invalid read of size 1
==4897== at 0x40C1EE4: udm_mb_wc_iso2022jp (uconv-eucjp.c:8310)
==4897== by 0x40BC53A: UdmConv (uconv.c:53)
==4897== by 0x405BA05: UdmPrepareWords (parsehtml.c:169)
==4897== by 0x401D74D: UdmIndexNextURL (indexer.c:2251)
==4897== by 0x804C1CF: thread_main (main.c:888)
==4897== by 0x7153DA: start_thread (in /lib/libpthread-2.5.so)
==4897== by 0x493106D: clone (in /lib/libc-2.5.so)
==4897== Address 0x4C33DCC is 0 bytes after a block of size 12 alloc'd
==4897== at 0x4005400: malloc (vg_replace_malloc.c:149)
==4897== by 0x48D001F: strdup (in /lib/libc-2.5.so)
==4897== by 0x406A92C: UdmTextListAdd (textlist.c:36)
==4897== by 0x405E585: UdmHTMLParse (parsehtml.c:1087)
==4897== by 0x4019F4F: UdmDocParseContent (indexer.c:1440)
==4897== by 0x401CF4E: UdmIndexNextURL (indexer.c:2139)
==4897== by 0x804C1CF: thread_main (main.c:888)
==4897== by 0x7153DA: start_thread (in /lib/libpthread-2.5.so)
==4897== by 0x493106D: clone (in /lib/libc-2.5.so)
- Document the new style of clones.
Remove $(CL) variable, add new style code into search.htm.
Document $(nclones) variable.
- Check if it's possible to use CHM decompiler as a parser:
http://www.keyworks.net/keytools.htm
|