File: README.txt

package info (click to toggle)
liblucy-perl 0.3.3-4
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 9,328 kB
  • ctags: 8,492
  • sloc: ansic: 80,468; perl: 7,080; yacc: 681; java: 174; lex: 96; makefile: 20
file content (39 lines) | stat: -rwxr-xr-x 1,726 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Indexing Benchmarks

The purpose of this experiment is to test raw indexing speed, using
Reuters-21578, Distribution 1.0 as a test corpus.  As of this writing,
Reuters-21578 is available at: 
    
    http://www.daviddlewis.com/resources/testcollections/reuters21578

The corpus comes packaged in SGML, which means we need to preprocess it so
that our results are not infected by differences between SGML parsers.  A
simple perl script, "./extract_reuters.plx" is supplied, which expands the
Reuters articles out into the file system, 1 article per file, with the title
as the first line of text.  It takes one command line argument: the location
of the un-tarred Reuters collection.

    ./extract_reuters.plx /path/to/reuters_collection

Filepaths are hard-coded, and the assumption is that the apps will be run from
within the benchmarks/ directory.  Each of the indexing apps takes four
optional command line arguments: 

  * The number of documents to index.
  * The number of times to repeat the indexing process.
  * The increment, or number of docs to add during each index writer instance.
  * Whether or not the main text should be stored and highlightable.

    $ perl -Mblib indexers/lucy_indexer.plx \
    > --docs=1000 --reps=6 --increment=10 --store=1

    $ java -server -Xmx500M -XX:CompileThreshold=100 LuceneIndexer \
    > -docs 1000 -reps 6 -increment 10 -store 1

If no command line args are supplied, the apps will index the entire 19043
article collection once, using a single index writer, and will neither store
nor vectorize the main text.

Upon finishing, each app will produce a "truncated mean" report: the slowest
25% and fastest 25% of  reps will be discarded, and the rest will be averaged.