File: facets.rst

package info (click to toggle)
python-whoosh 2.7.4%2Bgit6-g9134ad92-4
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 3,648 kB
  • sloc: python: 38,517; makefile: 118
file content (771 lines) | stat: -rw-r--r-- 28,315 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
====================
Sorting and faceting
====================

.. note::
    The API for sorting and faceting changed in Whoosh 3.0.

Overview
========

Sorting and faceting search results in Whoosh is based on **facets**. Each
facet associates a value with each document in the search results, allowing you
to sort by the keys or use them to group the documents. Whoosh includes a variety
of **facet types** you can use for sorting and grouping (see below).


Sorting
=======

By default, the results of a search are sorted with the highest-scoring
documents first. You can use the ``sortedby`` keyword argument to order the
results by some other criteria instead, such as the value of a field.


Making fields sortable
----------------------

In order to sort on a field, you should create the field using the
``sortable=True`` keyword argument::

    schema = fields.Schema(title=fields.TEXT(sortable=True),
                           content=fields.TEXT,
                           modified=fields.DATETIME(sortable=True)
                           )

It's possible to sort on a field that doesn't have ``sortable=True``, but this
requires Whoosh to load the unique terms in the field into memory. Using
``sortable`` is much more efficient.


About column types
------------------

When you create a field using ``sortable=True``, you are telling Whoosh to store
per-document values for that field in a *column*. A column object specifies the
format to use to store the per-document values on disk.

The :mod:`whoosh.columns` module contains several different column object
implementations. Each field type specifies a reasonable default column type (for
example, the default for text fields is :class:`whoosh.columns.VarBytesColumn`,
the default for numeric fields is :class:`whoosh.columns.NumericColumn`).
However, if you want maximum efficiency you may want to use a different column
type for a field.

For example, if all document values in a field are a fixed length, you can use a
:class:`whoosh.columns.FixedBytesColumn`. If you have a field where many
documents share a relatively small number of possible values (an example might
be a "category" field, or "month" or other enumeration type fields), you might
want to use :class:`whoosh.columns.RefBytesColumn` (which can handle both
variable and fixed-length values). There are column types for storing
per-document bit values, structs, pickled objects, and compressed byte values.

To specify a custom column object for a field, pass it as the ``sortable``
keyword argument instead of ``True``::

    from whoosh import columns, fields

    category_col = columns.RefBytesColumn()
    schema = fields.Schema(title=fields.TEXT(sortable=True),
                           category=fields.KEYWORD(sortable=category_col)


Using a COLUMN field for custom sort keys
-----------------------------------------

When you add a document with a sortable field, Whoosh uses the value you pass
for the field as the sortable value. For example, if "title" is a sortable
field, and you add this document::

    writer.add_document(title="Mr. Palomar")

...then ``Mr. Palomar`` is stored in the field column as the sorting key for the
document.

This is usually good, but sometimes you need to "massage" the sortable key so
it's different from the value the user searches and/or sees in the interface.
For example, if you allow the user to sort by title, you might want to use
different values for the visible title and the value used for sorting::

    # Visible title
    title = "The Unbearable Lightness of Being"

    # Sortable title: converted to lowercase (to prevent different ordering
    # depending on uppercase/lowercase), with initial article moved to the end
    sort_title = "unbearable lightness of being, the"

The best way to do this is to use an additional field just for sorting. You can
use the :class:`whoosh.fields.COLUMN` field type to create a field that is not
indexed or stored, it only holds per-document column values::

    schema = fields.Schema(title=fields.TEXT(stored=True),
                           sort_title=fields.COLUMN(columns.VarBytesColumn())
                           )

The single argument to the :class:`whoosh.fields.COLUMN` initializer is a
:class:`whoosh.columns.ColumnType` object. You can use any of the various
column types in the :mod:`whoosh.columns` module.

As another example, say you are indexing documents that have a custom sorting
order associated with each document, such as a "priority" number::

    name=Big Wheel
    price=100
    priority=1

    name=Toss Across
    price=40
    priority=3

    name=Slinky
    price=25
    priority=2
    ...

You can use a column field with a numeric column object to hold the "priority"
and use it for sorting::

    schema = fields.Schema(name=fields.TEXT(stored=True),
                           price=fields.NUMERIC(stored=True),
                           priority=fields.COLUMN(columns.NumericColumn("i"),
                           )

(Note that :class:`columns.NumericColumn` takes a type code character like the
codes used by Python's ``struct`` and ``array`` modules.)


Making existing fields sortable
-------------------------------

If you have an existing index from before the ``sortable`` argument was added
in Whoosh 3.0, or you didn't think you needed a field to be sortable but now
you find that you need to sort it, you can add "sortability" to an existing
index using the :func:`whoosh.sorting.add_sortable` utility function::

    from whoosh import columns, fields, index, sorting

    # Say we have an existing index with this schema
    schema = fields.Schema(title=fields.TEXT,
                           price=fields.NUMERIC)

    # To use add_sortable, first open a writer for the index
    ix = index.open_dir("indexdir")
    with ix.writer() as w:
        # Add sortable=True to the "price" field using field terms as the
        # sortable values
        sorting.add_sortable(w, "price", sorting.FieldFacet("price"))

        # Add sortable=True to the "title" field using the
        # stored field values as the sortable value
        sorting.add_sortable(w, "title", sorting.StoredFieldFacet("title"))

You can specify a custom column type when you call ``add_sortable`` using the
``column`` keyword argument::

    add_sortable(w, "chapter", sorting.FieldFacet("chapter"),
                 column=columns.RefBytesColumn())

See the documentation for :func:`~whoosh.sorting.add_sortable` for more
information.


Sorting search results
----------------------

When you tell Whoosh to sort by a field (or fields), it uses the per-document
values in the field's column as sorting keys for the documents.

Normally search results are sorted by descending relevance score. You can tell
Whoosh to use a different ordering by passing the ``sortedby`` keyword argument
to the :meth:`~whoosh.searching.Searcher.search` method::

    from whoosh import fields, index, qparser

    schema = fields.Schema(title=fields.TEXT(stored=True),
                           price=fields.NUMERIC(sortable=True))
    ix = index.create_in("indexdir", schema)

    with ix.writer() as w:
        w.add_document(title="Big Deal", price=20)
        w.add_document(title="Mr. Big", price=10)
        w.add_document(title="Big Top", price=15)

    with ix.searcher() as s:
        qp = qparser.QueryParser("big", ix.schema)
        q = qp.parse(user_query_string)

        # Sort search results from lowest to highest price
        results = s.search(q, sortedby="price")
        for hit in results:
            print(hit["title"])

You can use any of the following objects as ``sortedby`` values:

A ``FacetType`` object
    Uses this object to sort the documents. See below for the available facet
    types.

A field name string
    Converts the field name into a ``FieldFacet`` (see below) and uses it to
    sort the documents.

A list of ``FacetType`` objects and/or field name strings
    Bundles the facets together into a ``MultiFacet`` so you can sort by
    multiple keys. Note that this shortcut does not allow you to reverse
    the sort direction of individual facets. To do that, you need to construct
    the ``MultiFacet`` object yourself.

.. note::
    You can use the ``reverse=True`` keyword argument to the
    ``Searcher.search()`` method to reverse the overall sort direction. This
    is more efficient than reversing each individual facet.


Examples
--------

Sort by the value of the size field::

    results = searcher.search(myquery, sortedby="size")

Sort by the reverse (highest-to-lowest) order of the "price" field::

    facet = sorting.FieldFacet("price", reverse=True)
    results = searcher.search(myquery, sortedby=facet)

Sort by ascending size and then descending price::

    mf = sorting.MultiFacet()
    mf.add_field("size")
    mf.add_field("price", reverse=True)
    results = searcher.search(myquery, sortedby=mf)

    # or...
    sizes = sorting.FieldFacet("size")
    prices = sorting.FieldFacet("price", reverse=True)
    results = searcher.search(myquery, sortedby=[sizes, prices])

Sort by the "category" field, then by the document's score::

    cats = sorting.FieldFacet("category")
    scores = sorting.ScoreFacet()
    results = searcher.search(myquery, sortedby=[cats, scores])


Accessing column values
-----------------------

Per-document column values are available in :class:`~whoosh.searching.Hit`
objects just like stored field values::

    schema = fields.Schema(title=fields.TEXT(stored=True),
                           price=fields.NUMERIC(sortable=True))

    ...

    results = searcher.search(myquery)
    for hit in results:
        print(hit["title"], hit["price"])

ADVANCED: if you want to access abitrary per-document values quickly you can get
a column reader object::

    with ix.searcher() as s:
        reader = s.reader()

        colreader = s.reader().column_reader("price")
        for docnum in reader.all_doc_ids():
            print(colreader[docnum])


Grouping
========

It is often very useful to present "faceted" search results to the user.
Faceting is dynamic grouping of search results into categories. The
categories let users view a slice of the total results based on the categories
they're interested in.

For example, if you are programming a shopping website, you might want to
display categories with the search results such as the manufacturers and price
ranges.

==================== =================
Manufacturer         Price
-------------------- -----------------
Apple (5)            $0 - $100 (2)
Sanyo (1)            $101 - $500 (10)
Sony (2)             $501 - $1000 (1)
Toshiba (5)
==================== =================

You can let your users click the different facet values to only show results
in the given categories.

Another useful UI pattern is to show, say, the top 5 results for different
types of found documents, and let the user click to see more results from a
category they're interested in, similarly to how the Spotlight quick results
work on Mac OS X.


The ``groupedby`` keyword argument
----------------------------------

You can use the following objects as ``groupedby`` values:

A ``FacetType`` object
    Uses this object to group the documents. See below for the available facet
    types.

A field name string
    Converts the field name into a ``FieldFacet`` (see below) and uses it to
    sort the documents. The name of the field is used as the facet name.

A list or tuple of field name strings
    Sets up multiple field grouping criteria.

A dictionary mapping facet names to ``FacetType`` objects
    Sets up multiple grouping criteria.

A ``Facets`` object
    This object is a lot like using a dictionary, but has some convenience
    methods to make setting up multiple groupings a little easier.


Examples
--------

Group by the value of the "category" field::

    results = searcher.search(myquery, groupedby="category")

Group by the value of the "category" field and also by the value of the "tags"
field and a date range::

    cats = sorting.FieldFacet("category")
    tags = sorting.FieldFacet("tags", allow_overlap=True)
    results = searcher.search(myquery, groupedby={"category": cats, "tags": tags})

    # ...or, using a Facets object has a little less duplication
    facets = sorting.Facets()
    facets.add_field("category")
    facets.add_field("tags", allow_overlap=True)
    results = searcher.search(myquery, groupedby=facets)

To group results by the *intersected values of multiple fields*, use a
``MultiFacet`` object (see below). For example, if you have two fields named
``tag`` and ``size``, you could group the results by all combinations of the
``tag`` and ``size`` field, such as ``('tag1', 'small')``,
``('tag2', 'small')``, ``('tag1', 'medium')``, and so on::

    # Generate a grouping from the combination of the "tag" and "size" fields
    mf = MultiFacet(["tag", "size"])
    results = searcher.search(myquery, groupedby={"tag/size": mf})


Getting the faceted groups
--------------------------

The ``Results.groups("facetname")`` method  returns a dictionary mapping
category names to lists of **document IDs**::

    myfacets = sorting.Facets().add_field("size").add_field("tag")
    results = mysearcher.search(myquery, groupedby=myfacets)
    results.groups("size")
    # {"small": [8, 5, 1, 2, 4], "medium": [3, 0, 6], "large": [7, 9]}

If there is only one facet, you can just use ``Results.groups()`` with no
argument to access its groups::

    results = mysearcher.search(myquery, groupedby=myfunctionfacet)
    results.groups()

By default, the values in the dictionary returned by ``groups()`` are lists of
document numbers in the same relative order as in the results. You can use the
``Searcher`` object's ``stored_fields()`` method to take a document number and
return the document's stored fields as a dictionary::

    for category_name in categories:
        print "Top 5 documents in the %s category" % category_name
        doclist = categories[category_name]
        for docnum, score in doclist[:5]:
            print "  ", searcher.stored_fields(docnum)
        if len(doclist) > 5:
            print "  (%s more)" % (len(doclist) - 5)

If you want different information about the groups, for example just the count
of documents in each group, or you don't need the groups to be ordered, you can
specify a :class:`whoosh.sorting.FacetMap` type or instance with the
``maptype`` keyword argument when creating the ``FacetType``::

    # This is the same as the default
    myfacet = FieldFacet("size", maptype=sorting.OrderedList)
    results = mysearcher.search(myquery, groupedby=myfacet)
    results.groups()
    # {"small": [8, 5, 1, 2, 4], "medium": [3, 0, 6], "large": [7, 9]}

    # Don't sort the groups to match the order of documents in the results
    # (faster)
    myfacet = FieldFacet("size", maptype=sorting.UnorderedList)
    results = mysearcher.search(myquery, groupedby=myfacet)
    results.groups()
    # {"small": [1, 2, 4, 5, 8], "medium": [0, 3, 6], "large": [7, 9]}

    # Only count the documents in each group
    myfacet = FieldFacet("size", maptype=sorting.Count)
    results = mysearcher.search(myquery, groupedby=myfacet)
    results.groups()
    # {"small": 5, "medium": 3, "large": 2}

    # Only remember the "best" document in each group
    myfacet = FieldFacet("size", maptype=sorting.Best)
    results = mysearcher.search(myquery, groupedby=myfacet)
    results.groups()
    # {"small": 8, "medium": 3, "large": 7}

Alternatively you can specify a ``maptype`` argument in the
``Searcher.search()`` method call which applies to all facets::

    results = mysearcher.search(myquery, groupedby=["size", "tag"],
                                maptype=sorting.Count)

(You can override this overall ``maptype`` argument on individual facets by
specifying the ``maptype`` argument for them as well.)


Facet types
===========

FieldFacet
----------

This is the most common facet type. It sorts or groups based on the
value in a certain field in each document. This generally works best
(or at all) if each document has only one term in the field (e.g. an ID
field)::

    # Sort search results by the value of the "path" field
    facet = sorting.FieldFacet("path")
    results = searcher.search(myquery, sortedby=facet)

    # Group search results by the value of the "parent" field
    facet = sorting.FieldFacet("parent")
    results = searcher.search(myquery, groupedby=facet)
    parent_groups = results.groups("parent")

By default, ``FieldFacet`` only supports **non-overlapping** grouping, where a
document cannot belong to multiple facets at the same time (each document will
be sorted into one category arbitrarily.) To get overlapping groups with
multi-valued fields, use the ``allow_overlap=True`` keyword argument::

    facet = sorting.FieldFacet(fieldname, allow_overlap=True)

This supports overlapping group membership where documents have more than one
term in a field (e.g. KEYWORD fields). If you don't need overlapping, don't
use ``allow_overlap`` because it's *much* slower and uses more memory (see
the secion on ``allow_overlap`` below).


QueryFacet
----------

You can set up categories defined by arbitrary queries. For example, you can
group names using prefix queries::

    # Use queries to define each category
    # (Here I'll assume "price" is a NUMERIC field, so I'll use
    # NumericRange)
    qdict = {}
    qdict["A-D"] = query.TermRange("name", "a", "d")
    qdict["E-H"] = query.TermRange("name", "e", "h")
    qdict["I-L"] = query.TermRange("name", "i", "l")
    # ...

    qfacet = sorting.QueryFacet(qdict)
    r = searcher.search(myquery, groupedby={"firstltr": qfacet})

By default, ``QueryFacet`` only supports **non-overlapping** grouping, where a
document cannot belong to multiple facets at the same time (each document will
be sorted into one category arbitrarily). To get overlapping groups with
multi-valued fields, use the ``allow_overlap=True`` keyword argument::

    facet = sorting.QueryFacet(querydict, allow_overlap=True)


RangeFacet
----------

The ``RangeFacet`` is for NUMERIC field types. It divides a range of possible
values into groups. For example, to group documents based on price into
buckets $100 "wide"::

    pricefacet = sorting.RangeFacet("price", 0, 1000, 100)

The first argument is the name of the field. The next two arguments are the
full range to be divided. Value outside this range (in this example, values
below 0 and above 1000) will be sorted into the "missing" (None) group. The
fourth argument is the "gap size", the size of the divisions in the range.

The "gap" can be a list instead of a single value. In that case, the values in
the list will be used to set the size of the initial divisions, with the last
value in the list being the size for all subsequent divisions. For example::

    pricefacet = sorting.RangeFacet("price", 0, 1000, [5, 10, 35, 50])

...will set up divisions of 0-5, 5-15, 15-50, 50-100, and then use 50 as the
size for all subsequent divisions (i.e. 100-150, 150-200, and so on).

The ``hardend`` keyword argument controls whether the last division is clamped
to the end of the range or allowed to go past the end of the range. For
example, this::

    facet = sorting.RangeFacet("num", 0, 10, 4, hardend=False)

...gives divisions 0-4, 4-8, and 8-12, while this::

    facet = sorting.RangeFacet("num", 0, 10, 4, hardend=True)

...gives divisions 0-4, 4-8, and 8-10. (The default is ``hardend=False``.)

.. note::
    The ranges/buckets are always **inclusive** at the start and **exclusive**
    at the end.


DateRangeFacet
--------------

This is like ``RangeFacet`` but for DATETIME fields. The start and end values
must be ``datetime.datetime`` objects, and the gap(s) is/are
``datetime.timedelta`` objects.

For example::

    from datetime import datetime, timedelta

    start = datetime(2000, 1, 1)
    end = datetime.now()
    gap = timedelta(days=365)
    bdayfacet = sorting.DateRangeFacet("birthday", start, end, gap)

As with ``RangeFacet``, you can use a list of gaps and the ``hardend`` keyword
argument.


ScoreFacet
----------

This facet is sometimes useful for sorting.

For example, to sort by the "category" field, then for documents with the same
category, sort by the document's score::

    cats = sorting.FieldFacet("category")
    scores = sorting.ScoreFacet()
    results = searcher.search(myquery, sortedby=[cats, scores])

The ``ScoreFacet`` always sorts higher scores before lower scores.

.. note::
    While using ``sortedby=ScoreFacet()`` should give the same results as using
    the default scored ordering (``sortedby=None``), using the facet will be
    slower because Whoosh automatically turns off many optimizations when
    sorting.


FunctionFacet
-------------

This facet lets you pass a custom function to compute the sorting/grouping key
for documents. (Using this facet type may be easier than subclassing FacetType
and Categorizer to set up some custom behavior.)

The function will be called with the index searcher and index document ID as
arguments. For example, if you have an index with term vectors::

    schema = fields.Schema(id=fields.STORED,
                           text=fields.TEXT(stored=True, vector=True))
    ix = RamStorage().create_index(schema)

...you could use a function to sort documents higher the closer they are to
having equal occurances of two terms::

    def fn(searcher, docnum):
        v = dict(searcher.vector_as("frequency", docnum, "text"))
        # Sort documents that have equal number of "alfa" and "bravo" first
        return 0 - (1.0 / (abs(v.get("alfa", 0) - v.get("bravo", 0)) + 1.0))

    facet = sorting.FunctionFacet(fn)
    results = searcher.search(myquery, sortedby=facet)


StoredFieldFacet
----------------

This facet lets you use stored field values as the sorting/grouping key for
documents. This is usually slower than using an indexed field, but when using
``allow_overlap`` it can actually be faster for large indexes just because it
avoids the overhead of reading posting lists.

:class:`~whoosh.sorting.StoredFieldFacet` supports ``allow_overlap`` by
splitting the stored value into separate keys. By default it calls the value's
``split()`` method (since most stored values are strings), but you can supply
a custom split function. See the section on ``allow_overlap`` below.


MultiFacet
==========

This facet type returns a composite of the keys returned by two or more
sub-facets, allowing you to sort/group by the intersected values of multiple
facets.

``MultiFacet`` has methods for adding facets::

    myfacet = sorting.RangeFacet(0, 1000, 10)

    mf = sorting.MultiFacet()
    mf.add_field("category")
    mf.add_field("price", reverse=True)
    mf.add_facet(myfacet)
    mf.add_score()

You can also pass a list of field names and/or ``FacetType`` objects to the
initializer::

    prices = sorting.FieldFacet("price", reverse=True)
    scores = sorting.ScoreFacet()
    mf = sorting.MultiFacet(["category", prices, myfacet, scores])


Missing values
==============

* When sorting, documents without any terms in a given field, or whatever else
  constitutes "missing" for different facet types, will always sort to the end.

* When grouping, "missing" documents will appear in a group with the
  key ``None``.


Using overlapping groups
========================

The common supported workflow for grouping and sorting is where the given field
has *one value for document*, for example a ``path`` field containing the file
path of the original document. By default, facets are set up to support this
single-value approach.

Of course, there are situations where you want documents to be sorted into
multiple groups based on a field with multiple terms per document. The most
common example would be a ``tags`` field. The ``allow_overlap`` keyword
argument to the :class:`~whoosh.sorting.FieldFacet`,
:class:`~whoosh.sorting.QueryFacet`, and
:class:`~whoosh.sorting.StoredFieldFacet` allows this multi-value approach.

However, there is an important caveat: using ``allow_overlap=True`` is slower
than the default, potentially *much* slower for very large result sets. This is
because Whoosh must read every posting of every term in the field to
create a temporary "forward index" mapping documents to terms.

If a field is indexed with *term vectors*, ``FieldFacet`` will use them to
speed up ``allow_overlap`` faceting for small result sets, but for large result
sets, where Whoosh has to open the vector list for every matched document, this
can still be very slow.

For very large indexes and result sets, if a field is stored, you can get
faster overlapped faceting using :class:`~whoosh.sorting.StoredFieldFacet`
instead of ``FieldFacet``. While reading stored values is usually slower than
using the index, in this case avoiding the overhead of opening large numbers of
posting readers can make it worthwhile.

``StoredFieldFacet`` supports ``allow_overlap`` by loading the stored value for
the given field and splitting it into multiple values. The default is to call
the value's ``split()`` method.

For example, if you've stored the ``tags`` field as a string like
``"tag1 tag2 tag3"``::

    schema = fields.Schema(name=fields.TEXT(stored=True),
                           tags=fields.KEYWORD(stored=True))
    ix = index.create_in("indexdir")
    with ix.writer() as w:
        w.add_document(name="A Midsummer Night's Dream", tags="comedy fairies")
        w.add_document(name="Hamlet", tags="tragedy denmark")
        # etc.

...Then you can use a ``StoredFieldFacet`` like this::

    ix = index.open_dir("indexdir")
    with ix.searcher() as s:
        sff = sorting.StoredFieldFacet("tags", allow_overlap=True)
        results = s.search(myquery, groupedby={"tags": sff})

For stored Python objects other than strings, you can supply a split function
(using the ``split_fn`` keyword argument to ``StoredFieldFacet``). The function
should accept a single argument (the stored value) and return a list or tuple
of grouping keys.


Using a custom sort order
=========================

It is sometimes useful to have a custom sort order per-search. For example,
different languages use different sort orders. If you have a function to return
the sorting order you want for a given field value, such as an implementation of
the Unicode Collation Algorithm (UCA), you can customize the sort order
for the user's language.

The :class:`whoosh.sorting.TranslateFacet` lets you apply a function to the
value of another facet. This lets you "translate" a field value into an
arbitrary sort key, such as with UCA::

    from pyuca import Collator

    # The Collator object has a sort_key() method which takes a unicode
    # string and returns a sort key
    c = Collator("allkeys.txt")

    # Make a facet object for the field you want to sort on
    nf = sorting.FieldFacet("name")

    # Wrap the facet in a TranslateFacet with the translation function
    # (the Collator object's sort_key method)
    tf = sorting.TranslateFacet(facet, c.sort_key)

    # Use the facet to sort the search results
    results = searcher.search(myquery, sortedby=tf)

(You can pass multiple "wrapped" facets to the ``TranslateFacet``, and it will
call the function with the values of the facets as multiple arguments.)

The TranslateFacet can also be very useful with numeric fields to sort on the
output of some formula::

    # Sort based on the average of two numeric fields
    def average(a, b):
        return (a + b) / 2.0

    # Create two facets for the fields and pass them with the function to
    # TranslateFacet
    af = sorting.FieldFacet("age")
    wf = sorting.FieldFacet("weight")
    facet = sorting.TranslateFacet(average, af, wf)

    results = searcher.search(myquery. sortedby=facet)

Remember that you can still sort by multiple facets. For example, you could sort
by a numeric value transformed by a quantizing function first, and then if that
is equal sort by the value of another field::

    # Sort by a quantized size first, then by name
    tf = sorting.TranslateFacet(quantize, sorting.FieldFacet("size"))
    results = searcher.search(myquery, sortedby=[tf, "name"])


Expert: writing your own facet
==============================

TBD.