1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
|
# Customisation
Lunr.py ships with some sensible defaults to create indexes and search easily,
but in some cases you may want to tweak how documents are indexed and search.
You can do that in lunr.py by passing your own `Builder` instance to the `lunr`
function.
## Pipeline functions
When the builder processes your documents it splits (tokenises) the text, and
applies a series of functions to each token. These are called pipeline functions.
The builder includes two pipelines, indexing and searching.
If you want to change the way lunr.py indexes the documents you'll need to
change the indexing pipeline.
For example, say you wanted to support the American and British way of spelling
certain words, you could use a normalisation pipeline function to force one
token into the other:
```python
from lunr import lunr, get_default_builder
from lunr.pipeline import Pipeline
documents = [...]
builder = get_default_builder()
def normalise_spelling(token, i, tokens) {
if str(token) == "gray":
return token.update(lambda: "grey")
else:
return token
lunr.pipeline.Pipeline.register_function(normalise_spelling)
builder.pipeline.add(normalise_spelling)
idx = lunr(ref="id", fields=("title", "body"), documents=documents, builder=builder)
```
Note pipeline functions take the token being processed, its position in the
token list, and the token list itself.
## Skip a pipeline function for specific field names
The `Pipeline.skip()` method allows you to skip a pipeline function
for specific field names. It takes the function itself (not its name
or its registered name) and the field name to skip as arguments. This
example skips the `stop_word_filter` pipeline function for the field
`fullName`.
```python
from lunr import lunr, get_default_builder, stop_word_filter
documents = [...]
builder = get_default_builder()
builder.pipeline.skip(stop_word_filter.stop_word_filter, ["fullName"])
idx = lunr(ref="id", fields=("fullName", "body"), documents=documents, builder=builder)
```
Importantly, if you are using language support, the above code will
not work, since there is a separate builder for each language, and the
pipeline functions are generated by the code and so cannot be
imported. Instead, you can access them by name. For instance to skip
the stop word filter and stemmer for French for the field `titre`, you
could do this:
```python
from lunr import lunr, get_default_builder, stop_word_filter
documents = [...]
builder = get_default_builder("fr")
for funcname in "stopWordFilter-fr", "stemmer-fr":
builder.pipeline.skip(
builder.pipeline.registered_functions[funcname], ["titre"]
)
idx = lunr(ref="id", fields=("titre", "texte"), documents=documents, builder=builder)
```
The current language support registers the functions
`lunr-multi-trimmer-{lang}`, `stopWordFilter-{lang}` and
`stemmer-{lang}` but these are by convention only. You can access the
full list through the `registered_functions` attribute of the
pipeline, but this is not necessarily the list of actual pipeline
steps, which is contained in a private field (though you can see them
in the string representation of the pipeline).
## Token meta-data
Lunr.py `Token` instances include meta-data information which can be used in
pipeline functions. This meta-data is not stored in the index by default, but it
can be by adding it to the builder's `metadata_whitelist` property. This will
include the meta-data in the search results:
```python
from lunr import lunr, get_default_builder
from lunr.pipeline import Pipeline
builder = get_default_builder()
def token_length(token, i, tokens):
token.metadata["token_length"] = len(str(token))
return token
Pipeline.register_function(token_length)
builder.pipeline.add(token_length)
builder.metadata_whitelist.append("token_length")
idx = lunr("id", ("title", "body"), documents, builder=builder)
[result, _, _] = idx.search("green")
assert result["match_data"].metadata["green"]["title"]["token_length"] == [5]
assert result["match_data"].metadata["green"]["body"]["token_length"] == [5, 5]
```
## Similarity tuning
The algorithm used by Lunr to calculate similarity between a query and a document
can be tuned using two parameters. Lunr ships with sensible defaults, and these
can be adjusted to provide the best results for a given collection of documents.
- **b**: This parameter controls the importance given to the length of a
document and its fields. This value must be between 0 and 1, and by default it
has a value of 0.75. Reducing this value reduces the effect of different length
documents on a term’s importance to that document.
- **k1**: This controls how quickly the boost given by a common word reaches
saturation. Increasing it will slow down the rate of saturation and lower values
result in quicker saturation. The default value is 1.2. If the collection of
documents being indexed have high occurrences of words that are not covered by
a stop word filter, these words can quickly dominate any similarity calculation.
In these cases, this value can be reduced to get more balanced results.
These values can be changed in the builder:
```python
from lunr import lunr, get_default_builder
builder = get_default_builder()
builder.k1(1.3)
builder.b(0)
idx = lunr("id", ("title", "body"), documents, builder=builder)
```
|