1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
|
<!-- groonga-command -->
<!-- database: functions_language_model_vectorize -->
# `language_model_vectorize`
```{versionadded} 14.1.0
```
```{note}
This is an experimental feature. Currently, this feature is still not stable.
```
## Summary
`language_model_vectorize` generates a normalized embedding from the
given text.
See also {doc}`../language_model` how to prepare a language model.
You can use {ref}`generated-column` to automate embeddings generation.
To enable this function, register `functions/language_model` plugin by
the following command:
```shell
plugin_register functions/language_model
```
## Syntax
`language_model_vectorize` requires two parameters:
```
language_model_vectorize(model_name, text)
```
`mode_name` is the name of language mode to be used. It's associated
with file name. If
`${PREFIX}/share/groonga/language_models/mistral-7b-v0.1.Q4_K_M.gguf`
exists, you can refer it by `mistral-7b-v0.1.Q4_K_M`. It's computed by
removing directory and `.gguf` extension.
```{versionadded} 15.1.8
You can also specify a Hugging Face URI for `model_name`.
```
`text` is the input text.
## Requirements
You need llama.cpp enabled Groonga. The official packages enable it.
You need enough CPU/memory resources to use this feature. Language
model related features require more resources than other features.
You can use GPU in the feature.
## Usage
You need to register `functions/language_model` plugin at first:
<!-- groonga-command -->
```{include} ../../example/reference/functions/language_model_vectorize/usage_register.md
plugin_register functions/language_model
```
Here is a schema definition and sample data.
Sample schema:
<!-- groonga-command -->
```{include} ../../example/reference/functions/language_model_vectorize/usage_setup_schema.md
table_create --name Memos --flags TABLE_NO_KEY
column_create \
--table Memos \
--name content \
--flags COLUMN_SCALAR \
--type ShortText
```
Sample data:
<!-- groonga-command -->
```{include} ../../example/reference/functions/language_model_vectorize/usage_setup_data.md
load --table Memos
[
{"content": "Groonga is fast and embeddable full text search engine."},
{"content": "PGroonga is a PostgreSQL extension that uses Groonga."},
{"content": "PostgreSQL is a RDBMS."}
]
```
Here is a schema that creates a {ref}`generated-column` that
generates embeddings of `Memos.content` automatically:
<!-- groonga-command -->
```{include} ../../example/reference/functions/language_model_vectorize/usage_setup_generated_column.md
column_create \
--table Memos \
--name content_embedding \
--flags COLUMN_VECTOR \
--type Float32 \
--source content \
--generator 'language_model_vectorize("mistral-7b-v0.1.Q4_K_M", content)'
```
You can re-rank matched records by using `distance_inner_product()`
not `distance_cosine()` because `language_model_vectorize()` returns a
normalized embedding. The following example uses all records instead
of filtered records to show this usage simply:
<!-- groonga-command -->
```{include} ../../example/reference/functions/language_model_vectorize/usage_rerank.md
select \
--table Memos \
--columns[similarity].stage filtered \
--columns[similarity].flags COLUMN_SCALAR \
--columns[similarity].types Float32 \
--columns[similarity].value 'distance_inner_product(content_embedding, language_model_vectorize("mistral-7b-v0.1.Q4_K_M", "high performance FTS"))' \
--output_columns content,similarity \
--sort_keys -similarity
```
## Parameters
There are two required parameters.
### `model_name`
`mode_name` is the name of language mode to be used. It's associated
with file name. If
`${PREFIX}/share/groonga/language_models/mistral-7b-v0.1.Q4_K_M.gguf`
exists, you can refer it by `mistral-7b-v0.1.Q4_K_M`. It's computed by
removing directory and `.gguf` extension.
````{versionadded} 15.1.8
You can specify a Hugging Face URI for `model_name`.
When you specify a Hugging Face URI, the model will be automatically downloaded.
The model is downloaded to the directory where the Groonga database files are located.
Example of URI: `hf:///groonga/bge-m3-Q4_K_M-GGUF` for `https://huggingface.co/groonga/bge-m3-Q4_K_M-GGUF`.
Example of function execution:
```
language_model_vectorize("hf:///groonga/bge-m3-Q4_K_M-GGUF", content)
```
````
### `text`
`text` is the input text.
## Return value
`language_model_vectorize` returns `Float32` vector which as a
normalized embedding.
|