File: language_model_vectorize.md

package info (click to toggle)
groonga 16.0.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 188,416 kB
  • sloc: ansic: 772,827; cpp: 52,396; ruby: 40,556; javascript: 10,250; yacc: 7,045; sh: 5,627; python: 2,821; makefile: 1,679
file content (169 lines) | stat: -rw-r--r-- 4,451 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
<!-- groonga-command -->
<!-- database: functions_language_model_vectorize -->

# `language_model_vectorize`

```{versionadded} 14.1.0

```

```{note}
This is an experimental feature. Currently, this feature is still not stable.
```

## Summary

`language_model_vectorize` generates a normalized embedding from the
given text.

See also {doc}`../language_model` how to prepare a language model.

You can use {ref}`generated-column` to automate embeddings generation.

To enable this function, register `functions/language_model` plugin by
the following command:

```shell
plugin_register functions/language_model
```

## Syntax

`language_model_vectorize` requires two parameters:

```
language_model_vectorize(model_name, text)
```

`mode_name` is the name of language mode to be used. It's associated
with file name. If
`${PREFIX}/share/groonga/language_models/mistral-7b-v0.1.Q4_K_M.gguf`
exists, you can refer it by `mistral-7b-v0.1.Q4_K_M`. It's computed by
removing directory and `.gguf` extension.

```{versionadded} 15.1.8

You can also specify a Hugging Face URI for `model_name`.

```

`text` is the input text.

## Requirements

You need llama.cpp enabled Groonga. The official packages enable it.

You need enough CPU/memory resources to use this feature. Language
model related features require more resources than other features.

You can use GPU in the feature.

## Usage

You need to register `functions/language_model` plugin at first:

<!-- groonga-command -->

```{include} ../../example/reference/functions/language_model_vectorize/usage_register.md
plugin_register functions/language_model
```

Here is a schema definition and sample data.

Sample schema:

<!-- groonga-command -->

```{include} ../../example/reference/functions/language_model_vectorize/usage_setup_schema.md
table_create --name Memos --flags TABLE_NO_KEY
column_create \
  --table Memos \
  --name content \
  --flags COLUMN_SCALAR \
  --type ShortText
```

Sample data:

<!-- groonga-command -->

```{include} ../../example/reference/functions/language_model_vectorize/usage_setup_data.md
load --table Memos
[
{"content": "Groonga is fast and embeddable full text search engine."},
{"content": "PGroonga is a PostgreSQL extension that uses Groonga."},
{"content": "PostgreSQL is a RDBMS."}
]
```

Here is a schema that creates a {ref}`generated-column` that
generates embeddings of `Memos.content` automatically:

<!-- groonga-command -->

```{include} ../../example/reference/functions/language_model_vectorize/usage_setup_generated_column.md
column_create \
  --table Memos \
  --name content_embedding \
  --flags COLUMN_VECTOR \
  --type Float32 \
  --source content \
  --generator 'language_model_vectorize("mistral-7b-v0.1.Q4_K_M", content)'
```

You can re-rank matched records by using `distance_inner_product()`
not `distance_cosine()` because `language_model_vectorize()` returns a
normalized embedding. The following example uses all records instead
of filtered records to show this usage simply:

<!-- groonga-command -->

```{include} ../../example/reference/functions/language_model_vectorize/usage_rerank.md
select \
  --table Memos \
  --columns[similarity].stage filtered \
  --columns[similarity].flags COLUMN_SCALAR \
  --columns[similarity].types Float32 \
  --columns[similarity].value 'distance_inner_product(content_embedding, language_model_vectorize("mistral-7b-v0.1.Q4_K_M", "high performance FTS"))' \
  --output_columns content,similarity \
  --sort_keys -similarity
```

## Parameters

There are two required parameters.

### `model_name`

`mode_name` is the name of language mode to be used. It's associated
with file name. If
`${PREFIX}/share/groonga/language_models/mistral-7b-v0.1.Q4_K_M.gguf`
exists, you can refer it by `mistral-7b-v0.1.Q4_K_M`. It's computed by
removing directory and `.gguf` extension.

````{versionadded} 15.1.8

You can specify a Hugging Face URI for `model_name`.

When you specify a Hugging Face URI, the model will be automatically downloaded.

The model is downloaded to the directory where the Groonga database files are located.

Example of URI: `hf:///groonga/bge-m3-Q4_K_M-GGUF` for `https://huggingface.co/groonga/bge-m3-Q4_K_M-GGUF`.

Example of function execution:

```
language_model_vectorize("hf:///groonga/bge-m3-Q4_K_M-GGUF", content)
```

````

### `text`

`text` is the input text.

## Return value

`language_model_vectorize` returns `Float32` vector which as a
normalized embedding.