File: Escaping.md

package info (click to toggle)
redisearch 1%3A1.2.1-4
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 12,076 kB
  • sloc: ansic: 79,131; python: 3,419; pascal: 1,644; makefile: 431; yacc: 422; sh: 5
file content (17 lines) | stat: -rw-r--r-- 1,427 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Controlling Text Tokenization and Escaping

At the moment, RediSearch uses a very simple tokenizer for documents and a slightly more sophisticated tokenizer for queries. Both allow a degree of control over string escaping and tokenization. 

Note: There is a different mechanism for tokenizing text and tag fields, this document refers only to text fields. For tag fields please refer to the [Tag Fields](/Tags) documentation. 

## The rules of text field tokenization

1. All punctuation marks and whitespaces (besides underscores) separate the document and queries into tokens. e.g. any character of `,.<>{}[]"':;!@#$%^&*()-+=~` will break the text into terms.  So the text `foo-bar.baz...bag` will be tokenized into `[foo, bar, baz, bag]`

2. Escaping separators in both queries and documents is done by prepending a backslash to any separator. e.g. the text `hello\-world hello-world` will be tokenized as `[hello-world, hello, world]`. **NOTE** that in most languages you will need an extra backslash when formatting the document or query, to signify an actual backslash, so the actual text in redis-cli for example, will be entered as `hello\\-world`. 

2. Underscores (`_`) are not used as separators in either document or query. So the text `hello_world` will remain as is after tokenization. 

3. Repeating spaces or punctuation marks are stripped. 

4. In Latin characters, everything gets converted to lowercase.