File: indexing.md

package info (click to toggle)
redisearch 1%3A1.2.1-4
links: PTS, VCS
area: main
in suites: buster
size: 12,076 kB
sloc: ansic: 79,131; python: 3,419; pascal: 1,644; makefile: 431; yacc: 422; sh: 5
file content (99 lines) | stat: -rw-r--r-- 4,428 bytes
parent folder | download | duplicates (2)
# Document Indexing

This document describes how documents are added to the index.

## Components

* `Document` - this contains the actual document and its fields.
* `RSAddDocumentCtx` - this is the per-document state that is used while it
  is being indexed. The state is discarded once complete
* `ForwardIndex` - contains terms found in the document. The forward index
  is used to write the `InvertedIndex` (later on)
* `InvertedIndex` - an index mapping a term to occurrences within applicable
  documents.

##  Architecture

The indexing process begins by creating a new RSAddDocumentCtx and adding a
document to it. Internally this is divided into several steps.


1. Submission.
   A DocumentContext is created, and is associated with a document (as received)
   from input. The submission process will also perform some preliminary caching.

2. Preprocessing

   Once a document has been submitted, it is preprocessed. Preprocessing performs
   stateless processing on all document input fields. For text fields, this
   means tokenizing the document and creating a forward index. The preprocesors
   will store this information in per-field variables within the `AddDocumentCtx`.
   This computed result is then written to the (persistent) index later on during
   the indexing phase.

   If the document is sufficiently large, the preprocessing is done in a separate
   thread, which allows concurrent preprocessing and also avoids blocking other
   threads. If the document is smaller, the preprocessing is done within the main
   thread, avoiding the overhead of additional context switching.
   The `SELF_EXC_THRESHOLD` (macro) contains the threshold for 'sufficiently large'.

   Once the document is preprocessed,  it is submitted to be indexed.

3. Indexing

   Indexing proper consists of writing down the precomputed results of the
   preprocessing phase above. It is done in a single thread, and is in the form
   of a queue.

   Because documents must be written to the index in the exact order of their
   document ID assignment, and because we must also yield to other potential
   indexing processes, we may end up in a situation where document IDs are written
   to the index out-of-order. In order to solve that, the order in which documents
   are actually written must be well-defined. If there is only one thread writing
   documents, then this thread will not need to worry about out-of-order IDs
   while writing.

   Having a single background thread also helps optimize in several areas, as
   will be seen later on. The basic idea is that when there are a lot of
   documents queued for the indexing thread, the indexing thread may treat them
   as batch commands, greatly reducing the number of locks/unlocks of the GIL
   and the number of times term keys need to be opened and closed.

4. Skipping already indexed documents

   The phases below may operate on more than one document at a time. When a document
   is fully indexed, it is marked as done. When the thread iterates over the queue
   it will only perform processing/indexing on items not yet marked as done.

5. Term Merging

   Term merging, or forward index merging, is done when there is more than a
   single document in the queue. The forward index of each document in the queue
   is scanned, and a larger, 'master' forward index is constructed in its place.
   Each entry in the forward index contains a reference to the origin document
   as well as the normal offset/score/frequency information.
   
   Creating a 'master' forward index avoids opening common term keys once per
   document.

   If there is only one document within the queue, a 'master' forward index
   is not created.

   Note that the internal type of the master forward index is not actually
   `ForwardIndex`.

6. Document ID assignment
   
   At this point, the GIL is locked and every document in the queue is assigned
   a document ID. The assignment is done immediately before writing to the index
   so as to reduce the number of times the GIL is locked; thus, the GIL is
   locked only once - right before the index is written.

7. Writing to Indexes

   With the GIL being locked, any pending index data is written to the indexes.
   This usually involves opening one or more Redis keys, and writing/copying
   computed data into those keys.

   Once this is done, the reply for the given document is sent, and the
   `AddDocumentCtx` freed.