File: db.md

package info (click to toggle)
ipyparallel 8.8.0-6
links: PTS, VCS
area: main
in suites: sid
size: 12,412 kB
sloc: python: 21,991; javascript: 267; makefile: 29; sh: 28
file content (150 lines) | stat: -rw-r--r-- 6,523 bytes
parent folder | download | duplicates (2)
(parallel-db)=

# IPython's Task Database

## Enabling a DB Backend

The IPython Hub can store all task requests and results in a database.
Currently supported backends are: MongoDB, SQLite, and an in-memory DictDB.

The default is to store recent tasks in a dictionary in memory,
which deletes old values if it gets too big, and only survives
as long as the controller is running.

Using a real database is optional due to its potential {ref}`db-cost`.
You can enable one, either at the command-line:

```
$> ipcontroller --sqlitedb # or --mongodb or --nodb
```

or in your {file}`ipcontroller_config.py`:

```python
c.IPController.db_class = "NoDB"
c.IPController.db_class = "DictDB" # default
c.IPController.db_class = "MongoDB"
c.IPController.db_class = "SQLiteDB"
```

## Using the Task Database

The most common use case for this is clients requesting results for tasks they did not submit, via:

```ipython
In [1]: rc.get_result(task_id)
```

However, since we have this DB backend, we provide a direct query method in the {class}`~.Client`
for users who want deeper introspection into their task history. The {meth}`db_query` method of
the Client is modeled after MongoDB queries, so if you have used MongoDB it should look
familiar. In fact, when the MongoDB backend is in use, the query is relayed directly.
When using other backends, the interface is emulated and only a subset of queries is possible.

```{seealso}
MongoDB [query docs](https://www.mongodb.com/docs/manual/tutorial/query-documents/)
```

{meth}`Client.db_query` takes a dictionary query object, with keys from the TaskRecord key list,
and values of either exact values to test, or MongoDB queries, which are dicts of The form:
`{'operator' : 'argument(s)'}`. There is also an optional `keys` argument, that specifies
which subset of keys should be retrieved. The default is to retrieve all keys excluding the
request and result buffers. {meth}`db_query` returns a list of TaskRecord dicts. Also like
MongoDB, the `msg_id` key will always be included, whether requested or not.

TaskRecord keys:

| Key            | Type        | Description                                                 |
| -------------- | ----------- | ----------------------------------------------------------- |
| msg_id         | uuid(ascii) | The msg ID                                                  |
| header         | dict        | The request header                                          |
| content        | dict        | The request content (likely empty)                          |
| buffers        | list(bytes) | buffers containing serialized request objects               |
| submitted      | datetime    | timestamp for time of submission (set by client)            |
| client_uuid    | uuid(ascii) | IDENT of client's socket                                    |
| engine_uuid    | uuid(ascii) | IDENT of engine's socket                                    |
| started        | datetime    | time task began execution on engine                         |
| completed      | datetime    | time task finished execution (success or failure) on engine |
| resubmitted    | uuid(ascii) | msg_id of resubmitted task (if applicable)                  |
| result_header  | dict        | header for result                                           |
| result_content | dict        | content for result                                          |
| result_buffers | list(bytes) | buffers containing serialized request objects               |
| queue          | str         | The name of the queue for the task ('mux' or 'task')        |
| execute_input  | str         | Python input source                                         |
| execute_result | dict        | Python output (execute_result message content)              |
| error          | dict        | Python traceback (error message content)                    |
| stdout         | str         | Stream of stdout data                                       |
| stderr         | str         | Stream of stderr data                                       |

MongoDB operators we emulate on all backends:

| Operator | Python equivalent |
| -------- | ----------------- |
| '\$in'   | in                |
| '\$nin'  | not in            |
| '\$eq'   | ==                |
| '\$ne'   | !=                |
| '\$ge'   | >                 |
| '\$gte'  | >=                |
| '\$le'   | \<                |
| '\$lte'  | \<=               |

The DB Query is useful for two primary cases:

1. deep polling of task status or metadata
2. selecting a subset of tasks, on which to perform a later operation (e.g. wait on result, purge records, resubmit,...)

## Example Queries

To get all msg_ids that are not completed, only retrieving their ID and start time:

```ipython
In [1]: incomplete = rc.db_query({'completed' : None}, keys=['msg_id', 'started'])
```

All jobs started in the last hour by me:

```ipython
In [1]: from datetime import datetime, timedelta

In [2]: hourago = datetime.now() - timedelta(1./24)

In [3]: recent = rc.db_query({'started' : {'$gte' : hourago },
                                'client_uuid' : rc.session.session})
```

All jobs started more than an hour ago, by clients _other than me_:

```ipython
In [3]: recent = rc.db_query({'started' : {'$le' : hourago },
                                'client_uuid' : {'$ne' : rc.session.session}})
```

Result headers for all jobs on engine 3 or 4:

```ipython
In [1]: uuids = map(rc._engines.get, (3,4))

In [2]: hist34 = rc.db_query({'engine_uuid' : {'$in' : uuids }, keys='result_header')
```

(db-cost)=

## Cost

The advantage of the database backends is, of course, that large amounts of
data can be stored that won't fit in memory. The basic DictDB 'backend'
stores all of this information in a Python dictionary. This is very fast,
but will run out of memory quickly if you move a lot of data around, or your
cluster is to run for a long time.

Unfortunately, the DB backends (SQLite and MongoDB) right now are rather slow,
and can still consume large amounts of resources, particularly if large tasks
or results are being created at a high frequency.

For this reason, we have added {class}`~.NoDB`, a dummy backend that doesn't
store any information. When you use this database, nothing is stored,
and any request for results will result in a KeyError. This obviously prevents
later requests for results and task resubmission from functioning, but
sometimes those nice features are not as useful as keeping Hub memory under
control.