File: TAG2UPLOAD-MANAGER-PROTOCOL.md

package info (click to toggle)
dgit 14.8
links: PTS, VCS
area: main
in suites: sid
size: 4,476 kB
sloc: perl: 14,136; sh: 7,648; makefile: 346; python: 334; tcl: 69
file content (348 lines) | stat: -rw-r--r-- 11,224 bytes
parent folder | download | duplicates (3)
# Protocol between the tag2upload Manager and the Oracle.

## Initiation and lifecycle

The Oracle initiates the connection.
The transport is `ssh manager nc`:
ie, the Manager will be listening on a local socket.
(There will be an ssh restricted command.)

In principle,
the Oracle might make multiple connections,
if it has multiple worker processes.
In that case, each worker has one connection.

The ssh connection will use "protocol keepalives",
so that the Manager will (eventually) detect a failure.

## Protocol 

### Basic principles; notation

The protocol is line-based.
Lines are terminated by newlines.
Extraneous whitespace is a protocol violation.

We see things from the Oracle's point of view.  
`<` is from Manager to Oracle.

Medium- to long-term protocol states
are shown with `[state: ]` and are as follows:

 * `worker-waiting`: The worker is waiting for instructions/jobs.
 * `processing`: The worker is processing a job;
   the manager is waiting to see what the outcome is.

### Initial exchange

```
$ ssh manager nc -U /srv/socket
    < t2u-manager-ready
    > t2u-oracle-version 5
    > worker-id WORKER-ID FIDELITY
[state: worker-waiting]
```

If there are multiple protocol versions,
the Oracle gets to choose its preferred one.

This document describes version `8`.
In `7` and earlier, there is no `software-versions` message before `ack`.
In `6` and earlier, there is no `user-email` message.
In `5` and earlier, there is no RETRY-STATE in `job`,
no `commit-to-public-upload` and `go-ahead` messages,
and no `processing-uncommitted` state,
`job` transitions directly to `processing-committed`;
there is no `retriable` outcome and jobs can never be retried;
and there are no `email reported` / `email unreported` messages.
In `4` and earlier, there was no `restart-worker` message.
In `3` and earlier, `PUTATIVE-PACKAGE` is omitted from the `job` message.
In `2` and earlier, `FIDELITY` is omitted from the `worker-id` message.

(The protocol version could be on the command line,
but that entangles it with the ssh restricted command.)

The WORKER-ID must consist of ASCII alphanumerics,
commas, hyphens, and dots, and must start with an alphanumeric.
It is used by the manager for reporting,
including in public-facing status reports.
If the Oracle manages multiple Builders,
it should make multiple connections to the Oracle,
one for each Builder.
(The `worker-id` message is mandatory.)

`FIDELITY` is one of the fixed strings `testing` or `production`,
according to the Oracle's self-determination of its own status.
The Manager will not give out jobs to to a non-`production` Oracle,
unless it explicitly so instructed by its administrator.

### Readiness

The Oracle should then wait, indefinitely,
for a job to be available.

During this time,
the Manager will periodically poll the Oracle for readiness,
and may instruct an Oracle worker process to restart.
Polling works like this:

```
[state: worker-waiting]
    < ayt
    > software-versions SOFTWARE-VERSIONS-INFO
    > ack
[state: worker-waiting]
```

This allows the Manager to detect a dead Oracle connection.

`SOFTWARE-VERSIONS-INFO` is a string for public display by the Manager,
giving the version(s) of the software on the Oracle and maybe Builder.
It is the whole rest of the line (UTF-8 plain text).

Before responding, the Oracle should attempt to discover
any reasons why the processing of a source package is bound to fail.
In particular, ideally, the Oracle would check that:
 * it can contact its Builder;
 * the build environment (the autopkgtest testbed) is `open`;
 * the build environment is accessible (commands can be run in it);
 * the signing key it intends to use is available.

The Oracle need not check anything visible to the Manager.
For example, the Oracle need not check availability of dgit-reposs,
the ftpmaster upload queue, or input git repository servers (eg
salsa).

The Manager instructing the Oracle to restart looks like this:

```
[state: worker-waiting]
    < restart-worker
[connection close]
```

The Oracle worker should close the connection
and then cause itself to be restarted.
In particular, this means 
the worker will establish a fresh connection to the Builder.
(In our implementation, this means we know
the worker will now use the latest container base image on the Builder.)
The process supervising workers need not restart.

The Manager may send restart commands 
whenever the worker is expecting
`ayt` or `job` messages.

### Job

```
[state: worker-waiting]
    < job JOB-ID RETRY-STATUS PUTATIVE-PACKAGE URL
    < data-block NBYTES
    < [NBYTES bytes of data for the tag (no newline)]
    < data-end
    < user-email EMAIL
[1] < last-attempt-message MESSAGE
[state: processing-uncommitted]
```

JOB-ID is the "job id" assigned by the Manager,
and displayed in the Manager's reporting web pages etc.
The Oracle should use it only for reporting.
It has the same syntax as BUILDER-ID.
Note that subsequent retries of the same job will have the same JOB-ID.

PUTATIVE-PACKAGE is the source package name.
It is derived from the Manager's parse of the tag data,
so should be used for reporting only.
The Oracle must reparse the tag for itself after verifying the signature.

URL is the git URL for the repository where the tag exists.
It is guaranteed to consist of ASCII graphic characters.

RETRY-STATUS IS `last-attempt` or `not-last-attempt`.
With `last-attempt` the Oracle should
send a report email to the uploader even on retriable failures.
\[1] `last-attempt-message` is sent iff RETRY-STATUS is `last-attempt`;
MESSAGE should be included in that email's summary report,
and will be a whole sentence.
(`last-attempt` does not affect the reported outcome.)

USER-EMAIL is an email address suitable for writing
to the user who apparently initiated the upload.
It contains only ASCII graphic characters, spaces, and tabs.
It is a single address in RFC5322 destination field format.

The NBYTES of data are precisely the git tag object,
as output by `git cat-file tag`.

This protocol is identical to the `dgit rpush` file transfer protocol,
except that the Manager guarantees to put the whole tag
in one data block.
(So there will be only one `data-block`.)

After receiving a job, the Oracle may start work on it.
It may download things, perform checks, and so on.
It must not make any signatures or send out git objects
or source packages without further consultation with the Manager.

If the connection drops for some reason, during this phase,
the Manager may retry the same tag.

### Commitment

```
[state: processing-uncommitted]
    > commit-to-public-upload
    < go-ahead
[state: processing-committed]
```

The worker requests, and the manager grants,
permission to go ahead and sign and send out the outputs.

The retry mechanism is principally a mitigation for an unreliable forge.
So the worker should send `commit-to-public-upload` only
*after* it has completed interactions with the forge.
Operations such as downloading origs from mirrors,
and ftpmaster API enquiries, maybe done before or after.

After `go-ahead` is granted, all failures are irrecoverable.

### Outcome

Below, MESSAGE is UTF-8 text, possibly containing whitespace,
up to the newline.

The manager will log it,
and display it publicly in its status reports.

#### Success

```
[state: processing-commmitted]
    > message MESSAGE
    > email reported
    > uploaded
[state: worker-waiting]
```

The package was uploaded.

The Oracle has sent an email report to the uploader,
CC the administrators and the archive list.

#### Permanent failure

```
[state: processing-commmitted or processing-uncommmitted]
    > message MESSAGE
[a] > email reported
[b] > email unreported
    > irrecoverable
[state: worker-waiting]
```

Something is irrevocably wrong with the tag or the package contents.
There should be no retries.

\[a]: The Oracle has sent an email report to the uploader,
CC the administrators and the archive list.

\[b]: The Oracle has *not* sent an email report.
(For example, the Oracle's security posture is to avoid processing tags
when the signature doesn't verify or isn't from an authorised signer,
and in that case it doesn't send emails.)
The manager should do so.

#### Temporary failure

```
[state: processing-uncommmitted]
	> message MESSAGE
[a] > email reported
[b] > email unreported
	> retriable
[state: worker-waiting]
```

Something went wrong, but retrying may help.

`email reported` and `email unreported` are as for permanent failure.
The Oracle should send an email to the uploader and say `email reported`
only if the manager sent `job ... last-attempt ...` for this job.
(The Oracle may send an email to the public audit/archive list
in any case, but that doesn't count as `email reported`.)

The Oracle should avoid reporting `retriable` for situations
where a retry will not succeed.  An adequate implementation is
to do all forge accesses first; retry only if forge access fails;
and, to do all other processing after `go-ahead`.
But the Oracle must make an effort to detect permanent forge errors
(repo 404, missing git refs) as discussed in #1112106.

### Conclusion

After sending the outcome,
the Oracle should either close the connection,
or retain it and wait for further jobs.

### Protocol violations, reporting

Either side may send this message, at any time
(except in the middle of data blocks)
if it considers that its peer has violated the protocol:

```
    > protocol-violation MESSAGE
    < protocol-violation MESSAGE
```

The complaining side should then close the connection.

The complained-at side should report the error somewhere,
and will ideally display it in user-facing output
such as status web pages or emails.
It should also then close the connection.

The complaining side that sends `protocol-violation`
should *also* report or log the error as appropriate.

### Connection failures - handling by Oracle

If the connection is dropped,
or a connection attempt is unsuccessful,
the Oracle should retry with a delay.

### Connection failures - handling by Manager

If the connection fails (or the protocol is violated)
after `go-ahead` and before the outcome,
the job is treated as irrecoverable.

To Manager always does an `ayt` check
immediately before issuing a job,
to minimise the opportunity for jobs to be lost
simply because of a broken connection.

(The rest of the time the Manager doesn't care about connection failure.)

### Error recovery and retrying jobs

Jobs can be retried when:

 * The Manager fails to access the forge in a way that looks retriable
   (eg 500 errors from HTTP).
 * The Oracle disconnects, or reports `retriable`, before `go-ahead`
   (without `job ... last-attempt ...`).

The retry and backoff implementation (including policy configuration)
is the responsibility of the Manager.
The Oracle never performs any retries or backoff itself;
its only responsibilities with regard to this section of the specification
is to report whether a failure is retriable.

The retry schedule is controlled by the `retry`
configuration of `tag2upload-service-manager`.
See the Rustdoc for `tag2upload_service_manager::config::Retry`
for the configuration, and a working through of the implications.