File: TAG2UPLOAD-MANAGER-PROTOCOL.md

package info (click to toggle)
dgit 14.8
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 4,476 kB
  • sloc: perl: 14,136; sh: 7,648; makefile: 346; python: 334; tcl: 69
file content (348 lines) | stat: -rw-r--r-- 11,224 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
# Protocol between the tag2upload Manager and the Oracle.

## Initiation and lifecycle

The Oracle initiates the connection.
The transport is `ssh manager nc`:
ie, the Manager will be listening on a local socket.
(There will be an ssh restricted command.)

In principle,
the Oracle might make multiple connections,
if it has multiple worker processes.
In that case, each worker has one connection.

The ssh connection will use "protocol keepalives",
so that the Manager will (eventually) detect a failure.

## Protocol 

### Basic principles; notation

The protocol is line-based.
Lines are terminated by newlines.
Extraneous whitespace is a protocol violation.

We see things from the Oracle's point of view.  
`<` is from Manager to Oracle.

Medium- to long-term protocol states
are shown with `[state: ]` and are as follows:

 * `worker-waiting`: The worker is waiting for instructions/jobs.
 * `processing`: The worker is processing a job;
   the manager is waiting to see what the outcome is.

### Initial exchange

```
$ ssh manager nc -U /srv/socket
    < t2u-manager-ready
    > t2u-oracle-version 5
    > worker-id WORKER-ID FIDELITY
[state: worker-waiting]
```

If there are multiple protocol versions,
the Oracle gets to choose its preferred one.

This document describes version `8`.
In `7` and earlier, there is no `software-versions` message before `ack`.
In `6` and earlier, there is no `user-email` message.
In `5` and earlier, there is no RETRY-STATE in `job`,
no `commit-to-public-upload` and `go-ahead` messages,
and no `processing-uncommitted` state,
`job` transitions directly to `processing-committed`;
there is no `retriable` outcome and jobs can never be retried;
and there are no `email reported` / `email unreported` messages.
In `4` and earlier, there was no `restart-worker` message.
In `3` and earlier, `PUTATIVE-PACKAGE` is omitted from the `job` message.
In `2` and earlier, `FIDELITY` is omitted from the `worker-id` message.

(The protocol version could be on the command line,
but that entangles it with the ssh restricted command.)

The WORKER-ID must consist of ASCII alphanumerics,
commas, hyphens, and dots, and must start with an alphanumeric.
It is used by the manager for reporting,
including in public-facing status reports.
If the Oracle manages multiple Builders,
it should make multiple connections to the Oracle,
one for each Builder.
(The `worker-id` message is mandatory.)

`FIDELITY` is one of the fixed strings `testing` or `production`,
according to the Oracle's self-determination of its own status.
The Manager will not give out jobs to to a non-`production` Oracle,
unless it explicitly so instructed by its administrator.

### Readiness

The Oracle should then wait, indefinitely,
for a job to be available.

During this time,
the Manager will periodically poll the Oracle for readiness,
and may instruct an Oracle worker process to restart.
Polling works like this:

```
[state: worker-waiting]
    < ayt
    > software-versions SOFTWARE-VERSIONS-INFO
    > ack
[state: worker-waiting]
```

This allows the Manager to detect a dead Oracle connection.

`SOFTWARE-VERSIONS-INFO` is a string for public display by the Manager,
giving the version(s) of the software on the Oracle and maybe Builder.
It is the whole rest of the line (UTF-8 plain text).

Before responding, the Oracle should attempt to discover
any reasons why the processing of a source package is bound to fail.
In particular, ideally, the Oracle would check that:
 * it can contact its Builder;
 * the build environment (the autopkgtest testbed) is `open`;
 * the build environment is accessible (commands can be run in it);
 * the signing key it intends to use is available.

The Oracle need not check anything visible to the Manager.
For example, the Oracle need not check availability of dgit-reposs,
the ftpmaster upload queue, or input git repository servers (eg
salsa).

The Manager instructing the Oracle to restart looks like this:

```
[state: worker-waiting]
    < restart-worker
[connection close]
```

The Oracle worker should close the connection
and then cause itself to be restarted.
In particular, this means 
the worker will establish a fresh connection to the Builder.
(In our implementation, this means we know
the worker will now use the latest container base image on the Builder.)
The process supervising workers need not restart.

The Manager may send restart commands 
whenever the worker is expecting
`ayt` or `job` messages.

### Job

```
[state: worker-waiting]
    < job JOB-ID RETRY-STATUS PUTATIVE-PACKAGE URL
    < data-block NBYTES
    < [NBYTES bytes of data for the tag (no newline)]
    < data-end
    < user-email EMAIL
[1] < last-attempt-message MESSAGE
[state: processing-uncommitted]
```

JOB-ID is the "job id" assigned by the Manager,
and displayed in the Manager's reporting web pages etc.
The Oracle should use it only for reporting.
It has the same syntax as BUILDER-ID.
Note that subsequent retries of the same job will have the same JOB-ID.

PUTATIVE-PACKAGE is the source package name.
It is derived from the Manager's parse of the tag data,
so should be used for reporting only.
The Oracle must reparse the tag for itself after verifying the signature.

URL is the git URL for the repository where the tag exists.
It is guaranteed to consist of ASCII graphic characters.

RETRY-STATUS IS `last-attempt` or `not-last-attempt`.
With `last-attempt` the Oracle should
send a report email to the uploader even on retriable failures.
\[1] `last-attempt-message` is sent iff RETRY-STATUS is `last-attempt`;
MESSAGE should be included in that email's summary report,
and will be a whole sentence.
(`last-attempt` does not affect the reported outcome.)

USER-EMAIL is an email address suitable for writing
to the user who apparently initiated the upload.
It contains only ASCII graphic characters, spaces, and tabs.
It is a single address in RFC5322 destination field format.

The NBYTES of data are precisely the git tag object,
as output by `git cat-file tag`.

This protocol is identical to the `dgit rpush` file transfer protocol,
except that the Manager guarantees to put the whole tag
in one data block.
(So there will be only one `data-block`.)

After receiving a job, the Oracle may start work on it.
It may download things, perform checks, and so on.
It must not make any signatures or send out git objects
or source packages without further consultation with the Manager.

If the connection drops for some reason, during this phase,
the Manager may retry the same tag.

### Commitment

```
[state: processing-uncommitted]
    > commit-to-public-upload
    < go-ahead
[state: processing-committed]
```

The worker requests, and the manager grants,
permission to go ahead and sign and send out the outputs.

The retry mechanism is principally a mitigation for an unreliable forge.
So the worker should send `commit-to-public-upload` only
*after* it has completed interactions with the forge.
Operations such as downloading origs from mirrors,
and ftpmaster API enquiries, maybe done before or after.

After `go-ahead` is granted, all failures are irrecoverable.

### Outcome

Below, MESSAGE is UTF-8 text, possibly containing whitespace,
up to the newline.

The manager will log it,
and display it publicly in its status reports.

#### Success

```
[state: processing-commmitted]
    > message MESSAGE
    > email reported
    > uploaded
[state: worker-waiting]
```

The package was uploaded.

The Oracle has sent an email report to the uploader,
CC the administrators and the archive list.

#### Permanent failure

```
[state: processing-commmitted or processing-uncommmitted]
    > message MESSAGE
[a] > email reported
[b] > email unreported
    > irrecoverable
[state: worker-waiting]
```

Something is irrevocably wrong with the tag or the package contents.
There should be no retries.

\[a]: The Oracle has sent an email report to the uploader,
CC the administrators and the archive list.

\[b]: The Oracle has *not* sent an email report.
(For example, the Oracle's security posture is to avoid processing tags
when the signature doesn't verify or isn't from an authorised signer,
and in that case it doesn't send emails.)
The manager should do so.

#### Temporary failure

```
[state: processing-uncommmitted]
	> message MESSAGE
[a] > email reported
[b] > email unreported
	> retriable
[state: worker-waiting]
```

Something went wrong, but retrying may help.

`email reported` and `email unreported` are as for permanent failure.
The Oracle should send an email to the uploader and say `email reported`
only if the manager sent `job ... last-attempt ...` for this job.
(The Oracle may send an email to the public audit/archive list
in any case, but that doesn't count as `email reported`.)

The Oracle should avoid reporting `retriable` for situations
where a retry will not succeed.  An adequate implementation is
to do all forge accesses first; retry only if forge access fails;
and, to do all other processing after `go-ahead`.
But the Oracle must make an effort to detect permanent forge errors
(repo 404, missing git refs) as discussed in #1112106.

### Conclusion

After sending the outcome,
the Oracle should either close the connection,
or retain it and wait for further jobs.

### Protocol violations, reporting

Either side may send this message, at any time
(except in the middle of data blocks)
if it considers that its peer has violated the protocol:

```
    > protocol-violation MESSAGE
    < protocol-violation MESSAGE
```

The complaining side should then close the connection.

The complained-at side should report the error somewhere,
and will ideally display it in user-facing output
such as status web pages or emails.
It should also then close the connection.

The complaining side that sends `protocol-violation`
should *also* report or log the error as appropriate.

### Connection failures - handling by Oracle

If the connection is dropped,
or a connection attempt is unsuccessful,
the Oracle should retry with a delay.

### Connection failures - handling by Manager

If the connection fails (or the protocol is violated)
after `go-ahead` and before the outcome,
the job is treated as irrecoverable.

To Manager always does an `ayt` check
immediately before issuing a job,
to minimise the opportunity for jobs to be lost
simply because of a broken connection.

(The rest of the time the Manager doesn't care about connection failure.)

### Error recovery and retrying jobs

Jobs can be retried when:

 * The Manager fails to access the forge in a way that looks retriable
   (eg 500 errors from HTTP).
 * The Oracle disconnects, or reports `retriable`, before `go-ahead`
   (without `job ... last-attempt ...`).

The retry and backoff implementation (including policy configuration)
is the responsibility of the Manager.
The Oracle never performs any retries or backoff itself;
its only responsibilities with regard to this section of the specification
is to report whether a failure is retriable.

The retry schedule is controlled by the `retry`
configuration of `tag2upload-service-manager`.
See the Rustdoc for `tag2upload_service_manager::config::Retry`
for the configuration, and a working through of the implications.