File: SPEC.md

package info (click to toggle)
runc 1.0.0~rc9+dfsg1-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 1,996 kB
  • sloc: sh: 1,422; ansic: 1,008; makefile: 116
file content (465 lines) | stat: -rw-r--r-- 18,563 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
## Container Specification - v1

This is the standard configuration for version 1 containers.  It includes
namespaces, standard filesystem setup, a default Linux capability set, and
information about resource reservations.  It also has information about any 
populated environment settings for the processes running inside a container.

Along with the configuration of how a container is created the standard also
discusses actions that can be performed on a container to manage and inspect
information about the processes running inside.

The v1 profile is meant to be able to accommodate the majority of applications
with a strong security configuration.

### System Requirements and Compatibility

Minimum requirements:
* Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches)
* Mounted cgroups with each subsystem in its own hierarchy


### Namespaces

|     Flag        | Enabled |
| --------------- | ------- |
| CLONE_NEWPID    |    1    |
| CLONE_NEWUTS    |    1    |
| CLONE_NEWIPC    |    1    |
| CLONE_NEWNET    |    1    |
| CLONE_NEWNS     |    1    |
| CLONE_NEWUSER   |    1    |
| CLONE_NEWCGROUP |    1    |

Namespaces are created for the container via the `unshare` syscall.


### Filesystem

A root filesystem must be provided to a container for execution.  The container
will use this root filesystem (rootfs) to jail and spawn processes inside where
the binaries and system libraries are local to that directory.  Any binaries
to be executed must be contained within this rootfs.

Mounts that happen inside the container are automatically cleaned up when the
container exits as the mount namespace is destroyed and the kernel will 
unmount all the mounts that were setup within that namespace.

For a container to execute properly there are certain filesystems that 
are required to be mounted within the rootfs that the runtime will setup.

|     Path    |  Type  |                  Flags                 |                 Data                     |
| ----------- | ------ | -------------------------------------- | ---------------------------------------- |
| /proc       | proc   | MS_NOEXEC,MS_NOSUID,MS_NODEV           |                                          |
| /dev        | tmpfs  | MS_NOEXEC,MS_STRICTATIME               | mode=755                                 |
| /dev/shm    | tmpfs  | MS_NOEXEC,MS_NOSUID,MS_NODEV           | mode=1777,size=65536k                    |
| /dev/mqueue | mqueue | MS_NOEXEC,MS_NOSUID,MS_NODEV           |                                          |
| /dev/pts    | devpts | MS_NOEXEC,MS_NOSUID                    | newinstance,ptmxmode=0666,mode=620,gid=5 |
| /sys        | sysfs  | MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY |                                          |


After a container's filesystems are mounted within the newly created 
mount namespace `/dev` will need to be populated with a set of device nodes.
It is expected that a rootfs does not need to have any device nodes specified
for `/dev` within the rootfs as the container will setup the correct devices
that are required for executing a container's process.

|      Path    | Mode |   Access   |
| ------------ | ---- | ---------- |
| /dev/null    | 0666 |  rwm       |
| /dev/zero    | 0666 |  rwm       |
| /dev/full    | 0666 |  rwm       |
| /dev/tty     | 0666 |  rwm       |
| /dev/random  | 0666 |  rwm       |
| /dev/urandom | 0666 |  rwm       |


**ptmx**
`/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within
the container.  

The use of a pseudo TTY is optional within a container and it should support both.
If a pseudo is provided to the container `/dev/console` will need to be 
setup by binding the console in `/dev/` after it has been populated and mounted
in tmpfs.

|      Source     | Destination  | UID GID | Mode | Type |
| --------------- | ------------ | ------- | ---- | ---- |
| *pty host path* | /dev/console | 0 0     | 0600 | bind | 


After `/dev/null` has been setup we check for any external links between
the container's io, STDIN, STDOUT, STDERR.  If the container's io is pointing
to `/dev/null` outside the container we close and `dup2` the `/dev/null` 
that is local to the container's rootfs.


After the container has `/proc` mounted a few standard symlinks are setup 
within `/dev/` for the io.

|    Source       | Destination |
| --------------- | ----------- |
| /proc/self/fd   | /dev/fd     |
| /proc/self/fd/0 | /dev/stdin  |
| /proc/self/fd/1 | /dev/stdout |
| /proc/self/fd/2 | /dev/stderr |

A `pivot_root` is used to change the root for the process, effectively 
jailing the process inside the rootfs.

```c
put_old = mkdir(...);
pivot_root(rootfs, put_old);
chdir("/");
unmount(put_old, MS_DETACH);
rmdir(put_old);
```

For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined
with a `chroot` is required as `pivot_root` is not supported in `ramfs`.

```c
mount(rootfs, "/", NULL, MS_MOVE, NULL);
chroot(".");
chdir("/");
```

The `umask` is set back to `0022` after the filesystem setup has been completed.

### Resources

Cgroups are used to handle resource allocation for containers.  This includes
system resources like cpu, memory, and device access.

| Subsystem  | Enabled |
| ---------- | ------- |
| devices    | 1       |
| memory     | 1       |
| cpu        | 1       |
| cpuacct    | 1       |
| cpuset     | 1       |
| blkio      | 1       |
| perf_event | 1       |
| freezer    | 1       |
| hugetlb    | 1       |
| pids       | 1       |


All cgroup subsystem are joined so that statistics can be collected from
each of the subsystems.  Freezer does not expose any stats but is joined
so that containers can be paused and resumed.

The parent process of the container's init must place the init pid inside
the correct cgroups before the initialization begins.  This is done so
that no processes or threads escape the cgroups.  This sync is 
done via a pipe ( specified in the runtime section below ) that the container's
init process will block waiting for the parent to finish setup.

### IntelRdt

Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Cache Allocation Technology (CAT) and Memory Bandwidth Allocation (MBA) are
two sub-features of RDT.

Cache Allocation Technology (CAT) provides a way for the software to restrict
cache allocation to a defined 'subset' of L3 cache which may be overlapping
with other 'subsets'. The different subsets are identified by class of
service (CLOS) and each CLOS has a capacity bitmask (CBM).

Memory Bandwidth Allocation (MBA) provides indirect and approximate throttle
over memory bandwidth for the software. A user controls the resource by
indicating the percentage of maximum memory bandwidth or memory bandwidth limit
in MBps unit if MBA Software Controller is enabled.

It can be used to handle L3 cache and memory bandwidth resources allocation
for containers if hardware and kernel support Intel RDT CAT and MBA features.

In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

CAT and MBA features are introduced in Linux 4.10 and 4.12 kernel via
"resource control" filesystem.

Intel RDT "resource control" filesystem hierarchy:
```
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks
```

For runc, we can make use of `tasks` and `schemata` configuration for L3
cache and memory bandwidth resources constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent.

The file `schemata` has a list of all the resources available to this group.
Each resource (L3 cache, memory bandwidth) has its own line and format.

L3 cache schema:
It has allocation bitmasks/values for L3 cache on each socket, which
contains L3 cache id and capacity bitmask (CBM).
```
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
```
For example, on a two-socket machine, the schema line could be "L3:0=ff;1=c0"
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel CPU models. Kernel will check if it is valid when writing.
e.g., default value 0xfffff in root indicates the max bits of CBM is 20
bits, which mapping to entire L3 cache capacity. Some valid CBM values to
set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which contains
L3 cache id and memory bandwidth.
```
	Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
```
For example, on a two-socket machine, the schema line could be "MB:0=20;1=70"

The minimum bandwidth percentage value for each CPU model is predefined and
can be looked up through "info/MB/min_bandwidth". The bandwidth granularity
that is allocated is also dependent on the CPU model and can be looked up at
"info/MB/bandwidth_gran". The available bandwidth control steps are:
min_bw + N * bw_gran. Intermediate values are rounded to the next control
step available on the hardware.

If MBA Software Controller is enabled through mount option "-o mba_MBps"
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
We could specify memory bandwidth in "MBps" (Mega Bytes per second) unit
instead of "percentages". The kernel underneath would use a software feedback
mechanism or a "Software Controller" which reads the actual bandwidth using
MBM counters and adjust the memory bandwidth percentages to ensure:
"actual memory bandwidth < user specified memory bandwidth".

For example, on a two-socket machine, the schema line could be
"MB:0=5000;1=7000" which means 5000 MBps memory bandwidth limit on socket 0
and 7000 MBps memory bandwidth limit on socket 1.

For more information about Intel RDT kernel interface:  
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

```
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0x7ff and the max CBM length is 11 bits, and minimum memory bandwidth of 10%
with a memory bandwidth granularity of 10%.

Tasks inside the container only have access to the "upper" 7/11 of L3 cache
on socket 0 and the "lower" 5/11 L3 cache on socket 1, and may use a
maximum memory bandwidth of 20% on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "closID": "guaranteed_group",
        "l3CacheSchema": "L3:0=7f0;1=1f",
        "memBwSchema": "MB:0=20;1=70"
    }
}
```

### Security 

The standard set of Linux capabilities that are set in a container
provide a good default for security and flexibility for the applications.


|     Capability       | Enabled |
| -------------------- | ------- |
| CAP_NET_RAW          | 1       |
| CAP_NET_BIND_SERVICE | 1       |
| CAP_AUDIT_READ       | 1       |
| CAP_AUDIT_WRITE      | 1       |
| CAP_DAC_OVERRIDE     | 1       |
| CAP_SETFCAP          | 1       |
| CAP_SETPCAP          | 1       |
| CAP_SETGID           | 1       |
| CAP_SETUID           | 1       |
| CAP_MKNOD            | 1       |
| CAP_CHOWN            | 1       |
| CAP_FOWNER           | 1       |
| CAP_FSETID           | 1       |
| CAP_KILL             | 1       |
| CAP_SYS_CHROOT       | 1       |
| CAP_NET_BROADCAST    | 0       |
| CAP_SYS_MODULE       | 0       |
| CAP_SYS_RAWIO        | 0       |
| CAP_SYS_PACCT        | 0       |
| CAP_SYS_ADMIN        | 0       |
| CAP_SYS_NICE         | 0       |
| CAP_SYS_RESOURCE     | 0       |
| CAP_SYS_TIME         | 0       |
| CAP_SYS_TTY_CONFIG   | 0       |
| CAP_AUDIT_CONTROL    | 0       |
| CAP_MAC_OVERRIDE     | 0       |
| CAP_MAC_ADMIN        | 0       |
| CAP_NET_ADMIN        | 0       |
| CAP_SYSLOG           | 0       |
| CAP_DAC_READ_SEARCH  | 0       |
| CAP_LINUX_IMMUTABLE  | 0       |
| CAP_IPC_LOCK         | 0       |
| CAP_IPC_OWNER        | 0       |
| CAP_SYS_PTRACE       | 0       |
| CAP_SYS_BOOT         | 0       |
| CAP_LEASE            | 0       |
| CAP_WAKE_ALARM       | 0       |
| CAP_BLOCK_SUSPEND    | 0       |


Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor)
and [selinux](http://selinuxproject.org/page/Main_Page) can be used with
the containers.  A container should support setting an apparmor profile or 
selinux process and mount labels if provided in the configuration.  

Standard apparmor profile:
```c
#include <tunables/global>
profile <profile_name> flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  network,
  capability,
  file,
  umount,

  deny @{PROC}/sys/fs/** wklx,
  deny @{PROC}/sysrq-trigger rwklx,
  deny @{PROC}/mem rwklx,
  deny @{PROC}/kmem rwklx,
  deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx,
  deny @{PROC}/sys/kernel/*/** wklx,

  deny mount,

  deny /sys/[^f]*/** wklx,
  deny /sys/f[^s]*/** wklx,
  deny /sys/fs/[^c]*/** wklx,
  deny /sys/fs/c[^g]*/** wklx,
  deny /sys/fs/cg[^r]*/** wklx,
  deny /sys/firmware/efi/efivars/** rwklx,
  deny /sys/kernel/security/** rwklx,
}
```

*TODO: seccomp work is being done to find a good default config*

### Runtime and Init Process

During container creation the parent process needs to talk to the container's init 
process and have a form of synchronization.  This is accomplished by creating
a pipe that is passed to the container's init.  When the init process first spawns 
it will block on its side of the pipe until the parent closes its side.  This
allows the parent to have time to set the new process inside a cgroup hierarchy 
and/or write any uid/gid mappings required for user namespaces.  
The pipe is passed to the init process via FD 3.

The application consuming libcontainer should be compiled statically.  libcontainer
does not define any init process and the arguments provided are used to `exec` the
process inside the application.  There should be no long running init within the 
container spec.

If a pseudo tty is provided to a container it will open and `dup2` the console
as the container's STDIN, STDOUT, STDERR as well as mounting the console
as `/dev/console`.

An extra set of mounts are provided to a container and setup for use.  A container's
rootfs can contain some non portable files inside that can cause side effects during
execution of a process.  These files are usually created and populated with the container
specific information via the runtime.  

**Extra runtime files:**
* /etc/hosts 
* /etc/resolv.conf
* /etc/hostname
* /etc/localtime


#### Defaults

There are a few defaults that can be overridden by users, but in their omission
these apply to processes within a container.

|       Type          |             Value              |
| ------------------- | ------------------------------ |
| Parent Death Signal | SIGKILL                        | 
| UID                 | 0                              |
| GID                 | 0                              |
| GROUPS              | 0, NULL                        |
| CWD                 | "/"                            |
| $HOME               | Current user's home dir or "/" |
| Readonly rootfs     | false                          |
| Pseudo TTY          | false                          |


## Actions

After a container is created there is a standard set of actions that can
be done to the container.  These actions are part of the public API for 
a container.

|     Action     |                         Description                                |
| -------------- | ------------------------------------------------------------------ |
| Get processes  | Return all the pids for processes running inside a container       | 
| Get Stats      | Return resource statistics for the container as a whole            |
| Wait           | Waits on the container's init process ( pid 1 )                    |
| Wait Process   | Wait on any of the container's processes returning the exit status | 
| Destroy        | Kill the container's init process and remove any filesystem state  |
| Signal         | Send a signal to the container's init process                      |
| Signal Process | Send a signal to any of the container's processes                  |
| Pause          | Pause all processes inside the container                           |
| Resume         | Resume all processes inside the container if paused                |
| Exec           | Execute a new process inside of the container  ( requires setns )  |
| Set            | Setup configs of the container after it's created                  |

### Execute a new process inside of a running container

User can execute a new process inside of a running container. Any binaries to be
executed must be accessible within the container's rootfs.

The started process will run inside the container's rootfs. Any changes
made by the process to the container's filesystem will persist after the
process finished executing.

The started process will join all the container's existing namespaces. When the
container is paused, the process will also be paused and will resume when
the container is unpaused.  The started process will only run when the container's
primary process (PID 1) is running, and will not be restarted when the container
is restarted.

#### Planned additions

The started process will have its own cgroups nested inside the container's
cgroups. This is used for process tracking and optionally resource allocation
handling for the new process. Freezer cgroup is required, the rest of the cgroups
are optional. The process executor must place its pid inside the correct
cgroups before starting the process. This is done so that no child processes or
threads can escape the cgroups.

When the process is stopped, the process executor will try (in a best-effort way)
to stop all its children and remove the sub-cgroups.