1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
|
# Scheduler design
This document covers the design and implementation details of the swarmkit
scheduler.
## Overview
In the SwarmKit [task model](task_model.md), tasks start in the `New` state,
and advance to `Pending` once pre-scheduling activities like network allocation
are done. The scheduler becomes responsible for tasks once they reach the
`Pending` state. If the task can be scheduled, the scheduler schedules it
immediately (subject to batching), and advances the state to `Assigned`. If it
isn't possible to schedule the task immediately, for example, because no nodes
have sufficient resources, the task will stay in the `Pending` state until it
becomes possible to schedule it.
When the state of a task reaches `Assigned`, the dispatcher sends this task to
the assigned node to start the process of executing it.
Each task will only pass through the scheduler once. Once a task is assigned to
a node, this decision cannot be revisited. See the [task model](task_model.md)
for more details on task lifecycle.
## Global service tasks
Both replicated and global service tasks pass through the scheduler. For
replicated tasks, the scheduler needs to decide which node the task should run
on. For global service tasks, the job of the scheduler is considerably simpler,
because the global orchestrator creates these tasks with the `NodeID` field
already set. In this case, the scheduler only has to confirm that the node
satisfies all the constraints and other filters, and once it does, advance the
state to `Assigned`.
## Filters
The scheduler needs to run several checks against each candidate node to make
sure that node is suitable for running the task. At present, this includes the
following set of checks:
- Confirming the node is in the `Ready` state, as opposed to `Down` or
`Disconnected` and availability is `Active`, as opposed to `Pause` or
`Drain`
- Confirming sufficient resource availability
- Checking that all necessary plugins are installed on the node
- Checking that user-specified constraints are satisfied
- Checking that the node has the correct OS and architecture
- Checking that host ports aren't used by an existing task on the node
This operates through a mechanism called `Pipeline`. `Pipeline` chains together
filters that perform these checks.
Filters satisfy a simple interface. For simplicity, there is a `SetTask` method
that lets a task be loaded into the filter and then checked against several
candidate nodes. The `SetTask` method can do all the processing that only
depends on the task and not on the node. This approach can save some redundant
computation and/or allocations. `Filter` also has a `Check` method that tests
the most-recently-loaded task against a candidate node, and an `Explain` method
that provides a human-readable explanation of what an unsuccessful result from
`Check` means. `Explain` is used to produce a message inside the task that
explains what is preventing it from being scheduled.
## Scheduling algorithm
The current scheduling algorithm works by building a tree of nodes which is
specific to the service, and attempting to equalize the total number of tasks
of this service below the branches of the tree at each level. This is done
subject to constraints, so a node that, for example, doesn't have enough
resources to accommodate more tasks, will end up with fewer than its peers.
By default, this tree has only one level, and contains all suitable nodes at
that level. When [placement preferences](topology.md) are specified, the tree
can be customized to equalize the number of tasks across specific sets of
nodes.
While the primary scheduling criterion is the number of tasks from the same
service on the node, the total number of tasks on the node is used as a
tiebreaker. The first priority is spreading tasks from each service over as many
nodes as possible, as evenly as possible, but when there's a choice between
suitable nodes for the next task, preference is given to the node with the
fewest total tasks. Note that this doesn't take into consideration things like
resource reservations and actual resource usage, so this is an area where there
may be a lot of room for future improvement.
## Batching
The most expensive part of scheduling is building the tree described above. This
is `O(# nodes)`. If there were `n` nodes and `t` tasks to be scheduled,
scheduling those tasks independently would have `O(n*t)` runtime. We want to do
better than this.
A key optimization is that many tasks are effectively identical for the
scheduler's purposes, being generated by the same service. For example, a
replicated service with 1000 replicas will cause 1000 tasks to be created, but
those tasks can be viewed as equivalent from the scheduler's perspective (until
they are assigned nodes).
If the scheduler can identify a group of identical tasks, it can build a single
tree to be shared between them, instead of building a separate tree for each
one. It does this using the combination of service ID and `SpecVersion`. If
some number of tasks have the same service ID and `SpecVersion`, they get
scheduled as a batch using a single tree.
A slight complication with this is that the scheduler receives tasks one by one,
over a watch channel. If it processed each task immediately, there would be no
opportunities to group tasks and avoid redundant work. To solve this problem,
the scheduler waits up to 50 ms after receiving a task, in hopes of receiving of
another identical task. The total latency associated with this batching is
limited to one second.
## Building and using the tree
The tree starts out as a tree of max-heaps containing node objects. The primary
sort criterion for the heaps is the number of tasks from the service in
question running on the node. This provides easy access to the "worst"
candidate node (i.e. the most tasks from that service).
As an example, consider the following situation with nodes `N1`, `N2`, and `N3`,
and services `S1` and `S2`:
| node | S1 tasks | S2 tasks | labels |
|------|----------|----------|-------------------------|
| `N1` | 1 | 1 | engine.labels.os=ubuntu |
| `N2` | 1 | 0 | engine.labels.os=ubuntu |
| `N3` | 0 | 1 | engine.labels.os=centos |
Suppose we want to scale up `S2` by adding one more task. If there are no
placement preferences, the tree of max-heaps we generate in the context of `S2`
only has a single heap, which looks like this:
```
N1 <--- "worst" node choice for S2
/ \
N2 N3
```
Note that the above illustration shows a heap, not the tree that organizes the
heaps. The heap has `N1` at the root because `N1` ties `N3` for number of `S2`
tasks, but has more tasks in total. This makes `N1` the last-choice node to
schedule an additional `S2` task.
If there are placement preferences, the tree of heaps can contain multiple
heaps. Here is an example with a preference to spread over `engine.label.os`:
```
[root]
/ \
"ubuntu" "centos"
max heap: max heap:
node1 node3
|
node2
```
The scheduler iterates over the nodes, and checks if each one meets the
constraints. If it does, it is added to the heap in the correct location in the
tree. There is a maximum size for each heap, determined by the number of tasks
being scheduled in the batch (since there is no outcome where more than `n`
nodes are needed to schedule `n` tasks). If that maximum size gets reached for
a certain heap, new nodes will displace the current "worst" node if they score
better.
After this process of populating the heaps, they are converted in-place to
sorted lists, from minimum value (best node) to maximum value (worst node). The
resulting tree of sorted node lists can be used to schedule the group of tasks
by repeatedly choosing the branch with the fewest tasks from the service at
each level. Since the branches in the tree (and the leaves) are sorted by the
figure of merit, it is efficient to loop over these and "fill" them to the
level of the next node in the list. If there are still tasks left over after
doing a first pass, a round-robin approach is used to assign the tasks.
## Local state
The scheduler tries to avoid querying the `MemoryStore`. Instead, it maintains
information on all nodes and tasks in formats that are well-optimized for its
purposes.
A map called `allTasks` contains all tasks relevant to the scheduler, indexed by
ID. In principle this is similar to calling `store.GetTask`, but is more
efficient. The map is kept up to date through events from the store.
A `nodeSet` struct wraps a map that contains information on each node, indexed
by the node ID. In addition to the `Node` structure itself, this includes some
calculated information that's useful to the scheduler, such as the total number
of tasks, the number of tasks by service, a tally of the available resources,
and the set of host ports that are taken on that node.
## Detecting faulty nodes
A possible problem with the original scheduler was that it might assign tasks to
a misbehaving node indefinitely. If a certain node is unable to successfully run
tasks, it will always look like the least loaded from the scheduler's
perspective, and be the favorite for task assignments. But this could result in
a failure loop where tasks could never get assigned on a node where they would
actually run successfully.
To handle this situation, the scheduler tracks failures of each service by node.
If a service fails several times on any given node within a certain time
interval, that node is marked as potentially faulty for the service. The sort
comparator that determines which nodes are best for scheduling the service
(normally the nodes with the fewest instances of that service) sorts any node
that has been marked potentially faulty as among the last possible choices for
scheduling that service.
|