File: Checkpointing_implementation.md

package info (click to toggle)
mpich 4.3.2-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 101,184 kB
  • sloc: ansic: 1,040,629; cpp: 82,270; javascript: 40,763; perl: 27,933; python: 16,041; sh: 14,676; xml: 14,418; f90: 12,916; makefile: 9,270; fortran: 8,046; java: 4,635; asm: 324; ruby: 103; awk: 27; lisp: 19; php: 8; sed: 4
file content (113 lines) | stat: -rw-r--r-- 5,433 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# Checkpointing Implementation

This document describes the implementation of checkpointing in Nemesis.
For now this consists of two emails I sent describing it. Eventually
I'll clean this up.

## Email 1

BLCR allows you to checkpoint a process and all of its children. When it
does this any open sockets or mmapped files between them are preserved
after restart.

Each process registers a checkpoint callback function which is called
when a checkpoint of that process is requested. We selected the option
to have BLCR create a thread and call the callback from within that
thread.

When the app is started using hydra, it starts a proxy process on each
node. The proxy then forks the mpi processes, so the mpi processes are
children of the proxy process.

Checkpoints are initiated by hydra.

So, when a checkpoint is taken, each proxy requests a checkpoint to be
taken for itself and all of its children. BLCR then calls the callback
in each process.

Because we don't want the hydra proxy to be part of the checkpoint
image, from inside the callback, the proxy specifies that it should be
omitted from the checkpoint.

When the callback is called in an mpi process, we set a flag indicating
that the checkpoint protocol should be initiated, then wait on a
condition variable. Remember that the callback is called from within a
thread, so the main thread (and any other non blcr threads) continue
running.

When the mpi process enters the progress engine, it sees the flag is set
and initiates the checkpoint protocol (sending markers, shutting down
netmods, etc). When this protocol is complete, it signals the condition
variable to wake the blcr thread, then waits on its own condition
variable.

Once woken, callback thread calls the blcr function to actually take the
checkpoint. When the checkpoint completes (I can't remember if that's a
blocking function or we poll), it signals the condition variable to wake
the main thread, and returns.

When the main thread wakes it restarts the netmods and continues.

Note that the callback is also called when the process restarts after a
failure, so there is a switch statement (I believe) in there to test for
that.

## Email 2

The hydra proxy starts a blcr checkpoint of itself and all of its
children (i.e., all of the mpi processes on that node). This results in
the callback being called for itself and all of the mpi processes. In
the callback, the hydra proxy excludes itself from the checkpoint and
exits the callback. This results in checkpointing all of the processes
on one node in a single checkpoint image. Note that the callback is
actually called in a separate thread (rather than in an interrupt
context). This allows us to initiate and complete the checkpoint
protocol before taking the checkpoint of the local process.

In the callback of the mpi processes, rank 0 sets `start_checkpoint=TRUE`
then waits on `ckpt_sem`, every other process just waits on `ckpt_sem`. In
the progress engine, when `start_checkpoint=TRUE` (at rank0), it calls
`ckpt_start()`, which sends marker messages to every process. When the
other processes receive the first marker message, they call
`ckpt_start()`. Once every process receives a marker from every other
process, the `finish_checkpoint` flag is set.

Back at the progress engine, when `finish_checkpoint` is `TRUE`,
`ckpt_finish()` is called. This signals the `ckpt_sem` semaphore that the
blcr callback is waiting on then waits on the `cont_sem`.

When the `ckpt_sem` is signaled, the callback continues, and tells blcr
it's ready for blcr to take the checkpoint image. Once all processes on
the node are ready, blcr checkpoints all processes into one checkpoint
image.

When the blcr checkpoint function returns, the callback checks the
return code to determine if this is a restart, or just continuing after
taking the checkpoint. In the restart case, the environment variables
are reset (the env vars after a restart are the same as they were the in
the original process, we need new env vars for the new incarnation of
this process because, e.g., the PMI connection info will have changed.).
Then the stdin out and err pipes are reconnected to the pmi proxy (when
the pmi proxy initially starts the mpi process, it dup's the stdin out
and err fds, but when the processes are restarted, the proxy does not
fork (it calls blcr restart), so it doesn't have a chance to dup the
fds, so the process itself needs to connect the fds through named pipes
that the pmi proxy set up). Next it reinitializes pmi so new business
cards can be exchanged. Finally, it lets the netmod know that it has
restarted from a checkpoint.

In the continue case, it lets the netmod know that the checkpoint has
completed, and we're continuing.

When a marker message is received we need to make sure that no messages
are received from that VC until the checkpoint has completed. In TCP, we
handle this by holding the messages at the sender (as opposed to
blocking incoming messages at the receiver). So after the restart we
need to send control messages to tell the processes it's OK to send us
messages again.

Note that maker messages are not sent to intranode processes. This is
because blcr will include the shared memory region in the checkpoint
when processes are checkpointed together. We use the `nem_barrier()`
(which is a shared memory intranode barrier) to sync the local
processes, but otherwise don't drain the shared-memory queues.