1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<!--Converted with LaTeX2HTML 2019.2 (Released June 5, 2019) -->
<HTML lang="EN">
<HEAD>
<TITLE>4.6 Restarting</TITLE>
<META NAME="description" CONTENT="4.6 Restarting">
<META NAME="keywords" CONTENT="user_guide">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<META NAME="viewport" CONTENT="width=device-width, initial-scale=1.0">
<META NAME="Generator" CONTENT="LaTeX2HTML v2019.2">
<LINK REL="STYLESHEET" HREF="user_guide.css">
<LINK REL="previous" HREF="node19.html">
<LINK REL="next" HREF="node21.html">
</HEAD>
<BODY >
<!--Navigation Panel-->
<A
HREF="node21.html">
<IMG WIDTH="37" HEIGHT="24" ALT="next" SRC="next.png"></A>
<A
HREF="node14.html">
<IMG WIDTH="26" HEIGHT="24" ALT="up" SRC="up.png"></A>
<A
HREF="node19.html">
<IMG WIDTH="63" HEIGHT="24" ALT="previous" SRC="prev.png"></A>
<A ID="tex2html207"
HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALT="contents" SRC="contents.png"></A>
<BR>
<B> Next:</B> <A
HREF="node21.html">5 Troubleshooting</A>
<B> Up:</B> <A
HREF="node14.html">4 Performances</A>
<B> Previous:</B> <A
HREF="node19.html">4.5 Understanding the time</A>
<B> <A ID="tex2html208"
HREF="node1.html">Contents</A></B>
<BR>
<BR>
<!--End of Navigation Panel-->
<!--Table of Child-Links-->
<A ID="CHILD_LINKS"><STRONG>Subsections</STRONG></A>
<UL>
<LI><A ID="tex2html209"
HREF="node20.html#SECTION00056100000000000000">4.6.1 Signal trapping (experimental!)</A>
</UL>
<!--End of Table of Child-Links-->
<HR>
<H2><A ID="SECTION00056000000000000000">
4.6 Restarting</A>
</H2>
<P>
Since QE 5.1 restarting from an arbitrary point of the code is no more supported.
<P>
The code must terminate properly in order for restart to be possible. A clean stop can be triggered by one the following three conditions:
<OL>
<LI>The amount of time specified by the input variable max_seconds is reached
</LI>
<LI>The user creates a file named "$prefix.EXIT" either in the working
directory or in output directory "$outdir"
(variables $outdir and $prefix as specified in the control namelist)
</LI>
<LI>(experimental) The code is compiled with signal-trapping support and one of the trapped signals is received (see the next section for details).
</LI>
</OL>
<P>
After the condition is met, the code will try to stop cleanly as soon as possible, which can take a while for large calculation. Writing the files to disk can also be a long process. In order to be safe you need to reserve sufficient time for the stop process to complete.
<P>
If the previous execution of the code has stopped properly, restarting is possible setting restart_mode=``restart'' in the control namelist.
<P>
<H3><A ID="SECTION00056100000000000000">
4.6.1 Signal trapping (experimental!)</A>
</H3>
In order to compile signal-trapping add "-D__TERMINATE_GRACEFULLY" to MANUAL_DFLAGS in the make.doc file. Currently the code intercepts SIGINT, SIGTERM, SIGUSR1, SIGUSR2, SIGXCPU; signals can be added or removed editing the file <TT>clib/custom_signals.c</TT>.
<P>
Common queue systems will send a signal some time before killing a job. The exact behaviour depends on the queue systems and could be configured. Some examples:
<P>
With PBS:
<UL>
<LI>send the default signal (SIGTERM) 120 seconds before the end:
<BR> <TT>#PBS -l signal=@120</TT>
<P>
</LI>
<LI>send signal SIGUSR1 10 minutes before the end:
<BR> <TT>#PBS -l signal=SIGUSR1@600</TT>
<P>
</LI>
<LI>you cand also send a signal manually with qsig
</LI>
<LI>or send a signal and then stop:
<BR> <TT>qdel -W 120 jobid</TT>
<BR>
will send SIGTERM, wait 2 minutes than force stop.
</LI>
</UL>
<P>
With LoadLeveler (untested): the SIGXCPU signal will be sent when wall <I>softlimit</I> is reached, it will then stop the job when <I>hardlimit</I> is reached. You can specify both limits as:
<BR> <TT># @ wall_clock_limit = hardlimit,softlimit</TT>
<BR>
e.g. you can give pw.x thirty minutes to stop using:
<BR> <TT># @ wall_clock_limit = 5:00,4:30</TT>
<BR>
<P>
<HR>
<!--Navigation Panel-->
<A
HREF="node21.html">
<IMG WIDTH="37" HEIGHT="24" ALT="next" SRC="next.png"></A>
<A
HREF="node14.html">
<IMG WIDTH="26" HEIGHT="24" ALT="up" SRC="up.png"></A>
<A
HREF="node19.html">
<IMG WIDTH="63" HEIGHT="24" ALT="previous" SRC="prev.png"></A>
<A ID="tex2html207"
HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALT="contents" SRC="contents.png"></A>
<BR>
<B> Next:</B> <A
HREF="node21.html">5 Troubleshooting</A>
<B> Up:</B> <A
HREF="node14.html">4 Performances</A>
<B> Previous:</B> <A
HREF="node19.html">4.5 Understanding the time</A>
<B> <A ID="tex2html208"
HREF="node1.html">Contents</A></B>
<!--End of Navigation Panel-->
</BODY>
</HTML>
|