File: README.cray

package info (click to toggle)
gridengine 6.2-4
  • links: PTS, VCS
  • area: main
  • in suites: lenny
  • size: 51,532 kB
  • ctags: 51,172
  • sloc: ansic: 418,155; java: 37,080; sh: 22,593; jsp: 7,699; makefile: 5,292; csh: 4,244; xml: 2,901; cpp: 2,086; perl: 1,895; tcl: 1,188; lisp: 669; ruby: 642; yacc: 393; lex: 266
file content (84 lines) | stat: -rw-r--r-- 4,094 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
                    Cray kernel level checkpointing
                    -------------------------------

___INFO__MARK_BEGIN__
-------------------------------------------------------------------------
  The Contents of this file are made available subject to the terms of
  the Sun Industry Standards Source License Version 1.2

  Sun Microsystems Inc., March, 2001


  Sun Industry Standards Source License Version 1.2
  =================================================
  The contents of this file are subject to the Sun Industry Standards
  Source License Version 1.2 (the "License"); You may not use this file
  except in compliance with the License. You may obtain a copy of the
  License at http://gridengine.sunsource.net/Gridengine_SISSL_license.html

  Software provided under this License is provided on an "AS IS" basis,
  WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
  WITHOUT LIMITATION, WARRANTIES THAT THE SOFTWARE IS FREE OF DEFECTS,
  MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE, OR NON-INFRINGING.
  See the License for the specific provisions governing your rights and
  obligations concerning the Software.

  The Initial Developer of the Original Code is: Sun Microsystems, Inc.

  Copyright: 2001 by Sun Microsystems, Inc.

  All Rights Reserved.
-------------------------------------------------------------------------
___INFO__MARK_END__


The Cray kernel-level checkpointing interfaces use the chkpnt(1) and
restart(1) commands to checkpoint and restart Grid Engine/Grid Engine
Enterprise Edition jobs. Here is some specific information about the Cray
checkpointing interface scripts.

1. The cray_ckpt_command and cray_migrate_command scripts both delete the
user's job script file before executing the chkpnt(1) command.  This forces
the chkpnt(1) command to save the unlinked script file as part of the
checkpoint restart file.  Otherwise the script will not have the same inode
number when the job is restarted, because Grid Engine/Grid Engine
Enterprise Edition will have deleted and restored it, and the restart(1)
will fail.  In order for this workaround to work, the user must have write
access to the $SGE_ROOT/$SGE_CELL/spool/<exec_host_name>/job_scripts
directory.

2. The cray_restart_command script uses the acctcom(1) command to retrieve
the status of the restarted job when it completes.  This is needed because
the restart(1) command only returns a 0 or a 1.  The acttcom(1) command 
returns data which is then parsed by the parse_job_status script in order to
get the status of the completed job.  Unfortunatley, the acctcom(1) utility
only reports the signal number if the job was killed by a signal or a 0 if
the job was not.  So the actual status code returned by the job script is not
available for a restarted job.  The cray_jobstatus script is a utility to
print out the status code reported by acctcom(1) given a cray job id and
process id.

3. When setting up a checkpointing interface for these scripts in the
Qmon checkpoint config dialog, the checkpoint commands should be
identified in the following format.  The names in <> should be filled
in with the appropriate values.

   <sge_root>/ckpt/<chkpnt_script_name> $job_id $job_pid $ckpt_dir

4. The checkpoint interface scripts log some general information to the file
$ckpt_dir/ckpt.log

5. The checkpoint files are placed in $ckpt_dir/ckpt.$job_id.  A detailed
log is available in $ckpt_dir/ckpt.$job_id/ckpt.log. The
$ckpt_dir/ckpt.$job_id directory is cleaned up in the cleanup_command
script.

6. These scripts are designed to handle a system failure during checkpointing.
That is, the latest checkpoint file will not be used for a restart until the
checkpoint command completes.  If the system fails before the checkpoint has
completed, the last checkpoint file will be used for restart.

7. The checkpoint files will not be deleted in the cray_clean_command script
if either the SGE_LEAVE_CKPT_DIR environment variable is set to a non-null
string or if the fourth parameter in the clean command for the checkpoint
configuration is set to the string "save".