File: InstMaint.htm

package info (click to toggle)
dqs 3.1.8-9
  • links: PTS
  • area: non-free
  • in suites: slink
  • size: 8,908 kB
  • ctags: 9,887
  • sloc: ansic: 87,447; sh: 2,952; makefile: 442; yacc: 247; lex: 94; perl: 83; csh: 51; fortran: 24; awk: 16
file content (2720 lines) | stat: -rw-r--r-- 138,177 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
<HTML>
<HEAD>
<TITLE>1</TITLE>

<META NAME="GENERATOR" CONTENT="Internet Assistant for Microsoft Word 2.0z">
</HEAD>
<BODY>
<HR>
<P>
<CENTER><B><FONT SIZE=2>DISTRIBUTED QUEUEING </FONT></B></CENTER>
<P>
<CENTER><B><FONT SIZE=2>SYSTEM - 3.1.3</FONT></B></CENTER>
<H2><CENTER>August 28, 1996</CENTER></H2>
<H1><CENTER><FONT SIZE=4 COLOR=#FFFFFF>INSTALLATION  AND MAINTENANCE
 MANUAL</FONT></CENTER></H1>
<HR>
<P>
<B>SUPERCOMPUTER COMPUTATIONS RESEARCH INSTITUTE<BR>
<BR>
<BR>
<BR>
</B>
<H3><IMG SRC="IMG00018.GIF"></H3>
<H1><FONT SIZE=4 COLOR=#FFFFFF>Introduction</FONT></H1>
<HR>
<H2>The Distributed Queuing System (DQS)</H2>
<P>
<FONT SIZE=2>The Distributed Queuing System  (DQS)is an experimental
batch queuing system which has been under development at the Supercomputer
Computations Research Institute (SCRI) at Florida State University
for the past 7 years. The first years of this activity were funded
by the Department of energy Contract DE-FC0585ER250000.  DQS is
freely distributed to all parties with the understanding that
it continues to be an evolving development system, and no warranties
should be implied by this distribution.<BR>
</FONT>
<P>
<FONT SIZE=2>DQS is intended to provide a mechanism for the management
of requests for execution of batch jobs on one or more members
of a homogeneous or heterogeneous network of computers. Facilities
for load-balancing, prioritization and expediting of a wide variety
of computational jobs are included to assist each site in tailoring
the behavior of the system to their particular environment.<BR>
</FONT>
<H2>SCRI support </H2>
<P>
<FONT SIZE=2>SCRI will make every effort, within its resources
to assure that DQS is suitable for operation as a batch queuing
system in as many site situations as possible. SCRI staff will
respond to requests for assistance as well as investigating bugs,
incorporating repairs and updating documentation, from those who
are utilizing DQS. However it is not possible, at this time, to
make a formal commitment for the long term support and enhancement
of this system. Any user or organization which decides to adopt
DQS will be assuming all risks from that undertaking.<BR>
</FONT>
<P>
<FONT SIZE=2>With this release, DQS 3.1.3, the distribution and
support of the previous version of DQS 3.1.2.4 will be continued
for at least the balance of calendar year 1996. Depending on the
need for continued support and SCRI resource availability .some
level of support may be continued beyond that time. We feel, however
that  since the DQS 3.1.3 system is based on the DQS 3.1.2.4 release
most users will be using DQS 3.1.3 in preference to DQS 3.1.2.4
in the near future.<BR>
</FONT>
<P>
<FONT SIZE=2>DQS 3.1.3 and future enhancements can be obtained
by Internet  ftp from &quot;ftp.scri.fsu.edu&quot;.  <BR>
</FONT>
<P>
<FONT SIZE=2>Announcements of new releases and improvements will
be emailed to anyone who contacts SCRI to add their name to the
announcement list. This is done by:</FONT>
<P>
<FONT SIZE=2>send email to:     dqs-announce@scri.fsu.edu </FONT>
<P>
<FONT SIZE=2>Leave the &quot;subj:&quot;  field blank</FONT>
<P>
<FONT SIZE=2>Send a one line message:   subscribe</FONT>
<P>
<FONT SIZE=2>Names can be removed from this announcement list
by:</FONT>
<P>
<FONT SIZE=2>send email to:  dqs_announce@scri.fsu.edu</FONT>
<P>
<FONT SIZE=2>Leave the &quot;subj&quot;  field blank</FONT>
<P>
<FONT SIZE=2>Send a one line message:      unsubscribe <BR>
</FONT>
<P>
<FONT SIZE=2>Bug reports should be sent to :  dqs@scri.fsu.edu
<BR>
</FONT>
<P>
<FONT SIZE=2>DQS user information exchange is provided by Rensselaer
Polytechnic Institute. To add your name and email address to this
list:</FONT>
<P>
<FONT SIZE=2>Send email to  dqs-l@vm.its.rpi.edu</FONT>
<P>
<FONT SIZE=2>Leave the &quot;subj:&quot; line blank</FONT>
<P>
<FONT SIZE=2>Send  a one line message:      SUBSCRIBE dqs-l 1stname
Lastname</FONT>
<P>
<FONT SIZE=2>To remove name and email address:</FONT>
<P>
<FONT SIZE=2>Send email to  dqs-l@vm.its.rpi.edu</FONT>
<P>
<FONT SIZE=2>Leave the &quot;subj:&quot; line blank</FONT>
<P>
<FONT SIZE=2>Send  a one line message:      UNSUBSCRIBE dqs-l
1stname Lastname<BR>
</FONT>
<P>
<FONT SIZE=2>Where 1stname is the user's first name and Lastname
is the user's last name.<BR>
<BR>
<BR>
<BR>
<BR>
</FONT>
<P>
<FONT SIZE=2>With the release of DQS 3.1.3 the user intercommunication
through dqs_user@scri.fsu.edu will be re-instituted . All messages,
inquiries, announcements from any user or the DQS development
staff will be relayed to all other users automatically.<BR>
</FONT>
<H2>What's New in DQS 3.1.3</H2>
<P>
<FONT SIZE=2>The release of DQS 3.0 was a major departure for
the DQS evolution. It was based on several years' experience with
DQS 2.1 in a variety of computing environments.  Although it retained
many features of the 2.1 version, DQS 3.0  was a major restructuring
and re-coding of the  basic system with a major focus on supporting
parallel (clustered) computation on two or more UNIX  based hardware
platforms. The newly emerging message passing scheme (MPI) was
considered throughout the DQS 3.0 implementation.</FONT>
<P>
<FONT SIZE=2>In early 1995 DQS 3.0-3.1 was subjected to extensive
testing and the contributions of numerous users were incorporated
to produce DQS 3.1.2 which was released in March and augmented
over a period of six months to become DQS 3.1.2.4. With the exception
of some minor &quot;improvements this system has been fairly stable
and in operational use for nine months.</FONT>
<P>
<FONT SIZE=2>Operational experience at SCRI and other large production
sites revealed several  features which needed to be added or adapted
to make the system easier to use or to manage. Several sites provided
the DQS development team with valuable insight, advice and code
which has been incorporated into this new release. Although all
user interfaces have not been changed (albeit &quot;enhanced&quot;)
the internals of this system have undergone considerable change,
hence the naming of this release as 3.1.3 instead of 3.1.2.5.
We took this opportunity to restructure the documentation (one
more time!) in response to numerous requests to make it easier
to access. In addition to numerous bug-fixes for DQS 3.1.2.4 provided
by several very helpful sites (see &quot;acknowledgments&quot;)
a number of new features have been added to the system. </FONT>
<P>
<FONT SIZE=2>The &quot;new&quot; features of DQS 3.1.3 tend to
be somewhat invisible to the DQS user. The bulk of this effort
has been focused on further &quot;bulletproofing&quot; the system
to minimize, if not eliminate, the unreported termination of daemons
, utilities and jobs. Some features are &quot;semi-visible&quot;
such as the revised scheduling system. A few are quite evident
to all, as the &quot;job pre-validation&quot; feature returns
immediate feedback on the complete absence of a requested resource.
With this in mind we list here the major changes which appear
in DQS 3.1.3:</FONT>
<H3>Job pre-validation</H3>
<P>
When a job is submitted to DQS using the QSUB utility it is checked
to make ensure:
<OL>
<LI>The &quot;fixed&quot; resources requested as HARD are present
somewhere in an existing DQS complex. If the resource is in use
by another job it is still considered &quot;present&quot; for
the purposes of pre-validation.
<LI>The &quot;consumable&quot; resources reqested as HARD are
present in at least one DQS Consumables file. If the resource
is in use by another job it is still considered &quot;present&quot;
for the purposes of pre-validation.
</OL>
<H3>Consumable Resources</H3>
<P>
Many sites are confronted with the need to allocate scarce resources
to jobs during the scheduling process. Resources such as  FORTRAN
compiler licenses, data base licenses, shared memory and disk
space can be assigned names and values by the DQS administrator.
Job scheduling the reconciles requests for Consumable Resources
and when a job is placed into execution the available amount of
these resources is reduced until the job terminates or releases
the resource with a DQS system utility. Facilities for managing
the Consumable Resource reservoir have been added to the QCONF
utility.
<H3>qhold, qrls</H3>
<P>
The QHOLD and QRLS utilities have been implemented. These permit
a user or administrator to place a &quot;hold&quot; on an already
submitted job until it has been released by the user or the DQS
administrator or removed using the QDEL utility.
<H3> qmove(multi-cell job transfer)</H3>
<P>
The POSIX utility QMOVE has been partially implemented for this
release. In a single cell system a queued job can be moved from
one queue to another using the QALTER utility. This has involved
using the &quot;-q&quot; option to explicitly identify a target
queue. Or when queues are implicitly specified on the basis of
resource requests (&quot;-l&quot; option) the QALTER utility may
be used to change the resource request.
<P>
In a multi-cell system the QMOVE utility must be used to initiate
the transfer of a job from consideration by one cell to another.
The QMOVE request is pre-validated as any other QSUB submission,
and the job will not be moved if it cannot pass this first level
test.
<H3>&quot;fair use&quot; scheduling</H3>
<P>
The DQS scheduler has been rewritten .. again. Of many components
in an operating system the scheduling process is the most perplexing
and complex feature to provide in an adequately general form.
The DQS 3.1.3 scheduler has been commented the code blocked out
in a manner which we hope will make site modifications easier
and more comprehensible. The scheduling methodology now in use
at SCRI is provided as the default in this release. It attempts
to prevent one or two users to dominate the utilization of the
system resources, while keeping all hosts as busy as possible.
<P>
Those submitting massive quantities of jobs to the system at one
time will discover four levels at which their jobs will be handled
by the scheduler. First, there is a limit on how many jobs will
be accepted at QSUB time. Second, there is a limit on how many
jobs in the queue for a single user will be considered by the
scheduler. And third, the user's jobs which are considered for
scheduling will be assigned sub-priorities according to their
DQS sequence number and the number of jobs for that user preceding
it in the queue. Finally a queue can be assigned a time delay
which is imposed between consecutive allocations of that queue
to the same user.
<H3>FORTRAN / &quot;C&quot; resource requests</H3>
<P>
In DQS 3.1.2.4 resources are requested by using the form &quot;-l
 qty.eq.1,mem.gt.32,disk.gt.64&quot; (for example) DQS 3.1.3 permits
the retention of this format but the user may now use either FORTRAN
or &quot;C&quot; syntax for these requests. The above example
could then appear as &quot;-l qty==1&amp;&amp;mem&gt;32&amp;&amp;disk&gt;64&quot;
or, alternatively    &quot;-l qty.eq.1.AND.mem.gt.32.AND.disk.gt.64&quot;.
The logical operators &quot;.NOT.&quot; (or &quot;!&quot;) and
&quot;.OR..&quot; (or ||) may also be used, as well as parenthesis
to increase readability. Future releases will permit more complex,
compound resource requests with the ability to specify alternative
resources which could satisfy the request. (This is different
from the using HARD and SOFT classifications.) For the time being
parenthesis only assist in readability, as in &quot;-l qty==1&amp;&amp;(mem&gt;32)&amp;&amp;(disk&gt;64)&quot;
.
<H3> subordinate queues</H3>
<P>
DQS 2.1 introduced a feature known as &quot;subordinate queues&quot;
 which provided the capability to identify a queue as being subordinate
to another queue. If a job is running in the subordinate queue
and a job is launched in its &quot;superior&quot; queue the subordinate
job is suspended until termination of the &quot;superior&quot;
job. This feature is particularly important when managing a system
where hosts can function both as single processor and multiple
processor platforms. DQS 3.1.3 provides a re-implementation of
this feature.
<H3>SMP AFS re-authentication</H3>
<P>
DQS 3.1.2.4 provided a simple facility for operating in an AFS
environment. Actual use of this system uncovered a number of problems.
The most significant of these was solved by Axel Brandes and Bodo
Bechenback and is incorporated in DQS 3.1.3. A key element of
their solution involves the use of a temporary daemon which we
call the Process Leader and others call the &quot;process sheep-herder&quot;.
<P>
The Process Leader is spawned by the dqs_execd and does the actual
job launching and cleanup. It can respond to system requests which
the job is not equipped to deal with, such as the AFS periodic
re-authentication task. This capability also makes it possible
to run multiple jobs for the same queue on the same host, and
 to detach the job from the dqs_execd daemon in case that daemon
needs to be restarted, without killing the job. 
<H3>qmaster&lt;-&gt;dqs_execd synchronization</H3>
<P>
A glaring shortcoming in DQS 3.1.2.4 was the lack of synchronization
among the DQS daemons. Under some circumstances the queue status
maintained by the qmaster did not reflect the actual state of
jobs handed off to the dqs_execd. There was no mechanism for making
the two states congruent, other than using the &quot;clean queue&quot;
(QCNF -cq   ) mechanism, which only affected the qmaster view
of the system. DQS 3.1.3 has implemented auxiliary communications
between the qmaster and dqs_execd to provide for automatic and
manual methods of re-synchronizing the system. 
<P>
Programmed aborts of the dqs_execd using the system &quot;abort&quot;
or &quot;exit&quot; commands has been eliminated. Instead, all
dqs_execd errors previously considered fatal are now communicated
to the qmaster which emails an urgent message to the DQS administrator
and pauses the dqs_execd until the administrator can intervene.
Note that if a job is running under the Process Leader management,
it will continue execution, ignorant of the dqs_execd pause. (If
the dqs_execd error is due to a failure in the dqs_execd&lt;-&gt;qmaster
interface, the dqs_execd independently mails the urgent cry for
help to the administrator.)
<H3>parallel job consistency and accounting</H3>
<P>
In DQS 3.1.2 parallel job scheduling handed off  parallel jobs
when sufficient queues became available for the execution of the
requested number of processes. However only the dqs_execd which
was managing the MASTER process was aware of the parallel job
and the only accounting information obtained for the job was from
the MASTER host.. 
<P>
DQS 3.1.3. scheduling alerts all of the SLAVE queue managers to
the fact that they will be running one of the parallel job processes.
When the parallel job is launched by the MASTER dqs_execd, each
of the SLAVE dqs_execd verify that they are permitted to participate
in that job before the slave process is started. A Process Leader
is used to launch each of these slave processes and at their termination,
accounting information is gathered and sent to the qmaster. This
ensures that DQS is in charge of the execution of all parallel
job components. In the event that a LINDA parallel job is involved,
the Process Leader is initiated and it waits for the LINDA process
to be started by the master process on the MASTER host. It then
attaches itself to this process (since it cannot launch it itself)
in order to handle termination and accounting reporting.
<H3>qidle integration</H3>
<P>
In DQS 3.1.2.4 the QIDLE utility was part of the X-windows component
of the system and interfaced with DQS by invoking the QMOD utility
as a separate task. This created several problems, the principle
one being that at many sites the &quot;system console&quot; was
connected to a host which was also managing a DQS queue. Since
such a console usually has many users accessing it, there is not
one single &quot;owner&quot; for the queue on that machine with
permission to invoke the suspension of the queue n order to use
the console. 
<P>
The QIDLE function in DQS 3.1.3 is now an authenticated system
utility like QMOD, QDEL, etc. It communicates with the qmaster
itself and can suspend queues on a host for which the QIDLE function
is permitted to run in an X-windows environment. 
<H3>enhanced status displays</H3>
<P>
The somewhat cryptic symbols &quot;a&quot; &quot;e&quot; &quot;r&quot;
&quot;u&quot; &quot;s&quot; in the QSTAT display have been replaced
with more descriptive words ALARM, ENABLED, RUNNING,UNKNOWN, SUSPENDED.
More important, the reason why a job is residing in the PENDING
queue are listed. Thus between the pre-validation of jobs and
this description of PENDING causes DQS 3.1.3 should have eliminated
the most common problems of jobs never executing because they
had requested non-existent or illogical combinations of resources.
<H3>accounting tools</H3>
<P>
The DQS accounting information can play a key role in the management
and optimization of system resources. In the operational environment
at SCRI we have developed a small collection of tools for extracting,
summarizing and analyzing DQS accounting data. These have been
included in DQS 3.1.3 as a starting point for other sites to develop
their own methods.
<H3>&quot;Streamlined Installation&quot;</H3>
<P>
Many sites will find that the installation of DQS has been &quot;streamlined&quot;
requiring less interaction to prepare a basic system for configuration
and testing. Sites which are already running DQS 3.1.2.4 and have
the need for extensive local adaptations will use the more complex
&quot;custom&quot; installation process or the manual editing
of Makefile.proto, def.h,  and dqs.h with which they are already
familiar. The new installation process is based on  the GNU Autoconf
package
<H3>job &quot;wrapper&quot;  scripts</H3>
<P>
DQS 3.1.3 provides a mechanism for executing site-defined scripts
upon termination of the queued job. This script is executed by
the Process Leader and hence posseses root permissions which can
be handy for specialized cleanup operations. This is important
for system which support PVM and P4 daemons which may have to
be stopped by the system when the MASTER process terminates abnormally.
<H3>elective linking or copying of output files during job execution
</H3>
<P>
DQS 3.1.3 supports the special handling  of files on a host's
local disk  during execution of a job, without the intervention
of the user. Options set in the DQS conf_file determine whether
the output files are to be left in place on the local disk, linked
to a site-defined file system or copied to a site-defined file
system.
<H3>Logging improvements</H3>
<H4><FONT SIZE=2>All DQS log entries are now time stamped with
the local time of the qmaster host system. The DEBUG and DEBUG_EXT
output is now written to a file (defined in def.h) instead of
stderr. This minimizes the jumbling of file output when several
processes attempt to write the file simultaneously. All error
messages are now numbered and an appendix to this document lists
these error messages and suggests remedial actions when appropriate.</FONT>
</H4>
<H2>Documentation</H2>
<P>
<FONT SIZE=2>The DQS 3.1.3 Documentation has been reorganized..
.again. The POSIX specification has been extricated from the document
body and is now an Appendix. The reference manual pertains only
to the DQS 3.1.3 implementation and all confusing references to
&quot;standard&quot; and &quot;non-standard&quot; options removed.
</FONT>
<P>
<FONT SIZE=2>The documentation consists of three principle chapters
and three appendices. The Installation and Maintenance Manual
is primarily aimed at the DQS system administrator. The User Guide
is obviously targeted at the DQS user community. The Reference
Manual will be accessed by both users and administrators. Appendix
A contains a catalog of all DQS error messages with information
on methods for dealing with the error. Appendix B contains the
POSIX specification on which DQS 3.1.3 is based. Appendix C contains
several miscellaneous sections, including installation variants
and system tuning guidelines.</FONT>
<P>
<FONT SIZE=2>The documentation is supplied in several forms:</FONT>
<OL>
<LI><FONT SIZE=2>Microsoft WORD (6.0 or 7.0 )</FONT>
<LI><FONT SIZE=2>PostScript</FONT>
<LI><FONT SIZE=2>HTML format (can be viewed with MOSAIC or any
of the commercial WEB browser products.</FONT>
</OL>
<P>
<FONT SIZE=2></FONT>
<H2>Installation</H2>
<P>
<FONT SIZE=2>DQS is designed to be installed on almost every existing
UNIX platform. This process thus must cope with many differences
and idiosyncrasies of the varied hardware configurations and 
operating systems. DQS 3.1.3 attempts to detect and resolve these
differences to minimize the need for operator actions, but with
even the simplest installation there will be a need for some input
from the DQS administrator.</FONT>
<H2>Obtaining DQS 3.1.3</H2>
<P>
<FONT SIZE=2>DQS 3.1.3 can be obtained by ftp download from ftp.scri.fsu.edu/pub/dqs.
The  README.313 file in that directory will indicate which version
should be downloaded. To reduce download bandwidth, improvements
and big-fixes will be distributed on a file-by-file replacement
basis rather than requiring a complete download of the DQS 3.1.3
system. For this reason we do not envision distributing systems
such as DQS 3.1.3.1&#133;.DQS.1.3.n  in the future.  (But you
never know.) </FONT>
<H2>Setting up for installation</H2>
<P>
<FONT SIZE=2>DQS 3.1.3 is distributed as a compressed TAR file.
After this file is uncompressed it is recommended that the DQS
 system be extracted (with TAR) into a directory which is accessible
by all operating systems for which DQS will be built. The DQS
installation process will create a separate directory in the sub-directory
&#133;.DQS/ARCS for each different architecture/operating system
.</FONT>
<P>
<FONT SIZE=2>Once the DQS tree has been extracted the installation
process can be commenced  by  changing to ../DQS as a working
directory and typing &quot;install&quot;. This UNIX script will
execute the system evaluation procedures and produce a description
of the system on which the installation is being done. Three choices
are offered to the administrator, &quot;quick install&quot;, &quot;Custom
Install&quot; and &quot;quit installation&quot;.</FONT>
<H2>&quot;Quick&quot; Install</H2>
<P>
<FONT SIZE=2>A very simplistic &quot;Quick Install&quot; feature
is provided to assist in an initial installation of DQS. For those
site who are testing DQS for the first time we recommend using
this method. Choosing all of the defaults will result in an unrealistic
operating environment for DQS 3.1.3 but will offer a sample of
the system</FONT>
<P>
<FONT SIZE=2>The choice of &quot;Quick Install&quot; produces
a list of defaults which will be used for the installation. The
user is asked to review this list to ensure that it meets their
requirements. The default-cell name and default initial queue
name are derived from the host-name of  the machine on which the
installation process is being executed. If the installation is
being executed as &quot;root&quot; the system will be setup to
use &quot;reserved&quot; ports for communication, otherwise &quot;non-reserved&quot;
ports will  be utilized.</FONT>
<P>
<FONT SIZE=2>The &quot;quick&quot; installation method is intended
for new DQS sites which wish to experiment/evaluate DQS 3.1.3
and develop some experience on which to base an operational system
setup. If the installation parameters are acceptable the user
type &quot;y&quot; to accept them and begin the actual installation.
 </FONT>
<P>
<FONT SIZE=2>The installation proceeds in six stages,</FONT>
<OL>
<LI><FONT SIZE=2>First the GNU configure program is used to determine
installation parameters for the host being used for the installation
process. One of the directories modified by the GNU configuration
program is the DQS CONFIG. Once it is updated , the DQS config
utility is then built on that host platform.</FONT>
<LI><FONT SIZE=2>. The DQS config program then asks the user to
provide a base directory to use for the installation of DQS binaries,
libraries and documentation as well as the DQS configuration and
resolve files and directories The first step uses the GNU configure
program to determine the system for use by the qmaster and the
dqs_execd . The Default paths offered by the dialogue are based
on the current working directory(if running as non-root) or /usr/local//DQS
(when running as root). This latter path is commonly used at DQS
sites as all hosts of a common architecture often share the path
&quot;/usr/local&quot;. The simple install will only request on
 starting point  for building a DQS313 tree. If the administrator
wishes to differentiate the various components, binaries, libraries,
spool directories, etc. They can type &quot;CUSTOM&quot; when
asked to  enter an alternative base path. </FONT>
<LI><FONT SIZE=2>The next  step invokes the make operation to
create all of the DQS 3.1.3 executables The binaries are placed
in a subdirectory within the ../DQS/ARCS directory named for the
specific platform being built  This provides a separate repository
for each type of host system in the cluster. <U><B>NOTE </B></U>The
addition of &quot;qidle&quot;  has created some installation problems
on SOLARIS platforms not using the GNU &quot;C&quot; compiler.
If error messages appear related to missing X Windows include
files or libraries the DQS administrator may have to add appropriate
compiler or linker directives to the Makefile.proto AFTER the
&quot;configure&quot; step is completed.</FONT>
<LI><FONT SIZE=2>The fourth step moves the binaries to the directory
from which they will be executed. This process renames the executables
by placing a tag &quot;313&quot; at the end of each name The fourth
step move the sample conf_file and resolve file to the conf directory.
This is done to differentiate these binaries from other DQS versions
which might have preceded the DQS313. </FONT>
<LI><FONT SIZE=2>The next step involves the addition of the three
DQS 3.1.3 entries to the /etc/services  file on one or more hosts.
This step must be done with root permission and by someone familiar
with UNIX system administration knowledge. While DQS attempts
to identify proper port numbers to be used in the /etc/services
file, local conditions may dictate another choice. Upon successful
completion of the installation the administrator can proceed to
&quot;Testing the Installation&quot;.  If error message appear
and the installation is aborted the administrator should refer
to  &quot;Solving Installation Problems&quot;.</FONT>
<LI><FONT SIZE=2>Finally the administrator should proceed to the
step &quot;Testing  the DQS313 system.</FONT>
</OL>
<H2>Custom Install<BR>
</H2>
<P>
<FONT SIZE=2>&quot;Custom&quot; Installation presents the administrator
with the same default configuration as the &quot;Quick&quot; install
process. Any of the parameters can be changed by the administrator
before the installation proceeds. Two choices are presented to
the administrator. The first initiates an interactive session
where each parameter is displayed, the proposed default and, if
a previous installation has been completed the prior setup value.
The administrator may choose either of the displayed values or
enter their own parameter. </FONT>
<P>
<FONT SIZE=2>During this interactive exchange each parameter is
validated for consistency with the host system as well as DQS.
Upon completing the interactive setup the administrator may proceed
with the same installation steps as the &quot;Quick&quot; installation:</FONT>
<P>
<FONT SIZE=2>The installation proceeds in five stages, during
most of these the DQS administrator must make a selection as requested
by the config program.</FONT>
<OL>
<LI><FONT SIZE=2>First the GNU configure program us used to determine
installation parameters for the host being used for the installation
process. One of the directories modified by the GNU configuration
program is the DQS CONFIG. Once it is updated , the DQS config
utility is then built on that host platform. The GNU configure
program will attempt to build al the Makefiles to use the GNU
&quot;C&quot; compiler &quot;gcc&quot;  If the administrator wishes
to use an alternative compiler for any phase the following files
must be modified AFTER the GNU configure step is complete:  CONFIG/Makefile.proto.in,
SRC/Makefile.proto.in and DQS/Makefile.proto.</FONT>
<LI><FONT SIZE=2>. The DQS config program then asks the user to
provide a base directory to use for the installation of DQS binaries,
libraries and documentation as well as the DQS configuration and
resolve files and directories  This base directory will the be
used to  provide a &quot;default path&quot; for all items requiring
the entry of a file path. The Default paths offered by the dialogue
are based on the current working directory(if running as non-root)
or /usr/local//DQS (when running as root). This latter path is
commonly At each interactive step, a default value is presented.
Typing a question mark &quot;?&quot; will provide a brief comment
about that entry (which is intended to be helpful). A more detailed
explanation of each item to be entered may be found in &quot;Appendix
C Miscellaneous - &quot;Key System Variables and Manual Installation&quot;
</FONT>
<LI><FONT SIZE=2>The next  step invokes the make operation to
create all of the DQS 3.1.3 executables The binaries are placed
in a subdirectory within the ../DQS/ARCS directory named for the
specific platform being built  This provides a separate repository
for each type of host system in the cluster.</FONT>
<LI><FONT SIZE=2>The fourth step moves the binaries to the directory
from which they will be executed. The target  during the configure
process. binary directory is prescribed by the administrator This
process renames the executables by placing a tag &quot;313&quot;
at the end of Each name The fourth step move the sample conf_file
and resolve file to the conf directory. This is done to differentiate
these binaries from other DQS versions which might have preceded
the DQS313. </FONT>
<LI><FONT SIZE=2>The next step involves the addition of the three
DQS 3.1.3 entries to the /etc/services  file on one or more hosts.
This step must be done with root permission and by someone familiar
with UNIX system administration knowledge. While DQS attempts
to identify proper port numbers to be used in the /etc/services
file, local conditions may dictate another choice. </FONT>
<LI><FONT SIZE=2>Upon successful completion of the installation
the administrator can proceed to &quot;Testing the Installation&quot;.
 If error message appear and the installation is aborted the administrator
should refer to  &quot;Solving Installation Problems&quot;.</FONT>
</OL>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>An optional approach is available to the knowledgeable
DQS administrator which omits all interaction . This requires
the editing of three DQS files used during the make process. Details
for this approach may be found in the Appendix C Miscellaneous
- &quot;Key System Variables and Manual Installation&quot;</FONT>
<H2>The Graphical Interface</H2>
<P>
<FONT SIZE=2>The X-windows based DQS graphical interface is installed
as a separate step.  Change directory to &#133; DQS/XSRC and read
the INSTALL script. The X-Windows interface is being restructured
and will be intgerated fully in future DQS releases.</FONT>
<H2>Testing the installation</H2>
<P>
<IMG SRC="IMG00019.GIF">
<P>
<FONT SIZE=2>The installation process creates a series of directories
and subdirectories and two crucial files, the &quot;conf_file&quot;
(configuration file) and  the &quot;resolve_file&quot;. If the
system installation was completed correctly the conf_file will
contain information which will be read by every DQS binary files
when it is started. This includes the DQS daemons, qmaster and
dqs_execd, and the DQS interface &quot;utilities&quot; qsub, qdel,
qmod, qconf, qstat, qrls, qhold and qmove. It is best that these
two files are accessible through an NFS/AFS/DFS file cross-mounting.
If that is not possible then the administrator must ensure that
<U>identical </U>copies of these files are present on each host.</FONT>
<P>
<FONT SIZE=2>Once the binaries have been moved to their execution
directory (we will use the path /usr/local/DQS/bin&quot; for all
future examples), the qmaster can be started. If during the installation
process the administrator chose &quot;FALSE (NO)&quot; when asked
the question &quot;Reserved ports?&quot;, then the /etc/services
file will have been updated (by a root user) with the three entries
suggested by the config process (or a rational alternative). The
conf_file will contain the names of these entries along with the
DEFAULT_CELL name which must match the first entry on the first
(non-commented) line in the resolve file. The administrator should
make a visual check of these three crucial files, conf_file, resolve_file
and ./etc/services to make sure that they conform to these requirements.
 </FONT>
<H4><FONT SIZE=2>QMASTER</FONT></H4>
<P>
<FONT SIZE=2>&lt;<I>The qmaster manages all resources for a single
DQS cell</I>.&gt;</FONT>
<P>
<FONT SIZE=2>Once satisfied that all is well the qmaster can be
started by typing &quot;/usr/local/DQS/bin/qmaster. (We will use
the 313 appendage in all future discussions.) On this first occasion,
it would  be useful to check that the process has actually started
by viewing the UNIX process status (ps). If the qmaster name does
not appear in the hosts process list, the administrator should
check the &quot;err_file&quot;  in the qmaster spool directory
(chosen during the DQS config stage-default: &quot; /usr/;local/DQS/common/conf&quot;).</FONT>
<P>
<FONT SIZE=2>If the qmaster appears to be operating, it can be
tested by executing the command &quot;/usr/local/DQS/bin/qstat313
-f&quot;, on the same host where the qmaster313 is running A normal
response to this command would be one or more lines of output
describing the status of  the current queues. For brand new installations
this will be simply a header with no other lines. Error messages
may appear if things are not quite :in harmony&quot;, refer to
&quot;DQS Error Messages&quot;  and  &quot;Solving Installation
Problems: for assistance in this case.<BR>
</FONT>
<H4><FONT SIZE=2>DQS_EXECD</FONT></H4>
<P>
<FONT SIZE=2>&lt;<I>The dqs_execd is a DQS daemon which resides
on each host which has at least one queue and will be executing
DQS managed jobs</I>.&gt;</FONT>
<P>
<FONT SIZE=2>If the &quot;qstat313&quot; command succeeds, it
is time to start a dqs_execd, which actually manages a particular
queue. For this test, on the same host where the qmaster &quot;dwelleth&quot;
type the command &quot;/usr/local/DQS/bin/dqs_execd313&quot;..
Again the  UNIX process status should be examined (ps). If the
dqs_execd is not executing, refer to the err_file for significant
error messages. Consult  &quot;DQS Error Messages&quot;  and 
&quot;Solving Installation Problems: for assistance.</FONT>
<P>
<FONT SIZE=2>Executing the command &quot;qconf -aq&quot; ( queue
configuration, add queue) will  produce an edit session with the
default editor on that host. If the &quot;qconf&quot; command
yields an error message and shuts down consult &quot;Solving Installation
Problems&quot;. A queue &quot;template&quot; will be displayed
which can be modified using the editor commands. For this test
the queue name, and queue host name should  be changed to match
 the name of the host on which the dqs_execd is executing. We
will deal with the remaining entries later (see .The Queue Configuration).</FONT>
<MENU>
<LI><FONT SIZE=2>Q_name            <U><I><B>ibms30</B></I></U></FONT>
<LI><FONT SIZE=2>hostname          <U><I><B>ibms30.scri.fsu.edu</B></I></U></FONT>
<LI><FONT SIZE=2>seq_no               0</FONT>
<LI><FONT SIZE=2>load_masg         1</FONT>
<LI><FONT SIZE=2>load_alarm        175</FONT>
<LI><FONT SIZE=2>priority               0</FONT>
<LI><FONT SIZE=2>type                   batch</FONT>
<LI><FONT SIZE=2>rerun                 FALSE</FONT>
<LI><FONT SIZE=2>quantity           1</FONT>
<LI><FONT SIZE=2>tmpdir               /tmp</FONT>
<LI><FONT SIZE=2>shell              /bin/csh</FONT>
<LI><FONT SIZE=2>klog               /usr/local/bin/klog</FONT>
<LI><FONT SIZE=2>reauth_time       6000</FONT>
<LI><FONT SIZE=2>last_user_delay   0</FONT>
<LI><FONT SIZE=2>max_user_jobs     4</FONT>
<LI><FONT SIZE=2>notify             60</FONT>
<LI><FONT SIZE=2>owner_list        NONE</FONT>
<LI><FONT SIZE=2>user_acl          NONE</FONT>
<LI><FONT SIZE=2>xuser_acl         NONE</FONT>
<LI><FONT SIZE=2>subordinate_list  NONE</FONT>
<LI><FONT SIZE=2>complex_list      NONE</FONT>
<LI><FONT SIZE=2>consumables       NONE</FONT>
<LI><FONT SIZE=2>s_rt               7fffffff</FONT>
<LI><FONT SIZE=2>h_rt               7fffffff</FONT>
<LI><FONT SIZE=2>s_cpu             7fffffff</FONT>
<LI><FONT SIZE=2>h_cpu             7fffffff</FONT>
<LI><FONT SIZE=2>s_fsize           7fffffff</FONT>
<LI><FONT SIZE=2>h_fsize           7fffffff</FONT>
<LI><FONT SIZE=2>s_data            7fffffff</FONT>
<LI><FONT SIZE=2>h_data            7fffffff</FONT>
<LI><FONT SIZE=2>s_stack           7fffffff</FONT>
<LI><FONT SIZE=2>h_stack           7fffffff</FONT>
<LI><FONT SIZE=2>s_core            7fffffff</FONT>
<LI><FONT SIZE=2>h_core            7fffffff</FONT>
<LI><FONT SIZE=2>s_rss             7fffffff</FONT>
<LI><FONT SIZE=2>h_rss             7fffffff</FONT>
</MENU>
<P>
<FONT SIZE=2><BR>
</FONT>
<P>
<FONT SIZE=2>When the  queue name and queue host name are modified,
exit the editor in the normal manner (ESC-ZZ for vi or CTRL-X
CTRL-C for emacs). This will trigger the qconf utility to parse
the submitted definition and, if no syntactical errors are discovered
will create the requested queue.</FONT>
<MENU>
<LI><FONT SIZE=2>Queue Name            Queue Type     Quan   Load
          State</FONT>
<LI><FONT SIZE=2>----------                  ----------     ----
  ----           -----</FONT>
<LI><FONT SIZE=2>ibms30  batch          0/1    0.10   dr     
DISABLED</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>Note that the status entry in the right column of
the qstat output will display the word &quot;DISABLED&quot;. All
new queues are initiated in DISABLED mode. To enable the queue
we need to invoke another DQS command &quot;/usr/local/DQS/bin/qmod313
-e &lt;queue name&gt;&quot;  (modify queue, enable the queue name
given here as &lt;queue name&gt;).</FONT>
<P>
<FONT SIZE=2>Again execute the &quot;/usr/local/DQS/bin/qstat313
-f&quot; command :</FONT>
<MENU>
<LI><FONT SIZE=2>Queue Name            Queue Type     Quan   Load
          State</FONT>
<LI><FONT SIZE=2>----------                  ----------     ----
  ----           -----</FONT>
</MENU>
<P>
<FONT SIZE=2>ibms30  batch          0/1    0.10   er      UP<BR>
</FONT>
<H4><FONT SIZE=2>TEST SCRIPT</FONT></H4>
<P>
<FONT SIZE=2>Once the qmaster and at least one daemon a simple
test. In the directory ../DQS/tests directory are a collection
of sample scripts. The entire contents of this directory should
be copied to a user (non-root) directory owned by the administrator.
As a first test change directory to this non-root directory and
type &quot;/usr/local/DQS/bin/qsub313 dqs.sh&quot;.  This will
submit the simple script to DQS:</FONT>
<MENU>
<LI><FONT SIZE=2>#!/bin/csh</FONT>
<LI><FONT SIZE=2>#$      -l qty.eq.1</FONT>
<LI><FONT SIZE=2>#$ -N UTESTJOB</FONT>
<LI><FONT SIZE=2>#$ -A dummy_account</FONT>
<LI><FONT SIZE=2>#$ -cwd</FONT>
<LI><FONT SIZE=2>echo 'we are now doing something else'</FONT>
<LI><FONT SIZE=2>printenv</FONT>
<LI><FONT SIZE=2>sleep 30</FONT>
<LI><FONT SIZE=2>echo 'end of script'</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2> A message should appear in response to the qsub313
command:</FONT>
<P>
<FONT SIZE=2>       &quot;your job 1 has been submitted&quot;.</FONT>
<P>
<FONT SIZE=2>After 30 seconds the job should complete and in the
directory where the job was submitted two output files should
appear:</FONT>
<P>
<FONT SIZE=2>       UTESTJOB.e1.25674   and  UTESTJOB.o1.25674</FONT>
<MENU>
<LI><FONT SIZE=2>The title UTESTJOB was established by the DQS
directive  &quot;#$ -N UTESTJOB&quot;. The next field (either
e1 or o1) contains the job number preceded by the type of file.
The stderr file for the job will have an &quot;e&quot; in that
position and the stdout file will have an &quot;o&quot;. The UTESTJOB.e1.25674
file should be zero length. If not examine its contents for the
cause of any error. The stdout file should  begin with the line
: 'we are now doing something else', followed by a display of
the user's environment and ending with the line 'end of script'.</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<H2>COMPLETION OF INSTALLATION</H2>
<P>
<FONT SIZE=2>If the test script completes correctly,  hosts can
be added and additional queues created and more complex job tests
can be submitted. If the &quot;Quick Install&quot; method was
chosen the time has probably arrived to plan an operational cell
organization and setup resource files and queues. In order to
layout an effective system it is important to understand how DQS
is constructed, the capabilities of its components and how they
may be tailored for a specific site.<BR>
</FONT>
<H2>System Topology and Operation<BR>
</H2>
<P>
<FONT SIZE=2>A basic DQS system consists of at least one computer
host which is running the qmaster program and at least one instantiation
of the dqs_execd daemon which manages the actual execution of
jobs on the host which they 'inhabit'. All of the resources managed
an monitored by a qmaster are considered to be a &quot;cell&quot;.
<BR>
<BR>
</FONT>
<P>
<IMG SRC="IMG00020.GIF"><BR>
<P>
<FONT SIZE=2>Within a cell there are three classes of programs
operating. The qmaster daemon, the dqs_execd daemon and the DQS
utilities which include qsub, qstat, qmod, qconf, qdel, qhold,
qhold, qrls. </FONT>
<OL>
<LI><FONT SIZE=2>The qmaster maintains all of the critical files
and tables for a cell. There are actually two types of tables
managed by the qmaster which are called &quot;queues&quot;. The
first is the job queue which is a linear, ordered list of all
jobs in the system. This list is sorted by job priority, an internal
job sub-priority (based on a site parameterized &quot;fair use&quot;
policy) and then by the order in which jobs have been submitted.
The second table type is a list of &quot;execution queues&quot;,
where each potential target for running a job is defined by a
queue configuration for that target.</FONT>
<LI><FONT SIZE=2>The qmaster possesses a set of &quot;auxiliary&quot;
files which are used to maintain information for system security
and to parameterize DQS for specific site characteristics. Access
control lists,  static and consumable resource definitions, and
a table of &quot;trusted hosts&quot; who are permitted to contact
the qmaster are &quot;mirrored&quot; in memory and disk at all
times so that the qmaster can survive interruptions such as power-outages.</FONT>
<LI><FONT SIZE=2>The primary mode of operation of the qmaster
is &quot;listening and waiting&quot;. The qmaster listens for
messages from other qmasters ( [e] which are managing their own
cells) , its own dqs_execd daemons [a]  and the DQS utilities[b].
Periodically the qmaster examines the job list and attempts to
find an execution queue which can satisfy the requirements of
one or more jobs in the table.</FONT>
<LI><FONT SIZE=2>The basic operation of the dqs_execd is &quot;sleep
through class.. and wake up in time to answer a teacher's question
or hear the end-of-class bell&quot;.  The &quot;class bell&quot;
in this case is a periodic event where the dqs_execd gathers information
on the health and state of the host machine on which it resides.
This period is defined in the &quot;conf_file&quot; and can be
varied be each site. At this point the &quot;load average&quot;
is sent to the qmaster [a] to provide the qmaster with information
to help it distribute jobs among the available hosts. (If the
conf_file parameter &quot;DEFAULT_SORT_SEQ_NO&quot; is set to
TRUE, the load average report is subservient to the sequence number
of a queue.)</FONT>
<LI><FONT SIZE=2>The &quot;teacher's question&quot; in this case
is a probe from the qmaster for a system integrity test or  a
system request, usually to begin execution of a job [c]. At this
prodding the dqs_execd sets to work as we will see later.</FONT>
<LI><FONT SIZE=2>In a quiescent system, with no jobs queued, and
none executing, the qmaster and dqs_execd daemons continue their
&quot;sleepy handshaking&quot; described above. The term &quot;sleepy&quot;
was chosen because these programs have been designed to utilize
minimal system resources (memory and cpu cycles) on their hosts.
Thus both programs are either sleeping or performing the minor
handshaking indicated by the [a] in the diagram. In DQS313, a
qmaster in one cell does not poll or communicate with other qmasters
except to request an action such as moving a job from one queue
to another.</FONT>
<LI><FONT SIZE=2>Into the idle system described here, a user submits
a job from one of the &quot;trusted hosts&quot; in the system
This could be a host in the cell which also houses the qmaster
or a dqs_execd or on a host with neither daemons, but which was
made a trusted host by the administrator using the &quot;qconf
-ah &#133; &quot; command. Two validation steps occur upon invocation
of the &quot;qsub&quot; command.  </FONT>
<OL>
<LI><FONT SIZE=2>The qsub command line and the script file are
scanned for DQS directives. DQS       directives may occur in
either stream, but the scanning stops when a string is encountered
which is neither a comment nor a DQS directive. (The default flag
for a DQS directive is the character pair '#$' ).. All DQS directives
are &quot;parsed&quot; for syntactical errors and rejected at
this point if problems are found.</FONT>
<LI><FONT SIZE=2>The syntactically verified command line and script
file are then sent to the qmaster (shown as [b] in the diagram).
The qmaster then performs a &quot;semantic&quot; validation of
the job request. By &quot;semantic&quot; here we mean &quot;does
the request make sense in the context of this system at this time&quot;.
</FONT>
</OL>
</OL>
<P>
<FONT SIZE=2>     The second test compares the user's request
for site-defined resources (such as        those actually present
in the system at the moment. Unless the submitted job possesses
the DQS directive &quot;-F&quot; ( force the acceptance of the
resource request ),  If one or more of the requested resources
do not exist (please note that this test verifies that a resource
is present in the system, not whether  or not it is in use by
another job!).<BR>
</FONT>
<OL>
<LI><FONT SIZE=2>If the job request passes these tests it is placed
in the job queue [c]. This queue is &quot;mirrored' on disk so
that it may be recovered after a system restart. When a new job
is placed in the list, the qmaster scheduler scans all the jobs
in the list and tries to find an execution queue which  will satisfy
each entry's request for resources. This process does NOT begin
with the newly arrived job but begins at the top of the list,
so it is possible that the job submission may trigger the scheduling
of a previously submitted job and leave this job &quot;awaiting
another time&quot;.</FONT>
<LI><FONT SIZE=2>At some point, motivated by the submission of
a new job, the termination of a running job or a period of seconds
defined by the  &quot;SCHEDULE_TIME&quot; parameter in the conf_file,
the qmaster will scan the job list and find a job which meets
the resource requirements. The job description and script file
are &quot;packaged up&quot; and sent to the target dqs_execd [[d].
The status information for the target queue is updated to indicate
the change of state  and the identity of the job's host machine.
Where parallel jobs have been specified, the qmaster will assign
additional hosts and mark their status as running the selected
job. Slave processes, however are initiated by the dqs_execd host
for the Master process, and not the qmaster.</FONT>
<LI><FONT SIZE=2>The dqs_execd first records the job request information
in its own &quot;mirror&quot; disk file, so that it may be retrieved
in the event of a system restart while the job is executing. Then
the job is prepared for execution. This process consists of first
creating a separate UNIX process to monitor and manage the executing
the job. In DQS313 this is called the &quot;shepherd&quot; process.
It is the presence of this &quot;shepherd&quot; which permits
a single dqs_execd to manage multiple job executions on the same
host and deals with the need for AFS re-authentication invisible
to the executing jobs. </FONT>
<LI><FONT SIZE=2>The first step for the &quot;shepherd&quot; 
is to establish  an environment for the job which matches that
of the submitting user, modified by the parameters in the job
script and on the command line. Next the &quot;shepherd&quot;
determines how the  system and the user wish to handle the stdout
and stderr files for the job. This is directed by DQS directives
and the system-wide parameters in the &quot;conf_file&quot;. </FONT>
<LI><FONT SIZE=2>If one of the forms of parallel job execution
has been specified ( the &quot;-p &quot; option in the DQS directives)
the Master dqs_execd will &quot;remote-shell&quot; the DQS task
&quot;dsh&quot; (distributed shell) to the target Slave processes.
The dqs_execd on each Slave host will start a process to manage
the SLAVE task. (In this release of DQS 3..1.3 this task is NOT
identical to the process shepherd and does not support AFS re-authentication
of the SLAVE process.)</FONT>
<LI><FONT SIZE=2>After the user's environment has been setup and
any SLAVE process managers started on other hosts, the DQS313
&quot;process shepherd&quot; sends the job startup notice (if
requested) and then launches the job.</FONT>
<LI><FONT SIZE=2>The &quot;process shepherd&quot; then enters
its own &quot;sleep&quot; loop, occasionally awakening to peek
at the running job and copy output files (as directed) to their
target directories. </FONT>
<LI><FONT SIZE=2>Upon job termination the &quot;process shepherd&quot;
executes a system defined &quot;add-on script&quot; which usually
performs additional job-cleanup operations . The dqs_execd then
forms an accounting record including job execution statistics,
which is sent to the qmaster, signaling the completion of all
activities related to the job[a].  Any SLAVE processes terminate
their own portion of a job independently. These SLAVE tasks are
usually shut down by their master process, according to the methodology
of the specific parallel paradigm, P4, MPI, TCGMSG, or PVM.</FONT>
<LI><FONT SIZE=2>As with the &quot;qsub&quot; job submission program,
all DQS313 utilities interact only with the qmaster. The qmaster
rejects any requests if the originating computer is not in the
cell's host list. The qmaster then checks to see if the user has
permission to perform the actions. For example at most sites any
user can request a display of the queue status (qstat command),
while only a DQS administrator is permitted to add, delete or
disable queues.</FONT>
<LI><FONT SIZE=2>Thus in this system a valid request by a user
to delete one of their running jobs consists of the following
sequence  qdel &lt;job &gt;  qmaster  ; qmaster validates request;
qmaster sends a job terminate message to the appropriate dqs_execd;
qmaster sends an acknowledgment to the qdel utility; qdel posts
a message to the submitting user; dqs_execd sends a UNIX SIGKILL
to the job; job termination triggers the dqs_execd to gather usage
data and send an end-job message to the qmaster; qmaster logs
the accounting information; qmaster deletes all job information;
qmaster marks the queue as available for scheduling.</FONT>
</OL>
<P>
<FONT SIZE=2></FONT>
<H2>Cells, Hosts, Queues</H2>
<P>
<FONT SIZE=2>In the previous section a diagram of the elements
constituting a &quot;cell&quot; were displayed. A DQS313 site
may have several independent cells, or they may be aggregated
into a common operating environment:</FONT>
<P>
<IMG SRC="IMG00021.GIF">
<P>
<FONT SIZE=2>This example displays three cells A,B&amp;C, each
managed by its own qmaster QM-A, QM-B or QM-C. The hosts are labeled
A1 and A2 for Cell-A, B1, B2 and B3 for Cell-B and C1 and C2 for
Cell-C. For this discussion we will assign the qmasters to a separate
host in each cell. QM-A will thus be on host A0, QM-B on host
B0 and QM-C on host C0. </FONT>
<P>
<FONT SIZE=2>Communications among the various hosts in a cell
and between cells is structured by the inclusion of a host within
a qmaster's host list. In the above example qmaster QM-A has four
hosts in its table, A0(its own host), A1, A2 and B0(the qmaster
host for cell B). Instead of a completely symmetrical inter-cell
arrangement here we have chosen to not have QM-A linked with QM-C.
Thus neither of these qmasters will have the other cell's qmaster
host in its own hosts table.</FONT>
<P>
<FONT SIZE=2>An option, which is less secure, is to permit the
host from one cell to contact the qmaster in another cell (as
shown by path [c]. In this case host B3 could execute utilities
and perhaps launch jobs in Cell-C as well as Cell-B.  Even without
this &quot;sneak path&quot; hosts in cells A and C can interrogate
the status of queues in Cell-B, if the user permissions allow
such an activity.</FONT>
<P>
<FONT SIZE=2>Note, once again, that a host in a cell may have
no queues assigned to it for execution, or it may have one or
more queues assigned to it. It is also quite common to have a
dqs_execd running on the same host as the qmaster daemon. The
DQS313 utilities can be executed on any host in a cell, regardless
of whether that host is running a dqs_execd daemon.</FONT>
<P>
<FONT SIZE=2>The first level of security within DQS is then a
&quot;trust&quot; relationship among a cell's hosts and  between
each cell's qmasters. The next level of security is the level
of permissions established by a qmaster's &quot;manager&quot;
and &quot;administrator&quot; lists. The third level of security
is defined by specific user permissions or exclusions for each
queue. Certain activities are permitted to a DQS administrator
or manager which a queue owner may not invoke, Among them are
deleting the queue itself or changing its configuration. A queue
owner and the DQS managers may perform activities such as queue
suspension, which of course the average user is prohibited from
doing.<BR>
</FONT>
<H2>System Directories</H2>
<P>
<FONT SIZE=2>To manage system security, queues, jobs and user
access, a number of directories are created during the startup
process. The DQS administrator will normally not have to deal
with these directories nor their contents. However when all DQS
files cannot or should not be cross-mounted it is important that
the function of these elements are understood so that they can
be placed correctly in the system. </FONT>
<H3>Shared &amp; Local</H3>
<P>
As indicated in the installation instructions, the easiest method
for managing a DQS is to have all the system files and directories
mounted by NFS/AFS or DFS on all hosts. The one exception to this
is that the directories containing the binaries for all DQS executables
which, of course, should only be shared by hosts with identical
architecture and operating system configurations. A knowledgeable
administrator may wish to make changes directly to the contents
of one of these directories. Where appropriate a hint or two are
provided to assist the system manager.  A typical installation
will posses a directory tree somewhat like: (underlined names
are directories, italicized names are files)
<MENU>
<LI><U>/user</U>
<LI><U>/local</U>
<LI><U>/DQS</U>
<LI><U>/common </U>   /<U>bin</U>
<LI><U>/conf</U>
<LI><U>/qmaster</U>  <I>resolve_file</I>   <I>conf_file</I> /<U>dqs_execd</U>
         <I>act_file   log_file</I>
<LI><U>/QM-A</U>    <U>host-A1</U>             &#133;    <U>host_An</U>
<LI><U>/common_dir </U>   /<U>exec_dir</U>
<LI> <I>complex_file</I>    <I>script_file</I> 
<LI>     consumables_file    /<U>job_dir</U>
<LI> <I>generic_queue</I>    <I>job1</I>   
<LI> <I>host_file     job2</I>    
<LI> <I>man_file</I>    ..
<LI> <I>op_file</I>     ..
<LI> <I>seq_num_file</I>    ..
<LI>       acl_file     ..
<LI><U>/job_dir</U>    /<U>rusage_dir</U>
<LI> <I>job1</I>     <I>current_usage</I>
<LI> <I>job2</I>    /<U>tid_dir</U>
<LI> ..     <I>tid_#xxxx</I>
<MENU>
<LI>..      <I>tid_#xxxx</I>
</MENU>
<LI><U>/queue_dir</U>    <I>pid_file</I>
<LI> <I>queue-A1</I>   <I><B>core</B></I>
<LI> <I>queue-A2</I>   
<LI> <I>queue_a3</I>   
<LI> &#133;
<LI><U>/tid_dir</U>
<LI> <I>tid_#xxxx</I>
<LI> <I>tid_#xxxx</I>
<LI> &#133;
<LI><I>pid_file</I>
<LI><I>stat_file</I>
</MENU>
<P>
       <I><B>core</B></I>
<P>
Four system files are classed as &quot;should be shared by all
hosts, if at all possible&quot;. They are:
<P>
<I><B><FONT SIZE=2>conf_file</FONT></B></I><FONT SIZE=2> --- This
file is created during the DQS313 &quot;config&quot; step of the
installation or system update. This file contains system-wide
configuration which is read by the qmaster, dqs_execd and all
DQS utilities when they startup. If it is necessary to make changes
to this file, the qmaster and all dqs_execd's should be shutdown
and restarted after the changes are complete, so that they will
posses the latest configuration. Failure to observe this step
may often result in bizarre and unexplained behavior of the system
if not an outright collapse. If this file cannot be cross mounted
by all hosts, then an IDENTICAL COPY of this file needs to be
distributed to all hosts before restarting the qmaster or dqs_execd
daemons or any of the command-utilities. </FONT>
<P>
<FONT SIZE=2>The location from which this file is read is &quot;hard-wired&quot;
into the compiled DQS code based in the #define CONF_FILE  statement
in the dqs.h file which is also created by the DQS &quot;config&quot;
step. It is important to understand that the default installation
setup places the conf_file in &quot;/usr/local/common/conf&quot;
directory, which is also used as the default location for the
qmaster and dqs_execd spool directories. While those directories
can be relocated by changing the conf_file and restarting the
daemons, the location of the resolve_file and conf_file can only
be changed by modifying &quot;dqs.h&quot; with an editor or be
re-executing the &quot;config&quot; program.</FONT>
<P>
<FONT SIZE=2>The following are the initial entries in the conf
file with a description of each line's effect on the system.<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2> QMASTER_SPOOL_DIR          /usr/local/DQS/common/conf</FONT>
<MENU>
<LI><FONT SIZE=2>This parameter points to the starting directory
from which the qmaster's sub-directories are created. While at
some sites with several cells the resulting tree can be shared
by multiple qmasters, it is only necessary that the qmaster have
access to the sub-directories for itself. This tree appears above
as &quot;&#133;/qmaster/QM-A&quot;.</FONT>
</MENU>
<LI><FONT SIZE=2> EXECD_SPOOL_DIR            /usr/local/DQS/common/conf</FONT>
<MENU>
<LI><FONT SIZE=2>This parameter points to the starting directory
from which all of the dqs_execd's in the cell will find their
individual queue management directories. In the default DQS setup
all dqs_execd's in a cell use this same directory tree terminating
their own specific set of sub-directories. This is illustrated
in the preceding diagram by &quot;../dqs_execd/host-A1&quot;.</FONT>
</MENU>
<LI><FONT SIZE=2> DEFAULT_CELL                    user-network</FONT>
<MENU>
<LI><FONT SIZE=2>The system-wide, unique name for a given cell.
This can be any arbitrary ASCII string and is defaulted to the
qmaster's host domain name during the installation process. If
this name is changed the corresponding sting in the &quot;resolve_file&quot;
must changed accordingly&#133; and vice-versa.</FONT>
</MENU>
<LI><FONT SIZE=2> RESERVED_PORT               TRUE</FONT>
<MENU>
<LI><FONT SIZE=2>This parameter indicates that all daemons and
utilities in a cell will be using UNIX reserved ports for socket
communications. UNIX system port numbers from 0 to 1023 are designated
as &quot;reserved&quot;. If this parameter is set to TRUE then
all of the DQS313 programs MUST execute with root ownership. If
this parameter is set to FALSE then the /etc/services port numbers
for DQS313 services must be greater than 1024.</FONT>
</MENU>
<LI><FONT SIZE=2> DQS_EXECD_SERVICE          dqs313_dqs_execd</FONT>
<MENU>
<LI><FONT SIZE=2>Any arbitrary ASCII string can be used to identify
the tcp port number to be used when the qmaster or the DQS utility
&quot;dsh&quot; is communicating with the dqs_execd. The only
requirement is that this name must be unique among all names in
the /etc/services file.</FONT>
</MENU>
<LI><FONT SIZE=2> QMASTER_SERVICE            dqs313_qmaster</FONT>
<MENU>
<LI><FONT SIZE=2>Any arbitrary ASCII string can be used to identify
the tcp port number to be used when the dqs_execd or DQS utilities
are communicating with the qmaster. The only requirement is that
this name must be unique among all names in the /etc/services
file</FONT>
</MENU>
<LI><FONT SIZE=2> INTERCELL_SERVICE          dqs313_dqs_intercell</FONT>
<MENU>
<LI><FONT SIZE=2>Any arbitrary ASCII string can be used to identify
the tcp port number to be used when the one qmaster is communicating
another qmaster. The only requirement is that this name must be
unique among all names in the /etc/services file</FONT>
</MENU>
<LI><FONT SIZE=2> KLOG                         /usr/local/bin/klog</FONT>
<MENU>
<LI><FONT SIZE=2>The re-authentication process in AFS systems
will use the klog program. This entry is only used when AFS support
was selected during DQS installation.</FONT>
</MENU>
<LI><FONT SIZE=2> REAUTH_TIME                 60</FONT>
<MENU>
<LI><FONT SIZE=2>If AFS has been selected, all daemons and executing
jobs will be re-authenticated  every period of this number of
seconds.</FONT>
</MENU>
<LI><FONT SIZE=2> MAILER                      /bin/mail</FONT>
<MENU>
<LI><FONT SIZE=2>All  jobs can select options to send brief &quot;job
startup&quot;, &quot;job end&quot; and &quot;job abort&quot; messages
to one or more designated users. In addition the DQS313  system
will send mail messages to the administrator in the event of extraordinary
system events.</FONT>
</MENU>
<LI><FONT SIZE=2> DQS_BIN                     /usr/local/DQS/bin</FONT>
<MENU>
<LI><FONT SIZE=2>The qmaster, dqs_execd and all user initiated
utilities locate their binaries in the BIN_DIR established during
the &quot;config&quot; step of installation. This entry is set
by that step, and acts as a &quot;place-holder&quot; for that
target directory. This parameter is used, however by the parallel
queue management system. If the administrator wishes this parameter
can be changed to point to a different directory where PVM,P4,TCGMSG
and MPI support programs may reside. Doing so will not affect
the continued use of the BIN_DIR for the remaining DQS executables.</FONT>
</MENU>
<LI><FONT SIZE=2> ADMINISTRATOR             admin@host_machine
 </FONT>
<MENU>
<LI><FONT SIZE=2>On startup of the qmaster this entry is used
to identify the primary DQS administrator for this cell. This
also forms the email address used to send system error messages.</FONT>
</MENU>
<LI><FONT SIZE=2> DEFAULT_ACCOUNT            GENERAL</FONT>
<MENU>
<LI><FONT SIZE=2>Any arbitrary ASCII string (without separator
characters such as blanks, periods, commas) can be used as an
account identifier. Each job submission can provide its own account
identifier, which overrides this default string. No validation
is performed on this or the user submitted account name string.
When a job terminates a record is created from hardware and software
usage data. The &quot;account string&quot; is appended and the
record is appended to the qmaster's &quot;act_file&quot;.</FONT>
</MENU>
<LI><FONT SIZE=2> LOGMAIL                     FALSE</FONT>
<MENU>
<LI><FONT SIZE=2>By default none of the mail generated by the
DQS, either to users or the system's managers is not logged. Setting
this parameter to TRUE will cause the qmaster to create a mail
log file, where all system emails are recorded and time-stamped.</FONT>
</MENU>
<LI><FONT SIZE=2> DEFAULT_RERUN               FALSE</FONT>
<MENU>
<LI><FONT SIZE=2>It is our sincere hope to have the rerun feature
if DQS implemented in future versions. In DQS313 this parameter
is ignored.</FONT>
</MENU>
<LI><FONT SIZE=2> DEFAULT_SORT_SEQ_NO        FALSE</FONT>
<MENU>
<LI><FONT SIZE=2>During the qmaster's scheduling process two major
steps occur. First the jobs themselves are sorted according to
their submitted priorities and internal policy criteria. Second
all of the available queues are scanned to find one which suits
the needs of the first job to be scheduled. The ordering of this
queue scanning process can be changed by this parameter. When
this parameter is FALSE all of the queue entries are sorted in
the decreasing order f their host's usage data (as reported by
the dqs_execd). Thus the first queue examined will be the least
&quot;busy&quot; queue, in an effort to spread the workload across
the system.</FONT>
<LI><FONT SIZE=2>If this parameter is set to TRUE the queues are
examined in the order iof the sequence  number assigned by the
administrator in each queue configuration. Many sites use this
method to ensure that their most powerful hosts are scanned first,
by assigning those hosts very low sequence numbers to the corresponding
queues. </FONT>
</MENU>
<LI><FONT SIZE=2> SYNC_IO                     FALSE</FONT>
<MENU>
<LI><FONT SIZE=2>In multi-host systems utilizing NFS mounted files
it is possible for I/O actions to become disordered in their results.
The ordering of lines of output sent to stdout or stdout can become
totally confused. DQS313 is supposed to have a feature in its
&quot;process shepherd&quot; to ensure that all stdout and stderr
output is properly time sequenced, even when multiple SLAVE processes
are involved. In the initial DQS313 release this feature is not
active.</FONT>
</MENU>
<LI><FONT SIZE=2> USER_ACCESS                 ACCESS_FREE</FONT>
<MENU>
<LI><FONT SIZE=2>This feature for differentiating levels of access
for users or classes of users is not implemented in DQS313.</FONT>
</MENU>
<LI><FONT SIZE=2> LOGFACILITY                 LOG_VIA_COMBO</FONT>
<LI><FONT SIZE=2>Many system messages are generated to aid in
the maintenance and diagnosis of DQS operation. Three files are
used for this activity, the &quot;err_file&quot;, the &quot;log_file&quot;
and the &quot;syslog_file&quot;. Depending on the level of attention
required messages are directed to one of these files. All messages
with ERR,CRIT, or WARNING are always sent  to  err-file. Messages
with levels of INFO, WARNIING or NOTICE can be sent to the system
log or the normal activity log file. The normal mode is to use
both the system log and normal log file. In DQS313 the system
log has been disabled, so that all non-error messages are directed
to the &quot;log_file&quot;.</FONT>
<LI><FONT SIZE=2> LOGLEVEL                    LOG_INFO</FONT>
<MENU>
<LI><FONT SIZE=2>Information is logged depending on the level
assigned within the DQS. In increasing order they are LOG_INFO,
LOG_NOTICE, LOG_WARNING, LOG_ERR, LOG_CRIT, LOG_ALERT, LOG_EMERG.
Setting the LOGFACILITY parameter establishes the minimum level
of messages to be recorded. A parameter of LOG_INFO ensures that
all messages will appear in the &quot;log_fils&quot;.</FONT>
</MENU>
<LI><FONT SIZE=2> MIN_UID                     10</FONT>
<LI><FONT SIZE=2> MIN_GID                     10</FONT>
<MENU>
<LI><FONT SIZE=2>For security reasons it is desirable to establish
a minimum user and group identifier uid or gid)which will be permitted
in execution of any of the DQS utilities. The qmaster and dqs_execd,
of course normally operate at root level. The recommended setting
is &quot;10&quot; for these parameter values as most UNIX critical
processes run with uid and gid values below &quot;10&quot;. It
is strongly recommended that these default values be retained.
</FONT>
<LI><FONT SIZE=2>Attempts to run DQS utilities such as qsub, qalter,
qstat, etc. will fail if these default values are used, which
is the &quot;correct&quot; , albeit confusing (to new system managers)
behavior of DQS.</FONT>
</MENU>
<LI><FONT SIZE=2> MAXUJOBS                    10</FONT>
<MENU>
<LI><FONT SIZE=2>There are a number of DQS &quot;system policy&quot;
parameters available to the DQS313 administrator. One of these
is a system-wide limit on the total number of jobs a user may
have considered for scheduling at any one time. This is not a
limit on the total number of jobs which a user can have queued
up in the system, but it does instruct the qmaster not to consider
more than MAXUJOBS for a user during a scheduling pass. The effect
of this limit can become quite subtle. For example, if a limit
of 10 is established and the user submits 100 jobs, they will
be ordered in sequence of their priority and submission time.
If the first ten of these jobs require system resources not currently
available, they cannot be scheduled. Neither will any following
jobs, which may need some resource which is actually available.
An additional user limit can be found in each queue configuration.</FONT>
</MENU>
<LI><FONT SIZE=2> OUTPUT_HANDLING            LEAVE_OUTPUT_FILES</FONT>
<MENU>
<LI><FONT SIZE=2>When a job is started by the qmaster it may be
able to produce large stdout or stderr files. The writing of these
files to a a remote, NFS mounted file system can have negative
impacts on system performance. In some cases, retaining these
files on a hosts local filesystems could prevent network congestion
and minimize I/O delays for the running job. DQS313 provides three
options for handling these output files. The default LEAVE_OUTPUT_FILES
causes the stdout and stderr files to be left in the working directory
established by the user's &quot;qsub&quot; script.  </FONT>
<LI><FONT SIZE=2>This parameter can be changed to LINK_OUTPUT_FILES.
In this case the administrator must create a special file in one
or all the dqs_execd spool directories. The name of this file
is defaulted to &quot;netpath&quot; during the DQS &quot;config&quot;
step. This default name may be changed in the dqs.h file by the
administrator , if they are prepared to recompile the entire DQS313
system. The &quot;netpath&quot; file should contain one ASCII
line defining the fully qualified network path of the target directory
into which the stdout and stderr files are to actually be places.</FONT>
<LI><FONT SIZE=2>If the parameter is set to COPY_OUTPUT_FILES
the DQS313 process &quot;shepherd&quot; creates temporary standard
output and standard error files local to the host executing the
job.  A special &quot;copy&quot; process is started which wakes
up periodically (set by the hard-wired COPY_FILE_DELAY in the
dqs.h file), and copies the current contents of those files to
their actual destination.</FONT>
</MENU>
<LI><FONT SIZE=2> ADDON_SCRIPT                NONE</FONT>
<MENU>
<LI><FONT SIZE=2>At the conclusion of a user's job, and in the
working space of that job it is sometimes necessary to conduct
system cleanup tasks. This is particularly true of parallel processing
tasks which may might leave &quot;orphan&quot; daemons running,
in the event of unplanned process termination. A  system script
maintained within the DQS can be created and invoked at the conclusion
of EVERY user job. This parameter must then contain the fully
qualified path-name to this script file.</FONT>
</MENU>
<LI><FONT SIZE=2> ADDON_INFO                 NONE</FONT>
<MENU>
<LI><FONT SIZE=2>When OUPUT_HANDLING is set to anything other
than LEAVE_OUTPUT_FILES, the system administrator may wish to
maintain a diagnostic awareness of the &quot;process shepherd&quot;
handling of the copying or linking of a user's stdout and stderr
files. If this parameter is set to something other than NONE,
the parameter string should be a fully qualified path to a file
containing a ASCII string to be appended to the &quot;stdout&quot;
file along with other job information. </FONT>
</MENU>
<LI><FONT SIZE=2> LOAD_LOG_TIME              30</FONT>
<MENU>
<LI><FONT SIZE=2>Upon startup the dqs_execd sets this parameter
(specified in seconds) as a minimum period for the dqs_execd to
deliver system usage statistics to the qmaster.  </FONT>
</MENU>
<LI><FONT SIZE=2> STAT_LOG_TIME             600</FONT>
<MENU>
<LI><FONT SIZE=2>Various system statistics, beyond the host usage
provided by the dqs_execd daemons, are gathered periodically,
based on the value of this parameter (specified in seconds).</FONT>
</MENU>
<LI><FONT SIZE=2> SCHEDULE_TIME               60</FONT>
<MENU>
<LI><FONT SIZE=2>The qmaster scans the cell's job queue after
every new job is submitted to the system or upon termination of
a running job. Absent these occurrences the  qmaster will trigger
a scheduling pass of the jobs based on this parameter (in seconds).</FONT>
</MENU>
<LI><FONT SIZE=2> MAX_UNHEARD                 90</FONT>
<MENU>
<LI><FONT SIZE=2>The qmaster does not poll other daemons for their
status. Instead it updates the queue status for each dqs_execd
which reports in. If a dqs_execd fails to report in to the qmaster
within this threshold (seconds) the qmaster will mark all queues
managed by the dqs_execd as &quot;status UNKOWN&quot;. This status
is updated every interval, and can be changed from UNKNOWN to
UP if the dqs_execd has finally succeeded in updating the qmaster.</FONT>
</MENU>
<LI><FONT SIZE=2> ALARMS                      3</FONT>
<LI><FONT SIZE=2> ALARMM                      4</FONT>
<LI><FONT SIZE=2> ALARML                      5</FONT>
<MENU>
<LI><FONT SIZE=2>The admonition to &quot;avoid changing these
parameters&quot; in the installation is well founded. These parameters
control the amount of time permitted before the UNIX system interrupts
an attempt at inter-host communications.  The ALARMS value is
the time in seconds before a DQS utility such as qsub, qmod is
interrupted. A message will appear for the user with message &quot;Alarm
Clock Shutdown&quot; indicating that the utility cannot contact
the qmaster within &quot;ALARMS&quot; seconds.  The ALARMM parameter
sets a similar limit on dqs_execd&lt;-&gt;qmaster communications
attempts. ALARML is the longest period established for inter-process
interchange attempts, and is used to control qmaster&lt;-&gt;qmaster
communications. </FONT>
<LI><FONT SIZE=2>In systems where the qmaster host is also running
other jobs or where the network interconnect can become congested
is possible for one or more communications attempts to fail due
to an ALARM  time-out. If the err_file contains frequent &quot;ALARM
CLOCK Shutdown&quot; warnings or utility execution fails often
with similar error messages the three ALARM parameters should
be increased. These values should be kept as small as practical
to prevent a failing DQS element from tying up the host's tcp/ip
interface.</FONT>
</MENU>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<I><B><FONT SIZE=2>resolve_file</FONT></B></I><FONT SIZE=2>-This
file is also created during the DQS &quot;config&quot; process.
It is the equivalent of a combination of the UNIX &quot;resolv.conf&quot;
and &quot;hosts.equiv&quot; files for managing network security.
The default resolve_file is:</FONT>
<MENU>
<LI><FONT SIZE=2># NOTE! blank lines NOT permitted #</FONT>
<LI><FONT SIZE=2># NOTE! fields must be separated by one(1) AND
ONLY one space #</FONT>
<LI><FONT SIZE=2># 1st field = cell_name</FONT>
<LI><FONT SIZE=2># 2nd field = primary qmaster</FONT>
<LI><FONT SIZE=2># 3rd field = primary qmaster alias</FONT>
<LI><FONT SIZE=2># 4th field = secondary qmaster</FONT>
<LI><FONT SIZE=2># 5th field = secondary qmaster alias</FONT>
<LI><FONT SIZE=2>user-network QM-A0 QM-A0.user.com NONE NONE</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>The comment lines direct the DQS manager as to the
format of new entries or entry changes, Some of the aspects of
this file need further explanation.</FONT>
<OL>
<LI><FONT SIZE=2>The cell name appearing in the first field of
the first non-commented line MUST be identical to the name appearing
as the DEFAULT_CELL parameter in the conf_file.</FONT>
<LI><FONT SIZE=2>DQS313 does not yet support alternate qmasters
and thus the last two fields of each non-commented line must be
&quot;NONE&quot; and  &quot;NONE&quot;</FONT>
<LI><FONT SIZE=2>Additional cells nay be defined by adding lines
to the resolve_file following the primary cell entry. If a host
in one cell is permitted to contact a qmaster in another cell
(via a &quot;sneak path&quot;) then the cell name and qmaster
name for that other cell must appear in the source cell's resolve_file.</FONT>
</OL>
<P>
<I><B><FONT SIZE=2>err_file</FONT></B></I><FONT SIZE=2> --- The
master, dqs_execd and all DQS utilities may originate error messages
which are directed to a hard-wired filename &quot;err_file&quot;.
This name is created during the DQS &quot;config&quot; step and
implanted in the &quot;dqs.h&quot; include file in the ../DQS/SRC
directory. The installation process assumes that all DQS313 programs
will have write-access to the path name which appears as QMASTER_SPOOL_DIR
in the conf_file. If this path name is inappropriate for ALL DQS
programs the administrator may choose to change the  definition
of ERR_FILE in the include file &quot;dqs.h&quot;. This will require
recompilation of the entire DQS313 system.</FONT>
<P>
<FONT SIZE=2>As an alternative, the administrator may choose to
let each program write to its own &quot;err_file&quot; and gather
and collate all the files when it is necessary to examine error
information. In this case, however the path-name accessible by
each host must be identical to the QMASTER_SPOOL_DIR name.</FONT>
<P>
<I><B><FONT SIZE=2>log_file</FONT></B></I><FONT SIZE=2> --- The
master, dqs_execd and all DQS utilities may originate error messages
which are directed to a hard-wired filename &quot;log_file&quot;.
This name is created during the DQS &quot;config&quot; step and
implanted in the &quot;dqs.h&quot; include file in the ../DQS/SRC
directory. The installation process assumes that all DQS313 programs
will have write-access to the path name which appears as QMASTER_SPOOL_DIR
in the conf_file. If this path name is inappropriate for ALL DQS
programs the administrator may choose to change the  definition
of ERR_FILE in the include file &quot;dqs.h&quot;. This will require
recompilation of the entire DQS313 system.</FONT>
<P>
<FONT SIZE=2>&quot;log_file&quot; and gather and collate all the
files when it is necessary to examine error information. In this
case, however the path-name accessible by each host must be identical
to the QMASTER_SPOOL_DIR name.</FONT>
<H3>Qmaster</H3>
<P>
The qmaster directory contains a major sub-directory for each
qmaster registered in this cell. Each qmaster's directory contains
four sub-directories whose contents change constantly during DQS313
operation, and hence must permit write operations an all files.
There are also two files created by the qmaster , the pid_file
and stat_file. An additional, unwelcome file may appear here also.
In the event of a qmaster crash, its core file will be placed
in this directory.
<H4><FONT SIZE=2>common_dir</FONT></H4>
<P>
<FONT SIZE=2>This directory contains files common to the scheduling
and dispatching of jobs by the qmaster.</FONT>
<P>
<B><FONT SIZE=2>complex_file</FONT></B><FONT SIZE=2>-This file
contains all of the definitions of complexes created by the  add
complex command (qconf -ac ).</FONT>
<P>
<B><FONT SIZE=2>consumables_file</FONT></B><FONT SIZE=2>-This
file contains all of the definitions of consumable resources created
by the add consumable resource command ( qconf  -acons ).</FONT>
<P>
<B><FONT SIZE=2>generic_queue</FONT></B><FONT SIZE=2>-This file
is read by the qmaster each time the create queue command (qconf
-aq)  is performed and no name is provided as a parameter following
the &quot;-aq&quot; option flag. The contents of this file form
the starting template presented in the editor for modification
by the administrator.</FONT>
<P>
<B><FONT SIZE=2>host_file</FONT></B><FONT SIZE=2> - The  host_file
is read up at startup of the qmaster and contains a list of all
the hosts known to the qmaster and occasionally called &quot;trusted
hosts&quot;. Any program attempting to contact the qmaster must
have its host's name in this list or be rejected. On the initial
startup of the qmaster this file will not be present. The qmaster
will post a warning in the err_file and create the host_file.</FONT>
<P>
<B><FONT SIZE=2>man_file</FONT></B><FONT SIZE=2>-This file contains
the login names of all individuals identified as cell &quot;managers&quot;.
A cell &quot;manager&quot; is given permission to access al DQS313
system files and to execute every option of every DQS313 utility.</FONT>
<P>
<B><FONT SIZE=2>op_file</FONT></B><FONT SIZE=2> --- This file
contains the login names of all individuals identified as cell
&quot;operator&quot;. A cell &quot;operator&quot; is given permission
to perform a number of system operations normally reserved to
the system manager, and prohibited to the standard system user.
The functions qdel, qmod, qmove, and qrls are permitted  by operators.
Functions such as creating or deleting queues or adding and deleting
managers and operators is, of course, limited to cell managers.</FONT>
<P>
<B><FONT SIZE=2>seq_num_file</FONT></B><FONT SIZE=2> - Jobs are
assigned an internal sequence number. The next number to assigned
by the qmaster appears as a single binary value in this file.
It is thus not possible to manually reset sequence numbers, other
than to delete this file, forcing the numbering sequence to begin
over with '1&quot;.<BR>
</FONT>
<P>
<FONT SIZE=2>acl_file  -- This file contains all of the access
control list &quot;acl&quot; names for all queues. This is actually
a list of lists. An &quot;acl&quot; is a list of names to be given
access to one or more queues. A queue definition can include these
individuals by naming the corresponding &quot;acl&quot; in its
&quot;user_acl&quot; parameter.</FONT>
<H4><FONT SIZE=2>job_dir</FONT></H4>
<P>
<FONT SIZE=2>This directory contains a file for each job currently
in the queuing system. Each file contains the submitted script
file along with tables and lists created by the qsub operation
and used to manage the job while it is in the queue awaiting assignment
to a host, as well as during actual job execution.</FONT>
<H4><FONT SIZE=2>queue_dir</FONT></H4>
<P>
<FONT SIZE=2>This directory contains a file for each queue . The
file name is, in fact the name assigned to that queue. Each file
contains the queue configuration, encoded in binary form, along
with various tables which the queue manager utilizes to manage
the queues.</FONT>
<H4><FONT SIZE=2>tid_dir-To maintain internal coherency during
system operation, in the face of multiple hosts executing multiple
processes a unique identifier label is generated by the qmaster
and dqs_execd for every inter-host communications. This label
is called a &quot;task identifier&quot; or &quot;tid&quot;. An
empty file for each generated &quot;tid&quot; is created in this
An acknowledgment by the receiving host for a transaction causes
the corresponding tid file to be deleted from this directory.
</FONT></H4>
<P>
<FONT SIZE=2>In the event of aberrant behavior of a hardware or
DQS313 software element some &quot;orphan tid's&quot; may be found
in this directory, however the administrator is cautioned to NOT
clear out tid files manually without careful analysis. This scheme
was created to ensure inter-host synchronization despite multiple
restarting of the qmaster or the dqs_execd.</FONT>
<P>
<I><B><FONT SIZE=2>pid_file</FONT></B></I><FONT SIZE=2>-This file
contains a list of the process id of the running qmaster. This
is a &quot;canonical&quot; location where site procedures may
find this pid for system management actions.</FONT>
<P>
<I><FONT SIZE=2>Stat<U><I>_file</I></U></FONT></I><FONT SIZE=2>-Based
on the period defined as &quot;STAT_LOG_TIME&quot; the qmaster
records summary information about all the queues it is managing.
This data is time-stamped so that DQS managers might determine
when queue status chenges occur inadvertently.</FONT>
<H3>dqs_execd</H3>
<P>
The dqs_execd directory contains major sub-directories for each
dqs_execd operting in this cell. Each dqs_execd directory contains
four sub-directories plus one file, the &quot;pid_file&quot; which
contains the process id of the dqs_execd. Of course there is also
the possibility of a core file being placed here in the event
of a dqs_execd crash.
<H4><FONT SIZE=2>exec_dir</FONT></H4>
<P>
<FONT SIZE=2>The exec_dir contains the actual job file for the
executing job. When the dqs_execd launches a job, the script file
is copied here and executed.</FONT>
<H4><FONT SIZE=2>job_dir</FONT></H4>
<P>
<FONT SIZE=2>The job_dir contains a file for each job which the
dqs_execd is managing (usually only  one). In addition to the
job's DQS script this file contains all the tables and information
necessary for the qmaster and the dqs_execd to manage this job.</FONT>
<H4><FONT SIZE=2>rusage_dir </FONT></H4>
<P>
<FONT SIZE=2>Upon job termination usage data  is collected and
formatted into a &quot;termination record&quot; to be sent to
the qmaster. This record is written to this directory and retained
until the qmaster has received and recorded this information.
The procedure is used to prevent vital data from being lost, particularly
from long-running jobs, in the event of an interruption of dqs_execd
or qmaster service.</FONT>
<H4><FONT SIZE=2>tid_dir -- To maintain internal coherency during
system operation, in the face of multiple hosts executing multiple
processes a unique identifier label is generated by the qmaster
and dqs_execd for every inter-host communications. This label
is called a &quot;task identifier&quot; or &quot;tid&quot;. An
empty file for each generated &quot;tid&quot; is created in this
An acknowledgment by the receiving host for a transaction causes
the corresponding tid file to be deleted from this directory.
</FONT></H4>
<H3>Temporary Files</H3>
<P>
The dqs_execd creates and deletes a number of temporary files
in the &quot;/tmp&quot; directory of its host. These are deleted
after use, but if the dqs_execd has been shut down during job
launching and execution these files may be left n the &quot;/tmp&quot;
directory inadvertently. Since they are given unique names for
the job execution they will remain until removed by the system
manager.<BR>
<H2>The Queue Configuration</H2>
<P>
<FONT SIZE=2>The queue configuration was introduced during the
discussion of setting up an initial DQS313 cell and queue. The
queue configuration is the primary means of tailoring a DQS system
to a particular site's requirements. The queue configuration can
be changed dynamically by the DQS cell manager without requiring
a shutdown and restart of either the qmaster of dqs_execd, unlike
the more static &quot;conf_file&quot;. Changing the queue configuration
will not affect any jobs already in execution. The modified configuration
will be considered during the next scheduling pass of the qmaster
after the change has been completed. A description of each element
follows:</FONT>
<MENU>
<LI><FONT SIZE=2>Q_name            <U><I><B>QA1</B></I></U></FONT>
</MENU>
<P>
<FONT SIZE=2>Any ASCII string of numbers and letters may be used
in the queue name. I t must ba a unique queue name in a given
cell.</FONT>
<MENU>
<LI><FONT SIZE=2>hostname          <U><I><B>QA1_host</B></I></U></FONT>
</MENU>
<P>
<FONT SIZE=2>The hostname entered here may be any form of the
host's name which is used by the network members,. DQS will convert
the entered name to the fully qualified host name and insert that
into the registered queue configuration.<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>seq_no               0</FONT>
</MENU>
<P>
<FONT SIZE=2>The seq_no is the an arbitrary sequence number assigned
by the DQS administrator. It is ignored if the conf_file parameter
&quot;DEFAULT_SORT_SEQ_NO&quot; is set to FALSE. If &quot;DEFAULT_SORT_SEQ_NO&quot;
is set toTRUE the qmaster will scan the queue list in the order
of the sequence numbers starting with zero &quot;0&quot;.</FONT>
<P>
<FONT SIZE=2>The DS administrator may choose one of several strategies
for assigning sequence numbers. At SCRI the lowest sequence number
is assigned to the most powerful computing engines, with less
powerful machines being assigned higher sequence numbers.</FONT>
<MENU>
<LI><FONT SIZE=2>load_masg         1</FONT>
</MENU>
<P>
<FONT SIZE=2>Each dqs_execd collects information about the state
of its host's overall computational and I/O load as reported by
the UNIX system through the &quot;rusage&quot; structure. A 'total
system load&quot; is provided as an integer value representing
a fractional percentage of the system usage. A value of 1 represents
a load of   0.01, a value of 10 represents a load of 0.10 , and
a value of 100 represents  a load of 1.0. </FONT>
<P>
<FONT SIZE=2>When DEFAULT_SORT_SEQ_NO is set TRUE the qmaster
attempts to assign jobs to the least loaded queues which meet
the resources requested by the job. The queues are sorted into
increasing order of the load average, weighted by multiplying
by the reported load average by the &quot;massage factor&quot;
 (the load_masg value). The load_masg factor thus permits the
adnininstrator to adjust the system wide relationships between
different hosts which may be necessitated by variations in usage
measurements or background task activity.</FONT>
<MENU>
<LI><FONT SIZE=2>load_alarm        175</FONT>
</MENU>
<P>
<FONT SIZE=2>A threshold value can bet set beyond which a queue
will not be considered for scheduling by the qmaster. When a host
reports a load average greater than this threshold it is in an
&quot;ALARM&quot; state, and this flag is displayed in qstat output.
The default load_alarm represents a load average of 1.75.</FONT>
<MENU>
<LI><FONT SIZE=2>priority               0</FONT>
</MENU>
<P>
<FONT SIZE=2>This field may be confusing at this point because
jobs also posses a submission priority. The difference is that
the job priority determines only how it is ordered among other
jobs in competition for system resources. The job submission priority
has no influence on the UNIX priority with which that job is executed.</FONT>
<P>
<FONT SIZE=2>The queue priority field here IS the UNIX priority
assigned to any job executed in this queue and thus may range
from -19 (low) to +19(high).</FONT>
<MENU>
<LI><FONT SIZE=2>type                   batch</FONT>
</MENU>
<P>
<FONT SIZE=2>DQS was designed to support the scheduling and management
of batch and interactive jobs. DQS313 supports only batch queues.
This parameter is ignored</FONT>
<MENU>
<LI><FONT SIZE=2>rerun                 FALSE</FONT>
</MENU>
<P>
<FONT SIZE=2>Automatic job rerun is not enabled in DQS313, this
field is ignored.<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>quantity           1</FONT>
</MENU>
<P>
<FONT SIZE=2>A DQS313 queue can manage more than one job in execution
at a time, though this is usually not a practical way to operate
a single cpu host.</FONT>
<MENU>
<LI><FONT SIZE=2>tmpdir               /tmp</FONT>
</MENU>
<P>
<FONT SIZE=2>During job startup and execution several temporary
files are created. This parameter should be the fully qualified
path name to the hosts temporary directory.</FONT>
<MENU>
<LI><FONT SIZE=2>shell              /bin/csh</FONT>
</MENU>
<P>
<FONT SIZE=2>The default shell for executing jobs in this queue.
This default can be overridden by commands in the job script.
</FONT>
<MENU>
<LI><FONT SIZE=2>klog               /usr/local/bin/klog</FONT>
</MENU>
<P>
<FONT SIZE=2>The path name to the AFS klog executable.</FONT>
<MENU>
<LI><FONT SIZE=2>reauth_time       6000</FONT>
</MENU>
<P>
<FONT SIZE=2>The time period in milliseconds for performing an
AFS re-authentication of the executing job. </FONT>
<MENU>
<LI><FONT SIZE=2>last_user_delay   0</FONT>
</MENU>
<P>
<FONT SIZE=2>To prevent a single user from dominating the utilization
of a queue the administrator can set this time-out  value (seconds)
 during which a user's job will not be consdered for scheduling
following termination of a previous job for that user.</FONT>
<MENU>
<LI><FONT SIZE=2>max_user_jobs     4</FONT>
</MENU>
<P>
<FONT SIZE=2>This is the second system parameter available for
implementing scheduling policies for DQS313 at a site. The MAXUJOBS
parameter in the conf_file limits the total number of jobs a user
can have considered for scheduling across the entire system. The
queue configuration &quot;max_user_jobs&quot;  establishes a limit
on the number of jobs a user can have queued which will be considered
for scheduling for this queue. See &quot;SCHEDULING&quot; for
a more complete discussion of this topic.</FONT>
<MENU>
<LI><FONT SIZE=2>notify             60</FONT>
</MENU>
<P>
<FONT SIZE=2>A user job may invoke the &quot;-notify&quot; option
which instructs the system to send the job a SIGUSR1 or SIGUSR2
signal as a warning in advance of a SIGSTOP or SIGTERM signal.
This &quot;notify&quot; parameter in the queue configuation establishes
the number of seconds between sending the warning signal and the
SIGTERM or SIGSTOP.</FONT>
<MENU>
<LI><FONT SIZE=2>owner_list        NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>In addition to the DQS  manager and DQS operator
an individual can be designated a queue owner. A queue owner can
perform many system management tasks permitted to the managers
and operators but limited to this queue. Job deletion, queue suspension,
enabling and disabling are among those actions One or more login
names can be entered for this parameter.</FONT>
<MENU>
<LI><FONT SIZE=2>user_acl          NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>The administrator can create one or more access lists
using the &quot;qconf -au&quot; command. This command adds one
or more  users to a  named list. (This named list will be created
if it doesn't exist.) These named lists (of names) can be used
to include or exclude groups of users in access to a specific
queue, This queue configuration parameter &quot;user_acl&quot;
can contain a list of one or more acl_list names which will be
permitted to use the queue. (That is the parameter can itself
be a list of names of lists of names&#133; confused ?).<BR>
<BR>
<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>xuser_acl         NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>The administrator can create one or more access lists
using the &quot;qconf -au&quot; command. That command adds one
or more  users to a  named list. (This named list will be created
if it doesn't exist.) These named lists (of names) can be used
to include or exclude groups of users in access to a specific
queue, This queue configuration parameter &quot;user_acl&quot;
can contain a list of one or more acl_list names which will be
excluded from access to the queue. </FONT>
<MENU>
<LI><FONT SIZE=2>subordinate_list  NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>One or more DQS313 queues can be subordinated to
another queue The queue specifying a list of subordinates with
this parameter is called the &quot;superior queue&quot;. A &quot;superior
queue&quot; can NOT be a subordinate queue to another. A queue
can only be subordinated to one other queue.  The &quot;subordinate_list&quot;
parameter can contain a list of one or more queue names in the
same cell as the queue defining this parameter. </FONT>
<P>
<FONT SIZE=2>Superior queues are analyzed for scheduling in the
same manner as all queues, If a job is assigned to a superior
queue, the qmaster will suspend the execution of jobs in all of
the queues in the superior queue's subordinate list. </FONT>
<MENU>
<LI><FONT SIZE=2>complex_list      NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>This parameter can contain one or more names of complexes
defined by the &quot;add complex&quot; function of the qconf command
(qconf -ac). See &quot;Complexes and Consumables&quot;. Any complex
name can be preceded by the DQS reserved word &quot;REQUIRED&quot;
(must be all caps). This indicates that no job will be scheduled
for this queue UNLESS it requests a resource described in that
complex.</FONT>
<MENU>
<LI><FONT SIZE=2>consumables       NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>This parameter can contain one or more names of consumable
resources defined by the &quot;add consumable &quot; function
of the qconf command ( qconf -acons). See &quot;Complexes and
Consumables&quot;. Any consumable name can be preceded by the
DQS reserved word &quot;REQUIRED&quot; (must be all caps). This
indicates that no job will be scheduled for this queue UNLESS
it requests a resource described in that consumable.<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>s_rt               7fffffff</FONT>
<LI><FONT SIZE=2>h_rt               7fffffff</FONT>
<LI><FONT SIZE=2>s_cpu             7fffffff</FONT>
<LI><FONT SIZE=2>h_cpu             7fffffff</FONT>
<LI><FONT SIZE=2>s_fsize           7fffffff</FONT>
<LI><FONT SIZE=2>h_fsize           7fffffff</FONT>
<LI><FONT SIZE=2>s_data            7fffffff</FONT>
<LI><FONT SIZE=2>h_data            7fffffff</FONT>
<LI><FONT SIZE=2>s_stack           7fffffff</FONT>
<LI><FONT SIZE=2>h_stack           7fffffff</FONT>
<LI><FONT SIZE=2>s_core            7fffffff</FONT>
<LI><FONT SIZE=2>h_core            7fffffff</FONT>
<LI><FONT SIZE=2>s_rss             7fffffff</FONT>
<LI><FONT SIZE=2>h_rss             7fffffff</FONT>
<LI><FONT SIZE=2>These parameters establish the &quot;hard&quot;
or &quot;soft&quot; limitations on a host's resource utilization
a job executing under control of this queue. The &quot;hard&quot;
limits are transferred to the job's execution environment in the
hopes that the host operating system provides support for these
limits. Note, however, that if a host does support these limits
they apply only on a process-by-process basis!! If a job script
contains multiple invocations of processes, as in a FORTRAN  compilation
and execution, the limits apply to each individual step in the
job.</FONT>
</MENU>
<P>
<FONT SIZE=2><BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>DQS313 does check the &quot;soft&quot; and &quot;hard&quot;
real-time limits (s_rt &amp; h_rt) and will terminate jobs based
on the values of those parameters. A job exceeding the &quot;soft&quot;
real-time limit is sent a SIGTERM signal which can be intercepted
by the job using the &quot;-notify&quot; option in the job script.
If the job exceeds the &quot;hard&quot; real-time limits it is
sent a SIGKILL signal which cannot be caught by the user job.</FONT>
</MENU>
<P>
<FONT SIZE=2 FACE="Arial Black"></FONT>
<P>
<FONT SIZE=2 FACE="Arial Black">Complexes &amp; Consumables</FONT>
<P>
<FONT SIZE=2>The most valuable aspect of DQS, and easily its most
confusing property is the ability to define and utilize a variety
of system &quot;resources&quot; which can then be requested in
a user's QDS job script. These resource requests are used to differentiate
and assign jobs to the variety of system capabilities found in
today's heterogeneous computing environments. Let us look at an
example of how and why resource definitions are created at a site.
The diagram shows five DQS hosts with different capabilities.
</FONT>
<P>
<FONT SIZE=2>Many users will  have created an application compiled
for one machine architecture, say AIX. In the pictured environment
the user could run their application on one of the AIX  machines
by specifying the queue name, say QN1. The negative aspect of
this simple approach is that the job may be kept waiting for QN1
because of a previous job on that machine while either QN3 or
QN5 might be available.<BR>
</FONT>
<P>
<IMG SRC="IMG00022.GIF">
<P>
<FONT SIZE=2>The solution for this situation is to create a resource
definition for all AIX machines in the cell and name it &quot;AIX1&quot;.
 Then the user can submit a job using the qsub command with the
&quot;-l&quot; option. What are the steps needed to accomplish
this:</FONT>
<OL>
<LI><FONT SIZE=2>A complex is created by typing &quot;qconf -ac
AIX1&quot;  (create a complex named AIX1) </FONT>
<LI><FONT SIZE=2>The default text editor is started and an empty
page displayed. The administrator enters an arbitrary string such
as &quot;our_AIX&quot;. Then save the results and close the editor.</FONT>
<LI><FONT SIZE=2>Now that we have a complex defined (AIX1) we
can add that complex to a queue definition.</FONT>
<LI><FONT SIZE=2>Assuming that the queue has already been defined
we will modify it using the qconf command. Typing &quot;qconf
-mq QN1&quot; opens up another editor window with the complete
queue definition displayed.</FONT>
<LI><FONT SIZE=2>Replace the parameter entry for  &quot;complex_list&quot;
from NONE with AIX1. (the name The DQS administrator creates a
resource definition called a &quot;complex&quot;. </FONT>
<LI><FONT SIZE=2>given to the complex definition NOT the contents
of that definition.</FONT>
<LI><FONT SIZE=2>In the same manner add the complex name AIX1
to the queues QN4 and QN5.</FONT>
<LI><FONT SIZE=2>Advertise  the resource name  &quot;our_AIX&quot;
 to all users.</FONT>
<LI><FONT SIZE=2>A user can then direct their jobs to any one
of the AIX machines by including the resource request &quot;-l
our_AIX&quot; in their DQS job script.</FONT>
</OL>
<P>
<FONT SIZE=2>This simple example illustrates two key points. </FONT>
<OL>
<LI><FONT SIZE=2>The complex name is used by the administrator
to assist in designing and managing collections of resources and
queues. The complex name IS NOT USED by the user in resource requests.</FONT>
<LI><FONT SIZE=2>Resource requests in job submissions use the
descriptions within one or more complex definitions.</FONT>
</OL>
<P>
<FONT SIZE=2>Let us expand the example slightly and create a new
complex which cuts across machine architecture features, but shares
a different attribute:</FONT>
<OL>
<LI><FONT SIZE=2>Create a complex for systems supporting PVM by
typing &quot;qconf -ac PVM1&quot;.</FONT>
<LI><FONT SIZE=2>When the editor window opens enter a single line
&quot;our_PVM&quot;</FONT>
<LI><FONT SIZE=2>Save the file and close the editor and advertise
the resource name &quot;our_PVM&quot; to the users.</FONT>
<LI><FONT SIZE=2>Add the complex name PVM! To the complex_list
parameter of queues QM2 and QM4. </FONT>
<LI><FONT SIZE=2>A user wishing to submit a job to a queue which
is running on an AIX machine which provides PVM support would
use a resource request &quot;-l our_AIX.and.our_PVM&quot;</FONT>
</OL>
<P>
<FONT SIZE=2>So far the sample resource definitions have been
a single string such as &quot;our_AIX&quot; or &quot;our_PVM&quot;.
 We could have used an alternative form for  describing alternatives
as we did with AIX versus HPUX.  This form  would replace the
string we entered in the complex files:  arch=our_AIX and arch=our_HPUX.
The string &quot;arch&quot; is one created by the administrator
and could be any arbitrary name. A resource request would then
have the form &quot;-l arch=our_AIX&quot;, or &quot;-l arch=our_HPUX&quot;
.<BR>
</FONT>
<P>
<FONT SIZE=2>Resource definitions can contain numeric values and
the corresponding resource requests can perform numeric comparison
on these values to satisfy a criteria..  A complex called BigMemory
could be defined containing the line &quot;mem=128&quot; . For
our example let QN1 and QN2 both be operating on hosts which have
128 megabytes each.  The complex BigMemory would be added to the
QN1 and QN2 . A request for an AIX machine with at least 64 bytes
of memory might be stated as &quot;-l  our_AIX.and.mem.ge.64&quot;.</FONT>
<P>
<FONT SIZE=2>Resource definitions can possess more than the single
line examples in each named complex. A complex definition named
&quot;BIG_HUMMER&quot; might look like:</FONT>
<MENU>
<LI><FONT SIZE=2>AIX414</FONT>
<LI><FONT SIZE=2>mem=1028</FONT>
<LI><FONT SIZE=2>Horsepower=10</FONT>
<LI><FONT SIZE=2>IO_bandwidth=250</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>A resource request which needs a BIG_HUMMER host
would, in this case look like:</FONT>
<P>
<FONT SIZE=2>&quot;-l AIX414.and.mem.ge.1028.and Horsepoer.ge.10.and.IO_bandwidth.ge.250&quot;</FONT>
<P>
<FONT SIZE=2>There is one type of resource we have singled out
for special handling in DQS313. These are resources which are
not static during the operation of a DQS cell. While machine horsepower,
memory size and operating systems and compilers for long periods
of times  (on the order of days or weeks), shared memory multiprocessor
cpus will have varying amounts of shared memory available to them
as different jobs are executed on other of its cpus. An increasingly
common resource situation is &quot;licensed software&quot; such
as compilers and data-base management systems. In many cases there
are fewer licenses available within a system than there are hosts
to execute the software. </FONT>
<P>
<FONT SIZE=2>This type of resource is called a &quot;consumable&quot;
in DQS313.  The definition of a consumable resource is somewhat
different than a DQS &quot;complex&quot;, in that the administrator
will describe the total number of a resource which is available
in a system, and the number of that resources consumed by a satisfied
resource request. In the case of a FORTRAN compiler license, a
site usually purchases a number of licenses for their system which
are managed by a &quot;license server&quot;. The consumable resource
manager in DQS313 does not supplant a license server nor can it
effectively mimic such a server. Instead it provides a mechanism
parallel to the license server which attempts to keep track of
how many licenses are in use by DQS clients.</FONT>
<P>
<FONT SIZE=2>The administrator  defines a consumable resource
by executing the command &quot;qconf -acons FORTRAN&quot;(using
the compiler as an example).. The default editor will open a window
with the following template</FONT>
<P>
<FONT SIZE=2>Consumable         xlf</FONT>
<P>
<FONT SIZE=2>                 Available =  &lt;the amount of resources
available&gt;</FONT>
<P>
<FONT SIZE=2>   Consume_by &lt; quantum by which resource is reduced
by a request&gt;</FONT>
<P>
<FONT SIZE=2>                Current = &lt; currently available
resources&gt;<BR>
<BR>
<BR>
</FONT>
<P>
<FONT SIZE=2>The field for Available should be filled in with
the number of FORTRAN licenses authorized to this system. The
Consume_by will be 1 for software such as compilers. The Current
field will usually be equal to the Available field, unless there
are several licenses in use at the time this Consumable is being
defined. The Current field is also used to rest the DQS313 consumable
counter when DQS313 gets out of sync with  the actual license
manager.</FONT>
<P>
<FONT SIZE=2>Queues which must manage this consumable resource
should then have the consumable name added to the consumables
parameter list in the queue configuration. The user need not be
aware of the distinction between standard complexes and consumables.
Their resource requests are stated in the same way: &quot;-l our_AIX.and.mem.ge.64.and.xlf&quot;.
The qmaster will determine if an xlf license is available by examining
its internal counters (which may NOT match the license server's).
If the license and other resources are available the job will
be launched. At the time the job is started the consumable count
for the FORTRAN resource will be decremented. </FONT>
<P>
<FONT SIZE=2>Upon job termination this resource count will be
incremented. Obviously this is not a satisfactory situation for
a user who wishes to submit a job which does a quick FORTRAN compile
which produces an executable which is then to run a week long
job. The consumable count would remain decremented for the duration
of the job while the license manager will have had the license
&quot;token&quot; returned at the conclusion of the compilation.</FONT>
<P>
<FONT SIZE=2>For this situation the cooperation of the user is
required, to avoid breaking up jobs into compile-only and compute-only
separate jobs. The &quot;qalter&quot; command has been modified
to permit any user tp execute the &quot;qalster&quot; command
but only if it has the &quot;-rc &quot; . return consumable, command.
The user job would then have a script file which might look like:</FONT>
<MENU>
<LI><FONT SIZE=2>#!/bin/csh</FONT>
<LI><FONT SIZE=2>#$ -l xlf.and.our_AIX</FONT>
<LI><FONT SIZE=2>xlf my myprogram</FONT>
<LI><FONT SIZE=2>qalter -rc xlf 1</FONT>
<LI><FONT SIZE=2>myprogram mydata</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>The qalter command here specifies the name of the
resource being returned followed by the quantity being returned.
When resources such as high performance disk or shared memory
are being defined as a consumable resource often a &quot;quanta&quot;
of the resource is granted and recovered. An example might be
that a UNIX page is the minimum quanta or an integral number of
pages could be the &quot;quanta&quot;.  Where licenses are normally
doled out one at a time, memory might be allocated 1 MB at a time.
Hence the Consume_by field in the consumable definition.<BR>
</FONT>
<H3>REQUIRED Complexes and Consumables</H3>
<P>
A job submission may contain one or more resource requests (the
&quot;-l&quot; option). A job with no specific resource requests
is thus a candidate for assignment to any available queue. In
many installations some queues are best utilized by very specific
job configurations. An example might be a site which possesses
a heterogeneous collection of cpus with very wide differences
in computing capacity. The more robust computers should not be
assigned to &quot;tiny&quot; but persistent jobs in some cases.
DQS 3.1.3 provides a special keyword-&quot;REQUIRED&quot; which
can precede any complex or consumable which a user MUST request
in order for that job to be considered for scheduling on that
queue.<BR>
<H2>Job Scheduling</H2>
<P>
<FONT SIZE=2>The crux of any resource allocation and management
system is its ability to provide resources in an &quot;efficient&quot;
and &quot;fair&quot; manner. &quot;Efficiency&quot; is usually
measured terms of maximizing job throughput and effective utilization
of the available resources. &quot;Efficiency&quot; can be quantified
in ways usually referred to the hardware hosts in a system &quot;Fairness&quot;
is less easily described, is often measured by perceptions and
is most often referred to the human users of a system. Further,
priorities for efficiency and fairness and their relative values
can vary widely from site to site. The burden of meeting these
objectives falls upon the system job scheduling mechanism.</FONT>
<P>
<FONT SIZE=2>Forty years of experience with attempts at creating
comprehensive job scheduling algorithms have demonstrated several
points:</FONT>
<OL>
<LI><FONT SIZE=2>It is virtually impossible to produce a &quot;one
size fits all&quot; algorithm which will satisfy the demands for
efficiency plus fairness at every site.</FONT>
<LI><FONT SIZE=2>Scheduling systems which attempt to provide a
'flexible' software solution do so by offering to the administrator
numerous parameters for adjusting the methods used for allocating
resources. The plethora  of variables presented is ultimately
confusing if not confounding.</FONT>
<LI><FONT SIZE=2>Most sites with complex requirements and knowledgeable
support personnel end up writing their own scheduling code or
modifying the code provided with the system</FONT>
</OL>
<P>
<FONT SIZE=2>DQS313 therefore attempts to provide only a minimal
amount of job scheduling technology. Hopefully small sites will
be able to achieve a good level of balance in host usage and perceived
&quot;fairness&quot; with the system as it is delivered. As a
site develops experience with batch job management the staff will
experiment with the few parameters provided in DQS. At some point
the administrator will want to probe the module dqs_schedule.c
, adding or subtracting from its capabilities. To that end we
will describe the basic features of DQS scheduling and try to
illuminate the routines most likely to be modified.</FONT>
<P>
<FONT SIZE=2>A user job passes through two screening processes
before being considered by the qmaster for scheduling:</FONT>
<P>
<FONT SIZE=2>1. At the time of job submission a user job is checked
to see if it meets two system criteria:  </FONT>
<OL>
<LI><FONT SIZE=2>Are resources present in the system which meet
the requirements specified for the job (usually through the &quot;-l
&quot; parameter in a qsub script) ? </FONT>
</OL>
<P>
<FONT SIZE=2>b. Is this user under the maximum threshold established
for using system resources ?</FONT>
<P>
<FONT SIZE=2> If a job fails these tests it is rejected at the
time of submission and an error message returned to the user submitting
the job. ( In the event that a job is submitted in anticipation
of resources being added to the system, such a new host architecture,
the user can choose to override the first test by using the &quot;force&quot;
option (&quot;-F&quot;) in the qsub command.</FONT>
<P>
<FONT SIZE=2>2. Once a user job has been accepted into the system
it will be placed into the qmaster's job list where it will remain
until it has been executed or deleted. If a job's submission exceeds
the MAXUJOBS limit placed in the conf_file, it will remain in
the queue BUT it will not be considered during scheduling passes
by the qmaster.<BR>
</FONT>
<P>
<FONT SIZE=2>The qmaster conducts an examination (or &quot;pass&quot;)
over the job list :</FONT>
<OL>
<LI><FONT SIZE=2>Every time a job is added to the list</FONT>
<LI><FONT SIZE=2>Every time a job terminates</FONT>
<LI><FONT SIZE=2>If neither of these steps occur, the qmaster
will scan the list on a periodic basis based on the number of
seconds in the &quot;SCHEDULE_TIME&quot; parameter in the conf_file.</FONT>
</OL>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>The scanning process consists of sorting the jobs
according to their submitted priority (&quot;-p&quot; option),
then by an internally generated &quot;subpriority&quot; and finally
by the job sequence number (establishing its submission order).
After the jobs are sorted they are examined in order, testing
each  available queue ( each ordered by load average or sequence
number) looking for the first one which matches the resources
requested by that job. If a match is found the job is dispatched
and the next job is examined.</FONT>
<P>
<FONT SIZE=2>Manipulation of a job's subpriority before the sorting
step is the easiest way to affect the basic scheduling algorithm.
In DQS313 this simply consists of increasing the subpriority field
of a job based on the number of previously submitted jobs (at
the same priority level) for that user. Thus two or more users
with several jobs queued at the same priority and for the same
system resource will have their jobs interleaved, so that no one
user can dominate a resource by submitting a large quantity of
jobs.</FONT>
<P>
<FONT SIZE=2>The system administrator will probably experiment
with this subpriority computation as a first step in customizing
DQS. Flirting with the resource matching is considered to be a
more risky affair as the side effects of such changes are harder
to predict or detect.<BR>
</FONT>
<H2>AFS Operation</H2>
<P>
<FONT SIZE=2>DQS313 provides a minimal AFS support capability.
The introduction of the &quot;process shepherd&quot; has made
the job re-authentication  in DQS conform to AFS security requirements.
The output file handling feature addresses the 'cross platform'
security problems of dealing with stdout and stderr. </FONT>
<H2>Multi-Cell Operation</H2>
<P>
<FONT SIZE=2>A limited multi-cell operation capability is provided
in DQS313. Jobs may be moved from cell to cell if they are not
yet in execution, and users authenticated in one cell can view
the status of the queues in another cell.</FONT>
<H2>Accounting</H2>
<P>
<FONT SIZE=2>Site accounting methods vary as widely as any aspect
of a batch processing system. DQS313 records as much information
as possible about a job's scheduling and execution in a single
ASCII line of text. These entries are preceded by an ASCII string
of the standard UNIX GMT time of the entry. </FONT>
<P>
<FONT SIZE=2>Extraction of the accounting information simply requires
using a structure definition for the act_file entries in one's
&quot;c&quot; extraction program. An example of this technique
may be found in the program acte.c which can be found in the ../DQS/tools
directory. Included in the tools directory is a script &quot;dostats&quot;
which employs acte to create a series of system summary files
for the administrator.</FONT>
<H2>System Management</H2>
<P>
<FONT SIZE=2>The process of DQS system management first consists
of laying out the physical and logical structure of a cell. The
physical organization is described by adding hosts and assigning
them to queues. The logical organization consists of defining
resource &quot;complexes&quot; and consumable resources and assigning
these to their appropriate queue hosts. Finally setting system
parameters in the conf_file and each queue configuration establishes
the operating environment for DQS operation.</FONT>
<P>
<FONT SIZE=2>The ongoing management steps should include:</FONT>
<P>
<FONT SIZE=2>a. review of the queue status information to spot
queues in UNKNOWN or ALARM state; (DQS313 will send email to the
administrator whenever possible, but a sudden crash of a daemon
may only be detected from the qstat command display)</FONT>
<OL>
<LI><FONT SIZE=2>regular review of the err_file, log_file, stat_file
and act_file looking for operational anomalies; Some will be obvious,
such as dqs_execd's which have vanished or been restarted. One
key thing to look for is a sequence of jobs aborting on the same
host ( a potential problem with DQS or the host) or a sequence
of jobs aborting for the same user (may point to a problem with
the user's jobs or the user's permissions). (Job aborts may be
detected by examining  the exit_status of jobs in the act_file.</FONT>
<LI><FONT SIZE=2>Changing queue parameters, adding and deleting
jobs and performing queue suspend/unsuspend,  or disable/enable
operations as required.</FONT>
</OL>
<P>
<FONT SIZE=2>The majority of the DQS313 utilities set and their
options are provided for the system management function. While
users may employ the qalter command, for example, to change the
characteristics of a submitted job, more often the administrator
will avail themselves of this function. A not-uncommon occurrence
is for the administrator to increase the submission priority of
a job to move it ahead of other jobs in the scheduling.</FONT>
<P>
<FONT SIZE=2>One utility should be highlighted here, the &quot;qidle&quot;
function. Many DQS hosts may actually reside on someone's desk
and serve as their personal; workstation. At the same time these
machines are utilized for their computational capabilities in
a cell. To serve both functions, it must be possible for the workstation
user to have priority access to their machine and not suffer keyboard
and mouse response deficiencies because the host is being shared
with DQS. A first step is to make the &quot;owner&quot; of the
workstation also an &quot;owner&quot; of all queues assigned to
that host. Then when the workstation owner wishes to have exclusive
use of the machine they will have DQS permission to suspend any
queues on that machine.</FONT>
<P>
<FONT SIZE=2>Enter the &quot;qidle&quot; utility. This is an X-Windows
based program, since we presume that workstation users will be
operating with X-Windows. It can be started at any workstation
and performs the following functions on behalf of the workstation
&quot;owner&quot; who the administrator has also designated a
queue &quot;owner&quot; in the queue configuration. </FONT>
<OL>
<LI><FONT SIZE=2>If the workstation mouse and keyboard are used
in some way (mouse movement, button clicks, keyboard typing),
all queues on that host are suspended.</FONT>
<LI><FONT SIZE=2>If the keyboard and mouse have not been used
for a period of time specified in the qidle command, then all
queue suspensions are removed.</FONT>
</OL>
<P>
<FONT SIZE=2>What happens in the case where more than user may
have access to a workstation. The &quot;system console&quot; is
an example where many users may be permitted to operate the keyboard
and mouse. Making all users &quot;owners&quot; of that station's
queues could result in an unmanageable list and is a potential
security problem, since a queue owner has privileges beyond queue
suspension actions. </FONT>
<P>
<FONT SIZE=2>The qidle in DQS313 has thus been modified from its
DQS 3.1.2.4 form. It is now a member of the DQS313 utilities group
and communicates directly with the qmaster rather than indirectly
through the qmod utility. It can be started on any workstation
by any user who has permission to login to that workstation. Once
started it performs the same functions described above.<BR>
</FONT>
<H2><U><FONT SIZE=4>Problem Solving</FONT></U></H2>
<H2><FONT SIZE=2>Solving Installation Problems</FONT></H2>
<P>
<FONT SIZE=2>Most installation difficulties can be divided into
three categories (in the order of probability)</FONT>
<OL>
<LI><FONT SIZE=2>One or more bugs remain in the DQS 3.1.3 installation
procedure. This release has not been tested on all available UNIX
platforms (hardware or software versions 0.</FONT>
<LI><FONT SIZE=2>The interactive interface has produced messages
or questions which may confuse the reader. Some of these are natural
warnings from the make process or compiler. A few will be  labeled
&quot;error&quot; when they do not effect the installation process.
These often occur when an installation is being performed over
an old one and the target directories already exist.</FONT>
<LI><FONT SIZE=2>The administrator is running as non-root and
attempting operations not permitted in that mode.</FONT>
<LI><FONT SIZE=2>Host machines to be used for qmaster and/or dqs_execd
do not have uniform access (through NFS or AFS or DFS) to the
DQS binary files, or the spool directories defined during the
installation procedure..</FONT>
<LI><FONT SIZE=2>Attempts to use qstat313, qsub313,etc receive
a message &quot;.. unable to contact qmaster&quot;. This is usually
due to a user trying to invoke one of the DQS utilities on a host
not known to the qmaster. The qmaster maintains a list of all
&quot;trusted hosts&quot; in the cell which it manages. Hosts
are added automatically when a queue is configured for them: &quot;qmaster313
-aq&quot; or by an explicit host addition &quot;qconf313 -ah &lt;host
name &gt; &quot;.</FONT>
</OL>
<P>
<FONT SIZE=2>Identify the symptoms of the installation failure
and refer to one of the following sections:</FONT>
<H2>INSTALL fails during the make process of the &quot;config&quot;
program.</H2>
<P>
<FONT SIZE=2>The GNU configure program uses the &quot;Makefile.in&quot;
template in the DQS/CONFIG directory to produce the Makefile for
the DQS config utility. It is possible that a new configuration
of compilers or linkers can cause the GNU facility to create an
erroneous Makefile. Visually check the Makefile for correctness.</FONT>
<P>
<FONT SIZE=2>Although DQS313 installation has been tested on many
platforms, variants of the compiler or operating systems can create
WARNING messages during the compilation which we have not made
provision for. Even different versions of GNU &quot;C&quot; yield
different warning messages. If the error is fatal to the compilation
please contact the DQS313 support team for assistance.</FONT>
<H2>INSTALL  fails during the execution of the DQS config program.
</H2>
<P>
<FONT SIZE=2>During the config process the system attempts to
create a number of directories and sub-directories. The default
starting point for this process is the current working directory
of the user if running as non-root or /usr/local/DQS if running
as root.. If any of the directories exist, an error message is
displayed on stdout, but the config program continues. If the
user discovers that they have erroneously specified directory
names, config can be interrupted by typing CTRL-C. This will unwind
many aspects of the configuration process, however NO DIRECTORIES
will be removed. The administrator will have to cleanup any relevant
directories manually. After reviewing the directory already exists&quot;
messages the administrator can  choose to ignore those which are
expected because the directories were previously created..<BR>
</FONT>
<H2>INSTALL fails during the &quot;make&quot; process.</H2>
<P>
<FONT SIZE=2>During the DQS config step, all of the target directories
are created except for the ones associated with the compiled output
object ('.o' files) and the interim executables (qmaster, dqs_execd&#133;).
If a previous installation occurred under a &quot;root&quot; user
and the current &quot;make&quot; is being done as a &quot;non-root&quot;
the attempt to create the ARCS sub-directories will fail for lack
of permissions. The solution is to perform the &quot;make&quot;
as root or change the owner of the ARCS sub-directories to the
user doing the installation of DQS313.</FONT>
<P>
<FONT SIZE=2>The GNU CC compiler is chosen as the default compiler
or the &quot;make&quot; process if it is available. Some sites
may experience a large number of &quot;gcc&quot; warning messages
if there have been local modifications to the gnu include files.
If this occurs or if the site prefers to use the native &quot;C&quot;
compilers then the following steps should be taken&quot;</FONT>
<OL>
<LI><FONT SIZE=2>Stop the &quot;make&quot; operation. The GNU
configure program and the DQS config utility will have been executed
and all Makefile templates will contain the GCC default. Change
directory to &#133;DQS/SRC and edit the Makefile.proto file.</FONT>
<LI><FONT SIZE=2>Search the Makefile.proto for any lines which
match &quot;CC=gcc&quot; and replace the string &quot;gcc&quot;
with the native compiler name, (usually &quot;cc&quot; ).</FONT>
<LI><FONT SIZE=2>Change directory back to the base directory,
&#133;DQS and type &quot;make&quot; to restart the process.</FONT>
</OL>
<P>
<FONT SIZE=2>If only &quot;warning&quot; messages appear in the
stdout results you can feel reasonably secure with the installation.
However we will try to eliminate these in future releases and
would appreciate receiving information on these occurrences. If
an error fatal to the compilation occurs please contact the DQS
support staff.<BR>
</FONT>
<H2>INSTALL fails during the &quot;make installbin&quot; phase
</H2>
<P>
<FONT SIZE=2>Once the make process has created the temporary executables
in the ARCS directory they should be moved to their &quot;final
resting place&quot; as chosen during the DQS config step. For
operational installations this step should be performed as root.
If the INSTALL script was started as non-root and the target directory
requires root permissions the INSTALL process will fail at this
point.</FONT>
<P>
<FONT SIZE=2>If this occurs the administrator should switch to
&quot;root&quot;, change directory to &#133;./DQS and type &quot;make
installbin&quot;.</FONT>
<P>
<FONT SIZE=2>Since the DQS config process attempts to create the
BIN target directory, this phase may generate several warning
messages that &quot;directory already exists&quot;. Ignore these
warnings. If, however the message is &quot;error, permission denied&quot;,
the process should be repeated in &quot;root&quot; mode.</FONT>
<P>
<FONT SIZE=2>To prevent confusion between DQS313 binaries and
previously installed versions we have appended the string &quot;313&quot;
during the installbin process. The usual next step is to provide
soft-links in /usr/local/bin to these binaries something of the
form:</FONT>
<P>
<FONT SIZE=2>&quot;ln -s /usr/local/DQS/bin/qmaster313  /usr/local/bin/qmaster
<BR>
</FONT>
<H2>INSTALL fails during the &quot;make installconf&quot; phase
</H2>
<P>
<FONT SIZE=2>After the binaries have been installed in their directory
the 'resolve_file&quot; and &quot;conf_file&quot; will be moved
to their target directory, ( a possible default might be &quot;/usr/local/DQS/common/conf&quot;
). In our &quot;quick install example&quot; this process should
proceed automatically. If the INSTALL script was initiated by
a non-root user and the destination directory is restricted to
a root-user this step will fail with a &quot;permission denied&quot;
error message. However when a series of different platform types
 are being aggregated into a single cell, only one conf_file and
resolve_file need be moved to the common/conf directory. If this
has already been done then this step can be skipped.</FONT>
<H2>Startup of the qmaster fails.</H2>
<P>
<FONT SIZE=2>The principle reason for the qmaster not executing
during initial testing is the absence of the /etc/services entries
directed by the installation process. The err_file should be examined.
Warning messages bout absent hosts, acl and complex files should
be ignored. Look for an entry &quot;Bad Service&quot; which points
to the /etc/services file.</FONT>
<P>
<FONT SIZE=2>An obvious error, but one which occurs often is trying
to start the qmaster in user-mode while the RESERVED_PORTS   TRUE
appears in the conf_file.</FONT>
<P>
<FONT SIZE=2>If attempts at starting the qmaster fail, after checking
root-mode and the .etc/services file. the administrator should
set the environment variable DEBUG to 1 and then restart the qmaster
as follows : &quot;qmaster313 &gt;&amp;debug.out &amp;&quot; (assuming
a C shell environment). After the qmaster crashes send the  file
&quot;debug.out&quot; to the DQS support staff.<BR>
</FONT>
<H2>Startup of the dqs_execd fails</H2>
<P>
<FONT SIZE=2>The principle reason for the dqs_execd not executing
during initial testing is the absence of the /etc/services entries
on its host as directed by the installation process. The err_file
should be examined. Warning messages should be ignored. Look for
an entry &quot;Bad Service&quot; which points to the /etc/services
file.</FONT>
<P>
<FONT SIZE=2>An obvious error, but one which occurs often is trying
to start the dqs_execd in user-mode while the RESERVED_PORTS 
 TRUE appears in the conf_file.</FONT>
<P>
<FONT SIZE=2>If the dqs_execd is not able to check-in with the
qmaster during dqs_execd  startup the daemon will shut down (
once executing the dqs_execd will not shut down if the qmaster
is absent). Make sure the qmaster is running before attempting
to start the dqs_execd.</FONT>
<P>
<FONT SIZE=2>If attempts at starting the dqs_execd fail, after
checking root-mode and the .etc/services file. the administrator
should set the environment variable DEBUG to 1 and then restart
the qmaster as follows : &quot;dqs_execd313 &gt;&amp;debug.out
&amp;&quot; (assuming a C shell environment). After the qmaster
crashes send the  file &quot;debug.out&quot; to the DQS support
staff.</FONT>
<H2>Startup of qconf fails</H2>
<P>
<FONT SIZE=2>If the first attempt at using qconf produces error
messages and the qconf terminates there are several possible causes:</FONT>
<OL>
<LI><FONT SIZE=2>The user is attempting to execute qconf in root-mode
while the MIN_UID and MIN_GID are non zero. For security reasons
root users are not normally permitted to execute DQS utilities
unless the MIN_UID is set to zero. </FONT>
<LI><FONT SIZE=2>qconf is being started in user-mode but the utility
itself is NOT owned by root and does not have the permissions
for the owner set correctly. This can occur when a manager uses
a path to the ARCS directory for the utility rather than the BIN_DIR
target where installbin is supposed to put all DQS binaries.</FONT>
<LI><FONT SIZE=2>qconf is being started on a host not yet known
to the qmaster. Here we have a cart-and-horse situation. We need
to use the qconf function to add hosts, but cannot execute qconf
because its host is not &quot;legal&quot;.  The only solution
is to initiate qconf on the same host where the qmaster resides.</FONT>
</OL>
<H2>qstat display shows queue status as UNKNOWN</H2>
<P>
<FONT SIZE=2>During the initial test phase, the manager will have
created one queue using qconf. After it has been created, execution
of qstat should show the presence of a queue and a status of DISABLED.
An UNKNOWN status indicates a failure of the dqs_execd to contact
the qmaster in the time prescribed as  MAX_UNHEARD in the conf_file.
Check the err_file for messages relating to the dqs_execd being
unable to contact the qmaster. Since the dqs_execd would not even
start if it could not check in with the qmaster, some new problem
must have developed.. Check to see if the dqs_execd is still running.</FONT>
<H2>qsub fails to submit test job</H2>
<P>
<FONT SIZE=2>The test script should be accepted by the DQS system
at this point with no problem, since utility&lt;-&gt;qmaster interaction
has been operating successfully in the previous steps. The most
likely reason for a failure of this qsub test is represented by
a message of the form &quot;ALARM CLOCK shutdown&quot;.. This
is due to the qmaster or the network interfaces being overburdened.
Often the host on which the qmaster is running may be executing
some non-DQS managed computational hog. If the ALARM message occurs
try increasing the ALARM values in the conf file and re-executing
the qsub command. (note that for this experiment the dqs_execd
and qmaster need not be restarted after changing the conf_file,
as the qsub is the only one complaining. However if the new values
of ALARM' parameters  prove satisfactory the daemons should be
restarted as soon as practicable.).</FONT>
<H2>test job end with no output</H2>
<P>
<FONT SIZE=2>If the permissions for the user submitting the test
script are not sufficient for  the target host the job launching
process will be terminated and a message sent to the err_file.
An accounting record will also be sent to the DQS act_file. Check
these files for information.</FONT>
<H2>Test script produces a non-zero length stderr file</H2>
<P>
<FONT SIZE=2>The test script should create two output files, one
containing stdout information and the other the stderr output.
If the stderr output is not zero length than some &quot;very unlikely&quot;
event occurred during the job execution.  Examine this stderr
file and the err_file to determine what the cause was.</FONT>
<H2>Operational errors</H2>
<P>
<FONT SIZE=2>Once the system has succeeded in running the  test
script, the administrator will configure hosts, queues and resources
for its operational settings. A myriad of situations can then
occur which may appear to be, or in fact are, DQS system errors.
For this reason DQS produces a large number of informational,
warning and error messages which are posted to the system err_file.
</FONT>
<P>
<FONT SIZE=2>In the event that an operational aberration is detected
the err_file should be examined closely. If no explanations are
obvious,. The DQS support staff should be contacted and sent a
relevant extraction from the err_file and act_file.<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
</FONT>
</BODY>
</HTML>