1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720
|
<HTML>
<HEAD>
<TITLE>1</TITLE>
<META NAME="GENERATOR" CONTENT="Internet Assistant for Microsoft Word 2.0z">
</HEAD>
<BODY>
<HR>
<P>
<CENTER><B><FONT SIZE=2>DISTRIBUTED QUEUEING </FONT></B></CENTER>
<P>
<CENTER><B><FONT SIZE=2>SYSTEM - 3.1.3</FONT></B></CENTER>
<H2><CENTER>August 28, 1996</CENTER></H2>
<H1><CENTER><FONT SIZE=4 COLOR=#FFFFFF>INSTALLATION AND MAINTENANCE
MANUAL</FONT></CENTER></H1>
<HR>
<P>
<B>SUPERCOMPUTER COMPUTATIONS RESEARCH INSTITUTE<BR>
<BR>
<BR>
<BR>
</B>
<H3><IMG SRC="IMG00018.GIF"></H3>
<H1><FONT SIZE=4 COLOR=#FFFFFF>Introduction</FONT></H1>
<HR>
<H2>The Distributed Queuing System (DQS)</H2>
<P>
<FONT SIZE=2>The Distributed Queuing System (DQS)is an experimental
batch queuing system which has been under development at the Supercomputer
Computations Research Institute (SCRI) at Florida State University
for the past 7 years. The first years of this activity were funded
by the Department of energy Contract DE-FC0585ER250000. DQS is
freely distributed to all parties with the understanding that
it continues to be an evolving development system, and no warranties
should be implied by this distribution.<BR>
</FONT>
<P>
<FONT SIZE=2>DQS is intended to provide a mechanism for the management
of requests for execution of batch jobs on one or more members
of a homogeneous or heterogeneous network of computers. Facilities
for load-balancing, prioritization and expediting of a wide variety
of computational jobs are included to assist each site in tailoring
the behavior of the system to their particular environment.<BR>
</FONT>
<H2>SCRI support </H2>
<P>
<FONT SIZE=2>SCRI will make every effort, within its resources
to assure that DQS is suitable for operation as a batch queuing
system in as many site situations as possible. SCRI staff will
respond to requests for assistance as well as investigating bugs,
incorporating repairs and updating documentation, from those who
are utilizing DQS. However it is not possible, at this time, to
make a formal commitment for the long term support and enhancement
of this system. Any user or organization which decides to adopt
DQS will be assuming all risks from that undertaking.<BR>
</FONT>
<P>
<FONT SIZE=2>With this release, DQS 3.1.3, the distribution and
support of the previous version of DQS 3.1.2.4 will be continued
for at least the balance of calendar year 1996. Depending on the
need for continued support and SCRI resource availability .some
level of support may be continued beyond that time. We feel, however
that since the DQS 3.1.3 system is based on the DQS 3.1.2.4 release
most users will be using DQS 3.1.3 in preference to DQS 3.1.2.4
in the near future.<BR>
</FONT>
<P>
<FONT SIZE=2>DQS 3.1.3 and future enhancements can be obtained
by Internet ftp from "ftp.scri.fsu.edu". <BR>
</FONT>
<P>
<FONT SIZE=2>Announcements of new releases and improvements will
be emailed to anyone who contacts SCRI to add their name to the
announcement list. This is done by:</FONT>
<P>
<FONT SIZE=2>send email to: dqs-announce@scri.fsu.edu </FONT>
<P>
<FONT SIZE=2>Leave the "subj:" field blank</FONT>
<P>
<FONT SIZE=2>Send a one line message: subscribe</FONT>
<P>
<FONT SIZE=2>Names can be removed from this announcement list
by:</FONT>
<P>
<FONT SIZE=2>send email to: dqs_announce@scri.fsu.edu</FONT>
<P>
<FONT SIZE=2>Leave the "subj" field blank</FONT>
<P>
<FONT SIZE=2>Send a one line message: unsubscribe <BR>
</FONT>
<P>
<FONT SIZE=2>Bug reports should be sent to : dqs@scri.fsu.edu
<BR>
</FONT>
<P>
<FONT SIZE=2>DQS user information exchange is provided by Rensselaer
Polytechnic Institute. To add your name and email address to this
list:</FONT>
<P>
<FONT SIZE=2>Send email to dqs-l@vm.its.rpi.edu</FONT>
<P>
<FONT SIZE=2>Leave the "subj:" line blank</FONT>
<P>
<FONT SIZE=2>Send a one line message: SUBSCRIBE dqs-l 1stname
Lastname</FONT>
<P>
<FONT SIZE=2>To remove name and email address:</FONT>
<P>
<FONT SIZE=2>Send email to dqs-l@vm.its.rpi.edu</FONT>
<P>
<FONT SIZE=2>Leave the "subj:" line blank</FONT>
<P>
<FONT SIZE=2>Send a one line message: UNSUBSCRIBE dqs-l
1stname Lastname<BR>
</FONT>
<P>
<FONT SIZE=2>Where 1stname is the user's first name and Lastname
is the user's last name.<BR>
<BR>
<BR>
<BR>
<BR>
</FONT>
<P>
<FONT SIZE=2>With the release of DQS 3.1.3 the user intercommunication
through dqs_user@scri.fsu.edu will be re-instituted . All messages,
inquiries, announcements from any user or the DQS development
staff will be relayed to all other users automatically.<BR>
</FONT>
<H2>What's New in DQS 3.1.3</H2>
<P>
<FONT SIZE=2>The release of DQS 3.0 was a major departure for
the DQS evolution. It was based on several years' experience with
DQS 2.1 in a variety of computing environments. Although it retained
many features of the 2.1 version, DQS 3.0 was a major restructuring
and re-coding of the basic system with a major focus on supporting
parallel (clustered) computation on two or more UNIX based hardware
platforms. The newly emerging message passing scheme (MPI) was
considered throughout the DQS 3.0 implementation.</FONT>
<P>
<FONT SIZE=2>In early 1995 DQS 3.0-3.1 was subjected to extensive
testing and the contributions of numerous users were incorporated
to produce DQS 3.1.2 which was released in March and augmented
over a period of six months to become DQS 3.1.2.4. With the exception
of some minor "improvements this system has been fairly stable
and in operational use for nine months.</FONT>
<P>
<FONT SIZE=2>Operational experience at SCRI and other large production
sites revealed several features which needed to be added or adapted
to make the system easier to use or to manage. Several sites provided
the DQS development team with valuable insight, advice and code
which has been incorporated into this new release. Although all
user interfaces have not been changed (albeit "enhanced")
the internals of this system have undergone considerable change,
hence the naming of this release as 3.1.3 instead of 3.1.2.5.
We took this opportunity to restructure the documentation (one
more time!) in response to numerous requests to make it easier
to access. In addition to numerous bug-fixes for DQS 3.1.2.4 provided
by several very helpful sites (see "acknowledgments")
a number of new features have been added to the system. </FONT>
<P>
<FONT SIZE=2>The "new" features of DQS 3.1.3 tend to
be somewhat invisible to the DQS user. The bulk of this effort
has been focused on further "bulletproofing" the system
to minimize, if not eliminate, the unreported termination of daemons
, utilities and jobs. Some features are "semi-visible"
such as the revised scheduling system. A few are quite evident
to all, as the "job pre-validation" feature returns
immediate feedback on the complete absence of a requested resource.
With this in mind we list here the major changes which appear
in DQS 3.1.3:</FONT>
<H3>Job pre-validation</H3>
<P>
When a job is submitted to DQS using the QSUB utility it is checked
to make ensure:
<OL>
<LI>The "fixed" resources requested as HARD are present
somewhere in an existing DQS complex. If the resource is in use
by another job it is still considered "present" for
the purposes of pre-validation.
<LI>The "consumable" resources reqested as HARD are
present in at least one DQS Consumables file. If the resource
is in use by another job it is still considered "present"
for the purposes of pre-validation.
</OL>
<H3>Consumable Resources</H3>
<P>
Many sites are confronted with the need to allocate scarce resources
to jobs during the scheduling process. Resources such as FORTRAN
compiler licenses, data base licenses, shared memory and disk
space can be assigned names and values by the DQS administrator.
Job scheduling the reconciles requests for Consumable Resources
and when a job is placed into execution the available amount of
these resources is reduced until the job terminates or releases
the resource with a DQS system utility. Facilities for managing
the Consumable Resource reservoir have been added to the QCONF
utility.
<H3>qhold, qrls</H3>
<P>
The QHOLD and QRLS utilities have been implemented. These permit
a user or administrator to place a "hold" on an already
submitted job until it has been released by the user or the DQS
administrator or removed using the QDEL utility.
<H3> qmove(multi-cell job transfer)</H3>
<P>
The POSIX utility QMOVE has been partially implemented for this
release. In a single cell system a queued job can be moved from
one queue to another using the QALTER utility. This has involved
using the "-q" option to explicitly identify a target
queue. Or when queues are implicitly specified on the basis of
resource requests ("-l" option) the QALTER utility may
be used to change the resource request.
<P>
In a multi-cell system the QMOVE utility must be used to initiate
the transfer of a job from consideration by one cell to another.
The QMOVE request is pre-validated as any other QSUB submission,
and the job will not be moved if it cannot pass this first level
test.
<H3>"fair use" scheduling</H3>
<P>
The DQS scheduler has been rewritten .. again. Of many components
in an operating system the scheduling process is the most perplexing
and complex feature to provide in an adequately general form.
The DQS 3.1.3 scheduler has been commented the code blocked out
in a manner which we hope will make site modifications easier
and more comprehensible. The scheduling methodology now in use
at SCRI is provided as the default in this release. It attempts
to prevent one or two users to dominate the utilization of the
system resources, while keeping all hosts as busy as possible.
<P>
Those submitting massive quantities of jobs to the system at one
time will discover four levels at which their jobs will be handled
by the scheduler. First, there is a limit on how many jobs will
be accepted at QSUB time. Second, there is a limit on how many
jobs in the queue for a single user will be considered by the
scheduler. And third, the user's jobs which are considered for
scheduling will be assigned sub-priorities according to their
DQS sequence number and the number of jobs for that user preceding
it in the queue. Finally a queue can be assigned a time delay
which is imposed between consecutive allocations of that queue
to the same user.
<H3>FORTRAN / "C" resource requests</H3>
<P>
In DQS 3.1.2.4 resources are requested by using the form "-l
qty.eq.1,mem.gt.32,disk.gt.64" (for example) DQS 3.1.3 permits
the retention of this format but the user may now use either FORTRAN
or "C" syntax for these requests. The above example
could then appear as "-l qty==1&&mem>32&&disk>64"
or, alternatively "-l qty.eq.1.AND.mem.gt.32.AND.disk.gt.64".
The logical operators ".NOT." (or "!") and
".OR.." (or ||) may also be used, as well as parenthesis
to increase readability. Future releases will permit more complex,
compound resource requests with the ability to specify alternative
resources which could satisfy the request. (This is different
from the using HARD and SOFT classifications.) For the time being
parenthesis only assist in readability, as in "-l qty==1&&(mem>32)&&(disk>64)"
.
<H3> subordinate queues</H3>
<P>
DQS 2.1 introduced a feature known as "subordinate queues"
which provided the capability to identify a queue as being subordinate
to another queue. If a job is running in the subordinate queue
and a job is launched in its "superior" queue the subordinate
job is suspended until termination of the "superior"
job. This feature is particularly important when managing a system
where hosts can function both as single processor and multiple
processor platforms. DQS 3.1.3 provides a re-implementation of
this feature.
<H3>SMP AFS re-authentication</H3>
<P>
DQS 3.1.2.4 provided a simple facility for operating in an AFS
environment. Actual use of this system uncovered a number of problems.
The most significant of these was solved by Axel Brandes and Bodo
Bechenback and is incorporated in DQS 3.1.3. A key element of
their solution involves the use of a temporary daemon which we
call the Process Leader and others call the "process sheep-herder".
<P>
The Process Leader is spawned by the dqs_execd and does the actual
job launching and cleanup. It can respond to system requests which
the job is not equipped to deal with, such as the AFS periodic
re-authentication task. This capability also makes it possible
to run multiple jobs for the same queue on the same host, and
to detach the job from the dqs_execd daemon in case that daemon
needs to be restarted, without killing the job.
<H3>qmaster<->dqs_execd synchronization</H3>
<P>
A glaring shortcoming in DQS 3.1.2.4 was the lack of synchronization
among the DQS daemons. Under some circumstances the queue status
maintained by the qmaster did not reflect the actual state of
jobs handed off to the dqs_execd. There was no mechanism for making
the two states congruent, other than using the "clean queue"
(QCNF -cq ) mechanism, which only affected the qmaster view
of the system. DQS 3.1.3 has implemented auxiliary communications
between the qmaster and dqs_execd to provide for automatic and
manual methods of re-synchronizing the system.
<P>
Programmed aborts of the dqs_execd using the system "abort"
or "exit" commands has been eliminated. Instead, all
dqs_execd errors previously considered fatal are now communicated
to the qmaster which emails an urgent message to the DQS administrator
and pauses the dqs_execd until the administrator can intervene.
Note that if a job is running under the Process Leader management,
it will continue execution, ignorant of the dqs_execd pause. (If
the dqs_execd error is due to a failure in the dqs_execd<->qmaster
interface, the dqs_execd independently mails the urgent cry for
help to the administrator.)
<H3>parallel job consistency and accounting</H3>
<P>
In DQS 3.1.2 parallel job scheduling handed off parallel jobs
when sufficient queues became available for the execution of the
requested number of processes. However only the dqs_execd which
was managing the MASTER process was aware of the parallel job
and the only accounting information obtained for the job was from
the MASTER host..
<P>
DQS 3.1.3. scheduling alerts all of the SLAVE queue managers to
the fact that they will be running one of the parallel job processes.
When the parallel job is launched by the MASTER dqs_execd, each
of the SLAVE dqs_execd verify that they are permitted to participate
in that job before the slave process is started. A Process Leader
is used to launch each of these slave processes and at their termination,
accounting information is gathered and sent to the qmaster. This
ensures that DQS is in charge of the execution of all parallel
job components. In the event that a LINDA parallel job is involved,
the Process Leader is initiated and it waits for the LINDA process
to be started by the master process on the MASTER host. It then
attaches itself to this process (since it cannot launch it itself)
in order to handle termination and accounting reporting.
<H3>qidle integration</H3>
<P>
In DQS 3.1.2.4 the QIDLE utility was part of the X-windows component
of the system and interfaced with DQS by invoking the QMOD utility
as a separate task. This created several problems, the principle
one being that at many sites the "system console" was
connected to a host which was also managing a DQS queue. Since
such a console usually has many users accessing it, there is not
one single "owner" for the queue on that machine with
permission to invoke the suspension of the queue n order to use
the console.
<P>
The QIDLE function in DQS 3.1.3 is now an authenticated system
utility like QMOD, QDEL, etc. It communicates with the qmaster
itself and can suspend queues on a host for which the QIDLE function
is permitted to run in an X-windows environment.
<H3>enhanced status displays</H3>
<P>
The somewhat cryptic symbols "a" "e" "r"
"u" "s" in the QSTAT display have been replaced
with more descriptive words ALARM, ENABLED, RUNNING,UNKNOWN, SUSPENDED.
More important, the reason why a job is residing in the PENDING
queue are listed. Thus between the pre-validation of jobs and
this description of PENDING causes DQS 3.1.3 should have eliminated
the most common problems of jobs never executing because they
had requested non-existent or illogical combinations of resources.
<H3>accounting tools</H3>
<P>
The DQS accounting information can play a key role in the management
and optimization of system resources. In the operational environment
at SCRI we have developed a small collection of tools for extracting,
summarizing and analyzing DQS accounting data. These have been
included in DQS 3.1.3 as a starting point for other sites to develop
their own methods.
<H3>"Streamlined Installation"</H3>
<P>
Many sites will find that the installation of DQS has been "streamlined"
requiring less interaction to prepare a basic system for configuration
and testing. Sites which are already running DQS 3.1.2.4 and have
the need for extensive local adaptations will use the more complex
"custom" installation process or the manual editing
of Makefile.proto, def.h, and dqs.h with which they are already
familiar. The new installation process is based on the GNU Autoconf
package
<H3>job "wrapper" scripts</H3>
<P>
DQS 3.1.3 provides a mechanism for executing site-defined scripts
upon termination of the queued job. This script is executed by
the Process Leader and hence posseses root permissions which can
be handy for specialized cleanup operations. This is important
for system which support PVM and P4 daemons which may have to
be stopped by the system when the MASTER process terminates abnormally.
<H3>elective linking or copying of output files during job execution
</H3>
<P>
DQS 3.1.3 supports the special handling of files on a host's
local disk during execution of a job, without the intervention
of the user. Options set in the DQS conf_file determine whether
the output files are to be left in place on the local disk, linked
to a site-defined file system or copied to a site-defined file
system.
<H3>Logging improvements</H3>
<H4><FONT SIZE=2>All DQS log entries are now time stamped with
the local time of the qmaster host system. The DEBUG and DEBUG_EXT
output is now written to a file (defined in def.h) instead of
stderr. This minimizes the jumbling of file output when several
processes attempt to write the file simultaneously. All error
messages are now numbered and an appendix to this document lists
these error messages and suggests remedial actions when appropriate.</FONT>
</H4>
<H2>Documentation</H2>
<P>
<FONT SIZE=2>The DQS 3.1.3 Documentation has been reorganized..
.again. The POSIX specification has been extricated from the document
body and is now an Appendix. The reference manual pertains only
to the DQS 3.1.3 implementation and all confusing references to
"standard" and "non-standard" options removed.
</FONT>
<P>
<FONT SIZE=2>The documentation consists of three principle chapters
and three appendices. The Installation and Maintenance Manual
is primarily aimed at the DQS system administrator. The User Guide
is obviously targeted at the DQS user community. The Reference
Manual will be accessed by both users and administrators. Appendix
A contains a catalog of all DQS error messages with information
on methods for dealing with the error. Appendix B contains the
POSIX specification on which DQS 3.1.3 is based. Appendix C contains
several miscellaneous sections, including installation variants
and system tuning guidelines.</FONT>
<P>
<FONT SIZE=2>The documentation is supplied in several forms:</FONT>
<OL>
<LI><FONT SIZE=2>Microsoft WORD (6.0 or 7.0 )</FONT>
<LI><FONT SIZE=2>PostScript</FONT>
<LI><FONT SIZE=2>HTML format (can be viewed with MOSAIC or any
of the commercial WEB browser products.</FONT>
</OL>
<P>
<FONT SIZE=2></FONT>
<H2>Installation</H2>
<P>
<FONT SIZE=2>DQS is designed to be installed on almost every existing
UNIX platform. This process thus must cope with many differences
and idiosyncrasies of the varied hardware configurations and
operating systems. DQS 3.1.3 attempts to detect and resolve these
differences to minimize the need for operator actions, but with
even the simplest installation there will be a need for some input
from the DQS administrator.</FONT>
<H2>Obtaining DQS 3.1.3</H2>
<P>
<FONT SIZE=2>DQS 3.1.3 can be obtained by ftp download from ftp.scri.fsu.edu/pub/dqs.
The README.313 file in that directory will indicate which version
should be downloaded. To reduce download bandwidth, improvements
and big-fixes will be distributed on a file-by-file replacement
basis rather than requiring a complete download of the DQS 3.1.3
system. For this reason we do not envision distributing systems
such as DQS 3.1.3.1….DQS.1.3.n in the future. (But you
never know.) </FONT>
<H2>Setting up for installation</H2>
<P>
<FONT SIZE=2>DQS 3.1.3 is distributed as a compressed TAR file.
After this file is uncompressed it is recommended that the DQS
system be extracted (with TAR) into a directory which is accessible
by all operating systems for which DQS will be built. The DQS
installation process will create a separate directory in the sub-directory
….DQS/ARCS for each different architecture/operating system
.</FONT>
<P>
<FONT SIZE=2>Once the DQS tree has been extracted the installation
process can be commenced by changing to ../DQS as a working
directory and typing "install". This UNIX script will
execute the system evaluation procedures and produce a description
of the system on which the installation is being done. Three choices
are offered to the administrator, "quick install", "Custom
Install" and "quit installation".</FONT>
<H2>"Quick" Install</H2>
<P>
<FONT SIZE=2>A very simplistic "Quick Install" feature
is provided to assist in an initial installation of DQS. For those
site who are testing DQS for the first time we recommend using
this method. Choosing all of the defaults will result in an unrealistic
operating environment for DQS 3.1.3 but will offer a sample of
the system</FONT>
<P>
<FONT SIZE=2>The choice of "Quick Install" produces
a list of defaults which will be used for the installation. The
user is asked to review this list to ensure that it meets their
requirements. The default-cell name and default initial queue
name are derived from the host-name of the machine on which the
installation process is being executed. If the installation is
being executed as "root" the system will be setup to
use "reserved" ports for communication, otherwise "non-reserved"
ports will be utilized.</FONT>
<P>
<FONT SIZE=2>The "quick" installation method is intended
for new DQS sites which wish to experiment/evaluate DQS 3.1.3
and develop some experience on which to base an operational system
setup. If the installation parameters are acceptable the user
type "y" to accept them and begin the actual installation.
</FONT>
<P>
<FONT SIZE=2>The installation proceeds in six stages,</FONT>
<OL>
<LI><FONT SIZE=2>First the GNU configure program is used to determine
installation parameters for the host being used for the installation
process. One of the directories modified by the GNU configuration
program is the DQS CONFIG. Once it is updated , the DQS config
utility is then built on that host platform.</FONT>
<LI><FONT SIZE=2>. The DQS config program then asks the user to
provide a base directory to use for the installation of DQS binaries,
libraries and documentation as well as the DQS configuration and
resolve files and directories The first step uses the GNU configure
program to determine the system for use by the qmaster and the
dqs_execd . The Default paths offered by the dialogue are based
on the current working directory(if running as non-root) or /usr/local//DQS
(when running as root). This latter path is commonly used at DQS
sites as all hosts of a common architecture often share the path
"/usr/local". The simple install will only request on
starting point for building a DQS313 tree. If the administrator
wishes to differentiate the various components, binaries, libraries,
spool directories, etc. They can type "CUSTOM" when
asked to enter an alternative base path. </FONT>
<LI><FONT SIZE=2>The next step invokes the make operation to
create all of the DQS 3.1.3 executables The binaries are placed
in a subdirectory within the ../DQS/ARCS directory named for the
specific platform being built This provides a separate repository
for each type of host system in the cluster. <U><B>NOTE </B></U>The
addition of "qidle" has created some installation problems
on SOLARIS platforms not using the GNU "C" compiler.
If error messages appear related to missing X Windows include
files or libraries the DQS administrator may have to add appropriate
compiler or linker directives to the Makefile.proto AFTER the
"configure" step is completed.</FONT>
<LI><FONT SIZE=2>The fourth step moves the binaries to the directory
from which they will be executed. This process renames the executables
by placing a tag "313" at the end of each name The fourth
step move the sample conf_file and resolve file to the conf directory.
This is done to differentiate these binaries from other DQS versions
which might have preceded the DQS313. </FONT>
<LI><FONT SIZE=2>The next step involves the addition of the three
DQS 3.1.3 entries to the /etc/services file on one or more hosts.
This step must be done with root permission and by someone familiar
with UNIX system administration knowledge. While DQS attempts
to identify proper port numbers to be used in the /etc/services
file, local conditions may dictate another choice. Upon successful
completion of the installation the administrator can proceed to
"Testing the Installation". If error message appear
and the installation is aborted the administrator should refer
to "Solving Installation Problems".</FONT>
<LI><FONT SIZE=2>Finally the administrator should proceed to the
step "Testing the DQS313 system.</FONT>
</OL>
<H2>Custom Install<BR>
</H2>
<P>
<FONT SIZE=2>"Custom" Installation presents the administrator
with the same default configuration as the "Quick" install
process. Any of the parameters can be changed by the administrator
before the installation proceeds. Two choices are presented to
the administrator. The first initiates an interactive session
where each parameter is displayed, the proposed default and, if
a previous installation has been completed the prior setup value.
The administrator may choose either of the displayed values or
enter their own parameter. </FONT>
<P>
<FONT SIZE=2>During this interactive exchange each parameter is
validated for consistency with the host system as well as DQS.
Upon completing the interactive setup the administrator may proceed
with the same installation steps as the "Quick" installation:</FONT>
<P>
<FONT SIZE=2>The installation proceeds in five stages, during
most of these the DQS administrator must make a selection as requested
by the config program.</FONT>
<OL>
<LI><FONT SIZE=2>First the GNU configure program us used to determine
installation parameters for the host being used for the installation
process. One of the directories modified by the GNU configuration
program is the DQS CONFIG. Once it is updated , the DQS config
utility is then built on that host platform. The GNU configure
program will attempt to build al the Makefiles to use the GNU
"C" compiler "gcc" If the administrator wishes
to use an alternative compiler for any phase the following files
must be modified AFTER the GNU configure step is complete: CONFIG/Makefile.proto.in,
SRC/Makefile.proto.in and DQS/Makefile.proto.</FONT>
<LI><FONT SIZE=2>. The DQS config program then asks the user to
provide a base directory to use for the installation of DQS binaries,
libraries and documentation as well as the DQS configuration and
resolve files and directories This base directory will the be
used to provide a "default path" for all items requiring
the entry of a file path. The Default paths offered by the dialogue
are based on the current working directory(if running as non-root)
or /usr/local//DQS (when running as root). This latter path is
commonly At each interactive step, a default value is presented.
Typing a question mark "?" will provide a brief comment
about that entry (which is intended to be helpful). A more detailed
explanation of each item to be entered may be found in "Appendix
C Miscellaneous - "Key System Variables and Manual Installation"
</FONT>
<LI><FONT SIZE=2>The next step invokes the make operation to
create all of the DQS 3.1.3 executables The binaries are placed
in a subdirectory within the ../DQS/ARCS directory named for the
specific platform being built This provides a separate repository
for each type of host system in the cluster.</FONT>
<LI><FONT SIZE=2>The fourth step moves the binaries to the directory
from which they will be executed. The target during the configure
process. binary directory is prescribed by the administrator This
process renames the executables by placing a tag "313"
at the end of Each name The fourth step move the sample conf_file
and resolve file to the conf directory. This is done to differentiate
these binaries from other DQS versions which might have preceded
the DQS313. </FONT>
<LI><FONT SIZE=2>The next step involves the addition of the three
DQS 3.1.3 entries to the /etc/services file on one or more hosts.
This step must be done with root permission and by someone familiar
with UNIX system administration knowledge. While DQS attempts
to identify proper port numbers to be used in the /etc/services
file, local conditions may dictate another choice. </FONT>
<LI><FONT SIZE=2>Upon successful completion of the installation
the administrator can proceed to "Testing the Installation".
If error message appear and the installation is aborted the administrator
should refer to "Solving Installation Problems".</FONT>
</OL>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>An optional approach is available to the knowledgeable
DQS administrator which omits all interaction . This requires
the editing of three DQS files used during the make process. Details
for this approach may be found in the Appendix C Miscellaneous
- "Key System Variables and Manual Installation"</FONT>
<H2>The Graphical Interface</H2>
<P>
<FONT SIZE=2>The X-windows based DQS graphical interface is installed
as a separate step. Change directory to … DQS/XSRC and read
the INSTALL script. The X-Windows interface is being restructured
and will be intgerated fully in future DQS releases.</FONT>
<H2>Testing the installation</H2>
<P>
<IMG SRC="IMG00019.GIF">
<P>
<FONT SIZE=2>The installation process creates a series of directories
and subdirectories and two crucial files, the "conf_file"
(configuration file) and the "resolve_file". If the
system installation was completed correctly the conf_file will
contain information which will be read by every DQS binary files
when it is started. This includes the DQS daemons, qmaster and
dqs_execd, and the DQS interface "utilities" qsub, qdel,
qmod, qconf, qstat, qrls, qhold and qmove. It is best that these
two files are accessible through an NFS/AFS/DFS file cross-mounting.
If that is not possible then the administrator must ensure that
<U>identical </U>copies of these files are present on each host.</FONT>
<P>
<FONT SIZE=2>Once the binaries have been moved to their execution
directory (we will use the path /usr/local/DQS/bin" for all
future examples), the qmaster can be started. If during the installation
process the administrator chose "FALSE (NO)" when asked
the question "Reserved ports?", then the /etc/services
file will have been updated (by a root user) with the three entries
suggested by the config process (or a rational alternative). The
conf_file will contain the names of these entries along with the
DEFAULT_CELL name which must match the first entry on the first
(non-commented) line in the resolve file. The administrator should
make a visual check of these three crucial files, conf_file, resolve_file
and ./etc/services to make sure that they conform to these requirements.
</FONT>
<H4><FONT SIZE=2>QMASTER</FONT></H4>
<P>
<FONT SIZE=2><<I>The qmaster manages all resources for a single
DQS cell</I>.></FONT>
<P>
<FONT SIZE=2>Once satisfied that all is well the qmaster can be
started by typing "/usr/local/DQS/bin/qmaster. (We will use
the 313 appendage in all future discussions.) On this first occasion,
it would be useful to check that the process has actually started
by viewing the UNIX process status (ps). If the qmaster name does
not appear in the hosts process list, the administrator should
check the "err_file" in the qmaster spool directory
(chosen during the DQS config stage-default: " /usr/;local/DQS/common/conf").</FONT>
<P>
<FONT SIZE=2>If the qmaster appears to be operating, it can be
tested by executing the command "/usr/local/DQS/bin/qstat313
-f", on the same host where the qmaster313 is running A normal
response to this command would be one or more lines of output
describing the status of the current queues. For brand new installations
this will be simply a header with no other lines. Error messages
may appear if things are not quite :in harmony", refer to
"DQS Error Messages" and "Solving Installation
Problems: for assistance in this case.<BR>
</FONT>
<H4><FONT SIZE=2>DQS_EXECD</FONT></H4>
<P>
<FONT SIZE=2><<I>The dqs_execd is a DQS daemon which resides
on each host which has at least one queue and will be executing
DQS managed jobs</I>.></FONT>
<P>
<FONT SIZE=2>If the "qstat313" command succeeds, it
is time to start a dqs_execd, which actually manages a particular
queue. For this test, on the same host where the qmaster "dwelleth"
type the command "/usr/local/DQS/bin/dqs_execd313"..
Again the UNIX process status should be examined (ps). If the
dqs_execd is not executing, refer to the err_file for significant
error messages. Consult "DQS Error Messages" and
"Solving Installation Problems: for assistance.</FONT>
<P>
<FONT SIZE=2>Executing the command "qconf -aq" ( queue
configuration, add queue) will produce an edit session with the
default editor on that host. If the "qconf" command
yields an error message and shuts down consult "Solving Installation
Problems". A queue "template" will be displayed
which can be modified using the editor commands. For this test
the queue name, and queue host name should be changed to match
the name of the host on which the dqs_execd is executing. We
will deal with the remaining entries later (see .The Queue Configuration).</FONT>
<MENU>
<LI><FONT SIZE=2>Q_name <U><I><B>ibms30</B></I></U></FONT>
<LI><FONT SIZE=2>hostname <U><I><B>ibms30.scri.fsu.edu</B></I></U></FONT>
<LI><FONT SIZE=2>seq_no 0</FONT>
<LI><FONT SIZE=2>load_masg 1</FONT>
<LI><FONT SIZE=2>load_alarm 175</FONT>
<LI><FONT SIZE=2>priority 0</FONT>
<LI><FONT SIZE=2>type batch</FONT>
<LI><FONT SIZE=2>rerun FALSE</FONT>
<LI><FONT SIZE=2>quantity 1</FONT>
<LI><FONT SIZE=2>tmpdir /tmp</FONT>
<LI><FONT SIZE=2>shell /bin/csh</FONT>
<LI><FONT SIZE=2>klog /usr/local/bin/klog</FONT>
<LI><FONT SIZE=2>reauth_time 6000</FONT>
<LI><FONT SIZE=2>last_user_delay 0</FONT>
<LI><FONT SIZE=2>max_user_jobs 4</FONT>
<LI><FONT SIZE=2>notify 60</FONT>
<LI><FONT SIZE=2>owner_list NONE</FONT>
<LI><FONT SIZE=2>user_acl NONE</FONT>
<LI><FONT SIZE=2>xuser_acl NONE</FONT>
<LI><FONT SIZE=2>subordinate_list NONE</FONT>
<LI><FONT SIZE=2>complex_list NONE</FONT>
<LI><FONT SIZE=2>consumables NONE</FONT>
<LI><FONT SIZE=2>s_rt 7fffffff</FONT>
<LI><FONT SIZE=2>h_rt 7fffffff</FONT>
<LI><FONT SIZE=2>s_cpu 7fffffff</FONT>
<LI><FONT SIZE=2>h_cpu 7fffffff</FONT>
<LI><FONT SIZE=2>s_fsize 7fffffff</FONT>
<LI><FONT SIZE=2>h_fsize 7fffffff</FONT>
<LI><FONT SIZE=2>s_data 7fffffff</FONT>
<LI><FONT SIZE=2>h_data 7fffffff</FONT>
<LI><FONT SIZE=2>s_stack 7fffffff</FONT>
<LI><FONT SIZE=2>h_stack 7fffffff</FONT>
<LI><FONT SIZE=2>s_core 7fffffff</FONT>
<LI><FONT SIZE=2>h_core 7fffffff</FONT>
<LI><FONT SIZE=2>s_rss 7fffffff</FONT>
<LI><FONT SIZE=2>h_rss 7fffffff</FONT>
</MENU>
<P>
<FONT SIZE=2><BR>
</FONT>
<P>
<FONT SIZE=2>When the queue name and queue host name are modified,
exit the editor in the normal manner (ESC-ZZ for vi or CTRL-X
CTRL-C for emacs). This will trigger the qconf utility to parse
the submitted definition and, if no syntactical errors are discovered
will create the requested queue.</FONT>
<MENU>
<LI><FONT SIZE=2>Queue Name Queue Type Quan Load
State</FONT>
<LI><FONT SIZE=2>---------- ---------- ----
---- -----</FONT>
<LI><FONT SIZE=2>ibms30 batch 0/1 0.10 dr
DISABLED</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>Note that the status entry in the right column of
the qstat output will display the word "DISABLED". All
new queues are initiated in DISABLED mode. To enable the queue
we need to invoke another DQS command "/usr/local/DQS/bin/qmod313
-e <queue name>" (modify queue, enable the queue name
given here as <queue name>).</FONT>
<P>
<FONT SIZE=2>Again execute the "/usr/local/DQS/bin/qstat313
-f" command :</FONT>
<MENU>
<LI><FONT SIZE=2>Queue Name Queue Type Quan Load
State</FONT>
<LI><FONT SIZE=2>---------- ---------- ----
---- -----</FONT>
</MENU>
<P>
<FONT SIZE=2>ibms30 batch 0/1 0.10 er UP<BR>
</FONT>
<H4><FONT SIZE=2>TEST SCRIPT</FONT></H4>
<P>
<FONT SIZE=2>Once the qmaster and at least one daemon a simple
test. In the directory ../DQS/tests directory are a collection
of sample scripts. The entire contents of this directory should
be copied to a user (non-root) directory owned by the administrator.
As a first test change directory to this non-root directory and
type "/usr/local/DQS/bin/qsub313 dqs.sh". This will
submit the simple script to DQS:</FONT>
<MENU>
<LI><FONT SIZE=2>#!/bin/csh</FONT>
<LI><FONT SIZE=2>#$ -l qty.eq.1</FONT>
<LI><FONT SIZE=2>#$ -N UTESTJOB</FONT>
<LI><FONT SIZE=2>#$ -A dummy_account</FONT>
<LI><FONT SIZE=2>#$ -cwd</FONT>
<LI><FONT SIZE=2>echo 'we are now doing something else'</FONT>
<LI><FONT SIZE=2>printenv</FONT>
<LI><FONT SIZE=2>sleep 30</FONT>
<LI><FONT SIZE=2>echo 'end of script'</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2> A message should appear in response to the qsub313
command:</FONT>
<P>
<FONT SIZE=2> "your job 1 has been submitted".</FONT>
<P>
<FONT SIZE=2>After 30 seconds the job should complete and in the
directory where the job was submitted two output files should
appear:</FONT>
<P>
<FONT SIZE=2> UTESTJOB.e1.25674 and UTESTJOB.o1.25674</FONT>
<MENU>
<LI><FONT SIZE=2>The title UTESTJOB was established by the DQS
directive "#$ -N UTESTJOB". The next field (either
e1 or o1) contains the job number preceded by the type of file.
The stderr file for the job will have an "e" in that
position and the stdout file will have an "o". The UTESTJOB.e1.25674
file should be zero length. If not examine its contents for the
cause of any error. The stdout file should begin with the line
: 'we are now doing something else', followed by a display of
the user's environment and ending with the line 'end of script'.</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<H2>COMPLETION OF INSTALLATION</H2>
<P>
<FONT SIZE=2>If the test script completes correctly, hosts can
be added and additional queues created and more complex job tests
can be submitted. If the "Quick Install" method was
chosen the time has probably arrived to plan an operational cell
organization and setup resource files and queues. In order to
layout an effective system it is important to understand how DQS
is constructed, the capabilities of its components and how they
may be tailored for a specific site.<BR>
</FONT>
<H2>System Topology and Operation<BR>
</H2>
<P>
<FONT SIZE=2>A basic DQS system consists of at least one computer
host which is running the qmaster program and at least one instantiation
of the dqs_execd daemon which manages the actual execution of
jobs on the host which they 'inhabit'. All of the resources managed
an monitored by a qmaster are considered to be a "cell".
<BR>
<BR>
</FONT>
<P>
<IMG SRC="IMG00020.GIF"><BR>
<P>
<FONT SIZE=2>Within a cell there are three classes of programs
operating. The qmaster daemon, the dqs_execd daemon and the DQS
utilities which include qsub, qstat, qmod, qconf, qdel, qhold,
qhold, qrls. </FONT>
<OL>
<LI><FONT SIZE=2>The qmaster maintains all of the critical files
and tables for a cell. There are actually two types of tables
managed by the qmaster which are called "queues". The
first is the job queue which is a linear, ordered list of all
jobs in the system. This list is sorted by job priority, an internal
job sub-priority (based on a site parameterized "fair use"
policy) and then by the order in which jobs have been submitted.
The second table type is a list of "execution queues",
where each potential target for running a job is defined by a
queue configuration for that target.</FONT>
<LI><FONT SIZE=2>The qmaster possesses a set of "auxiliary"
files which are used to maintain information for system security
and to parameterize DQS for specific site characteristics. Access
control lists, static and consumable resource definitions, and
a table of "trusted hosts" who are permitted to contact
the qmaster are "mirrored" in memory and disk at all
times so that the qmaster can survive interruptions such as power-outages.</FONT>
<LI><FONT SIZE=2>The primary mode of operation of the qmaster
is "listening and waiting". The qmaster listens for
messages from other qmasters ( [e] which are managing their own
cells) , its own dqs_execd daemons [a] and the DQS utilities[b].
Periodically the qmaster examines the job list and attempts to
find an execution queue which can satisfy the requirements of
one or more jobs in the table.</FONT>
<LI><FONT SIZE=2>The basic operation of the dqs_execd is "sleep
through class.. and wake up in time to answer a teacher's question
or hear the end-of-class bell". The "class bell"
in this case is a periodic event where the dqs_execd gathers information
on the health and state of the host machine on which it resides.
This period is defined in the "conf_file" and can be
varied be each site. At this point the "load average"
is sent to the qmaster [a] to provide the qmaster with information
to help it distribute jobs among the available hosts. (If the
conf_file parameter "DEFAULT_SORT_SEQ_NO" is set to
TRUE, the load average report is subservient to the sequence number
of a queue.)</FONT>
<LI><FONT SIZE=2>The "teacher's question" in this case
is a probe from the qmaster for a system integrity test or a
system request, usually to begin execution of a job [c]. At this
prodding the dqs_execd sets to work as we will see later.</FONT>
<LI><FONT SIZE=2>In a quiescent system, with no jobs queued, and
none executing, the qmaster and dqs_execd daemons continue their
"sleepy handshaking" described above. The term "sleepy"
was chosen because these programs have been designed to utilize
minimal system resources (memory and cpu cycles) on their hosts.
Thus both programs are either sleeping or performing the minor
handshaking indicated by the [a] in the diagram. In DQS313, a
qmaster in one cell does not poll or communicate with other qmasters
except to request an action such as moving a job from one queue
to another.</FONT>
<LI><FONT SIZE=2>Into the idle system described here, a user submits
a job from one of the "trusted hosts" in the system
This could be a host in the cell which also houses the qmaster
or a dqs_execd or on a host with neither daemons, but which was
made a trusted host by the administrator using the "qconf
-ah … " command. Two validation steps occur upon invocation
of the "qsub" command. </FONT>
<OL>
<LI><FONT SIZE=2>The qsub command line and the script file are
scanned for DQS directives. DQS directives may occur in
either stream, but the scanning stops when a string is encountered
which is neither a comment nor a DQS directive. (The default flag
for a DQS directive is the character pair '#$' ).. All DQS directives
are "parsed" for syntactical errors and rejected at
this point if problems are found.</FONT>
<LI><FONT SIZE=2>The syntactically verified command line and script
file are then sent to the qmaster (shown as [b] in the diagram).
The qmaster then performs a "semantic" validation of
the job request. By "semantic" here we mean "does
the request make sense in the context of this system at this time".
</FONT>
</OL>
</OL>
<P>
<FONT SIZE=2> The second test compares the user's request
for site-defined resources (such as those actually present
in the system at the moment. Unless the submitted job possesses
the DQS directive "-F" ( force the acceptance of the
resource request ), If one or more of the requested resources
do not exist (please note that this test verifies that a resource
is present in the system, not whether or not it is in use by
another job!).<BR>
</FONT>
<OL>
<LI><FONT SIZE=2>If the job request passes these tests it is placed
in the job queue [c]. This queue is "mirrored' on disk so
that it may be recovered after a system restart. When a new job
is placed in the list, the qmaster scheduler scans all the jobs
in the list and tries to find an execution queue which will satisfy
each entry's request for resources. This process does NOT begin
with the newly arrived job but begins at the top of the list,
so it is possible that the job submission may trigger the scheduling
of a previously submitted job and leave this job "awaiting
another time".</FONT>
<LI><FONT SIZE=2>At some point, motivated by the submission of
a new job, the termination of a running job or a period of seconds
defined by the "SCHEDULE_TIME" parameter in the conf_file,
the qmaster will scan the job list and find a job which meets
the resource requirements. The job description and script file
are "packaged up" and sent to the target dqs_execd [[d].
The status information for the target queue is updated to indicate
the change of state and the identity of the job's host machine.
Where parallel jobs have been specified, the qmaster will assign
additional hosts and mark their status as running the selected
job. Slave processes, however are initiated by the dqs_execd host
for the Master process, and not the qmaster.</FONT>
<LI><FONT SIZE=2>The dqs_execd first records the job request information
in its own "mirror" disk file, so that it may be retrieved
in the event of a system restart while the job is executing. Then
the job is prepared for execution. This process consists of first
creating a separate UNIX process to monitor and manage the executing
the job. In DQS313 this is called the "shepherd" process.
It is the presence of this "shepherd" which permits
a single dqs_execd to manage multiple job executions on the same
host and deals with the need for AFS re-authentication invisible
to the executing jobs. </FONT>
<LI><FONT SIZE=2>The first step for the "shepherd"
is to establish an environment for the job which matches that
of the submitting user, modified by the parameters in the job
script and on the command line. Next the "shepherd"
determines how the system and the user wish to handle the stdout
and stderr files for the job. This is directed by DQS directives
and the system-wide parameters in the "conf_file". </FONT>
<LI><FONT SIZE=2>If one of the forms of parallel job execution
has been specified ( the "-p " option in the DQS directives)
the Master dqs_execd will "remote-shell" the DQS task
"dsh" (distributed shell) to the target Slave processes.
The dqs_execd on each Slave host will start a process to manage
the SLAVE task. (In this release of DQS 3..1.3 this task is NOT
identical to the process shepherd and does not support AFS re-authentication
of the SLAVE process.)</FONT>
<LI><FONT SIZE=2>After the user's environment has been setup and
any SLAVE process managers started on other hosts, the DQS313
"process shepherd" sends the job startup notice (if
requested) and then launches the job.</FONT>
<LI><FONT SIZE=2>The "process shepherd" then enters
its own "sleep" loop, occasionally awakening to peek
at the running job and copy output files (as directed) to their
target directories. </FONT>
<LI><FONT SIZE=2>Upon job termination the "process shepherd"
executes a system defined "add-on script" which usually
performs additional job-cleanup operations . The dqs_execd then
forms an accounting record including job execution statistics,
which is sent to the qmaster, signaling the completion of all
activities related to the job[a]. Any SLAVE processes terminate
their own portion of a job independently. These SLAVE tasks are
usually shut down by their master process, according to the methodology
of the specific parallel paradigm, P4, MPI, TCGMSG, or PVM.</FONT>
<LI><FONT SIZE=2>As with the "qsub" job submission program,
all DQS313 utilities interact only with the qmaster. The qmaster
rejects any requests if the originating computer is not in the
cell's host list. The qmaster then checks to see if the user has
permission to perform the actions. For example at most sites any
user can request a display of the queue status (qstat command),
while only a DQS administrator is permitted to add, delete or
disable queues.</FONT>
<LI><FONT SIZE=2>Thus in this system a valid request by a user
to delete one of their running jobs consists of the following
sequence qdel <job > qmaster ; qmaster validates request;
qmaster sends a job terminate message to the appropriate dqs_execd;
qmaster sends an acknowledgment to the qdel utility; qdel posts
a message to the submitting user; dqs_execd sends a UNIX SIGKILL
to the job; job termination triggers the dqs_execd to gather usage
data and send an end-job message to the qmaster; qmaster logs
the accounting information; qmaster deletes all job information;
qmaster marks the queue as available for scheduling.</FONT>
</OL>
<P>
<FONT SIZE=2></FONT>
<H2>Cells, Hosts, Queues</H2>
<P>
<FONT SIZE=2>In the previous section a diagram of the elements
constituting a "cell" were displayed. A DQS313 site
may have several independent cells, or they may be aggregated
into a common operating environment:</FONT>
<P>
<IMG SRC="IMG00021.GIF">
<P>
<FONT SIZE=2>This example displays three cells A,B&C, each
managed by its own qmaster QM-A, QM-B or QM-C. The hosts are labeled
A1 and A2 for Cell-A, B1, B2 and B3 for Cell-B and C1 and C2 for
Cell-C. For this discussion we will assign the qmasters to a separate
host in each cell. QM-A will thus be on host A0, QM-B on host
B0 and QM-C on host C0. </FONT>
<P>
<FONT SIZE=2>Communications among the various hosts in a cell
and between cells is structured by the inclusion of a host within
a qmaster's host list. In the above example qmaster QM-A has four
hosts in its table, A0(its own host), A1, A2 and B0(the qmaster
host for cell B). Instead of a completely symmetrical inter-cell
arrangement here we have chosen to not have QM-A linked with QM-C.
Thus neither of these qmasters will have the other cell's qmaster
host in its own hosts table.</FONT>
<P>
<FONT SIZE=2>An option, which is less secure, is to permit the
host from one cell to contact the qmaster in another cell (as
shown by path [c]. In this case host B3 could execute utilities
and perhaps launch jobs in Cell-C as well as Cell-B. Even without
this "sneak path" hosts in cells A and C can interrogate
the status of queues in Cell-B, if the user permissions allow
such an activity.</FONT>
<P>
<FONT SIZE=2>Note, once again, that a host in a cell may have
no queues assigned to it for execution, or it may have one or
more queues assigned to it. It is also quite common to have a
dqs_execd running on the same host as the qmaster daemon. The
DQS313 utilities can be executed on any host in a cell, regardless
of whether that host is running a dqs_execd daemon.</FONT>
<P>
<FONT SIZE=2>The first level of security within DQS is then a
"trust" relationship among a cell's hosts and between
each cell's qmasters. The next level of security is the level
of permissions established by a qmaster's "manager"
and "administrator" lists. The third level of security
is defined by specific user permissions or exclusions for each
queue. Certain activities are permitted to a DQS administrator
or manager which a queue owner may not invoke, Among them are
deleting the queue itself or changing its configuration. A queue
owner and the DQS managers may perform activities such as queue
suspension, which of course the average user is prohibited from
doing.<BR>
</FONT>
<H2>System Directories</H2>
<P>
<FONT SIZE=2>To manage system security, queues, jobs and user
access, a number of directories are created during the startup
process. The DQS administrator will normally not have to deal
with these directories nor their contents. However when all DQS
files cannot or should not be cross-mounted it is important that
the function of these elements are understood so that they can
be placed correctly in the system. </FONT>
<H3>Shared & Local</H3>
<P>
As indicated in the installation instructions, the easiest method
for managing a DQS is to have all the system files and directories
mounted by NFS/AFS or DFS on all hosts. The one exception to this
is that the directories containing the binaries for all DQS executables
which, of course, should only be shared by hosts with identical
architecture and operating system configurations. A knowledgeable
administrator may wish to make changes directly to the contents
of one of these directories. Where appropriate a hint or two are
provided to assist the system manager. A typical installation
will posses a directory tree somewhat like: (underlined names
are directories, italicized names are files)
<MENU>
<LI><U>/user</U>
<LI><U>/local</U>
<LI><U>/DQS</U>
<LI><U>/common </U> /<U>bin</U>
<LI><U>/conf</U>
<LI><U>/qmaster</U> <I>resolve_file</I> <I>conf_file</I> /<U>dqs_execd</U>
<I>act_file log_file</I>
<LI><U>/QM-A</U> <U>host-A1</U> … <U>host_An</U>
<LI><U>/common_dir </U> /<U>exec_dir</U>
<LI> <I>complex_file</I> <I>script_file</I>
<LI> consumables_file /<U>job_dir</U>
<LI> <I>generic_queue</I> <I>job1</I>
<LI> <I>host_file job2</I>
<LI> <I>man_file</I> ..
<LI> <I>op_file</I> ..
<LI> <I>seq_num_file</I> ..
<LI> acl_file ..
<LI><U>/job_dir</U> /<U>rusage_dir</U>
<LI> <I>job1</I> <I>current_usage</I>
<LI> <I>job2</I> /<U>tid_dir</U>
<LI> .. <I>tid_#xxxx</I>
<MENU>
<LI>.. <I>tid_#xxxx</I>
</MENU>
<LI><U>/queue_dir</U> <I>pid_file</I>
<LI> <I>queue-A1</I> <I><B>core</B></I>
<LI> <I>queue-A2</I>
<LI> <I>queue_a3</I>
<LI> …
<LI><U>/tid_dir</U>
<LI> <I>tid_#xxxx</I>
<LI> <I>tid_#xxxx</I>
<LI> …
<LI><I>pid_file</I>
<LI><I>stat_file</I>
</MENU>
<P>
<I><B>core</B></I>
<P>
Four system files are classed as "should be shared by all
hosts, if at all possible". They are:
<P>
<I><B><FONT SIZE=2>conf_file</FONT></B></I><FONT SIZE=2> --- This
file is created during the DQS313 "config" step of the
installation or system update. This file contains system-wide
configuration which is read by the qmaster, dqs_execd and all
DQS utilities when they startup. If it is necessary to make changes
to this file, the qmaster and all dqs_execd's should be shutdown
and restarted after the changes are complete, so that they will
posses the latest configuration. Failure to observe this step
may often result in bizarre and unexplained behavior of the system
if not an outright collapse. If this file cannot be cross mounted
by all hosts, then an IDENTICAL COPY of this file needs to be
distributed to all hosts before restarting the qmaster or dqs_execd
daemons or any of the command-utilities. </FONT>
<P>
<FONT SIZE=2>The location from which this file is read is "hard-wired"
into the compiled DQS code based in the #define CONF_FILE statement
in the dqs.h file which is also created by the DQS "config"
step. It is important to understand that the default installation
setup places the conf_file in "/usr/local/common/conf"
directory, which is also used as the default location for the
qmaster and dqs_execd spool directories. While those directories
can be relocated by changing the conf_file and restarting the
daemons, the location of the resolve_file and conf_file can only
be changed by modifying "dqs.h" with an editor or be
re-executing the "config" program.</FONT>
<P>
<FONT SIZE=2>The following are the initial entries in the conf
file with a description of each line's effect on the system.<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2> QMASTER_SPOOL_DIR /usr/local/DQS/common/conf</FONT>
<MENU>
<LI><FONT SIZE=2>This parameter points to the starting directory
from which the qmaster's sub-directories are created. While at
some sites with several cells the resulting tree can be shared
by multiple qmasters, it is only necessary that the qmaster have
access to the sub-directories for itself. This tree appears above
as "…/qmaster/QM-A".</FONT>
</MENU>
<LI><FONT SIZE=2> EXECD_SPOOL_DIR /usr/local/DQS/common/conf</FONT>
<MENU>
<LI><FONT SIZE=2>This parameter points to the starting directory
from which all of the dqs_execd's in the cell will find their
individual queue management directories. In the default DQS setup
all dqs_execd's in a cell use this same directory tree terminating
their own specific set of sub-directories. This is illustrated
in the preceding diagram by "../dqs_execd/host-A1".</FONT>
</MENU>
<LI><FONT SIZE=2> DEFAULT_CELL user-network</FONT>
<MENU>
<LI><FONT SIZE=2>The system-wide, unique name for a given cell.
This can be any arbitrary ASCII string and is defaulted to the
qmaster's host domain name during the installation process. If
this name is changed the corresponding sting in the "resolve_file"
must changed accordingly… and vice-versa.</FONT>
</MENU>
<LI><FONT SIZE=2> RESERVED_PORT TRUE</FONT>
<MENU>
<LI><FONT SIZE=2>This parameter indicates that all daemons and
utilities in a cell will be using UNIX reserved ports for socket
communications. UNIX system port numbers from 0 to 1023 are designated
as "reserved". If this parameter is set to TRUE then
all of the DQS313 programs MUST execute with root ownership. If
this parameter is set to FALSE then the /etc/services port numbers
for DQS313 services must be greater than 1024.</FONT>
</MENU>
<LI><FONT SIZE=2> DQS_EXECD_SERVICE dqs313_dqs_execd</FONT>
<MENU>
<LI><FONT SIZE=2>Any arbitrary ASCII string can be used to identify
the tcp port number to be used when the qmaster or the DQS utility
"dsh" is communicating with the dqs_execd. The only
requirement is that this name must be unique among all names in
the /etc/services file.</FONT>
</MENU>
<LI><FONT SIZE=2> QMASTER_SERVICE dqs313_qmaster</FONT>
<MENU>
<LI><FONT SIZE=2>Any arbitrary ASCII string can be used to identify
the tcp port number to be used when the dqs_execd or DQS utilities
are communicating with the qmaster. The only requirement is that
this name must be unique among all names in the /etc/services
file</FONT>
</MENU>
<LI><FONT SIZE=2> INTERCELL_SERVICE dqs313_dqs_intercell</FONT>
<MENU>
<LI><FONT SIZE=2>Any arbitrary ASCII string can be used to identify
the tcp port number to be used when the one qmaster is communicating
another qmaster. The only requirement is that this name must be
unique among all names in the /etc/services file</FONT>
</MENU>
<LI><FONT SIZE=2> KLOG /usr/local/bin/klog</FONT>
<MENU>
<LI><FONT SIZE=2>The re-authentication process in AFS systems
will use the klog program. This entry is only used when AFS support
was selected during DQS installation.</FONT>
</MENU>
<LI><FONT SIZE=2> REAUTH_TIME 60</FONT>
<MENU>
<LI><FONT SIZE=2>If AFS has been selected, all daemons and executing
jobs will be re-authenticated every period of this number of
seconds.</FONT>
</MENU>
<LI><FONT SIZE=2> MAILER /bin/mail</FONT>
<MENU>
<LI><FONT SIZE=2>All jobs can select options to send brief "job
startup", "job end" and "job abort" messages
to one or more designated users. In addition the DQS313 system
will send mail messages to the administrator in the event of extraordinary
system events.</FONT>
</MENU>
<LI><FONT SIZE=2> DQS_BIN /usr/local/DQS/bin</FONT>
<MENU>
<LI><FONT SIZE=2>The qmaster, dqs_execd and all user initiated
utilities locate their binaries in the BIN_DIR established during
the "config" step of installation. This entry is set
by that step, and acts as a "place-holder" for that
target directory. This parameter is used, however by the parallel
queue management system. If the administrator wishes this parameter
can be changed to point to a different directory where PVM,P4,TCGMSG
and MPI support programs may reside. Doing so will not affect
the continued use of the BIN_DIR for the remaining DQS executables.</FONT>
</MENU>
<LI><FONT SIZE=2> ADMINISTRATOR admin@host_machine
</FONT>
<MENU>
<LI><FONT SIZE=2>On startup of the qmaster this entry is used
to identify the primary DQS administrator for this cell. This
also forms the email address used to send system error messages.</FONT>
</MENU>
<LI><FONT SIZE=2> DEFAULT_ACCOUNT GENERAL</FONT>
<MENU>
<LI><FONT SIZE=2>Any arbitrary ASCII string (without separator
characters such as blanks, periods, commas) can be used as an
account identifier. Each job submission can provide its own account
identifier, which overrides this default string. No validation
is performed on this or the user submitted account name string.
When a job terminates a record is created from hardware and software
usage data. The "account string" is appended and the
record is appended to the qmaster's "act_file".</FONT>
</MENU>
<LI><FONT SIZE=2> LOGMAIL FALSE</FONT>
<MENU>
<LI><FONT SIZE=2>By default none of the mail generated by the
DQS, either to users or the system's managers is not logged. Setting
this parameter to TRUE will cause the qmaster to create a mail
log file, where all system emails are recorded and time-stamped.</FONT>
</MENU>
<LI><FONT SIZE=2> DEFAULT_RERUN FALSE</FONT>
<MENU>
<LI><FONT SIZE=2>It is our sincere hope to have the rerun feature
if DQS implemented in future versions. In DQS313 this parameter
is ignored.</FONT>
</MENU>
<LI><FONT SIZE=2> DEFAULT_SORT_SEQ_NO FALSE</FONT>
<MENU>
<LI><FONT SIZE=2>During the qmaster's scheduling process two major
steps occur. First the jobs themselves are sorted according to
their submitted priorities and internal policy criteria. Second
all of the available queues are scanned to find one which suits
the needs of the first job to be scheduled. The ordering of this
queue scanning process can be changed by this parameter. When
this parameter is FALSE all of the queue entries are sorted in
the decreasing order f their host's usage data (as reported by
the dqs_execd). Thus the first queue examined will be the least
"busy" queue, in an effort to spread the workload across
the system.</FONT>
<LI><FONT SIZE=2>If this parameter is set to TRUE the queues are
examined in the order iof the sequence number assigned by the
administrator in each queue configuration. Many sites use this
method to ensure that their most powerful hosts are scanned first,
by assigning those hosts very low sequence numbers to the corresponding
queues. </FONT>
</MENU>
<LI><FONT SIZE=2> SYNC_IO FALSE</FONT>
<MENU>
<LI><FONT SIZE=2>In multi-host systems utilizing NFS mounted files
it is possible for I/O actions to become disordered in their results.
The ordering of lines of output sent to stdout or stdout can become
totally confused. DQS313 is supposed to have a feature in its
"process shepherd" to ensure that all stdout and stderr
output is properly time sequenced, even when multiple SLAVE processes
are involved. In the initial DQS313 release this feature is not
active.</FONT>
</MENU>
<LI><FONT SIZE=2> USER_ACCESS ACCESS_FREE</FONT>
<MENU>
<LI><FONT SIZE=2>This feature for differentiating levels of access
for users or classes of users is not implemented in DQS313.</FONT>
</MENU>
<LI><FONT SIZE=2> LOGFACILITY LOG_VIA_COMBO</FONT>
<LI><FONT SIZE=2>Many system messages are generated to aid in
the maintenance and diagnosis of DQS operation. Three files are
used for this activity, the "err_file", the "log_file"
and the "syslog_file". Depending on the level of attention
required messages are directed to one of these files. All messages
with ERR,CRIT, or WARNING are always sent to err-file. Messages
with levels of INFO, WARNIING or NOTICE can be sent to the system
log or the normal activity log file. The normal mode is to use
both the system log and normal log file. In DQS313 the system
log has been disabled, so that all non-error messages are directed
to the "log_file".</FONT>
<LI><FONT SIZE=2> LOGLEVEL LOG_INFO</FONT>
<MENU>
<LI><FONT SIZE=2>Information is logged depending on the level
assigned within the DQS. In increasing order they are LOG_INFO,
LOG_NOTICE, LOG_WARNING, LOG_ERR, LOG_CRIT, LOG_ALERT, LOG_EMERG.
Setting the LOGFACILITY parameter establishes the minimum level
of messages to be recorded. A parameter of LOG_INFO ensures that
all messages will appear in the "log_fils".</FONT>
</MENU>
<LI><FONT SIZE=2> MIN_UID 10</FONT>
<LI><FONT SIZE=2> MIN_GID 10</FONT>
<MENU>
<LI><FONT SIZE=2>For security reasons it is desirable to establish
a minimum user and group identifier uid or gid)which will be permitted
in execution of any of the DQS utilities. The qmaster and dqs_execd,
of course normally operate at root level. The recommended setting
is "10" for these parameter values as most UNIX critical
processes run with uid and gid values below "10". It
is strongly recommended that these default values be retained.
</FONT>
<LI><FONT SIZE=2>Attempts to run DQS utilities such as qsub, qalter,
qstat, etc. will fail if these default values are used, which
is the "correct" , albeit confusing (to new system managers)
behavior of DQS.</FONT>
</MENU>
<LI><FONT SIZE=2> MAXUJOBS 10</FONT>
<MENU>
<LI><FONT SIZE=2>There are a number of DQS "system policy"
parameters available to the DQS313 administrator. One of these
is a system-wide limit on the total number of jobs a user may
have considered for scheduling at any one time. This is not a
limit on the total number of jobs which a user can have queued
up in the system, but it does instruct the qmaster not to consider
more than MAXUJOBS for a user during a scheduling pass. The effect
of this limit can become quite subtle. For example, if a limit
of 10 is established and the user submits 100 jobs, they will
be ordered in sequence of their priority and submission time.
If the first ten of these jobs require system resources not currently
available, they cannot be scheduled. Neither will any following
jobs, which may need some resource which is actually available.
An additional user limit can be found in each queue configuration.</FONT>
</MENU>
<LI><FONT SIZE=2> OUTPUT_HANDLING LEAVE_OUTPUT_FILES</FONT>
<MENU>
<LI><FONT SIZE=2>When a job is started by the qmaster it may be
able to produce large stdout or stderr files. The writing of these
files to a a remote, NFS mounted file system can have negative
impacts on system performance. In some cases, retaining these
files on a hosts local filesystems could prevent network congestion
and minimize I/O delays for the running job. DQS313 provides three
options for handling these output files. The default LEAVE_OUTPUT_FILES
causes the stdout and stderr files to be left in the working directory
established by the user's "qsub" script. </FONT>
<LI><FONT SIZE=2>This parameter can be changed to LINK_OUTPUT_FILES.
In this case the administrator must create a special file in one
or all the dqs_execd spool directories. The name of this file
is defaulted to "netpath" during the DQS "config"
step. This default name may be changed in the dqs.h file by the
administrator , if they are prepared to recompile the entire DQS313
system. The "netpath" file should contain one ASCII
line defining the fully qualified network path of the target directory
into which the stdout and stderr files are to actually be places.</FONT>
<LI><FONT SIZE=2>If the parameter is set to COPY_OUTPUT_FILES
the DQS313 process "shepherd" creates temporary standard
output and standard error files local to the host executing the
job. A special "copy" process is started which wakes
up periodically (set by the hard-wired COPY_FILE_DELAY in the
dqs.h file), and copies the current contents of those files to
their actual destination.</FONT>
</MENU>
<LI><FONT SIZE=2> ADDON_SCRIPT NONE</FONT>
<MENU>
<LI><FONT SIZE=2>At the conclusion of a user's job, and in the
working space of that job it is sometimes necessary to conduct
system cleanup tasks. This is particularly true of parallel processing
tasks which may might leave "orphan" daemons running,
in the event of unplanned process termination. A system script
maintained within the DQS can be created and invoked at the conclusion
of EVERY user job. This parameter must then contain the fully
qualified path-name to this script file.</FONT>
</MENU>
<LI><FONT SIZE=2> ADDON_INFO NONE</FONT>
<MENU>
<LI><FONT SIZE=2>When OUPUT_HANDLING is set to anything other
than LEAVE_OUTPUT_FILES, the system administrator may wish to
maintain a diagnostic awareness of the "process shepherd"
handling of the copying or linking of a user's stdout and stderr
files. If this parameter is set to something other than NONE,
the parameter string should be a fully qualified path to a file
containing a ASCII string to be appended to the "stdout"
file along with other job information. </FONT>
</MENU>
<LI><FONT SIZE=2> LOAD_LOG_TIME 30</FONT>
<MENU>
<LI><FONT SIZE=2>Upon startup the dqs_execd sets this parameter
(specified in seconds) as a minimum period for the dqs_execd to
deliver system usage statistics to the qmaster. </FONT>
</MENU>
<LI><FONT SIZE=2> STAT_LOG_TIME 600</FONT>
<MENU>
<LI><FONT SIZE=2>Various system statistics, beyond the host usage
provided by the dqs_execd daemons, are gathered periodically,
based on the value of this parameter (specified in seconds).</FONT>
</MENU>
<LI><FONT SIZE=2> SCHEDULE_TIME 60</FONT>
<MENU>
<LI><FONT SIZE=2>The qmaster scans the cell's job queue after
every new job is submitted to the system or upon termination of
a running job. Absent these occurrences the qmaster will trigger
a scheduling pass of the jobs based on this parameter (in seconds).</FONT>
</MENU>
<LI><FONT SIZE=2> MAX_UNHEARD 90</FONT>
<MENU>
<LI><FONT SIZE=2>The qmaster does not poll other daemons for their
status. Instead it updates the queue status for each dqs_execd
which reports in. If a dqs_execd fails to report in to the qmaster
within this threshold (seconds) the qmaster will mark all queues
managed by the dqs_execd as "status UNKOWN". This status
is updated every interval, and can be changed from UNKNOWN to
UP if the dqs_execd has finally succeeded in updating the qmaster.</FONT>
</MENU>
<LI><FONT SIZE=2> ALARMS 3</FONT>
<LI><FONT SIZE=2> ALARMM 4</FONT>
<LI><FONT SIZE=2> ALARML 5</FONT>
<MENU>
<LI><FONT SIZE=2>The admonition to "avoid changing these
parameters" in the installation is well founded. These parameters
control the amount of time permitted before the UNIX system interrupts
an attempt at inter-host communications. The ALARMS value is
the time in seconds before a DQS utility such as qsub, qmod is
interrupted. A message will appear for the user with message "Alarm
Clock Shutdown" indicating that the utility cannot contact
the qmaster within "ALARMS" seconds. The ALARMM parameter
sets a similar limit on dqs_execd<->qmaster communications
attempts. ALARML is the longest period established for inter-process
interchange attempts, and is used to control qmaster<->qmaster
communications. </FONT>
<LI><FONT SIZE=2>In systems where the qmaster host is also running
other jobs or where the network interconnect can become congested
is possible for one or more communications attempts to fail due
to an ALARM time-out. If the err_file contains frequent "ALARM
CLOCK Shutdown" warnings or utility execution fails often
with similar error messages the three ALARM parameters should
be increased. These values should be kept as small as practical
to prevent a failing DQS element from tying up the host's tcp/ip
interface.</FONT>
</MENU>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<I><B><FONT SIZE=2>resolve_file</FONT></B></I><FONT SIZE=2>-This
file is also created during the DQS "config" process.
It is the equivalent of a combination of the UNIX "resolv.conf"
and "hosts.equiv" files for managing network security.
The default resolve_file is:</FONT>
<MENU>
<LI><FONT SIZE=2># NOTE! blank lines NOT permitted #</FONT>
<LI><FONT SIZE=2># NOTE! fields must be separated by one(1) AND
ONLY one space #</FONT>
<LI><FONT SIZE=2># 1st field = cell_name</FONT>
<LI><FONT SIZE=2># 2nd field = primary qmaster</FONT>
<LI><FONT SIZE=2># 3rd field = primary qmaster alias</FONT>
<LI><FONT SIZE=2># 4th field = secondary qmaster</FONT>
<LI><FONT SIZE=2># 5th field = secondary qmaster alias</FONT>
<LI><FONT SIZE=2>user-network QM-A0 QM-A0.user.com NONE NONE</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>The comment lines direct the DQS manager as to the
format of new entries or entry changes, Some of the aspects of
this file need further explanation.</FONT>
<OL>
<LI><FONT SIZE=2>The cell name appearing in the first field of
the first non-commented line MUST be identical to the name appearing
as the DEFAULT_CELL parameter in the conf_file.</FONT>
<LI><FONT SIZE=2>DQS313 does not yet support alternate qmasters
and thus the last two fields of each non-commented line must be
"NONE" and "NONE"</FONT>
<LI><FONT SIZE=2>Additional cells nay be defined by adding lines
to the resolve_file following the primary cell entry. If a host
in one cell is permitted to contact a qmaster in another cell
(via a "sneak path") then the cell name and qmaster
name for that other cell must appear in the source cell's resolve_file.</FONT>
</OL>
<P>
<I><B><FONT SIZE=2>err_file</FONT></B></I><FONT SIZE=2> --- The
master, dqs_execd and all DQS utilities may originate error messages
which are directed to a hard-wired filename "err_file".
This name is created during the DQS "config" step and
implanted in the "dqs.h" include file in the ../DQS/SRC
directory. The installation process assumes that all DQS313 programs
will have write-access to the path name which appears as QMASTER_SPOOL_DIR
in the conf_file. If this path name is inappropriate for ALL DQS
programs the administrator may choose to change the definition
of ERR_FILE in the include file "dqs.h". This will require
recompilation of the entire DQS313 system.</FONT>
<P>
<FONT SIZE=2>As an alternative, the administrator may choose to
let each program write to its own "err_file" and gather
and collate all the files when it is necessary to examine error
information. In this case, however the path-name accessible by
each host must be identical to the QMASTER_SPOOL_DIR name.</FONT>
<P>
<I><B><FONT SIZE=2>log_file</FONT></B></I><FONT SIZE=2> --- The
master, dqs_execd and all DQS utilities may originate error messages
which are directed to a hard-wired filename "log_file".
This name is created during the DQS "config" step and
implanted in the "dqs.h" include file in the ../DQS/SRC
directory. The installation process assumes that all DQS313 programs
will have write-access to the path name which appears as QMASTER_SPOOL_DIR
in the conf_file. If this path name is inappropriate for ALL DQS
programs the administrator may choose to change the definition
of ERR_FILE in the include file "dqs.h". This will require
recompilation of the entire DQS313 system.</FONT>
<P>
<FONT SIZE=2>"log_file" and gather and collate all the
files when it is necessary to examine error information. In this
case, however the path-name accessible by each host must be identical
to the QMASTER_SPOOL_DIR name.</FONT>
<H3>Qmaster</H3>
<P>
The qmaster directory contains a major sub-directory for each
qmaster registered in this cell. Each qmaster's directory contains
four sub-directories whose contents change constantly during DQS313
operation, and hence must permit write operations an all files.
There are also two files created by the qmaster , the pid_file
and stat_file. An additional, unwelcome file may appear here also.
In the event of a qmaster crash, its core file will be placed
in this directory.
<H4><FONT SIZE=2>common_dir</FONT></H4>
<P>
<FONT SIZE=2>This directory contains files common to the scheduling
and dispatching of jobs by the qmaster.</FONT>
<P>
<B><FONT SIZE=2>complex_file</FONT></B><FONT SIZE=2>-This file
contains all of the definitions of complexes created by the add
complex command (qconf -ac ).</FONT>
<P>
<B><FONT SIZE=2>consumables_file</FONT></B><FONT SIZE=2>-This
file contains all of the definitions of consumable resources created
by the add consumable resource command ( qconf -acons ).</FONT>
<P>
<B><FONT SIZE=2>generic_queue</FONT></B><FONT SIZE=2>-This file
is read by the qmaster each time the create queue command (qconf
-aq) is performed and no name is provided as a parameter following
the "-aq" option flag. The contents of this file form
the starting template presented in the editor for modification
by the administrator.</FONT>
<P>
<B><FONT SIZE=2>host_file</FONT></B><FONT SIZE=2> - The host_file
is read up at startup of the qmaster and contains a list of all
the hosts known to the qmaster and occasionally called "trusted
hosts". Any program attempting to contact the qmaster must
have its host's name in this list or be rejected. On the initial
startup of the qmaster this file will not be present. The qmaster
will post a warning in the err_file and create the host_file.</FONT>
<P>
<B><FONT SIZE=2>man_file</FONT></B><FONT SIZE=2>-This file contains
the login names of all individuals identified as cell "managers".
A cell "manager" is given permission to access al DQS313
system files and to execute every option of every DQS313 utility.</FONT>
<P>
<B><FONT SIZE=2>op_file</FONT></B><FONT SIZE=2> --- This file
contains the login names of all individuals identified as cell
"operator". A cell "operator" is given permission
to perform a number of system operations normally reserved to
the system manager, and prohibited to the standard system user.
The functions qdel, qmod, qmove, and qrls are permitted by operators.
Functions such as creating or deleting queues or adding and deleting
managers and operators is, of course, limited to cell managers.</FONT>
<P>
<B><FONT SIZE=2>seq_num_file</FONT></B><FONT SIZE=2> - Jobs are
assigned an internal sequence number. The next number to assigned
by the qmaster appears as a single binary value in this file.
It is thus not possible to manually reset sequence numbers, other
than to delete this file, forcing the numbering sequence to begin
over with '1".<BR>
</FONT>
<P>
<FONT SIZE=2>acl_file -- This file contains all of the access
control list "acl" names for all queues. This is actually
a list of lists. An "acl" is a list of names to be given
access to one or more queues. A queue definition can include these
individuals by naming the corresponding "acl" in its
"user_acl" parameter.</FONT>
<H4><FONT SIZE=2>job_dir</FONT></H4>
<P>
<FONT SIZE=2>This directory contains a file for each job currently
in the queuing system. Each file contains the submitted script
file along with tables and lists created by the qsub operation
and used to manage the job while it is in the queue awaiting assignment
to a host, as well as during actual job execution.</FONT>
<H4><FONT SIZE=2>queue_dir</FONT></H4>
<P>
<FONT SIZE=2>This directory contains a file for each queue . The
file name is, in fact the name assigned to that queue. Each file
contains the queue configuration, encoded in binary form, along
with various tables which the queue manager utilizes to manage
the queues.</FONT>
<H4><FONT SIZE=2>tid_dir-To maintain internal coherency during
system operation, in the face of multiple hosts executing multiple
processes a unique identifier label is generated by the qmaster
and dqs_execd for every inter-host communications. This label
is called a "task identifier" or "tid". An
empty file for each generated "tid" is created in this
An acknowledgment by the receiving host for a transaction causes
the corresponding tid file to be deleted from this directory.
</FONT></H4>
<P>
<FONT SIZE=2>In the event of aberrant behavior of a hardware or
DQS313 software element some "orphan tid's" may be found
in this directory, however the administrator is cautioned to NOT
clear out tid files manually without careful analysis. This scheme
was created to ensure inter-host synchronization despite multiple
restarting of the qmaster or the dqs_execd.</FONT>
<P>
<I><B><FONT SIZE=2>pid_file</FONT></B></I><FONT SIZE=2>-This file
contains a list of the process id of the running qmaster. This
is a "canonical" location where site procedures may
find this pid for system management actions.</FONT>
<P>
<I><FONT SIZE=2>Stat<U><I>_file</I></U></FONT></I><FONT SIZE=2>-Based
on the period defined as "STAT_LOG_TIME" the qmaster
records summary information about all the queues it is managing.
This data is time-stamped so that DQS managers might determine
when queue status chenges occur inadvertently.</FONT>
<H3>dqs_execd</H3>
<P>
The dqs_execd directory contains major sub-directories for each
dqs_execd operting in this cell. Each dqs_execd directory contains
four sub-directories plus one file, the "pid_file" which
contains the process id of the dqs_execd. Of course there is also
the possibility of a core file being placed here in the event
of a dqs_execd crash.
<H4><FONT SIZE=2>exec_dir</FONT></H4>
<P>
<FONT SIZE=2>The exec_dir contains the actual job file for the
executing job. When the dqs_execd launches a job, the script file
is copied here and executed.</FONT>
<H4><FONT SIZE=2>job_dir</FONT></H4>
<P>
<FONT SIZE=2>The job_dir contains a file for each job which the
dqs_execd is managing (usually only one). In addition to the
job's DQS script this file contains all the tables and information
necessary for the qmaster and the dqs_execd to manage this job.</FONT>
<H4><FONT SIZE=2>rusage_dir </FONT></H4>
<P>
<FONT SIZE=2>Upon job termination usage data is collected and
formatted into a "termination record" to be sent to
the qmaster. This record is written to this directory and retained
until the qmaster has received and recorded this information.
The procedure is used to prevent vital data from being lost, particularly
from long-running jobs, in the event of an interruption of dqs_execd
or qmaster service.</FONT>
<H4><FONT SIZE=2>tid_dir -- To maintain internal coherency during
system operation, in the face of multiple hosts executing multiple
processes a unique identifier label is generated by the qmaster
and dqs_execd for every inter-host communications. This label
is called a "task identifier" or "tid". An
empty file for each generated "tid" is created in this
An acknowledgment by the receiving host for a transaction causes
the corresponding tid file to be deleted from this directory.
</FONT></H4>
<H3>Temporary Files</H3>
<P>
The dqs_execd creates and deletes a number of temporary files
in the "/tmp" directory of its host. These are deleted
after use, but if the dqs_execd has been shut down during job
launching and execution these files may be left n the "/tmp"
directory inadvertently. Since they are given unique names for
the job execution they will remain until removed by the system
manager.<BR>
<H2>The Queue Configuration</H2>
<P>
<FONT SIZE=2>The queue configuration was introduced during the
discussion of setting up an initial DQS313 cell and queue. The
queue configuration is the primary means of tailoring a DQS system
to a particular site's requirements. The queue configuration can
be changed dynamically by the DQS cell manager without requiring
a shutdown and restart of either the qmaster of dqs_execd, unlike
the more static "conf_file". Changing the queue configuration
will not affect any jobs already in execution. The modified configuration
will be considered during the next scheduling pass of the qmaster
after the change has been completed. A description of each element
follows:</FONT>
<MENU>
<LI><FONT SIZE=2>Q_name <U><I><B>QA1</B></I></U></FONT>
</MENU>
<P>
<FONT SIZE=2>Any ASCII string of numbers and letters may be used
in the queue name. I t must ba a unique queue name in a given
cell.</FONT>
<MENU>
<LI><FONT SIZE=2>hostname <U><I><B>QA1_host</B></I></U></FONT>
</MENU>
<P>
<FONT SIZE=2>The hostname entered here may be any form of the
host's name which is used by the network members,. DQS will convert
the entered name to the fully qualified host name and insert that
into the registered queue configuration.<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>seq_no 0</FONT>
</MENU>
<P>
<FONT SIZE=2>The seq_no is the an arbitrary sequence number assigned
by the DQS administrator. It is ignored if the conf_file parameter
"DEFAULT_SORT_SEQ_NO" is set to FALSE. If "DEFAULT_SORT_SEQ_NO"
is set toTRUE the qmaster will scan the queue list in the order
of the sequence numbers starting with zero "0".</FONT>
<P>
<FONT SIZE=2>The DS administrator may choose one of several strategies
for assigning sequence numbers. At SCRI the lowest sequence number
is assigned to the most powerful computing engines, with less
powerful machines being assigned higher sequence numbers.</FONT>
<MENU>
<LI><FONT SIZE=2>load_masg 1</FONT>
</MENU>
<P>
<FONT SIZE=2>Each dqs_execd collects information about the state
of its host's overall computational and I/O load as reported by
the UNIX system through the "rusage" structure. A 'total
system load" is provided as an integer value representing
a fractional percentage of the system usage. A value of 1 represents
a load of 0.01, a value of 10 represents a load of 0.10 , and
a value of 100 represents a load of 1.0. </FONT>
<P>
<FONT SIZE=2>When DEFAULT_SORT_SEQ_NO is set TRUE the qmaster
attempts to assign jobs to the least loaded queues which meet
the resources requested by the job. The queues are sorted into
increasing order of the load average, weighted by multiplying
by the reported load average by the "massage factor"
(the load_masg value). The load_masg factor thus permits the
adnininstrator to adjust the system wide relationships between
different hosts which may be necessitated by variations in usage
measurements or background task activity.</FONT>
<MENU>
<LI><FONT SIZE=2>load_alarm 175</FONT>
</MENU>
<P>
<FONT SIZE=2>A threshold value can bet set beyond which a queue
will not be considered for scheduling by the qmaster. When a host
reports a load average greater than this threshold it is in an
"ALARM" state, and this flag is displayed in qstat output.
The default load_alarm represents a load average of 1.75.</FONT>
<MENU>
<LI><FONT SIZE=2>priority 0</FONT>
</MENU>
<P>
<FONT SIZE=2>This field may be confusing at this point because
jobs also posses a submission priority. The difference is that
the job priority determines only how it is ordered among other
jobs in competition for system resources. The job submission priority
has no influence on the UNIX priority with which that job is executed.</FONT>
<P>
<FONT SIZE=2>The queue priority field here IS the UNIX priority
assigned to any job executed in this queue and thus may range
from -19 (low) to +19(high).</FONT>
<MENU>
<LI><FONT SIZE=2>type batch</FONT>
</MENU>
<P>
<FONT SIZE=2>DQS was designed to support the scheduling and management
of batch and interactive jobs. DQS313 supports only batch queues.
This parameter is ignored</FONT>
<MENU>
<LI><FONT SIZE=2>rerun FALSE</FONT>
</MENU>
<P>
<FONT SIZE=2>Automatic job rerun is not enabled in DQS313, this
field is ignored.<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>quantity 1</FONT>
</MENU>
<P>
<FONT SIZE=2>A DQS313 queue can manage more than one job in execution
at a time, though this is usually not a practical way to operate
a single cpu host.</FONT>
<MENU>
<LI><FONT SIZE=2>tmpdir /tmp</FONT>
</MENU>
<P>
<FONT SIZE=2>During job startup and execution several temporary
files are created. This parameter should be the fully qualified
path name to the hosts temporary directory.</FONT>
<MENU>
<LI><FONT SIZE=2>shell /bin/csh</FONT>
</MENU>
<P>
<FONT SIZE=2>The default shell for executing jobs in this queue.
This default can be overridden by commands in the job script.
</FONT>
<MENU>
<LI><FONT SIZE=2>klog /usr/local/bin/klog</FONT>
</MENU>
<P>
<FONT SIZE=2>The path name to the AFS klog executable.</FONT>
<MENU>
<LI><FONT SIZE=2>reauth_time 6000</FONT>
</MENU>
<P>
<FONT SIZE=2>The time period in milliseconds for performing an
AFS re-authentication of the executing job. </FONT>
<MENU>
<LI><FONT SIZE=2>last_user_delay 0</FONT>
</MENU>
<P>
<FONT SIZE=2>To prevent a single user from dominating the utilization
of a queue the administrator can set this time-out value (seconds)
during which a user's job will not be consdered for scheduling
following termination of a previous job for that user.</FONT>
<MENU>
<LI><FONT SIZE=2>max_user_jobs 4</FONT>
</MENU>
<P>
<FONT SIZE=2>This is the second system parameter available for
implementing scheduling policies for DQS313 at a site. The MAXUJOBS
parameter in the conf_file limits the total number of jobs a user
can have considered for scheduling across the entire system. The
queue configuration "max_user_jobs" establishes a limit
on the number of jobs a user can have queued which will be considered
for scheduling for this queue. See "SCHEDULING" for
a more complete discussion of this topic.</FONT>
<MENU>
<LI><FONT SIZE=2>notify 60</FONT>
</MENU>
<P>
<FONT SIZE=2>A user job may invoke the "-notify" option
which instructs the system to send the job a SIGUSR1 or SIGUSR2
signal as a warning in advance of a SIGSTOP or SIGTERM signal.
This "notify" parameter in the queue configuation establishes
the number of seconds between sending the warning signal and the
SIGTERM or SIGSTOP.</FONT>
<MENU>
<LI><FONT SIZE=2>owner_list NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>In addition to the DQS manager and DQS operator
an individual can be designated a queue owner. A queue owner can
perform many system management tasks permitted to the managers
and operators but limited to this queue. Job deletion, queue suspension,
enabling and disabling are among those actions One or more login
names can be entered for this parameter.</FONT>
<MENU>
<LI><FONT SIZE=2>user_acl NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>The administrator can create one or more access lists
using the "qconf -au" command. This command adds one
or more users to a named list. (This named list will be created
if it doesn't exist.) These named lists (of names) can be used
to include or exclude groups of users in access to a specific
queue, This queue configuration parameter "user_acl"
can contain a list of one or more acl_list names which will be
permitted to use the queue. (That is the parameter can itself
be a list of names of lists of names… confused ?).<BR>
<BR>
<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>xuser_acl NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>The administrator can create one or more access lists
using the "qconf -au" command. That command adds one
or more users to a named list. (This named list will be created
if it doesn't exist.) These named lists (of names) can be used
to include or exclude groups of users in access to a specific
queue, This queue configuration parameter "user_acl"
can contain a list of one or more acl_list names which will be
excluded from access to the queue. </FONT>
<MENU>
<LI><FONT SIZE=2>subordinate_list NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>One or more DQS313 queues can be subordinated to
another queue The queue specifying a list of subordinates with
this parameter is called the "superior queue". A "superior
queue" can NOT be a subordinate queue to another. A queue
can only be subordinated to one other queue. The "subordinate_list"
parameter can contain a list of one or more queue names in the
same cell as the queue defining this parameter. </FONT>
<P>
<FONT SIZE=2>Superior queues are analyzed for scheduling in the
same manner as all queues, If a job is assigned to a superior
queue, the qmaster will suspend the execution of jobs in all of
the queues in the superior queue's subordinate list. </FONT>
<MENU>
<LI><FONT SIZE=2>complex_list NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>This parameter can contain one or more names of complexes
defined by the "add complex" function of the qconf command
(qconf -ac). See "Complexes and Consumables". Any complex
name can be preceded by the DQS reserved word "REQUIRED"
(must be all caps). This indicates that no job will be scheduled
for this queue UNLESS it requests a resource described in that
complex.</FONT>
<MENU>
<LI><FONT SIZE=2>consumables NONE</FONT>
</MENU>
<P>
<FONT SIZE=2>This parameter can contain one or more names of consumable
resources defined by the "add consumable " function
of the qconf command ( qconf -acons). See "Complexes and
Consumables". Any consumable name can be preceded by the
DQS reserved word "REQUIRED" (must be all caps). This
indicates that no job will be scheduled for this queue UNLESS
it requests a resource described in that consumable.<BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>s_rt 7fffffff</FONT>
<LI><FONT SIZE=2>h_rt 7fffffff</FONT>
<LI><FONT SIZE=2>s_cpu 7fffffff</FONT>
<LI><FONT SIZE=2>h_cpu 7fffffff</FONT>
<LI><FONT SIZE=2>s_fsize 7fffffff</FONT>
<LI><FONT SIZE=2>h_fsize 7fffffff</FONT>
<LI><FONT SIZE=2>s_data 7fffffff</FONT>
<LI><FONT SIZE=2>h_data 7fffffff</FONT>
<LI><FONT SIZE=2>s_stack 7fffffff</FONT>
<LI><FONT SIZE=2>h_stack 7fffffff</FONT>
<LI><FONT SIZE=2>s_core 7fffffff</FONT>
<LI><FONT SIZE=2>h_core 7fffffff</FONT>
<LI><FONT SIZE=2>s_rss 7fffffff</FONT>
<LI><FONT SIZE=2>h_rss 7fffffff</FONT>
<LI><FONT SIZE=2>These parameters establish the "hard"
or "soft" limitations on a host's resource utilization
a job executing under control of this queue. The "hard"
limits are transferred to the job's execution environment in the
hopes that the host operating system provides support for these
limits. Note, however, that if a host does support these limits
they apply only on a process-by-process basis!! If a job script
contains multiple invocations of processes, as in a FORTRAN compilation
and execution, the limits apply to each individual step in the
job.</FONT>
</MENU>
<P>
<FONT SIZE=2><BR>
</FONT>
<MENU>
<LI><FONT SIZE=2>DQS313 does check the "soft" and "hard"
real-time limits (s_rt & h_rt) and will terminate jobs based
on the values of those parameters. A job exceeding the "soft"
real-time limit is sent a SIGTERM signal which can be intercepted
by the job using the "-notify" option in the job script.
If the job exceeds the "hard" real-time limits it is
sent a SIGKILL signal which cannot be caught by the user job.</FONT>
</MENU>
<P>
<FONT SIZE=2 FACE="Arial Black"></FONT>
<P>
<FONT SIZE=2 FACE="Arial Black">Complexes & Consumables</FONT>
<P>
<FONT SIZE=2>The most valuable aspect of DQS, and easily its most
confusing property is the ability to define and utilize a variety
of system "resources" which can then be requested in
a user's QDS job script. These resource requests are used to differentiate
and assign jobs to the variety of system capabilities found in
today's heterogeneous computing environments. Let us look at an
example of how and why resource definitions are created at a site.
The diagram shows five DQS hosts with different capabilities.
</FONT>
<P>
<FONT SIZE=2>Many users will have created an application compiled
for one machine architecture, say AIX. In the pictured environment
the user could run their application on one of the AIX machines
by specifying the queue name, say QN1. The negative aspect of
this simple approach is that the job may be kept waiting for QN1
because of a previous job on that machine while either QN3 or
QN5 might be available.<BR>
</FONT>
<P>
<IMG SRC="IMG00022.GIF">
<P>
<FONT SIZE=2>The solution for this situation is to create a resource
definition for all AIX machines in the cell and name it "AIX1".
Then the user can submit a job using the qsub command with the
"-l" option. What are the steps needed to accomplish
this:</FONT>
<OL>
<LI><FONT SIZE=2>A complex is created by typing "qconf -ac
AIX1" (create a complex named AIX1) </FONT>
<LI><FONT SIZE=2>The default text editor is started and an empty
page displayed. The administrator enters an arbitrary string such
as "our_AIX". Then save the results and close the editor.</FONT>
<LI><FONT SIZE=2>Now that we have a complex defined (AIX1) we
can add that complex to a queue definition.</FONT>
<LI><FONT SIZE=2>Assuming that the queue has already been defined
we will modify it using the qconf command. Typing "qconf
-mq QN1" opens up another editor window with the complete
queue definition displayed.</FONT>
<LI><FONT SIZE=2>Replace the parameter entry for "complex_list"
from NONE with AIX1. (the name The DQS administrator creates a
resource definition called a "complex". </FONT>
<LI><FONT SIZE=2>given to the complex definition NOT the contents
of that definition.</FONT>
<LI><FONT SIZE=2>In the same manner add the complex name AIX1
to the queues QN4 and QN5.</FONT>
<LI><FONT SIZE=2>Advertise the resource name "our_AIX"
to all users.</FONT>
<LI><FONT SIZE=2>A user can then direct their jobs to any one
of the AIX machines by including the resource request "-l
our_AIX" in their DQS job script.</FONT>
</OL>
<P>
<FONT SIZE=2>This simple example illustrates two key points. </FONT>
<OL>
<LI><FONT SIZE=2>The complex name is used by the administrator
to assist in designing and managing collections of resources and
queues. The complex name IS NOT USED by the user in resource requests.</FONT>
<LI><FONT SIZE=2>Resource requests in job submissions use the
descriptions within one or more complex definitions.</FONT>
</OL>
<P>
<FONT SIZE=2>Let us expand the example slightly and create a new
complex which cuts across machine architecture features, but shares
a different attribute:</FONT>
<OL>
<LI><FONT SIZE=2>Create a complex for systems supporting PVM by
typing "qconf -ac PVM1".</FONT>
<LI><FONT SIZE=2>When the editor window opens enter a single line
"our_PVM"</FONT>
<LI><FONT SIZE=2>Save the file and close the editor and advertise
the resource name "our_PVM" to the users.</FONT>
<LI><FONT SIZE=2>Add the complex name PVM! To the complex_list
parameter of queues QM2 and QM4. </FONT>
<LI><FONT SIZE=2>A user wishing to submit a job to a queue which
is running on an AIX machine which provides PVM support would
use a resource request "-l our_AIX.and.our_PVM"</FONT>
</OL>
<P>
<FONT SIZE=2>So far the sample resource definitions have been
a single string such as "our_AIX" or "our_PVM".
We could have used an alternative form for describing alternatives
as we did with AIX versus HPUX. This form would replace the
string we entered in the complex files: arch=our_AIX and arch=our_HPUX.
The string "arch" is one created by the administrator
and could be any arbitrary name. A resource request would then
have the form "-l arch=our_AIX", or "-l arch=our_HPUX"
.<BR>
</FONT>
<P>
<FONT SIZE=2>Resource definitions can contain numeric values and
the corresponding resource requests can perform numeric comparison
on these values to satisfy a criteria.. A complex called BigMemory
could be defined containing the line "mem=128" . For
our example let QN1 and QN2 both be operating on hosts which have
128 megabytes each. The complex BigMemory would be added to the
QN1 and QN2 . A request for an AIX machine with at least 64 bytes
of memory might be stated as "-l our_AIX.and.mem.ge.64".</FONT>
<P>
<FONT SIZE=2>Resource definitions can possess more than the single
line examples in each named complex. A complex definition named
"BIG_HUMMER" might look like:</FONT>
<MENU>
<LI><FONT SIZE=2>AIX414</FONT>
<LI><FONT SIZE=2>mem=1028</FONT>
<LI><FONT SIZE=2>Horsepower=10</FONT>
<LI><FONT SIZE=2>IO_bandwidth=250</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>A resource request which needs a BIG_HUMMER host
would, in this case look like:</FONT>
<P>
<FONT SIZE=2>"-l AIX414.and.mem.ge.1028.and Horsepoer.ge.10.and.IO_bandwidth.ge.250"</FONT>
<P>
<FONT SIZE=2>There is one type of resource we have singled out
for special handling in DQS313. These are resources which are
not static during the operation of a DQS cell. While machine horsepower,
memory size and operating systems and compilers for long periods
of times (on the order of days or weeks), shared memory multiprocessor
cpus will have varying amounts of shared memory available to them
as different jobs are executed on other of its cpus. An increasingly
common resource situation is "licensed software" such
as compilers and data-base management systems. In many cases there
are fewer licenses available within a system than there are hosts
to execute the software. </FONT>
<P>
<FONT SIZE=2>This type of resource is called a "consumable"
in DQS313. The definition of a consumable resource is somewhat
different than a DQS "complex", in that the administrator
will describe the total number of a resource which is available
in a system, and the number of that resources consumed by a satisfied
resource request. In the case of a FORTRAN compiler license, a
site usually purchases a number of licenses for their system which
are managed by a "license server". The consumable resource
manager in DQS313 does not supplant a license server nor can it
effectively mimic such a server. Instead it provides a mechanism
parallel to the license server which attempts to keep track of
how many licenses are in use by DQS clients.</FONT>
<P>
<FONT SIZE=2>The administrator defines a consumable resource
by executing the command "qconf -acons FORTRAN"(using
the compiler as an example).. The default editor will open a window
with the following template</FONT>
<P>
<FONT SIZE=2>Consumable xlf</FONT>
<P>
<FONT SIZE=2> Available = <the amount of resources
available></FONT>
<P>
<FONT SIZE=2> Consume_by < quantum by which resource is reduced
by a request></FONT>
<P>
<FONT SIZE=2> Current = < currently available
resources><BR>
<BR>
<BR>
</FONT>
<P>
<FONT SIZE=2>The field for Available should be filled in with
the number of FORTRAN licenses authorized to this system. The
Consume_by will be 1 for software such as compilers. The Current
field will usually be equal to the Available field, unless there
are several licenses in use at the time this Consumable is being
defined. The Current field is also used to rest the DQS313 consumable
counter when DQS313 gets out of sync with the actual license
manager.</FONT>
<P>
<FONT SIZE=2>Queues which must manage this consumable resource
should then have the consumable name added to the consumables
parameter list in the queue configuration. The user need not be
aware of the distinction between standard complexes and consumables.
Their resource requests are stated in the same way: "-l our_AIX.and.mem.ge.64.and.xlf".
The qmaster will determine if an xlf license is available by examining
its internal counters (which may NOT match the license server's).
If the license and other resources are available the job will
be launched. At the time the job is started the consumable count
for the FORTRAN resource will be decremented. </FONT>
<P>
<FONT SIZE=2>Upon job termination this resource count will be
incremented. Obviously this is not a satisfactory situation for
a user who wishes to submit a job which does a quick FORTRAN compile
which produces an executable which is then to run a week long
job. The consumable count would remain decremented for the duration
of the job while the license manager will have had the license
"token" returned at the conclusion of the compilation.</FONT>
<P>
<FONT SIZE=2>For this situation the cooperation of the user is
required, to avoid breaking up jobs into compile-only and compute-only
separate jobs. The "qalter" command has been modified
to permit any user tp execute the "qalster" command
but only if it has the "-rc " . return consumable, command.
The user job would then have a script file which might look like:</FONT>
<MENU>
<LI><FONT SIZE=2>#!/bin/csh</FONT>
<LI><FONT SIZE=2>#$ -l xlf.and.our_AIX</FONT>
<LI><FONT SIZE=2>xlf my myprogram</FONT>
<LI><FONT SIZE=2>qalter -rc xlf 1</FONT>
<LI><FONT SIZE=2>myprogram mydata</FONT>
</MENU>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>The qalter command here specifies the name of the
resource being returned followed by the quantity being returned.
When resources such as high performance disk or shared memory
are being defined as a consumable resource often a "quanta"
of the resource is granted and recovered. An example might be
that a UNIX page is the minimum quanta or an integral number of
pages could be the "quanta". Where licenses are normally
doled out one at a time, memory might be allocated 1 MB at a time.
Hence the Consume_by field in the consumable definition.<BR>
</FONT>
<H3>REQUIRED Complexes and Consumables</H3>
<P>
A job submission may contain one or more resource requests (the
"-l" option). A job with no specific resource requests
is thus a candidate for assignment to any available queue. In
many installations some queues are best utilized by very specific
job configurations. An example might be a site which possesses
a heterogeneous collection of cpus with very wide differences
in computing capacity. The more robust computers should not be
assigned to "tiny" but persistent jobs in some cases.
DQS 3.1.3 provides a special keyword-"REQUIRED" which
can precede any complex or consumable which a user MUST request
in order for that job to be considered for scheduling on that
queue.<BR>
<H2>Job Scheduling</H2>
<P>
<FONT SIZE=2>The crux of any resource allocation and management
system is its ability to provide resources in an "efficient"
and "fair" manner. "Efficiency" is usually
measured terms of maximizing job throughput and effective utilization
of the available resources. "Efficiency" can be quantified
in ways usually referred to the hardware hosts in a system "Fairness"
is less easily described, is often measured by perceptions and
is most often referred to the human users of a system. Further,
priorities for efficiency and fairness and their relative values
can vary widely from site to site. The burden of meeting these
objectives falls upon the system job scheduling mechanism.</FONT>
<P>
<FONT SIZE=2>Forty years of experience with attempts at creating
comprehensive job scheduling algorithms have demonstrated several
points:</FONT>
<OL>
<LI><FONT SIZE=2>It is virtually impossible to produce a "one
size fits all" algorithm which will satisfy the demands for
efficiency plus fairness at every site.</FONT>
<LI><FONT SIZE=2>Scheduling systems which attempt to provide a
'flexible' software solution do so by offering to the administrator
numerous parameters for adjusting the methods used for allocating
resources. The plethora of variables presented is ultimately
confusing if not confounding.</FONT>
<LI><FONT SIZE=2>Most sites with complex requirements and knowledgeable
support personnel end up writing their own scheduling code or
modifying the code provided with the system</FONT>
</OL>
<P>
<FONT SIZE=2>DQS313 therefore attempts to provide only a minimal
amount of job scheduling technology. Hopefully small sites will
be able to achieve a good level of balance in host usage and perceived
"fairness" with the system as it is delivered. As a
site develops experience with batch job management the staff will
experiment with the few parameters provided in DQS. At some point
the administrator will want to probe the module dqs_schedule.c
, adding or subtracting from its capabilities. To that end we
will describe the basic features of DQS scheduling and try to
illuminate the routines most likely to be modified.</FONT>
<P>
<FONT SIZE=2>A user job passes through two screening processes
before being considered by the qmaster for scheduling:</FONT>
<P>
<FONT SIZE=2>1. At the time of job submission a user job is checked
to see if it meets two system criteria: </FONT>
<OL>
<LI><FONT SIZE=2>Are resources present in the system which meet
the requirements specified for the job (usually through the "-l
" parameter in a qsub script) ? </FONT>
</OL>
<P>
<FONT SIZE=2>b. Is this user under the maximum threshold established
for using system resources ?</FONT>
<P>
<FONT SIZE=2> If a job fails these tests it is rejected at the
time of submission and an error message returned to the user submitting
the job. ( In the event that a job is submitted in anticipation
of resources being added to the system, such a new host architecture,
the user can choose to override the first test by using the "force"
option ("-F") in the qsub command.</FONT>
<P>
<FONT SIZE=2>2. Once a user job has been accepted into the system
it will be placed into the qmaster's job list where it will remain
until it has been executed or deleted. If a job's submission exceeds
the MAXUJOBS limit placed in the conf_file, it will remain in
the queue BUT it will not be considered during scheduling passes
by the qmaster.<BR>
</FONT>
<P>
<FONT SIZE=2>The qmaster conducts an examination (or "pass")
over the job list :</FONT>
<OL>
<LI><FONT SIZE=2>Every time a job is added to the list</FONT>
<LI><FONT SIZE=2>Every time a job terminates</FONT>
<LI><FONT SIZE=2>If neither of these steps occur, the qmaster
will scan the list on a periodic basis based on the number of
seconds in the "SCHEDULE_TIME" parameter in the conf_file.</FONT>
</OL>
<P>
<FONT SIZE=2></FONT>
<P>
<FONT SIZE=2>The scanning process consists of sorting the jobs
according to their submitted priority ("-p" option),
then by an internally generated "subpriority" and finally
by the job sequence number (establishing its submission order).
After the jobs are sorted they are examined in order, testing
each available queue ( each ordered by load average or sequence
number) looking for the first one which matches the resources
requested by that job. If a match is found the job is dispatched
and the next job is examined.</FONT>
<P>
<FONT SIZE=2>Manipulation of a job's subpriority before the sorting
step is the easiest way to affect the basic scheduling algorithm.
In DQS313 this simply consists of increasing the subpriority field
of a job based on the number of previously submitted jobs (at
the same priority level) for that user. Thus two or more users
with several jobs queued at the same priority and for the same
system resource will have their jobs interleaved, so that no one
user can dominate a resource by submitting a large quantity of
jobs.</FONT>
<P>
<FONT SIZE=2>The system administrator will probably experiment
with this subpriority computation as a first step in customizing
DQS. Flirting with the resource matching is considered to be a
more risky affair as the side effects of such changes are harder
to predict or detect.<BR>
</FONT>
<H2>AFS Operation</H2>
<P>
<FONT SIZE=2>DQS313 provides a minimal AFS support capability.
The introduction of the "process shepherd" has made
the job re-authentication in DQS conform to AFS security requirements.
The output file handling feature addresses the 'cross platform'
security problems of dealing with stdout and stderr. </FONT>
<H2>Multi-Cell Operation</H2>
<P>
<FONT SIZE=2>A limited multi-cell operation capability is provided
in DQS313. Jobs may be moved from cell to cell if they are not
yet in execution, and users authenticated in one cell can view
the status of the queues in another cell.</FONT>
<H2>Accounting</H2>
<P>
<FONT SIZE=2>Site accounting methods vary as widely as any aspect
of a batch processing system. DQS313 records as much information
as possible about a job's scheduling and execution in a single
ASCII line of text. These entries are preceded by an ASCII string
of the standard UNIX GMT time of the entry. </FONT>
<P>
<FONT SIZE=2>Extraction of the accounting information simply requires
using a structure definition for the act_file entries in one's
"c" extraction program. An example of this technique
may be found in the program acte.c which can be found in the ../DQS/tools
directory. Included in the tools directory is a script "dostats"
which employs acte to create a series of system summary files
for the administrator.</FONT>
<H2>System Management</H2>
<P>
<FONT SIZE=2>The process of DQS system management first consists
of laying out the physical and logical structure of a cell. The
physical organization is described by adding hosts and assigning
them to queues. The logical organization consists of defining
resource "complexes" and consumable resources and assigning
these to their appropriate queue hosts. Finally setting system
parameters in the conf_file and each queue configuration establishes
the operating environment for DQS operation.</FONT>
<P>
<FONT SIZE=2>The ongoing management steps should include:</FONT>
<P>
<FONT SIZE=2>a. review of the queue status information to spot
queues in UNKNOWN or ALARM state; (DQS313 will send email to the
administrator whenever possible, but a sudden crash of a daemon
may only be detected from the qstat command display)</FONT>
<OL>
<LI><FONT SIZE=2>regular review of the err_file, log_file, stat_file
and act_file looking for operational anomalies; Some will be obvious,
such as dqs_execd's which have vanished or been restarted. One
key thing to look for is a sequence of jobs aborting on the same
host ( a potential problem with DQS or the host) or a sequence
of jobs aborting for the same user (may point to a problem with
the user's jobs or the user's permissions). (Job aborts may be
detected by examining the exit_status of jobs in the act_file.</FONT>
<LI><FONT SIZE=2>Changing queue parameters, adding and deleting
jobs and performing queue suspend/unsuspend, or disable/enable
operations as required.</FONT>
</OL>
<P>
<FONT SIZE=2>The majority of the DQS313 utilities set and their
options are provided for the system management function. While
users may employ the qalter command, for example, to change the
characteristics of a submitted job, more often the administrator
will avail themselves of this function. A not-uncommon occurrence
is for the administrator to increase the submission priority of
a job to move it ahead of other jobs in the scheduling.</FONT>
<P>
<FONT SIZE=2>One utility should be highlighted here, the "qidle"
function. Many DQS hosts may actually reside on someone's desk
and serve as their personal; workstation. At the same time these
machines are utilized for their computational capabilities in
a cell. To serve both functions, it must be possible for the workstation
user to have priority access to their machine and not suffer keyboard
and mouse response deficiencies because the host is being shared
with DQS. A first step is to make the "owner" of the
workstation also an "owner" of all queues assigned to
that host. Then when the workstation owner wishes to have exclusive
use of the machine they will have DQS permission to suspend any
queues on that machine.</FONT>
<P>
<FONT SIZE=2>Enter the "qidle" utility. This is an X-Windows
based program, since we presume that workstation users will be
operating with X-Windows. It can be started at any workstation
and performs the following functions on behalf of the workstation
"owner" who the administrator has also designated a
queue "owner" in the queue configuration. </FONT>
<OL>
<LI><FONT SIZE=2>If the workstation mouse and keyboard are used
in some way (mouse movement, button clicks, keyboard typing),
all queues on that host are suspended.</FONT>
<LI><FONT SIZE=2>If the keyboard and mouse have not been used
for a period of time specified in the qidle command, then all
queue suspensions are removed.</FONT>
</OL>
<P>
<FONT SIZE=2>What happens in the case where more than user may
have access to a workstation. The "system console" is
an example where many users may be permitted to operate the keyboard
and mouse. Making all users "owners" of that station's
queues could result in an unmanageable list and is a potential
security problem, since a queue owner has privileges beyond queue
suspension actions. </FONT>
<P>
<FONT SIZE=2>The qidle in DQS313 has thus been modified from its
DQS 3.1.2.4 form. It is now a member of the DQS313 utilities group
and communicates directly with the qmaster rather than indirectly
through the qmod utility. It can be started on any workstation
by any user who has permission to login to that workstation. Once
started it performs the same functions described above.<BR>
</FONT>
<H2><U><FONT SIZE=4>Problem Solving</FONT></U></H2>
<H2><FONT SIZE=2>Solving Installation Problems</FONT></H2>
<P>
<FONT SIZE=2>Most installation difficulties can be divided into
three categories (in the order of probability)</FONT>
<OL>
<LI><FONT SIZE=2>One or more bugs remain in the DQS 3.1.3 installation
procedure. This release has not been tested on all available UNIX
platforms (hardware or software versions 0.</FONT>
<LI><FONT SIZE=2>The interactive interface has produced messages
or questions which may confuse the reader. Some of these are natural
warnings from the make process or compiler. A few will be labeled
"error" when they do not effect the installation process.
These often occur when an installation is being performed over
an old one and the target directories already exist.</FONT>
<LI><FONT SIZE=2>The administrator is running as non-root and
attempting operations not permitted in that mode.</FONT>
<LI><FONT SIZE=2>Host machines to be used for qmaster and/or dqs_execd
do not have uniform access (through NFS or AFS or DFS) to the
DQS binary files, or the spool directories defined during the
installation procedure..</FONT>
<LI><FONT SIZE=2>Attempts to use qstat313, qsub313,etc receive
a message ".. unable to contact qmaster". This is usually
due to a user trying to invoke one of the DQS utilities on a host
not known to the qmaster. The qmaster maintains a list of all
"trusted hosts" in the cell which it manages. Hosts
are added automatically when a queue is configured for them: "qmaster313
-aq" or by an explicit host addition "qconf313 -ah <host
name > ".</FONT>
</OL>
<P>
<FONT SIZE=2>Identify the symptoms of the installation failure
and refer to one of the following sections:</FONT>
<H2>INSTALL fails during the make process of the "config"
program.</H2>
<P>
<FONT SIZE=2>The GNU configure program uses the "Makefile.in"
template in the DQS/CONFIG directory to produce the Makefile for
the DQS config utility. It is possible that a new configuration
of compilers or linkers can cause the GNU facility to create an
erroneous Makefile. Visually check the Makefile for correctness.</FONT>
<P>
<FONT SIZE=2>Although DQS313 installation has been tested on many
platforms, variants of the compiler or operating systems can create
WARNING messages during the compilation which we have not made
provision for. Even different versions of GNU "C" yield
different warning messages. If the error is fatal to the compilation
please contact the DQS313 support team for assistance.</FONT>
<H2>INSTALL fails during the execution of the DQS config program.
</H2>
<P>
<FONT SIZE=2>During the config process the system attempts to
create a number of directories and sub-directories. The default
starting point for this process is the current working directory
of the user if running as non-root or /usr/local/DQS if running
as root.. If any of the directories exist, an error message is
displayed on stdout, but the config program continues. If the
user discovers that they have erroneously specified directory
names, config can be interrupted by typing CTRL-C. This will unwind
many aspects of the configuration process, however NO DIRECTORIES
will be removed. The administrator will have to cleanup any relevant
directories manually. After reviewing the directory already exists"
messages the administrator can choose to ignore those which are
expected because the directories were previously created..<BR>
</FONT>
<H2>INSTALL fails during the "make" process.</H2>
<P>
<FONT SIZE=2>During the DQS config step, all of the target directories
are created except for the ones associated with the compiled output
object ('.o' files) and the interim executables (qmaster, dqs_execd…).
If a previous installation occurred under a "root" user
and the current "make" is being done as a "non-root"
the attempt to create the ARCS sub-directories will fail for lack
of permissions. The solution is to perform the "make"
as root or change the owner of the ARCS sub-directories to the
user doing the installation of DQS313.</FONT>
<P>
<FONT SIZE=2>The GNU CC compiler is chosen as the default compiler
or the "make" process if it is available. Some sites
may experience a large number of "gcc" warning messages
if there have been local modifications to the gnu include files.
If this occurs or if the site prefers to use the native "C"
compilers then the following steps should be taken"</FONT>
<OL>
<LI><FONT SIZE=2>Stop the "make" operation. The GNU
configure program and the DQS config utility will have been executed
and all Makefile templates will contain the GCC default. Change
directory to …DQS/SRC and edit the Makefile.proto file.</FONT>
<LI><FONT SIZE=2>Search the Makefile.proto for any lines which
match "CC=gcc" and replace the string "gcc"
with the native compiler name, (usually "cc" ).</FONT>
<LI><FONT SIZE=2>Change directory back to the base directory,
…DQS and type "make" to restart the process.</FONT>
</OL>
<P>
<FONT SIZE=2>If only "warning" messages appear in the
stdout results you can feel reasonably secure with the installation.
However we will try to eliminate these in future releases and
would appreciate receiving information on these occurrences. If
an error fatal to the compilation occurs please contact the DQS
support staff.<BR>
</FONT>
<H2>INSTALL fails during the "make installbin" phase
</H2>
<P>
<FONT SIZE=2>Once the make process has created the temporary executables
in the ARCS directory they should be moved to their "final
resting place" as chosen during the DQS config step. For
operational installations this step should be performed as root.
If the INSTALL script was started as non-root and the target directory
requires root permissions the INSTALL process will fail at this
point.</FONT>
<P>
<FONT SIZE=2>If this occurs the administrator should switch to
"root", change directory to …./DQS and type "make
installbin".</FONT>
<P>
<FONT SIZE=2>Since the DQS config process attempts to create the
BIN target directory, this phase may generate several warning
messages that "directory already exists". Ignore these
warnings. If, however the message is "error, permission denied",
the process should be repeated in "root" mode.</FONT>
<P>
<FONT SIZE=2>To prevent confusion between DQS313 binaries and
previously installed versions we have appended the string "313"
during the installbin process. The usual next step is to provide
soft-links in /usr/local/bin to these binaries something of the
form:</FONT>
<P>
<FONT SIZE=2>"ln -s /usr/local/DQS/bin/qmaster313 /usr/local/bin/qmaster
<BR>
</FONT>
<H2>INSTALL fails during the "make installconf" phase
</H2>
<P>
<FONT SIZE=2>After the binaries have been installed in their directory
the 'resolve_file" and "conf_file" will be moved
to their target directory, ( a possible default might be "/usr/local/DQS/common/conf"
). In our "quick install example" this process should
proceed automatically. If the INSTALL script was initiated by
a non-root user and the destination directory is restricted to
a root-user this step will fail with a "permission denied"
error message. However when a series of different platform types
are being aggregated into a single cell, only one conf_file and
resolve_file need be moved to the common/conf directory. If this
has already been done then this step can be skipped.</FONT>
<H2>Startup of the qmaster fails.</H2>
<P>
<FONT SIZE=2>The principle reason for the qmaster not executing
during initial testing is the absence of the /etc/services entries
directed by the installation process. The err_file should be examined.
Warning messages bout absent hosts, acl and complex files should
be ignored. Look for an entry "Bad Service" which points
to the /etc/services file.</FONT>
<P>
<FONT SIZE=2>An obvious error, but one which occurs often is trying
to start the qmaster in user-mode while the RESERVED_PORTS TRUE
appears in the conf_file.</FONT>
<P>
<FONT SIZE=2>If attempts at starting the qmaster fail, after checking
root-mode and the .etc/services file. the administrator should
set the environment variable DEBUG to 1 and then restart the qmaster
as follows : "qmaster313 >&debug.out &" (assuming
a C shell environment). After the qmaster crashes send the file
"debug.out" to the DQS support staff.<BR>
</FONT>
<H2>Startup of the dqs_execd fails</H2>
<P>
<FONT SIZE=2>The principle reason for the dqs_execd not executing
during initial testing is the absence of the /etc/services entries
on its host as directed by the installation process. The err_file
should be examined. Warning messages should be ignored. Look for
an entry "Bad Service" which points to the /etc/services
file.</FONT>
<P>
<FONT SIZE=2>An obvious error, but one which occurs often is trying
to start the dqs_execd in user-mode while the RESERVED_PORTS
TRUE appears in the conf_file.</FONT>
<P>
<FONT SIZE=2>If the dqs_execd is not able to check-in with the
qmaster during dqs_execd startup the daemon will shut down (
once executing the dqs_execd will not shut down if the qmaster
is absent). Make sure the qmaster is running before attempting
to start the dqs_execd.</FONT>
<P>
<FONT SIZE=2>If attempts at starting the dqs_execd fail, after
checking root-mode and the .etc/services file. the administrator
should set the environment variable DEBUG to 1 and then restart
the qmaster as follows : "dqs_execd313 >&debug.out
&" (assuming a C shell environment). After the qmaster
crashes send the file "debug.out" to the DQS support
staff.</FONT>
<H2>Startup of qconf fails</H2>
<P>
<FONT SIZE=2>If the first attempt at using qconf produces error
messages and the qconf terminates there are several possible causes:</FONT>
<OL>
<LI><FONT SIZE=2>The user is attempting to execute qconf in root-mode
while the MIN_UID and MIN_GID are non zero. For security reasons
root users are not normally permitted to execute DQS utilities
unless the MIN_UID is set to zero. </FONT>
<LI><FONT SIZE=2>qconf is being started in user-mode but the utility
itself is NOT owned by root and does not have the permissions
for the owner set correctly. This can occur when a manager uses
a path to the ARCS directory for the utility rather than the BIN_DIR
target where installbin is supposed to put all DQS binaries.</FONT>
<LI><FONT SIZE=2>qconf is being started on a host not yet known
to the qmaster. Here we have a cart-and-horse situation. We need
to use the qconf function to add hosts, but cannot execute qconf
because its host is not "legal". The only solution
is to initiate qconf on the same host where the qmaster resides.</FONT>
</OL>
<H2>qstat display shows queue status as UNKNOWN</H2>
<P>
<FONT SIZE=2>During the initial test phase, the manager will have
created one queue using qconf. After it has been created, execution
of qstat should show the presence of a queue and a status of DISABLED.
An UNKNOWN status indicates a failure of the dqs_execd to contact
the qmaster in the time prescribed as MAX_UNHEARD in the conf_file.
Check the err_file for messages relating to the dqs_execd being
unable to contact the qmaster. Since the dqs_execd would not even
start if it could not check in with the qmaster, some new problem
must have developed.. Check to see if the dqs_execd is still running.</FONT>
<H2>qsub fails to submit test job</H2>
<P>
<FONT SIZE=2>The test script should be accepted by the DQS system
at this point with no problem, since utility<->qmaster interaction
has been operating successfully in the previous steps. The most
likely reason for a failure of this qsub test is represented by
a message of the form "ALARM CLOCK shutdown".. This
is due to the qmaster or the network interfaces being overburdened.
Often the host on which the qmaster is running may be executing
some non-DQS managed computational hog. If the ALARM message occurs
try increasing the ALARM values in the conf file and re-executing
the qsub command. (note that for this experiment the dqs_execd
and qmaster need not be restarted after changing the conf_file,
as the qsub is the only one complaining. However if the new values
of ALARM' parameters prove satisfactory the daemons should be
restarted as soon as practicable.).</FONT>
<H2>test job end with no output</H2>
<P>
<FONT SIZE=2>If the permissions for the user submitting the test
script are not sufficient for the target host the job launching
process will be terminated and a message sent to the err_file.
An accounting record will also be sent to the DQS act_file. Check
these files for information.</FONT>
<H2>Test script produces a non-zero length stderr file</H2>
<P>
<FONT SIZE=2>The test script should create two output files, one
containing stdout information and the other the stderr output.
If the stderr output is not zero length than some "very unlikely"
event occurred during the job execution. Examine this stderr
file and the err_file to determine what the cause was.</FONT>
<H2>Operational errors</H2>
<P>
<FONT SIZE=2>Once the system has succeeded in running the test
script, the administrator will configure hosts, queues and resources
for its operational settings. A myriad of situations can then
occur which may appear to be, or in fact are, DQS system errors.
For this reason DQS produces a large number of informational,
warning and error messages which are posted to the system err_file.
</FONT>
<P>
<FONT SIZE=2>In the event that an operational aberration is detected
the err_file should be examined closely. If no explanations are
obvious,. The DQS support staff should be contacted and sent a
relevant extraction from the err_file and act_file.<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
</FONT>
</BODY>
</HTML>
|