1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455
|
################################################################################
#
# Copyright (C) 2016-2022 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
################################################################################
from . import Code
from . import Common
from .Common import globalParameters, CHeader, roundUp, Backup, print2
from .ReplacementKernels import ReplacementKernels
from .CustomKernels import isCustomKernelConfig
from .SolutionStructs import Solution
import abc
import collections
import os
import shutil
import subprocess
import copy
from math import ceil
################################################################################
# Kernel Writer
################################################################################
class KernelWriter(metaclass=abc.ABCMeta):
#__metaclass__=abc.ABCMeta
##############################################################################
# Init
##############################################################################
def __init__( self, kernelMinNaming, kernelSerialNaming ):
self.kernelMinNaming = kernelMinNaming
self.kernelSerialNaming = kernelSerialNaming
self.overflowedResources = 0
@property
def asmCaps(self):
"""
Assembler capabilities for the current ISA version.
"""
return globalParameters["AsmCaps"][self.version]
@property
def archCaps(self):
"""
Architectural capabilities for the current ISA version.
"""
return globalParameters["ArchCaps"][self.version]
@property
def globalParams(self):
"""
Global parameters for current configuration.
"""
return globalParameters
##############################################################################
# makeSchedule: Schedule work into interations.
# Tensile uses a two-level scheduler. This the first-level, which
# schedules global reads, global incs, and local writes into iteration.
# Then makeSubIterSchedule schedules the instructions within the iteration.
#
# Inputs:
# localWriteEndIter: loop iteration where last writes should be inserted
# If scheduleLocalWrite=0, all writes will be be placed in this iteration.
# If scheduleLocalWrite=1, the scheduler will work backwards from this
# iteration.
#
# Outputs:
# self.unrollLoopHeaderCode:
# - Code module that should be added into the unroll loop header
# In unscheduled code this contains global loads and global address increment
# self.perIterGlobalReadCode[], self.perIterLocalWriteCode[]
# - List indexed by unroll iteration.
# Each entry in the list is a code module that should be added into that iteration.
# May be None, indicating no extra code for that iteration
# self.grEndMfmaIndex
# self.lwStartMfmaIndex
# self.lwEndMfmaIndex
# self.barrierMfmaIndex
# self.numMfmaForNextLoopLR
# This routine is responsible for setting the schedule including determining
# that all necessary dependency are met. The driver code in kernelBody
# blindly follows the plan set in unrollLoopHeaderCode and perIterCode
##############################################################################
def makeSchedule(self, kernel, tensorParametersA, tensorParametersB, localWriteEndIter, uDu=0, skipGlobalReadInc=False, firstIter=False, lastLoop=False, lastLc=False):
currentIsa = globalParameters["CurrentISA"]
maxVmcnt = globalParameters["AsmCaps"][currentIsa]["MaxVmcnt"]
self.unrollLoopHeaderCode = Code.Module()
# schedule of work for each local_read iteration:
self.perIterGlobalReadCode = [ Code.Module() for i in range (kernel["LoopIters"]) ]
self.perIterLocalWriteCode = [ Code.Module() for i in range (kernel["LoopIters"]) ]
if lastLc:
self.perIterLocalWriteCodeNGLL = [ Code.Module() for i in range (kernel["LoopIters"]) ]
self.perIterLocalWriteCanSkip = [ 0 for i in range (kernel["LoopIters"]) ]
self.perIterGlobalReadCodeDTV = [ Code.Module() for i in range (kernel["LoopIters"]) ] # global read for DirectToVgpr
assert([item.name for item in self.globalReadIncrements.itemList] == ['globalReadIncrementA', 'globalReadIncrementB'])
globalReadIncACode = self.globalReadIncrements.findNamedItem("globalReadIncrementA")
globalReadIncBCode = self.globalReadIncrements.findNamedItem("globalReadIncrementB")
if uDu < kernel["DepthULdsDivisor"] - 1 and kernel.enabledSplitLDS and kernel["PrefetchGlobalRead"] \
or skipGlobalReadInc:
globalReadIncACode = Code.Module()
globalReadIncBCode = Code.Module()
grBackup = None
if uDu != kernel["DepthULdsDivisor"] - 2 and kernel.enabledSplitLDS:
# hack RAII object for auto restore
# withhold issuing global read codes until in the 2nd last subloop, meaning we empty the code
# modules in other subloops.
grBackup = Backup(self, globalReadACode = self.globalReadACode, globalReadBCode = self.globalReadBCode)
self.globalReadACode = Code.StructuredModule() # empty
self.globalReadBCode = Code.StructuredModule() # empty
numGlobalReadC = self.getNumberOfLoadCInForLoadC(kernel)
lastLoadIter = 0
PRECISION = 100
if kernel["EnableMatrixInstruction"] and kernel["ScheduleIterAlg"] == 3:
numMfmaPerIter = self.numMfmaPerIter
#########
# Get localWriteEnd
#########
# assign parameter
# 1. we calculate number of mfma to prefetch localReads for next loop
# 2. we put barrier 1 mfma ahead that
# 3. we put last localWrite 1~2 mfma ahead barrier
# localReads followed following sequence to be scheduled
# ds_read[A][0], ds_read[B][0], ds_read[A][1:], ds_read[B][1:]
# NOTE: we need this sequence for new feature "breaking waitcnt"
# TODO: breaking waitcnt
self.numMfmaForLR = 1
latencyLeft = self.miLatencyLeft
miLatencyLeft = self.miLatencyLeft
# ds_read[A][0]
for i in range(self.numReadPerVectorA):
latencyLeft -= tensorParametersA["localReadInstruction"].IssueLatency*2
if latencyLeft < 0:
self.numMfmaForLR += 1
latencyLeft = max(miLatencyLeft - tensorParametersA["localReadInstruction"].IssueLatency*2,0)
# ds_read[B][0]
for i in range(self.numReadPerVectorB):
latencyLeft -= tensorParametersB["localReadInstruction"].IssueLatency*2
if latencyLeft < 0:
self.numMfmaForLR += 1
latencyLeft = max(miLatencyLeft - tensorParametersB["localReadInstruction"].IssueLatency*2,0)
# ds_read[A][1:]
for i in range(self.numReadsPerIterA-self.numReadPerVectorA):
latencyLeft -= tensorParametersA["localReadInstruction"].IssueLatency*2
if latencyLeft < 0:
self.numMfmaForLR += 1
latencyLeft = max(miLatencyLeft - tensorParametersA["localReadInstruction"].IssueLatency*2,0)
# ds_read[B][1:]
for i in range(self.numReadsPerIterB-self.numReadPerVectorB):
latencyLeft -= tensorParametersB["localReadInstruction"].IssueLatency*2
if latencyLeft < 0:
self.numMfmaForLR += 1
latencyLeft = max(miLatencyLeft - tensorParametersB["localReadInstruction"].IssueLatency*2,0)
# to calculate number of mfma we need to wait before data arrive from lds to vgpr.
# latency: 40 quad-cycle for 4 word, 20 quad-cycle for 2 word, 10 quad-cycle for 1 word / half word
self.numMfmaForNextLoopLR = self.numMfmaForLR
latencyForLR = roundUp(tensorParametersB["localReadInstruction"].blockWidth)*10
latencyForLR -= max(latencyLeft,0) # remaining latency in mfma
latencyForLR -= self.miLatency # last LR will have 1 mfma latency
while latencyForLR > 0:
latencyForLR -= self.miLatency
self.numMfmaForNextLoopLR += 1
# final index definition
self.numMfmaForNextLoopLR = min(self.numMfmaForNextLoopLR,numMfmaPerIter-1)
self.barrierMfmaIndex = numMfmaPerIter*(kernel["LoopIters"]-self.numItersPLR+1) - self.numMfmaForNextLoopLR - 1 if self.numItersPLR else 0
numMfmaBetweenLWandBarrier = 2 if kernel["MatrixInstM"] == 32 else 3
self.lwEndMfmaIndex = max(self.barrierMfmaIndex - numMfmaBetweenLWandBarrier,0) if self.numItersPLR else numMfmaPerIter*kernel["LoopIters"] - 1
if kernel["DirectToLds"] and kernel["PrefetchGlobalRead"] == 2:
# DirectToLds + PGR=2 case, lwEndMfmaIndex must be after the end of local read (excluding local reads for next iter)
lrEnd = min(self.barrierMfmaIndex - 1, self.numMfmaForLR * (kernel["LoopIters"] - self.numItersPLR))
if self.lwEndMfmaIndex < lrEnd:
self.lwEndMfmaIndex = lrEnd
#########
# Internally assign an optimized LWPM value for PGR2
#########
# strategy is to distribute LW/GR as wide as possible to avoid hitting vmem FIFO
# LWPM = (LW_End - LW_Start) / numLW
if kernel["LocalWritePerMfma"] == -1:
#########
# Get localWriteStart
#########
if not kernel["1LDSBuffer"]:
# TODO: replace here for real number of globalReadIncInst
numGRIncInst = 12 if not kernel["StaggerU"] else 18
numInstPerMfma = max(roundUp(self.miLatencyLeft/2),1)
numMfmaToSched = roundUp(numGRIncInst/numInstPerMfma)
lwStartMfmaIndex = 1 + numMfmaToSched
else:
# for 1LDSB, we have to issue localwrites after localreads
if self.numVgprBuffer == kernel["LoopIters"]:
if self.numReadPerVectorA != 1 or self.numReadPerVectorB !=1:
# fp16 or bf16, we read 1 element to vgprBuffer the other element to tempVgpr.
# since each iteration shares same tempVgpr, only read-to-vgprBuffer can
# be scheduled in the front of loop.
# localwrite have to start after last read-to-tempVgpr.
numHalfReads = (self.numReadPerVectorA//2)*kernel["InnerUnroll"]*kernel["MIWaveTileA"] + (self.numReadPerVectorB//2)*kernel["InnerUnroll"]*kernel["MIWaveTileB"]
numMfmaForHalfRead = 1
latencyLeft = self.miLatencyLeft
for i in range(numHalfReads):
latencyLeft -= 2
if latencyLeft < 0:
numMfmaForHalfRead += 1
latencyLeft = max(self.miLatencyLeft - 2, 0)
lwStartMfmaIndex = numMfmaPerIter * (kernel["LoopIters"] - 1 - self.numItersPLR) + numMfmaForHalfRead
else:
# we have enough vgprBuffer to schedule localReads in the front of loop
numMfmaForCurrentLoopLR = 1
latencyLeft = self.miLatencyLeft
for u in range(kernel["LoopIters"] - self.numItersPLR):
doReadA = (u < kernel["LoopIters"] // self.numIterPerCoalescedReadA - self.numItersPLR)
doReadB = (u < kernel["LoopIters"] // self.numIterPerCoalescedReadB - self.numItersPLR)
# disable LocalRead if DirectToVgpr is enabled
doReadA = doReadA and (not kernel["DirectToVgprA"])
doReadB = doReadB and (not kernel["DirectToVgprB"])
# ds_read[A][0]
for i in range(self.numReadPerVectorA * doReadA):
latencyLeft -= tensorParametersA["localReadInstruction"].IssueLatency*2
if latencyLeft < 0:
numMfmaForCurrentLoopLR += 1
latencyLeft = max(self.miLatencyLeft - tensorParametersA["localReadInstruction"].IssueLatency*2,0)
# ds_read[B][0]
for i in range(self.numReadPerVectorB * doReadB):
latencyLeft -= tensorParametersB["localReadInstruction"].IssueLatency*2
if latencyLeft < 0:
numMfmaForCurrentLoopLR += 1
latencyLeft = max(self.miLatencyLeft - tensorParametersB["localReadInstruction"].IssueLatency*2,0)
# ds_read[A][1:]
for i in range((self.numReadsPerIterA - self.numReadPerVectorA) * doReadA):
latencyLeft -= tensorParametersA["localReadInstruction"].IssueLatency*2
if latencyLeft < 0:
numMfmaForCurrentLoopLR += 1
latencyLeft = max(self.miLatencyLeft - tensorParametersA["localReadInstruction"].IssueLatency*2,0)
# ds_read[B][1:]
for i in range((self.numReadsPerIterB - self.numReadPerVectorB) * doReadB):
latencyLeft -= tensorParametersB["localReadInstruction"].IssueLatency*2
if latencyLeft < 0:
numMfmaForCurrentLoopLR += 1
latencyLeft = max(self.miLatencyLeft - tensorParametersB["localReadInstruction"].IssueLatency*2,0)
lwStartMfmaIndex = numMfmaForCurrentLoopLR
else:
lwStartMfmaIndex = numMfmaPerIter * (kernel["LoopIters"] - 1 - self.numItersPLR) + self.numMfmaForLR
# to calculate number of mfma we need to wait before data arrive from lds to vgpr.
# latency: 40 quad-cycle for 4 word, 20 quad-cycle for 2 word, 10 quad-cycle for 1 word / half word
if self.numIterPerCoalescedReadB > self.numIterPerCoalescedReadA:
latencyForLR = roundUp(tensorParametersA["localReadInstruction"].blockWidth) * 10
else:
latencyForLR = roundUp(tensorParametersB["localReadInstruction"].blockWidth) * 10
latencyForLR -= max(latencyLeft,0) # remaining latency in mfma
while latencyForLR > 0:
latencyForLR -= self.miLatency
lwStartMfmaIndex += 1
#########
# Get LocalWritePerMfma
#########
if lwStartMfmaIndex > self.lwEndMfmaIndex:
lwStartMfmaIndex = self.lwEndMfmaIndex
numMfmaCanSched = self.lwEndMfmaIndex - lwStartMfmaIndex + 1
numLoadsA = kernel["DepthU"]*kernel["MacroTileA"]//kernel["GlobalLoadVectorWidthA"]//kernel["NumThreads"]
numLoadsB = kernel["DepthU"]*kernel["MacroTileB"]//kernel["GlobalLoadVectorWidthB"]//kernel["NumThreads"]
writesToSched = (numLoadsA + numLoadsB - 1) * PRECISION
# In StoreCInUnroll case, add StoreC code related code to writesToSched
if kernel["StoreCInUnroll"]:
numStoreCUnrollCode = len(list(self.StoreCUnrollCode.items()))
writesToSched += numStoreCUnrollCode * PRECISION
oldValue = 0
newValue = PRECISION
loop = 0
# 1. number of padded writesToSched is (numWrites - 1) * 100 + 1
# LW ---99--- LW ---99--- LW
# 2. we need to pad it to multiple of LWPM
# LW ---99--- LW ---99--- LW --?--
# | ------- multiple of LWPM ---- |
# 3. if LWPM is not multiple of 100, we need extra empty instructions to schedule GR for PGR2
# LW ---99--- LW ---99--- LW --?-- --?--
# | ------- multiple of LWPM ---- |-LWPM-|
# 4. then we put GR into padded writesToSched
# put GR after LW + LWPM of empty inst, so that we can offset GR 1 mfma with LW if possible
# Ex. LWPM = 0.25
# LW --24- GR ------74------ LW --24- GR ------74------ LW --24- GR --24-
# mfma--24-mfma--24-mfma--24-mfma--24-mfma--24-mfma--24-mfma--24-mfma--24-mfma
# we need LWPM to get precise LWPM
# so we iterate formula 10 times to get LWPM
while oldValue != newValue and loop < 10:
loop += 1
oldValue = newValue
newValue = roundUp((writesToSched+1 + (oldValue - (writesToSched+1) % oldValue) + oldValue%PRECISION) / numMfmaCanSched)
numLocalWriteModPerMfma = newValue
#####
# Assign GRPM and LWPM
#####
# HOW THIS WORK
# padding each globalReadInstruction to 100 with empty instruction,
# each mfma will schedule intructions GRPM*100 times from padded globalReadInstruction.
# Ex. GRPM = 0.5
# GR ---------99--------- GR --------99---------- GR
# mfma --49-- mfma --49-- mfma --49-- mfma --49-- mfma --49--
self.numGlobalReadInsPerMfma = roundUp(kernel["GlobalReadPerMfma"]*PRECISION)
# HOW THIS WORK
# padding each globalReadInstruction to 100 with empty instruction,
# each mfma will schedule intructions GRPM*100 times from padded globalReadInstruction.
# Ex. LWPM = 0.5
# LW ---------99--------- LW --------99---------- LW
# mfma --49-- mfma --49-- mfma --49-- mfma --49-- mfma --49--
if kernel["LocalWritePerMfma"] == -1:
if kernel["PrefetchGlobalRead"] == 1:
# In PGR1:
# Larger LWPM can provide more latency to hide global read
# However, larger LWPM may cause mfma bubbles
# we set LWPM to 1 unless it requires larger LWPM to enable 1LDSB
if kernel["1LDSBuffer"]:
self.numLocalWriteModPerMfma = max(numLocalWriteModPerMfma,PRECISION)
else:
self.numLocalWriteModPerMfma = PRECISION
else:
self.numLocalWriteModPerMfma = numLocalWriteModPerMfma
else:
self.numLocalWriteModPerMfma = roundUp(kernel["LocalWritePerMfma"]*PRECISION)
##################################
numGlobalReadInsPerIter = numMfmaPerIter * self.numGlobalReadInsPerMfma
numLocalWriteModPerIter = numMfmaPerIter * self.numLocalWriteModPerMfma
# if numGlobalReadInsPerMfma>1, we still want to schedule only 1 GlobalReadIncCode per mfma
# inserting empty CodeModule so that generator will schedule 1 GlobalReadIncCode 1 empty CodeModule if numGlobalReadInsPerMfma=2
numEmptyGlobalReadIncCode = self.numGlobalReadInsPerMfma - 1
# If numLocalWriteModPerMfma is not multiple of 100,
# last globalread will be scheduled at lwEndMfmaIndex,
# and last localwrite will be scheduled at lwEndMfmaIndex - 1
# so we offset lwEndMfmaIndex by 1 mfma
if kernel["PrefetchGlobalRead"] == 2 and self.numLocalWriteModPerMfma % PRECISION != 0:
numMfmaBetweenLWandBarrier -= 1
def assignParamSplitLds(numMfmaBetweenLWandBarrier):
if not kernel.enabledSplitLDS:
return numMfmaBetweenLWandBarrier
# how many local reads in terms of mfma indices (height)
# total number of instructions (total) minus the instructions prefetched outside of loop (spent), divided by mfma bubble (width)
issueLatency = max(self.localReadInstructionA.IssueLatency, self.localReadInstructionB.IssueLatency) * 2
width = self.miLatencyLeft // issueLatency
width = max(width, 1)
spent = self.numItersPLR * (self.numReadsPerIterA + self.numReadsPerIterB)
total = kernel["LoopIters"]//self.numIterPerCoalescedReadA*self.numReadsPerIterA + \
kernel["LoopIters"]//self.numIterPerCoalescedReadB*self.numReadsPerIterB
height = int(ceil((total-spent)/width))
# how many local writes
localWritesToSched = self.localWriteACode.countType(Code.LocalWriteInst) + \
self.localWriteBCode.countType(Code.LocalWriteInst)
if kernel["StoreCInUnroll"]:
# in StoreCInUnroll case, add number of storeC related code here
# add store C related code to itemsLWToSched
numStoreCUnrollCode = len(list(self.StoreCUnrollCode.items()))
localWritesToSched += numStoreCUnrollCode
localWritesPerMfma = self.numLocalWriteModPerMfma / PRECISION # was scaled by PRECISION
# _numMfmaBetweenLastLWandBarrier: a function of 'spacing', which is num of mfma instructions until local write starts
_numMfmaBetweenLastLWandBarrier = lambda spacing : self.barrierMfmaIndex + 1 - ceil(localWritesToSched/localWritesPerMfma) - spacing
addrIncToSched = sum(1 for codemod in [globalReadIncACode, globalReadIncBCode] if len(codemod.items()))
if uDu < kernel["DepthULdsDivisor"] - 1:
if kernel["1LDSBuffer"] and kernel["PrefetchLocalRead"] > 1:
# space the stream of local writes so that 1st local write is scheduled after last local read,
# but give it 2 mfma's worth of headroom
spacing = 2 + height
else:
# can start ds_write/buffer_load as soon as loop starts, but give it 1 mfma's worth of headroom
spacing = 1
else:
# query how much spacing we have by calling lambda(0), minus the original 'numMfmaBetweenLWandBarrier'
# we get the spacing that results in exactly 'numMfmaBetweenLWandBarrier' between last write and barrier
spacing = _numMfmaBetweenLastLWandBarrier(0) - numMfmaBetweenLWandBarrier + addrIncToSched - 1
return max(0, _numMfmaBetweenLastLWandBarrier(spacing))
numMfmaBetweenLWandBarrier = assignParamSplitLds(numMfmaBetweenLWandBarrier)
# In StoreCInUnroll + num of store > 1 case, reduce numMfmaBetweenLWandBarrier to 1
# because interval between local write and read is already added by StoreCInUnroll code
if kernel["StoreCInUnroll"] and self.getNumberOfStoreCInTemplate(kernel) > 1:
numMfmaBetweenLWandBarrier = min(numMfmaBetweenLWandBarrier, 1)
self.lwEndMfmaIndex = max(self.barrierMfmaIndex - numMfmaBetweenLWandBarrier,0) if self.numItersPLR else numMfmaPerIter*kernel["LoopIters"] - 1
# adjust lwEndMfmaIndex for the following cases
# 1) PGR=2 + DirectToVgpr(DTV)
# 2) last loop and StoreCInUnrollPostLoop enabled case
# In these cases, lwEndMfmaIndex needs to be < numMfmaPerIter * (kernel["LoopIters"] - 1)
# to schedule global read for DTV after lwEndMfmaIndex or execute PostLoop after StoreC in NoLoadLoop
# kernel["LoopIters"] has to be > 1 to make this logic work.
if kernel["LoopIters"] > 1 and \
((kernel["PrefetchGlobalRead"] == 2 and (kernel["DirectToVgprA"] or kernel["DirectToVgprB"])) or \
(lastLoop and kernel["StoreCInUnrollPostLoop"])):
self.lwEndMfmaIndex = min(self.lwEndMfmaIndex, numMfmaPerIter * (kernel["LoopIters"] - 1) - 1)
if kernel["DirectToLds"] and kernel["PrefetchGlobalRead"] == 2:
# DirectToLds + PGR=2 case, lwEndMfmaIndex must be after the end of local read (excluding local reads for next iter)
lrEnd = min(self.barrierMfmaIndex - 1, self.numMfmaForLR * (kernel["LoopIters"] - self.numItersPLR))
if self.lwEndMfmaIndex < lrEnd:
self.lwEndMfmaIndex = lrEnd
localWriteEndIter = self.lwEndMfmaIndex//numMfmaPerIter
localWriteEndIter = min(kernel["LoopIters"] - 1, localWriteEndIter)
assert localWriteEndIter < kernel["LoopIters"]
assert self.lwEndMfmaIndex < numMfmaPerIter*kernel["LoopIters"]
else:
numGlobalReadInsPerIter = roundUp(kernel["GlobalReadPerMfma"] * PRECISION) if kernel["GlobalReadPerMfma"] > 0 else PRECISION
numLocalWriteModPerIter = roundUp(kernel["LocalWritePerMfma"] * PRECISION) if kernel["LocalWritePerMfma"] > 0 else PRECISION
numEmptyGlobalReadIncCode = numGlobalReadInsPerIter - 1
numLocalWritesPerSched = numLocalWriteModPerIter if kernel["ScheduleIterAlg"] != 3 else self.numLocalWriteModPerMfma
if not self.scheduleGlobalRead:
# put everything in the header:
self.unrollLoopHeaderCode.addCode(self.dtlsM0UpdateACode)
self.unrollLoopHeaderCode.addCode(self.globalReadACode)
self.unrollLoopHeaderCode.addCode(self.dtlsM0UpdateBCode)
self.unrollLoopHeaderCode.addCode(self.globalReadBCode)
self.unrollLoopHeaderCode.addCode(globalReadIncACode)
self.unrollLoopHeaderCode.addCode(globalReadIncBCode)
if kernel["EnableMatrixInstruction"] and kernel["ScheduleIterAlg"] == 3:
self.grEndMfmaIndex = 0
itemsGRToSchedLater = []
else:
self.unrollLoopHeaderCode.addCode(self.globalReadACode.header)
self.unrollLoopHeaderCode.addCode(self.globalReadBCode.header)
# Add all loads from middle as individual schedulable items
# when using PGR2, put global read instruction right after corresponding localWrite instruction
if kernel["PrefetchGlobalRead"] == 2 or kernel.enabledSplitLDS:
itemsGRToSched = []
itemsGRToSchedLater = list(self.globalReadACode.middle.items()) + \
list(self.globalReadBCode.middle.items())
itemsGRToSchedLaterDTV = []
# PGR2 and DirectToVgpr case, schedule global read for DirectToVgpr separately after registers are used for mfma
if kernel["EnableMatrixInstruction"]:
if kernel["DirectToVgprA"] or kernel["DirectToVgprB"]:
itemsGRToSchedLater = list(self.globalReadACode.middle.items()) # not DirectToVgpr (A has non-DirectToVgpr load)
itemsGRToSchedLaterDTV = list(self.globalReadBCode.middle.items()) # DirectToVgpr (B has DirectToVgpr load)
# add to self.perIterGlobalReadCodeDTV to schedule DirectToVgpr
while itemsGRToSchedLaterDTV:
itemGR = itemsGRToSchedLaterDTV.pop(0)
self.perIterGlobalReadCodeDTV[kernel["LoopIters"] - 1].addCode(itemGR)
if kernel.enabledSetPrioSplitLDS and itemsGRToSchedLater:
itemsGRToSchedLater.insert(1, Code.Inst("s_setprio", "3", "top priority for load"))
itemsGRToSchedLater.insert(len(itemsGRToSchedLater), Code.Inst("s_setprio", "0", ""))
else:
itemsGRToSched = list(self.globalReadACode.middle.items()) + \
list(self.globalReadBCode.middle.items())
itemsGRToSchedLater = []
if kernel["StoreCInUnroll"]:
# in StoreCInUnroll case, add loadC code here (self.LoadCTemplate is empty for no loadC required case)
# The location to insert LoadC is decided based on DirectToLds and DirectToVgpr setting
# 1) No Lds write case (Both DirectToLds or DirectToVgpr enabled), insert Load C before Load A and B
if kernel["NoLdsWriteCode"]:
itemsGRToSched = list(list(self.LoadCUnrollCode.itemList) + self.globalReadACode.middle.items()) +\
list(self.globalReadBCode.middle.items())
# 2) DirectToVgpr only enabled case, insert Load C before Load for DirectToVgpr
elif kernel["DirectToVgprA"] or kernel["DirectToVgprB"]:
itemsGRToSched = list(self.globalReadACode.middle.items()) + list(self.LoadCUnrollCode.itemList) +\
list(self.globalReadBCode.middle.items())
# 3) no DirectToVgpr/Lds case, insert Load C after Load A,B
else:
itemsGRToSched += list(self.LoadCUnrollCode.itemList)
itemsGRToSchedTemp = []
for i in range(len(itemsGRToSched)):
itemsGRToSchedTemp.append(itemsGRToSched.pop(0))
for j in range(PRECISION-1):
itemsGRToSchedTemp.append(Code.Module())
itemsGRToSched = itemsGRToSchedTemp
itemsGRIncToSched = []
if kernel["EnableMatrixInstruction"] and kernel["ScheduleIterAlg"] == 3:
# for SIA3, we can break GlobalReadIncCode to avoid mfma bubbles
if kernel["PrefetchGlobalRead"] == 2:
# skip to schedule global read for PGR2 first mfma
for i in range(numEmptyGlobalReadIncCode+1):
imod = Code.Module()
itemsGRIncToSched.append(imod)
numInst = globalReadIncACode.countType(Code.Inst) + globalReadIncBCode.countType(Code.Inst)
numInstPerMfma = max(roundUp(self.miLatencyLeft/2),1)
globalReadInc1 = globalReadIncACode.flatitems()
globalReadInc2 = globalReadIncBCode.flatitems()
if kernel["DirectToVgprA"]:
# swap the order of readInc for DTVA
globalReadInc1, globalReadInc2 = globalReadInc2, globalReadInc1
globalReadIncItems = globalReadInc1 + globalReadInc2
if kernel["StoreCInUnroll"] and kernel["PrefetchGlobalRead"] == 2:
# PGR=2 + StoreCInUnroll case, add first LoadC after IncA, second LoadC (if exist) after IncB
tmpList = list(self.LoadCUnrollCode.itemList)
dummyList = [ Code.Module() for i in range (numInstPerMfma - 1) ]
if len(tmpList) > 0:
# first LoadC
globalReadIncItems = globalReadInc1 + tmpList[0:1] + dummyList + globalReadInc2
if len(tmpList) > 1:
# second LoadC
globalReadIncItems += tmpList[1:]
# add len(LoadCUnrollCode.itemList) to numInst
numInst += len(tmpList)
numMfmaToSched = roundUp(numInst/numInstPerMfma)
for j in range(numMfmaToSched):
imod = Code.Module()
count = 0
while globalReadIncItems and count < numInstPerMfma:
tempInst = globalReadIncItems.pop(0)
imod.addCode(tempInst)
if tempInst.countType(Code.Inst):
count += 1
itemsGRIncToSched.append(imod)
for i in range(numEmptyGlobalReadIncCode):
imod = Code.Module()
itemsGRIncToSched.append(imod)
else:
itemsGRIncToSched.append(globalReadIncACode)
for i in range(numEmptyGlobalReadIncCode):
imod = Code.Module()
itemsGRIncToSched.append(imod)
itemsGRIncToSched.append(globalReadIncBCode)
for i in range(numEmptyGlobalReadIncCode):
imod = Code.Module()
itemsGRIncToSched.append(imod)
if kernel["EnableMatrixInstruction"] and kernel["ScheduleIterAlg"] == 3:
# Loop in PGR1: GlobalRead -> GlobalReadInc -> LocalWrite
# but GlobalReadInc shouldn't block LocalWrite so we count them out
# Loop in PGR2: GlobalReadInc -> LocalWrite/GlobalRead pair
# since LocalWrite/GlobalRead pair depends on GlobalReadInc, we count in only GlobalReadInc
if kernel["PrefetchGlobalRead"] == 2:
loadsToSched = len(itemsGRIncToSched)
else:
loadsToSched = len(itemsGRToSched)
# Here is to adjust scheduling silently in order to have validation pass.
# Better way is to use larger globalReadPerMfma.
## schedule more instructions at first iteration if no enough mfma to schedule globalRead
self.grEndMfmaIndex = max(0, roundUp(loadsToSched/self.numGlobalReadInsPerMfma) - 1)
if self.grEndMfmaIndex > self.lwEndMfmaIndex:
schedNumForIter0 = numGlobalReadInsPerIter + (self.grEndMfmaIndex - self.lwEndMfmaIndex) * self.numGlobalReadInsPerMfma
self.grEndMfmaIndex = self.lwEndMfmaIndex
else:
schedNumForIter0 = numGlobalReadInsPerIter
if kernel["PrefetchGlobalRead"] == 1:
globalReadIncEndMfmaIndex = self.grEndMfmaIndex + roundUp(len(itemsGRIncToSched)/self.numGlobalReadInsPerMfma)
endIter = roundUp((globalReadIncEndMfmaIndex+1)/numMfmaPerIter)
else:
endIter = roundUp((self.grEndMfmaIndex+1)/numMfmaPerIter)
## schedule more instructions at first iteration if no enough mfma to schedule globalRead + globalReadInc
if endIter > kernel["LoopIters"]:
endIter = kernel["LoopIters"]
if kernel["PrefetchGlobalRead"] == 1:
schedNumForIter0 += (globalReadIncEndMfmaIndex+1 - kernel["LoopIters"]*numMfmaPerIter) * self.numGlobalReadInsPerMfma
# SIA 1 or 2
# distribute the instructions in itemsGRToSched evenly as possible to iterations: perIterGlobalReadCode[0,endIter)
# last one is perIterGlobalReadCode[endIter-1],
# Ideally: endIter <= localWriteEndIter,
# then put M0 updateCode (if any) and first 'schedNumForIter0' GR-inst in perIterGlobalReadCode[0]
# put every numGlobalReadInsPerIter GR-insts in perIterGlobalReadCode[1]~[endIter-1]
# corner case: endIter > localWriteEndIter, set endIter = localWriteEndIter,in this case, schedNumForIter0 will > 1
# and perIterGlobalReadCode[0] would need to schedule more instructions
else:
# reads and incs are scheduled in iters range(0..endIter)
endIter = roundUp((len(itemsGRToSched) + len(itemsGRIncToSched)) / numGlobalReadInsPerIter)
# FIXME:
# above formula precisely count number of GR + GRInc
# however it has regression issue with tuned yaml with default GRPM.
# below formula follows old logic to add 2 to the instruction count, so it may has larger schedNumForIter0
# we should use above formula with GRPM tuning for better performance
# NOTE: both formula pass validation test
endIter = roundUp((len(itemsGRToSched) + len(itemsGRIncToSched) + 2*PRECISION) / numGlobalReadInsPerIter)
if endIter > localWriteEndIter:
# Front-load some of the buffer loads if we don't have enough loop iters:
# could use a different/smarter algorithm to space out the loads?
schedNumForIter0 = (endIter-(localWriteEndIter) + 1) * numGlobalReadInsPerIter
endIter = localWriteEndIter
else:
# schedule b2b for readCnt > 2 (True for bigger TT)
schedNumForIter0 = numGlobalReadInsPerIter
# insert dtlsM0UpdateACode dtlsM0UpdateBCode code
if self.globalReadACode.middle.items():
self.globalReadACode.middle.items()[0].items().insert(0,self.dtlsM0UpdateACode)
if self.globalReadBCode.middle.items():
self.globalReadBCode.middle.items()[0].items().insert(0,self.dtlsM0UpdateBCode)
itemsGRToSched.extend(itemsGRIncToSched)
# append 'n' global load at a time
# append global load(S) first 'number of global load(s)' determined by schedNumForIter0
for item in itemsGRToSched[:schedNumForIter0]:
self.perIterGlobalReadCode[0].addCode(item)
itemsGRToSched = itemsGRToSched[schedNumForIter0:] # trim the scheduled GRs, do the rest in the following loop
for u in range(1, endIter):
# append itemPerIter GR for each iteration,
# and trim the scheduled ones at the end of loop
itemPerIter = 1 * numGlobalReadInsPerIter
try:
for item in itemsGRToSched[:itemPerIter]:
self.perIterGlobalReadCode[u].addCode(item)
lastLoadIter = u
itemsGRToSched = itemsGRToSched[itemPerIter:]
except IndexError:
break # itemsGRToSched is 0-length, no code left to schedule
assert not itemsGRToSched # should have scheduled everything already, itemsGRToSched should be empty
# adjustment for StoreCInUnroll
# lastLoop case, make the last perIterGlobalReadCode[] (LoopIters-1) empty
# otherwise, mixing global read inc code and StoreCInUnroll post code could cause memory access issue
if kernel["StoreCInUnroll"] and lastLoop:
lastIter = kernel["LoopIters"] - 1
prevLastIter = max(0, lastIter - 1)
if prevLastIter < lastIter:
while self.perIterGlobalReadCode[lastIter].items():
self.perIterGlobalReadCode[prevLastIter].addCode(self.perIterGlobalReadCode[lastIter].items().pop(0))
self.perIterGlobalReadCode[endIter-1].addCode(self.globalReadACode.footer)
self.perIterGlobalReadCode[endIter-1].addCode(self.globalReadBCode.footer)
# Now schedule the writes:
if not self.scheduleLocalWrite:
# if no scheduleLocalWrite - just add writes to localWritelocalWriteEndIter
# If PGR=0, writes have to be done immediately following the loads - no opportunity to schedule
# so don't add to schedule, these will be added separately and before the first iter
if kernel["PrefetchGlobalRead"]:
# do we need a module here? That would prevent these from being scheduled
imod = self.perIterLocalWriteCode[localWriteEndIter].addCode(Code.Module())
if self.enable["Wait"]:
imod.addCode(
self.wait(kernel, tensorParametersA, tensorParametersB, 0, -1, -1, \
"1wait for global read"))
imod.addComment1("local write A")
imod.addCode(self.localWriteACode)
imod.addComment1("local write B")
imod.addCode(self.localWriteBCode)
if kernel["EnableMatrixInstruction"] and kernel["ScheduleIterAlg"] == 3:
self.lwStartMfmaIndex = self.lwEndMfmaIndex
else:
#################
# create a plan #
#################
itemsLWToSched = list(self.localWriteACode.items()) + list(self.localWriteBCode.items())
if kernel["PrefetchGlobalRead"] == 2:
# PrefetchGlobalRead + DirectToLds case, need to add dummy list to insert global read
tmpList = []
numItemsBeforeStoreC = 0 #if not kernel["StoreCInUnroll"] else self.numItemsBeforeStoreC
numDummy = 0
if kernel["DirectToLdsA"]:
numDummy += max(len(list(self.globalReadACode.middle.items())) - numItemsBeforeStoreC, 0)
if kernel["DirectToLdsB"]:
# DirectToVgprA case, LDS load is actually in B. Need to get correct length
numReadB = len(list(self.globalReadACode.middle.items())) if kernel["DirectToVgprA"] else len(list(self.globalReadBCode.middle.items()))
numDummy += max(numReadB - numItemsBeforeStoreC, 0)
for i in range(numDummy):
tmpList.append(Code.Module())
# add dummy at the top of the list
itemsLWToSched = tmpList + itemsLWToSched
if kernel["StoreCInUnroll"]:
# in StoreCInUnroll case, add storeC related code here
# add store C related code to itemsLWToSched
tmpList = list(self.StoreCUnrollCode.items())
itemsLWToSched += tmpList
# extend localWrite by inserting empty Module
itemsLWToSchedTemp = []
for i in range(len(itemsLWToSched)-1):
itemsLWToSchedTemp.append(itemsLWToSched.pop(0))
for j in range(PRECISION-1):
itemsLWToSchedTemp.append(Code.Module())
if itemsLWToSched:
itemsLWToSchedTemp.append(itemsLWToSched.pop(0))
for i in range(numLocalWritesPerSched + numLocalWritesPerSched % PRECISION - len(itemsLWToSchedTemp) % numLocalWritesPerSched):
itemsLWToSchedTemp.append(Code.Module())
itemsLWToSched = itemsLWToSchedTemp
# This counts the number of modules which contain a ds_write
# Scheduler below keeps all writes in the same module in same iteration
# so this is better match to what it is trying to do
# writesToSched = sum(1 for item in itemsLWToSched if item.countType(Code.LocalWriteInst))
writesToSched = len(itemsLWToSched)
# assign schedule index
if kernel["EnableMatrixInstruction"] and kernel["ScheduleIterAlg"] == 3:
self.lwStartMfmaIndex = self.lwEndMfmaIndex - max(1,roundUp(writesToSched/numLocalWritesPerSched)) + 1
if self.lwStartMfmaIndex < self.grEndMfmaIndex:
self.lwStartMfmaIndex = self.grEndMfmaIndex
# DirectToLds + PGR=2 case, lwStart must be after all local reads are done
if kernel["DirectToLds"] and kernel["PrefetchGlobalRead"] == 2:
lrEnd = min(self.lwEndMfmaIndex, self.numMfmaForLR * (kernel["LoopIters"] - self.numItersPLR))
if self.lwStartMfmaIndex < lrEnd:
self.lwStartMfmaIndex = lrEnd
startIter = self.lwStartMfmaIndex//numMfmaPerIter
assert startIter < localWriteEndIter+1 # startIter should be at or before the endIter
else:
startIter = localWriteEndIter - roundUp(writesToSched/numLocalWritesPerSched) + 1
# - can't move a write past the load it depends on
# as a simplification, don't move writes past any loads
if startIter < lastLoadIter:
startIter = lastLoadIter
readsToWait = len(list(self.localWriteACode.items())) + len(list(self.localWriteBCode.items()))
readsToWaitDTV = 0
# add waitcnt for DirectToVgpr. Delaying wait for DirectToVgpr global read
if kernel["DirectToVgprA"] or kernel["DirectToVgprB"]:
# DirectToVgprA case, actual A load is in self.globalReadBCode (due to swap).
# Need to check self.globalReadBCode
readsToWaitDTV += len(list(self.globalReadBCode.middle.items()))
# add waitcnt for StoreCInUnroll. Delaying wait for Load C
readsToWait += numGlobalReadC
readsToWaitNGLL = readsToWait
localwriteCnt = 0
for u in range(startIter, localWriteEndIter+1):
if u==(localWriteEndIter):
itemPerIter = len(itemsLWToSched) # schedule all remaining activity
else:
itemPerIter = numLocalWriteModPerIter
# if localwrite is not multiple of numLocalWriteModPerIter, fill last iteration first.
# make sure numLocalWriteModPerIter is enough to schedule localwrite
# TODO: if numLocalWriteModPerIter is not enough to schedule localwrite, need smarter way to distribute localWrite
if u == startIter and kernel["ScheduleIterAlg"] == 3:
itemPerIter = numLocalWriteModPerIter - (self.lwStartMfmaIndex % numMfmaPerIter) * numLocalWritesPerSched
for item in itemsLWToSched[:itemPerIter]:
# Use a module to ensure these pieces stay together in the sub-iter scheduler
imod = Code.Module("LocalWriteMod%u"%u)
imodNGLL = Code.Module("LocalWriteMod%u"%u)
writesPerItem = item.countType(Code.LocalWriteInst)
readsToWaitAdjustForStoreC = 0
if kernel["StoreCInUnroll"] and not firstIter and kernel["PrefetchGlobalRead"]==2:
# get number of StoreC in template
readsToWaitAdjustForStoreC += self.getNumberOfStoreCInTemplate(kernel)
if writesPerItem:
imod.addComment0("sched write - iter %u writesPerItem=%u"%(u,writesPerItem))
imodNGLL.addComment0("sched write - iter %u writesPerItem=%u"%(u,writesPerItem))
# if writesPerItem>1 this indicates multiple LocalWrites in the same module
# this happens in some transpose cases. Here the first write needs to wait
# for the associated global read to finish, then the remaining writes can flow
# TODO - can schedule these writes across iters, should figure this out above
readsToWait = readsToWait - 1
readsToWaitNGLL = readsToWaitNGLL - 1
if uDu < kernel["DepthULdsDivisor"]-1:
imod.addComment0("no wait vmcnt except for in the last subLdsLoop")
else:
imod.addCode(Code.WaitCnt(self.version, -1, min(maxVmcnt, readsToWait + readsToWaitDTV + readsToWaitAdjustForStoreC), \
"wait for global read before writing to local"))
imodNGLL.addCode(Code.WaitCnt(self.version, -1, min(maxVmcnt, readsToWaitNGLL + readsToWaitDTV + readsToWaitAdjustForStoreC), \
"wait for global read before writing to local"))
if kernel["StoreCInUnroll"] or kernel["PrefetchGlobalRead"]==2:
if "s_waitcnt" in str(item) and "__placeholder__" in str(item):
# waitcnt adjustment for StoreCInUnroll
readsToWaitAdjust = readsToWait + readsToWaitDTV - numGlobalReadC
if kernel["PrefetchGlobalRead"]==2:
# PGR=2 special cases
if (kernel["AtomicAddC"] or not kernel["ProblemType"]["UseBeta"]):
# no Load C case
if not firstIter:
# PGR=2 and not firstIter case, __placeholder__ includes num of storeC from previous Iter
readsToWaitAdjust += readsToWaitAdjustForStoreC
else:
# Load C case
# adjustment for waitcnt for loadC
if kernel["StoreCInUnroll"] and self.StoreCUnrollLoadCWaitComment in str(item):
# readsToWaitDTV should not be added for loadC waitcnt
readsToWaitAdjust -= readsToWaitDTV
if kernel["NoLdsWriteCode"] and kernel["PrefetchGlobalRead"]!=2:
# DirectToLds or DirectToVgpr for both A and B case, use the number of global read for both A and B as vmcnt (only for PGR=1)
readsToWaitAdjust = len(list(self.globalReadACode.middle.items())) + len(list(self.globalReadBCode.middle.items()))
item = str(item).replace("__placeholder__", str(readsToWaitAdjust))
imod.addCode(item)
# schedule global instruction that need to be scheduled later
if localwriteCnt % PRECISION == (numLocalWritesPerSched % PRECISION):
reads = 0
while itemsGRToSchedLater:
itemGR = itemsGRToSchedLater[0]
readsInc = itemGR.countType(Code.GlobalReadInst)
if kernel["StoreCInUnroll"] and readsInc == 0:
# adjustment for StoreCInUnroll
# count buffer_load if it exist but not counted
readsInc += str(itemGR).count("_buffer_load")
reads = reads + readsInc
if reads > 1:
break
if "s_waitcnt" in str(itemGR) and "__placeholder__" in str(itemGR):
itemGR2 = (str(itemGR).replace("__placeholder__", str(readsToWait)))
imod.addText(itemGR2)
else:
imod.addCode(itemGR)
readsToWait = readsToWait + readsInc # GR instruction increments vmcnt
itemsGRToSchedLater.pop(0)
localwriteCnt += 1
self.perIterLocalWriteCode[u].addCode(imod)
imodNGLL.addCode(copy.deepcopy(item))
if lastLc:
# local write code for NGLL should be updated at the last lc
# in init acc opt case, the last inner loop generated is not for the last lc.
# in that case, local write code for NGLL is not as expected.
self.perIterLocalWriteCodeNGLL[u].addCode(imodNGLL)
itemsLWToSched = itemsLWToSched[itemPerIter:]
# should never run out of items to schedule
assert not itemsLWToSched # should have scheduled everthing already
if grBackup is not None:
del grBackup
##############################################################################
# Schedule work into the each unroll loop iteration
# localReadCode is the local reads for this loop iteration
# (returned by localReadDo). The instructions in localReadCode
# will retain their relative order, but may be interleaved
# with instructions from otherCode.
# globalReadCode is the 'other' buffer loads and addr increments
# localWriteCode is the 'other' local writes
# to schedule in with the ds reads. The instructions
# will retain their relative order, but may be interleaved
# with instructions from localReadCode.
# pointerCode contains local pointer changes (if needed)
# waitCode contains s_waitcnt before macs.
# - Cannot be "" or None
# - may be empty Module if not waiting is desired (perhaps for debug)
# - may be multiple instructions (ConservativeWaitCnt)
# - typically is a single Code.WaitCnt. This routine will
# modify the lgkmcnt to account for any scheduling decisions.
# If this is not desired, add the waitCnt to pointerCode and
# set waitCode to an empty module
# macIterCode contains the mac iters. May be a macro call.
#
# returns: a Module with the combined, optimally scheduled
# localReadCode + otherCode
##############################################################################
def makeSubIterSchedule(self, kernel, localReadCode, iteration, pointerLWCode, pointerLRCode, waitCode, macIterCode, \
waitLWCode = Code.Module(), syncCode = Code.Module(), packCode = Code.Module(), isDTVodd = False, NLLlast = False):
iterCode = Code.Module()
globalReadCode = copy.deepcopy(self.perIterGlobalReadCode[iteration])
globalReadCodeDTV = self.perIterGlobalReadCodeDTV[iteration]
origLenGlobalReadCodeDTV = len(list(self.perIterGlobalReadCodeDTV[iteration].items()))
localWriteCode = self.perIterLocalWriteCode[iteration]
isBarrier = kernel["LoopIters"] - self.numItersPLR
hasLocalRead = localReadCode.countType(Code.LocalReadInst)
# Default schedule is other, local reads, then local writes:
if self.scheduleIterAlg==0:
# simple schedule, just add the modules in-order
iterCode.addCode(globalReadCode)
iterCode.addCode(globalReadCodeDTV)
# pop out all items
while len(list(globalReadCodeDTV.items())):
globalReadCodeDTV.items().pop(0)
iterCode.addCode(waitLWCode)
iterCode.addCode(syncCode)
iterCode.addCode(localReadCode)
iterCode.addCode(localWriteCode)
iterCode.addCode(pointerLWCode)
iterCode.addCode(pointerLRCode)
iterCode.addCode(waitCode)
iterCode.addCode(packCode)
iterCode.addCode(macIterCode)
elif self.scheduleIterAlg == 1:
iterCode.addCode(waitLWCode)
iterCode.addCode(syncCode)
#import pdb
#pdb.set_trace()
# simple algorithm - do half the reads first:
readsToSchedule = localReadCode.countType(Code.LocalReadInst) / 2
#localReadCode.prettyPrint()
readItems = localReadCode.flatitems()
while readItems:
item = readItems.pop(0)
#print "readsToSchedule=", readsToSchedule, "item=", item
iterCode.addCode(item)
readsThisItem = item.countType(Code.LocalReadInst)
if readsThisItem:
assert readsThisItem==1, "Scheduler assumes 1 read per item"
readsToSchedule = readsToSchedule - 1
if readsToSchedule == 0:
break
iterCode.addCode(globalReadCode)
iterCode.addCode(globalReadCodeDTV)
# pop out all items
while len(list(globalReadCodeDTV.items())):
globalReadCodeDTV.items().pop(0)
# add rest of the reads here
for item in readItems:
iterCode.addCode(item)
#move down write to be the last
iterCode.addCode(localWriteCode)
# tack on the pointer and mac code:
iterCode.addCode(pointerLWCode)
iterCode.addCode(pointerLRCode)
iterCode.addCode(waitCode)
iterCode.addCode(packCode)
iterCode.addCode(macIterCode)
elif self.scheduleIterAlg == 2:
# SIA2 use only 1 iteration and separate compute and fetch by raising compute priority
# 2 workgroup interleave, while WG0/WG1 doing compute, WG1/WG0 doing fetch
# EPS need to be 1, or valu instruction will break interleave
iterCode.addCode(globalReadCode)
iterCode.addCode(globalReadCodeDTV)
# pop out all items
while len(list(globalReadCodeDTV.items())):
globalReadCodeDTV.items().pop(0)
iterCode.addCode(waitLWCode)
iterCode.addCode(syncCode)
iterCode.addCode(localReadCode)
iterCode.addCode(waitCode)
# interleave pack code
# BF16 or FP16: each packCode is for one 32-bit reg, 1 packing inst: half-to-single x1
# INT8 : each packCode is for one 32-bit regs, 3 packing inst: byte-to-half x2 + half-to-single x1
if self.archCaps["HasEccHalf"]:
instPerRegPack = 1 / kernel["ProblemType"]["DataType"].numRegisters() - 1
else:
instPerRegPack = 1 if (kernel["ProblemType"]["DataType"].numRegisters() == 0.25) else 0
instPerPack = int(kernel["MIInputPerThread"] * kernel["ProblemType"]["DataType"].numRegisters() * instPerRegPack)
packItems = []
for iui in range(kernel["InnerUnroll"]):
packINtems = [ [] for j in range(max(self.numReadsIterCoalescedA,self.numReadsIterCoalescedB)) ]
packA = packCode.findNamedItem("packA_I%s"%(iui))
packB = packCode.findNamedItem("packB_I%s"%(iui))
# In case localReadDo not generate pack Module
# and findNamedItem will return None type
# TODO: let all type have pack Module
if not packA:
packA = Code.Module()
packAItems = packA.flatitems()
if not packB:
packB = Code.Module()
packBItems = packB.flatitems()
if packAItems:
for j in range(self.numReadsIterCoalescedA):
for n in range(instPerPack):
packINtems[j].append(packAItems.pop(0))
if packBItems:
for j in range(self.numReadsIterCoalescedB):
for n in range(instPerPack):
packINtems[j].append(packBItems.pop(0))
while packAItems:
for j in range(self.numReadsIterCoalescedA):
for n in range(instPerPack):
packINtems[j].append(packAItems.pop(0))
while packBItems:
for j in range(self.numReadsIterCoalescedB):
for n in range(instPerPack):
packINtems[j].append(packBItems.pop(0))
for j in range(max(self.numReadsIterCoalescedA,self.numReadsIterCoalescedB)):
packItems += packINtems.pop(0)
macIterItem = macIterCode.flatitems()
# pop the first code which is s_nop 1 for packing
item = macIterItem.pop(0)
numMfmaPerIter = self.numMfmaPerIter
curPackIdx = 0
packAIdx = 0
packBIdx = 0
for i in range(numMfmaPerIter):
if packItems:
# how many pack have to be done
# calculate the data index of this mfma used for A and B
# if i // kernel["MIWaveTile"][0]==0, mfma will use new A (need to take iu into account)
# if i % kernel["MIWaveTile"][0]==0, mfma will use new B
packAIdx += instPerPack if i//(kernel["MIWaveTileA"]+kernel["MIWaveTileA"]*kernel["MIWaveTileB"]*(i//(kernel["MIWaveTileA"]*kernel["MIWaveTileB"]))) == 0 else 0
packBIdx += instPerPack if i % kernel["MIWaveTileA"] == 0 else 0
# blockWidth < 1, means 0.5 or 0.25 (BF,H,Int8)
packAIdx = packAIdx if self.tPA["localReadInstruction"].blockWidth < 1 else 0
packBIdx = packBIdx if self.tPB["localReadInstruction"].blockWidth < 1 else 0
numPack = (packAIdx + packBIdx)
iterCode.addComment0("pack scheduling: packAIdx:%u, packBIdx:%u" %(packAIdx,packBIdx))
# we put 2 pack in each mfma, "2" means A & B
if packItems:
for j in range(instPerPack):
iterCode.addCode(packItems.pop(0))
curPackIdx += 1
if packItems:
for j in range(instPerPack):
iterCode.addCode(packItems.pop(0))
curPackIdx += 1
# since packed register need to wait 2 quad cycle to finish packing
# we insert pack instruction if we can, or s_nop
while curPackIdx < numPack+2:
if packItems:
for j in range(instPerPack):
iterCode.addCode(packItems.pop(0))
curPackIdx += 1
else:
iterCode.addInst("s_nop ","0","VALU packing writes to be consumed by matrix instruction")
curPackIdx += 1
if i == 0:
if not packItems:
tmpVgpr = self.vgprPool.checkOut(1)
iterCode.addInst("v_mov_b32 ","v%u"%(tmpVgpr),"0x0","valu operation to have different priority")
self.vgprPool.checkIn(tmpVgpr)
iterCode.addInst("s_setprio ","3","Raise priority while processing macs")
item = macIterItem.pop(0)
iterCode.addCode(item)
iterCode.addInst("s_setprio ","1","Raise priority while processing macs")
if kernel["1LDSBuffer"]:
barrier = Code.Module()
barrier.addComment0("1 LDS buffer: read-sync-write")
barrier.addInst("s_waitcnt lgkmcnt(0)","")
barrier.addInst("s_barrier","")
iterCode.addCode(barrier)
iterCode.addCode(localWriteCode)
iterCode.addCode(pointerLWCode)
iterCode.addCode(pointerLRCode)
iterCode.addInst("s_setprio ","2","Raise priority while processing macs")
pass
elif self.scheduleIterAlg == 3:
iterCode.addComment0(" grEndMfmaIndex:%u, lwStartMfmaIndex:%u, lwEndMfmaIndex:%u " %(self.grEndMfmaIndex,self.lwStartMfmaIndex,self.lwEndMfmaIndex))
iterCode.addComment0(" numMfmaForLR:%u, barrierMfmaIndex:%u " %(self.numMfmaForNextLoopLR,self.barrierMfmaIndex))
#####
# Prepare and Assign parameter
####
if iteration == 0:
self.localReadsVacancy = []
self.localReadsWait = [ [] for j in range(kernel["LoopIters"])]
self.localReadsWait[iteration] = waitCode
numMfmaPerIter = self.numMfmaPerIter
isBarrier = kernel["LoopIters"] - self.numItersPLR
writeItems = list(localWriteCode.items())
macIterItems = macIterCode.flatitems()
skipLocalWriteWaitcnt = 0
localReadsWaitcnt = 0
curPackIdx = 0
packAIdx = 0
packBIdx = 0
#####
# Prepare localReadCode
####
localReadCodeAB = Code.Module()
for iui in range(kernel["InnerUnroll"]):
localReadCodeA = localReadCode.findNamedItem("LocalReadDoA_I%s"%(iui))
localReadCodeB = localReadCode.findNamedItem("LocalReadDoB_I%s"%(iui))
# In case localReadDo not generate localReadCode Module
# and findNamedItem will return None type
# TODO: findNamedItem return Code.Module() if not found
if not localReadCodeA:
localReadCodeA = Code.Module()
if not localReadCodeB:
localReadCodeB = Code.Module()
if localReadCodeA.items():
localReadCodeAB.addCode(localReadCodeA.items().pop(0))
if localReadCodeB.items():
localReadCodeAB.addCode(localReadCodeB.items().pop(0))
while localReadCodeA.items():
localReadCodeAB.addCode(localReadCodeA.items().pop(0))
while localReadCodeB.items():
localReadCodeAB.addCode(localReadCodeB.items().pop(0))
localReadItems = localReadCodeAB.flatitems()
localReadItemsThisLoop = localReadItems if iteration < isBarrier else []
localReadItemsNextLoop = localReadItems if iteration >= isBarrier else []
#####
# Prepare pack Code for B:
# since the mfma reuse B first => for A: mfma[A][B]
# we need 1 vector A and 1 vector B for first mfma
# then we prepare remaining A, then remaining B
# BF16 or FP16: each packCode is for one 32-bit reg, 1 packing inst: half-to-single x1
# INT8 : each packCode is for one 32-bit regs, 3 packing inst: byte-to-half x2 + half-to-single x1
####
if self.archCaps["HasEccHalf"]:
instPerRegPack = 1 / kernel["ProblemType"]["DataType"].numRegisters() - 1
else:
instPerRegPack = 1 if (kernel["ProblemType"]["DataType"].numRegisters() == 0.25) else 0
instPerPack = int(kernel["MIInputPerThread"] * kernel["ProblemType"]["DataType"].numRegisters() * instPerRegPack)
packItems = []
for iui in range(kernel["InnerUnroll"]):
packINtems = [ [] for j in range(max(self.numReadsIterCoalescedA,self.numReadsIterCoalescedB)) ]
packA = packCode.findNamedItem("packA_I%s"%(iui))
packB = packCode.findNamedItem("packB_I%s"%(iui))
# In case localReadDo not generate pack Module
# and findNamedItem will return None type
# TODO: let all type have pack Module
if not packA:
packA = Code.Module()
packAItems = packA.flatitems()
if not packB:
packB = Code.Module()
packBItems = packB.flatitems()
if packAItems:
for j in range(self.numReadsIterCoalescedA):
for n in range(instPerPack):
packINtems[j].append(packAItems.pop(0))
if packBItems:
for j in range(self.numReadsIterCoalescedB):
for n in range(instPerPack):
packINtems[j].append(packBItems.pop(0))
while packAItems:
for j in range(self.numReadsIterCoalescedA):
for n in range(instPerPack):
packINtems[j].append(packAItems.pop(0))
while packBItems:
for j in range(self.numReadsIterCoalescedB):
for n in range(instPerPack):
packINtems[j].append(packBItems.pop(0))
for j in range(max(self.numReadsIterCoalescedA,self.numReadsIterCoalescedB)):
packItems += packINtems.pop(0)
# remove s_nop for packing
# we will add s_nop if needed
if macIterItems:
macIterItems.pop(0)
####
# scheduled local read to previous iterations
####
if self.numVgprBuffer >= kernel["LoopIters"]:
for vacancy in self.localReadsVacancy:
# {"items","latencyLeft","atIter","atMfmaIndex","noReadsAtThisIter"}
for localRead in list(localReadItemsThisLoop):
if vacancy["latencyLeft"] > localRead.IssueLatency * 2:
if not localRead.readToTempVgpr:
vacancy["latencyLeft"] -= localRead.IssueLatency * 2
vacancy["items"].addCode(localRead)
localReadItemsThisLoop.remove(localRead)
if vacancy["atMfmaIndex"] > self.lwStartMfmaIndex - 1 and kernel["1LDSBuffer"]:
self.overflowedResources = 5
# update waitCnt
if self.numItersPLR:
for readsIter in range(vacancy["atIter"], iteration + self.numItersPLR):
if (vacancy["atMfmaIndex"] % numMfmaPerIter == 0 or readsIter != vacancy["atIter"]) and \
(vacancy["noReadsAtThisIter"] or readsIter <= vacancy["atIter"] + self.numItersPLR):
if isinstance(self.localReadsWait[readsIter], Code.WaitCnt):
self.localReadsWait[readsIter].lgkmcnt += 1
else:
# make sure the localread sequence remain the same
vacancy["latencyLeft"] = 0
numReadsInst = len(localReadItemsThisLoop) if iteration < isBarrier else len(localReadItemsNextLoop)
for i in range(numMfmaPerIter):
mfmaIndex = iteration * numMfmaPerIter + i
lastMfmaIndex = kernel["LoopIters"] * numMfmaPerIter - 1
iterCode.addComment0(" mfmaIndex:%u " %(mfmaIndex))
####
# scheduled local read
####
readLeft = numReadsInst
latencyLeft = self.miLatencyLeft
# with PrefetchLocalRead, localreads can interleave with mfma
if self.numItersPLR and iteration < isBarrier:
# take ds_write into account to schedule ds_read, assume A and B localwrite have same width (TLDS=1)
if (mfmaIndex >= self.lwStartMfmaIndex) and not globalReadCode.countType(Code.GlobalReadInst):
for j in range(min(len(writeItems),self.numLocalWriteModPerMfma)):
if writeItems[j].countType(Code.LocalWriteInst):
latencyLeft -= (self.tPA["localWriteInstruction"].IssueLatency*2)
readLeftLROPT = 0
for j in range(len(localReadItemsThisLoop)):
latencyLeft -= localReadItemsThisLoop[j].IssueLatency*2
readLeftLROPT += 1 if latencyLeft >= 0 else 0
# at least 1 instruction
readLeftLROPT = max(readLeftLROPT,1)
# evenly schedule localread with each mfma
readLeftLREven = numReadsInst // numMfmaPerIter
if (numReadsInst % (numMfmaPerIter)) > i:
readLeftLREven += 1
# we want no localreads at first mfma
if (iteration == 0) and numMfmaPerIter != 1:
numMfmaForLR = numMfmaPerIter - 1
if i < numMfmaPerIter - numMfmaForLR:
readLeftLREven = 0
readLeftLROPT = 0
# rest mfma help to schedule those localReads
else:
readLeftLREven = numReadsInst // (numMfmaPerIter-1)
if (numReadsInst % (numMfmaPerIter-1)) >= i:
readLeftLREven += 1
# if there are too many localreads, change strategy to even.
readLeft = max(readLeftLREven,readLeftLROPT)
if not self.numItersPLR and iteration < isBarrier:
for j in range(len(localReadItemsThisLoop)):
latencyLeft -= localReadItemsThisLoop[j].IssueLatency*2
# if start to schedule localwrite, but still have localreads not scheduled yet,
# reject to use 1LDSB, since it will write and read same lds buffer at same time.
# TODO: force to schedule all remaining localreads before start to schedule localwrite.
if mfmaIndex >= self.lwStartMfmaIndex and mfmaIndex <= max(self.lwEndMfmaIndex,self.barrierMfmaIndex) and \
localReadItemsThisLoop and localWriteCode.countType(Code.LocalWriteInst) and kernel["1LDSBuffer"]:
self.overflowedResources = 5
# DirectToVgpr case, localReadItemsThisLoop and localWriteCode.countType(Code.LocalWriteInst) do not satisfy at the same time.
# However, it is still invaid if localReadItemsThisLoop exists when mfmaIndex > lwStartMfmaIndex
elif (kernel["DirectToVgprA"] or kernel["DirectToVgprB"]) and \
mfmaIndex > self.lwStartMfmaIndex and mfmaIndex <= max(self.lwEndMfmaIndex,self.barrierMfmaIndex) and \
localReadItemsThisLoop and kernel["1LDSBuffer"]:
self.overflowedResources = 5
for j in range(readLeft):
if localReadItemsThisLoop:
item = localReadItemsThisLoop.pop(0)
iterCode.addCode(item)
if (i == 0):
localReadsWaitcnt += 1
if not localReadItemsThisLoop and latencyLeft > 0 and iteration < isBarrier and \
not(mfmaIndex > self.lwStartMfmaIndex and kernel["1LDSBuffer"]):
item = Code.Module()
item.addComment0("localReadsVacancy: latencyLeft %d"%(latencyLeft))
iterCode.addCode(item)
self.localReadsVacancy.append({ "items": item, \
"latencyLeft": latencyLeft, \
"atIter": iteration, \
"atMfmaIndex": mfmaIndex, \
"noReadsAtThisIter": numReadsInst == 0, \
})
####
# scheduled global read
####
for j in range(self.numGlobalReadInsPerMfma):
if globalReadCode.items():
loadText = str(globalReadCode.items().pop(0))
if isDTVodd:
# need to swap Vgpr set for odd code
loadText = self.flipVregSetForDirectToVgprInGlobalRead(kernel, loadText)
iterCode.addText(loadText)
# schedule remaining globalReadInst
if mfmaIndex == self.grEndMfmaIndex:
while globalReadCode.items() and \
(globalReadCode.countType(Code.GlobalReadInst) or kernel["PrefetchGlobalRead"] == 2):
loadText = str(globalReadCode.items().pop(0))
if isDTVodd:
# need to swap Vgpr set for odd code
loadText = self.flipVregSetForDirectToVgprInGlobalRead(kernel, loadText)
iterCode.addText(loadText)
# schedule remaining globalReadIncInst
if i == numMfmaPerIter - 1:
while globalReadCode.items():
loadText = str(globalReadCode.items().pop(0))
if isDTVodd:
# need to swap Vgpr set for odd code
loadText = self.flipVregSetForDirectToVgprInGlobalRead(kernel, loadText)
iterCode.addText(loadText)
####
# scheduled local write
####
if kernel["1LDSBuffer"] and mfmaIndex == self.lwStartMfmaIndex - 1:
barrier = Code.Module()
barrier.addComment0("1 LDS buffer: read-sync-write")
barrier.addInst("s_waitcnt lgkmcnt(0)","")
barrier.addInst("s_barrier","")
iterCode.addCode(barrier)
if kernel["StorePriorityOpt"]:
flagInsert = False
if kernel["PrefetchGlobalRead"] == 2:
lwStartOffset = 0
if kernel["DirectToLds"]:
lwStartOffset = 2
# if (mfmaIndex == self.lwStartMfmaIndex or mfmaIndex == self.barrierMfmaIndex+2):
if (mfmaIndex == self.lwStartMfmaIndex + lwStartOffset or mfmaIndex == self.barrierMfmaIndex+1) :
flagInsert = True
elif kernel["PrefetchGlobalRead"] == 1 and numMfmaPerIter >= 4:
# this setting is good for fixed clock, but not good for auto clock
#if (mfmaIndex == self.grEndMfmaIndex or mfmaIndex == self.barrierMfmaIndex+1) :
withGL = ((not NLLlast) or (self.prefetchAcrossPersistent and kernel["PrefetchAcrossPersistentMode"] == 1))
withDTLload = kernel["DirectToLds"] and withGL
startIndex = 0 if withDTLload else 1
if (mfmaIndex == startIndex or withGL and mfmaIndex == self.barrierMfmaIndex+1):
flagInsert = True
if flagInsert:
iterCode.addInst("s_setprio 3","store optimization")
if (mfmaIndex >= self.lwStartMfmaIndex):
for j in range(self.numLocalWriteModPerMfma):
# in case there are localWrite and globalread in same iteration
# we need to make sure globalRead before localWrite
if writeItems and not globalReadCode.countType(Code.GlobalReadInst):
writeItem = writeItems.pop(0)
# check StoreCInUnrollLoopCodeStart
if kernel["StoreCInUnroll"]:
if self.StoreCUnrollStartComment in str(writeItem):
self.StoreCUnrollLoopCodeStarted = 1 # mark as started
if self.StoreCUnrollStoreStartComment in str(writeItem):
# generate all remaining pre code before the first Store C
while(len(self.StoreCUnrollPreCode.items()) > 0):
iterCode.addCode(self.StoreCUnrollPreCode.items().pop(0))
iterCode.addCode(writeItem)
# if there is localWrite at first mfma, need to skip it in waitcnt.
if i == 0:
skipLocalWriteWaitcnt += writeItem.countType(Code.LocalWriteInst)
if not localReadItemsThisLoop:
self.perIterLocalWriteCanSkip[iteration] += writeItem.countType(Code.LocalWriteInst)
if mfmaIndex == self.lwEndMfmaIndex:
while writeItems:
writeItem = writeItems.pop(0)
# generate all remaining pre code before the first Store C
if kernel["StoreCInUnroll"]:
if self.StoreCUnrollStoreStartComment in str(writeItem):
while(len(self.StoreCUnrollPreCode.items()) > 0):
iterCode.addCode(self.StoreCUnrollPreCode.items().pop(0))
iterCode.addCode(writeItem)
if i == 0:
skipLocalWriteWaitcnt += writeItem.countType(Code.LocalWriteInst)
if not localReadItemsThisLoop:
self.perIterLocalWriteCanSkip[iteration] += writeItem.countType(Code.LocalWriteInst)
####
# scheduled pointer
####
if mfmaIndex == self.lwEndMfmaIndex:
iterCode.addCode(pointerLWCode)
if i == numMfmaPerIter - 1:
iterCode.addCode(pointerLRCode)
####
# scheduled sync
####
if mfmaIndex == self.barrierMfmaIndex and self.numItersPLR:
iterCode.addCode(waitLWCode)
iterCode.addCode(syncCode)
####
# scheduled local read for next loop
# localReads for next loop should after barrier
####
latencyLeft = self.miLatencyLeft
if self.numItersPLR and iteration >= isBarrier:
readLeftLROPT = 0
for j in range(len(localReadItemsNextLoop)):
latencyLeft -= localReadItemsNextLoop[j].IssueLatency*2
readLeftLROPT += 1 if latencyLeft >= 0 else 0
# at least 1 instruction
readLeftLROPT = max(readLeftLROPT,1)
# evenly schedule localread with each mfma
readLeftLREven = numReadsInst // numMfmaPerIter
if (numReadsInst % (numMfmaPerIter)) > i:
readLeftLREven += 1
# we want no localreads at barrier mfma
if (iteration == isBarrier) and numMfmaPerIter != 1:
numMfmaForLR = self.numMfmaForNextLoopLR
if i < numMfmaPerIter - numMfmaForLR:
readLeftLREven = 0
readLeftLROPT = 0
# rest mfma help to schedule those localReads
else:
readLeftLREven = numReadsInst // (numMfmaPerIter-1)
if (numReadsInst % (numMfmaPerIter-1)) >= i:
readLeftLREven += 1
# if there are too many localreads, change strategy to even.
readLeft = max(readLeftLREven,readLeftLROPT)
for j in range(readLeft):
if localReadItemsNextLoop:
item = localReadItemsNextLoop.pop(0)
iterCode.addCode(item)
if (i == 0):
localReadsWaitcnt += 1
####
# scheduled wait localReads
####
if i == 0:
iterCode.addCode(waitCode)
####
# scheduled pack
####
if packItems:
# how many pack have to be done
# calculate the data index of this mfma used for A and B
# if i // kernel["MIWaveTile"][0]==0, mfma will use new A (need to take iu into account)
# if i % kernel["MIWaveTile"][0]==0, mfma will use new B
packAIdx += instPerPack if i//(kernel["MIWaveTileA"]+kernel["MIWaveTileA"]*kernel["MIWaveTileB"]*(i//(kernel["MIWaveTileA"]*kernel["MIWaveTileB"]))) == 0 else 0
packBIdx += instPerPack if i % kernel["MIWaveTileA"] == 0 else 0
# blockWidth < 1, means 0.5 or 0.25 (BF,H,Int8)
packAIdx = packAIdx if self.tPA["localReadInstruction"].blockWidth < 1 else 0
packBIdx = packBIdx if self.tPB["localReadInstruction"].blockWidth < 1 else 0
numPack = (packAIdx + packBIdx)
iterCode.addComment0("pack scheduling: packAIdx:%u, packBIdx:%u" %(packAIdx,packBIdx))
# we put 2 pack in each mfma
if packItems:
for j in range(instPerPack):
iterCode.addCode(packItems.pop(0))
curPackIdx += 1
if packItems:
for j in range(instPerPack):
iterCode.addCode(packItems.pop(0))
curPackIdx += 1
# since packed register need to wait 2 quad cycle to finish packing
# we insert pack instruction if we can, or s_nop
while curPackIdx < numPack+2:
if packItems:
for j in range(instPerPack):
iterCode.addCode(packItems.pop(0))
curPackIdx += 1
else:
iterCode.addInst("s_nop ","0","VALU packing writes to be consumed by matrix instruction")
curPackIdx += 1
if i == numMfmaPerIter - 1:
while packItems:
iterCode.addCode(packItems.pop(0))
####
# scheduled StoreCInUnrollPreProcess
####
if kernel["StoreCInUnroll"]:
if self.StoreCUnrollLoopCodeStarted and len(list(self.StoreCUnrollPreCode.items())) > 0:
iterCode.addCode(self.StoreCUnrollPreCode.items().pop(0))
####
# scheduled mfma
####
iterCode.addCode(macIterItems.pop(0) if macIterItems else Code.Module())
####
# scheduled global read for DirectToVgpr (PGR=2 only)
####
numLoadVgpr = len(list(globalReadCodeDTV.items()))
if numLoadVgpr > 0:
interval = roundUp(numMfmaPerIter / origLenGlobalReadCodeDTV)
tileIndex = 0 if kernel["DirectToVgprA"] else 1
if (kernel["MIWaveTile"][tileIndex] // kernel["VectorWidth"]) > 1:
if kernel["ProblemType"]["DataType"].isComplex():
# adjustment for double complex
# limit the max of interval up to 4 if (kernel["MIWaveTile"][0] // kernel["VectorWidth"]) > 1
interval = min(4, interval)
else: #if kernel["ProblemType"]["DataType"].isDouble() or isSingle():
# adjustment for double
# in this case, interval must be 1 to avoid overwritting vreg by global read
interval = 1
# DirectToVgprA + TLU=False + VW > 1 case, need to use interval = 1
if kernel["DirectToVgprA"] and (not kernel["ProblemType"]["TLUA"]) and kernel["VectorWidth"] > 1:
interval = 1
# if number of mfma after self.grEndMfmaIndex is smaller than numMfmaPerIter, we need to use smaller interval to insert DTV load.
# this is to ensure DTV load is generated after lwStartMfmaIndex
intervalAfterGrEnd = kernel["LoopIters"] * numMfmaPerIter - self.lwStartMfmaIndex
intervalMfma = min(numMfmaPerIter, intervalAfterGrEnd)
numInstToInsert = roundUp(origLenGlobalReadCodeDTV / intervalMfma)
remainingTimesToInsert = roundUp(numLoadVgpr / numInstToInsert)
insertMfmaIndex = kernel["LoopIters"] * numMfmaPerIter - 1 - interval * (remainingTimesToInsert - 1)
# avoid insertMfmaIndex getting smaller than (kernel["LoopIters"] - 1) * numMfmaPerIter
insertMfmaIndex = max(insertMfmaIndex, (kernel["LoopIters"] - 1) * numMfmaPerIter)
# avoid insertMfmaIndex getting smaller than lwEndMfmaIndex (DTV loads must be generated after non DTV loads)
insertMfmaIndex = max(insertMfmaIndex, self.lwEndMfmaIndex)
# if mfmaIndex is the last index, insert all DTV loads
if mfmaIndex == lastMfmaIndex:
insertMfmaIndex = mfmaIndex
numInstToInsert = numLoadVgpr
if mfmaIndex == insertMfmaIndex:
for i in range(min(numLoadVgpr, numInstToInsert)):
loadDTVText = str(globalReadCodeDTV.items().pop(0))
if isDTVodd:
# need to swap Vgpr set for odd code
loadDTVText = self.flipVregSetForDirectToVgprInGlobalRead(kernel, loadDTVText)
iterCode.addText(loadDTVText)
####
# scheduled StoreCInUnrollPostProcess
####
if kernel["StoreCInUnroll"]:
numItems = len(self.StoreCUnrollPostCode.items())
# need to make sure all global read inc is already generated
# (iteration should be the last one)
if numItems > 0 and iteration == kernel["LoopIters"] - 1 and len(globalReadCode.items()) == 0:
totalMfma = kernel["LoopIters"] * numMfmaPerIter
interval = 1
numInstToInsert = roundUp(numItems / (totalMfma - mfmaIndex))
remainingTimesToInsert = roundUp(numItems / numInstToInsert)
insertMfmaIndex = totalMfma - 2 - interval * (remainingTimesToInsert - 1)
if mfmaIndex >= insertMfmaIndex:
for i in range(numInstToInsert):
iterCode.addCode(self.StoreCUnrollPostCode.items().pop(0))
if kernel["StorePriorityOpt"]:
flagInsert = False
if kernel["PrefetchGlobalRead"] == 2:
# if (mfmaIndex == self.barrierMfmaIndex or mfmaIndex == (kernel["LoopIters"] * numMfmaPerIter - 1)):
if (mfmaIndex == self.barrierMfmaIndex - 1 or (not NLLlast) and mfmaIndex == (kernel["LoopIters"] * numMfmaPerIter - 1)) :
flagInsert = True
elif kernel["PrefetchGlobalRead"] == 1 and numMfmaPerIter >= 4:
# this setting is good for fixed clock, but not good for auto clock
#if (mfmaIndex == mfmaIndex == self.barrierMfmaIndex - 1 or mfmaIndex == (kernel["LoopIters"] * numMfmaPerIter - 1)) :
insertPos1 = self.grEndMfmaIndex
if not kernel["NoLdsWriteCode"]:
insertPos1 = self.lwStartMfmaIndex - 1
withGL = ((not NLLlast) or (self.prefetchAcrossPersistent and kernel["PrefetchAcrossPersistentMode"] == 1))
if withGL and (mfmaIndex == insertPos1 or (not NLLlast) and mfmaIndex == (kernel["LoopIters"] * numMfmaPerIter - 1)) or \
(not withGL) and mfmaIndex == (kernel["LoopIters"] * numMfmaPerIter // 2 - 1):
flagInsert = True
if flagInsert:
iterCode.addInst("s_setprio 0","store optimization")
else:
assert 0, "Unsupported scheduleIterAlg=%u"%self.scheduleIterAlg
if isinstance(waitCode, Code.WaitCnt):
# Set the waitCount, based on the new iter schedule
lgkmcnt = waitCode.lgkmcnt
localReads = 0
localWrites = 0
if kernel["EnableMatrixInstruction"]:
# dataAtIter : the data we wait is read at which iteration
# numReadsIter : in this loop, number of iteration we have read (data used in current loop)
dataAtIterA = iteration//self.numIterPerCoalescedReadA - self.numItersPLR
dataAtIterB = iteration//self.numIterPerCoalescedReadB - self.numItersPLR
numReadsIterA = min(iteration+1, kernel["LoopIters"]//self.numIterPerCoalescedReadA - self.numItersPLR)
numReadsIterB = min(iteration+1, kernel["LoopIters"]//self.numIterPerCoalescedReadB - self.numItersPLR)
skipReadsIterA = numReadsIterA - dataAtIterA - 1 if not dataAtIterA < max(dataAtIterA,dataAtIterB) else 0
skipReadsIterB = numReadsIterB - dataAtIterB - 1 if not dataAtIterB < max(dataAtIterA,dataAtIterB) else 0
# numPrefetchIter : in this loop, number of prefetch iteration we have read (data used in next loop)
# currently we have localReadA and localReadB if iteration >= isBarrier
# some case will not have localReads if PGR=0 or NoLoadLoop
# known bug: wider localread + numItersPLR>1 may have chance to fail.
numPrefetchIter = (iteration//(kernel["LoopIters"]-self.numItersPLR))*((iteration+1)-(kernel["LoopIters"]-self.numItersPLR)) if kernel["PrefetchGlobalRead"] else 0
numPrefetchIter = 0 if iteration >= isBarrier and not hasLocalRead else numPrefetchIter
skipReadsIterA += numPrefetchIter
skipReadsIterB += numPrefetchIter
# here the reads are prefetches so can skip them in the waitcnt
# how many localreads can skip is based on how many iterations we prefetch.
localReads += self.numReadsPerIterA * skipReadsIterA + localReads + self.numReadsPerIterB * skipReadsIterB
# some of localReads is interleaved after waitcnt in SIA3
if kernel["ScheduleIterAlg"] == 3 and self.numItersPLR and\
(iteration < numReadsIterA or iteration < numReadsIterB or numPrefetchIter) and \
self.enable["LocalRead"]:
if (iteration < numReadsIterA and not dataAtIterA < max(dataAtIterA,dataAtIterB)) or numPrefetchIter:
localReads -= self.numReadsPerIterA
if (iteration < numReadsIterB and not dataAtIterB < max(dataAtIterA,dataAtIterB)) or numPrefetchIter:
localReads -= self.numReadsPerIterB
localReads += localReadsWaitcnt
lgkmcnt += localReads
iterCode.addComment0("numPrefetchIter=%u" % numPrefetchIter)
iterCode.addComment0("dataAtIterA=%u numReadsIterA=%u skipReadsIterA=%u readsPerIterA=%u" % (dataAtIterA, numReadsIterA, skipReadsIterA, self.numReadsPerIterA))
iterCode.addComment0("dataAtIterB=%u numReadsIterB=%u skipReadsIterB=%u readsPerIterB=%u" % (dataAtIterB, numReadsIterB, skipReadsIterB, self.numReadsPerIterB))
if kernel["ScheduleIterAlg"] == 0 or kernel["ScheduleIterAlg"] == 1:
# adjust the initial value of loop counter for DirectToVgpr
adj = 1 if (kernel["DirectToVgprA"] or kernel["DirectToVgprB"]) else 0
for i in range (max(dataAtIterA,dataAtIterB)+adj,iteration+1):
localWrites += self.perIterLocalWriteCode[i].countType(Code.LocalWriteInst)
# ScheduleIterAlg=2, localwrite is after waitCnt, no need to count it's current iteration.
if kernel["ScheduleIterAlg"] == 3:
for i in range (max(dataAtIterA,dataAtIterB)+1,iteration):
localWrites += self.perIterLocalWriteCode[i].countType(Code.LocalWriteInst)
if kernel["ScheduleLocalWrite"] > 0:
# current iteration localWrite count
localWrites += skipLocalWriteWaitcnt
# dataAtIter iteration localWrite count
if self.numItersPLR:
skipPreIterLW = self.perIterLocalWriteCanSkip[max(dataAtIterA,dataAtIterB)]
if kernel["PrefetchGlobalRead"] == 2 and kernel["LocalReadVectorWidth"] == 2 and \
(kernel["DirectToVgprA"] or kernel["DirectToVgprB"]):
# PGR==2 and LRVW==2 and DirectToVgpr enabled case, count local write before max(dataAtIterA,dataAtIterB)
# NOTE: This logic assumes that local write is scheduled after local read.
for up in range(max(dataAtIterA,dataAtIterB)):
skipPreIterLW += self.perIterLocalWriteCanSkip[up]
localWrites += skipPreIterLW
lgkmcnt += localWrites
else:
for item in list(iterCode.items()):
localReads = item.countType(Code.LocalReadInst)
localWrites = item.countType(Code.LocalWriteInst)
if self.numVgprBuffer:
# SQ: If PrefetchLocalRead = 1 and DepthU == LocalSplitU, then there is no double
# buffering and we must wait for all localReads but not localWrites.
# In that case, LoopIters == 1:
if kernel["LoopIters"] > 1:
# here the reads are prefetches so can skip them in the waitcnt
lgkmcnt += localReads
# and the writes are targetting another section of LDS and are
# synchronized through a different waitcnt than this one
# (which is always just before the macs)
lgkmcnt += localWrites
else:
# if UnrollLoopEfficiencyEnable == True use waitCode passed lgkmCnt
# else:
# we need to wait for all preceding reads before the macs
# so only opportunity for optimization is if the writes are at the end
if globalParameters["UnrollLoopEfficiencyEnable"]:
lgkmcnt = waitCode.lgkmcnt
else:
if localReads:
lgkmcnt = 0 # reset to wait for all reads
else:
lgkmcnt = localWrites # this only survives if writes are at the end
waitCode.comment += " old=%u, new=%u newLW=%u newLR=%u" % (waitCode.lgkmcnt, lgkmcnt,localWrites,localReads)
waitCode.lgkmcnt = lgkmcnt
return iterCode
##############################################################################
# returns list of modules or text
# papIter indicates this is the setup for the "prefetchAcrossPersistent"
# (aka pap) iteration
##############################################################################
def setupNewTile(self, kernel, tensorParametersA, tensorParametersB, isPap, isOptNLL=False, forceNoTileCode=False, forceNoGRCode=False):
kl = []
if self.enable["PreLoop"]:
####################################
# Global Read Addresses
####################################
kl.append(self.comment3("Begin setupNewTile, isPap=%s") % isPap)
# work-group assignments
kl.append(self.comment("global read addresses: work-group"))
if not forceNoTileCode:
kl.append(self.graWorkGroup(kernel, isPap))
needShift = False
if (kernel["EdgeType"] == "ShiftPtr") and \
(not (kernel["BufferLoad"] and kernel["GuaranteeNoPartialA"])) or \
(not (kernel["BufferLoad"] and kernel["GuaranteeNoPartialB"])):
needShift = True
# some case (PAP), we don't have to append the code for duplicated calculation
# only those calculation related to WorkGroupID need to be generated. otherwise it's just redundant
# default dontAppendCode = False, means need to append code
self.dontAppendCode = False
# 1. during isPap, this is actually no needed, so we can skip this.
# but since there are some vgpr value is used in the later lwaFirstOffset (when not OptNLL, such as "lwoT")
# so we still do this part when "isPap & not OptNLL"
# 2. if tile edge, then we still need to add all these codes even isPap
self.dontAppendCode = isPap and kernel["PrefetchAcrossPersistentMode"] == 1 and ((not needShift) or self.useGlobalReadTileVgpr)
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# tile assignments
kl.append(self.comment("global read addresses: tile offset assignment a"))
kl.append(self.graTileAssignment(kernel, tensorParametersA))
kl.append(self.comment("global read addresses: tile offset assignment b"))
kl.append(self.graTileAssignment(kernel, tensorParametersB))
self.dontAppendCode = isPap and (not needShift)
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# unroll assignments
kl.append(self.comment("global read addresses: unroll assignment a"))
kl.append(self.graUnrollAssignment(kernel, tensorParametersA))
kl.append(self.comment("global read addresses: unroll assignment b"))
kl.append(self.graUnrollAssignment(kernel, tensorParametersB))
self.dontAppendCode = False
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# other free indices
if kernel["ProblemType"]["NumIndicesC"] > 2:
kl.append(self.comment("global read addresses: other free assignments"))
kl.append(self.graOtherFreeAssignments(kernel))
# other summation indices
if self.otherSummations:
kl.append(self.comment("global read addresses: other summation assignments"))
kl.append(self.graOtherSummationAssignments(kernel))
self.dontAppendCode = isPap and ((not needShift) or self.useGlobalReadTileVgpr)
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# tile offsets
kl.append(self.comment("global read addresses: tile offsets a"))
kl.append(self.graTileOffsets(kernel, tensorParametersA))
kl.append(self.comment("global read addresses: tile offsets b"))
kl.append(self.graTileOffsets(kernel, tensorParametersB))
# unroll offsets
kl.append(self.comment("global read addresses: unroll offsets a"))
kl.append(self.graUnrollOffsets(kernel, tensorParametersA))
kl.append(self.comment("global read addresses: unroll offsets b"))
kl.append(self.graUnrollOffsets(kernel, tensorParametersB))
self.dontAppendCode = False
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# tile edges
if kernel["EdgeType"] == "ShiftPtr":
# Shift here has two purposes:
# 1. Ensure the loads are in-bounds to prevent fault.
# BufferLoad uses the buffer limit hardware and does not require bounds checking for this case
# 2. Shift-left a wide vector load to ensure it is completely in-bounds.
# If this occurs we need to 'unshift' the C values (see shiftVectorComponents)
# BufferLoad does support this shifting, but if GuaranteeNoPartial=1 then
# it can be guaranteed that no shifting is required.
if not (kernel["BufferLoad"] and kernel["GuaranteeNoPartialA"]) and not forceNoTileCode:
kl.append(self.comment("global read addresses: shift a"))
kl.append(self.graShift(kernel, tensorParametersA))
if not (kernel["BufferLoad"] and kernel["GuaranteeNoPartialB"]) and not forceNoTileCode:
kl.append(self.comment("global read addresses: shift b"))
kl.append(self.graShift(kernel, tensorParametersB))
elif kernel["EdgeType"] == "Branch":
kl.append(self.comment("global read addresses: branch a"))
kl.append(self.graBranch(kernel, tensorParametersA))
kl.append(self.comment("global read addresses: branch b"))
kl.append(self.graBranch(kernel, tensorParametersB))
self.dontAppendCode = isPap and (not needShift)
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# final offsets
kl.append(self.comment("global read addresses: final offsets a"))
kl.append(self.graFinalOffsets(kernel, tensorParametersA))
kl.append(self.comment("global read addresses: final offsets b"))
kl.append(self.graFinalOffsets(kernel, tensorParametersB))
self.dontAppendCode = False
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# addresses
if not forceNoTileCode:
kl.append(self.comment("global read addresses: addresses a"))
kl.append(self.graAddresses(kernel, tensorParametersA, isPap))
kl.append(self.comment("global read addresses: addresses b"))
kl.append(self.graAddresses(kernel, tensorParametersB, isPap))
self.dontAppendCode = isPap
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# increments
kl.append(self.comment("global read addresses: increments a"))
for i in reversed(range(kernel["ProblemType"]["NumIndicesSummation"])):
kl.append(self.graIncrements(kernel, i, tensorParametersA))
kl.append(self.comment("global read addresses: increments b"))
for i in reversed(range(kernel["ProblemType"]["NumIndicesSummation"])):
kl.append(self.graIncrements(kernel, i, tensorParametersB))
self.dontAppendCode = False
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
####################################
# Local Write Addresses
####################################
kl.append(self.comment3("Local Write Addresses"))
# tile assignments
kl.append(self.lwaTileAssignment(kernel, tensorParametersA))
kl.append(self.lwaTileAssignment(kernel, tensorParametersB))
# unroll assignments
kl.append(self.lwaUnrollAssignment(kernel, tensorParametersA))
kl.append(self.lwaUnrollAssignment(kernel, tensorParametersB))
# if PAP, no need to reset LWA, but if not OptNLL, we still do this (due to TailLoop)
self.dontAppendCode = isPap and kernel["PrefetchAcrossPersistentMode"] == 1
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# first offsets
kl.append(self.comment("local write addresses: first offset a"))
kl.append(self.lwaFirstOffset(kernel, tensorParametersA))
kl.append(self.comment("local write addresses: first offset b"))
kl.append(self.lwaFirstOffset(kernel, tensorParametersB))
self.dontAppendCode = False
self.dontAppendCode = self.dontAppendCode or forceNoTileCode
# final offsets
kl.append(self.lwaFinalOffsets(kernel, tensorParametersA))
kl.append(self.lwaFinalOffsets(kernel, tensorParametersB))
# declare addresses
kl.append(self.lwaDeclareAddresses(kernel, tensorParametersA))
kl.append(self.lwaDeclareAddresses(kernel, tensorParametersB))
# init pointers
kl.append(self.localWriteInitPointers(kernel, tensorParametersA))
kl.append(self.localWriteInitPointers(kernel, tensorParametersB))
###########################################################################
# summations loops: open
###########################################################################
# declare loop num iter
if not forceNoTileCode:
kl.append(self.comment1("declare loop num iterations"))
kl.append(self.declareLoopNumIter(kernel))
# perform initC in the shadow of the prefetch
# Prefetch occurs at start of unroll loop
# If we have multiple summation indices (otherSummationLoops>0),
# we can't init in shadow of this prefetch
# since that would initC inside the other summation loops
if self.doShadowInit != 2:
kl.append(self.initC(kernel))
# open non-unrolled summation loops
if not forceNoTileCode:
for i in range(kernel["ProblemType"]["NumIndicesSummation"]-1):
kl.append(self.comment("summation loop %u"%i))
kl.append(self.calculateLoopNumIter(kernel, i, isPap))
if self.actualSummationLoops>1:
kl.append(self.openLoop(kernel, i))
kl.append(self.calculateLoopNumIter(kernel, self.unrollIdx, isPap))
if not forceNoTileCode:
if self.staggerU:
kl.append(self.declareStaggerParms(kernel))
kl.append(self.calculateStagger(kernel, tensorParametersA))
kl.append(self.calculateStagger(kernel, tensorParametersB))
# isPap don't init the read pointers - we want to continue to use the double-buffer
# LRO and LWA as assigned
if self.enable["PreLoop"] and not isPap:
# init lds read pointers before each unrolled loop
kl.append(self.comment1("local read addresses: init pointers a"))
kl.append(self.localReadInitPointers(kernel, tensorParametersA))
kl.append(self.comment1("local read addresses: init pointers b"))
kl.append(self.localReadInitPointers(kernel, tensorParametersB))
if isPap and not isOptNLL:
# init lds read pointers before each unrolled loop
kl.append(self.comment1("local read addresses: reset offset a"))
kl.append(self.localReadResetOffsets(kernel, tensorParametersA))
kl.append(self.comment1("local read addresses: reset offset b"))
kl.append(self.localReadResetOffsets(kernel, tensorParametersB))
####################################
# prefetch: unrolled loop prefix
####################################
if kernel["PrefetchGlobalRead"]:
pfi = 1
kl.append(self.comment("prefetch: global -> local"))
kl.append(self.openSumAtLeastUnroll(kernel, prefetch=True, isOptNLL=isOptNLL, isPap=isPap))
if isPap and isOptNLL:
# forceNoGRCode case, reset and not generate global read A/B code
if self.enable["GlobalRead"] and (not forceNoGRCode):
# if DirectToVgprA is enabled, swap the order of global read (B->A)
tensorParameters1st = tensorParametersA
tensorParameters2nd = tensorParametersB
if kernel["DirectToVgprA"]:
tensorParameters1st, tensorParameters2nd = tensorParameters2nd, tensorParameters1st
self.dtlsM0UpdateACode = self.directToLdsM0Update(kernel, 0, tensorParameters1st, usePlaceHolder=isPap)
self.globalReadACode = self.globalReadDo(kernel, 0, tensorParameters1st, 0)
self.dtlsM0UpdateBCode = self.directToLdsM0Update(kernel, 0, tensorParameters2nd, usePlaceHolder=isPap)
self.globalReadBCode = self.globalReadDo(kernel, 0, tensorParameters2nd, 0)
else:
self.dtlsM0UpdateACode = Code.StructuredModule()
self.globalReadACode = Code.StructuredModule() # empty
self.dtlsM0UpdateBCode = Code.StructuredModule()
self.globalReadBCode = Code.StructuredModule() # empty
if self.enable["GlobalReadInc"]:
self.globalReadIncrements = self.globalReadIncrementAB(kernel, self.unrollIdx, pfi)
else:
self.globalReadIncrements = Code.Module()
self.globalReadIncrements.addCode(Code.Module("globalReadIncrementA"))
self.globalReadIncrements.addCode(Code.Module("globalReadIncrementB"))
else:
if self.enable["GlobalRead"]:
# if DirectToVgprA is enabled, swap the order of global read (B->A)
tensorParameters1st = tensorParametersA
tensorParameters2nd = tensorParametersB
if kernel["DirectToVgprA"]:
tensorParameters1st, tensorParameters2nd = tensorParameters2nd, tensorParameters1st
tmpStr = str(self.directToLdsM0Update(kernel, 0, tensorParameters1st, usePlaceHolder=isPap))
tmpStr = tmpStr.replace("__placeholder__", str(0))
kl.append(tmpStr)
kl.append(str(self.globalReadDo(kernel, 0, tensorParameters1st, 0)))
tmpStr = str(self.directToLdsM0Update(kernel, 0, tensorParameters2nd, usePlaceHolder=isPap))
tmpStr = tmpStr.replace("__placeholder__", str(0))
kl.append(tmpStr)
kl.append(str(self.globalReadDo(kernel, 0, tensorParameters2nd, 0)))
if self.enable["GlobalReadInc"]:
kl.append(self.globalReadIncrementAB(kernel, self.unrollIdx, pfi))
kl.append(self.comment3("End setupNewTile, isPap=%s") % isPap)
return kl
##############################################################################
# get conditions to skip local write wait
##############################################################################
def getConditionToSkipLocalWriteWait( self, kernel , isPap, u, lastU):
# not generate wait code here if u == 0 u != lastU and DirectToVgpr + DirectToLds is enabled
# (to remove redundant wait. isPap case only)
# exception is PGR=2. wait is necessary for u = 0 in PGR=2 case
cond1 = not (isPap and u == 0 and u != lastU and kernel["PrefetchLocalRead"] != 0 and \
(kernel["DirectToVgprA"] and kernel["DirectToLdsB"] or kernel["DirectToVgprB"] and kernel["DirectToLdsA"])) \
or kernel["PrefetchGlobalRead"]==2
# no need local read wait if LocalReadVectorWidth==2 and u is odd.
# In that case, Prefetch local read covers both u = 0 and 1 (limit to MFMA+double+DirectToVgpr only)
# (The other side of numReadsIterCoalesced must be 0 to skip local read wait)
condSkip = kernel["LocalReadVectorWidth"]==2 and (u%2 != 0) and kernel["EnableMatrixInstruction"] and \
kernel["ProblemType"]["DataType"].isDouble() and \
(kernel["DirectToVgprA"] and self.numReadsIterCoalescedB % 2 == 0 or \
kernel["DirectToVgprB"] and self.numReadsIterCoalescedA % 2 == 0)
return cond1 and (not condSkip)
##############################################################################
# No Load Loop Body
##############################################################################
def noLoadLoopBody( self, kernel, tensorParametersA, tensorParametersB, kl, pack, isOptNLL, isPap, isNGLL, NLLfirst, NLLlast, isDTVodd=False):
expand = kernel["ExpandPointerSwap"]
lastuIdx = False
pflr = self.numItersPLR
localWriteEndIter = kernel["LoopIters"] - self.numItersPLR - 1
for uIdx in range(0, kernel["LoopIters"]*kernel["DepthULdsDivisor"]):
u = uIdx % kernel["LoopIters"] # u: index in compute loop (in contrast to the notion of global read loop)
uDu = uIdx // kernel["LoopIters"] # uDu: index of compute loop
isLastLoop = (uDu == kernel["DepthULdsDivisor"] -1 ) and not isNGLL
if u == 0:
if uDu > 0:
if self.enable["GlobalRead"]:
assert len(self.globalReadACode.items()) > 0 and len(self.globalReadBCode.items()) > 0 # already issued in first uDu
self.globalReadACode = Code.StructuredModule() # empty
self.globalReadBCode = Code.StructuredModule() # empty
if self.enable["GlobalReadInc"]:
self.globalReadIncrements = Code.Module() # empty
self.globalReadIncrements.addCode(Code.Module("globalReadIncrementA"))
self.globalReadIncrements.addCode(Code.Module("globalReadIncrementB"))
if not isLastLoop:
self.localWriteACode = self.localWriteDo(kernel, tensorParametersA, (uDu+1)%kernel["DepthULdsDivisor"]) # local write in loopcnt N targets data for loopcnt N+1
self.localWriteBCode = self.localWriteDo(kernel, tensorParametersB, (uDu+1)%kernel["DepthULdsDivisor"])
else:
self.localWriteACode = Code.Module()
self.localWriteBCode = Code.Module()
# TODO schedule waitcnt/barrier in makeSubIterSchedule()
if kernel["PrefetchGlobalRead"] and kernel["LoopIters"] in [1, 2] and uDu > 0:
if self.enable["Wait"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 1, 0, -1, "wait for local write"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel, "sync for local read after write"))
if not isNGLL or isPap:
# PAP would have GlobalRead and GlobalInc, but no localWrite
# Get the perIterGlobalReadCode code for PAP (if PAP=On), else would be empty
# NGLL (PGR=2) and isPap case, we do not need globalInc code. Set skip flag in that case
skipGlobalReadInc = isNGLL and isPap
self.makeSchedule(kernel, tensorParametersA, tensorParametersB, localWriteEndIter, uDu, skipGlobalReadInc=skipGlobalReadInc, lastLoop=NLLlast)
kl.append(str(self.unrollLoopHeaderCode))
# which loop iteration to reset the LRO,
# note if PLR=0, isResetLroIter is False for all u
isResetLroIter = (u == localWriteEndIter)
isSwapAndResetLwoIter = isResetLroIter
isSwapLroIter = isResetLroIter
if kernel["ScheduleIterAlg"] == 3:
isSwapAndResetLwoIter = (u == self.lwEndMfmaIndex//(self.numMfmaPerIter))
extraComment = ""
if isLastLoop:
extraComment += " (last unrolled loop)"
else:
if kernel.enabledSplitLDS:
extraComment += f" (uDu={uDu}) "
if isResetLroIter:
extraComment += " (reset local read pointers iteration) "
if isSwapAndResetLwoIter:
extraComment += " (swap and reset local write pointers iteration) "
if isSwapLroIter:
extraComment += " (swap local read pointers iteration) "
kl.append(self.comment("iter %u%s"%(u,extraComment)))
plrIdx = ((u+pflr) % (self.numVgprBuffer+1)) % kernel["LoopIters"]
localReads = Code.Module()
pointerLWCode = Code.Module()
pointerLRCode = Code.Module()
waitCode = Code.Module() # may be overwritten (not added to) below
macIterCode = Code.Module()
waitLWCode = Code.Module()
syncCode = Code.Module()
if self.enable["LocalRead"]:
hasLiveLdsData = kernel["PrefetchGlobalRead"] or (uDu < kernel["DepthULdsDivisor"]-1)
hasLiveLdsData = hasLiveLdsData and not isLastLoop
# reads for current loop are done in previous iteration because of wider local read
doReadA = (u < kernel["LoopIters"]/self.numIterPerCoalescedReadA - self.numItersPLR)
doReadB = (u < kernel["LoopIters"]/self.numIterPerCoalescedReadB - self.numItersPLR)
# reads for next loop
doReadA = doReadA or (hasLiveLdsData and u > localWriteEndIter)
doReadB = doReadB or (hasLiveLdsData and u > localWriteEndIter)
# disable LocalRead if DirectToVgpr is enabled
doReadA = doReadA and (not kernel["DirectToVgprA"])
doReadB = doReadB and (not kernel["DirectToVgprB"])
for iui in range(0,kernel["InnerUnroll"]):
doReadA = doReadA and iui*self.numReadsIterCoalescedA < kernel["InnerUnroll"]
doReadB = doReadB and iui*self.numReadsIterCoalescedB < kernel["InnerUnroll"]
if doReadA:
localReads.addText(self.comment("local read a"))
localReadCodeA, packCodeA = self.localReadDo(kernel, plrIdx*self.numIterPerCoalescedReadA, iui*self.numReadsIterCoalescedA, 0, tensorParametersA)
localReads.addCode(localReadCodeA)
pack[plrIdx*self.numIterPerCoalescedReadA].addCode(packCodeA)
if doReadB:
localReads.addText(self.comment("local read b"))
localReadCodeB, packCodeB = self.localReadDo(kernel, plrIdx*self.numIterPerCoalescedReadB, iui*self.numReadsIterCoalescedB, 0, tensorParametersB)
localReads.addCode(localReadCodeB)
pack[plrIdx*self.numIterPerCoalescedReadB].addCode(packCodeB)
if (not isResetLroIter or iui != kernel["InnerUnroll"]-1):
if doReadA:
localReads.addText(self.comment("local read increment a"))
localReads.addText(self.localReadInc(kernel, iui, tensorParametersA))
if doReadB:
localReads.addText(self.comment("local read increment b"))
localReads.addText(self.localReadInc(kernel, iui, tensorParametersB))
if not isLastLoop:
if kernel["PrefetchGlobalRead"]:
# put barrier at localWriteEndIter+1
if u == localWriteEndIter+1 or (u == (localWriteEndIter+1)%kernel["LoopIters"] and kernel["ScheduleIterAlg"] == 2):
if self.enable["Wait"]:
# skip local write wait if DirectToVgpr + DirectToLds is enabled
if not kernel["NoLdsWriteCode"]:
waitLWCode.addCode(self.wait(kernel, tensorParametersA, tensorParametersB, -1, 0, -1, "3wait for local write"))
if (kernel["DirectToVgprA"] or kernel["DirectToVgprB"]) and (kernel["DirectToLdsA"] or kernel["DirectToLdsB"]):
# DirectToVgpr + DirectToLds case, add waitcnt vmcnt before s_barrier
# Except for PGR=2 and Load C (StoreCInUnroll) case. In that case, Load C is executed after necessary Load A and B.
# Wait for Load C is already done here in PGR=2 case.
needLoadC = kernel["StoreCInUnroll"] and (not kernel["AtomicAddC"]) and kernel["ProblemType"]["UseBeta"]
if not (kernel["PrefetchGlobalRead"]==2 and needLoadC):
retStr = self.getWaitcntCodeForDirectToVgpr(kernel, localWriteEndIter, u, firstIter=False, beforeBarrier=True)
waitLWCode.addCode(retStr)
if self.enable["Sync"]:
if kernel["PrefetchGlobalRead"]==2 and (kernel["DirectToLdsA"] and kernel["DirectToLdsB"]):
# PGR=2 and DTLA+B case, wait for global read needs to be added (wait is not generated with local write)
syncCode.addCode(self.wait(kernel, tensorParametersA, tensorParametersB, 0, -1, -1, "wait for global read with lds"))
syncCode.addCode(self.syncThreads(kernel))
if isSwapAndResetLwoIter: # ResetLroIter
if self.enable["LocalWrite"]:
# local write for next iter, used to have local writes here
pointerLWCode.addText(self.comment("local write swap offsets a"))
pointerLWCode.addText(self.localWriteSwapOffsets(kernel, expand, tensorParametersA))
pointerLWCode.addText(self.comment("local write swap offsets b"))
pointerLWCode.addText(self.localWriteSwapOffsets(kernel, expand, tensorParametersB))
pointerLWCode.addText(self.localWriteInitPointers(kernel, tensorParametersA))
pointerLWCode.addText(self.localWriteInitPointers(kernel, tensorParametersB))
if isSwapLroIter: # ResetLroIter
if self.enable["LocalRead"]:
# Swap, reset, or increment the LRO:
# force internalPointerSwap = False in NGLL case
internalPointerSwap = expand and not isNGLL
pointerLRCode.addText(self.comment("local read swap offsets a"))
pointerLRCode.addText(self.localReadSwapOffsets(kernel, internalPointerSwap, tensorParametersA))
pointerLRCode.addText(self.comment("local read swap offsets b"))
pointerLRCode.addText(self.localReadSwapOffsets(kernel, internalPointerSwap, tensorParametersB))
if isResetLroIter: # ResetLroIter
if self.enable["LocalRead"]:
pointerLRCode.addText(self.comment("local read init pointers a"))
pointerLRCode.addText(self.localReadInitPointers(kernel, tensorParametersA))
pointerLRCode.addText(self.comment("local read init pointers b"))
pointerLRCode.addText(self.localReadInitPointers(kernel, tensorParametersB))
# we initiate lgkmcnt to 0, then assigning it correct value in makeSubIterSchedule()
if self.enable["Wait"]:
if self.getConditionToSkipLocalWriteWait(kernel, isPap, u, kernel["LoopIters"] - 1):
waitCode = self.wait(kernel, tensorParametersA, tensorParametersB, \
-1, 0, 0, \
"wait for prior local read local write")
# DirectToVgpr case, wait for global read as well as local read/write
if kernel["DirectToVgprA"] or kernel["DirectToVgprB"]:
# not generate wait here
# 1) local write code in previous u (u-1) has waitcnt vmcnt
prevVmcnt = False
prevLocalWrite = ""
if (u > 0):
prevLocalWrite = ' '.join([str(x) for x in self.perIterLocalWriteCode[u-1].flatitems()])
prevVmcnt = "vmcnt" in prevLocalWrite
if not prevVmcnt:
retStr = self.getWaitcntCodeForDirectToVgpr(kernel, localWriteEndIter, u, False, isPap or isNGLL, NLLlast=NLLlast)
kl.append(retStr)
# generate StoreCInUnroll post loop if it is enabled (only PAP case (but not NGLL))
if u == localWriteEndIter+1:
if kernel["StoreCInUnrollPostLoop"] and isPap and not isNGLL:
kl.append(self.generateStoreInUnrollPostLoop(kernel, isOptNLL, isDTVodd))
luIdx = (u) % (self.numVgprBuffer+1) # local to use for MACs
if self.enable["MAC"]:
if kernel["EnableMatrixInstruction"]:
# NGLL case, use first set
setId = 0 if isNGLL else 1
# flip setId if isDTVodd is True
if isDTVodd:
setId = 1 - setId
# use second set for DirectToVGPR
vregSetIdxMFMA = setId # use first set for NGLL, second set for other cases
if ((uIdx+1) == kernel["LoopIters"]*kernel["DepthULdsDivisor"]) and \
(kernel["StoreCInUnroll"]):
lastuIdx = (isOptNLL or self.enableSingleNLLOpt) and not isNGLL # do not apply lastuIdx for not isOptNLL case
macIterCode.addCode(self.mfmaIter(kernel, u, kernel["InnerUnroll"], vregSetIdxMFMA,lastuIdx))
else:
macIterCode.addCode(self.macIter(kernel, luIdx, kernel["InnerUnroll"], True ))
subIterCode = self.makeSubIterSchedule(kernel, localReads, \
u, pointerLWCode, pointerLRCode, waitCode, macIterCode, waitLWCode, syncCode, pack[luIdx], isDTVodd, NLLlast)
kl.append(subIterCode)
# vgpr.checkin for all the checked-out vgpr in LocalRead
for item in list(pack[luIdx].items()):
if item.tempVgpr != None:
self.vgprPool.checkIn(item.tempVgpr)
item.tempVgpr = None
pack[luIdx] = Code.Module()
##############################################################################
# noLoadLoop
# Create the no load loop (NLL)
#
# isOptNLL : the NLL is to be optimized for the alpha=1 and non-edge case
##############################################################################
def noLoadLoop( self, kernel, tensorParametersA, tensorParametersB, isOptNLL, isPap, isNGLL, pack ):
kl = []
if isNGLL:
LoopNameComment = "NoGlobalLoadLoop"
else:
LoopNameComment = "NoLoadLoop"
if isOptNLL:
PAPcomment = "Opt. %s %s PAP - Begin " % (LoopNameComment, "With" if isPap else "Without")
else:
PAPcomment = "Ord. %s - Begin " % (LoopNameComment)
kl.append(self.comment3("%s")%PAPcomment)
NLLfirst = True
NLLlast = True
if kernel["PrefetchGlobalRead"] == 2:
# PGR=2 case NoLoadLoop(NLL) is generated twice
# we need to distinguish them to generate proper code at each NLL
if isNGLL:
NLLlast = False
else:
# PGR=2 and not isNGLL means second NoLoadLoop for PGR2.
# Need to avoid generating duplicated code which is already generated in NGLL(first NoLoadLoop for PGR=2)
NLLfirst = False
if isNGLL:
self.perIterLocalWriteCode = self.perIterLocalWriteCodeNGLL
self.perIterLocalWriteCanSkip = [ 0 for i in range (kernel["LoopIters"]) ]
#else:
if not isNGLL or isPap:
self.dtlsM0UpdateACode = Code.StructuredModule()
self.globalReadACode = Code.StructuredModule() # empty
self.dtlsM0UpdateBCode = Code.StructuredModule()
self.globalReadBCode = Code.StructuredModule() # empty
self.globalReadIncrements = Code.Module()
self.globalReadIncrements.addCode(Code.Module("globalReadIncrementA"))
self.globalReadIncrements.addCode(Code.Module("globalReadIncrementB"))
self.localWriteACode = Code.Module()
self.localWriteBCode = Code.Module()
# the scheduled GlobalRead,Inc code of PAP is inside openSumAtLeastUnroll (if PAP=on)
isPapTmp = isPap
if kernel["PrefetchGlobalRead"]==2:
# PGR=2 case, set isPap only if isNGLL is True. This is to generate NewTile code at NGLL in PAP + PGR=2 case
isPapTmp = isPap and not isNGLL
kStrOpenSum = self.openSumAtLeastUnroll(kernel, prefetch=False, isOptNLL=isOptNLL, isPap=isPapTmp)
#if self.prefetchAcrossPersistent and kernel["PrefetchAcrossPersistentMode"] == 1 and isPap:
if self.prefetchAcrossPersistent and isPap:
#if self.prefetchAcrossPersistent and isPap \
# and (kernel["PrefetchAcrossPersistentMode"] == 0 or isOptNLL):
kStr = ""
#kStr += str(self.openPrefetchAcrossPersistent(kernel, isOptNLL=False, useBufferOOB=True))
# For PAPMode 1, using isOptNLL true to generate prefetch code
if kernel["PrefetchAcrossPersistentMode"] == 0:
# generate openSumAtLeastUnroll code here
kStr += kStrOpenSum
kStrOpenSum = "" # empty OpenSum str to avoid inserting it again
# isPap and kernel["PrefetchAcrossPersistentMode"] == 1 and isOptNLL==False,
# no need to append NewTile code because it is already generated in OptNLL code
# also, NGLL second NoLoadLoop case, we do not append code for NewTile
forceNoTileCode = False
if (isOptNLL==False or (not NLLfirst)):
forceNoTileCode = True
# PGR=2 and last loop case, we do not need GlobalRead code
forceNoGRCode = False
if kernel["PrefetchGlobalRead"] == 2 and NLLlast:
forceNoGRCode = True
newTileCodes = self.setupNewTile(kernel, self.tPA, self.tPB, isPap=True, isOptNLL=True, forceNoTileCode=forceNoTileCode, forceNoGRCode = forceNoGRCode)
codes = '\n'.join([str(x) for x in newTileCodes])
kStr += codes
# openPrefetchAcrossPersistent should be after newTileCodes to set correct values to ShadowLimit
# also, NGLL second NoLoadLoop case, we do not append code for Open/Close PAP
if isOptNLL:
if NLLfirst:
kStr += str(self.openPrefetchAcrossPersistent(kernel, isOptNLL=False, useBufferOOB=True))
kStr += str(self.closePrefetchAcrossPersistent(kernel, isOptNLL=False, useBufferOOB=True))
kl.append(kStr)
# skip generating OpenSum code here for SingleNLLOpt
if not (isOptNLL and self.enableSingleNLLOpt):
kl.append(kStrOpenSum)
kStrOpenSum = "" # empty OpenSum str to avoid inserting it again
if not self.numItersPLR:
if self.enable["Wait"]:
if kernel["DirectToLdsA"] or kernel["DirectToLdsB"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 0, -1, -1, "10wait for global read"))
# TODO: need to check if we correctly checked-in the temp VGPR used for Int8 LocalWrite (uDu, PGR=2)
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, -1, 0, -1, "4wait for local write"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel))
# if DirectToVgpr and ASEM is not multiple of DepthU*2, generate noLoadLoopBody twice for odd and even exit separately
if ( kernel["DirectToVgprA"] or kernel["DirectToVgprB"]) and (kernel["AssertSummationElementMultiple"] % (kernel["DepthU"] * 2) != 0):
# generate additional No Load Loop Body code for odd case (to use the other Vreg set for DirectToVgpr)
# 1. generate odd check
name = ""
if isNGLL:
name += "NoGlobalLoadLoop"
else:
name += "NoLoadLoop"
if isOptNLL:
name += "Opt"
else:
name += "Ord"
kl.append(self.openOddNoLoadLoopForDTV(kernel, isNGLL, name))
# 2. generate no Load Loop Body code for odd
# backup
self.saveLocalPointers(kernel)
# deepCopy packCode for OptNLL noLoadLoop
deepCopyPack = copy.deepcopy(pack)
# keep StoreCInUnroll related code for the next noLoadLoop
if kernel["StoreCInUnroll"]:
self.backupStoreCInUnrollRelatedCode()
self.noLoadLoopBody(kernel, tensorParametersA, tensorParametersB, kl, deepCopyPack, isOptNLL, isPap, isNGLL, NLLfirst, NLLlast, isDTVodd=True)
# restore
self.restoreLocalPointers(kernel)
# restore StoreCInUnroll related code
if kernel["StoreCInUnroll"]:
self.restoreStoreCInUnrollRelatedCode()
# 3. PAP enabled and isLast and odd code case, the last global load for DirectToVgpr is the seconde reg set.
# Need to copy to the first set for the next PK loop
if isPap and NLLlast:
kl.append(self.getWaitcntCodeForDirectToVgpr(kernel, 0, 0, False, oddLast=True))
kl.append(self.generateOddEndVgprCopyForDTV(kernel))
# 4. generate even start label
kl.append(self.closeOddNoLoadLoopForDTV(kernel, isNGLL, name))
# 5. generate no Load Loop Body code for odd
# need to re-initialize perIterLocalWriteCanSkip to avoid having incorrect lgkmcnt
self.perIterLocalWriteCanSkip = [ 0 for i in range (kernel["LoopIters"]) ]
self.noLoadLoopBody(kernel, tensorParametersA, tensorParametersB, kl, pack, isOptNLL, isPap, isNGLL, NLLfirst, NLLlast)
# 6. generate even end label
kl.append(self.generateEvenEndLabeNoLoadLoopForDTV(kernel, isNGLL, name))
else:
# generate no Load Loop Body code
self.noLoadLoopBody(kernel, tensorParametersA, tensorParametersB, kl, pack, isOptNLL, isPap, isNGLL, NLLfirst, NLLlast)
if NLLlast and isPap:
# reset or swap local write offset
# If DirectToLds is True, first LDS buffer is already used by lds global read and offset already points first one
# Swap/Reset is not necessary
# If DirectToLds is False, first LDS buffer is no used yet, need reset.
if kernel["ExpandPointerSwap"]:
if not kernel["DirectToLdsA"]:
kl.append(self.comment("local write reset offsets a"))
kl.append(self.localWriteResetOffsets(kernel, False, tensorParametersA))
if not kernel["DirectToLdsB"]:
kl.append(self.comment("local write reset offsets b"))
kl.append(self.localWriteResetOffsets(kernel, False, tensorParametersB))
kl.append(self.localReadResetOffsets(kernel, tensorParametersA))
kl.append(self.localReadResetOffsets(kernel, tensorParametersB))
# add OpenSum code here if it is not empty
if kStrOpenSum != "":
kl.append(kStrOpenSum)
# Close code is necessary for both first and last (NGLL case(=NLLfirst) needs label)
kl.append(self.closeSumAtLeastUnroll(kernel, prefetch=False, isOptNLL=isOptNLL, isPap=isPap, isNGLL=isNGLL))
return kl
##############################################################################
# Loop Body
##############################################################################
def loopBody( self, kernel, tensorParametersA, tensorParametersB, kl, pack, lc, loopCopies, finalLoop, firstIter=False ):
expand = kernel["ExpandPointerSwap"]
# generate storeC code for StoreCInUnroll (need to call for not StoreCInUnroll case as well)
self.generateStoreCCodeInUnrollLoop(kernel, lc & 1)
# not generate openLoop for firstIter
if not firstIter:
kl.append(self.comment3("Unrolled Loop %u/%u - Begin" % (lc+1, loopCopies)))
kl.append(self.openLoopCopy(kernel, lc))
if kernel["PrefetchGlobalRead"] and not self.numItersPLR and not kernel["ScheduleIterAlg"] == 2:
if self.enable["Wait"]:
if kernel["DirectToLdsA"] or kernel["DirectToLdsB"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 0, -1, -1, "11wait for global read"))
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 1, 0, -1, "1wait for local write"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel, "4sync for global read"))
kl.append(self.comment("Begin Each Unroll: Check VGPR.checkin for INT8 LW"))
if self.enable["GlobalRead"]:
# if DirectToVgprA is enabled, swap the order of global read (B->A)
tensorParameters1st = tensorParametersA
tensorParameters2nd = tensorParametersB
tc1 = 'A'
tc2 = 'B'
if kernel["DirectToVgprA"]:
tensorParameters1st, tensorParameters2nd = tensorParameters2nd, tensorParameters1st
tc1, tc2 = tc2, tc1
# unrolled loop: global read A, B
# M0 update for directToLds
vregSetIdxGR = 0
if (kernel["DirectToVgpr%s"%tc1]):
vregSetIdxGR = (kernel["PrefetchGlobalRead"] + lc ) % 2 # toggle vreg set for DirectToVgpr.
self.dtlsM0UpdateACode = self.directToLdsM0Update(kernel, 1, tensorParameters1st, usePlaceHolder=True)
self.globalReadACode = self.globalReadDo(kernel, 1, tensorParameters1st, vregSetIdxGR)
vregSetIdxGR = 0
if (kernel["DirectToVgpr%s"%tc2]):
vregSetIdxGR = (kernel["PrefetchGlobalRead"] + lc ) % 2 # toggle vreg set for DirectToVgpr.
self.dtlsM0UpdateBCode = self.directToLdsM0Update(kernel, 1, tensorParameters2nd, usePlaceHolder=True)
self.globalReadBCode = self.globalReadDo(kernel, 1, tensorParameters2nd, vregSetIdxGR)
else:
self.dtlsM0UpdateACode = Code.StructuredModule()
self.globalReadACode = Code.StructuredModule() # empty
self.dtlsM0UpdateBCode = Code.StructuredModule()
self.globalReadBCode = Code.StructuredModule() # empty
if self.enable["GlobalReadInc"]:
# unrolled loop: increment global read addresses
self.globalReadIncrements = self.globalReadIncrementAB(kernel, self.unrollIdx, 0)
else:
self.globalReadIncrements = Code.Module()
self.globalReadIncrements.addCode(Code.Module("globalReadIncrementA"))
self.globalReadIncrements.addCode(Code.Module("globalReadIncrementB"))
if self.enable["LocalWrite"] and not kernel["NoLdsWriteCode"]:
self.localWriteACode = self.localWriteDo(kernel, tensorParametersA)
self.localWriteBCode = self.localWriteDo(kernel, tensorParametersB)
else:
self.localWriteACode = Code.Module()
self.localWriteBCode = Code.Module()
# localWriteEndIter is used to determine which iteration to put sync
# if PGR=0, GR,LW,sync,LR will put at front of loop.
localWriteEndIter = kernel["LoopIters"] - self.numItersPLR - 1
# Schedule the global read, global read inc, and writes:
unrollLoopHeaderCodeScheduled = False
if not kernel["PrefetchGlobalRead"]:
unrollLoopHeaderCodeScheduled = True
self.makeSchedule(kernel, tensorParametersA, tensorParametersB, localWriteEndIter, firstIter=firstIter)
kl.append(str(self.unrollLoopHeaderCode))
# if not prefetch global, localWrite before mac's
if not kernel["PrefetchGlobalRead"]:
# unrolled loop: local write A, B
if self.enable["Wait"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 0, -1, -1, "5wait for global read"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel, "PGR=0, prior iter done reading lds"))
if self.enable["LocalWrite"] and not kernel["NoLdsWriteCode"]:
kl.append(self.comment("local write a"))
tempLWCodeModA = self.localWriteDo(kernel, tensorParametersA)
kl.append(tempLWCodeModA)
kl.append(self.comment("local write b"))
tempLWCodeModB = self.localWriteDo(kernel, tensorParametersB)
kl.append(tempLWCodeModB)
if self.enable["Wait"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, -1, 0, -1, "2prefetch wait for local write"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel))
# debug Local state
"""
kl.append(" /* print Local state */" + self.endLine)
kl.append(" for (unsigned int i = serial; i < LDS_NUM_ELEMENTS; i+=NUM_THREADS) {%s" % self.endLine)
kl.append(" printf(\\\"localMemory[%%06u] = %%.0f\\\\n\\\", i, localMemory[i]);%s" )
% self.endLine
kl.append(" }" + self.endLine)
"""
# unrolled loop: prefetch local
if self.numItersPLR and not kernel["PrefetchGlobalRead"]:
if self.enable["LocalRead"]:
for plrIdx in range(0, self.numItersPLR):
pack[plrIdx] = Code.Module()
for iui in range(0,kernel["InnerUnroll"]):
if iui*self.numReadsIterCoalescedA < kernel["InnerUnroll"] and (not kernel["DirectToVgprA"]) : # no local read code if DirectToVgpr is enabled
kl.append(self.comment("prefetch local a"))
localReadCodeA, packCodeA = self.localReadDo(kernel, plrIdx*self.numIterPerCoalescedReadA, iui*self.numReadsIterCoalescedA, 0, tensorParametersA)
kl.append(localReadCodeA)
pack[plrIdx].addCode(packCodeA)
if iui*self.numReadsIterCoalescedB < kernel["InnerUnroll"] and (not kernel["DirectToVgprB"]) : # no local read code if DirectToVgpr is enabled
kl.append(self.comment("prefetch local b"))
localReadCodeB, packCodeB = self.localReadDo(kernel, plrIdx*self.numIterPerCoalescedReadB, iui*self.numReadsIterCoalescedB, 0, tensorParametersB)
kl.append(localReadCodeB)
pack[plrIdx].addCode(packCodeB)
if iui*self.numReadsIterCoalescedA < kernel["InnerUnroll"] and (not kernel["DirectToVgprA"]) : # no local read code if DirectToVgpr is enabled
kl.append(self.comment1("local read increment a"))
kl.append(self.localReadInc(kernel, iui, tensorParametersA))
if iui*self.numReadsIterCoalescedB < kernel["InnerUnroll"] and (not kernel["DirectToVgprB"]) : # no local read code if DirectToVgpr is enabled
kl.append(self.comment1("local read increment b"))
kl.append(self.localReadInc(kernel, iui, tensorParametersB))
kl.append(self.closeString(kernel))
kl.append(self.openString(kernel))
pflr = self.numItersPLR # how many pf already done above
############################################################################
# unrolled loop: mac iterations
############################################################################
# double/quadruple the number of compute loop for each DepthU's worth of data read
for uIdx in range(0, kernel["LoopIters"]*kernel["DepthULdsDivisor"]):
u = uIdx % kernel["LoopIters"] # u: index in compute loop (in contrast to the notion of global read loop)
uDu = uIdx // kernel["LoopIters"] # uDu: index of compute loop
if u==0: # if at start of subloop...
# ...update local write code
if self.enable["LocalWrite"] and not kernel["NoLdsWriteCode"]:
self.localWriteACode = self.localWriteDo(kernel, tensorParametersA, (uDu+1)%kernel["DepthULdsDivisor"]) # local write in loopcnt N targets data for loopcnt N+1
self.localWriteBCode = self.localWriteDo(kernel, tensorParametersB, (uDu+1)%kernel["DepthULdsDivisor"])
else:
self.localWriteACode = Code.Module()
self.localWriteBCode = Code.Module()
# TODO schedule waitcnt/barrier in makeSubIterSchedule()
if kernel["PrefetchGlobalRead"] and kernel["LoopIters"] in [1, 2] and uDu > 0:
if self.enable["Wait"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 1, 0, -1, "wait for local write"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel, "sync for local read after write"))
if not unrollLoopHeaderCodeScheduled:
self.makeSchedule(kernel, tensorParametersA, tensorParametersB, localWriteEndIter, uDu, firstIter=firstIter, lastLoop=False, lastLc=(lc==loopCopies-1))
kl.append(str(self.unrollLoopHeaderCode))
# for PGR=0 where generator can't schedule the instructions (yet),
# we duplicate the local write codegen and append to string list directly
if not kernel["PrefetchGlobalRead"]:
doWrite = False
if uDu<kernel["DepthULdsDivisor"]-1 and u==kernel["LoopIters"]-self.numItersPLR:
doWrite = True
writeForNextLoop = 1
if uDu>0 and self.numItersPLR==0 and u==0:
assert doWrite==False # should be exclusive with the previous condition
doWrite = True
writeForNextLoop = 0
# unrolled loop: local write A, B
if doWrite:
if self.enable["Wait"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, -1, -1, 0, "5wait for local read"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel, "PGR=0, prior iter done reading lds"))
if self.enable["LocalWrite"] and not kernel["NoLdsWriteCode"]:
kl.append(self.comment("local write a"))
tempLWCodeModA = self.localWriteDo(kernel, tensorParametersA, (uDu+writeForNextLoop)%kernel["DepthULdsDivisor"])
kl.append(tempLWCodeModA)
kl.append(self.comment("local write b"))
tempLWCodeModB = self.localWriteDo(kernel, tensorParametersB, (uDu+writeForNextLoop)%kernel["DepthULdsDivisor"])
kl.append(tempLWCodeModB)
if self.enable["Wait"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, -1, 0, -1, "2prefetch wait for local write"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel))
# which loop iteration to reset the LRO,
# note if PLR=0, isResetLroIter is False for all u
isResetLroIter = (u == localWriteEndIter)
isSwapAndResetLwoIter = isResetLroIter
isSwapLroIter = isResetLroIter
if kernel["ScheduleIterAlg"] == 3:
isSwapAndResetLwoIter = (u == self.lwEndMfmaIndex//(self.numMfmaPerIter))
extraComment = ""
if kernel.enabledSplitLDS:
extraComment += f" (uDu={uDu}) "
if isResetLroIter:
extraComment += " (reset local read pointers iteration) "
if isSwapAndResetLwoIter:
extraComment += " (swap and reset local write pointers iteration) "
if isSwapLroIter:
extraComment += " (swap local read pointers iteration) "
kl.append(self.comment("iter %u%s"%(u,extraComment)))
plrIdx = ((u+pflr) % (self.numVgprBuffer+1)) % kernel["LoopIters"]
localReads = Code.Module()
localReadsA = Code.Module()
localReadsB = Code.Module()
pointerLWCode = Code.Module()
pointerLRCode = Code.Module()
waitCode = Code.Module() # may be overwritten (not added to) below
macIterCode = Code.Module()
waitLWCode = Code.Module()
syncCode = Code.Module()
if self.enable["LocalRead"]:
hasLiveLdsData = kernel["PrefetchGlobalRead"] or (uDu < kernel["DepthULdsDivisor"]-1)
# reads for current loop are done in previous iteration because of wider local read
doReadA = (u < kernel["LoopIters"]/self.numIterPerCoalescedReadA - self.numItersPLR)
doReadB = (u < kernel["LoopIters"]/self.numIterPerCoalescedReadB - self.numItersPLR)
# reads for next loop
doReadA = doReadA or (hasLiveLdsData and u > localWriteEndIter)
doReadB = doReadB or (hasLiveLdsData and u > localWriteEndIter)
# disable LocalRead if DirectToVgpr is enabled
doReadA = doReadA and (not kernel["DirectToVgprA"])
doReadB = doReadB and (not kernel["DirectToVgprB"])
# double the number of VgprValu if self.vgprValuDouble is true
plrIdxLR = plrIdx
if self.vgprValuDouble and (lc & 1) == 0:
# use the next buffer set (do not change the index of pack[])
plrIdxLR += 1
for iui in range(0,kernel["InnerUnroll"]):
doReadA = doReadA and iui*self.numReadsIterCoalescedA < kernel["InnerUnroll"]
doReadB = doReadB and iui*self.numReadsIterCoalescedB < kernel["InnerUnroll"]
if doReadA:
localReads.addText(self.comment("local read a"))
localReadCodeA, packCodeA = self.localReadDo(kernel, plrIdxLR*self.numIterPerCoalescedReadA, iui*self.numReadsIterCoalescedA, 0, tensorParametersA)
localReads.addCode(localReadCodeA)
localReadsA.addCode(localReadCodeA)
pack[plrIdx*self.numIterPerCoalescedReadA].addCode(packCodeA)
if doReadB:
localReads.addText(self.comment("local read b"))
localReadCodeB, packCodeB = self.localReadDo(kernel, plrIdxLR*self.numIterPerCoalescedReadB, iui*self.numReadsIterCoalescedB, 0, tensorParametersB)
localReads.addCode(localReadCodeB)
localReadsB.addCode(localReadCodeB)
pack[plrIdx*self.numIterPerCoalescedReadB].addCode(packCodeB)
# Don't increment the LRO if we are going to reset them below:
if not isResetLroIter or iui != kernel["InnerUnroll"]-1:
if doReadA:
localReads.addText(self.comment("local read increment a"))
localReads.addText(self.localReadInc(kernel, iui, tensorParametersA))
if doReadB:
localReads.addText(self.comment("local read increment b"))
localReads.addText(self.localReadInc(kernel, iui, tensorParametersB))
if kernel["PrefetchGlobalRead"]:
# wait code for DirectToVgpr
if kernel["DirectToVgprA"] or kernel["DirectToVgprB"]:
if self.enable["Wait"]:
# not generate wait here
# 1) for the first unroll with self.canOptimizePreLoopLWVmcnt = True
# 2) local write code in previous u (u-1) has waitcnt vmcnt
prevVmcnt = False
prevLocalWrite = ""
if (u > 0 and kernel["ScheduleIterAlg"] == 3):
for up in range(u):
prevLocalWrite += ' '.join([str(x) for x in self.perIterLocalWriteCode[up].flatitems()])
prevVmcnt = "vmcnt" in prevLocalWrite
if not (firstIter and u == 0 and self.canOptimizePreLoopLWVmcnt) and not prevVmcnt:
retStr = self.getWaitcntCodeForDirectToVgpr(kernel, localWriteEndIter, u, firstIter)
kl.append(retStr)
# put barrier at localWriteEndIter+1
if u == localWriteEndIter+1 or (u == (localWriteEndIter+1)%kernel["LoopIters"] and kernel["ScheduleIterAlg"] == 2):
if self.enable["Wait"]:
if kernel["DirectToLdsA"] or kernel["DirectToLdsB"]:
# skip generating wait for global read again here in DirectToVgpr case or no DirectToVgpr + PGR=2
# no DTV and PGR=2 case, wait is generated at sync (barrier), which is before next local read
if not(kernel["DirectToVgprA"] or kernel["DirectToVgprB"]) and not kernel["PrefetchGlobalRead"]==2:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 0, -1, -1, "12wait for global read"))
else:
# DirectToVgpr + DirectToLds case, add waitcnt vmcnt before s_barrier
# Except for PGR=2 and Load C case. In that case, Load C is executed after necessary Load A and B.
# Wait for Load C is already done here in PGR=2 case.
needLoadC = kernel["StoreCInUnroll"] and (not kernel["AtomicAddC"]) and kernel["ProblemType"]["UseBeta"]
if not (kernel["PrefetchGlobalRead"]==2 and needLoadC):
retStr = self.getWaitcntCodeForDirectToVgpr(kernel, localWriteEndIter, u, firstIter, beforeBarrier=True)
waitLWCode.addCode(retStr)
# skip local write wait if DirectToVgpr + DirectToLds is enabled
# (no local write code. Global read wait for DirectToLds is already done)
if not kernel["NoLdsWriteCode"]:
waitLWCode.addCode(self.wait(kernel, tensorParametersA, tensorParametersB, -1, 0, -1, "3wait for local write"))
if self.enable["Sync"]:
if kernel["DirectToVgprA"] or kernel["DirectToVgprB"]:
# put only barrier for DirectToVgpr (to avoid generating waitcnt for global read)
syncCode.addCode("s_barrier" + self.endLine)
else:
syncCode.addCode(self.syncThreads(kernel))
if isSwapAndResetLwoIter: # ResetLroIter
if self.enable["LocalWrite"]:
# local write for next iter, used to have local writes here
pointerLWCode.addText(self.comment("local write swap offsets a"))
pointerLWCode.addText(self.localWriteSwapOffsets(kernel, expand, tensorParametersA))
pointerLWCode.addText(self.comment("local write swap offsets b"))
pointerLWCode.addText(self.localWriteSwapOffsets(kernel, expand, tensorParametersB))
pointerLWCode.addText(self.localWriteInitPointers(kernel, tensorParametersA))
pointerLWCode.addText(self.localWriteInitPointers(kernel, tensorParametersB))
if isSwapLroIter: # ResetLroIter
if self.enable["LocalRead"]:
# Swap, reset, or increment the LRO:
pointerLRCode.addText(self.comment("local read swap offsets a"))
pointerLRCode.addText(self.localReadSwapOffsets(kernel, expand, tensorParametersA))
pointerLRCode.addText(self.comment("local read swap offsets b"))
pointerLRCode.addText(self.localReadSwapOffsets(kernel, expand, tensorParametersB))
if isResetLroIter: # ResetLroIter
if self.enable["LocalRead"]:
pointerLRCode.addText(self.comment("local read init pointers a"))
pointerLRCode.addText(self.localReadInitPointers(kernel, tensorParametersA))
pointerLRCode.addText(self.comment("local read init pointers b"))
pointerLRCode.addText(self.localReadInitPointers(kernel, tensorParametersB))
# we initiate lgkmcnt to 0, then assigning it correct value in makeSubIterSchedule()
if self.enable["Wait"]:
if self.getConditionToSkipLocalWriteWait(kernel, True, u, kernel["LoopIters"] - 1):
waitCode = self.wait(kernel, tensorParametersA, tensorParametersB, \
-1, 0, 0, \
"wait for prior local read local write")
luIdx = (u) % (self.numVgprBuffer+1) # local to use for MACs
if self.enable["MAC"]:
if kernel["EnableMatrixInstruction"]:
vregSetIdxMFMA = lc
macIterCode.addCode(self.mfmaIter(kernel, u, kernel["InnerUnroll"], vregSetIdxMFMA, firstIter=firstIter and u == 0))
else:
macIterCode.addCode(self.macIter(kernel, luIdx, kernel["InnerUnroll"], True ))
###### unroll loop efficiency implementation######################################
# unroll loop efficiency implementation
## split A&B fetch&MAC code into multiple groups
## splitting strategy based on TT size
## 6x4 -> split MAC blob(s) into group of 8(s) and 16 FMA instructions.
## LDS fetch(es) into group of A{1-2)B(0) , A(3),B(1) (not implemented yet)
## 4x6 -> split MAC blob(s) into group of 8(s) and 16 FMA instructions.
## LDS fetch(es) into group of B{1-2)A(0) , B(3),A(1)
## 4x4 -> split into group of 8 and 8 MAC(s)
## 6x6 -> split into group of 12 MAC(s)
## 8x4/4x8 -> split into group of 16 and 16 MAC(s)
## 8x8 -> split into group of 16 MAC(s)
## supports only PLR=0
###############################################################################
if self.numItersPLR or (not globalParameters["UnrollLoopEfficiencyEnable"]):
subIterCode = self.makeSubIterSchedule(kernel, localReads, \
u, pointerLWCode, pointerLRCode, waitCode, macIterCode, waitLWCode, syncCode, pack[luIdx])
kl.append(subIterCode) # add scheduled "other", local reads, local writes
for item in list(pack[luIdx].items()):
if item.tempVgpr != None:
self.vgprPool.checkIn(item.tempVgpr)
item.tempVgpr = None
pack[luIdx] = Code.Module()
else:
macIterCode = Code.Module()
MacitemsReorder = []
if self.enable["MAC"]:
luIdx = (u) % (self.numVgprBuffer+1) # local to use for MACs
macIterCode.addCode(self.macCode(kernel, luIdx, kernel["InnerUnroll"] ))
MacIteritems = macIterCode.flatitems()
#remove last and second entry from list if AggressiveMode is set
# re-insert them back later
if (kernel["AggressivePerfMode"]):
MacIteritems = MacIteritems[:-1]
MacIteritems.pop(1)
#print("number MacItems\n",len(MacIteritems))
blockWidth = tensorParametersA["localReadInstruction"].blockWidth
numVectorsPerTileA = (kernel["ThreadTile%u"%tensorParametersA["tensorIdx"]]/kernel["VectorWidth"])
numReadsPerVectorA = (kernel["VectorWidth"] * tensorParametersA["bpe"] ) / (blockWidth*4)
numVectorsPerTileB = (kernel["ThreadTile%u"%tensorParametersB["tensorIdx"]]/kernel["VectorWidth"])
TotalnumLdsFetches = numVectorsPerTileA*numReadsPerVectorA + numVectorsPerTileB*numReadsPerVectorA
## Rules for applying kernel["UnrollLoopEfficiencyEnable"]
## if A+B fetches <= 3 no split approach
if not TotalnumLdsFetches > 3:
subIterCode = self.makeSubIterSchedule(kernel, localReads, \
u, pointerLWCode, pointerLRCode, waitCode, macIterCode)
kl.append(subIterCode) # add scheduled "other", local reads, local writes
else:
if ((kernel["ThreadTile0"] == 6 and kernel["ThreadTile1"] == 4) or
(kernel["ThreadTile0"] == 4 and kernel["ThreadTile1"] == 6)):
numGroups = 2 #group0 = 8 MAC(s) #group1 = 16 MAC(s) (6x4 - 4x2)
# ldsItems for splitting lds(s)
ldsItems = ([[4,2],[2,2]]) if kernel["ThreadTile0"] == 6 else ([[2,4],[2,2]])
macItems = [8,16]
waitCntItems = [0,0]
elif (kernel["ThreadTile0"] == 4 and kernel["ThreadTile1"] == 4):
numGroups = 2 #group0 = 8 MAC(s) #group1 = 8 MAC(s) 2)
ldsItems = ([[4,2],[0,2]])
macItems = [8,8]
waitCntItems = [0,0]
elif (kernel["ThreadTile0"] == 6 and kernel["ThreadTile1"] == 6):
numGroups = 2 #group0 = 8 MAC(s) #group1 = 16 MAC(s) 2)
ldsItems = ([[4,4],[2,2]])
macItems = [16,20]
waitCntItems = [0,0]
elif ((kernel["ThreadTile0"] == 8 and kernel["ThreadTile1"] == 4) or
(kernel["ThreadTile0"] == 4 and kernel["ThreadTile1"] == 8)):
numGroups = 2 #group0 = 16 MAC(s) #group1 = 16 MAC(s) 2)
ldsItems = ([[4,4],[4,0]]) if kernel["ThreadTile0"] == 8 else ([[4,4],[0,4]])
macItems = [16,16]
waitCntItems = [0,0]
elif (kernel["ThreadTile0"] == 8 and kernel["ThreadTile1"] == 8):
numGroups = 2 #group0 = 8 MAC(s) #group1 = 8 MAC(s) 2)
#ldsItems = ([[4,4],[4,4]])
macItems = [16,48]
waitCntItems = [0,0]
AitemsToReorder = localReadsA.flatitems()
BitemsToReorder = localReadsB.flatitems()
##reorder code?? based on LDS fetch
## works for 2 groups.. needs fix for more than 2 groups
for iter in range(0,numGroups):
endIdx = ldsItems[iter][0] if iter == 0 else kernel["ThreadTile%u"%tensorParametersA["tensorIdx"]]
startIdx = 0 if iter == 0 else ldsItems[iter-1][1]
for Bitems in range(startIdx, startIdx+ldsItems[iter][1]):
for Aitems in range(0, endIdx):
idx = Aitems+(kernel["ThreadTile%u"%tensorParametersA["tensorIdx"]]*Bitems)
MacitemsReorder.append(MacIteritems[idx])
if (iter != 0):
for Bitems in range(0, ldsItems[iter-1][1]):
for Aitems in range(ldsItems[iter-1][0], kernel["ThreadTile%u"%tensorParametersA["tensorIdx"]]):
MacitemsReorder.append(MacIteritems[Aitems+((kernel["ThreadTile%u"%tensorParametersA["tensorIdx"]])*Bitems)])
#print("Total number mac items A(%u)\n",len(MacitemsReorder))
#print("Total number ds items A(%u)\n"%(TotalnumLdsFetches))
#print("number ds items A_B(%u .. %u)\n"%(len(AitemsToReorder),len(BitemsToReorder)))
#reorder LDS fetches so order in which A+B fetches matches MAC blob
#e.g 8x4 original order in DGEMM case A[0-1]A[2-3]A[4-5]A[6-7]B[0-1][2-3]
#we want to re-order them into A[0-1][2-3]B[0-1]B[2-3]; In all other except
#DGEMM type, number of LDS fetches <=4 so no need for LDS re-order
if self.enable["LocalRead"] and TotalnumLdsFetches > 4:
localReads = Code.Module()
for iter in range(0,numGroups):
if len(AitemsToReorder):
localReads.addText(self.comment("local read a"))
numLocalReads = roundUp((ldsItems[iter][0])/kernel["VectorWidth"])
##print("number ds items A(%u..%u)\n"%(iter,numLocalReads))
for idx in range(0,numLocalReads):
localReads.addCode(AitemsToReorder[0])
AitemsToReorder = AitemsToReorder[1:]
if len(BitemsToReorder):
numLocalReads = roundUp(ldsItems[iter][1]/kernel["VectorWidth"])
##print("number ds items B(%u..%u)\n"%(iter,numLocalReads))
localReads.addText(self.comment("local read b"))
for items in range(0,numLocalReads):
localReads.addCode(BitemsToReorder[0])
BitemsToReorder = BitemsToReorder[1:]
if iter == 0:
waitCntItems[iter] = TotalnumLdsFetches - ((ldsItems[iter][0])/kernel["VectorWidth"] + (ldsItems[iter][1])/kernel["VectorWidth"])
elif iter+1 != numGroups:
waitCntItems[iter] = TotalnumLdsFetches - ((ldsItems[iter][0])/kernel["VectorWidth"] + (ldsItems[iter][1])/kernel["VectorWidth"] + waitCntItems[iter-1])
else:
waitCntItems[iter] = 0
#print("Waitcnt(%u..%u)\n"%(iter,waitCntItems[iter]))
for iter in range(0,numGroups):
#Mac Code
#place holder for future work Instruction class for generting MAC instruction
#FMAInstruction = MacInstruction(globalParameters["CurrentISA"])
subIterCode = Code.Module()
waitCode = Code.Module()
macIterCodeGrp = Code.Module()
doOnce = False
if self.enable["MAC"]:
numMacItems = macItems[iter]
for Items in range(0,numMacItems):
macItem = MacitemsReorder.pop(0)
macIterCodeGrp.addCode(macItem)
## add s_setprio 1 when AggressivePerfMode ==1 as second instruction for second-last blob macCode
if (kernel["AggressivePerfMode"] and not doOnce):
macIterCodeGrp.addInst("s_setprio ","1","Raise priority while processing macs")
doOnce = True
## add s_setprio 0 when AggressivePerfMode ==1 as last instruction
if (kernel["AggressivePerfMode"]):
macIterCodeGrp.addInst("s_setprio ","0","Reset priority after macs")
#print("ReadWaitcnt(%u..%u)\n"%(iter,waitCntItems[iter]))
#print("WriteCodeCount(%d..%u)\n",u,self.perIterLocalWriteCode[u].count())
if (iter == 0):
if self.enable["Wait"]:
#calculate lgkmcnt value including read+write for first iteration
waitCntVal = waitCntItems[iter] + 1 if (self.perIterLocalWriteCode[u].count()>0) else waitCntItems[iter]
# read + write instructions lgkmcnt (1=> for write)
# build waitCnt using new lgkmcnt
waitCode = Code.WaitCnt(self.version, waitCntVal,-1,"wait for prior local read")
subIterCode = self.makeSubIterSchedule(kernel, localReads, \
u, pointerLWCode, pointerLRCode, waitCode, macIterCodeGrp)
else:
#last group only pointer + localWrite Code
if self.enable["Wait"]:
waitCode = Code.WaitCnt(self.version, waitCntItems[iter],-1,"wait for prior local read & local writes")
subIterCode.addCode(waitCode)
subIterCode.addCode(macIterCodeGrp)
kl.append(subIterCode) # add scheduled "other", local reads, local writes
kl.append(self.closeString(kernel))
kl.append(self.openString(kernel))
# close unrolled loop
if expand:
if not finalLoop:
kl.append(self.comment3("Unrolled Loop - End %u/%u"%(lc+1, loopCopies)))
else:
kl.append(self.comment3("Unrolled Loop - End %u/%u (final)"%(lc+1, loopCopies)))
# add wait for global read here canOptimizePreLoopLWVmcnt is true and DirectToVgpr is true
# StoreCInUnroll does not require this wait because wait code is generated at the top of inner loop
if kernel["PrefetchGlobalRead"] and self.canOptimizePreLoopLWVmcnt and (kernel["DirectToVgprA"] or kernel["DirectToVgprB"]) \
and (not kernel["StoreCInUnroll"]):
retStr = self.getWaitcntCodeForDirectToVgpr(kernel, localWriteEndIter, u=0, firstIter=False)
kl.append(retStr)
else:
kl.append(self.comment3("Unrolled Loop - End"))
oddLabel = lc == 0
kl.append(self.closeLoop(kernel, self.unrollIdx, finalLoop, loopCopies, oddLabel=oddLabel))
##############################################################################
# Kernel Body
##############################################################################
def kernelBody( self, kernel, tensorParametersA, tensorParametersB ):
expand = kernel["ExpandPointerSwap"]
####################################
# Begin String
kl = []
kl.append(self.openString(kernel))
####################################
# Function Prefix
kl.append(self.comment3("Function Prefix"))
kl.append(self.functionPrefix(kernel))
####################################
# Function Signature
####################################
kl.append(self.comment3("Begin Kernel"))
kl.append(self.functionSignaturePrefix(kernel))
beforeFunctionSignature = '\n'.join([str(x) for x in kl])
kl = []
kl.append(self.functionSignatureSuffix(kernel))
kl.append(self.functionBegin(kernel))
kl.append(self.comment3("Allocate Resources"))
kl.append(self.allocateResources(kernel))
if self.enable["PreLoop"]:
####################################
# Local Read Addresses
####################################
kl.append(self.comment3("Local Read Addresses"))
# tile assignments
kl.append(self.comment("local read addresses: tile assignments a/b"))
kl.append(self.lraTileAssignment(kernel, tensorParametersA, tensorParametersB))
# final offsets
kl.append(self.comment("local read addresses: final offsets a"))
kl.append(self.lraFinalOffset(kernel, tensorParametersA))
kl.append(self.comment("local read addresses: final offsets b"))
kl.append(self.lraFinalOffset(kernel, tensorParametersB))
# declare addresses
kl.append(self.comment("local read addresses: declare addresses a"))
kl.append(self.lraDeclareAddresses(kernel, tensorParametersA))
kl.append(self.comment("local read addresses: declare addresses b"))
kl.append(self.lraDeclareAddresses(kernel, tensorParametersB))
# doShadowInit performs initialization in the 'shadow' of the global mem prefetch
self.doShadowInit = 0
if kernel["PrefetchGlobalRead"]:
if self.actualSummationLoops == 1:
self.doShadowInit = 2 # 2 is both store setup and initC
else:
# can't do shadow initC with multiple summation since this resets the ValuC counters
# on each unroll iteration.
self.doShadowInit = 1 # 1 is just store setup
if self.prefetchAcrossPersistent:
# SrdC/D init before persistent loop
kl.append(self.globalWriteWorkGroupInitBeforePersistentLoop(kernel))
# init code for StoreCInUnroll (only once before persistent kernel loop)
# SrdC/D init has to be done beforehand
if self.storeCInUnroll:
kl.append(self.initStoreCInUnroll(kernel))
# first prefetch is outside persistent loop, subsequent prefetch will
# be integrated into no-load-loop
kl += self.setupNewTile(kernel, tensorParametersA, tensorParametersB, isPap=False, isOptNLL=False)
kl.append(self.openPersistentLoop(kernel))
else:
# prefetch is inside persistent loop
kl.append(self.openPersistentLoop(kernel))
kl += self.setupNewTile(kernel, tensorParametersA, tensorParametersB, isPap=False, isOptNLL=False)
pack = [ Code.Module() for i in range (self.numVgprBuffer+1) ]
self.preLoopLocalWriteCode = None
if kernel["PrefetchGlobalRead"]:
if self.doShadowInit:
kl.append(self.openShadowInit(kernel))
# init code for StoreCInUnroll per each persistent kernel loop iteration
# before generate new srdC/D (in globalWriteWorkGroupInit())
if self.storeCInUnroll:
kl.append(self.initStoreCInUnrollPerPersistentLoop(kernel))
kl.append(self.globalWriteWorkGroupInit(kernel))
# after genarating new SrdC,D, swap with backup values so that previous srdC,D is used in unroll loop for StoreCInUnroll
if self.storeCInUnroll:
kl.append(self.swapSrdCDandBackup(kernel))
if self.doShadowInit == 2:
kl.append(self.initC(kernel)) # initC while waiting for global reads
kl.append(self.closeShadowInit(kernel))
if self.enable["Wait"] and not self.canOptimizePreLoopLWVmcnt:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 0, -1, -1, "8wait for global read"))
# These cases loop back and run the prefetch loop again
# we need an extra barrier to ensure that the ds_reads (either for SR or MFMA) from previous iteration
# have finished before we generate the prefetch for the next summation index.
if kernel["PersistentKernel"] or self.actualSummationLoops>1:
kl.append( self.indent + self.syncStr + "// for PersistentKernel " + self.endLine )
if self.enable["LocalWrite"]:
# local write
self.preLoopLocalWriteCode = self.preLoopLocalWriteDo(kernel, tensorParametersA, tensorParametersB)
kl.append(self.preLoopLocalWriteCode)
# swap local ptrs
kl.append(self.comment("local write swap a"))
kl.append(self.localWriteSwapOffsets(kernel, expand, tensorParametersA))
kl.append(self.comment("local write swap b"))
kl.append(self.localWriteSwapOffsets(kernel, expand, tensorParametersB))
kl.append(self.localWriteInitPointers(kernel, tensorParametersA))
kl.append(self.localWriteInitPointers(kernel, tensorParametersB))
if kernel["PrefetchGlobalRead"] == 2:
kl.append(self.openPrefetchGlobalRead2(kernel))
if self.enable["GlobalRead"]:
# if DirectToVgprA is enabled, swap the order of global read (B->A)
tensorParameters1st = tensorParametersA
tensorParameters2nd = tensorParametersB
if kernel["DirectToVgprA"]:
tensorParameters1st, tensorParameters2nd = tensorParameters2nd, tensorParameters1st
kl.append(str(self.directToLdsM0Update(kernel, 1, tensorParameters1st)))
kl.append(str(self.globalReadDo(kernel, 0, tensorParameters1st, 1)))
kl.append(str(self.directToLdsM0Update(kernel, 1, tensorParameters2nd)))
kl.append(str(self.globalReadDo(kernel, 0, tensorParameters2nd, 1)))
# swap local ptrs again if DirectToLds is enabled
if kernel["DirectToLdsA"]:
kl.append(self.comment("local write swap a"))
kl.append(self.localWriteSwapOffsets(kernel, expand, tensorParametersA))
kl.append(self.localWriteInitPointers(kernel, tensorParametersA))
if kernel["DirectToLdsB"]:
kl.append(self.comment("local write swap b"))
kl.append(self.localWriteSwapOffsets(kernel, expand, tensorParametersB))
kl.append(self.localWriteInitPointers(kernel, tensorParametersB))
kl.append(self.closePrefetchGlobalRead2(kernel))
# prefetch-local
if self.numItersPLR:
# not generate wait for local write if LDS write code is not generated
if self.enable["Wait"] and not kernel["NoLdsWriteCode"]:
# TODO: need to check if we correctly checked-in the temp VGPR used for Int8 LocalWrite (uDu, PGR=2)
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, -1, 0, -1, "0prefetch wait for local write"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel))
# in some cases need an extra copy of the LDS read with appropriate double buffer offsets
if self.enable["LocalRead"]:
for plrIdx in range(0, self.numItersPLR):
pack[plrIdx] = Code.Module()
# no matter EPS or PAP, only prefect local once per plrIdx
# for espi in range(0, (self.prefetchAcrossPersistent and kernel["ExpandPointerSwap"])+1):
for espi in range(0, 1):
for iui in range(0,kernel["InnerUnroll"]):
if iui*self.numReadsIterCoalescedA < kernel["InnerUnroll"] and (not kernel["DirectToVgprA"]) : # no local read code if DirectToVgpr is enabled
kl.append(self.comment("local read prefetch a"))
localReadCodeA, packCodeA = self.localReadDo(kernel, plrIdx*self.numIterPerCoalescedReadA, iui*self.numReadsIterCoalescedA, espi, tensorParametersA)
kl.append(localReadCodeA)
pack[plrIdx].addCode(packCodeA)
if iui*self.numReadsIterCoalescedB < kernel["InnerUnroll"] and (not kernel["DirectToVgprB"]) : # no local read code if DirectToVgpr is enabled
kl.append(self.comment("local read prefetch b"))
localReadCodeB, packCodeB = self.localReadDo(kernel, plrIdx*self.numIterPerCoalescedReadB, iui*self.numReadsIterCoalescedB, espi, tensorParametersB)
kl.append(localReadCodeB)
pack[plrIdx].addCode(packCodeB)
if iui*self.numReadsIterCoalescedA < kernel["InnerUnroll"] and (not kernel["DirectToVgprA"]) : # no local read code if DirectToVgpr is enabled
kl.append(self.comment("local read inc a"))
kl.append(self.localReadInc(kernel, iui, tensorParametersA))
if iui*self.numReadsIterCoalescedB < kernel["InnerUnroll"] and (not kernel["DirectToVgprB"]) : # no local read code if DirectToVgpr is enabled
kl.append(self.comment("local read inc b"))
kl.append(self.localReadInc(kernel, iui, tensorParametersB))
kl.append(self.closeSumAtLeastUnroll(kernel, prefetch=True, isOptNLL=False, isPap=False, isNGLL=False))
loopCopies = 2 if expand else 1
if self.useInitAccVgprOpt:
# generate first iteration code for init accvgpr opt
kl.append(self.comment3("First Unrolled Iter for InitAccVgprOpt - Begin"))
# open loop without Label
kl.append(self.openLoop(kernel, self.unrollIdx, noLabelGen=True))
self.loopBody( kernel, tensorParametersA, tensorParametersB, kl, pack, 0, loopCopies, False, firstIter=True )
# open unrolled summation loop
kl.append(self.comment3("Unrolled Loop(s) - Begin"))
# In StoreCInUnroll case, LoopCounter check code is already generated. We need only LoopBeginLabel
beginLabelOnly = kernel["StoreCInUnroll"]
kl.append(self.openLoop(kernel, self.unrollIdx, beginLabelOnly=beginLabelOnly))
lcStart = 0
if self.useInitAccVgprOpt:
lcStart = 1 if loopCopies == 2 else 0
for lc in range(0, loopCopies):
loopIndex = lcStart + lc
if loopIndex >= loopCopies:
loopIndex -= loopCopies
# loop body code generation
finalLoop = lc == loopCopies - 1
self.loopBody( kernel, tensorParametersA, tensorParametersB, kl, pack, loopIndex, loopCopies, finalLoop )
kl.append(self.comment("Before NLL: Check VGPR.checkin for INT8 LW"))
# swap local write, read again before noLoadLoop if PrefetchGlobalRead and DirectToLds is enabled
# In DirectToLds enabled case, local write address is necessary for prefetch global read (for m0).
# However, even exit with DirectToLds will not pass with this code (limitation).
# So far, this code is to make odd exit case (i.e. k is multiple of 2*depthU) pass for DirectToVgpr
if not self.useInitAccVgprOpt and kernel["PrefetchGlobalRead"] and self.enable["LocalWrite"] and kernel["ExpandPointerSwap"]:
# local write for next iter, used to have local writes here
if(kernel["DirectToLdsA"]):
kl.append(self.comment("local write swap offsets a"))
kl.append(self.localWriteSwapOffsets(kernel, expand, tensorParametersA))
if(kernel["DirectToLdsB"]):
kl.append(self.comment("local write swap offsets b"))
kl.append(self.localWriteSwapOffsets(kernel, expand, tensorParametersB))
# swap local read point for self.useInitAccVgprOpt
if self.useInitAccVgprOpt and kernel["ExpandPointerSwap"]:
if self.enable["LocalRead"]:
kl.append(self.comment("local read swap offsets a"))
kl.append(self.localReadSwapOffsets(kernel, expand, tensorParametersA))
kl.append(self.comment("local read swap offsets b"))
kl.append(self.localReadSwapOffsets(kernel, expand, tensorParametersB))
if kernel["PrefetchGlobalRead"] == 2:
# re-generate store code for StoreCInUnroll (odd=0,isLast=False))
self.generateStoreCCodeInUnrollLoop(kernel, 0, isLast=False)
isOptNLL=False
isPap=False
if self.prefetchAcrossPersistent:
isOptNLL = True
isPap = True
kl += self.noLoadLoop(kernel, tensorParametersA, tensorParametersB, isOptNLL=isOptNLL, isPap=isPap, isNGLL=True, pack=pack)
# re-generate store code for StoreCInUnroll (no increment code (isLast=True))
# this should be after NGLL code for PGR=2
odd = 1
self.generateStoreCCodeInUnrollLoop(kernel, odd, isLast=True)
# This "NoLoad" loop is a copy of the unroll loop but with global loads + LDS writes removed
# doShadowInit is required since this pushes up the store SRD initialization before the NLL
# OptNLL only allowed for single summation index - for multiple summation we (currently)
# execute the NLL inside each unroll iteration not just once at the end.
if kernel["PrefetchGlobalRead"]:
if not kernel["SuppressNoLoadLoop"]:
firstNLLgenerated = False
if kernel["KernelLanguage"] == "Assembly" and kernel["OptNoLoadLoop"] and \
kernel["BufferLoad"] and kernel["BufferStore"] and self.doShadowInit and \
kernel["LocalSplitU"]==1 and kernel["GlobalSplitU"] == 1 and \
self.actualSummationLoops==1:
firstNLLgenerated = True
# two different noLoadLoops:
# 1. OptNLL & PAP global-read interleaved (only for PAP=ON)
# (2. OptNLL : No PAP global-read (For PAP=OFF, or PAP=ON but the last tile))
# -> this is unified with 1. global-read is invalidated at the last tile.
# 3. OrdinaryNLL (Not Opt.)
self.saveLocalPointers(kernel)
# deepCopy packCode for OptNLL noLoadLoop
deepCopyPack = copy.deepcopy(pack)
# keep StoreCInUnroll related code for the next noLoadLoop
if kernel["StoreCInUnroll"]:
self.backupStoreCInUnrollRelatedCode()
isPap = self.prefetchAcrossPersistent
kl += self.noLoadLoop(kernel, tensorParametersA, tensorParametersB, isOptNLL=True, isPap=isPap, isNGLL=False, pack=deepCopyPack)
self.restoreLocalPointers(kernel)
# restore StoreCInUnroll related code
if kernel["StoreCInUnroll"]:
self.restoreStoreCInUnrollRelatedCode()
# skip second NLL code if enableSingleNLLOpt
if not (self.enableSingleNLLOpt and firstNLLgenerated):
papMode = self.prefetchAcrossPersistent and kernel["PrefetchAcrossPersistentMode"] == 1
kl += self.noLoadLoop(kernel, tensorParametersA, tensorParametersB, isOptNLL=False, isPap=papMode, isNGLL=False, pack=pack)
else:
# generate PrefetchGlobalLastIterEnd label
kl.append(self.closeSumAtLeastUnroll(kernel, prefetch=False, isOptNLL=False, isPap=False, isNGLL=False))
if kernel["StoreCInUnroll"]:
# end process for StoreCInUnroll per PersistentLoop (NoOptNLL)
kl.append(self.endProcessPersistentLoopforStoreCInUnrollNoOptNLL(kernel))
# if PGR, last few iterations will have PLR,
# and those PLR will not be used(register not checkIn) if without NoLoadLoop
else:
for i in range(self.numVgprBuffer):
for item in list(pack[i].items()):
if item.tempVgpr != None:
self.vgprPool.checkIn(item.tempVgpr)
item.tempVgpr = None
if self.staggerU and self.actualSummationLoops>1:
kl.append(self.comment("remove stagger offsets"))
kl.append(self.removeStagger(kernel, tensorParametersA))
kl.append(self.removeStagger(kernel, tensorParametersB))
if not self.noTailLoop:
########################################
# Tail Loop
# PackSummationDims=1 requires that the tile slice does not cross DepthU
# which means tail loop not needed.
########################################
self.inTailLoop = True
if kernel["LoopTail"] and not kernel["PackSummationDims"]:
kl.append(self.comment3("Tail Loop"))
# Update local write pointers in case the upcoming global reads are writing directly to LDS:
if self.enable["LocalWrite"]:
if kernel["PrefetchGlobalRead"]:
kl.append(self.comment("local write reset offsets a"))
kl.append(self.localWriteResetOffsets(kernel, kernel["ExpandPointerSwap"], tensorParametersA))
if kernel["ExpandPointerSwap"]:
# reset local write offset in asm code as well
kl.append(self.localWriteResetOffsets(kernel, False, tensorParametersA))
kl.append(self.comment("local write reset offsets b"))
kl.append(self.localWriteResetOffsets(kernel, kernel["ExpandPointerSwap"], tensorParametersB))
if kernel["ExpandPointerSwap"]:
# reset local write offset in asm code as well
kl.append(self.localWriteResetOffsets(kernel, False, tensorParametersB))
if self.enable["GlobalRead"]:
# tail: global read
kl.append(self.calculateLoopNumIter(kernel, -1, False))
if self.staggerU and self.actualSummationLoops==1:
kl.append(self.comment("remove stagger offsets for tail loop"))
kl.append(self.removeStagger(kernel, tensorParametersA))
kl.append(self.removeStagger(kernel, tensorParametersB))
# if DirectToVgprA is enabled, swap the order of global read (B->A)
tensorParameters1st = tensorParametersA
tensorParameters2nd = tensorParametersB
tc1 = 'a'
tc2 = 'b'
if kernel["DirectToVgprA"]:
tensorParameters1st, tensorParameters2nd = tensorParameters2nd, tensorParameters1st
tc1, tc2 = tc2, tc1
kl.append(self.comment("Update M0 for DTLDS"))
tmpStr = str(self.directToLdsM0Update(kernel, 1, tensorParameters1st))
tmpStr = tmpStr.replace("__placeholder__", str(0))
kl.append(tmpStr)
kl.append(self.comment("global read %s"%tc1))
vregSetIdx = 0
kl.append(str(self.globalReadDo(kernel, 2, tensorParameters1st, vregSetIdx)))
kl.append(self.comment("Update M0 for DTLDS"))
tmpStr = str(self.directToLdsM0Update(kernel, 1, tensorParameters2nd))
tmpStr = tmpStr.replace("__placeholder__", str(0))
kl.append(tmpStr)
kl.append(self.comment("global read %s"%tc2))
vregSetIdx = 0
kl.append(str(self.globalReadDo(kernel, 2, tensorParameters2nd, vregSetIdx)))
if self.enable["Wait"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, 0, -1, -1, "2wait for global read"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel))
# the following read/write addresses could be modified in recalcLocal(Read|Write)Addresses due to policy change
self.oriLraA = None # back up original local read address vgpr
self.oriLraB = None
self.oriLwaA = None # back up original local write address vgpr
self.oriLwaB = None
for uDu in range(0, kernel["DepthULdsDivisor"]):
if kernel.enabledSplitLDS:
# change local write policy from interleave-K to fractional as tail loop
# iterate LDS read address one unit of K at a time
kl.append(self.comment("Recalc local write offsets"))
kl.append(self.recalcLocalWriteAddresses(kernel, tensorParametersA, uDu))
kl.append(self.recalcLocalWriteAddresses(kernel, tensorParametersB, uDu))
if self.enable["Sync"]:
if uDu > 0:
kl.append(self.comment("sync before local write"))
kl.append(self.syncThreads(kernel))
if self.enable["LocalWrite"] and not kernel["NoLdsWriteCode"]:
# tail: local write
kl.append(self.localWriteInitPointers(kernel, tensorParametersA))
kl.append(self.localWriteInitPointers(kernel, tensorParametersB))
kl.append(self.comment("local write a"))
tempLWCodeModA = self.localWriteDo(kernel, tensorParametersA, None)
kl.append(tempLWCodeModA)
kl.append(self.comment("local write b"))
tempLWCodeModB = self.localWriteDo(kernel, tensorParametersB, None)
kl.append(tempLWCodeModB)
# change local read policy from wider local read to one unit of K at a time
kl.append(self.comment("Recalc local read offsets"))
kl.append(self.recalcLocalReadAddressesAB(kernel))
if self.enable["Wait"]:
# TODO: need to check if we correctly checked-in the temp VGPR used for Int8 LocalWrite (uDu, PGR=2)
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, -1, 0, -1, "5wait for local write"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel))
#kl.append(self.dumpLds(kernel, 0, 8))
# tail: re-init local read addresses
if kernel["PrefetchGlobalRead"]:
kl.append(self.comment("local read reset offsets a"))
kl.append(self.localReadResetOffsets(kernel, tensorParametersA))
kl.append(self.comment("local read reset offsets b"))
kl.append(self.localReadResetOffsets(kernel, tensorParametersB))
kl.append(self.comment("local read init pointers a"))
kl.append(self.localReadInitPointers(kernel, tensorParametersA))
kl.append(self.comment("local read init pointers b"))
kl.append(self.localReadInitPointers(kernel, tensorParametersB))
# tail: macs
kl.append(self.comment("tail loop: macs"))
kl.append(self.openLoop(kernel, -1, uDu if kernel.enabledSplitLDS else None))
# Try to use InnerUnroll in the tail loop if allowed:
KinInnerUnroll = kernel["InnerUnroll"]
if kernel["EnableMatrixInstruction"]:
KinInnerUnroll *= kernel["MatrixInstK"]
tailLoopInnerUnroll = 1
if (kernel["AssertSummationElementMultiple"] % KinInnerUnroll == 0):
tailLoopInnerUnroll = kernel["InnerUnroll"]
elif (kernel["LocalDotLayout"] > 1) and (kernel["InnerUnroll"] == kernel["LocalDotLayout"]):
tailLoopInnerUnroll = kernel["InnerUnroll"]
# need to unroll tail loop for the following cases
mEnd = 1
if kernel["DirectToVgprA"] or kernel["DirectToVgprB"]:
mEnd = kernel["DepthU"]//KinInnerUnroll
# need to cover different local read inc values for the following DirectToLds case
elif kernel["DirectToLds"] and kernel["EnableMatrixInstruction"] and kernel["InnerUnroll"] == 1 and\
(kernel["GlobalLoadVectorWidthA"] * self.bpeAB > 4 or kernel["GlobalLoadVectorWidthB"] * self.bpeAB > 4
or kernel["ThreadSeparateGlobalReadA"] or kernel["ThreadSeparateGlobalReadB"]) and \
kernel["DepthU"] // kernel["MatrixInstK"] > 2:
mEnd = kernel["DepthU"] // (kernel["MatrixInstK"] * 2)
for mValue in range(mEnd):
pack[0] = Code.Module()
for iui in range(0, tailLoopInnerUnroll):
if self.enable["LocalRead"]:
doReadA = not kernel["DirectToVgprA"]
doReadB = not kernel["DirectToVgprB"]
if doReadA:
# Reading 16-bit data from LDS requires packing when ECC enabled
kl.append(self.comment("local read a"))
localReadCodeA, packCodeA = self.localReadDo(kernel, 0, iui, 0, tensorParametersA)
kl.append(localReadCodeA)
pack[0].addCode(packCodeA)
if doReadB:
kl.append(self.comment("local read b"))
localReadCodeB, packCodeB = self.localReadDo(kernel, 0, iui, 0, tensorParametersB)
kl.append(localReadCodeB)
pack[0].addCode(packCodeB)
# adjustment for DirectToLds case
iuiParam = iui + tailLoopInnerUnroll * mValue
if doReadA:
kl.append(self.comment("local read inc a"))
kl.append(self.localReadInc(kernel, iuiParam, tensorParametersA))
if doReadB:
kl.append(self.comment("local read inc b"))
kl.append(self.localReadInc(kernel, iuiParam, tensorParametersB))
if self.enable["Wait"]:
kl.append(self.wait(kernel, tensorParametersA, tensorParametersB, -1, -1, 0, "4wait for local read"))
if kernel["EnableMatrixInstruction"]:
kl.append(pack[0])
# vgpr.checkin for all the checked-out vgpr in LocalRead
for item in list(pack[0].items()):
if item.tempVgpr != None:
self.vgprPool.checkIn(item.tempVgpr)
item.tempVgpr = None
pack[0] = Code.Module()
if self.enable["MAC"]:
if kernel["EnableMatrixInstruction"]:
# DirectToVgpr is not applicable for tail loop
vregSetIdxMFMA = 0
kl.append(self.mfmaIter(kernel, 0, tailLoopInnerUnroll, vregSetIdxMFMA, False, True))
else:
kl.append(self.macIter(kernel, 0, tailLoopInnerUnroll, True, True))
finalLoop = mValue == mEnd - 1
kl.append(self.closeLoop(kernel, -1, finalLoop, loopCopies, uDu if kernel.enabledSplitLDS else None))
# always emit the skip-tail-loop label
kl.append(self.closeLoop(kernel, -1, None, loopCopies, emitEndLabelOnly=True))
# tail: close
self.inTailLoop = False
# extra summation loops: global increment and close
for i in reversed(range(self.otherSummationLoops)):
kl.append(self.comment("global read inc AB"))
kl.append(self.globalReadIncrementAB(kernel, i, 0))
kl.append(self.closeLoop(kernel, i, True, loopCopies))
if self.prefetchAcrossPersistent and kernel["PrefetchAcrossPersistentMode"] != 1:
kl.append(str(self.openPrefetchAcrossPersistent(kernel, isOptNLL=False)))
kl += self.setupNewTile(kernel, self.tPA, self.tPB, isPap=True, isOptNLL=False)
kl.append(str(self.closePrefetchAcrossPersistent(kernel, isOptNLL=False)))
kl.append(self.endSummation(kernel))
if self.enable["PostLoop"]:
if not self.doShadowInit:
kl.append(self.globalWriteWorkGroupInit(kernel))
####################################
# Shift Vector Components
####################################
if kernel["EdgeType"] == "ShiftPtr":
# GuaranteeNoPartial means each component in the vector loads is always valid. In this case we
# don't need the unshift code
# shift vector components d0
if not kernel["GuaranteeNoPartialA"] and self.readTileDimVectorA:
kl.append(self.comment("shift vector components d0"))
kl.append(self.shiftVectorComponents(kernel, tensorParametersA))
# shift vector components d1, for MFMA version, B never entered this
if not kernel["GuaranteeNoPartialB"] and self.readTileDimVectorB:
kl.append(self.comment("shift vector components d1"))
kl.append(self.shiftVectorComponents(kernel, tensorParametersB))
# complex declare tmp registers
kl.append(self.complexDeclareTmpRegisters(kernel))
####################################
# LocalSplitU reduction
####################################
#if kernel["NumThreads"]%kernel["MacroTile0"] == 0:
if kernel["LocalSplitU"] > 1:
kl.append(self.comment3("LocalSplitU Reduction"))
if self.enable["Sync"]:
kl.append(self.syncThreads(kernel))
# LocalSplitU: local write
kl.append(self.comment("LocalSplitU: local write"))
kl.append(self.localSplitULocalWrite(kernel))
# LocalSplitU: local read
kl.append(self.comment("LocalSplitU: local read"))
kl.append(self.localSplitULocalRead(kernel))
# LocalSplitU: local read
kl.append(self.comment("LocalSplitU: reduction"))
kl.append(self.localSplitUReduction(kernel))
# LocalSplitU: global write indices
kl.append(self.comment("LocalSplitU: global write indices"))
kl.append(self.localSplitUGlobalWriteIndices(kernel))
# LocalSplitU: global write
kl.append(self.comment("LocalSplitU: global write"))
kl.append(self.localSplitUGlobalWrite(kernel))
else:
####################################
# NOT LocalSplitU
####################################
# global write indices
kl.append(self.comment("not-LocalSplitU: global write indices"))
kl.append(self.notLocalSplitUGlobalWriteIndices(kernel))
# global write
kl.append(self.comment("not-LocalSplitU: global write"))
kl.append(self.notLocalSplitUGlobalWrite(kernel))
# After we know the #-of-globalwrite instructions, we can go back to replace the pre-loop LW vmcnt
# Note that currently, this code-replacement occurs only when PrefetchAcrossPersistent=True,
# otherwise, nothing is changed
if self.preLoopLocalWriteCode != None:
self.replacePreLoopLWVmcnt(kernel)
# function suffix
kl.append(self.functionEnd(kernel, True))
kl.append(self.functionSuffix(kernel))
kl.append(self.closeString(kernel))
kStr = '\n'.join([str(x) for x in kl])
afterFunctionSignature = kStr
error = self.overflowedResources
# function signature last since it needs to know how many gprs were actually used
kStr = beforeFunctionSignature + self.functionSignature(kernel) + afterFunctionSignature
return (error,kStr)
##############################################################################
#
# Functions to Write Kernel Segments
#
##############################################################################
def comment1(self, text):
"""
single line comment
"""
s = ""
s += self.indent
s += self.commentPrefix
s += " %s " % text
s += self.commentSuffix
s += self.endLine
return s
def comment(self, text):
"""
comment with prior newline
"""
s = ""
s += self.endLine
s += self.comment1(text)
return s
def comment3(self, text):
"""
3-line comment
"""
s = ""
s += self.endLine
s += self.indent
s += self.commentPrefix
s += self.commentHR
s += self.commentSuffix
s += self.endLine
for line in text.split("\n"):
s += self.indent
s += self.commentPrefix
s += " %-38s " % line
s += self.commentSuffix
s += self.endLine
s += self.indent
s += self.commentPrefix
s += self.commentHR
s += self.commentSuffix
s += self.endLine
return s
##############################################################################
# Init Kernel
##############################################################################
@abc.abstractmethod
def initKernel(self, kernel, tensorParametersA, tensorParametersB ):
self.staggerU = kernel["StaggerU"] and (kernel["KernelLanguage"]=="Source" or kernel["BufferLoad"])
self.tPA = tensorParametersA
self.tPB = tensorParametersB
# Only assembly supports scheduling
self.canSchedule = (kernel["KernelLanguage"] == "Assembly")
if self.canSchedule:
self.scheduleGlobalRead = kernel["ScheduleGlobalRead"] \
and kernel["PrefetchGlobalRead"]
else:
self.scheduleGlobalRead = 0
if self.canSchedule:
self.scheduleLocalWrite = kernel["ScheduleLocalWrite"] \
and kernel["PrefetchGlobalRead"]
else:
self.scheduleLocalWrite = 0
if self.canSchedule:
self.scheduleIterAlg = kernel["ScheduleIterAlg"]
else:
self.scheduleIterAlg = 0
self.prefetchAcrossPersistent = kernel["PrefetchAcrossPersistent"]
self.storeCInUnroll = kernel["StoreCInUnroll"]
self.noTailLoop = kernel["NoTailLoop"]
self.actualSummationLoops = 1 if kernel["PackSummationDims"] else kernel["ProblemType"]["NumIndicesSummation"]
self.otherSummationLoops = self.actualSummationLoops-1
self.otherSummations = kernel["ProblemType"]["NumIndicesSummation"]-1 # not loops but summations vars
# If 0, unroll loop is decremented by 1 each iteration and scaled by DEPTHU when number of summation elements
# is required.
# if 1, unroll loop starts at 0 and increments by DEPTHU. No scaling is required. This mode is required
# for pack summation dims, but can also be used independently and this is useful for isolation and testing.
self.unrollIncIsDepthU = kernel["UnrollIncIsDepthU"] or kernel["PackSummationDims"] \
or bool(kernel["ProblemType"]["ZeroPadA"]) or bool(kernel["ProblemType"]["ZeroPadB"])
# turn on parts of prefetchAcrossPersistent code for testing
self.prefetchAcrossPersistent0 = 0 or self.prefetchAcrossPersistent
self.canOptimizePreLoopLWVmcnt = kernel["OptPreLoopVmcnt"]
self.enable = {}
dkp = kernel["DisableKernelPieces"]
# Can locally overrid these by changing True to False or
# use the DisableKernelPieces for a quick search (see Common.py)
self.enable["PreLoop"] = True and not (dkp>0 and dkp >= 7) and not dkp == -7
self.enable["GlobalRead"] = True and not (dkp>0 and dkp >= 2) and not dkp == -2
self.enable["GlobalReadInc"] = True and not (dkp>0 and dkp >= 7) and not dkp == -7
self.enable["LocalWrite"] = True and not (dkp>0 and dkp >= 3) and not dkp == -3
self.enable["LocalRead"] = True and not (dkp>0 and dkp >= 4) and not dkp == -4
self.enable["Wait"] = True and not (dkp>0 and dkp >= 5) and not dkp == -5
self.enable["Sync"] = True and not (dkp>0 and dkp >= 5) and not dkp == -5
self.enable["MAC"] = True and not (dkp>0 and dkp >= 6) and not dkp == -6
self.enable["PostLoop"] = True and not (dkp>0 and dkp >= 1) and not dkp == -1
#if dkp:
# print "\nKernelWriter enable:", self.enable
if kernel["KernelLanguage"] == "Source":
self.language = globalParameters["RuntimeLanguage"]
else:
self.language = "ASM"
self.indexChars = []
for i in range(0, len(globalParameters["IndexChars"])):
self.indexChars.append(globalParameters["IndexChars"][i])
self.indexChars[kernel["ProblemType"]["Index0"]] \
= "0" + self.indexChars[kernel["ProblemType"]["Index0"]]
self.indexChars[kernel["ProblemType"]["Index1"]] \
= "1" + self.indexChars[kernel["ProblemType"]["Index1"]]
self.unrollIdx = kernel["ProblemType"]["NumIndicesSummation"]-1
self.unrollChar = \
self.indexChars[kernel["ProblemType"]["IndicesSummation"][\
self.unrollIdx]]
self.tileChar0 = self.indexChars[kernel["ProblemType"]["Index0"]]
self.tileChar1 = self.indexChars[kernel["ProblemType"]["Index1"]]
self.tileCharA = self.tileChar0 if (kernel["ProblemType"]["Tensor0"]==0) \
else self.tileChar1
self.tileCharB = self.tileChar0 if (kernel["ProblemType"]["Tensor0"]==1) \
else self.tileChar1
"""
if kernel["ProblemType"]["Tensor0"]==0:
kernel["ThreadTileA"] = kernel["ThreadTile0"]
kernel["ThreadTileB"] = kernel["ThreadTile1"]
kernel["SubGroupA"] = kernel["SubGroup0"]
kernel["SubGroupB"] = kernel["SubGroup1"]
kernel["MacroTileA"] = kernel["MacroTile0"]
kernel["MacroTileB"] = kernel["MacroTile1"]
else:
kernel["ThreadTileB"] = kernel["ThreadTile0"]
kernel["ThreadTileA"] = kernel["ThreadTile1"]
kernel["SubGroupB"] = kernel["SubGroup0"]
kernel["SubGroupA"] = kernel["SubGroup1"]
kernel["MacroTileB"] = kernel["MacroTile0"]
kernel["MacroTileA"] = kernel["MacroTile1"]
"""
########################################
# derrive global-read-coalesce-group from local in config
"""
if kernel["ProblemType"]["TLUA"]:
self.globalReadCoalesceGroupA = kernel["LocalWriteCoalesceGroupA"]
else:
self.globalReadCoalesceGroupA = not kernel["LocalWriteCoalesceGroupA"]
if kernel["ProblemType"]["TLUB"]:
self.globalReadCoalesceGroupB = kernel["LocalWriteCoalesceGroupB"]
else:
self.globalReadCoalesceGroupB = not kernel["LocalWriteCoalesceGroupB"]
"""
self.globalReadCoalesceGroupA = kernel["GlobalReadCoalesceGroupA"]
self.globalReadCoalesceGroupB = kernel["GlobalReadCoalesceGroupB"]
"""
# original parameters
NumLoadsCoalesced -> NumLoadsPerpendicular
# new intermediate parameters
numReadsTile # nrt
numReadsUnroll # nru
numReadsTileVecComp # nrvt
numReadsUnrollVecComp # nrvu
numWritesCoal # nwc
numWritesPerp # nwp
numWritesCoalVecComp # nwvc
numWritesPerpVecComp # nwvp
readTileComponents (based on grcv)
readTileVector
"""
# TODO load sub-vector
vwa = kernel["GlobalLoadVectorWidthA"]
vwb = kernel["GlobalLoadVectorWidthB"]
# allow LocalReadVectorWidthB for TLUB + MatrixInstruction
self.allowLRVWBforTLUandMI = kernel["allowLRVWBforTLUandMI"]
self.numItersPLR = kernel["PrefetchLocalRead"]%kernel["LoopIters"]
self.numVgprBuffer = kernel["LoopIters"] if kernel["PrefetchLocalRead"] > kernel["LoopIters"] else kernel["PrefetchLocalRead"]
# merge N iteration's read into 1 iteration if can't coalesce read
# ex, A can coalesce read, B can't
# MergeRead 0: ds_readAx1 ds_readBx1 mfma | ds_readAx1 ds_readBx1 mfma | => ds_readAx2 ds_readBx1 mfma | ds_readBx1 mfma |
# MergeRead 1: ds_readAx1 ds_readBx1 mfma | ds_readAx1 ds_readAx1 mfma | => ds_readAx2 ds_readBx1 ds_readBx1 mfma | mfma |
MergeRead = 0
if not kernel["ProblemType"]["TLUA"] or MergeRead or self.allowLRVWBforTLUandMI:
if kernel["DirectToVgprA"]:
# DirectToVgprA case, ignore LocalReadVectorWidth and use GlobalLoadVectorWidth instead.
self.lrvwA = vwa
else:
self.lrvwA = kernel["LocalReadVectorWidth"]
else:
if kernel["EnableMatrixInstruction"]:
self.lrvwA = kernel["MIInputPerThread"]
else:
self.lrvwA = 1
if not kernel["ProblemType"]["TLUB"] or MergeRead or self.allowLRVWBforTLUandMI:
if kernel["DirectToVgprB"]:
# DirectToVgprB case, ignore LocalReadVectorWidth and use GlobalLoadVectorWidth instead.
self.lrvwB = vwb
else:
self.lrvwB = kernel["LocalReadVectorWidth"]
else:
if kernel["EnableMatrixInstruction"]:
self.lrvwB = kernel["MIInputPerThread"]
else:
self.lrvwB = 1
# DirectToVgprB + VW > 1 case, set lrvwB = VW
# DirectToVgprB case, global load data directly goes to Vgpr.
# If VW=2, it means lrwvB is 2.
if kernel["DirectToVgprB"] and kernel["VectorWidth"] > 1:
self.lrvwB = kernel["VectorWidth"]
# DirectToVgpr + TLU=False case
# set lrvw = VW
self.vgprValuDouble = False
#if kernel["DirectToVgprA"] and kernel["PrefetchLocalRead"] > 1 and (not kernel["ProblemType"]["TLUA"]) and kernel["VectorWidth"] > 1:
if kernel["DirectToVgprA"] and (not kernel["ProblemType"]["TLUA"]) and (not kernel["ProblemType"]["TLUB"]) or \
kernel["DirectToVgprB"] and (not kernel["ProblemType"]["TLUB"]) and (not kernel["ProblemType"]["TLUA"]):
self.lrvwA = max(self.lrvwA, self.lrvwB)
self.lrvwB = self.lrvwA
if kernel["DepthU"] // kernel["MatrixInstK"] <= 2 and self.lrvwA > 1:
# need to double vgprValu to avoid local read overwritting vgprValu registers
self.vgprValuDouble = True
# Wider LocalRead
if kernel["EnableMatrixInstruction"]:
self.numReadsIterCoalescedA = self.lrvwA // kernel["MIInputPerThread"]
self.numReadsIterCoalescedB = self.lrvwB // kernel["MIInputPerThread"]
if self.allowLRVWBforTLUandMI:
if kernel["ProblemType"]["TLUA"]:
self.numReadsIterCoalescedA = 1
self.numReadsIterCoalescedB = 1
else:
self.numReadsIterCoalescedA = 1
self.numReadsIterCoalescedB = 1
self.numIterPerCoalescedReadA = max(1,self.numReadsIterCoalescedA//kernel["InnerUnroll"])
self.numIterPerCoalescedReadB = max(1,self.numReadsIterCoalescedB//kernel["InnerUnroll"])
if kernel["ScheduleIterAlg"] == 3 or kernel["ScheduleIterAlg"] == 2:
self.numMfmaPerIter = kernel["MIWaveTile"][0] * kernel["MIWaveTile"][1] * kernel["InnerUnroll"]
if kernel["ProblemType"]["DataType"].isComplex(): self.numMfmaPerIter *= 4
########################################
# read vectors or vector components
########################################
if kernel["ProblemType"]["TLUA"]: # NT no transpose
self.numReadsTileA = kernel["NumLoadsCoalescedA"]
self.numReadsUnrollA = kernel["NumLoadsPerpendicularA"]
if kernel["GlobalReadCoalesceVectorA"]: # read vectors
self.readTileDimComponentsA = False # Vector
self.readTileDimVectorA = True # Vector
self.readUnrollDimComponentsA = False # Scalar
self.readUnrollDimVectorA = False # Scalar
self.numReadsTileVecCompA = vwa
self.numReadsUnrollVecCompA = 1
else: # read components, write components
self.readTileDimComponentsA = False # Scalar
self.readTileDimVectorA = False # Scalar
self.readUnrollDimComponentsA = kernel["VectorWidth"] > 1 # Components
self.readUnrollDimVectorA = False # Components
self.numReadsTileVecCompA = 1
self.numReadsUnrollVecCompA = vwa
else: # TN yes transpose
self.numReadsTileA = kernel["NumLoadsPerpendicularA"]
self.numReadsUnrollA = kernel["NumLoadsCoalescedA"]
if kernel["GlobalReadCoalesceVectorA"]: # read vector
self.readTileDimComponentsA = False # Scalar
self.readTileDimVectorA = False # Scalar
self.readUnrollDimComponentsA = False # Vector
self.readUnrollDimVectorA = True # Vector
self.numReadsUnrollVecCompA = vwa
self.numReadsTileVecCompA = 1
else: # read components, write vectors
self.readTileDimComponentsA = kernel["VectorWidth"] > 1 # Components
self.readTileDimVectorA = False # Components
self.readUnrollDimComponentsA = False # Scalar
self.readUnrollDimVectorA = False # Scalar
# NEW
self.numReadsUnrollVecCompA = 1
self.numReadsTileVecCompA = vwa
########################################
# write vectors or vector components
########################################
if kernel["ProblemType"]["TLUA"] != kernel["UnrollMajorLDSA"]: # NT no transpose
self.numWritesCoalA = kernel["NumLoadsCoalescedA"]
if kernel["GlobalReadCoalesceVectorA"]: # read vectors, write vectors
self.writeUnrollDimComponentsA = False # Scalar
if kernel["LocalDotLayout"]>1:
self.writeTileDimComponentsA = kernel["GlobalReadVectorWidth"] > 1 # Components
writeCoal = False
else:
self.writeTileDimComponentsA = False # Vector
writeCoal = True
else: # read components, write components
self.writeTileDimComponentsA = False # Scalar
self.writeUnrollDimComponentsA = kernel["GlobalReadVectorWidth"] > 1 # Components
writeCoal = False
else: # TN yes transpose
self.numWritesCoalA = kernel["NumLoadsPerpendicularA"]
if kernel["GlobalReadCoalesceVectorA"]: # read vector, write components
self.writeUnrollDimComponentsA = False # Scalar
if kernel["LocalDotLayout"]>1:
self.writeTileDimComponentsA = kernel["GlobalReadVectorWidth"] > 1 # Components
# LDS writes with LDL>1 will never be coalesced
writeCoal = False
else:
self.writeTileDimComponentsA = kernel["GlobalReadVectorWidth"] > 1 # Components
writeCoal = False
else: # read components, write vectors
self.writeTileDimComponentsA = False # Vector
self.writeUnrollDimComponentsA = False # Scalar
writeCoal = True
# writeCoal indicates writes should be done in the coal dim
# else in perp
if writeCoal:
self.numWritesCoalVecCompA = vwa // kernel["DepthULdsDivisor"]
self.numWritesPerpVecCompA = 1
else:
self.numWritesCoalVecCompA = 1
self.numWritesPerpVecCompA = vwa
del writeCoal
self.numReadVectorComponentsA = kernel["GlobalLoadVectorWidthA"] \
if (self.readTileDimComponentsA \
or self.readUnrollDimComponentsA) else 1
# self.numWriteVectorComponentsA = kernel["GlobalLoadVectorWidthA"] \
# if (self.writeTileDimComponentsA \
# or self.writeUnrollDimComponentsA) else 1
# self.numReadTileVectorComponentsA = kernel["GlobalLoadVectorWidthA"] \
# if self.readTileDimComponentsA else 1 # for branches
# convert tile/unroll to para/perp
if kernel["ProblemType"]["TLUA"]:
self.numReadsCoalVecCompA = self.numReadsTileVecCompA
self.numReadsPerpVecCompA = self.numReadsUnrollVecCompA
# for asm
self.readCoalescedComponentsA = self.readTileDimComponentsA
# self.readCoalescedVectorA = self.readTileDimVectorA # Not Used
self.readPerpendicularComponentsA = self.readUnrollDimComponentsA
# self.readPerpendicularVectorA = self.readUnrollDimVectorA # Not Used
else:
self.numReadsCoalVecCompA = self.numReadsUnrollVecCompA
self.numReadsPerpVecCompA = self.numReadsTileVecCompA
# for asm
self.readCoalescedComponentsA = self.readUnrollDimComponentsA
# self.readCoalescedVectorA = self.readUnrollDimVectorA # Not Used
self.readPerpendicularComponentsA = self.readTileDimComponentsA
# self.readPerpendicularVectorA = self.readTileDimVectorA # Not Used
####################################
# read vectors or vector components b
####################################
if kernel["ProblemType"]["TLUB"]: # NT no transpose
self.numReadsTileB = kernel["NumLoadsCoalescedB"]
self.numReadsUnrollB = kernel["NumLoadsPerpendicularB"]
if kernel["GlobalReadCoalesceVectorB"]:
self.readTileDimComponentsB = False # Vector
self.readTileDimVectorB = True # Vector
self.readUnrollDimComponentsB = False # Scalar
self.readUnrollDimVectorB = False # Scalar
self.numReadsTileVecCompB = vwb
self.numReadsUnrollVecCompB = 1
else:
self.readTileDimComponentsB = False # Scalar
self.readTileDimVectorB = False # Scalar
self.readUnrollDimComponentsB = kernel["VectorWidth"] > 1 # Components
self.readUnrollDimVectorB = False # Components
# NEW
self.numReadsTileVecCompB = 1
self.numReadsUnrollVecCompB = vwb
else: # TN yes transpose
self.numReadsTileB = kernel["NumLoadsPerpendicularB"]
self.numReadsUnrollB = kernel["NumLoadsCoalescedB"]
if kernel["GlobalReadCoalesceVectorB"]:
self.readTileDimComponentsB = False # Scalar
self.readTileDimVectorB = False # Scalar
self.readUnrollDimComponentsB = False # Vector
self.readUnrollDimVectorB = True # Vector
self.numReadsUnrollVecCompB = vwb
self.numReadsTileVecCompB = 1
else:
self.readTileDimComponentsB = kernel["VectorWidth"] > 1 # Components
self.readTileDimVectorB = False # Components
self.readUnrollDimComponentsB = False # Scalar
self.readUnrollDimVectorB = False # Scalar
# NEW
self.numReadsUnrollVecCompB = 1
self.numReadsTileVecCompB = vwb
####################################
# write vectors or vector components b
####################################
if kernel["ProblemType"]["TLUB"] != kernel["UnrollMajorLDSB"]: # NT no transpose
self.numWritesCoalB = kernel["NumLoadsCoalescedB"]
if kernel["GlobalReadCoalesceVectorB"]:
self.writeUnrollDimComponentsB = False # Vector
if kernel["LocalDotLayout"]>1:
self.writeTileDimComponentsB = kernel["GlobalReadVectorWidth"] > 1 # Components
writeCoal = False
else:
self.writeTileDimComponentsB = False # Vector
writeCoal = True
else:
self.writeTileDimComponentsB = False # Scalar
self.writeUnrollDimComponentsB = kernel["GlobalReadVectorWidth"] > 1 # Components
# NEW
self.numWritesCoalVecCompB = 1
self.numWritesPerpVecCompB = vwb
else: # TN yes transpose
self.numWritesCoalB = kernel["NumLoadsPerpendicularB"]
if kernel["GlobalReadCoalesceVectorB"]:
self.writeUnrollDimComponentsB = False
if kernel["LocalDotLayout"]>1:
self.writeTileDimComponentsB = kernel["GlobalReadVectorWidth"] > 1 # Components
# LDS writes with LDL>1 will never be coalesced
writeCoal = False
else:
self.writeTileDimComponentsB = kernel["GlobalReadVectorWidth"] > 1 # Components
writeCoal = False
else:
self.writeTileDimComponentsB = False # Vector
self.writeUnrollDimComponentsB = False # Scalar
# NEW
self.numWritesCoalVecCompB = vwb
self.numWritesPerpVecCompB = 1
# writeCoal indicates writes should be done in the coal dim
# else in perp
if writeCoal:
self.numWritesCoalVecCompB = vwb // kernel["DepthULdsDivisor"]
self.numWritesPerpVecCompB = 1
else:
self.numWritesCoalVecCompB = 1
self.numWritesPerpVecCompB = vwb
del writeCoal
# numReadVectorComponentsB is refers to global reads
self.numReadVectorComponentsB = kernel["GlobalLoadVectorWidthB"] \
if (self.readTileDimComponentsB \
or self.readUnrollDimComponentsB) else 1
# self.numWriteVectorComponentsB = kernel["GlobalLoadVectorWidthB"] \
# if (self.writeTileDimComponentsB \
# or self.writeUnrollDimComponentsB) else 1
# self.numReadTileVectorComponentsB = kernel["GlobalLoadVectorWidthB"] \
# if self.readTileDimComponentsB else 1 # for branches
# convert tile/unroll to para/perp
if kernel["ProblemType"]["TLUB"]:
self.numReadsCoalVecCompB = self.numReadsTileVecCompB
self.numReadsPerpVecCompB = self.numReadsUnrollVecCompB
# for asm
self.readCoalescedComponentsB = self.readTileDimComponentsB
# self.readCoalescedVectorB = self.readTileDimVectorB # Not Used
self.readPerpendicularComponentsB = self.readUnrollDimComponentsB
# self.readPerpendicularVectorB = self.readUnrollDimVectorB # Not Used
else:
self.numReadsCoalVecCompB = self.numReadsUnrollVecCompB
self.numReadsPerpVecCompB = self.numReadsTileVecCompB
# for asm
self.readCoalescedComponentsB = self.readUnrollDimComponentsB
# self.readCoalescedVectorB = self.readUnrollDimVectorB # Not Used
self.readPerpendicularComponentsB = self.readTileDimComponentsB
# self.readPerpendicularVectorB = self.readTileDimVectorB # Not Used
####################################
# load sizes
"""
if kernel["ProblemType"]["TLUA"]:
kernel["LSCA"] = kernel["MacroTileA"] \
/ kernel["NumLoadsCoalescedA"]
kernel["LSPA"] = kernel["DepthU"] / kernel["NumLoadsPerpendicularA"]
else:
kernel["LSCA"] = kernel["DepthU"] / kernel["NumLoadsCoalescedA"]
kernel["LSPA"] = kernel["MacroTileA"] \
/ kernel["NumLoadsPerpendicularA"]
if kernel["ProblemType"]["TLUB"]:
kernel["LSCB"] = kernel["MacroTileB"] \
/ kernel["NumLoadsCoalescedB"]
kernel["LSPB"] = kernel["DepthU"] / kernel["NumLoadsPerpendicularB"]
else:
kernel["LSCB"] = kernel["DepthU"] / kernel["NumLoadsCoalescedB"]
kernel["LSPB"] = kernel["MacroTileB"] \
/ kernel["NumLoadsPerpendicularB"]
kernel["LVCA"] = kernel["LSCA"] / kernel["GlobalLoadVectorWidthA"]
kernel["LVCB"] = kernel["LSCB"] / kernel["GlobalLoadVectorWidthB"]
kernel["LVPA"] = kernel["LSPA"] / kernel["GlobalLoadVectorWidthA"]
kernel["LVPB"] = kernel["LSPB"] / kernel["GlobalLoadVectorWidthB"]
"""
self.getTensorParameters(tensorParametersA, kernel, True)
self.getTensorParameters(tensorParametersB, kernel, False)
tensorParametersA["PackBatchDims"] = kernel["PackBatchDims"] if kernel["PackBatchDims"] & 0x1 else 0
tensorParametersB["PackBatchDims"] = kernel["PackBatchDims"] if kernel["PackBatchDims"] & 0x2 else 0
tensorParametersA["PackedIndices"] = kernel["PackedC%uIndicesX"%self.tPA["tile01Idx"]]
tensorParametersB["PackedIndices"] = kernel["PackedC%uIndicesX"%self.tPB["tile01Idx"]]
# condition(s) to enable init accvgpr opt (initialize only the last set of accvgpr instead of whole accvgpr)
self.useInitAccVgprOpt = False
# enable for the following conditions
if kernel["StoreCInUnroll"]:
self.useInitAccVgprOpt = True
if (kernel["DirectToVgprA"] or kernel["DirectToVgprB"]):
self.useInitAccVgprOpt = True
# force to disable for the following conditions
if self.useInitAccVgprOpt:
if kernel["PrefetchGlobalRead"] == 2:
# PGR=2 case, K > DepthU * 2 is necessary ( if not noTailLoop, need > DepthU * 3)
# (kernel["AssertSizeGreaterThan"][3] > DepthU * 2 (or 3)
minDUnum = 2 if self.noTailLoop else 3
if not (3 in kernel["AssertSizeGreaterThan"].keys() and kernel["AssertSizeGreaterThan"][3] >= kernel["DepthU"] * minDUnum):
print2("InitAccVgprOpt is disabled because AssertSizeGreaterThan for K is not greater than DepthU * %u"%minDUnum)
self.useInitAccVgprOpt = False
if kernel["PrefetchGlobalRead"] == 1:
# PGR=1 case, K > DepthU * 1 is necessary ( if not noTailLoop, need > DepthU * 2)
# (kernel["AssertSizeGreaterThan"][3] > DepthU * 2 (or 3)
minDUnum = 1 if self.noTailLoop else 2
if not (3 in kernel["AssertSizeGreaterThan"].keys() and kernel["AssertSizeGreaterThan"][3] >= kernel["DepthU"] * minDUnum):
print2("InitAccVgprOpt is disabled because AssertSizeGreaterThan for K is not greater than DepthU * %u"%minDUnum)
self.useInitAccVgprOpt = False
# condition(s) to enable singleNLL opt
self.enableSingleNLLOpt = False
if self.noTailLoop and not (self.prefetchAcrossPersistent and kernel["PrefetchAcrossPersistentMode"] == 0):
if kernel["StoreCInUnroll"]:
self.enableSingleNLLOpt = True
# so far, not enabled for DirectToVgpr
# Performance is better with Tensile, but does not perform better with HPL
#if kernel["DirectToVgprA"] or kernel["DirectToVgprB"]:
# self.enableSingleNLLOpt = True
@staticmethod
def zpForSumIdx(sumIdx, zeroPad):
""" Returns zero-pad for specified sumIdx if it matches or None if not """
return next((zpi for zpi in zeroPad if zpi[1] == sumIdx), None)
##############################################################################
# Open String
##############################################################################
@abc.abstractmethod
def openString(self, kernel):
return ""
##############################################################################
# Close String
##############################################################################
@abc.abstractmethod
def closeString(self, kernel):
return ""
##############################################################################
# Function Prefix
##############################################################################
@abc.abstractmethod
def functionPrefix(self, kernel):
return ""
##############################################################################
# Function Signature Prefix
##############################################################################
@abc.abstractmethod
def functionSignaturePrefix(self, kernel):
return ""
##############################################################################
# Function Signature
##############################################################################
@abc.abstractmethod
def functionSignature(self, kernel ):
return ""
##############################################################################
# Function Signature Suffix
##############################################################################
@abc.abstractmethod
def functionSignatureSuffix(self, kernel):
return ""
##############################################################################
# Function Begin
##############################################################################
@abc.abstractmethod
def functionBegin(self, kernel):
return ""
##############################################################################
# Allocate Resources
##############################################################################
@abc.abstractmethod
def allocateResources(self, kernel):
return ""
##############################################################################
# Open Persistent Loop
# init iteration counter, define loop target
##############################################################################
@abc.abstractmethod
def openPersistentLoop(self, kernel):
return ""
##############################################################################
# Global Read Addresses: Work-Group
##############################################################################
@abc.abstractmethod
def graWorkGroup(self, kernel, isPap):
return ""
##############################################################################
# Get Params For Tensor A/B
##############################################################################
def getTensorParameters(self, tP, kernel, tA):
tP["mirror"] = bool(kernel["ProblemType"]["MirrorDims%s" % ("A" if tA else "B")])
if tA: # A
tP["isA"] = True # is this tensor A
tP["isB"] = False # is this tensor B
tP["bpe"] = int(4*kernel["ProblemType"]["DataType"].numRegisters())
tP["tensorChar"] = "A" # tensor character A/B
tP["tensorIdx"] = 0 # tensor index A=0, B=1
tP["tileChar"] = self.tileCharA # tile char I0 or J1
tP["tileIdx"] = kernel["ProblemType"]["Index01A"] # is the tile dimension of A the 0th or 1th index, i.e. Aki, tileIdx=0
tP["tile01Idx"] = 1 if tP["tileIdx"] else 0
tP["lsc"] = "LSCA" # load size coalesced A, number of elements that get loaded along coalesced dimension with each load
tP["lsp"] = "LSPA" # load size perpendicular A, number of elements that get loaded along non-coalesced dimension with each load
tP["lvc"] = "LVCA" # "load size" in terms of number of short-vectors and not elements
tP["lvp"] = "LVPA" # "load size" in terms of number of short-vectors and not elements
tP["rtv"] = self.readTileDimVectorA # bool in the tile dimension, reads will read vectors
tP["rtc"] = self.readTileDimComponentsA # bool in the tile dimension, reads will read vector components
#tP["ruv"] = self.readUnrollDimVectorA
#tP["nlvc"] = self.numReadVectorComponentsA
#tP["nwvc"] = self.numWriteVectorComponentsA
tP["wg"] = "WorkGroup%u" % (tP["tile01Idx"])# these are storing the actual strong to lookup the number from kernel dictionary
tP["prevWg"] = "PrevWorkGroup0" # used for prefetch-across-persistent #NHWC TO-do
tP["sg"] = "SubGroup%u" % (tP["tile01Idx"])
tP["tt"] = "ThreadTile%u" % (tP["tile01Idx"])
tP["mt"] = "MacroTile%u" % (tP["tile01Idx"])
tP["grcg"] = self.globalReadCoalesceGroupA # global reads are coalesced along threads
tP["grcv"] = kernel["GlobalReadCoalesceVectorA"] # global reads are vector reads, and lds writes will be components if transposing
tP["tlu"] = kernel["ProblemType"]["TLUA"] # thread stride is less than unroll stride, i.e., not transposing matrix
tP["ia"] = kernel["ProblemType"]["IndexAssignmentsA"] # array of index assignments
#tP["nlc"] = kernel["NumLoadsCoalescedA"]
#tP["nlp"] = kernel["NumLoadsPerpendicularA"]
#tP["nlcv"] = self.numReadsCoalVecCompA
tP["nlpv"] = self.numReadsPerpVecCompA # num vector components perpendicular to coalesced; =1 or VW
# NEW
tP["nrt"] = self.numReadsTileA # number of reads along tile dimension
tP["nrtv"] = self.numReadsTileVecCompA # number of vector components along tile dimension; =1 or VW
tP["nru"] = self.numReadsUnrollA # number of reads along unroll dimension
tP["nruv"] = self.numReadsUnrollVecCompA # number of vector components along unroll dimension; =1 or VW
tP["nrc"] = kernel["NumLoadsCoalescedA"] # number of reads along coalesced dimension
tP["nrcv"] = self.numReadsCoalVecCompA # number of vector components along coalesced dimension
tP["nrp"] = kernel["NumLoadsPerpendicularA"] # number of reads along perpendicular dimension
tP["nrpv"] = self.numReadsPerpVecCompA # number of vector components along perpendicular dimension
tP["nwcv"] = self.numWritesCoalVecCompA # number of vector component writes along coalesced dimension
tP["nwpv"] = self.numWritesPerpVecCompA # number of vector component writes along perpendicular dimension
tP["glvw"] = kernel["GlobalLoadVectorWidthA"]
# asm
tP["rcc"] = self.readCoalescedComponentsA # read vector components along coalesced dimensions
# tP["rcv"] = self.readCoalescedVectorA # read vector along coalesced dimension
tP["rpc"] = self.readPerpendicularComponentsA # read vector components along perpendicular dimension
# tP["rpv"] = self.readPerpendicularVectorA # read vector along perpendicular dimension
tP["ruc"] = self.readUnrollDimComponentsA # read vector components along unroll dimension
tP["wtc"] = self.writeTileDimComponentsA # write vector components along tile dimension
tP["wuc"] = self.writeUnrollDimComponentsA # write vector components along unroll dimension
tP["idx"] = kernel["ProblemType"]["Index0"] # index 0 is tile dimension belonging to A. Note 'idx' may not be in tP['ia'].
tP["rc"] = kernel["ProblemType"]["IndexAssignmentsA"][0] \
in [kernel["ProblemType"]["Index01A"], \
kernel["ProblemType"]["IndexUnroll"]] # can read coalesced
tP["NonTemporal"] = kernel["NonTemporalA"] # non-temporal read type
else: # B
tP["isA"] = False
tP["isB"] = True
tP["bpe"] = int(4*kernel["ProblemType"]["DataType"].numRegisters())
tP["tensorChar"] = "B"
tP["tensorIdx"] = 1
tP["tileChar"] = self.tileCharB
tP["tileIdx"] = kernel["ProblemType"]["Index01B"]
tP["tile01Idx"] = 1 if tP["tileIdx"] else 0
tP["lsc"] = "LSCB"
tP["lsp"] = "LSPB"
tP["lvc"] = "LVCB"
tP["lvp"] = "LVPB"
tP["rtv"] = self.readTileDimVectorB
tP["rtc"] = self.readTileDimComponentsB
#tP["ruv"] = self.readUnrollDimVectorB
#tP["nlvc"] = self.numReadVectorComponentsB
#tP["nwvc"] = self.numWriteVectorComponentsB
tP["wg"] = "WorkGroup%u" % (tP["tile01Idx"])
tP["prevWg"] = "PrevWorkGroup1"
tP["sg"] = "SubGroup%u" % (tP["tile01Idx"])
tP["tt"] = "ThreadTile%u" % (tP["tile01Idx"])
tP["mt"] = "MacroTile%u" % (tP["tile01Idx"])
tP["grcg"] = self.globalReadCoalesceGroupB
tP["grcv"] = kernel["GlobalReadCoalesceVectorB"]
tP["tlu"] = kernel["ProblemType"]["TLUB"]
tP["ia"] = kernel["ProblemType"]["IndexAssignmentsB"]
# NEW
tP["nrt"] = self.numReadsTileB
tP["nrtv"] = self.numReadsTileVecCompB
tP["nru"] = self.numReadsUnrollB
tP["nruv"] = self.numReadsUnrollVecCompB
tP["nrc"] = kernel["NumLoadsCoalescedB"]
tP["nrcv"] = self.numReadsCoalVecCompB
tP["nrp"] = kernel["NumLoadsPerpendicularB"]
tP["nrpv"] = self.numReadsPerpVecCompB
tP["nwcv"] = self.numWritesCoalVecCompB
tP["nwpv"] = self.numWritesPerpVecCompB
tP["glvw"] = kernel["GlobalLoadVectorWidthB"]
# asm
tP["rcc"] = self.readCoalescedComponentsB
# tP["rcv"] = self.readCoalescedVectorB
tP["rpc"] = self.readPerpendicularComponentsB
# tP["rpv"] = self.readPerpendicularVectorB
tP["ruc"] = self.readUnrollDimComponentsB
tP["wtc"] = self.writeTileDimComponentsB
tP["wuc"] = self.writeUnrollDimComponentsB
tP["idx"] = kernel["ProblemType"]["Index1"]
tP["rc"] = kernel["ProblemType"]["IndexAssignmentsB"][0] \
in [kernel["ProblemType"]["Index01B"], \
kernel["ProblemType"]["IndexUnroll"]] # can read coalesced
tP["NonTemporal"] = kernel["NonTemporalB"]
##############################################################################
# Global Read Addresses: Tile Assignment A/B
##############################################################################
@abc.abstractmethod
def graTileAssignment(self, kernel, tP):
return ""
##############################################################################
# Global Read Addresses: Unroll Assignment A/B
##############################################################################
@abc.abstractmethod
def graUnrollAssignment(self, kernel, tP):
return ""
##############################################################################
# Global Read Addresses: Other Free Assignments
##############################################################################
@abc.abstractmethod
def graOtherFreeAssignments(self, kernel):
return ""
##############################################################################
# Global Read Addresses: Other Summation Assignments
##############################################################################
@abc.abstractmethod
def graOtherSummationAssignments(self, kernel):
return ""
##############################################################################
# Global Read Addresses: Tile Offsets A/B
##############################################################################
@abc.abstractmethod
def graTileOffsets(self, kernel, tP):
return ""
##############################################################################
# Global Read Addresses: Unroll Offsets A/B
##############################################################################
@abc.abstractmethod
def graUnrollOffsets(self, kernel, tP):
return ""
##############################################################################
# Global Read Addresses: Branch A/B
##############################################################################
@abc.abstractmethod
def graBranch(self, kernel, tP):
return ""
##############################################################################
# Global Read Addresses: Shift A/B
##############################################################################
@abc.abstractmethod
def graShift(self, kernel, tP):
return ""
##############################################################################
# Global Read Addresses: Final Offsets A/B
##############################################################################
@abc.abstractmethod
def graFinalOffsets(self, kernel, tP):
return ""
##############################################################################
# Global Read Addresses: Addresses A/B
##############################################################################
@abc.abstractmethod
def graAddresses(self, kernel, tP, isPap=False):
return ""
##############################################################################
# Global Read Addresses: Increments A/B
# This function declares the increments
##############################################################################
@abc.abstractmethod
def graIncrements(self, kernel, loopIdx, tP):
return ""
##############################################################################
# Local Write Addresses: Tile Assignment A/B
##############################################################################
@abc.abstractmethod
def lwaTileAssignment(self, kernel, tP):
return ""
##############################################################################
# Local Write Addresses: Unroll Assignment A/B
##############################################################################
@abc.abstractmethod
def lwaUnrollAssignment(self, kernel, tP):
return ""
##############################################################################
# Local Write Addresses: First Offset A/B
##############################################################################
@abc.abstractmethod
def lwaFirstOffset(self, kernel, tP, uDu):
return ""
##############################################################################
# Local Write Addresses: Final Offsets A/B
##############################################################################
@abc.abstractmethod
def lwaFinalOffsets(self, kernel, tP):
return ""
##############################################################################
# Local Write Addresses: Declare Addresses A/B
##############################################################################
@abc.abstractmethod
def lwaDeclareAddresses(self, kernel, tP):
return ""
##############################################################################
# Local Read Addresses: Tile Assignment
##############################################################################
@abc.abstractmethod
def lraTileAssignment(self, kernel, tPA, tPB):
return ""
##############################################################################
# Local Read Addresses: Final Offset A/B
##############################################################################
@abc.abstractmethod
def lraFinalOffset(self, kernel, tP):
return ""
##############################################################################
# Local Read Addresses for direct LDS : Final Offset A/B
##############################################################################
@abc.abstractmethod
def directToLdsLraOffset(self, kernel, finalVgpr, tmp1, tmp2, tP):
return ""
##############################################################################
# Local Read Addresses offset conversion for DTL + NLC > 1
##############################################################################
@abc.abstractmethod
def lraOffsetConversionForDTLandNLC(self, kernel, tP, offset_val, generateAsm=False, \
finalVgpr=None, tmp1=None, tmp2=None):
return ""
##############################################################################
# Local Read Addresses: Declare Addresses A/B
##############################################################################
@abc.abstractmethod
def lraDeclareAddresses(self, kernel, tP):
return ""
##############################################################################
# Recalculate local read addresses A/B
##############################################################################
@abc.abstractmethod
def recalcLocalReadAddressesAB(self, kernel):
return ""
##############################################################################
# Recalculate local write addresses A/B
##############################################################################
@abc.abstractmethod
def recalcLocalWriteAddresses(self, kernel, tP, uDu):
return ""
##############################################################################
# Declare Loop Num Iterations
##############################################################################
@abc.abstractmethod
def declareLoopNumIter(self, kernel):
return ""
##############################################################################
# Define stagger parms that will be used in calculateStagger
##############################################################################
@abc.abstractmethod
def declareStaggerParms(self, kernel):
return ""
##############################################################################
# Calculate and apply stagger offsets and edge
##############################################################################
@abc.abstractmethod
def calculateStagger(self, kernel, loopIdx):
return ""
##############################################################################
# Remove stagger offset (before tail loop)
##############################################################################
@abc.abstractmethod
def removeStagger(self, kernel):
return ""
##############################################################################
# Calculate Loop Num Iter
##############################################################################
@abc.abstractmethod
def calculateLoopNumIter(self, kernel, loopIdx, isPap):
return ""
##############################################################################
# openShadowInit:
# Top of shadow init code
##############################################################################
@abc.abstractmethod
def openShadowInit(self, kernel):
return ""
##############################################################################
# closeShadowInit:
# Top of shadow init code
##############################################################################
@abc.abstractmethod
def closeShadowInit(self, kernel):
return ""
##############################################################################
# Initialize C
##############################################################################
@abc.abstractmethod
def initC(self, kernel):
return ""
##############################################################################
# Open Loop
# loopIdx<0 : tail loop
##############################################################################
@abc.abstractmethod
def openLoop(self, kernel, loopIdx, uDu, noLabelGen, beginLabelOnly):
return ""
##############################################################################
# Close Loop
##############################################################################
@abc.abstractmethod
def closeLoop(self, kernel, loopIdx, finalLoop, loopCopies, uDu, emitEndLabelOnly, oddLabel=False):
return ""
##############################################################################
# Open Loop Copy
##############################################################################
@abc.abstractmethod
def openLoopCopy(self, kernel, lc):
return ""
##############################################################################
# End Summation
##############################################################################
@abc.abstractmethod
def endSummation(self, kernel, label = None, isOptNLL = False):
return ""
##############################################################################
# MAC Iteration
# useMacro : if true, call the MAC* macro. If False, inline the MACs
##############################################################################
@abc.abstractmethod
def macIter(self, kernel, bufferIdx, iuiCount, useMacro, isTail=False):
return ""
##############################################################################
# At Least 1 Unroll
##############################################################################
@abc.abstractmethod
def openSumAtLeastUnroll(self, kernel, prefetch, isOptNLL, isPap):
return ""
@abc.abstractmethod
def closeSumAtLeastUnroll(self, kernel, prefetch, isOptNLL, isPap, isNGLL):
return ""
##############################################################################
# Global Read: Increment A/B
##############################################################################
@abc.abstractmethod
def globalReadIncrementAB(self, kernel, loopIdx, prefetchIndex, incs=1):
return ""
##############################################################################
# Global Read: Do It A/B
# mode: 0=prefetch, 1=unroll loop, 2=guardK
##############################################################################
@abc.abstractmethod
def globalReadDo(self, kernel, mode, tP, vregSetIdx=0):
return ""
##############################################################################
# directToLds m0 update: Do It A/B
# mode: 0=prefetch, 1=unroll loop, 2=guardK
##############################################################################
@abc.abstractmethod
def directToLdsM0Update(self, kernel, mode, tP, usePlaceHolder=False):
return ""
##############################################################################
# Local Write: Swap Offsets A/B
##############################################################################
@abc.abstractmethod
def localWriteSwapOffsets(self, kernel, internalPointerSwap, tP):
return ""
##############################################################################
# Local Write: Reset Offsets A/B
##############################################################################
@abc.abstractmethod
def localWriteResetOffsets(self, kernel, internalPointerSwap, tP):
return ""
##############################################################################
# Local Write: Init Pointers A/B
##############################################################################
@abc.abstractmethod
def localWriteInitPointers(self, kernel, tP):
return ""
##############################################################################
# Local Write in Prefetch Pass (PreLoop): Do It A/B
##############################################################################
@abc.abstractmethod
def preLoopLocalWriteDo(self, kernel, tPA, tPB):
return ""
##############################################################################
# Replace the determined vmcnt in PreLoop LocalWrite
##############################################################################
@abc.abstractmethod
def replacePreLoopLWVmcnt(self, kernel):
return ""
##############################################################################
# Local Write: Do It A/B
##############################################################################
@abc.abstractmethod
def localWriteDo(self, kernel, tP, uDu):
return ""
##############################################################################
# Local Read: Swap Offsets A/B
##############################################################################
@abc.abstractmethod
def localReadSwapOffsets(self, kernel, internalPointerSwap, tP):
return ""
##############################################################################
# Local Read: Reset Offsets A/B
##############################################################################
@abc.abstractmethod
def localReadResetOffsets(self, kernel, tP):
return ""
##############################################################################
# Local Read: Init Pointers A/B
##############################################################################
@abc.abstractmethod
def localReadInitPointers(self, kernel, tP):
return ""
##############################################################################
# Local Read: Increment A/B
##############################################################################
@abc.abstractmethod
def localReadInc(self, kernel, tP):
return ""
##############################################################################
# Local Read: Do It A/B
##############################################################################
@abc.abstractmethod
def localReadDo(self, kernel, bufferIdx, innerUnrollIndex, epsi, tP):
return ""
##############################################################################
# Shift Vector Components d0/1
##############################################################################
@abc.abstractmethod
def shiftVectorComponents(self, kernel, tP):
return ""
##############################################################################
# Complex Declare Tmp Registers
##############################################################################
@abc.abstractmethod
def complexDeclareTmpRegisters(self, kernel):
return ""
##############################################################################
# LocalSplitU: Local Write
##############################################################################
@abc.abstractmethod
def localSplitULocalWrite(self, kernel):
return ""
##############################################################################
# LocalSplitU: Local Read
##############################################################################
@abc.abstractmethod
def localSplitULocalRead(self, kernel):
return ""
##############################################################################
# LocalSplitU: Reduction
##############################################################################
@abc.abstractmethod
def localSplitUReduction(self, kernel):
return ""
##############################################################################
# globalWriteWorkGroupInitBeforePersistentLoop:
##############################################################################
@abc.abstractmethod
def globalWriteWorkGroupInitBeforePersistentLoop(self, kernel):
return ""
##############################################################################
# globalWriteWorkGroupInit:
# Perform work-group granularity init
##############################################################################
@abc.abstractmethod
def globalWriteWorkGroupInit(self, kernel):
return ""
##############################################################################
# LocalSplitU: Global Write Indices
##############################################################################
@abc.abstractmethod
def localSplitUGlobalWriteIndices(self, kernel):
return ""
##############################################################################
# LocalSplitU: Global Write
##############################################################################
@abc.abstractmethod
def localSplitUGlobalWrite(self, kernel):
return ""
##############################################################################
# Not LocalSplitU: Global Write Indices
##############################################################################
@abc.abstractmethod
def notLocalSplitUGlobalWriteIndices(self, kernel):
return ""
##############################################################################
# Not LocalSplitU: Global Write
##############################################################################
@abc.abstractmethod
def notLocalSplitUGlobalWrite(self, kernel):
return ""
@abc.abstractmethod
def openPrefetchAcrossPersistent(self, kernel, isOptNLL=False, useBufferOOB=False):
return ""
@abc.abstractmethod
def closePrefetchAcrossPersistent(self, kernel, isOptNLL=False, useBufferOOB=False):
return ""
##############################################################################
# init for StoreCInUnroll
##############################################################################
@abc.abstractmethod
def initStoreCInUnroll(self, kernel):
return ""
##############################################################################
# init for StoreCInUnroll per Persistent Loop
##############################################################################
@abc.abstractmethod
def initStoreCInUnrollPerPersistentLoop(self, kernel):
return ""
##############################################################################
# init for StoreCInUnroll per Unroll Loop
##############################################################################
@abc.abstractmethod
def initStoreCInUnrollPerUnrollLoop(self, kernel, needInit):
return ""
##############################################################################
# swap SrdC and SrdCbackup, SrdD and SrdDbackup
##############################################################################
@abc.abstractmethod
def swapSrdCDandBackup(self, kernel):
return ""
##############################################################################
# C/D address increment value for StoreCInUnroll
##############################################################################
@abc.abstractmethod
def generateCorDaddrIncrementForStoreCInUnroll(self, kernel, CorD, odd, tmpSgprWork):
return ""
##############################################################################
# get address/gpr index increment frequency for StoreCInUnroll
##############################################################################
@abc.abstractmethod
def getAddrGprIdxIncrementFrequencyForStoreCInUnroll(self, kernel):
return ""
##############################################################################
# generate post process for StoreCInUnroll loop
##############################################################################
@abc.abstractmethod
def generatePostProcessForStoreCInUnrollLoop(self, kernel, needPost):
return ""
##############################################################################
# restore SrdCbackup and SrdDbackup
##############################################################################
@abc.abstractmethod
def restoreSrdCandDBackup(self, kernel):
return ""
##############################################################################
# reset storeC sync objects
##############################################################################
@abc.abstractmethod
def resetStoreCsyncObject(self, kernel):
return ""
##############################################################################
# set storeC sync objects
##############################################################################
@abc.abstractmethod
def setStoreCsyncObject(self, kernel):
return ""
##############################################################################
# end process for StoreCInUnroll per PersistentLoop (OptNLL)
##############################################################################
@abc.abstractmethod
def endProcessPersistentLoopforStoreCInUnrollOptNLL(self, kernel):
return ""
##############################################################################
# end process for StoreCInUnroll per PersistentLoop (NoOptNLL)
##############################################################################
@abc.abstractmethod
def endProcessPersistentLoopforStoreCInUnrollNoOptNLL(self, kernel):
return ""
##############################################################################
# number of storeC code in template for StoreCInUnroll
##############################################################################
@abc.abstractmethod
def getNumberOfStoreCInTemplate(self, kernel):
return ""
##############################################################################
# number of LoadC code in template for StoreCInUnroll
##############################################################################
@abc.abstractmethod
def getNumberOfLoadCInForLoadC(self, kernel):
return ""
##############################################################################
# generate storeCInUnroll post loop code
##############################################################################
@abc.abstractmethod
def generateStoreInUnrollPostLoop(self, kernel, isOptNLL, isDTVodd):
return ""
##############################################################################
# openOddNoLoadLoopForDTV
# generate open code for DirectToVgpr + odd exit case in noLoadLoop code
##############################################################################
@abc.abstractmethod
def openOddNoLoadLoopForDTV(self, kernel, isNGLL, name):
return ""
##############################################################################
# closeOddNoLoadLoopForDTV
# generate close code for DirectToVgpr + odd exit case in noLoadLoop code
##############################################################################
@abc.abstractmethod
def closeOddNoLoadLoopForDTV(self, kernel, isNGLL, name):
return ""
##############################################################################
# generateEvenEndLabeNoLoadLoopForDTV
# generate even end label for DirectToVgpr
##############################################################################
@abc.abstractmethod
def generateEvenEndLabeNoLoadLoopForDTV(self, kernel, isNGLL, name):
return ""
##############################################################################
# generateOddEndVgprCopyForDTV
# generate odd end vgpr copy for DirectToVgpr
##############################################################################
@abc.abstractmethod
def generateOddEndVgprCopyForDTV(self, kernel):
return ""
##############################################################################
# PrefetchGlobalRead2
##############################################################################
@abc.abstractmethod
def openPrefetchGlobalRead2(self, kernel):
return ""
@abc.abstractmethod
def closePrefetchGlobalRead2(self, kernel):
return ""
##############################################################################
# Function End
##############################################################################
@abc.abstractmethod
def functionEnd(self, kernel, addLabel=True):
return ""
##############################################################################
# Function Suffix
##############################################################################
@abc.abstractmethod
def functionSuffix(self, kernel):
return ""
##############################################################################
# Kernel Body Prefix
##############################################################################
@abc.abstractmethod
def kernelBodyPrefix(self, kernel, tPA, tPB ):
return ""
##############################################################################
# Kernel Body Suffix
##############################################################################
@abc.abstractmethod
def kernelBodySuffix(self, kernel, tPA, tPB ):
return ""
##############################################################################
# WaitCnt
##############################################################################
@abc.abstractmethod
def wait(self, kernel, tPA, tPB, globalRead, localWrite, localRead, comment):
return ""
##############################################################################
# SyncThreads
##############################################################################
@abc.abstractmethod
def syncThreads(self, kernel):
return self.indent + self.syncStr + self.endLine
##############################################################################
# MapAcctoArch
##############################################################################
@abc.abstractmethod
def MapAcctoArchRegs(self, kernel, option):
return ""
##############################################################################
# openmovaccVgpr
##############################################################################
@abc.abstractmethod
def openmovaccVgpr(self, kernel, backupSgpr):
return ""
##############################################################################
# getAccVgprCode
##############################################################################
@abc.abstractmethod
def getAccVgprCode(self,kernel,odd):
return ""
##############################################################################
# closemovaccVgpr
##############################################################################
@abc.abstractmethod
def closemovaccVgpr(self, kernel, backupSgpr):
return ""
##############################################################################
#
# Entry Functions
#
##############################################################################
##############################################################################
# get kernel name
##############################################################################
def getKernelFileBase(self, kernel):
if isCustomKernelConfig(kernel):
fileBase = kernel["CustomKernelName"]
elif globalParameters["ShortNames"]:
fileBase = Solution.getNameSerial(kernel, self.kernelSerialNaming)
else:
fileBase = self.shortenFileBase(kernel)
return fileBase
def getKernelName(self, kernel):
kernelName = Solution.getNameMin(kernel, self.kernelMinNaming)
return kernelName
def getKernelSource(self, kernel):
"""
Returns the source of the kernel, either C++ or assembly.
"""
fileString = ""
self.tPA = tensorParametersA = {}
self.tPB = tensorParametersB = {}
self.initKernel(kernel, tensorParametersA, tensorParametersB )
fileString += self.kernelBodyPrefix( kernel, tensorParametersA, \
tensorParametersB )
self.stringIdx = 0
(error, kb) = self.kernelBody( kernel, tensorParametersA, tensorParametersB)
fileString += kb
fileString += self.kernelBodySuffix( kernel, tensorParametersA, \
tensorParametersB )
if error != 0:
if globalParameters["ForceGenerateKernel"]:
print ("warning: Generating kernel source resulted in error {}, but ForceGenerateKernel=1 so saving source".format(error))
else:
raise RuntimeError("Generating kernel source resulted in error {}".format(error))
return fileString
def getAssemblyDirectory(self):
return Common.ensurePath(os.path.join(globalParameters["WorkingPath"], "assembly"))
def byteArrayScriptSource(self):
return """
#!/usr/bin/env python
fileString = ""
fileString += "/*******************************************************************************\\n"
fileString += "* Copyright (C) 2016 Advanced Micro Devices, Inc. All rights reserved.\\n"
fileString += "*\\n"
fileString += "* Permission is hereby granted, free of charge, to any person obtaining a copy\\n"
fileString += '* of this software and associated documentation files (the \"Software\"), to deal\\n'
fileString += "* in the Software without restriction, including without limitation the rights\\n"
fileString += "* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell cop-\\n"
fileString += "* ies of the Software, and to permit persons to whom the Software is furnished\\n"
fileString += "* to do so, subject to the following conditions:\\n"
fileString += "*\\n"
fileString += "* The above copyright notice and this permission notice shall be included in all\\n"
fileString += "* copies or substantial portions of the Software.\\n"
fileString += "*\\n"
fileString += '* THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IM-\\n'
fileString += "* PLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS\\n"
fileString += "* FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR\\n"
fileString += "* COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER\\n"
fileString += "* IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNE-\\n"
fileString += "* CTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\\n"
fileString += "*******************************************************************************/\\n\\n"
fileString += "/**************************************************\\n"
fileString += "* This file was generated by Tensile: *\\n"
fileString += "* https://github.com/ROCmSoftwarePlatform/Tensile *\\n"
fileString += "**************************************************/\\n\\n\\n"
import os.path
fileString += '#include "Kernels.h"\\n\\n'
fileString += "/* code object byte array */\\n\\n"
codeObjectFileNames = [f for f in os.listdir(".") if (os.path.isfile(f) and f.endswith(".co"))]
for codeObjectFileName in codeObjectFileNames:
print codeObjectFileName
print "\\n"
kernelName=os.path.splitext(codeObjectFileName)[0]
codeObjectFile = open(codeObjectFileName, "r")
codeObjectByteArray = bytearray(codeObjectFile.read())
codeObjectFile.close()
# write code object byte array for asm
fileString += "const unsigned char %s_coba[%u] = {\\n" % (kernelName, len(codeObjectByteArray))
for byteIdx in range(0, len(codeObjectByteArray)):
byte = codeObjectByteArray[byteIdx]
fileString += "0x%02x" % byte
if byteIdx < len(codeObjectByteArray)-1:
fileString += ","
else:
fileString += "};\\n"
if byteIdx % 16 == 15:
fileString += "\\n"
text_file = open("Kernels.cpp", "w")
text_file.write("%s" % fileString)
text_file.close()
"""
def writeByteArrayScript(self):
asmPath = self.getAssemblyDirectory()
bytearrayFileName = os.path.join(asmPath,"insert_byte_array.py")
if not os.path.isfile(bytearrayFileName):
with open(bytearrayFileName, 'w') as bytearrayFile:
bytearrayFile.write(self.byteArrayScriptSource())
os.chmod(bytearrayFileName, 0o777)
return bytearrayFileName
def getReplacementKernelPath(self, kernel):
if not kernel["ReplacementKernel"] and not isCustomKernelConfig(kernel): #kernel["CustomKernelName"]:
return None
kernelName = self.getKernelName(kernel)
if isCustomKernelConfig(kernel):
return os.path.join(globalParameters["CustomKernelDirectory"], (kernelName + ".s"))
else: # Replacement kernel
return ReplacementKernels.Get(kernelName)
def shortenFileBase(self, kernel):
base = self.getKernelName(kernel)
if len(base) <= globalParameters["MaxFileName"]:
return base
import hashlib
import base64
pivot = globalParameters["MaxFileName"] * 3 // 4
firstPart = base[:pivot]
secondPart = base[pivot:]
secondHash = hashlib.sha256(secondPart.encode()).digest()
secondPart = base64.b64encode(secondHash, b'_-').decode()
return firstPart + secondPart
def getKernelObjectAssemblyFile(self, kernel):
asmPath = self.getAssemblyDirectory()
# write assembly file to assembly directory
kernelName = self.getKernelFileBase(kernel)
fileBase = os.path.join(asmPath, kernelName )
assemblyFileName = "%s.s" % fileBase
replacementKernel = self.getReplacementKernelPath(kernel)
if replacementKernel is not None:
self.tPA = tensorParametersA = {}
self.tPB = tensorParametersB = {}
if isCustomKernelConfig(kernel):
kernelFoundMessage = "Custom kernel filename "
# ISA version, such as 803
self.kernel = kernel
self.language = "ASM"
self.version = globalParameters["CurrentISA"]
if "ISA" in kernel:
self.version = tuple(kernel["ISA"])
if not globalParameters["AsmCaps"][self.version]["SupportedISA"]:
defaultIsa = (9,0,0)
print("warning: ISA:", self.version, " is not supported; overriding with ", defaultIsa)
self.version = defaultIsa
else:
kernelFoundMessage = "replacement_assemblyFilename "
self.initKernel(kernel, tensorParametersA, tensorParametersB )
shutil.copyfile(replacementKernel, assemblyFileName)
if globalParameters["PrintLevel"] >= 1:
print(kernelFoundMessage + assemblyFileName)
else:
kernelSource = self.getKernelSource(kernel)
if globalParameters["PrintLevel"] >= 2:
print("write_assemblyFilename %s" % assemblyFileName)
with open(assemblyFileName, 'w') as assemblyFile:
assemblyFile.write(kernelSource)
return assemblyFileName
def getAssembledKernelObjectFile(self, kernel):
assemblyFileName = self.getKernelObjectAssemblyFile(kernel)
base, ext = os.path.splitext(assemblyFileName)
objectFileName = base + '.o'
args = self.getCompileArgs(assemblyFileName, objectFileName)
if globalParameters["PrintCodeCommands"]:
print (' '.join(args), " && ")
# change to use check_output to force windows cmd block util command finish
try:
out = subprocess.check_output(args, stderr=subprocess.STDOUT, cwd=self.getAssemblyDirectory())
print2(out)
except subprocess.CalledProcessError as err:
print(err.output)
raise
return objectFileName
def getSingleCodeObjectFile(self, kernel):
objectFileName = self.getAssembledKernelObjectFile(kernel)
base, ext = os.path.splitext(objectFileName)
coFileName = base + '.co'
args = self.getLinkCodeObjectArgs([objectFileName], coFileName)
if globalParameters["PrintCodeCommands"]:
print (' '.join(args))
# change to use check_output to force windows cmd block util command finish
try:
out = subprocess.check_output(args, stderr=subprocess.STDOUT, cwd=self.getAssemblyDirectory())
print2(out)
except subprocess.CalledProcessError as err:
print(err.output)
raise
return coFileName
def getByteArrayCobaDefinition(self, varName, byteArray):
s = self.comment("code object byte array")
s += "const unsigned char %s_coba[%u] = {\n" % (varName, len(byteArray))
if len(byteArray) != 0:
s += "0x%02x" % byteArray[0]
for byteIdx, byte in enumerate(byteArray[1:]):
if byteIdx % 16 == 15:
s += ",\n0x%02x" % byte
else:
s += ",0x%02x" % byte
s += '};\n'
return s
def getFileCobaDefinition(self, varName, fileName):
with open(fileName, 'rb') as f:
byteArray = bytearray(f.read())
return self.getByteArrayCobaDefinition(varName, byteArray)
##############################################################################
def getSourceFileString(self, kernel):
"""
Returns a string suitable for placing in Kernels.cpp. This means the actual kernel source in the case
of a source kernel, or an assembled code object byte array definition in the case of an assembly kernel,
or an empty string in the case that CodeFromFiles is true.
In the case of an assembly kernel, this function has the side effect of creating the following files:
* An assembly source file
* An object file
* A code object file
* A Python script which can create byte array variable definitions.
"""
try:
if kernel["KernelLanguage"] == "Assembly":
# asmPath = self.getAssemblyDirectory()
# kernelName = self.getKernelName(kernel)
# Skip if .o files will have already been built for this file
# @TODO remove need for this with better code organization
if kernel.duplicate:
self.language = "ASM"
return (0, "")
if globalParameters["GenerateSourcesAndExit"]:
# only create the assembly file.
self.getKernelObjectAssemblyFile(kernel)
return (0, "")
else:
self.writeByteArrayScript()
self.getSingleCodeObjectFile(kernel)
# I guess in this case we are making sure that the code object file exists by executing the code
# above but we aren't placing it into the source.
return (0, "")
# Old client debug option
# return (0, self.getFileCobaDefinition(kernelName, os.path.join(asmPath, coFile)))
else:
return (0, self.getKernelSource(kernel))
except subprocess.CalledProcessError as exc:
if isinstance(exc.cmd, collections.Sequence):
print("Command: ")
print(' '.join(exc.cmd))
print("returned non-zero exit status ", exc.returncode)
else:
print(exc)
return (-1, "")
except RuntimeError as exc:
if globalParameters["PrintSolutionRejectionReason"]:
print(exc)
return (-2, "")
##############################################################################
# header file string
##############################################################################
def getHeaderFileString(self, kernel):
kernelName = self.getKernelName(kernel)
fileString = "" # CHeader
if not hasattr(self, "language"):
raise AttributeError(f"Error processing {kernelName}: language attribute not found!")
if self.language == "HIP" or self.language == "OCL":
if not globalParameters["MergeFiles"]:
fileString += CHeader
fileString += "#pragma once\n\n"
if self.language == "HIP":
fileString += "#include <hip/hip_runtime.h>\n"
fileString += "#include <hip/hip_fp16.h>\n"
fileString += "#include <KernelHeader.h>\n"
fileString += "\n"
else:
fileString += "#include <string>\n"
if self.language == "OCL":
fileString += "extern const char * const %s_src;\n" % kernelName
else:
fileString += self.functionSignature(kernel)
fileString += ";\n"
else:
if not globalParameters["MergeFiles"] or globalParameters["NumMergedFiles"] > 1:
fileString += "#pragma once\n\n"
if not globalParameters["CodeFromFiles"]:
fileString += "extern const unsigned char %s_coba[]; // code object byte array\n" % kernelName
return fileString
##############################################################################
# flip Vreg set for DirectToVgpr in global read
##############################################################################
def flipVregSetForDirectToVgprInGlobalRead(self, kernel, itemStr):
# need to swap VGPR register set for odd code
baseName = "G2LA" if kernel["DirectToVgprA"] else "G2LB" # only one of them is enabled
set0 = baseName + "0"
set1 = baseName + "1"
if set0 in itemStr:
# replace set0 with set1
itemStr = itemStr.replace(set0, set1)
elif set1 in itemStr:
# replace set1 with set0
itemStr = itemStr.replace(set1, set0)
return itemStr
##############################################################################
# return number of store instructions
##############################################################################
def getNumStoreInst(self, str):
ret = 0
ret += str.count("_buffer_store") # count _buffer_store
ret += str.count("_global_store") # count _global_store
ret += str.count("buffer_atomic_add") # count buffer_atomic_add
ret += str.count("global_atomic_add") # count global_atomic_add
return ret
##############################################################################
# return number of load instructions
##############################################################################
def getNumLoadInst(self, str):
ret = 0
ret += str.count("_buffer_load") # count _buffer_load
ret += str.count("_global_load") # count _global_load
return ret
##############################################################################
# waitcnt code for DirectToVgpr
##############################################################################
def getWaitcntCodeForDirectToVgpr(self, kernel, localWriteEndIter, u, firstIter, isPap=True, beforeBarrier=False, NLLlast=False, oddLast=False):
retStr = ""
# generate wait
if (kernel["DirectToVgprA"] or kernel["DirectToVgprB"]):
if self.enable["Wait"]:
pgr2 = kernel["PrefetchGlobalRead"] == 2
numGlobalReadA = kernel["NumLoadsPerpendicularA"] * kernel["NumLoadsCoalescedA"] * self.numReadVectorComponentsA
numGlobalReadB = kernel["NumLoadsPerpendicularB"] * kernel["NumLoadsCoalescedB"] * self.numReadVectorComponentsB
numGlobalRead = numGlobalReadA if kernel["DirectToVgprA"] else numGlobalReadB
numGlobalReadAll = numGlobalReadA + numGlobalReadB
numGlobalStoreC = 0
numReadsIterCoalesced = self.numReadsIterCoalescedA if kernel["DirectToVgprA"] else self.numReadsIterCoalescedB
waitComment = "global read wait for DirectToVgpr"
# delay DirectToVgpr global read (from previous iteration) which is not referred yet
numRegsIn1set = (numGlobalRead // kernel["LoopIters"]) * numReadsIterCoalesced
numSet = (u + numReadsIterCoalesced) // numReadsIterCoalesced
numSetMod = (u + numReadsIterCoalesced) % numReadsIterCoalesced
if numSetMod > 0:
# if mod > 0, wait is already done by mod == 0 case and no need to wait for same set of global read
return ""
needToWait = numGlobalRead - numSet * numRegsIn1set
if not isPap:
# not isPap case, no global load A, B in no load loop. Reset numGlobalReadAll and numGlobalRead
numGlobalReadAll = 0
numGlobalRead = 0
if pgr2:
# PGR=2 case, add numGlobalReadAll for second set of prefetch
needToWait += numGlobalReadAll
if u > 0:
# count number of global read for i < u
count = 0
for i in range(u):
globalReadStr = ' '.join([str(x) for x in self.perIterGlobalReadCode[i].flatitems()])
count += self.getNumLoadInst(globalReadStr)
# PGR=2 case, global read is in LocalWriteCode
localWriteStr = ' '.join([str(x) for x in self.perIterLocalWriteCode[i].flatitems()])
count += self.getNumLoadInst(localWriteStr)
needToWait += count
if u == localWriteEndIter + 1 and beforeBarrier:
# beforeBarrier case, reduce the amount of non-Vgpr global read
needToWait -= (numGlobalReadAll - numGlobalRead)
# adjustment for oddLast
# oddLast case or ScheduleIterAlg < 3 case, ignore all of above and set 0
if oddLast or kernel["ScheduleIterAlg"] < 3:
needToWait = 0
if kernel["StoreCInUnroll"]:
# In StoreCInUnroll case,
# 1) last iteration case (u == localWriteEndIter + 1)
# 1-1) if StoreC is already executed in the previous u, add number of executed buffer_store/atomic_add
# (global read C wait is already done in this case)
# 1-2) else, add number of global read C to numGlobalReadAll
# count number of StoreC in template
tmpStr = ' '.join([str(x) for x in self.StoreCUnrollCode.flatitems()])
numGlobalStoreCinTemplate = self.getNumStoreInst(tmpStr) # count store instructions
numGlobalStoreC = 0
if u == localWriteEndIter + 1:
if beforeBarrier:
# before barrier case (DirectToLds+DirectToVgpr), put waitcnt vmcnt just before barrier (before ds_read)
# In that case, StoreC is already done. Add number of store C from template to vmcnt.
numGlobalStoreC += numGlobalStoreCinTemplate
# It means LoadC wait is already done. Deduct the number of load C in template
# count number of Load in template
tmpStr = ' '.join([str(x) for x in self.LoadCUnrollCode.flatitems()])
numGlobalLoadCinTemplate = self.getNumLoadInst(tmpStr) # count load instructions
needToWait -= numGlobalLoadCinTemplate
else:
# check if store C is already in perIterLocalWriteCode
for i in range(u):
# scheduled storeC in unroll is in LocalWriteCode
localWriteStr = ' '.join([str(x) for x in self.perIterLocalWriteCode[i].flatitems()])
numGlobalStoreC += self.getNumStoreInst(localWriteStr)
# no LDS write (DirectToLds+DirectToVgpr) and not beforeBarrier and not firstIter case,
# no need to wait for StoreC in previous iteration
# Then, add the number of storeC in template
#if kernel["NoLdsWriteCode"] and not firstIter:
# numGlobalStoreC += numGlobalStoreCinTemplate
# 2) add number of store C from previous iter to needToWait
# 2-1) not firstIter and u < localWriteEndIter + 1 case
# 2-2) noLoadC and last NoLoadLoop
needLoadC = (not kernel["AtomicAddC"]) and kernel["ProblemType"]["UseBeta"]
if not firstIter and (u < localWriteEndIter + 1 or ((not needLoadC) and NLLlast)):
numGlobalStoreC += numGlobalStoreCinTemplate
# oddLast case, ignore all of above and set numGlobalStoreCinTemplate
if oddLast:
numGlobalStoreC = numGlobalStoreCinTemplate
# add numGlobalStoreC to needToWait
needToWait += numGlobalStoreC
waitComment = "global read/store wait for DirectToVgpr with StoreCInUnroll (StoreC=%u)"%(numGlobalStoreC)
# vmcnt should not go over MaxVmcnt
maxVmcnt = globalParameters["AsmCaps"][self.version]["MaxVmcnt"]
needToWait = min(needToWait, maxVmcnt)
retStr = "s_waitcnt vmcnt(%u) // %s\n"%(needToWait, waitComment)
return retStr
##############################################################################
# Backup StoreCInUnroll related code
##############################################################################
def backupStoreCInUnrollRelatedCode(self):
# keep StoreCInUnrollPreCode, StoreCUnrollPostCode for the next noLoadLoop
self.StoreCUnrollPreCodeBackup = copy.deepcopy(self.StoreCUnrollPreCode)
self.StoreCUnrollPostCodeBackup = copy.deepcopy(self.StoreCUnrollPostCode)
##############################################################################
# Restore StoreCInUnroll related code
##############################################################################
def restoreStoreCInUnrollRelatedCode(self):
self.StoreCUnrollPreCode = self.StoreCUnrollPreCodeBackup
self.StoreCUnrollPostCode = self.StoreCUnrollPostCodeBackup
self.StoreCUnrollLoopCodeStarted = 0
##############################################################################
# generate storeC code in UnrollLoop
##############################################################################
def generateStoreCCodeInUnrollLoop(self, kernel, odd, isLast=False):
self.LoadCUnrollCode = Code.Module()
self.StoreCUnrollCode = Code.Module()
self.StoreCUnrollPreCode = Code.Module()
self.StoreCUnrollPostCode = Code.Module()
self.numItemsBeforeStoreC = 0
self.StoreCUnrollStartComment ="Start of StoreCInUnroll code"
self.StoreCUnrollStoreStartComment ="Start of StoreCInUnroll Store code"
self.StoreCUnrollLoopCodeStarted = 0 # 0:not StoreC code started, 1: started
if kernel["StoreCInUnroll"]:
needInit = not odd
needPost = odd
needInc = (not isLast) or kernel["StoreCInUnrollPostLoop"]
backupSgpr = self.getTmpSgpr(2).idx() # allocate all tmp register here
tmpSgprWork = backupSgpr + 1
needAddrC = (not kernel["AssertCEqualsD"]) and kernel["ProblemType"]["UseBeta"]
# init/inc code is necessary if inc frequency is 1
needInit = needInit or (self.getAddrGprIdxIncrementFrequencyForStoreCInUnroll(kernel) == 1)
needPost = needPost or (self.getAddrGprIdxIncrementFrequencyForStoreCInUnroll(kernel) == 1)
# generate init code for StoreCInUnroll per Unroll Loop
initPerUnrollCode = self.initStoreCInUnrollPerUnrollLoop(kernel, needInit)
# loadC
for x in self.LoadCTemplate.items():
# Load C case, insert Init per unroll code before Load C (setup vgpr offset for loadC and StoreC)
s = initPerUnrollCode + str(x)
initPerUnrollCode = "" # reset initPerUnrollCode so that it is not inserted again
self.LoadCUnrollCode.addText(s)
# Addr C increment code (no increment for isLast (and not PostLoop))
if needInc and needAddrC:
oddParam = needPost
kStr = self.generateCorDaddrIncrementForStoreCInUnroll(kernel, "C", oddParam, tmpSgprWork)
self.LoadCUnrollCode.addText(kStr)
if initPerUnrollCode != "":
# If init code is not inserted (no Load C case), insert it to the top of StoreC list (setup vgpr offset for StoreC)
self.StoreCUnrollPreCode.addText(initPerUnrollCode)
initPerUnrollCode = "" # reset initPerUnrollCode so that it is not inserted again
# these 3 items need to be in the same set
# open gpr indexing
# accVgpr (need gpr indexing)
# close gpr indexing
kStr = self.openmovaccVgpr(kernel, backupSgpr)
# odd case, use + (1 iteration) for gpr index, but not necessary if index frequency is 1
oddGprIndex = odd and (self.getAddrGprIdxIncrementFrequencyForStoreCInUnroll(kernel) > 1)
kStr += self.getAccVgprCode(kernel, oddGprIndex)
first, second = self.closemovaccVgpr(kernel, backupSgpr)
kStr += first
self.StoreCUnrollPreCode.addText(kStr)
# put second part of close gpr indexing separately (for better scheduling)
self.StoreCUnrollPreCode.addText(second)
# Alpha
kStr = ""
for x in self.AlphaOpTemplate.items():
kStr += str(x)
if kStr != "":
self.StoreCUnrollPreCode.addText(kStr)
# count the number of items before StoreC (before beta)
self.numItemsBeforeStoreC = len(list(self.StoreCUnrollPreCode.items()))
# StoreC
# put marker comment to recognize start point of StoreC code
# this must be the first item in self.StoreCUnrollCode.
self.StoreCUnrollCode.addComment0(self.StoreCUnrollStartComment)
# add necessary dummy based on number of mfma instructions between local write items
# put enough interval (=3) for LocalWritePerMfma == -1 case
numMfma = 3 if kernel["LocalWritePerMfma"] == -1 else roundUp(1/kernel["LocalWritePerMfma"])
n = self.numItemsBeforeStoreC - numMfma # first numMfma items are inserted at the start comment and following mfmas
while n >= numMfma:
self.StoreCUnrollCode.addText("")
n -= numMfma
# insert items in postProcessList between StoreC/AtomicAdd (StoreVectorWidth=1 only)
imod = Code.Module()
imod.addComment0(self.StoreCUnrollStoreStartComment)
StartComment = str(imod)
# Beta
kStrBeta = ""
for x in self.BetaOpTemplate.items():
kStrBeta += str(x)
# double complex case or num of store == 1 case, put beta instruction separately
if kStrBeta != "" and (kernel["ProblemType"]["DestDataType"].isDoubleComplex() or self.getNumberOfStoreCInTemplate(kernel) == 1):
# combine beta code with first StoreC comment to avoid generating beta before alpha
self.StoreCUnrollCode.addText(kStrBeta + StartComment)
kStrBeta = ""
StartComment = ""
# number of instructions(items) of increment code between MFMAs
putCount = 1
postProcessListIndex = 0
# generate post process for StoreCInUnroll loop
# 1) increment gpr indexing (new value in tmp). Put this as separate item in StoreCUnrollCode
# 2-1) increment StoreC address (new value in tmp)
# 2-2) check enable count and apply new values when necessary
postProcessList = []
finalAddrIncList = []
if needInc:
postProcessList, finalAddrIncList = self.generatePostProcessForStoreCInUnrollLoop(kernel, needPost)
for x in self.StoreCTemplate.items():
kStr = ""
if x == self.StoreCTemplate.items()[0]:
kStr += kStrBeta + StartComment # combine beta code with first StoreC. first item case, add marker comment
StartComment = ""
strX = str(x)
kStr += strX
if x != self.StoreCTemplate.items()[-1]:
# not the last StoreC
# add postprocess code or empty between StoreC
self.StoreCUnrollCode.addCode(kStr)
end = kernel["StoreCInUnrollInterval"] - 1
for i in range(end):
if postProcessListIndex < len(postProcessList):
self.StoreCUnrollCode.addText(postProcessList[postProcessListIndex])
postProcessListIndex += 1
else:
self.StoreCUnrollCode.addText("") # add empty str to add interval between Store codes
else:
# last StoreC
if not (kernel["StoreCInUnrollPostLoop"] and isLast):
# last element and not StoreCInUnrollPostLoop+isLast case
self.StoreCUnrollCode.addCode(kStr)
# add remaining postprocess, finalAddrInc code in StoreCUnrollPostCode
count = 0
kStr = ""
for i in range(postProcessListIndex, len(postProcessList)):
kStr += postProcessList[i]
count+=1
if count == putCount:
self.StoreCUnrollPostCode.addText(kStr)
count = 0
kStr = ""
for i in range(len(finalAddrIncList)):
kStr += finalAddrIncList[i]
count+=1
if count == putCount:
self.StoreCUnrollPostCode.addText(kStr)
count = 0
kStr = ""
if count > 0:
self.StoreCUnrollPostCode.addText(kStr)
else:
# not last element or StoreCInUnrollPostLoop+isLast
# add all remaining items in postProcessList and finalAddrInc code after the last StoreC (in the same item)
for item in (postProcessList[postProcessListIndex:] + finalAddrIncList):
kStr += item
self.StoreCUnrollCode.addCode(kStr)
|