1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296
|
\documentclass[fleqn,12pt]{article}
%
\setlength{\oddsidemargin}{-0.25in}
\setlength{\textwidth}{7.0in}
\setlength{\topmargin}{-0.25in}
\setlength{\textheight}{9.5in}
%
\usepackage{algorithmic}
\newenvironment{algorithm}{\begin{quote} %\vspace{1em}
\begin{algorithmic}\samepage}{\end{algorithmic} %\vspace{1em}
\end{quote}}
\newcommand{\Real}{\mathop{\mathrm{Re}}}
\newcommand{\Imag}{\mathop{\mathrm{Im}}}
\begin{document}
\title{FFT Algorithms}
\author{Brian Gough, {\tt bjg@network-theory.co.uk}}
\date{May 1997}
\maketitle
\section{Introduction}
Fast Fourier Transforms (FFTs) are efficient algorithms for
calculating the discrete Fourier transform (DFT),
%
\begin{eqnarray}
h_a &=& \mathrm{DFT}(g_b) \\
&=& \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) \qquad 0 \leq a \leq N-1\\
&=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N= \exp(-2\pi i/N)
\end{eqnarray}
%
The DFT usually arises as an approximation to the continuous Fourier
transform when functions are sampled at discrete intervals in space or
time. The naive evaluation of the discrete Fourier transform is a
matrix-vector multiplication ${\mathbf W}\vec{g}$, and would take
$O(N^2)$ operations for $N$ data-points. The general principle of the
Fast Fourier Transform algorithms is to use a divide-and-conquer
strategy to factorize the matrix $W$ into smaller sub-matrices,
typically reducing the operation count to $O(N \sum f_i)$ if $N$ can
be factorized into smaller integers, $N=f_1 f_2 \dots f_n$.
This chapter explains the algorithms used in the GSL FFT routines
and provides some information on how to extend them. To learn more about
the FFT you should read the review article {\em Fast Fourier
Transforms: A Tutorial Review and A State of the Art} by Duhamel and
Vetterli~\cite{duhamel90}. There are several introductory books on the
FFT with example programs, such as {\em The Fast Fourier Transform} by
Brigham~\cite{brigham74} and {\em DFT/FFT and Convolution Algorithms}
by Burrus and Parks~\cite{burrus84}. In 1979 the IEEE published a
compendium of carefully-reviewed Fortran FFT programs in {\em Programs
for Digital Signal Processing}~\cite{committee79} which is a useful
reference for implementations of many different FFT algorithms. If you
are interested in using DSPs then the {\em Handbook of Real-Time Fast
Fourier Transforms}~\cite{smith95} provides detailed information on
the algorithms and hardware needed to design, build and test DSP
applications. Many FFT algorithms rely on results from number theory.
These results are covered in the books {\em Fast transforms:
algorithms, analyses, applications}, by Elliott and
Rao~\cite{elliott82}, {\em Fast Algorithms for Digital Signal
Processing} by Blahut~\cite{blahut} and {\em Number Theory in Digital
Signal Processing} by McClellan and Rader~\cite{mcclellan79}. There is
also an annotated bibliography of papers on the FFT and related topics
by Burrus~\cite{burrus-note}.
\section{Families of FFT algorithms}
%
There are two main families of FFT algorithms: the Cooley-Tukey
algorithm and the Prime Factor algorithm. These differ in the way they
map the full FFT into smaller sub-transforms. Of the Cooley-Tukey
algorithms there are two types of routine in common use: mixed-radix
(general-$N$) algorithms and radix-2 (power of 2) algorithms. Each
type of algorithm can be further classified by additional characteristics,
such as whether it operates in-place or uses additional scratch space,
whether its output is in a sorted or scrambled order, and whether it
uses decimation-in-time or -frequency iterations.
Mixed-radix algorithms work by factorizing the data vector into
shorter lengths. These can then be transformed by small-$N$ FFTs.
Typical programs include FFTs for small prime factors, such as 2, 3,
5, \dots which are highly optimized. The small-$N$ FFT modules act as
building blocks and can be multiplied together to make longer
transforms. By combining a reasonable set of modules it is possible to
compute FFTs of many different lengths. If the small-$N$ modules are
supplemented by an $O(N^2)$ general-$N$ module then an FFT of any
length can be computed, in principle. Of course, any lengths which
contain large prime factors would perform only as $O(N^2)$.
Radix-2 algorithms, or ``power of two'' algorithms, are simplified
versions of the mixed-radix algorithm. They are restricted to lengths
which are a power of two. The basic radix-2 FFT module only involves
addition and subtraction, so the algorithms are very simple. Radix-2
algorithms have been the subject of much research into optimizing the
FFT. Many of the most efficient radix-2 routines are based on the
``split-radix'' algorithm. This is actually a hybrid which combines
the best parts of both radix-2 and radix-4 (``power of 4'')
algorithms~\cite{sorenson86,duhamel86}.
The prime factor algorithm (PFA) is an alternative form of general-$N$
algorithm based on a different way of recombining small-$N$ FFT
modules~\cite{temperton83pfa,temperton85}. It has a very simple indexing
scheme which makes it attractive. However it only works in the case
where all factors are mutually prime. This requirement makes it more
suitable as a specialized algorithm for given lengths.
\begin{subsection}{FFTs of prime lengths}
%
Large prime lengths cannot be handled efficiently by any of these
algorithms. However it may still possible to compute a DFT, by using
results from number theory. Rader showed that it is possible to
convert a length-$p$ FFT (where $p$ is prime) into a convolution of
length-($p-1$). There is a simple identity between the convolution of
length $N$ and the FFT of the same length, so if $p-1$ is easily
factorizable this allows the convolution to be computed efficiently
via the FFT. The idea is presented in the original paper by
Rader~\cite{raderprimes} (also reprinted in~\cite{mcclellan79}), but
for more details see the theoretical books mentioned earlier.
\end{subsection}
%\subsection{In-place algorithms vs algorithms using scratch space}
%%
%For simple algorithms, such as the Radix-2 algorithms, the FFT can be
%performed in-place, without additional memory. For this to be possible
%the index calculations must be simple enough that the calculation of
%the result to be stored in index $k$ can be carried out at the same
%time as the data in $k$ is used, so that no temporary storage is
%needed.
%The mixed-radix algorithm is too complicated for this. It requires an
%area of scratch space sufficient to hold intermediate copies of the
%input data.
%\subsection{Self-sorting algorithms vs scrambling algorithms}
%%
%A self-sorting algorithm At each iteration of the FFT internal permutations are included
%which leave the final iteration in its natural order.
%The mixed-radix algorithm can be made self-sorting. The additional
%scratch space helps here, although it is in fact possible to design
%self-sorting algorithms which do not require scratch
%space.
%The in-place radix-2 algorithm is not self-sorting. The data is
%permuted, and transformed into bit-reversed order. Note that
%bit-reversal only refers to the order of the array, not the values for
%each index which are of course not changed. More precisely, the data
%stored in location $[b_n \dots b_2 b_1 b_0]$ is moved to location
%$[b_0 b_1 b_2 \dots b_n]$, where $b_0 \dots b_n$ are the raw bits of a
%given index. Placing the data in bit reversed order is easily done,
%using right shifts to extract the least significant bits of the index
%and left-shifts to feed them into a register for the bit-reversed
%location. Simple radix-2 FFT usually recompute the bit reverse
%ordering in the naive way for every call. For repeated calls it might
%be worthwhile to precompute the permutation and cache it. There are
%also more sophisticated algorithms which reduce the number of
%operations needed to perform bit-reversal~\cite{bit-reversal}.
%\begin{subsection}{Decimation-in-time (DIT) vs Decimation-in-Frequency (DIF)}
%\end{subsection}
\subsection{Optimization}
%
There is no such thing as the single fastest FFT {\em algorithm}. FFT
algorithms involve a mixture of floating point calculations, integer
arithmetic and memory access. Each of these operations will have
different relative speeds on different platforms. The performance of
an algorithm is a function of the hardware it is implemented on. The
goal of optimization is thus to choose the algorithm best suited to
the characteristics of a given hardware platform.
For example, the Winograd Fourier Transform (WFTA) is an algorithm
which is designed to reduce the number of floating point
multiplications in the FFT. However, it does this at the expense of
using many more additions and data transfers than other algorithms.
As a consequence the WFTA might be a good candidate algorithm for
machines where data transfers occupy a negligible time relative to
floating point arithmetic. However on most modern machines, where the
speed of data transfers is comparable to or slower than floating point
operations, it would be outperformed by an algorithm which used a
better mix of operations (i.e. more floating point operations but
fewer data transfers).
For a study of this sort of effect in detail, comparing the different
algorithms on different platforms consult the paper {\it Effects of
Architecture Implementation on DFT Algorithm Performance} by Mehalic,
Rustan and Route~\cite{mehalic85}. The paper was written in the early
1980's and has data for super- and mini-computers which you are
unlikely to see today, except in a museum. However, the methodology is
still valid and it would be interesting to see similar results for
present day computers.
\section{FFT Concepts}
%
Factorization is the key principle of the mixed-radix FFT divide-and-conquer
strategy. If $N$ can be factorized into a product of $n_f$ integers,
%
\begin{equation}
N = f_1 f_2 ... f_{n_f} ,
\end{equation}
%
then the FFT itself can be divided into smaller FFTs for each factor.
More precisely, an FFT of length $N$ can be broken up into,
%
\begin{quote}
\begin{tabular}{l}
$(N/f_1)$ FFTs of length $f_1$, \\
$(N/f_2)$ FFTs of length $f_2$, \\
\dots \\
$(N/f_{n_f})$ FFTs of length $f_{n_f}$.
\end{tabular}
\end{quote}
%
The total number of operations for these sub-operations will be $O(
N(f_1 + f_2 + ... + f_{n_f}))$. When the factors of $N$ are all small
integers this will be substantially less than $O(N^2)$. For example,
when $N$ is a power of 2 an FFT of length $N=2^m$ can be reduced to $m
N/2$ FFTs of length 2, or $O(N\log_2 N)$ operations. Here is a
demonstration which shows this:
We start with the full DFT,
%
\begin{eqnarray}
h_a &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N=\exp(-2\pi i/N)
\end{eqnarray}
%
and split the sum into even and odd terms,
%
\begin{eqnarray}
\phantom{h_a}
% &=& \left(\sum_{b=0,2,4,\dots} + \sum_{b=1,3,5,\dots}\right) g_b W_N^{ab}\\
&=& \sum_{b=0}^{N/2-1} g_{2b} W_N^{a(2b)} +
\sum_{b=0}^{N/2-1} g_{2b+1} W_N^{a(2b+1)}.
\end{eqnarray}
%
This converts the original DFT of length $N$ into two DFTs of length
$N/2$,
%
\begin{equation}
h_a = \sum_{b=0}^{N/2-1} g_{2b} W_{(N/2)}^{ab} +
W_N^a \sum_{b=0}^{N/2-1} g_{2b+1} W_{(N/2)}^{ab}
\end{equation}
%
The first term is a DFT of the even elements of $g$. The second term
is a DFT of the odd elements of $g$, premultiplied by an exponential
factor $W_N^k$ (known as a {\em twiddle factor}).
%
\begin{equation}
\mathrm{DFT}(h) = \mathrm{DFT}(g_{even}) + W_N^k \mathrm{DFT}(g_{odd})
\end{equation}
%
By splitting the DFT into its even and odd parts we have reduced the
operation count from $N^2$ (for a DFT of length $N$) to $2 (N/2)^2$
(for two DFTs of length $N/2$). The cost of the splitting is that we
need an additional $O(N)$ operations to multiply by the twiddle factor
$W_N^k$ and recombine the two sums.
We can repeat the splitting procedure recursively $\log_2 N$ times
until the full DFT is reduced to DFTs of single terms. The DFT of a
single value is just the identity operation, which costs nothing.
However since $O(N)$ operations were needed at each stage to recombine
the even and odd parts the total number of operations to obtain the
full DFT is $O(N \log_2 N)$. If we had used a length which was a
product of factors $f_1$, $f_2$, $\dots$ we could have split the sum
in a similar way. First we would split terms corresponding to the
factor $f_1$, instead of the even and odd terms corresponding to a
factor of two. Then we would repeat this procedure for the subsequent
factors. This would lead to a final operation count of $O(N \sum
f_i)$.
This procedure gives some motivation for why the number of operations
in a DFT can in principle be reduced from $O(N^2)$ to $O(N \sum f_i)$.
It does not give a good explanation of how to implement the algorithm
in practice which is what we shall do in the next section.
\section{Radix-2 Algorithms}
%
For radix-2 FFTs it is natural to write array indices in binary form
because the length of the data is a power of two. This is nicely
explained in the article {\em The FFT: Fourier Transforming One Bit at
a Time} by P.B. Visscher~\cite{visscher96}. A binary representation
for indices is the key to deriving the simplest efficient radix-2
algorithms.
We can write an index $b$ ($0 \leq b < 2^{n-1}$) in binary
representation like this,
%
\begin{equation}
b = [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + \dots + 2 b_1 + b_0 .
\end{equation}
%
Each of the $b_0, b_1, \dots, b_{n-1}$ are the bits (either 0 or 1) of
$b$.
Using this notation the original definition of the DFT
can be rewritten as a sum over the bits of $b$,
%
\begin{equation}
h(a) = \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N)
\end{equation}
%
to give an equivalent summation like this,
%
\begin{equation}
h([a_{n-1} \dots a_1 a_0]) =
\sum_{b_0=0}^{1}
\sum_{b_1=0}^{1}
\dots
\sum_{b_{n-1}=0}^{1}
g([b_{n-1} \dots b_1 b_0]) W_N^{ab}
\end{equation}
%
where the bits of $a$ are $a=[a_{n-1} \dots a_1 a_0]$.
To reduce the number of operations in the sum we will use the
periodicity of the exponential term,
%
\begin{equation}
W_N^{x+N} = W_N^{x}.
\end{equation}
%
Most of the products $ab$ in $W_N^{ab}$ are greater than $N$. By
making use of this periodicity they can all be collapsed down into the
range $0\dots N-1$. This allows us to reduce the number of operations
by combining common terms, modulo $N$. Using this idea we can derive
decimation-in-time or decimation-in-frequency algorithms, depending on
how we break the DFT summation down into common terms. We'll first
consider the decimation-in-time algorithm.
\subsection{Radix-2 Decimation-in-Time (DIT)}
%
To derive the decimation-in-time algorithm we start by separating
out the most significant bit of the index $b$,
%
\begin{equation}
[b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + [b_{n-2} \dots b_1 b_0]
\end{equation}
%
Now we can evaluate the innermost sum of the DFT without any
dependence on the remaining bits of $b$ in the exponential,
%
\begin{eqnarray}
h([a_{n-1} \dots a_1 a_0]) &=&
\sum_{b_0=0}^{1}
\sum_{b_1=0}^{1}
\dots
\sum_{b_{n-1}=0}^{1}
g(b)
W_N^{a(2^{n-1}b_{n-1}+[b_{n-2} \dots b_1 b_0])} \\
&=&
\sum_{b_0=0}^{1}
\sum_{b_1=0}^{1}
\dots
\sum_{b_{n-2}=0}^{1}
W_N^{a[b_{n-2} \dots b_1 b_0]}
\sum_{b_{n-1}=0}^{1}
g(b)
W_N^{a(2^{n-1}b_{n-1})}
\end{eqnarray}
%
Looking at the term $W_N^{a(2^{n-1}b_{n-1})}$ we see that we can also
remove most of the dependence on $a$ as well, by using the periodicity of the
exponential,
%
\begin{eqnarray}
W_N^{a(2^{n-1}b_{n-1})} &=&
\exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-1} b_{n-1}/ 2^n )\\
&=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] b_{n-1}/ 2 )\\
&=& \exp(-2\pi i ( 2^{n-2}a_{n-1} + \dots + a_1 + (a_0/2)) b_{n-1} )\\
&=& \exp(-2\pi i a_0 b_{n-1}/2 ) \\
&=& W_2^{a_0 b_{n-1}}
\end{eqnarray}
%
Thus the innermost exponential term simplifies so that it only
involves the highest order bit of $b$ and the lowest order bit of $a$,
and the sum can be reduced to,
%
\begin{equation}
h([a_{n-1} \dots a_1 a_0])
=
\sum_{b_0=0}^{1}
\sum_{b_1=0}^{1}
\dots
\sum_{b_{n-2}=0}^{1}
W_N^{a[b_{n-2} \dots b_1 b_0]}
\sum_{b_{n-1}=0}^{1}
g(b)
W_2^{a_0 b_{n-1}}.
\end{equation}
%
We can repeat this this procedure for the next most significant bit of
$b$, $b_{n-2}$, using a similar identity,
%
\begin{eqnarray}
W_N^{a(2^{n-2}b_{n-2})}
&=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-2} b_{n-2}/ 2^n )\\
&=& W_4^{ [a_1 a_0] b_{n-2}}.
\end{eqnarray}
%
to give a formula with even less dependence on the bits of $a$,
%
\begin{equation}
h([a_{n-1} \dots a_1 a_0])
=
\sum_{b_0=0}^{1}
\sum_{b_1=0}^{1}
\dots
\sum_{b_{n-3}=0}^{1}
W_N^{a[b_{n-3} \dots b_1 b_0]}
\sum_{b_{n-2}=0}^{1}
W_4^{[a_1 a_0] b_{n-2}}
\sum_{b_{n-1}=0}^{1}
g(b)
W_2^{a_0 b_{n-1}}.
\end{equation}
%
If we repeat the process for all the remaining bits we obtain a
simplified DFT formula which is the basis of the radix-2
decimation-in-time algorithm,
%
\begin{eqnarray}
h([a_{n-1} \dots a_1 a_0]) &=&
\sum_{b_0=0}^{1}
W_{N}^{[a_{n-1} \dots a_1 a_0]b_0}
%\sum_{b_1=0}^{1}
%W_{N/2}^{[a_{n-1} \dots a_1 a_0]b_1}
\dots
\sum_{b_{n-2}=0}^{1}
W_4^{ [a_1 a_0] b_{n-2}}
\sum_{b_{n-1}=0}^{1}
W_2^{a_0 b_{n-1}}
g(b)
\end{eqnarray}
%
To convert the formula to an algorithm we expand out the sum
recursively, evaluating each of the intermediate summations, which we
denote by $g_1$, $g_2$, \dots, $g_n$,
%
\begin{eqnarray}
g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0)
&=&
\sum_{b_{n-1}=0}^{1}
W_2^{a_0 b_{n-1}}
g([b_{n-1} b_{n-2} b_{n-3} \dots b_1 b_0]) \\
g_2(a_0, {}_{\phantom{-2}} a_{1}, b_{n-3}, \dots, b_1, b_0)
&=&
\sum_{b_{n-2}=0}^{1}
W_4^{[a_1 a_0] b_{n-2}}
g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0) \\
g_3(a_0, {}_{\phantom{-2}} a_{1}, {}_{\phantom{-3}} a_{2}, \dots, b_1, b_0)
&=&
\sum_{b_{n-3}=0}^{1}
W_8^{[a_2 a_1 a_0] b_{n-2}}
g_2(a_0, a_1, b_{n-3}, \dots, b_1, b_0) \\
\dots &=& \dots \\
g_n(a_0, a_{1}, a_{2}, \dots, a_{n-2}, a_{n-1})
&=&
\sum_{b_{0}=0}^{1}
W_N^{[a_{n-1} \dots a_1 a_0]b_0}
g_{n-1}(a_0, a_1, a_2, \dots, a_{n-2}, b_0)
\end{eqnarray}
%
After the final sum, we can obtain the transform $h$ from $g_n$,
%
\begin{equation}
h([a_{n-1} \dots a_1 a_0])
=
g_n(a_0, a_1, \dots, a_{n-1})
\end{equation}
%
Note that we left the storage arrangements of the intermediate sums
unspecified by using the bits as function arguments and not as an
index. The storage of intermediate sums is different for the
decimation-in-time and decimation-in-frequency algorithms.
Before deciding on the best storage scheme we'll show that the results
of each stage, $g_1$, $g_2$, \dots, can be carried out {\em
in-place}. For example, in the case of $g_1$, the inputs are,
%
\begin{equation}
g([\underline{b_{n-1}} b_{n-2} b_{n-3} \dots b_1 b_0])
\end{equation}
%
for $b_{n-1}=(0,1)$, and the corresponding outputs are,
%
\begin{equation}
g_1(\underline{a_0},b_{n-2}, b_{n-3}, \dots, b_1, b_0)
\end{equation}
%
for $a_0=(0,1)$. It's clear that if we hold $b_{n-2}, b_{n-3}, \dots,
b_1, b_0$ fixed and compute the sum over $b_{n-1}$ in memory for both
values of $a_0 = 0,1$ then we can store the result for $a_0=0$ in the
location which originally had $b_0=0$ and the result for $a_0=1$ in
the location which originally had $b_0=1$. The two inputs and two
outputs are known as {\em dual node pairs}. At each stage of the
calculation the sums for each dual node pair are independent of the
others. It is this property which allows an in-place calculation.
So for an in-place pass our storage has to be arranged so that the two
outputs $g_1(a_0,\dots)$ overwrite the two input terms
$g([b_{n-1},\dots])$. Note that the order of $a$ is reversed from the
natural order of $b$, i.e.@: the least significant bit of $a$
replaces the most significant bit of $b$. This is inconvenient
because $a$ occurs in its natural order in all the exponentials,
$W^{ab}$. We could keep track of both $a$ and its bit-reverse,
$a^{\mathit bit-reversed}$ at all times but there is a neat trick
which avoids this: if we bit-reverse the order of the input data $g$
before we start the calculation we can also bit-reverse the order of
$a$ when storing intermediate results. Since the storage involving $a$
was originally in bit-reversed order the switch in the input $g$ now
allows us to use normal ordered storage for $a$, the same ordering
that occurs in the exponential factors.
This is complicated to explain, so here is an example of the 4 passes
needed for an $N=16$ decimation-in-time FFT, with the initial data
stored in bit-reversed order,
%
\begin{eqnarray}
g_1([b_0 b_1 b_2 a_0])
&=&
\sum_{b_3=0}^{1} W_2^{a_0 b_3} g([b_0 b_1 b_2 b_3])
\\
g_2([b_0 b_1 a_1 a_0])
&=&
\sum_{b_2=0}^{1} W_4^{[a_1 a_0] b_2} g_1([b_0 b_1 b_2 a_0])
\\
g_3([b_0 a_2 a_1 a_0])
&=&
\sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0])
\\
h(a) = g_4([a_3 a_2 a_1 a_0])
&=&
\sum_{b_0=0}^{1} W_{16}^{[a_3 a_2 a_1 a_0] b_0} g_3([b_0 a_2 a_1 a_0])
\end{eqnarray}
%
We compensate for the bit reversal of the input data by accessing $g$
with the bit-reversed form of $b$ in the first stage. This ensures
that we are still carrying out the same calculation, using the same
data, and not accessing different values. Only single bits of $b$ ever
occur in the exponential so we never need the bit-reversed form of
$b$.
Let's examine the third pass in detail,
%
\begin{equation}
g_3([b_0 a_2 a_1 a_0])
=
\sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0])
\end{equation}
%
First note that only one bit, $b_1$, varies in each summation. The
other bits of $b$ ($b_0$) and of $a$ ($a_1 a_0$) are essentially
``spectators'' -- we must loop over all combinations of these bits and
carry out the same basic calculation for each, remembering to update
the exponentials involving $W_8$ appropriately. If we are storing the
results in-place (with $g_3$ overwriting $g_2$ we will need to compute
the sums involving $b_1=0,1$ and $a_2=0,1$ simultaneously.
%
\begin{equation}
\left(
\begin{array}{c}
g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\
g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}}
\end{array}
\right)
=
\left(
\begin{array}{c}
g_2([b_0 0 a_1 a_0]) + W_8^{[0 a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\
g_2([b_0 0 a_1 a_0]) + W_8^{[1 a_1 a_0]} g_2([b_2 1 a_1 a_0])
\end{array}
\right)
\end{equation}
%
We can write this in a more symmetric form by simplifying the exponential,
%
\begin{equation}
W_8^{[a_2 a_1 a_0]}
= W_8^{4 a_2 + [a_1 a_0]}
= (-1)^{a_2} W_8^{[a_1 a_0]}
\end{equation}
%
\begin{equation}
\left(
\begin{array}{c}
g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\
g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}}
\end{array}
\right)
=
\left(
\begin{array}{c}
g_2([b_0 0 a_1 a_0]) + W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\
g_2([b_0 0 a_1 a_0]) - W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0])
\end{array}
\right)
\end{equation}
%
The exponentials $W_8^{[a_1 a_0]}$ are referred to as {\em twiddle
factors}. The form of this calculation, a symmetrical sum and
difference involving a twiddle factor is called {\em a butterfly}.
It is often shown diagrammatically, and in the case $b_0=a_0=a_1=0$
would be drawn like this,
%
\begin{center}
\setlength{\unitlength}{1cm}
\begin{picture}(9,4)
%
%\put(0,0){\line(1,0){8}}
%\put(0,0){\line(0,1){4}}
%\put(8,4){\line(0,-1){4}}
%\put(8,4){\line(-1,0){8}}
%
\put(0,1){$g_2(4)$} \put(4.5,1){$g_3(4)=g_2(0) - W^a_8 g_2(4)$}
\put(0,3){$g_2(0)$} \put(4.5,3){$g_3(0)=g_2(0) + W^a_8 g_2(4)$}
\put(1,1){\vector(1,0){0.5}}
\put(1.5,1){\line(1,0){0.5}}
\put(1.5,0.5){$W^a_8$}
\put(1,3){\vector(1,0){0.5}}\put(1.5,3){\line(1,0){0.5}}
\put(2,1){\circle*{0.1}}
\put(2,3){\circle*{0.1}}
\put(2,1){\vector(1,1){2}}
\put(2,1){\vector(1,0){1}}
\put(3,1){\line(1,0){1}}
\put(3,0.5){$-1$}
\put(2,3){\vector(1,-1){2}}
\put(2,3){\vector(1,0){1}}
\put(3,3){\line(1,0){1}}
\put(4,1){\circle*{0.1}}
\put(4,3){\circle*{0.1}}
\end{picture}
\end{center}
%
The inputs are shown on the left and the outputs on the right. The
outputs are computed by multiplying the incoming lines by their
accompanying factors (shown next to the lines) and summing the results
at each node.
In general, denoting the bit for dual-node pairs by $\Delta$ and the
remaining bits of $a$ and $b$ by ${\hat a}$ and ${\hat b}$, the
butterfly is,
%
\begin{equation}
\left(
\begin{array}{c}
g({\hat b} + {\hat a}) \\
g({\hat b} + \Delta + {\hat a}) \\
\end{array}
\right)
\leftarrow
\left(
\begin{array}{c}
g({\hat b} + {\hat a}) + W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})\\
g({\hat b} + {\hat a}) - W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})
\end{array}
\right)
\end{equation}
%
where ${\hat a}$ runs from $0 \dots \Delta-1$ and ${\hat b}$ runs
through $0 \times 2\Delta$, $1\times 2\Delta$, $\dots$, $(N/\Delta -
1)2\Delta$. The value of $\Delta$ is 1 on the first pass, 2 on the
second pass and $2^{n-1}$ on the $n$-th pass. Each pass requires
$N/2$ in-place computations, each involving two input locations and
two output locations.
In the example above $\Delta = [100] = 4$, ${\hat a} = [a_1 a_0]$ and
${\hat b} = [b_0 0 0 0]$.
This leads to the canonical radix-2 decimation-in-time FFT algorithm
for $2^n$ data points stored in the array $g(0) \dots g(2^n-1)$.
%
\begin{algorithm}
\STATE bit-reverse ordering of $g$
\STATE {$\Delta \Leftarrow 1$}
\FOR {$\mbox{pass} = 1 \dots n$}
\STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$}
\FOR {$(a = 0 ; a < \Delta ; a{++})$}
\FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$}
\STATE{$t_0 \Leftarrow g(b+a) + W^a g(b+\Delta+a)$}
\STATE{$t_1 \Leftarrow g(b+a) - W^a g(b+\Delta+a)$}
\STATE{$g(b+a) \Leftarrow t_0$}
\STATE{$g(b+\Delta+a) \Leftarrow t_1$}
\ENDFOR
\ENDFOR
\STATE{$\Delta \Leftarrow 2\Delta$}
\ENDFOR
\end{algorithm}
%
%This algorithm appears in Numerical Recipes as the routine {\tt
%four1}, where the implementation is attributed to N.M. Brenner.
%
\subsection{Details of the Implementation}
It is straightforward to implement a simple radix-2 decimation-in-time
routine from the algorithm above. Some optimizations can be made by
pulling the special case of $a=0$ out of the loop over $a$, to avoid
unnecessary multiplications when $W^a=1$,
%
\begin{algorithm}
\FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$}
\STATE{$t_0 \Leftarrow g(b) + g(b+\Delta)$}
\STATE{$t_1 \Leftarrow g(b) - g(b+\Delta)$}
\STATE{$g(b) \Leftarrow t_0$}
\STATE{$g(b+\Delta) \Leftarrow t_1$}
\ENDFOR
\end{algorithm}
%
There are several algorithms for doing fast bit-reversal. We use the
Gold-Rader algorithm, which is simple and does not require any working
space,
%
\begin{algorithm}
\FOR{$i = 0 \dots n - 2$}
\STATE {$ k = n / 2 $}
\IF {$i < j$}
\STATE {swap $g(i)$ and $g(j)$}
\ENDIF
\WHILE {$k \leq j$}
\STATE{$j \Leftarrow j - k$}
\STATE{$k \Leftarrow k / 2$}
\ENDWHILE
\STATE{$j \Leftarrow j + k$}
\ENDFOR
\end{algorithm}
%
The Gold-Rader algorithm is typically twice as fast as a naive
bit-reversal algorithm (where the bit reversal is carried out by
left-shifts and right-shifts on the index). The library also has a
routine for the Rodriguez bit reversal algorithm, which also does not
require any working space~\cite{rodriguez89}. There are faster bit
reversal algorithms, but they all use additional scratch
space~\cite{rosel89}.
Within the loop for $a$ we can compute $W^a$ using a trigonometric
recursion relation,
%
\begin{eqnarray}
W^{a+1} &=& W W^a \\
&=& (\cos(2\pi/2\Delta) + i \sin(2\pi/2\Delta)) W^a
\end{eqnarray}
%
This requires only $2 \log_2 N$ trigonometric calls, to compute the
initial values of $\exp(2\pi i /2\Delta)$ for each pass.
\subsection{Radix-2 Decimation-in-Frequency (DIF)}
%
To derive the decimation-in-frequency algorithm we start by separating
out the lowest order bit of the index $a$. Here is an example for the
decimation-in-frequency $N=16$ DFT.
%
\begin{eqnarray}
W_{16}^{[a_3 a_2 a_1 a_0][b_3 b_2 b_1 b_0]}
&=&
W_{16}^{[a_3 a_2 a_1 a_0][b_2 b_1 b_0]} W_{16}^{[a_3 a_2 a_1 a_0] [b_3
0 0 0]} \\
&=&
W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} W_2^{a_0
b_3} \\
&=&
W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} (-1)^{a_0 b_3}
\end{eqnarray}
%
By repeating the same type of the expansion on the term,
%
\begin{equation}
W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]}
\end{equation}
%
we can reduce the transform to an alternative simple form,
%
\begin{equation}
h(a) =
\sum_{b_0=0}^1 (-1)^{a_3 b_0} W_4^{a_2 b_0}
\sum_{b_1=0}^1 (-1)^{a_2 b_1} W_8^{a_1 [b_1 b_0]}
\sum_{b_2=0}^1 (-1)^{a_1 b_2} W_{16}^{a_0 [b_2 b_1 b_0]}
\sum_{b_3=0}^1 (-1)^{a_0 b_3} g(b)
\end{equation}
%
To implement this we can again write the sum recursively. In this case
we do not have the problem of the order of $a$ being bit reversed --
the calculation can be done in-place using the natural ordering of
$a$ and $b$,
%
\begin{eqnarray}
g_1([a_0 b_2 b_1 b_0])
&=&
W_{16}^{a_0 [b_2 b_1 b_0]}
\sum_{b_3=0}^1 (-1)^{a_0 b_3} g([b_3 b_2 b_1 b_0]) \\
g_2([a_0 a_1 b_1 b_0])
&=&
W_{8}^{a_1 [b_1 b_0]}
\sum_{b_2=0}^1 (-1)^{a_1 b_2} g_1([a_0 b_2 b_1 b_0]) \\
g_3([a_0 a_1 a_2 b_0])
&=&
W_{4}^{a_2 b_0}
\sum_{b_1=0}^1 (-1)^{a_2 b_1} g_2([a_0 a_1 b_1 b_0]) \\
h(a)
=
g_4([a_0 a_1 a_2 a_3])
&=&
\sum_{b_0=0}^1 (-1)^{a_3 b_0} g_3([a_0 a_1 a_2 b_0])
\end{eqnarray}
%
The final pass leaves the data for $h(a)$ in bit-reversed order, but
this is easily fixed by a final bit-reversal of the ordering.
The basic in-place calculation or butterfly for each pass is slightly
different from the decimation-in-time version,
%
\begin{equation}
\left(
\begin{array}{c}
g({\hat a} + {\hat b}) \\
g({\hat a} + \Delta + {\hat b}) \\
\end{array}
\right)
\leftarrow
\left(
\begin{array}{c}
g({\hat a} + {\hat b}) + g({\hat a} + \Delta + {\hat b})\\
W_{\Delta}^{\hat b}
\left( g({\hat a} + {\hat b}) - g({\hat a} + \Delta + {\hat b}) \right)
\end{array}
\right)
\end{equation}
%
In each pass ${\hat b}$ runs from $0 \dots \Delta-1$ and ${\hat
a}$ runs from $0, 2\Delta, \dots, (N/\Delta -1) \Delta$. On the first
pass we start with $\Delta=16$, and on subsequent passes $\Delta$ takes
the values $8, 4, \dots, 1$.
This leads to the canonical radix-2 decimation-in-frequency FFT
algorithm for $2^n$ data points stored in the array $g(0) \dots
g(2^n-1)$.
%
\begin{algorithm}
\STATE {$\Delta \Leftarrow 2^{n-1}$}
\FOR {$\mbox{pass} = 1 \dots n$}
\STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$}
\FOR {$(b = 0 ; b < \Delta ; b++)$}
\FOR{$(a = 0 ; a < N ; a += 2*\Delta)$}
\STATE{$t_0 \Leftarrow g(b+a) + g(a+\Delta+b)$}
\STATE{$t_1 \Leftarrow W^b \left( g(a+b) - g(a+\Delta+b) \right)$}
\STATE{$g(a+b) \Leftarrow t_0$}
\STATE{$g(a+\Delta+b) \Leftarrow t_1$}
\ENDFOR
\ENDFOR
\STATE{$\Delta \Leftarrow \Delta/2$}
\ENDFOR
\STATE bit-reverse ordering of $g$
\end{algorithm}
%
\section{Self-Sorting Mixed-Radix Complex FFTs}
%
This section is based on the review article {\em Self-sorting
Mixed-Radix Fast Fourier Transforms} by Clive
Temperton~\cite{temperton83}. You should consult his article for full
details of all the possible algorithms (there are many
variations). Here I have annotated the derivation of the simplest
mixed-radix decimation-in-frequency algorithm.
For general-$N$ FFT algorithms the simple binary-notation of radix-2
algorithms is no longer useful. The mixed-radix FFT has to be built
up using products of matrices acting on a data vector. The aim is to
take the full DFT matrix $W_N$ and factor it into a set of small,
sparse matrices corresponding to each factor of $N$.
We'll denote the components of matrices using either subscripts or
function notation,
%
\begin{equation}
M_{ij} = M(i,j)
\end{equation}
%
with (C-like) indices running from 0 to $N-1$. Matrix products will be
denoted using square brackets,
%
\begin{equation}
[AB]_{ij} = \sum_{k} A_{ik} B_{kj}
\end{equation}
%
%
Three special matrices will be needed in the mixed-radix factorization
of the DFT: the identity matrix, $I$, a permutation matrix, $P$ and a
matrix of twiddle factors, $D$, as well as the normal DFT matrices
$W_n$.
We write the identity matrix of order $r$ as $I_r(n,m)$,
%
\begin{equation}
I_r(n,m) = \delta_{nm}
\end{equation}
%
for $0 \leq n,m \leq r-1$.
We also need to define a permutation matrix $P^a_b$ that performs
digit reversal of the ordering of a vector. If the index of a vector
$j= 0\dots N-1$ is factorized into $j = la +m$, with $0 \leq l \leq
b-1$ and $0 \leq m \leq a-1$ then the operation of the matrix $P$ will
exchange positions $la+m$ and $mb+l$ in the vector (this sort of
digit-reversal is the generalization of bit-reversal to a number
system with exponents $a$ and $b$).
In mathematical terms $P$ is a square matrix of size $ab \times ab$
with the property,
%
\begin{eqnarray}
P^a_b(j,k) &=& 1 ~\mbox{if}~ j=ra+s ~\mbox{and}~ k=sb+r \\
&=& 0 ~\mbox{otherwise}
\end{eqnarray}
%
Finally the FFT algorithm needs a matrix of twiddle factors, $D^a_b$,
for the trigonometric sums. $D^a_b$ is a diagonal square matrix of
size $ab \times ab$ with the definition,
%
\begin{eqnarray}
D^a_b(j,k) &=& \omega^{sr}_{ab} ~\mbox{if}~ j=k=sb+r \\
&=& 0 ~\mbox{otherwise}
\end{eqnarray}
%
where $\omega_{ab} = e^{-2\pi i/ab}$.
\subsection{The Kronecker Product}
The Kronecker matrix product plays an important role in all the
algorithms for combining operations on different subspaces. The
Kronecker product $A \otimes B$ of two square matrices $A$ and $B$, of
sizes $a \times a$ and $b \times b$ respectively, is a square matrix
of size $a b \times a b$, defined as,
%
\begin{equation}
[A \otimes B] (tb+u, rb+s) = A(t,r) B(u,s)
\end{equation}
%
where $0 \leq u,s < b$ and $0 \leq t,r < a$. Let's examine a specific
example. If we take a $2 \times 2$ matrix and a $3
\times 3$ matrix,
%
\begin{equation}
\begin{array}{ll}
A =
\left(
\begin{array}{cc}
a_{11} & a_{12} \\
a_{21} & a_{22}
\end{array}
\right)
&
B =
\left(
\begin{array}{ccc}
b_{11} & b_{12} & b_{13} \\
b_{21} & b_{22} & b_{23} \\
b_{31} & b_{32} & b_{33}
\end{array}
\right)
\end{array}
\end{equation}
%
then the Kronecker product $A \otimes B$ is,
%
\begin{eqnarray}
A \otimes B &= &
\left(
\begin{array}{cc}
a_{11} B & a_{12} B \\
a_{12} B & a_{22} B
\end{array}
\right) \\
&=&
\left(
\begin{array}{cccccc}
a_{11} b_{11} & a_{11} b_{12} & a_{11} b_{13} &
a_{12} b_{11} & a_{12} b_{12} & a_{12} b_{13} \\
a_{11} b_{21} & a_{11} b_{22} & a_{11} b_{23} &
a_{12} b_{21} & a_{12} b_{22} & a_{12} b_{23} \\
a_{11} b_{31} & a_{11} b_{32} & a_{11} b_{33} &
a_{12} b_{31} & a_{12} b_{32} & a_{12} b_{33} \\
a_{21} b_{11} & a_{21} b_{12} & a_{21} b_{13} &
a_{22} b_{11} & a_{22} b_{12} & a_{22} b_{13} \\
a_{21} b_{21} & a_{21} b_{22} & a_{21} b_{23} &
a_{22} b_{21} & a_{22} b_{22} & a_{22} b_{23} \\
a_{21} b_{31} & a_{21} b_{32} & a_{21} b_{33} &
a_{22} b_{31} & a_{22} b_{32} & a_{22} b_{33}
\end{array}
\right)
\end{eqnarray}
%
When the Kronecker product $A \otimes B$ acts on a vector of length
$ab$, each matrix operates on a different subspace of the vector.
Writing the index $i$ as $i=t b + u$, with $0\leq u \leq b-1$
and $0\leq t\leq a$, we can see this explicitly by looking at components,
%
\begin{eqnarray}
[(A \otimes B) v]_{(tb+u)}
& = & \sum_{t'=0}^{a-1} \sum_{u'=0}^{b-1}
[A \otimes B]_{(tb+u,t'b+u')} v_{t'b+u'} \\
& = & \sum_{t'u'} A_{tt'} B_{uu'} v_{t'b+u'}
\end{eqnarray}
%
The matrix $B$ operates on the ``index'' $u'$, for all values of $t'$, and
the matrix $A$ operates on the ``index'' $t'$, for all values of $u'$.
%
The most important property needed for deriving the FFT factorization
is that the matrix product of two Kronecker products is the Kronecker
product of the two matrix products,
%
\begin{equation}
(A \otimes B)(C \otimes D) = (AC \otimes BD)
\end{equation}
%
This follows straightforwardly from the original definition of the
Kronecker product.
\subsection{Two factor case, $N=ab$}
%
First consider the simplest possibility, where the data length $N$ can
be divided into two factors, $N=ab$. The aim is to reduce the DFT
matrix $W_N$ into simpler matrices corresponding to each factor. To
make the derivation easier we will start from the known factorization
and verify it (the initial factorization can be guessed by
generalizing from simple cases). Here is the factorization we are
going to prove,
%
\begin{equation}
W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b).
\end{equation}
%
We can check it by expanding the product into components,
%
\begin{eqnarray}
\lefteqn{[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)} \\
& = &
\sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
[(W_b \otimes I_a)](la+m,ua+t) [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s)
\end{eqnarray}
%
where we have split the indices to match the Kronecker product $0 \leq
m, r \leq a$, $0 \leq l, s \leq b$. The first term in the sum can
easily be reduced to its component form,
%
\begin{eqnarray}
[(W_b \otimes I_a)](la+m,ua+t)
&=& W_b(l,u) I_a(m,t) \\
&=& \omega_b^{lu} \delta_{mt}
\end{eqnarray}
%
The second term is more complicated. We can expand the Kronecker
product like this,
\begin{eqnarray}
(W_a \otimes I_b)(tb+u,rb+s)
&=& W_a(t,r) I_a(u,s) \\
&=& \omega_a^{tr} \delta_{us}
\end{eqnarray}
%
and use this term to build up the product, $P^a_b D^a_b (W_a \otimes
I_b)$. We first multiply by $D^a_b$,
%
\begin{equation}
[D^a_b (W_a \otimes I_b)](tb+u,rb+s)
=
\omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
\end{equation}
%
and then apply the permutation matrix, $P^a_b$, which digit-reverses
the ordering of the first index, to obtain,
%
\begin{equation}
[P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s)
=
\omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
\end{equation}
%
Combining the two terms in the matrix product we can obtain the full
expansion in terms of the exponential $\omega$,
%
\begin{eqnarray}
[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)
&=&
\sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
\omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
\end{eqnarray}
%
If we evaluate this sum explicitly we can make the connection between
the product involving $W_a$ and $W_b$ (above) and the expansion of the
full DFT matrix $W_{ab}$,
%
\begin{eqnarray}
\sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
\omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
&=& \omega^{ls}_b \omega^{ms}_{ab} \omega^{mr}_a \\
&=& \omega^{als + ms +bmr}_{ab} \\
&=& \omega^{als + ms +bmr}_{ab} \omega^{lrab}_{ab} \quad\mbox{using~} \omega^{ab}_{ab} =1\\
&=& \omega^{(la+m)(rb+s)}_{ab} \\
&=& W_{ab}(la+m,rb+s)
\end{eqnarray}
%
The final line shows that matrix product given above is identical to
the full two-factor DFT matrix, $W_{ab}$.
%
Thus the full DFT matrix $W_{ab}$ for two factors $a$, $b$ can be
broken down into a product of sub-transforms, $W_a$ and $W_b$, plus
permutations, $P$, and twiddle factors, $D$, according to the formula,
%
\begin{equation}
W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b).
\end{equation}
%
This relation is the foundation of the general-$N$ mixed-radix FFT algorithm.
\subsection{Three factor case, $N=abc$}
%
The result for the two-factor expansion can easily be generalized to
three factors. We first consider $abc$ as being a product of two
factors $a$ and $(bc)$, and then further expand the product $(bc)$ into
$b$ and $c$. The first step of the expansion looks like this,
%
\begin{eqnarray}
W_{abc} &=& W_{a(bc)}\\
&=& (W_{bc} \otimes I_a) P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) .
\end{eqnarray}
%
And after using the two-factor result to expand out $W_{bc}$ we obtain
the factorization of $W_{abc}$,
%
\begin{eqnarray}
W_{abc} &=& (((W_c \otimes I_b) P^b_c D^b_c (W_b \otimes I_c)) \otimes I_a )
P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\
&=& (W_c \otimes I_{ab}) (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) P^a_{bc} D^a_{bc} (W_a \otimes I_c)
\end{eqnarray}
%
We can write this factorization in a product form, with one term for
each factor,
%
\begin{equation}
W_{abc} = T_3 T_2 T_1
\end{equation}
%
where we read off $T_1$, $T_2$ and $T_3$,
%
\begin{eqnarray}
T_1 &=& P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\
T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\
T_3 &=& (W_c \otimes I_{ab} )
\end{eqnarray}
%
\subsection{General case, $N=f_1 f_2 \dots f_{n_f}$}
%
If we continue the procedure that we have used for two- and
three-factors then a general pattern begins to emerge in the
factorization of $W_{f_1 f_2 \dots f_{n_f}}$. To see the beginning of
the pattern we can rewrite the three factor case as,
%
\begin{eqnarray}
T_1 &=& (P^a_{bc} D^a_{bc} \otimes I_1) (W_a \otimes I_{bc}) \\
T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\
T_3 &=& (P^c_1 D^c_1 \otimes I_{ab}) (W_c \otimes I_{ab} )
\end{eqnarray}
%
using the special cases $P^c_1 = D^c_1 = I_c$.
%
In general, we can write the factorization of $W_N$ for $N= f_1 f_2
\dots f_{n_f}$ as,
%
\begin{equation}
W_N = T_{n_f} \dots T_2 T_1
\end{equation}
%
where the matrix factors $T_i$ are,
%
\begin{equation}
T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ( W_{f_i}
\otimes I_{m_i})
\end{equation}
%
We have defined the following three additional variables $p$, $q$ and
$m$ to denote different partial products of the factors,
%
\begin{eqnarray}
p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1) \\
q_i &=& N / p_i \\
m_i &=& N / f_i
\end{eqnarray}
%
Note that the FFT modules $W$ are applied before the permutations $P$,
which makes this a decimation-in-frequency algorithm.
\subsection{Implementation}
%
Now to the implementation of the algorithm. We start with a vector of
data, $z$, as input and want to apply the transform,
%
\begin{eqnarray}
x &=& W_N z \\
&=& T_{n_f} \dots T_2 T_1 z
\end{eqnarray}
%
where $T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) (
W_{f_i} \otimes I_{m_i})$.
The outer structure of the implementation will be a loop over the
$n_f$ factors, applying each matrix $T_i$ to the vector in turn to
build up the complete transform.
%
\begin{algorithm}
\FOR{$(i = 1 \dots n_f)$}
\STATE{$v \Leftarrow T_i v $}
\ENDFOR
\end{algorithm}
%
The order of the factors is not important. Now we examine the iteration
$v \Leftarrow T_i v$, which we'll write as,
%
\begin{equation}
v' =
(P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ~
( W_{f_i} \otimes I_{m_i}) v
\end{equation}
%
There are two Kronecker product matrices in this iteration. The
rightmost matrix, which is the first to be applied, is a DFT of length
$f_i$ applied to $N/f_i$ subsets of the data. We'll call this $t$,
since it will be a temporary array,
%
\begin{equation}
t = (W_{f_i} \otimes I_{m_i}) v
\end{equation}
%
The second matrix applies a permutation and the exponential
twiddle-factors. We'll call this $v'$, since it is the result of the
full iteration on $v$,
%
\begin{equation}
v' = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) t
\end{equation}
The effect of the matrix $(W_{f_i} \otimes I_{m_i})$ is best seen by
an example. Suppose the factor is $f_i = 3$, and the length of the FFT
is $N=6$, then the relevant Kronecker product is,
%
\begin{equation}
t = (W_3 \otimes I_2) v
\end{equation}
%
which expands out to,
%
\begin{equation}
\left(
\begin{array}{c}
t_0 \\
t_1 \\
t_2 \\
t_3 \\
t_4 \\
t_5
\end{array}
\right)
=
\left(
\begin{array}{cccccc}
W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) & 0 \\
0 & W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) \\
W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) & 0 \\
0 & W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) \\
W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3) & 0 \\
0 & W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3)
\end{array}
\right)
\left(
\begin{array}{c}
v_0 \\
v_1 \\
v_2 \\
v_3 \\
v_4 \\
v_5
\end{array}
\right)
\end{equation}
%
We can rearrange the components in a computationally convenient form,
\begin{equation}
\left(
\begin{array}{c}
t_0 \\
t_2 \\
t_4 \\
t_1 \\
t_3 \\
t_5
\end{array}
\right)
=
\left(
\begin{array}{cccccc}
W_3(1,1) & W_3(1,2) & W_3(1,3) & 0 & 0 & 0 \\
W_3(2,1) & W_3(2,2) & W_3(2,3) & 0 & 0 & 0 \\
W_3(3,1) & W_3(3,2) & W_3(3,3) & 0 & 0 & 0 \\
0 & 0 & 0 & W_3(1,1) & W_3(1,2) & W_3(1,3) \\
0 & 0 & 0 & W_3(2,1) & W_3(2,2) & W_3(2,3) \\
0 & 0 & 0 & W_3(3,1) & W_3(3,2) & W_3(3,3)
\end{array}
\right)
\left(
\begin{array}{c}
v_0 \\
v_2 \\
v_4 \\
v_1 \\
v_3 \\
v_5
\end{array}
\right)
\end{equation}
%
which clearly shows that we just need to apply the $3\times 3$ DFT
matrix $W_3$ twice, once to the sub-vector of elements $(v_0, v_2, v_4)$,
and independently to the remaining sub-vector $(v_1, v_3, v_5)$.
In the general case, if we index $t$ as $t_k = t(\lambda,\mu) =
t_{\lambda m + \mu}$ then $\lambda = 0 \dots f-1$ is an index within
each transform of length $f$ and $\mu = 0 \dots m-1$ labels the
independent subsets of data. We can see this by showing the
calculation with all indices present,
%
\begin{equation}
t = (W_f \otimes I_m) z
\end{equation}
%
becomes,
%
\begin{eqnarray}
t_{\lambda m + \mu} &=& \sum_{\lambda'=0}^{f-1} \sum_{\mu'=0}^{m-1}
(W_f \otimes I_m)_{(\lambda m + \mu)(\lambda' m + \mu')}
z_{\lambda'm + \mu} \\
&=& \sum_{\lambda'\mu'} (W_f)_{\lambda\lambda'} \delta_{\mu\mu'}
z_{\lambda'm+\mu'} \\
&=& \sum_{\lambda'} (W_f)_{\lambda\lambda'} z_{\lambda'm+\mu}
\end{eqnarray}
%
The DFTs on the index $\lambda$ will be computed using
special optimized modules for each $f$.
To calculate the next stage,
%
\begin{equation}
v'=(P^f_q D^f_q \otimes I_{p_{i-1}}) t
\end{equation}
%
we note that the Kronecker product has the property of performing
$p_{i-1}$ independent multiplications of $PD$ on $q_{i-1}$ different
subsets of $t$. The index $\mu$ of $t(\lambda,\mu)$ which runs from 0
to $m$ will include $q_i$ copies of each $PD$ operation because
$m=p_{i-1}q$, i.e.@: we can split the index $\mu$ further into $\mu = a
p_{i-1} + b$, where $a = 0 \dots q-1$ and $b=0 \dots p_{i-1}$,
%
\begin{eqnarray}
\lambda m + \mu &=& \lambda m + a p_{i-1} + b \\
&=& (\lambda q + a) p_{i-1} + b.
\end{eqnarray}
%
Now we can expand the second stage,
%
\begin{eqnarray}
v'&=& (P^f_q D^f_q \otimes I_{p_{i-1}}) t \\
v'_{\lambda m + \mu} &=& \sum_{\lambda' \mu'}
(P^f_q D^f_q \otimes I_{p_{i-1}})_{(\lambda m + \mu)(\lambda' m + \mu')}
t_{\lambda' m + \mu'} \\
v'_{(\lambda q + a) p_{i-1} + b} &=& \sum_{\lambda' a' b'}
(
P^f_q D^f_q \otimes I_{p_{i-1}}
)_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')}
t_{(\lambda' q + a')p_{i-1} +b'}
\end{eqnarray}
%
The first step in removing redundant indices is to take advantage of
the identity matrix $I$ and separate the subspaces of the Kronecker
product,
%
\begin{equation}
(
P^f_q D^f_q \otimes I_{p_{i-1}}
)_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')}
=
(P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')}
\delta_{bb'}
\end{equation}
%
This eliminates one sum, leaving us with,
%
\begin{equation}
v'_{(\lambda q + a) p_{i-1} + b}
=
\sum_{\lambda' a' }
(P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')} t_{(\lambda'q+a')p_{i-1} + b}
\end{equation}
%
We can insert the definition of $D^f_q$ to give,
%
\begin{equation}
\phantom{v'_{(\lambda q + a) p_{i-1} + b}}
= \sum_{\lambda'a'} (P^f_q)_{(\lambda q + a)(\lambda'q + a')}
\omega^{\lambda'a'}_{q_{i-1}} t_{(\lambda'q+a')p_{i-1}+b}
\end{equation}
%
Using the definition of $P^f_q$, which exchanges an index of $\lambda
q + a$ with $a f + \lambda$, we get a final result with no matrix
multiplication,
%
\begin{equation}
v'_{(a f + \lambda) p_{i-1} + b}
= \omega^{\lambda a}_{q_{i-1}} t_{(\lambda q + a)p_{i-1} + b}
\end{equation}
%
All we have to do is premultiply each element of the temporary vector
$t$ by an exponential twiddle factor and store the result in another
index location, according to the digit reversal permutation of $P$.
Here is the algorithm to implement the mixed-radix FFT,
%
\begin{algorithm}
\FOR{$i = 1 \dots n_f$}
\FOR{$a = 0 \dots q-1$}
\FOR{$b = 0 \dots p_{i-1} - 1$}
\FOR{$\lambda = 0 \dots f-1$}
\STATE{$t_\lambda \Leftarrow
\sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') v_{b+\lambda'm+ap_{i-1}}$}
\COMMENT{DFT matrix-multiply module}
\ENDFOR
\FOR{$\lambda = 0 \dots f-1$}
\STATE{$v'_{(af+\lambda)p_{i-1}+b}
\Leftarrow \omega^{\lambda a}_{q_{i-1}} t_\lambda$}
\ENDFOR
\ENDFOR
\ENDFOR
\STATE{$v \Leftarrow v'$}
\ENDFOR
\end{algorithm}
%
\subsection{Details of the implementation}
%
First the function {\tt gsl\_fft\_complex\_wavetable\_alloc} allocates
$n$ elements of scratch space (to hold the vector $v'$ for each
iteration) and $n$ elements for a trigonometric lookup table of
twiddle factors.
Then the length $n$ must be factorized. There is a general
factorization function {\tt gsl\_fft\_factorize} which takes a list of
preferred factors. It first factors out the preferred factors and then
removes general remaining prime factors.
The algorithm used to generate the trigonometric lookup table is
%
\begin{algorithm}
\FOR {$a = 1 \dots n_f$}
\FOR {$b = 1 \dots f_i - 1$}
\FOR {$c = 1 \dots q_i$}
\STATE $\mbox{trig[k++]} = \exp(- 2\pi i b c p_{a-1}/N)$
\ENDFOR
\ENDFOR
\ENDFOR
\end{algorithm}
%
Note that $\sum_{1}^{n_f} \sum_{0}^{f_i-1} \sum_{1}^{q_i} =
\sum_{1}^{n_f} (f_i-1)q_i = n - 1$ so $n$ elements are always
sufficient to store the lookup table. This is chosen because we need
to compute $\omega_{q_i-1}^{\lambda a} t_\lambda$ in
the FFT. In terms of the lookup table we can write this as,
%
\begin{eqnarray}
\omega_{q_{i-1}}^{\lambda a} t_\lambda
&=& \exp(-2\pi i \lambda a/q_{i-1}) t_\lambda \\
&=& \exp(-2\pi i \lambda a p_{i-1}/N) t_\lambda \\
&=& \left\{
\begin{array}{ll}
t_\lambda & a=0 \\
\mbox{trig}[\mbox{twiddle[i]}+\lambda q+(a-1)] t_\lambda & a\not=0
\end{array}
\right.
\end{eqnarray}
%
\begin{sloppypar}
\noindent
The array {\tt twiddle[i]} maintains a set of pointers into {\tt trig}
for the starting points for the outer loop. The core of the
implementation is {\tt gsl\_fft\_complex}. This function loops over
the chosen factors of $N$, computing the iteration $v'=T_i v$ for each
pass. When the DFT for a factor is implemented the iteration is
handed-off to a dedicated small-$N$ module, such as {\tt
gsl\_fft\_complex\_pass\_3} or {\tt
gsl\_fft\_complex\_pass\_5}. Unimplemented factors are handled
by the general-$N$ routine {\tt gsl\_fft\_complex\_pass\_n}. The
structure of one of the small-$N$ modules is a simple transcription of
the basic algorithm given above. Here is an example for {\tt
gsl\_fft\_complex\_pass\_3}. For a pass with a factor of 3 we have to
calculate the following expression,
\end{sloppypar}%
\begin{equation}
v'_{(a f + \lambda) p_{i-1} + b}
=
\sum_{\lambda' = 0,1,2}
\omega^{\lambda a}_{q_{i-1}} W^{\lambda \lambda'}_{3}
v_{b + \lambda' m + a p_{i-1}}
\end{equation}
%
for $b = 0 \dots p_{i-1} - 1$, $a = 0 \dots q_{i} - 1$ and $\lambda =
0, 1, 2$. This is implemented as,
%
\begin{algorithm}
\FOR {$a = 0 \dots q-1$}
\FOR {$b = 0 \dots p_{i-1} - 1$}
\STATE {$
\left(
\begin{array}{c}
t_0 \\ t_1 \\ t_2
\end{array}
\right)
=
\left(
\begin{array}{ccc}
W^{0}_3 & W^{0}_3 & W^{0}_3 \\
W^{0}_3 & W^{1}_3 & W^{2}_3 \\
W^{0}_3 & W^{2}_3 & W^{4}_3
\end{array}
\right)
\left(
\begin{array}{l}
v_{b + a p_{i-1}} \\
v_{b + a p_{i-1} + m} \\
v_{b + a p_{i-1} +2m}
\end{array}
\right)$}
\STATE {$ v'_{a p_{i} + b} = t_0$}
\STATE {$ v'_{a p_{i} + b + p_{i-1}} = \omega^{a}_{q_{i-1}} t_1$}
\STATE {$ v'_{a p_{i} + b + 2 p_{i-1}} = \omega^{2a}_{q_{i-1}} t_2$}
\ENDFOR
\ENDFOR
\end{algorithm}
%
In the code we use the variables {\tt from0}, {\tt from1}, {\tt from2}
to index the input locations,
%
\begin{eqnarray}
\mbox{\tt from0} &=& b + a p_{i-1} \\
\mbox{\tt from1} &=& b + a p_{i-1} + m \\
\mbox{\tt from2} &=& b + a p_{i-1} + 2m
\end{eqnarray}
%
and the variables {\tt to0}, {\tt to1}, {\tt to2} to index the output
locations in the scratch vector $v'$,
%
\begin{eqnarray}
\mbox{\tt to0} &=& b + a p_{i} \\
\mbox{\tt to1} &=& b + a p_{i} + p_{i-1} \\
\mbox{\tt to2} &=& b + a p_{i} + 2 p_{i-1}
\end{eqnarray}
%
The DFT matrix multiplication is computed using the optimized
sub-transform modules given in the next section. The twiddle factors
$\omega^a_{q_{i-1}}$ are taken out of the {\tt trig} array.
To compute the inverse transform we go back to the definition of the
Fourier transform and note that the inverse matrix is just the complex
conjugate of the forward matrix (with a factor of $1/N$),
%
\begin{equation}
W^{-1}_N = W^*_N / N
\end{equation}
%
Therefore we can easily compute the inverse transform by conjugating
all the complex elements of the DFT matrices and twiddle factors that
we apply. (An alternative strategy is to conjugate the input data,
take a forward transform, and then conjugate the output data).
\section{Fast Sub-transform Modules}
%
To implement the mixed-radix FFT we still need to compute the
small-$N$ DFTs for each factor. Fortunately many highly-optimized
small-$N$ modules are available, following the work of Winograd who
showed how to derive efficient small-$N$ sub-transforms by number
theoretic techniques.
The algorithms in this section all compute,\footnote{Erratum: due to a difference in sign convention, these transforms are actually for $x_a = \sum_{b=0}^{N-1} W_N^{-ab} z_b$. Thanks to Andrew Holme for the correction.}
%
\begin{equation}
x_a = \sum_{b=0}^{N-1} W_N^{ab} z_b
\end{equation}
%
The sub-transforms given here are the ones recommended by Temperton
and differ slightly from the canonical Winograd modules. According to
Temperton~\cite{temperton83} they are slightly more robust against
rounding errors and trade off some additions for multiplications.
%
For the $N=2$ DFT,
%
\begin{equation}
\begin{array}{ll}
x_0 = z_0 + z_1, &
x_1 = z_0 - z_1.
\end{array}
\end{equation}
%
For the $N=3$ DFT,
%
\begin{equation}
\begin{array}{lll}
t_1 = z_1 + z_2, &
t_2 = z_0 - t_1/2, &
t_3 = \sin(\pi/3) (z_1 - z_2),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{lll}
x_0 = z_0 + t_1, &
x_1 = t_2 + i t_3, &
x_2 = t_2 - i t_3.
\end{array}
\end{equation}
%
The $N=4$ transform involves only additions and subtractions,
%
\begin{equation}
\begin{array}{llll}
t_1 = z_0 + z_2, &
t_2 = z_1 + z_3, &
t_3 = z_0 - z_2, &
t_4 = z_1 - z_3,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{llll}
x_0 = t_1 + t_2, &
x_1 = t_3 + i t_4, &
x_2 = t_1 - t_2, &
x_3 = t_3 - i t_4.
\end{array}
\end{equation}
%
For the $N=5$ DFT,
%
\begin{equation}
\begin{array}{llll}
t_1 = z_1 + z_4, &
t_2 = z_2 + z_3, &
t_3 = z_1 - z_4, &
t_4 = z_2 - z_3,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{llll}
t_5 = t_1 + t_2, &
t_6 = (\sqrt{5}/4) (t_1 - t_2), &
t_7 = z_0 - t_5/4, \\
\end{array}
\end{equation}
\begin{equation}
\begin{array}{llll}
t_8 = t_7 + t_6, &
t_9 = t_7 - t_6, \\
\end{array}
\end{equation}
\begin{equation}
\begin{array}{llll}
t_{10} = \sin(2\pi/5) t_3 + \sin(2\pi/10) t_4, &
t_{11} = \sin(2\pi/10) t_3 - \sin(2\pi/5) t_4,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{llll}
x_0 = z_0 + t_5,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{llll}
x_1 = t_8 + i t_{10}, &
x_2 = t_9 + i t_{11},
\end{array}
\end{equation}
\begin{equation}
\begin{array}{llll}
x_3 = t_9 - i t_{11}, &
x_4 = t_8 - i t_{10}.
\end{array}
\end{equation}
%
The DFT matrix for $N=6$ can be written as a combination of $N=3$ and
$N=2$ transforms with index permutations,
%
\begin{equation}
\left(
\begin{array}{c}
x_0 \\
x_4 \\
x_2 \\
\hline x_3 \\
x_1 \\
x_5
\end{array}
\right)
=
\left(
\begin{array}{ccc|ccc}
& & & & & \\
&W_3& & &W_3& \\
& & & & & \\
\hline & & & & & \\
&W_3& & &-W_3& \\
& & & & &
\end{array}
\right)
\left(
\begin{array}{c}
z_0 \\
z_2 \\
z_4 \\
\hline z_3 \\
z_5 \\
z_1
\end{array}
\right)
\end{equation}
%
This simplification is an example of the Prime Factor Algorithm, which
can be used because the factors 2 and 3 are mutually prime. For more
details consult one of the books on number theory for
FFTs~\cite{elliott82,blahut}. We can take advantage of the simple
indexing scheme of the PFA to write the $N=6$ DFT as,
%
\begin{equation}
\begin{array}{lll}
t_1 = z_2 + z_4, &
t_2 = z_0 - t_1/2, &
t_3 = \sin(\pi/3) (z_2 - z_4),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{lll}
t_4 = z_5 + z_1, &
t_5 = z_3 - t_4/2, &
t_6 = \sin(\pi/3) (z_5 - z_1),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{lll}
t_7 = z_0 + t_1, &
t_8 = t_2 + i t_3, &
t_9 = t_2 - i t_3,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{lll}
t_{10} = z_3 + t_4, &
t_{11} = t_5 + i t_6, &
t_{12} = t_5 - i t_6,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{lll}
x_0 = t_7 + t_{10}, &
x_4 = t_8 + t_{11}, &
x_2 = t_9 + t_{12},
\end{array}
\end{equation}
\begin{equation}
\begin{array}{lll}
x_3 = t_7 - t_{10}, &
x_1 = t_8 - t_{11}, &
x_5 = t_9 - t_{12}.
\end{array}
\end{equation}
For any remaining general factors we use Singleton's efficient method
for computing a DFT~\cite{singleton}. Although it is an $O(N^2)$
algorithm it does reduce the number of multiplications by a factor of
4 compared with a naive evaluation of the DFT. If we look at the
general structure of a DFT matrix, shown schematically below,
%
\begin{equation}
\left(
\begin{array}{c}
h_0 \\
h_1 \\
h_2 \\
\vdots \\
h_{N-2} \\
h_{N-1}
\end{array}
\right)
=
\left(
\begin{array}{cccccc}
1 & 1 & 1 & \cdots & 1 & 1 \\
1 & W & W & \cdots & W & W \\
1 & W & W & \cdots & W & W \\
\vdots & \vdots & \vdots & \cdots & \vdots & \vdots \\
1 & W & W & \cdots & W & W \\
1 & W & W & \cdots & W & W
\end{array}
\right)
\left(
\begin{array}{c}
g_0 \\
g_1 \\
g_2 \\
\vdots \\
g_{N-2} \\
g_{N-1}
\end{array}
\right)
\end{equation}
%
we see that the outer elements of the DFT matrix are all unity. We can
remove these trivial multiplications but we will still be left with an
$(N-1) \times (N-1)$ sub-matrix of complex entries, which would appear
to require $(N-1)^2$ complex multiplications. Singleton's method,
uses symmetries of the DFT matrix to convert the complex
multiplications to an equivalent number of real multiplications. We
start with the definition of the DFT in component form,
%
\begin{equation}
a_k + i b_k = \sum_{j=0} (x_j+iy_j)(\cos(2\pi jk/f) - i\sin(2\pi jk/f))
\end{equation}
%
The zeroth component can be computed using only additions,
%
\begin{equation}
a_0 + i b_0 = \sum_{j=0}^{(f-1)} x_j + i y_j
\end{equation}
%
We can rewrite the remaining components as,
%
\begin{eqnarray}
a_k + i b_k & = x_0 + i y_0 & +
\sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f)
+ (y_j - y_{f-j}) \sin(2\pi jk/f) \\
& & + i\sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f)
- (x_j - x_{f-j}) \sin(2\pi jk/f)
\end{eqnarray}
%
by using the following trigonometric identities,
%
\begin{eqnarray}
\cos(2\pi(f-j)k/f) &=& \phantom{-}\cos(2\pi jk/f) \\
\sin(2\pi(f-j)k/f) &=& -\sin(2\pi jk/f)
\end{eqnarray}
%
These remaining components can all be computed using four partial
sums,
%
\begin{eqnarray}
a_k + i b_k & = & (a^+_k - a^-_k) + i (b^+_k + b^-_k) \\
a_{f-k} + i b_{f-k} & = & (a^+_k + a^-_k) + i (b^+_k - b^-_k)
\end{eqnarray}
%
for $k = 1, 2, \dots, (f-1)/2$, where,
%
\begin{eqnarray}
a^+_k &=& x_0 + \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f) \\
a^-_k &=& \phantom{x_0} - \sum_{j=1}^{(f-1)/2} (y_j - y_{f-j}) \sin(2\pi jk/f) \\
b^+_k &=& y_0 + \sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f) \\
b^-_k &=& \phantom{y_0} - \sum_{j=1}^{(f-1)/2} (x_j - x_{f-j}) \sin(2\pi jk/f)
\end{eqnarray}
%
Note that the higher components $k'=f-k$ can be obtained directly
without further computation once $a^+$, $a^-$, $b^+$ and $b^-$ are
known. There are $4 \times (f-1)/2$ different sums, each involving
$(f-1)/2$ real multiplications, giving a total of $(f-1)^2$ real
multiplications instead of $(f-1)^2$ complex multiplications.
To implement Singleton's method we make use of the input and output
vectors $v$ and $v'$ as scratch space, copying data back and forth
between them to obtain the final result. First we use $v'$ to store
the terms of the symmetrized and anti-symmetrized vectors of the form
$x_j + x_{f-j}$ and $x_j - y_{f-j}$. Then we multiply these by the
appropriate trigonometric factors to compute the partial sums $a^+$,
$a^-$, $b^+$ and $b^-$, storing the results $a_k + i b_k$ and $a_{f-k}
+ i b_{f-k}$ back in $v$. Finally we multiply the DFT output by any
necessary twiddle factors and place the results in $v'$.
\section{FFTs for real data}
%
This section is based on the articles {\em Fast Mixed-Radix Real
Fourier Transforms} by Clive Temperton~\cite{temperton83real} and
{\em Real-Valued Fast Fourier Transform Algorithms} by Sorensen,
Jones, Heideman and Burrus~\cite{burrus87real}. The DFT of a real
sequence has a special symmetry, called a {\em conjugate-complex} or
{\em half-complex} symmetry,
%
\begin{equation}
h(a) = h(N-a)^*
\end{equation}
%
The element $h(0)$ is real, and when $N$ is even $h(N/2)$ is also
real. It is straightforward to prove the symmetry,
%
\begin{eqnarray}
h(a) &=& \sum W^{ab}_N g(b) \\
h(N-a)^* &=& \sum W^{-(N-a)b}_N g(b)^* \\
&=& \sum W^{-Nb}_N W^{ab}_N g(b) \qquad{(W^N_N=1)} \\
&=& \sum W^{ab}_N g(b)
\end{eqnarray}
%
Real-valued data is very common in practice (perhaps more common that
complex data) so it is worth having efficient FFT routines for real
data. In principle an FFT for real data should need half the
operations of an FFT on the equivalent complex data (where the
imaginary parts are set to zero). There are two different strategies
for computing FFTs of real-valued data:
One strategy is to ``pack'' the real data (of length $N$) into a
complex array (of length $N/2$) by index transformations. A complex
FFT routine can then be used to compute the transform of that array.
By further index transformations the result can actually by
``unpacked'' to the FFT of the original real data. It is also possible
to do two real FFTs simultaneously by packing one in the real part and
the other in the imaginary part of the complex array. These
techniques have some disadvantages. The packing and unpacking
procedures always add $O(N)$ operations, and packing a real array of
length $N$ into a complex array of length $N/2$ is only possible if
$N$ is even. In addition, if two unconnected datasets with very
different magnitudes are packed together in the same FFT there could
be ``cross-talk'' between them due to a loss of precision.
A more straightforward strategy is to start with an FFT algorithm,
such as the complex mixed-radix algorithm, and prune out all the
operations involving the zero imaginary parts of the initial data. The
FFT is linear so the imaginary part of the data can be decoupled from
the real part. This procedure leads to a dedicated FFT for real-valued
data which works for any length and does not perform any unnecessary
operations. It also allows us to derive a corresponding inverse FFT
routine which transforms a half-complex sequence back into real data.
\subsection{Radix-2 FFTs for real data}
%
Before embarking on the full mixed-radix real FFT we'll start with the
radix-2 case. It contains all the essential features of the
general-$N$ algorithm. To make it easier to see the analogy between
the two we will use the mixed-radix notation to describe the
factors. The factors are all 2,
%
\begin{equation}
f_1 = 2, f_2 = 2, \dots, f_{n_f} = 2
\end{equation}
%
and the products $p_i$ are powers of 2,
%
\begin{eqnarray}
p_0 & = & 1 \\
p_1 & = & f_1 = 2 \\
p_2 & = & f_1 f_2 = 4 \\
\dots &=& \dots \\
p_i & = & f_1 f_2 \dots f_i = 2^i
\end{eqnarray}
%
Using this notation we can rewrite the radix-2 decimation-in-time
algorithm as,
%
\begin{algorithm}
\STATE bit-reverse ordering of $g$
\FOR {$i = 1 \dots n$}
\FOR {$a = 0 \dots p_{i-1} - 1$}
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$
\left(
\begin{array}{c}
g(b p_i + a) \\
g(b p_i + p_{i-1} + a)
\end{array}
\right)
=
\left(
\begin{array}{c}
g(b p_i + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\
g(b p_i + a) - W^a_{p_i} g(b p_i + p_{i-1} + a)
\end{array}
\right) $}
\ENDFOR
\ENDFOR
\ENDFOR
\end{algorithm}
%
where we have used $p_i = 2 \Delta$, and factored $2 \Delta$ out of
the original definition of $b$ ($b \to b p_i$).
If we go back to the original recurrence relations we can see how to
write the intermediate results in a way which make the
real/half-complex symmetries explicit at each step. The first pass is
just a set of FFTs of length-2 on real values,
%
\begin{equation}
g_1([b_0 b_1 b_2 a_0]) = \sum_{b_3} W^{a_0 b_3}_2 g([b_0 b_1 b_2 b_3])
\end{equation}
%
Using the symmetry $FFT(x)_k = FFT(x)^*_{N-k}$ we have the reality
condition,
%
\begin{eqnarray}
g_1([b_0 b_1 b_2 0]) &=& \mbox{real} \\
g_1([b_0 b_1 b_2 1]) &=& \mbox{real'}
\end{eqnarray}
%
In the next pass we have a set of length-4 FFTs on the original data,
%
\begin{eqnarray}
g_2([b_0 b_1 b_1 a_0])
&=&
\sum_{b_2}\sum_{b_3}
W^{[a_1 a_0]b_2}_4 W^{a_0 b_3}_2
g([b_0 b_1 b_2 b_3]) \\
&=&
\sum_{b_2}\sum_{b_3}
W^{[a_1 a_0][b_3 b_2]}_4
g([b_0 b_1 b_2 b_3])
\end{eqnarray}
%
This time symmetry gives us the following conditions on the
transformed data,
%
\begin{eqnarray}
g_2([b_0 b_1 0 0]) &=& \mbox{real} \\
g_2([b_0 b_1 0 1]) &=& x + i y \\
g_2([b_0 b_1 1 0]) &=& \mbox{real'} \\
g_2([b_0 b_1 1 1]) &=& x - i y
\end{eqnarray}
%
We can see a pattern emerging here: the $i$-th pass computes a set of
independent length-$2^i$ FFTs on the original real data,
%
\begin{eqnarray}
g_i ( b p_i + a ) = \sum_{a' = 0}^{p_i-1} W_{p_i}^{aa'} g(b p_i + a')
\quad
\mbox{for $b = 0 \dots q_i - 1$}
\end{eqnarray}
%
As a consequence the we can apply the symmetry for an FFT of real data
to all the intermediate results -- not just the final result.
In general after the $i$-th pass we will
have the symmetry,
%
\begin{eqnarray}
g_i(b p_i) &=& \mbox{real} \\
g_i(b p_i + a) &=& g_i(b p_i + p_i - a)^* \qquad a = 1 \dots p_{i}/2 - 1 \\
g_i(b p_i + p_{i}/2) &=& \mbox{real'}
\end{eqnarray}
%
In the next section we'll show that this is a general property of
decimation-in-time algorithms. The same is not true for the
decimation-in-frequency algorithm, which does not have any simple
symmetries in the intermediate results.
Since we can obtain the values of $g_i(b p_i + a)$ for $a > p_i/2$
from the values for $a < p_i/2$ we can cut our computation and
storage in half compared with the full-complex case.
%
We can easily rewrite the algorithm to show how the computation can be
halved, simply by limiting all terms to involve only values for $a
\leq p_{i}/2$. Whenever we encounter a term $g_i(b p_i + a)$ with $a >
p_{i}/2$ we rewrite it in terms of its complex symmetry partner,
$g_i(b p_i + a')^*$, where $a' = p_i - a$. The butterfly computes two
values for each value of $a$, $b p_i + a$ and $b p_i + p_{i-1} - a$,
so we actually only need to compute from $a = 0$ to $p_{i-1}/2$. This
gives the following algorithm,
%
\begin{algorithm}
\FOR {$a = 0 \dots p_{i-1}/2$}
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$
\left(
\begin{array}{c}
g(b p_i + a) \\
g(b p_i + p_{i-1} - a)^*
\end{array}
\right)
=
\left(
\begin{array}{c}
g(b p_{i} + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\
g(b p_{i} + a) - W^a_{p_i} g(b p_i + p_{i-1} + a)
\end{array}
\right) $}
\ENDFOR
\ENDFOR
\end{algorithm}
%
Although we have halved the number of operations we also need a
storage arrangement which will halve the memory requirement. The
algorithm above is still formulated in terms of a complex array $g$,
but the input to our routine will naturally be an array of $N$ real
values which we want to use in-place.
Therefore we need a storage scheme which lays out the real and
imaginary parts within the real array, in a natural way so that there
is no need for complicated index calculations. In the radix-2
algorithm we do not have any additional scratch space. The storage
scheme has to be designed to accommodate the in-place calculation
taking account of dual node pairs.
Here is a scheme which takes these restrictions into account: On the
$i$-th pass we store the real part of $g(b p_i + a)$ in location $b
p_i + a$. We store the imaginary part in location $b p_i + p_i -
a$. This is the redundant location which corresponds to the conjugate
term $g(b p_i + a)^* = g(b p_i + p_i -a)$, so it is not needed. When
the results are purely real (as in the case $a = 0$ and $a = p_i/2$ we
store only the real part and drop the zero imaginary part).
This storage scheme has to work in-place, because the radix-2 routines
should not use any scratch space. We will now check the in-place
property for each butterfly. A crucial point is that the scheme is
pass-dependent. Namely, when we are computing the result for pass $i$
we are reading the results of pass $i-1$, and we must access them
using the scheme from the previous pass, i.e. we have to remember that
the results from the previous pass were stored using $b p_{i-1} + a$,
not $b p_i + a$, and the symmetry for these results will be $g_{i-1}(b
p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1} - a)^*$. To take this into
account we'll write the right hand side of the iteration, $g_{i-1}$,
in terms of $p_{i-1}$. For example, instead of $b p_i$, which occurs
naturally in $g_i(b p_i + a)$ we will use $2 b p_{i-1}$, since $p_i =
2 p_{i-1}$.
Let's start with the butterfly for $a = 0$,
%
\begin{equation}
\left(
\begin{array}{c}
g(b p_i) \\
g(b p_i + p_{i-1})^*
\end{array}
\right)
=
\left(
\begin{array}{c}
g(2 b p_{i-1}) + g((2 b + 1) p_{i-1}) \\
g(2 b p_{i-1}) - g((2 b + 1) p_{i-1})
\end{array}
\right)
\end{equation}
%
By the symmetry $g_{i-1}(b p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1}
- a)^*$ all the inputs are purely real. The input $g(2 b p_{i-1})$ is
read from location $2 b p_{i-1}$ and $g((2 b + 1) p_{i-1})$ is read
from the location $(2 b + 1) p_{i-1}$. Here is the full breakdown,
%
\begin{center}
\renewcommand{\arraystretch}{1.5}
\begin{tabular}{|l|lll|}
\hline Term & & Location & \\
\hline
$g(2 b p_{i-1})$
& real part & $2 b p_{i-1} $ &$= b p_i$ \\
& imag part & --- & \\
$g((2 b+1) p_{i-1})$
& real part & $(2 b + 1) p_{i-1} $&$= b p_i + p_{i-1} $ \\
& imag part & --- & \\
\hline
$g(b p_{i})$
& real part & $b p_i$ &\\
& imag part & --- & \\
$g(b p_{i} + p_{i-1})$
& real part & $b p_i + p_{i-1}$& \\
& imag part & --- & \\
\hline
\end{tabular}
\end{center}
%
The conjugation of the output term $g(b p_i + p_{i-1})^*$ is
irrelevant here since the results are purely real. The real results
are stored in locations $b p_i$ and $b p_i + p_{i-1}$, which
overwrites the inputs in-place.
The general butterfly for $a = 1 \dots p_{i-1}/2 - 1$ is,
%
\begin{equation}
\left(
\begin{array}{c}
g(b p_i + a) \\
g(b p_i + p_{i-1} - a)^*
\end{array}
\right)
=
\left(
\begin{array}{c}
g(2 b p_{i-1} + a) + W^a_{p_i} g((2 b + 1) p_{i-1} + a) \\
g(2 b p_{i-1} + a) - W^a_{p_i} g((2 b + 1) p_{i-1} + a)
\end{array}
\right)
\end{equation}
%
All the terms are complex. To store a conjugated term like $g(b' p_i +
a')^*$ where $a > p_i/2$ we take the real part and store it in
location $b' p_i + a'$ and then take imaginary part, negate it, and
store the result in location $b' p_i + p_i - a'$.
Here is the full breakdown of the inputs and outputs from the
butterfly,
%
\begin{center}
\renewcommand{\arraystretch}{1.5}
\begin{tabular}{|l|lll|}
\hline Term & & Location & \\
\hline
$g(2 b p_{i-1} + a)$
& real part & $2 b p_{i-1} + a $ &$= b p_i + a$ \\
& imag part & $2 b p_{i-1} + p_{i-1} - a$ &$= b p_i + p_{i-1} - a$ \\
$g((2 b+1) p_{i-1} + a)$
& real part & $(2 b+1) p_{i-1} + a $&$= b p_i + p_{i-1} + a $ \\
& imag part & $(2 b+1) p_{i-1} + p_{i-1} - a $&$= b p_i + p_i - a$\\
\hline
$g(b p_{i} + a)$
& real part & $b p_i + a$ &\\
& imag part & $b p_i + p_i - a$& \\
$g(b p_{i} + p_{i-1} - a)$
& real part & $b p_i + p_{i-1} - a$& \\
& imag part & $b p_i + p_{i-1} + a$& \\
\hline
\end{tabular}
\end{center}
%
By comparing the input locations and output locations we can see
that the calculation is done in place.
The final butterfly for $a = p_{i-1}/2$ is,
%
\begin{equation}
\left(
\begin{array}{c}
g(b p_i + p_{i-1}/2) \\
g(b p_i + p_{i-1} - p_{i-1}/2)^*
\end{array}
\right)
=
\left(
\begin{array}{c}
g(2 b p_{i-1} + p_{i-1}/2) - i g((2 b + 1) p_{i-1} + p_{i-1}/2) \\
g(2 b p_{i-1} + p_{i-1}/2) + i g((2 b + 1) p_{i-1} + p_{i-1}/2)
\end{array}
\right)
\end{equation}
%
where we have substituted for the twiddle factor, $W^a_{p_i} = -i$,
%
\begin{eqnarray}
W^{p_{i-1}/2}_{p_i} &=& \exp(-2\pi i p_{i-1}/2 p_i) \\
&=& \exp(-2\pi i /4) \\
&=& -i
\end{eqnarray}
%
For this butterfly the second line is just the conjugate of the first,
because $p_{i-1} - p_{i-1}/2 = p_{i-1}/2$. Therefore we only need to
consider the first line. The breakdown of the inputs and outputs is,
%
\begin{center}
\renewcommand{\arraystretch}{1.5}
\begin{tabular}{|l|lll|}
\hline Term & & Location & \\
\hline
$g(2 b p_{i-1} + p_{i-1}/2)$
& real part & $2 b p_{i-1} + p_{i-1}/2 $ &$= b p_i + p_{i-1}/2$ \\
& imag part & --- & \\
$g((2 b + 1) p_{i-1} + p_{i-1}/2)$
& real part & $(2 b + 1) p_{i-1} + p_{i-1}/2 $&$= b p_i + p_{i} - p_{i-1}/2 $ \\
& imag part & --- & \\
\hline
$g(b p_{i} + p_{i-1}/2)$
& real part & $b p_i + p_{i-1}/2$ &\\
& imag part & $b p_i + p_i - p_{i-1}/2$& \\
\hline
\end{tabular}
\end{center}
%
By comparing the locations of the inputs and outputs with the
operations in the butterfly we find that this computation is very
simple: the effect of the butterfly is to negate the location $b p_i +
p_i - p_{i-1}/2$ and leave other locations unchanged. This is clearly
an in-place operation.
Here is the radix-2 algorithm for real data, in full, with the cases
of $a=0$, $a=1\dots p_{i-1}/2 - 1$ and $a = p_{i-1}/2$ in separate
blocks,
%
\begin{algorithm}
\STATE bit-reverse ordering of $g$
\FOR {$i = 1 \dots n$}
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$\left(
\begin{array}{c}
g(b p_i) \\
g(b p_i + p_{i-1})
\end{array}
\right)
\Leftarrow
\left(
\begin{array}{c}
g(b p_{i}) + g(b p_{i} + p_{i-1}) \\
g(b p_{i}) - g(b p_{i} + p_{i-1})
\end{array}
\right)$}
\ENDFOR
\FOR {$a = 1 \dots p_{i-1}/2 - 1$}
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$(\Real z_0, \Imag z_0) \Leftarrow
(g(b p_i + a), g(b p_i + p_{i-1} - a))$}
\STATE{$(\Real z_1, \Imag z_1) \Leftarrow
(g(b p_i + p_{i-1} + a), g(b p_i + p_{i} - a))$}
\STATE{$t_0 \Leftarrow z_0 + W^a_{p_i} z_1$}
\STATE{$t_1 \Leftarrow z_0 - W^a_{p_i} z_1$}
\STATE{$(g(b p_{i} + a),g(b p_{i} + p_i - a) \Leftarrow
(\Real t_0, \Imag t_0)$}
\STATE{$(g(b p_{i} + p_{i-1} - a), g(b p_{i} + p_{i-1} + a))
\Leftarrow
(\Real t_1, -\Imag t_1)$}
\ENDFOR
\ENDFOR
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$g(b p_{i} + p_{i} - p_{i-1}/2) \Leftarrow -g(b p_{i} + p_{i} - p_{i-1}/2)$}
\ENDFOR
\ENDFOR
\end{algorithm}
%
We split the loop over $a$ into three parts, $a=0$, $a=1\dots
p_{i-1}/2-1$ and $a = p_{i-1}/2$ for efficiency. When $a=0$ we have
$W^a_{p_i}=1$ so we can eliminate a complex multiplication within the
loop over $b$. When $a=p_{i-1}/2$ we have $W^a_{p_i} = -i$ which does
not require a full complex multiplication either.
\subsubsection{Calculating the Inverse}
%
The inverse FFT of complex data was easy to calculate, simply by
taking the complex conjugate of the DFT matrix. The input data and
output data were both complex and did not have any special
symmetry. For real data the inverse FFT is more complicated because
the half-complex symmetry of the transformed data is
different from the purely real input data.
We can compute an inverse by stepping backwards through the forward
transform. To simplify the inversion it's convenient to write the
forward algorithm with the butterfly in matrix form,
%
\begin{algorithm}
\FOR {$i = 1 \dots n$}
\FOR {$a = 0 \dots p_{i-1}/2$}
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$
\left(
\begin{array}{c}
g(b p_i + a) \\
g(b p_i + p_{i-1} + a)
\end{array}
\right)
=
\left(
\begin{array}{cc}
1 & W^a_{p_{i}} \\
1 & -W^a_{p_{i}}
\end{array}
\right)
\left(
\begin{array}{c}
g(2 b p_{i-1} + a) \\
g((2 b + 1) p_{i-1} + a)
\end{array}
\right) $}
\ENDFOR
\ENDFOR
\ENDFOR
\end{algorithm}
%
To invert the algorithm we run the iterations backwards and invert the
matrix multiplication in the innermost loop,
%
\begin{algorithm}
\FOR {$i = n \dots 1$}
\FOR {$a = 0 \dots p_{i-1}/2$}
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$
\left(
\begin{array}{c}
g(2 b p_{i-1} + a) \\
g((2 b + 1) p_{i-1} + a)
\end{array}
\right)
=
\left(
\begin{array}{cc}
1 & W^a_{p_{i}} \\
1 & -W^a_{p_{i}}
\end{array}
\right)^{-1}
\left(
\begin{array}{c}
g(b p_i + a) \\
g(b p_i + p_{i-1} + a)
\end{array}
\right) $}
\ENDFOR
\ENDFOR
\ENDFOR
\end{algorithm}
%
There is no need to reverse the loops over $a$ and $b$ because the
result is independent of their order. The inverse of the matrix that
appears is,
%
\begin{equation}
\left(
\begin{array}{cc}
1 & W^a_{p_{i}} \\
1 & -W^a_{p_{i}}
\end{array}
\right)^{-1}
=
{1 \over 2}
\left(
\begin{array}{cc}
1 & 1 \\
W^{-a}_{p_{i}} & -W^{-a}_{p_{i}}
\end{array}
\right)
\end{equation}
%
To save divisions we remove the factor of $1/2$ inside the loop. This
computes the unnormalized inverse, and the normalized inverse can be
retrieved by dividing the final result by $N = 2^n$.
Here is the radix-2 half-complex to real inverse FFT algorithm, taking
into account the radix-2 storage scheme,
%
\begin{algorithm}
\FOR {$i = n \dots 1$}
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$\left(
\begin{array}{c}
g(b p_i) \\
g(b p_i + p_{i-1})
\end{array}
\right)
\Leftarrow
\left(
\begin{array}{c}
g(b p_{i}) + g(b p_{i} + p_{i-1}) \\
g(b p_{i}) - g(b p_{i} + p_{i-1})
\end{array}
\right)$}
\ENDFOR
\FOR {$a = 1 \dots p_{i-1}/2 - 1$}
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$(\Real z_0, \Imag z_0)
\Leftarrow
(g(b p_i + a), g(b p_i + p_{i} - a))$}
\STATE{$(\Real z_1, \Imag z_1)
\Leftarrow
(g(b p_i + p_{i-1} - a), -g(b p_i + p_{i-1} + a))$}
\STATE{$t_0 \Leftarrow z_0 + z_1$}
\STATE{$t_1 \Leftarrow z_0 - z_1$}
\STATE{$(g(b p_{i} + a), g(b p_{i} + p_{i-1} - a))
\Leftarrow
(\Real t_0, \Imag t_0) $}
\STATE{$(g(b p_{i} + p_{i-1} + a),g(b p_{i} + p_{i} - a))
\Leftarrow
(\Real(W^a_{p_i}t_1), \Imag(W^a_{p_i}t_1))$}
\ENDFOR
\ENDFOR
\FOR{$b = 0 \dots q_i - 1$}
\STATE{$g(b p_{i} + p_{i-1}/2) \Leftarrow 2 g(b p_{i} + p_{i-1}/2)$}
\STATE{$g(b p_{i} + p_{i-1} + p_{i-1}/2) \Leftarrow -2 g(b p_{i} + p_{i-1} + p_{i-1}/2)$}
\ENDFOR
\ENDFOR
\STATE bit-reverse ordering of $g$
\end{algorithm}
\subsection{Mixed-Radix FFTs for real data}
%
As discussed earlier the radix-2 decimation-in-time algorithm had the
special property that its intermediate passes are interleaved Fourier
transforms of the original data, and this generalizes to the
mixed-radix algorithm. The complex mixed-radix algorithm that we
derived earlier was a decimation-in-frequency algorithm, but we can
obtain a decimation-in-time version by taking the transpose of the
decimation-in-frequency DFT matrix like this,
%
\begin{eqnarray}
W_N &=& W_N^T \\
&=& (T_{n_f} \dots T_2 T_1)^T \\
&=& T_1^T T_2^T \dots T_{n_f}^T
\end{eqnarray}
%
with,
%
\begin{eqnarray}
T_i^T &=& \left( (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}})
(W_{f_i} \otimes I_{m_i}) \right)^T \\
&=& (W_{f_i} \otimes I_{m_i})
( D^{f_i}_{q_i} (P^{f_i}_{q_i})^T \otimes I_{p_{i-1}}).
\end{eqnarray}
%
We have used the fact that $W$, $D$ and $I$ are symmetric and that the
permutation matrix $P$ obeys,
%
\begin{equation}
(P^a_b)^T = P^b_a.
\end{equation}
%
From the definitions of $D$ and $P$ we can derive the following identity,
%
\begin{equation}
D^a_b P^b_a = P^b_a D^b_a.
\end{equation}
%
This allows us to put the transpose into a simple form,
%
\begin{equation}
T_i^T = (W_{f_i} \otimes I_{m_i})
(P^{q_i}_{f_i} D^{q_i}_{f_i} \otimes I_{p_{i-1}}).
\end{equation}
%
The transposed matrix, $T^T$ applies the digit-reversal $P$ before the
DFT $W$, giving the required decimation-in-time algorithm. The
transpose reverses the order of the factors --- $T_{n_f}$ is applied
first and $T_1$ is applied last. For convenience we can reverse the
order of the factors, $f_1 \leftrightarrow f_{n_f}$, $f_2
\leftrightarrow f_{n_f-1}$, \dots and make the corresponding
substitution $p_{i-1} \leftrightarrow q_i$. These substitutions give
us a decimation-in-time algorithm with the same ordering as the
decimation-in-frequency algorithm,
%
\begin{equation}
W_N = T_{n_f} \dots T_2 T_1
\end{equation}
%
\begin{equation}
T_i = (W_{f_i} \otimes I_{m_i})
(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})
\end{equation}
%
where $p_i$, $q_i$ and $m_i$ now have the same meanings as before,
namely,
%
\begin{eqnarray}
p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1) \\
q_i &=& N / p_i \\
m_i &=& N / f_i
\end{eqnarray}
%
Now we would like to prove that the iteration for computing $x = W z =
T_{n_f} \dots T_2 T_1 z$ has the special property interleaving
property. If we write the result of each intermediate pass as
$v^{(i)}$,
%
\begin{eqnarray}
v^{(0)} &=& z \\
v^{(1)} &=& T_1 v^{(0)} \\
v^{(2)} &=& T_2 v^{(1)} \\
\dots &=& \dots \\
v^{(i)} &=& T_i v^{(i-1)}
\end{eqnarray}
%
then we will show that the intermediate results $v^{(i)}$ on any pass
can be written as,
%
\begin{equation}
v^{(i)} = (W_{p_i} \otimes I_{q_i}) z
\end{equation}
%
Each intermediate stage will be a set of $q_i$ interleaved Fourier
transforms, each of length $p_i$. We can prove this result by
induction. First we assume that the result is true for $v^{(i-1)}$,
%
\begin{equation}
v^{(i-1)} = (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \qquad \mbox{(assumption)}
\end{equation}
%
And then we examine the next iteration using this assumption,
%
\begin{eqnarray}
v^{(i)} &=& T_i v^{(i-1)} \\
&=& T_i (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \\
&=& (W_{f_i} \otimes I_{m_i})
(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})
(W_{p_{i-1}} \otimes I_{q_{i-1}}) z \label{dit-induction}
\end{eqnarray}
%
Using the relation $m_i = p_{i-1} q_i$, we can write $I_{m_i}$ as
$I_{p_{i-1} q_i}$ and $I_{q_{i-1}}$ as $I_{f_i q_i}$. By combining these
with the basic matrix identity,
%
\begin{equation}
I_{ab} = I_a \otimes I_b
\end{equation}
%
we can rewrite $v^{(i)}$ in the following form,
%
\begin{equation}
v^{(i)} = (((W_{f_i} \otimes I_{p_{i-1}})
(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})
(W_{p_{i-1}} \otimes I_{f_i})) \otimes I_{q_i}) z .
\end{equation}
%
The first part of this matrix product is the two-factor expansion of
$W_{ab}$, for $a = p_{i-1}$ and $b = f_i$,
%
\begin{equation}
W_{p_{i-1} f_i} = ((W_{f_i} \otimes I_{p_{i-1}})
(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})
(W_{p_{i-1}} \otimes I_{f_i})).
\end{equation}
%
If we substitute this result, remembering that $p_i = p_{i-1} f_i$, we
obtain,
%
\begin{equation}
v^{(i)} = (W_{p_i} \otimes I_{q_i}) z
\end{equation}
%
which is the desired result. The case $i=1$ can be verified
explicitly, and induction then shows that the result is true for all
$i$. As discussed for the radix-2 algorithm this result is important
because if the initial data $z$ is real then each intermediate pass is
a set of interleaved Fourier transforms of $z$, having half-complex
symmetries (appropriately applied in the subspaces of the Kronecker
product). Consequently only $N$ real numbers are needed to store the
intermediate and final results.
\subsection{Implementation}
%
The implementation of the mixed-radix real FFT algorithm follows the
same principles as the full complex transform. Some of the steps are
applied in the opposite order because we are dealing with a decimation
in time algorithm instead of a decimation in frequency algorithm, but
the basic outer structure of the algorithm is the same. We want to
apply the factorized version of the DFT matrix $W_N$ to the input
vector $z$,
%
\begin{eqnarray}
x &=& W_N z \\
&=& T_{n_f} \dots T_2 T_1 z
\end{eqnarray}
%
We loop over the $n_f$ factors, applying each matrix $T_i$ to the
vector in turn to build up the complete transform,
%
\begin{algorithm}
\FOR{$(i = 1 \dots n_f)$}
\STATE{$v \Leftarrow T_i v $}
\ENDFOR
\end{algorithm}
%
In this case the definition of $T_i$ is different because we have
taken the transpose,
%
\begin{equation}
T_i =
(W_{f_i} \otimes I_{m_i})
(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}).
\end{equation}
%
We'll define a temporary vector $t$ to denote the results of applying the
rightmost matrix,
%
\begin{equation}
t = (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) v
\end{equation}
%
If we expand this out into individual components, as before, we find a
similar simplification,
%
\begin{eqnarray}
t_{aq+b}
&=&
\sum_{a'b'}
(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})_{(aq+b)(a'q+b')}
v_{a'q+b'} \\
&=&
\sum_{a'}
(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{a a'}
v_{a'q+b}
\end{eqnarray}
%
We have factorized the indices into the form $aq+b$, with $0 \leq a <
p_{i}$ and $0 \leq b < q$. Just as in the decimation in frequency
algorithm we can split the index $a$ to remove the matrix
multiplication completely. We have to write $a$ as $\mu f + \lambda$,
where $0 \leq \mu < p_{i-1}$ and $0 \leq \lambda < f$,
%
\begin{eqnarray}
t_{(\mu f + \lambda)q+b}
&=&
\sum_{\mu'\lambda'}
(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')}
v_{(\mu' f + \lambda')q+b}
\\
&=&
\sum_{\mu'\lambda'}
(P^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')}
\omega^{\mu'\lambda'}_{p_{i}}
v_{(\mu' f + \lambda')q+b}
\end{eqnarray}
%
The matrix $P^{p_{i-1}}_{f_i}$ exchanges an index of $(\mu f +
\lambda) q + b$ with $(\lambda p_{i-1} + \mu) q + b$, giving a final
result of,
%
\begin{eqnarray}
t_{(\lambda p_{i-1} + \mu) q + b}
=
w^{\mu\lambda}_{p_i} v_{(\mu f + \lambda)q +b}
\end{eqnarray}
%
To calculate the next stage,
%
\begin{equation}
v' = (W_{f_i} \otimes I_{m_i}) t,
\end{equation}
%
we temporarily rearrange the index of $t$ to separate the $m_{i}$
independent DFTs in the Kronecker product,
%
\begin{equation}
v'_{(\lambda p_{i-1} + \mu) q + b}
=
\sum_{\lambda' \mu' b'}
(W_{f_i} \otimes I_{m_i})_{
((\lambda p_{i-1} + \mu) q + b)
((\lambda' p_{i-1} + \mu') q + b')}
t_{(\lambda' p_{i-1} + \mu') q + b'}
\end{equation}
%
If we use the identity $m = p_{i-1} q$ to rewrite the index of $t$
like this,
%
\begin{equation}
t_{(\lambda p_{i-1} + \mu) q + b} = t_{\lambda m + (\mu q + b)}
\end{equation}
%
we can split the Kronecker product,
%
\begin{eqnarray}
v'_{(\lambda p_{i-1} + \mu) q + b}
&=&
\sum_{\lambda' \mu' b'}
(W_{f_i} \otimes I_{m_i})_{
((\lambda p_{i-1} + \mu) q + b)
((\lambda' p_{i-1} + \mu') q + b')}
t_{(\lambda' p_{i-1} + \mu') q + b'}\\
&=&
\sum_{\lambda'}
(W_{f_i})_{\lambda \lambda'}
t_{\lambda' m_i + (\mu q + b)}
\end{eqnarray}
%
If we switch back to the original form of the index in the last line we obtain,
%
\begin{eqnarray}
\phantom{v'_{(\lambda p_{i-1} + \mu) q + b}}
&=&
\sum_{\lambda'}
(W_{f_i})_{\lambda \lambda'}
t_{(\lambda p_{i-1} + \mu) q + b}
\end{eqnarray}
%
which allows us to substitute our previous result for $t$,
%
\begin{equation}
v'_{(\lambda p_{i-1} + \mu) q + b}
=
\sum_{\lambda'}
(W_{f_i})_{\lambda \lambda'}
w^{\mu\lambda'}_{p_i} v_{(\mu f + \lambda')q + b}
\end{equation}
%
This gives us the basic decimation-in-time mixed-radix algorithm for
complex data which we will be able to specialize to real data,
%
\begin{algorithm}
\FOR{$i = 1 \dots n_f$}
\FOR{$\mu = 0 \dots p_{i-1} - 1$}
\FOR{$b = 0 \dots q-1$}
\FOR{$\lambda = 0 \dots f-1$}
\STATE{$t_\lambda \Leftarrow
\omega^{\mu\lambda'}_{p_{i}} v_{(\mu f + \lambda')q + b}$}
\ENDFOR
\FOR{$\lambda = 0 \dots f-1$}
\STATE{$v'_{(\lambda p_{i-1} + \mu)q + b} =
\sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') t_{\lambda'}$}
\COMMENT{DFT matrix-multiply module}
\ENDFOR
\ENDFOR
\ENDFOR
\STATE{$v \Leftarrow v'$}
\ENDFOR
\end{algorithm}
%
We are now at the point where we can convert an algorithm formulated
in terms of complex variables to one in terms of real variables by
choosing a suitable storage scheme. We will adopt the FFTPACK storage
convention. FFTPACK uses a scheme where individual FFTs are
contiguous, not interleaved, and real-imaginary pairs are stored in
neighboring locations. This has better locality than was possible for
the radix-2 case.
The interleaving of the intermediate FFTs results from the Kronecker
product, $W_p \otimes I_q$. The FFTs can be made contiguous if we
reorder the Kronecker product on the intermediate passes,
%
\begin{equation}
W_p \otimes I_q \Rightarrow I_q \otimes W_p
\end{equation}
%
This can be implemented by a simple change in indexing. On pass-$i$
we store element $v_{a q_i + b}$ in location $v_{b p_i+a}$. We
compensate for this change by making the same transposition when
reading the data. Note that this only affects the indices of the
intermediate passes. On the zeroth iteration the transposition has no
effect because $p_0 = 1$. Similarly there is no effect on the last
iteration, which has $q_{n_f} = 1$. This is how the algorithm above
looks after this index transformation,
%
\begin{algorithm}
\FOR{$i = 1 \dots n_f$}
\FOR{$\mu = 0 \dots p_{i-1} - 1$}
\FOR{$b = 0 \dots q-1$}
\FOR{$\lambda = 0 \dots f-1$}
\STATE{$t_\lambda \Leftarrow
\omega^{\mu\lambda'}_{p_{i}} v_{(\lambda'q + b)p_{i-1} + \mu}$}
\ENDFOR
\FOR{$\lambda = 0 \dots f-1$}
\STATE{$v'_{b p + (\lambda p_{i-1} + \mu)} =
\sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') t_{\lambda'}$}
\COMMENT{DFT matrix-multiply module}
\ENDFOR
\ENDFOR
\ENDFOR
\STATE{$v \Leftarrow v'$}
\ENDFOR
\end{algorithm}
%
We transpose the input terms by writing the index in the form $a
q_{i-1} + b$, to take account of the pass-dependence of the scheme,
%
\begin{equation}
v_{(\mu f + \lambda')q + b} = v_{\mu q_{i-1} + \lambda'q + b}
\end{equation}
%
We used the identity $q_{i-1} = f q$ to split the index. Note that in
this form $\lambda'q + b$ runs from 0 to $q_{i-1} - 1$ as $b$ runs
from 0 to $q-1$ and $\lambda'$ runs from 0 to $f-1$. The transposition
for the input terms then gives,
%
\begin{equation}
v_{\mu q_{i-1} + \lambda'q + b} \Rightarrow v_{(\lambda'q + b) p_{i-1} + \mu}
\end{equation}
%
In the FFTPACK scheme the intermediate output data have the same
half-complex symmetry as the radix-2 example, namely,
%
\begin{equation}
v^{(i)}_{b p + a} = v^{(i)*}_{b p + (p - a)}
\end{equation}
%
and on the input data from the previous pass have the symmetry,
%
\begin{equation}
v^{(i-1)}_{(\lambda' q + b) p_{i-1} + \mu} = v^{(i-1)*}_{(\lambda' q +
b) p_{i-1} + (p_{i-1} - \mu)}
\end{equation}
%
Using these symmetries we can halve the storage and computation
requirements for each pass. Compared with the radix-2 algorithm we
have more freedom because the computation does not have to be done in
place. The storage scheme adopted by FFTPACK places elements
sequentially with real and imaginary parts in neighboring
locations. Imaginary parts which are known to be zero are not
stored. Here are the full details of the scheme,
%
\begin{center}
\renewcommand{\arraystretch}{1.5}
\begin{tabular}{|l|lll|}
\hline Term & & Location & \\
\hline
$g(b p_i)$
& real part & $b p_{i} $ & \\
& imag part & --- & \\
\hline
$g(b p_i + a)$
& real part & $b p_{i} + 2a - 1 $& for $a = 1 \dots p_i/2 - 1$ \\
& imag part & $b p_{i} + 2a$ & \\
\hline
$g(b p_{i} + p_{i}/2)$
& real part & $b p_i + p_{i} - 1$ & if $p_i$ is even\\
& imag part & --- & \\
\hline
\end{tabular}
\end{center}
%
The real element for $a=0$ is stored in location $bp$. The real parts
for $a = 1 \dots p/2 - 1$ are stored in locations $bp + 2a -1$ and the
imaginary parts are stored in locations $b p + 2 a$. When $p$ is even
the term for $a = p/2$ is purely real and we store it in location $bp
+ p - 1$. The zero imaginary part is not stored.
When we compute the basic iteration,
%
\begin{equation}
v^{(i)}_{b p + (\lambda p_{i-1} + \mu)} = \sum_{\lambda'}
W^{\lambda \lambda'}_f
\omega^{\mu\lambda'}_{p_i} v^{(i-1)}_{(\lambda' q + b)p_{i-1} + \mu}
\end{equation}
%
we eliminate the redundant conjugate terms with $a > p_{i}/2$ as we
did in the radix-2 algorithm. Whenever we need to store a term with $a
> p_{i}/2$ we consider instead the corresponding conjugate term with
$a' = p - a$. Similarly when reading data we replace terms with $\mu >
p_{i-1}/2$ with the corresponding conjugate term for $\mu' = p_{i-1} -
\mu$.
Since the input data on each stage has half-complex symmetry we only
need to consider the range $\mu=0 \dots p_{i-1}/2$. We can consider
the best ways to implement the basic iteration for each pass, $\mu = 0
\dots p_{i-1}/2$.
On the first pass where $\mu=0$ we will be accessing elements which
are the zeroth components of the independent transforms $W_{p_{i-1}}
\otimes I_{q_{i-1}}$, and are purely real.
%
We can code the pass with $\mu=0$ as a special case for real input
data, and conjugate-complex output. When $\mu=0$ the twiddle factors
$\omega^{\mu\lambda'}_{p_i}$ are all unity, giving a further saving.
We can obtain small-$N$ real-data DFT modules by removing the
redundant operations from the complex modules.
%
For example the $N=3$ module was,
%
\begin{equation}
\begin{array}{lll}
t_1 = z_1 + z_2, &
t_2 = z_0 - t_1/2, &
t_3 = \sin(\pi/3) (z_1 - z_2),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{lll}
x_0 = z_0 + t_1, &
x_1 = t_2 + i t_3, &
x_2 = t_2 - i t_3.
\end{array}
\end{equation}
%
In the complex case all the operations were complex, for complex input
data $z_0$, $z_1$, $z_2$. In the real case $z_0$, $z_1$ and $z_2$ are
all real. Consequently $t_1$, $t_2$ and $t_3$ are also real, and the
symmetry $x_1 = t_1 + i t_2 = x^*_2$ means that we do not have to
compute $x_2$ once we have computed $x_1$.
For subsequent passes $\mu = 1 \dots p_{i-1}/2 - 1$ the input data is
complex and we have to compute full complex DFTs using the same
modules as in the complex case. Note that the inputs are all of the
form $v_{(\lambda q + b) p_{i-1} + \mu}$ with $\mu < p_{i-1}/2$ so we
never need to use the symmetry to access the conjugate elements with
$\mu > p_{i-1}/2$.
If $p_{i-1}$ is even then we reach the halfway point $\mu=p_{i-1}/2$,
which is another special case. The input data in this case is purely
real because $\mu = p_{i-1} - \mu$ for $\mu = p_{i-1}/2$. We can code
this as a special case, using real inputs and real-data DFT modules as
we did for $\mu=0$. However, for $\mu = p_{i-1}/2$ the twiddle factors
are not unity,
%
\begin{eqnarray}
\omega^{\mu\lambda'}_{p_i} &=& \omega^{(p_{i-1}/2)\lambda'}_{p_i} \\
&=& \exp(-i\pi\lambda'/f_i)
\end{eqnarray}
%
These twiddle factors introduce an additional phase which modifies the
symmetry of the outputs. Instead of the conjugate-complex symmetry
which applied for $\mu=0$ there is a shifted conjugate-complex
symmetry,
%
\begin{equation}
t_\lambda = t^*_{f-(\lambda+1)}
\end{equation}
%
This is easily proved,
%
\begin{eqnarray}
t_\lambda
&=&
\sum e^{-2\pi i \lambda\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\
t_{f - (\lambda + 1)}
&=&
\sum e^{-2\pi i (f-\lambda-1)\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\
&=&
\sum e^{2\pi i \lambda\lambda'/f_i} e^{i\pi \lambda'/f_i} r_{\lambda'} \\
&=& t^*_\lambda
\end{eqnarray}
%
The symmetry of the output means that we only need to compute half of
the output terms, the remaining terms being conjugates or zero
imaginary parts. For example, when $f=4$ the outputs are $(x_0 + i
y_0, x_1 + i y_1, x_1 - i y_1, x_0 - i y_0)$. For $f=5$ the outputs
are $(x_0 + i y_0, x_1 + i y_1, x_2, x_1 - i y_1, x_0 - i y_0)$. By
combining the twiddle factors and DFT matrix we can make a combined
module which applies both at the same time. By starting from the
complex DFT modules and bringing in twiddle factors we can derive
optimized modules. Here are the modules given by Temperton for $z = W
\Omega x$ where $x$ is real and $z$ has the shifted conjugate-complex
symmetry. The matrix of twiddle factors, $\Omega$, is given by,
%
\begin{equation}
\Omega = \mathrm{diag}(1, e^{-i\pi/f}, e^{-2\pi i/f}, \dots, e^{-i\pi(f-1)/f})
\end{equation}
%
We write $z$ in terms of two real vectors $z = a + i b$.
%
For $N=2$,
%
\begin{equation}
\begin{array}{ll}
a_0 = x_0, &
b_0 = - x_1.
\end{array}
\end{equation}
%
For $N=3$,
%
\begin{equation}
\begin{array}{l}
t_1 = x_1 - x_2,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_0 = x_0 + t_1/2, & b_0 = x_0 - t_1,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{l}
a_1 = - \sin(\pi/3) (x_1 + x_2)
\end{array}
\end{equation}
%
For $N=4$,
%
\begin{equation}
\begin{array}{ll}
t_1 = (x_1 - x_3)/\sqrt{2}, & t_2 = (x_1 + x_3)/\sqrt{2},
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_0 = x_0 + t_1, & b_0 = -x_2 - t_2,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_1 = x_0 - t_1, & b_1 = x_2 - t_2.
\end{array}
\end{equation}
%
For $N=5$,
%
\begin{equation}
\begin{array}{ll}
t_1 = x_1 - x_4, & t_2 = x_1 + x_4,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
t_3 = x_2 - x_3, & t_4 = x_2 + x_3,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
t_5 = t_1 - t_3, & t_6 = x_0 + t_5 / 4,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
t_7 = (\sqrt{5}/4)(t_1 + t_3) &
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_0 = t_6 + t_7, & b_0 = -\sin(2\pi/10) t_2 - \sin(2\pi/5) t_4,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_1 = t_6 - t_7, & b_1 = -\sin(2\pi/5) t_2 + \sin(2\pi/10) t_4,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_2 = x_0 - t_5 &
\end{array}
\end{equation}
%
For $N=6$,
%
\begin{equation}
\begin{array}{ll}
t_1 = \sin(\pi/3)(x_5 - x_1), & t_2 = \sin(\pi/3) (x_2 + x_4),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
t_3 = x_2 - x_4, & t_4 = x_1 + x_5,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
t_5 = x_0 + t_3 / 2, & t_6 = -x_3 - t_4 / 2,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_0 = t_5 - t_1, & b_0 = t_6 - t_2,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_1 = x_0 - t_3, & b_1 = x_3 - t_4,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
a_2 = t_5 + t_1, & b_2 = t_6 + t_2
\end{array}
\end{equation}
\section{Computing the mixed-radix inverse for real data}
%
To compute the inverse of the mixed-radix FFT on real data we step
through the algorithm in reverse and invert each operation.
This gives the following algorithm using FFTPACK indexing,
%
\begin{algorithm}
\FOR{$i = n_f \dots 1$}
\FOR{$\mu = 0 \dots p_{i-1} - 1$}
\FOR{$b = 0 \dots q-1$}
\FOR{$\lambda = 0 \dots f-1$}
\STATE{$t_{\lambda'} =
\sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda')
v_{b p + (\lambda p_{i-1} + \mu)}$}
\COMMENT{DFT matrix-multiply module}
\ENDFOR
\FOR{$\lambda = 0 \dots f-1$}
\STATE{$v'_{(\lambda'q + b)p_{i-1} + \mu} \Leftarrow
\omega^{-\mu\lambda'}_{p_{i}} t_\lambda$}
\ENDFOR
\ENDFOR
\ENDFOR
\STATE{$v \Leftarrow v'$}
\ENDFOR
\end{algorithm}
%
When $\mu=0$ we are applying an inverse DFT to half-complex data,
giving a real result. The twiddle factors are all unity. We can code
this as a special case, just as we did for the forward routine. We
start with complex module and eliminate the redundant terms. In this
case it is the final result which has the zero imaginary part, and we
eliminate redundant terms by using the half-complex symmetry of the
input data.
When $\mu=1 \dots p_{i-1}/2 - 1$ we have full complex transforms on
complex data, so no simplification is possible.
When $\mu = p_{i-1}/2$ (which occurs only when $p_{i-1}$ is even) we
have a combination of twiddle factors and DFTs on data with the
shifted half-complex symmetry which give a real result. We implement
this as a special module, essentially by inverting the system of
equations given for the forward case. We use the modules given by
Temperton, appropriately modified for our version of the algorithm. He
uses a slightly different convention which differs by factors of two
for some terms (consult his paper for details~\cite{temperton83real}).
For $N=2$,
%
\begin{equation}
\begin{array}{ll}
x_0 = 2 a_0, & x_1 = - 2 b_0 .
\end{array}
\end{equation}
%
For $N=3$,
%
\begin{equation}
\begin{array}{ll}
t_1 = a_0 - a_1, & t_2 = \sqrt{3} b_1, \\
\end{array}
\end{equation}
\begin{equation}
\begin{array}{lll}
x_0 = 2 a_0 + a_1, & x_1 = t_1 - t_2, & x_2 = - t_1 - t_2
\end{array}
\end{equation}
%
For $N=4$,
\begin{equation}
\begin{array}{ll}
t_1 = \sqrt{2} (b_0 + b_1), & t_2 = \sqrt{2} (a_0 - a_1),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
x_0 = 2(a_0 + a_1), & x_1 = t_2 - t_1 ,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
x_2 = 2(b_1 - b_0), & x_3 = -(t_2 + t_1).
\end{array}
\end{equation}
%
For $N=5$,
%
\begin{equation}
\begin{array}{ll}
t_1 = 2 (a_0 + a_1), & t_2 = t_1 / 4 - a_2,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
t_3 = (\sqrt{5}/2) (a_0 - a_1),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{l}
t_4 = 2(\sin(2\pi/10) b_0 + \sin(2\pi/5) b_1),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{l}
t_5 = 2(\sin(2\pi/10) b_0 - \sin(2\pi/5) b_1),
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
t_6 = t_3 + t_2, & t_7 = t_3 - t_2,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
x_0 = t_1 + a_2, & x_1 = t_6 - t_4 ,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
x_2 = t_7 - t_5, & x_3 = - t_7 - t_5,
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
x_4 = -t_6 - t_4.
\end{array}
\end{equation}
\section{Conclusions}
%
We have described the basic algorithms for one-dimensional radix-2 and
mixed-radix FFTs. It would be nice to have a pedagogical explanation
of the split-radix FFT algorithm, which is faster than the simple
radix-2 algorithm we used. We could also have a whole chapter on
multidimensional FFTs.
%
%\section{Multidimensional FFTs}
%\section{Testing FFTs, Numerical Analysis}
%\nocite{*}
\bibliographystyle{unsrt}
\bibliography{fftalgorithms}
\end{document}
|