1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 7309 7310 7311 7312 7313 7314 7315 7316 7317 7318 7319 7320 7321 7322 7323 7324 7325 7326 7327 7328 7329 7330 7331 7332 7333 7334 7335 7336 7337 7338 7339 7340 7341 7342 7343 7344 7345 7346 7347 7348 7349 7350 7351 7352 7353 7354 7355 7356 7357 7358 7359 7360 7361 7362 7363 7364 7365 7366 7367 7368 7369 7370 7371 7372 7373 7374 7375 7376 7377 7378 7379 7380 7381 7382 7383 7384 7385 7386 7387 7388 7389 7390 7391 7392 7393 7394 7395 7396 7397 7398 7399 7400 7401 7402 7403 7404 7405 7406 7407 7408 7409 7410 7411 7412 7413 7414 7415 7416 7417 7418 7419 7420 7421 7422 7423 7424 7425 7426 7427 7428 7429 7430 7431 7432 7433 7434 7435 7436 7437 7438 7439 7440 7441 7442 7443 7444 7445 7446 7447 7448 7449 7450 7451 7452 7453 7454 7455 7456 7457 7458 7459 7460 7461 7462 7463 7464 7465 7466 7467 7468 7469 7470 7471 7472 7473 7474 7475 7476 7477 7478 7479 7480 7481 7482 7483 7484 7485 7486 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497 7498 7499 7500 7501 7502 7503 7504 7505 7506 7507 7508 7509 7510 7511 7512 7513 7514 7515 7516 7517 7518 7519 7520 7521 7522 7523 7524 7525 7526 7527 7528 7529 7530 7531 7532 7533 7534 7535 7536 7537 7538 7539 7540 7541 7542 7543 7544 7545 7546 7547 7548 7549 7550 7551 7552 7553 7554 7555 7556 7557 7558 7559 7560 7561 7562 7563 7564 7565 7566 7567 7568 7569 7570 7571 7572 7573 7574 7575 7576 7577 7578 7579 7580 7581 7582 7583 7584 7585 7586 7587 7588 7589 7590 7591 7592 7593 7594 7595 7596 7597 7598 7599 7600 7601 7602 7603 7604 7605 7606 7607 7608 7609 7610 7611 7612 7613 7614 7615 7616 7617 7618 7619 7620 7621 7622 7623 7624 7625 7626 7627 7628 7629 7630 7631 7632 7633 7634 7635 7636 7637 7638 7639 7640 7641 7642 7643 7644 7645 7646 7647 7648 7649 7650 7651 7652 7653 7654 7655 7656 7657 7658 7659 7660 7661 7662 7663 7664 7665 7666 7667 7668 7669 7670 7671 7672 7673 7674 7675 7676 7677 7678 7679 7680 7681 7682 7683 7684 7685 7686 7687 7688 7689 7690 7691 7692 7693 7694 7695 7696 7697 7698 7699 7700 7701 7702 7703 7704 7705 7706 7707 7708 7709 7710 7711 7712 7713 7714 7715 7716 7717 7718 7719 7720 7721 7722 7723 7724 7725 7726 7727 7728 7729 7730 7731 7732 7733 7734 7735 7736 7737 7738 7739 7740 7741 7742 7743 7744 7745 7746 7747 7748 7749 7750 7751 7752 7753 7754 7755 7756 7757 7758 7759 7760 7761 7762 7763 7764 7765 7766 7767 7768 7769 7770 7771 7772 7773 7774 7775 7776 7777 7778 7779 7780 7781 7782 7783 7784 7785 7786 7787 7788 7789 7790 7791 7792 7793 7794 7795 7796 7797 7798 7799 7800 7801 7802 7803 7804 7805 7806 7807 7808 7809 7810 7811 7812 7813 7814 7815 7816 7817 7818 7819 7820 7821 7822 7823 7824 7825 7826 7827 7828 7829 7830 7831 7832 7833 7834 7835 7836 7837 7838 7839 7840 7841 7842 7843 7844 7845 7846 7847 7848 7849 7850 7851 7852 7853 7854 7855 7856 7857 7858 7859 7860 7861 7862 7863 7864 7865 7866 7867 7868 7869 7870 7871 7872 7873 7874 7875 7876 7877 7878 7879 7880 7881 7882 7883 7884 7885 7886 7887 7888 7889 7890 7891 7892 7893 7894 7895 7896 7897 7898 7899 7900 7901 7902 7903 7904 7905 7906 7907 7908 7909 7910 7911 7912 7913 7914 7915 7916 7917 7918 7919 7920 7921 7922 7923 7924 7925 7926 7927 7928 7929 7930 7931 7932 7933 7934 7935 7936 7937 7938 7939 7940 7941 7942 7943 7944 7945 7946 7947 7948 7949 7950 7951 7952 7953 7954 7955 7956 7957 7958 7959 7960 7961 7962 7963 7964 7965 7966 7967 7968 7969 7970 7971 7972 7973 7974 7975 7976 7977 7978 7979 7980 7981 7982 7983 7984 7985 7986 7987 7988 7989 7990 7991 7992 7993 7994 7995 7996 7997 7998 7999 8000 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010 8011 8012 8013 8014 8015 8016 8017 8018 8019 8020 8021 8022 8023 8024 8025 8026 8027 8028 8029 8030 8031 8032 8033 8034 8035 8036 8037 8038 8039 8040 8041 8042 8043 8044 8045 8046 8047 8048 8049 8050 8051 8052 8053 8054 8055 8056 8057 8058 8059 8060 8061 8062 8063 8064 8065 8066 8067 8068 8069 8070 8071 8072 8073 8074 8075 8076 8077 8078 8079 8080 8081 8082 8083 8084 8085 8086 8087 8088 8089 8090 8091 8092 8093 8094 8095 8096 8097 8098 8099 8100 8101 8102 8103 8104 8105 8106 8107 8108 8109 8110 8111 8112 8113 8114 8115 8116 8117 8118 8119 8120 8121 8122 8123 8124 8125 8126 8127 8128 8129 8130 8131 8132 8133 8134 8135 8136 8137 8138 8139 8140 8141 8142 8143 8144 8145 8146 8147 8148 8149 8150 8151 8152 8153 8154 8155 8156 8157 8158 8159 8160 8161 8162 8163 8164 8165 8166 8167 8168 8169 8170 8171 8172 8173 8174 8175 8176 8177 8178 8179 8180 8181 8182 8183 8184 8185 8186 8187 8188 8189 8190 8191 8192 8193 8194 8195 8196 8197 8198 8199 8200 8201 8202 8203 8204 8205 8206 8207 8208 8209 8210 8211 8212 8213 8214 8215 8216 8217 8218 8219 8220 8221 8222 8223 8224 8225 8226 8227 8228 8229 8230 8231 8232 8233 8234 8235 8236 8237 8238 8239 8240 8241 8242 8243 8244 8245 8246 8247 8248 8249 8250 8251 8252 8253 8254 8255 8256 8257 8258 8259 8260 8261 8262 8263 8264 8265 8266 8267 8268 8269 8270 8271 8272 8273 8274 8275 8276 8277 8278 8279 8280 8281 8282 8283 8284 8285 8286 8287 8288 8289 8290 8291 8292 8293 8294 8295 8296 8297 8298 8299 8300 8301 8302 8303 8304 8305 8306 8307 8308 8309 8310 8311 8312 8313 8314 8315 8316 8317 8318 8319 8320 8321 8322 8323 8324 8325 8326 8327 8328 8329 8330 8331 8332 8333 8334 8335 8336 8337 8338 8339 8340 8341 8342 8343 8344 8345 8346 8347 8348 8349 8350 8351 8352 8353 8354 8355 8356 8357 8358 8359 8360 8361 8362 8363 8364 8365 8366 8367 8368 8369 8370 8371 8372 8373 8374 8375 8376 8377 8378 8379 8380 8381 8382 8383 8384 8385 8386 8387 8388 8389 8390 8391 8392 8393 8394 8395 8396 8397 8398 8399 8400 8401 8402 8403 8404 8405 8406 8407 8408 8409 8410 8411 8412 8413 8414 8415 8416 8417 8418 8419 8420 8421 8422 8423 8424 8425 8426 8427 8428 8429 8430 8431 8432 8433 8434 8435 8436 8437 8438 8439 8440 8441 8442 8443 8444 8445 8446 8447 8448 8449 8450 8451 8452 8453 8454 8455 8456 8457 8458 8459 8460 8461 8462 8463 8464 8465 8466 8467 8468 8469 8470 8471 8472 8473 8474 8475 8476 8477 8478 8479 8480 8481 8482 8483 8484 8485 8486 8487 8488 8489 8490 8491 8492 8493 8494 8495 8496 8497 8498 8499 8500 8501 8502 8503 8504 8505 8506 8507 8508 8509 8510 8511 8512 8513 8514 8515 8516 8517 8518 8519 8520 8521 8522 8523 8524 8525 8526 8527 8528 8529 8530 8531 8532 8533 8534 8535 8536 8537 8538 8539 8540 8541 8542 8543 8544 8545 8546 8547 8548 8549 8550 8551 8552 8553 8554 8555 8556 8557 8558 8559 8560 8561 8562 8563 8564 8565 8566 8567 8568 8569 8570 8571 8572 8573 8574 8575 8576 8577 8578 8579 8580 8581 8582 8583 8584 8585 8586 8587 8588 8589 8590 8591 8592 8593 8594 8595 8596 8597 8598 8599 8600 8601 8602 8603 8604 8605 8606 8607 8608 8609 8610 8611 8612 8613 8614 8615 8616 8617 8618 8619 8620 8621 8622 8623 8624 8625 8626 8627 8628 8629 8630 8631 8632 8633 8634 8635 8636 8637 8638 8639 8640 8641 8642 8643 8644 8645 8646 8647 8648 8649 8650 8651 8652 8653 8654 8655 8656 8657 8658 8659 8660 8661 8662 8663 8664 8665 8666 8667 8668 8669 8670 8671 8672 8673 8674 8675 8676 8677 8678 8679 8680 8681 8682 8683 8684 8685 8686 8687 8688 8689 8690 8691 8692 8693 8694 8695 8696 8697 8698 8699 8700 8701 8702 8703 8704 8705 8706 8707 8708 8709 8710 8711 8712 8713 8714 8715 8716 8717 8718 8719 8720 8721 8722 8723 8724 8725 8726 8727 8728 8729 8730 8731 8732 8733 8734 8735 8736 8737 8738 8739 8740 8741 8742 8743 8744 8745 8746 8747 8748 8749 8750 8751 8752 8753 8754 8755 8756 8757 8758 8759 8760 8761 8762 8763 8764 8765 8766 8767 8768 8769 8770 8771 8772 8773 8774 8775 8776 8777 8778 8779 8780 8781 8782 8783 8784 8785 8786 8787 8788 8789 8790 8791 8792 8793 8794 8795 8796 8797 8798 8799 8800 8801 8802 8803 8804 8805 8806 8807 8808 8809 8810 8811 8812 8813 8814 8815 8816 8817 8818 8819 8820 8821 8822 8823 8824 8825 8826 8827 8828 8829 8830 8831 8832 8833 8834 8835 8836 8837 8838 8839 8840 8841 8842 8843 8844 8845 8846 8847 8848 8849 8850 8851 8852 8853 8854 8855 8856 8857 8858 8859 8860 8861 8862 8863 8864 8865 8866 8867 8868 8869 8870 8871 8872 8873 8874 8875 8876 8877 8878 8879 8880 8881 8882 8883 8884 8885 8886 8887 8888 8889 8890 8891 8892 8893 8894 8895 8896 8897 8898 8899 8900 8901 8902 8903 8904 8905 8906 8907 8908 8909 8910 8911 8912 8913 8914 8915 8916 8917 8918 8919 8920 8921 8922 8923 8924 8925 8926 8927 8928 8929 8930 8931 8932 8933 8934 8935 8936 8937 8938 8939 8940 8941 8942 8943 8944 8945 8946 8947 8948 8949 8950 8951 8952 8953 8954 8955 8956 8957 8958 8959 8960 8961 8962 8963 8964 8965 8966 8967 8968 8969 8970 8971 8972 8973 8974 8975 8976 8977 8978 8979 8980 8981 8982 8983 8984 8985 8986 8987 8988 8989 8990 8991 8992 8993 8994 8995 8996 8997 8998 8999 9000 9001 9002 9003 9004 9005 9006 9007 9008 9009 9010 9011 9012 9013 9014 9015 9016 9017 9018 9019 9020 9021 9022 9023 9024 9025 9026 9027 9028 9029 9030 9031 9032 9033 9034 9035 9036 9037 9038 9039 9040 9041 9042 9043 9044 9045 9046 9047 9048 9049 9050 9051 9052 9053 9054 9055 9056 9057 9058 9059 9060 9061 9062 9063 9064 9065 9066 9067 9068 9069 9070 9071 9072 9073 9074 9075 9076 9077 9078 9079 9080 9081 9082 9083 9084 9085 9086 9087 9088 9089 9090 9091 9092 9093 9094 9095 9096 9097 9098 9099 9100 9101 9102 9103 9104 9105 9106 9107 9108 9109 9110 9111 9112 9113 9114 9115 9116 9117 9118 9119 9120 9121 9122 9123 9124 9125 9126 9127 9128 9129 9130 9131 9132 9133 9134 9135 9136 9137 9138 9139 9140 9141 9142 9143 9144 9145 9146 9147 9148 9149 9150 9151 9152 9153 9154 9155 9156 9157 9158 9159 9160 9161 9162 9163 9164 9165 9166 9167 9168 9169 9170 9171 9172 9173 9174 9175 9176 9177 9178 9179 9180 9181 9182 9183 9184 9185 9186 9187 9188 9189 9190 9191 9192 9193 9194 9195 9196 9197 9198 9199 9200 9201 9202 9203 9204 9205 9206 9207 9208 9209 9210 9211 9212 9213 9214 9215 9216 9217 9218 9219 9220 9221 9222 9223 9224 9225 9226 9227 9228 9229 9230 9231 9232 9233 9234 9235 9236 9237 9238 9239 9240 9241 9242 9243 9244 9245 9246 9247 9248 9249 9250 9251 9252 9253 9254 9255 9256 9257 9258 9259 9260 9261 9262 9263 9264 9265 9266 9267 9268 9269 9270 9271 9272 9273 9274 9275 9276 9277 9278 9279 9280 9281 9282 9283 9284 9285 9286 9287 9288 9289 9290 9291 9292 9293 9294 9295 9296 9297 9298 9299 9300 9301 9302 9303 9304 9305 9306 9307 9308 9309 9310 9311 9312 9313 9314 9315 9316 9317 9318 9319 9320 9321 9322 9323 9324 9325 9326 9327 9328 9329 9330 9331 9332 9333 9334 9335 9336 9337 9338 9339 9340 9341 9342 9343 9344 9345 9346 9347 9348 9349 9350 9351 9352 9353 9354 9355 9356 9357 9358 9359 9360 9361 9362 9363 9364 9365 9366 9367 9368 9369 9370 9371 9372 9373 9374 9375 9376 9377 9378 9379 9380 9381 9382 9383 9384 9385 9386 9387 9388 9389 9390 9391 9392 9393 9394 9395 9396 9397 9398 9399 9400 9401 9402 9403 9404 9405 9406 9407 9408 9409 9410 9411 9412 9413 9414 9415 9416 9417 9418 9419 9420 9421 9422 9423 9424 9425 9426 9427 9428 9429 9430 9431 9432 9433 9434 9435 9436 9437 9438 9439 9440 9441 9442 9443 9444 9445 9446 9447 9448 9449 9450 9451 9452 9453 9454 9455 9456 9457 9458 9459 9460 9461 9462 9463 9464 9465 9466 9467 9468 9469 9470 9471 9472 9473 9474 9475 9476 9477 9478 9479 9480 9481 9482 9483 9484 9485 9486 9487 9488 9489 9490 9491 9492 9493 9494 9495 9496 9497 9498 9499 9500 9501 9502 9503 9504 9505 9506 9507 9508 9509 9510 9511 9512 9513 9514 9515 9516 9517 9518 9519 9520 9521 9522 9523 9524 9525 9526 9527 9528 9529 9530 9531 9532 9533 9534 9535 9536 9537 9538 9539 9540 9541 9542 9543 9544 9545 9546 9547 9548 9549 9550 9551 9552 9553 9554 9555 9556 9557 9558 9559 9560 9561 9562 9563 9564 9565 9566 9567 9568 9569 9570 9571 9572 9573 9574 9575 9576 9577 9578 9579 9580 9581 9582 9583 9584 9585 9586 9587 9588 9589 9590 9591 9592 9593 9594 9595 9596 9597 9598 9599 9600 9601 9602 9603 9604 9605 9606 9607 9608 9609 9610 9611 9612 9613 9614 9615 9616 9617 9618 9619 9620 9621 9622 9623 9624 9625 9626 9627 9628 9629 9630 9631 9632 9633 9634 9635 9636 9637 9638 9639 9640 9641 9642 9643 9644 9645 9646 9647 9648 9649 9650 9651 9652 9653 9654 9655 9656 9657 9658 9659 9660 9661 9662 9663 9664 9665 9666 9667 9668 9669 9670 9671 9672 9673 9674 9675 9676 9677 9678 9679 9680 9681 9682 9683 9684 9685 9686 9687 9688 9689 9690 9691 9692 9693 9694 9695 9696 9697 9698 9699 9700 9701 9702 9703 9704 9705 9706 9707 9708 9709 9710 9711 9712 9713 9714 9715 9716 9717 9718 9719 9720 9721 9722 9723 9724 9725 9726 9727 9728 9729 9730 9731 9732 9733 9734 9735 9736 9737 9738 9739 9740 9741 9742 9743 9744 9745 9746 9747 9748 9749 9750 9751 9752 9753 9754 9755 9756 9757 9758 9759 9760 9761 9762 9763 9764 9765 9766 9767 9768 9769 9770 9771 9772 9773 9774 9775 9776 9777 9778 9779 9780 9781 9782 9783 9784 9785 9786 9787 9788 9789 9790 9791 9792 9793 9794 9795 9796 9797 9798 9799 9800 9801 9802 9803 9804 9805 9806 9807 9808 9809 9810 9811 9812 9813 9814 9815 9816 9817 9818 9819 9820 9821 9822 9823 9824 9825 9826 9827 9828 9829 9830 9831 9832 9833 9834 9835 9836 9837 9838 9839 9840 9841 9842 9843 9844 9845 9846 9847 9848 9849 9850 9851 9852 9853 9854 9855 9856 9857 9858 9859 9860 9861 9862 9863 9864 9865 9866 9867 9868 9869 9870 9871 9872 9873 9874 9875 9876 9877 9878 9879 9880 9881 9882 9883 9884 9885 9886 9887 9888 9889 9890 9891 9892 9893 9894 9895 9896 9897 9898 9899 9900 9901 9902 9903 9904 9905 9906 9907 9908 9909 9910 9911 9912 9913 9914 9915 9916 9917 9918 9919 9920 9921 9922 9923 9924 9925 9926 9927 9928 9929 9930 9931 9932 9933 9934 9935 9936 9937 9938 9939 9940 9941 9942 9943 9944 9945 9946 9947 9948 9949 9950 9951 9952 9953 9954 9955 9956 9957 9958 9959 9960 9961 9962 9963 9964 9965 9966 9967 9968 9969 9970 9971 9972 9973 9974 9975 9976 9977 9978 9979 9980 9981 9982 9983 9984 9985 9986 9987 9988 9989 9990 9991 9992 9993 9994 9995 9996 9997 9998 9999 10000 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 10011 10012 10013 10014 10015 10016 10017 10018 10019 10020 10021 10022 10023 10024 10025 10026 10027 10028 10029 10030 10031 10032 10033 10034 10035 10036 10037 10038 10039 10040 10041 10042 10043 10044 10045 10046 10047 10048 10049 10050 10051 10052 10053 10054 10055 10056 10057 10058 10059 10060 10061 10062 10063 10064 10065 10066 10067 10068 10069 10070 10071 10072 10073 10074 10075 10076 10077 10078 10079 10080 10081 10082 10083 10084 10085 10086 10087 10088 10089 10090 10091 10092 10093 10094 10095 10096 10097 10098 10099 10100 10101 10102 10103 10104 10105 10106 10107 10108 10109 10110 10111 10112 10113 10114 10115 10116 10117 10118 10119 10120 10121 10122 10123 10124 10125 10126 10127 10128 10129 10130 10131 10132 10133 10134 10135 10136 10137 10138 10139 10140 10141 10142 10143 10144 10145 10146 10147 10148 10149 10150 10151 10152 10153 10154 10155 10156 10157 10158 10159 10160 10161 10162 10163 10164 10165 10166 10167 10168 10169 10170 10171 10172 10173 10174 10175 10176 10177 10178 10179 10180 10181 10182 10183 10184 10185 10186 10187 10188 10189 10190 10191 10192 10193 10194 10195 10196 10197 10198 10199 10200 10201 10202 10203 10204 10205 10206 10207 10208 10209 10210 10211 10212 10213 10214 10215 10216 10217 10218 10219 10220 10221 10222 10223 10224 10225 10226 10227 10228 10229 10230 10231 10232 10233 10234 10235 10236 10237 10238 10239 10240 10241 10242 10243 10244 10245 10246 10247 10248 10249 10250 10251 10252 10253 10254 10255 10256 10257 10258 10259 10260 10261 10262 10263 10264 10265 10266 10267 10268 10269 10270 10271 10272 10273 10274 10275 10276 10277 10278 10279 10280 10281 10282 10283 10284 10285 10286 10287 10288 10289 10290 10291 10292 10293 10294 10295 10296 10297 10298 10299 10300 10301 10302 10303 10304 10305 10306 10307 10308 10309 10310 10311 10312 10313 10314 10315 10316 10317 10318 10319 10320 10321 10322 10323 10324 10325 10326 10327 10328 10329 10330 10331 10332 10333 10334 10335 10336 10337 10338 10339 10340 10341 10342 10343 10344 10345 10346 10347 10348 10349 10350 10351 10352 10353 10354 10355 10356 10357 10358 10359 10360 10361 10362 10363 10364 10365 10366 10367 10368 10369 10370 10371 10372 10373 10374 10375 10376 10377 10378 10379 10380 10381 10382 10383 10384 10385 10386 10387 10388 10389 10390 10391 10392 10393 10394 10395 10396 10397 10398 10399 10400 10401 10402 10403 10404 10405 10406 10407 10408 10409 10410 10411 10412 10413 10414 10415 10416 10417 10418 10419 10420 10421 10422 10423 10424 10425 10426 10427 10428 10429 10430 10431 10432 10433 10434 10435 10436 10437 10438 10439 10440 10441 10442 10443 10444 10445 10446 10447 10448 10449 10450 10451 10452 10453 10454 10455 10456 10457 10458 10459 10460 10461 10462 10463 10464 10465 10466 10467 10468 10469 10470 10471 10472 10473 10474 10475 10476 10477 10478 10479 10480 10481 10482 10483 10484 10485 10486 10487 10488 10489 10490 10491 10492 10493 10494 10495 10496 10497 10498 10499 10500 10501 10502 10503 10504 10505 10506 10507 10508 10509 10510 10511 10512 10513 10514 10515 10516 10517 10518 10519 10520 10521 10522 10523 10524 10525 10526 10527 10528 10529 10530 10531 10532 10533 10534 10535 10536 10537 10538 10539 10540 10541 10542 10543 10544 10545 10546 10547 10548 10549 10550 10551 10552 10553 10554 10555 10556 10557 10558 10559 10560 10561 10562 10563 10564 10565 10566 10567 10568 10569 10570 10571 10572 10573 10574 10575 10576 10577 10578 10579 10580 10581 10582 10583 10584 10585 10586 10587 10588 10589 10590 10591 10592 10593 10594 10595 10596 10597 10598 10599 10600 10601 10602 10603 10604 10605 10606 10607 10608 10609 10610 10611 10612 10613 10614 10615 10616 10617 10618 10619 10620 10621 10622 10623 10624 10625 10626 10627 10628 10629 10630 10631 10632 10633 10634 10635 10636 10637 10638 10639 10640 10641 10642 10643 10644 10645 10646 10647 10648 10649 10650 10651 10652 10653 10654 10655 10656 10657 10658 10659 10660 10661 10662 10663 10664 10665 10666 10667 10668 10669 10670 10671 10672 10673 10674 10675 10676 10677 10678 10679 10680 10681 10682 10683 10684 10685 10686 10687 10688 10689 10690 10691 10692 10693 10694 10695 10696 10697 10698 10699 10700 10701 10702 10703 10704 10705 10706 10707 10708 10709 10710 10711 10712 10713 10714 10715 10716 10717 10718 10719 10720 10721 10722 10723 10724 10725 10726 10727 10728 10729 10730 10731 10732 10733 10734 10735 10736 10737 10738 10739 10740 10741 10742 10743 10744 10745 10746 10747 10748 10749 10750 10751 10752 10753 10754 10755 10756 10757 10758 10759 10760 10761 10762 10763 10764 10765 10766 10767 10768 10769 10770 10771 10772 10773 10774 10775 10776 10777 10778 10779 10780 10781 10782 10783 10784 10785 10786 10787 10788 10789 10790 10791 10792 10793 10794 10795 10796 10797 10798 10799 10800 10801 10802 10803 10804 10805 10806 10807 10808 10809 10810 10811 10812 10813 10814 10815 10816 10817 10818 10819 10820 10821 10822 10823 10824 10825 10826 10827 10828 10829 10830 10831 10832 10833 10834 10835 10836 10837 10838 10839 10840 10841 10842 10843 10844 10845 10846 10847 10848 10849 10850 10851 10852 10853 10854 10855 10856 10857 10858 10859 10860 10861 10862 10863 10864 10865 10866 10867 10868 10869 10870 10871 10872 10873 10874 10875 10876 10877 10878 10879 10880 10881 10882 10883 10884 10885 10886 10887 10888 10889 10890 10891 10892 10893 10894 10895 10896 10897 10898 10899 10900 10901 10902 10903 10904 10905 10906 10907 10908 10909 10910 10911 10912 10913 10914 10915 10916 10917 10918 10919 10920 10921 10922 10923 10924 10925 10926 10927 10928 10929 10930 10931 10932 10933 10934 10935 10936 10937 10938 10939 10940 10941 10942 10943 10944 10945 10946 10947 10948 10949 10950 10951 10952 10953 10954 10955 10956 10957 10958 10959 10960 10961 10962 10963 10964 10965 10966 10967 10968 10969 10970 10971 10972 10973 10974 10975 10976 10977 10978 10979 10980 10981 10982 10983 10984 10985 10986 10987 10988 10989 10990 10991 10992 10993 10994 10995 10996 10997 10998 10999 11000 11001 11002 11003 11004 11005 11006 11007 11008 11009 11010 11011 11012 11013 11014 11015 11016 11017 11018 11019 11020 11021 11022 11023 11024 11025 11026 11027 11028 11029 11030 11031 11032 11033 11034 11035 11036 11037 11038 11039 11040 11041 11042 11043 11044 11045 11046 11047 11048 11049 11050 11051 11052 11053 11054 11055 11056 11057 11058 11059 11060 11061 11062 11063 11064 11065 11066 11067 11068 11069 11070 11071 11072 11073 11074 11075 11076 11077 11078 11079 11080 11081 11082 11083 11084 11085 11086 11087 11088 11089 11090 11091 11092 11093 11094 11095 11096 11097 11098 11099 11100 11101 11102 11103 11104 11105 11106 11107 11108 11109 11110 11111 11112 11113 11114 11115 11116 11117 11118 11119 11120 11121 11122 11123 11124 11125 11126 11127 11128 11129 11130 11131 11132 11133 11134 11135 11136 11137 11138 11139 11140 11141 11142 11143 11144 11145 11146 11147 11148 11149 11150 11151 11152 11153 11154 11155 11156 11157 11158 11159 11160 11161 11162 11163 11164 11165 11166 11167 11168 11169 11170 11171 11172 11173 11174 11175 11176 11177 11178 11179 11180 11181 11182 11183 11184 11185 11186 11187 11188 11189 11190 11191 11192 11193 11194 11195 11196 11197 11198 11199 11200 11201 11202 11203 11204 11205 11206 11207 11208 11209 11210 11211 11212 11213 11214 11215 11216 11217 11218 11219 11220 11221 11222 11223 11224 11225 11226 11227 11228 11229 11230 11231 11232 11233 11234 11235 11236 11237 11238 11239 11240 11241 11242 11243 11244 11245 11246 11247 11248 11249 11250 11251 11252 11253 11254 11255 11256 11257 11258 11259 11260 11261 11262 11263 11264 11265 11266 11267 11268 11269 11270 11271 11272 11273 11274 11275 11276 11277 11278 11279 11280 11281 11282 11283 11284 11285 11286 11287 11288 11289 11290 11291
|
\documentclass{article}
\usepackage{noweb}
\usepackage{amsmath}
\usepackage{fancyvrb}
\usepackage{graphicx}
\addtolength{\textwidth}{1in}
\addtolength{\oddsidemargin}{-.5in}
\setlength{\evensidemargin}{\oddsidemargin}
\newcommand{\myfig}[1]{\includegraphics[width=\textwidth]{figures/#1.pdf}}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\xbar}{\overline{x}}
\newcommand{\sign}{{\rm sign}}
\noweboptions{breakcode}
\title{Survival Package Functions}
\author{Terry Therneau}
\begin{document}
\maketitle
\tableofcontents
\section{Introduction}
\begin{quotation}
Let us change or traditional attitude to the construction of programs.
Instead of imagining that our main task is to instruct a \emph{computer}
what to do, let us concentrate rather on explaining to \emph{humans}
what we want the computer to do. (Donald E. Knuth, 1984).
\end{quotation}
This is the definition of a coding style called
\emph{literate programming}.
I first made use of it in the \emph{coxme} library and have become a full
convert. For the survival library only selected objects are documented in
this way; as I make updates and changes I am slowly converting the source
code.
The first motivation for this is to make the code easier for me, both to
create and to maintain. As to maintinance, I have found that whenver I
need to update code I spend a lot of time in the ``what was I doing in these
x lines?'' stage. The code never has enough documentation, even for the
author. (The survival library is already better than the majority of packages
in R, whose comment level is abysmal.
In the pre-noweb source code about 1 line in 6
has a comment, for the noweb document the documentation/code ratio is 2:1.)
I also find it helps in creating new code to have the real documentation of
intent --- formulas with integrals and such --- closely integrated.
The second motivation is to leave code that is well enough explained that
someone else can take it over.
The source code is structured using \emph{noweb}, one of the simpler literate
programming environments.
The source code files look remakably like Sweave, and the .Rnw mode of
emacs works perfectly for them. This is not too surprising since Sweave
was also based on noweb. Sweave is not sufficient to process the files,
however, since it has a different intention: it is designed to
\emph{execute} the code and make the results into a report, while noweb
is designed to \emph{explain} the code. We do this using the \code{noweb}
library in R, which contains the \code{noweave} and \code{notangle} functions.
(It would in theory be fairly simple to extend \code{knitr} to do this task,
which is a topic for further exploration one day. A downside to noweb is
that like Sweave it depends on latex, which has an admittedly steep learning
curve, and markdown is thus attractive.)
\section{Cox Models}
\subsection{Coxph}
The \code{coxph} routine is the underlying basis for all the models.
The source was converted to noweb when adding time-transform terms.
The call starts out with the basic building of a model frame
and proceeds from there.
A cluster term in the model is an exception. The variable mentioned is
never part of the formal model, and so it is not kept as part of the saved
terms structure.
The aeqSurv function is used to adjucate near ties in the time
variable, numerical precision issues that occur when users base
caculations on days/365.25 instead of days.
The analysis for multi-state data is a bit more complex.
\begin{itemize}
\item If the formula statement is a list, we preprocess this to find out
any potential extra variables, and create a new global formula which
will be used to create the data frame.
\item In the above case missing value processing needs
to be deferred, since some covariates may apply only to select
transitions.
\item After the data frame is constructed, the transitions matrix can be
used to check that all the state names actually exist, construct the
cmap matrix, and do missing value removal.
\end{itemize}
<<coxph>>=
coxph <- function(formula, data, weights, subset, na.action,
init, control, ties= c("efron", "breslow", "exact"),
singular.ok =TRUE, robust,
model=FALSE, x=FALSE, y=TRUE, tt, method=ties,
id, cluster, istate, statedata, nocenter=c(-1, 0, 1), ...) {
missing.ties <- missing(ties) & missing(method) #see later multistate sect
ties <- match.arg(ties)
Call <- match.call()
if (missing(formula)) stop("a formula argument is required")
## We want to pass any ... args to coxph.control, but not pass things
## like "dats=mydata" where someone just made a typo. The use of ...
## is simply to allow things like "eps=1e6" with easier typing
extraArgs <- list(...)
if (length(extraArgs)) {
controlargs <- names(formals(coxph.control)) #legal arg names
indx <- pmatch(names(extraArgs), controlargs, nomatch=0L)
if (any(indx==0L))
stop(gettextf("Argument %s not matched",
names(extraArgs)[indx==0L]), domain = NA)
}
# Gather any leftover arguments into a coxph.control call
# If there is a control argument, force a call to coxph.control to both
# fill it out with all the elements and do sanity checks
if (missing(control)) control <- coxph.control(...)
else if (is.list(control)) control <- do.call(coxph.control, control)
else stop("control argument must be a list")
# make Surv(), strata() etc in a formula resolve to the survival namespace
if (is.list(formula)) {
newform <- removeDoubleColonSurv(formula[[1]])
if (!is.null(newform)) {
formula[[1]] <- newform$formula
if (newform$newcall) Call$formula <- newform$formula
}
} else {
newform <- removeDoubleColonSurv(formula)
if (!is.null(newform)) {
formula <- newform$formula
if (newform$newcall) Call$formula <- formula #save the nicer version
}
}
# Move any cluster() term out of the formula, and make it an argument
# instead. This makes everything easier. But, I can only do that with
# a local copy, doing otherwise messes up future use of update() on
# the model object for a user stuck in "+ cluster()" mode.
ss <- "cluster"
if (is.list(formula))
Terms <- if (missing(data)) terms(formula[[1]], specials=ss) else
terms(formula[[1]], specials=ss, data=data)
else Terms <- if (missing(data)) terms(formula, specials=ss) else
terms(formula, specials=ss, data=data)
tcl <- attr(Terms, 'specials')$cluster
if (length(tcl) > 1) stop("a formula cannot have multiple cluster terms")
if (length(tcl) > 0) { # there is one
factors <- attr(Terms, 'factors')
if (any(factors[tcl,] >1)) stop("cluster() cannot be in an interaction")
if (attr(Terms, "response") ==0)
stop("formula must have a Surv response")
if (is.null(Call$cluster))
Call$cluster <- attr(Terms, "variables")[[1+tcl]][[2]]
else warning("cluster appears both in a formula and as an argument, formula term ignored")
# [.terms is broken at least through R 4.1; use our
# local drop.special() function instead.
Terms <- drop.special(Terms, tcl)
formula <- Call$formula <- formula(Terms)
}
# create a call to model.frame() that contains the formula (required)
# and any other of the relevant optional arguments
# but don't evaluate it just yet
indx <- match(c("formula", "data", "weights", "subset", "na.action",
"cluster", "id", "istate"),
names(Call), nomatch=0)
tform <- Call[c(1,indx)] # only keep the arguments we wanted
tform[[1L]] <- quote(stats::model.frame) # change the function called
# if the formula is a list, do the first level of processing on it.
if (is.list(formula)) {
<<coxph-multiform1>>
}
else {
multiform <- FALSE # formula is not a list of expressions
covlist <- NULL
dformula <- formula
}
# add specials to the formula
special <- c("strata", "tt", "frailty", "ridge", "pspline")
tform$formula <- if(missing(data)) terms(formula, special) else
terms(formula, special, data=data)
# okay, now evaluate the formula
mf <- eval(tform, parent.frame())
Terms <- terms(mf)
# Grab the response variable, and deal with Surv2 objects
n <- nrow(mf)
Y <- model.response(mf)
isSurv2 <- inherits(Y, "Surv2")
if (isSurv2) {
# this is Surv2 style data
# if there were any obs removed due to missing, remake the model frame
if (length(attr(mf, "na.action"))) {
tform$na.action <- na.pass
mf <- eval.parent(tform)
}
if (!is.null(attr(Terms, "specials")$cluster))
stop("cluster() cannot appear in the model statement")
new <- surv2data(mf)
mf <- new$mf
istate <- new$istate
id <- new$id
Y <- new$y
n <- nrow(mf)
}
else {
if (!is.Surv(Y)) stop("Response must be a survival object")
id <- model.extract(mf, "id")
istate <- model.extract(mf, "istate")
}
if (n==0) stop("No (non-missing) observations")
if (length(id) >0) n.id <- length(unique(id))
type <- attr(Y, "type")
multi <- FALSE
if (type=="mright" || type == "mcounting") multi <- TRUE
else if (type!='right' && type!='counting')
stop(paste("Cox model doesn't support \"", type,
"\" survival data", sep=''))
data.n <- nrow(Y) #remember this before any time transforms
if (!multi && multiform)
stop("formula is a list but the response is not multi-state")
if (multi) {
if (length(attr(Terms, "specials")$frailty) >0)
stop("multi-state models do not currently support frailty terms")
if (length(attr(Terms, "specials")$pspline) >0)
stop("multi-state models do not currently support pspline terms")
if (length(attr(Terms, "specials")$ridge) >0)
stop("multi-state models do not currently support ridge penalties")
if (missing.ties) method <- ties <- "breslow"
}
# the code was never designed for multiple fraily terms, but of course
# someone tried it
if (length(attr(Terms, "specials")$frailty) >1)
stop("multiple frailty terms are not supported")
if (control$timefix) Y <- aeqSurv(Y)
<<coxph-bothsides>>
# The time transform will expand the data frame mf. To do this
# it needs Y and the strata. Everything else (cluster, offset, weights)
# should be extracted after the transform
#
strats <- attr(Terms, "specials")$strata
hasinteractions <- FALSE
dropterms <- NULL
if (length(strats)) {
stemp <- untangle.specials(Terms, 'strata', 1)
if (length(stemp$vars)==1) strata.keep <- mf[[stemp$vars]]
else strata.keep <- strata(mf[,stemp$vars], shortlabel=TRUE)
istrat <- as.integer(strata.keep)
for (i in stemp$vars) { #multiple strata terms are allowed
# The factors attr has one row for each variable in the frame, one
# col for each term in the model. Pick rows for each strata
# var, and find if it participates in any interactions.
if (any(attr(Terms, 'order')[attr(Terms, "factors")[i,] >0] >1))
hasinteractions <- TRUE
}
if (!hasinteractions) dropterms <- stemp$terms
} else istrat <- NULL
timetrans <- attr(Terms, "specials")$tt
if (missing(tt)) tt <- NULL
if (length(timetrans)) {
if (multi || isSurv2) stop("the tt() transform is not implemented for multi-state or Surv2 models")
# begin tt() preprocessing
<<coxph-transform>>
# end tt() preprocessing
}
xlevels <- .getXlevels(Terms, mf)
# grab the cluster, if present. Using cluster() in a formula is no
# longer encouraged
cluster <- model.extract(mf, "cluster")
weights <- model.weights(mf)
# The user can call with cluster, id, robust, or any combination
# Default for robust: if cluster or any id with > 1 event or
# any weights that are not 0 or 1, then TRUE
# If only id, treat it as the cluster too
has.cluster <- !(missing(cluster) || length(cluster)==0)
has.id <- !(missing(id) || length(id)==0)
has.rwt<- (!is.null(weights) && any(weights != floor(weights)))
#has.rwt<- FALSE # we are rethinking this
has.robust <- (!missing(robust) && !is.null(robust)) # arg present
if (has.id) id <- as.factor(id)
if (missing(robust) || is.null(robust)) {
if (has.cluster || has.rwt ||
(has.id && (multi || anyDuplicated(id[Y[,ncol(Y)]==1]))))
robust <- TRUE else robust <- FALSE
}
if (!is.logical(robust)) stop("robust must be TRUE/FALSE")
if (has.cluster) {
if (!robust) {
warning("cluster specified with robust=FALSE, cluster ignored")
ncluster <- 0
clname <- NULL
}
else {
if (is.factor(cluster)) {
clname <- levels(cluster)
cluster <- as.integer(cluster)
} else {
clname <- sort(unique(cluster))
cluster <- match(cluster, clname)
}
ncluster <- length(clname)
}
} else {
if (robust && has.id) {
# treat the id as both identifier and clustering
clname <- levels(id)
cluster <- as.integer(id)
ncluster <- length(clname)
}
else {
ncluster <- 0 # has neither
}
}
# if the user said "robust", (time1,time2) data, and no cluster or
# id, complain about it
if (robust && is.null(cluster)) {
if (ncol(Y) ==2 || !has.robust) cluster <- seq.int(1, nrow(mf))
else stop("one of cluster or id is needed")
}
contrast.arg <- NULL #due to shared code with model.matrix.coxph
attr(Terms, "intercept") <- 1 # always have a baseline hazard
if (multi) {
<<coxph-multiform2>>
}
<<coxph-make-X>>
<<coxph-setup>>
if (multi) {
<<coxph-multi-X>>
}
# infinite covariates are not screened out by the na.omit routines
# But this needs to be done after the multi-X part
if (!all(is.finite(X)))
stop("data contains an infinite predictor")
# init is checked after the final X matrix has been made
if (missing(init)) init <- NULL
else {
if (length(init) != ncol(X)) stop("wrong length for init argument")
temp <- X %*% init - sum(colMeans(X) * init) + offset
# it's okay to have a few underflows, but if all of them are too
# small we get all zeros
if (any(exp(temp) > .Machine$double.xmax) || all(exp(temp)==0))
stop("initial values lead to overflow or underflow of the exp function")
}
<<coxph-penal>>
<<coxph-compute>>
<<coxph-finish>>
}
@
Multi-state models have a multi-state response, optionally they have a
formula that is a list.
If the formula is a list then the first element is the default formula
with a survival response and covariates on the right.
Further elements are of the form from/to ~ covariates / options and
specify other covariates for all from:to transitions.
Steps in processing such a formula are
\begin{enumerate}
\item Gather all the variables that appear on a right-hand side, and
create a master formula y ~ all of them. This is used to create the
model.frame. We also need to defer missing value processing, since
some covariates might appear for only some transitions.
\item Get the data. The response, id, and statedata variables can now
be checked for consistency with the formulas.
\item After X has been formed, expand it.
\end{enumerate}
Here is code for the first step.
<<coxph-multiform1>>=
multiform <- TRUE
dformula <- formula[[1]] # the default formula for transitions
if (missing(statedata)) covlist <- parsecovar1(formula[-1])
else {
if (!inherits(statedata, "data.frame"))
stop("statedata must be a data frame")
if (is.null(statedata$state))
stop("statedata data frame must contain a 'state' variable")
covlist <- parsecovar1(formula[-1], names(statedata))
}
# create the master formula, used for model.frame
# the term.labels + reformulate + environment trio is used in [.terms;
# if it's good enough for base R it's good enough for me
tlab <- unlist(lapply(covlist$rhs, function(x)
attr(terms.formula(x$formula), "term.labels")))
tlab <- c(attr(terms.formula(dformula), "term.labels"), tlab)
newform <- reformulate(tlab, dformula[[2]])
environment(newform) <- environment(dformula)
formula <- newform
tform$na.action <- na.pass # defer any missing value work to later
@
<<coxph-multiform2>>=
# check for consistency of the states, and create a transition
# matrix
if (length(id)==0)
stop("an id statement is required for multi-state models")
mcheck <- survcheck2(Y, id, istate)
# error messages here
if (mcheck$flag["overlap"] > 0)
stop("data set has overlapping intervals for one or more subjects")
transitions <- mcheck$transitions
istate <- mcheck$istate
states <- mcheck$states
# build tmap, which has one row per term, one column per transition
if (missing(statedata))
covlist2 <- parsecovar2(covlist, NULL, dformula= dformula,
Terms, transitions, states)
else covlist2 <- parsecovar2(covlist, statedata, dformula= dformula,
Terms, transitions, states)
tmap <- covlist2$tmap
if (!is.null(covlist)) {
<<coxph-missing>>
}
@
For multi-state models we can't tell what observations should be removed until
any extra formulas have been processed.
There may be rows that are missing some of the covariates but
are okay for \emph{some} transitions, i.e., a covariate that is used in
only some transitions. Others rows could be always NA.
Those rows can be removed from the model frame before creating the X matrix.
Partially used rows, ones where the necessary covariates are
present for some of the possible transitions but not all, will be resolved
later by the stacker function.
Observations with missing response, id, weight, istate or cluster are always
removed.
<<coxph-missing>>=
miss0 <- is.na(Y) | is.na(id) # id and Y are required
if (!is.null(weights)) miss0 <- miss0 | is.na(weights)
if (!is.null(istate)) miss0 <- miss0 | is.na(istate)
if (!is.null(cluster)) miss0 <- miss0 | is.na(cluster)
# first vector will be true if there is at least 1 transition for which all
# covariates are present, second if there is at least 1 for which some are not
good.tran <- bad.tran <- rep(FALSE, nrow(Y))
# If someone has a term like sex:trt in the model but no main effect for trt
# then we have no choice but to expand tmap to a 'per col of mf' form
tmap2 <- matrix(0, ncol(mf), ncol(tmap)) # will have 1 for "mf col was used"
temp <- sapply(strsplit(rownames(tmap), ":"),
function(x) match(x, colnames(mf)))
for (i in seq(along.with=temp)[-1]) { # skip the (Baseline) row
tmap2[temp[[i]], tmap[i,] >0] <- 1
}
# create a missing indicator for each term
termiss <- matrix(0L, nrow(mf), ncol(mf))
for (i in 1:ncol(mf)) {
xx <- is.na(mf[[i]])
# spline terms have multiple columns; treat any missing as missing
if (is.matrix(xx)) termiss[,i] <- apply(xx, 1, any)
else termiss[,i] <- xx
}
for (i in 1:ncol(tmap2)) { # for each transition
rindex <- which(as.integer(istate) == covlist2$mapid[i,1]) #relevant obs
j <- which(tmap[,i] >0) # which cols of mf used in this transition
anymiss <- apply(termiss[rindex, j, drop=FALSE],1, any)
bad.tran[rindex] <- (bad.tran[rindex] | anymiss) #failed for this trans
good.tran[rindex]<- (good.tran[rindex] | !anymiss) # success
}
# the value below was useful during testing, but isn't used directly
n.partially.used <- sum(good.tran & bad.tran & !miss0)
omit <- (!good.tran & bad.tran) | miss0
if (all(omit)) stop("all observations deleted due to missing values")
temp <- setNames(seq(omit)[omit], attr(mf, "row.names")[omit])
attr(temp, "class") <- "omit"
mf <- mf[!omit,, drop=FALSE]
attr(mf, "na.action") <- temp
Y <- Y[!omit]
id <- id[!omit]
if (length(istate)) istate <- istate[!omit] # istate can be NULL
n <- data.n <- nrow(mf) # reset n
@
For a multi-state model, create the expanded X matrix. Sometimes it is
much expanded.
The first step is to create the cmap matrix from tmap by expanding terms;
factors turn into multiple columns for instance.
If tmap has rows (terms) for strata, then we have to deal with the complication
that a strata might be applied to some transitions and not to others.
<<coxph-multi-X>>=
if (length(strats) >0) {
# tmap starts with a "(Baseline)" row, which we want
# strats is indexed off the data frame, which includes the response, so
# turns out to be correct for the remaining rows of tmap
smap <- tmap[c(1L, strats),]
smap[-1,] <- ifelse(smap[-1,] >0, 1L, 0L)
}
else smap <- tmap[1,,drop=FALSE]
@
Also create the initial values vector.
The stacker function will create a separate block of observations for every
unique value in \code{smap}.
Now say that two transitions A:B and A:C share the same baseline hazard.
Then either a B or a C outcome will be an ``event'' in that stratum; they
would only be distinguished by perhaps having different covariates.
The first thing we do with the result is to rebuild the transitions matrix:
the working version was created before removing missings and can
seriously overstate the number of transitions available.
Then set up the data.
<<coxph-multi-X>>=
cmap <- parsecovar3(tmap, colnames(X), attr(X, "assign"), covlist2$phbaseline)
xstack <- stacker(cmap, smap, as.integer(istate), X, Y, mf = mf,
states=states)
rkeep <- unique(xstack$rindex)
transitions <- survcheck2(Y[rkeep,], id[rkeep], istate[rkeep])$transitions
Xsave <- X # the originals may be needed later
Ysave <- Y
X <- xstack$X
Y <- xstack$Y
istrat <- xstack$strata
if (length(offset)) offset <- offset[xstack$rindex]
if (length(weights)) weights <- weights[xstack$rindex]
if (length(cluster)) cluster <- cluster[xstack$rindex]
@
The next step for multi X is to remake the assign attribute.
It is a list with one element per term, and needs to be expanded in the
same way as \code{tmap}, which has one row per term (+ an intercept row).
For \code{predict, type='terms'} to work, no label can be repeated in the
final assign object.
If a variable `fred' were common across all the states we would want to
use that as the label, but if it appears twice, as separate terms for
two different transitions, then we label it as fred\_x:y where x:y is the
transition.
<<coxph-multi-X>>=
t2 <- tmap[-c(1, strats),,drop=FALSE] # remove the intercept row and strata rows
r2 <- row(t2)[!duplicated(as.vector(t2)) & t2 !=0]
c2 <- col(t2)[!duplicated(as.vector(t2)) & t2 !=0]
a2 <- lapply(seq(along.with=r2), function(i) {cmap[assign[[r2[i]]], c2[i]]})
# which elements are unique?
tab <- table(r2)
count <- tab[r2]
names(a2) <- ifelse(count==1, row.names(t2)[r2],
paste(row.names(t2)[r2], colnames(cmap)[c2], sep="_"))
assign <- a2
@
An increasingly common error is for users to put the time variable on
both sides of the formula, in the mistaken idea that this will
deal with a failure of proportional hazards.
Add a test for such models, but don't bail out. There will be cases where
someone has the the stop variable in an expression on the right hand side,
to create current age say.
The \code{variables} attribute of the Terms object is the expression form
of a list that contains the response variable followed by the predictors.
Subscripting this, element 1 is the call to ``list'' itself so we always
retain it. My \code{innerterms} function works only with formula
objects.
<<coxph-bothsides>>=
if (length(attr(Terms, 'variables')) > 2) { # a ~1 formula has length 2
ytemp <- innerterms(formula[1:2])
suppressWarnings(z <- as.numeric(ytemp)) # are any of the elements numeric?
ytemp <- ytemp[is.na(z)] # toss numerics, e.g. Surv(t, 1-s)
xtemp <- innerterms(formula[-2])
if (any(!is.na(match(xtemp, ytemp))))
warning("a variable appears on both the left and right sides of the formula")
}
@
At this point we deal with any time transforms.
The model frame is expanded to a ``fake'' data set that has a
separate stratum for each unique event-time/strata combination,
and any tt() terms in the formula are processed.
The first step is to create the index vector [[tindex]] and
new strata [[.strata.]]. This last is included in a model.frame call
(for others to use), internally the code simply replaces the \code{istrat}
variable.
A (modestly) fast C-routine first counts up and indexes the observations.
We start out with error checks; since the computation can be slow we want
to complain early.
<<coxph-transform>>=
timetrans <- untangle.specials(Terms, 'tt')
ntrans <- length(timetrans$terms)
if (is.null(tt)) {
tt <- function(x, time, riskset, weights){ #default to O'Brien's logit rank
obrien <- function(x) {
r <- rank(x)
(r-.5)/(.5+length(r)-r)
}
unlist(tapply(x, riskset, obrien))
}
}
if (is.function(tt)) tt <- list(tt) #single function becomes a list
if (is.list(tt)) {
if (any(!sapply(tt, is.function)))
stop("The tt argument must contain function or list of functions")
if (length(tt) != ntrans) {
if (length(tt) ==1) {
temp <- vector("list", ntrans)
for (i in 1:ntrans) temp[[i]] <- tt[[1]]
tt <- temp
}
else stop("Wrong length for tt argument")
}
}
else stop("The tt argument must contain a function or list of functions")
if (ncol(Y)==2) {
if (length(strats)==0) {
sorted <- order(-Y[,1], Y[,2])
newstrat <- rep.int(0L, nrow(Y))
newstrat[1] <- 1L
}
else {
sorted <- order(istrat, -Y[,1], Y[,2])
#newstrat marks the first obs of each strata
newstrat <- as.integer(c(1, 1*(diff(istrat[sorted])!=0)))
}
if (storage.mode(Y) != "double") storage.mode(Y) <- "double"
counts <- .Call(Ccoxcount1, Y[sorted,],
as.integer(newstrat))
tindex <- sorted[counts$index]
}
else {
if (length(strats)==0) {
sort.end <- order(-Y[,2], Y[,3])
sort.start<- order(-Y[,1])
newstrat <- c(1L, rep(0, nrow(Y) -1))
}
else {
sort.end <- order(istrat, -Y[,2], Y[,3])
sort.start<- order(istrat, -Y[,1])
newstrat <- c(1L, as.integer(diff(istrat[sort.end])!=0))
}
if (storage.mode(Y) != "double") storage.mode(Y) <- "double"
counts <- .Call(Ccoxcount2, Y,
as.integer(sort.start -1L),
as.integer(sort.end -1L),
as.integer(newstrat))
tindex <- counts$index
}
@
The C routine has returned a list with 4 elements
\begin{description}
\item[nrisk] a vector containing the number at risk at each event time
\item[time] the vector of event times
\item[status] a vector of status values
\item[index] a vector containing the set of subjects at risk for event time
1, followed by those at risk at event time 2, those at risk at event time 3,
etc.
\end{description}
The new data frame is then a simple creation.
The subtle part below is a desire to retain transformation information
so that a downstream call to \code{termplot} will work.
The tt function supplied by the user often finishes with a call to
\code{pspline} or \code{ns}. If the returned value of the \code{tt}
call has a class for which a \code{makepredictcall} method exists then
we need to do 2 things:
\begin{enumerate}
\item Construct a fake call, e.g., ``pspline(age)'', then feed it and
the result of tt as arguments to \code{makepredictcall}
\item Replace that componenent in the predvars attribute of the terms.
\end{enumerate}
The \code{timetrans\$terms} value is a count of the right hand side of
the formula. Some objects in the terms structure are unevaluated calls
that include y, this adds 2 to the count (the call to ``list'' and
the response).
<<coxph-transform>>=
Y <- Surv(rep(counts$time, counts$nrisk), counts$status)
type <- 'right' # new Y is right censored, even if the old was (start, stop]
mf <- mf[tindex,]
istrat <- rep(1:length(counts$nrisk), counts$nrisk)
weights <- model.weights(mf)
if (!is.null(weights) && any(!is.finite(weights)))
stop("weights must be finite")
id <- model.extract(mf, "id") # update the id and/or cluster, if present
cluster <- model.extract(mf, "cluster")
tcall <- attr(Terms, 'variables')[timetrans$terms+2]
pvars <- attr(Terms, 'predvars')
pmethod <- sub("makepredictcall.", "", as.vector(methods("makepredictcall")))
for (i in 1:ntrans) {
newtt <- (tt[[i]])(mf[[timetrans$var[i]]], Y[,1], istrat, weights)
mf[[timetrans$var[i]]] <- newtt
nclass <- class(newtt)
if (any(nclass %in% pmethod)) { # It has a makepredictcall method
dummy <- as.call(list(as.name(class(newtt)[1]), tcall[[i]][[2]]))
ptemp <- makepredictcall(newtt, dummy)
pvars[[timetrans$terms[i]+2]] <- ptemp
}
}
attr(Terms, "predvars") <- pvars
@
This is the C code for time-transformation.
For the first case it expects y to contain time and status sorted from
longest time to shortest, and strata=1 for the first observation of
each strata.
<<coxcount1>>=
#include "survS.h"
/*
** Count up risk sets and identify who is in each
*/
SEXP coxcount1(SEXP y2, SEXP strat2) {
int ntime, nrow;
int i, j, n;
int stratastart=0; /* start row for this strata */
int nrisk=0; /* number at risk (=0 to stop -Wall complaint)*/
double *time, *status;
int *strata;
double dtime;
SEXP rlist, rlistnames, rtime, rn, rindex, rstatus;
int *rrindex, *rrstatus;
n = nrows(y2);
time = REAL(y2);
status = time +n;
strata = INTEGER(strat2);
/*
** First pass: count the total number of death times (risk sets)
** and the total number of rows in the new data set.
*/
ntime=0; nrow=0;
for (i=0; i<n; i++) {
if (strata[i] ==1) nrisk =0;
nrisk++;
if (status[i] ==1) {
ntime++;
dtime = time[i];
/* walk across tied times, if any */
for (j=i+1; j<n && time[j]==dtime && status[j]==1 && strata[j]==0;
j++) nrisk++;
i = j-1;
nrow += nrisk;
}
}
<<coxcount-alloc-memory>>
/*
** Pass 2, fill them in
*/
ntime=0;
for (i=0; i<n; i++) {
if (strata[i] ==1) stratastart =i;
if (status[i]==1) {
dtime = time[i];
for (j=stratastart; j<i; j++) *rrstatus++=0; /*non-deaths */
*rrstatus++ =1; /* this death */
/* tied deaths */
for(j= i+1; j<n && status[j]==1 && time[j]==dtime && strata[j]==0;
j++) *rrstatus++ =1;
i = j-1;
REAL(rtime)[ntime] = dtime;
INTEGER(rn)[ntime] = i +1 -stratastart;
ntime++;
for (j=stratastart; j<=i; j++) *rrindex++ = j+1;
}
}
<<coxcount-list-return>>
}
@
The start-stop case is a bit more work.
The set of subjects still at risk is an arbitrary set so we have to
keep an index vector [[atrisk]].
At each new death time we write out the set of those at risk, with the
deaths last.
I toyed with the idea of a binary tree then realized it was not useful:
at each death we need to list out all the subjects at risk into the index
vector which is an $O(n)$ process, tree or not.
<<coxcount1>>=
#include "survS.h"
/* count up risk sets and identify who is in each, (start,stop] version */
SEXP coxcount2(SEXP y2, SEXP isort1, SEXP isort2, SEXP strat2) {
int ntime, nrow;
int i, j, istart, n;
int nrisk=0, *atrisk;
double *time1, *time2, *status;
int *strata;
double dtime;
int iptr, jptr;
SEXP rlist, rlistnames, rtime, rn, rindex, rstatus;
int *rrindex, *rrstatus;
int *sort1, *sort2;
n = nrows(y2);
time1 = REAL(y2);
time2 = time1+n;
status = time2 +n;
strata = INTEGER(strat2);
sort1 = INTEGER(isort1);
sort2 = INTEGER(isort2);
/*
** First pass: count the total number of death times (risk sets)
** and the total number of rows in the new data set
*/
ntime=0; nrow=0;
istart =0; /* walks along the sort1 vector (start times) */
for (i=0; i<n; i++) {
iptr = sort2[i];
if (strata[i]==1) nrisk=0;
nrisk++;
if (status[iptr] ==1) {
ntime++;
dtime = time2[iptr];
for (; istart <i && time1[sort1[istart]] >= dtime; istart++)
nrisk--;
for(j= i+1; j<n; j++) {
jptr = sort2[j];
if (status[jptr]==1 && time2[jptr]==dtime && strata[jptr]==0)
nrisk++;
else break;
}
i= j-1;
nrow += nrisk;
}
}
<<coxcount-alloc-memory>>
atrisk = (int *)R_alloc(n, sizeof(int)); /* marks who is at risk */
/*
** Pass 2, fill them in
*/
ntime=0; nrisk=0;
j=0; /* pointer to time1 */;
istart=0;
for (i=0; i<n; ) {
iptr = sort2[i];
if (strata[i] ==1) {
nrisk=0;
for (j=0; j<n; j++) atrisk[j] =0;
}
nrisk++;
if (status[iptr]==1) {
dtime = time2[iptr];
for (; istart<i && time1[sort1[istart]] >=dtime; istart++) {
atrisk[sort1[istart]]=0;
nrisk--;
}
for (j=1; j<nrisk; j++) *rrstatus++ =0;
for (j=0; j<n; j++) if (atrisk[j]) *rrindex++ = j+1;
atrisk[iptr] =1;
*rrstatus++ =1;
*rrindex++ = iptr +1;
for (j=i+1; j<n; j++) {
jptr = sort2[j];
if (time2[jptr]==dtime && status[jptr]==1 && strata[jptr]==0){
atrisk[jptr] =1;
*rrstatus++ =1;
*rrindex++ = jptr +1;
nrisk++;
}
else break;
}
i = j;
REAL(rtime)[ntime] = dtime;
INTEGER(rn)[ntime] = nrisk;
ntime++;
}
else {
atrisk[iptr] =1;
i++;
}
}
<<coxcount-list-return>>
}
@
<<coxcount-alloc-memory>>=
/*
** Allocate memory
*/
PROTECT(rtime = allocVector(REALSXP, ntime));
PROTECT(rn = allocVector(INTSXP, ntime));
PROTECT(rindex=allocVector(INTSXP, nrow));
PROTECT(rstatus=allocVector(INTSXP,nrow));
rrindex = INTEGER(rindex);
rrstatus= INTEGER(rstatus);
@
<<coxcount-list-return>>=
/* return the list */
PROTECT(rlist = allocVector(VECSXP, 4));
SET_VECTOR_ELT(rlist, 0, rn);
SET_VECTOR_ELT(rlist, 1, rtime);
SET_VECTOR_ELT(rlist, 2, rindex);
SET_VECTOR_ELT(rlist, 3, rstatus);
PROTECT(rlistnames = allocVector(STRSXP, 4));
SET_STRING_ELT(rlistnames, 0, mkChar("nrisk"));
SET_STRING_ELT(rlistnames, 1, mkChar("time"));
SET_STRING_ELT(rlistnames, 2, mkChar("index"));
SET_STRING_ELT(rlistnames, 3, mkChar("status"));
setAttrib(rlist, R_NamesSymbol, rlistnames);
unprotect(6);
return(rlist);
@
We now return to the original thread of the program, though perhaps
with new data, and build the $X$ matrix.
Creation of the $X$ matrix for a Cox model requires just a bit of
trickery.
The baseline hazard for a Cox model plays the role of an intercept,
but does not appear in the $X$ matrix.
However, to create the columns of $X$ for factor variables correctly,
we need to call the model.matrix routine in such a way that it \emph{thinks}
there is an intercept, and so we set the intercept attribute to 1 in
the terms object before calling model.matrix, ignoring any -1 term the
user may have added.
One simple way to handle all this is to call model.matrix on the original
formula and then remove the terms we don't need.
However,
\begin{enumerate}
\item The cluster() term, if any, could lead to thousands of extraneous
``intercept'' columns which are never needed.
\item Likewise, nested case-control models can have thousands of strata,
again leading many intercepts we never need. They never have strata by
covariate interactions, however.
\item If there are strata by covariate interactions in the model,
the dummy intercepts-per-strata columns are necessary information for the
model.matrix routine to correctly compute other columns of $X$.
\end{enumerate}
On later reflection \code{cluster} should never have been in the model
statement in the first place, something that became painfully apparent
with addition of multi-state models.
In the future we will discourage it.
For reason 2 above the usual plan is to also remove strata
terms from the ``Terms'' object \emph{before} calling model.matrix,
unless there are strata by covariate interactions in which case we remove
them after.
If anything is pre-dropped, for documentation purposes we want the
returned assign attribute to match the Terms structure that we will
hand back. (Do we ever use it?)
In particular, the numbers therein correspond to the column names in
\code{attr(Terms, 'factors')}
The requires a shift. The cluster and strata terms are seen as main
effects, so appear early in that list.
We have found a case where terms get relabeled:
<<relabel>>=
t1 <- terms( ~(x1 + x2):x3 + strata(x4))
t2 <- terms( ~(x1 + x2):x3)
t3 <- t1[-1]
colnames(attr(t1, "factors"))
colnames(attr(t2, "factors"))
colnames(attr(t3, "factors"))
@
In t1 the strata term appears first, as it is the only thing that looks like
a main effect, and the column labels are strata(x4), x1:x3, x2:x3.
In t3 the column labels are x1:x3 and x3:x2 --- note left-right swap of
the second. This means that using match() on the labels is not a reliable
approach.
We instead assume that nothing is reordered and do a shift.
<<coxph-make-X>>=
if (length(dropterms)) {
Terms2 <- Terms[-dropterms]
X <- model.matrix(Terms2, mf, constrasts.arg=contrast.arg)
# we want to number the terms wrt the original model matrix
temp <- attr(X, "assign")
shift <- sort(dropterms)
for (i in seq(along.with=shift))
temp <- temp + 1*(shift[i] <= temp)
attr(X, "assign") <- temp
}
else X <- model.matrix(Terms, mf, contrasts.arg=contrast.arg)
# drop the intercept after the fact, and also drop strata if necessary
Xatt <- attributes(X)
if (hasinteractions) adrop <- c(0, untangle.specials(Terms, "strata")$terms)
else adrop <- 0
xdrop <- Xatt$assign %in% adrop #columns to drop (always the intercept)
X <- X[, !xdrop, drop=FALSE]
attr(X, "assign") <- Xatt$assign[!xdrop]
attr(X, "contrasts") <- Xatt$contrasts
Xmeans <- colMeans(X) # do this before expanding a multistate model
@
Finish the setup. If someone includes an init statement or offset, make sure
that it does not lead to instant code failure due to overflow/underflow.
The mean offset is added back to the linear predictors at the end, to maintain
consistency with predict.coxph(fit, newdata= originaldata)
<<coxph-setup>>=
offset <- model.offset(mf)
if (is.null(offset) || all(offset==0)) {
offset <- rep(0., nrow(mf))
meanoffset <- 0
} else if (any(!is.finite(offset) | !is.finite(exp(offset))))
stop("offsets must lead to a finite risk score")
else {
meanoffset <- mean(offset)
offset <- offset - meanoffset # this can help stability of exp()
}
weights <- model.weights(mf)
if (!is.null(weights) && any(!is.finite(weights)))
stop("weights must be finite")
assign <- attrassign(X, Terms)
contr.save <- attr(X, "contrasts")
<<coxph-zeroevent>>
@
Check for a rare edge case: a data set with no events. In this case the
return structure is simple.
The coefficients will all be NA, since they can't be estimated.
The variance matrix is all zeros, in line with the usual rule to zero out
any row and col corresponding to an NA coef.
The loglik is the sum of zero terms, which we set to zero like the usual
R result for sum(numeric(0)).
An overall idea is to return something that won't blow up later code.
<<coxph-zeroevent>>=
if (sum(Y[, ncol(Y)]) == 0) {
# No events in the data!
ncoef <- ncol(X)
ctemp <- rep(NA, ncoef)
names(ctemp) <- colnames(X)
concordance= c(concordant=0, discordant=0, tied.x=0, tied.y=0, tied.xy=0,
concordance=NA, std=NA, timefix=FALSE)
rval <- list(coefficients= ctemp,
var = matrix(0.0, ncoef, ncoef),
loglik=c(0,0),
score =0,
iter =0,
linear.predictors = offset,
residuals = rep(0.0, data.n),
means = colMeans(X), method=method,
n = data.n, nevent=0, terms=Terms, assign=assign,
concordance=concordance, wald.test=0.0,
y = Y, call=Call)
class(rval) <- "coxph"
return(rval)
}
@
Check for penalized terms in the model, and set up infrastructure for
the fitting routines to deal with them.
<<coxph-penal>>=
pterms <- sapply(mf, inherits, 'coxph.penalty')
if (any(pterms)) {
pattr <- lapply(mf[pterms], attributes)
pname <- names(pterms)[pterms]
if (robust) {
warning("the robust variance is not defined for a penalized model, option ignored")
robust <- FALSE
}
#
# Check the order of any penalty terms
ord <- attr(Terms, "order")[match(pname, attr(Terms, 'term.labels'))]
if (any(ord>1)) stop ('Penalty terms cannot be in an interaction')
pcols <- assign[match(pname, names(assign))]
fit <- coxpenal.fit(X, Y, istrat, offset, init=init,
control,
weights=weights, method=method,
row.names(mf), pcols, pattr, assign,
nocenter= nocenter)
}
@
<<coxph-compute>>=
else {
rname <- row.names(mf)
if (multi) rname <- rname[xstack$rindex]
if( method=="breslow" || method =="efron") {
if (grepl('right', type))
fit <- coxph.fit(X, Y, istrat, offset, init, control,
weights=weights, method=method,
rname, nocenter=nocenter)
else fit <- agreg.fit(X, Y, istrat, offset, init, control,
weights=weights, method=method,
rname, nocenter=nocenter)
}
else if (method=='exact') {
if (type== "right")
fit <- coxexact.fit(X, Y, istrat, offset, init, control,
weights=weights, method=method,
rname, nocenter=nocenter)
else fit <- agexact.fit(X, Y, istrat, offset, init, control,
weights=weights, method=method,
rname, nocenter=nocenter)
}
else stop(paste ("Unknown method to ties", method))
}
@
<<coxph-finish>>=
if (is.character(fit)) {
fit <- list(fail=fit)
class(fit) <- 'coxph'
}
else {
if (!is.null(fit$coefficients) && any(is.na(fit$coefficients))) {
vars <- (1:length(fit$coefficients))[is.na(fit$coefficients)]
msg <-paste("X matrix deemed to be singular; variable",
paste(vars, collapse=" "))
if (!singular.ok) stop(msg)
# else warning(msg) # stop being chatty
}
fit$n <- data.n
fit$nevent <- sum(Y[,ncol(Y)])
if (length(id)>0) fit$n.id <- n.id
fit$terms <- Terms
fit$assign <- assign
class(fit) <- fit$class
fit$class <- NULL
# don't compute a robust variance if there are no coefficients
if (robust && !is.null(fit$coefficients) && !all(is.na(fit$coefficients))) {
fit$naive.var <- fit$var
# a little sneaky here: by calling resid before adding the
# na.action method, I avoid having missings re-inserted
# I also make sure that it doesn't have to reconstruct X and Y
fit2 <- c(fit, list(x=X, y=Y, weights=weights))
if (length(istrat)) fit2$strata <- istrat
if (length(cluster)) {
temp <- residuals.coxph(fit2, type='dfbeta', collapse=cluster,
weighted=TRUE)
# get score for null model
if (is.null(init))
fit2$linear.predictors <- 0*fit$linear.predictors
else fit2$linear.predictors <- c(X %*% init)
temp0 <- residuals.coxph(fit2, type='score', collapse=cluster,
weighted=TRUE)
}
else {
temp <- residuals.coxph(fit2, type='dfbeta', weighted=TRUE)
fit2$linear.predictors <- 0*fit$linear.predictors
temp0 <- residuals.coxph(fit2, type='score', weighted=TRUE)
}
fit$var <- crossprod(temp)
u <- apply(as.matrix(temp0), 2, sum)
fit$rscore <- coxph.wtest(t(temp0)%*%temp0, u, control$toler.chol)$test
}
#Wald test
if (length(fit$coefficients) && is.null(fit$wald.test)) {
#not for intercept only models, or if test is already done
nabeta <- !is.na(fit$coefficients)
# The init vector might be longer than the betas, for a sparse term
if (is.null(init)) temp <- fit$coefficients[nabeta]
else temp <- (fit$coefficients -
init[1:length(fit$coefficients)])[nabeta]
fit$wald.test <- coxph.wtest(fit$var[nabeta,nabeta], temp,
control$toler.chol)$test
}
# Concordance. Done here so that we can use cluster if it is present
# The returned value is a subset of the full result, partly because it
# is all we need, but more for backward compatability with survConcordance.fit
if (length(cluster))
temp <- concordancefit(Y, fit$linear.predictors, istrat, weights,
cluster=cluster, reverse=TRUE,
timefix= FALSE)
else temp <- concordancefit(Y, fit$linear.predictors, istrat, weights,
reverse=TRUE, timefix= FALSE)
if (is.matrix(temp$count))
fit$concordance <- c(colSums(temp$count), concordance=temp$concordance,
std=sqrt(temp$var))
else fit$concordance <- c(temp$count, concordance=temp$concordance,
std=sqrt(temp$var))
na.action <- attr(mf, "na.action")
if (length(na.action)) fit$na.action <- na.action
if (model) {
if (length(timetrans)) {
stop("'model=TRUE' not supported for models with tt terms")
}
fit$model <- mf
}
if (x) {
if (multi) fit$x <- Xsave else fit$x <- X
if (length(timetrans)) fit$strata <- istrat
else if (length(strats)) fit$strata <- strata.keep
}
if (y) {
if (multi) fit$y <- Ysave else fit$y <- Y
}
fit$timefix <- control$timefix # remember this option
}
@
If any of the weights were not 1, save the results.
Add names to the means component, which are occassionally
useful to survfit.coxph.
Other objects below are used when we need to recreate a
model frame.
A multi-state model will have a matrix of linear predictors and of residuals.
Each has a column for each transition and a row for each subject.
The rows are with respect to the starting X and Y, not the expanded ones which
were used to compute the coefficients.
The expanded linear predictor is easy: Xbeta where beta is the matrix form of
the coefficients.
Residuals are a bit more nuisance: if an observation was a risk for an a:b
transition, it will appear in the a:b strata of the expanded X matrix, and that
residual fills in the appropriate row/col. If it was not at risk for said
transition, the residual is zero.
There is, however, a further problem. Any transitions for which there are no
covariates were not sent across as a strata to the fitting routine --- they
would create a stratum where all covariates = 0, which cause computation for
no cause.
But that also means that the martingale residuals are not computed for those
rows of the data.
<<coxph-finish>>=
if (!is.null(weights) && any(weights!=1)) fit$weights <- weights
if (multi) {
fit$transitions <- transitions
fit$states <- states
fit$cmap <- cmap
fit$smap <- smap # why not 'stratamap'? Confusion with fit$strata
nonzero <- which(colSums(cmap)!=0)
fit$rmap <- cbind(row=xstack$rindex, transition= nonzero[xstack$transition])
# add a suffix to each coefficent name. Those that map to multiple transitions
# get the first transition they map to
single <- apply(cmap, 1, function(x) all(x %in% c(0, max(x)))) #only 1 coef
cindx <- col(cmap)[match(1:length(fit$coefficients), cmap)]
rindx <- row(cmap)[match(1:length(fit$coefficients), cmap)]
suffix <- ifelse(single[rindx], "", paste0("_", colnames(cmap)[cindx]))
newname <- paste0(names(fit$coefficients), suffix)
if (any(covlist2$phbaseline > 0)) {
# for proportional baselines, use a better name
base <- colnames(tmap)[covlist2$phbaseline]
child <- colnames(tmap)[which(covlist2$phbaseline >0)]
indx <- 1 + length(newname) - length(base):1 # coefs are the last ones
newname[indx] <- paste0("ph(", child, "/", base, ")")
phrow <- apply(cmap, 1, function(x) all(x[x>0] %in% indx))
matcoef <- cmap[!phrow,,drop=FALSE ] # ph() terms exluded
}
else matcoef <- cmap
names(fit$coefficients) <- newname
if (FALSE) {
# an idea that was tried, then paused: make the linear predictors
# and residuals into matrices with one column per transition
# It leads to a much larger fit object, so we do this expansion in
# predict/residuals instead.
matcoef[matcoef>0] <- fit$coefficients[matcoef]
temp <- Xsave %*% matcoef
colnames(temp) <- colnames(cmap)
fit$linear.predictors <- temp
temp <- matrix(0., nrow=nrow(Xsave), ncol=ncol(fit$cmap))
temp[cbind(xstack$rindex, xstack$transition)] <- fit$residuals
# if there are any transitions with no covariates, residuals have not
# yet been calculated for those.
if (any(colSums(cmap) ==0)) {
from.state <- as.numeric(sub(":.*$", "", colnames(cmap)))
to.state <- as.numeric(sub("^.*:", "", colnames(cmap)))
# warning("no covariate residuals not filled in")
}
fit$residuals <- temp
}
class(fit) <- c("coxphms", class(fit))
}
names(fit$means) <- names(fit$coefficients)
fit$formula <- formula(Terms)
if (length(xlevels) >0) fit$xlevels <- xlevels
fit$contrasts <- contr.save
if (meanoffset !=0) fit$linear.predictors <- fit$linear.predictors + meanoffset
if (x & any(offset !=0)) fit$offset <- offset
fit$call <- Call
fit
@
The model.matrix and model.frame routines are called after a Cox model to
reconstruct those portions.
Much of their code is shared with the coxph routine.
<<model.matrix.coxph>>=
# In internal use "data" will often be an already derived model frame.
# We detect this via it having a terms attribute.
model.matrix.coxph <- function(object, data=NULL,
contrast.arg=object$contrasts, ...) {
#
# If the object has an "x" component, return it, unless a new
# data set is given
if (is.null(data) && !is.null(object[['x']]))
return(object[['x']]) #don't match "xlevels"
Terms <- delete.response(object$terms)
if (is.null(data)) mf <- stats::model.frame(object)
else {
if (is.null(attr(data, "terms")))
mf <- stats::model.frame(Terms, data, xlev=object$xlevels)
else mf <- data #assume "data" is already a model frame
}
cluster <- attr(Terms, "specials")$cluster
if (length(cluster)) {
temp <- untangle.specials(Terms, "cluster")
dropterms <- temp$terms
}
else dropterms <- NULL
strats <- attr(Terms, "specials")$strata
hasinteractions <- FALSE
if (length(strats)) {
stemp <- untangle.specials(Terms, 'strata', 1)
if (length(stemp$vars)==1) strata.keep <- mf[[stemp$vars]]
else strata.keep <- strata(mf[,stemp$vars], shortlabel=TRUE)
istrat <- as.integer(strata.keep)
for (i in stemp$vars) { #multiple strata terms are allowed
# The factors attr has one row for each variable in the frame, one
# col for each term in the model. Pick rows for each strata
# var, and find if it participates in any interactions.
if (any(attr(Terms, 'order')[attr(Terms, "factors")[i,] >0] >1))
hasinteractions <- TRUE
}
if (!hasinteractions) dropterms <- c(dropterms, stemp$terms)
else if (multi) stop("multistate does not allow strata by covariate interactions")
} else istrat <- NULL
<<coxph-make-X>>
X
}
@
In parallel is the model.frame routine, which reconstructs the model frame.
This routine currently doesn't do all that we want. To wit, the following code
fails:
\begin{verbatim}
> tfun <- function(formula, ndata) {
fit <- coxph(formula, data=ndata)
model.frame(fit)
}
> tfun(Surv(time, status) ~ age, lung)
Error: ndata not found
\end{verbatim}
The genesis of this problem is hard to unearth, but has to do with non standard
evaluation rules used by model.frame.default. In essence it pays attention to
the environment of the formula, but the enclos argument of eval appears to be
ignored. I've not yet found a solution, other than to tell users to set x=TRUE
when calling coxph inside a subroutine.
<<model.matrix.coxph>>=
model.frame.coxph <- function(formula, ...) {
dots <- list(...)
nargs <- dots[match(c("data", "na.action", "subset", "weights",
"id", "cluster", "istate"),
names(dots), 0)]
# If nothing has changed and the coxph object had a model component,
# simply return it.
if (length(nargs) ==0 && !is.null(formula$model)) return(formula$model)
else {
# Rebuild the original call to model.frame
Terms <- terms(formula)
fcall <- formula$call
indx <- match(c("formula", "data", "weights", "subset", "na.action",
"cluster", "id", "istate"),
names(fcall), nomatch=0)
if (indx[1] ==0) stop("The coxph call is missing a formula!")
temp <- fcall[c(1,indx)] # only keep the arguments we wanted
temp[[1]] <- quote(stats::model.frame) # change the function called
temp$xlev <- formula$xlevels # this will turn strings to factors
temp$formula <- Terms #keep the predvars attribute
# Now, any arguments that were on this call overtake the ones that
# were in the original call.
if (length(nargs) >0)
temp[names(nargs)] <- nargs
# Make "tt" visible for coxph formulas,
if (!is.null(attr(temp$formula, "specials")$tt)) {
coxenv <- new.env(parent= environment(temp$formula))
assign("tt", function(x) x, envir=coxenv)
environment(temp$formula) <- coxenv
}
# The documentation for model.frame implies that the environment arg
# to eval will be ignored, but if we omit it there is a problem.
if (is.null(environment(formula$terms)))
mf <- eval(temp, parent.frame())
else mf <- eval(temp, environment(formula$terms), parent.frame())
if (!is.null(attr(formula$terms, "dataClasses")))
.checkMFClasses(attr(formula$terms, "dataClasses"), mf)
if (is.null(attr(Terms, "specials")$tt)) return(mf)
else {
# Do time transform
tt <- eval(formula$call$tt)
Y <- aeqSurv(model.response(mf))
strats <- attr(Terms, "specials")$strata
if (length(strats)) {
stemp <- untangle.specials(Terms, 'strata', 1)
if (length(stemp$vars)==1) strata.keep <- mf[[stemp$vars]]
else strata.keep <- strata(mf[,stemp$vars], shortlabel=TRUE)
istrat <- as.numeric(strata.keep)
}
<<coxph-transform>>
mf[[".strata."]] <- istrat
return(mf)
}
}
}
@
\subsection{Exact partial likelihood}
Let $r_i = \exp(X_i\beta)$ be the risk score for observation $i$.
For one of the time points assume that there that there are $d$
tied deaths among $n$ subjects at risk.
For convenience we will index them as $i= 1,\ldots,d$ in the $n$ at risk.
Then for the exact parial likelihood, the contribution at this time point
is
\begin{align*}
L &= \sum_{i=1}^d \log(r_i) - \log(D) \\
\frac{\partial L}{\partial \beta_j} &= x_{ij} - (1/D)
\frac{\partial D}{\partial \beta_j} \\
\frac{\partial^2 L}{\partial \beta_j \partial \beta_k} &=
(1/D^2)\left[D\frac{\partial^2D}{\partial \beta_j \partial \beta_k} -
\frac{\partial D}{\partial \beta_j}\frac{\partial D}{\partial \beta_k}
\right]
\end{align*}
The hard part of this computation is $D$, which is a sum
\begin{equation*}
D = \sum_{S(d,n)} r_{s_1}r_{s_2} \ldots r_{s_d}
\end{equation*}
where $S(d,n)$ is the set of all possible subsets of size $d$ from $n$
objects, and $s_1, s_2, \ldots$ indexes the current selection.
So if $n=6$ and $d=2$ we would have the 15 pairs 12, 13, .... 56;
for $n=5$ and $d=3$ there would be 10 triples 123, 124, 125, \ldots, 345.
The brute force computation of all subsets can take a very long time.
Gail et al \cite{Gail81} show simple recursion formulas that speed
this up considerably. Let $D(d,n)$ be the denominator with $d$
deaths and $n$ subjects. Then
\begin{align}
D(d,n) &= r_nD(d-1, n-1) + D(d, n-1) \label{d0}\\
\frac{\partial D(d,n)}{\partial \beta_j} &=
\frac{\partial D(d, n-1)}{\partial \beta_j} +
r_n \frac{\partial D(d-1, n-1)}{\partial \beta_j} +
x_{nj}r_n D(d-1, n-1) \label{d1}\\
\frac{\partial^2D(d,n}{\partial \beta_j \partial \beta_k} &=
\frac{\partial^2D(d,n-1)}{\partial \beta_j \partial \beta_k} +
r_n\frac{\partial^2D(d-1,n-1)}{\partial \beta_j \partial \beta_k} +
x_{nj}r_n\frac{\partial D(d-1, n-1)}{\partial \beta_k} + \nonumber \\
& x_{nk}r_n\frac{\partial D(d-1, n-1)}{\partial \beta_j} +
x_{nj}x_{nk}r_n D(d-1, n-1) \label{d2}
\end{align}
The above recursion is captured in the three routines below.
The first calculates $D$.
It is called with $d$, $n$, an array that will contain all the
values of $D(d,n)$ computed so far, and the the first dimension of the array.
The intial condition $D(0,n)=1$ is important to all three routines.
<<excox-recur>>=
#define NOTDONE -1.1
double coxd0(int d, int n, double *score, double *dmat,
int dmax) {
double *dn;
if (d==0) return(1.0);
dn = dmat + (n-1)*dmax + d -1; /* pointer to dmat[d,n] */
if (*dn == NOTDONE) { /* still to be computed */
*dn = score[n-1]* coxd0(d-1, n-1, score, dmat, dmax);
if (d<n) *dn += coxd0(d, n-1, score, dmat, dmax);
}
return(*dn);
}
@
The next routine calculates the derivative with respect to a particular
coefficient. It will be called once for each covariate with d1 pointing to
the work array for that covariate.
The second derivative calculation is per pair of variables; the
\texttt{d1j} and \texttt{d1k} arrays are the appropriate first derivative
arrays of saved values.
It is possible for the first derivative to be exactly 0 (if all values
of the covariate are identical for instance) in which case we may recalculate the
derivative for a particular (d,n) case multiple times unnecessarily,
since we are using value=0 as a marker for
``not yet computed''.
This case is essentially nonexistent in real data, however.
Later update: User feedback about an "infinite computation" proved that the
case most definitely does exist: in one strata their first 65 rows had x=0 for
one of the variables. Not actually infinite compute time, but close enough.
One solution is to pick a value that will never occur as the first derivative.
That is impossible, but actually anything other than 0 should never be the
first derivative for more than a single (d,n) combination.
We use a negative
number for the constant NOTDONE since d0 must be positive, and thus no
issues arise there.
<<excox-recur>>=
double coxd1(int d, int n, double *score, double *dmat, double *d1,
double *covar, int dmax) {
int indx;
indx = (n-1)*dmax + d -1; /*index to the current array member d1[d.n]*/
if (d1[indx] == NOTDONE) { /* still to be computed */
d1[indx] = score[n-1]* covar[n-1]* coxd0(d-1, n-1, score, dmat, dmax);
if (d<n) d1[indx] += coxd1(d, n-1, score, dmat, d1, covar, dmax);
if (d>1) d1[indx] += score[n-1]*
coxd1(d-1, n-1, score, dmat, d1, covar, dmax);
}
return(d1[indx]);
}
double coxd2(int d, int n, double *score, double *dmat, double *d1j,
double *d1k, double *d2, double *covarj, double *covark,
int dmax) {
int indx;
indx = (n-1)*dmax + d -1; /*index to the current array member d1[d,n]*/
if (d2[indx] == NOTDONE) { /*still to be computed */
d2[indx] = coxd0(d-1, n-1, score, dmat, dmax)*score[n-1] *
covarj[n-1]* covark[n-1];
if (d<n) d2[indx] += coxd2(d, n-1, score, dmat, d1j, d1k, d2, covarj,
covark, dmax);
if (d>1) d2[indx] += score[n-1] * (
coxd2(d-1, n-1, score, dmat, d1j, d1k, d2, covarj, covark, dmax) +
covarj[n-1] * coxd1(d-1, n-1, score, dmat, d1k, covark, dmax) +
covark[n-1] * coxd1(d-1, n-1, score, dmat, d1j, covarj, dmax));
}
return(d2[indx]);
}
@
Now for the main body. Start with the dull part of the code:
declarations.
I use \code{maxiter2} for the
S structure and \code{maxiter} for the variable within it, and
etc for the other input arguments.
All the input arguments except strata are read-only.
The output beta vector starts as a copy of ibeta.
<<coxexact>>=
#include <math.h>
#include "survS.h"
#include "survproto.h"
#include <R_ext/Utils.h>
<<excox-recur>>
SEXP coxexact(SEXP maxiter2, SEXP y2,
SEXP covar2, SEXP offset2, SEXP strata2,
SEXP ibeta, SEXP eps2, SEXP toler2) {
int i,j,k;
int iter, notfinite;
double **covar, **imat; /*ragged arrays */
double *time, *status; /* input data */
double *offset;
int *strata;
int sstart; /* starting obs of current strata */
double *score;
double *oldbeta;
double zbeta;
double newlk=0;
double temp;
int halving; /*are we doing step halving at the moment? */
int nrisk =0; /* number of subjects in the current risk set */
int dsize, /* memory needed for one coxc0, coxc1, or coxd2 array */
dmemtot, /* amount needed for all arrays */
ndeath; /* number of deaths at the current time point */
double maxdeath; /* max tied deaths within a strata */
double dtime; /* time value under current examiniation */
double *dmem0, **dmem1, *dmem2; /* pointers to memory */
double *dtemp; /* used for zeroing the memory */
double *d1; /* current first derivatives from coxd1 */
double d0; /* global sum from coxc0 */
/* copies of scalar input arguments */
int nused, nvar, maxiter;
double eps, toler;
/* returned objects */
SEXP imat2, beta2, u2, loglik2;
double *beta, *u, *loglik;
SEXP rlist, rlistnames;
int nprotect; /* number of protect calls I have issued */
<<excox-setup>>
<<excox-strata>>
<<excox-iter0>>
<<excox-iter>>
}
@
Setup is ordinary. Grab S objects and assign others.
I use \verb!R_alloc! for temporary ones since it is released automatically on
return.
<<excox-setup>>=
nused = LENGTH(offset2);
nvar = ncols(covar2);
maxiter = asInteger(maxiter2);
eps = asReal(eps2); /* convergence criteria */
toler = asReal(toler2); /* tolerance for cholesky */
/*
** Set up the ragged array pointer to the X matrix,
** and pointers to time and status
*/
covar= dmatrix(REAL(covar2), nused, nvar);
time = REAL(y2);
status = time +nused;
strata = INTEGER(PROTECT(duplicate(strata2)));
offset = REAL(offset2);
/* temporary vectors */
score = (double *) R_alloc(nused+nvar, sizeof(double));
oldbeta = score + nused;
/*
** create output variables
*/
PROTECT(beta2 = duplicate(ibeta));
beta = REAL(beta2);
PROTECT(u2 = allocVector(REALSXP, nvar));
u = REAL(u2);
PROTECT(imat2 = allocVector(REALSXP, nvar*nvar));
imat = dmatrix(REAL(imat2), nvar, nvar);
PROTECT(loglik2 = allocVector(REALSXP, 5)); /* loglik, sctest, flag,maxiter*/
loglik = REAL(loglik2);
nprotect = 5;
@
The data passed to us has been sorted by strata, and
reverse time within strata (longest subject first).
The variable [[strata]] will be 1 at the start of each new strata.
Separate strata are completely separate computations: time 10 in
one strata and time 10 in another are not comingled.
Compute the largest product (size of strata)*
(max tied deaths in strata) for allocating scratch space.
When computing $D$ it is advantageous to create all the intermediate
values of $D(d,n)$ in an array since they will be used in the
derivative calculation. Likewise, the first derivatives are used
in calculating the second.
Even more importantly, say we have a large data set. It will
be sorted with the shortest times first.
If there is a death with 30 at risk and another with 40 at
risk, the intermediate sums we computed for the n=30 case
are part of the computation for n=40. To make this
work we need to index our matrices, within any strata,
by the maximum number of tied deaths in the strata.
We save this in the strata variable: first obs of a new
strata has the number of events.
And what if a strata had 0 events? We mark it with a 1.
Note that the maxdeath variable is floating point. I had someone call this
routine with a data set that gives an integer overflow in that situation.
We now keep track of this further below and fail with a message.
Such a run would take longer than forever to complete even if integer
subscripts did not overflow.
<<excox-strata>>=
strata[0] =1; /* in case the parent forgot (e.g., no strata case)*/
temp = 0; /* temp variable for dsize */
maxdeath =0;
j=0; /* first obs of current stratum */
ndeath=0; nrisk=0;
for (i=0; i<nused;) {
if (strata[i]==1) { /* first obs of a new strata */
if (i>0) {
/* assign data for the prior stratum, just finished */
/* If maxdeath <2 leave the strata alone at it's current value of 1 */
if (maxdeath >1) strata[j] = maxdeath;
j = i;
if (maxdeath*nrisk > temp) temp = maxdeath*nrisk;
}
maxdeath =0; /* max tied deaths at any time in this strata */
nrisk=0;
ndeath =0;
}
dtime = time[i];
ndeath =0; /*number tied here */
while (time[i] ==dtime) {
nrisk++;
ndeath += status[i];
i++;
if (i>=nused || strata[i] >0) break; /* don't cross strata */
}
if (ndeath > maxdeath) maxdeath = ndeath;
}
/* data for the final stratum */
if (maxdeath*nrisk > temp) temp = maxdeath*nrisk;
if (maxdeath >1) strata[j] = maxdeath;
/* Now allocate memory for the scratch arrays
Each per-variable slice is of size dsize
*/
dsize = temp;
temp = temp * ((nvar*(nvar+1))/2 + nvar + 1);
dmemtot = dsize * ((nvar*(nvar+1))/2 + nvar + 1);
if (temp != dmemtot) { /* the subscripts will overflow */
error("(number at risk) * (number tied deaths) is too large");
}
dmem0 = (double *) R_alloc(dmemtot, sizeof(double)); /*pointer to memory */
dmem1 = (double **) R_alloc(nvar, sizeof(double*));
dmem1[0] = dmem0 + dsize; /*points to the first derivative memory */
for (i=1; i<nvar; i++) dmem1[i] = dmem1[i-1] + dsize;
d1 = (double *) R_alloc(nvar, sizeof(double)); /*first deriv results */
@
Here is a standard iteration step. Walk forward to a new time,
then through all the ties with that time.
If there are any deaths, the contributions to the loglikilihood,
first, and second derivatives at this time point are
\begin{align}
L &= \left(\sum_{i \in deaths} X_i\beta\right) - \log(D) \\
\frac{\partial L}{\partial \beta_j} &= \left(\sum_{i \in deaths} X_{ij} \right) -
\frac{\partial D(d,n)}{\partial \beta_j} D^{-1}(d,n) \\
\frac{\partial^2 L }{\partial \beta_j \partial \beta_k} &=
\frac{\partial^2 D(d,n) }{\partial \beta_j \partial \beta_k} D^{-1}(d,n) -
\frac{\partial D(d,n)}{\partial \beta_j}
\frac{\partial D(d,n)}{\partial \beta_k} D^{-2}(d,n)
\end{align}
Even the efficient calculation can be compuatationally intense, so check for
user interrupt requests on a regular basis.
<<excox-addup>>=
sstart =0; /* a line to make gcc stop complaining */
for (i=0; i<nused; ) {
if (strata[i] >0) { /* first obs of a new strata */
maxdeath= strata[i];
dtemp = dmem0;
for (j=0; j<dmemtot; j++) *dtemp++ = NOTDONE;
sstart =i;
nrisk =0;
}
dtime = time[i]; /*current unique time */
ndeath =0;
while (time[i] == dtime) {
zbeta= offset[i];
for (j=0; j<nvar; j++) zbeta += covar[j][i] * beta[j];
score[i] = exp(zbeta);
if (status[i]==1) {
newlk += zbeta;
for (j=0; j<nvar; j++) u[j] += covar[j][i];
ndeath++;
}
nrisk++;
i++;
if (i>=nused || strata[i] >0) break;
}
/* We have added up over the death time, now process it */
if (ndeath >0) { /* Add to the loglik */
d0 = coxd0(ndeath, nrisk, score+sstart, dmem0, maxdeath);
R_CheckUserInterrupt();
newlk -= log(d0);
dmem2 = dmem0 + (nvar+1)*dsize; /*start for the second deriv memory */
for (j=0; j<nvar; j++) { /* for each covariate */
d1[j] = coxd1(ndeath, nrisk, score+sstart, dmem0, dmem1[j],
covar[j]+sstart, maxdeath) / d0;
if (ndeath > 3) R_CheckUserInterrupt();
u[j] -= d1[j];
for (k=0; k<= j; k++) { /* second derivative*/
temp = coxd2(ndeath, nrisk, score+sstart, dmem0, dmem1[j],
dmem1[k], dmem2, covar[j] + sstart,
covar[k] + sstart, maxdeath);
if (ndeath > 5) R_CheckUserInterrupt();
imat[k][j] += temp/d0 - d1[j]*d1[k];
dmem2 += dsize;
}
}
}
}
@
Do the first iteration of the solution. The first iteration is
different in 3 ways: it is used to set the initial log-likelihood,
to compute the score test, and
we pay no attention to convergence criteria or diagnositics.
(I expect it not to converge in one iteration).
<<excox-iter0>>=
/*
** do the initial iteration step
*/
newlk =0;
for (i=0; i<nvar; i++) {
u[i] =0;
for (j=0; j<nvar; j++)
imat[i][j] =0 ;
}
<<excox-addup>>
loglik[0] = newlk; /* save the loglik for iteration zero */
loglik[1] = newlk; /* and it is our current best guess */
/*
** update the betas and compute the score test
*/
for (i=0; i<nvar; i++) /*use 'd1' as a temp to save u0, for the score test*/
d1[i] = u[i];
loglik[3] = cholesky2(imat, nvar, toler);
chsolve2(imat,nvar, u); /* u replaced by u *inverse(imat) */
loglik[2] =0; /* score test stored here */
for (i=0; i<nvar; i++)
loglik[2] += u[i]*d1[i];
if (maxiter==0 || isfinite(loglik[0])==0) { /* give up on overflow */
iter =0; /*number of iterations */
<<excox-finish>>
}
/*
** Never, never complain about convergence on the first step. That way,
** if someone has to they can force one iter at a time.
*/
for (i=0; i<nvar; i++) {
oldbeta[i] = beta[i];
beta[i] = beta[i] + u[i];
}
@
Now the main loop. This has code for convergence and step halving.
Be careful about order. For our current guess at the solution
beta:
\begin{enumerate}
\item Compute the loglik, first, and second derivatives
\item If the loglik has converged, return beta and information
just computed for this beta (loglik, derivatives, etc).
Don't update beta. %'
\item If not converged
\begin{itemize}
\item If The loglik got worse try beta= (beta + oldbeta)/2
\item Otherwise update beta
\end{itemize}
\end{enumerate}
<<excox-iter>>=
halving =0 ; /* =1 when in the midst of "step halving" */
for (iter=1; iter<=maxiter; iter++) {
newlk =0;
for (i=0; i<nvar; i++) {
u[i] =0;
for (j=0; j<nvar; j++)
imat[i][j] =0;
}
<<excox-addup>>
/* am I done?
** update the betas and test for convergence
*/
loglik[3] = cholesky2(imat, nvar, toler);
notfinite = 0;
for (i=0; i<nvar; i++) {
if (isfinite(u[i]) ==0) notfinite=2; /* infinite score stat */
for (j=0; j<nvar; j++) {
if (isfinite(imat[i][j]) ==0) notfinite =3; /*infinite imat */
}
}
if (isfinite(newlk) ==0) notfinite =4;
if (notfinite==0 && fabs(1-(loglik[1]/newlk))<= eps && halving==0) {
/* all done */
loglik[1] = newlk;
<<excox-finish>>
}
if (iter==maxiter) break; /*skip the step halving and etc */
if (notfinite > 0 || newlk < loglik[1]) { /*it is not converging ! */
halving =1;
for (i=0; i<nvar; i++)
beta[i] = (oldbeta[i] + beta[i]) /2; /*half of old increment */
}
else {
halving=0;
loglik[1] = newlk;
chsolve2(imat,nvar,u);
for (i=0; i<nvar; i++) {
oldbeta[i] = beta[i];
beta[i] = beta[i] + u[i];
}
}
} /* return for another iteration */
/*
** We end up here only if we ran out of iterations
** recompute the last good version of the loglik and imat
** If maxiter =0 or 1, though, leave well enough alone.
*/
if (maxiter > 1) {
for (i=0; i< nvar; i++) beta[i] = oldbeta[i];
newlk =0;
for (i=0; i<nvar; i++) {
u[i] =0;
for (j=0; j<nvar; j++)
imat[i][j] =0;
}
<<excox-addup>>
}
loglik[1] = newlk;
loglik[3] = 1000; /* signal no convergence */
<<excox-finish>>
@
The common code for finishing. Invert the information matrix, copy it
to be symmetric, and put together the output structure.
<<excox-finish>>=
loglik[4] = iter;
chinv2(imat, nvar);
for (i=1; i<nvar; i++)
for (j=0; j<i; j++) imat[i][j] = imat[j][i];
/* assemble the return objects as a list */
PROTECT(rlist= allocVector(VECSXP, 4));
SET_VECTOR_ELT(rlist, 0, beta2);
SET_VECTOR_ELT(rlist, 1, u2);
SET_VECTOR_ELT(rlist, 2, imat2);
SET_VECTOR_ELT(rlist, 3, loglik2);
/* add names to the list elements */
PROTECT(rlistnames = allocVector(STRSXP, 4));
SET_STRING_ELT(rlistnames, 0, mkChar("coef"));
SET_STRING_ELT(rlistnames, 1, mkChar("u"));
SET_STRING_ELT(rlistnames, 2, mkChar("imat"));
SET_STRING_ELT(rlistnames, 3, mkChar("loglik"));
setAttrib(rlist, R_NamesSymbol, rlistnames);
unprotect(nprotect+2);
return(rlist);
@
\subsection{Andersen-Gill fits}
When the survival data set has (start, stop] data a couple of computational
issues are added.
A primary one is how to do this compuation efficiently.
At each event time we need to compute 3 quantities, each of them added up
over the current risk set.
\begin{itemize}
\item The weighted sum of the risk scores $\sum w_i r_i$ where
$r_i = \exp(\eta_i)$ and $\eta_i = x_{i1}\beta_1 + x_{i2}\beta_2 +\ldots$
is the current linear predictor.
\item The weighted mean of the covariates $x$, with weight $w_i r_i$.
\item The weighted variance-covariance matrix of $x$.
\end{itemize}
The current risk set at some event time $t$ is the set of all (start, stop]
intervals that overlap $t$, and are part of the same strata.
The round/square brackets in the prior sentence are important: for an event time
$t=20$ the interval $(5,20]$ is considered to overlap $t$ and the interval
$(20,55]$ does not overlap $t$.
Our routine for the simple right censored Cox model computes these efficiently
by keeping a cumulative sum. Starting with the longest survival move
backwards through time, adding and subtracting subject from the sum as
we go.
The code below creates two sort indices, one orders the data by reverse stop
time and the other by reverse start time, each within strata.
The fit routine is called by the coxph function with arguments
\begin{description}
\item[x] matrix of covariates
\item[y] three column matrix containing the start time, stop time, and event
for each observation
\item[strata] for stratified fits, the strata of each subject
\item[offset] the offset, usually a vector of zeros
\item[init] initial estimate for the coefficients
\item[control] results of the coxph.control function
\item[weights] case weights, often a vector of ones.
\item[method] how ties are handled: 1=Breslow, 2=Efron
\item[rownames] used to label the residuals
\end{description}
If the data set has any observations whose (start, stop] interval does not
overlap any death times, those rows of data play no role in the computation,
and we push them to the end of the sort order and report a smaller $n$ to
the C routine.
The reason for this has less to do with efficiency than with safety: one user,
for example, created a data set with a time*covariate interaction, to be
used for testing proportional hazards with an \code{x:ns(time, df=4)} term.
They had cut the data up by day using survSplit, there was a long
no-event stretch of time before the last censor, and this generated some large
outliers in the extrapolated spline --- large enough to force an exp() overflow.
<<agreg.fit>>=
agreg.fit <- function(x, y, strata, offset, init, control,
weights, method, rownames, resid=TRUE, nocenter=NULL)
{
nvar <- ncol(x)
event <- y[,3]
if (all(event==0)) stop("Can't fit a Cox model with 0 failures")
if (missing(offset) || is.null(offset)) offset <- rep(0.0, nrow(y))
if (missing(weights)|| is.null(weights))weights<- rep(1.0, nrow(y))
else if (any(weights<=0)) stop("Invalid weights, must be >0")
else weights <- as.vector(weights)
# Find rows to be ignored. We have to match within strata: a
# value that spans a death in another stratum, but not it its
# own, should be removed. Hence the per stratum delta
if (length(strata) ==0) {y1 <- y[,1]; y2 <- y[,2]}
else {
if (is.numeric(strata)) strata <- as.integer(strata)
else strata <- as.integer(as.factor(strata))
delta <- strata* (1+ max(y[,2]) - min(y[,1]))
y1 <- y[,1] + delta
y2 <- y[,2] + delta
}
event <- y[,3] > 0
dtime <- sort(unique(y2[event]))
indx1 <- findInterval(y1, dtime)
indx2 <- findInterval(y2, dtime)
# indx1 != indx2 for any obs that spans an event time
ignore <- (indx1 == indx2)
nused <- sum(!ignore)
# Sort the data (or rather, get a list of sorted indices)
# For both stop and start times, the indices go from last to first
if (length(strata)==0) {
sort.end <- order(ignore, -y[,2]) -1L #indices start at 0 for C code
sort.start<- order(ignore, -y[,1]) -1L
strata <- rep(0L, nrow(y))
}
else {
sort.end <- order(ignore, strata, -y[,2]) -1L
sort.start<- order(ignore, strata, -y[,1]) -1L
}
if (is.null(nvar) || nvar==0) {
# A special case: Null model. Just return obvious stuff
# To keep the C code to a small set, we call the usual routines, but
# with a dummy X matrix and 0 iterations
nvar <- 1
x <- matrix(as.double(1:nrow(y)), ncol=1) #keep the .C call happy
maxiter <- 0
nullmodel <- TRUE
if (length(init) !=0) stop("Wrong length for inital values")
init <- 0.0 #dummy value to keep a .C call happy (doesn't like 0 length)
}
else {
nullmodel <- FALSE
maxiter <- control$iter.max
if (is.null(init)) init <- rep(0., nvar)
if (length(init) != nvar) stop("Wrong length for inital values")
}
# 2021 change: pass in per covariate centering. This gives
# us more freedom to experiment. Default is to leave 0/1 variables alone
if (is.null(nocenter)) zero.one <- rep(FALSE, ncol(x))
zero.one <- apply(x, 2, function(z) all(z %in% nocenter))
# the returned value of agfit$coef starts as a copy of init, so make sure
# is is a vector and not a matrix; as.double suffices.
# Solidify the storage mode of other arguments
storage.mode(y) <- storage.mode(x) <- "double"
storage.mode(offset) <- storage.mode(weights) <- "double"
agfit <- .Call(Cagfit4, nused,
y, x, strata, weights,
offset,
as.double(init),
sort.start, sort.end,
as.integer(method=="efron"),
as.integer(maxiter),
as.double(control$eps),
as.double(control$toler.chol),
ifelse(zero.one, 0L, 1L))
# agfit4 centers variables within strata, so does not return a vector
# of means. Use a fill in consistent with other coxph routines
agmeans <- ifelse(zero.one, 0, colMeans(x))
<<agreg-fixup>>
<<agreg-finish>>
rval
}
@
Upon return we need to clean up three simple things.
The first is the rare case that the agfit routine failed.
These cases are rare, usually involve an overflow or underflow, and
we encourage users to let us have a copy of the data when it occurs.
(They end up in the \code{fail} directory of the library.)
The second is that if any of the covariates were redudant then this
will be marked by zeros on the diagonal of the variance matrix.
Replace these coefficients and their variances with NA.
The last is to post a warning message about possible infinite coefficients.
The algorithm for determining this is unreliable, unfortunately.
Sometimes coefficients are marked as infinite when the solution is not tending
to infinity (usually associated with a very skewed covariate), and sometimes
one that is tending to infinity is not marked. Que sera sera.
Don't complain if the user asked for only one iteration; they will already
know that it has not converged.
<<agreg-fixup>>=
vmat <- agfit$imat
coef <- agfit$coef
if (agfit$flag[1] < nvar) which.sing <- diag(vmat)==0
else which.sing <- rep(FALSE,nvar)
if (maxiter >1) {
infs <- abs(agfit$u %*% vmat)
if (any(!is.finite(coef)) || any(!is.finite(vmat)))
stop("routine failed due to numeric overflow.",
"This should never happen. Please contact the author.")
if (agfit$flag[4] > 0)
warning("Ran out of iterations and did not converge")
else {
infs <- (!is.finite(agfit$u) |
infs > control$toler.inf*(1+ abs(coef)))
if (any(infs))
warning(paste("Loglik converged before variable ",
paste((1:nvar)[infs],collapse=","),
"; beta may be infinite. "))
}
}
@
The last of the code is very standard. Compute residuals and package
up the results.
One design decision is that we return all $n$ residuals and predicted
values, even though the model fit ignored useless observations.
(All those obs have a residual of 0).
<<agreg-finish>>=
lp <- as.vector(x %*% coef + offset - sum(coef * agmeans))
if (resid) {
if (any(lp > log(.Machine$double.xmax))) {
# prevent a failure message due to overflow
# this occurs with near-infinite coefficients
temp <- lp + log(.Machine$double.xmax) - (1 + max(lp))
score <- exp(temp)
} else score <- exp(lp)
residuals <- .Call(Cagmart3, nused,
y, score, weights,
strata,
sort.start, sort.end,
as.integer(method=='efron'))
names(residuals) <- rownames
}
# The if-then-else below is a real pain in the butt, but the tccox
# package's test suite assumes that the ORDER of elements in a coxph
# object will never change.
#
if (nullmodel) {
rval <- list(loglik=agfit$loglik[2],
linear.predictors = offset,
method= method,
class = c("coxph.null", 'coxph') )
if (resid) rval$residuals <- residuals
}
else {
names(coef) <- dimnames(x)[[2]]
if (maxiter > 0) coef[which.sing] <- NA # always leave iter=0 alone
flag <- agfit$flag
names(flag) <- c("rank", "rescale", "step halving", "convergence")
if (resid) {
rval <- list(coefficients = coef,
var = vmat,
loglik = agfit$loglik,
score = agfit$sctest,
iter = agfit$iter,
linear.predictors = as.vector(lp),
residuals = residuals,
means = agmeans,
first = agfit$u,
info = flag,
method= method,
class = "coxph")
} else {
rval <- list(coefficients = coef,
var = vmat,
loglik = agfit$loglik,
score = agfit$sctest,
iter = agfit$iter,
linear.predictors = as.vector(lp),
means = agmeans,
first = agfit$u,
info = flag,
method = method,
class = "coxph")
}
rval
}
@
The details of the C code contain the more challenging part of the
computations.
It starts with the usual dull stuff.
My standard coding style for a variable zed to to use
[[zed2]] as the variable name for the R object, and [[zed]] for
the pointer to the contents of the object, i.e., what the
C code will manipulate.
For the matrix objects I make use of ragged arrays, this
allows for reference to the i,j element as \code{cmat[i][j]}
and makes for more readable code.
<<agfit4>>=
#include <math.h>
#include "survS.h"
#include "survproto.h"
SEXP agfit4(SEXP nused2, SEXP surv2, SEXP covar2, SEXP strata2,
SEXP weights2, SEXP offset2, SEXP ibeta2,
SEXP sort12, SEXP sort22, SEXP method2,
SEXP maxiter2, SEXP eps2, SEXP tolerance2,
SEXP doscale2) {
int i,j,k, person;
int indx1, istrat, p, p1;
int nrisk, nr;
int nused, nvar;
int rank=0, rank2, fail; /* =0 to keep -Wall happy */
double **covar, **cmat, **imat; /*ragged array versions*/
double *a, *oldbeta;
double *scale;
double *a2, **cmat2;
double *eta;
double denom, zbeta, risk;
double dtime =0; /* initial value to stop a -Wall message */
double temp, temp2;
double newlk =0;
int halving; /*are we doing step halving at the moment? */
double tol_chol, eps;
double meanwt;
int deaths;
double denom2, etasum;
double recenter;
/* inputs */
double *start, *tstop, *event;
double *weights, *offset;
int *sort1, *sort2, maxiter;
int *strata;
double method; /* saving this as double forces some double arithmetic */
int *doscale;
/* returned objects */
SEXP imat2, beta2, u2, loglik2;
double *beta, *u, *loglik;
SEXP sctest2, flag2, iter2;
double *sctest;
int *flag, *iter;
SEXP rlist;
static const char *outnames[]={"coef", "u", "imat", "loglik",
"sctest", "flag", "iter", ""};
int nprotect; /* number of protect calls I have issued */
/* get sizes and constants */
nused = asInteger(nused2);
nvar = ncols(covar2);
nr = nrows(covar2); /*nr = number of rows, nused = how many we use */
method= asInteger(method2);
eps = asReal(eps2);
tol_chol = asReal(tolerance2);
maxiter = asInteger(maxiter2);
doscale = INTEGER(doscale2);
/* input arguments */
start = REAL(surv2);
tstop = start + nr;
event = tstop + nr;
weights = REAL(weights2);
offset = REAL(offset2);
sort1 = INTEGER(sort12);
sort2 = INTEGER(sort22);
strata = INTEGER(strata2);
/*
** scratch space
** nvar: a, a2, oldbeta, scale
** nvar*nvar: cmat, cmat2
** nr: eta
*/
eta = (double *) R_alloc(nr + 4*nvar + 2*nvar*nvar, sizeof(double));
a = eta + nr;
a2= a + nvar;
scale = a2 + nvar;
oldbeta = scale + nvar;
/*
** Set up the ragged arrays
** covar2 might not need to be duplicated, even though
** we are going to modify it, due to the way this routine was
** was called. But check
*/
PROTECT(imat2 = allocMatrix(REALSXP, nvar, nvar));
nprotect =1;
if (MAYBE_REFERENCED(covar2)) {
PROTECT(covar2 = duplicate(covar2));
nprotect++;
}
covar= dmatrix(REAL(covar2), nr, nvar);
imat = dmatrix(REAL(imat2), nvar, nvar);
cmat = dmatrix(oldbeta+ nvar, nvar, nvar);
cmat2= dmatrix(oldbeta+ nvar + nvar*nvar, nvar, nvar);
/*
** create the output structures
*/
PROTECT(rlist = mkNamed(VECSXP, outnames));
nprotect++;
beta2 = SET_VECTOR_ELT(rlist, 0, duplicate(ibeta2));
beta = REAL(beta2);
u2 = SET_VECTOR_ELT(rlist, 1, allocVector(REALSXP, nvar));
u = REAL(u2);
SET_VECTOR_ELT(rlist, 2, imat2);
loglik2 = SET_VECTOR_ELT(rlist, 3, allocVector(REALSXP, 2));
loglik = REAL(loglik2);
sctest2 = SET_VECTOR_ELT(rlist, 4, allocVector(REALSXP, 1));
sctest = REAL(sctest2);
flag2 = SET_VECTOR_ELT(rlist, 5, allocVector(INTSXP, 4));
flag = INTEGER(flag2);
for (i=0; i<4; i++) flag[i]=0;
iter2 = SET_VECTOR_ELT(rlist, 6, allocVector(INTSXP, 1));
iter = INTEGER(iter2);
/*
** Subtract the mean from each covar, as this makes the variance
** computation more stable. The mean is taken per stratum,
** the scaling is overall.
*/
for (i=0; i<nvar; i++) {
if (doscale[i] == 0) scale[i] =1; /* skip this variable */
else {
istrat = strata[sort2[0]]; /* the current stratum */
k = 0; /* first obs of current one */
temp =0; temp2=0;
for (person=0; person< nused; person++) {
p = sort2[person];
if (strata[p] == istrat) {
temp += weights[p] * covar[i][p];
temp2 += weights[p];
}
else { /* new stratum */
temp /= temp2; /* mean for this covariate, this strata */
for (; k< person; k++) covar[i][sort2[k]] -=temp;
temp =0; temp2=0;
istrat = strata[p];
}
temp /= temp2; /* mean for last stratum */
for (; k< nused; k++) covar[i][sort2[k]] -= temp;
}
/* this cannot be done per stratum */
temp =0;
temp2 =0;
for (person=0; person<nused; person++) {
p = sort2[person];
temp += weights[p] * fabs(covar[i][p]);
temp2 += weights[p];
}
if (temp >0) temp = temp2/temp; /* 1/scale */
else temp = 1.0; /* rare case of a constant covariate */
scale[i] = temp;
for (person=0; person<nused; person++) {
covar[i][sort2[person]] *= temp;
}
}
}
for (i=0; i<nvar; i++) beta[i] /= scale[i]; /* rescale initial betas */
<<agfit4-iter>>
<<agfit4-finish>>
}
@
As we walk through the risk sets observations are both added and
removed from a set of running totals.
We have 6 running totals:
\begin{itemize}
\item sum of the weights, denom = $\sum w_i r_i$
\item totals for each covariate a[j] = $\sum w_ir_i x_{ij}$
\item totals for each covariate pair cmat[j,k]= $\sum w_ir_i x_{ij} x_{ik}$
\item the same three quantities, but only for times that are exactly
tied with the current death time, named denom2, a2, cmat2.
This allows for easy compuatation of the Efron approximation for ties.
\end{itemize}
At one point I spent a lot of time worrying about $r_i$ values that are too
large, but it turns out that the overall scale of the weights does not
really matter since they always appear as a ratio.
(Assuming we avoid exponential overflow and underflow, of course.)
What does get the code in trouble is when there are large and small
weights and we get an update of (large + small) - large.
For example suppose a data set has a time dependent covariate which grows
with time and the data has values like below:
\begin{center}
\begin{tabular}{ccccc}
time1 & time2 & status & x \\ \hline
0 & 90 & 1 & 1 \\
0 & 105 & 0 & 2 \\
100 & 120 & 1 & 50 \\
100 & 124 & 0 & 51
\end{tabular}
\end{center}
The code moves from large times to small, so the first risk set has
subjects 3 and 4, the second has 1 and 2.
The original code would do removals only when necessary, i.e., at the
event times of 120 and 90, and additions as they came along.
This leads to adding in subjects 1 and 2 before the update at time 90
when observations 3 and 4 are removed;
for a coefficient greater than about .6 this leads to a loss of all of
the significant digits.
The defense is to remove subjects from the risk set as early
as possible, and defer additions for as long as possible.
Every time we hit a new (unique) death time, and only then,
update the totals: first remove any
old observations no longer in the risk set and then add any new ones.
One interesting edge case is observations that are not part of any risk
set. (A call to survSplit with too fine a partition can create these, or
using a subset of data that excluded some of the deaths.)
Observations that are not part of any risk set add unnecessary noise since
they will be added and then subtracted from all the totals, but the
intermediate values are never used. If said observation had a large risk
score this could be exceptionally bad.
The parent routine has already dealt with such observations: their indices
never appear in the sort1 or sort2 vector.
The three primary quantities for the Cox model are the log-likelihood $L$,
the score vector $U$ and the Hessian matrix $H$.
\begin{align*}
L &= \sum_i w_i \delta_i \left[\eta_i - \log(d(t)) \right] \\
d(t) &= \sum_j w_j r_j Y_j(t) \\
U_k &= \sum_i w_i \delta_i \left[ (X_{ik} - \mu_k(t_i)) \right] \\
\mu_k(t) &= \frac{\sum_j w_j r_j Y_j(t) X_{jk}} {d(t)} \\
H_{kl} &= \sum_i w_i \delta_i V_{kl}(t_i) \\
V_{kl}(t) &= \frac{\sum_j w_j r_j Y_j(t) [X_{jk} - \mu_k(t)]
[X_{jl}- \mu_l(t)]} {d(t)} \\
&= \frac{\sum_j w_j r_j Y_j(t) X_{jk}X_{jl}} {d(t)}
- d(t) \mu_k(t) \mu_l(t)
\end{align*}
In the above $\delta_i =1$ for an event and 0 otherwise, $w_i$ is the per
subject weight, $\eta_i$ is the current linear predictor $X\beta$ for the
subject, $r_i = \exp(\eta_i)$ is the risk score
and $Y_i(t)$ is 1 if observation $i$ is at risk at time $t$.
The vector $\mu(t)$ is the weighted mean of the covariates at time $t$
using a weight of $w r Y(t)$ for each subject, and $V(t)$ is the weighted
variance matrix of $X$ at time $t$.
Tied deaths and the Efron approximation add a small complication to the
formula. Say there are three tied deaths at some particular time $t$.
When calculating the denominator $d(t)$, mean $\mu(t)$ and variance
$V(t)$ at that time the inclusion value $Y_i(t)$ is 0 or 1 for all other
subjects, as usual, but for the three tied deaths Y(t) is taken to
be 1 for the first death, 2/3 for the second, and 1/3 for the third.
The idea is that if the tied death times were randomly broken by adding
a small random amount then each of these three would be in the first risk set,
have 2/3 chance of being in the second, and 1/3 chance of being in the risk
set for the third death.
In the code this means that at a death time we add the \code{denom2},
\code{a2} and \code{c2} portions in a little at at time:
for three tied death the code will add in 1/3, update totals,
add in another 1/3, update totals, then the last 1/3, and update totals.
The variance formula is stable if $\mu$ is small relative to
the total variance. This is guarranteed by having a working estimate $m$
of the mean along with the formula:
\begin{align*}
(1/n) \sum w_ir_i(x_i- \mu)^2 &= (1/n)\sum w_ir_i(x-m)^2 -
(\mu -m)^2 \\
\mu &= (1/n) \sum w_ir_i (x_i -m)\\
n &= \sum w_ir_i
\end{align*}
A refinement of this is to scale the covariates, since the Cholesky
decomposition can lose precision when variables are on vastly different
scales. We do this centering and scaling once at the beginning of the
calculation.
Centering is done per strata --- what if someone had two strata and
a covariate with mean 0 in the first but mean one million in the second?
(Users do amazing things). Scaling is required to be a single
value for each covariate, however. For a univariate model scaling
does not add any precision.
Weighted sums can still be unstable if the weights get out of hand.
Because of the exponential $r_i = exp(\eta_i)$
the original centering of the $X$ matrix may not be enough.
A particular example was a data set on hospital adverse events with
``number of nurse shift changes to date'' as a time dependent covariate.
At any particular time point the covariate varied only by $\pm 3$ between
subjects (weekends often use 12 hour nurse shifts instead of 8 hour). The
regression coefficient was around 1 and the data duration was 11 weeks
(about 200 shifts) so that $eta$ values could be over 100 even after
centering. We keep a time dependent average of $\eta$ and use it to update
a recentering constant as necessary.
A case like this should be rare, but it is not as unusual as one might
think.
The last numerical problem is when one or more coefficients gets too
large, leading to a huge weight exp(eta).
This usually happens when a coefficient is tending to infinity, but can
also be due to a bad step in the intermediate Newton-Raphson path.
In the infinite coefficient case the
log-likelihood trends to an asymptote and there is a race between three
conditions: convergence of the loglik, singularity of the variance matrix,
or an invalid log-likelihood. The first of these wins the race most of
the time, especially if the data set is small, and is the simplest case.
The last occurs when the denominator becomes $<0$ due to
round off so that log(denom) is undefined, the second when extreme weights
cause the second derivative to lose precision.
In all 3 we revert to step halving, since a bad Newton-Raphson step can
cause the same issues to arise.
The next section of code adds up the totals for a given iteration.
This is the workhorse.
For a given death time all of the events tied at
that time must be handled together, hence the main loop below proceeds in
batches:
\begin{enumerate}
\item Find the time of the next death. Whenever crossing a stratum
boundary, zero cetain intermediate sums.
\item Remove all observations in the stratum with time1 $>$ dtime.
When survSplit was used to create a data set, this will often remove all.
If so we can rezero temporaries and regain precision.
\item Add new observations to the risk set and to the death counts.
\end{enumerate}
<<agfit4-addup>>=
for (person=0; person<nused; person++) {
p = sort2[person];
zbeta = 0; /* form the term beta*z (vector mult) */
for (i=0; i<nvar; i++)
zbeta += beta[i]*covar[i][p];
eta[p] = zbeta + offset[p];
}
/*
** 'person' walks through the the data from 1 to nused,
** sort1[0] points to the largest stop time, sort1[1] the next, ...
** 'dtime' is a scratch variable holding the time of current interest
** 'indx1' walks through the start times.
*/
newlk =0;
for (i=0; i<nvar; i++) {
u[i] =0;
for (j=0; j<nvar; j++) imat[i][j] =0;
}
person =0;
indx1 =0;
/* this next set is rezeroed at the start of each stratum */
recenter =0;
denom=0;
nrisk=0;
etasum =0;
for (i=0; i<nvar; i++) {
a[i] =0;
for (j=0; j<nvar; j++) cmat[i][j] =0;
}
/* end of the per-stratum set */
istrat = strata[sort2[0]]; /* initial stratum */
while (person < nused) {
/* find the next death time */
for (k=person; k< nused; k++) {
p = sort2[k];
if (strata[p] != istrat) {
/* hit a new stratum; reset temporary sums */
istrat= strata[p];
denom = 0;
nrisk = 0;
etasum =0;
for (i=0; i<nvar; i++) {
a[i] =0;
for (j=0; j<nvar; j++) cmat[i][j] =0;
}
person =k; /* skip to end of stratum */
indx1 =k;
}
if (event[p] == 1) {
dtime = tstop[p];
break;
}
}
if (k == nused) break; /* no more deaths to be processed */
/* remove any subjects no longer at risk */
<<agreg-remove>>
/*
** add any new subjects who are at risk
** denom2, a2, cmat2, meanwt and deaths count only the deaths
*/
denom2= 0;
meanwt =0;
deaths=0;
for (i=0; i<nvar; i++) {
a2[i]=0;
for (j=0; j<nvar; j++) {
cmat2[i][j]=0;
}
}
for (; person <nused; person++) {
p = sort2[person];
if (strata[p] != istrat || tstop[p] < dtime) break;/*no more to add*/
nrisk++;
etasum += eta[p];
<<fixeta>>
risk = exp(eta[p] - recenter) * weights[p];
if (event[p] ==1 ){
deaths++;
denom2 += risk;
meanwt += weights[p];
newlk += weights[p]* (eta[p] - recenter);
for (i=0; i<nvar; i++) {
u[i] += weights[p] * covar[i][p];
a2[i]+= risk*covar[i][p];
for (j=0; j<=i; j++)
cmat2[i][j] += risk*covar[i][p]*covar[j][p];
}
}
else {
denom += risk;
for (i=0; i<nvar; i++) {
a[i] += risk*covar[i][p];
for (j=0; j<=i; j++)
cmat[i][j] += risk*covar[i][p]*covar[j][p];
}
}
}
<<breslow-efron>>
} /* end of accumulation loop */
@
The last step in the above loop adds terms to the loglik, score and
information matrices. Assume that there were 3 tied deaths.
The difference between the Efron and Breslow approximations is that for the
Efron the three tied subjects are given a weight of 1/3 for the first, 2/3 for
the second, and 3/3 for the third death; for the Breslow they get 3/3 for
all of them.
Note that \code{imat} is symmetric, and that the cholesky routine will
utilize the upper triangle of the matrix as input, using the lower part for
its own purposes. The inverse from \code{chinv} is also in the upper
triangle.
<<breslow-efron>>=
/*
** Add results into u and imat for all events at this time point
*/
if (method==0 || deaths ==1) { /*Breslow */
denom += denom2;
newlk -= meanwt*log(denom); /* sum of death weights*/
for (i=0; i<nvar; i++) {
a[i] += a2[i];
temp = a[i]/denom; /*mean covariate at this time */
u[i] -= meanwt*temp;
for (j=0; j<=i; j++) {
cmat[i][j] += cmat2[i][j];
imat[j][i] += meanwt*((cmat[i][j]- temp*a[j])/denom);
}
}
}
else {
meanwt /= deaths;
for (k=0; k<deaths; k++) {
denom += denom2/deaths;
newlk -= meanwt*log(denom);
for (i=0; i<nvar; i++) {
a[i] += a2[i]/deaths;
temp = a[i]/denom;
u[i] -= meanwt*temp;
for (j=0; j<=i; j++) {
cmat[i][j] += cmat2[i][j]/deaths;
imat[j][i] += meanwt*((cmat[i][j]- temp*a[j])/denom);
}
}
}
}
@
Code to process the removals:
<<agreg-remove>>=
/*
** subtract out the subjects whose start time is to the right
** If everyone is removed reset the totals to zero. (This happens when
** the survSplit function is used, so it is worth checking).
*/
for (; indx1<nused; indx1++) {
p1 = sort1[indx1];
if (start[p1] < dtime || strata[p1] != istrat) break;
nrisk--;
if (nrisk ==0) {
etasum =0;
denom =0;
for (i=0; i<nvar; i++) {
a[i] =0;
for (j=0; j<=i; j++) cmat[i][j] =0;
}
}
else {
etasum -= eta[p1];
risk = exp(eta[p1] - recenter) * weights[p1];
denom -= risk;
for (i=0; i<nvar; i++) {
a[i] -= risk*covar[i][p1];
for (j=0; j<=i; j++)
cmat[i][j] -= risk*covar[i][p1]*covar[j][p1];
}
}
}
@
The next bit of code exists for the sake of rather rare data sets.
Assume that there is a time dependent covariate that rapidly climbs
in such a way that the eta gets large but the range of eta stays
modest. An example would be something like ``payments made to date'' for
a portfolio of loans. Then even though the data has been centered and
the global mean is fine, the current values of eta are outrageous with
respect to the exp function.
Since replacing eta with (eta -c) for any c does not change the likelihood,
do it. Unfortunately, we can't do this once and for all: this is a step that
will occur at least twice per iteration for those rare cases, e.g., eta is
too small at early time points and too large at late ones.
<<fixeta>>=
/*
** We must avoid overflow in the exp function (~709 on Intel)
** and want to act well before that, but not take action very often.
** One of the case-cohort papers suggests an offset of -100 meaning
** that etas of 50-100 can occur in "ok" data, so make it larger
** than this.
** If the range of eta is more then log(1e16) = 37 then the data is
** hopeless: some observations will have effectively 0 weight. Keeping
** the mean sensible has sufficed to keep the max in check.
*/
if (fabs(etasum/nrisk - recenter) > 200) {
flag[1]++; /* a count, for debugging/profiling purposes */
temp = etasum/nrisk - recenter;
recenter = etasum/nrisk;
if (denom > 0) {
/* we can skip this if there is no one at risk */
if (fabs(temp) > 709) error("exp overflow due to covariates\n");
temp = exp(-temp); /* the change in scale, for all the weights */
denom *= temp;
for (i=0; i<nvar; i++) {
a[i] *= temp;
for (j=0; j<nvar; j++) {
cmat[i][j]*= temp;
}
}
}
}
@
Now, I'm finally to do the actual iteration steps.
The Cox model calculation rarely gets into numerical difficulty, and when it
does step halving has always been sufficient.
Let $\beta^{(0)}$, $\beta^{(1)}$, etc be the iteration steps in the search
for the maximum likelihood solution $\hat \beta$.
The flow of the algorithm is
\begin{enumerate}
\item Iteration 0 is the loglik and etc for the intial estimates.
At the end of that iteration, calculate a score test. If the user
asked for 0 iterations, then don't do any singularity or infinity checks,
just give them the results.
\item For the $k$th iteration, start with the new trial estimate
$\beta^{(k)}$. This new estimate is [[beta]] in the code and the
most recent successful estimate is [[oldbeta]].
\item For this new trial estimate, compute the log-likelihood, and the
first and second derivatives.
\item Test if the log-likelihood if finite, has converged \emph{and}
the last estimate
was not generated by step-halving. In the latter case the algorithm may
\emph{appear} to have converged but the solution is not sure.
An infinite loglik is very rare, it arises when denom <0 due to catastrophic
loss of significant digits when range(eta) is too large.
\begin{itemize}
\item if converged return beta and the the other information
\item if this was the last iteration, return the best beta found so
far (perhaps beta, more likely oldbeta), the other information,
and a warning flag.
\item otherwise, compute the next guess and return to the top
\begin{itemize}
\item if our latest trial guess [[beta]] made things worse use step
halving: $\beta^{(k+1)}$ = oldbeta + (beta-oldbeta)/2.
The assumption is that the current trial step was in the right
direction, it just went too far.
\item otherwise take a Newton-Raphson step
\end{itemize}
\end{itemize}
\end{enumerate}
I am particularly careful not to make a mistake that I have seen in several
other Cox model programs. All the hard work is to calculate the first
and second derivatives $U$ (u) and $H$ (imat), once we have them the next
Newton-Rhapson update $UH^{-1}$ is just a little bit more. Many programs
succumb to the temptation of this ``one more for free'' idea, and as a
consequence return $\beta^{(k+1)}$ along with the log-likelihood and
variance matrix for $\beta^{(k)}$.
If a user has specified
for instance only 1 or 2 iterations the answers can be seriously
out of joint.
If iteration has gone to completion they will differ by only a gnat's
eyelash, so what's the utility of the ``free'' update?
<<agfit4-iter>>=
/* main loop */
halving =0 ; /* =1 when in the midst of "step halving" */
fail =0;
for (*iter=0; *iter<= maxiter; (*iter)++) {
R_CheckUserInterrupt(); /* be polite -- did the user hit cntrl-C? */
<<agfit4-addup>>
if (*iter==0) {
loglik[0] = newlk;
loglik[1] = newlk;
/* compute the score test, but don't corrupt u */
for (i=0; i<nvar; i++) a[i] = u[i];
rank = cholesky2(imat, nvar, tol_chol);
chsolve2(imat,nvar,a); /* a replaced by u *inverse(i) */
*sctest=0;
for (i=0; i<nvar; i++) {
*sctest += u[i]*a[i];
}
if (maxiter==0) break;
fail = isnan(newlk) + isinf(newlk);
/* it almost takes malice to give a starting estimate with infinite
** loglik. But if so, just give up now */
if (fail>0) break;
for (i=0; i<nvar; i++) {
oldbeta[i] = beta[i];
beta[i] += a[i];
}
}
else {
fail =0;
for (i=0; i<nvar; i++)
if (isfinite(imat[i][i]) ==0) fail++;
rank2 = cholesky2(imat, nvar, tol_chol);
fail = fail + isnan(newlk) + isinf(newlk) + abs(rank-rank2);
if (fail ==0 && halving ==0 &&
fabs(1-(loglik[1]/newlk)) <= eps) break; /* success! */
if (*iter == maxiter) { /* failed to converge */
flag[3] = 1;
if (maxiter>1 && ((newlk -loglik[1])/ fabs(loglik[1])) < -eps) {
/*
** "Once more unto the breach, dear friends, once more; ..."
**The last iteration above was worse than one of the earlier ones,
** by more than roundoff error.
** We need to use beta and imat at the last good value, not the
** last attempted value. We have tossed the old imat away, so
** recompute it.
** It will happen very rarely that we run out of iterations, and
** even less often that it is right in the middle of halving.
*/
for (i=0; i<nvar; i++) beta[i] = oldbeta[i];
<<agfit4-addup>>
rank2 = cholesky2(imat, nvar, tol_chol);
}
break;
}
if (fail >0 || newlk < loglik[1]) {
/*
** The routine has not made progress past the last good value.
*/
halving++; flag[2]++;
for (i=0; i<nvar; i++)
beta[i] = (oldbeta[i]*halving + beta[i]) /(halving +1.0);
}
else {
halving=0;
loglik[1] = newlk; /* best so far */
chsolve2(imat,nvar,u);
for (i=0; i<nvar; i++) {
oldbeta[i] = beta[i];
beta[i] = beta[i] + u[i];
}
}
}
} /*return for another iteration */
@
Save away the final bits, compute the inverse of imat and symmetrize it,
release memory and return.
If the routine did not converge (iter== maxiter), then the cholesky
routine will not have been called.
<<agfit4-finish>>=
flag[0] = rank;
loglik[1] = newlk;
chinv2(imat, nvar);
for (i=0; i<nvar; i++) {
beta[i] *= scale[i]; /* return to original scale */
u[i] /= scale[i];
imat[i][i] *= scale[i] * scale[i];
for (j=0; j<i; j++) {
imat[j][i] *= scale[i] * scale[j];
imat[i][j] = imat[j][i];
}
}
UNPROTECT(nprotect);
return(rlist);
@
\subsection{Predicted survival}
The \code{survfit} method for a Cox model produces individual survival
curves. As might be expected these have much in common with
ordinary survival curves, and share many of the same methods.
The primary differences are first that a predicted curve always refers
to a particular set of covariate values.
It is often the case that a user wants multiple values at once, in
which case the result will be a matrix of survival curves with a row
for each time and a column for each covariate set.
The second is that the computations are somewhat more difficult.
The input arguments are
\begin{description}
\item[formula] a fitted object of class `coxph'. The argument name of
`formula' is historic, from when the survfit function was not a generic
and only did Kaplan-Meier type curves.
\item[newdata] contains the data values for which curves should be
produced, one per row
\item[se.fit] TRUE/FALSE, should standard errors be computed.
\item[individual] a particular option for time-dependent covariates
\item[stype] survival type for the formula 1=direct 2= exp
\item[ctype] cumulative hazard, 1=Nelson-Aalen, 2= corrected for ties
\item[censor] if FALSE, remove any times that have no events from the
output. This is for
backwards compatability with older versions of the code.
\item[id] replacement and extension for the individual argument
\item[start.time] Start a curve at a later timepoint than zero.
\item[influence] whether to return the influence matrix
\end{description}
All the other arguments are common to all the methods, refer to the
help pages.
Other survival routines have id and cluster options; this routine inherits
those variables from coxph. If coxph did a robust variance, this routine
will do one also.
<<survfit.coxph>>=
survfit.coxph <-
function(formula, newdata, se.fit=TRUE, conf.int=.95, individual=FALSE,
stype=2, ctype,
conf.type=c("log", "log-log", "plain", "none", "logit", "arcsin"),
censor=TRUE, start.time, id, influence=FALSE,
na.action=na.pass, type, time0= FALSE,...) {
Call <- match.call()
Call[[1]] <- as.name("survfit") #nicer output for the user
object <- formula #'formula' because it has to match survfit
<<survfit.coxph-setup1>>
<<survfit.coxph-setup2>>
<<survfit.coxph-setup2b>>
<<survfit.coxph-setup2c>>
<<survfit.coxph-setup3>>
if (missing(newdata)) {
if (inherits(formula, "coxphms"))
stop ("newdata is required for multi-state models")
risk2 <- 1
}
else {
if (length(object$means))
risk2 <- exp(c(x2 %*% beta) + offset2 - xcenter)
else risk2 <- exp(offset2 - xcenter)
}
<<survfit.coxph-result>>
<<survfit.coxph-finish>>
}
@
The third line \code{as.name('survfit')} causes the printout to say
`survfit' instead of `survfit.coxph'. %'
The setup for the has three main phases, first of course to sort out the
options the user has given us, second to rebuild the
data frame, X matrix, etc from the original Cox model, and third to
create variables from the new data set.
In the code below x2, y2, strata2, id2, etc. are variables from the
new data, X, Y, strata etc from the old. One exception to the pattern
is id= argument, oldid = id from original data, id2 = id from new.
If the newdata argument is missing we use \code{object\$means} as the
default value. This choice has lots of statistical shortcomings,
particularly in a stratified model, but is common in other
packages and a historic option here.
If stype is missing we use the standard approach of exp(cumulative hazard),
and ctype is pulled from the Cox model.
That is, the \code{coxph} computation used for \code{ties='breslow'} is
the same as the Nelson-Aalen hazard estimate, and
the Efron approximation the tie-corrected hazard.
One particular special case (that gave me fits for a while) is when there
are non-heirarchical models, for example \code{~ age + age:sex}.
The fit of such a model will \emph{not} be the same using the variable
\code{age2 <- age-50}; I originally thought it was a flaw induced by my
subtraction.
The routine simply cannot give a sensible curve for a model like this.
The issue continued to surprise me each time I rediscovered it,
leading to an error message for my own protection. I'm
not convinced at this time that there is a sensible survival curve
that \emph{could} be calculated for such a model.
A model with \code{age + age:strata(sex)} will be ok, because the
coxph routine treats this last term as though it had a * in it, i.e.,
fits a stratified model.
<<survfit.coxph-setup1>>=
Terms <- terms(object)
robust <- !is.null(object$naive.var) # did the coxph model use robust var?
if (!is.null(attr(object$terms, "specials")$tt))
stop("The survfit function can not process coxph models with a tt term")
if (!missing(type)) { # old style argument
if (!missing(stype) || !missing(ctype))
warning("type argument ignored")
else {
temp1 <- c("kalbfleisch-prentice", "aalen", "efron",
"kaplan-meier", "breslow", "fleming-harrington",
"greenwood", "tsiatis", "exact")
survtype <- match(match.arg(type, temp1), temp1)
stype <- c(1,2,2,1,2,2,2,2,2)[survtype]
if (stype!=1) ctype <-c(1,1,2,1,1,2,1,1,1)[survtype]
}
}
if (missing(ctype)) {
# Use the appropriate one from the model
temp1 <- match(object$method, c("exact", "breslow", "efron"))
ctype <- c(1,1,2)[temp1]
}
else if (!(ctype %in% 1:2)) stop ("ctype must be 1 or 2")
if (!(stype %in% 1:2)) stop("stype must be 1 or 2")
if (!se.fit) conf.type <- "none"
else conf.type <- match.arg(conf.type)
tfac <- attr(Terms, 'factors')
temp <- attr(Terms, 'specials')$strata
has.strata <- !is.null(temp)
if (has.strata) {
stangle = untangle.specials(Terms, "strata") #used multiple times, later
# Toss out strata terms in tfac before doing the test 1 line below, as
# strata end up in the model with age:strat(grp) terms or *strata() terms
# (There might be more than one strata term)
for (i in temp) tfac <- tfac[,tfac[i,] ==0] # toss out strata terms
}
if (any(tfac >1))
stop("not able to create a curve for models that contain an interaction without the lower order effect")
Terms <- object$terms
n <- object$n[1]
if (!has.strata) strata <- NULL
else strata <- object$strata
if (!missing(individual)) warning("the `id' option supersedes `individual'")
missid <- missing(id) # I need this later, and setting id below makes
# "missing(id)" always false
if (!missid) individual <- TRUE
else if (missid && individual) id <- rep(0L,n) #dummy value
else id <- NULL
if (individual & missing(newdata)) {
stop("the id option only makes sense with new data")
}
@
In two places below we need to know if there are strata by covariate
interactions, which requires looking at attributes of the terms
object.
The factors attribute will have a row for the strata variable, or
maybe more than one (multiple strata terms are legal). If it has
a 1 in a column that corresponds to something of order 2 or
greater, that is a strata by covariate interaction.
<<survfit.coxph-setup1>>=
if (has.strata) {
temp <- attr(Terms, "specials")$strata
factors <- attr(Terms, "factors")[temp,]
strata.interaction <- any(t(factors)*attr(Terms, "order") >1)
}
@
I need to retrieve a copy of the original data.
We always need the $X$ matrix and $y$, both of which might be found in
the data object.
If the fit was a multistate model,
the original call included either strata, offset, weights, or id,
or if either $x$ or $y$ are missing from the \code{coxph} object,
then the model frame will need to be reconstructed.
We have to use \code{object['x'}] instead of \texttt{object\$x} since
the latter will
pick off the \code{xlevels} component if the \code{x} component is missing
(which is the default).
<<survfit.coxph-setup1>>=
coxms <- inherits(object, "coxphms")
if (coxms || is.null(object$y) || is.null(object[['x']]) ||
!is.null(object$call$weights) || !is.null(object$call$id) ||
(has.strata && is.null(object$strata)) ||
!is.null(attr(object$terms, 'offset'))) {
mf <- model.frame(object)
}
else mf <- NULL #useful for if statements later
@
For a single state model we can grab
the X matrix off the model frame, for multistate some more work
needs to be done.
We have to repeat some lines from coxph, but to do that we need some
further material.
We prefer \code{object\$y} to model.response, since the former will have been
passed through aeqSurv with the options the user specified.
For a multi-state model, however, we do have to recreate since the
saved y has been expanded.
In that case observe the saved status of timefix. Old saved objects
might not have that element, if missing assume TRUE.
<<survfit.coxph-setup2>>=
position <- NULL
Y <- object[['y']]
if (is.null(mf)) {
weights <- object$weights # let offsets/weights be NULL until needed
offset <- NULL
offset.mean <- 0
X <- object[['x']]
}
else {
weights <- model.weights(mf)
offset <- model.offset(mf)
if (is.null(offset)) offset.mean <- 0
else {
if (is.null(weights)) offset.mean <- mean(offset)
else offset.mean <- sum(offset * (weights/sum(weights)))
}
X <- model.matrix.coxph(object, data=mf)
if (is.null(Y) || coxms) {
Y <- model.response(mf)
if (is.null(object$timefix) || object$timefix) Y <- aeqSurv(Y)
}
oldid <- model.extract(mf, "id")
if (length(oldid) && ncol(Y)==3) position <- survflag(Y, oldid)
else position <- NULL
if (!coxms && (nrow(Y) != object$n[1]))
stop("Failed to reconstruct the original data set, wrong number of rows")
if (has.strata) {
if (length(strata)==0) {
if (length(stangle$vars) ==1) strata <- mf[[stangle$vars]]
else strata <- strata(mf[, stangle$vars], shortlabel=TRUE)
}
}
}
@
If a model frame was created, then it is trivial to grab \code{y}
from the new frame and compare it to \code{object\$y} from the
original one. This is to avoid nonsense results that arise
when someone changes the data set under our feet.
We can only check the size: with the addition of aeqSurv other packages
were being flagged for tiny discrepancies.
Later note: this check does not work for multi-state models, and we don't
\emph{have} to have it. Removed by using if (FALSE) so as to preserve
the code for future consideration.
<<survfit.coxph-setup2b>>=
if (FALSE) {
if (!is.null(mf)){
y2 <- object[['y']]
if (!is.null(y2)) {
if (ncol(y2) != ncol(Y) || length(y2) != length(Y))
stop("Could not reconstruct the y vector")
}
}
}
type <- attr(Y, 'type')
if (!type %in% c("right", "counting", "mright", "mcounting"))
stop("Cannot handle \"", type, "\" type survival data")
if (missing(start.time)) t0 <- min(c(0, Y[,-ncol(Y)]))
else {
if (!is.numeric(start.time) || length(start.time) > 1)
stop("start.time must be a single numeric value")
t0 <- start.time
# Start the curves after start.time
# To do so, remove any rows of the data with an endpoint before that
# time.
if (ncol(Y)==3) {
keep <- Y[,2] >= start.time
# Y[keep,1] <- pmax(Y[keep,1], start.time) # removed 2/2022
}
else keep <- Y[,1] >= start.time
if (!any(Y[keep, ncol(Y)]==1))
stop("start.time argument has removed all endpoints")
Y <- Y[keep,,drop=FALSE]
X <- X[keep,,drop=FALSE]
if (!is.null(offset)) offset <- offset[keep]
if (!is.null(weights)) weights <- weights[keep]
if (!is.null(strata)) strata <- strata[keep]
if (length(id) >0 ) id <- id[keep]
if (length(position) >0) position <- position[keep]
n <- nrow(Y)
}
@
In the above code we see id twice. The first, kept as \code{oldid} is the
identifier variable for subjects in the original data set, and is needed
whenever it contained subjects with more than one row.
The second is the user variable of this call, and is used to define multiple
rows for a new subject. The latter usage should be rare but we need to
allow for it.
If a variable is deemed redundant the \code{coxph} routine will have set its
coefficient to NA as a marker.
We want to ignore that coefficient: treating it as a zero has the
desired effect.
Another special case is a null model, having either ~1 or only an offset
on the right hand side. In that case we create a dummy covariate to
allow the rest of the code to work without special if/else.
The last special case is a model with a sparse frailty term. We treat
the frailty coefficients as 0 variance (in essence as an offset).
The frailty is removed from the model variables but kept in the risk score.
This isn't statistically very defensible, but it is backwards compatatble. %'
A non-sparse frailty does not need special code and works out like any
other variable.
Center the risk scores by subtracting $ \overline x \hat\beta$ from each.
The reason for this is to avoid huge values when calculating $\exp(X\beta)$;
this would happen if someone had a variable with a mean of 1000 and a
variance of 1.
Any constant can be subtracted, mathematically the results are identical as long
as the same values are subtracted from the old and new $X$ data.
The mean is used because it is handy, we just need to get $X\beta$ in the
neighborhood of zero.
<<survfit.coxph-setup2c>>=
if (length(object$means) ==0) { # a model with only an offset term
# Give it a dummy X so the rest of the code goes through
# (This case is really rare)
# se.fit <- FALSE
X <- matrix(0., nrow=n, ncol=1)
if (is.null(offset)) offset <- rep(0, n)
xcenter <- offset.mean
coef <- 0.0
varmat <- matrix(0.0, 1, 1)
risk <- rep(exp(offset- offset.mean), length=n)
}
else {
varmat <- object$var
beta <- ifelse(is.na(object$coefficients), 0, object$coefficients)
xcenter <- sum(object$means * beta) + offset.mean
if (!is.null(object$frail)) {
keep <- !grepl("frailty(", dimnames(X)[[2]], fixed=TRUE)
X <- X[,keep, drop=F]
}
if (is.null(offset)) risk <- c(exp(X%*% beta - xcenter))
else risk <- c(exp(X%*% beta + offset - xcenter))
}
@
The \code{risk} vector and \code{x} matrix come from the original data, and are
the raw data for the survival curve and its variance.
We also need the risk score $\exp(X\beta)$ for the target subject(s).
\begin{itemize}
\item For predictions with time-dependent covariates the user will have
either included an \code{id} statement (newer style) or specified the
\code{individual=TRUE} option. If the latter, then \code{newdata} is
presumed to contain only a single indivual represented by multiple
rows. If the former then the \code{id} variable marks separate individuals.
In either case we need to retrieve
the covariates, strata, and repsonse from the new data set.
\item For ordinary predictions only the covariates are needed.
\item If newdata is not present we assume that this is the ordinary case, and
use the value of \code{object\$means} as the default covariate set. This is
not ideal statistically since many users view this as an
``average'' survival curve, which it is not.
\end{itemize}
When grabbing [newdata] we want to use model.frame processing, both to
handle missing values correctly and, perhaps more importantly, to correctly
map any factor variables between the original fit and the new data. (The
new data will often have only one of the original levels represented.)
Also, we want to correctly handle data-dependent nonlinear terms such as
ns and pspline.
However, the simple call found in predict.lm, say,
\code{model.frame(Terms, data=newdata, ..} isn't used here
for a few reasons.
The first is a decision on our part that the user should not have
to include unused terms in the newdata: sometimes we don't need the
response and sometimes we do.
Second, if there are strata, the user may or may not
have included strata variables in their data set and we need to
act accordingly.
The third is that we might have an \code{id} statement in this
call, which is another variable to be fetched.
At one time we dealt with cluster() terms in the formula, but the coxph
routine has already removed those for us.
Finally, note that there is no ability to use sparse frailties and newdata together;
it is a hard case and so rare as to not be worth it.
First, remove unnecessary terms from the orginal model formula.
If \code{individual} is false then the repsonse variable can go.
The dataClasses and predvars attributes, if present, have elements
in the same order as the first dimension of the ``factors'' attribute
of the terms.
Subscripting the terms argument does not preserve dataClasses or
predvars, however. Use the pre and post subscripting factors attribute
to determine what elements of them to keep.
The predvars component is a call objects with one element for each
term in the formula, so y ~ age + ns(height) would lead to a predvars
of length 4, element 1 is the call itself, 2 would be y, etc.
The dataClasses object is a simple list.
<<survfit.coxph-setup3>>=
if (missing(newdata)) {
# If the model has interactions, print out a long warning message.
# People may hate it, but I don't see another way to stamp out these
# bad curves without backwards-incompatability.
# I probably should complain about factors too (but never in a strata
# or cluster term).
if (any(attr(Terms, "order") > 1) )
warning("the model contains interactions; the default curve based on columm means of the X matrix is almost certainly not useful. Consider adding a newdata argument.")
if (length(object$means)) {
mf2 <- as.list(object$means) #create a dummy newdata
names(mf2) <- names(object$coefficients)
mf2 <- as.data.frame(mf2)
x2 <- matrix(object$means, 1)
}
else { # nothing but an offset
mf2 <- data.frame(X=0)
x2 <- 0
}
offset2 <- 0
found.strata <- FALSE
}
else {
if (!is.null(object$frail))
stop("Newdata cannot be used when a model has frailty terms")
Terms2 <- Terms
if (!individual) {
Terms2 <- delete.response(Terms)
y2 <- NULL # a dummy to carry along, for the call to coxsurv.fit
}
<<survfit.coxph-newdata2>>
}
@
For backwards compatability, I allow someone to give an ordinary vector
instead of a data frame (when only one curve is required). In this case
I also need to verify that the elements have a name.
Then turn it into a data frame, like it should have been from the beginning.
(Documentation of this ability has been suppressed, however. I'm hoping
people forget it ever existed.)
<<survfit.coxph-newdata2>>=
if (is.vector(newdata, "numeric")) {
if (individual) stop("newdata must be a data frame")
if (is.null(names(newdata))) {
stop("Newdata argument must be a data frame")
}
newdata <- data.frame(as.list(newdata), stringsAsFactors=FALSE)
} else if (is.list(newdata)) newdata <- as.data.frame(newdata)
@
Finally get my new model frame mf2.
We allow the
user to leave out any strata() variables if they so desire,
\emph{if} there are no strata by covariate interactions.
How does one check if the strata variables are or are not available in
the call?
My first attempt at this was to wrap the call in a try() construct and
see if it failed. This doesn't work.
\begin{itemize}
\item What if there is no strata variable in newdata, but they do have,
by bad luck, a variable of the same name in their main directory?
\item It would seem like changing the environment to NULL would be wise,
so that we don't find variables anywhere but in the data argument,
a sort of sandboxing. Not wise: you then won't find functions like ``log''.
\item We don't dare modify the environment of the formula at all.
It is needed for the sneaky caller who uses his own function
inside the formula, 'mycosine' say, and that function can only be
found if we retain the environment.
\end{itemize}
One way out of this is to evaluate each of the strata terms
(there can be more than one) one at a time, in an environment that knows
nothing except "list" and a fake definition of "strata", and newdata.
Variables that are part of the global environment won't be found.
I even watch out for the case of either "strata" or "list" is the name of
the stratification variable, which causes my fake strata function to
return a function when said variable is not in newdata. The
variable found.strata is true if ALL the strata are found, set it to
false if any are missing.
<<survfit.coxph-newdata2>>=
if (has.strata) {
found.strata <- TRUE
tempenv <- new.env(, parent=emptyenv())
assign("strata", function(..., na.group, shortlabel, sep)
list(...), envir=tempenv)
assign("list", list, envir=tempenv)
for (svar in stangle$vars) {
temp <- try(eval(parse(text=svar), newdata, tempenv),
silent=TRUE)
if (!is.list(temp) ||
any(unlist(lapply(temp, class))== "function"))
found.strata <- FALSE
}
if (!found.strata) {
ss <- untangle.specials(Terms2, "strata")
Terms2 <- Terms2[-ss$terms]
}
}
tcall <- Call[c(1, match(c('id', "na.action"),
names(Call), nomatch=0))]
tcall$data <- newdata
tcall$formula <- Terms2
tcall$xlev <- object$xlevels[match(attr(Terms2,'term.labels'),
names(object$xlevels), nomatch=0)]
tcall$na.action <- na.omit # do not allow missing values
tcall[[1L]] <- quote(stats::model.frame)
mf2 <- eval(tcall)
if (nrow(mf2) ==0)
stop("all rows of newdata have missing values")
@
Now, finally, extract the \code{x2} matrix from the just-created frame.
<<survfit.coxph-setup3>>=
if (has.strata && found.strata) { #pull them off
temp <- untangle.specials(Terms2, 'strata')
strata2 <- strata(mf2[temp$vars], shortlabel=TRUE)
strata2 <- factor(strata2, levels=levels(strata))
if (any(is.na(strata2)))
stop("New data set has strata levels not found in the original")
# An expression like age:strata(sex) will have temp$vars= "strata(sex)"
# and temp$terms = integer(0). This does not work as a subscript
if (length(temp$terms) >0) Terms2 <- Terms2[-temp$terms]
}
else strata2 <- factor(rep(0, nrow(mf2)))
if (!robust) cluster <- NULL
if (individual) {
if (missing(newdata))
stop("The newdata argument must be present when individual=TRUE")
if (!missid) { #grab the id variable
id2 <- model.extract(mf2, "id")
if (is.null(id2)) stop("id=NULL is an invalid argument")
}
else id2 <- rep(1, nrow(mf2))
x2 <- model.matrix(Terms2, mf2)[,-1, drop=FALSE] #no intercept
if (length(x2)==0) stop("Individual survival but no variables")
offset2 <- model.offset(mf2)
if (length(offset2) ==0) offset2 <- 0
y2 <- model.extract(mf2, 'response')
if (attr(y2,'type') != type)
stop("Survival type of newdata does not match the fitted model")
if (attr(y2, "type") != "counting")
stop("Individual=TRUE is only valid for counting process data")
y2 <- y2[,1:2, drop=F] #throw away status, it's never used
}
else if (missing(newdata)) {
if (has.strata && strata.interaction)
stop ("Models with strata by covariate interaction terms require newdata")
offset2 <- 0
if (length(object$means)) {
x2 <- matrix(object$means, nrow=1, ncol=ncol(X))
} else {
# model with only an offset and no new data: very rare case
x2 <- matrix(0.0, nrow=1, ncol=1) # make a dummy x2
}
} else {
offset2 <- model.offset(mf2)
if (length(offset2)==0 ) offset2 <- 0
# a model with only an offset, but newdata containing a value for it
if (length(object$means)==0) x2 <- 0
else x2 <- model.matrix(Terms2, mf2)[,-1, drop=FALSE] #no intercept
}
@
<<survfit.coxph-result>>=
if (individual) {
result <- coxsurv.fit(ctype, stype, se.fit, varmat, cluster,
Y, X, weights, risk, position, strata, oldid,
y2, x2, risk2, strata2, id2)
}
else {
result <- coxsurv.fit(ctype, stype, se.fit, varmat, cluster,
Y, X, weights, risk, position, strata, oldid,
y2, x2, risk2)
if (has.strata && found.strata) {
<<newstrata-fixup>>
}
}
@
The final bit of work. If the newdata arg contained strata then the
user should not get a matrix of survival curves containing
every newdata obs * strata combination, but rather a vector
of curves, each one with the appropriate strata.
It was faster to compute them all, however, than to use the individual=T
logic. So now pick off the bits we want.
The names of the curves will be the rownames of the newdata arg,
if they exist.
<<newstrata-fixup>>=
if (is.matrix(result$surv)) nr <- nrow(result$surv)
else nr <- length(result$surv) # if newdata had only one row
indx1 <- split(1:nr, rep(1:length(result$strata), result$strata))
rows <- indx1[as.numeric(strata2)] #the rows for each curve
indx2 <- unlist(rows) #index for time, n.risk, n.event, n.censor
indx3 <- as.integer(strata2) #index for n and strata
if (is.matrix(result$surv)) {
for(i in 2:length(rows)) rows[[i]] <- rows[[i]]+ (i-1)*nr #linear subscript
indx4 <- unlist(rows) #index for surv and std.err
} else indx4 <- indx2
temp <- result$strata[indx3]
names(temp) <- row.names(mf2)
new <- list(n = result$n[indx3],
time= result$time[indx2],
n.risk= result$n.risk[indx2],
n.event=result$n.event[indx2],
n.censor=result$n.censor[indx2],
strata = temp,
surv= result$surv[indx4],
cumhaz = result$cumhaz[indx4])
if (se.fit) new$std.err <- result$std.err[indx4]
result <- new
@
Finally, the last (somewhat boring) part of the code.
First, if given the argument \code{censor=FALSE} we need to
remove all the time points from the output at which there
was only censoring activity. This action is mostly for
backwards compatability with older releases that never
returned censoring times.
Second, add
in the variance and the confidence intervals to the result.
The code is nearly identical to that in survfitKM.
<<survfit.coxph-finish>>=
if (!censor) {
kfun <- function(x, keep){ if (is.matrix(x)) x[keep,,drop=F]
else if (length(x)==length(keep)) x[keep]
else x}
keep <- (result$n.event > 0)
if (!is.null(result$strata)) {
temp <- factor(rep(names(result$strata), result$strata),
levels=names(result$strata))
result$strata <- c(table(temp[keep]))
}
result <- lapply(result, kfun, keep)
}
if (se.fit) {
result$logse = TRUE # this will migrate to solutio
# In this particular case, logse=T and they are the same
# Other cases await addition of code
if (stype==2) result$std.chaz <- result$std.err
}
if (se.fit && conf.type != "none") {
ci <- survfit_confint(result$surv, result$std.err, logse=result$logse,
conf.type, conf.int)
result <- c(result, list(lower=ci$lower, upper=ci$upper,
conf.type=conf.type, conf.int=conf.int))
}
if (!missing(start.time)) result$start.time <- start.time
if (!missing(newdata)) result$newdata <- newdata
result$call <- Call
class(result) <- c('survfitcox', 'survfit')
result
@
\subsubsection{Multi-state models}
Survival curves after a multi-state Cox model are more challenging,
particularly the variance.
<<survfit.coxphms>>=
survfit.coxphms <-
function(formula, newdata, se.fit=FALSE, conf.int=.95, individual=FALSE,
stype=2, ctype,
conf.type=c("log", "log-log", "plain", "none", "logit", "arcsin"),
censor=TRUE, start.time, id, influence=FALSE,
na.action=na.pass, type, p0=NULL, time0=FALSE, ...) {
Call <- match.call()
Call[[1]] <- as.name("survfit") #nicer output for the user
object <- formula #'formula' because it has to match survfit
se.fit <- FALSE #still to do
if (missing(newdata))
stop("multi-state survival requires a newdata argument")
if (!missing(id))
stop("using a covariate path is not supported for multi-state")
temp <- object$smap["(Baseline)",]
baselinecoef <- rbind(temp, coef= 1.0)
phbase <- rep(FALSE, nrow(object$cmap))
if (any(duplicated(temp))) {
# We have shared hazards
# Any rows of cmap with names like ph(1:4) are special. The coefs they
# point to should be copied over to the baselinecoef vector.
# There might not be such rows, by the way.
pattern <- "^ph\\([0-9]+:[0-9]+\\)$"
cname <- rownames(object$cmap)
phbase <- grepl(pattern, cname) # this row points to a "ph" coef
for (i in which(phbase)) {
# Say that this row (i) of cmap had label ph(1:4), and contains
# elements 0,0,0,0,0, 8,9.
# This means that coefs 8 and 9 are special. They should be
# plugged into a matching element of baselinecoef.
# The columns names of smap and cmap are identical, and tell us
# where to put them.
j <- object$cmap[i,]
baselinecoef[2, j>0] <- exp(object$coef[j])
}
}
# process options, set up Y and the model frame for the original data
<<survfit.coxph-setup1>>
<<survfit.coxph-setup2>>
istate <- model.extract(mf, "istate")
#deal with start time, by throwing out observations that end before then
if (!missing(start.time)) {
if (!is.numeric(start.time) || length(start.time) !=1
|| !is.finite(start.time))
stop("start.time must be a single numeric value")
toss <- which(Y[,ncol(Y)-1] <= start.time)
if (length(toss)) {
n <- nrow(Y)
if (length(toss)==n) stop("start.time has removed all observations")
Y <- Y[-toss,,drop=FALSE]
X <- X[-toss,,drop=FALSE]
weights <- weights[-toss]
oldid <- oldid[-toss]
istate <- istate[-toss]
}
}
# expansion of the X matrix with stacker, set up shared hazards
<<survfit.coxphms-setupa>>
# risk scores, mf2, and x2
<<survfit.coxph-setup2c>>
<<survfit.coxph-setup3>>
<<survfit.coxphms-setup3b>>
<<survfit.coxphms-result>>
cifit$call <- Call
class(cifit) <- c("survfitcoxms", "survfitms", "survfit")
cifit
}
@
The third line \code{as.name('survfit')} causes the printout to say
`survfit' instead of `survfit.coxph'. %'
Notice that setup is almost completely shared with survival for single state
models. The major change is that we use survfitAJ (non-Cox) to do all the
legwork wrt the tabulation values (number at risk, etc.),
while for the computation proper it is easier to make use of the same
expanded data set that coxph used for a multi-state fit.
<<survfit.coxphms-setupa>>=
# Rebuild istate using the survcheck routine, as a double check
# that the data set hasn't been modified
mcheck <- survcheck2(Y, oldid, istate)
transitions <- mcheck$transitions
if (!identical(object$states, mcheck$states))
stop("failed to rebuild the data set")
if (is.null(istate)) istate <- mcheck$istate
else {
# if istate has unused levels, mcheck$istate won't have them so they
# need to be dropped.
istate <- factor(istate, object$states)
# a new level in state should only happen if someone has mucked up the
# data set used in the coxph fit
if (any(is.na(istate))) stop("unrecognized initial state, data changed?")
}
# Let the survfitAJ routine do the work of creating the
# overall counts (n.risk, etc). The rest of this code then
# replaces the surv and hazard components.
if (missing(start.time)) start.time <- min(Y[,2], 0)
if (is.null(weights)) weights <- rep(1.0, nrow(Y))
if (is.null(strata)) tempstrat <- rep(1L, nrow(Y))
else tempstrat <- strata
cifit <- survfitAJ(as.factor(tempstrat), Y, weights,
id= oldid, istate = istate, se.fit=FALSE,
start.time=start.time, p0=p0, time0= time0)
# For computing the actual estimates it is easier to work with an
# expanded data set.
# Replicate actions found in the coxph-multi-X chunk
# Note the dropzero=FALSE argument: if there is a transition with no
# covariates we still need it expanded; this differs from coxph.
# A second differnence is tstrata: force stacker to think that every
# transition is a unique hazard, so that it does proper expansion.
cluster <- model.extract(mf, "cluster")
tstrata <- object$smap
tstrata[1,] <- 1:ncol(tstrata)
xstack <- stacker(object$cmap, tstrata, as.integer(istate), X, Y,
mf=mf, states= object$states, dropzero=FALSE)
if (length(position) >0)
position <- position[xstack$rindex] # id was required by coxph
X <- xstack$X
Y <- xstack$Y
strata <- strata[xstack$rindex] # strat in the model, other than transitions
transition <- xstack$transition
istrat <- xstack$strata
if (length(offset)) offset <- offset[xstack$rindex]
if (length(weights)) weights <- weights[xstack$rindex]
if (length(cluster)) cluster <- cluster[xstack$rindex]
oldid <- oldid[xstack$rindex]
if (robust & length(cluster)==0) cluster <- oldid
@
Fix up the X matrix to avoid huge values. In the single state case this
is fairly straightforward: use $(X-1m')\beta = X\beta - m'\beta$ where
$m$ is the vector of centering constants found in the
\code{object\$means} component.
However, in multi-state there will often be covariates that are part of one
transition but not another, and if one of them is wild we will want different
centering for each transition.
(Not yet implemented).
<<survfit.coxph-setup2d>>=
if (length(object$means) ==0) { # a model with only an offset term
# Give it a dummy X so the rest of the code goes through
# (This case is really rare)
# se.fit <- FALSE
X <- matrix(0., nrow=n, ncol=1)
if (is.null(offset)) offset <- rep(0, n)
xcenter <- mean(offset)
coef <- 0.0
varmat <- matrix(0.0, 1, 1)
risk <- rep(exp(offset- mean(offset)), length=n)
}
else {
varmat <- object$var
beta <- ifelse(is.na(object$coefficients), 0, object$coefficients)
if (is.null(offset)) xcenter <- sum(object$means * beta)
else xcenter <- sum(object$means * beta)+ mean(offset)
if (!is.null(object$frail)) {
keep <- !grepl("frailty(", dimnames(X)[[2]], fixed=TRUE)
X <- X[,keep, drop=F]
}
if (is.null(offset)) risk <- c(exp(X%*% beta - xcenter))
else risk <- c(exp(X%*% beta + offset - xcenter))
}
@
The survfit.coxph-setup3 chunk, shared with single state Cox models, has created
an mf2 model frame and an x2 matrix.
For multi-state, we ignore any strata variables in mf2.
Create a matrix of risk scores, number of subjects by number of transitions.
Different transitions often have different coefficients, so there is a risk
score vector per transition.
<<survfit.coxphms-setup3b>>=
if (has.strata && any(stangle$vars %in% names(mf2))){
mf2 <- mf2[is.na(match(names(mf2), stangle$vars))]
mf2 <- unique(mf2)
x2 <- unique(x2)
}
temp <- coef(object, matrix=TRUE)[!phbase,,drop=FALSE] # ignore missing coefs
# temp will be a matrix of coefficients, with ncol = number of transtions
# and nrow = the covariate set of a "normal" Cox model.
# x2 will have one row per desired curve and one col per 'normal' covariate.
risk2 <- exp(x2 %*% ifelse(is.na(temp), 0, temp) - xcenter)
# risk2 has a risk score with rows= curve and cols= transition
@
At this point we have several parts to keep straight. The data set has been
expanded into a new X and Y.
\begin{itemize}
\item \code{strata} contains any strata that were specified by the user
in the original fit. We do completely separate computations for each
stratum: the time scale starts over, nrisk, etc. Each has a separate
call to the multihaz function.
\item \code{transtion} contains the transition to which each observation
applies
\item \code{istrat} comes from the xstack routine, and marks each
strata * baseline hazard combination.
\item \code{baselinecoef} maps from baseline hazards to transitions. It
has one column per transition, which baseline hazard it points to, and a
multiplier. Most multipliers will be 1.
\item \code{hfill} is constructed below. It contains the row/column to which
each column of baselinecoef is mapped, within the H matrix used to compute
P(state).
\end{itemize}
The coxph routine fits all strata and transitions at once, since the loglik is
a sum over strata. This routine does each stratum separately.
<<survfit.coxphms-result>>=
# make the expansion map.
# The H matrices we will need are nstate by nstate, at each time, with
# elements that are non-zero only for observed transtions.
states <- object$states
nstate <- length(states)
from <- as.numeric(sub(":.*$", "", colnames(object$smap)))
to <- as.numeric(sub("^.*:", "", colnames(object$smap)))
hfill <- cbind(from, to)
if (individual) {
stop("time dependent survival curves are not supported for multistate")
}
ny <- ncol(Y)
if (is.null(strata)) {
fit <- multihaz(Y, X, position, weights, risk, istrat, ctype, stype,
baselinecoef, hfill, x2, risk2, varmat, nstate, se.fit,
cifit$p0, cifit$time)
cifit$pstate <- fit$pstate
cifit$cumhaz <- fit$cumhaz
}
else {
if (is.factor(strata)) ustrata <- levels(strata)
else ustrata <- sort(unique(strata))
nstrata <- length(cifit$strata)
itemp <- rep(1:nstrata, cifit$strata)
timelist <- split(cifit$time, itemp)
ustrata <- names(cifit$strata)
tfit <- vector("list", nstrata)
for (i in 1:nstrata) {
indx <- which(strata== ustrata[i]) # divides the data
tfit[[i]] <- multihaz(Y[indx,,drop=F], X[indx,,drop=F],
position[indx], weights[indx], risk[indx],
istrat[indx], ctype, stype, baselinecoef, hfill,
x2, risk2, varmat, nstate, se.fit,
cifit$p0[i,], timelist[[i]])
}
# do.call(rbind) doesn't work for arrays, it loses a dimension
ntime <- length(cifit$time)
cifit$pstate <- array(0., dim=c(ntime, dim(tfit[[1]]$pstate)[2:3]))
cifit$cumhaz <- array(0., dim=c(ntime, dim(tfit[[1]]$cumhaz)[2:3]))
rtemp <- split(seq(along=cifit$time), itemp)
for (i in 1:nstrata) {
cifit$pstate[rtemp[[i]],,] <- tfit[[i]]$pstate
cifit$cumhaz[rtemp[[i]],,] <- tfit[[i]]$cumhaz
}
}
cifit$newdata <- newdata
@
Finally, a routine that does all the actual work.
\begin{itemize}
\item The first 5 variables are for the data set that the Cox model was built
on: y, x, position, risk score, istrat.
Position is a flag for each obs. Is it the first of a connected string
such as (10, 12) (12,19) (19,21), the last of such a string, both,
or neither. 1*first + 2*last. This affects whether an obs is labeled
as censored or not in user printout, nothing else. (That part has actually
already been done via the survfitAJ call.)
\item x2 and risk2 are the covariates and risk scores for the predicted
values. These do not involve any ph(a:b) coefficients.
\item baselinecoef encodes shared hazards
\item hfill control mapping from fitted hazards to
transitions and probabilities
\item p0 will be NULL if the user did not specifiy it.
\item vmat is only needed for standard errors
\item utime is the set of time points desired
\end{itemize}
The cn matrix below contains all the subtotals we need.
Say that transitions 4, 5, and 6 have a shared hazard, with bcoef[2,] values
of 1, 1.3, .4 (the first coef is always 1).
Then the underlying hazard will base = (events[3] + events[4] + events[5])/
(nrisk[3] + 1.3* nrisk[4] + .4*nrisk[5]),
and the 3 individual hazards are 1*base, 1.3*base and .4*base.
If there are no shared hazards this can be computed more simply of course.
<<survfit.coxphms>>=
# Compute the hazard and survival functions
multihaz <- function(y, x, position, weight, risk, istrat, ctype, stype,
bcoef, hfill, x2, risk2, vmat, nstate, se.fit, p0, utime) {
ny <- ncol(y)
sort2 <- order(istrat, y[,ny-1L]) -1L
ntime <- length(utime)
storage.mode(weight) <- "double" #failsafe
# this returns all of the counts we might desire.
if (ny ==2) {
fit <- .Call(Ccoxsurv1, utime, y, weight, sort2, istrat, x, risk)
cn <- fit$count
dim(cn) <- c(length(utime), fit$ntrans, 10)
}
else {
sort1 <- order(istrat, y[,1]) -1L
fit <- .Call(Ccoxsurv2, utime, y, weight, sort1, sort2, position,
istrat, x, risk)
cn <- fit$count
dim(cn) <- c(length(utime), fit$ntrans, 12)
}
# cn is returned as a matrix since there is an allocMatrix C macro, but
# no allocArray macro. So we first reset the dimensions.
# The first dimension is time
# Second is the transition, same order as columns of bcoef
# Third is the count type: 1-3 = at risk (unweighted, with case weights,
# with casewt * risk wt), 4-6 = events (unweighted, case, risk),
# 7-8 = censored events, 9-10 = censored, 11-12 = Efron
# We will use events/(at risk) = cn[,,5]/cn[,,3] a few lines below; avoid 0/0
# If there is no one at risk there are no events, obviously.
# cn[,,1] is the safer check since it is an integer, but if there are weights
# and a subject with weight=0 were the only one at risk, we need cn[,,2]
# (Users should never use weights of 0, but someone, somewhere, will do it.)
none.atrisk <- (cn[,,1]==0 | cn[,,2]==0)
if (ctype ==1) {
denom1 <- ifelse(none.atrisk, 1, cn[,,3]) # avoid a later 0/0
denom2 <- ifelse(none.atrisk, 1, cn[,,3]^2)
} else {
denom1 <- ifelse(none.atrisk, 1, cn[,,9])
denom2 <- ifelse(none.atrisk, 1, cn[,,10])
}
# We want to avoid 0/0. If there is no one at risk (denominator) then
# by definition there will be no events (numerator), and that element of
# the hazard is by definintion also 0.
if (any(duplicated(bcoef[1,]))) {
# there are shared hazards: we have to collapse and then expand
if (all(bcoef[1,] == bcoef[1,1])) design <- matrix(1, nrow=ncol(bcoef))
else design <- model.matrix(~factor(zed) -1, data.frame(zed=bcoef[1,]))
colnames(design) <- 1:ncol(design) # easier to read when debuggin
events <- cn[,,5] %*% design
if (ctype==1) atrisk <- cn[,,3] %*% design
else atrisk <- cn[,,9] %*% design
basehaz <- events/ifelse(atrisk<=0, 1, atrisk)
hazard <- basehaz[,bcoef[1,]] * rep(bcoef[2,], each=nrow(basehaz))
}
else {
if (ctype==1) hazard <- cn[,,5]/ifelse(cn[,,3]<=0, 1, cn[,,3])
else hazard <- cn[,,5]/ifelse(cn[,,9] <=0, 1, cn[,,9])
}
# Expand the result, one "hazard set" for each row of x2
nx2 <- nrow(x2)
h2 <- array(0, dim=c(nrow(hazard), nx2, ncol(hazard)))
S <- double(nstate) # survival at the current time
S2 <- array(0, dim=c(nrow(hazard), nx2, nstate))
H <- matrix(0, nstate, nstate)
if (stype==2) {
H[hfill] <- colMeans(hazard) # dummy H to drive esetup
diag(H) <- diag(H) -rowSums(H)
esetup <- survexpmsetup(H)
}
for (i in 1:nx2) {
h2[,i,] <- apply(hazard * rep(risk2[i,], each=ntime), 2, cumsum)
if (FALSE) { # if (se.fit) eventually
d1 <- fit$xbar - rep(x[i,], each=nrow(fit$xbar))
d2 <- apply(d1*hazard, 2, cumsum)
d3 <- rowSums((d2%*% vmat) * d2)
v2[jj,] <- (apply(varhaz[jj,],2, cumsum) + d3) * (risk2[i])^2
}
S <- p0
for (j in 1:ntime) {
if (any(hazard[j,] > 0)) { # hazard =0 for censoring times
H[,] <- 0.0
H[hfill] <- hazard[j,] *risk2[i,]
if (stype==1) {
diag(H) <- pmax(0, 1.0 - rowSums(H))
S <- as.vector(S %*% H) # don't keep any names
}
else {
diag(H) <- 0.0 - rowSums(H)
#S <- as.vector(S %*% expm(H)) # dgeMatrix issue
S <- as.vector(S %*% survexpm(H, 1, esetup))
}
}
S2[j,i,] <- S
}
}
rval <- list(time=utime, xgrp=rep(1:nx2, each=nrow(hazard)),
pstate=S2, cumhaz=h2)
#if (se.fit) rval$varhaz <- v2
rval
}
@
\section{The Fine-Gray model}
For competing risks with ending states 1, 2, \ldots $k$,
the Fine-Gray approach turns these into a set of simple 2-state
Cox models:
\begin{itemize}
\item (not yet in state 1) $\longrightarrow$ state 1
\item (not yet in state 2) $\longrightarrow$ state 2
\item \ldots
\end{itemize}
Each of these is now a simple Cox model, assuming that we are willing
to make a proportional hazards assumption.
There is one added complication:
when estimating the first model, one wants to use the data set that
would have occured if the subjects being followed for state 1 had
not had an artificial censoring, that is, had continued to be followed
for event 1 even after event 2 occured.
Sometimes this can be filled in directly, e.g., if we knew the enrollment
dates for each subject along with the date that follow-up for the
study was terminated, and there was no lost to follow-up (only administrative
censoring.)
An example is the mgus2 data set, where follow-up for death continued
after the occurence of plasma cell malignancy.
In practice what is done is to estimate the overall censoring distribution and
give subjects artificial follow-up.
The function below creates a data set that can then be used with coxph.
<<finegray>>=
finegray <- function(formula, data, weights, subset, na.action= na.pass,
etype, prefix="fg", count="", id, timefix=TRUE) {
Call <- match.call()
indx <- match(c("formula", "data", "weights", "subset", "id"),
names(Call), nomatch=0)
if (indx[1] ==0) stop("A formula argument is required")
temp <- Call[c(1,indx)] # only keep the arguments we wanted
temp$na.action <- na.action
temp[[1L]] <- quote(stats::model.frame) # change the function called
special <- c("strata", "cluster")
temp$formula <- if(missing(data)) terms(formula, special)
else terms(formula, special, data=data)
mf <- eval(temp, parent.frame())
if (nrow(mf) ==0) stop("No (non-missing) observations")
Terms <- terms(mf)
Y <- model.extract(mf, "response")
if (!inherits(Y, "Surv")) stop("Response must be a survival object")
type <- attr(Y, "type")
if (type!='mright' && type!='mcounting')
stop("Fine-Gray model requires a multi-state survival")
nY <- ncol(Y)
states <- attr(Y, "states")
# The next line is a response to github issue 316
if (length(states) < 2) stop("survival time has only a single state")
if (timefix) Y <- aeqSurv(Y)
strats <- attr(Terms, "specials")$strata
if (length(strats)) {
stemp <- untangle.specials(Terms, 'strata', 1)
if (length(stemp$vars)==1) strata <- mf[[stemp$vars]]
else strata <- survival::strata(mf[,stemp$vars], shortlabel=TRUE)
istrat <- as.numeric(strata)
mf[stemp$vars] <- NULL
}
else istrat <- rep(1, nrow(mf))
id <- model.extract(mf, "id")
if (!is.null(id)) mf["(id)"] <- NULL # don't leave it in result
user.weights <- model.weights(mf)
if (is.null(user.weights)) user.weights <- rep(1.0, nrow(mf))
cluster<- attr(Terms, "specials")$cluster
if (length(cluster)) {
stop("a cluster() term is not valid")
}
# If there is start-stop data, then there needs to be an id
# also check that this is indeed a competing risks form of data.
# Mark the first and last obs of each subject, as we need it later.
# Observations may not be in time order within a subject
delay <- FALSE # is there delayed entry?
if (type=="mcounting") {
if (is.null(id)) stop("(start, stop] data requires a subject id")
else {
index <- order(id, Y[,2]) # by time within id
sorty <- Y[index,]
first <- which(!duplicated(id[index]))
last <- c(first[-1] -1, length(id))
if (any(sorty[-last, 3] != 0))
stop("a subject has a transition before their last time point")
delta <- c(sorty[-1,1], 0) - sorty[,2]
if (any(delta[-last] !=0))
stop("a subject has gaps in time")
if (any(Y[first,1] > min(Y[,2]))) delay <- TRUE
temp1 <- temp2 <- rep(FALSE, nrow(mf))
temp1[index[first]] <- TRUE
temp2[index[last]] <- TRUE
first <- temp1 #used later
last <- temp2
}
} else last <- rep(TRUE, nrow(mf))
if (missing(etype)) enum <- 1 #generate a data set for which endpoint?
else {
index <- match(etype, states)
if (any(is.na(index)))
stop ("etype argument has a state that is not in the data")
enum <- index[1]
if (length(index) > 1) warning("only the first endpoint was used")
}
# make sure count, if present is syntactically valid
if (!missing(count)) count <- make.names(count) else count <- NULL
oname <- paste0(prefix, c("start", "stop", "status", "wt"))
<<finegray-censor>>
<<finegray-build>>
}
@
The censoring and truncation distributions are
\begin{align*}
G(t) &= \prod_{s \le t} \left(1 - \frac{c(s)}{r_c(s)} \right ) \\
H(t) &= \prod_{s > t} \left(1 - \frac{e(s)}{r_e(s)} \right )
\end{align*}
where $c(t)$ is the number of subjects censored at time $t$, $e(t)$ is the
number who enter at time $t$, and $r$ is the size of the relevant risk set.
These are equations 5 and 6 of Geskus (Biometrics 2011).
Note that both $G$ and $H$ are right continuous functions.
For tied times the assumption is that event $<$ censor $<$ entry.
For $G$ we use a modified Kapan-Meier where any events at censoring time $t$ are
removed from the risk set just before time $t$.
To avoid issues with times that are nearly identical (but not quite) we first
convert to an integer time scale, and then move events backwards by .2.
Since this is a competing risks data set any non-censored observation for a
subject is their last, so this time shift does not goof up the alignment
of start, stop data.
For the truncation distribution it is the subjects with times
at or before time $t$ that
are in the risk set $r_e(t)$ for truncation at (or before) $t$.
$H$ can be calculated using an ordinary KM on the reverse time scale.
When there is (start,stop) data and hence multiple observations per subject,
calculation of $G$ needs use a status that is 1 only for the \emph{last} row
row of a censored subject.
<<finegray-censor>>=
if (ncol(Y) ==2) {
temp <- min(Y[,1], na.rm=TRUE)
if (temp >0) zero <- 0
else zero <- 2*temp -1 # a value less than any observed y
Y <- cbind(zero, Y) # add a start column
}
utime <- sort(unique(c(Y[,1:2]))) # all the unique times
newtime <- matrix(findInterval(Y[,1:2], utime), ncol=2)
status <- Y[,3]
newtime[status !=0, 2] <- newtime[status !=0,2] - .2
Gsurv <- survfit(Surv(newtime[,1], newtime[,2], last & status==0) ~ istrat,
se.fit=FALSE)
@
The calculation for $H$ is also done on the integer scale.
Otherwise we will someday be clobbered by times that differ only in
round off error. The only nuisance is the status variable, which is
1 for the first row of each subject, since the data set may not
be in sorted order. The offset of .2 used above is not needed, but due
to the underlying integer scale it doesn't harm anything either.
Reversal of the time scale leads to a left continuous function which
we fix up later.
<<finegray-censor>>=
if (delay)
Hsurv <- survfit(Surv(-newtime[,2], -newtime[,1], first) ~ istrat,
se.fit =FALSE)
@
Consider the following data set:
\begin{itemize}
\item Events of type 1 at times 1, 4, 5, 10
\item Events of type 2 at times 2, 5, 8
\item Censors at times 3, 4, 4, 6, 8, 9, 12
\end{itemize}
The censoring distribution will have the following shape:
\begin{center}
\begin{tabular}{rcccccc}
interval& (0,3]& (3,4] & (4,6] & (6,8] & (8,12] & 12+\\
C(t) & 1 &11/12 & (11/12)(8/10) & (11/15)(5/6)& (11/15)(5/6)(3/4)&
0 \\
& 1.0000 & .9167 & .7333 & .6111 & .4583
\end{tabular}
\end{center}
Notice that the event at time 4 is not counted in the risk set at time 4,
so the jump is 8/10 rather than 8/11.
Likewise at time 8 the risk set has 4 instead of 5: censors occur after deaths.
When creating the data set for event type 1, subjects who have an event of
type 2 get extended out using this censoring distribution. The event at
time 2, for instance, appears as a censored observation with time dependent
weights of $G(t)$. The type 2 event at time 5 has weight 1 up through time 5,
then weights of $G(t)/C(5)$ for the remainder.
This means a weight of 1 over (5,6], 5/6 over (6,8], (5/6)(3/4) over (9,12]
and etc.
Though there are 6 unique censoring intervals,
in the created data set for event type 1 we only need to know case
weights at times 1, 4, 5, and 10; the information from the (4,6] and (6,8]
intervals will never be used.
To create a minimal sized data set we can leave those intervals out.
$G(t)$ only drops to zero if the largest time(s) are censored observations, so
by definition no events lie in an interval with $G(t)=0$.
If there is delayed entry, then the set of intervals is larger due to a merge
with the jumps in Hsurv.
The truncation distribution Hsurv ($H$) will become 0 at the first entry time;
it is a left continuous function whereas Gsurv ($G$) is right continuous.
We can slide $H$ one point to the left and merge them at the jump points.
<<finegray-build>>=
status <- Y[, 3]
# Do computations separately for each stratum
stratfun <- function(i) {
keep <- (istrat ==i)
times <- sort(unique(Y[keep & status == enum, 2])) #unique event times
if (length(times)==0) return(NULL) #no events in this stratum
tdata <- mf[keep, -1, drop=FALSE]
maxtime <- max(Y[keep, 2])
Gtemp <- Gsurv[i]
if (delay) {
Htemp <- Hsurv[i]
dtime <- rev(-Htemp$time[Htemp$n.event > 0])
dprob <- c(rev(Htemp$surv[Htemp$n.event > 0])[-1], 1)
ctime <- Gtemp$time[Gtemp$n.event > 0]
cprob <- c(1, Gtemp$surv[Gtemp$n.event > 0])
temp <- sort(unique(c(dtime, ctime))) # these will all be integers
index1 <- findInterval(temp, dtime)
index2 <- findInterval(temp, ctime)
ctime <- utime[temp]
cprob <- dprob[index1] * cprob[index2+1] # G(t)H(t), eq 11 Geskus
}
else {
ctime <- utime[Gtemp$time[Gtemp$n.event > 0]]
cprob <- Gtemp$surv[Gtemp$n.event > 0]
}
ct2 <- c(ctime, maxtime)
cp2 <- c(1.0, cprob)
index <- findInterval(times, ct2, left.open=TRUE)
index <- sort(unique(index)) # the intervals that were actually seen
# times before the first ctime get index 0, those between 1 and 2 get 1
ckeep <- rep(FALSE, length(ct2))
ckeep[index] <- TRUE
expand <- (Y[keep, 3] !=0 & Y[keep,3] != enum & last[keep]) #which rows to expand
split <- .Call(Cfinegray, Y[keep,1], Y[keep,2], ct2, cp2, expand,
c(TRUE, ckeep))
tdata <- tdata[split$row,,drop=FALSE]
tstat <- ifelse((status[keep])[split$row]== enum, 1, 0)
tdata[[oname[1]]] <- split$start
tdata[[oname[2]]] <- split$end
tdata[[oname[3]]] <- tstat
tdata[[oname[4]]] <- split$wt * user.weights[split$row]
if (!is.null(count)) tdata[[count]] <- split$add
tdata
}
if (max(istrat) ==1) result <- stratfun(1)
else {
tlist <- lapply(1:max(istrat), stratfun)
result <- do.call("rbind", tlist)
}
rownames(result) <- NULL #remove all the odd labels that R adds
attr(result, "event") <- states[enum]
result
@
\subsection{The predict method}
The \code{predict.coxph} function
produces various types of predicted values from a Cox model.
The arguments are
\begin{description}
\item [object] The result of a call to \code{coxph}.
\item [newdata] Optionally, a new data set for which prediction is
desired. If this is absent predictions are for the observations used
fit the model.
\item[type] The type of prediction
\begin{itemize}
\item lp = the linear predictor for each observation
\item risk = the risk score $exp(lp)$ for each observation
\item expected = the expected number of events
\item survival = predicted survival = exp(-expected)
\item terms = a matrix with one row per subject and one column for
each term in the model.
\end{itemize}
\item[se.fit] Whether or not to return standard errors of the predictions.
\item[na.action] What to do with missing values \emph{if} there is new
data.
\item[terms] The terms that are desired. This option is almost never used,
so rarely in fact that it's hard to justify keeping it.
\item[collapse] An optional vector of subject identifiers, over which to
sum or `collapse' the results
\item[reference] the reference context for centering the results
\item[\ldots] All predict methods need to have a \ldots argument; we make
no use of it however.
\end{description}
\paragraph{Setup}
The first task of the routine is to reconsruct necessary data elements
that were not saved as a part of the \code{coxph} fit.
We will need the following components:
\begin{itemize}
\item for type=`expected' residuals we need the orignal survival y. This
is saved in coxph objects by default so will only need to be fetched in
the highly unusual case that a user specfied
\code{y=FALSE} in the orignal call.
\item for any call with either newdata, standard errors, or type='terms'
the original $X$ matrix, weights, strata, and offset.
When checking for the existence of a saved $X$ matrix we can't %'
use \code{object\$x}
since that will also match the \code{xlevels} component.
\item the new data matrix, if any
\end{itemize}
<<predict.coxph>>=
predict.coxph <- function(object, newdata,
type=c("lp", "risk", "expected", "terms", "survival"),
se.fit=FALSE, na.action=na.pass,
terms=names(object$assign), collapse,
reference=c("strata", "sample", "zero"), ...) {
<<pcoxph-init>>
<<pcoxph-getdata>>
if (type=="expected") {
<<pcoxph-expected>>
}
else {
<<pcoxph-simple>>
<<pcoxph-terms>>
}
<<pcoxph-finish>>
}
@
We start of course with basic argument checking.
Then retrieve the model parameters: does it have a strata
statement, offset, etc.
The \code{Terms2} object is a model statement without the strata or cluster terms,
appropriate for recreating the matrix of covariates $X$.
For type=expected the response variable needs to be kept, if not we remove
it as well since the user's newdata might not contain one. %'
The type= survival is treated the same as type expected.
<<pcoxph-init>>=
if (!inherits(object, 'coxph'))
stop("Primary argument much be a coxph object")
Call <- match.call()
type <-match.arg(type)
if (type=="survival") {
survival <- TRUE
type <- "expected" # survival and expecte have nearly the same code path
}
else survival <- FALSE
if (type == "expected") reference <- "sample" # a common ref is easiest
n <- object$n
Terms <- object$terms
if (!missing(terms)) {
if (is.numeric(terms)) {
if (any(terms != floor(terms) |
terms > length(object$assign) |
terms <1)) stop("Invalid terms argument")
}
else if (any(is.na(match(terms, names(object$assign)))))
stop("a name given in the terms argument not found in the model")
}
# I will never need the cluster argument, if present delete it.
# Terms2 are terms I need for the newdata (if present), y is only
# needed there if type == 'expected'
if (length(attr(Terms, 'specials')$cluster)) {
temp <- untangle.specials(Terms, 'cluster', 1)
Terms <- drop.special(Terms, attr(Terms, "specials")$cluster)
}
if (type != 'expected') Terms2 <- delete.response(Terms)
else Terms2 <- Terms
has.strata <- !is.null(attr(Terms, 'specials')$strata)
has.offset <- !is.null(attr(Terms, 'offset'))
has.weights <- any(names(object$call) == 'weights')
na.action.used <- object$na.action
n <- length(object$residuals)
if (missing(reference) && type=="terms") reference <- "sample"
else reference <- match.arg(reference)
@
The next task of the routine is to reconsruct necessary data elements
that were not saved as a part of the \code{coxph} fit.
We will need the following components:
\begin{itemize}
\item for type=`expected' residuals we need the orignal survival y. This %'`
is saved in coxph objects by default so will only need to be fetched in
the highly unusual case that a user specfied \code{y=FALSE} in the orignal
call. We also need the strata in this case. Grabbing it is the same
amount of work as grabbing X, so gets lumped with that case in the
code.
\item for any call with either standard errors, reference strata,
or type=`terms'
the original $X$ matrix, weights, strata, and offset.
When checking for the existence of a saved $X$ matrix we can't %'
use \code{object\$x}
since that will also match the \code{xlevels} component.
\item the new data matrix, if present, along with offset and strata.
\end{itemize}
For the case that none of the above are needed, we can use the
\code{linear.predictors} component of the fit. The variable \code{use.x} signals
this case, which takes up almost none of the code but is common in usage.
The check below that nrow(mf)==n is to avoid data sets that change under our
feet. A fit was based on data set ``x'', and when we reconstruct the data
frame it is a different size! This means someone changed the data between
the model fit and the extraction of residuals.
One other non-obvious case is that coxph treats the model \code{age:strata(grp)}
as though it were \code{age:strata(grp) + strata(grp)}.
The untangle.specials function will return
\code{vars= strata(grp), terms=integer(0)}; the first shows a strata to extract
and the second that there is nothing to remove from the terms structure.
<<pcoxph-getdata>>=
have.mf <- FALSE
if (type == "expected") {
y <- object[['y']]
if (is.null(y)) { # very rare case
mf <- stats::model.frame(object)
y <- model.extract(mf, 'response')
have.mf <- TRUE #for the logic a few lines below, avoid double work
}
}
# This will be needed if there are strata, and is cheap to compute
strat.term <- untangle.specials(Terms, "strata")
if (se.fit || type=='terms' || (!missing(newdata) && type=="expected") ||
(has.strata && (reference=="strata") || type=="expected") ||
(reference=="zero" && any(object[["means"]] !=0))) {
use.x <- TRUE
if (is.null(object[['x']]) || has.weights || has.offset ||
(has.strata && is.null(object$strata))) {
# I need the original model frame
if (!have.mf) mf <- stats::model.frame(object)
if (nrow(mf) != n)
stop("Data is not the same size as it was in the original fit")
x <- model.matrix(object, data=mf)
if (has.strata) {
if (!is.null(object$strata)) oldstrat <- object$strata
else {
if (length(strat.term$vars)==1) oldstrat <- mf[[strat.term$vars]]
else oldstrat <- strata(mf[,strat.term$vars], shortlabel=TRUE)
}
}
else oldstrat <- rep(0L, n)
weights <- model.weights(mf)
if (is.null(weights)) weights <- rep(1.0, n)
offset <- model.offset(mf)
if (is.null(offset)) offset <- 0
}
else {
x <- object[['x']]
if (has.strata) oldstrat <- object$strata
else oldstrat <- rep(0L, n)
weights <- rep(1.,n)
offset <- 0
}
}
else {
# I won't need strata in this case either
if (has.strata) {
Terms2 <- drop.special(Terms2, attr(Terms2, "specials")$strata)
has.strata <- FALSE #remaining routine never needs to look
}
oldstrat <- rep(0L, n)
offset <- 0
use.x <- FALSE
}
@
Now grab data from the new data set. We want to use model.frame
processing, in order to correctly expand factors and such.
We don't need weights, however, and don't want to make the user
include them in their new dataset. Thus we build the call up
the way it is done in coxph itself, but only keeping the newdata
argument. Note that terms2 may have fewer variables than the
original model: no cluster and if type!= expected no response.
If the original model had a strata, but newdata does not, we need to
remove the strata from xlev to stop a spurious warning message.
<<pcoxph-getdata>>=
if (!missing(newdata)) {
use.x <- TRUE #we do use an X matrix later
tcall <- Call[c(1, match(c("newdata", "collapse"), names(Call), nomatch=0))]
names(tcall)[2] <- 'data' #rename newdata to data
tcall$formula <- Terms2 #version with no response
tcall$na.action <- na.action #always present, since there is a default
tcall[[1L]] <- quote(stats::model.frame) # change the function called
if (!is.null(attr(Terms, "specials")$strata) && !has.strata) {
temp.lev <- object$xlevels
temp.lev[strat.term$vars] <- NULL
tcall$xlev <- temp.lev
}
else tcall$xlev <- object$xlevels
mf2 <- eval(tcall, parent.frame())
collapse <- model.extract(mf2, "collapse")
n2 <- nrow(mf2)
if (has.strata) {
if (length(strat.term$vars)==1) newstrat <- mf2[[strat.term$vars]]
else newstrat <- strata(mf2[,strat.term$vars], shortlabel=TRUE)
if (any(is.na(match(levels(newstrat), levels(oldstrat)))))
stop("New data has a strata not found in the original model")
else newstrat <- factor(newstrat, levels=levels(oldstrat)) #give it all
if (length(strat.term$terms))
newx <- model.matrix(Terms2[-strat.term$terms], mf2,
contr=object$contrasts)[,-1,drop=FALSE]
else newx <- model.matrix(Terms2, mf2,
contr=object$contrasts)[,-1,drop=FALSE]
}
else {
newx <- model.matrix(Terms2, mf2,
contr=object$contrasts)[,-1,drop=FALSE]
newstrat <- rep(0L, nrow(mf2))
}
newoffset <- model.offset(mf2)
if (is.null(newoffset)) newoffset <- 0
if (type== 'expected') {
newy <- model.response(mf2)
if (attr(newy, 'type') != attr(y, 'type'))
stop("New data has a different survival type than the model")
}
na.action.used <- attr(mf2, 'na.action')
}
else n2 <- n
@
%\subsection{Expected hazard}
When we do not need standard errors the computation of expected
hazard is very simple since
the martingale residual is defined as status - expected. The 0/1
status is saved as the last column of $y$.
<<pcoxph-expected>>=
if (missing(newdata))
pred <- y[,ncol(y)] - object$residuals
if (!missing(newdata) || se.fit) {
<<pcoxph-expected2>>
}
if (survival) { #it actually was type= survival, do one more step
if (se.fit) se <- se * exp(-pred)
pred <- exp(-pred) # probablility of being in state 0
}
@
The more general case makes use of the [agsurv] routine to calculate
a survival curve for each strata. The routine is defined in the
section on individual Cox survival curves. The code here closely matches
that. The routine only returns values at the death times, so we need
approx to get a complete index.
One non-obvious, but careful choice is to use the residuals for the predicted
value instead of the compuation below, whenever operating on the original
data set. This is a consequence of the Efron approx. When someone in
a new data set has exactly the same time as one of the death times in the
original data set, the code below implicitly makes them the ``last'' death
in the set of tied times.
The Efron approx puts a tie somewhere in the middle of the pack. This is
way too hard to work out in the code below, but thankfully the original
Cox model already did it. However, it does mean that a different answer will
arise if you set newdata = the original coxph data set.
Standard errors have the same issue, but 1. they are hardly used and 2. the
original coxph doesn't do that calculation. So we do what's easiest.
<<pcoxph-expected2>>=
ustrata <- unique(oldstrat)
risk <- exp(object$linear.predictors)
x <- x - rep(object$means, each=nrow(x)) #subtract from each column
if (missing(newdata)) #se.fit must be true
se <- double(n)
else {
pred <- se <- double(nrow(mf2))
newx <- newx - rep(object$means, each=nrow(newx))
newrisk <- c(exp(newx %*% object$coef) + newoffset)
# This was added in May 2024, and removed a few weeks later
# For (time1, time2) type survival estimates P(dead at t2 | alive at t1),
# which I saw no use case for. But a user did. Added notes to .Rd file
#if (ncol(y) ==3 && survival) {
# t0 <- unname(min(y[,1])) # the start of the survival curve
# simpler is all(newy[,1] == t0), but
# use of all.equal allows for roundoff error in newdata
# if (!isTRUE(all.equal(as.vector(newy[,1]), rep(t0, nrow(newy)))))
# stop("predicted survival must be from the start of the curve")
#}
}
survtype<- ifelse(object$method=='efron', 3,2)
for (i in ustrata) {
indx <- which(oldstrat == i)
afit <- agsurv(y[indx,,drop=F], x[indx,,drop=F],
weights[indx], risk[indx],
survtype, survtype)
xbar <- apply(afit$xbar, 2, cumsum)
afit.n <- length(afit$time)
if (missing(newdata)) {
# In this case we need se.fit, nothing else
j1 <- findInterval(y[indx,1], afit$time) # time (ny=2) or time1 (ny=3)
chaz <- c(0, afit$cumhaz)[j1 +1]
varh <- c(0, cumsum(afit$varhaz))[j1 +1]
xbar2 <- rbind(0, xbar)[j1+1,,drop=F]
if (ncol(y)==2) {
dt <- (chaz * x[indx,]) - xbar2
se[indx] <- sqrt(varh + rowSums((dt %*% object$var) *dt)) *
risk[indx]
}
else {
j2 <- findInterval(y[indx,2], afit$time) # time2
chaz2 <- c(0, afit$cumhaz)[j2 +1L]
varh2 <- c(0, cumsum(afit$varhaz))[j2 +1L]
xbar3 <- rbind(0, xbar)[j2+ 1L,,drop=F]
dt <- (chaz * x[indx,]) - xbar2
v1 <- varh + rowSums((dt %*% object$var) *dt)
dt2 <- (chaz2 * x[indx,]) - xbar3
v2 <- varh2 + rowSums((dt2 %*% object$var) *dt2)
se[indx] <- sqrt(v2-v1)* risk[indx]
}
}
else {
#there is new data
use.x <- TRUE
indx2 <- which(newstrat == i)
j1 <- findInterval(newy[indx2,1], afit$time)
chaz <-c(0, afit$cumhaz)[j1+1]
pred[indx2] <- chaz * newrisk[indx2]
if (se.fit) {
varh <- c(0, cumsum(afit$varhaz))[j1+1]
xbar2 <- rbind(0, xbar)[j1+1,,drop=F]
}
if (ncol(y)==2) {
if (se.fit) {
dt <- (chaz * newx[indx2,]) - xbar2
se[indx2] <- sqrt(varh + rowSums((dt %*% object$var) *dt)) *
newrisk[indx2]
}
}
else {
j2 <- findInterval(newy[indx2,2], afit$time)
chaz2 <-c(0, afit$cumhaz)[j2+1L]
pred[indx2] <- (chaz2 - chaz) * newrisk[indx2]
if (se.fit) {
varh2 <- c(0, cumsum(afit$varhaz))[j2 +1L]
xbar3 <- rbind(0, xbar)[j2 + 1L,,drop=F]
dt <- (chaz * newx[indx2,]) - xbar2
dt2 <- (chaz2 * newx[indx2,]) - xbar3
v2 <- varh2 + rowSums((dt2 %*% object$var) *dt2)
v1 <- varh + rowSums((dt %*% object$var) *dt)
se[indx2] <- sqrt(v2-v1)* newrisk[indx2]
}
}
}
}
@
%\subsection{Linear predictor, risk, and terms}
For these three options what is returned is a \emph{relative} prediction
which compares each observation to the average for the data set.
Partly this is practical. Say for instance that a treatment covariate
was coded as 0=control and 1=treatment.
If the model were refit using a new coding of 3=control 4=treatment, the
results of the Cox model would be exactly the same with respect to
coefficients, variance, tests, etc.
The raw linear predictor $X\beta$ however would change, increasing by
a value of $3\beta$.
The relative predictor
\begin{equation}
\eta_i = X_i\beta - (1/n)\sum_j X_j\beta
\label{eq:eta}
\end{equation}
will stay the same.
The second reason for doing this is that the Cox model is a
relative risks model rather than an absolute risks model,
and thus relative predictions are almost certainly what the
user was thinking of.
When the fit was for a stratified Cox model more care is needed.
For instance assume that we had a fit that was stratified by sex with
covaritate $x$, and a second data set were created where for the
females $x$ is replaced
by $x+3$. The Cox model results will be unchanged for the two
models, but the `normalized' linear predictors $(x - \overline x)'\beta$ %`
will not be the same.
This reflects a more fundamental issue that the for a stratified
Cox model relative risks are well defined only \emph{within} a
stratum, i.e. for subject pairs that share a common baseline
hazard.
The example above is artificial, but the problem arises naturally
whenever the model includes a strata by covariate interaction.
So for a stratified Cox model the predictions should be forced to
sum to zero within each stratum, or equivalently be made relative
to the weighted mean of the stratum.
Unfortunately, this important issue was not realized until late in 2009
when a puzzling query was sent to the author involving the results
from such an interaction.
Note that this issue did not arise with type='expected', which
has a natural scaling.
An offset variable, if specified, is treated like any other covariate
with respect to centering.
The logic for this choice is not as compelling, but it seemed the
best that I could do.
Note that offsets play no role whatever in predicted terms, only in
the lp and risk.
Start with the simple ones
<<pcoxph-simple>>=
if (is.null(object$coefficients))
coef<-numeric(0)
else {
# Replace any NA coefs with 0, to stop NA in the linear predictor
coef <- ifelse(is.na(object$coefficients), 0, object$coefficients)
}
if (missing(newdata)) {
offset <- offset - mean(offset)
if (has.strata && any(is.na(oldstrat))) is.na(newx) <- is.na(oldstrat)
if (has.strata && reference=="strata") {
# We can't use as.integer(oldstrat) as an index, if oldstrat is
# a factor variable with unrepresented levels as.integer could
# give 1,2,5 for instance.
xmeans <- rowsum(x*weights, oldstrat)/c(rowsum(weights, oldstrat))
newx <- x - xmeans[match(oldstrat,row.names(xmeans)),]
}
else if (use.x) {
if (reference == "zero") newx <- x
else newx <- x - rep(object$means, each=nrow(x))
}
}
else {
offset <- newoffset - mean(offset)
if (has.strata && any(is.na(newstrat))) is.na(newx) <- is.na(newstrat)
if (has.strata && reference=="strata") {
xmeans <- rowsum(x*weights, oldstrat)/c(rowsum(weights, oldstrat))
newx <- newx - xmeans[match(newstrat, row.names(xmeans)),]
}
else if (reference!= "zero")
newx <- newx - rep(object$means, each=nrow(newx))
}
if (type=='lp' || type=='risk') {
if (use.x) pred <- drop(newx %*% coef) + offset
else pred <- object$linear.predictors
if (se.fit) se <- sqrt(rowSums((newx %*% object$var) *newx))
if (type=='risk') {
pred <- exp(pred)
if (se.fit) se <- se * sqrt(pred) # standard Taylor series approx
}
}
@
The type=terms residuals are a bit more work.
In Splus this code used the Build.terms function, which was essentially
the code from predict.lm extracted out as a separate function.
As of March 2010 (today) a check of the Splus function and the R code
for predict.lm revealed no important differences.
A lot of the bookkeeping in both is to work around any possible NA
coefficients resulting from a singularity.
The basic formula is to
\begin{enumerate}
\item If the model has an intercept, then sweep the column means
out of the X matrix. We've already done this.
\item For each term separately, get the list of coefficients that
belong to that term; call this list \code{tt}.
\item Restrict $X$, $\beta$ and $V$ (the variance matrix) to that
subset, then the linear predictor is $X\beta$ with variance
matrix $X V X'$. The standard errors are the square root of
the diagonal of this latter matrix. This can be computed,
as colSums((X %*% V) * X)).
\end{enumerate}
Note that the \code{assign} component of a coxph object is the same
as that found in Splus models (a list), most R models retain a numeric vector
which contains the same information but it is not as easily used. The first
first part of predict.lm in R rebuilds the list form as its \code{asgn} variable.
I can skip this part since it is already done.
<<pcoxph-terms>>=
else if (type=='terms') {
asgn <- object$assign
nterms<-length(asgn)
pred<-matrix(ncol=nterms,nrow=NROW(newx))
dimnames(pred) <- list(rownames(newx), names(asgn))
if (se.fit) se <- pred
for (i in 1:nterms) {
tt <- asgn[[i]]
tt <- tt[!is.na(object$coefficients[tt])]
xtt <- newx[,tt, drop=F]
pred[,i] <- xtt %*% object$coefficient[tt]
if (se.fit)
se[,i] <- sqrt(rowSums((xtt %*% object$var[tt,tt]) *xtt))
}
pred <- pred[,terms, drop=F]
if (se.fit) se <- se[,terms, drop=F]
attr(pred, 'constant') <- sum(object$coefficients*object$means, na.rm=T)
}
@
To finish up we need to first expand out any missings in the result
based on the na.action, and optionally collapse the results within
a subject.
What should we do about the standard errors when collapse is specified?
We assume that the individual pieces are
independent and thus var(sum) = sum(variances).
The statistical justification of this is quite solid for the linear predictor,
risk and terms type of prediction due to independent increments in a martingale.
For expecteds the individual terms are positively correlated so the se will
be too small. One solution would be to refuse to return an se in this
case, but the the bias should usually be small,
and besides it would be unkind to the user.
Prediction of type='terms' is expected to always return a matrix, or
the R termplot() function gets unhappy.
<<pcoxph-finish>>=
if (type != 'terms') {
pred <- drop(pred)
if (se.fit) se <- drop(se)
}
if (!is.null(na.action.used)) {
pred <- napredict(na.action.used, pred)
if (is.matrix(pred)) n <- nrow(pred)
else n <- length(pred)
if(se.fit) se <- napredict(na.action.used, se)
}
if (!missing(collapse) && !is.null(collapse)) {
if (length(collapse) != n2) stop("Collapse vector is the wrong length")
pred <- rowsum(pred, collapse) # in R, rowsum is a matrix, always
if (se.fit) se <- sqrt(rowsum(se^2, collapse))
if (type != 'terms') {
pred <- drop(pred)
if (se.fit) se <- drop(se)
}
}
if (se.fit) list(fit=pred, se.fit=se)
else pred
@
\section{Expected Survival}
The expected survival routine creates the overall survival curve for a
\emph{group} of people. It is possible to take the set of expected
survival curves for each individual and average them, which is the
\code{Ederer} method below, but this is not always the wisest choice:
the Hakulinen and conditional methods average in anothers ways, both of
which are more sophisticated ways to deal with censoring.
The individual curves are dervived either from population rate tables such
as the US annual life tables from the National Center for Health Statistics
or the larger multi-national collection at mortality.org, or by using a
previously fitted Cox model as the table.
The arguments for [[survexp]] are
\begin{description}
\item[formula] The model formula. The right hand side consists of grouping
variables, identically to [[survfit]] and an optional [[ratetable]]
directive. The ``response'' varies by method:
\begin{itemize}
\item for the Hakulinen method it is a vector of censoring times. This is
the actual censoring time for censored subjecs, and is what the
censoring time `would have been' for each subject who died. %'`
\item for the conditional method it is the usual Surv(time, status) code
\item for the Ederer method no response is needed
\end{itemize}
\item[data, weights, subset, na.action] as usual
\item[rmap] an optional mapping for rate table variables, see more below.
\item[times] An optional vector of time points at which to compute the
response. For the Hakulinen and conditional methods the program uses the
vector of unique y values if this is missing. For the Ederer the component
is not optional.
\item[method] The method used for the calculation. Choices are individual
survival, or the Ederer, Hakulinen, or conditional methods for cohort
survival.
\item[cohort, conditional] Older arguments that were used to select the
method.
\item[ratetable] the population rate table to use as a reference. This can
either be a ratetable object or a previously fitted Cox model
\item[scale] Scale the resulting output times, e.g., 365.25 to turn days into
years.
\item[se.fit] This has been deprecated.
\item[model, x, y] usual
\end{description}
The output of survexp contains a subset of the elements in a [[survfit]]
object, so many of the survfit methods can be applied. The result
has a class of [[c('survexp', 'survfit')]].
<<survexp>>=
survexp <- function(formula, data,
weights, subset, na.action, rmap, times,
method=c("ederer", "hakulinen", "conditional", "individual.h",
"individual.s"),
cohort=TRUE, conditional=FALSE,
ratetable=survival::survexp.us, scale=1, se.fit,
model=FALSE, x=FALSE, y=FALSE) {
<<survexp-setup>>
<<survexp-compute>>
<<survexp-format>>
<<survexp-finish>>
}
@
The first few lines are standard. Keep a copy of the call, then manufacture
a call to [[model.frame]] that contains only the arguments relevant to that
function.
<<survexp-setup>>=
Call <- match.call()
# keep the first element (the call), and the following selected arguments
indx <- match(c('formula', 'data', 'weights', 'subset', 'na.action'),
names(Call), nomatch=0)
if (indx[1] ==0) stop("A formula argument is required")
tform <- Call[c(1,indx)] # only keep the arguments we wanted
tform[[1L]] <- quote(stats::model.frame) # change the function called
Terms <- if(missing(data)) terms(formula)
else terms(formula, data=data)
@
The function works with two data sets, the user's data on an actual set of %'
subjects and the reference ratetable.
This leads to a particular nuisance, that the variable names in the data
set may not match those in the ratetable.
For instance the United States overall death rate table [[survexp.us]] expects
3 variables, as shown by [[summary(survexp.us)]]
\begin{itemize}
\item age = age in days for each subject at the start of follow-up
\item sex = sex of the subject, ``male'' or ``female'' (the routine accepts
any unique abbreviation and is case insensitive)
\item year = date of the start of follow-up
\end{itemize}
In earlier versions of the code, the mapping between variables in the data
set and the ratetable was managed by a ratetable() term in the formula.
For instance
\begin{verbatim}
survexp( ~ sex + ratetable(age=age*365.25, sex=sex,
year=entry.dt),
data=mydata, ratetable=survexp.us)
\end{verbatim}
In this case the user's data set has a variable `age' containing age in years,
along with sex and an entry date.
This had to be changed for several reasons, but still exists in some old user
level code, and also in the relsurv package. As of 1/2023 the code has stopped
supporting it.
The new process adds the [[rmap]] argument, an example would be
\code{rmap=list(age =age*365.25, year=entry.dt)}.
Any variables in the ratetable that are not found in \code{rmap} are assumed to
not need a mapping, this would be \code{sex} in the above example.
For backwards compatability we allow the old style argument, converting it
into the new style.
The \code{rmap} argument needs to be examined without evaluating it; we then add
the appropriate extra variables into a temporary formula so that the model
frame has all that is required, \emph{before} calling model.frame.
The ratetable variables then can be retrieved from the model frame.
The \code{pyears} routine uses the same rmap argument; this segment of the
code is given its own name so that it can be included there as well.
<<survexp-setup>>=
<<survexp-setup-rmap>>
mf <- eval(tform, parent.frame())
@
<<survexp-setup-rmap>>=
if (!missing(rmap)) {
rcall <- substitute(rmap)
if (!is.call(rcall) || rcall[[1]] != as.name('list'))
stop ("Invalid rcall argument")
}
else rcall <- NULL # A ratetable, but no rcall argument
# Check that there are no illegal names in rcall, then expand it
# to include all the names in the ratetable
if (is.ratetable(ratetable)) {
varlist <- names(dimnames(ratetable))
if (is.null(varlist)) varlist <- attr(ratetable, "dimid") # older style
}
else if(inherits(ratetable, "coxph") && !inherits(ratetable, "coxphms")) {
## Remove "log" and such things, to get just the list of
# variable names
varlist <- all.vars(delete.response(ratetable$terms))
}
else stop("Invalid rate table")
temp <- match(names(rcall)[-1], varlist) # 2,3,... are the argument names
if (any(is.na(temp)))
stop("Variable not found in the ratetable:", (names(rcall))[is.na(temp)])
if (any(!(varlist %in% names(rcall)))) {
to.add <- varlist[!(varlist %in% names(rcall))]
temp1 <- paste(text=paste(to.add, to.add, sep='='), collapse=',')
if (is.null(rcall)) rcall <- parse(text=paste("list(", temp1, ")"))[[1]]
else {
temp2 <- deparse(rcall)
rcall <- parse(text=paste("c(", temp2, ",list(", temp1, "))"))[[1]]
}
}
@
The formula below is used only in the call to [[model.frame]] to ensure
that the frame has both the formula and the ratetable variables.
We don't want to modify the original formula, since we use it to create
the $X$ matrix and the response variable.
The non-obvious bit of code is the addition of an environment to the
formula. The [[model.matrix]] routine has a non-standard evaluation - it
uses the frame of the formula, rather than the parent.frame() argument
below, along with the [[data]] to look up variables.
If a formula is long enough deparse() will give two lines, hence the
extra paste call to re-collapse it into one.
<<survexp-setup-rmap>>=
# Create a temporary formula, used only in the call to model.frame
newvar <- all.vars(rcall)
if (length(newvar) > 0) {
temp <- paste(paste(deparse(Terms), collapse=""),
paste(newvar, collapse='+'), sep='+')
tform$formula <- as.formula(temp, environment(Terms))
}
@
If the user data has 0 rows, e.g. from a mistaken [[subset]] statement
that eliminated all subjects, we need to stop early. Otherwise the
.C code fails in a nasty way.
<<survexp-setup>>=
n <- nrow(mf)
if (n==0) stop("Data set has 0 rows")
if (!missing(se.fit) && se.fit)
warning("se.fit value ignored")
weights <- model.extract(mf, 'weights')
if (length(weights) ==0) weights <- rep(1.0, n)
if (inherits(ratetable, 'ratetable') && any(weights !=1))
warning("weights ignored")
if (any(attr(Terms, 'order') >1))
stop("Survexp cannot have interaction terms")
if (!missing(times)) {
if (any(times<0)) stop("Invalid time point requested")
if (length(times) >1 )
if (any(diff(times)<0)) stop("Times must be in increasing order")
}
@
If a response variable was given, we only need the times and not the
status. To be correct,
computations need to be done for each of the times given in
the [[times]] argument as well as for each of the unique y values.
This ends up as the vector [[newtime]]. If a [[times]] argument was
given we will subset down to only those values at the end.
For a population rate table and the Ederer method the times argument is
required.
<<survexp-setup>>=
Y <- model.extract(mf, 'response')
no.Y <- is.null(Y)
if (no.Y) {
if (missing(times)) {
if (is.ratetable(ratetable))
stop("either a times argument or a response is needed")
}
else newtime <- times
}
else {
if (is.matrix(Y)) {
if (is.Surv(Y) && attr(Y, 'type')=='right') Y <- Y[,1]
else stop("Illegal response value")
}
if (any(Y<0)) stop ("Negative follow up time")
# if (missing(npoints)) temp <- unique(Y)
# else temp <- seq(min(Y), max(Y), length=npoints)
temp <- unique(Y)
if (missing(times)) newtime <- sort(temp)
else newtime <- sort(unique(c(times, temp[temp<max(times)])))
}
if (!missing(method)) method <- match.arg(method)
else {
# the historical defaults and older arguments
if (!missing(conditional) && conditional) method <- "conditional"
else {
if (no.Y) method <- "ederer"
else method <- "hakulinen"
}
if (!missing(cohort) && !cohort) method <- "individual.s"
}
if (no.Y && (method!="ederer"))
stop("a response is required in the formula unless method='ederer'")
@
The next step is to check out the ratetable.
For a population rate table a set of consistency checks is done by the
[[match.ratetable]] function, giving a set of sanitized indices [[R]].
This function wants characters turned to factors.
For a Cox model [[R]] will be a model matix whose covariates are coded
in exactly the same way that variables were coded in the original
Cox model. We call the model.matrix.coxph function to avoid repeating the
steps found there (remove cluster statements, etc).
We also need to use the [[mf]] argument of the function, otherwise
it will call model.frame internally and fail when it can't find the
response variable (which we don't need).
Note that for a population rate table the standard error of the expected
is by definition 0 (the population rate table is based on a huge sample).
For a Cox model rate table, an se formula is currently only available for
the Ederer method.
<<survexp-compute>>=
ovars <- attr(Terms, 'term.labels')
# rdata contains the variables matching the ratetable
rdata <- data.frame(eval(rcall, mf), stringsAsFactors=TRUE)
if (is.ratetable(ratetable)) {
israte <- TRUE
if (no.Y) {
Y <- rep(max(times), n)
}
rtemp <- match.ratetable(rdata, ratetable)
R <- rtemp$R
}
else if (inherits(ratetable, 'coxph')) {
israte <- FALSE
Terms <- ratetable$terms
}
else if (inherits(ratetable, "coxphms"))
stop("survexp not defined for multi-state coxph models")
else stop("Invalid ratetable")
@
Now for some calculation. If cohort is false, then any covariates on the
right hand side (other than the rate table) are irrelevant, the function
returns a vector of expected values rather than survival curves.
<<survexp-compute>>=
if (substring(method, 1, 10) == "individual") { #individual survival
if (no.Y) stop("for individual survival an observation time must be given")
if (israte)
temp <- survexp.fit (1:n, R, Y, max(Y), TRUE, ratetable)
else {
rmatch <- match(names(data), names(rdata))
if (any(is.na(rmatch))) rdata <- cbind(rdata, data[,is.na(rmatch)])
temp <- survexp.cfit(1:n, rdata, Y, 'individual', ratetable)
}
if (method == "individual.s") xx <- temp$surv
else xx <- -log(temp$surv)
names(xx) <- row.names(mf)
na.action <- attr(mf, "na.action")
if (length(na.action)) return(naresid(na.action, xx))
else return(xx)
}
@
Now for the more commonly used case: returning a survival curve.
First see if there are any grouping variables.
The results of the [[tcut]] function are often used in person-years
analysis, which is somewhat related to expected survival. However
tcut results aren't relevant here and we put in a check for the %'
confused user.
The strata command creates a single factor incorporating all the
variables.
<<survexp-compute>>=
if (length(ovars)==0) X <- rep(1,n) #no categories
else {
odim <- length(ovars)
for (i in 1:odim) {
temp <- mf[[ovars[i]]]
ctemp <- class(temp)
if (!is.null(ctemp) && ctemp=='tcut')
stop("Can't use tcut variables in expected survival")
}
X <- strata(mf[ovars])
}
#do the work
if (israte)
temp <- survexp.fit(as.numeric(X), R, Y, newtime,
method=="conditional", ratetable)
else {
temp <- survexp.cfit(as.numeric(X), rdata, Y, method, ratetable, weights)
newtime <- temp$time
}
@
Now we need to package up the curves properly
All the results can
be returned as a single matrix of survivals with a common vector of times.
If there was a times argument we need to subset to selected rows of the
computation.
<<survexp-format>>=
if (missing(times)) {
n.risk <- temp$n
surv <- temp$surv
}
else {
if (israte) keep <- match(times, newtime)
else {
# The result is from a Cox model, and it's list of
# times won't match the list requested in the user's call
# Interpolate the step function, giving survival of 1
# for requested points that precede the Cox fit's
# first downward step. The code is like summary.survfit.
n <- length(temp$time)
keep <- approx(temp$time, 1:n, xout=times, yleft=0,
method='constant', f=0, rule=2)$y
}
if (is.matrix(temp$surv)) {
surv <- (rbind(1,temp$surv))[keep+1,,drop=FALSE]
n.risk <- temp$n[pmax(1,keep),,drop=FALSE]
}
else {
surv <- (c(1,temp$surv))[keep+1]
n.risk <- temp$n[pmax(1,keep)]
}
newtime <- times
}
newtime <- newtime/scale
if (is.matrix(surv)) {
dimnames(surv) <- list(NULL, levels(X))
out <- list(call=Call, surv= drop(surv), n.risk=drop(n.risk),
time=newtime)
}
else {
out <- list(call=Call, surv=c(surv), n.risk=c(n.risk),
time=newtime)
}
@
Last do the standard things: add the model, x, or y components to the output
if the user asked for them. (For this particular routine I can't think of %'
a reason they every would.) Copy across summary information from the
rate table computation if present, and add the method and class to the
output.
<<survexp-finish>>=
if (model) out$model <- mf
else {
if (x) out$x <- X
if (y) out$y <- Y
}
if (israte && !is.null(rtemp$summ)) out$summ <- rtemp$summ
if (no.Y) out$method <- 'Ederer'
else if (conditional) out$method <- 'conditional'
else out$method <- 'cohort'
class(out) <- c('survexp', 'survfit')
out
@
\subsection{Parsing the covariates list}
For a multi-state Cox model we allow a list of formulas to take the place
of the \code{formula} argument.
The first element of the list is the default formula, later elements
are of the form \code{transitions ~ formula/options}, where the left hand side
denotes one or more transitions, and the right hand side is used to augment
the basic formula wrt those transitions.
Step 1 is to break the formula into parts. There will be a list of left sides,
a list of right sides, and a list of options.
From this we can create a single ``pseudo formula'' that is used to drive
the model.frame process, which ensures that all of the variables we need
will be found in the model frame.
Further processing has to wait until after the model frame has been constructed,
i.e., if a left side referred to state ``deathh'' that might be a real state
or a typing mistake, we can't know until the data is in hand.
Should we walk the parse tree of the formula, or convert it to character and use
string manipulations? The latter looks promising until you see a fragment
like this:
\code{entry:death ~ age/sex + ns(weight/height, df=4) / common}
Walking the parse tree is a bit more subtle, but we then can take advantage of
all the knowledge built into the R parser.
A formula is a 3 element list of ``~'', leftside, rightside, or 2 elements if
it has only a right hand side. Legal ones for coxph have both left and right.
<<parsecovar>>=
parsecovar1 <- function(flist, statedata) {
if (any(sapply(flist, function(x) !inherits(x, "formula"))))
stop("an element of the formula list is not a formula")
if (any(sapply(flist, length) != 3))
stop("all formulas must have a left and right side")
# split the formulas into a right hand and left hand side
lhs <- lapply(flist, function(x) x[-3]) # keep the ~
rhs <- lapply(flist, function(x) x[[3]]) # don't keep the ~
rhs <- parse_rightside(rhs)
<<parse-leftside>>
list(rhs = rhs, lhs= lterm)
}
@
\begin{figure}
\includegraphics{figures/fig1.pdf}
\caption{The parse tree for the formula
\code{1:3 +2:3 ~ strata(sex)/(age + trt) + ns(weight/ht, df=4) / common + shared}}
\label{figparse}
\end{figure}
Figure \ref{figparse} shows the parse tree for a complex formula.
The following function splits the formula at the rightmost slash, ignoring the
inside of any function or parenthesised phrase.
Recursive functions like this are almost impossible to read, but luckily
it is short.
The formula recurrs on the left and right side of +*: and \%in\%, and on
binary - (but not on unary -).
<<parsecovar>>=
rightslash <- function(x) {
if (!inherits(x, 'call')) return(x)
else {
if (x[[1]] == as.name('/')) return(list(x[[2]], x[[3]]))
else if (x[[1]]==as.name('+') || (x[[1]]==as.name('-') && length(x)==3)||
x[[1]]==as.name('*') || x[[1]]==as.name(':') ||
x[[1]]==as.name('%in%')) {
temp <- rightslash(x[[3]])
if (is.list(temp)) {
x[[3]] <- temp[[1]]
return(list(x, temp[[2]]))
} else {
temp <- rightslash(x[[2]])
if (is.list(temp)) {
x[[2]] <- temp[[2]]
return(list(temp[[1]], x))
} else return(x)
}
}
else return(x)
}
}
@
There are 4 possble options of common, shared, and init.
The first 2 appear just as words, the last should have a set of
values attached which become the \code{ival} vector.
There will, of course, one day be a user with a variable named \code{common}
who wants a nested term \code{x/common}. Since we don't look inside
parenthesis they will be able to use \code{1:3 ~ (x/common)}.
<<parsecovar>>=
parse_rightside <- function(rhs) {
parts <- lapply(rhs, rightslash)
new <- lapply(parts, function(opt) {
tform <- ~ x # a skeleton, "x" will be replaced
if (!is.list(opt)) { # no options for this line
tform[[2]] <- opt
list(formula = tform, ival = NULL, common = FALSE,
shared = FALSE)
}
else{
# treat the option list as though it were a formula
temp <- ~ x
temp[[2]] <- opt[[2]]
optterms <- terms(temp)
ff <- rownames(attr(optterms, "factors"))
index <- match(ff, c("common", "shared", "init"))
if (any(is.na(index)))
stop("option not recognized in a covariates formula: ",
paste(ff[is.na(index)], collapse=", "))
common <- any(index==1)
shared <- any(index==2)
if (any(index==3)) {
optatt <- attributes(optterms)
j <- optatt$variables[1 + which(index==3)]
j[[1]] <- as.name("list")
ival <- unlist(eval(j, parent.frame()))
}
else ival <- NULL
tform[[2]] <- opt[[1]]
list(formula= tform, ival= ival, common= common, shared=shared)
}
})
new
}
@
The left hand side of each formula specifies the set of transitions to which
the covariates apply, and is more complex.
Say instance that we had 7 states and the following statedata
data set.
\begin{center}
\begin{tabular}{cccc}
state & A& N& death \\ \hline
A-N- & 0& 0 & 0\\
A+N- & 1& 0 & 0\\
A-N1 & 0& 1 & 0\\
A+N1 & 1& 1 & 0\\
A-N2 & 0& 2 & 0\\
A+N2 & 1& 2 & 0\\
Death& NA & NA& 1
\end{tabular}
\end{center}
Here are some valid transitions
\begin{enumerate}
\item 0:state('A+N+'), any transition to the A+N+ state
\item state('A-N-'):death(0), a transition from A-N-, but not to death
\item A(0):A(1), any of the 4 changes that start with A=0 and end with A=1
\item N(0):N(1,2) + N(1):N(2), an upward change of N
\item 'A-N-':c('A-N+','A+N-'); if there is no variable then the
overall state is assumed
\item 1:3 + 2:3; we can refer to states by number, and we can have multiples
\end{enumerate}
<<parse-leftside>>=
# deal with the left hand side of the formula
# the next routine cuts at '+' signs
pcut <- function(form) {
if (length(form)==3) {
if (form[[1]] == '+')
c(pcut(form[[2]]), pcut(form[[3]]))
else if (form[[1]] == '~') pcut(form[[2]])
else list(form)
}
else list(form)
}
lcut <- lapply(lhs, function(x) pcut(x[[2]]))
@
We now have one list per formula, each list is either a single term
or a list of terms (case 4 above).
To make evaluation easier, create functions that append their
name to a list of values.
I have not yet found a way to do this without eval(parse()), which
always seems clumsy.
A use for the labels without an argument will arise later, hence the
double environments.
Repeating the list above, this is what we want to end with
\begin{itemize}
\item a list with one element per formula in the covariates list
\item each element is a list, with one element per term: multiple
a:b terms are allowed separated by + signs
\item each of these level 3 elements is a list with two elements
``left'' and ``right'', for the two sides of the : operator
\item left and right will be one of 3 forms: a simple vector,
a one element list containing the stateid, or a two element list
containing the stateid and the values.
Any word that doesn't match one of the
column names of statedata ends up as a vector.
\end{itemize}
<<parse-leftside>>=
env1 <- new.env(parent= parent.frame(2))
env2 <- new.env(parent= env1)
if (missing(statedata)) {
assign("state", function(...) list(stateid= "state",
values=c(...)), env1)
assign("state", list(stateid="state"))
}
else {
for (i in statedata) {
assign(i, eval(list(stateid=i)), env2)
tfun <- eval(parse(text=paste0("function(...) list(stateid='"
, i, "', values=c(...))")))
assign(i, tfun, env1)
}
}
lterm <- lapply(lcut, function(x) {
lapply(x, function(z) {
if (length(z)==1) {
temp <- eval(z, envir= env2)
if (is.list(temp) && names(temp)[[1]] =="stateid") temp
else temp
}
else if (length(z) ==3 && z[[1]]==':')
list(left=eval(z[[2]], envir=env2), right=eval(z[[3]], envir=env2))
else stop("invalid term: ", deparse(z))
})
})
@
The second call, which builds tmap, the terms map.
Arguments are the results from the first pass, the statedata data frame,
the default formula, the terms structure from the full formula,
and the transitions count.
One nuisance is that the terms function sometimes inverts things. For
example in the formula
\code{terms(~ x1 + x1:iage + x2 + x2:iage)} the label for the second
of these becomes \code{iage:x2}.
I'm guessing it is because the variables first appear in the order x1, iage, x2
and labels make use of that order.
But when we look at the formula fragment \code{~ x2 + x2:iage} the terms
will be in the other order.
A way out of this is to use the simple \code{termmatch} function below,
which keys off of the factors attribute instead of the names.
<<parsecovar>>=
termmatch <- function(f1, f2) {
# look for f1 in f2, each the factors attribute of a terms object
if (length(f1)==0) return(NULL) # a formula with only ~1
irow <- match(rownames(f1), rownames(f2))
if (any(is.na(irow))) stop ("termmatch failure 1")
hashfun <- function(j) sum(ifelse(j==0, 0, 2^(seq(along.with=j))))
hash1 <- apply(f1, 2, hashfun)
hash2 <- apply(f2[irow,,drop=FALSE], 2, hashfun)
index <- match(hash1, hash2)
if (any(is.na(index))) stop("termmatch failure 2")
index
}
parsecovar2 <- function(covar1, statedata, dformula, Terms, transitions,states) {
if (is.null(statedata))
statedata <- data.frame(state = states, stringsAsFactors=FALSE)
else {
if (is.null(statedata$state))
stop("the statedata data set must contain a variable 'state'")
indx1 <- match(states, statedata$state, nomatch=0)
if (any(indx1==0))
stop("statedata does not contain all the possible states: ",
states[indx1==0])
statedata <- statedata[indx1,] # put it in order
}
# Statedata might have rows for states that are not in the data set,
# for instance if the coxph call had used a subset argument. Any of
# those were eliminated above.
# Likewise, the formula list might have rules for transitions that are
# not present. Don't worry about it at this stage.
allterm <- attr(Terms, 'factors')
nterm <- ncol(allterm)
# create a map for every transition, even ones that are not used.
# at the end we will thin it out
# It has an extra first row for intercept (baseline)
# Fill it in with the default formula
nstate <- length(states)
tmap <- array(0L, dim=c(nterm+1, nstate, nstate))
dmap <- array(seq_len(length(tmap)), dim=c(nterm+1, nstate, nstate)) #unique values
dterm <- termmatch(attr(terms(dformula), "factors"), allterm)
dterm <- c(1L, 1L+ dterm) # add intercept
tmap[dterm,,] <- dmap[dterm,,]
inits <- NULL
if (!is.null(covar1)) {
<<parse-tmap>>
}
<<parse-finish>>
}
@
Now go through the formulas one by one. The left hand side tells us which
state:state transitions to fill in, the right hand side tells the variables.
The code block below goes through lhs element(s) for a single formula.
That element is itself a list which has an entry for each term, and that
entry can have left and right portions.
<<parse-lmatch>>=
state1 <- state2 <- NULL
for (x in lhs) {
# x is one term
if (!is.list(x) || is.null(x$left)) stop("term found without a ':' ", x)
# left of the colon
if (!is.list(x$left) && length(x$left) ==1 && x$left==0)
temp1 <- 1:nrow(statedata)
else if (is.numeric(x$left)) {
temp1 <- as.integer(x$left)
if (any(temp1 != x$left)) stop("non-integer state number")
if (any(temp1 <1 | temp1> nstate))
stop("numeric state is out of range")
}
else if (is.list(x$left) && names(x$left)[1] == "stateid"){
if (is.null(x$left$value))
stop("state variable with no list of values: ",x$left$stateid)
else {
if (any(k= is.na(match(x$left$stateid, names(statedata)))))
stop(x$left$stateid[k], ": state variable not found")
zz <- statedata[[x$left$stateid]]
if (any(k= is.na(match(x$left$value, zz))))
stop(x$left$value[k], ": state value not found")
temp1 <- which(zz %in% x$left$value)
}
}
else {
k <- match(x$left, statedata$state)
if (any(is.na(k))) stop(x$left[is.na(k)], ": state not found")
temp1 <- which(statedata$state %in% x$left)
}
# right of colon
if (!is.list(x$right) && length(x$right) ==1 && x$right ==0)
temp2 <- 1:nrow(statedata)
else if (is.numeric(x$right)) {
temp2 <- as.integer(x$right)
if (any(temp2 != x$right)) stop("non-integer state number")
if (any(temp2 <1 | temp2> nstate))
stop("numeric state is out of range")
}
else if (is.list(x$right) && names(x$right)[1] == "stateid") {
if (is.null(x$right$value))
stop("state variable with no list of values: ",x$right$stateid)
else {
if (any(k= is.na(match(x$right$stateid, names(statedata)))))
stop(x$right$stateid[k], ": state variable not found")
zz <- statedata[[x$right$stateid]]
if (any(k= is.na(match(x$right$value, zz))))
stop(x$right$value[k], ": state value not found")
temp2 <- which(zz %in% x$right$value)
}
}
else {
k <- match(x$right, statedata$state)
if (any(is.na(k))) stop(x$right, ": state not found")
temp2 <- which(statedata$state %in% x$right)
}
state1 <- c(state1, rep(temp1, length(temp2)))
state2 <- c(state2, rep(temp2, each=length(temp1)))
}
@
At the end it has created two vectors state1 and state2 listing all
the pairs of states that are indicated.
The init clause (initial values) are gathered but not checked:
we don't yet know how many columns a term will expand into.
tmap is a 3 way array: term, state1, state2 containing coefficient numbers and
zeros.
<<parse-tmap>>=
for (i in 1:length(covar1$rhs)) {
rhs <- covar1$rhs[[i]]
lhs <- covar1$lhs[[i]] # one rhs and one lhs per formula
<<parse-lmatch>>
npair <- length(state1) # number of state:state pairs for this line
# update tmap for this set of transitions
# first, what variables are mentioned, and check for errors
rterm <- terms(rhs$formula)
rindex <- 1L + termmatch(attr(rterm, "factors"), allterm)
# the update.formula function is good at identifying changes
# formulas that start with "- x" have to be pasted on carefully
temp <- substring(deparse(rhs$formula, width.cutoff=500), 2)
if (substring(temp, 1,1) == '-') dummy <- formula(paste("~ .", temp))
else dummy <- formula(paste("~. +", temp))
rindex1 <- termmatch(attr(terms(dformula), "factors"), allterm)
rindex2 <- termmatch(attr(terms(update(dformula, dummy)), "factors"),
allterm)
dropped <- 1L + rindex1[is.na(match(rindex1, rindex2))] # remember the intercept
if (length(dropped) >0) {
for (k in 1:npair) tmap[dropped, state1[k], state2[k]] <- 0
}
# grab initial values
if (length(rhs$ival))
inits <- c(inits, list(term=rindex, state1=state1,
state2= state2, init= rhs$ival))
# adding -1 to the front is a trick, to check if there is a "+1" term
dummy <- ~ -1 + x
dummy[[2]][[3]] <- rhs$formula
if (attr(terms(dummy), "intercept") ==1) rindex <- c(1L, rindex)
# an update of "- sex" won't generate anything to add
# dmap is simply an indexed set of unique values to pull from, so that
# no number is used twice
if (length(rindex) > 0) { # rindex = things to add
if (rhs$common) {
j <- dmap[rindex, state1[1], state2[1]]
for(k in 1:npair) tmap[rindex, state1[k], state2[k]] <- j
}
else {
for (k in 1:npair)
tmap[rindex, state1[k], state2[k]] <- dmap[rindex, state1[k], state2[k]]
}
}
# Deal with the shared argument, using - for a separate coef
if (rhs$shared && npair>1) {
j <- dmap[1, state1[1], state2[1]]
for (k in 2:npair)
tmap[1, state1[k], state2[k]] <- -j
}
}
@
Fold the 3-dimensional tmap into a matrix with terms as rows
and one column for each transition that actually occured.
``Actually occured'' is on its face a simple task: look at the transitions
matrix and find all the non-zero entries.
Shared hazards create a nuisance though.
Suppose 1:death and 2:death have shared hazard, no state 1 obs actually die,
but there are state 1 subjects at risk, i.e., there is a nonzero row for
state 1 in the transitions matrix. (The death row is normally all zero).
The 1:death transition certainly needs to appear in the final smap object.
Shared transitions can be found in the [1,,] element of tmap; use that to
put sums into the t2 matrix below.
This isn't perfect, e.g., if there was a single state 1 subject who is censored
before anything happens, then the 1:death state is never actually part of a
risk set and could be omitted from cmap and smap.
A more complex case shows up when we divide a covariate into groups in order
to deal with time dependent covariates. Say we have states A, B and death,
and two covariates x1 and x2 with 3 levels each.
This leads to a 10 state model A11, A12,\ldots A33, B11, \ldots, B33, death.
If covariates change slowly we might never have an A11 to B33 transition, ever.
If the user used statedata, the model statement might be
\code{A(1:9) *B(1:9)~ x1 + x2 + 1/common}, collapsing all 81 possiblilties
into a stratum with shared coefficients and baseline.
Without due care one could end up with 9 copies of each subject in the A:B
transition's risk set. This routine passes the buck to stacker to deal with it.
Later addition: For the real data cases we have seen so far, it is best to
assume that any transition that isn't observed, won't occur. Given that, it
is easier if we don't mark extra shared hazards a possible in the
returned object. An example was states of not demented, demented and death,
with the first 2 divided by the presence of 0-7 cardiometabolic comorbidities.
It is easy to declare all 8*8 ND:dementia transitions as 'shared', but because
CMC cannot go backwards a lot of these are impossible (each condition x is
coded as ``any history of x''). Because CMC changes slowly, many others are
effectively so, such as ND0 to dem7. We don't want to estimate a positive
hazard for such transitions.
<<parse-finish>>=
t2 <- transitions[rowSums(transitions) > 0,, drop=FALSE]
i <- match("(censored)", colnames(transitions), nomatch=0)
if (i>0) t2 <- t2[,-i, drop=FALSE] # transitions to 'censor' don't count
indx1 <- match(rownames(t2), states)
indx2 <- match(colnames(t2), states)
# check shared hazards
# Commented out per discussion in the noweb file: in more complex shared hazard
# models such as multiple time-dependent covariates, assuming that all the
# transitions implied by the user's model statement should be counted can lead
# to including a *lot* of state combinations that are improbable or impossible.
# So we no longer extend the state space.
# But keep the code here just in case we change our mind
#temp <- matrix(tmap[1,indx1,indx2], nrow=nrow(t2))
#for (i in unique(temp)) {
# if (sum(temp==i) > 1) { #shared hazard
# j <- cbind(row(temp)[temp==i], col(temp)[temp==i])
# t2[j] <- sum(t2[j]) # credit all with all the events
# }
#}
tmap2 <- matrix(0L, nrow= 1+nterm, ncol= sum(t2>0))
trow <- row(t2)[t2>0]
tcol <- col(t2)[t2>0]
for (i in 1:nrow(tmap2)) {
for (j in 1:ncol(tmap2))
tmap2[i,j] <- tmap[i, indx1[trow[j]], indx2[tcol[j]]]
}
# Remember which hazards had ph
# tmap2[1,] is the 'intercept' row
# If the hazard for colum 6 is proportional to the hazard for column 2,
# the tmap2[1,2] = tmap[1,6], and phbaseline[6] =2
temp <- tmap2[1,]
indx <- which(temp> 0)
tmap2[1,] <- indx[match(abs(temp), temp[indx])]
phbaseline <- ifelse(temp<0, tmap2[1,], 0) # remembers column numbers
tmap2[1,] <- match(tmap2[1,], unique(tmap2[1,])) # unique strata 1,2, ...
if (nrow(tmap2) > 1)
tmap2[-1,] <- match(tmap2[-1,], unique(c(0L, tmap2[-1,]))) -1L
dimnames(tmap2) <- list(c("(Baseline)", colnames(allterm)),
paste(indx1[trow], indx2[tcol], sep=':'))
# mapid gives the from,to for each realized state
list(tmap = tmap2, inits=inits, mapid= cbind(from=indx1[trow], to=indx2[tcol]),
phbaseline = phbaseline)
@
Last is a helper routine that converts tmap, which has one row per term,
into cmap, which has one row per coefficient. Both have one column per
transition. If there a transition with no covariates, that is removed from
cmap.
It uses the assign attribute of the X matrix along with the column names.
Consider the model \code{~ x1 + strata(x2) + factor(x3)} where x3 has 4 levels.
The Xassign vector will be 1, 3, 3, 3, since it refers to terms and there are 3
columns of X for term number 3.
If there were an intercept the first column of X
would be a 1 and Xassign would be 0, 1, 3, 3, 3.
Let's say that there were 3 transitions and tmap looks like this:
\begin{tabular}{rccc}
& 1:2 & 1:3 & 2:3 \\
(Baseline) & 1 & 2 & 3 \\
x1 & 1 & 4 & 4 \\
strata(x2) & 2 & 5 & 6 \\
factor(x3) & 3 & 3 & 7
\end{tabular}
The cmap matrix will ignore rows 1 and 3 since they do not correspond to
coefficients in the model.
Proportional baseline hazards add another wrinkle: say that the 1:3 and 2:3
hazards were proportional, and the user had \code{1:3 + 2:3 /shared} in thier
call. Then the phbaseline vector will be 0,0,2 and
cmap will gain an extra row with label ph(1:3) which has a coefficient
for the 2:3 transition.
If the user typed \code{2:3 + 1:3/shared} then the phbaseline vector will
be (0,3,0) and 2:3 is the reference level.
<<parsecovar>>=
parsecovar3 <- function(tmap, Xcol, Xassign, phbaseline=NULL) {
# sometime X will have an intercept, sometimes not; cmap never does
hasintercept <- (Xassign[1] ==0)
ph.coef <- (phbaseline !=0) # any proportional baselines?
ph.rows <- length(unique(phbaseline[ph.coef])) #extra rows to add to cmap
cmap <- matrix(0L, length(Xcol) + ph.rows -hasintercept, ncol(tmap))
uterm <- unique(Xassign[Xassign != 0L]) # terms that will have coefficients
xcount <- table(factor(Xassign, levels=1:max(Xassign)))
mult <- 1L+ max(xcount) # temporary scaling
ii <- 0
for (i in uterm) {
k <- seq_len(xcount[i])
for (j in 1:ncol(tmap))
cmap[ii+k, j] <- if(tmap[i+1,j]==0) 0L else tmap[i+1,j]*mult +k
ii <- ii + max(k)
}
if (ph.rows > 0) {
temp <- phbaseline[ph.coef] # where each points
for (i in unique(temp)) {
# for each baseline that forms a reference
j <- which(phbaseline ==i) # the others that are proportional to it
k <- seq_len(length(j))
ii <- ii +1 # row of cmat for this baseline
cmap[ii, j] <- max(cmap) + k # fill in elements
}
newname <- paste0("ph(", colnames(tmap)[unique(temp)], ")")
} else newname <- NULL
# renumber coefs as 1, 2, 3, ...
cmap[,] <- match(cmap, sort(unique(c(0L, cmap)))) -1L
colnames(cmap) <- colnames(tmap)
if (hasintercept) rownames(cmap) <- c(Xcol[-1], newname)
else rownames(cmap) <- c(Xcol, newname)
# nonzero <- colSums(cmap) > 0 # there is at least one covariate
# if (!all(nonzero)) cmap <- cmap[, nonzero, drop=FALSE]
cmap
}
@
\section{Person years}
The person years routine and the expected survival code are the
two parts of the survival package that make use of external
rate tables, of which the United States mortality tables \code{survexp.us}
and \code{survexp.usr} are examples contained in the package.
The arguments for pyears are
\begin{description}
\item[formula] The model formula. The right hand side consists of grouping
variables and is essentially identical to [[survfit]], the result of the
model will be a table of results with dimensions determined from the
right hand variables. The formula can include an optional [[ratetable]]
directive; but this style has been superseded by the [[rmap]] argument.
\item [data, weights, subset, na.action] as usual
\item[rmap] an optional mapping for rate table variables, see more below.
\item[ratetable] the population rate table to use as a reference. This can
either be a ratetable object or a previously fitted Cox model
\item[scale] Scale the resulting output times, e.g., 365.25 to turn days into
years.
\item[expect] Should the output table include the expected number of
events, or the expected number of person-years of observation?
\item[model, x, y] as usual
\item[data.frame] if true the result is returned as a data frame, if false
as a set of tables.
\end{description}
<<pyears>>=
pyears <- function(formula, data,
weights, subset, na.action, rmap,
ratetable, scale=365.25, expect=c('event', 'pyears'),
model=FALSE, x=FALSE, y=FALSE, data.frame=FALSE) {
<<pyears-setup>>
<<pyears-compute>>
<<pyears-finish>>
}
@
Start out with the standard model processing, which involves making a copy
of the input call, but keeping only the arguments we want.
We then process the special argument [[rmap]]. This is discussed in the
section on the [[survexp]] function so we need not repeat the
explantation here.
<<pyears-setup>>=
expect <- match.arg(expect)
Call <- match.call()
# create a call to model.frame() that contains the formula (required)
# and any other of the relevant optional arguments
# then evaluate it in the proper frame
indx <- match(c("formula", "data", "weights", "subset", "na.action"),
names(Call), nomatch=0)
if (indx[1] ==0) stop("A formula argument is required")
tform <- Call[c(1,indx)] # only keep the arguments we wanted
tform[[1L]] <- quote(stats::model.frame) # change the function called
Terms <- if(missing(data)) terms(formula)
else terms(formula, data=data)
if (any(attr(Terms, 'order') >1))
stop("Pyears cannot have interaction terms")
if (!missing(rmap) || !missing(ratetable)) {
has.ratetable <- TRUE
if (missing(ratetable)) stop("No rate table specified")
<<survexp-setup-rmap>>
}
else has.ratetable <- FALSE
mf <- eval(tform, parent.frame())
Y <- model.extract(mf, 'response')
if (is.null(Y)) stop ("Follow-up time must appear in the formula")
if (!is.Surv(Y)){
if (any(Y <0)) stop ("Negative follow up time")
Y <- as.matrix(Y)
if (ncol(Y) >2) stop("Y has too many columns")
}
else {
stype <- attr(Y, 'type')
if (stype == 'right') {
if (any(Y[,1] <0)) stop("Negative survival time")
nzero <- sum(Y[,1]==0 & Y[,2] ==1)
if (nzero >0)
warning(paste(nzero,
"observations with an event and 0 follow-up time,",
"any rate calculations are statistically questionable"))
}
else if (stype != 'counting')
stop("Only right-censored and counting process survival types are supported")
}
n <- nrow(Y)
if (is.null(n) || n==0) stop("Data set has 0 observations")
weights <- model.extract(mf, 'weights')
if (is.null(weights)) weights <- rep(1.0, n)
@
The next step is to check out the ratetable.
For a population rate table a set of consistency checks is done by the
[[match.ratetable]] function, giving a set of sanitized indices [[R]].
This function wants characters turned to factors.
For a Cox model [[R]] will be a model matix whose covariates are coded
in exactly the same way that variables were coded in the original
Cox model. We call the model.matrix.coxph function so as not to have to
repeat the steps found there (remove cluster statements, etc).
<<pyears-setup>>=
# rdata contains the variables matching the ratetable
if (has.ratetable) {
rdata <- data.frame(eval(rcall, mf), stringsAsFactors=TRUE)
if (is.ratetable(ratetable)) {
israte <- TRUE
rtemp <- match.ratetable(rdata, ratetable)
R <- rtemp$R
}
else if (inherits(ratetable, 'coxph') && !inherits(ratetable, "coxphms")) {
israte <- FALSE
Terms <- ratetable$terms
if (!is.null(attr(Terms, 'offset')))
stop("Cannot deal with models that contain an offset")
strats <- attr(Terms, "specials")$strata
if (length(strats))
stop("pyears cannot handle stratified Cox models")
R <- model.matrix.coxph(ratetable, data=rdata)
}
else stop("Invalid ratetable")
}
@
Now we process the non-ratetable variables.
Those of class [[tcut]] set up time-dependent classes. For
these the cutpoints attribute sets the intervals, if there
were 4 cutpoints of 1, 5,6, and 10 the 3 intervals will be 1-5,
5-6 and 6-10, and odims will be 3.
All other variables are treated as factors.
<<pyears-setup>>=
ovars <- attr(Terms, 'term.labels')
if (length(ovars)==0) {
# no categories!
X <- rep(1,n)
ofac <- odim <- odims <- ocut <- 1
}
else {
odim <- length(ovars)
ocut <- NULL
odims <- ofac <- double(odim)
X <- matrix(0, n, odim)
outdname <- vector("list", odim)
names(outdname) <- attr(Terms, 'term.labels')
for (i in 1:odim) {
temp <- mf[[ovars[i]]]
if (inherits(temp, 'tcut')) {
X[,i] <- temp
temp2 <- attr(temp, 'cutpoints')
odims[i] <- length(temp2) -1
ocut <- c(ocut, temp2)
ofac[i] <- 0
outdname[[i]] <- attr(temp, 'labels')
}
else {
temp2 <- as.factor(temp)
X[,i] <- temp2
temp3 <- levels(temp2)
odims[i] <- length(temp3)
ofac[i] <- 1
outdname[[i]] <- temp3
}
}
}
@
Now do the computations.
The code above has separated out the variables into 3 groups:
\begin{itemize}
\item The variables in the rate table. These determine where we
\emph{start} in the rate table with respect to retrieving the relevant
death rates. For the US table [[survexp.us]] this will be the date of
study entry, age (in days) at study entry, and sex of each subject.
\item The variables on the right hand side of the model. These are
interpreted almost identically to a call to [[table]], with special
treatment for those of class \emph{tcut}.
\item The response variable, which tells the number of days of follow-up
and optionally the status at the end of follow-up.
\end{itemize}
Start with the rate table variables.
There is an oddity about US rate tables: the entry for age (year=1970,
age=55) contains the daily rate for anyone who turns 55 in that year,
from their birthday forward for 365 days. So if your birthday is on
Oct 2, the 1970 table applies from 2Oct 1970 to 1Oct 1971. The
underlying C code wants to make the 1970 rate table apply from 1Jan
1970 to 31Dec 1970. The easiest way to finess this is to fudge
everyone's enter-the-study date. If you were born in March but
entered in April, make it look like you entered in Febuary; that way
you get the first 11 months at the entry year's rates, etc. The birth
date is entry date - age in days (based on 1/1/1970).
The other aspect of the rate tables is that ``older style'' tables, those that
have the factor attribute, contained only decennial data which the C code would
interpolate on the fly. The value of [[atts$factor]] was 10 indicating that
there are 10 years in the interpolation interval. The newer tables do not
do this and the C code is passed a 0/1 for continuous (age and year) versus
discrete (sex, race).
<<pyears-compute>>=
ocut <-c(ocut,0) #just in case it were of length 0
osize <- prod(odims)
if (has.ratetable) { #include expected
atts <- attributes(ratetable)
datecheck <- function(x)
inherits(x, c("Date", "POSIXt", "date", "chron"))
cuts <- lapply(attr(ratetable, "cutpoints"), function(x)
if (!is.null(x) & datecheck(x)) ratetableDate(x) else x)
if (is.null(atts$type)) {
#old stlye table
rfac <- atts$factor
us.special <- (rfac >1)
}
else {
rfac <- 1*(atts$type ==1)
us.special <- (atts$type==4)
}
if (any(us.special)) { #special handling for US pop tables
if (sum(us.special) > 1) stop("more than one type=4 in a rate table")
# Someone born in June of 1945, say, gets the 1945 US rate until their
# next birthday. But the underlying logic of the code would change
# them to the 1946 rate on 1/1/1946, which is the cutpoint in the
# rate table. We fudge by faking their enrollment date back to their
# birth date.
#
# The cutpoint for year has been converted to days since 1/1/1970 by
# the ratetableDate function. (Date objects in R didn't exist when
# rate tables were conceived.)
if (is.null(atts$dimid)) dimid <- names(atts$dimnames)
else dimid <- atts$dimid
cols <- match(c("age", "year"), dimid)
if (any(is.na(cols)))
stop("ratetable does not have expected shape")
# The format command works for Dates, use it to get an offset
bdate <- as.Date("1970-01-01") + (R[,cols[2]] - R[,cols[1]])
byear <- format(bdate, "%Y")
offset <- as.numeric(bdate - as.Date(paste0(byear, "-01-01")))
R[,cols[2]] <- R[,cols[2]] - offset
# Doctor up "cutpoints" - only needed for (very) old style rate tables
# for which the C code does interpolation on the fly
if (any(rfac >1)) {
temp <- which(us.special)
nyear <- length(cuts[[temp]])
nint <- rfac[temp] #intervals to interpolate over
cuts[[temp]] <- round(approx(nint*(1:nyear), cuts[[temp]],
nint:(nint*nyear))$y - .0001)
}
}
docount <- is.Surv(Y)
temp <- .C(Cpyears1,
as.integer(n),
as.integer(ncol(Y)),
as.integer(is.Surv(Y)),
as.double(Y),
as.double(weights),
as.integer(length(atts$dim)),
as.integer(rfac),
as.integer(atts$dim),
as.double(unlist(cuts)),
as.double(ratetable),
as.double(R),
as.integer(odim),
as.integer(ofac),
as.integer(odims),
as.double(ocut),
as.integer(expect=='event'),
as.double(X),
pyears=double(osize),
pn =double(osize),
pcount=double(if(docount) osize else 1),
pexpect=double(osize),
offtable=double(1))[18:22]
}
else { #no expected
docount <- as.integer(ncol(Y) >1)
temp <- .C(Cpyears2,
as.integer(n),
as.integer(ncol(Y)),
as.integer(docount),
as.double(Y),
as.double(weights),
as.integer(odim),
as.integer(ofac),
as.integer(odims),
as.double(ocut),
as.double(X),
pyears=double(osize),
pn =double(osize),
pcount=double(if (docount) osize else 1),
offtable=double(1)) [11:14]
}
@
Create the output object.
<<pyears-finish>>=
has.tcut <- any(sapply(mf, function(x) inherits(x, 'tcut')))
if (data.frame) {
# Create a data frame as the output, rather than a set of
# rate tables
if (length(ovars) ==0) { # no variables on the right hand side
keep <- TRUE
df <- data.frame(pyears= temp$pyears/scale,
n = temp$n)
}
else {
keep <- (temp$pyears >0) # what rows to keep in the output
# grab prototype rows from the model frame, this preserves class
# (unless it is a tcut variable, then we know what to do)
tdata <- lapply(1:length(ovars), function(i) {
temp <- mf[[ovars[i]]]
if (inherits(temp, "tcut")) { #if levels are numeric, return numeric
if (is.numeric(outdname[[i]])) outdname[[i]]
else factor(outdname[[i]], outdname[[i]]) # else factor
}
else temp[match(outdname[[i]], temp)]
})
tdata$stringsAsFactors <- FALSE # argument for expand.grid
df <- do.call("expand.grid", tdata)[keep,,drop=FALSE]
names(df) <- ovars
df$pyears <- temp$pyears[keep]/scale
df$n <- temp$pn[keep]
}
row.names(df) <- NULL # toss useless 'creation history'
if (has.ratetable) df$expected <- temp$pexpect[keep]
if (expect=='pyears') df$expected <- df$expected/scale
if (docount) df$event <- temp$pcount[keep]
# if any of the predictors were factors, make them factors in the output
for (i in 1:length(ovars)){
if (is.factor( mf[[ovars[i]]]))
df[[ovars[i]]] <- factor(df[[ovars[i]]], levels( mf[[ovars[i]]]))
}
out <- list(call=Call,
data= df, offtable=temp$offtable/scale,
tcut=has.tcut)
if (has.ratetable && !is.null(rtemp$summ))
out$summary <- rtemp$summ
}
else if (prod(odims) ==1) { #don't make it an array
out <- list(call=Call, pyears=temp$pyears/scale, n=temp$pn,
offtable=temp$offtable/scale, tcut = has.tcut)
if (has.ratetable) {
out$expected <- temp$pexpect
if (expect=='pyears') out$expected <- out$expected/scale
if (!is.null(rtemp$summ)) out$summary <- rtemp$summ
}
if (docount) out$event <- temp$pcount
}
else {
out <- list(call = Call,
pyears= array(temp$pyears/scale, dim=odims, dimnames=outdname),
n = array(temp$pn, dim=odims, dimnames=outdname),
offtable = temp$offtable/scale, tcut=has.tcut)
if (has.ratetable) {
out$expected <- array(temp$pexpect, dim=odims, dimnames=outdname)
if (expect=='pyears') out$expected <- out$expected/scale
if (!is.null(rtemp$summ)) out$summary <- rtemp$summ
}
if (docount)
out$event <- array(temp$pcount, dim=odims, dimnames=outdname)
}
out$observations <- nrow(mf)
out$terms <- Terms
na.action <- attr(mf, "na.action")
if (length(na.action)) out$na.action <- na.action
if (model) out$model <- mf
else {
if (x) out$x <- X
if (y) out$y <- Y
}
class(out) <- 'pyears'
out
@
\subsection{Print and summary}
The print function for pyear gives a very abbreviated
printout: just a few lines.
It works with pyears objects with or without a data component.
<<print.pyears>>=
print.pyears <- function(x, ...) {
if (!is.null(cl<- x$call)) {
cat("Call:\n")
dput(cl)
cat("\n")
}
if (is.null(x$data)) {
if (!is.null(x$event))
cat("Total number of events:", format(sum(x$event)), "\n")
cat ( "Total number of person-years tabulated:",
format(sum(x$pyears)),
"\nTotal number of person-years off table:",
format(x$offtable), "\n")
}
else {
if (!is.null(x$data$event))
cat("Total number of events:", format(sum(x$data$event)), "\n")
cat ( "Total number of person-years tabulated:",
format(sum(x$data$pyears)),
"\nTotal number of person-years off table:",
format(x$offtable), "\n")
}
if (!is.null(x$summary)) {
cat("Matches to the chosen rate table:\n ",
x$summary)
}
cat("Observations in the data set:", x$observations, "\n")
if (!is.null(x$na.action))
cat(" (", naprint(x$na.action), ")\n", sep='')
cat("\n")
invisible(x)
}
@
The summary function attempts to create output that looks like a
pandoc table, which in turn makes it mesh nicely with Rstudio.
Pandoc has 4 types of tables: with and without vertical bars and
with single or multiple rows per cell.
If the pyears object has only a single dimension then our output will
be a simple table with a row or column for each of the output
types (see the vertical argument).
The result will be a simple table or a ``pipe'' table depending on the
vline argument.
For two or more dimensions the output follows the usual R strategy for printing
an array, but with each ``cell'' containing all of the summaries for that
combination of predictors, thus giving
either a ``multiline'' or ``grid'' table.
The default values of no vertical lines makes the tables
appropriate for non-pandoc output such as a terminal session.
<<print.pyears>>=
summary.pyears <- function(object, header=TRUE, call=header,
n= TRUE, event=TRUE, pyears=TRUE,
expected = TRUE, rate = FALSE, rr = expected,
ci.r = FALSE, ci.rr = FALSE, totals=FALSE,
legend=TRUE, vline = FALSE, vertical = TRUE,
nastring=".", conf.level=0.95,
scale= 1, ...) {
# Usual checks
if (!inherits(object, "pyears"))
stop("input must be a pyears object")
temp <- c(is.logical(header), is.logical(call), is.logical(n),
is.logical(event) , is.logical(pyears), is.logical(expected),
is.logical(rate), is.logical(ci.r), is.logical(rr),
is.logical(ci.rr), is.logical(vline), is.logical(vertical),
is.logical(legend), is.logical(totals))
tname <- c("header", "call", "n", "event", "pyears", "expected",
"rate", "ci.r", "rr", "ci.rr", "vline", "vertical",
"legend", "totals")
if (any(!temp) || length(temp) != 14 || any(is.na(temp))) {
stop("the ", paste(tname[!temp], collapse=", "),
"argument(s) must be single logical values")
}
if (!is.numeric(conf.level) || conf.level <=0 || conf.level >=1 |
length(conf.level) > 1 || is.na(conf.level) > 1)
stop("conf.level must be a single numeric between 0 and 1")
if (is.na(scale) || !is.numeric(scale) || length(scale) !=1 || scale <=0)
stop("scale must be a value > 0")
vname <- attr(terms(object), "term.labels") #variable names
if (!is.null(object$data)) {
# Extra work: restore the tables which had been unpacked into a df
# All of the categories are factors in this case
tdata <- object$data[vname] # the conditioning variables
dname <- lapply(tdata, function(x) {
if (is.factor(x)) levels(x) else sort(unique(x))}) # dimnames
dd <- sapply(dname, length) # dim of arrays
index <- tapply(tdata[,1], tdata)
restore <- c('n', 'event', 'pyears', 'expected') #do these, if present
restore <- restore[restore %in% names(object$data)]
new <- lapply(object$data[restore],
function(x) {
temp <- array(0L, dim=dd, dimnames=dname)
temp[index] <- x
temp} )
object <- c(object, new)
}
if (is.null(object$expected)) {
expected <- FALSE
rr <- FALSE
ci.rr <- FALSE
}
if (is.null(object$event)) {
event <- FALSE
rate <- FALSE
ci.r <- FALSE
rr <- FALSE
ci.rr <- FALSE
}
# print out the front matter
if (call && !is.null(object$call)) {
cat("Call: ")
dput(object$call)
cat("\n")
}
if (header) {
cat("number of observations =", object$observations)
if (length(object$omit))
cat(" (", naprint(object$omit), ")\n", sep="")
else cat("\n")
if (object$offtable > 0)
cat(" Total time lost (off table)", format(object$offtable), "\n")
cat("\n")
}
# Add in totals if requested
if (totals) {
# if the pyear object was based on any time dependent cuts, then
# the "n" component cannot be totaled up.
tcut <- if (is.null(object$tcut)) TRUE else object$tcut
object$n <- pytot(object$n, na=tcut)
object$pyears <- pytot(object$pyears)
if (event) object$event <- pytot(object$event)
if (expected) object$expected <- pytot(object$expected)
}
dd <- dim(object$n)
vname <- attr(terms(object), "term.labels") #variable names
<<pyears-list>>
if (length(dd) ==1) {
# 1 dimensional table
<<pyears-table1>>
} else {
# more than 1 dimension
<<pyears-table2>>
}
invisible(object)
}
<<pyears-charfun>>
@
<<pyears-list>>=
# Put the elements to be printed onto a list
pname <- (tname[3:6])[c(n, event, pyears, expected)]
plist <- object[pname]
if (rate) {
pname <- c(pname, "rate")
plist$r <- scale* object$event/object$pyears
}
if (ci.r) {
pname <- c(pname, "ci.r")
plist$ci.r <- cipoisson(object$event, object$pyears, p=conf.level) *scale
}
if (rr) {
pname <- c(pname, "rr")
plist$rr <- object$event/object$expected
}
if (ci.rr) {
pname <- c(pname, "ci.rr")
plist$ci.rr <- cipoisson(object$event, object$expected, p=conf.level)
}
rname <- c(n = "N", event="Events",
pyears= "Time", expected= "Expected events",
rate = "Event rate", ci.r = "CI (rate)",
rr= "Obs/Exp", ci.rr= "CI (O/E)")
rname <- rname[pname]
@
If there is only one dimension to the table we can forgo the top legend
and use the object names as one of the margins.
If \code{vertical=TRUE} the output types are vertical, otherwise they
are horizontal. Format each element of the output separately.
<<pyears-table1>>=
cname <- names(object$n) #category names
if (vertical) {
# The person-years objects list across the top, categories up and down
# This makes columns line up in a standard "R" way
# The first column label is the variable name, content is the categories
plist <- lapply(plist, pformat, nastring, ...) # make it character
pcol <- sapply(plist, function(x) nchar(x[1])) #width of each one
colwidth <- pmax(pcol, nchar(rname)) +2
for (i in 1:length(plist))
plist[[i]] <- strpad(plist[[i]], colwidth[i])
colwidth <- c(max(nchar(vname), nchar(cname)) +2, colwidth)
leftcol <- list(strpad(cname, colwidth[1]))
header <- strpad(c(vname, rname), colwidth)
}
else {
# in this case each column will have different types of objects in it
# alignment is the nuisance
newmat <- pybox(plist, length(plist[[1]]), nastring, ...)
colwidth <- pmax(nchar(cname), apply(nchar(newmat), 1, max)) +2
# turn the list sideways
plist <- split(newmat, row(newmat))
for (i in 1:length(plist))
plist[[i]] <- strpad(plist[[i]], colwidth[i])
colwidth <- c(max(nchar(vname), nchar(rname)) +2, colwidth)
leftcol <- list(strpad(rname, colwidth[1]))
header <- strpad(c(vname, cname), colwidth)
}
# Now print it
if (vline) { # use a pipe table
cat(paste(header, collapse = "|"), "\n")
cat(paste(strpad("-", colwidth, "-"), collapse="|"), "\n")
temp <- do.call("paste", c(leftcol, plist, list(sep ="|")))
cat(temp, sep= '\n')
}
else {
cat(paste(header, collapse = " "), "\n")
cat(paste(strpad("-", colwidth, "-"), collapse=" "), "\n")
temp <- do.call("paste", c(leftcol, plist, list(sep =" ")))
cat(temp, sep='\n')
}
@
When there are more than one category in the pyears object then
we use a special layout. Each 'cell' of the printed table has
all of the values in it.
<<pyears-table2>>=
if (header) {
# the header is itself a table
width <- max(nchar(rname))
if (vline) {
cat('+', strpad('-', width, '-'), "+\n", sep="")
cat(paste0('|',strpad(rname, width), '|'), sep='\n')
cat('+', strpad('-', width, '-'), "+\n\n", sep="")
} else {
cat(strpad('-', width, '-'), "\n")
cat(strpad(rname, width), sep='\n')
cat(strpad('-', width, '-'), "\n\n")
}
}
tname <- vname[1:2] #names for the row and col
rowname <- dimnames(object$n)[[1]]
colname <- dimnames(object$n)[[2]]
if (length(dd) > 2)
newmat <- pybox(plist, c(dd[1],dd[2], prod(dd[-(1:2)])),
nastring, ...)
else newmat <- pybox(plist, dd, nastring, ...)
if (length(dd) > 2) {
newmat <- pybox(plist, c(dd[1],dd[2], prod(dd[-(1:2)])),
nastring, ...)
outer.label <- do.call("expand.grid", dimnames(object$n)[-(1:2)])
temp <- names(outer.label)
for (i in 1:nrow(outer.label)) {
# first the caption, then data
cat(paste(":", paste(temp, outer.label[i,], sep="=")), '\n')
pyshow(newmat[,,i,], tname, rowname, colname, vline)
}
}
else {
newmat <- pybox(plist, dd, nastring, ...)
pyshow(newmat, tname, rowname, colname, vline)
}
@
Here are some character manipulation functions. The stringi package has
more elegant versions of the pad function, but we don't need the speed.
No one is going to print out thousands of lines.
<<pyears-charfun>>=
strpad <- function(x, width, pad=' ') {
# x = the string(s) to be padded out
# width = width of desired string.
nc <- nchar(x)
added <- width - nc
left <- pmax(0, floor(added/2)) # can't add negative space
right <- pmax(0, width - (nc + left)) # right will be >= left
if (all(right <=0)) {
if (length(x) >= length(width)) x # nothing needs to be done
else rep(x, length=length(width))
}
else {
# Each pad could be a different length.
# Make a long string from which we can take a portion
longpad <- paste(rep(pad, max(right)), collapse='')
paste0(substring(longpad, 1, left), x, substring(longpad,1, right))
}
}
pformat <- function(x, nastring, ...) {
# This is only called for single index tables, in vertical mode
# Any matrix will be a confidence interval
if (is.matrix(x))
ret <- paste(ifelse(is.na(x[,1]), nastring,
format(x[,1], ...)), "-",
ifelse(is.na(x[,2]), nastring,
format(x[,2], ...)))
else ret <- ifelse(is.na(x), nastring, format(x, ...))
}
@
Create formatted boxes. We want all the decimal points to line up,
so the format calls are in 3 parts: integer, real, and confidence interval.
If there are confidence intervals, format their values and then paste
together the left-right ends.
The intermediag form \code{final} is a matrix with one column per statistic.
At the end, reformat it as an array whose last dimension is the components.
<<pyears-charfun>>=
pybox <- function(plist, dd, nastring, ...) {
ci <- (substring(names(plist), 1,3) == "ci.") # the CI components
int <- sapply(plist, function(x) all(x == floor(x) | is.na(x)))
int <- (!ci & int)
real<- (!ci & !int)
nc <- prod(dd)
final <- matrix("", nrow=nc, ncol=length(ci))
if (any(int)) { # integers
if (any(sapply(plist[int], length) != nc))
stop("programming length error, notify package author")
temp <- unlist(plist[int])
final[,int] <- ifelse(is.na(temp), nastring, format(temp))
}
if (any(real)) { # floating point
if (any(sapply(plist[real], length) != nc))
stop("programming length error, notify package author")
temp <- unlist(plist[real])
final[,real] <- ifelse(is.na(temp), nastring,
format(temp, ...))
}
if (any(ci)) {
if (any(sapply(plist[ci], length) != nc*2))
stop("programming length error, notify package author")
temp <- unlist(plist[ci])
temp <- array(ifelse(is.na(temp), nastring,
format(temp, ...)),
dim=c(nc, 2, sum(ci)))
final[,ci] <- paste(temp[,1,], temp[,2,], sep='-')
}
array(final, dim=c(dd, length(ci)))
}
@
This function prints out a box table. Each cell contains the full set of
statistics that were requested. Most of the work is the creation of
the appropriate spacing and special characters to create a valid
pandoc table.
<<pyears-charfun>>=
pyshow <- function(dmat, labels, rowname, colname, vline) {
# Every column is the same width, except the first
colwidth <- c(max(nchar(rowname), nchar(labels[1])),
rep(max(nchar(dmat[1,1,]), nchar(colname)), length(colname)))
colwidth[2] <- max(colwidth[2], nchar(labels[2]))
ncol <- length(colwidth)
dd <- dim(dmat) # vector of length 3, third dim is the statistics
rline <- ceiling(dd[3]/2) #which line to put the row label on.
if (vline) { # use a grid table
cat("+", paste(strpad('-', colwidth, pad='-'), collapse='+'), "+\n",
sep='')
temp <- rep(' ', ncol); temp[2] <- labels[2]
cat("|", paste(strpad(temp, colwidth), collapse="|"), "|\n",
sep='')
cat("|", paste(strpad(c(labels[1], colname), colwidth), collapse="|"),
"|\n", sep='')
cat("+", paste(strpad('=', colwidth, pad='='), collapse="+"), "+\n",
sep='')
for (i in 1:dd[1]) {
for (j in 1:dd[3]) { #one printout line per stat
if (j==rline) temp <- c(rowname[i], dmat[i,,j])
else temp <- c("", dmat[i,,j])
cat("|", paste(strpad(temp, colwidth), collapse='|'), "|\n",
sep='')
}
cat("+", paste(strpad('-', colwidth, '-'), collapse='+'), "+\n",
sep='')
}
}
else { # use a multiline table
cat(paste(strpad('-', colwidth, '-'), collapse='-'), "\n")
temp <- rep(' ', ncol); temp[2] <- labels[2]
cat(paste(strpad(temp, colwidth), collapse=" "), "\n")
cat(paste(strpad(c(labels[1], colname), colwidth), collapse=" "),
"\n")
cat(paste(strpad('-', colwidth, pad='-'), collapse=" "), "\n")
for (i in 1:dd[1]) {
for (j in 1:dd[3]) { #one printout line per stat
if (j==rline) temp <- c(rowname[i], dmat[i,,j])
else temp <- c("", dmat[i,,j])
cat(paste(strpad(temp, colwidth), collapse=' '), "\n")
}
if (i< dd[1]) cat(" \n") #blank line
}
cat(paste(strpad('-', colwidth, '-'), collapse='-'), "\n")
}
}
@
This function adds a totals row to the data, for either the first
or first and second dimensions.
The ``n'' component can't be totaled, so we turn that into NA.
<<pyears-charfun>>=
pytot <- function(x, na=FALSE) {
dd <- dim(x)
if (length(dd) ==1) {
if (na) array(c(x, NA), dim= length(x) +1,
dimnames=list(c(dimnames(x)[[1]], "Total")))
else array(c(x, sum(x)), dim= length(x) +1,
dimnames=list(c(dimnames(x)[[1]], "Total")))
}
else if (length(dd) ==2) {
if (na) new <- rbind(cbind(x, NA), NA)
else {
new <- rbind(x, colSums(x))
new <- cbind(new, rowSums(new))
}
array(new, dim=dim(x) + c(1,1),
dimnames=list(c(dimnames(x)[[1]], "Total"),
c(dimnames(x)[[2]], "Total")))
}
else {
# The general case
index <- 1:length(dd)
if (na) sum1 <- sum2 <- sum3 <- NA
else {
sum1 <- apply(x, index[-1], sum) # row sums
sum2 <- apply(x, index[-2], sum) # col sums
sum3 <- apply(x, index[-(1:2)], sum) # total sums
}
# create a new matrix and then fill it in
d2 <- dd
d2[1:2] <- dd[1:2] +1
dname <- dimnames(x)
dname[[1]] <- c(dname[[1]], "Total")
dname[[2]] <- c(dname[[2]], "Total")
new <- array(x[1], dim=d2, dimnames=dname)
# say dim(x) =(5,8,4); we want new[6,-9,] <- sum1; new[-6,9,] <- sum2
# and new[6,9,] <- sum3
# if dim is longer, we need to add more commas
commas <- rep(',', length(dd) -2)
eval(parse(text=paste("new[1:dd[1], 1:dd[2]", commas, "] <- x")))
eval(parse(text=paste("new[ d2[1],-d2[2]", commas, "] <- sum1")))
eval(parse(text=paste("new[-d2[1], d2[2]", commas, "] <- sum2")))
eval(parse(text=paste("new[ d2[1], d2[2]", commas, "] <- sum3")))
new
}
}
@
\section{Residuals for survival curves}
\subsection{R-code}
For all the more complex cases, the variance of a survival curve is based on
the infinitesimal jackknife:
$$
D_i(t) = \frac{\partial S(t)}{\partial w_i}
$$
evaluated at the the observed vector of weights. The variance at a given
time is then $D'WD'$ where $D$ is a diagonal matrix of the case weights.
When there are multiple states $S$ is replaced by the vector $p(t)$, with
one element per state, and the formula gets a bit more complex.
The predicted curve from a Cox model is the most complex case.
Realizing that we need to return the matrix $D$ to the user, in order to compute
the variance of derived quantities like the restricted mean time in state,
the code has been changed from a primarily internal focus (compute within the
survfit routine) to an external one.
The underlying C code is very similar to that in survfitkm.c
One major difference in the routines is that this code is designed to return
values at a fixed set of time points; it is an error if the user does not
provide them. This allows the result to be presented as a matrix or array.
Computational differences will be discussed later.
The method argument is for debugging. For multi-state it uses either C code
or the optimized R method.
The double call below is because we want residuals to return a simple matrix,
but the pseudo function needs to get back a little bit more.
<<residuals.survfit>>=
# residuals for a survfit object
residuals.survfit <- function(object, times,
type= "pstate",
collapse, weighted=FALSE, method=1, ...){
if (!inherits(object, "survfit"))
stop("argument must be a survfit object")
if (object$type=="interval") {
# trial code to support it
# reconstruct the data set
# create dummy time/status for all interval or left censored
# over the span of jump points in S, non-censored obs with
# weights proportional to the jumps
# combine dummy + (exact, right) from original, compute KM
# get pseudo for this new KM
# collapse dummy obs back to a single
stop("residuals for interval-censored data are not available")
}
if (!is.null(object$oldstates))
stop("residuals not available for a subscripted survfit object")
if (missing(times)) stop("the times argument is required")
# allow a set of alias
temp <- c("pstate", "cumhaz", "sojourn", "survival",
"chaz", "rmst", "rmts", "auc")
type <- match.arg(casefold(type), temp)
itemp <- c(1,2,3,1,2,3,3,3)[match(type, temp)]
type <- c("pstate", "cumhaz", "auc")[itemp]
if (missing(collapse))
fit <- survresid.fit(object, times, type, weighted=weighted,
method= method)
else fit <- survresid.fit(object, times, type, collapse= collapse,
weighted= weighted, method= method)
fit$residuals
}
survresid.fit <- function(object, times,
type= "pstate",
collapse, weighted=FALSE, method=1) {
if (object$type=="interval") stop("interval censored not yet supported")
survfitms <- inherits(object, "survfitms")
coxsurv <- inherits(object, "survfitcox") # should never be true, as there
# is a residuals.survfitcox
timefix <- (is.null(object$timefix) || object$timefix)
start.time <- object$start.time
if (is.null(start.time)) start.time <- min(c(0, object$time))
# check input arguments
if (missing(times))
stop ("the times argument is required")
else {
if (!is.numeric(times)) stop("times must be a numeric vector")
times <- sort(unique(times))
if (timefix) times <- aeqSurv(Surv(times))[,1]
}
# get the data
<<rsurvfit-data>>
if (missing(collapse)) collapse <- (!(is.null(id)) && any(duplicated(id)))
if (collapse && is.null(id)) stop("collapse argument requires an id or cluster argument in the survfit call")
ny <- ncol(newY)
if (collapse && any(X != X[1])) {
# If the same id shows up in multiple curves, we just can't deal
# with it.
temp <- unlist(lapply(split(id, X), unique))
if (any(duplicated(temp)))
stop("same id appears in multiple curves, cannot collapse")
}
timelab <- signif(times, 3) # used for dimnames
# What type of survival curve?
stype <- Call$stype
if (is.null(stype)) stype <- 1
ctype <- Call$ctype
if (is.null(ctype)) ctype <- 1
if (!survfitms) {
resid <- rsurvpart1(newY, X, casewt, times,
type, stype, ctype, object)
if (collapse) {
resid <- rowsum(resid, id, reorder=FALSE)
dimnames(resid) <- list(id= unique(id), times=timelab)
curve <- (as.integer(X))[!duplicated(id)] #which curve for each
}
else {
if (length(id) >0) dimnames(resid) <- list(id=id, times=timelab)
curve <- as.integer(X)
}
}
else { # multi-state
if (!collapse) {
if (length(id >0)) d1name <- id else d1name <- NULL
cluster <- d1name
curve <- as.integer(X)
}
else {
d1name <- unique(id)
cluster <- match(id, d1name)
curve <- (as.integer(X))[!duplicated(id)]
}
resid <- rsurvpart2(newY, X, casewt, istate, times, cluster,
type, object, method=method, collapse=collapse)
if (type == "cumhaz") {
ntemp <- colnames(object$cumhaz)
if (length(dim(resid)) ==3)
dimnames(resid) <- list(id=d1name, times=timelab,
cumhaz= ntemp)
else dimnames(resid) <- list(id=d1name, cumhaz=ntemp)
}
else {
ntemp <- object$states
if (length(dim(resid)) ==3)
dimnames(resid) <- list(id=d1name, times=timelab,
state= ntemp)
else dimnames(resid) <- list(id=d1name, state= ntemp)
}
}
if (weighted && any(casewt !=1)) resid <- resid*casewt
list(residuals= resid, curve= curve, id= id, idname=idname)
}
@
The first part of the work is retrieve the data set. This is done in multiple
places in the survival code, all essentially the same.
If I gave up (like lm) and forced the model frame to be saved this would be
easier of course.
<<rsurvfit-data>>=
Call <- object$call
Terms <- object$terms
# remember the name of the id variable, if present.
# but we don't try to parse it: id= mydata$clinic becomes NULL
idname <- Call$id
if (is.name(idname)) idname <- as.character(idname)
else idname <- NULL
# I always need the model frame
mf <- model.frame(object)
if (is.null(object$y)) Y <- model.response(mf)
else Y <- object$y
formula <- formula(object)
# the chunk below is shared with survfit.formula
na.action <- getOption("na.action")
if (is.character(na.action))
na.action <- get(na.action) # a hack to allow the shared code
<<survfit.formula-getdata>>
# end of shared code
xlev <- levels(X)
# Deal with ties
if (is.null(Call$timefix) || Call$timefix) newY <- aeqSurv(Y) else newY <- Y
@
This code has 3 primary sections: single state survival, multi-state survival,
and post-Cox survival.
A motivating idea in all of them is to avoid an $O(nd)$ calculation that
involves the increment to each subject's leverage at each of the $d$
event times. Since $d$ often grows with $n$ this can get very slow. This
routine is designed for the case where the number of time points in the
output matrix is modest, so we aim for $O(n)$ processes that repeat for
each output time.
\subsection{Simple survival}
The Nelson-Aalen estimate of cumulative hazard is a simple sum
\begin{align}
H(t) &= H(t-) + h(t) \nonumber \\
\frac{\partial H(t)}{\partial w_i} &= \frac{\partial H(t-)}{\partial w_i} +
[dN_i(t) - Y_i(t)h(t)]/r(t) \nonumber \\
&= \sum_{d_j \le t} dN_i(d_j)/r(d_j) - Y_i(d_j)h(d_j)/r(d_j)
\label{NAderiv}
\end{align}
where $H$ the cumulative hazard,
$h$ is the increment to the cumulative hazard, $Y_i$ is 1 when a
subject is at risk, and $dN_i$ marks an event for the subject.
Our basic strategy for the NA estimate is to use a two stage estimate.
First, compute three vectors, each with one element per event time.
\begin{itemize}
\item term1 = $1/r(d_j)$ is the increment to the derivative for any
observation with an event at event time $d_j$
\item term2 = $-h(d_j)/r(d_j)$ is the increment for any observation that is at
risk at time $d_j$
\item term3 = cumulative sum of term2
\end{itemize}
For any given observation $i$ whose follow-up interval is $(s_i, t_i)$, their
derivative at time $z$ is the sum of
\begin{itemize}
\item term3(min($z$, $t_i$)) - term3(min($z$, $s_i$))
\item term1($t_i$) if $t_i \le z$ and observation $i$ is an event
\end{itemize}
The computation of term1 and term3 are each $O(d)$, the number of events, and
the residual is $O(2n)$, an addition is done when it enters the risk set and
another when it leaves. This accomplishes our goal to not update every member
of the risk set at every event.
The Fleming-Harrington estimate of survival is
\begin{align*}
S(t) &= e^{-H(t)} \\
\frac{\partial S(t)}{\partial w_i} &= -S(t)\frac{\partial H(t)}{\partial w_i}
\end{align*}
So has exactly the same computation, with a multiplication at the end.
<<residuals.survfit>>=
rsurvpart1 <- function(Y, X, casewt, times,
type, stype, ctype, fit) {
ntime <- length(times)
etime <- (fit$n.event >0)
ny <- ncol(Y)
event <- (Y[,ny] >0)
status <- Y[,ny]
#
# Create a list whose first element contains the location of
# the death times in curve 1, second element the death times for curve 2,
#
if (is.null(fit$strata)) {
fitrow <- list(which(etime))
}
else {
temp1 <- cumsum(fit$strata)
temp2 <- c(1, temp1+1)
fitrow <- lapply(1:length(fit$strata), function(i) {
indx <- seq(temp2[i], temp1[i])
indx[etime[indx]] # keep the death times
})
}
ff <- unlist(fitrow)
# for each time x, the index of the last death time which is <=x.
# 0 if x is before the first death time in the fit object.
# The result is an index to the survival curve
matchfun <- function(x, fit, index) {
dtime <- fit$time[index] # subset to this curve
i2 <- findInterval(x, dtime, left.open=FALSE)
c(0, index)[i2 +1]
}
# output matrix D will have one row per observation, one col for each
# reporting time. tindex and yindex have the same dimension as D.
# tindex points to the last death time in fit which
# is <= the reporting time. (If there is only 1 curve, each col of
# tindex will be a repeat of the same value.)
tindex <- matrix(0L, nrow(Y), length(times))
for (i in 1:length(fitrow)) {
yrow <- which(as.integer(X) ==i)
temp <- matchfun(times, fit, fitrow[[i]])
tindex[yrow, ] <- rep(temp, each= length(yrow))
}
tindex[,] <- match(tindex, c(0,ff)) -1L # the [,] preserves dimensions
# repeat the indexing for Y onto fit$time. Each row of yindex points
# to the last row of fit with death time <= Y[,ny]
ny <- ncol(Y)
yindex <- matrix(0L, nrow(Y), length(times))
event <- (Y[,ny] >0)
if (ny==3) startindex <- yindex
for (i in 1:length(fitrow)) {
yrow <- (as.integer(X) ==i) # rows of Y for this curve
temp <- matchfun(Y[yrow,ny-1], fit, fitrow[[i]])
yindex[yrow,] <- rep(temp, ncol(yindex))
if (ny==3) {
temp <- matchfun(Y[yrow,1], fit, fitrow[[i]])
startindex[yrow,] <- rep(temp, ncol(yindex))
}
}
yindex[,] <- match(yindex, c(0,ff)) -1L
if (ny==3) {
startindex[,] <- match(startindex, c(0,ff)) -1L
# no subtractions for report times before subject's entry
startindex <- pmin(startindex, tindex)
}
# Now do the work
if (type=="cumhaz" || stype==2) { # result based on hazards
if (ctype==1) {
<<residpart1-nelson>>
} else {
<<residpart1-fleming>>
}
} else { # not hazard based
<<residpart1-AJ>>
}
D
}
@
The Nelson-Aalen is the simplest case.
We don't have to worry about case weights of the data, since that has
already been accounted for by the survfit function.
<<residpart1-nelson>>=
death <- (yindex <= tindex & rep(event, ntime)) # an event occured at <= t
term1 <- 1/fit$n.risk[ff]
term2 <- lapply(fitrow, function(i) fit$n.event[i]/fit$n.risk[i]^2)
term3 <- unlist(lapply(term2, cumsum))
sum1 <- c(0, term1)[ifelse(death, 1+yindex, 1)]
sum2 <- c(0, term3)[1 + pmin(yindex, tindex)]
if (ny==3) sum3 <- c(0, term3)[1 + pmin(startindex, tindex)]
if (ny==2) D <- matrix(sum1 - sum2, ncol=ntime)
else D <- matrix(sum1 + sum3 - sum2, ncol=ntime)
# survival is exp(-H) so the derivative is a simple transform of D
if (type== "pstate") D <- -D* c(1,fit$surv[ff])[1+ tindex]
else if (type == "auc") {
<<auctrick>>
}
@
The sojourn time is the area under the survival curve. Let $x_j$ be the
widths of the rectangles under the curve from event time $d_j$ to
$\min(d_{j+1}, t)$, zero if $t \le d_j$, or $t-d_m$ if $t$ is after the last
event time.
\begin{align}
A(0,t) &= \sum_{j=1}^m x_j S(d_j) \\nonumber \\
\frac{\partial A(0,t)}{\partial w_i} &=
\sum_{j=1}^m -x_j S(d_j) \frac{\partial H(d_j)}{\partial w_i} \nonumber \\
&= \sum_{j=1}^m -x_jS(d_j) \sum_{k \le j} \frac{\partial h(d_k)}{\partial w_i}
\nonumber \\
&= \sum_{k=1}^m \frac{\partial h(d_k)}{\partial w_i}
\left(\sum_{j\ge k} -x_j S(d_j) \right) \nonumber \\
&= \sum_{k=1}^m -A(d_k, t) \frac{\partial h(d_k)}{\partial w_i}
\label{eq:auctrick}
\end{align}
For an observation at risk over the interval $(a,b)$ we have exactly the same
calculus as the cumulative hazard with respect to which $h(d_k)$ terms
are counted for the observation, but now they are weighted sums. The weights
are different for each output time, so we set them up as a matrix.
We need the AUC at each event time $d_k$, and the AUC at the output times.
Matrix subscripts are a little used feature of R. If y is a matrix of
values and x is a 2 colum matrix containing m (row, col) pairs, the
result will be a vector of length m that plucks out the [x[1,1], x[1,2]]
value of y, then the [x[2,1], x[2,2]] value of y, etc.
They are rarely useful, but very handy in the few cases where they apply.
<<auctrick>>=
auc1 <- lapply(fitrow, function(i) {
if (length(i) <=1) 0
else c(0, cumsum(diff(fit$time[i]) * (fit$surv[i])[-length(i)]))
}) # AUC at each event time
auc2 <- lapply(fitrow, function(i) {
if (length(i) <=1) 0
else {
xx <- sort(unique(c(fit$time[i], times))) # all the times
yy <- (fit$surv[i])[findInterval(xx, fit$time[i])]
auc <- cumsum(c(diff(xx),0) * yy)
c(0, auc)[match(times, xx)]
}}) # AUC at the output times
# Most often this function is called with a single curve, so make that case
# faster. (Or I presume so: mapply and do.call may be more efficient than
# I think for lists of length 1).
if (length(fitrow)==1) { # simple case, most common to ask for auc
wtmat <- pmin(outer(auc1[[1]], -auc2[[1]], '+'),0)
term1 <- term1 * wtmat
term2 <- unlist(term2) * wtmat
term3 <- apply(term2, 2, cumsum)
}
else { #more than one curve, compute weighted cumsum per curve
wtmat <- mapply(function(x, y) pmin(outer(x, -y, "+"), 0), auc1, auc2)
term1 <- term1 * do.call(rbind, wtmat)
temp <- mapply(function(x, y) apply(x*y, 2, cumsum), term2, wtmat)
term3 <- do.call(rbind, temp)
}
sum1 <- sum2 <- matrix(0, nrow(yindex), ntime)
if (ny ==3) sum3 <- sum1
for (i in 1:ntime) {
sum1[,i] <- c(0, term1[,i])[ifelse(death[,i], 1 + yindex[,i], 1)]
sum2[,i] <- c(0, term3[,i])[1 + pmin(yindex[,i], tindex[,i])]
if (ny==3) sum3[,i] <- c(0, term3[,i])[1 + pmin(startindex[,i], tindex[,i])]
}
# Perhaps a bit faster(?), but harder to read. And for AUC people usually only
# ask for one time point
#sum1 <- rbind(0, term1)[cbind(c(ifelse(death, 1+yindex, 1)), c(col(yindex)))]
#sum2 <- rbind(0, term3)[cbind(c(1 + pmin(yindex, tindex)), c(col(yindex)))]
#if (ny==3) sum3 <-
# rbind(0, term3)[c(cbind(1 + pmin(startindex, tindex)),
# c(col(yindex)))]
if (ny==2) D <- matrix(sum1 - sum2, ncol=ntime)
else D <- matrix(sum1 + sum3 - sum2, ncol=ntime)
@
\paragraph{Fleming-Harrington}
For the Fleming-Harrington estimator the calculation at a tied time differs
slightly.
If there were 10 at risk and 3 tied events, the Nelson-Aalen has an increment
of 3/10, while the FH has an increment of (1/10 + 1/9 + 1/8). The underlying
idea is that the true time values are continuous and we observe ties due to
coarsening of the data. The derivative will have 3 terms as well. In this
case the needed value cannot be pulled directly from the survfit object.
Computationally, the number of distinct times at which a tie occurs is normally
quite small and the for loop below will not be too expensive.
<<residpart1-fleming>>=
stop("residuals function still imcomplete, for FH estimate")
if (any(casewt != casewt[1])) {
# Have to reconstruct the number of obs with an event, the curve only
# contains the weighted sum
nevent <- unlist(lapply(seq(along.with=levels(X)), function(i) {
keep <- which(as.numeric(X) ==i)
counts <- table(Y[keep, ny-1], status)
as.vector(counts[, ncol(counts)])
}))
} else nevent <- fit$n.event
n2 <- fit$n.risk
risk2 <- 1/fit$n.risk
ltemp <- risk2^2
for (i in which(nevent>1)) { # assume not too many ties
denom <- fit$n.risk[i] - fit$n.event[i]*(0:(nevent[i]-1))/nevent[i]
risk2[i] <- mean(1/denom) # multiplier for the event
ltemp[i] <- mean(1/denom^2)
n2[i] <- mean(denom)
}
death <- (yindex <= tindex & rep(event, ntime))
term1 <- risk2[ff]
term2 <- lapply(fitrow, function(i) event[i]*ltemp[i])
term3 <- unlist(lapply(term2, cumsum))
sum1 <- c(0, term1)[ifelse(death, 1+yindex, 1)]
sum2 <- c(0, term3)[1 + pmin(yindex, tindex)]
if (ny==3) sum3 <- c(0, term3)[1 + pmin(startindex, tindex)]
if (ny==2) D <- matrix(sum1 - sum2, ncol=ntime)
else D <- matrix(sum1 + sum3 - sum2, ncol=ntime)
if (type=="pstate") D <- -D* c(0,fit$surv[ff])[1+ tindex]
else if (type=="auc") {
<<auctrick>>
}
@
\paragraph{Kaplan-Meier}
For the Kaplan-Meier (a special case of the Aalen-Johansen) the underlying
algorithm is multiplicative, but we can turn it into an additive
algoritm with a slight of hand.
\begin{align*}
S(t) &= \prod_{d_j\le t} (1- h(d_j)) \\
&= \exp \left(\sum_{d_j\le t} \log(1- h(d_j)) \right) \\
&= \exp \left(\sum_{d_j\le t} \log(r(d_j) - dN(d_j)) - \log(r(d_j)) \right) \\
\frac{\partial S(t)}{\partial w_i} &=
S(t) \sum_{d_j\le t} \frac{Y_i(d_j) - dN_i(d_j)}{r(d_j) - dN(d_j)} -
\frac{Y_i(d_j)}{ r(d_j)}
\end{align*}
The addend for term2 is now $1/n(n-e)$ where $e$ is the number of events, i.e.,
the same term as in the Greenwood variance, and term1 is $-1/n(n-e)$.
The jumps in the KM curve are just a big larger than jumps in a FH estimate,
so it makes sense that these are just a bit larger.
<<residpart1-AJ>>=
death <- (yindex <= tindex & rep(event, ntime))
# dtemp avoids 1/0. (When this occurs the influence is 0, since
# the curve has dropped to zero; and this avoids Inf in term1 and term2).
dtemp <- ifelse(fit$n.risk==fit$n.event, 0, 1/(fit$n.risk- fit$n.event))
term1 <- dtemp[ff]
term2 <- lapply(fitrow, function(i) dtemp[i]*fit$n.event[i]/fit$n.risk[i])
term3 <- unlist(lapply(term2, cumsum))
add1 <- c(0, term1)[ifelse(death, 1+yindex, 1)]
add2 <- c(0, term3)[1 + pmin(yindex, tindex)]
if (ny==3) add3 <- c(0, term3)[1 + pmin(startindex, tindex)]
if (ny==2) D <- matrix(add1 - add2, ncol=ntime)
else D <- matrix(add1 + add3 - add2, ncol=ntime)
# survival is exp(-H) so the derivative is a simple transform of D
if (type== "pstate") D <- -D* c(1,fit$surv[ff])[1+ tindex]
else if (type == "auc") {
<<auctrick>>
}
@
\subsection{Multi-state Aalen-Johansen estimate}
For multi-state models a correction for ties of similar spirit to the
Efron approximation in a Cox model (the ctype=2 argument for \code{survfit})
is difficult: the `right' answer depends on the study.
Thus the ctype argument is not present.
Both stype 1 and 2 are feasible, but currently only \code{stype=1} is
supported.
This makes the code somewhat simpler, but this is more than offset by the
multi-state nature.
With multiple states we also need to account for influence on the starting
state $p(0)$.
One thing that can make this code slow is data that has been divided into a
very large number of intervals, giving a large number of observations for
each cluster. We first deal with that by collapsing adjacent observations.
<<residuals.survfit>>=
rsurvpart2 <- function(Y, X, casewt, istate, times, cluster, type, fit,
method, collapse) {
ny <- ncol(Y)
ntime <- length(times)
nstate <- length(fit$states)
# ensure that Y, istate, and fit all use the same set of states
states <- fit$states
if (!identical(attr(Y, "states"), fit$states)) {
map <- match(attr(Y, "states"), fit$states)
Y[,ny] <- c(0, map)[1+ Y[,ny]] # 0 = censored
attr(Y, "states") <- fit$states
}
if (is.null(istate)) istate <- rep(1L, nrow(Y)) #everyone starts in s0
else {
if (is.character(istate)) istate <- factor(istate)
if (is.factor(istate)) {
if (!identical(levels(istate), fit$states)) {
map <- match(levels(istate), fit$states)
if (any(is.na(map))) stop ("invalid levels in istate")
istate <- map[istate]
}
} # istate is numeric, we take what we get and hope it is right
}
# collapse redundant rows in Y, for efficiency
# a redundant row is a censored obs in the middle of a chain of times
# If the user wants individial obs, however, we would just have to
# expand it again
if (ny==3 && collapse & any(duplicated(cluster))) {
ord <- order(cluster, Y[,1]) # time within subject
cfit <- .Call(Ccollapse, Y, X, istate, cluster, casewt, ord -1L)
if (nrow(cfit) < .8*length(X)) {
# shrinking the data by 20 percent is worth it
temp <- Y[ord,]
Y <- cbind(temp[cfit[,1], 1], temp[cfit[2], 2:3])
X <- X[cfit[,1]]
istate <- istate[cfit[1,]]
cluster <- cluster[cfit[1,]]
}
}
# Compute the initial leverage
inf0 <- NULL
if (is.null(fit$call$p0) && any(istate != istate[1])) {
#p0 was not supplied by the user, and the intitial states vary
inf0 <- matrix(0., nrow=nrow(Y), ncol=nstate)
i0fun <- function(i, fit, inf0) {
# reprise algorithm in survfitCI
p0 <- fit$p0
t0 <- fit$time[1]
if (ny==2) at.zero <- which(as.numeric(X) ==i)
else
at.zero <- which(as.numeric(X) ==i &
(Y[,1] < t0 & Y[,2] >= t0))
for (j in 1:nstate) {
inf0[at.zero, j] <- (ifelse(istate[at.zero]==states[j], 1, 0) -
p0[j])/sum(casewt[at.zero])
}
inf0
}
if (is.null(fit$strata)) inf0 <- i0fun(1, fit, inf0)
else for (i in 1:length(levels(X)))
inf0 <- i0fun(i, fit[i], inf0) # each iteration fills in some rows
}
p0 <- fit$p0 # needed for method==1, type != cumhaz
fit <- survfit0(fit) # package the initial state into the picture
start.time <- fit$time[1]
# This next block is identical to the one in rsurvpart1, more comments are
# there
etime <- (rowSums(fit$n.event) >0)
event <- (Y[,ny] >0)
#
# Create a list whose first element contains the location of
# the death times in curve 1, second element for curve 2, etc.
#
if (is.null(fit$strata)) fitrow <- list(which(etime))
else {
temp1 <- cumsum(fit$strata)
temp2 <- c(1, temp1+1)
fitrow <- lapply(1:length(fit$strata), function(i) {
indx <- seq(temp2[i], temp1[i])
indx[etime[indx]] # keep the death times
})
}
ff <- unlist(fitrow)
# for each time x, the index of the last death time which is <=x.
# 0 if x is before the first death time
matchfun <- function(x, fit, index) {
dtime <- fit$time[index] # subset to this curve
i2 <- findInterval(x, dtime, left.open=FALSE)
c(0, index)[i2 +1]
}
if (type== "cumhaz") {
<<residpart2CH>>
} else {
<<residpart2AJ>>
}
# since we may have done a partial collapse (removing redundant rows), the
# parent routine can't collapse the data
if (collapse & any(duplicated(cluster))) {
if (length(dim(D)) ==2)
D <- rowsum(D, cluster, reorder=FALSE)
else { #rowsums has to be fooled
dd <- dim(D)
temp <- rowsum(matrix(D, nrow=dd[1]), cluster)
D <- array(temp, dim=c(nrow(temp), dd[2:3]))
}
}
D
}
@
\paragraph{Nelson-Aalen}
The multi-state Nelson-Aalen estimate of the cumulative hazard at time $t$
is a vector with one element for each observed transition pair. If there
were $k$ states there are potentially $k(k-1)$ transition pairs, though
normally only a small number will occur in a given fit.
We ignore transitions from state $j$ to state $j$.
Let $r(t)$ be the weighted number at risk at time $t$, in each state.
When some subject makes a $j:k$ transition, the $j:k$ transition will
have an increment of $w_i/r_j(t)$.
This is precisely the same increment as the ordinary Nelson estimate.
The only change then is that we loop over the set of possible transitions,
creating a large output object.
<<residpart2CH>>=
# output matrix D will have one row per observation, one col for each
# reporting time. tindex and yindex have the same dimension as D.
# tindex points to the last death time in fit which
# is <= the reporting time. (If there is only 1 curve, each col of
# tindex will be a repeat of the same value.)
tindex <- matrix(0L, nrow(Y), length(times))
for (i in 1:length(fitrow)) {
yrow <- which(as.integer(X) ==i)
temp <- matchfun(times, fit, fitrow[[i]])
tindex[yrow, ] <- rep(temp, each= length(yrow))
}
tindex[,] <- match(tindex, c(0,ff)) -1L # the [,] preserves dimensions
# repeat the indexing for Y onto fit$time. Each row of yindex points
# to the last row of fit with death time <= Y[,ny]
ny <- ncol(Y)
yindex <- matrix(0L, nrow(Y), length(times))
event <- (Y[,ny] >0)
if (ny==3) startindex <- yindex
for (i in 1:length(fitrow)) {
yrow <- (as.integer(X) ==i) # rows of Y for this curve
temp <- matchfun(Y[yrow,ny-1], fit, fitrow[[i]])
yindex[yrow,] <- rep(temp, ncol(yindex))
if (ny==3) {
temp <- matchfun(Y[yrow,1], fit, fitrow[[i]])
startindex[yrow,] <- rep(temp, ncol(yindex))
}
}
yindex[,] <- match(yindex, c(0,ff)) -1L
if (ny==3) {
startindex[,] <- match(startindex, c(0, ff)) -1L
# no subtractions for report times before subject's entry
startindex <- pmin(startindex, tindex)
}
dstate <- Y[,ncol(Y)]
istate <- as.integer(istate)
ntrans <- ncol(fit$cumhaz) # the number of possible transitions
D <- array(0, dim=c(nrow(Y), ntime, ntrans))
scount <- table(istate[dstate!=0], dstate[dstate!=0]) # observed transitions
state1 <- row(scount)[scount>0]
state2 <- col(scount)[scount>0]
temp <- paste(rownames(scount)[state1],
colnames(scount)[state2], sep='.')
if (!identical(temp, colnames(fit$cumhaz))) stop("setup error")
for (k in length(state1)) {
e2 <- Y[,ny] == state2[k]
add1 <- (yindex <= tindex & rep(e2, ntime))
lsum <- unlist(lapply(fitrow, function(i)
cumsum(fit$n.event[i,k]/fit$n.risk[i,k]^2)))
term1 <- c(0, 1/fit$n.risk[ff,k])[ifelse(add1, 1+yindex, 1)]
term2 <- c(0, lsum)[1+pmin(yindex, tindex)]
if (ny==3) term3 <- c(0, lsum)[1 + startindex]
if (ny==2) D[,,k] <- matrix(term1 - term2, ncol=ntime)
else D[,,k] <- matrix(term1 + term3 - term2, ncol=ntime)
}
@
\paragraph{Aalen-Johansen}
The multi-state AJ estimate is more complex. Let $p(t)$ be the vector
of probability in state at time $t$.
Then
\begin{align}
p(t) &= p(t-) [I+ A(t)] \nonumber \\
\frac{\partial p(t)}{\partial w_i} &= \frac{\partial p(t-)}{\partial w_i}
[I+ A(t)]
+ p(t-) \frac{\partial A(t)}{\partial w_i} \nonumber\\
&= U_i(t-) [I+ A(t)] + p(t-) \frac{\partial A(t)}{\partial w_i}
\label{ajresidx2}
\end{align}
When we expand the left hand portion of \eqref{ajresidx2} to include all
observations it becomes simple matrix multiplication, not so with
the right hand portion.
Each individual subject $i$ has a subject-specific
nstate * nstate derivative matrix $dA$, which will be non-zero only for the
state (row) $j$ that the subject occupies at time $t-$.
The $j$th row of $p(t-) dH$ is added to each subject's derivative.
The $A$ matrix at time $t$ has off diagonal elements and derivative
\begin{align}
A(t)_{jk} &= \frac{\sum_i w_i Y_{ij}(t) dN{ik}(t)}
{\sum_i w_iY_{ij}(t)} \\
&= \lambda_{jk}(t) \\
\frac{\partial A(t)}{\partial w_i} &= \frac{dN_{ik}(t) - \lambda_{jk}(t)}
{\sum_i w_iY_{ij}(t)} \label{Aderiv}
\end{align}
This is the standard counting process notation: $Y_{ij}(t)$ is 1 if subject $i$
is in state $j$ and at risk at time $t-$, and $dN_{ik}(t)$ is a transition to
state $k$ at time $t$.
Each observation at risk appears in at most 1 row of $A(t)$, since they can
only be in one state.
The diagonal element of $A$ are set so that each row sums to 0.
If there are no transitions out of state $j$ at some time point, then that
row of $A$ is zero.
Since the row sums are constant, the sum of the derivatives for each row
must be zero.
If we evaluate equation \eqref{ajresidx} directly there will be
$O(nk^2)$ operations at each death time for the matrix product, and another
$O(nk)$ to add in the new increment. For a large data set $d$ is often
of the same order as $n$, which makes this an expensive calculation.
But, this is what the C-code version currently does, because I have code that
actually works.
<<residpart2AJ>>=
if (method==1) {
# Compute the result using the direct method, in C code
# the routine is called separately for each curve, data in sorted order
#
is1 <- as.integer(istate) -1L # 0 based subscripts for C
if (is.null(inf0)) inf0 <- matrix(0, nrow=nrow(Y), ncol=nstate)
if (all(as.integer(X) ==1)) { # only one curve
if (ny==2) asort1 <- 0L else asort1 <- order(Y[,1], Y[,2]) -1L
asort2 <- order(Y[,ny-1]) -1L
tfit <- .Call(Csurvfitresid, Y, asort1, asort2, is1,
casewt, p0, inf0, times, start.time,
type== "auc")
if (ntime==1) {
if (type=="auc") D <- tfit[[2]] else D <- tfit[[1]]
}
else {
if (type=="auc") D <- array(tfit[[2]], dim=c(nrow(Y), nstate, ntime))
else D <- array(tfit[[1]], dim=c(nrow(Y), nstate, ntime))
}
}
else { # one curve at a time
ix <- as.numeric(X) # 1, 2, etc
if (ntime==1) D <- matrix(0, nrow(Y), nstate)
else D <- array(0, dim=c(nrow(Y), nstate, ntime))
for (curve in 1:max(ix)) {
j <- which(ix==curve)
ytemp <- Y[j,,drop=FALSE]
if (ny==2) asort1 <- 0L
else asort1 <- order(ytemp[,1], ytemp[,2]) -1L
asort2 <- order(ytemp[,ny-1]) -1L
# call with a subset of the data
j <- which(ix== curve)
tfit <- .Call(Csurvfitresid, ytemp, asort1, asort2, is1[j],
casewt[j], p0[curve,], inf0[j,], times,
start.time, type=="auc")
if (ntime==1) {
if (type=="auc") D[j,] <- tfit[[2]] else D[j,] <- tfit[[1]]
} else {
if (type=="auc") D[j,,] <- tfit[[2]] else D[j,,] <- tfit[[1]]
}
}
}
# the C code makes time the last dimension, we want it to be second
if (ntime > 1) D <- aperm(D, c(1,3,2))
}
else {
# method 2
<<residpart2AJ2>>
}
@
Can we speed this up?
An alternate is to look at the direct expansion.
\begin{align}
p(t) &= p(0) \prod_{d_j \le t} [I+ A(d_j)] \nonumber \\
\frac{\partial p(t)}{\partial w_i} &=
\frac{\partial p(0)}{\partial w_i} \prod_{d_j \le t} [I+ A(d_j)] \\
& + p(0)\sum_{d_j \le t} \left( \prod_{k<j}[I+ A(d_k)]
\frac{\partial A(d_j)}{\partial w_i}
\prod_{j<k, d_k\le t}[I+ A(d_k)] \right)\nonumber \\
&= \frac{\partial p(0)}{w_i} \prod_{d_j \le t} [I+ A(d_j)] +
p(d_{j-1}) \frac{\partial A(d_j)}{\partial w_i}
\prod_{j<k, d_k\le t}[I+ A(d_k)] \label{ajresidy}
\end{align}
We cannot insert an $(I+ A(d_j))/(I + A(d_j))$ term and rearrange the last
equation so as to factor out $p(t)$, as was done in the KM case,
since matrix products do not commute.
Instead think of accumulating the terms sequentially.
Let $B^{(j)}(t)$ be the nstate by nstate matrix derivative matrix with
row $j$ of $\lambda_{jk}/n_j(t)$, and zero in all of the other
rows, i.e., term 2 of equation \eqref{Aderiv} for someone in state $j$.
(This is the part of the derivative that is common to all subjects at
risk.) Let $B(t)$ be the sum of these matrices, i.e., all states filled.
Now, here is the trick. The product $B^{(j)}(t)[I + A(t)]$ also is
zero for all but the $jth$ row, and is in fact equal to the $j$th
row of $B(t)[I + A(t)]$.
Further, $p(t-)B^{(j)}(t)[I + A(t)]$ is the $j$th row of
${\rm diag}(p(t-))B(t)[I + A(t)]$.
The key computation is based on a matrix of matrices. Start with the following
definitions. $T_{jk}$ is the $j$th term in the expansion, at
death time $k$. $T_{jk}=0$ whenever $k=0$ or $j>k$.
Let $D(x)$ be the diagonal matrix.
\begin{align}
T_{01} &= D(p'(0))[I+ A(d_1)] & T_{02} &= T_{01}[I + A(d_2)] &
T_{03} &= T_{02} [I + A(d_3)] & \ldots \\
T_{11} &= D(p(d_1)) B(d_1) & T_{12} &= T_{11}[I + A(d_2)] &
T_{13} &= T_{12}[I + A(d_3)] & \ldots \\
T_{21} &= 0 & T_{22} &= D(p(d_2)) B(d_2) & T_{23} &= T_{22}[I+ A(d_2)] & \ldots \\
T_{31} &= 0 & T_{32}&=0 & T_{33} &= D(p(d_3)) B(d_3) &\ldots
\end{align}
(According to the latex guide the above should be nicely spaced, but I get
equations that are touching. Why?)
If $p(0)$ is a fixed value specified by the user then $p'(0)$ =0.
Otherwise $p(0)$ is the emprical distribution of the initial states, just
before the first death time $d_1$. Let $n_0$ be the (weighted) count of
subjects who are at risk at that time.
The $j$th row of $p'(0)$ is defined as the deviative wrt $w_i$ for a subject
who starts in state $j$.
If no one starts in state $j$ that row of the matrix will be 0, otherwise
it contains $(1-p_j(0)$ in the $jth$ element and $p_j(0)/n_0$ elsewhere.
Define the matrix $W_{jk} = \sum_{l=1}^j T_{lk}$, with $W_{j0}=0$.
Then for someone who enters at time $s$ such that $d_a < s \le d_{a+1}$,
is censored or has an event at time $t$ such that $d_b \le t <d_{b+1}$,
reporting at time $r$ such that $d_c \le r < d_{c+1}$, the first portion of
the contribution for an observation in state $j$ will be the
$j$th row of $- (W_{br}-W_{ar})$.
The second contribution is the effect of the $dN$ term in the derivative.
An observation that has a j:k transtion at time $d_i$ will have an
additional term of $c \prod_{k=i+1}^r [I + A(t_k)]$ where $c$ is a vector
with
\begin{align*}
c_j &= -1/n_j(d_i) \\
c_k &= 1/n_j(d_i) \\
c &= 0 \;\mbox{otherwise}
\end{align*}
If there are multiple reporting times, it is currently simplest to do each
one separately (at least for now), having computed and stored the sets of
matrices $A(d_i)$ and $p(d_i)B(d_i)$ once at the start.
If there are multiple strata in a curve, this is done separately per stratum.
<<residpart2AJ2>>=
Yold <- Y
utime <- fit$time[fit$time <= max(times) & etime] # unique death times
ndeath <- length(utime) # number of unique event times
delta <- diff(c(start.time, utime))
# Expand Y
if (ny==2) split <- .Call(Csurvsplit, rep(0., nrow(Y)), Y[,1], times)
else split <- .Call(Csurvsplit, Y[,1], Y[,2], times)
X <- X[split$row]
casewt <- casewt[split$row]
istate <- istate[split$row]
Y <- cbind(split$start, split$end,
ifelse(split$censor, 0, Y[split$row,ny]))
ny <- 3
# Create a vector containing the index of each end time into the fit object
yindex <- ystart <- double(nrow(Y))
for (i in 1:length(fitrow)) {
yrow <- (as.integer(X) ==i) # rows of Y for this curve
yindex[yrow] <- matchfun(Y[yrow, 2], fit, fitrow[[i]])
ystart[yrow] <- matchfun(Y[yrow, 1], fit, fitrow[[i]])
}
# And one indexing the reporting times into fit
tindex <- matrix(0L, nrow=length(fitrow), ncol=ntime)
for (i in 1:length(fitrow)) {
tindex[i,] <- matchfun(times, fit, fitrow[[i]])
}
yindex[,] <- match(yindex, c(0,ff)) -1L
tindex[,] <- match(tindex, c(0,ff)) -1L
ystart[,] <- pmin(match(ystart, c(0,ff)) -1L, tindex)
# Create the array of C matrices
cmat <- array(0, dim=c(nstate, nstate, ndeath)) # max(i2) = ndeath, by design
Hmat <- cmat
# We only care about observations that had a transition; any transitions
# after the last reporting time are not relevant
transition <- (Y[,ny] !=0 & Y[,ny] != istate &
Y[,ny-1] <= max(times)) # obs that had a transition
i2 <- match(yindex, sort(unique(yindex))) # which C matrix this obs goes to
i2 <- i2[transition]
from <- as.numeric(istate[transition]) # from this state
to <- Y[transition, ny] # to this state
nrisk <- fit$n.risk[cbind(yindex[transition], from)] # number at risk
wt <- casewt[transition]
for (i in seq(along.with =from)) {
j <- c(from[i], to[i])
haz <- wt[i]/nrisk[i]
cmat[from[i], j, i2[i]] <- cmat[from[i], j, i2[i]] + c(-haz, haz)
}
for (i in 1:ndeath) Hmat[,,i] <- cmat[,,i] + diag(nstate)
# The transformation matrix H(t) at time t is cmat[,,t] + I
# Create the set of W and V matrices.
#
dindex <- which(etime & fit$time <= max(times))
Wmat <- Vmat <- array(0, dim=c(nstate, nstate, ndeath))
for (i in ndeath:1) {
j <- match(dindex[i], tindex, nomatch=0)
if (j > 0) {
# this death matches one of the reporting times
Wmat[,,i] <- diag(nstate)
Vmat[,,i] <- matrix(0, nstate, nstate)
}
else {
Wmat[,,i] <- Hmat[,,i+1] %*% Wmat[,,i+1]
Vmat[,,i] <- delta[i] + Hmat[,,i+1] %*% Wmat[,,i+1]
}
}
@
The above code has created the Wmat array for all reporting times and
for all the curves (if more than one).
Each of them reaches forward to the next reporting time.
Now work forward in time.
<<residpart2AJ2>>=
iterm <- array(0, dim=c(nstate, nstate, ndeath)) # term in equation
itemp <- vtemp <- matrix(0, nstate, nstate) # cumulative sum, temporary
isum <- isum2 <- iterm # cumulative sum
vsum <- vsum2 <- vterm <- iterm
for (i in 1:ndeath) {
j <- dindex[i]
n0 <- ifelse(fit$n.risk[j,] ==0, 1, fit$n.risk[j,]) # avoid 0/0
iterm[,,i] <- ((fit$pstate[j-1,]/n0) * cmat[,,i]) %*% Wmat[,,i]
vterm[,,i] <- ((fit$pstate[j-1,]/n0) * cmat[,,i]) %*% Vmat[,,i]
itemp <- itemp + iterm[,,i]
vtemp <- vtemp + vterm[,,i]
isum[,,i] <- itemp
vsum[,,i] <- vtemp
j <- match(dindex[i], tindex, nomatch=0)
if (j>0) itemp <- vtemp <- matrix(0, nstate, nstate) # reset
isum2[,,i] <- itemp
vsum2[,,i] <- vtemp
}
# We want to add isum[state,, entry time] - isum[state,, exit time] for
# each subject, and for those with an a:b transition there will be an
# additional vector with -1, 1 in the a and b position.
i1 <- match(ystart, sort(unique(yindex)), nomatch=0) # start at 0 gives 0
i2 <- match(yindex, sort(unique(yindex)))
D <- matrix(0., nrow(Y), nstate)
keep <- (Y[,2] <= max(times)) # any intervals after the last reporting time
# will have 0 influence
for (i in which(keep)) {
if (Y[i,3] !=0 && istate[i] != Y[i,3]) {
z <- fit$pstate[yindex[i]-1, istate[i]]/fit$n.risk[yindex[i], istate[i]]
temp <- double(nstate)
temp[istate[i]] = -z
temp[Y[i,3]] = z
temp <- temp %*% Wmat[,,i2[i]] - isum[istate[i],,i2[i]]
if (i1[i] >0) temp <- temp + isum2[istate[i],, i1[i]]
D[i,] <- temp
}
else {
if (i1[i] >0) D[i,] = isum2[istate[i],,i1[i]] - isum[istate[i],, i2[i]]
else D[i,] = -isum[istate[i],, i2[i]]
}
}
@
By design, each row of $Y$, and hence each row of $D$, corresponds to a unique
curve, and also to a unique period in the reporting intervals.
(Any Y intervals after the last reporting time will have D=0 for the row.)
If there are multiple reporting intervals, create an array with one
n by nstate slice for each.
If a row lies in the first interval, $D$ currently contains its influence
on that interval. It's influence on the second interval is the vector times
$\prod H(d_k)$ where $k$ is the set of event times $>$ the first reporting time
and $\le$ the second one.
<<residpart2AJ2>>=
Dsave <- D
if (!is.null(inf0)) {
# add in the initial influence, to the first row of each obs
# (inf0 was created on unsplit data)
j <- which(!duplicated(split$row))
D[j,] <- D[j,] + (inf0%*% Hmat[,,1] %*% Wmat[,,1])
}
if (ntime > 1) {
interval <- findInterval(yindex, tindex, left.open=TRUE)
D2 <- array(0., dim=c(dim(D), ntime))
D2[interval==0,,1] <- D[interval==0,]
for (i in 1:(ntime-1)) {
D2[interval==i,,i+1] = D[interval==i,]
j <- tindex[i]
D2[,,i+1] = D2[,,i+1] + D2[,,i] %*% (Hmat[,,j] %*% Wmat[,,j])
}
D <- D2
}
# undo any artificial split
if (any(duplicated(split$row))) {
if (ntime==1) D <- rowsum(D, split$row)
else {
# rowsums has to be fooled
temp <- rowsum(matrix(D, ncol=(nstate*ntime)), split$row)
# then undo it
D <- array(temp, dim=c(nrow(temp), nstate, ntime))
}
}
@
\subsection{Cox model case}
The code for a simple Cox model has a lot of overlap with the simple
Nelson-Aalen case, leading to overlap between this section and
the rsurvpart1 routine.
We only support the exponential form (Breslow estimate), however.
At time $t$ the increment to the hazard function will be
\begin{align*}
h(t;z) &= \frac{\sum w_i dN_i(t)}{\sum Y_i(t) w_i \exp((X_i-z)\beta)}\\
&= \frac{\sum w_i dN_i(t)}{d(t;z)}
H(t;z) &= \int_0^t h(s;z) ds
\end{align*}
where $z$ is the covariate vector for the predicted curve.
If $\beta=0$ then this reduces to the ordinary Nelson-Aalen.
The increment to the IJ for some subject $k$ turns out to be
\begin{align}
\frac{\partial h(t;z)}{\partial w_k} &= A + B \\
A &= \frac{dN_k(t) - \exp((X_k-z)\beta) h(t;z)}{d(t;z)}
{\sum Y_i(t) w_i \exp((X_i-z)\beta)} \label{eq:residij1}\\
&= \frac{dM_k(t)}{d(t;z)} \\
B &= -D_{k.} (\overline{x}(t)- z)' h(t;z)\label{eq:residij2}
\end{align}
where $D_{k.}$ is row $k$ of the dfbeta matrix, which gives the influence
of each subject (row) on the coefficients of $\hat\beta$.
$D$ and $M$ do not involve $z$.
Term A is a near clone of the Nelson-Aalen and can use nearly the
same code, adding the risk weights $\exp((X_i-z)\beta$, while term B is new.
The user may request curves for more than one covariate set $z$, in that case
the survival curve found below within \code{object} will be a matrix,
one column for each target, and the returned matrix from this routine will
be an array of dimensions (subject, time, z).
The survival curves are
\begin{align*}
S(t; z) &= \exp(-H(t;z)) \\
\frac{\partial \log S(t;z)}{\partial w_k} &=
-S(t;z) \frac{\partial H(t;z)}{\partial w_k} \\
\end{align*}
thus the survival or pstate derivative is a simple multiple of the derivative
for the cumulative hazard.
As shown in the earlier in equation \eqref{eq:auctrick}, if $A(s,t; z)$ is the
area under the curve from $s$ to $t$, then
$$
\frac{\partial A(0,t;z)}{\partial w_i} =
\sum_{k=1}^m -A(d_k, t;z) \frac{\partial h(d_k;z)}{\partial w_i}
$$
where $d_k$ are the event times.
Note that \emph{all} the weights change for a new reporting time.
However, since $A(0,t) = A(0,d_k) + A(d_k, t)$ the values can be obtained
efficiently.
<<residuals.survfitcox>>=
residuals.survfitcoxms <- function(object, times, type="pstate", collapse= TRUE,
weighted= FALSE, ...) {
stop("residuals for survival curves from a multistate PH model are not yet available")
}
residuals.survfitcox <- function(object, times, type="pstate", collapse= TRUE,
weighted= FALSE, ...) {
# residuals for a single state Cox model survival curve
if (!inherits(object, "survfitcox"))
stop("argument must be a survfit object created from a coxph model")
if (missing(times)) stop("the times argument is required")
ntime <- length(times)
if (is.matrix(object$surv)) nz <- ncol(object$surv)
else nz <- 1 # number of z vectors that were used
fit <- object # the fitted survival
# allow a set of alias
temp <- c("pstate", "cumhaz", "sojourn", "survival",
"chaz", "rmst", "rmts", "auc")
type <- match.arg(casefold(type), temp)
itemp <- c(1,2,3,1,2,3,3,3)[match(type, temp)]
type <- c("pstate", "cumhaz", "auc")[itemp]
# retrive the underlying Cox model, and then the data
Call <- object$call
coxfit <- eval(Call$formula)
cdata <- coxph.getdata(coxfit, id=collapse, cluster=collapse)
id <- cdata$id
Y <- cdata$y
X <- cdata$x
ny <- ncol(Y)
n <- nrow(Y)
strata <- cdata$strata
if (is.null(strata)) strata <- integer(n)
nstrat <- length(unique(strata))
wt <- cdata$weight
risk <- exp(coxfit$linear.predictors)
xcurve <- object$xcurve # the predictors for each curve in the object
ncurve <- nrow(xcurve)
# Deal with the rare case of a redundant covariate
if (any(is.na(coxfit$coefficients))) {
keep <- which(!is.na(coxfit$coefficients))
X <- X[,keep, drop=FALSE]
vmat <- coxfit$var[keep,keep, drop=FALSE]
xcurve <- xcurve[,keep, drop=FALSE]
temp <- xcurve - rep(coxfit$means[keep], each=nrow(xcurve))
scale <- drop(exp(temp %*% coef(coxfit)[keep]))
} else {
vmat <- coxfit$var
temp <- xcurve - rep(coxfit$means, each=nrow(xcurve))
scale <- drop(exp(temp %*% coef(coxfit))) # 1/exp((xbar -z)' beta)
}
# The coxsurv routines return all the pieces that we need
if (ny==2) {
sort2 <- order(strata, Y[,1])
cfit <- .Call(Ccoxsurv3, Y, X, strata, risk, wt, sort2- 1L,
as.integer(coxfit$method=="efron"))
} else {
sort2 <- order(strata, Y[,2])
sort1 <- order(strata, Y[,1])
cfit <- .Call(Ccoxsurv4, Y, wt, sort1, sort2, strata,
X, fit$linear.predictor)
}
if (is.null(object$start.time)) start.time <- min(0, Y[,1])
else start.time <- object$start.time
if (!is.null(object$start.time) && any(cfit$time < object$start.time)) {
# trim out information before the first time
keep <- which(cfit$time >= object$start.time)
cfit$time <- cfit$time[keep]
cfit$strata <- cfit$strata[keep]
cfit$count <- cfit$count[keep,, drop=FALSE]
cfit$xbar <- cfit$xbar[keep,, drop=FALSE]
}
<<residuals.survfitcox2>>
}
@
The coxsurv routines has returned the score residuals $r$,
the dfbeta resdiduals are $D= r V$ where
$V$ is the variance matrix from the coxph fit.
The product $r V (\xbar(t) - z)'$ is an $n,p$ matrix times $p,p$ matrix
times $p,d$, where $d$ is the number of unique event times.
This is a big matrix multiplication, $O(np^2) + O(npd)$.
To make this routine reasonably fast, we want to avoid anything that is
$O(nd)$. The key idea is that the final result will only be \emph{reported} at
a small number of event times $m$ = length(times).
Look for an algorithm whose dominating term is $O(nm) + O(d)$.
For the cumulative hazard we have the cumulative sum of $dM_i(t)/d(t, z)$,
the numerator does not depend on $z$.
The hazard portion is the cumulative is exp(linear.predictor[i]) times the
cumulative sum of the hazard $h(t; x_0)$ where $x_0$ is the means component of
the coxph fit.
\begin{itemize}
\item terma1 = $1/d(t;x_0)$ is the increment to the derivative for any
observation with an event at event time $t$
\item terma2 = $h(t;x_0)/d(t;x_0) = dN(t)/d^2(t;x_0)$ is the scaled increment
to the hazard at time $t$
\item terma3 = cumulative sum of term2
\end{itemize}
The \code{cfit\$counts} matrix has $dN$ in colum 4 and $d$ in columns 3 and 7,
the latter has an Efron correction (identical to column 3 if ties= breslow).
Scaling term a1 to a given $z$ involves division by $\exp((z- x_0)\beta)$,
there is no additional per-subject correction. Term a2 has an additional
per-subject multiplier of exp(linear.predictor) to give the per subject
margtingale $M_i$.
Assume an observation over interval $(t_1, t_2)$ and a reporting time $s$.
For the $dM_i$ term in the IJ, our sum goes over the interval
$(\min(t_1,s), \min(t_2,s)]$, open on the left and closed on the right.
Term 1 applies for any death which falls into the interval, add (term3 at
$\min(t_2,s)$ - term 3 at $\min(t_1,s)$) times the risk score.
This gives the first ``$dM_i$'' term of the IJ residual for the observation.
If there are time-dependent covariates the risk score for a subject may differ
from row to row, so defer the collapse on id until a final step.
Think of term B as a long matrix product
$J (R (\xbar(t)-z)' \rm{diag}(h(t;z))) K$.
The per-subject risk scores do not appear here.
The inner portion has the large $(n,p)$ by $(p,m)$ matrix multiplication that we
wish to avoid, while $J$ and $K$ are design matrices.
The first adds all rows for a subject, while $K$ gives cumulative sums up to
each of the reporting times. Simply changing the grouping to
$(J R) [(\xbar(t)-z)' \rm{diag}(h(t;z) K)]$
given an interior multiplication that is the size of the final report.
Strata are a nuisance, since they are stacked end to end in the cfit object.
They can't be packaged as an array since each stratum will usually have a
different number of events. The final result, however, will be an array with
dimensions of subject, reporting times, and z.
<<residuals.survfitcox2>>=
# index1 = the index in cfit of the largest event time <= min(t,s), in the same
# strata. The result might be 0 (someone censored before the first event)
index1 <- sapply(times, function(x)
neardate(strata, cfit$strata, pmin(Y[,ny-1], x), cfit$time, best="prior",
nomatch= 0))
index1 <- matrix(index1, n) # helps debug, but doesn't change computation
# index0 = index1 or 0: 0 if the interval does not contain a death for subject
# i. If nonzero it will be the interval (in cfit) in which that death falls.
index0 <- ifelse(Y[,ny] & (Y[,ny-1] <= c(0, times)[1L + index1]), index1, 0)
# The function below gets called twice in the AUC case, once for others,
# h will be a matrix
s2addup <- function(h, scale) {
H2 <- residcsum(h/cfit$count[,7], cfit$strata)
# Terms for the cumhaz
term1a <- outer(c(0, 1/cfit$count[,7])[index0 +1L], scale, '*')
if (ny ==2) term1b <- risk * rbind(0, H2)[index1 + 1L,]
else {
index2 <- sapply(times, function(x)
neardate(strata, cfit$strata, pmin(Y[,1], x), cfit$time,
best="prior", nomatch= 0))
term1b <- risk*( rbind(0,H2)[index1 + 1L,] - rbind(0,H2)[index2 + 1L,])
}
term1 <- term1a - term1b
# Now term 2, the effect of each obs on beta
# By definition we won't have any reporting times before
nvar <- ncol(cfit$xbar)
ustrat <- unique(strata)
indx3 <- neardate(rep(unique(cfit$strata), each=ntime), cfit$strata,
times, cfit$time, best= "prior", nomatch=0)
term2 <- array(0., dim= c(n, ntime, ncurve))
for (k in 1:ncurve) {
# Can't do this part all at once (though it might be possible)
temp <- residcsum((cfit$xbar - rep(xcurve[k,], each= nrow(cfit$xbar)))*
h[,k], cfit$strata)
term2[,,k] <- cfit$sresid %*% (vmat %*% t(rbind(0, temp)[indx3 + 1L,]))
}
if (ncurve >1) array(c(term1) - c(term2), dim=c(n, ntime, ncurve))
else matrix(c(term1) - c(term2), n)
}
haz <- outer(cfit$count[,4]/cfit$count[,7], scale) # hazard for each curve
IJ <- s2addup(haz, scale) # IJ for the cumulative hazard
if (type == "pstate") {
# Each residual has to be multiplied by the appropriate survival value from
# the survival curve.
# First find the row in object$surv
if (nstrat == 1) {
srow <- findInterval(times, object$time, left.open=FALSE)
# IJ[,k,,,] is multiplied by srow[k], so replicate as needed
srow <- rep(srow, each= dim(IJ)[1])
} else {
srow <- neardate(rep(ustrat, each=length(times)),
rep(ustrat, each=object$strata), times, object$time,
prior=TRUE)
# srow has the indices for strata 1, then strata 2, ...
temp <- matrix(srow, ncol=ntime, byrow=TRUE)
srow <- c(temp[strata,]) # each row of IJ matched to the right strata
}
if (ncurve==1) surv = object$surv[srow]
else surv <- c(object$surv[srow,])
IJ <- -surv * IJ # if an obs increases the hazard, it decreases survival
}
else if (type=="auc") {
events <- (object$n.event > 0) # ignore censored rows in survival curv
# create the AUC weighted hazard, using the survival curve
if (nstrat ==1) delta <- diff(c(start.time, object$time[events]))
else delta <- unlist(lapply(1:nstrat), function(i) {
temp <- object[i]
diff(c(start.time, temp$time[temp$n.event>0]))
})
auc <- residcsum(delta*object$surv[events], strata)
browser
# weighted hazard
wthaz <- residcsum(auc* haz, strata)
IJ2 <- s2addup(wthaz, h2)
browser()
# I need the AUC at each reporting time, which may not match any of the
# event times
}
# Now, collapse the rows to be one per subject per strata
# (the rowsum function is fast, so use it)
if (collapse && !is.null(id) && any(duplicated(cbind(id, strata)))) {
temp <- matrix(IJ, nrow= dim(IJ)[1]) # make it appear to be a matrix
if (nstrat ==1) temp <- rowsum(temp, id, reorder=FALSE)
else {
uid <- unique(id)
dummy <- match(id, uid) + (1 + length(uid))* match(strata, ustrat)
temp < rowsum(temp, dummy, reorder= FALSE)
}
IJ <- array(temp, dim= c(nrow(temp), dim(IJ)[-1]))
if (nstrat >1)
attr(IJ, "strata") <- strata[!duplicated(cbind(id, strata))]
idx <- id[!duplicated(cbind(id, strata))]
} else {
if (is.null(id)) idx <- seq.int(dim(IJ)[1]) else idx <- id
}
if (is.matrix(IJ)) dimnames(IJ) <- list(id= idx, time= times)
else dimnames(IJ) <- list(id= idx, time=times, NULL)
IJ
@
\section{Accelerated Failure Time models}
The [[surveg]] function fits parametric failure time models.
This includes accerated failure time models, the Weibull, log-normal,
and log-logistic models.
It also fits as well as censored linear regression; with left censoring
this is referred to in economics \emph{Tobit} regression.
\subsection{Residuals}
The residuals for a [[survreg]] model are one of several types
\begin{description}
\item[response] residual [[y]] value on the scale of the original data
\item[deviance] an approximate deviance residual. A very bad idea
statistically, retained for the sake of backwards compatability.
\item[dfbeta] a matrix with one row per observation and one column per
parameter showing the approximate influence of each observation on
the final parameter value
\item[dfbetas] the dfbeta residuals scaled by the standard error of
each coefficient
\item[working] residuals on the scale of the linear predictor
\item[ldcase] likelihood displacement wrt case weights
\item[ldresp] likelihood displacement wrt response changes
\item[ldshape] likelihood displacement wrt changes in shape
\item[matrix] matrix of derivatives of the log-likelihood wrt paramters
\end{description}
The other parameters are
\begin{description}
\item[rsigma] whether the scale parameters should be included in the
result for dfbeta results. I can think of no reason why one would not
want them --- unless of course the scale was fixed by the user, in
which case there is no parameter.
\item[collapse] optional vector of subject identifiers. This is for the
case where a subject has multiple observations in a data set, and one
wants to have residuals per subject rather than residuals per observation.
\item[weighted] whether the residuals should be multiplied by the case
weights. The sum of weighted residuals will be zero.
\end{description}
The routine starts with standard stuff, checking arguments for
validity and etc.
The two cases of response or working residuals require
a lot less computation. and are the most common calls, so they are
taken care of first.
<<residuals.survreg>>=
#
# Residuals for survreg objects
residuals.survreg <- function(object, type=c('response', 'deviance',
'dfbeta', 'dfbetas', 'working', 'ldcase',
'ldresp', 'ldshape', 'matrix'),
rsigma =TRUE, collapse=FALSE, weighted=FALSE, ...) {
type <-match.arg(type)
n <- length(object$linear.predictors)
Terms <- object$terms
if(!inherits(Terms, "terms"))
stop("invalid terms component of object")
# If the variance wasn't estimated then it has no error
if (nrow(object$var) == length(object$coefficients)) rsigma <- FALSE
# If there was a cluster directive in the model statment then remove
# it. It does not correspond to a coefficient, and would just confuse
# things later in the code.
cluster <- untangle.specials(Terms,"cluster")$terms
if (length(cluster) >0 )
Terms <- Terms[-cluster]
strata <- attr(Terms, 'specials')$strata
intercept <- attr(Terms, "intercept")
response <- attr(Terms, "response")
weights <- object$weights
if (is.null(weights)) weighted <- FALSE
<<rsr-data>>
<<rsr-dist>>
<<rsr-resid>>
<<rsr-finish>>
}
@
First retrieve the distribution, which is used multiple times.
The common case is a character string pointing to some element of
[[survreg.distributions]], but the other is a user supplied
list of the form contained there.
Some distributions are defined as the transform of another in which
case we need to set [[itrans]] and [[dtrans]] and follow the link,
otherwise the transformation and its inverse are the identity.
<<rsr-dist>>=
if (is.character(object$dist))
dd <- survreg.distributions[[object$dist]]
else dd <- object$dist
ytype <- attr(y, "type")
if (is.null(dd$itrans)) {
itrans <- dtrans <-function(x)x
# reprise the work done in survreg to create a transformed y
if (ytype=='left') y[,2] <- 2- y[,2]
else if (type=='interval' && all(y[,3]<3)) y <- y[,c(1,3)]
}
else {
itrans <- dd$itrans
dtrans <- dd$dtrans
# reprise the work done in survreg to create a transformed y
tranfun <- dd$trans
exactsurv <- y[,ncol(y)] ==1
if (any(exactsurv)) logcorrect <-sum(log(dd$dtrans(y[exactsurv,1])))
if (ytype=='interval') {
if (any(y[,3]==3))
y <- cbind(tranfun(y[,1:2]), y[,3])
else y <- cbind(tranfun(y[,1]), y[,3])
}
else if (ytype=='left')
y <- cbind(tranfun(y[,1]), 2-y[,2])
else y <- cbind(tranfun(y[,1]), y[,2])
}
if (!is.null(dd$dist)) dd <- survreg.distributions[[dd$dist]]
deviance <- dd$deviance
dens <- dd$density
@
The next task is to decide what data we need. The response
is always needed, but is normally saved as a part of the
model. If it is a transformed distribution such as the
Weibull (a transform of the extreme value) the saved object
[[y]] is the transformed data, so we need to replicate that
part of the survreg() code.
(Why did I even allow for y=F in survreg? Because I was
mimicing the lm function --- oh the long, long consequences of
a design decision.)
The covariate matrix [[x]] will be needed for all but
response, deviance, and working residuals.
If the model
included a strata() term then there will be multiple scales,
and the strata variable needs to be recovered.
The variable [[sigma]] is set to a scalar if there are no
strata, but otherwise to a vector with [[n]] elements containing
the appropriate scale for each subject.
The leverage type residuals all need the second derivative
matrix. If there was a [[cluster]] statement in the model this
will be found in [[naive.var]], otherwise in the [[var]]
component.
<<rsr-data>>=
if (is.null(object$naive.var)) vv <- object$var
else vv <- object$naive.var
need.x <- is.na(match(type, c('response', 'deviance', 'working')))
if (is.null(object$y) || !is.null(strata) || (need.x & is.null(object[['x']])))
mf <- stats::model.frame(object)
if (is.null(object$y)) y <- model.response(mf)
else y <- object$y
if (!is.null(strata)) {
temp <- untangle.specials(Terms, 'strata', 1)
Terms2 <- Terms[-temp$terms]
if (length(temp$vars)==1) strata.keep <- mf[[temp$vars]]
else strata.keep <- strata(mf[,temp$vars], shortlabel=TRUE)
strata <- as.numeric(strata.keep)
nstrata <- max(strata)
sigma <- object$scale[strata]
}
else {
Terms2 <- Terms
nstrata <- 1
sigma <- object$scale
}
if (need.x) {
x <- object[['x']] #don't grab xlevels component
if (is.null(x))
x <- model.matrix(Terms2, mf, contrasts.arg=object$contrasts)
}
@
The most common residual is type response, which requires almost
no more work, for the others we need to create the matrix of
derivatives before proceeding.
We use the [[center]] component from the deviance function for the
distribution, which returns the data point [[y]] itself for an
exact, left, or right censored observation, and an appropriate
midpoint for interval censored ones.
<<rsr-resid>>=
if (type=='response') {
yhat0 <- deviance(y, sigma, object$parms)
rr <- itrans(yhat0$center) - itrans(object$linear.predictor)
}
else {
<<rtr-deriv>>
<<rtr-resid2>>
}
@
The matrix of derviatives is used in all of the other cases.
The starting point is the [[density]] function of the distribtion
which return a matrix with columns of
$F(x)$, $1-F(x)$, $f(x)$, $f'(x)/f(x)$ and $f''(x)/f(x)$. %'
The matrix type residual contains columns for each of
$$
L_i \quad \frac{\partial L_i}{\partial \eta_i}
\quad \frac{\partial^2 L_i}{\partial \eta_i^2}
\quad \frac{\partial L_i}{\partial \log(\sigma)}
\quad \frac{\partial L_i}{\partial \log(\sigma)^2}
\quad \frac{\partial^2 L_i}{\partial \eta \partial\log(\sigma)}
$$
where $L_i$ is the contribution to the log-likelihood from each
individual.
Note that if there are multiple scales, i.e. a strata() term in the
model, then terms 3--6 are the derivatives for that subject with
respect to their \emph{particular} scale factor; derivatives with
respect to all the other scales are zero for that subject.
The log-likelihood can be written as
\begin{align*}
L &= \sum_{exact}\left[ \log(f(z_i)) -\log(\sigma_i) \right] +
\sum_{censored} \log \left( \int_{z_i^l}^{z_i^u} f(u)du \right) \\
&\equiv \sum_{exact}\left[g_1(z_i) -\log(\sigma_i) \right] +
\sum_{censored} \log(g_2(z_i^l, z_i^u)) \\
z_i &= (y_i - \eta_i)/ \sigma_i
\end{align*}
For the interval censored observations we have a $z$ defined at both the
lower and upper endpoints.
The linear predictor is $\eta = X\beta$.
The derivatives are shown below.
Note that $f(-\infty) = f(\infty) = F(-\infty)=0$,
$F(\infty)=1$, $z^u = \infty$ for a right censored observation
and $z^l = -\infty$ for a left censored one.
\begin{align*}
\frac{\partial g_1}{\partial \eta} &= - \frac{1}{\sigma}
\left[\frac{f'(z)}{f(z)} \right] \\ %'
\frac{\partial g_2}{\partial \eta} &= - \frac{1}{\sigma} \left[
\frac{f(z^u) - f(z^l)}{F(z^u) - F(z^l)} \right] \\
\frac{\partial^2 g_1}{\partial \eta^2} &= \frac{1}{\sigma^2}
\left[ \frac{f''(z)}{f(z)} \right]
- (\partial g_1 / \partial \eta)^2 \\
\frac{\partial^2 g_2}{\partial \eta^2} &= \frac{1}{\sigma^2} \left[
\frac{f'(z^u) - f'(z^l)}{F(z^u) - F(z^l)} \right]
- (\partial g_2 / \partial \eta)^2 \\
\frac{\partial g_1}{\partial \log\sigma} && - \left[
\frac{zf'(z)}{f(z)} \right] \\
\frac{\partial g_2}{\partial \log\sigma} &= - \left[
\frac{z^uf(z^u) - z^lf(z^l)}{F(z^u) - F(z^l)} \right] \\
\frac{\partial^2 g_1}{\partial (\log\sigma)^2} &=& \left[
\frac{z^2 f''(z) + zf'(z)}{f(z)} \right]
- (\partial g_1 / \partial \log\sigma)^2 \\
\frac{\partial^2 g_2}{\partial (\log\sigma)^2} &= \left[
\frac{(z^u)^2 f'(z^u) - (z^l)^2f'(z_l) }
{F(z^u) - F(z^l)} \right]
- \partial g_1 /\partial \log\sigma(1+\partial g_1 / \partial \log\sigma) \\
\frac{\partial^2 g_1}{\partial \eta \partial \log\sigma} &=
\frac{zf''(z)}{\sigma f(z)}
-\partial g_1/\partial \eta (1 + \partial g_1/\partial \log\sigma) \\
\frac{\partial^2 g_2}{\partial \eta \partial \log\sigma} &=
\frac{z^uf'(z^u) - z^lf'(z^l)}{\sigma [F(z^u) - F(z^l)]}
-\partial g_2/\partial \eta (1 + \partial g_2/\partial \log\sigma) \\
\end{align*}
In the code [[z]] is the relevant point for exact, left, or right
censored data, and [[z2]] the upper endpoint for an interval censored one.
The variable [[tdenom]] contains the denominator for each subject (which
is the same for all derivatives for that subject).
For an interval censored observation we try to avoid numeric cancellation
by using the appropriate tail of the distribution.
For instance with $(z^l, z^u) = (12,15)$ the value of $F(x)$ will be very
near 1 and it is better to subtract two upper tail values $(1-F)$ than
two lower tail ones $F$.
<<rtr-deriv>>=
status <- y[,ncol(y)]
eta <- object$linear.predictors
z <- (y[,1] - eta)/sigma
dmat <- dens(z, object$parms)
dtemp<- dmat[,3] * dmat[,4] #f'
if (any(status==3)) {
z2 <- (y[,2] - eta)/sigma
dmat2 <- dens(z2, object$parms)
}
else {
dmat2 <- dmat #dummy values
z2 <- 0
}
tdenom <- ((status==0) * dmat[,2]) + #right censored
((status==1) * 1 ) + #exact
((status==2) * dmat[,1]) + #left
((status==3) * ifelse(z>0, dmat[,2]-dmat2[,2],
dmat2[,1] - dmat[,1])) #interval
g <- log(ifelse(status==1, dmat[,3]/sigma, tdenom)) #loglik
tdenom <- 1/tdenom
dg <- -(tdenom/sigma) *(((status==0) * (0-dmat[,3])) + #dg/ eta
((status==1) * dmat[,4]) +
((status==2) * dmat[,3]) +
((status==3) * (dmat2[,3]- dmat[,3])))
ddg <- (tdenom/sigma^2) *(((status==0) * (0- dtemp)) + #ddg/eta^2
((status==1) * dmat[,5]) +
((status==2) * dtemp) +
((status==3) * (dmat2[,3]*dmat2[,4] - dtemp)))
ds <- ifelse(status<3, dg * sigma * z,
tdenom*(z2*dmat2[,3] - z*dmat[,3]))
dds <- ifelse(status<3, ddg* (sigma*z)^2,
tdenom*(z2*z2*dmat2[,3]*dmat2[,4] -
z * z*dmat[,3] * dmat[,4]))
dsg <- ifelse(status<3, ddg* sigma*z,
tdenom *(z2*dmat2[,3]*dmat2[,4] - z*dtemp))
deriv <- cbind(g, dg, ddg=ddg- dg^2,
ds = ifelse(status==1, ds-1, ds),
dds=dds - ds*(1+ds),
dsg=dsg - dg*(1+ds))
@
Now, we can calcultate the actual residuals case by case.
For the dfbetas there will be one column per coefficient,
so if there are strata column 4 of the deriv matrix needs
to be \emph{un}collapsed into a matrix with nstrata columns.
The same manipulation is needed for the ld residuals.
<<rtr-resid2>>=
if (type=='deviance') {
yhat0 <- deviance(y, sigma, object$parms)
rr <- (-1)*deriv[,2]/deriv[,3] #working residuals
rr <- sign(rr)* sqrt(2*(yhat0$loglik - deriv[,1]))
}
else if (type=='working') rr <- (-1)*deriv[,2]/deriv[,3]
else if (type=='dfbeta' || type== 'dfbetas' || type=='ldcase') {
score <- deriv[,2] * x # score residuals
if (rsigma) {
if (nstrata > 1) {
d4 <- matrix(0., nrow=n, ncol=nstrata)
d4[cbind(1:n, strata)] <- deriv[,4]
score <- cbind(score, d4)
}
else score <- cbind(score, deriv[,4])
}
rr <- score %*% vv
# cause column names to be retained
# old: if (type=='dfbetas') rr[] <- rr %*% diag(1/sqrt(diag(vv)))
if (type=='dfbetas') rr <- rr * rep(1/sqrt(diag(vv)), each=nrow(rr))
if (type=='ldcase') rr<- rowSums(rr*score)
}
else if (type=='ldresp') {
rscore <- deriv[,3] * (x * sigma)
if (rsigma) {
if (nstrata >1) {
d6 <- matrix(0., nrow=n, ncol=nstrata)
d6[cbind(1:n, strata)] <- deriv[,6]*sigma
rscore <- cbind(rscore, d6)
}
else rscore <- cbind(rscore, deriv[,6] * sigma)
}
temp <- rscore %*% vv
rr <- rowSums(rscore * temp)
}
else if (type=='ldshape') {
sscore <- deriv[,6] *x
if (rsigma) {
if (nstrata >1) {
d5 <- matrix(0., nrow=n, ncol=nstrata)
d5[cbind(1:n, strata)] <- deriv[,5]
sscore <- cbind(sscore, d5)
}
else sscore <- cbind(sscore, deriv[,5])
}
temp <- sscore %*% vv
rr <- rowSums(sscore * temp)
}
else { #type = matrix
rr <- deriv
}
@
Finally the two optional steps of adding case weights and
collapsing over subject id.
<<rsr-finish>>=
#case weights
if (weighted) rr <- rr * weights
#Expand out the missing values in the result
if (!is.null(object$na.action)) {
rr <- naresid(object$na.action, rr)
if (is.matrix(rr)) n <- nrow(rr)
else n <- length(rr)
}
# Collapse if desired
if (!missing(collapse)) {
if (length(collapse) !=n) stop("Wrong length for 'collapse'")
rr <- drop(rowsum(rr, collapse))
}
rr
@
\section{Survival curves}
The survfit function was set up as a method so that we could apply the
function to both formulas (to compute the Kaplan-Meier) and to coxph
objects.
The downside to this is that the manual pages get a little odd, but from
a programming perspective it was a good idea.
At one time, long long ago, we allowed the function to be called with
``Surv(time, status)'' as the formula, i.e., without a tilde. That was
a bad idea, now abandoned.
A note on times: one of the things that drove me nuts was the problem of
``tied but not quite tied'' times.
As an example consider two values of 24173 = 23805 + 368. These are values from
an actual study with times in days.
However, the user chose to use age in years, and saved those values out
in a CSV file, the left hand side of the above equation becomes
66.18206708000000 and the right hand side addition yeilds 66.18206708000001.
The R phrase \code{unique(x)} sees these two values as distinct but
\code{table(x)} and \code{tapply} see it as a single value since they
first apply \code{factor} to the values, and that in turn uses
\code{as.character}.
A transition through CSV is not necessary to create the problem:
<<test>>=
tfun <- function(start, gap) {
as.numeric(start)/365.25 - as.numeric(start + gap)/365.25
}
test <- logical(200)
for (i in 1:200) {
test[i] <- tfun(as.Date("2010/01/01"), 29) ==
tfun(as.Date("2010/01/01") + i, 29)
}
table(test)
@
The number of FALSE entries in the table depends on machine, compiler,
and a host of other issues.
There is discussion of this general issue in the R FAQ: ``why doesn't R
think these numbers are equal''.
The Kaplan-Meier and Cox model both pay careful attention to ties, and
so both now use the \code{aeqSurv} routine to first preprocess
the time data. It uses the same rules as \code{all.equal} to
adjudicate ties and near ties. See the vignette on tied times for more
detail.
<<survfit>>=
survfit <- function(formula, ...) {
UseMethod("survfit")
}
<<survfit-formula>>
<<survfit-subscript>>
<<survfit-Surv>>
@
The result of a survival curve will have a \code{surv} or \code{pstate}
component that is a vector or a matrix, and an optional strata component.
From a user's point of view this is an object with [strata, newdata, state]
as dimensions, where only 1, 2 or all three of these may appear.
The first is always present, and is essentially the number of distinct
curves created by the right-hand side of the equation (or by the strata in
a coxph model).
The newdata portion appears for survival curves from a Cox model, when curves
for multiple covariate patterns were requested;
the state portion only from a multi-state model; or both for a multi-state
Cox model.
The \code{surv} component contains the time points for the first stratum,
the second, third, etc stacked one above the other.
As with R matrices, if only 1 subscript is given for an array or matrix of
curves, we treat the collection of curves as a vector of curves.
We need to make sure that the new object has all the elements of the returned
object in the same order as the original --- users count on this.
The dimension of a survival curve is closely tied to the number of rows in
newdata, but isn't exactly that. The most common mismatch is when newdata has
only 1 row: the curves omit that dimension.
A newdata with one row per stratum is another exception.
<<survfit-subscript>>=
dim.survfit <- function(x) {
d1name <- "strata"
d2name <- "data"
d3name <- "states"
if (is.null(x$strata)) {d1 <- d1name <- NULL} else d1 <- length(x$strata)
# d3 is present for a survfitms object, null otherwise
if (is.null(x$states)) {
d3 <- d3name <- NULL
if (is.matrix(x$surv)) d2 <- ncol(x$surv)
else {d2 <- d2name <- NULL}
} else {
d3 <- length(x$states)
dp <- dim(x$pstate)
if (length(dp) ==3) d2 <- dp[2]
else {d2 <- d2name <- NULL}
}
dd <- c(d1, d2, d3)
names(dd) <- c(d1name, d2name, d3name)
dd
}
# there is a separate subscript function for survfitms objects
"[.survfit" <- function(x, ... , drop=TRUE) {
nmatch <- function(indx, target) {
# This function lets R worry about character, negative, or
# logical subscripts.
# It always returns a set of positive integer indices
temp <- 1:length(target)
names(temp) <- target
temp[indx]
}
if (!inherits(x, "survfit")) stop("[.survfit called on non-survfit object")
ndots <- ...length() # the simplest, but not avail in R 3.4
# ndots <- length(list(...))# fails if any are missing, e.g. fit[,2]
# ndots <- if (missing(drop)) nargs()-1 else nargs()-2 # a workaround
dd <- dim(x)
# for dd=NULL, an object with only one curve, x[1] is always legal
if (is.null(dd)) dd <- c(strata=1L) # survfit object with only one curve
dtype <- match(names(dd), c("strata", "data", "states"))
if (ndots >0 && !missing(..1)) i <- ..1 else i <- NULL
if (ndots> 1 && !missing(..2)) j <- ..2 else j <- NULL
if (ndots > length(dd))
stop("incorrect number of dimensions")
if (length(dtype) > 2) stop("invalid survfit object") # should never happen
if (is.null(i) && is.null(j)) {
# called with no subscripts given -- return x untouched
return(x)
}
# Code below is easier if "i" is always the strata
if (dtype[1] !=1) {
dtype <- c(1, dtype)
j <- i; i <- NULL
dd <- c(1, dd)
ndots <- ndots +1
}
# We need to make a new one
newx <- vector("list", length(x))
names(newx) <- names(x)
for (k in c("logse", "version", "conf.int", "conf.type", "type", "call"))
if (!is.null(x[[k]])) newx[[k]] <- x[[k]]
class(newx) <- class(x)
if (ndots== 1 && length(dd)==2) {
# one subscript given for a two dimensional object
# If one of the dimensions is 1, it is easier for me to fill in i and j
if (dd[1]==1) {j <- i; i<- 1}
else if (dd[2]==1) j <- 1
else {
# the user has a mix of rows/cols
index <- 1:prod(dd)
itemp <- matrix(index, nrow=dd[1])
keep <- itemp[i] # illegal subscripts will generate an error
if (length(keep) == length(index) && all(keep==index)) return(x)
ii <- row(itemp)[keep]
jj <- col(itemp)[keep]
# at this point we have a matrix subscript of (ii, jj)
# expand into a long pair of rows and cols
temp <- split(seq(along.with=x$time),
rep(1:length(x$strata), x$strata))
indx1 <- unlist(temp[ii]) # rows of the surv object
indx2 <- rep(jj, x$strata[ii])
# return with each curve as a separate strata
newx$n <- x$n[ii]
for (k in c("time", "n.risk", "n.event", "n.censor", "n.enter"))
if (!is.null(x[[k]])) newx[[k]] <- (x[[k]])[indx1]
k <- cbind(indx1, indx2)
for (j in c("surv", "std.err", "upper", "lower", "cumhaz",
"std.chaz", "influence.surv", "influence.chaz"))
if (!is.null(x[[j]])) newx[[j]] <- (x[[j]])[k]
temp <- x$strata[ii]
names(temp) <- 1:length(ii)
newx$strata <- temp
return(newx)
}
}
# irow will be the rows that need to be taken
# j the columns (of present)
if (is.null(x$strata)) {
if (is.null(i) || all(i==1)) irow <- seq(along.with=x$time)
else stop("subscript out of bounds")
newx$n <- x$n
}
else {
if (is.null(i)) indx <- seq(along.with= x$strata)
else indx <- nmatch(i, names(x$strata)) #strata to keep
if (any(is.na(indx)))
stop(paste("strata",
paste(i[is.na(indx)], collapse=' '),
'not matched'))
# Now, indx may not be in order: some can use curve[3:2] to reorder
# The list/unlist construct will reorder the data
temp <- split(seq(along.with =x$time),
rep(1:length(x$strata), x$strata))
irow <- unlist(temp[indx])
if (length(indx) <=1 && drop) newx$strata <- NULL
else newx$strata <- x$strata[i]
newx$n <- x$n[indx]
if (length(indx) ==1 & drop) x$strata <- NULL
else newx$strata <- x$strata[indx]
}
if (!is.matrix(x[["surv"]])) { # no j dimension
for (k in c("time", "n.risk", "n.event", "n.censor", "n.enter",
"surv", "std.err", "cumhaz", "std.chaz", "upper", "lower",
"influence.surv", "influence.chaz"))
if (!is.null(x[[k]])) newx[[k]] <- (x[[k]])[irow]
}
else { # 2 dimensional object
if (is.null(j)) j <- seq.int(ncol(x$surv))
# If the curve has been selected by strata and keep has only
# one row, we don't want to lose the second subscript too
if (length(irow)==1) drop <- FALSE
for (k in c("time", "n.risk", "n.event", "n.censor", "n.enter"))
if (!is.null(x[[k]])) newx[[k]] <- (x[[k]])[irow]
for (k in c("surv", "std.err", "cumhaz", "std.chaz", "upper", "lower",
"influence.surv", "influence.chaz"))
if (!is.null(x[[k]])) newx[[k]] <- (x[[k]])[irow, j, drop=drop]
# for a survfit.coxph object, newdata is a data frame whose rows match j
if (!is.null(x[["newdata"]])) newx[["newdata"]] <- x[["newdata"]][j,]
}
newx
}
@
\subsection{Kaplan-Meier}
The most common use of the survfit function is with a formula as the first
argument, and the most common outcome of such a call is a Kaplan-Meier
curve.
The id argument is from an older version of the competing risks code; most
people will use [[cluster(id)]] in the formula instead.
The istate argument only applies to competing risks, but don't print
an error message if it is accidentally there.
<<survfit-formula>>=
survfit.formula <- function(formula, data, weights, subset,
na.action, stype=1, ctype=1,
id, cluster, robust, istate,
timefix=TRUE, etype, model=FALSE, error, ...) {
Call <- match.call()
Call[[1]] <- as.name('survfit') #make nicer printout for the user
<<survfit.formula-getdata>>
# Deal with the near-ties problem
if (!is.logical(timefix) || length(timefix) > 1)
stop("invalid value for timefix option")
if (timefix) newY <- aeqSurv(Y) else newY <- Y
if (missing(robust)) robust <- NULL
# Call the appropriate helper function
if (attr(Y, 'type') == 'left' || attr(Y, 'type') == 'interval')
temp <- survfitTurnbull(X, newY, casewt, cluster= cluster,
robust= robust, ...)
else if (attr(Y, 'type') == "right" || attr(Y, 'type')== "counting")
temp <- survfitKM(X, newY, casewt, stype=stype, ctype=ctype, id=id,
cluster=cluster, robust=robust, ...)
else if (attr(Y, 'type') == "mright" || attr(Y, "type")== "mcounting")
temp <- survfitCI(X, newY, weights=casewt, stype=stype, ctype=ctype,
id=id, cluster=cluster, robust=robust,
istate=istate, ...)
else {
# This should never happen
stop("unrecognized survival type")
}
# If a stratum had no one beyond start.time, the length 0 gives downstream
# failure, e.g., there is no sensible printout for summary(fit, time= 100)
# for such a curve
temp$strata <- temp$strata[temp$strata >0]
if (is.null(temp$states)) class(temp) <- 'survfit'
else class(temp) <- c("survfitms", "survfit")
if (!is.null(attr(mf, 'na.action')))
temp$na.action <- attr(mf, 'na.action')
if (model) temp$model <- mf
temp$call <- Call
temp
}
@
This chunk of code is shared with resid.survfit
<<survfit.formula-getdata>>=
# create a copy of the call that has only the arguments we want,
# and use it to call model.frame()
indx <- match(c('formula', 'data', 'weights', 'subset','na.action',
'istate', 'id', 'cluster', "etype"), names(Call), nomatch=0)
#It's very hard to get the next error message other than malice
# eg survfit(wt=Surv(time, status) ~1)
if (indx[1]==0) stop("a formula argument is required")
temp <- Call[c(1, indx)]
temp[[1L]] <- quote(stats::model.frame)
mf <- eval.parent(temp)
Terms <- terms(formula, c("strata", "cluster"))
ord <- attr(Terms, 'order')
if (length(ord) & any(ord !=1))
stop("Interaction terms are not valid for this function")
n <- nrow(mf)
Y <- model.response(mf)
if (inherits(Y, "Surv2")) {
# this is Surv2 style data
# if there are any obs removed due to missing, remake the model frame
if (length(attr(mf, "na.action"))) {
temp$na.action <- na.pass
mf <- eval.parent(temp)
}
if (!is.null(attr(Terms, "specials")$cluster))
stop("cluster() cannot appear in the model statement")
new <- surv2data(mf)
mf <- new$mf
istate <- new$istate
id <- new$id
Y <- new$y
if (anyNA(mf[-1])) { #ignore the response variable still found there
if (missing(na.action)) temp <- get(getOption("na.action"))(mf[-1])
else temp <- na.action(mf[-1])
omit <- attr(temp, "na.action")
mf <- mf[-omit,]
Y <- Y[-omit]
id <- id[-omit]
istate <- istate[-omit]
}
n <- nrow(mf)
}
else {
if (!is.Surv(Y)) stop("Response must be a survival object")
id <- model.extract(mf, "id")
istate <- model.extract(mf, "istate")
}
if (n==0) stop("data set has no non-missing observations")
casewt <- model.extract(mf, "weights")
if (is.null(casewt)) casewt <- rep(1.0, n)
else {
if (!is.numeric(casewt)) stop("weights must be numeric")
if (any(!is.finite(casewt))) stop("weights must be finite")
if (any(casewt <0)) stop("weights must be non-negative")
casewt <- as.numeric(casewt) # transform integer to numeric
}
if (!is.null(attr(Terms, 'offset'))) warning("Offset term ignored")
cluster <- model.extract(mf, "cluster")
temp <- untangle.specials(Terms, "cluster")
if (length(temp$vars)>0) {
if (length(cluster) >0) stop("cluster appears as both an argument and a model term")
if (length(temp$vars) > 1) stop("can not have two cluster terms")
cluster <- mf[[temp$vars]]
Terms <- Terms[-temp$terms]
}
ll <- attr(Terms, 'term.labels')
if (length(ll) == 0) X <- factor(rep(1,n)) # ~1 on the right
else X <- strata(mf[ll])
# Backwards support for the now-depreciated etype argument
etype <- model.extract(mf, "etype")
if (!is.null(etype)) {
if (attr(Y, "type") == "mcounting" ||
attr(Y, "type") == "mright")
stop("cannot use both the etype argument and mstate survival type")
if (length(istate))
stop("cannot use both the etype and istate arguments")
status <- Y[,ncol(Y)]
etype <- as.factor(etype)
temp <- table(etype, status==0)
if (all(rowSums(temp==0) ==1)) {
# The user had a unique level of etype for the censors
newlev <- levels(etype)[order(-temp[,2])] #censors first
}
else newlev <- c(" ", levels(etype)[temp[,1] >0])
status <- factor(ifelse(status==0,0, as.numeric(etype)),
labels=newlev)
if (attr(Y, 'type') == "right")
Y <- Surv(Y[,1], status, type="mstate")
else if (attr(Y, "type") == "counting")
Y <- Surv(Y[,1], Y[,2], status, type="mstate")
else stop("etype argument incompatable with survival type")
}
@
Once upon a time I allowed survfit to be called without the
`\textasciitilde 1' portion of the formula.
This was a mistake for multiple reasons, but the biggest problem is timing.
If the subject has a data statement but the first argument is not a formula,
R needs to evaluate Surv(t,s) to know that it is a survival object,
but it also needs to know that this is a survival object before evaluation
in order to dispatch the correct method.
The method below helps give a useful error message in some cases.
<<survfit-Surv>>=
survfit.Surv <- function(formula, ...)
stop("the survfit function requires a formula as its first argument")
@
The last peice in this file is the function to create confidence
intervals. It is called from multiple different places so it is well to
have one copy.
If $p$ is the survival probability and $s(p)$ its standard error,
we can do confidence intervals on the simple scale of
$ p \pm 1.96 s(p)$, but that does not have very good properties.
Instead use a transformation $y = f(p)$ for which the standard error is
$s(p) f'(p)$, leading to the confidence interval
\begin{equation*}
f^{-1}\left(f(p) +- 1.96 s(p)f'(p) \right)
\end{equation*}
Here are the supported transformations.
\begin{center}
\begin{tabular}{rccc}
&$f$& $f'$ & $f^{-1}$ \\ \hline
log & $\log(p)$ & $1/p$ & $ \exp(y)$ \\
log-log & $\log(-\log(p))$ & $1/\left[ p \log(p) \right]$ &
$\exp(-\exp(y)) $ \\
logit & $\log(p/1-p)$ & $1/[p (1-p)]$ & $1- 1/\left[1+ \exp(y)\right]$ \\
arcsin & $\arcsin(\sqrt{p})$ & $1/(2 \sqrt{p(1-p)})$ &$\sin^2(y)$ \\
\end{tabular} \end{center}
Plain intervals can give limits outside of (0,1), we truncate them when this
happens. The log intervals can give an upper limit greater than 1, but the
lower limit is always valid, and the log-log and logit. The arcsin require
truncation in the middle of the formula.
In all cases we return NA as the CI for survival=0: it makes the graphs look
better.
Some of the underlying routines compute the standard error of $p$ and some
the standard error of $\log(p)$. The \code{selow} argument is used for the
modified lower limits of Dory and Korn. When this is used for cumulative
hazards the ulimit arg will be FALSE: no upper limit of 1.
<<survfit>>=
survfit_confint <- function(p, se, logse=TRUE, conf.type, conf.int,
selow, ulimit=TRUE) {
zval <- qnorm(1- (1-conf.int)/2, 0,1)
if (missing(selow)) scale <- 1.0
else scale <- ifelse(selow==0, 1.0, selow/se) # avoid 0/0 at the origin
if (!logse) se <- ifelse(se==0, 0, se/p) # se of log(survival) = log(p)
if (conf.type=='plain') {
se2 <- se* p * zval # matches equation 4.3.1 in Klein & Moeschberger
if (ulimit) list(lower= pmax(p -se2*scale, 0), upper = pmin(p + se2, 1))
else list(lower= pmax(p -se2*scale, 0), upper = p + se2)
}
else if (conf.type=='log') {
#avoid some "log(0)" messages
xx <- ifelse(p==0, NA, p)
se2 <- zval* se
temp1 <- exp(log(xx) - se2*scale)
temp2 <- exp(log(xx) + se2)
if (ulimit) list(lower= temp1, upper= pmin(temp2, 1))
else list(lower= temp1, upper= temp2)
}
else if (conf.type=='log-log') {
xx <- ifelse(p==0 | p==1, NA, p)
se2 <- zval * se/log(xx)
temp1 <- exp(-exp(log(-log(xx)) - se2*scale))
temp2 <- exp(-exp(log(-log(xx)) + se2))
list(lower = temp1 , upper = temp2)
}
else if (conf.type=='logit') {
xx <- ifelse(p==0, NA, p) # avoid log(0) messages
se2 <- zval * se *(1 + xx/(1-xx))
temp1 <- 1- 1/(1+exp(log(p/(1-p)) - se2*scale))
temp2 <- 1- 1/(1+exp(log(p/(1-p)) + se2))
list(lower = temp1, upper=temp2)
}
else if (conf.type=="arcsin") {
xx <- ifelse(p==0, NA, p)
se2 <- .5 *zval*se * sqrt(xx/(1-xx))
list(lower= (sin(pmax(0, asin(sqrt(xx)) - se2*scale)))^2,
upper= (sin(pmin(pi/2, asin(sqrt(xx)) + se2)))^2)
}
else stop("invalid conf.int type")
}
@
\subsubsection{C-code}
(This is set up as a separate file in the source code directory since
it is easier to make emacs stay in C-mode if the file has a .nw
extension.)
<<survfitci>>=
#include "survS.h"
#include "survproto.h"
#include <math.h>
SEXP survfitci(SEXP ftime2, SEXP sort12, SEXP sort22, SEXP ntime2,
SEXP status2, SEXP cstate2, SEXP wt2, SEXP id2,
SEXP p2, SEXP i02, SEXP sefit2) {
<<survfitci-declare>>
<<survfitci-compute>>
<<survfitci-return>>
}
@
Arguments to the routine are the following.
For an R object ``zed'' I use the convention of [[zed2]] to refer to the
object and [[zed]] to the contents of the object.
\begin{description}
\item[ftime] A two column matrix containing the entry and exit times
for each subject.
\item[sort1] Order vector for the entry times. The first element of sort1
points to the first entry time, etc.
\item[sort2] Order vector for the event times.
\item[ntime] Number of unique event time values. This fixes the size of
the output arrays.
\item[status] Status for each observation. 0= censored
\item[cstate] The initial state for each subject, which will be
updated during computation to always be the current state.
\item[wt] Case weight for each observation.
\item[id] The subject id for each observation.
\item[p] The initial distribution of states. This will be updated during
computation to be the current distribution.
\item[i0] The initial influence matrix, number of subjects by number of states
\item[sefit] If 1 then do the se compuatation, if 2 also return the full
influence matrix upon which it is based, if 0 the se is not needed.
\end{description}
Note that code is called with id and not cluster: there is a basic premise that
each id is a single subject and thus has a unique "current state" at any
given time point. The history of this is that before the survcheck routine,
we did not have a good way for a user to normalize the 'current state' variable
for a subject, so this routine takes care of that tracking process.
When multi-state Cox models were added we became more formal about this, and
users can now have data sets with quite odd patterns of transitions and current
state, ones that survcheck calls a teleport. At some point this routine should
be updated as well. Cumulative hazard estimates make at least some sense
when a subject has a hole, though P(state |t) curves do not.
Declare all of the variables.
<<survfitci-declare>>=
int i, j, k, kk; /* generic loop indices */
int ck, itime, eptr; /*specific indices */
double ctime; /*current time of interest, in the main loop */
int oldstate, newstate; /*when changing state */
double temp, *temp2; /* scratch double, and vector of length nstate */
double *dptr; /* reused in multiple contexts */
double *p; /* current prevalence vector */
double **hmat; /* hazard matrix at this time point */
double **umat=0; /* per subject leverage at this time point */
int *atrisk; /* 1 if the subject is currently at risk */
int *ns; /* number curently in each state */
int *nev; /* number of events at this time, by state */
double *ws; /* weighted count of number state */
double *wtp; /* case weights indexed by subject */
double wevent; /* weighted number of events at current time */
int nstate; /* number of states */
int n, nperson; /*number of obs, subjects*/
double **chaz; /* cumulative hazard matrix */
/* pointers to the R variables */
int *sort1, *sort2; /*sort index for entry time, event time */
double *entry,* etime; /*entry time, event time */
int ntime; /* number of unique event time values */
int *status; /*0=censored, 1,2,... new states */
int *cstate; /* current state for each subject */
int *dstate; /* the next state, =cstate if not an event time */
double *wt; /* weight for each observation */
double *i0; /* initial influence */
int *id; /* for each obs, which subject is it */
int sefit;
/* returned objects */
SEXP rlist; /* the returned list and variable names of same */
const char *rnames[]= {"nrisk","nevent","ncensor", "p",
"cumhaz", "std", "influence.pstate", ""};
SEXP setemp;
double **pmat, **vmat=0, *cumhaz, *usave=0; /* =0 to silence -Wall warning */
int *ncensor, **nrisk, **nevent;
@
Now set up pointers for all of the R objects sent to us.
The two that will be updated need to be replaced by duplicates.
<<survfitci-declare>>=
ntime= asInteger(ntime2);
nperson = LENGTH(cstate2); /* number of unique subjects */
n = LENGTH(sort12); /* number of observations in the data */
PROTECT(cstate2 = duplicate(cstate2));
cstate = INTEGER(cstate2);
entry= REAL(ftime2);
etime= entry + n;
sort1= INTEGER(sort12);
sort2= INTEGER(sort22);
status= INTEGER(status2);
wt = REAL(wt2);
id = INTEGER(id2);
PROTECT(p2 = duplicate(p2)); /*copy of initial prevalence */
p = REAL(p2);
nstate = LENGTH(p2); /* number of states */
i0 = REAL(i02);
sefit = asInteger(sefit2);
/* allocate space for the output objects
** Ones that are put into a list do not need to be protected
*/
PROTECT(rlist=mkNamed(VECSXP, rnames));
setemp = SET_VECTOR_ELT(rlist, 0, allocMatrix(INTSXP, ntime, nstate));
nrisk = imatrix(INTEGER(setemp), ntime, nstate); /* time by state */
setemp = SET_VECTOR_ELT(rlist, 1, allocMatrix(INTSXP, ntime, nstate));
nevent = imatrix(INTEGER(setemp), ntime, nstate); /* time by state */
setemp = SET_VECTOR_ELT(rlist, 2, allocVector(INTSXP, ntime));
ncensor = INTEGER(setemp); /* total at each time */
setemp = SET_VECTOR_ELT(rlist, 3, allocMatrix(REALSXP, ntime, nstate));
pmat = dmatrix(REAL(setemp), ntime, nstate);
setemp = SET_VECTOR_ELT(rlist, 4, allocMatrix(REALSXP, nstate*nstate, ntime));
cumhaz = REAL(setemp);
if (sefit >0) {
setemp = SET_VECTOR_ELT(rlist, 5, allocMatrix(REALSXP, ntime, nstate));
vmat= dmatrix(REAL(setemp), ntime, nstate);
}
if (sefit >1) {
/* the max space is larger for a matrix than a vector
** This is pure sneakiness: if I allocate a vector then n*nstate*(ntime+1)
** may overflow, as it is an integer argument. Using the rows and cols of
** a matrix neither overflows. But once allocated, I can treat setemp
** like a vector since usave is a pointer to double, which is bigger than
** integer and won't overflow. */
setemp = SET_VECTOR_ELT(rlist, 6, allocMatrix(REALSXP, n*nstate, ntime+1));
usave = REAL(setemp);
}
/* allocate space for scratch vectors */
ws = (double *) R_alloc(2*nstate, sizeof(double)); /*weighted number in state */
temp2 = ws + nstate;
ns = (int *) R_alloc(2*nstate, sizeof(int));
nev = ns + nstate;
atrisk = (int *) R_alloc(2*nperson, sizeof(int));
dstate = atrisk + nperson;
wtp = (double *) R_alloc(nperson, sizeof(double));
hmat = (double**) dmatrix((double *)R_alloc(nstate*nstate, sizeof(double)),
nstate, nstate);
chaz = (double**) dmatrix((double *)R_alloc(nstate*nstate, sizeof(double)),
nstate, nstate);
if (sefit >0)
umat = (double**) dmatrix((double *)R_alloc(nperson*nstate, sizeof(double)),
nstate, nperson);
/* R_alloc does not zero allocated memory */
for (i=0; i<nstate; i++) {
ws[i] =0;
ns[i] =0;
nev[i] =0;
for (j=0; j<nstate; j++) {
hmat[i][j] =0;
chaz[i][j] =0;
}
}
for (i=0; i<nperson; i++) {
atrisk[i] =0;
wtp[i] = 0.0;
dstate[i] = cstate[i]; /* cstate starts as the initial state */
}
@
Copy over the initial influence data, which was computed in R.
<<survfitci-declare>>=
if (sefit ==1) {
dptr = i0;
for (j=0; j<nstate; j++) {
for (i=0; i<nperson; i++) umat[i][j] = *dptr++;
}
}
else if (sefit>1) {
/* copy influence, and save it */
dptr = i0;
for (j=0; j<nstate; j++) {
for (i=0; i<nperson; i++) {
umat[i][j] = *dptr;
*usave++ = *dptr++; /* save in the output */
}
}
}
@
The primary loop of the program walks along the \code{sort2}
vector, with one pass through the interior of the for loop for each unique
event time.
Observations are at risk in the interval (entry, event]: note
the round and square brackets, so a row must satisfy
\code{entry < ctime <= event} to be at risk,
where \code{ctime} is the unique event time of current interest.
The basic loop is to add new subjects to the risk set, compute,
save results, then remove expired ones from the risk set.
The \code{ns} and \code{ws} vectors keep track of the number of subjects
currently in each state and the weighted number currently in each
state.
There are four indexing patterns in play which may be confusing.
\begin{itemize}
\item The output matrices, indexed by unique event time \code{itime}
and state.
\item The \code{n} observations (variables entry, event, sort1, sort2, status,
wt, id)
\item The \code{nperson} individual subjects (variables cstate, atrisk)
\item The \code{[nstate} states (variables hmat, p)
\end{itemize}
In the code below \code{i} steps through the exit times and \code{eptr} the
entry time. The \code{atrisk} variable keeps track of \emph{subjects} who are
at risk.
<<survfitci-compute>>=
itime =0; /*current time index, for output arrays */
eptr = 0; /*index to sort1, the entry times */
for (i=0; i<n; ) {
ck = sort2[i];
ctime = etime[ck]; /* current time value of interest */
/* Add subjects whose entry time is < ctime into the counts */
for (; eptr<n; eptr++) {
k = sort1[eptr];
if (entry[k] < ctime) {
kk = cstate[id[k]]; /*current state of the addition */
ns[kk]++;
ws[kk] += wt[k];
wtp[id[k]] = wt[k];
atrisk[id[k]] =1; /* mark them as being at risk */
}
else break;
}
<<survfitci-compute-matrices>>
<<survfitci-compute-update>>
/* Take the current events and censors out of the risk set */
for (; i<n; i++) {
j= sort2[i];
if (etime[j] == ctime) {
oldstate = cstate[id[j]]; /*current state */
ns[oldstate]--;
ws[oldstate] -= wt[j];
if (status[j] >0) cstate[id[j]] = status[j]-1; /*new state */
atrisk[id[j]] =0;
}
else break;
}
itime++;
}
@
The key variables for the computation are the matrix $H$ and the
current prevalence vector $P$.
$H$ is created anew at each unique time point.
Row $j$ of $H$ concerns everyone in state $j$ just before the time point,
and contains the transitions at that time point.
So the $jk$ element is the (weighted) fraction who change from state $j$
to state $k$, and the $jj$ element the fraction who stay put.
Each row of $H$ by definition sums to 1.
If no one is in the state then the $jj$ element is set to 1.
A second version which we call H2 has 1 subtracted from each diagonal giving
row sums are 0, we go back and
forth depending on which is needed at the moment.
If there are no events at this time point $P$ and $U$ do not update.
<<survfitci-compute-matrices>>=
for (j=0; j<nstate; j++) {
for (k=0; k<nstate; k++) {
hmat[j][k] =0;
}
}
/* Count up the number of events and censored at this time point */
for (k=0; k<nstate; k++) nev[k] =0;
ncensor[itime] =0;
wevent =0;
for (j=i; j<n; j++) {
k = sort2[j];
if (etime[k] == ctime) {
if (status[k] >0) {
newstate = status[k] -1; /* 0 based subscripts */
oldstate = cstate[id[k]];
if (oldstate != newstate) {
/* A "move" to the same state does not count */
dstate[id[k]] = newstate;
nev[newstate]++;
wevent += wt[k];
hmat[oldstate][newstate] += wt[k];
}
}
else ncensor[itime]++;
}
else break;
}
if (wevent > 0) { /* there was at least one move with weight > 0 */
/* finish computing H */
for (j=0; j<nstate; j++) {
if (ns[j] >0) {
temp =0;
for (k=0; k<nstate; k++) {
temp += hmat[j][k];
hmat[j][k] /= ws[j]; /* events/n */
}
hmat[j][j] =1 -temp/ws[j]; /*rows sum to one */
}
else hmat[j][j] =1.0;
}
if (sefit >0) {
<<survfitci-compute-U>>
}
<<survfitci-compute-P>>
}
@
The most complicated part of the code is the update of the
per subject influence matrix $U$.
The influence for a subject is the derivative of the current
estimates wrt the case weight of that subject. Since $p$ is a
vector the influence $U$ is easily represented as a matrix with one row
per subject and one column per state.
Refer to equation \eqref{ci} for the derivation.
Let $m$ and $n$ be the old and new states for subject $i$, and
$n_m$ the sum of weights for all subjects at risk in state $m$.
Then
\begin{equation*}
U_{ij}(t) = \sum_k \left[ U_{ik}(t-)H_{kj}\right] + p_m(t-)(I_{n=j} - H_{mj})/ n_m
\end{equation*}
\begin{enumerate}
\item The first term above is simple matrix multiplication.
\item The second adds a vector with mean zero.
\end{enumerate}
If standard errors are not needed we can skip this calculation.
<<survfitci-compute-U>>=
/* Update U, part 1 U = U %*% H -- matrix multiplication */
for (j=0; j<nperson; j++) { /* row of U */
for (k=0; k<nstate; k++) { /* column of U */
temp2[k]=0;
for (kk=0; kk<nstate; kk++)
temp2[k] += umat[j][kk] * hmat[kk][k];
}
for (k=0; k<nstate; k++) umat[j][k] = temp2[k];
}
/* step 2, add in dH term */
for (j=0; j<nperson; j++) {
if (atrisk[j]==1) {
oldstate = cstate[j];
for (k=0; k<nstate; k++)
umat[j][k] -= hmat[oldstate][k]* p[oldstate]/ ws[oldstate];
umat[j][dstate[j]] += p[oldstate]/ws[oldstate];
}
}
@
Now update the cumulative hazard by adding H2 to it, and
update $p$ to $pH$.
<<survfitci-compute-P>>=
/* Finally, update chaz and p. */
for (j=0; j<nstate; j++) {
for (k=0; k<nstate; k++) chaz[j][k] += hmat[j][k];
chaz[j][j] -=1; /* Update using H2 */
temp2[j] =0;
for (k=0; k<nstate; k++)
temp2[j] += p[k] * hmat[k][j];
}
for (j=0; j<nstate; j++) p[j] = temp2[j];
@
<<survfitci-compute-update>>=
/* store into the matrices that will be passed back */
for (j=0; j<nstate; j++) {
pmat[j][itime] = p[j];
nrisk[j][itime] = ns[j];
nevent[j][itime] = nev[j];
for (k=0; k<nstate; k++) *cumhaz++ = chaz[k][j];
if (sefit >0) {
temp =0;
for (k=0; k<nperson; k++)
temp += wtp[k]* wtp[k]*umat[k][j]*umat[k][j];
vmat[j][itime] = sqrt(temp);
}
if (sefit > 1)
for (k=0; k<nperson; k++) *usave++ = umat[k][j];
}
@
<<survfitci-return>>=
/* return a list */
UNPROTECT(3);
return(rlist);
@
\section{State space figures}
The statefig function was written to do ``good enough'' state space figures
quickly and easily. There are certainly figures it can't draw and
many figures that can be drawn better, but it accomplishes its purpose.
The key argument \code{layout}, the first, is a vector of numbers.
The value (1,3,4,2) for instance has a single state, then a column with 3
states, then a column with 4, then a column with 2.
If \code{layout} is instead a 1 column matrix then do the same from top
down. If it is a 2 column matrix then they provided their own spacing.
<<statefig>>=
statefig <- function(layout, connect, margin=.03, box=TRUE,
cex=1, col=1, lwd=1, lty=1, bcol= col,
acol=col, alwd = lwd, alty= lty, offset=0) {
# set up an empty canvas
frame(); # new environment
par(usr=c(0,1,0,1))
if (!is.numeric(layout))
stop("layout must be a numeric vector or matrix")
if (!is.matrix(connect) || nrow(connect) != ncol(connect))
stop("connect must be a square matrix")
nstate <- nrow(connect)
dd <- dimnames(connect)
if (!is.null(dd[[1]])) statenames <- dd[[1]]
else if (is.null(dd[[2]]))
stop("connect must have the state names as dimnames")
else statenames <- dd[[2]]
# expand out all of the graphical parameters. This lets users
# use a vector of colors, line types, etc
narrow <- sum(connect!=0)
acol <- rep(acol, length=narrow)
alwd <- rep(alwd, length=narrow)
alty <- rep(alty, length=narrow)
bcol <- rep(bcol, length=nstate)
lty <- rep(lty, length=nstate)
lwd <- rep(lwd, length=nstate)
col <- rep(col, length=nstate) # text colors
<<statefig-layout>>
<<statefig-text>>
<<statefig-arrows>>
dimnames(cbox) <- list(statenames, c("x", "y"))
invisible(cbox)
}
<<statefig-fun>>
@
The drawing region is always (0,1) by (0,1).
A user can enter their own matrix of coordinates.
Otherwise the free space is divided with one portion
on each end and 2 portions between boxes. If there were 3 columns for
instance they will have x coordinates of 1/6, 1/6 + 1/3, 1/6 + 2/3. Ditto
for dividing up the y coordinate. The primary nuisance is that we want to
count down from the top instead of up from the bottom. A 1 by 1 matrix is
treated as a column matrix.
<<statefig-layout>>=
if (is.matrix(layout) && ncol(layout)==2 && nrow(layout) > 1) {
# the user provided their own
if (any(layout <0) || any(layout >1))
stop("layout coordinates must be between 0 and 1")
if (nrow(layout) != nstate)
stop("layout matrix should have one row per state")
cbox <- layout
}
else {
if (any(layout <=0 | layout != floor(layout)))
stop("non-integer number of states in layout argument")
space <- function(n) (1:n -.5)/n # centers of the boxes
if (sum(layout) != nstate) stop("number of boxes != number of states")
cbox <- matrix(0, ncol=2, nrow=nstate) #coordinates will be here
n <- length(layout)
ix <- rep(seq(along=layout), layout)
if (is.vector(layout) || ncol(layout)> 1) { #left to right
cbox[,1] <- space(n)[ix]
for (i in 1:n) cbox[ix==i,2] <- 1 -space(layout[i])
} else { # top to bottom
cbox[,2] <- 1- space(n)[ix]
for (i in 1:n) cbox[ix==i,1] <- space(layout[i])
}
}
@
Write the text out. Compute the width and height of each box.
Then compute the margin. The only tricky thing here is that we want
the area around the text to \emph{look} the same left-right and up-down,
which depends on the geometry of the plotting region.
<<statefig-text>>=
text(cbox[,1], cbox[,2], statenames, cex=cex, col=col) # write the labels
textwd <- strwidth(statenames, cex=cex)
textht <- strheight(statenames, cex=cex)
temp <- par("pin") #plot region in inches
dx <- margin * temp[2]/mean(temp) # extra to add in the x dimension
dy <- margin * temp[1]/mean(temp) # extra to add in y
if (box) {
drawbox <- function(x, y, dx, dy, lwd, lty, col) {
lines(x+ c(-dx, dx, dx, -dx, -dx),
y+ c(-dy, -dy, dy, dy, -dy), lwd=lwd, lty=lty, col=col)
}
for (i in 1:nstate)
drawbox(cbox[i,1], cbox[i,2], textwd[i]/2 + dx, textht[i]/2 + dy,
col=bcol[i], lwd=lwd[i], lty=lty[i])
dx <- 2*dx; dy <- 2*dy # move arrows out from the box
}
@
Now for the hard part, which is drawing the arrows.
The entries in the connection matrix are 0= no connection or $1+d$ for
$-1 < d < 1$. The connection is an arc that passes from the center of
box 1 to the center of box 2, and through a point that is $dz$ units above
the midpoint of the line from box 1 to box 2, where $2z$ is the length
of that line.
For $d=1$ we get a half circle to the right (with respect to traversing the
line from A to B) and for $d= -1$ we get a half circle to the left.
If $d=0$ it is a straight line.
If A and B are the starting and ending points then AB is the chord of a
circle. Draw radii from the center to A, B, and through the midpoint $c$ of
AB. This last has length $dz$ above the chord and $r- dz$ below where $r$
is the radius. Then we have
\begin{align*}
r^2 & = z^2 + (r-dz)^2 \\
2rdz &= z^2 + (dz)^2 \\
r &= \left[z (1+ d^2) \right ]/ 2d
\end{align*}
Be careful with negative $d$, which is used to denote left-hand arcs.
The angle $\theta$ from A to B is the arctan of $B-A$,
and the center of the circle is at
$C = (A+B)/2 + (r - dz)(\sin \theta, -\cos \theta)$.
We then need to draw the arc $C + r(\cos \phi, \sin \phi)$ for some range
of angles $\phi$.
The angles to the centers of the boxes are $\arctan(A-C)$ and $\arctan(B-C)$,
but we want to start and end outside the box.
It turned out that this is more subtle than I thought.
The solution below uses two helper functions \code{statefigx} and
\code{statefigy}.
The first accepts $C$, $r$, the range of $\phi$ values, and a target
$y$ value. It returns the angles, within the range, such that the
endpoint of the arc has horizontal coordinate $x$, or an empty
vector if none such exists. For an arc there are sometimes two
solutions.
First calculate the angles for which the arc will strike the horizontal
line. If the arc is too short to reach the line then there is no
intersection.
The return legal angles.
<<statefig-fun>>=
statefigx <- function(x, C, r, a1, a2) {
temp <-(x - C[1])/r
if (abs(temp) >1) return(NULL) # no intersection of the arc and x
phi <- acos(temp) # this will be from 0 to pi
pi <- 3.1415926545898 # in case someone has a variable "pi"
if (x > C[1]) phi <- c(phi, pi - phi)
else phi <- -c(phi, pi - phi)
# Add reflection about the X axis, in both forms
phi <- c(phi, -phi, 2*pi - phi)
amax <- max(a1, a2)
amin <- min(a1, a2)
phi[phi<amax & phi > amin]
}
statefigy <- function(y, C, r, a1, a2) {
pi <- 3.1415926545898 # in case someone has a variable named "pi"
amax <- max(a1, a2)
amin <- min(a1, a2)
temp <-(y - C[2])/r
if (abs(temp) >1) return(NULL) # no intersection of the arc and y
phi <- asin(temp) # will be from -pi/2 to pi/2
phi <- c(phi, sign(phi)*pi -phi) # reflect about the vertical
phi <- c(phi, phi + 2*pi)
phi[phi<amax & phi > amin]
}
@
<<statefig-fun>>=
phi <- function(x1, y1, x2, y2, d, delta1, delta2) {
# d = height above the line
theta <- atan2(y2-y1, x2-x1) # angle from center to center
if (abs(d) < .001) d=.001 # a really small arc looks like a line
z <- sqrt((x2-x1)^2 + (y2 - y1)^2) /2 # half length of chord
ab <- c((x1 + x2)/2, (y1 + y2)/2) # center of chord
r <- abs(z*(1 + d^2)/ (2*d))
if (d >0) C <- ab + (r - d*z)* c(-sin(theta), cos(theta)) # center of arc
else C <- ab + (r + d*z)* c( sin(theta), -cos(theta))
a1 <- atan2(y1-C[2], x1-C[1]) # starting angle
a2 <- atan2(y2-C[2], x2-C[1]) # ending angle
if (abs(a2-a1) > pi) {
# a1= 3 and a2=-3, we don't want to include 0
# nor for a1=-3 and a2=3
if (a1>0) a2 <- a2 + 2 *pi
else a1 <- a1 + 2*pi
}
if (d > 0) { #counterclockwise
phi1 <- min(statefigx(x1 + delta1[1], C, r, a1, a2),
statefigx(x1 - delta1[1], C, r, a1, a2),
statefigy(y1 + delta1[2], C, r, a1, a2),
statefigy(y1 - delta1[2], C, r, a1, a2), na.rm=TRUE)
phi2 <- max(statefigx(x2 + delta2[1], C, r, a1, a2),
statefigx(x2 - delta2[1], C, r, a1, a2),
statefigy(y2 + delta2[2], C, r, a1, a2),
statefigy(y2 - delta2[2], C, r, a1, a2), na.rm=TRUE)
}
else { # clockwise
phi1 <- max(statefigx(x1 + delta1[1], C, r, a1, a2),
statefigx(x1 - delta1[1], C, r, a1, a2),
statefigy(y1 + delta1[2], C, r, a1, a2),
statefigy(y1 - delta1[2], C, r, a1, a2), na.rm=TRUE)
phi2 <- min(statefigx(x2 + delta2[1], C, r, a1, a2),
statefigx(x2 - delta2[1], C, r, a1, a2),
statefigy(y2 + delta2[2], C, r, a1, a2),
statefigy(y2 - delta2[2], C, r, a1, a2), na.rm=TRUE)
}
list(center=C, angle=c(phi1, phi2), r=r)
}
@
Now draw the arrows, one at a time. I arbitrarily declare that 20
segments is enough for a smooth curve.
<<statefig-arrows>>=
arrow2 <- function(...) arrows(..., angle=20, length=.1)
doline <- function(x1, x2, d, delta1, delta2, lwd, lty, col) {
if (d==0 && x1[1] ==x2[1]) { # vertical line
if (x1[2] > x2[2]) # downhill
arrow2(x1[1], x1[2]- delta1[2], x2[1], x2[2] + delta2[2],
lwd=lwd, lty=lty, col=col)
else arrow2(x1[1], x1[2]+ delta1[2], x2[1], x2[2] - delta2[2],
lwd=lwd, lty=lty, col=col)
}
else if (d==0 && x1[2] == x2[2]) { # horizontal line
if (x1[1] > x2[1]) # right to left
arrow2(x1[1]-delta1[1], x1[2], x2[1] + delta2[1], x2[2],
lwd=lwd, lty=lty, col=col)
else arrow2(x1[1]+delta1[1], x1[2], x2[1] - delta2[1], x2[2],
lwd=lwd, lty=lty, col=col)
}
else {
temp <- phi(x1[1], x1[2], x2[1], x2[2], d, delta1, delta2)
if (d==0) {
arrow2(temp$center[1] + temp$r*cos(temp$angle[1]),
temp$center[2] + temp$r*sin(temp$angle[1]),
temp$center[1] + temp$r*cos(temp$angle[2]),
temp$center[2] + temp$r*sin(temp$angle[2]),
lwd=lwd, lty=lty, col=col)
}
else {
# approx the curve with 21 segments
# arrowhead on the last one
phi <- seq(temp$angle[1], temp$angle[2], length=21)
lines(temp$center[1] + temp$r*cos(phi),
temp$center[2] + temp$r*sin(phi), lwd=lwd, lty=lty, col=col)
arrow2(temp$center[1] + temp$r*cos(phi[20]),
temp$center[2] + temp$r*sin(phi[20]),
temp$center[1] + temp$r*cos(phi[21]),
temp$center[2] + temp$r*sin(phi[21]),
lwd=lwd, lty=lty, col=col)
}
}
}
@
The last arrow bit is the offset. If offset $\ne 0$ and there is a
bidirectional
arrow between two boxes, and the arc for both of them is identical,
then move each arrow just a bit, orthagonal to a segment connecting the middle
of the two boxes.
If the line goes from (x1, y1) to (x2, y2), then the normal to the line at
(x1, x2) is (y2-y1, x1-x2), normalized to length 1.
The -1 below (\code{-offset}) makes the shift obey a left-hand rule: looking
down a line segement towards the arrow head, we shift to the left.
This makes two horizontal arrows stack in the normal typographical order
for chemical reactions, the right facing one above the left facing.
A user can use a negative value for offset to reverse this if they wish.
<<statefig-arrows>>=
k <- 1
for (j in 1:nstate) {
for (i in 1:nstate) {
if (i != j && connect[i,j] !=0) {
if (connect[i,j] == 2-connect[j,i] && offset!=0) {
#add an offset
toff <- c(cbox[j,2] - cbox[i,2], cbox[i,1] - cbox[j,1])
toff <- -offset *toff/sqrt(sum(toff^2))
doline(cbox[i,]+toff, cbox[j,]+toff, connect[i,j]-1,
delta1 = c(textwd[i]/2 + dx, textht[i]/2 + dy),
delta2 = c(textwd[j]/2 + dx, textht[j]/2 + dy),
lty=alty[k], lwd=alwd[k], col=acol[k])
}
else doline(cbox[i,], cbox[j,], connect[i,j]-1,
delta1 = c(textwd[i]/2 + dx, textht[i]/2 + dy),
delta2 = c(textwd[j]/2 + dx, textht[j]/2 + dy),
lty=alty[k], lwd=alwd[k], col=acol[k])
k <- k +1
}
}
}
@
\section{Linear models and contrasts}
The primary contrast function is \code{yates}.
This function does both simple and population contrasts; the name is a nod
to the ``Yates weighted means'' method, the first population contrast that
I know of.
A second reason for the name is that
the word ``contrast'' is already overused in the S/R lexicon.
Both \code{yates} and \code{cmatrix} can be used with any model that returns
the necessary
portions, e.g., lm, coxph, or glm.
They were written because I became embroiled in the ``type III'' controversy,
and made it a goal to figure out what exactly it is that SAS does.
If I had known that that quest would take multiple years would
perhaps have never started.
Population contrasts can result in some head scratching.
It is easy to create the predicted value for any hypothethical
subject from a model.
A population prediction holds some data values constant and lets the
others range over a population, giving a mean predicted value or
population average.
Population predictions for two treatments are the familiar g-estimates
of causal models.
We can take sums or differences of these predictions as well, e.g. to
ask if they are significantly different.
What can't be done is to work backwards from one of these contrasts to the
populations, at least for continuous variables.
If someone asks for an x contrast of 15-5 is this a sum of two population
estimates at 15 and -5, or a difference?
It's always hard to guess the mind of a user.
Therefore what is needed is a fitted model, the term (covariate) of interest,
levels of that covariate, a desired comparison, and a population.
First is cmatrix routine. This is called by users to create a contrast
matrix for a model, users can also construct their own contrast matrices.
The result has two parts: the definition of a set of predicted values and
a set of contrasts between those values.
The routine requires a fit and a formula. The formula is simply a way to
get a set of variable names: all those variables are the fixed ones in
the population contrast, and all others form the ``population''.
The result will be a matrix or list that has a label
attribute containing the name of the term; this is used in printouts in the
obvious way.
Suppose that our model was \code{coxph(Surv(time, status) ~ age*sex + ph.ecog)}.
Someone might want the population matrix for age, sex, ph.ecog, or age+ sex.
For the last it doesn't matter if they say age+sex, age*sex, or age:sex.
<<yates>>=
cmatrix <- function(fit, term,
test =c("global", "trend", "pairwise", "mean"),
levels, assign) {
# Make sure that "fit" is present and isn't missing any parts.
if (missing(fit)) stop("a fit argument is required")
Terms <- try(terms(fit), silent=TRUE)
if (inherits(Terms, "try-error"))
stop("the fit does not have a terms structure")
else Terms <- delete.response(Terms) # y is not needed
Tatt <- attributes(Terms)
# a flaw in delete.response: it doesn't subset dataClasses
Tatt$dataClasses <- Tatt$dataClasses[row.names(Tatt$factors)]
test <- match.arg(test)
if (missing(term)) stop("a term argument is required")
if (is.character(term)) term <- formula(paste("~", term))
else if (is.numeric(term)) {
if (all(term == floor(term) & term >0 & term < length(Tatt$term.labels)))
term <- formula(paste("~",
paste(Tatt$term.labels[term], collapse='+')))
else stop("a numeric term must be an integer between 1 and max terms in the fit")
}
else if (!inherits(term, "formula"))
stop("the term must be a formula or integer")
fterm <- delete.response(terms(term))
fatt <- attributes(fterm)
user.name <- fatt$term.labels # what the user called it
termname <- all.vars(fatt$variables)
indx <- match(termname, all.vars(Tatt$variables))
if (any(is.na(indx)))
stop("variable ", termname[is.na(indx)], " not found in the formula")
# What kind of term is being tested? It can be categorical, continuous,
# an interaction of only categorical terms, interaction of only continuous
# terms, or a mixed interaction.
# Key is a trick to get "zed" from ns(zed, df= dfvar)
key <- sapply(Tatt$variables[-1], function(x) all.vars(x)[1])
parts <- names(Tatt$dataClasses)[match(termname, key)]
types <- Tatt$dataClasses[parts]
iscat <- as.integer(types=="factor" | types=="character")
if (length(iscat)==1) termtype <- iscat
else termtype <- 2 + any(iscat) + all(iscat)
# Were levels specified? If so we either simply accept them (continuous),
# or double check them (categorical)
if (missing(levels)) {
temp <- fit$xlevels[match(parts, names(fit$xlevels), nomatch=0)]
if (length(temp) < length(parts))
stop("continuous variables require the levels argument")
levels <- do.call(expand.grid, c(temp, stringsAsFactors=FALSE))
}
else { #user supplied
if (is.list(levels)) {
if (is.null(names(levels))) {
if (length(termname)==1) names(levels)== termname
else stop("levels list requires named elements")
}
}
if (is.data.frame(levels) || is.list(levels)) {
index1 <- match(termname, names(levels), nomatch=0)
# Grab the cols from levels that are needed (we allow it to have
# extra, unused columns)
levels <- as.list(levels[index1])
# now, levels = the set of ones that the user supplied (which might
# be none, if names were wrong)
if (length(levels) < length(termname)) {
# add on the ones we don't have, using fit$xlevels as defaults
temp <- fit$xlevels[parts[index1==0]]
if (length(temp) > 0) {
names(temp) <- termname[index1 ==0]
levels <- c(levels, temp)
}
}
index2 <- match(termname, names(levels), nomatch=0)
if (any(index2==0))
stop("levels information not found for: ", termname[index2==0])
levels <- expand.grid(levels[index2], stringsAsFactors=FALSE)
if (any(duplicated(levels))) stop("levels data frame has duplicates")
}
else if (is.matrix(levels)) {
if (ncol(levels) != length(parts))
stop("levels matrix has the wrong number of columns")
if (!is.null(dimnames(levels)[[2]])) {
index <- match(termname, dimnames(levels)[[2]], nomatch=0)
if (index==0)
stop("matrix column names do no match the variable list")
else levels <- levels[,index, drop=FALSE]
} else if (ncol(levels) > 1)
stop("multicolumn levels matrix requires column names")
if (any(duplicated(levels)))
stop("levels matrix has duplicated rows")
levels <- data.frame(levels, stringsAsFactors=FALSE)
names(levels) <- termname
}
else if (length(parts) > 1)
stop("levels should be a data frame or matrix")
else {
levels <- data.frame(x=unique(levels), stringsAsFactors=FALSE)
names(levels) <- termname
}
}
# check that any categorical levels are legal
for (i in which(iscat==1)) {
xlev <- fit$xlevels[[parts[i]]]
if (is.null(xlev))
stop("xlevels attribute not found for", termname[i])
temp <- match(levels[[i]], xlev)
if (any(is.na(temp)))
stop("invalid level for term", termname[i])
}
rval <- list(levels=levels, termname=termname)
# Now add the contrast matrix between the levels, if needed
if (test=="global") {
<<cmatrix-build-default>>
}
else if (test=="pairwise") {
<<cmatrix-build-pairwise>>
}
else if (test=="mean") {
<<cmatrix-build-mean>>
}
else {
<<cmatrix-build-linear>>
}
# the user can say "age" when the model has "ns(age)", but we need
# the more formal label going forward
rval <- list(levels=levels, termname=parts, cmat=cmat, iscat=iscat)
class(rval) <- "cmatrix"
rval
}
@
The default contrast matrix is a simple test of equality if there is only
one term.
If the term is the interaction of multiple categorical variables
then we do an anova type decomposition.
In other cases we currently fail.
<<cmatrix-build-default>>=
if (TRUE) {
#if (length(parts) ==1) {
cmat <- diag(nrow(levels))
cmat[, nrow(cmat)] <- -1 # all equal to the last
cmat <- cmat[-nrow(cmat),, drop=FALSE]
}
else if (termtype== 4) { # anova type
stop("not yet done 1")
}
else stop("not yet done 2")
@
The \code{pairwise} option creates a set of contrast matrices for all pairs
of a factor.
<<cmatrix-build-pairwise>>=
nlev <- nrow(levels) # this is the number of groups being compared
if (nlev < 2) stop("pairwise tests need at least 2 groups")
npair <- nlev*(nlev-1)/2
if (npair==1) cmat <- matrix(c(1, -1), nrow=1)
else {
cmat <- vector("list", npair)
k <- 1
cname <- rep("", npair)
for (i in 1:(nlev-1)) {
temp <- double(nlev)
temp[i] <- 1
for (j in (i+1):nlev) {
temp[j] <- -1
cmat[[k]] <- matrix(temp, nrow=1)
temp[j] <- 0
cname[k] <- paste(i, "vs", j)
k <- k+1
}
}
names(cmat) <- cname
}
@
The mean option compares each to the overall mean.
<<cmatrix-build-mean>>=
ntest <- nrow(levels)
cmat <- vector("list", ntest)
for (k in 1:ntest) {
temp <- rep(-1/ntest, ntest)
temp[k] <- (ntest-1)/ntest
cmat[[k]] <- matrix(temp, nrow=1)
}
names(cmat) <- paste(1:ntest, "vs mean")
@
The \code{linear} option is of interest for terms that have more than one
column; the two most common cases are a factor variable or a spline.
It forms a pair of tests, one for the linear and one
for the nonlinear part. For non-linear functions such as splines we need
some notion of the range of the data, since we want to be linear over the
entire range.
<<cmatrix-build-linear>>=
cmat <- vector("list", 2)
cmat[[1]] <- matrix(1:ntest, 1, ntest)
cmat[[2]] <- diag(ntest)
attr(cmat, "nested") <- TRUE
if (is.null(levels[[1]])) {
# a continuous variable, and the user didn't give levels for the test
# look up the call and use the knots
tcall <- Tatt$predvars[[indx + 1]] # skip the 'call'
if (tcall[[1]] == as.name("pspline")) {
bb <- tcall[["Boundary.knots"]]
levels[[1]] <- seq(bb[1], bb[2], length=ntest)
}
else if (tcall[[1]] %in% c("ns", "bs")) {
bb <- c(tcall[["Boundary.knots"]], tcall[["knots"]])
levels[[1]] <- sort(bb)
}
else stop("don't know how to do a linear contrast for this term")
}
@
Here are some helper routines.
Formulas are from chapter 5 of Searle. The sums of squares only makes
sense within a linear model.
<<yates>>=
gsolve <- function(mat, y, eps=sqrt(.Machine$double.eps)) {
# solve using a generalized inverse
# this is very similar to the ginv function of MASS
temp <- svd(mat, nv=0)
dpos <- (temp$d > max(temp$d[1]*eps, 0))
dd <- ifelse(dpos, 1/temp$d, 0)
# all the parentheses save a tiny bit of time if y is a vector
if (all(dpos)) x <- drop(temp$u %*% (dd*(t(temp$u) %*% y)))
else if (!any(dpos)) x <- drop(temp$y %*% (0*y)) # extremely rare
else x <-drop(temp$u[,dpos] %*%(dd[dpos] * (t(temp$u[,dpos, drop=FALSE]) %*% y)))
attr(x, "df") <- sum(dpos)
x
}
qform <- function(var, beta) { # quadratic form b' (V-inverse) b
temp <- gsolve(var, beta)
list(test= sum(beta * temp), df=attr(temp, "df"))
}
@
The next functions do the work. Some bookkeeping is needed for
a missing value in beta: we leave that coefficient out of the linear
predictor.
If there are missing coefs then the variance matrix will not have those
columns in any case.
The nafun function asks if a linear combination is NA. It treats
0*NA as 0.
<<yates>>=
estfun <- function(cmat, beta, varmat) {
nabeta <- is.na(beta)
if (any(nabeta)) {
k <- which(!nabeta) #columns to keep
estimate <- drop(cmat[,k] %*% beta[k]) # vector of predictions
evar <- cmat[,k] %*% varmat %*% t(cmat[,k, drop=FALSE])
list(estimate = estimate, var=evar)
}
else {
list(estimate = drop(cmat %*% beta),
var = cmat %*% varmat %*% t(cmat))
}
}
testfun <- function(cmat, beta, varmat, sigma2) {
nabeta <- is.na(beta)
if (any(nabeta)) {
k <- which(!nabeta) #columns to keep
estimate <- drop(cmat[,k] %*% beta[k]) # vector of predictions
temp <- qform(cmat[,k] %*% varmat %*% t(cmat[,k,drop=FALSE]), estimate)
rval <- c(chisq=temp$test, df=temp$df)
}
else {
estimate <- drop(cmat %*% beta)
temp <- qform(cmat %*% varmat %*% t(cmat), estimate)
rval <- c(chisq=temp$test, df=temp$df)
}
if (!is.null(sigma2)) rval <- c(rval, ss= unname(rval[1]) * sigma2)
rval
}
nafun <- function(cmat, est) {
used <- apply(cmat, 2, function(x) any(x != 0))
any(used & is.na(est))
}
@
Now for the primary function.
The user may have a list of tests, or a single term.
The first part of the function does the usual of grabbing arguments
and then checking them.
The fit object has to have the standard stuff: terms, assign, xlevels
and contrasts.
Attributes of the terms are used often enough that we copy them
to \code{Tatt} to save typing.
We will almost certainly need the model frame and/or model matrix as
well.
In the discussion below I use x1 to refer to the covariates/terms that are
the target, e.g. \code{test='Mask'} to get the mean population values for
each level of the Mask variable in the solder data set, and x2 to refer to
all the other terms in the model, the ones that we average over.
These are also referred to as U and V in the vignette.
<<yates>>=
yates <- function(fit, term, population=c("data", "factorial", "sas"),
levels, test =c("global", "trend", "pairwise"),
predict="linear", options, nsim=200,
method=c("direct", "sgtt")) {
Call <- match.call()
if (missing(fit)) stop("a fit argument is required")
Terms <- try(terms(fit), silent=TRUE)
if (inherits(Terms, "try-error"))
stop("the fit does not have a terms structure")
else Terms <- delete.response(Terms) # y is not needed
Tatt <- attributes(Terms)
# a flaw in delete.response: it doesn't subset dataClasses
Tatt$dataClasses <- Tatt$dataClasses[row.names(Tatt$factors)]
if (inherits(fit, "coxphms")) stop("multi-state coxph not yet supported")
if (is.list(predict) || is.function(predict)) {
# someone supplied their own
stop("user written prediction functions are not yet supported")
}
else { # call the method
indx <- match(c("fit", "predict", "options"), names(Call), nomatch=0)
temp <- Call[c(1, indx)]
temp[[1]] <- quote(yates_setup)
mfun <- eval(temp, parent.frame())
}
if (is.null(mfun)) predict <- "linear"
# we will need the original model frame and X matrix
mframe <- fit$model
if (is.null(mframe)) mframe <- model.frame(fit)
Xold <- model.matrix(fit)
if (is.null(fit$assign)) { # glm models don't save assign
xassign <- attr(Xold, "assign")
}
else xassign <- fit$assign
nvar <- length(xassign)
nterm <- length(Tatt$term.names)
termname <- rownames(Tatt$factors)
iscat <- sapply(Tatt$dataClasses,
function(x) x %in% c("character", "factor"))
method <- match.arg(casefold(method), c("direct", "sgtt")) #allow SGTT
if (method=="sgtt" && missing(population)) population <- "sas"
if (inherits(population, "data.frame")) popframe <- TRUE
else if (is.character(population)) {
popframe <- FALSE
population <- match.arg(tolower(population[1]),
c("data", "factorial", "sas",
"empirical", "yates"))
if (population=="empirical") population <- "data"
if (population=="yates") population <- "factorial"
}
else stop("the population argument must be a data frame or character")
test <- match.arg(test)
if (popframe || population != "data") weight <- NULL
else {
weight <- model.extract(mframe, "weights")
if (is.null(weight)) {
id <- model.extract(mframe, "id")
if (!is.null(id)) { # each id gets the same weight
count <- c(table(id))
weight <- 1/count[match(id, names(count))]
}
}
}
if (method=="sgtt" && (population !="sas" || predict != "linear"))
stop("sgtt method only applies if population = sas and predict = linear")
beta <- coef(fit, complete=TRUE)
nabeta <- is.na(beta) # undetermined coefficients
vmat <- vcov(fit, complete=FALSE)
if (nrow(vmat) > sum(!nabeta)) {
# a vcov method that does not obey the complete argument
vmat <- vmat[!nabeta, !nabeta]
}
# grab the dispersion, needed for the writing an SS in linear models
if (class(fit)[1] =="lm") sigma <- summary(fit)$sigma
else sigma <- NULL # don't compute an SS column
# process the term argument and check its legality
if (missing(levels))
contr <- cmatrix(fit, term, test, assign= xassign)
else contr <- cmatrix(fit, term, test, assign= xassign, levels = levels)
x1data <- as.data.frame(contr$levels) # labels for the PMM values
# Make the list of X matrices that drive everything: xmatlist
# (Over 1/2 the work of the whole routine)
xmatlist <- yates_xmat(Terms, Tatt, contr, population, mframe, fit,
iscat)
# check rows of xmat for estimability
<<yates-estim-setup>>
# Drop missing coefficients, and use xmatlist to compute the results
beta <- beta[!nabeta]
if (predict == "linear" || is.null(mfun)) {
# population averages of the simple linear predictor
<<yates-linear>>
}
else {
<<yates-nonlinear>>
}
result$call <- Call
class(result) <- "yates"
result
}
@
Models with factor variables may often lead to population predictions that
involve non-estimable functions, particularly if there are interactions
and the user specifies a factorial population.
If there are any missing coefficients we have to do formal checking for
this: any given row of the new $X$ matrix, for prediction, must be in the
row space of the original $X$ matrix.
If this is true then a regression of a new row on the old $X$ will have
residuals of zero.
It is not possible to derive this from the pattern of NA coefficients alone.
Set up a function that returns a true/false vector of whether each row of
a matrix is estimable. This test isn't relevant if population=none.
<<yates-estim-setup>>=
if (any(is.na(beta)) && (popframe || population != "none")) {
Xu <- unique(Xold) # we only need unique rows, saves time to do so
if (inherits(fit, "coxph")) X.qr <- qr(t(cbind(1.0,Xu)))
else X.qr <- qr(t(Xu)) # QR decomposition of the row space
estimcheck <- function(x, eps= sqrt(.Machine$double.eps)) {
temp <- abs(qr.resid(X.qr, t(x)))
# apply(abs(temp), 1, function(x) all(x < eps)) # each row estimable
all(temp < eps)
}
estimable <- sapply(xmatlist, estimcheck)
} else estimable <- rep(TRUE, length(xmatlist))
@
When the prediction target is $X\beta$ there is a four step
process: build the reference population, create the list of X matrices
(one prediction matrix for each for x1 value),
column means of each X form each row of the
contrast matrix Cmat, and then use Cmat to get the pmm values and
tests of the pmm values.
<<yates-linear>>=
#temp <- match(contr$termname, colnames(Tatt$factors))
#if (any(is.na(temp)))
# stop("term '", contr$termname[is.na(temp)], "' not found in the model")
meanfun <- if (is.null(weight)) colMeans else function(x) {
colSums(x*weight)/ sum(weight)}
Cmat <- t(sapply(xmatlist, meanfun))[,!nabeta]
# coxph model: the X matrix is built as though an intercept were there (the
# baseline hazard plays that role), but then drop it from the coefficients
# before computing estimates and tests. If there was a strata * covariate
# interaction there will be many more colums to drop.
if (inherits(fit, "coxph")) {
nkeep <- length(fit$means) # number of non-intercept columns
col.to.keep <- seq(to=ncol(Cmat), length= nkeep)
Cmat <- Cmat[,col.to.keep, drop=FALSE]
offset <- -sum(fit$means[!nabeta] * beta) # recenter the predictions too
}
else offset <- 0
# Get the PMM estimates, but only for estimable ones
estimate <- cbind(x1data, pmm=NA, std=NA)
if (any(estimable)) {
etemp <- estfun(Cmat[estimable,,drop=FALSE], beta, vmat)
estimate$pmm[estimable] <- etemp$estimate + offset
estimate$std[estimable] <- sqrt(diag(etemp$var))
}
# Now do tests on the PMM estimates, one by one
if (method=="sgtt") {
<<yates-sgtt>>
}
else {
if (is.list(contr$cmat)) {
test <- t(sapply(contr$cmat, function(x)
testfun(x %*% Cmat, beta, vmat, sigma^2)))
natest <- sapply(contr$cmat, nafun, estimate$pmm)
}
else {
test <- testfun(contr$cmat %*% Cmat, beta, vmat, sigma^2)
test <- matrix(test, nrow=1,
dimnames=list("global", names(test)))
natest <- nafun(contr$cmat, estimate$pmm)
}
if (any(natest)) test[natest,] <- NA
}
if (any(estimable)){
# Cmat[!estimable,] <- NA
result <- list(estimate=estimate, test=test, mvar=etemp$var, cmat=Cmat)
}
else result <- list(estimate=estimate, test=test, mvar=NA)
if (method=="sgtt") result$SAS <- Smat
@
In the non-linear case the mfun object is either a single function
or a list containing two functions \code{predict} and \code{summary}.
The predict function is handed a vector $\eta = X\beta$ along with
the $X$ matrix, though most methods don't use $X$.
The result of predict can be a vector or a matrix.
For coxph models we add on an ``intercept coef'' that will center the
predictions.
<<yates-nonlinear>>=
xall <- do.call(rbind, xmatlist)[,!nabeta, drop=FALSE]
if (inherits(fit, "coxph")) {
xall <- xall[,-1, drop=FALSE] # remove the intercept
eta <- xall %*% beta -sum(fit$means[!nabeta]* beta)
}
else eta <- xall %*% beta
n1 <- nrow(xmatlist[[1]]) # all of them are the same size
index <- rep(1:length(xmatlist), each = n1)
if (is.function(mfun)) predfun <- mfun
else { # double check the object
if (!is.list(mfun) ||
any(is.na(match(c("predict", "summary"), names(mfun)))) ||
!is.function(mfun$predic) || !is.function(mfun$summary))
stop("the prediction should be a function, or a list with two functions")
predfun <- mfun$predict
sumfun <- mfun$summary
}
pmm <- predfun(eta, xall)
n2 <- length(eta)
if (!(is.numeric(pmm)) || !(length(pmm)==n2 || nrow(pmm)==n2))
stop("prediction function should return a vector or matrix")
pmm <- rowsum(pmm, index, reorder=FALSE)/n1
pmm[!estimable,] <- NA
# get a sample of coefficients, in order to create a variance
# this is lifted from the mvtnorm code (can't include a non-recommended
# package in the dependencies)
tol <- sqrt(.Machine$double.eps)
if (!isSymmetric(vmat, tol=tol, check.attributes=FALSE))
stop("variance matrix of the coefficients is not symmetric")
ev <- eigen(vmat, symmetric=TRUE)
if (!all(ev$values >= -tol* abs(ev$values[1])))
warning("variance matrix is numerically not positive definite")
Rmat <- t(ev$vectors %*% (t(ev$vectors) * sqrt(ev$values)))
bmat <- matrix(rnorm(nsim*ncol(vmat)), nrow=nsim) %*% Rmat
bmat <- bmat + rep(beta, each=nsim) # add the mean
# Now use this matrix of noisy coefficients to get a set of predictions
# and use those to create a variance matrix
# Since if Cox we need to recenter each run
sims <- array(0., dim=c(nsim, nrow(pmm), ncol(pmm)))
if (inherits(fit, 'coxph')) offset <- bmat %*% fit$means[!nabeta]
else offset <- rep(0., nsim)
for (i in 1:nsim)
sims[i,,] <- rowsum(predfun(xall %*% bmat[i,] - offset[i]), index,
reorder=FALSE)/n1
mvar <- var(sims[,,1]) # this will be used for the tests
estimate <- cbind(x1data, pmm=unname(pmm[,1]), std= sqrt(diag(mvar)))
# Now do the tests, on the first column of pmm only
if (is.list(contr$cmat)) {
test <- t(sapply(contr$cmat, function(x)
testfun(x, pmm[,1], mvar[estimable, estimable], NULL)))
natest <- sapply(contr$cmat, nafun, pmm[,1])
}
else {
test <- testfun(contr$cmat, pmm[,1], mvar[estimable, estimable], NULL)
test <- matrix(test, nrow=1,
dimnames=list(contr$termname, names(test)))
natest <- nafun(contr$cmat, pmm[,1])
}
if (any(natest)) test[natest,] <- NA
if (any(estimable))
result <- list(estimate=estimate,test=test, mvar=mvar)
else result <- list(estimate=estimate, test=test, mvar=NA)
# If there were multiple columns from predfun, compute the matrix of
# results and variances
if (ncol(pmm) > 1 && any(estimable)){
pmm <- apply(sims, 2:3, mean)
mvar2 <- apply(sims, 2:3, var)
# Call the summary function, if present
if (is.list(mfun)) result$summary <- sumfun(pmm, mvar2)
else {
result$pmm <- pmm
result$mvar2 <- mvar2
}
}
@
Build the population data set.
If the user provided a data set as the population then the task is
fairly straightforward: we manipulate the data set and then call
model.frame followed by model.matrix in the usual way.
The primary task in that
case is to verify that the data has all the needed variables.
Otherwise we have to be subtle.
\begin{enumerate}
\item We have ready access to a model frame, but not to the data.
Consider a spline term for instance --- it's not always possible
to go backwards and get the data.
\item We need to manipulate this model frame, e.g., make everyone
treatment=A, then repeat with everyone treatment B.
\item We need to do it in a way that makes the frame still look
like a correct model frame to R. This requires care.
\end{enumerate}
For population= factorial we create a population data set that has all
the combinations. If there are three adjusters z1, z2 and z3 with
2, 3, and 5 levels, respectively, the new data set will have 30
rows.
If the primary model didn't have any z1*z2*z3 terms in it we
likely could get by with less, but it's not worth the programming effort
to figure that out: predicted values are normally fairly cheap.
For population=sas we need a mixture: categoricals are factorial and others
are data. Say there were categoricals with 3 and 5 levels, so the factorial
data set has 15 obs, while the overall n is 50. We need a data set of 15*50
observations to ensure all combinations of the two categoricals with each
continuous line.
An issue with data vs model is names. Suppose the original model was
\code{lm(y \textasciitilde ns(age,4) + factor(ph.ecog))}.
In the data set the variable name is ph.ecog, in the model frame,
the xlevels list, and terms structure it is factor(ph.ecog).
The data frame has individual columns for the four variables, the model frame
is a list with 3 elements, one of which is named ``ns(age, 4)'': notice the
extra space before the 4 compared to what was typed.
<<yates>>=
yates_xmat <- function(Terms, Tatt, contr, population, mframe, fit,
iscat, weight) {
# which variables(s) are in x1 (variables of interest)
# First a special case of strata(grp):x, which causes strata(grp) not to
# appear as a column
if (any(is.na(match(contr$termname, colnames(Tatt$factors))))) {
#tis rare
if (length(contr$termname) > 1) stop("incomplete code 1")
x1indx <- (contr$termname== rownames(Tatt$factors))
names(x1indx) <- rownames(Tatt$factors)
if (!any(x1indx)) stop(paste("variable", contr$termname, "not found"))
} else x1indx <- apply(Tatt$factors[,contr$termname,drop=FALSE] >0, 1, any)
x2indx <- !x1indx # adjusters
if (inherits(population, "data.frame")) pdata <- population #user data
else if (population=="data") pdata <- mframe #easy case
else if (population=="factorial")
pdata <- yates_factorial_pop(mframe, Terms, x2indx, fit$xlevels)
else if (population=="sas") {
if (all(iscat[x2indx]))
pdata <- yates_factorial_pop(mframe, Terms, x2indx, fit$xlevels)
else if (!any(iscat[x2indx])) pdata <- mframe # no categoricals
else { # mixed population
pdata <- yates_factorial_pop(mframe, Terms, x2indx & iscat,
fit$xlevels)
n2 <- nrow(pdata)
pdata <- pdata[rep(1:nrow(pdata), each=nrow(mframe)), ]
row.names(pdata) <- 1:nrow(pdata)
# fill in the continuous
k <- rep(1:nrow(mframe), n2)
for (i in which(x2indx & !iscat)) {
j <- names(x1indx)[i]
if (is.matrix(mframe[[j]]))
pdata[[j]] <- mframe[[j]][k,, drop=FALSE]
else pdata[[j]] <- (mframe[[j]])[k]
attributes(pdata[[j]]) <- attributes(mframe[[j]])
}
}
}
else stop("unknown population") # this should have been caught earlier
# Now create the x1 data set, the unique rows we want to test
<<yates-x1mat>>
xmatlist
}
@
Build a factorial data set from a model frame.
<<yates>>=
yates_factorial_pop <- function(mframe, terms, x2indx, xlevels) {
x2name <- names(x2indx)[x2indx]
dclass <- attr(terms, "dataClasses")[x2name]
if (!all(dclass %in% c("character", "factor")))
stop("population=factorial only applies if all the adjusting terms are categorical")
nvar <- length(x2name)
n2 <- sapply(xlevels[x2name], length) # number of levels for each
n <- prod(n2) # total number of rows needed
pdata <- mframe[rep(1, n), -1] # toss the response
row.names(pdata) <- NULL # throw away funny names
n1 <- 1
for (i in 1:nvar) {
j <- rep(rep(1:n2[i], each=n1), length=n)
xx <- xlevels[[x2name[i]]]
if (dclass[i] == "factor")
pdata[[x2name[i]]] <- factor(j, 1:n2[i], labels= xx)
else pdata[[x2name[i]]] <- xx[j]
n1 <- n1 * n2[i]
}
attr(pdata, "terms") <- terms
pdata
}
@
The next section builds a set of X matrices, one for each level of the
x1 combination.
The following was learned by reading the source code for
model.matrix:
\begin{itemize}
\item If pdata has no terms attribute then model.matrix will call model.frame
first, otherwise not. The xlev argument is passed forward to model.frame
but is otherwise unused.
\item If necessary, it will reorder the columns of pdata to match the terms,
though I try to avoid that.
\item Toss out the response variable, if present.
\item Any character variables are turned into factors. The dataClass attribute
of the terms object is not consulted.
\item For each column that is a factor
\begin{itemize}
\item if it alreay has a contrasts attribute, it is left alone.
\item otherwise a contrasts attribute is added using a matching
element from contrasts.arg, if present, otherwise the global default
\item contrasts.arg must be a list, but it does not have to contain all
factors
\end{itemize}
\item Then call the internal C code
\end{itemize}
If pdata already is a model frame we want to leave it as one, so as to
avoid recreating the raw data.
If x1data comes from the user though, so we need to do that portion of
model.frame processing ourselves, in order to get it into the right
form. Always turn characters into factors, since individual elements
of \code{xmatlist} will have only a subset of the x1 variables.
One nuisance is name matching. Say the model had
\code{factor(ph.ecog)} as a term; then \code{fit\$xlevels} will have
`factor(ph.ecog)' as a name but the user will likely have created a
data set using `ph.ecog' as the name.
<<yates-x1mat>>=
if (is.null(contr$levels)) stop("levels are missing for this contrast")
x1data <- as.data.frame(contr$levels) # in case it is a list
x1name <- names(x1indx)[x1indx]
for (i in 1:ncol(x1data)) {
if (is.character(x1data[[i]])) {
if (is.null(fit$xlevels[[x1name[i]]]))
x1data[[i]] <- factor(x1data[[i]])
else x1data[[i]] <- factor(x1data[[i]], fit$xlevels[[x1name[i]]])
}
}
xmatlist <- vector("list", nrow(x1data))
if (is.null(attr(pdata, "terms"))) {
np <- nrow(pdata)
k <- match(x1name, names(pdata), nomatch=0)
if (any(k>0)) pdata <- pdata[, -k, drop=FALSE] # toss out yates var
for (i in 1:nrow(x1data)) {
j <- rep(i, np)
tdata <- cbind(pdata, x1data[j,,drop=FALSE]) # new data set
xmatlist[[i]] <- model.matrix(Terms, tdata, xlev=fit$xlevels,
contrast.arg= fit$contrasts)
}
} else {
# pdata is a model frame, convert x1data
# if the name and the class agree we go forward simply
index <- match(names(x1data), names(pdata), nomatch=0)
if (all(index >0) &&
identical(lapply(x1data, class), lapply(pdata, class)[index]) &
identical(sapply(x1data, ncol) , sapply(pdata, ncol)[index]))
{ # everything agrees
for (i in 1:nrow(x1data)) {
j <- rep(i, nrow(pdata))
tdata <- pdata
tdata[,names(x1data)] <- x1data[j,]
xmatlist[[i]] <- model.matrix(Terms, tdata,
contrasts.arg= fit$contrasts)
}
}
else {
# create a subset of the terms structure, for x1 only
# for instance the user had age=c(75, 75, 85) and the term was ns(age)
# then call model.frame to fix it up
x1term <- Terms[which(x1indx)]
x1name <- names(x1indx)[x1indx]
attr(x1term, "dataClasses") <- Tatt$dataClasses[x1name] # R bug
x1frame <- model.frame(x1term, x1data, xlev=fit$xlevels[x1name])
for (i in 1:nrow(x1data)) {
j <- rep(i, nrow(pdata))
tdata <- pdata
tdata[,names(x1frame)] <- x1frame[j,]
xmatlist[[i]] <- model.matrix(Terms, tdata, xlev=fit$xlevels,
contrast.arg= fit$contrasts)
}
}
}
@
The decompostion based algorithm for SAS type 3 tests.
Ignore the set of contrasts cmat since the algorithm can only
do a global test.
We mostly mimic the SAS GLM algorithm.
For the generalized Cholesky decomposition $LDL' = X'X$, where $L$ is
lower triangular with $L_{ii}=1$ and $D$ is diagonal, the set of contrasts
$L'\beta$ gives the type I sequential sums of squares, partitioning the
rows of $L$ into those for term 1, term 2, etc.
If $X$ is the design matrix for a balanced factorial design then it is
also true that $L_{ij}=0$ unless term $j$ includes term $i$, e.g., x1:x2
includes x1. These blocks of zeros mean that changing the order of the terms
in the model simply rearranges $L$, and individual tests are unchanged.
This is precisely the definition of a type III contrast in SAS.
With a bit of reading between the lines the ``four types of estimable
functions'' document suggests the following algorithm:
\begin{enumerate}
\item Start with an $X$ matrix in standard order of intercept, main effects,
first order interactions, etc. Code any categorical variable with $k$ levels
as $k$ 0/1 columns. An interaction of two categoricals with $k$ and $l$
levels will have $kl$ columns, etc.
\item Create the dependency matrix $D = (X'X)^-(X'X)$. If column $i$ of $X$
can be written as a linear combination of prior columns, then column $i$ of
$D$ contains that combination. Other columns of $D$ match the identity
matrix.
\item Intitialize $L = D$.
\item For any row $i$ and $j$ such that $i$ is contained in $j$, make $L_i$
orthagonal to $L_j$.
\end{enumerate}
The algorithm appears to work in almost all cases, an exception is when the
type 3 test has fewer degrees of freedom that we would expect.
Continuous variables are not orthagonalized in the SAS type III approach,
nor any interaction that contains a continuous variable as one of its parts.
To find the nested terms first note which rows of \code{factors} refer
to categorical variables (the \code{iscat} variable);
columns of \code{factors} that are non-zero only
in categorical rows are the ``categorical'' columns.
A term represented by one column in \code{factors} ``contains'' the term
represented in some other column iff it's non-zero elements are a superset.
We have to build a new X matrix that is the expanded SAS coding, and are only
able to do that for models that have an intercept, and use contr.treatement
or contr.SAS coding.
<<yates-sgtt>>=
# It would be simplest to have the contrasts.arg to be a list of function names.
# However, model.matrix plays games with the calling sequence, and any function
# defined at this level will not be seen. Instead create a list of contrast
# matrices.
temp <- sapply(fit$contrasts, function(x) (is.character(x) &&
x %in% c("contr.SAS", "contr.treatment")))
if (!all(temp))
stop("yates sgtt method can only handle contr.SAS or contr.treatment")
temp <- vector("list", length(fit$xlevels))
names(temp) <- names(fit$xlevels)
for (i in 1:length(fit$xlevels)) {
cmat <- diag(length(fit$xlevels[[i]]))
dimnames(cmat) <- list(fit$xlevels[[i]], fit$xlevels[[i]])
if (i>1 || Tatt$intercept==1) {
if (fit$contrasts[[i]] == "contr.treatment")
cmat <- cmat[, c(2:ncol(cmat), 1)]
}
temp[[i]] <- cmat
}
sasX <- model.matrix(formula(fit), data=mframe, xlev=fit$xlevels,
contrasts.arg=temp)
sas.assign <- attr(sasX, "assign")
# create the dependency matrix D. The lm routine is unhappy if it thinks
# the right hand and left hand sides are the same, fool it with I().
# We do this using the entire X matrix even though only categoricals will
# eventually be used; if a continuous variable made it NA we need to know.
D <- coef(lm(sasX ~ I(sasX) -1))
dimnames(D)[[1]] <- dimnames(D)[[2]] #get rid if the I() names
zero <- is.na(D[,1]) # zero rows, we'll get rid of these later
D <- ifelse(is.na(D), 0, D)
# make each row orthagonal to rows for other terms that contain it
# Containing blocks, if any, will always be below
# this is easiest to do with the transposed matrix
# Only do this if both row i and j are for a categorical variable
if (!all(iscat)) {
# iscat marks variables in the model frame as categorical
# tcat marks terms as categorical. For x1 + x2 + x1:x2 iscat has
# 2 entries and tcat has 3.
tcat <- (colSums(Tatt$factors[!iscat,,drop=FALSE]) == 0)
}
else tcat <- rep(TRUE, max(sas.assign)) # all vars are categorical
B <- t(D)
dimnames(B)[[2]] <- paste0("L", 1:ncol(B)) # for the user
if (ncol(Tatt$factors) > 1) {
share <- t(Tatt$factors) %*% Tatt$factors
nc <- ncol(share)
for (i in which(tcat[-nc])) {
j <- which(share[i,] > 0 & tcat)
k <- j[j>i] # terms that I need to regress out
if (length(k)) {
indx1 <- which(sas.assign ==i)
indx2 <- which(sas.assign %in% k)
B[,indx1] <- resid(lm(B[,indx1] ~ B[,indx2]))
}
}
}
# Cut B back down to the non-missing coefs of the original fit
Smat <- t(B)[!zero, !zero]
Sassign <- xassign[!nabeta]
@
Although the SGTT does test for all terms, we only want to print out the
ones that were asked for.
<<yates-sgtt>>=
keep <- match(contr$termname, colnames(Tatt$factors))
if (length(keep) > 1) { # more than 1 term in the model
test <- t(sapply(keep, function(i)
testfun(Smat[Sassign==i,,drop=FALSE], beta, vmat, sigma^2)))
rownames(test) <- contr$termname
} else {
test <- testfun(Smat[Sassign==keep,, drop=FALSE], beta, vmat, sigma^2)
test <- matrix(test, nrow=1,
dimnames=list(contr$termname, names(test)))
}
@
The print routine places the population predicted values (PPV) alongside the
tests on those values. Defaults are copied from printCoefmat.
<<yates>>=
print.yates <- function(x, digits = max(3, getOption("digits") -2),
dig.tst = max(1, min(5, digits-1)),
eps=1e-8, ...) {
temp1 <- x$estimate
temp1$pmm <- format(temp1$pmm, digits=digits)
temp1$std <- format(temp1$std, digits=digits)
# the spaces help separate the two parts of the printout
temp2 <- cbind(test= paste(" ", rownames(x$test)),
data.frame(x$test), stringsAsFactors=FALSE)
row.names(temp2) <- NULL
temp2$Pr <- format.pval(pchisq(temp2$chisq, temp2$df, lower.tail=FALSE),
eps=eps, digits=dig.tst)
temp2$chisq <- format(temp2$chisq, digits= dig.tst)
temp2$df <- format(temp2$df)
if (!is.null(temp2$ss)) temp2$ss <- format(temp2$ss, digits=digits)
if (nrow(temp1) > nrow(temp2)) {
dummy <- temp2[1,]
dummy[1,] <- ""
temp2 <- rbind(temp2, dummy[rep(1, nrow(temp1)-nrow(temp2)),])
}
if (nrow(temp2) > nrow(temp1)) {
# get rid of any factors before padding
for (i in which(sapply(temp1, is.factor)))
temp1[[i]] <- as.character(temp1[[i]])
dummy <- temp1[1,]
dummy[1,] <- ""
temp1 <- rbind(temp1, dummy[rep(1, nrow(temp2)- nrow(temp1)),])
}
print(cbind(temp1, temp2), row.names=FALSE)
invisible(x)
}
@
Routines to allow yates to interact with other models.
Each is called with the fitted model and the type of prediction.
It should return NULL when the type is a linear predictor, since the
parent routine has a very efficient approach in that case.
Otherwise it returns a function that will be applied to each value
$\eta$, from each row of a prediction matrix.
<<yates>>=
yates_setup <- function(fit, ...)
UseMethod("yates_setup", fit)
yates_setup.default <- function(fit, type, ...) {
if (!missing(type) && !(type %in% c("linear", "link")))
warning("no yates_setup method exists for a model of class ",
class(fit)[1], " and estimate type ", type,
", linear predictor estimate used by default")
NULL
}
yates_setup.glm <- function(fit, predict = c("link", "response", "terms",
"linear"), ...) {
type <- match.arg(predict)
if (type == "link" || type== "linear") NULL # same as linear
else if (type == "response") {
finv <- family(fit)$linkinv
function(eta, X) finv(eta)
}
else if (type == "terms")
stop("type terms not yet supported")
}
@
For the coxph routine, we are making use of the R environment by first
defining the baseline hazard and then defining the predict and summary
functions. This means that those functions have access to the baseline.
<<yates>>=
yates_setup.coxph <- function(fit, predict = c("lp", "risk", "expected",
"terms", "survival", "linear"),
options, ...) {
type <- match.arg(predict)
if (type=="lp" || type == "linear") NULL
else if (type=="risk") function(eta, X) exp(eta)
else if (type == "survival") {
# If there are strata we need to do extra work
# if there is an interaction we want to suppress a spurious warning
suppressWarnings(baseline <- survfit(fit, censor=FALSE))
if (missing(options) || is.null(options$rmean))
rmean <- max(baseline$time) # max death time
else rmean <- options$rmean
if (!is.null(baseline$strata))
stop("stratified models not yet supported")
cumhaz <- c(0, baseline$cumhaz)
tt <- c(diff(c(0, pmin(rmean, baseline$time))), 0)
predict <- function(eta, ...) {
c2 <- outer(exp(drop(eta)), cumhaz) # matrix of values
surv <- exp(-c2)
meansurv <- apply(rep(tt, each=nrow(c2)) * surv, 1, sum)
cbind(meansurv, surv)
}
summary <- function(surv, var) {
bsurv <- t(surv[,-1])
std <- t(sqrt(var[,-1]))
chaz <- -log(bsurv)
zstat <- -qnorm((1-baseline$conf.int)/2)
baseline$lower <- exp(-(chaz + zstat*std))
baseline$upper <- exp(-(chaz - zstat*std))
baseline$surv <- bsurv
baseline$std.err <- std/bsurv
baselinecumhaz <- chaz
baseline
}
list(predict=predict, summary=summary)
}
else stop("type expected is not supported")
}
@
\section{The cox.zph function}
The simplest test of proportional hazards is to use a time dependent
coefficient $\beta(t) = a + bt$.
Then $\beta(t) x = ax + b*(tx)$, and the extended coefficients $a$ and $b$
can be obtained from a Cox model with an extra 'fake' covariate $tx$.
More generally, replace $t$ with some function $g(t)$, which gives rise to
an entire family of tests.
An efficient assessment of this extended model can be done using a score
test.
\begin{itemize}
\item Augment the original variables $x_1, \ldots x_k$ with $k$ new ones
$g(t)x_1, \ldots, g(t)x_k$
\item Compute the first and second derivatives $U$ and $H$ of the Cox model
at the starting estimate of $(\hat\beta, 0)$; prior covariates at their
prior values, and the new covariates at 0. No iteration is done.
This can be done efficiently with a modified version of the primary C routines
for coxph.
\item By design, the first $k$ elements of $U$ will be zero. Thus the
first iteration of the new coefficients, and the score tests for them, are
particularly easy.
\end{itemize}
The information or Hessian matrix for a Cox model is
$$ \sum_{j \in deaths} V(t_j) = \sum_jV_j$$
where $V_j$ is the variance matrix of the weighted covariate values, over
all subjects at risk at time $t_j$.
Then the expanded information matrix for the score test is
\begin{align*}
H &= \left(\begin{array}{cc} H_1 & H_2 \\ H_2' & H_3 \end{array} \right) \\
H_1 &= \sum V(t_j) \\
H_2 &= \sum V(t_j) g(t_j) \\
H_3 &= \sum V(t_j) g^2(t_j)
\end{align*}
The inverse of the matrix will be more numerically stable if $g(t)$ is centered
at zero, and this does not change the test statistic.
In the usual case $V(t)$ is close to constant in time --- the variance of
$X$ does not change rapidly --- and then $H_2$ is approximately zero.
The original cox.zph used an approximation, which is to assume that
$V(t)$ is exactly constant.
In that case $H_2=0$ and $H_3= \sum V(t_j) \sum g^2(t_j)$ and the test
is particularly easy to compute.
This assumption of identical components can fail badly for models with a
covariate by strata interaction, and for some models with covariate
dependent censoring.
Multi-state models finally forced a change.
The newer version of the routine has two separate tracks: for the formal test
and another for the residuals.
<<cox.zph>>=
cox.zph <- function(fit, transform='km', terms=TRUE, singledf =FALSE,
global=TRUE) {
Call <- match.call()
if (!inherits(fit, "coxph") && !inherits(fit, "coxme"))
stop ("argument must be the result of Cox model fit")
if (inherits(fit, "coxph.null"))
stop("there are no score residuals for a Null model")
if (!is.null(attr(terms(fit), "specials")[["tt"]]))
stop("function not defined for models with tt() terms")
if (inherits(fit, "coxme")) {
# drop all mention of the random effects, before getdata
fit$formula <- fit$formula$fixed
fit$call$formula <- fit$formula
}
cget <- coxph.getdata(fit, y=TRUE, x=TRUE, stratax=TRUE, weights=TRUE)
y <- cget$y
ny <- ncol(y)
event <- (y[,ny] ==1)
if (length(cget$strata))
istrat <- as.integer(cget$strata) - 1L # number from 0 for C
else istrat <- rep(0L, nrow(y))
# if terms==FALSE the singledf argument is moot, but setting a value
# leads to a simpler path through the code
if (!terms) singledf <- FALSE
<<zph-setup>>
<<zph-transform>>
<<zph-terms>>
<<zph-schoen>>
rval$transform <- tname
rval$call <- Call
class(rval) <- "cox.zph"
return(rval)
}
print.cox.zph <- function(x, digits = max(options()$digits - 4, 3),
signif.stars=FALSE, ...) {
invisible(printCoefmat(x$table, digits=digits, signif.stars=signif.stars,
P.values=TRUE, has.Pvalue=TRUE, ...))
}
@
The user can use $t$ or $g(t)$ as the multiplier of the covariates.
The default is to use the KM, only because that seems to be best at
avoiding edge cases.
<<zph-transform>>=
times <- y[,ny-1]
if (is.character(transform)) {
tname <- transform
ttimes <- switch(transform,
'identity'= times,
'rank' = rank(times),
'log' = log(times),
'km' = {
temp <- survfitKM(factor(rep(1L, nrow(y))),
y, se.fit=FALSE)
# A nuisance to do left continuous KM
indx <- findInterval(times, temp$time, left.open=TRUE)
1.0 - c(1, temp$surv)[indx+1]
},
stop("Unrecognized transform"))
}
else {
tname <- deparse(substitute(transform))
if (length(tname) >1) tname <- 'user'
ttimes <- transform(times)
}
gtime <- ttimes - mean(ttimes[event])
# Now get the U, information, and residuals
if (ny==2) {
ord <- order(istrat, y[,1]) -1L
resid <- .Call(Czph1, gtime, y, X, eta,
cget$weights, istrat, fit$method=="efron", ord)
}
else {
ord1 <- order(-istrat, -y[,1]) -1L # reverse time for zph2
ord <- order(-istrat, -y[,2]) -1L
resid <- .Call(Czph2, gtime, y, X, eta,
cget$weights, istrat, fit$method=="efron",
ord1, ord)
}
@
The result has a score vector of length $2p$ where $p$ is the number of
variables and an information matrix that is $2p$ by $2p$.
This is done with C code that
is a simple variation on iteration 1 for a coxph model.
If \code{singledf} is TRUE then treat each term as a single degree of
freedom test, otherwise as a multi-degree of freedom.
If terms=FALSE test each covariate individually.
If all the variables are univariate this is a moot point.
The survival routines return Splus style assign components, that is a list
with one element per term, each element an integer vector of coefficient
indices.
The asgn vector is our main workhorse: loop over asgn to process term by
term.
\begin{itemize}
\item if term=FALSE, set make a new asgn with one coef per term
\item if a coefficient is NA, remove it from the relevant asgn vector
\item frailties and penalized coxme coefficients are ignored: remove
their element from the asgn list
\end{itemize}
For random effects models, including both frailty and coxme results, the
random effect is included in the linear.predictors component of the
fit. This allows us to do score tests for the other terms while effectively
holding the random effect fixed.
If there are any NA coefficients these are redundant variables. It's
easiest to simply get rid of them at the start by fixing up X, varnames,
asgn, nvar, and fcoef. The variable matrix won't have the NA columns.
<<zph-setup>>=
eta <- fit$linear.predictors
X <- cget$x
varnames <- names(fit$coefficients)
nvar <- length(varnames)
if (!terms) {
# create a fake asgn that has one value per coefficient
asgn <- as.list(1:nvar)
names(asgn) <- names(fit$coefficients)
}
else if (inherits(fit, "coxme")) {
asgn <- attrassign(cget$x, terms(fit))
# allow for a spelling inconsistency in coxme, later fixed
if (is.null(fit$linear.predictors))
eta <- fit$linear.predictor
fit$df <- NULL # don't confuse later code
}
else asgn <- fit$assign
if (!is.list(asgn)) stop ("unexpected assign component")
frail <- grepl("frailty(", names(asgn), fixed=TRUE) |
grepl("frailty.gamma(", names(asgn), fixed = TRUE) |
grepl("frailty.gaussian(", names(asgn), fixed = TRUE)
if (any(frail)) {
dcol <- unlist(asgn[frail]) # remove these columns from X
X <- X[, -dcol, drop=FALSE]
asgn <- asgn[!frail]
# frailties don't appear in the varnames, so no change there
}
nterm <- length(asgn)
termname <- names(asgn)
fcoef <- fit$coefficients
if (any(is.na(fcoef))) {
keep <- !is.na(fcoef)
varnames <- varnames[keep]
X <- X[,keep]
fcoef <- fcoef[keep]
# fix up assign
new <- unname(unlist(asgn))[keep] # the ones to keep
asgn <- sapply(asgn, function(x) {
i <- match(x, new, nomatch=0)
i[i>0]})
asgn <- asgn[sapply(asgn, length)>0] # drop any that were lost
termname <- names(asgn)
nterm <- length(asgn) # asgn will be a list
nvar <- length(new)
}
@
The zph1 and zph2 functions do not consider penalties, so we need to add
those back in after the call.
Nothing needs to be done wrt the first derivative: we already ignore the
first ncoef elements of the returned first derivative (u) vector, which would
have had a penalty. The second portion of u is for beta=0, and all of the
penalties that currently are implemented have first derivative 0 at 0.
For the second derivative, the current penalties (frailty, rigde, pspline) have
a second derivative penalty that is independent of beta-hat.
The coxph result contains the numeric value of the penalty at the solution,
and we use a score test that would penalize the new time*pspline() term in
the same way as the pspline term was penalized.
If no coefficients were missing then allvar will be 1:n, otherwise it
will have holes.
<<zph-terms>>=
test <- double(nterm+1)
df <- rep(1L, nterm+1)
u0 <- rep(0, nvar)
if (!is.null(fit$coxlist2)) { # there are penalized terms
pmat <- matrix(0., 2*nvar, 2*nvar) # second derivative penalty
pmat[1:nvar, 1:nvar] <- fit$coxlist2$second
pmat[1:nvar + nvar, 1:nvar + nvar] <- fit$coxlist2$second
imatr <- resid$imat + pmat
}
else imatr <- resid$imat
for (ii in 1:nterm) {
jj <- asgn[[ii]]
kk <- c(1:nvar, jj+nvar)
imat <- imatr[kk, kk]
u <- c(u0, resid$u[jj+nvar])
if (singledf && length(jj) >1) {
vv <- solve(imat)[-(1:nvar), -(1:nvar)]
t1 <- sum(fcoef[jj] * resid$u[jj+nvar])
test[ii] <- t1^2 * (fcoef[jj] %*% vv %*% fcoef[jj])
df[ii] <- 1
}
else {
test[ii] <- drop(solve(imat,u) %*% u)
if (is.null(fit$df)) df[ii] <- length(jj)
else df[ii] <- fit$df[ii]
}
}
#Global test
if (global) {
u <- c(u0, resid$u[-(1:nvar)])
test[nterm+1] <- solve(imatr, u) %*% u
if (is.null(fit$df)) df[nterm+1] <- nvar
else df[nterm+1] <- sum(fit$df)
tbl <- cbind(test, df, pchisq(test, df, lower.tail=FALSE))
dimnames(tbl) <- list(c(termname, "GLOBAL"), c("chisq", "df", "p"))
}
else {
tbl <- cbind(test, df, pchisq(test, df, lower.tail=FALSE))[1:nterm,, drop=FALSE]
dimnames(tbl) <- list(termname, c("chisq", "df", "p"))
}
# The x, y, residuals part is sorted by time within strata; this is
# what the C routine zph1 and zph2 return
indx <- if (ny==2) ord +1 else rev(ord) +1 # return to 1 based subscripts
indx <- indx[event[indx]] # only keep the death times
rval <- list(table=tbl, x=unname(ttimes[indx]), time=unname(y[indx, ny-1]))
if (length(cget$strata)) rval$strata <- cget$strata[indx]
@
The matrix of scaled Schoenfeld residuals is created one stratum at a
time.
The ideal for the residual $r(t_i)$, contributed by an event for subject
$i$ at time $t_i$ is to use $r_iV^{-1}(t_i)$, the inverse of the variance
matrix of $X$ at that time and for the relevant stratum.
What is returned as \code{resid\$imat} is $\sum_i V(t_i)$.
One option would have been to return all the individual $\hat V_i$ matrices,
but that falls over when the number at risk is too small and it cannot
be inverted.
Option 2 would be to use a per stratum averge of the $V_i$, but that falls
flat for models with a large number of strata, a nested case-control model
for instance.
We take a different average that may not be the best, but seems to be
good enough and doesn't seem to fail.
\begin{enumerate}
\item The \code{resid\$used} matrix contains the number of deaths for
each strata (row) that contributed to the sum for each variable (column).
The value is either 0 or the number of events in the stratum, zero for those
variables that are constant within the stratum. From this we can get the
number of events that contributed to each element of the \code{imat} total.
Dividing by this gives a per-element average \code{vmean}.
\item For a given stratum, some of the covariates may have been unused. For
any of those set the scaled Schoenfeld residual to NA, and use the other
rows/columns of the \code{vmean} matrix to scale the rest.
\end{enumerate}
Now if some variable $x_1$ has a large variance at some time points and a
small variance at others, or a large variance in one stratum and a small
variance in another, the above smoothing won't catch that subtlety.
However we expect such an issue to be rare.
The common problem of strata*covariate interactions is the target of the
above manipulations.
<<zph-schoen>>=
# Watch out for a particular edge case: there is a factor, and one of the
# strata happens to not use one of its levels. The element of resid$used will
# be zero, but it really should not.
used <-resid$used
for (i in asgn) {
if (length(i) > 1 && any(used[,i] ==0))
used[,i] <- apply(used[,i,drop=FALSE], 1, max)
}
# Make the weight matrix
wtmat <- matrix(0, nvar, nvar)
for (i in 1:nrow(used))
wtmat <- wtmat + outer(used[i,], used[i,], pmin)
# with strata*covariate interactions (multi-state models for instance) the
# imatr matrix will be block diagonal. Don't divide these off diagonal zeros
# by a wtmat value of zero.
vmean <- imatr[1:nvar, 1:nvar, drop=FALSE]/ifelse(wtmat==0, 1, wtmat)
sresid <- resid$schoen
if (terms && any(sapply(asgn, length) > 1)) { # collase multi-column terms
temp <- matrix(0, ncol(sresid), nterm)
for (i in 1:nterm) {
j <- asgn[[i]]
if (length(j) ==1) temp[j, i] <- 1
else temp[j, i] <- fcoef[j]
}
sresid <- sresid %*% temp
vmean <- t(temp) %*% vmean %*% temp
used <- used[, sapply(asgn, function(x) x[1]), drop=FALSE]
}
dimnames(sresid) <- list(signif(rval$time, 4), termname)
# for each stratum, rescale the Schoenfeld residuals in that stratum
sgrp <- rep(1:nrow(used), apply(used, 1, max))
for (i in 1:nrow(used)) {
k <- which(used[i,] > 0)
if (length(k) >0) { # there might be no deaths in the stratum
j <- which(sgrp==i)
if (length(k) ==1) sresid[j,k] <- sresid[j,k]/vmean[k,k]
else sresid[j, k] <- t(solve(vmean[k, k], t(sresid[j, k, drop=FALSE])))
sresid[j, -k] <- NA
}
}
# Add in beta-hat. For a term with multiple columns we are testing zph for
# the linear predictor X\beta, which always has a coefficient of 1
for (i in 1:nterm) {
j <- asgn[[i]]
if (length(j) ==1) sresid[,i] <- sresid[,i] + fcoef[j]
else sresid[,i] <- sresid[,i] +1
}
rval$y <- sresid
rval$var <- solve(vmean)
@
<<cox.zph>>=
"[.cox.zph" <- function(x, ..., drop=FALSE) {
i <- ..1
if (is.logical(i)) i <- which(i)
else if (is.character(i))
i <- match(i, colnames(x$y))
if (any(is.na(i) | i> ncol(x$y))) stop ("invalid variable requested")
if (!is.null(x$strata)) {
y2 <- x$y[,i,drop=FALSE]
ymiss <- apply(is.na(y2), 1, all)
if (any(ymiss)) {
# some deaths played no role in these coefficients
# due to a strata * covariate interaction, drop unneeded rows
z<- list(table=x$table[i,,drop=FALSE], x=x$x[!ymiss],
time= x$time[!ymiss],
strata = x$strata[!ymiss],
y = y2[!ymiss,,drop=FALSE],
var=x$var[i,i, drop=FALSE],
transform=x$transform, call=x$call)
}
else z<- list(table=x$table[i,,drop=FALSE], x=x$x, time= x$time,
strata = x$strata,
y = y2, var=x$var[i,i, drop=FALSE],
transform=x$transform, call=x$call)
}
else
z<- list(table=x$table[i,,drop=FALSE], x=x$x, time= x$time,
y = x$y[,i,drop=FALSE],
var=x$var[i,i, drop=FALSE],
transform=x$transform, call=x$call)
class(z) <- class(x)
z
}
@
\bibliographystyle{plain}
\bibliography{refer}
\end{document}
|