1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411
|
# regalloc2 Design Overview
This document describes the basic architecture of the regalloc2
register allocator. It describes the externally-visible interface
(input CFG, instructions, operands, with their invariants; meaning of
various parts of the output); core data structures; and the allocation
pipeline, or series of algorithms that compute an allocation. It ends
with a description of future work and expectations, as well as an
appendix that notes design influences and similarities to the
IonMonkey backtracking allocator.
# API, Input IR and Invariants
The toplevel API to regalloc2 consists of a single entry point `run()`
that takes a register environment, which specifies all physical
registers, and the input program. The function returns either an error
or an `Output` struct that provides allocations for each operand and a
vector of additional instructions (moves, loads, stores) to insert.
## Register Environment
The allocator takes a `MachineEnv` which specifies, for each of the
two register classes `Int` and `Float`, a vector of `PReg`s by index. A
`PReg` is nothing more than the class and index within the class; the
allocator does not need to know anything more.
The `MachineEnv` provides a vector of preferred and non-preferred
physical registers per class. Any register not in either vector will
not be allocated. Usually, registers that do not need to be saved in
the prologue if used (i.e., caller-save registers) are given in the
"preferred" vector. The environment also provides exactly one scratch
register per class. This register must not be in the preferred or
non-preferred vectors, and is used whenever a set of moves that need
to occur logically in parallel have a cycle (for a simple example,
consider a swap `r0, r1 := r1, r0`).
With some more work, we could potentially remove the need for the
scratch register by requiring support for an additional edit type from
the client ("swap"), but we have not pursued this.
## CFG and Instructions
The allocator operates on an input program that is in a standard CFG
representation: the function body is a sequence of basic blocks, and
each block has a sequence of instructions and zero or more
successors. The allocator also requires the client to provide
predecessors for each block, and these must be consistent with the
successors.
Instructions are opaque to the allocator except for a few important
bits: (1) `is_ret` (is a return instruction); (2) `is_branch` (is a
branch instruction); and (3) a vector of Operands, covered below.
Every block must end in a return or branch.
Both instructions and blocks are named by indices in contiguous index
spaces. A block's instructions must be a contiguous range of
instruction indices, and block i's first instruction must come
immediately after block i-1's last instruction.
The CFG must have *no critical edges*. A critical edge is an edge from
block A to block B such that A has more than one successor *and* B has
more than one predecessor. For this definition, the entry block has an
implicit predecessor, and any block that ends in a return has an
implicit successor.
Note that there are *no* requirements related to the ordering of
blocks, and there is no requirement that the control flow be
reducible. Some *heuristics* used by the allocator will perform better
if the code is reducible and ordered in reverse postorder (RPO),
however: in particular, (1) this interacts better with the
contiguous-range-of-instruction-indices live range representation that
we use, and (2) the "approximate loop depth" metric will actually be
exact if both these conditions are met.
## Operands and VRegs
Every instruction operates on values by way of `Operand`s. An operand
consists of the following fields:
- VReg, or virtual register. *Every* operand mentions a virtual
register, even if it is constrained to a single physical register in
practice. This is because we track liveranges uniformly by vreg.
- Policy, or "constraint". Every reference to a vreg can apply some
constraint to the vreg at that point in the program. Valid policies are:
- Any location;
- Any register of the vreg's class;
- Any stack slot;
- A particular fixed physical register; or
- For a def (output), a *reuse* of an input register.
- The "kind" of reference to this vreg: Def, Use, Mod. A def
(definition) writes to the vreg, and disregards any possible earlier
value. A mod (modify) reads the current value then writes a new
one. A use simply reads the vreg's value.
- The position: before or after the instruction.
- Note that to have a def (output) register available in a way that
does not conflict with inputs, the def should be placed at the
"before" position. Similarly, to have a use (input) register
available in a way that does not conflict with outputs, the use
should be placed at the "after" position.
VRegs, or virtual registers, are specified by an index and a register
class (Float or Int). The classes are not given separately; they are
encoded on every mention of the vreg. (In a sense, the class is an
extra index bit, or part of the register name.) The input function
trait does require the client to provide the exact vreg count,
however.
Implementation note: both vregs and operands are bit-packed into
u32s. This is essential for memory-efficiency. As a result of the
operand bit-packing in particular (including the policy constraints!),
the allocator supports up to 2^21 (2M) vregs per function, and 2^6
(64) physical registers per class. Later we will also see a limit of
2^20 (1M) instructions per function. These limits are considered
sufficient for the anticipated use-cases (e.g., compiling Wasm, which
also has function-size implementation limits); for larger functions,
it is likely better to use a simpler register allocator in any case.
## Reuses and Two-Address ISAs
Some instruction sets primarily have instructions that name only two
registers for a binary operator, rather than three: both registers are
inputs, and the result is placed in one of the registers, clobbering
its original value. The most well-known modern example is x86. It is
thus imperative that we support this pattern well in the register
allocator.
This instruction-set design is somewhat at odds with an SSA
representation, where a value cannot be redefined.
Thus, the allocator supports a useful fiction of sorts: the
instruction can be described as if it has three register mentions --
two inputs and a separate output -- and neither input will be
clobbered. The output, however, is special: its register-placement
policy is "reuse input i" (where i == 0 or 1). The allocator
guarantees that the register assignment for that input and the output
will be the same, so the instruction can use that register as its
"modifies" operand. If the input is needed again later, the allocator
will take care of the necessary copying.
We will see below how the allocator makes this work by doing some
preprocessing so that the core allocation algorithms do not need to
worry about this constraint.
## SSA
regalloc2 takes an SSA IR as input, where the usual definitions apply:
every vreg is defined exactly once, and every vreg use is dominated by
its one def. (Using blockparams means that we do not need additional
conditions for phi-nodes.)
## Block Parameters
Every block can have *block parameters*, and a branch to a block with
block parameters must provide values for those parameters via
operands. When a branch has more than one successor, it provides
separate operands for each possible successor. These block parameters
are equivalent to phi-nodes; we chose this representation because they
are in many ways a more consistent representation of SSA.
To see why we believe block parameters are a slightly nicer design
choice than use of phi nodes, consider: phis are special
pseudoinstructions that must come first in a block, are all defined in
parallel, and whose uses occur on the edge of a particular
predecessor. All of these facts complicate any analysis that scans
instructions and reasons about uses and defs. It is much closer to the
truth to actually put those uses *in* the predecessor, on the branch,
and put all the defs at the top of the block as a separate kind of
def. The tradeoff is that a vreg's def now has two possibilities --
ordinary instruction def or blockparam def -- but this is fairly
reasonable to handle.
## Output
The allocator produces two main data structures as output: an array of
`Allocation`s and a sequence of edits. Some other data, such as
stackmap slot info, is also provided.
### Allocations
The allocator provides an array of `Allocation` values, one per
`Operand`. Each `Allocation` has a kind and an index. The kind may
indicate that this is a physical register or a stack slot, and the
index gives the respective register or slot. All allocations will
conform to the constraints given, and will faithfully preserve the
dataflow of the input program.
### Inserted Moves
In order to implement the necessary movement of data between
allocations, the allocator needs to insert moves at various program
points.
The vector of inserted moves contains tuples that name a program point
and an "edit". The edit is either a move, from one `Allocation` to
another, or else a kind of metadata used by the checker to know which
VReg is live in a given allocation at any particular time. The latter
sort of edit can be ignored by a backend that is just interested in
generating machine code.
Note that the allocator will never generate a move from one stackslot
directly to another, by design. Instead, if it needs to do so, it will
make use of the scratch register. (Sometimes such a move occurs when
the scratch register is already holding a value, e.g. to resolve a
cycle of moves; in this case, it will allocate another spillslot and
spill the original scratch value around the move.)
Thus, the single "edit" type can become either a register-to-register
move, a load from a stackslot into a register, or a store from a
register into a stackslot.
# Data Structures
We now review the data structures that regalloc2 uses to track its
state.
## Program-Derived Alloc-Invariant Data
There are a number of data structures that are computed in a
deterministic way from the input program and then subsequently used
only as read-only data during the core allocation procedure.
### Livein/Liveout Bitsets
The livein and liveout bitsets (`liveins` and `liveouts` on the `Env`)
are allocated one per basic block and record, per block, which vregs
are live entering and leaving that block. They are computed using a
standard backward iterative dataflow analysis and are exact; they do
not over-approximate (this turns out to be important for performance,
and is also necessary for correctness in the case of stackmaps).
### Blockparam Vectors: Source-Side and Dest-Side
The initialization stage scans the input program and produces two
vectors that represent blockparam flows from branches to destination
blocks: `blockparam_ins` and `blockparam_outs`.
These two vectors are the first instance we will see of a recurring
pattern: the vectors contain tuples that are carefully ordered in a
way such that their sort-order is meaningful. "Build a vector lazily
then sort" is a common idiom: it batches the O(n log n) cost into one
operation that the stdlib has aggressively optimized, it provides
dense storage, and it allows for a scan in a certain order that often
lines up with a scan over the program.
In this particular case, we will build vectors of (vreg, block) points
that are meaningful either at the start or end of a block, so that
later, when we scan over a particular vreg's allocations in block
order, we can generate another vector of allocations. One side (the
"outs") also contains enough information that it can line up with the
other side (the "ins") in a later sort.
To make this work, `blockparam_ins` contains a vector of (to-vreg,
to-block, from-block) tuples, and has an entry for every blockparam of
every block. Note that we can compute this without actually observing
from-blocks; we only need to iterate over `block_preds` at any given
block.
Then, `blockparam_outs` contains a vector of (from-vreg, from-block,
to-block, to-vreg), and has an entry for every parameter on every
branch that ends a block. There is exactly one "out" tuple for every
"in" tuple. As mentioned above, we will later scan over both to
generate moves.
## Core Allocation State: Ranges, Uses, Bundles, VRegs, PRegs
We now come to the core data structures: live-ranges, bundles, virtual
registers and their state, and physical registers and their state.
First we must define a `ProgPoint` precisely: a `ProgPoint` is an
instruction index and a `Before` or `After` suffix. We pack the
before/after suffix into the LSB of a `u32`, so a `ProgPoint` can be
incremented and compared as a simple integer.
A live-range is a contiguous range of program points (half-open,
i.e. including `from` and excluding `to`) for which a particular vreg
is live with a value.
A live-range contains a vector of uses. Each use contains four parts:
the Operand word (directly copied, so there is no need to dereference
it); the ProgPoint at which the use occurs; the operand slot on that
instruction, if any, that the operand comes from, and the use's
'weight". (It's possible to have "ghost uses" that do not derive from
any slot on the instruction.) These four parts are packed into three
`u32`s: the slot can fit in 8 bits, and the weight in 16.
The live-range carries its program-point range, uses, vreg index,
bundle index (see below), and some metadata: spill weight and
flags. The spill weight is the sum of weights of each use. The flags
set currently carries one flag only: whether the live-range starts at
a Def-kind operand. (This is equivalent to whether the range consumes
a value at its start or not.)
Uses are owned only by live-ranges and have no separate identity, but
live-ranges live in a toplevel array and are known by `LiveRangeIndex`
values throughout the allocator. New live-ranges can be created
(e.g. during splitting); old ones are not cleaned up, but rather, all
state is bulk-freed at the end.
Live-ranges are aggregated into "bundles". A bundle is a collection of
ranges that does not overlap. Each bundle carries: a vector (inline
SmallVec) of (range, live-range index) tuples, an allocation (starts
as "none"), a "spillset" (more below), and some metadata, including a
spill weight (sum of ranges' weights), a priority (sum of ranges'
lengths), and three property flags: "minimal", "contains fixed
constraints", "contains stack constraints".
VRegs also contain their vectors of live-ranges, in the same form as a
bundle does (inline SmallVec that has inline (from, to) range bounds
and range indices).
There are two important overlap invariants: (i) no liveranges within a
bundle overlap, and (ii) no liveranges within a vreg overlap. These
are extremely important and we rely on them implicitly in many places.
The live-range vectors in bundles and vregs, and use-vectors in ranges,
have various sorting invariants as well. These invariants differ
according to the phase of the allocator's computation. First, during
live-range construction, live-ranges are placed into vregs in reverse
order (because the computation is a reverse scan) and uses into ranges
in reverse order; these are sorted into forward order at the end of
live-range computation. When bundles are first constructed, their
range vectors are sorted, and they remain so for the rest of allocation,
as we need for interference testing. However, as ranges are created
and split, sortedness of vreg ranges is *not* maintained; they are
sorted once more, in bulk, when allocation is done and we start to
resolve moves.
Finally, we have physical registers. The main data associated with
each is the allocation map. This map is a standard BTree, indexed by
ranges (`from` and `to` ProgPoints) and yielding a LiveRange for each
location range. The ranges have a custom comparison operator defined
that compares equal for any overlap.
This comparison operator allows us to determine whether a range is
free, i.e. has no overlap with a particular range, in one probe -- the
btree will not contain a match. However, it makes iteration over *all*
overlapping ranges somewhat tricky to get right. Notably, Rust's
BTreeMap does not guarantee that the lookup result will be the *first*
equal key, if multiple keys are equal to the probe key. Thus, when we
want to enumerate all overlapping ranges, we probe with a range that
consists of the single program point *before* the start of the actual
query range, using the API that returns an iterator over a range in
the BTree, and then iterate through the resulting iterator to gather
all overlapping ranges (which will be contiguous).
## Spill Bundles
It is worth describing "spill bundles" separately. Every spillset (see
below; a group of bundles that originated from one bundle) optionally
points to a single bundle that we designate the "spill bundle" for
that spillset. Contrary to the name, this bundle is not
unconditionally spilled. Rather, one can see it as a sort of fallback:
it is where liveranges go when we give up on processing them via the
normal backtracking loop, and will only process them once more in the
"second-chance" stage.
This fallback behavior implies that the spill bundle must always be
able to accept a spillslot allocation, i.e., it cannot require a
register. This invariant is what allows spill bundles to be processed
in a different way, after backtracking has completed.
The spill bundle acquires liveranges in two ways. First, as we split
bundles, we will trim the split pieces in certain ways so that some
liveranges are immediately placed in the spill bundle. Intuitively,
the "empty" regions that just carry a value, but do not satisfy any
operands, should be in the spill bundle: it is better to have a single
consistent location for the value than to move it between lots of
different split pieces without using it, as moves carry a cost.
Second, the spill bundle acquires the liveranges of a bundle that has
no requirement to be in a register when that bundle is processed, but
only if the spill bundle already exists. In other words, we won't
create a second-chance spill bundle just for a liverange with an "Any"
use; but if it was already forced into existence by splitting and
trimming, then we might as well use it.
Note that unlike other bundles, a spill bundle's liverange vector
remains unsorted until we do the second-chance allocation. This allows
quick appends of more liveranges.
## Allocation Queue
The allocation queue is simply a priority queue (built with a binary
max-heap) of (prio, bundle-index) tuples.
## Spillsets and Spillslots
Every bundle contains a reference to a spillset. Spillsets are used to
assign spillslots near the end of allocation, but before then, they
are also a convenient place to store information that is common among
*all bundles* that share the spillset. In particular, spillsets are
initially assigned 1-to-1 to bundles after all bundle-merging is
complete; so spillsets represent in some sense the "original bundles",
and as splitting commences, the smaller bundle-pieces continue to
refer to their original spillsets.
We stash some useful information on the spillset because of this: a
register hint, used to create some "stickiness" between pieces of an
original bundle that are assigned separately after splitting; the
spill bundle; the common register class of all vregs in this bundle;
the vregs whose liveranges are contained in this bundle; and then some
information actually used if this is spilled to the stack (`required`
indicates actual stack use; `size` is the spillslot count; `slot` is
the actual stack slot).
Spill *sets* are later allocated to spill *slots*. Multiple spillsets
can be assigned to one spillslot; the only constraint is that
spillsets assigned to a spillslot must not overlap. When we look up
the allocation for a bundle, if the bundle is not given a specific
allocation (its `alloc` field is `Allocation::none()`), this means it
is spilled, and we traverse to the spillset then spillslot.
## Other: Fixups, Stats, Debug Annotations
There are a few fixup vectors that we will cover in more detail
later. Of particular note is the "multi-fixed-reg fixup vector": this
handles instructions that constrain the same input vreg to multiple,
different, fixed registers for different operands at the same program
point. The only way to satisfy such a set of constraints is to
decouple all but one of the inputs (make them no longer refer to the
vreg) and then later insert copies from the first fixed use of the
vreg to the other fixed regs.
The `Env` also carries a statistics structure with counters that are
incremented, which can be useful for evaluating the effects of
changes; and a "debug annotations" hashmap from program point to
arbitrary strings that is filled out with various useful diagnostic
information if enabled, so that an annotated view of the program with
its liveranges, bundle assignments, inserted moves, merge and split
decisions, etc. can be viewed.
# Allocation Pipeline
We now describe the pipeline that computes register allocations.
## Live-range Construction
The first step in performing allocation is to analyze the input
program to understand its dataflow: that is, the ranges during which
virtual registers must be assigned to physical registers. Computing
these ranges is what allows us to do better than a trivial "every vreg
lives in a different location, always" allocation.
We compute precise liveness first using an iterative dataflow
algorithm with BitVecs. (See below for our sparse chunked BitVec
description.) This produces the `liveins` and `liveouts` vectors of
BitVecs per block.
We then perform a single pass over blocks in reverse order, and scan
instructions in each block in reverse order. Why reverse order? We
must see instructions within a block in reverse to properly compute
liveness (a value is live backward from an use to a def). Because we
want to keep liveranges in-order as we build them, to enable
coalescing, we visit blocks in reverse order as well, so overall this
is simply a scan over the whole instruction index space in reverse
order.
For each block, we perform a scan with the following state:
- A liveness bitvec, initialized at the start from `liveouts`.
- A vector of live-range indices, with one entry per vreg, initially
"invalid" (this vector is allocated once and reused at each block).
- In-progress vector of live-range indices per vreg in the vreg state,
in *reverse* order (we will reverse it when we're done).
A vreg is live at the current point in the scan if its bit is set in
the bitvec; its entry in the vreg-to-liverange vec may be stale, but
if the bit is not set, we ignore it.
We initially create a liverange for all vregs that are live out of the
block, spanning the whole block. We will trim this below if it is
locally def'd and does not pass through the block.
For each instruction, we process its effects on the scan state:
- For all clobbers (which logically happen at the end of the
instruction), add a single-program-point liverange to each clobbered
preg.
- For each program point [after, before], for each operand at
this point(\*):
- if a def:
- if not currently live, this is a dead def; create an empty LR.
- set the start of the LR for this vreg to this point.
- set as dead.
- if a use:
- create LR if not live, with start at beginning of block.
(\*) an instruction operand's effective point is adjusted in a few
cases. If the instruction is a branch, its uses (which are
blockparams) are extended to the "after" point. If there is a reused
input, all *other* inputs are extended to "after": this ensures proper
interference (as we explain more below).
We then treat blockparams as defs at the end of the scan (beginning of
the block), and create the "ins" tuples. (The uses for the other side
of the edge are already handled as normal uses on a branch
instruction.)
### Handling Reused Inputs
Reused inputs are also handled a bit specially. We have already
described how we essentially translate the idiom so that the output's
allocation is used for input and output, and there is a move just
before the instruction that copies the actual input (which will not be
clobbered) to the output. Together with an attempt to merge the
bundles for the two, to elide the move if possible, this works
perfectly well as long as we ignore all of the other inputs.
But we can't do that: we have to ensure that other inputs' allocations
are correct too. Note that using the output's allocation as the input
is actually potentially incorrect if the output is at the After point
and the input is at the Before: the output might share a register with
one of the *other* (normal, non-reused) inputs if that input's vreg
were dead afterward. This will mean that we clobber the other input.
So, to get the interference right, we *extend* all other (non-reused)
inputs of an instruction with a reused input to the After point. This
ensures that the other inputs are *not* clobbered by the slightly
premature use of the output register.
The source has a link to a comment in IonMonkey that implies that it
uses a similar solution to this problem, though it's not entirely
clear.
(This odd dance, like many of the others above and below, is "written
in fuzzbug failures", so to speak. It's not entirely obvious until one
sees the corner case where it's necessary!)
## Bundle Merging
Once we have built the liverange vectors for every vreg, we can reverse
these vectors (recall, they were built in strict reverse order) and
initially assign one bundle per (non-pinned) vreg. We then try to
merge bundles together as long as find pairs of bundles that do not
overlap and that (heuristically) make sense to merge.
Note that this is the only point in the allocation pipeline where
bundles get larger. We initially merge as large as we dare (but not
too large, because then we'll just cause lots of conflicts and
splitting later), and then try out assignments, backtrack via
eviction, and split continuously to chip away at the problem until we
have a working set of allocation assignments.
We attempt to merge two kinds of bundle pairs: reused-input to
corresponding output; and across blockparam assignments.
To merge two bundles, we traverse over both their sorted liverange
vectors at once, checking for overlaps. Note that we can do this without
pointer-chasing to the liverange data; the (from, to) range is in the
liverange vector itself.
We also check whether the merged bundle would have conflicting
requirements (see below for more on requirements). We do a coarse
check first, checking 1-bit flags that indicate whether either bundle
has any fixed-reg constraints or stack-only constraints. If so, we
need to do a detailed check by actually computing merged requirements
on both sides, merging, and checking for Conflict (the lattice bottom
value). If no conflict, we merge.
A performance note: merging is extremely performance-sensitive, and it
turns out that a mergesort-like merge of the liverange vectors is too
expensive, partly because it requires allocating a separate result
vector (in-place merge in mergesort is infamously complex). Instead,
we simply append one vector onto the end of the other and invoke
Rust's builtin sort. We could special-case "one bundle is completely
before the other", but we currently don't do that (performance idea!).
Once all bundles are merged as far as they will go, we compute cached
bundle properties (priorities and weights) and enqueue them on the
priority queue for allocation.
## Recurring: Bundle Property Computation
The core allocation loop is a recurring iteration of the following: we
take the highest-priority bundle from the allocation queue; we compute
its requirements; we try to find it a register according to those
requirements; if no fit, we either evict some other bundle(s) from
their allocations and try again, or we split the bundle and put the
parts back on the queue. We record all the information we need to make
the evict-or-split decision (and where to split) *during* the physical
register allocation-map scans, so we don't need to go back again to
compute that.
Termination is nontrivial to see, because of eviction. How do we
guarantee we don't get into an infinite loop where two bundles fight
over a register forever? In fact, this can easily happen if there is a
bug; we fixed many fuzzbugs like this, and we have a check for
"infinite loop" based on an upper bound on iterations. But if the
allocator is correct, it should never happen.
Termination is guaranteed because (i) bundles always get smaller, (ii)
eviction only occurs when a bundle is *strictly* higher weight (not
higher-or-equal), and (iii) once a bundle gets down to its "minimal"
size, it has an extremely high weight that is guaranteed to evict any
non-minimal bundle. A minimal bundle is one that covers only one
instruction. As long as the input program does not have impossible
constraints that require more than one vreg to exist in one preg, an
allocation problem of all minimal bundles will always have a solution.
## Bundle Processing
Let's now talk about what happens when we take a bundle off the
allocation queue. The three basic outcomes are: allocate; split and
requeue; or evict and try again immediately (and eventually allocate
or split/requeue).
### Properties: Weight, Priority, and Requirements
To process a bundle, we have to compute a few properties. In fact we
will have already computed a few of these beforehand, but we describe
them all here.
- Priority: a bundle's priority determines the order in which it is
considered for allocation. RA2 defines as the sum of the lengths (in
instruction index space) of each liverange. This causes the
allocator to consider larger bundles first, when the allocation maps
are generally more free; they can always be evicted and split later.
- Weight: a bundle's weight indicates how important (in terms of
runtime) its uses/register mentions are. In an approximate sense,
inner loop bodies create higher-weight uses. Fixed register
constraints add some weight, and defs add some weight. Finally,
weight is divided by priority, so a very large bundle that happens
to have a few important uses does not unformly exert its weight
across its entire range. This has the effect of causing bundles to
be more important (more likely to evict others) the more they are
split.
- Requirement: a bundle's requirement is a value in a lattice that we
have defined, where top is "Unknown" and bottom is
"Conflict". Between these two, we have: any register (of a class);
any stackslot (of a class); a particular register. "Any register"
can degrade to "a particular register", but any other pair of
different requirements meets to Conflict. Requirements are derived
from the operand constraints for all uses in all liveranges in a
bundle, and then merged with the lattice meet-function.
The lattice is as follows (diagram simplified to remove multiple
classes and multiple fixed registers which parameterize nodes; any two
differently-parameterized values are unordered with respect to each
other):
```plain
Any(rc)
/ \
FixedReg(reg) FixedStack(reg)
\ /
Conflict
```
Once we have the Requirement for a bundle, we can decide what to do.
### No-Register-Required Cases
If the requirement indicates that no register is needed (`Unknown` or
`Any`, i.e. a register or stack slot would be OK), *and* if the spill
bundle already exists for this bundle's spillset, then we move all the
liveranges over to the spill bundle, as described above.
If the requirement indicates a conflict, we immediately split and
requeue the split pieces. This split is performed at the point at
which the conflict is first introduced, i.e. just before the first use
whose requirement, when merged into the requirement for all prior uses
combined, goes to `Conflict`. In this way, we always guarantee forward
progress. Note also that a bundle can reach this stage with a
conflicting requirement only if the original liverange had conflicting
uses (e.g., a liverange from a def in a register to a use on stack, or
a liverange between two different fixed-reg-constrained operands); our
bundle merging logic explicitly avoids merging two bundles if it would
create a conflict.
### Allocation-Map Probing
If we did not immediately dispose of the bundle as described above,
then we *can* use a register (either `Any`, which accepts a register
as one of several options, or `Reg`, which must have one, or `Fixed`,
which must have a particular one).
We determine which physical registers whose allocation maps we will
probe, and in what order. If a particular fixed register is required,
we probe only that register. Otherwise, we probe all registers in the
required class.
The order in which we probe, if we are not constrained to a single
register, is carefully chosen. First, if there is a hint register from
the spillset (this is set by the last allocation into a register of
any other bundle in this spillset), we probe that. Then, we probe all
preferred registers; then all non-preferred registers.
For each of the preferred and non-preferred register sequences, we
probe in an *offset* manner: we start at some index partway through
the sequence, determined by some heuristic number that is random and
well-distributed. (In practice, we use the sum of the bundle index and
the instruction index of the start of the first range in the bundle.)
We then march through the sequence and wrap around, stopping before we
hit our starting point again.
The purpose of this offset is to distribute the contention and speed
up the allocation process. In the common case where there are enough
registers to hold values without spilling (for small functions), we
are more likely to choose a free register right away if we throw the
dart at random than if we start *every* probe at register 0, in
order. This has a large allocation performance impact in practice.
For each register in probe order, we probe the allocation map, and
gather, simultaneously, several results: (i) whether the entire range
is free; (ii) if not, the vector of all conflicting bundles, *and* the
highest weight among those bundles; (iii) if not, the *first* conflict
point.
We do this by iterating over all liveranges in the preg's btree that
overlap with each range in the current bundle. This iteration is
somewhat subtle due to multiple "equal" keys (see above where we
describe the use of the btree). It is also adaptive for performance
reasons: it initially obtains an iterator into the btree corresponding
to the start of the first range in the bundle, and concurrently
iterates through both the btree and the bundle. However, if there is a
large gap in the bundle, this might require skipping many irrelevant
entries in the btree. So, if we skip too many entries (heuristically,
16, right now), we do another lookup from scratch in the btree for the
start of the next range in the bundle. This balances between the two
cases: dense bundle, where O(1) iteration through the btree is faster,
and sparse bundle, where O(log n) lookup for each entry is better.
### Decision: Allocate, Evict, or Split
First, the "allocate" case is easy: if, during our register probe
loop, we find a physical register whose allocations do not overlap
this bundle, then we allocate this register; done!
If not, then we need to decide whether to evict some conflicting
bundles and retry, or to split the current bundle into smaller pieces
that may have better luck.
A bit about our split strategy first: contrary to the IonMonkey
allocator which inspired much of our design, we do *not* have a list
of split strategies that split one bundle into many pieces at
once. Instead, each iteration of the allocation loop splits at most
*once*. This simplifies the splitting code greatly, but also turns out
to be a nice heuristic: we split at the point that the bundle first
encounters a conflict for a particular preg assignment, then we hint
that preg for the first (pre-conflict) piece when we retry. In this
way, we always make forward progress -- one piece of the bundle is
always allocated -- and splits are informed by the actual situation at
hand, rather than best guesses. Also note that while this may appear
at first to be a greedy algorithm, it still allows backtracking: the
first half of the split bundle, which we *can* now assign to a preg,
does not necessarily remain on that preg forever (it can still be
evicted later). It is just a split that is known to make at least one
part of the allocation problem solvable.
To determine whether to split or evict, we track our best options: as
we probe, we track the "lowest cost eviction option", which is a set
of bundles and the maximum weight in that set of bundles. We also
track the "lowest cost split option", which is the cost (more below),
the point at which to split, and the register for this option.
For each register we probe, if there is a conflict but none of the
conflicts are fixed allocations, we receive a vector of bundles that
conflicted, and also separately, the first conflicting program
point. We update the lowest-cost eviction option if the cost (max
weight) of the conflicting bundles is less than the current best. We
update the lowest-cost split option if the cost is less as well,
according to the following definition of cost: a split's cost is the
cost of its move, as defined by the weight of a normal def operand at
the split program point, plus the cost of all bundles beyond the split
point (which will still be conflicts even after the split).
If there is a conflict with a fixed allocation, then eviction is not
an option, but we can still compute the candidate split point and cost
in the same way as above.
Finally, as an optimization, we pass in the current best cost to the
btree probe inner loop; if, while probing, we have already exceeded
the best cost, we stop early (this improves allocation time without
affecting the result).
Once we have the best cost for split and evict options, we split if
(i) the bundle is not already a minimal bundle, and (ii) we've already
evicted once in this toplevel iteration without success, or the weight
of the current bundle is less than the eviction cost. We then requeue
*both* resulting halves of the bundle with the preg that resulted in
this option as the register hint. Otherwise, we evict all conflicting
bundles and try again.
Note that the split cost does not actually play into the above (split
vs. evict) decision; it is only used to choose *which* split is
best. This is equivalent to saying: we never evict if the current
bundle is less important than the evicted bundles, even if the split
is more expensive still. This is important for forward progress, and
the case where the split would be even more expensive should be very
very rare (it would have to come from a costly move in the middle of
an inner loop).
### How to Split
The actual split procedure is fairly simple. We are given a bundle and
a split-point. We create a new bundle to take on the second half
("rest") of the original. We find the point in the liverange vector
that corresponds to the split, and distribute appropriately. If the
split-point lands in the middle of a liverange, then we split that
liverange as well.
In the case that a new liverange is created, we add the liverange to
the corresponding vreg liverange vector as well. Note that, as described
above, the vreg's liverange vector is unsorted while splitting is
occurring (because we do not need to traverse it or do any lookups
during this phase); so we just append.
The splitting code also supports a "minimal split", in which it simply
peels off the first use. This is used to ensure forward progress when
a bundle has conflicting requirements within it (see above).
#### Spill Bundle and Splitting
Once a split occurs, however, it turns out that we can improve results
by doing a little cleanup. Once we distribute a bundle's liveranges
across two half-bundles, we postprocess by trimming a bit.
In particular, if we see that the "loose ends" around the split point
extend beyond uses, we will create and move ranges to a spill
bundle. That is: if the last liverange in the first-half bundle
extends beyond its last use, we trim that part off into an empty (no
uses) liverange and place that liverange in the spill
bundle. Likewise, if the first liverange in the second-half bundle
starts before its first use, we trim that part off into an empty
liverange and place it in the spill bundle.
This is, empirically, an improvement: it reduces register contention
and makes splitting more effective. The intuition is twofold: (i) it
is better to put all of the "flow-through" parts of a vreg's liveness
into one bundle that is never split, and can be spilled to the stack
if needed, to avoid unnecessary moves; and (ii) if contention is high
enough to cause splitting, it is more likely there will be an actual
stack spill, and if this is the case, it is better to do the store
just after the last use and reload just before the first use of the
respective bundles.
## Second-Chance Allocation: Spilled Bundles
Once the main allocation loop terminates, when all bundles have either
been allocated or punted to the "spilled bundles" vector, we do
second-chance allocation. This is a simpler loop that never evicts and
never splits. Instead, each bundle gets one second chance, in which it
can probe pregs and attempt to allocate. If it fails, it will actually
live on the stack.
This is correct because we are careful to only place bundles on the
spilled-bundles vector that are *allowed* to live on the
stack. Specifically, only the canonical spill bundles (which will
contain only empty ranges) and other bundles that have an "any" or
"unknown" requirement are placed here (but *not* "stack" requirements;
those *must* be on the stack, so do not undergo second-chance
allocation).
At the end of this process, we have marked spillsets as required
whenever at least one bundle in the spillset actually requires a stack
slot. We can then allocate slots to the spillsets.
## Spillslot Allocation
We must allocate space on the stack, denoted by an abstract index
space, to each spillset that requires it, and for the liveranges in
which it requires it.
To facilitate this, we keep a btree per spillslot in the same way we
do per preg. We will allocate spillsets to slots in a way that avoids
interference.
Note that we actually overapproximate the required ranges for each
spillset in order to improve the behavior of a later phase (redundant
move elimination). Specifically, when we allocate a slot for a
spillset, we reserve that slot for *all* of the liveranges of *every*
vreg that is assigned to that spillset (due to merging rules that
initially merge one-vreg bundles into final merged bundles, there will
be no overlaps here). In other words, we rule out interleaving of
completely different values in the same slot, though bundle merging
does mean that potentially many (non-interfering) vregs may share
it. This provides the important property that if a vreg has been
reloaded, but not modified, its spillslot *still contains the
up-to-date value* (because the slot is reserved for all liveranges of
the vreg). This enables us to avoid another store to the spillslot
later if there is another spilled range.
We perform probing in a way that is somewhat different than for
registers, because the spillslot space is conceptually infinite. We
can thus optimize for slightly better allocation performance by giving
up and allocating a new slot at any time.
For each size class, we keep a linked list of slots. When we need to
allocate a spillset to a slot, we traverse down the list and try a
fixed number of slots. If we find one that fits the spillset's ranges,
we allocate, and we remove the slot from its current place in the list
and append to the end. In this way, it is deprioritized from probing
"for a while", which tends to reduce contention. This is a simple way
to round-robin between slots. If we don't find one that fits after a
fixed number of probes, we allocate a new slot.
And with that, we have valid allocations for all vregs for all points
that they are live! Now we just need to modify the program to reify
these choices.
## Allocation Assignment
The first step in reifying the allocation is to iterate through all
mentions of a vreg and fill in the resulting `Allocation` array with
the appropriate allocations. We do this by simply traversing
liveranges per vreg, looking up the allocation by observing the bundle
(and spillset if no specific allocation for the bundle), and for each
use, filling in the slot according to the saved progpoint/slot info in
the use data.
## Move Generation
The more difficult half of the reification step is generating the
*moves* that will put the values in the right spots.
There are two sources of moves that we must generate. The first are
moves between different ranges of the same vreg, as the split pieces
of that vreg's original bundle may have been assigned to different
locations. The second are moves that result from move semantics in the
input program: assignments from blockparam args on branches to the
target block's params.
Moves are tricky to handle efficiently because they join two
potentially very different locations in the program (in the case of
control-flow-edge moves). In order to avoid the need for random
lookups, which are a cache-locality nightmare even if we have O(log n)
lookups, we instead take a scan-sort-scan approach.
First, we scan over each vreg's liveranges, find the allocation for
each, and for each move that comes *to* or *from* this liverange,
generate a "half-move". The key idea is that we generate a record for
each "side" of the move, and these records are keyed in a way that
after a sort, the "from" and "to" ends will be consecutive. We can
sort the vector of halfmoves once (this is expensive, but not as
expensive as many separate pointer-chasing lookups), then scan it
again to actually generate the move instructions.
To enable the sort to work, half-moves are sorted by a key that is
equivalent to the tuple (from-block, to-block, to-vreg, kind), where
`kind` is "source" or "dest". For each key, the payload is an
allocation. The fields in this tuple are carefully chosen: we know all
of them at every location we generate a halfmove, without expensive
lookups, and sorting by this key will make the source and all dests
(there can be more than one) contiguous in the final order.
Half-moves are generated for several situations. First, at the start
of every block covered by a liverange, we can generate "dest"
half-moves for blockparams, and at the end of every block covered by a
liverange, we can generate "source" half-moves for blockparam args on
branches. Incidentally, this is the reason that `blockparam_ins` and
`blockparam_outs` are sorted tuple-vectors whose tuples begin with
(vreg, block, ...): this is the order in which we do the toplevel scan
over allocations.
Second, at every block edge, if the vreg is live in any pred (at
block-start) or succ (at block-end), we generate a half-move to
transfer the vreg to its own location in the connected block.
This completes the "edge-moves". We sort the half-move array and then
have all of the alloc-to-alloc pairs on a given (from-block, to-block)
edge.
Next, when a live-range ends and another begins for the same vreg in
the same block (i.e., a split in the middle of a block), we know both
sides of the move immediately (because it is the same vreg and we can
look up the adjacent allocation easily), and we can generate that
move.
Finally, we generate moves to fix up multi-fixed-reg-constraint
situations, and make reused inputs work, as described earlier.
## Move Resolution
During this whole discussion, we have described "generating moves",
but we have not said what that meant. Note that in many cases, there
are several moves at a particular program point that semantically
happen *in parallel*. For example, if multiple vregs change
allocations between two instructions, all of those moves happen as
part of one parallel permutation. Similarly, blockparams have
parallel-assignment semantics. We thus enqueue all the moves that we
generate at program points and resolve them into sequences of
sequential moves that can actually be lowered to move instructions in
the machine code.
First, a word on *move priorities*. There are different kinds of moves
that are generated between instructions, and we have to ensure that
some happen before others, i.e., *not* in parallel. For example, a
vreg might change allocation (due to a split) before an instruction,
then be copied to an output register for an output with a reused-input
policy. The latter move must happen *after* the vreg has been moved
into its location for this instruction.
To enable this, we define "move priorities", which are a logical
extension of program points (i.e., they are sub-points) that enable
finer-grained ordering of moves. We currently have the following
priorities:
- In-edge moves, to place edge-moves before the first instruction in a
block.
- Regular, used for vreg movement between allocations.
- Multi-fixed-reg, used for moves that handle the
single-vreg-in-multiple-fixed-pregs constraint case.
- Reused-input, used for implementing outputs with reused-input policies.
- Out-edge moves, to place edge-moves after the last instruction
(prior to the branch) in a block.
Every move is statically given one of these priorities by the code
that generates it.
We collect moves with (prog-point, prio) keys, and we sort by those
keys. We then have, for each such key, a set of moves that
semantically happen in parallel.
We then resolve those moves using a parallel-move resolver, as we now
describe.
### Parallel-Move Resolver
The fundamental issue that arises when resolving parallel moves to
sequential moves is *overlap*: some of the moves may overwrite
registers that other moves use as sources. We must carefully order
moves so that this does not clobber values incorrectly.
We first check if such overlap occurs. If it does not (this is
actually the most common case), the sequence of parallel moves can be
emitted as sequential moves directly. Done!
Otherwise, we have to order the moves carefully. Furthermore, if there
is a *cycle* anywhere among the moves, we will need a scratch
register. (Consider, e.g., t0 := t1 and t1 := t0 in parallel: with
only move instructions and no direct "exchange" instruction, we cannot
reify this without a third register.)
We first compute a mapping from each move instruction to the move
instruction, if any, that it must precede. Note that there can be only
one such move for a given move, because each destination can be
written only once; so a move might be constrained only before the one
move that overwrites its source. (This will be important in a bit!)
Our task is now to find an ordering of moves that respects these
dependencies. To do so, we perform a depth-first search on the graph
induced by the dependencies, which will generate a sequence of
sequential moves in reverse order. We keep a stack of moves; we start
with any move that has not been visited yet; in each iteration, if the
top-of-stack has no out-edge to another move (does not need to come
before any others), then push it to a result vector, followed by all
others on the stack (in popped order). If it does have an out-edge and
the target is already visited and not on the stack anymore (so already
emitted), likewise, emit this move and the rest on the stack. If it
has an out-edge to a move not yet visited, push on the stack and
continue. Otherwise, if out-edge to a move currently on the stack, we
have found a cycle. In this case, we emit the moves on the stack with
a modification: the first move writes to a scratch register, and we
emit an additional move that moves from the scratch to the first
move's dest. This breaks the cycle.
The astute reader may notice that this sounds like a canonical
application of Tarjan's algorithm for finding SCCs (strongly-connected
components). Why don't we have the full complexity of that algorithm?
In particular, *why* can we emit the cycle *right away* once we find
it, rather than ensuring that we have gotten all of the SCC first?
The answer is that because there is only *one* out-edge at most (a
move can only require preceding *one* other move), all SCCs must be
simple cycles. This means that once we have found a cycle, no other
nodes (moves) can be part of the SCC, because every node's single
out-edge is already accounted for. This is what allows us to avoid a
fully general SCC algorithm.
Once the vector of moves in-reverse has been constructed, we reverse
it and return.
Note that this "move resolver" is fuzzed separately with a simple
symbolic move simulator (the `moves` fuzz-target).
### Stack-to-Stack Moves
There is one potentially difficult situation that could arise from the
move-resolution logic so far: if a vreg moves from one spillslot to
another, this implies a memory-to-memory move, which most machine
architectures cannot handle natively. It would be much nicer if we
could ensure within the regalloc that this never occurs.
This is in fact possible to do in a postprocessing step. We iterate
through the sequential moves, tracking whether the scratch register is
in use (has been written). When we see a stack-to-stack move: (i) if
the scratch register is not in use, generate a stack-to-scratch move
and scratch-to-stack move; otherwise, (ii) if the scratch register is
in use, allocate an "extra spillslot" if one has not already been
allocated, move the scratch reg to that, do the above stack-to-scratch
/ scratch-to-stack sequence, then reload the scratch reg from the
extra spillslot.
## Redundant-Spill/Load Elimination
As a final step before returning the vector of program edits to the
client, we perform one optimization: redundant-spill/load elimination.
To understand the need for this, consider what will occur when a vreg
is (i) defined once, (ii) used many times, and (iii) spilled multiple
times between some of the uses: with the design described above, we
will move the value from the preg to the stack after every segment of
uses, and then reload it when the next use occurs. However, only the
first spill is actually needed; as we noted above, we allocate
spillslots so that the slot that corresponded to the vreg at the first
spill will always be reserved for that vreg as long as it is live. If
no other defs or mods occur, the value in the slot can be reloaded,
and need not be written back every time.
This inefficiency is a result of our invariant that a vreg lives in
exactly one place at a time, and these locations are joined by
moves. This is a simple and effective design to use for most of the
allocation pipeline, but falls flat here. It is especially inefficient
when the unnecessary spill occurs in an inner loop. (E.g.: value
defined at top of function is spilled, then used once in the middle of
an inner loop body.)
The opposite case can also sometimes occur, though it is rarer: a
value is loaded into a register, spilled, and then reloaded into the
same register. This can happen when hinting is successful at getting
several segments of a vreg to use the same preg, but splitting has
trimmed part of the liverange between uses and put it in the spill
bundle, and the spill bundle did not get a reg.
In order to resolve this inefficiency, we implement a general
redundant-spill/load elimination pass (an even more general solution
would be a full redundant-move elimination pass, but we focus on moves
that are spills/loads to contain the complexity for now). This pass
tracks, for every allocation (reg or spillslot), whether it is a copy
of another allocation. This state is invalidated whenever either that
allocation or the allocation of which it is a copy is
overwritten. When we see a move instruction, if the destination is
already a copy of the source, we elide the move. (There are some
additional complexities to preserve checker metadata which we do not
describe here.)
Note that this could, in principle, be done as a fixpoint analysis
over the CFG; it must be, if we try to preserve state across
blocks. This is because a location is only a copy of another if that
is true on every incoming edge. However, to avoid the cost and
complexity of doing such an analysis, we instead take the much simpler
approach of doing only an intra-block analysis. This turns out to be
sufficient to remove most redundant moves, especially in the common
case of a single use of an otherwise-spilled value.
Note that there is an opportunity to do better: as we only accept SSA
code we would know that a value could not be redefined once written.
# Future Plans
## Better Split Heuristics
We have spent quite some effort trying to improve splitting behavior,
and it is now generally decent, but more work could be done here,
especially with regard to the interaction between splits and the loop
nest.
# Appendix: Comparison to IonMonkey Allocator
There are a number of differences between the [IonMonkey
allocator](https://searchfox.org/mozilla-central/source/js/src/jit/BacktrackingAllocator.cpp)
and this one. While this allocator initially began as an attempt to
clone IonMonkey's, it has drifted significantly as we optimized the
design (especially after we built the regalloc.rs shim and had to
adapt to its code style); it is easier at this point to name the
similarities than the differences.
* The core abstractions of "liverange", "bundle", "vreg", "preg", and
"operand" (with policies/constraints) are the same.
* The overall allocator pipeline is the same, and the top-level
structure of each stage should look similar. Both allocators begin
by computing liveranges, then merging bundles, then handling bundles
and splitting/evicting as necessary, then doing second-chance
allocation, then reifying the decisions.
* The cost functions are very similar, though the heuristics that make
decisions based on them are not.
Several notable high-level differences are:
* There are [fuzz/fuzz_targets/](many different fuzz targets) that
exercise the allocator, including a full symbolic checker
(`ion_checker` target) based on the [symbolic checker in
regalloc.rs](https://cfallin.org/blog/2021/03/15/cranelift-isel-3/)
and, e.g., a targetted fuzzer for the parallel move-resolution
algorithm (`moves`) and the SSA generator used for generating cases
for the other fuzz targets (`ssagen`).
* The data-structure invariants are simplified. While the IonMonkey
allocator allowed for LiveRanges and Bundles to overlap in certain
cases, this allocator sticks to a strict invariant: ranges do not
overlap in bundles, and bundles do not overlap. There are other
examples too: e.g., the definition of minimal bundles is very simple
and does not depend on scanning the code at all. In general, we
should be able to state simple invariants and see by inspection (as
well as fuzzing -- see above) that they hold.
* The data structures themselves are simplified. Where IonMonkey uses
linked lists in many places, this allocator stores simple inline
smallvecs of liveranges on bundles and vregs, and smallvecs of uses
on liveranges. We also (i) find a way to construct liveranges
in-order immediately, without any need for splicing, unlike
IonMonkey, and (ii) relax sorting invariants where possible to allow
for cheap append operations in many cases.
* The splitting heuristics are significantly reworked. Whereas
IonMonkey has an all-at-once approach to splitting an entire bundle,
and has a list of complex heuristics to choose where to split, this
allocator does conflict-based splitting, and tries to decide whether
to split or evict and which split to take based on cost heuristics.
* The liverange computation is exact, whereas IonMonkey approximates
using a single-pass algorithm that makes vregs live across entire
loop bodies. We have found that precise liveness improves allocation
performance and generated code quality, even though the liveness
itself is slightly more expensive to compute.
* Many of the algorithms in the IonMonkey allocator are built with
helper functions that do linear scans. These "small quadratic" loops
are likely not a huge issue in practice, but nevertheless have the
potential to be in corner cases. As much as possible, all work in
this allocator is done in linear scans.
* There are novel schemes for solving certain interesting design
challenges. One example: in IonMonkey, liveranges are connected
across blocks by, when reaching one end of a control-flow edge in a
scan, doing a lookup of the allocation at the other end. This is in
principle a linear lookup (so quadratic overall). We instead
generate a vector of "half-moves", keyed on the edge and from/to
vregs, with each holding one of the allocations. By sorting and then
scanning this vector, we can generate all edge moves in one linear
scan. There are a number of other examples of simplifications: for
example, we handle multiple conflicting
physical-register-constrained uses of a vreg in a single instruction
by recording a copy to do in a side-table, then removing constraints
for the core regalloc. Ion instead has to tweak its definition of
minimal bundles and create two liveranges that overlap (!) to
represent the two uses.
* Using block parameters rather than phi-nodes significantly
simplifies handling of inter-block data movement. IonMonkey had to
special-case phis in many ways because they are actually quite
weird: their uses happen semantically in other blocks, and their
defs happen in parallel at the top of the block. Block parameters
naturally and explicitly reprsent these semantics in a direct way.
* The allocator supports irreducible control flow and arbitrary block
ordering (its only CFG requirement is that critical edges are
split).
* The allocator supports non-SSA code, and has native support for
handling program moves specially.
# Appendix: Performance-Tuning Lessons
In the course of optimizing the allocator's performance, we found a
number of general principles:
* We got substantial performance speedups from using vectors rather
than linked lists everywhere. This is well-known, but nevertheless,
it took some thought to work out how to avoid the need for any
splicing, and it turns out that even when our design is slightly
less efficient asymptotically (e.g., apend-and-re-sort rather than
linear-time merge of two sorted liverange lists when merging
bundles), it is faster.
* We initially used a direct translation of IonMonkey's splay tree as
an allocation map for each PReg. This turned out to be significantly
(!) less efficient than Rust's built-in BTree data structures, for
the usual cache-efficiency vs. pointer-chasing reasons.
* We initially used dense bitvecs, as IonMonkey does, for
livein/liveout bits. It turned out that a chunked sparse design (see
below) was much more efficient.
* Precise liveness significantly improves performance because it
reduces the size of liveranges (i.e., interference), and probing
registers with liveranges is the most significant hot inner
loop. Paying a fraction of a percent runtime for the iterative
dataflow algorithm to get precise bitsets is more than worth it.
* The randomized probing of registers was a huge win: as above, the
probing is very expensive, and reducing the average number of probes
it takes to find a free register is very important.
* In general, single-pass algorithms and design of data structures to
enable them are important. For example, the half-move technique
avoids the need to do any O(log n) search at all, and is relatively
cache-efficient. As another example, a side-effect of the precise
liveness was that we could then process operands within blocks in
actual instruction order (in reverse), which allowed us to simply
append liveranges to in-progress vreg liverange vectors and then
reverse at the end. The expensive part is a single pass; only the
bitset computation is a fixpoint loop.
* Sorts are better than always-sorted data structures (like btrees):
they amortize all the comparison and update cost to one phase, and
this phase is much more cache-friendly than a bunch of spread-out
updates.
* Take care of basic data structures and their operator definitions!
We initially used the auto-derived comparator on ProgPoint, and let
ProgPoint be a normal struct (with a u32 inst index and a
Befor/After enum). The comparator for this, used in many sorting
inner loops, was a compound thing with conditionals. Instead, pack
them in a u32 and do a simple compare (and save half the memory as
well). Likewise, the half-move key is a single value packed in a
u64; this is far more efficient than the tuple comparator on a
4-tuple, and the half-move sort (which can be a few percent or more
of total allocation time) became multiple times cheaper.
# Appendix: Data Structure: Chunked Sparse BitVec
We use a "chunked sparse bitvec" to store liveness information, which
is just a set of VReg indices. The design is fairly simple: the
toplevel is a HashMap from "chunk" to a `u64`, and each `u64`
represents 64 contiguous indices.
The intuition is that while the vreg sets are likely sparse overall,
they will probably be dense within small regions of the index
space. For example, in the Nth block in a function, the values that
flow from block N-1 will largely be almost-contiguous vreg indices, if
vregs are allocated in sequence down the function body. Or, at least,
they will be some local vregs together with a few defined at the top
of the function; two separate chunks will cover that.
We tried a number of other designs as well. Initially we used a simple
dense bitvec, but this was prohibitively expensive: O(n^2) space when
the real need is closer to O(n) (i.e., a classic sparse matrix). We
also tried a hybrid scheme that kept a vector of indices when small
and used either a bitvec or a hashset when large. This did not perform
as well because (i) it was less memory-efficient (the chunking helps
with this) and (ii) insertions are more expensive when they always
require a full hashset/hashmap insert.
# Appendix: Fuzzing
We have five fuzz targets: `ssagen`, `domtree`, `moves`, `ion`, and
`ion_checker`.
## SSAGen
The SSAGen target tests our SSA generator, which generates cases for
the full allocator fuzz targets. The SSA generator is careful to
always generate a valid CFG, with split critical edges, and valid SSA,
so that we never have to throw out a test input before we reach the
allocator itself. (An alternative fuzzing approach randomly generates
programs and then throws out those that do not meet certain conditions
before using them as legitimate testcases; this is much simpler, but
less efficient.)
To generate a valid CFG, with no unreachable blocks and with no
critical edges, the generator (i) glues together units of either one
or three blocks (A->B, A->C), forming either a straight-through
section or a conditional. These are glued together into a "spine", and
the conditionals (the "C" block), where they exist, are then linked to
a random target block chosen among the main blocks of these one- or
three-block units. The targets are chosen either randomly, for
potentially irreducible CFGs, or in a way that ensures proper nesting
of loop backedges, if a structured CFG is requested.
SSA is generated by first choosing which vregs will be defined in each
block, and which will be defined as blockparams vs. instruction
defs. Instructions are then generated, with operands chosen among the
"available" vregs: those defined so far in the current block and all
of those in any other block that dominates this one.
The SSAGen fuzz target runs the above code generator against an SSA
validator, and thus ensures that it will only generate valid SSA code.
## Domtree
The `domtree` fuzz target computes dominance using the algorithm that
we use elsewhere in our CFG analysis, and then walks a
randomly-generated path through the CFG. It checks that the dominance
definition ("a dom b if any path from entry to b must pass through a")
is consistent with this particular randomly-chosen path.
## Moves
The `moves` fuzz target tests the parallel move resolver. It generates
a random sequence of parallel moves, careful to ensure that each
destination is written only once. It then runs the parallel move
resolver, and then *abstractly interprets* the resulting sequential
series of moves, thus determining which inputs flow to which
outputs. This must match the original set of parallel moves.
## Ion and Ion-checker
The `ion` fuzz target runs the allocator over test programs generated
by SSAGen. It does not validate the output; it only tests that the
allocator runs to completion and does not panic. This was used mainly
during development, and is now less useful than the checker-based
target.
The `ion_checker` fuzz target runs the allocator's result through a
symbolic checker, which is adapted from the one developed for
regalloc.rs (see [this blog
post](https://cfallin.org/blog/2021/01/22/cranelift-isel-2/) for more
details). This is the most useful fuzz target in the fuzzing suite,
and has found many bugs in development.
|