1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634
|
#+TITLE: MoarVM JIT compiler overview
* Introduction
This document attempts to describe the structure of the MoarVM JIT
compiler and document some of the design decisions that shape it.
* Components
** Dynamic optimizer
This part is noteworthy because it is not, in fact, part of the JIT
compiler at all. It is a separate subsystem that handles all of the
type logging, graph construction and rewriting, call inlining - in
short, all of the interesting optimizations of a dynamic language
system. The JIT (legacy and 'expression') is a code-generation backend
for this system. MoarVM is rather unique in this since in most
systems, the order is reversed.
In contrast to what might be called 'language-level' optimizations
implemented by spesh, the objective of the JIT is to remove
intepreter-level inefficiencies.
** 'Lego' JIT compiler
The legacy or 'lego' JIT compiler was the first to be developed back
in 2014. It has this name because it works by emitting independent
fragments of machine code that are stitched together. These fragments
are mapped directly from the 'spesh graph' (representing a MoarVM
subroutine or 'frame') that is the input to the compiler. There are
very few optimizations applied by the legacy compiler (one notable
optimization is the inlining of a polymorphic REPR function pointer if
the type of the object is known).
The lego JIT produces a structure called the 'JIT graph', which, while
a technically correct name, is a bit of a misnomer since it is really
a singly-linked list. It uses labels rather than an explicit control
flow graph (like spesh does) to indicate internal structure. The JIT
graph is then compiled to JIT code. There is in fact little reason to
maintain this two-phase setup; the legacy JIT could work just as well
in a single phase.
Even for the new JIT, the legacy JIT provides the necessary
scaffolding to integrate with the interpreter (such as a proper
function prologue and epilogue, support for exceptions and invokish
operators, and the various bits and pieces required for correct
deoptimization.
** DynASM assembler
DynASM is a third-party library from the [[http://www.luajit.org/][luajit]] project that we've
modified to support register addressing on the x86-64 architecture. It
allows you to interleave the assembly code you'd like to emit with the
C code that does the regular compiling business. DynASM has two parts:
+ A lua script to preprocess a 'dasc' (which is C interleaved with
assembly code) file and output a C file.
+ A C header to #include into the generated C file that contains all
the functions to assemble and link the machine code at runtime
DynASM source files are architecture-specific.
** Expression trees
The expression 'tree' is the intermediate format of the new
'expression' JIT compiler. The tree is a technically incorrect
misnomer - it actually represents a graph.
The expression tree is built from a spesh graph by mapping MoarVM
instructions to templates. These templates are written in a textual
format by a human (or perhaps a script), that is /precompiled/ to a
header file during MoarVM compilation. Take note that while the
expression templates look a bit like a high-level language some may be
familiar with, the concepts it expresses are decidedly 'low-level' and
can be mapped more-or-less directly to the machine code of a
hypothetical RISC CPU.
*** Syntax
The syntax used for expression templates is derived from LISP.
+ Lists are the basic constructs of the language. A list is delimited
by opening ='('= and closing =')'= parentheses. Lists can be
nested. Items in the lists are separated by whitespace.
+ A word is anything that matches the perl regular expression of =[^\s\(\)#"']=.
+ A word that consists only of alphanumeric symbols (and possibly
underline) is called a /symbol/: =sp_p6oget_o=. The first symbol
in a list is generally it's /operator/.
+ Suffixed by =:= is generally a /keyword/, e.g. =let:=
+ Prefixed by =$= is a /reference/, either to a MoarVM instruction
operand (=$1=, =$2=) or to a declared name (=$val=).
+ Prefixed by =^= invokes a /list macro/, which can be declared with
the =macro:= keyword.
+ Macro /parameters/ are prefixed by =,= (e.g. =,obj=), and they are
replaced with whatever symbol or list is 'bound' to them when the
macro is invoked.
+ Parameter macro's are indicated by a prefix =&=,
e.g. =&sizeof=. They are always substituted with a function macro:
=(&sizeof foo)= becomes =sizeof(foo)=. These /must/ form constant
expressions at compilation time.
+ A string is anything delimited by ="= - but note that /escape
sequences/ are not supported, so ="foo \" bar"= doesn't do what you
might think it does
An example of a template definition would be as follows:
#+BEGIN_SRC scheme
(template: sp_p6oget_o
(let: (($val (load (add (^p6obody $1) $2) ptr_sz)))
(if (nz $val) $val (^vmnull))))
#+END_SRC
The =let:= keyword is not an operator but a special construct for the
template precompiler. It allows the declaration of names for values
that are computed and reused in further operators.
And an example of a macro definition would be:
#+BEGIN_SRC scheme
(macro: ^getf (,object ,type ,field)
(load (addr ,object (&offsetof ,type ,field))
(&SIZEOF_MEMBER ,type ,field)))
#+END_SRC
*** Operators
These are all the operators that are known by the expression language
as of today (<2018-10-23 Tues>).
| Operator | Shape | Type | Semantics |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =LOAD= | =(load reg $size)= | =reg= | Load a value from a memory address |
| =STORE= | =(store reg reg $size)= | =void= | Store a value to a memory address |
| =CONST= | =(const $value $size)= | =reg= | Load a constant value to a register |
| =ADDR= | =(addr reg $ofs)= | =reg= | Add a constant offset to a pointer |
| =IDX= | =(idx reg reg $scale)= | =reg= | Compute a pointer for an index in an array |
| =COPY= | =(copy reg)= | =reg= | Copy this value (opaque to tiling) |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =LT= | =(lt reg reg)= | =flag= | First operand is smaller than second |
| =LE= | =(le reg reg)= | =flag= | First operand is smaller than or equal to second |
| =EQ= | =(eq reg reg)= | =flag= | First operand is equal to second |
| =NE= | =(ne reg reg)= | =flag= | First operand is not equal to second |
| =GE= | =(ge reg reg)= | =flag= | First operand is larger than or equal to second |
| =GT= | =(gt reg reg)= | =flag= | First operand is larger than second |
| =NZ= | =(nz reg)= | =flag= | Operand is nonzero |
| =ZR= | =(zr reg)= | =flag= | Operand equals zero |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =SCAST= | =(scast reg $to $from $sign)= | =reg= | Convert a signed value from smaller to larger representation |
| =UCAST= | =(ucast reg $to $from $sign)= | =reg= | Convert an unsigned value from one representation to another (larger to smaller or smaller to larger) |
| =FLAGVAL= | =(flagval flag)= | =reg= | Binary value of comparison operator |
| =DISCARD= | =(discard reg)= | =void= | Discard value of child operator |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =ADD= | =(add reg reg)= | =reg= | Add two integer values |
| =SUB= | =(sub reg reg)= | =reg= | Subtract second operand from first |
| =MUL= | =(mul reg reg)= | =reg= | Multiply two integer values |
| =AND= | =(and reg reg)= | =reg= | Binary AND (intersection) of two operands |
| =OR= | =(or reg reg)= | =reg= | Binary OR (union) of two operands |
| =XOR= | =(xor reg reg)= | =reg= | Binary XOR of two operands |
| =NOT= | =(not reg)= | =reg= | Binary negation of operand |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =ALL= | =(all flag+)= | =flag= | Variadic short-circuit logical AND |
| =ANY= | =(any flag+)= | =flag= | Variadic short-circuit logical OR |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =DO= | =(do void* reg)= | =reg= | Execute multiple statements and return last expression |
| =DOV= | =(do void+)= | =void= | Execute multiple statements |
| =WHEN= | =(when flag void)= | =void= | Execute statement if condition is met |
| =IF= | =(if flag reg reg)= | =reg= | Yield value of conditional expression |
| =IFV= | =(ifv flag void void)= | =void= | Conditionally execute one of two statements |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =BRANCH= | =(branch reg)= | =void= | Jump to code position |
| =LABEL= | =(label $label)= | =reg= | Load position of code |
| =MARK= | =(mark (label $label))= | =void= | Mark this position |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =CALL= | =(call reg (arglist ...) $size)= | =reg= | Call a function and return a value |
| =CALLV= | =(call reg (arglist ...))= | =void= | Call a function without a return value |
| =ARGLIST= | =(arglist (carg reg $type)+)= | =c_args= | Setup function call arguments |
| =CARG= | =(carg reg $type)= | =void= | Annotate value with 'parameter type' |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =GUARD= | =(guard void $before $after)= | =void= | Wrap a statement with code before and after |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =TC= | =(tc)= | =reg= | Refer to MoarVM thread context |
| =CU= | =(cu)= | =reg= | Refer to current compilation unit |
| =LOCAL= | =(local)= | =reg= | Refer to current frame working memory |
| =STACK= | =(stack)= | =reg= | Refer to top of C frame stack |
*** Tree iteration and manipulation
** Instruction selection (tiler)
Next to the register allocator, the instruction selection algorithm
(or the 'tiler') is the most complex part of the JIT. It is
fortunately fairly stable. The goal of tiling is to match the
intermediate representation (the expression tree/graph) to the
cheapest sequence of instructions on the target architecture (x86-64)
that implements the semantics of the expression tree. The
implementation is heavily based on the paper by [[http://dl.acm.org/citation.cfm?id=75700][Aho et al]] - and in
fact, the expression tree IR was designed mostly to accomodate it.
*** How tiling works
The following is a necessarily incomplete description of the actual
process - much better described by either the paper linked above or
the source code that implements it.
A 'tile' in the expression JIT describes a number of overlapping
concepts, for which I rely on the reader to disambiguate by
context. (Humans are good at that).
+ A /pattern/ that can be matched against an expression tree and which
is defined textually, much like the expression templates.
+ An /object/ for the JIT compiler to represent a machine instruction,
and
+ A /function/ that emits the machine code to the compiler
buffer, with parameters substituted
The textual 'tile definition' contains the name of the function that
implements it (in the example below =store_addr= and
=test_addr_const=), the /pattern/ that the tile matches, the /symbol/
that is substituted for the pattern in a succesful match, and the
/cost/ of doing so (in terms of compiled code).
#+BEGIN_SRC scheme
(tile: store_addr
(store (addr reg $ofs) reg $size) void 5)
(tile: test_addr_const
(nz (and (load (addr reg $ofs) $size) (const $val $size))) flag 4)
#+END_SRC
Conceptually, tiling is a process of /reducing/ the tree structure by
replacing nodes by the /symbols/ defined by the tiles. The following
example may serve as an illustration. Given the following set of
tiles:
| Tile | Pattern | Symbol | Assembly code |
|---------+-------------------------+--------+--------------------------|
| =local= | =(local)= | =reg= | =rbx= |
| =const= | =(const $value)= | =reg= | =mov reg, $value= |
| =addr= | =(addr reg $offset)= | =reg= | =lea reg, [reg+$offset]= |
| =load= | =(load reg $size)= | =reg= | =mov out, [reg]= |
| =add= | =(add reg reg)= | =reg= | =add reg1, reg2= |
| =store= | =(store reg reg $size)= | =void= | =mov [reg1], reg2= |
We can reduce a tree and generate code as follows (note that the order
of reducing is bottom-up / postorder from left-to-right):
| Tree | Tile | Code |
|----------------------------------------------+----------------+----------------------|
| =(add (load (addr (local) 0x8)) (const 17))= | =local -> reg= | |
| =(add (load (addr reg 0x8)) (const 17))= | =addr -> reg= | =lea rax, [rbx+0x8]= |
| =(add (load reg) (const 17))= | =load -> reg= | =mov rcx, [rax]= |
| =(add reg (const 17))= | =const -> reg= | =mov rdx, 17= |
| =(add reg reg)= | =add -> reg= | =add rcx, rdx= |
Tiling /never/ takes constant parameters into account, which is a
severe limitation - some operators are radically different between
8-bit and 64 bit sizes on x86-64. Maybe we can implement
architecture-specific templates to optimize our way out of this.
*** Picking tiles
The difference between the model above and the real implementation
are:
+ A tile can cover more complex expression trees (see the example
tiles above)
+ A given subtree may be reduced by different sets of tiles, and
+ We'd like to choose the cheapest such set, and we'd like to do that
efficiently.
Before going further, be warned: the following is rather complex and
even now it makes my head hurt. Feel free to skip this section.
In order to figure out if a certain tile can be applied to a given
tree, we'd need to know if it's pattern matches. The pattern matches
if it's structure matches /and/ if the expressions 'below' it can be
matched to the symbols at its leafs. Because the tiler uses postorder
iteration, we can ensure that symbols have been assigned to the leafs
when we consider the head operator of the tile.
To avoid having to traverse the tree to find out if the leafs match to
the pattern, during precompilation the tile patterns are 'flattened'
and the nested lists replaced with 'synthetic symbols'. From the
sublist a new (partial) tile pattern is generated that reduces to this
generated symbol and compiles to nothing). See for some examples the
table below. Because the resulting patterns are flat, they can be
matched by inspecting only the symbols of the direct children of the
operator.
| Original | Flat |
|--------------------------------------+-----------------------------|
| =(load (addr reg)) -> reg= | =(load @sym1) -> reg= |
| | =(addr reg) -> @sym1= |
| =(store (addr reg) reg) -> void= | =(store @sym2 reg) -> void= |
| | =(addr reg) -> @sym2= |
| =(add reg (const)) -> reg= | =(add reg @sym3)= |
| | =(const) -> @sym3= |
| =(add reg (load (addr reg))) -> reg= | =(add reg @sym4) -> reg= |
| | =(load @sym5) -> @sym4= |
| | =(addr reg) -> @sym5= |
From the table it is possible to see that a single tile list pattern
(like '(addr reg)') can be reduced to many different symbols. In fact,
given a set of tiles and their associated costs, we can compute all
possible combinations of symbols (symbol sets) that a tile pattern can
be reduced to /and/ we can also compute which tile would be the most
efficient implementation of that operator given the symbol sets of
it's children. By precomputing this information selecting the optimal
tile for an operator can be reduced to a table lookup at runtime.
For completeness I'll try to describe how the precomputation process
works. The central concept it is based on is that of 'symbol sets'
(symsets). We begin by initializing a map of all symbol sets, which
are initially just the individual symbols that are generated by all
tile patterns (called =@rules=).
#+BEGIN_SRC perl
$sets{$_->{sym}} = [$_->{sym}] for @rules;
#+END_SRC
And a reverse lookup table is also generated, from all symbols within
a set, to all symbol sets that they occur in. Again, initially this is
just the symbol itself.
#+BEGIN_SRC perl
while (my ($k, $v) = each %sets) {
# Use a nested hash for set semantics
$lookup{$_}{$k} = 1 for @$v;
}
#+END_SRC
Then for each tile pattern, we lookup all symbols that could be
combined with the symbols in the patterns, and we store the symbol
that this reduces to. The name '%trie' is also kind of inaccurate, but
it gets the picture accross, which is that this is a nested associated
lookup table. The purpose is to /combine/ all tiles (and their
associated symbols) that map to the *same symbol sets* together -
because that means that these tiles can cover the same tree
structures.
#+BEGIN_SRC perl
for my $s_k1 (keys %{$lookup{$sym1}}) {
for my $s_k2 (keys %{$lookup{$sym2}}) {
$trie{$head, $s_k1, $s_k2}{$rule_nr} = $rule->{sym};
}
}
#+END_SRC
Having done that, we generate new symbol sets from this lookup table -
all the values of the generated hashes are symbols that can be
produced by tiles that map to the same trees:
#+BEGIN_SRC perl
my %new_sets;
for my $gen (values %trie) {
my @set = sort(main::uniq(values %$gen));
my $key = join(':', @set);
$new_sets{$key} = [@set];
}
#+END_SRC
In general, we run this process multiple iterations with the new sets
of symbols, because the symbolsets that exist influence the tile
patterns that are considered to be equivalent. For instance, in the
table above the tile patterns generating =@sym1=, =@sym2= and =@sym5=
are equivalent, and so after a single iteration there will be only a
single set containing those symbols. This means that the patterns of
=(load @sym1)= and =(load @sym5)= are equivalent, and hence that
=@sym4= is always combined with =reg=. So this process has to iterate
until there are no more change in the sets, at which point it can read
(from the same table) the all possible combinations of sets. Having
this, it is a small matter to compute the cheapest tile of the set to
cover a given tree [fn:cost].
#+BEGIN_SRC perl
my (%seen, @rulesets);
for my $symset (values %trie) {
my @rule_nrs = main::sortn(keys %$symset);
my $key = join $;, @rule_nrs;
push @rulesets, [@rule_nrs] unless $seen{$key}++;
}
return @rulesets;
#+END_SRC
** Register allocation
The legacy JIT compiler did not require register allocation, because
it never persisted values in registers between fragments. The
expression JIT compiler does, that's the whole purpose of it.
*** Earlier attempts
There have been three attempts at implementing a register allocator
for the expression JIT. Ironically, all of them have been based on the
same algorithm, which is [[https://c9x.me/compile/bib/Wimmer10a.pdf][linear scan register allocation on SSA form]].
The first attempt tried to do register allocation 'online' with code
generation, i.e. during tree walking. This does not work because
you'll need to know the live range of the values your allocating,
otherwise you don't know when a register can be freed, and you'll
either run out or you might be incorrect when freeing
them. Furthermore, you'll need to manage /aliases/, i.e. values that
are the same (created by =COPY= operators) and nodes that 'merge'
(created by =IF= opreators). Never mind the case where we compile a
=CALL= node in a conditional block.
The second attempt improved on that in two ways:
+ It made a real attempt to compute the live range extents of
values. Initally based on tree iteration order; however, after
introducing the tile list, based on tile list position (which is
what we use now).
+ It used a 'cross-linked-list' to identify values that either shared
a location or a definition. This was used to deal with aliases and
merged values.
However, this proved still not quite sufficient. It had inherited from
the earlier online algorithm the design of /allocating/ a register for
a value at the first instance it is encountered, and /assigning/ that
register to instructions that use the value when they'd be
encountered. Hence it required the ability to determine the 'current'
position for a value, including if that value is spilled or not. This
is somewhat complicated when a value is represented by multiple
different things (due to aliasing and merging, in a cross-linked
list). What is more, it would only try to resolve aliases at
register-assignment time, i.e. when it'd encounter the relevant =COPY=
or =IF= operator. So that would actually change the extent of the live
range while it was 'under operation', which further complicates
several key functionalities such as deciding when a register can be
freed.
*** Current implementation
The third attempt is the current one and departs from the previous two
attempts in the following crucial ways:
+ Rather than have a *value* be synonymous with its *definition* (the
instruction and operator that created it) it represents a *set* of
definitions and *uses* , that are joined together by aliases and
merges. It uses a separate phase and a [[https://en.wikipedia.org/wiki/Disjoint-set_data_structure][union-find]] data structure to
implement these values (which are called 'live ranges'
consistently). This ensures that all uses of the 'same' value use
the same register location for it.
+ During the allocation phase, it iterates over the *values* in
ascending order (i.e. by first definition), rather than over the
/instructions/ as the second version did or over the /tree/ as the
first. A register can be reused only after its value has expired,
which is when the algorithm iterates past its last use. This means
that at a single point in the program, no two values can be
allocated to the same register, which is /the/ essential correctness
property for a register allocator.
+ Whenever a register is allocated, it is directly /assigned/ to the
instructions that use its value. When that register is changed for
whatever reason, we update it as well. Thus we never have to query
'what is the current location of the value of $x', which gets tricky
to answer in case of spilling, and the register assigned to an
instruction is always 'up to date'.
+ It is sometimes necessary to 'spill' a value from a register to
memory, either because we need more registers or because the ABI
forces us to - this is true for all register allocation algorithms.
At any point in the program where it is necessary to spil a value,
all instructions that define or use that value are wrapped with
instructions to move that value between register and memory. (This
ensures that the value is spilled to the same position in all
possible code paths). The single live range that covered all these
uses is then split into 'atomic' live ranges that cover only the
instruction to load or store the value and the original instruction
that would use or define it. The upshot is that a live range that is
processed always refers to a value that resides in a register.
+ Because of value merging and aliasing, it is sometimes possible for
a value to be undefined within its own life range. This can be
demonstrated by code below. In it, =$a= is first defined as 10, then
conditionally redefined as the result of =$a * 2 - 3=. So between
the definition of =$b= and the assignment of the 'new' value of
=$a=, the 'old' value of =$a = 10= is no longer valid. Hence, it
should *not* be stored to memory at that point. Such undefined
ranges are called 'live range holes', and they are an of example of
the 'complexity waterbed' of compiling - SSA form code doesn't have
them, but makes it really complicated to ensure the same register is
allocated for all the separate uses of a value.
#+BEGIN_SRC perl
my $a = 10;
if (rand < 0.5) {
my $b = $a * 2;
$a = $b - 3;
}
printf "\$a = %d\n", $a;
#+END_SRC
*** Function call ABI
It is (unfortunately) necessary to have the register allocator handle
the C function call ABI. This is because certain function arguments
are supposed to be passed by in specific registers (meaning they have
to be placed there) and some registers can be overwritten by the
function that is called (in fact, all of them). So in the general
case, the JIT needs to:
+ Place values in the correct registers and/or stack positions, making
sure not to overwrite values that are still necessary for the call.
+ Load values from memory if they are not present in registers. Note
that a function call can have more arguments than registers, so it
is not always possible to load all values into registers prior to
the call. So we cannot rely on the normal register allocator
mechanisms to ensure a value is present.
+ Ensure all values that 'survive' the call are spilled to memory.
This is currently implemented by a function =prepare_arglist_and_call=
that is over 144 lines long and implements topological sort,
cycle-breaking and register-conflict resolution with values used by
the =CALL= operator itself (e.g. in dynamic dispatch).
I feel like this functionality - rearranging registers to deal with
register requirements of specific operators - could be generalized,
but I'm not sure if it is worth it. The x86-64 instruction set has
plenty of 'irregular' instructions that have specific register
requirements however, virtually all of them use just the single =rax=
register, which I've decided to keep as the spare register anyway. So
I'm not sure what the value of generalizing would be, and I'm fairly
sure it would introduce even more complexity.
*** Challenges
What is still necessary for completeness is the ability to allocate
multiple register /classes/, specifically floating point
registers. The suggested way of handling that is by giving out
different number ranges per class, and letting the tiles sort out the
details of getting the correct number.
Another thing that is not necessary but certainly nice-to-have is a
way to ensure that the register assignment actually meets the
constraints of the target architecture. Specifically, while the
expression JIT assumes a system that operates as follows:
#+BEGIN_EXAMPLE
a = operator(b, c)
#+END_EXAMPLE
What we actually have is on x86-64 is more like:
#+BEGIN_EXAMPLE
a = operator(a, b)
#+END_EXAMPLE
The situation is actually considerably more complex due to the many
addressing modes that are possible. (For instance, indexed memory
access uses a third register 'c'.). Currently we handle that in tiles
by moving values arround, but it would be much nicer if the register
allocator could enforce this constraint. (The allocator keeps a
'scratch' register free at all times for this purpose).
Finally, the live range hole finding algorithm iterates over the basic
blocks of the JIT code in backwards linear order, which is fine so
long as there are no loops (the tile list control flow always flows
forward), which is true so long as we only ever try to compile single
basic blocks, but that is an assumption I'm explicitly trying to
break. So I'll need to find a better, more general iteration order.
** Logging
To aid debugging, the JIT compiler logs to file. This logging is adhoc
and line-based (for the most part). It is not currently useful for
general consumption or instrumentation. Adding to that, the linear
scan allocator has numerous debug logging statements that are
conditionally compiled in.
It would probably be a nice improvement to write 'structural' logging
information to the JIT or spesh log (that could be useful for
instrumentation), and use conditionally compiled 'adhoc' logging
statements solely for debugging. And it'd be nicer still to have
conditional compilation on the section of JIT output that you'd really
be interested in (e.g. expression building or register allocation).
The JIT log is setup at initialization time in =src/moar.c=, based on
the environment variable =MVM_JIT_LOG=. This is arguably not a good
design for embedding; then again, for embedding, this is not a file
you'd want to expose anyway.
Expression trees and tile lists have their own loggers, which
implement a more structured format. An expression tree log represents
the tree as a directed graph in the 'dot' language. There's probably
still some formatting we could apply to that, but on the whole I'm
satisfied with that approach. The tile list logger represents each
tile with a debug string, which is stored when the tiler table
generator processes the tile pttern definitions. It also lists the
basic block structure of the program.
* Tools
* Notes
** Memory management
The JIT uses three different strategies for memory management:
+ Spesh region allocation for the JIT graph.
+ Dynamically allocated vectors for the expression tree, tile lists
and related structures.
+ Anonymous memory pages for the generated machine code.
*** Spesh region allocation
The =spesh= dynamic optimization framework contains a region
allocator that is bound to the spesh graph.
*** Dynamic vectors
*** Memory pages
Operating systems that try to provide some measure of security to C
programs generally do not allow execution of heap-allocated memory.
[fn:cost] Actually, it is not very simple at all, and this is mostly
because we've 'flattened' the tile list, so we somehow have to
'reconstruct' the cost of a tile set that is equivalent to the tree
matched by the original. This part really needs to be revisited at
some point.
|