File: overview.org

package info (click to toggle)
moarvm 2020.12%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 18,652 kB
  • sloc: ansic: 268,178; perl: 8,186; python: 1,316; makefile: 768; sh: 287
file content (634 lines) | stat: -rw-r--r-- 34,610 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
#+TITLE: MoarVM JIT compiler overview

* Introduction

This document attempts to describe the structure of the MoarVM JIT
compiler and document some of the design decisions that shape it.

* Components

** Dynamic optimizer

This part is noteworthy because it is not, in fact, part of the JIT
compiler at all. It is a separate subsystem that handles all of the
type logging, graph construction and rewriting, call inlining - in
short, all of the interesting optimizations of a dynamic language
system. The JIT (legacy and 'expression') is a code-generation backend
for this system. MoarVM is rather unique in this since in most
systems, the order is reversed.

In contrast to what might be called 'language-level' optimizations
implemented by spesh, the objective of the JIT is to remove
intepreter-level inefficiencies.

** 'Lego' JIT compiler

The legacy or 'lego' JIT compiler was the first to be developed back
in 2014.  It has this name because it works by emitting independent
fragments of machine code that are stitched together. These fragments
are mapped directly from the 'spesh graph' (representing a MoarVM
subroutine or 'frame') that is the input to the compiler. There are
very few optimizations applied by the legacy compiler (one notable
optimization is the inlining of a polymorphic REPR function pointer if
the type of the object is known).

The lego JIT produces a structure called the 'JIT graph', which, while
a technically correct name, is a bit of a misnomer since it is really
a singly-linked list. It uses labels rather than an explicit control
flow graph (like spesh does) to indicate internal structure. The JIT
graph is then compiled to JIT code. There is in fact little reason to
maintain this two-phase setup; the legacy JIT could work just as well
in a single phase.

Even for the new JIT, the legacy JIT provides the necessary
scaffolding to integrate with the interpreter (such as a proper
function prologue and epilogue, support for exceptions and invokish
operators, and the various bits and pieces required for correct
deoptimization.

** DynASM assembler

DynASM is a third-party library from the [[http://www.luajit.org/][luajit]] project that we've
modified to support register addressing on the x86-64 architecture. It
allows you to interleave the assembly code you'd like to emit with the
C code that does the regular compiling business. DynASM has two parts:

+ A lua script to preprocess a 'dasc' (which is C interleaved with
  assembly code) file and output a C file.
+ A C header to #include into the generated C file that contains all
  the functions to assemble and link the machine code at runtime

DynASM source files are architecture-specific.

** Expression trees

The expression 'tree' is the intermediate format of the new
'expression' JIT compiler. The tree is a technically incorrect
misnomer - it actually represents a graph.

The expression tree is built from a spesh graph by mapping MoarVM
instructions to templates. These templates are written in a textual
format by a human (or perhaps a script), that is /precompiled/ to a
header file during MoarVM compilation. Take note that while the
expression templates look a bit like a high-level language some may be
familiar with, the concepts it expresses are decidedly 'low-level' and
can be mapped more-or-less directly to the machine code of a
hypothetical RISC CPU.

*** Syntax

The syntax used for expression templates is derived from LISP.

+ Lists are the basic constructs of the language. A list is delimited
  by opening ='('= and closing =')'= parentheses. Lists can be
  nested. Items in the lists are separated by whitespace.
+ A word is anything that matches the perl regular expression of =[^\s\(\)#"']=.
  + A word that consists only of alphanumeric symbols (and possibly
    underline) is called a /symbol/: =sp_p6oget_o=. The first symbol
    in a list is generally it's /operator/.
  + Suffixed by =:= is generally a /keyword/, e.g. =let:=
  + Prefixed by =$= is a /reference/, either to a MoarVM instruction
    operand (=$1=, =$2=) or to a declared name (=$val=).
  + Prefixed by =^= invokes a /list macro/, which can be declared with
    the =macro:= keyword.
  + Macro /parameters/ are prefixed by =,= (e.g. =,obj=), and they are
    replaced with whatever symbol or list is 'bound' to them when the
    macro is invoked.
  + Parameter macro's are indicated by a prefix =&=,
    e.g. =&sizeof=. They are always substituted with a function macro:
    =(&sizeof foo)= becomes =sizeof(foo)=. These /must/ form constant
    expressions at compilation time.
+ A string is anything delimited by ="= - but note that /escape
  sequences/ are not supported, so ="foo \" bar"= doesn't do what you
  might think it does

An example of a template definition would be as follows:

#+BEGIN_SRC scheme
(template: sp_p6oget_o
    (let: (($val (load (add (^p6obody $1) $2) ptr_sz)))
           (if (nz $val) $val (^vmnull))))
#+END_SRC

The =let:= keyword is not an operator but a special construct for the
template precompiler. It allows the declaration of names for values
that are computed and reused in further operators.

And an example of a macro definition would be:

#+BEGIN_SRC scheme
(macro: ^getf (,object ,type ,field)
   (load (addr ,object (&offsetof ,type ,field))
         (&SIZEOF_MEMBER ,type ,field)))
#+END_SRC



*** Operators

These are all the operators that are known by the expression language
as of today (<2018-10-23 Tues>).

| Operator  | Shape                            | Type     | Semantics                                                                                             |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =LOAD=    | =(load reg $size)=               | =reg=    | Load a value from a memory address                                                                    |
| =STORE=   | =(store reg reg $size)=          | =void=   | Store a value to a memory address                                                                     |
| =CONST=   | =(const $value $size)=           | =reg=    | Load a constant value to a register                                                                   |
| =ADDR=    | =(addr reg $ofs)=                | =reg=    | Add a constant offset to a pointer                                                                    |
| =IDX=     | =(idx reg reg $scale)=           | =reg=    | Compute a pointer for an index in an array                                                            |
| =COPY=    | =(copy reg)=                     | =reg=    | Copy this value (opaque to tiling)                                                                    |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =LT=      | =(lt reg reg)=                   | =flag=   | First operand is smaller than second                                                                  |
| =LE=      | =(le reg reg)=                   | =flag=   | First operand is smaller than or equal to second                                                      |
| =EQ=      | =(eq reg reg)=                   | =flag=   | First operand is equal to second                                                                      |
| =NE=      | =(ne reg reg)=                   | =flag=   | First operand is not equal to second                                                                  |
| =GE=      | =(ge reg reg)=                   | =flag=   | First operand is larger than or equal to second                                                       |
| =GT=      | =(gt reg reg)=                   | =flag=   | First operand is larger than second                                                                   |
| =NZ=      | =(nz reg)=                       | =flag=   | Operand is nonzero                                                                                    |
| =ZR=      | =(zr reg)=                       | =flag=   | Operand equals zero                                                                                   |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =SCAST=   | =(scast reg $to $from $sign)=    | =reg=    | Convert a signed value from smaller to larger representation                                          |
| =UCAST=   | =(ucast reg $to $from $sign)=    | =reg=    | Convert an unsigned value from one representation to another (larger to smaller or smaller to larger) |
| =FLAGVAL= | =(flagval flag)=                 | =reg=    | Binary value of comparison operator                                                                   |
| =DISCARD= | =(discard reg)=                  | =void=   | Discard value of child operator                                                                       |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =ADD=     | =(add reg reg)=                  | =reg=    | Add two integer values                                                                                |
| =SUB=     | =(sub reg reg)=                  | =reg=    | Subtract second operand from first                                                                    |
| =MUL=     | =(mul reg reg)=                  | =reg=    | Multiply two integer values                                                                           |
| =AND=     | =(and reg reg)=                  | =reg=    | Binary AND (intersection) of two operands                                                             |
| =OR=      | =(or reg reg)=                   | =reg=    | Binary OR (union) of two operands                                                                     |
| =XOR=     | =(xor reg reg)=                  | =reg=    | Binary XOR of two operands                                                                            |
| =NOT=     | =(not reg)=                      | =reg=    | Binary negation of operand                                                                            |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =ALL=     | =(all flag+)=                    | =flag=   | Variadic short-circuit logical AND                                                                    |
| =ANY=     | =(any flag+)=                    | =flag=   | Variadic short-circuit logical OR                                                                     |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =DO=      | =(do void* reg)=                 | =reg=    | Execute multiple statements and return last expression                                                |
| =DOV=     | =(do void+)=                     | =void=   | Execute multiple statements                                                                           |
| =WHEN=    | =(when flag void)=               | =void=   | Execute statement if condition is met                                                                 |
| =IF=      | =(if flag reg reg)=              | =reg=    | Yield value of conditional expression                                                                 |
| =IFV=     | =(ifv flag void void)=           | =void=   | Conditionally execute one of two statements                                                           |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =BRANCH=  | =(branch reg)=                   | =void=   | Jump to code position                                                                                 |
| =LABEL=   | =(label $label)=                 | =reg=    | Load position of code                                                                                 |
| =MARK=    | =(mark (label $label))=          | =void=   | Mark this position                                                                                    |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =CALL=    | =(call reg (arglist ...) $size)= | =reg=    | Call a function and return a value                                                                    |
| =CALLV=   | =(call reg (arglist ...))=       | =void=   | Call a function without a return value                                                                |
| =ARGLIST= | =(arglist (carg reg $type)+)=    | =c_args= | Setup function call arguments                                                                         |
| =CARG=    | =(carg reg $type)=               | =void=   | Annotate value with 'parameter type'                                                                  |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =GUARD=   | =(guard void $before $after)=    | =void=   | Wrap a statement with code before and after                                                           |
|-----------+----------------------------------+----------+-------------------------------------------------------------------------------------------------------|
| =TC=      | =(tc)=                           | =reg=    | Refer to MoarVM thread context                                                                        |
| =CU=      | =(cu)=                           | =reg=    | Refer to current compilation unit                                                                     |
| =LOCAL=   | =(local)=                        | =reg=    | Refer to current frame working memory                                                                 |
| =STACK=   | =(stack)=                        | =reg=    | Refer to top of C frame stack                                                                         |


*** Tree iteration and manipulation


** Instruction selection (tiler)

Next to the register allocator, the instruction selection algorithm
(or the 'tiler') is the most complex part of the JIT. It is
fortunately fairly stable. The goal of tiling is to match the
intermediate representation (the expression tree/graph) to the
cheapest sequence of instructions on the target architecture (x86-64)
that implements the semantics of the expression tree. The
implementation is heavily based on the paper by [[http://dl.acm.org/citation.cfm?id=75700][Aho et al]] - and in
fact, the expression tree IR was designed mostly to accomodate it.

*** How tiling works

The following is a necessarily incomplete description of the actual
process - much better described by either the paper linked above or
the source code that implements it.

A 'tile' in the expression JIT describes a number of overlapping
concepts, for which I rely on the reader to disambiguate by
context. (Humans are good at that).

+ A /pattern/ that can be matched against an expression tree and which
  is defined textually, much like the expression templates.
+ An /object/ for the JIT compiler to represent a machine instruction,
  and
+ A /function/ that emits the machine code to the compiler
  buffer, with parameters substituted

The textual 'tile definition' contains the name of the function that
implements it (in the example below =store_addr= and
=test_addr_const=), the /pattern/ that the tile matches, the /symbol/
that is substituted for the pattern in a succesful match, and the
/cost/ of doing so (in terms of compiled code).

#+BEGIN_SRC scheme
(tile: store_addr 
   (store (addr reg $ofs) reg $size) void 5)
(tile: test_addr_const  
   (nz (and (load (addr reg $ofs) $size) (const $val $size))) flag 4)
#+END_SRC

Conceptually, tiling is a process of /reducing/ the tree structure by
replacing nodes by the /symbols/ defined by the tiles. The following
example may serve as an illustration. Given the following set of
tiles:

| Tile    | Pattern                 | Symbol | Assembly code            |
|---------+-------------------------+--------+--------------------------|
| =local= | =(local)=               | =reg=  | =rbx=                    |
| =const= | =(const $value)=        | =reg=  | =mov reg, $value=        |
| =addr=  | =(addr reg $offset)=    | =reg=  | =lea reg, [reg+$offset]= |
| =load=  | =(load reg $size)=      | =reg=  | =mov out, [reg]=         |
| =add=   | =(add reg reg)=         | =reg=  | =add reg1, reg2=         |
| =store= | =(store reg reg $size)= | =void= | =mov [reg1], reg2=       |

We can reduce a tree and generate code as follows (note that the order
of reducing is bottom-up / postorder from left-to-right):

| Tree                                         | Tile           | Code                 |
|----------------------------------------------+----------------+----------------------|
| =(add (load (addr (local) 0x8)) (const 17))= | =local -> reg= |                      |
| =(add (load (addr reg 0x8)) (const 17))=     | =addr -> reg=  | =lea rax, [rbx+0x8]= |
| =(add (load reg) (const 17))=                | =load -> reg=  | =mov rcx, [rax]=     |
| =(add reg (const 17))=                       | =const -> reg= | =mov rdx, 17=        |
| =(add reg reg)=                              | =add -> reg=   | =add rcx, rdx=       |

Tiling /never/ takes constant parameters into account, which is a
severe limitation - some operators are radically different between
8-bit and 64 bit sizes on x86-64. Maybe we can implement
architecture-specific templates to optimize our way out of this.


*** Picking tiles

The difference between the model above and the real implementation
are:

+ A tile can cover more complex expression trees (see the example
  tiles above)
+ A given subtree may be reduced by different sets of tiles, and
+ We'd like to choose the cheapest such set, and we'd like to do that
  efficiently.

Before going further, be warned: the following is rather complex and
even now it makes my head hurt. Feel free to skip this section.

In order to figure out if a certain tile can be applied to a given
tree, we'd need to know if it's pattern matches. The pattern matches
if it's structure matches /and/ if the expressions 'below' it can be
matched to the symbols at its leafs. Because the tiler uses postorder
iteration, we can ensure that symbols have been assigned to the leafs
when we consider the head operator of the tile.

To avoid having to traverse the tree to find out if the leafs match to
the pattern, during precompilation the tile patterns are 'flattened'
and the nested lists replaced with 'synthetic symbols'. From the
sublist a new (partial) tile pattern is generated that reduces to this
generated symbol and compiles to nothing). See for some examples the
table below. Because the resulting patterns are flat, they can be
matched by inspecting only the symbols of the direct children of the
operator.

| Original                             | Flat                        |
|--------------------------------------+-----------------------------|
| =(load (addr reg)) -> reg=           | =(load @sym1) -> reg=       |
|                                      | =(addr reg) -> @sym1=       |
| =(store (addr reg) reg) -> void=     | =(store @sym2 reg) -> void= |
|                                      | =(addr reg) -> @sym2=       |
| =(add reg (const)) -> reg=           | =(add reg @sym3)=           |
|                                      | =(const) -> @sym3=          |
| =(add reg (load (addr reg))) -> reg= | =(add reg @sym4) -> reg=    |
|                                      | =(load @sym5) -> @sym4=     |
|                                      | =(addr reg) -> @sym5=       |

From the table it is possible to see that a single tile list pattern
(like '(addr reg)') can be reduced to many different symbols. In fact,
given a set of tiles and their associated costs, we can compute all
possible combinations of symbols (symbol sets) that a tile pattern can
be reduced to /and/ we can also compute which tile would be the most
efficient implementation of that operator given the symbol sets of
it's children. By precomputing this information selecting the optimal
tile for an operator can be reduced to a table lookup at runtime.

For completeness I'll try to describe how the precomputation process
works. The central concept it is based on is that of 'symbol sets'
(symsets). We begin by initializing a map of all symbol sets, which
are initially just the individual symbols that are generated by all
tile patterns (called =@rules=).

#+BEGIN_SRC perl
$sets{$_->{sym}} = [$_->{sym}] for @rules;
#+END_SRC

And a reverse lookup table is also generated, from all symbols within
a set, to all symbol sets that they occur in. Again, initially this is
just the symbol itself.

#+BEGIN_SRC perl
while (my ($k, $v) = each %sets) {
   # Use a nested hash for set semantics
   $lookup{$_}{$k} = 1 for @$v;
}
#+END_SRC

Then for each tile pattern, we lookup all symbols that could be
combined with the symbols in the patterns, and we store the symbol
that this reduces to. The name '%trie' is also kind of inaccurate, but
it gets the picture accross, which is that this is a nested associated
lookup table. The purpose is to /combine/ all tiles (and their
associated symbols) that map to the *same symbol sets* together -
because that means that these tiles can cover the same tree
structures.

#+BEGIN_SRC perl
for my $s_k1 (keys %{$lookup{$sym1}}) {
   for my $s_k2 (keys %{$lookup{$sym2}}) {
        $trie{$head, $s_k1, $s_k2}{$rule_nr} = $rule->{sym};
    }
}
#+END_SRC

Having done that, we generate new symbol sets from this lookup table -
all the values of the generated hashes are symbols that can be
produced by tiles that map to the same trees:

#+BEGIN_SRC perl
my %new_sets;
for my $gen (values %trie) {
    my @set = sort(main::uniq(values %$gen));
    my $key = join(':', @set);
    $new_sets{$key} = [@set];
}
#+END_SRC

In general, we run this process multiple iterations with the new sets
of symbols, because the symbolsets that exist influence the tile
patterns that are considered to be equivalent. For instance, in the
table above the tile patterns generating =@sym1=, =@sym2= and =@sym5=
are equivalent, and so after a single iteration there will be only a
single set containing those symbols. This means that the patterns of
=(load @sym1)= and =(load @sym5)= are equivalent, and hence that
=@sym4= is always combined with =reg=. So this process has to iterate
until there are no more change in the sets, at which point it can read
(from the same table) the all possible combinations of sets. Having
this, it is a small matter to compute the cheapest tile of the set to
cover a given tree [fn:cost].

#+BEGIN_SRC perl
my (%seen, @rulesets);
for my $symset (values %trie) {
    my @rule_nrs = main::sortn(keys %$symset);
    my $key = join $;, @rule_nrs;
    push @rulesets, [@rule_nrs] unless $seen{$key}++;
}
return @rulesets;
#+END_SRC



** Register allocation

The legacy JIT compiler did not require register allocation, because
it never persisted values in registers between fragments. The
expression JIT compiler does, that's the whole purpose of it.

*** Earlier attempts

There have been three attempts at implementing a register allocator
for the expression JIT. Ironically, all of them have been based on the
same algorithm, which is [[https://c9x.me/compile/bib/Wimmer10a.pdf][linear scan register allocation on SSA form]].

The first attempt tried to do register allocation 'online' with code
generation, i.e. during tree walking. This does not work because
you'll need to know the live range of the values your allocating,
otherwise you don't know when a register can be freed, and you'll
either run out or you might be incorrect when freeing
them. Furthermore, you'll need to manage /aliases/, i.e. values that
are the same (created by =COPY= operators) and nodes that 'merge'
(created by =IF= opreators). Never mind the case where we compile a
=CALL= node in a conditional block.

The second attempt improved on that in two ways:

+ It made a real attempt to compute the live range extents of
  values. Initally based on tree iteration order; however, after
  introducing the tile list, based on tile list position (which is
  what we use now).
+ It used a 'cross-linked-list' to identify values that either shared
  a location or a definition. This was used to deal with aliases and
  merged values.

However, this proved still not quite sufficient. It had inherited from
the earlier online algorithm the design of /allocating/ a register for
a value at the first instance it is encountered, and /assigning/ that
register to instructions that use the value when they'd be
encountered. Hence it required the ability to determine the 'current'
position for a value, including if that value is spilled or not. This
is somewhat complicated when a value is represented by multiple
different things (due to aliasing and merging, in a cross-linked
list). What is more, it would only try to resolve aliases at
register-assignment time, i.e. when it'd encounter the relevant =COPY=
or =IF= operator. So that would actually change the extent of the live
range while it was 'under operation', which further complicates
several key functionalities such as deciding when a register can be
freed.


*** Current implementation

The third attempt is the current one and departs from the previous two
attempts in the following crucial ways:

+ Rather than have a *value* be synonymous with its *definition* (the
  instruction and operator that created it) it represents a *set* of
  definitions and *uses* , that are joined together by aliases and
  merges. It uses a separate phase and a [[https://en.wikipedia.org/wiki/Disjoint-set_data_structure][union-find]] data structure to
  implement these values (which are called 'live ranges'
  consistently). This ensures that all uses of the 'same' value use
  the same register location for it.

+ During the allocation phase, it iterates over the *values* in
  ascending order (i.e. by first definition), rather than over the
  /instructions/ as the second version did or over the /tree/ as the
  first. A register can be reused only after its value has expired,
  which is when the algorithm iterates past its last use. This means
  that at a single point in the program, no two values can be
  allocated to the same register, which is /the/ essential correctness
  property for a register allocator.

+ Whenever a register is allocated, it is directly /assigned/ to the
  instructions that use its value. When that register is changed for
  whatever reason, we update it as well. Thus we never have to query
  'what is the current location of the value of $x', which gets tricky
  to answer in case of spilling, and the register assigned to an
  instruction is always 'up to date'.

+ It is sometimes necessary to 'spill' a value from a register to
  memory, either because we need more registers or because the ABI
  forces us to - this is true for all register allocation algorithms.
  At any point in the program where it is necessary to spil a value,
  all instructions that define or use that value are wrapped with
  instructions to move that value between register and memory. (This
  ensures that the value is spilled to the same position in all
  possible code paths). The single live range that covered all these
  uses is then split into 'atomic' live ranges that cover only the
  instruction to load or store the value and the original instruction
  that would use or define it. The upshot is that a live range that is
  processed always refers to a value that resides in a register.

+ Because of value merging and aliasing, it is sometimes possible for
  a value to be undefined within its own life range. This can be
  demonstrated by code below. In it, =$a= is first defined as 10, then
  conditionally redefined as the result of =$a * 2 - 3=. So between
  the definition of =$b= and the assignment of the 'new' value of
  =$a=, the 'old' value of =$a = 10= is no longer valid. Hence, it
  should *not* be stored to memory at that point. Such undefined
  ranges are called 'live range holes', and they are an of example of
  the 'complexity waterbed' of compiling - SSA form code doesn't have
  them, but makes it really complicated to ensure the same register is
  allocated for all the separate uses of a value.

#+BEGIN_SRC perl
my $a = 10;
if (rand < 0.5) {
    my $b = $a * 2;
    $a = $b - 3;
}
printf "\$a = %d\n", $a;
#+END_SRC
  

*** Function call ABI

It is (unfortunately) necessary to have the register allocator handle
the C function call ABI. This is because certain function arguments
are supposed to be passed by in specific registers (meaning they have
to be placed there) and some registers can be overwritten by the
function that is called (in fact, all of them). So in the general
case, the JIT needs to:

+ Place values in the correct registers and/or stack positions, making
  sure not to overwrite values that are still necessary for the call.
+ Load values from memory if they are not present in registers. Note
  that a function call can have more arguments than registers, so it
  is not always possible to load all values into registers prior to
  the call. So we cannot rely on the normal register allocator
  mechanisms to ensure a value is present.
+ Ensure all values that 'survive' the call are spilled to memory.

This is currently implemented by a function =prepare_arglist_and_call=
that is over 144 lines long and implements topological sort,
cycle-breaking and register-conflict resolution with values used by
the =CALL= operator itself (e.g. in dynamic dispatch).

I feel like this functionality - rearranging registers to deal with
register requirements of specific operators - could be generalized,
but I'm not sure if it is worth it. The x86-64 instruction set has
plenty of 'irregular' instructions that have specific register
requirements however, virtually all of them use just the single =rax=
register, which I've decided to keep as the spare register anyway. So
I'm not sure what the value of generalizing would be, and I'm fairly
sure it would introduce even more complexity.

*** Challenges

What is still necessary for completeness is the ability to allocate
multiple register /classes/, specifically floating point
registers. The suggested way of handling that is by giving out
different number ranges per class, and letting the tiles sort out the
details of getting the correct number.

Another thing that is not necessary but certainly nice-to-have is a
way to ensure that the register assignment actually meets the
constraints of the target architecture. Specifically, while the
expression JIT assumes a system that operates as follows:

#+BEGIN_EXAMPLE
a = operator(b, c)
#+END_EXAMPLE

What we actually have is on x86-64 is more like:

#+BEGIN_EXAMPLE
a = operator(a, b)
#+END_EXAMPLE

The situation is actually considerably more complex due to the many
addressing modes that are possible. (For instance, indexed memory
access uses a third register 'c'.). Currently we handle that in tiles
by moving values arround, but it would be much nicer if the register
allocator could enforce this constraint. (The allocator keeps a
'scratch' register free at all times for this purpose).

Finally, the live range hole finding algorithm iterates over the basic
blocks of the JIT code in backwards linear order, which is fine so
long as there are no loops (the tile list control flow always flows
forward), which is true so long as we only ever try to compile single
basic blocks, but that is an assumption I'm explicitly trying to
break. So I'll need to find a better, more general iteration order.

** Logging

To aid debugging, the JIT compiler logs to file. This logging is adhoc
and line-based (for the most part). It is not currently useful for
general consumption or instrumentation. Adding to that, the linear
scan allocator has numerous debug logging statements that are
conditionally compiled in.

It would probably be a nice improvement to write 'structural' logging
information to the JIT or spesh log (that could be useful for
instrumentation), and use conditionally compiled 'adhoc' logging
statements solely for debugging. And it'd be nicer still to have
conditional compilation on the section of JIT output that you'd really
be interested in (e.g. expression building or register allocation).

The JIT log is setup at initialization time in =src/moar.c=, based on
the environment variable =MVM_JIT_LOG=. This is arguably not a good
design for embedding; then again, for embedding, this is not a file
you'd want to expose anyway.

Expression trees and tile lists have their own loggers, which
implement a more structured format. An expression tree log represents
the tree as a directed graph in the 'dot' language. There's probably
still some formatting we could apply to that, but on the whole I'm
satisfied with that approach. The tile list logger represents each
tile with a debug string, which is stored when the tiler table
generator processes the tile pttern definitions. It also lists the
basic block structure of the program.

* Tools

* Notes

** Memory management

The JIT uses three different strategies for memory management:

+ Spesh region allocation for the JIT graph.
+ Dynamically allocated vectors for the expression tree, tile lists
  and related structures.
+ Anonymous memory pages for the generated machine code.

*** Spesh region allocation

The =spesh= dynamic optimization framework contains a region
allocator that is bound to the spesh graph.

*** Dynamic vectors



*** Memory pages

Operating systems that try to provide some measure of security to C
programs generally do not allow execution of heap-allocated memory.



[fn:cost] Actually, it is not very simple at all, and this is mostly
because we've 'flattened' the tile list, so we somehow have to
'reconstruct' the cost of a tile set that is equivalent to the tree
matched by the original. This part really needs to be revisited at
some point.