1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263
|
=========================
PyPy's assembler backends
=========================
Draft notes about the organization of assembler backends in the PyPy JIT, in 2016
=================================================================================
input: linear sequence of instructions, called a "trace".
A trace is a sequence of instructions in SSA form. Most instructions
correspond to one or a few CPU-level instructions. There are a few
meta-instructions like `label` and debugging stuff. All branching is
done with guards, which are instructions that check that a condition is
true and exit the trace if not. A failing guard can have a new trace
added to it later, called a "bridge". A patched guard becomes a direct
`Jcond` instruction going to the bridge, with no indirection, no
register spilling, etc.
A trace ends with either a `return` or a `jump to label`. The target
label is either inside the same trace, or in some older one. For
historical reasons we call a "loop" a trace that is not a bridge. The
machine code that we generate is organized as a forest of trees; the
trunk of the tree is a "loop", and the branches are all bridges
(branching off the trunk or off another branch).
* every trunk or branch that ends in a `jump to label` can target a
label from a different tree, too.
* the whole process of assembling a loop or a branch is basically
single-threaded, so no synchronization issue there (including to patch
older generated instructions).
* the generated assembler has got a "frame" in %rbp, which is actually
not on the stack at all, but is a GC object (called a "jitframe").
Spilling goes there.
* the guards are `Jcond` to a very small piece of generated code, which
is basically pushing a couple of constants on the stack and then
jumping to the general guard-recovery code. That code will save the
registers into the jitframe and then exit the whole generated
function. The caller of that generated function checks how it
finished: if it finished by hitting a guard, then the caller is
responsible for calling the "blackhole interpreter". This is the part
of the front-end that recovers from failing guards and finishes
running the frame (including, possibly, by jumping again into
generated assembler).
Details about the JITting process:
* front-end and optimization pass
* rewrite (includes gc related transformation as well as simplifactions)
* assembler generation
Front-end and optimization pass
-------------------------------
Not discussed here in detail. This produces loops and bridges using an
instruction set that is "high-level" in some sense: it contains
intructions like "new"/"new_array", and
"setfield"/"setarrayitem"/"setinteriorfield" which describe the action
of storing a value in a precise field of the structure or array. For
example, the "setfield" action might require implicitly a GC write
barrier. This is the high-level trace that we send to the following
step.
Rewrite
-------
A mostly but not completely CPU-independent phase: lowers some
instructions. For example, the variants of "new" are lowered to
"malloc" and a few "gc_store": it bumps the pointer of the GC and then
sets a few fields explicitly in the newly allocated structure. The
"setfield" is replaced with a "cond_gc_wb_call" (conditional call to the
write barrier) if needed, followed by a "gc_store".
The "gc_store" instruction can be encoded in a single MOV assembler
instruction, but is not as flexible as a MOV. The address is always
specified as "some GC pointer + an offset". We don't have the notion of
interior pointer for GC objects.
A different instruction, "gc_store_indexed", offers additional operands,
which can be mapped to a single MOV instruction using forms like
`[rax+8*rcx+24]`.
Some other complex instructions pass through to the backend, which must
deal with them: for example, "card marking" in the GC. (Writing an
object pointer inside an array would require walking the whole array
later to find "young" references. Instead of that, we flip a bit for
every range of 128 entries. This is a common GC optimization.) Setting
the card bit of a GC object requires a sequence of assembler
instructions that depends too much on the target CPU to be expressed
explicitly here (moreover, it contains a few branches, which are hard to
express at this level).
Assembly
--------
No fancy code generation technique, but greedy forward pass that tries
to avoid some pitfalls
Handling instructions
~~~~~~~~~~~~~~~~~~~~~
* One by one (forward direction). Each instruction asks the register
allocator to ensure that some arguments are in registers (not in the
jitframe); asks for a register to put its result into; and asks for
additional scratch registers that will be freed at the end of the
instruction. There is a special case for boolean variables: they are
stored in the condition code flags instead of being materialized as a
0/1 value. (They are materialized later, except in the common case
where they are only used by the next `guard_false` or `guard_true` and
then forgotten.)
* Instruction arguments are loaded into a register on demand. This
makes the backend quite easy to write, but leads do some bad
decisions.
Linear scan register allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Although it's always a linear trace that we consider, we don't use
advanced techniques for register allocation: we do forward, on-demand
allocation as the backend produces the assembler. When it asks for a
register to put some value into, we give it any free register, without
consideration for what will be done with it later. We compute the
longevity of all variables, but only use it when choosing which register
to spill (we spill the variable with the longest longevity).
This works to some extend because it is well integrated with the earlier
optimization pass. Loops are unrolled once by the optimization pass to
allow more powerful optimizations---the optimization pass itself is the
place that benefits the most, but it also has benefits here in the
assembly pass. These are:
* The first peeling initializes the register binding on the first use.
* This leads to an already allocated register of the trace loop.
* As well as allocated registers when exiting bridges
[Try to better allocate registers to match the ABI (minor to non benefit
in the current state)]
More complex mappings
~~~~~~~~~~~~~~~~~~~~~
Some instructions generate more complex code. These are either or both of:
* complex instructions generating some local control flow, like
"cond_gc_wb_call" (for write barriers), "call_assembler" (a call
followed by a few checks).
* instructions that invoke custom assembler helpers, like the slow-path
of write barriers or the slow-path of allocations. These slow-paths
are typically generated too, so that we are not constrained by the
usual calling conventions.
GC pointers
~~~~~~~~~~~
Around most CALL instructions, we need to record a description of where
the GC pointers are (registers and stack frame). This is needed in case
the CALL invokes a garbage collection. The GC pointers can move; the
pointers in the registers and stack frame are updated by the GC. That's
a reason for why we don't have explicit interior pointers.
GC pointers can appear as constants in the trace. We are busy changing
that to use a constant table and `MOV REG, (%RIP+offset)`. The
"constant" in the table is actually updated by the GC if the object
move.
Vectorization
~~~~~~~~~~~~~
Optimization developed to use SIMD instructions for trace loops. Primary
idea was to use it as an optimization of micro numpy. It has several
passes on the already optimized trace.
Shortly explained: It builds dependencies for an unrolled trace loop,
gathering pairs/packs of operations that could be executed in parallel
and finally schedules the operations.
What did it add to the code base:
* Dependencies can be constructed
* Code motion of guards to relax dependencies
* Scheduler to reorder trace
* Array bound check removal (especially for unrolled traces)
What can it do:
* Transform vector loops (element wise operations)
* Accumulation (`reduce([...],operator,0)`). Requires Operation to be
associative and commutative
* SSE 4.1 as "vector backend"
We do not
~~~~~~~~~
* Keep tracing data around to reoptimize the trace tree. (Once a trace
is compiled, minimal data is kept.) This is one reason (there are
others in the front-end) for the following result: JIT-compiling a
small loop with two common paths ends up as one "loop" and one bridge
assembled, and the bridge-following path is slightly less efficient.
This is notably because this bridge is assembled with two constraints:
the input registers are fixed (from the guard), and the output
registers are fixed (from the jump target); usually these two sets of
fixed registers are different, and copying around is needed.
* We don't join trace tails: we only assemble *trees*.
* We don't do any reordering (neither of trace instructions nor of
individual assembler instructions)
* We don't do any cross-instruction optimization that makes sense only
for the backend and can't easily be expressed at a higher level. I'm
sure there are tons of examples of that, but e.g. loading a large
constant in a register that will survive for several instructions;
moving out of loops *parts* of some instruction like the address
calculation; etc. etc.
* Other optimization opportunities I can think about: look at the
function prologue/epilogue; look at the overhead (small but not zero)
at the start of a bridge. Also check if the way guards are
implemented makes sense. Also, we generate large-ish sequences of
assembler instructions with tons of `Jcond` that are almost never
followed; any optimization opportunity there? (They all go forward,
if it changes anything.) In theory we could also replace some of
these with a signal handler on segfault (e.g. `guard_nonnull_class`).
a GCC or LLVM backend?
~~~~~~~~~~~~~~~~~~~~~~
At least for comparison we'd like a JIT backend that emits its code
using GCC or LLVM (irrespective of the time it would take). But it's
hard to map reasonably well the guards to the C language or to LLVM IR.
The problems are: (1) we have many guards, we would like to avoid having
many paths that each do a full
saving-all-local-variables-that-are-still-alive; (2) it's hard to patch
a guard when a bridge is compiled from it; (3) instructions like a CALL
need to expose the local variables that are GC pointers; CALL_MAY_FORCE
need to expose *all* local variables for optional off-line
reconstruction of the interpreter state.
|