File: backend.rst

package info (click to toggle)
pypy 7.0.0%2Bdfsg-3
links: PTS, VCS
area: main
in suites: buster
size: 107,216 kB
sloc: python: 1,201,787; ansic: 62,419; asm: 5,169; cpp: 3,017; sh: 2,534; makefile: 545; xml: 243; lisp: 45; awk: 4
file content (263 lines) | stat: -rw-r--r-- 10,732 bytes
parent folder | download | duplicates (6)
=========================
PyPy's assembler backends
=========================

Draft notes about the organization of assembler backends in the PyPy JIT, in 2016
=================================================================================


input: linear sequence of instructions, called a "trace".

A trace is a sequence of instructions in SSA form.  Most instructions
correspond to one or a few CPU-level instructions.  There are a few
meta-instructions like `label` and debugging stuff.  All branching is
done with guards, which are instructions that check that a condition is
true and exit the trace if not.  A failing guard can have a new trace
added to it later, called a "bridge".  A patched guard becomes a direct
`Jcond` instruction going to the bridge, with no indirection, no
register spilling, etc.

A trace ends with either a `return` or a `jump to label`.  The target
label is either inside the same trace, or in some older one.  For
historical reasons we call a "loop" a trace that is not a bridge.  The
machine code that we generate is organized as a forest of trees; the
trunk of the tree is a "loop", and the branches are all bridges
(branching off the trunk or off another branch).

* every trunk or branch that ends in a `jump to label` can target a
  label from a different tree, too.

* the whole process of assembling a loop or a branch is basically
  single-threaded, so no synchronization issue there (including to patch
  older generated instructions).

* the generated assembler has got a "frame" in %rbp, which is actually
  not on the stack at all, but is a GC object (called a "jitframe").
  Spilling goes there.

* the guards are `Jcond` to a very small piece of generated code, which
  is basically pushing a couple of constants on the stack and then
  jumping to the general guard-recovery code.  That code will save the
  registers into the jitframe and then exit the whole generated
  function.  The caller of that generated function checks how it
  finished: if it finished by hitting a guard, then the caller is
  responsible for calling the "blackhole interpreter".  This is the part
  of the front-end that recovers from failing guards and finishes
  running the frame (including, possibly, by jumping again into
  generated assembler).


Details about the JITting process:

* front-end and optimization pass

* rewrite (includes gc related transformation as well as simplifactions)

* assembler generation


Front-end and optimization pass
-------------------------------

Not discussed here in detail.  This produces loops and bridges using an
instruction set that is "high-level" in some sense: it contains
intructions like "new"/"new_array", and
"setfield"/"setarrayitem"/"setinteriorfield" which describe the action
of storing a value in a precise field of the structure or array.  For
example, the "setfield" action might require implicitly a GC write
barrier.  This is the high-level trace that we send to the following
step.


Rewrite
-------

A mostly but not completely CPU-independent phase: lowers some
instructions.  For example, the variants of "new" are lowered to
"malloc" and a few "gc_store": it bumps the pointer of the GC and then
sets a few fields explicitly in the newly allocated structure.  The
"setfield" is replaced with a "cond_gc_wb_call" (conditional call to the
write barrier) if needed, followed by a "gc_store".

The "gc_store" instruction can be encoded in a single MOV assembler
instruction, but is not as flexible as a MOV.  The address is always
specified as "some GC pointer + an offset".  We don't have the notion of
interior pointer for GC objects.

A different instruction, "gc_store_indexed", offers additional operands,
which can be mapped to a single MOV instruction using forms like
`[rax+8*rcx+24]`.

Some other complex instructions pass through to the backend, which must
deal with them: for example, "card marking" in the GC.  (Writing an
object pointer inside an array would require walking the whole array
later to find "young" references. Instead of that, we flip a bit for
every range of 128 entries.  This is a common GC optimization.)  Setting
the card bit of a GC object requires a sequence of assembler
instructions that depends too much on the target CPU to be expressed
explicitly here (moreover, it contains a few branches, which are hard to
express at this level).


Assembly
--------

No fancy code generation technique, but greedy forward pass that tries
to avoid some pitfalls


Handling instructions
~~~~~~~~~~~~~~~~~~~~~

* One by one (forward direction).   Each instruction asks the register
  allocator to ensure that some arguments are in registers (not in the
  jitframe); asks for a register to put its result into; and asks for
  additional scratch registers that will be freed at  the end of the
  instruction.  There is a special case for boolean variables: they are
  stored in the condition code flags instead of being materialized as a
  0/1 value.  (They are materialized later, except in the common case
  where they are only used by the next `guard_false` or `guard_true` and
  then forgotten.)

* Instruction arguments are loaded into a register on demand.  This
  makes the backend quite easy to write, but leads do some bad
  decisions.


Linear scan register allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Although it's always a linear trace that we consider, we don't use
advanced techniques for register allocation: we do forward, on-demand
allocation as the backend produces the assembler.  When it asks for a
register to put some value into, we give it any free register, without
consideration for what will be done with it later.  We compute the
longevity of all variables, but only use it when choosing which register
to spill (we spill the variable with the longest longevity).

This works to some extend because it is well integrated with the earlier
optimization pass. Loops are unrolled once by the optimization pass to
allow more powerful optimizations---the optimization pass itself is the
place that benefits the most, but it also has benefits here in the
assembly pass.  These are:

* The first peeling initializes the register binding on the first use.

* This leads to an already allocated register of the trace loop.

* As well as allocated registers when exiting bridges

[Try to better allocate registers to match the ABI (minor to non benefit
in the current state)]


More complex mappings
~~~~~~~~~~~~~~~~~~~~~

Some instructions generate more complex code.  These are either or both of:

* complex instructions generating some local control flow, like
  "cond_gc_wb_call" (for write barriers), "call_assembler" (a call
  followed by a few checks).

* instructions that invoke custom assembler helpers, like the slow-path
  of write barriers or the slow-path of allocations.  These slow-paths
  are typically generated too, so that we are not constrained by the
  usual calling conventions.


GC pointers
~~~~~~~~~~~

Around most CALL instructions, we need to record a description of where
the GC pointers are (registers and stack frame).  This is needed in case
the CALL invokes a garbage collection.  The GC pointers can move; the
pointers in the registers and stack frame are updated by the GC.  That's
a reason for why we don't have explicit interior pointers.

GC pointers can appear as constants in the trace.  We are busy changing
that to use a constant table and `MOV REG, (%RIP+offset)`.  The
"constant" in the table is actually updated by the GC if the object
move.


Vectorization
~~~~~~~~~~~~~

Optimization developed to use SIMD instructions for trace loops. Primary
idea was to use it as an optimization of micro numpy. It has several
passes on the already optimized trace.

Shortly explained: It builds dependencies for an unrolled trace loop,
gathering pairs/packs of operations that could be executed in parallel
and finally schedules the operations.

What did it add to the code base:

* Dependencies can be constructed

* Code motion of guards to relax dependencies

* Scheduler to reorder trace

* Array bound check removal (especially for unrolled traces)

What can it do:

* Transform vector loops (element wise operations)

* Accumulation (`reduce([...],operator,0)`). Requires Operation to be
  associative and commutative

* SSE 4.1 as "vector backend"


We do not
~~~~~~~~~

* Keep tracing data around to reoptimize the trace tree. (Once a trace
  is compiled, minimal data is kept.)  This is one reason (there are
  others in the front-end) for the following result: JIT-compiling a
  small loop with two common paths ends up as one "loop" and one bridge
  assembled, and the bridge-following path is slightly less efficient.
  This is notably because this bridge is assembled with two constraints:
  the input registers are fixed (from the guard), and the output
  registers are fixed (from the jump target); usually these two sets of
  fixed registers are different, and copying around is needed.

* We don't join trace tails: we only assemble *trees*.

* We don't do any reordering (neither of trace instructions nor of
  individual assembler instructions)

* We don't do any cross-instruction optimization that makes sense only
  for the backend and can't easily be expressed at a higher level.  I'm
  sure there are tons of examples of that, but e.g. loading a large
  constant in a register that will survive for several instructions;
  moving out of loops *parts* of some instruction like the address
  calculation; etc. etc.

* Other optimization opportunities I can think about: look at the
  function prologue/epilogue; look at the overhead (small but not zero)
  at the start of a bridge.  Also check if the way guards are
  implemented makes sense.  Also, we generate large-ish sequences of
  assembler instructions with tons of `Jcond` that are almost never
  followed; any optimization opportunity there?  (They all go forward,
  if it changes anything.)  In theory we could also replace some of
  these with a signal handler on segfault (e.g. `guard_nonnull_class`).


a GCC or LLVM backend?
~~~~~~~~~~~~~~~~~~~~~~

At least for comparison we'd like a JIT backend that emits its code
using GCC or LLVM (irrespective of the time it would take).  But it's
hard to map reasonably well the guards to the C language or to LLVM IR.
The problems are: (1) we have many guards, we would like to avoid having
many paths that each do a full
saving-all-local-variables-that-are-still-alive; (2) it's hard to patch
a guard when a bridge is compiled from it; (3) instructions like a CALL
need to expose the local variables that are GC pointers; CALL_MAY_FORCE
need to expose *all* local variables for optional off-line
reconstruction of the interpreter state.