File: translation.html

package info (click to toggle)
gxemul 0.7.0%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bookworm
size: 12,152 kB
sloc: ansic: 111,065; sh: 972; exp: 354; makefile: 118
file content (491 lines) | stat: -rw-r--r-- 19,996 bytes
parent folder | download | duplicates (2)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title>GXemul: Dynamic Translation Internals</title>
  <meta name="robots" content="noarchive,nofollow,noindex">
</head>
<body style="font-family : sans-serif;">

<!-- 10 lines header.  -->

<h1>GXemul: Dynamic Translation Internals</h1>
<p>

<a href="./">Back to the index.</a>

<!--

Copyright (C) 2005-2021  Anders Gavare.  All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions and the following disclaimer in the
   documentation and/or other materials provided with the distribution.
3. The name of the author may not be used to endorse or promote products
   derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.

-->

<p><br>
This page describes the internal workings of the GXemul dynamic translation
system (usually referred to as just "dyntrans").

<p>
<ul>
  <li><a href="#staticvsdynamic">Static vs. dynamic</a>
  <li><a href="#ir">Executable Intermediate Representation</a>
  <li><a href="#loadsandstores">Emulation of Loads and Stores</a>
  <li><a href="#performance">Performance</a>
  <li><a href="#instrcomb">Instruction Combinations</a>
  <li><a href="#native">Native Code Generation Back-ends</a>
</ul>




<p><br>
<a name="staticvsdynamic"></a>
<h3>Static vs. dynamic:</h3>

<p>In order to support guest operating systems, it is important to take into
account the fact that code pages in memory can be overwritten with new code,
e.g. when loading new programs inside the guest OS.
Statically doing a one-time translation of the code to run will thus not work,
it is necessary to translate code <i>dynamically</i> <sup>*</sup>.
Self-modifying code and Just-in-Time compilers running inside 
the emulator are other things that would not work with a static 
translator. GXemul is a dynamic translator. However, it does not 
necessarily translate into <a href="#native">native code</a>, like some other emulators.

<p><small>* Actually this is not necessary. Instead of translating the
code ahead of execution, the instructions could be interpreted one-at-a-time.
But this is slow.</small>




<p><br>
<a name="ir"></a>
<h3>Executable Intermediate Representation:</h3>

<p>Dynamic translators usually translate from the emulated architecture
(e.g. MIPS) into a kind of <i>intermediate representation</i> (IR), and then
to native code (e.g. AMD64 or x86 code) which can be executed by the host.
Since one of the main goals for GXemul is to keep everything as portable as
possible, the IR is something which can be executed regardless of whether
the final step (translation from IR to native code) has been implemented
or not.

<p>The IR in GXemul consists of <i>arrays of pointers to functions</i>, and a few 
arguments which are passed along to those functions. This is usually referred
to as <a href="http://en.wikipedia.org/wiki/Threaded_code">threaded code</a>.
The functions are implemented in either manually hand-coded C code, or
generated C code.

<p>Here is a simplified diagram of how these arrays work.

<p><center><img src="simplified_dyntrans.png"></center>

<p>There is one instruction call slot for every possible program counter
location. For example, in the traditional MIPS case, instruction words are 32
bits long, and pages are 4 KB large, resulting in 1024 instructions per page.
In GXemul, this is represented using 1025 or 1026 <i>instruction call slots</i>.

<p>After the last of the regular instruction call slots, there is one or two
additional slots, which are special "end of page" slots. These do not count as
executed instructions. Instead, they jump to the first instruction
on the next virtual page (which might cause exceptions, etc).

<p>(For architectures with 32-bit long instructions and 4 KB page size
<b>without</b> a delay slot, there are 1025 entries, and for those <b>with</b>
a delay slot there are 1026 entries.)

<p>The core dyntrans loop, which executes the instructions in the
instruction call slots, looks something like this:

<pre>
	<font color="#0000ff">/*  The normal instruction execution core:  */</font>
	#define I	ic = cpu->cd.DYNTRANS_ARCH.next_ic ++; ic->f(cpu, ic);

	...

	n_instrs = 0;

	for (;;) {
		struct DYNTRANS_IC *ic;

		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;

		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;
		I; I; I; I; I;   I; I; I; I; I;

		cpu->n_translated_instrs += 120;
		if (cpu->n_translated_instrs >= N_SAFE_DYNTRANS_LIMIT)
			break;
	}
</pre>

<p>Why the number 120? The answer is: this number is evenly divisible by
1, 2, 3, 4, 5, 6, 8, 10, 12 (and some other numbers). Long emulated instruction
loops of those lenghts will thus result in the same host branch being taken from
the same place multiple times, which seems to fit with some modern CPUs'
branch prediction mechanisms.

<p>There are two important pointers: next_ic points to the next instruction
call, and cur_ic_page points to the whole page of such instructions. Thus,
when the main loop exits, we can check what next_ic points to. If it points
to one of the slots inside the current page (cur_ic_page), then the program
counter's lowest bits should be adjusted to be in sync with the next_ic
pointer. It can also happen that the next_ic points to something else:
a special case is when one wants to break out of the dyntrans loop, e.g.
if there was an error. Then, next_ic does not point to any of the instruction
slots in that page, and thus the program counter's lowest bits are not synced
from the next_ic value.

<p>The complexity of individual instructions vary. A simple example of
what an instruction can look like is the MIPS <tt>addu</tt> instruction:
<pre>
	<font color="#0000ff">/*
	 *  3-register arithmetic instructions:
	 *
	 *  arg[0] = ptr to rs
	 *  arg[1] = ptr to rt
	 *  arg[2] = ptr to rd
	 */</font>
	X(addu)
	<font color="#404040">{</font>
		reg(ic->arg[2]) = (int32_t)(reg(ic->arg[0]) + reg(ic->arg[1]));
	<font color="#404040">}</font>
</pre>

<p>It stores the result of a 32-bit addition of the register at <tt>arg[0]</tt>
with the register at <tt>arg[1]</tt> into the register at <tt>arg[2]</tt>.
If the emulated CPU is a 64-bit CPU, then this will store a correctly
sign-extended value. If it is a 32-bit CPU, then only the lowest 32 bits will
be stored, and the high part ignored. <tt>X(addu)</tt> is expanded to
<tt>mips_instr_addu</tt> in the 64-bit case, and <tt>mips32_instr_addu</tt>
in the 32-bit case. Both are compiled into the GXemul executable; no code 
is created during run-time.

<p>Another reasonably simple example is the M88K <tt>bsr</tt> (branch to subroutine)
instruction. This is a branch which also sets the return address register to the
instruction following the branch. Here is the implementation for "samepage" offsets,
i.e. branch offsets which are inside the same dyntrans page:

<pre>
	X(bsr_samepage)
	<font color="#404040">{</font>
		cpu->cd.m88k.r[M88K_RETURN_REG] = (cpu->pc &
		    ~((M88K_IC_ENTRIES_PER_PAGE-1) << M88K_INSTR_ALIGNMENT_SHIFT))
		    + ic->arg[2];
		cpu->cd.m88k.next_ic = (struct m88k_instr_call *) ic->arg[0];
	<font color="#404040">}</font>
</pre>

<p>
Don't be scared by the logic operators.
Here, the M88K <tt>r1</tt> register (the return register) is set to the
return address, which is calculated by taking the current program counter,
clearing the lowest bits, and adding arg[2] which is the offset within the
page of the next instruction.
Then the next instruction call to execute is set to <tt>arg[0]</tt>, which
needs to be on the same page (since cur_ic_page is not updated).

<p>
The compiled code may look like this (example from an AMD64 host):
<pre>000000000036dac0 <m88k_instr_bsr_samepage>:
  36dac0:       b8 03 f0 ff ff          mov    $0xfffff003,%eax
  36dac5:       23 87 a8 00 00 00       and    0xa8(%rdi),%eax
  36dacb:       03 46 18                add    0x18(%rsi),%eax
  36dace:       89 87 e4 00 00 00       mov    %eax,0xe4(%rdi)
  36dad4:       48 8b 46 08             mov    0x8(%rsi),%rax
  36dad8:       48 89 87 e0 03 00 00    mov    %rax,0x3e0(%rdi)
  36dadf:       c3                      retq   
</pre>



<p><br>
<a name="loadsandstores"></a>
<h3>Emulation of Loads and Stores:</h3>

<p>The emulation of memory access instructions differ slightly between
the various emulated architectures, but most of the principles in this
chapter apply.
The emulation of loads and stores when emulating a MIPS processor consist
of a "happy case" shorter code path, and a fallback to a more general case if
the preconditions for the happy case are not met.

<p>Pseudo code for the short path:

<p><table border="1">
<tr>
	<td><b>Step:</b></td>
	<td>Overhead:</td>
</tr>
<tr>
	<td><b>1.</b> Calculate the final virtual address.</td>
	<td>3 host memory accesses: read the Base and Offset arguments.</td>
</tr>
<tr>
	<td><b>2.</b> Preconditions check: a) find a corresponding host RAM
		page, if there is one, and b) check address alignment.</td>
	<td>1 host memory access for 32 bit, 3 for 64 bit.</td>
</tr>
<tr>
	<td><b>3.</b> Perform native load or store.</td>
	<td>1 host memory access (or multiple if doing byte-order-swap).</td>
</tr>
<tr>
	<td><b>4.</b> "Write-back" to emulated CPU's destination register (for loads).</td>
	<td>2 host memory accesses.</td>
</tr>
</table>

<p><b>Step 1</b> calculates the actual virtual address (usually a base
register plus an immediate offset).

<p><b>Step 2</b> looks up whether there is a corresponding RAM page in the host.
If there is, we can potentially load or store directly from that page in the
host's RAM.

<p>However, before going ahead with the load or store, we need to check the
alignment of the address as well. Different host architectures have different
properties, for example when running on (some?) ARM hosts, the lower bits of
an address may be ignored leading to the wrong address being used if we were
to use the address as-is; when running on an x86-based host, unaligned loads
and stores are silently accepted even though they shouldn't be on the hardware
being emulated (in this case MIPS). Checking the lowest bits of the final
address is thus important for correct emulation, so that exceptions can be
generated etc.

<p>For emulation of 32-bit address spaces, the lookup from emulated virtual
address to host page is a single fetch from an array. For emulation of 64-bit
address spaces, it is a three-level lookup.
In either case, there are separate lookups for loads and stores. A page which
is write-protected has just a mapping for load, no mapping for store. A page
which is readable and writable has both a mapping for load and one for store.

<p>If the alignment was ok, and there was a direct mapping to host page,
we can go ahead (<b>step 3</b>) and perform a native load or store, possibly
followed by byte order swap (which may be implemented by using multiple
memory accesses) if the emulated CPU is of different endianness than the host.

<p>(This mechanism is currently hardcoded for 4 KB base page size. The 1 KB
page size of the VR41xx CPUs for example does not work, and neither would e.g.
VAX' 512 byte pages. As long as the guest OS doesn't make use of smaller page
sizes than 4 KB, it should work even when emulating such CPUs.)

<p>If any of the checks fail in <b>step 2</b>, the code bails out and calls a
more general memory access routine, which looks up addresses via the emulated
MIPS TLB and depending on the outcome, may add host pages to the
virtual-to-host arrays, or cause exceptions on the emulated cpu, or call out
to memory mapped device routines.

<p>This implementation is obviously not intended to result in any world speed
records, but it strikes a reasonable balance between emulation correctness and
host portability.

<p>The actual code in <tt>src/cpus/cpu_mips_instr_loadstore.cc</tt> looks
horrible due to being originally written in C with the intention of mimicking
C++ templates. If it were to be written in C++ templates from scratch, it
would likely be more readable.

<p>When the TLB in the emulated CPU gets instructed to change its mapping,
for example if virtual address 0x1000 pointed to physical address 0x1234000,
but is changed to point to 0x2345000, then the virtual-to-physical-to-host
mappings in the emulator must be updated to reflect this. This is known as
invalidating the translation lookup tables.

<p>Also, when performing a store to a page for which there are code
translations (the IR described above), all such translated instruction slots
are reset.



<p><br>
<a name="performance"></a>
<h3>Performance:</h3>

<p>The performance of using this kind of executable IR is obviously lower
than what can be achieved by emulators using native code generation, but
can be significantly higher than using a naive fetch-decode-execute
interpretation loop. Using threaded code as the IR is
an interesting compromise.

<p>The overhead per emulated instruction is usually around or below
approximately 10 host instructions. This is very much dependent on your
host architecture and what compiler and compiler switches you are using.
Added to this instruction count is (of course) also the C code used to
implement each specific instruction.


<p><br>
<a name="instrcomb"></a>
<h3>Instruction Combinations:</h3>

<p>Short, common instruction sequences can sometimes be replaced by a 
"combined" instruction. An example could be a compare instruction followed 
by a conditional branch instruction. The advantages of instruction 
combinations are that
<ul>
  <li>the amortized overhead per instruction is slightly reduced, and
  <p>
  <li>the host's compiler can make a good job at optimizing the common
	instruction sequence.
</ul>

<p>The special cases where instruction combinations give the most gain
are in the cores of string/memory manipulation functions such as 
<tt>memset()</tt> or <tt>strlen()</tt>. The core loop can then (at least
to some extent) be replaced by a native call to the equivalent function.

<p>The implementations of compound instructions still keep track of the
number of executed instructions, etc. When single-stepping, these
translations are invalidated, and replaced by normal instruction calls
(one per emulated instruction).

<p>An example of what such an instruction combination can look like is a
3-instruction <tt>memset</tt> loop on MIPS. The pattern to detect is a
sequence of <tt>addiu</tt>, <tt>bne</tt>, and <tt>sw</tt> with certain
registers. In this case, an attempt to identify the sequence is only necessary
for certain <tt>sw</tt> instructions; otherwise, the attempt to find the
pattern is not done. (Note: The <tt>sw</tt> instruction is in the
<a href="http://en.wikipedia.org/wiki/Delay_slot">delay slot</a> of the
<tt>bne</tt> instruction.)

<p>If the sequence is detected, the instruction call slot with
<tt>addiu</tt> is changed to point to the instruction combination.

<pre>
	<u>Original sequence:</u>			<u>Changed to:</u>

	slot 4:  ...				slot 4:  ...
	slot 5:  addiu	r5, r3, 24		slot 5:  addiu	  r5, r3, 24
	<font color="#8000ff"><i>slot 6:</i></font>  <b>addiu</b>	r9, r9, 4		<font color="#8000ff"><i>slot 6:</i></font>  <b>sw_loop</b>  r9, r9, 4
	<font color="#8000ff"><i>slot 7:</i></font>  bne    r12, r9, slot 6		<font color="#8000ff"><i>slot 7:</i></font>  bne      r12, r9, slot 6
	<font color="#8000ff"><i>slot 8:</i></font>  sw	r6, -4(r9)		<font color="#8000ff"><i>slot 8:</i></font>  sw	  r6, -4(r9)
	slot 9:  subu	r2, r3, 32		slot 9:  subu	  r2, r3, 32      <i>&lt;--- (*)</i>
	slot 10: ...				slot 10: ...
</pre>

<p>
<i>(*) If the sw_loop code is executed, then execution continues here (<tt>&ic[3]</tt>).</i>

<p>The <tt>sw_loop</tt> is then implemented as follows:

<pre>
	<font color="#0000ff">/*
	 *  sw_loop:
	 *
	 *  s:	addiu	rX,rX,4			rX = arg[0] and arg[1]
	 *	bne	rY,rX,s  (or rX,rY,s)	rt=arg[1], rs=arg[0]
	 *	sw	rZ,-4(rX)		rt=arg[0], rs=arg[1]
	 */</font>
	X(sw_loop)
	<font color="#404040">{</font>
		MODE_uint_t rX = reg(ic->arg[0]), rZ = reg(ic[2].arg[0]);
		uint64_t *rYp = (uint64_t *) ic[1].arg[0];
		MODE_uint_t rY, bytes_to_write;
		unsigned char *page;
		int partial = 0;

		page = cpu->cd.mips.host_store[rX >> 12];

		<font color="#0000ff">/*  Fallback:  */</font>
		if (cpu->delay_slot || page == NULL || (rX & 3) != 0 || rZ != 0) <font color="#404040">{</font>
			instr(addiu)(cpu, ic);
			return;
		<font color="#404040">}</font>

		if (rYp == (uint64_t *) ic->arg[0])
			rYp = (uint64_t *) ic[1].arg[1];

		rY = reg(rYp);

		bytes_to_write = rY - rX;
		if ((rX & 0xfff) + bytes_to_write > 0x1000) <font color="#404040">{</font>
			bytes_to_write = 0x1000 - (rX & 0xfff);
			partial = 1;
		<font color="#404040">}</font>

		<b>memset(page + (rX & 0xfff), 0, bytes_to_write);</b>

		reg(ic->arg[0]) = rX + bytes_to_write;

		cpu->n_translated_instrs += bytes_to_write / 4 * 3 - 1;
		cpu->cd.mips.next_ic = partial?
		    (struct mips_instr_call *) &ic[0] :
		    (struct mips_instr_call *) &ic[3];
	<font color="#404040">}</font>
</pre>

<p>For large memsets, calling the native/system <tt>memset</tt> implementation will most
likely be a lot faster than emulating the MIPS <tt>memset</tt> one instruction at a time.

<p>The fallback check makes sure that the rest of the code in <tt>sw_loop</tt>
will be able to run. The implementation only supports zero-fill memsets, so
if <tt>rZ</tt> is non-zero (or some other dangerous condition is present),
then we do fall-back to <tt>addiu</tt>. This is done by explicitly calling
<tt>instr(addiu)(cpu, ic);</tt> and then returning.

<p>(The example above is 32-bit specific, and only works if a direct page
address can be obtained via <tt>cpu->cd.mips.host_store</tt>.)

<p>To find out which such instruction combinations to implement, real-world
code (such as a complete guest operating system with suitable user applications)
should be executed with special instruction statistics gathering, and then
the most common instructions can be retrieved from that statistics.




<p><br>
<a name="native"></a>
<h3>Native Code Generation Back-ends:</h3>

<p>Native code generation was included in GXemul before 0.4.x, but only for
i386 and Alpha hosts. That did not feel clean and portable enough to work
as a long-term solution. Time has proved this right: i386 has been superseded
by amd64, and the Alpha architecture has unfortunately been killed.

<p>
The current implementation instead focuses on an having an intermediate
representation that can be executed regardless of what the native architecture
is.

<p>In theory, it would be possible to add native code generation again,
as long as that generated code abides to the C ABI on the host.





</body>
</html>