1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491
|
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>GXemul: Dynamic Translation Internals</title>
<meta name="robots" content="noarchive,nofollow,noindex">
</head>
<body style="font-family : sans-serif;">
<!-- 10 lines header. -->
<h1>GXemul: Dynamic Translation Internals</h1>
<p>
<a href="./">Back to the index.</a>
<!--
Copyright (C) 2005-2021 Anders Gavare. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. The name of the author may not be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
-->
<p><br>
This page describes the internal workings of the GXemul dynamic translation
system (usually referred to as just "dyntrans").
<p>
<ul>
<li><a href="#staticvsdynamic">Static vs. dynamic</a>
<li><a href="#ir">Executable Intermediate Representation</a>
<li><a href="#loadsandstores">Emulation of Loads and Stores</a>
<li><a href="#performance">Performance</a>
<li><a href="#instrcomb">Instruction Combinations</a>
<li><a href="#native">Native Code Generation Back-ends</a>
</ul>
<p><br>
<a name="staticvsdynamic"></a>
<h3>Static vs. dynamic:</h3>
<p>In order to support guest operating systems, it is important to take into
account the fact that code pages in memory can be overwritten with new code,
e.g. when loading new programs inside the guest OS.
Statically doing a one-time translation of the code to run will thus not work,
it is necessary to translate code <i>dynamically</i> <sup>*</sup>.
Self-modifying code and Just-in-Time compilers running inside
the emulator are other things that would not work with a static
translator. GXemul is a dynamic translator. However, it does not
necessarily translate into <a href="#native">native code</a>, like some other emulators.
<p><small>* Actually this is not necessary. Instead of translating the
code ahead of execution, the instructions could be interpreted one-at-a-time.
But this is slow.</small>
<p><br>
<a name="ir"></a>
<h3>Executable Intermediate Representation:</h3>
<p>Dynamic translators usually translate from the emulated architecture
(e.g. MIPS) into a kind of <i>intermediate representation</i> (IR), and then
to native code (e.g. AMD64 or x86 code) which can be executed by the host.
Since one of the main goals for GXemul is to keep everything as portable as
possible, the IR is something which can be executed regardless of whether
the final step (translation from IR to native code) has been implemented
or not.
<p>The IR in GXemul consists of <i>arrays of pointers to functions</i>, and a few
arguments which are passed along to those functions. This is usually referred
to as <a href="http://en.wikipedia.org/wiki/Threaded_code">threaded code</a>.
The functions are implemented in either manually hand-coded C code, or
generated C code.
<p>Here is a simplified diagram of how these arrays work.
<p><center><img src="simplified_dyntrans.png"></center>
<p>There is one instruction call slot for every possible program counter
location. For example, in the traditional MIPS case, instruction words are 32
bits long, and pages are 4 KB large, resulting in 1024 instructions per page.
In GXemul, this is represented using 1025 or 1026 <i>instruction call slots</i>.
<p>After the last of the regular instruction call slots, there is one or two
additional slots, which are special "end of page" slots. These do not count as
executed instructions. Instead, they jump to the first instruction
on the next virtual page (which might cause exceptions, etc).
<p>(For architectures with 32-bit long instructions and 4 KB page size
<b>without</b> a delay slot, there are 1025 entries, and for those <b>with</b>
a delay slot there are 1026 entries.)
<p>The core dyntrans loop, which executes the instructions in the
instruction call slots, looks something like this:
<pre>
<font color="#0000ff">/* The normal instruction execution core: */</font>
#define I ic = cpu->cd.DYNTRANS_ARCH.next_ic ++; ic->f(cpu, ic);
...
n_instrs = 0;
for (;;) {
struct DYNTRANS_IC *ic;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
I; I; I; I; I; I; I; I; I; I;
cpu->n_translated_instrs += 120;
if (cpu->n_translated_instrs >= N_SAFE_DYNTRANS_LIMIT)
break;
}
</pre>
<p>Why the number 120? The answer is: this number is evenly divisible by
1, 2, 3, 4, 5, 6, 8, 10, 12 (and some other numbers). Long emulated instruction
loops of those lenghts will thus result in the same host branch being taken from
the same place multiple times, which seems to fit with some modern CPUs'
branch prediction mechanisms.
<p>There are two important pointers: next_ic points to the next instruction
call, and cur_ic_page points to the whole page of such instructions. Thus,
when the main loop exits, we can check what next_ic points to. If it points
to one of the slots inside the current page (cur_ic_page), then the program
counter's lowest bits should be adjusted to be in sync with the next_ic
pointer. It can also happen that the next_ic points to something else:
a special case is when one wants to break out of the dyntrans loop, e.g.
if there was an error. Then, next_ic does not point to any of the instruction
slots in that page, and thus the program counter's lowest bits are not synced
from the next_ic value.
<p>The complexity of individual instructions vary. A simple example of
what an instruction can look like is the MIPS <tt>addu</tt> instruction:
<pre>
<font color="#0000ff">/*
* 3-register arithmetic instructions:
*
* arg[0] = ptr to rs
* arg[1] = ptr to rt
* arg[2] = ptr to rd
*/</font>
X(addu)
<font color="#404040">{</font>
reg(ic->arg[2]) = (int32_t)(reg(ic->arg[0]) + reg(ic->arg[1]));
<font color="#404040">}</font>
</pre>
<p>It stores the result of a 32-bit addition of the register at <tt>arg[0]</tt>
with the register at <tt>arg[1]</tt> into the register at <tt>arg[2]</tt>.
If the emulated CPU is a 64-bit CPU, then this will store a correctly
sign-extended value. If it is a 32-bit CPU, then only the lowest 32 bits will
be stored, and the high part ignored. <tt>X(addu)</tt> is expanded to
<tt>mips_instr_addu</tt> in the 64-bit case, and <tt>mips32_instr_addu</tt>
in the 32-bit case. Both are compiled into the GXemul executable; no code
is created during run-time.
<p>Another reasonably simple example is the M88K <tt>bsr</tt> (branch to subroutine)
instruction. This is a branch which also sets the return address register to the
instruction following the branch. Here is the implementation for "samepage" offsets,
i.e. branch offsets which are inside the same dyntrans page:
<pre>
X(bsr_samepage)
<font color="#404040">{</font>
cpu->cd.m88k.r[M88K_RETURN_REG] = (cpu->pc &
~((M88K_IC_ENTRIES_PER_PAGE-1) << M88K_INSTR_ALIGNMENT_SHIFT))
+ ic->arg[2];
cpu->cd.m88k.next_ic = (struct m88k_instr_call *) ic->arg[0];
<font color="#404040">}</font>
</pre>
<p>
Don't be scared by the logic operators.
Here, the M88K <tt>r1</tt> register (the return register) is set to the
return address, which is calculated by taking the current program counter,
clearing the lowest bits, and adding arg[2] which is the offset within the
page of the next instruction.
Then the next instruction call to execute is set to <tt>arg[0]</tt>, which
needs to be on the same page (since cur_ic_page is not updated).
<p>
The compiled code may look like this (example from an AMD64 host):
<pre>000000000036dac0 <m88k_instr_bsr_samepage>:
36dac0: b8 03 f0 ff ff mov $0xfffff003,%eax
36dac5: 23 87 a8 00 00 00 and 0xa8(%rdi),%eax
36dacb: 03 46 18 add 0x18(%rsi),%eax
36dace: 89 87 e4 00 00 00 mov %eax,0xe4(%rdi)
36dad4: 48 8b 46 08 mov 0x8(%rsi),%rax
36dad8: 48 89 87 e0 03 00 00 mov %rax,0x3e0(%rdi)
36dadf: c3 retq
</pre>
<p><br>
<a name="loadsandstores"></a>
<h3>Emulation of Loads and Stores:</h3>
<p>The emulation of memory access instructions differ slightly between
the various emulated architectures, but most of the principles in this
chapter apply.
The emulation of loads and stores when emulating a MIPS processor consist
of a "happy case" shorter code path, and a fallback to a more general case if
the preconditions for the happy case are not met.
<p>Pseudo code for the short path:
<p><table border="1">
<tr>
<td><b>Step:</b></td>
<td>Overhead:</td>
</tr>
<tr>
<td><b>1.</b> Calculate the final virtual address.</td>
<td>3 host memory accesses: read the Base and Offset arguments.</td>
</tr>
<tr>
<td><b>2.</b> Preconditions check: a) find a corresponding host RAM
page, if there is one, and b) check address alignment.</td>
<td>1 host memory access for 32 bit, 3 for 64 bit.</td>
</tr>
<tr>
<td><b>3.</b> Perform native load or store.</td>
<td>1 host memory access (or multiple if doing byte-order-swap).</td>
</tr>
<tr>
<td><b>4.</b> "Write-back" to emulated CPU's destination register (for loads).</td>
<td>2 host memory accesses.</td>
</tr>
</table>
<p><b>Step 1</b> calculates the actual virtual address (usually a base
register plus an immediate offset).
<p><b>Step 2</b> looks up whether there is a corresponding RAM page in the host.
If there is, we can potentially load or store directly from that page in the
host's RAM.
<p>However, before going ahead with the load or store, we need to check the
alignment of the address as well. Different host architectures have different
properties, for example when running on (some?) ARM hosts, the lower bits of
an address may be ignored leading to the wrong address being used if we were
to use the address as-is; when running on an x86-based host, unaligned loads
and stores are silently accepted even though they shouldn't be on the hardware
being emulated (in this case MIPS). Checking the lowest bits of the final
address is thus important for correct emulation, so that exceptions can be
generated etc.
<p>For emulation of 32-bit address spaces, the lookup from emulated virtual
address to host page is a single fetch from an array. For emulation of 64-bit
address spaces, it is a three-level lookup.
In either case, there are separate lookups for loads and stores. A page which
is write-protected has just a mapping for load, no mapping for store. A page
which is readable and writable has both a mapping for load and one for store.
<p>If the alignment was ok, and there was a direct mapping to host page,
we can go ahead (<b>step 3</b>) and perform a native load or store, possibly
followed by byte order swap (which may be implemented by using multiple
memory accesses) if the emulated CPU is of different endianness than the host.
<p>(This mechanism is currently hardcoded for 4 KB base page size. The 1 KB
page size of the VR41xx CPUs for example does not work, and neither would e.g.
VAX' 512 byte pages. As long as the guest OS doesn't make use of smaller page
sizes than 4 KB, it should work even when emulating such CPUs.)
<p>If any of the checks fail in <b>step 2</b>, the code bails out and calls a
more general memory access routine, which looks up addresses via the emulated
MIPS TLB and depending on the outcome, may add host pages to the
virtual-to-host arrays, or cause exceptions on the emulated cpu, or call out
to memory mapped device routines.
<p>This implementation is obviously not intended to result in any world speed
records, but it strikes a reasonable balance between emulation correctness and
host portability.
<p>The actual code in <tt>src/cpus/cpu_mips_instr_loadstore.cc</tt> looks
horrible due to being originally written in C with the intention of mimicking
C++ templates. If it were to be written in C++ templates from scratch, it
would likely be more readable.
<p>When the TLB in the emulated CPU gets instructed to change its mapping,
for example if virtual address 0x1000 pointed to physical address 0x1234000,
but is changed to point to 0x2345000, then the virtual-to-physical-to-host
mappings in the emulator must be updated to reflect this. This is known as
invalidating the translation lookup tables.
<p>Also, when performing a store to a page for which there are code
translations (the IR described above), all such translated instruction slots
are reset.
<p><br>
<a name="performance"></a>
<h3>Performance:</h3>
<p>The performance of using this kind of executable IR is obviously lower
than what can be achieved by emulators using native code generation, but
can be significantly higher than using a naive fetch-decode-execute
interpretation loop. Using threaded code as the IR is
an interesting compromise.
<p>The overhead per emulated instruction is usually around or below
approximately 10 host instructions. This is very much dependent on your
host architecture and what compiler and compiler switches you are using.
Added to this instruction count is (of course) also the C code used to
implement each specific instruction.
<p><br>
<a name="instrcomb"></a>
<h3>Instruction Combinations:</h3>
<p>Short, common instruction sequences can sometimes be replaced by a
"combined" instruction. An example could be a compare instruction followed
by a conditional branch instruction. The advantages of instruction
combinations are that
<ul>
<li>the amortized overhead per instruction is slightly reduced, and
<p>
<li>the host's compiler can make a good job at optimizing the common
instruction sequence.
</ul>
<p>The special cases where instruction combinations give the most gain
are in the cores of string/memory manipulation functions such as
<tt>memset()</tt> or <tt>strlen()</tt>. The core loop can then (at least
to some extent) be replaced by a native call to the equivalent function.
<p>The implementations of compound instructions still keep track of the
number of executed instructions, etc. When single-stepping, these
translations are invalidated, and replaced by normal instruction calls
(one per emulated instruction).
<p>An example of what such an instruction combination can look like is a
3-instruction <tt>memset</tt> loop on MIPS. The pattern to detect is a
sequence of <tt>addiu</tt>, <tt>bne</tt>, and <tt>sw</tt> with certain
registers. In this case, an attempt to identify the sequence is only necessary
for certain <tt>sw</tt> instructions; otherwise, the attempt to find the
pattern is not done. (Note: The <tt>sw</tt> instruction is in the
<a href="http://en.wikipedia.org/wiki/Delay_slot">delay slot</a> of the
<tt>bne</tt> instruction.)
<p>If the sequence is detected, the instruction call slot with
<tt>addiu</tt> is changed to point to the instruction combination.
<pre>
<u>Original sequence:</u> <u>Changed to:</u>
slot 4: ... slot 4: ...
slot 5: addiu r5, r3, 24 slot 5: addiu r5, r3, 24
<font color="#8000ff"><i>slot 6:</i></font> <b>addiu</b> r9, r9, 4 <font color="#8000ff"><i>slot 6:</i></font> <b>sw_loop</b> r9, r9, 4
<font color="#8000ff"><i>slot 7:</i></font> bne r12, r9, slot 6 <font color="#8000ff"><i>slot 7:</i></font> bne r12, r9, slot 6
<font color="#8000ff"><i>slot 8:</i></font> sw r6, -4(r9) <font color="#8000ff"><i>slot 8:</i></font> sw r6, -4(r9)
slot 9: subu r2, r3, 32 slot 9: subu r2, r3, 32 <i><--- (*)</i>
slot 10: ... slot 10: ...
</pre>
<p>
<i>(*) If the sw_loop code is executed, then execution continues here (<tt>&ic[3]</tt>).</i>
<p>The <tt>sw_loop</tt> is then implemented as follows:
<pre>
<font color="#0000ff">/*
* sw_loop:
*
* s: addiu rX,rX,4 rX = arg[0] and arg[1]
* bne rY,rX,s (or rX,rY,s) rt=arg[1], rs=arg[0]
* sw rZ,-4(rX) rt=arg[0], rs=arg[1]
*/</font>
X(sw_loop)
<font color="#404040">{</font>
MODE_uint_t rX = reg(ic->arg[0]), rZ = reg(ic[2].arg[0]);
uint64_t *rYp = (uint64_t *) ic[1].arg[0];
MODE_uint_t rY, bytes_to_write;
unsigned char *page;
int partial = 0;
page = cpu->cd.mips.host_store[rX >> 12];
<font color="#0000ff">/* Fallback: */</font>
if (cpu->delay_slot || page == NULL || (rX & 3) != 0 || rZ != 0) <font color="#404040">{</font>
instr(addiu)(cpu, ic);
return;
<font color="#404040">}</font>
if (rYp == (uint64_t *) ic->arg[0])
rYp = (uint64_t *) ic[1].arg[1];
rY = reg(rYp);
bytes_to_write = rY - rX;
if ((rX & 0xfff) + bytes_to_write > 0x1000) <font color="#404040">{</font>
bytes_to_write = 0x1000 - (rX & 0xfff);
partial = 1;
<font color="#404040">}</font>
<b>memset(page + (rX & 0xfff), 0, bytes_to_write);</b>
reg(ic->arg[0]) = rX + bytes_to_write;
cpu->n_translated_instrs += bytes_to_write / 4 * 3 - 1;
cpu->cd.mips.next_ic = partial?
(struct mips_instr_call *) &ic[0] :
(struct mips_instr_call *) &ic[3];
<font color="#404040">}</font>
</pre>
<p>For large memsets, calling the native/system <tt>memset</tt> implementation will most
likely be a lot faster than emulating the MIPS <tt>memset</tt> one instruction at a time.
<p>The fallback check makes sure that the rest of the code in <tt>sw_loop</tt>
will be able to run. The implementation only supports zero-fill memsets, so
if <tt>rZ</tt> is non-zero (or some other dangerous condition is present),
then we do fall-back to <tt>addiu</tt>. This is done by explicitly calling
<tt>instr(addiu)(cpu, ic);</tt> and then returning.
<p>(The example above is 32-bit specific, and only works if a direct page
address can be obtained via <tt>cpu->cd.mips.host_store</tt>.)
<p>To find out which such instruction combinations to implement, real-world
code (such as a complete guest operating system with suitable user applications)
should be executed with special instruction statistics gathering, and then
the most common instructions can be retrieved from that statistics.
<p><br>
<a name="native"></a>
<h3>Native Code Generation Back-ends:</h3>
<p>Native code generation was included in GXemul before 0.4.x, but only for
i386 and Alpha hosts. That did not feel clean and portable enough to work
as a long-term solution. Time has proved this right: i386 has been superseded
by amd64, and the Alpha architecture has unfortunately been killed.
<p>
The current implementation instead focuses on an having an intermediate
representation that can be executed regardless of what the native architecture
is.
<p>In theory, it would be possible to add native code generation again,
as long as that generated code abides to the C ABI on the host.
</body>
</html>
|