1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
|
The current implementation of the Tcl-byte-code engine is built around
a large switch statement (the technique is called switch-threading).
It is actually quite hard to understand and to de-mangle, since there
are many goto-operations for optimization, etc. There exists several
other implementation techniques for execution engines coming from the
field of direct threaded code. In particular i invested into
differences between switch-threading, call-threading and label
threading. implementation mechanisms around, namely call-threading and
label-threading (direct threading).
First of all, threaded.c is a short introductory example, showing in a
small concise program the differences of these techniques. On my
machine i got the following results:
1: 0 seconds, 94591 usec
2: 0 seconds, 84749 usec
3: 0 seconds, 59811 usec
[1] switch threading
[2] call threading
[3] label threading
This experiment shows that the implementation of switch-threading
is the slowest, followed by call threading and label threading.
However, the implementation of label-threading depends on a
non-standard C extension (label address operator) needed for
computed gotos. However, the label address operator is supported
by at least three different compilers (supported on GCC,
clang and IBM's XL C/C++).
Based on this study i built an experimental new code-engine for tcl
(integrated in nsf/nx) based on the following ideas:
- Use Tcl's basic implementation elements as base elements of the
execution-engine: Tcl_Obj, Tcl_ObjCmdProc (objProcs).
- Instead of defining a "byte code", implement the code essentially as
an array of objProcs, were as well the argument vectors of the
objProcs are built as far as possible at compile time. This makes it
possible to implement quickly an execution engine for all of Tcl.
- By using objProcs as basic mechanism, every tcl-command
implementation will become effectively an "instruction" of the code
engine. By defining new objProcs, the command set of the engine will
be extended automatically. Specialized instructions with the same
format can be registered depending on syntactic properties of the
source (e.g. not all argument test will be required, part of the
syntax errors can be caught at compile time, on could implement
specialized versions for Tcl's "set" for setting and reading
variables, etc.).
- The code engine is designed for allowing different resolving
strategies for cmds (objProcs): come commands could be resolved at
compile-time, some commands must be resolved at runtime. Similar
experiments can be made in the object oriented case for objects and
methods. It is not completely clear, if the binding time flexibility
should be an option for a production version of the execution
engine, since this would cause differences in the semantics of Tcl
(when commands are redefined). However, this option provides some
good evidence for potential speedups.
- The current byte code engine performs double dispatches on object
oriented calls (one dispatch on the object, a second dispatch on the
method). The new execution engine allows direct dispatch on methods.
- Exploit typing information: NSF/NX provides "optionally typing"
information (e.g. x:int). This optional typing information can be
used at runtime as well for optimizing the code.
- Provide an intermediate representation ("Tcl assembler") instead of
"hard wiring" instructions to Tcl commands as implemented in
Tcl. The Tcl assembler makes it easier to implement high-level
optimizations to provide different implementations for e.g. a
sequence of tcl instructions. Note that a Tcl compiler (a program
translating Tcl source into Tcl assembler) can be implemented in Tcl
as well. Furthermore, for some high-performance requirements, it is
possible to simply write Tcl assembly code by hand (this is at least
an alternative to C-coding). Note that the Tcl assembler is not
machine dependent, the instructions are those of the "abstract Tcl
machine". Currently, the syntax of the Tcl assembler is a nested Tcl
list.
- As explained above, label-threading is the fastest implementation
technique of the compared approaches, but it is not supported by all
c-compilers. Implemented execution engines are sufficiently
different to make maintenance of two (or more) execution engines
hard. In order to address this issue, i implemented an assembly code
generator and an execution code generator in NX from a common,
declarative source. The assembly code and the execution engines can
be included in the source. Depending on the C-compiler, the "right"
include can be chosen. Note that it is certainly possible to
implement some more different (optimized) execution engines as well.
- Replacing for Tcl the complete byte-code engine is quite a large
task. In order to experiment with the new execution engines i
implemented asm-proc and asm-method as counterparts of "proc" and
"method", which will take in their body the tcl assembly
language. When such a command is called, the appropriate execution
engine is called. This way, the new execution engine can co-exist
with Tcl's classical byte-code engine, it should not be as well
quite easy to follow this approach in jimtcl.
- In cases where the new tcl-compilation and assembly generation
fails, one could fall back to the basic implementation (e.g. tcl
byte code engine).
Some preliminary results:
- If one executes just the objProcs (e.g. for "set" Tcl_SetObjCmd),
the old code engine is faster, but the new ones "work" as well. It
is certainly possible to provide new optimized instructions for
certain tasks (accessing/setting variables, increment operations
etc.) using the same format.
- Using such instructions, both new execution engines (for
call-threading and label-threading) show to be much faster than
Tcl's current bytecode engine.
- The current implementation based on call-threading is for the
example below up to about three times faster than the
Tcl-implementation, the label-threading variant is about 5 times
faster.
Current shortcomings / work in progress:
- The current implementation is not re-entrant (no stack
integration, proc local variables are currently stored in the
assembly proc structure, not on the stack).
- For allowing "uplevels" and "upvars" and compiled locals, we will
need in Tcl the proc call frames, which will certainly cost some
performance. For high-performance situations, these frames could
be made optional (these are not available if one would code the
procs straightforward in C).
Note that the example below with the for loop and incr, reentrancy
and stack frames are not required/needed at all (since the code
does neither call a proc (or eval) and the code is not recursive -
it is not clear, how often such optimization would be possible).
- The instruction set is just sufficient for the about 30 examples
in my current regression test set. The instruction set is just
experimental, and not sufficiently thoughtfully designed.
- The assembly language is very primitive: one has a set of "slot"
declarations which are constants or variables. Later references
address these slots via their indices (starting with
0). Similarly, the instructions are numbered as well (the code is
a C array of instructions). A "goto" simply addresses the next
instruction via the array index. All "declarations" of local vars
have to be written before the "instructions".
- It is not clear, how much Tcl compatibility is needed or wanted
(interface to compiled locals, variable-/call-traces, etc.)
- No effort so far on compiling Tcl source into Tcl assembly.
Below is a short example in Tcl (and the Tcl byte code engine) and in
Tcl assembly in two implementation variants, one with Tcl_Objs, one
with integers (in case the typing information is available or
inferable from the Tcl source).
The timing results based on clang under macOS:
Call Threading:
asm/t.001: 6.64
asm/t.002: 4.04
asm/t.003: 2.46
Label Threading:
asm/t.001: 6.58
asm/t.002: 2.74
asm/t.003: 1.33
I'll push the version over the holidays to our public git repository.
All the best
-gustaf
==================================================
package req nx::test
nx::Test parameter count 100000
proc sum10.tcl {} {
set sum 0
for {set i 0} {$i < 100} {incr i} {
incr sum $i
}
return $sum
}
# Implementation in Tcl assembly, using tcl-objs for "sum", "i" and
# the constants
nsf::asm::proc sum10.asm1 {} {
{obj sum}
{obj i}
{obj 0}
{obj 1}
{obj 100}
{obj 0}
{var obj 0}
{var obj 1}
{duplicateObj slot 6 obj 2}
{duplicateObj slot 7 obj 5}
{leIntObj slot 4 slot 7}
{jumpTrue instruction 7}
{incrObj slot 6 slot 7}
{incrObj slot 7 slot 3}
{jump instruction 2}
{setResult slot 6}
}
# Implementation in Tcl assembly, using integers for "sum", "i" and
# the constants
nsf::asm::proc sum10.asm2 {} {
{obj sum}
{obj i}
{integer int 1}
{integer int 100}
{integer int 0}
{integer int 0}
{setInt slot 4 int 0}
{setInt slot 5 int 0}
{leInt slot 3 slot 5}
{jumpTrue instruction 7}
{incrInt slot 4 slot 5}
{incrInt slot 5 slot 2}
{jump instruction 2}
{setResultInt slot 4}
}
? {sum10.tcl} "4950"
? {sum10.asm1} "4950"
? {sum10.asm2} "4950"
|