1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
|
PERFORMANCE
===========
*** These are some notes/ideas for improving performance in plex86.
If guest SS spans 32bit space, can eliminate switching to r3h SS in
tcode. Add guest SS descriptor to meta page info and compare.
The timer handling needs work. Currently, the guest is likely getting
hammered with interrupts too frequently.
In fault handler, need to have ability to know if 1st instruction
is virtualized. If so, emulate and save round trip.
Finer invalidation granularity for code cache page writes. Right now,
we're dumping the entire page even if data access does not
conflict with translated instructions.
optimize 16-bit code complementing so only instructions
which require it are actually complemented.
Make DT usage bitlists 32-bits instead of 8.
There are several points marked with 'xxx' in the kernel/dt/
code which talk about potential optimizations.
The PLEX86_PHYMEM_MOD ioctl interface for invalidating translated
code pages due to user space writes, needs to be modified
to be more efficient. Currently, we dump all translated code
for _any_ write. Ouch! The floppy/DMA interface from user
space requires this. So for now, any floppy access will drag
performance.
Optimize functions in util-nexus.c: mon_memzero, mon_memcpy, mon_memset
They could be done a lot more efficiently.
Perhaps make mon_memzero function specifically for pages.
Some important and/or heavily used components of hardware emulation
could be optionally moved to monitor space:
{DMA, disk, floppy, video, etc, timer, ...}
Optionally, let ring3 guest code run without DT (SIV) intervention.
Need a set of criteria for when this is possible, and should
be controlled by user conf file setting.
Remove levels of indirection posed by plugin.c.
Use a better messaging system between user space and monitor.
Possibly queue more than one message at a time. Could use
memory mapped page(s).
Different x86 modes could have their own opcode virtualization map.
Pseudo devices and special guest-OS specific device drivers for
disk/network/video/etc and an associated architecture. This would
let us pass data more quickly and prevent a lot of emulation overhead.
The real device emulation could plug into the same architecture as
the pseudo devices.
============= OLD NOTES ======================================
*** These are notes from the 1st generation of software instruction
*** virtualization (SBE). Some of them still apply. I'll sort through
*** them later.
Fix extra CR3 reload in nexus.S
Handle string IO, N operations at a time, rather than
1 at a time. This is really bogging down performance.
Have to consider paging and page boundaries, segment limits,
timing issues, and debug single stepping etc here.
Could software breakpoints (INT3) invoke handler directly,
rather than generate #GP because of lack of permission from
guest at ring3.
Use COSIMULATE macro in kernel/ for trim code when it's
not used.
If we virtualize a near branch instruction only because
we are at the maximum recursion level, then we could
unvirtualize it later on another pass, when the current
level is lower?
Optimize multi-byte reads/writes to VGA
Alignment of routines in mon-fault.c
Keep multiple monitor GDTs customized for certain CPU modes?
|