1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043
|
WARNING!!! This manual (English version only) may contain some
lexical and grammar errors and mistakes. Author do not pretend on
ideal knowledge of English on highest technical and lingual level. If
you met some misunderstanding of this document please contact with
author. On his part, author has exerted all efforts that this
document became comprehensible and tried to use the most widely used
forms of comprehension and expressions for mono-semantical
understanding of this manual by readers.
(In this connexion, suggestions, feedback and bugreports are gladly
accepted.)
BIEW Internals
=============================
This manual documents internal architecture of BIEW and
development notes. Main goal of this document is to tell you how to
expand project functionality, and help you to understand it. This
manual is written with the assumption that you are familiar with the
C programming language and basic programming concepts.
Table of Contents
=================
0. Preamble
1. Hierarchical structure
1.1. How it works
1.2. Understanding plugins and addons
1.3. Why biewlib
2. How to expand or port the project
2.1. Creating plugins and addons
2.2. Project porting
3. Source code notes
3.1. Source location
3.2. Source layout
3.3. GNU Makefile
4. Optimization notes
4.1. Few words about calling conventions
4.2. Source code optimization notes
4.3. Performance tests
5. Final chapter
0. Preamble
===========
BIEW is a modular project, based on plugins-addons technology. Any
new plugin and addon can be easily added to the project, as well as
removed from the project. As interaction facility with OS and
computer, BIEW uses own library named biewlib. There are two reasons
of creation and existence of biewlib:
- Project was started in 16-bit DOS environment with poor development kit
- Portability to non-POSIX systems
1. Hierarchical structure
=========================
Plugins of auto level:
+--------------------+
+----| All files in |
+-----------+ | | plugins/bin |
| Various | | +--------------------+
| addons | | ^
+-----------+ | : Plugins of II level:
| | v +---------------+
| | Plugins of I level: |plugins/nls |
biew lib: | Base level: | +-------------+ +-|??? in feature |
+-------------+ +-*============* | binmode.c | | +---------------+
| OS and CPU |===# biew.c #----| hexmode.c | | +----------------+
| depended |===# mainloop.c # | textmode.c |------+ | various |
| library | *============* | disasm.c |--------| disassemblers |
+-------------+ | +-------------+ | plugins/disasm |
| +----------------+
|
+----------------------+
| biew utilities: |
| biewutil, bin_util |
| bconsole, biewhelp, |
| events, ... |
+----------------------+
1.1. How it works
-----------------
If you want acquaint with details of sourcecodes interaction, you
should install DOXYGEN to which are oriented commentaries of project
sources. Interaction in short:
- At program start control is passed to main function that is defined
in biew.c. Here program initializes itself, analyses command line,
reads .ini file, creates general windows, initializes plugins and
addons and passes control to mainloop.c file.
- In module mainloop.c basic message loop of program is defined . It
works similar to "GetMessage - DispatchMessage" loop from Win3.1 SDK
with implemented callback function.
- After receiving EXIT, message loop returns control to main function.
Then, main function deinitializes the program, saves variables in
.ini file, disconnects plugins and addons, destroys all global
objects and terminates.
All other modules are auxiliary to this loop.
1.2. Understanding plugins and addons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are three classes of plugins:
- autolevel
- I level
- II level
Plugins of autolevel, basically, are designed for the given file
format support. Plugins of this class are automatic. This means:
- you can't see and find them in the menu, dialogs and e.t.c.
- you can't activate such plugins on demand.
During initialization BIEW queries autolevel plugins about format
of the given file, and sets first plugin that understands this format
as a server of the given file. This plugin is turned to active state,
all other plugins of this class are turned to inactive state. The only
way to see what autolevel plugin is currently active is to call "File
information" utility and see information about detected file format.
Plugins of I level are "work horses" of project. They present main
view modes of file. Plugins of this class (either as II-level) are not
automatic plugins. This means:
- you can see and find them in menu, dialogs and e.t.c.
- you can activate such plugins on demand.
During initialization BIEW queries I-level plugins to find out which
one will be able to handle the given file. Main goal of such query is
an attempt to separate text and binary files. Anyway, I-level plugins
can be selected and activated via menu. When I-level plugin is activated
by the program or user, it queries and initializes II-level plugins
which communicate only with I-level plugins.
Eventually, III-level plugins which will communicate with II-level can
be added and e.t.c. ad infinitum.
An example to understand differences between autolevel, I-level and
II-level plugins:
1. We have opened a file of Windows PE format.
2. Plugin for PE format will be accepted as a server for binary
structure of this file. (In principle, we should have a possibility to
view this file as NE or any other format, but if user is aware of it,
then he can edit the signature of the file and restart the program
or reread the file. I.e. basically, such feature is meaningless).
3. How to view this file. Program detects this file as binary and
automatically assigns binary mode viewer. But user must be able to
to change viewing mode and he can do this via menu.
4. User selects disassembler mode. In general, PE format is designed
for Intel-32 architecture and disassembler plugin of I level, making
request to active plugin of autolevel (PE format) finds this out, and
assigns ix86 disassembler by default for this file. But user must be
able to change disassembler to any other, and he can do this via menu.
However in this case all such work is performed by the I-level plugin
(disassembler plugin).
5. User wants to resolve all internal references in the viewed file
and get full disassembler dump of the file. I-level plugin does this
job. It communicates with autolevel plugin which knows this format.
I-level plugin in this case is a facility of interaction between
plugin of level II and plugin of autolevel. By the way, plugin of
level II also is aware about reference resolving but it builds all
requests for I-level plugin.
6. User wants to save result into a file. File utility also builds
all requests to I-level plugin, and this plugin performs all work of
creating connection between plugins of other levels, but outputs
result into a file stream instead of the screen.
Resume: Role of I-level plugins is an implementation of a common
interface between program and plugins of other levels, and containing
some common function for plugins of next level, which abstracts
such plugins from some system specific features.
In terms of C++ plugin of level I provides an interface of base
class for plugins of next level.
In contrast with plugins, addons are modules, which do not change
properties of program and only add some useful features into it. They
always are selected via menu at user opinion. Program interface of
addons is quite simple and can be understood from the source code
of the program.
1.3. Why biewlib
~~~~~~~~~~~~~~~~~~
As said above, there are two reasons of creation and existence
of biewlib:
- Project was started in 16-bit DOS environment with poor development kit
- Portability to non-POSIX systems
Biewlib does not pretend to substitute libc and other libraries.
Unfortunately, many functions in biewlib were created as a facility
of struggling with errors of standard libraries and discrepancy of
properties of their functions with documentation. Some biewlib parts
were created as facility of rehabilitation of dead or abandoned
libraries. There were historical reasons for this. When project was
started, there were a lot of unknown today companies with popular
products. Some of them are now in the WindZone, some are acquired by
the others. So, biewlib consists of:
- bbio - a facility of caching the binary streams (Family of
functions, working with FILE * streams in many IDE (at
least in early versions) contain errors when working
with binary files.
- biewlib - contains functions which are implemented differently in
POSIX and ms-compatible libraries, and a small (but
increasing) kit of useful functions that are absent in
all C-libraries.
- file_ini - an unique library to handle .ini files.
- pmalloc - implements some useful mechanisms of work with the memory
in the preemptive mode. (May be I not right - but I did
not find of facilities in the standard kit of libraries,
signalizing to program on the lack of memory in the
system).
- twin - a window library that is compatible by functionality
with the window library of TopSpeed C JPI (window.h) to
which project was oriented from begin. (It seems to me
that it is easier to create a full analogue of this
library instead porting the project under other windows.
In addition, I had my own window library - I've expanded
it vastly and made it portable).
sysdep/ - all that is located in this subdirectory is needed for
implementation of lowlevel interface to the operating
system. I hope - here all so understandable. May be only
one question - why fileio.c? Strange, but even port of
gcc under os/2 - emx-0.9c(d) has error when working with
files through open, read, write, close.
Consequently fileio (either as all others) - is
guarantee of stable work of project independently from
IDE.
Certainly, BIEW uses standard functions from C libraries, but these
are in general ANSI compatible functions (ISO is too narrow for such
project, and POSIX in some versions of C-libraries is implemented with
bugs or with errors). Anyway, using this subset of functions in BIEW
doesn't cause problems when recompiling it under MSC, Watcom, TopSpeed
and many ports of gcc. However (certainly) if some function causes
trouble, the more correct way is to take its implementation from
a stable open source project, than to reject the development system
(example: qsort - lfind, which are taken from djgpp.), or to find and
fix the error manually.
2. How to expand or port the project
====================================
In general project is expandable in three ways:
- Adding and registering new plugin.
- Adding and registering new addon.
- Porting project under new computing architecture and/or OS.
All other cases are rewriting of project ideology and/or bugfixes.
2.1. Creating plugins and addons
--------------------------------
To create a new plugin or addon it is necessary to create an
object whose interface will be implemented through the correspondent
structure. Though this way is not the most ideal for future extension
of features of such objects, this scheme was chosen as facility of
interaction of program pieces, additionally it does not exclude a
possibility of implementation of objects as external modules in the
manner of shared (dynamically linked) libraries and e.t.c.
To create plugins of any level it is always possible to use an empty
template which is submitted for each level:
- autolevel: plugins/bin/bin.c
- level I: plugins/binmode.c
- level II:
- disassembler: plugins/disasm/null_da.c
- national language support: plugins/nls/russian.c
Practically, all files (except russian.c) contain minimum level
of functionality which is allowed for objects of such class.
The file reg_form.h contains declaration of interfaces and their
full description for plugins (except of level II and above) and
addons.
Description of interface for plugins of level II is located in
corresponding header files for plugins of level I.
As accepted in rules, all files which implement interfaces must be
located in corresponding thematic directories and contain not more
than one interface per file. In the future this will allow to detach
them into separate modules.
After writing source code, there is the last thing needed to be done:
add your file(s) to correspondent makefile(s) so that your code will be
built.
2.2. Project porting
--------------------
The task of porting a project one of the most easiest. All system
depended parts of the project are located in the directory
'biewlib/sysdep'. Structure of subdirectories is built as CPU/OS.
Exception from this rule is 'generic' subdirectory, where inheres
code which can be used on any platforms. Within this subdirectory also
are located subdirectories 'posix' and 'unix'. 'posix' contains an
implementation of functions which are common for all completely
POSIX-compatible operating systems. POSIX can not be completely
implemented in terms of itself, so a compilation using just
TARGET_OS=posix will not complete. Directory 'unix' contains an
implementation of all functions for fully UNIX-compatible operating
systems. Most UNIX systems are fundamentally very similar.
Unfortunately, unified standard, like POSIX, for implementation of
console (however either as graphics) applications does not exist
today; and even if it will appear, a lot of existing operating systems
(or at least development systems) which will be not compatible with it
will stay. In this connexion, it was necessary to implement invented
standard separately for each operating system. (If an OS is fully
UNIX-compatible it is hardly needed to port project, though a "native"
port could be perform much better).
All system depended functions are well documented in corresponding
header files which are located in biewlib and its subdirectories.
If during project porting some file can be taken from existing
implementation, it can be easy accomplished by including this file from
anew created.
Example:
/biewlib/sysdep/ia32/linux/fileio.c contains a single line:
#include "biewlib/sysdep/generic/linux/fileio.c"
Thus are implemented portable to any file systems "symbolic links".
After writing source code, there is the last thing needed to be done:
add your file(s) to correspondent makefile(s) so that your code will be
built.
3. Source code notes
====================
History of project is beginning in DOS environment. Project was
born and for a long time existed in DOS. In this connexion, the
following rules of source code development should be followed:
- All source of project are written in ibm-866 code page (equ:
ibm-437 + russian letters).
- All names of files must be in 8.3 model, no symlinks e.t.c (Project
builds in DOS environment).
3.1. Source location
--------------------
Directory where sources will be placed does not matter. All paths
used in the project are relative.
3.2. Source layout
------------------
Hierarchy of source tree is very simple. Top level is started from
directory where source code is located. The following picture shows
source tree layout:
/ - top level contains entry point routine and common utilities
addons - contains various addons
sys - contains system related addons
tools - contains general addons
biewlib - contains library named biewlib
sysdep - contains all system and OS depended files and structured as CPU/OS
bin_rc - contains various ready to use binary and text files
doc - contains documentation
hlp - contains project help and some help sources
mk_files - contains various makefiles for non GNU make utilities
plugins - contains various plugins
bin - contains plugins of autolevel
disasm - contains various disassemblers
nls - contains national language support
testlab - contains various test routines and files
tools - contains auxiliary utilities
3.3. GNU Makefile
-----------------
The process of building the project is driven by makefile, which
uses features of GNU make utility. The makefile is not very complex,
and you probably want to try to understand it. All rules are defined
in makefile.inc file. Makefile includes this file into itself and does
not contain any OS and CPU specific information. All what you must to
do during porting project it add OS and CPU specific sections into
makefile.inc using previously declared sections as template. Then, you
can change values of TARGET_OS and TAGET_PLATFORM to preferred values.
4. Optimization notes
=====================
Task of optimization of any program needs separate interpretation.
Each program has its own subtle places which require special
optimization for qualitative speedup of project.
Of course BIEW is an interactive program, therefore interchange of
information with console will be a thin place for it. In any case,
the best facility for searching such places in each separate case
is profiler, but some functions are already known as potential "brake"
for the project, and after their optimization serious acceleration of
the project is possible:
- __MsGetPos
- __vioGet(Set)CursorPos (it is cached when using twGet(Set)CursorPos)
Thereby, the best implementation of these and other functions for
working with the console will be using of internal flags, that must
signalize about changing of state of observed (by this function)
objects, which asynchronously changes its own state.
Optimization of non-interactive parts of the program needs special
measurements with profiler.
See also 4.3
4.1. Few words about calling conventions
------------------------------------------
Somewhy it has happened that mainstream of programmers thinks that
destiny of C language is cdecl calling convention. It is absolutely
understandable that K&R created C language on the earliest stage of
computer industry evolution. The base concept of accepted agreements
about calling convention is implementation of variable number of
arguments (...). But, first - percent proportion of such functions per
program is too small, and second - this was the first edition of the
standard that hereinafter was correctly extended by ANSI committee up
to call modifiers. I do not want to consider these questions in terms
of non-Intel architectures, which may have their own particularities,
though I am convinced to think that it is possible to use fastcall
convention on these platforms as well. But, as main working platforms
for the project ia16 and ia32 architectures (at the time of writing),
such kind of optimization is reasonable. Certainly it is possible to
object me that, with appearance of branch prediction, pipelines, etc.
in new processors, the effect of such changes becomes lost, but first,
this yet does not mean that program will not run on earlier processor
models; second, decreasing of code size (by such optimization) always
produces the most effective use of processor caches; and, third, it is
not evident that pipelines and predictions reduce optimization effects
down to zero, they minimize effect of cdecl but in any case there are
still some differences.
Be that as it may, the project defines macro __FASTCALL__, which is
involved in big part of the project. Although (theoretically) it can
be redefined with any value, it would be better to use it during
development of new functions.
What Watcom manual says about calling conventions:
__cdecl:
Defines the calling convention used by Microsoft compilers.
Notes:
1. All symbols are preceded by an underscore character.
2. Arguments are pushed on the stack from right to left. That
is, the last argument is pushed first. The calling routine
will remove the arguments from the stack.
3. Floating-point values are returned in the same way as
structures. When a structure is returned, the called routine
returns a pointer in register AX/EAX to the return value
which is stored in the data segment (DGROUP).
(NK: In 32-bit version floating-point values are returned in
80x87 register ST(0)).
4. For the 16-bit compiler, registers AX, BX, CX and DX, and
segment register ES are not saved and restored when a call is
made.
5. For the 32-bit compiler, registers EAX, ECX and EDX are not
saved and restored when a call is made.
__stdcall:
(32-bit only) The __stdcall keyword may be used with function
definitions, and indicates that the 32-bit Win32 calling convention
is to be used.
Notes:
1. All symbols are preceded by an underscore character.
2. All C symbols (extern "C" symbols in C++) are suffixed by
"@nnn" where "nnn" is the sum of the argument sizes (each
size is rounded up to a multiple of 4 bytes so that char and
short are size 4). When the argument list contains "...", the
"@nnn" suffix is omitted.
3. Arguments are pushed on the stack from right to left. That
is, the last argument is pushed first. The called routine
will remove the arguments from the stack.
4. When a structure is returned, the caller allocates space on
the stack. The address of the allocated space will be pushed
on the stack immediately before the call instruction. Upon
returning from the call, register EAX will contain address of
the space allocated for the return value. Floating-point
values are returned in 80x87 register ST(0).
5. Registers EAX, ECX and EDX are not saved and restored when a
call is made.
__syscall:
(32-bit only) The __syscall keyword may be used with function
definitions, and indicates that the calling convention used is
compatible with functions provided by 32-bit OS/2.
Notes:
1. Symbols names are not modified, that is, they are not adorned
with leading or trailing underscores.
2. Arguments are pushed on the stack from right to left. That
is, the last argument is pushed first. The calling routine
will remove the arguments from the stack.
3. When a structure is returned, the caller allocates space on
the stack. The address of the allocated space will be pushed
on the stack immediately before the call instruction. Upon
returning from the call, register EAX will contain address of
the space allocated for the return value. Floating-point
values are returned in 80x87 register ST(0).
4. Registers EAX, ECX and EDX are not saved and restored when a
call is made.
__pascal:
(16-bit only) Defines the calling convention used by OS/2 1.x and
Windows 3.x API functions.
Notes:
1. All symbols are mapped to upper case.
2. Arguments are pushed on the stack in reverse order. That is,
the first argument is pushed first, the second argument is
pushed next, and so on. The routine being called will remove
the arguments from the stack.
3. Floating-point values are returned in the same way as
structures. When a structure is returned, the caller
allocates space on the stack. The address of the allocated
space will be pushed on the stack immediately before the call
instruction. Upon returning from the call, register AX will
contain address of the space allocated for the return value.
4. Registers AX, BX, CX and DX, and segment register ES are not
saved and restored when a call is made.
Author notes:
__fastcall:
Different compilers has different implementation of it.
Notes:
1. Name of functions or are not modified or are adorned with
trailing underscore.
2. Arguments are passed through registers (E)AX, (E)BX, (E)CX
and (E)DX. If number of arguments is greater than number of
registers then remainder of arguments are pushed on the stack
from right to left. That is, the last argument is pushed
first. The called routine will remove the arguments from the
stack. In some implementations floating-point values are
passed through registers (ST(0)-ST(2(5))) of 80x87
coprocessor.
3. In some implementations small structures are return through
registers of processor. Floating-point values are returned in
80x87 register ST(0). When a big structure is returned, the
caller allocates space on the stack. The address of the
allocated space will be pushed on the stack immediately
before the call instruction. Upon returning from the call,
register AX will contain address of the space allocated for
the return value.
4. All registers, except registers which contain return values,
are saved and do not require restoring after call is made.
4.2. Source code optimization notes
-----------------------------------
There could be various interpretations of the material brought below:
as a manual about how to code not, or as a ultimate guideline (it seemed
to me very strange while I was reading them that this work is being left
to programmer, and not to processor). But anyway - these are official
recommendations from the leaders of processor industry for personal
computers and they must be well known.
Intel P-III manual says: Athlon-K7 manual says:
(System Programming Guide, Order (Publication # 22007 Rev: D
Number 243192 (pages 443 and Issue Date: August 1999 (pages 21 and
below)): below)):
[14.1.1. General Code Optimization C Source Level Optimizations:
Guidelines]
Write code that can be optimized by This chapter details C programming
the compiler. For example: practices for optimizing code for the
AMD Athlon processor:
*******************************************************************************
- Minimize the use of global Avoid frequently de-referencing pointer
variables, pointers, and complex arguments inside a function. Since the
control flow statements. compiler has no knowledge of whether
aliasing exists between the pointers,
such de-referencing can not be optimized
away by the compiler. This prevents data
from being kept in registers and
significantly increases memory traffic.
- Do not use the "register" modifier. --------------[ N/A ]----------------
- Use the "const" modifier. Use the "const" type qualifier as much as
possible. This optimization makes code
more robust and may enable higher
performance code to be generated due to
the additional information available to
the compiler. For example, the C standard
allows compilers to not allocate storage
for objects that are declared const, if
their address is never taken.
- Do not defeat the typing system. --------------[ N/A ]----------------
- Do not make indirect calls. --------------[ N/A ]----------------
- Sign Extension is usually quite If possible, use unsigned integer types
expensive. over signed integer types. The unsigned
types convey to the compiler that data
cannot be negative, which allows some
optimizations not possible with signed
and potentially negative data.
--------------[ N/A ]---------------- Optimize Switch Statements:
recommended to sort the cases of a switch
statement according to the probability of
occurrences, with the most probable first.
int days_in_month, short_months,
normal_months, long_months;
switch (days_in_month) {
case 31: long_months++; break;
case 30: normal_months++; break;
case 28:
case 29: short_months++; break;
default: printf ("month has fewer"
"than 28 or more than 31 days\n");
}
--------------[ N/A ]---------------- Declare Local Functions as Static:
Functions that are not used outside the
file in which they are defined should
always be declared static, which forces
internal linkage. Otherwise, such
functions default to external linkage,
which might inhibit certain optimizations
with some compilers for example,
aggressive inlining.
--------------[ N/A ]---------------- Use Prototypes for All Functions:
Prototypes can convey additional
information to the compiler that might
enable more aggressive optimizations.
- For best performance, make sure C Language Structure Component Considerations
that data structures and arrays - Sort structure members according to
greater than 32 bytes, are 32-byte their base type size, declaring members
aligned, and that access patterns with a larger base type size ahead of
to data structures and arrays do members with a smaller base type size.
not break the alignment rules. - Pad the structure to a multiple of the
largest base type size of any member:
struct {
double x;
long k;
char a[5];
char pad[7]; } baz;
- ALIGNMENT OF DATA ON THE STACK Sort Local Variables According to Base
Use static variables instead of Type Size:
dynamic (stack) variables. When a compiler allocates local variables
On the Pentium processor, accessing in the same order in which they are
64-bit variables that are not 8-byte declared in the source code, it can be
aligned will cost an extra 3 clocks. helpful to declare local variables in
On the P6 family processors, such a manner that variables with a
accessing a 64-bit variable will larger base type size are declared ahead
cause a data cache split. of the variables with smaller base type
size:
double z[3];
double x, y;
long foo, bar;
float baz;
short ga, gu, gi;
- Use minimum sizes for integer and Use 32-bit data types for integer code.
floating-point data types, to Compiler implementations vary, but
enable SIMD parallelism. typically the following data types are
included int, signed, signed int,
unsigned, unsigned int, long, signed long,
long int, signed long int, unsigned long,
and unsigned long int.
--------------[ N/A ]---------------- Avoid Unnecessary Integer Division:
Integer division is the slowest of all
integer arithmetic operations and should
be avoided wherever possible.
- Unroll all very short loops. Loops Complete unrolling reduces register
that execute for less than 2 clocks pressure by removing the loop counter.
waste loop overhead. To completely unroll a loop, remove the
loop control and replicate the loop body
N times. In addition, completely
unrolling a loop increases scheduling
opportunities. Only unrolling very large
code loops can result in the inefficient
use of the L1 instruction cache.
--------------[ N/A ]---------------- Always Inline Functions with Fewer Than
25 Machine Instructions
- Pay attention to the branch Place branch targets at or near the
prediction algorithm for the target beginning of 16-byte aligned code windows.
processor. This optimization is This technique helps to maximize the number
particularly important for P6 family of instructions that are filled into the
processors. Code that optimizes instruction-byte queue.
branch predict-ability will spend
fewer clocks fetching instructions.
- Take advantage of the SIMD Use 3DNow! Instructions:
capabilities of MMXT technology and Unless accuracy requirements dictate
Streaming SIMD Extensions. otherwise, perform floating-point
computations using the 3DNow! instructions
instead of x87 instructions. The SIMD
nature of 3DNow! instructions achieves
twice the number of FLOPs that are
achieved through x87 instructions. 3DNow!
instructions also provide for a flat
register file instead of the stack-based
approach of x87 instructions.
- Avoid partial register stalls: Avoid Partial Register Reads and Writes:
On P6 family processors, when a In order to handle partial register
large (32-bit) general-purpose writes, the AMD Athlon processor
register is read immediately after execution core implements a data-merging
a small register (8- or 16-bit) scheme. In the execution unit, an
that is contained in the large instruction writing a partial register
register has been written, the read merges the modified portion with the
is stalled until the write retires current state of the remainder of the
(a minimum of 7 clocks). Consider register. Therefore, the dependency
the example below: hardware can potentially force a false
MOV AX, 8 dependency on the most recent instruction
ADD ECX, EAX ;Partial stall that writes to any part of the register.
;occurs on access of Example 1 (Avoid):
;the EAX register MOV AL, 10 ;inst 1
Here, the first instruction moves MOV AH, 12 ;inst 2.
the value 8 into the small register Inst 2 has a false dependency on inst 1.
AX. The next instruction accesses Inst 2 merges new AH with current the
sequence results in a partial large register EAX. This code EAX
register stall. register value forwarded by inst 1. In
addition, an instruction that has a read
Pentium R and Intel486T processors dependency on any part of a given
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ architectural register has a read
do not generate this stall. dependency on the most recent instruction
^^^^^^^^^^^^^^^^^^^^^^^^^^^ that modifies any part of the same
architectural register.
Example 2 (Avoid):
MOV BX, 12h ;inst 1
MOV BL, DL ;inst 2
MOV BH, CL ;inst 3
MOV AL, BL ;inst 4
Inst 2 false dependency on completion of
inst 1. Inst 3, false dependency on
completion of inst 2. Inst 4, depends on
completion of inst 2.
- Align all data. Avoid Memory Size Mismatches:
- Align 8-bit data on any boundary - Align 8-bit data on any boundary
- Align 16-bit data to be contained - WORD accesses are aligned if they
within an aligned 4-byte word. access an address divisible by 2.
- Align 32-bit data on any boundary - DWORD accesses are aligned if they
that is a multiple of 4. access an address divisible by 4.
- Align 64-bit data on any boundary - QWORD accesses are aligned if they
that is a multiple of 8. access an address divisible by 8.
- Align 80-bit and 128-bit data on - TBYTE accesses are aligned if they
a 128-bit boundary (that is, any access an address divisible by 16.
boundary that is a multiple of 16
bytes).
- A loop entry label should be --------------[ N/A ]----------------
16-byte aligned when it is less
than 8 bytes away from that
boundary.
- A label that follows a conditional --------------[ N/A ]----------------
branch should not be aligned.
- A label that follows an --------------[ N/A ]----------------
unconditional branch or function
call should be 16-byte aligned when
it is less than 8 bytes away from
that boundary.
Aligment penalties: Avoid misaligned data references. A
- On a Pentium R processor, a misaligned store or load operation
misaligned access costs 3 clocks. suffers a minimum 1-cycle penalty in
- On a P6 family processor, a the AMD Athlon processor load/store
misaligned access that crosses a pipeline.
cache line boundary costs 6 to 9
clocks.
- On a P6 family processor,
unaligned accesses that cause a
data cache split stall the
processor. A data cache split is
a memory access that crosses a
32-byte cache line boundary.
- Dynamic Allocation Using MALLOC Dynamic Memory Allocation Consideration:
When using dynamic allocation, check Dynamic memory allocation ( malloc in C
that the compiler aligns doubleword language) should always return a pointer
or quadword values on 8-byte that is suitably aligned for the largest
boundaries. If the compiler does not base type (quadword alignment). Where
implement this alignment, then use this aligned pointer cannot be guaranteed,
the following technique to align use the technique shown in the following
doublewords and quadwords for code to make the pointer quadword aligned,
optimum code execution: if needed. This code assumes the pointer
1. Allocate memory equal to the size can be cast to a long. Example:
of the array or structure plus 4
bytes.
2. Use "bitwise" and to make sure
that the array is aligned, for
example:
double a[5]; double* p;
double *p, *newp; double* np;
p = (double*)malloc p = (double *)malloc
((sizeof(double)*5)+4) (sizeof(double)*number_of_doubles+7L);
newp = (p+4) & (-7) np = (double *)((((long)(p))+7L)&(8L));
- Organize code to minimize !!!!!!!! [SAME ] !!!!!!!!!!!!!!!!!!!!!
instruction cache misses and
optimize instruction prefetches.
- Avoid prefixed opcodes other than --------------[ N/A ]----------------
0FH.
- Use software pipelining. !!!!!!!! [SAME ] !!!!!!!!!!!!!!!!!!!!!
- Always pair CALL and RET (return) --------------[ N/A ]----------------
instructions.
- Avoid self-modifying code. !!!!!!!! [SAME ] !!!!!!!!!!!!!!!!!!!!!
- Do not place data in the code !!!!!!!! [SAME ] !!!!!!!!!!!!!!!!!!!!!
segment.
- Avoid instructions that contain 4 or Use Short Instruction Lengths:
more micro-ops or instructions that Assemblers and compilers should generate
are more than 7 bytes long. If the tightest code possible to optimize
^^^^^^^^^^^^^^^^^^^^^^^ use of the I-Cache and increase average
possible, use instructions that decode rate. Wherever possible, use
require 1 micro-op. instructions with shorter lengths.
Pentium R processors without MMXT Using shorter instructions increases the
technology do not execute a set of number of instructions that can fit into
paired instructions if either the instruction-byte queue.
instruction is longer than 7 bytes;
Pentium R processors with MMXT avoid:
technology do not execute a set of 81 C3 FB FF FF FF: add ebx, -5
paired instructions if the first prefered:
instruction is longer than 11 bytes 83 C3 FB: add ebx, -5
or the second instruction is longer
than 7 bytes. Prefixes are not
^^^^^^^^
counted.
The P6 family processors have 3
decoders that translate Intel
Architecture macro instructions into
micro operations (micro-ops, also
called "uops"). The decoder
limitations are as follows:
The first decoder (decoder 0) can
decode instructions up to 7 bytes in
^^^^^^^^^^^^^
length and with up to 4 micro-ops in
one clock cycle. The second two
decoders (decoders 1 and 2) can
decode instructions that are 1 micro-
op instructions, and these
instructions will also be decoded in
one clock cycle.
So, for best performance on all
Intel processors, use simple
instructions that are less than 8
bytes in length.
[14.1.2. Guidelines for Optimizing
MMXT Code]
- Do not intermix MMXT instructions There is no penalty for switching
and floating-point instructions. between x87 FPU and 3DNow!/MMX
instructions in the AMD Athlon processor.
- Use the opcode reg, mem instruction Avoid Address Generation Interlocks:
format whenever possible. This It is advantageous to schedule loads and
format helps to free registers and stores that can calculate their addresses
reduce clocks without generating quickly, ahead of loads and stores that
unnecessary loads. require the resolution of a long
dependency chain in order to generate
their addresses.
- Put an EMMS instruction at the end FEMMS instruction should be used to
of all MMXT code sections that you ensure the same code also runs optimally
know will transition to floating- on the AMD-K6 processor. The FEMMS
point code. instruction is supported for backward
compatibility with the AMD-K6 processor,
and is aliased to the EMMS instruction.
[14.1.3. Guidelines for Optimizing
Floating-Point Code]
- Understand how the compiler handles - Use Multiplies Rather Than Divides
floating-point code. Look at the - Use FFREEP to Pop One Register from the
assembly dump and see what FPU Stack
transforms are already performed on - For branches that are dependent on
the program. Study the loop nests in floating-point comparisons, use the
the application that dominate the FCOMI/FCOMIP/FUCOMI/FUCOMIP instructions.
execution time. Determine why the These instructions are much faster than
compiler is not creating the fastest the classical approach using FSTSW,
code. For example, look for because FSTSW is essentially a
dependences that can be resolved by serializing instruction on the AMD
rearranging code Athlon processor. When FSTSW cannot
be avoided (for example, backward
compatibility of code with older
processors), no FPU instruction should
occur between an FCOM[P], FICOM[P],
FUCOM[P], or FTST and a dependent
FSTSW. This optimization allows the use
of a fast forwarding mechanism for the
FPU condition codes internal to the AMD
Athlon processor FPU and increases
performance.
- Look for and correct situations Ensure All FPU Data is Aligned:
known to cause slow execution of
floating-point code, such as:
- Large memory bandwidth Misaligned memory accesses reduce the
requirements. available memory bandwidth.
- Poor cache locality.
- Long-latency floating-point
arithmetic operations.
- Do not use more precision than is Avoid Using Extended-Precision Data:
necessary. Single precision Store data as either single-precision or
(32-bits) is faster on some double-precision quantities. Loading and
operations and consumes only half storing extended-precision data is
the memory space as double precision comparatively slower.
(64-bits) or double extended
(80-bits).
- Use a library that provides fast Minimize Floating-Point-to-Integer
floating-point to integer routines. Conversions
Many library routines do more work
than is necessary.
- Insure whenever possible that --------------[ N/A ]----------------
computations stay in range. Out of
range numbers cause very high
overhead.
- Schedule code in assembly language Use the FXCH instruction rather than
using the FXCH instruction. When FST/FLD pairs:
possible, unroll loops and pipeline Using FXCH is preferred over the use of
code. FST/FLD pairs, even if the FST/FLD pair
works on a register. An FST/FLD pair adds
two cycles of latency and consists of two
OPs.
- Perform transformations to improve --------------[ N/A ]----------------
memory access patterns. Use loop
fusion or compression to keep as
much of the computation in the cache
as possible.
- Break dependency chains. --------------[ N/A ]----------------
[14.6.3. Write Allocation Effects]
P6 family processors have a "write --------------[ N/A ]----------------
allocate by read-for-ownership" cache,
whereas the Pentium R processor has a
"no-write-allocate; write through on
write miss" cache.
boolean array[max];
/* 1-In "boolean" in this example
there is a "char" array. Here, it
may well be better to make the
"boolean" array into an array of
bits, thereby reducing the size
of the array, which in turn
reduces the number of cache line
fetches. */
for(i=2;i<max;i++) {
array = 1;
}
for(i=2;i<max;i++) {
if( array[i] ) {
for(j=2;j<max;j+=i) {
if( array[j] != 0 ) {
array[j] = 0;
/* check to see if value is already 0.
if the value is already zero before
writing (as shown in the following
example), thereby reducing the number
of writes to memory (dirty cache
lines) */
}
}
}
}
4.3. Performance tests
=====================
When development of 5.3 branch has begun, my main goal was to
accelerate the project. I performed a lot of different tests to find
"thin" parts in the program, and, as the result, the following
comparative table has appeared.
TEST:
All tests were performed on the same computer (AMDK6-200/128Mb). I
was interested in relative but not absolute numbers (it would be
better to run such tests on i386-40 machine, but I was unable to find
such a rarity).
For the test I took kernel32.dll and used the following modes of
disassembling: Reference prediction (Ctrl-F8), Local offsets
(Ctrl-F6), Save as...(Shift-F10) mode of assembler (F2), put
structures (F4)
+-----------------------------------------------------------------+
| Operating system | 5.2.0 mode | MMF | MMF + ^Break |
+-----------------------+-------------+------------+--------------+
| Linux-2.2.17-pre.14 | 0 m 58 sec | 0 m 14 sec | 0 m 07 sec |
| WinNT4.0+SP4 | 1 m 07 sec | 0 m 28 sec | 0 m 07 sec |
| OS/2(WSeB+fp1+UNI_060)| 4 m 57 sec | 3 m 42 sec | 0 m 08(19)sec|
| DOS32 on WinNT | 2 m 30 sec | N/A | N/A |
| DOS32 on OS/2 | 3 m 13 sec | N/A | N/A |
+-----------------------------------------------------------------+
Abbreviations:
5.2.0 mode - mode present in 5.2.0 version
MMF - Memory Mapped Files is a technique used to access files
like ordinary RAM (also known as mmap).
^Break - In 5.3 version, new scheme of keyboard polling was
implemented. In 5.2.0 to interrupt current operation,
for example, a dump sequence, user should press Escape;
in 5.3 version Escape was replaced with Ctrl+Break,
because in several operating systems it is implemented
as asynchronous signal, which is not sent via usual
keyboard functions.
Numbers in few words:
Up to 5.3.0-pre.2 OS/2 port of program was the slowest, but with
implementation of new technologies it occupied a worthy place in
comparative performance. Numbers in parenthesis indicate work time at
the first start, which means: it does not matter how long the OS is
working before this program is started, but important is that program
is launched for the first time. Problem is that at the second start
output file is already cached. These 8 seconds are interesting from
the point of performance probing, but in practice any user will use a
program for completion of operation only once.
Conclusion: in spite of the above words about OS/2 (though now it
is not a serious product), this example indicates that optimization
of one program block does not always increase performance enough
(even if that block obviously requires optimization). In this case
effect was reached after optimization of two program blocks, but each
of them individually did not give such strong effect (37 times faster
for extreme values).
So, when porting the project under other operating systems, it is
necessary to remember that all functions that are located in system
depended subfolders require a lot of your attention during
implementation (it would be wrong to emulate logic of async signal
Ctrl+Break through keyboard functions, etc) - they may cause serious
performance degradation.
SOME REMARKS ON GCC: I use the same test with fastcall technology,
but still two dumps of file are different with fastcall and cdecl
optimizations. Btw cdecl optimized project produces correct dumps of
file (same as after building it with other compilers). I prone to
think of this as of gcc bug, as far as I have no other ideas.
5. Final chapter
================
That all!!!
|