File: develop.en

package info (click to toggle)
biew 5.2.0-3
links: PTS
area: main
in suites: woody
size: 2,808 kB
ctags: 5,528
sloc: ansic: 38,763; makefile: 389; pascal: 245
file content (976 lines) | stat: -rw-r--r-- 51,882 bytes
WARNING!!!  This  manual  (english  version  only)  may  contain  some
lexical and grammar errors and mistakes.  Author  do  not  pretend  on
ideal knowledge of english on highest technical and lingual level.  If
you met some  misunderstanding of this document  please  contact  with
author. For its  part,  author  has  exerted  all  efforts  that  this
document is got comprehensible and tried to use the most  widely  used
forms   of   comprehensions   and   expressions   for   monosemantical
understanding of this manual by readers.
(In this connexion, suggestions, feedback and  bugreports  are  gladly
accepted.)

                         BIEW Internals
                =============================

     This  manual  documents  internal  architecture   of   BIEW   and
development notes. Main goal of this document is to tell  you  how  to
expand possibility of project and help  you  to  understand  it.  This
manual is written with the assumption that you are familiar  with  the
C programming language and basic programming concepts.


Table of Contents
=================
0. Preamble
1. Hierarchical structure
1.1. How it works
1.2. Understanding of plugins and addons
1.3. Why biewlib
2. How to expand or port the project
2.1. Creation a pluggins and addons
2.2. Project porting
3. Source code notes
3.1. Placing of sources
3.2. Source layout
3.3. GNU Makefile
4. Optimization notes
4.1. A few words about calling conventions
4.2. Source code optimization notes
5. Final chapter

0. Preamble
===========
   BIEW is a modular project, based on plugins-addons technology.  Any
new plugin and addon can be easily added to the project,  as  well  as
removed  from  the  project.  As  interaction  facility  with  OS  and
computer, BIEW uses own library named biewlib. There are  two  reasons
of birth and existance of biewlib:
- Project was born in 16-bit DOS environment with poor development kit
- Portability to non-POSIX systems

1. Hierarchical structure
=========================
                                    Plugins of auto level:
                                    +--------------------+
                               +----| All files in       |
           +-----------+       |    | plugins/bin        |
           | Various   |       |    +--------------------+
           | addons    |       |             ^
           +-----------+       |             :             Plugins of II level:
                 |             |             v              +---------------+
                 |             |     Plugins of I level:    |plugins/nls    |
   biew lib:     | Base level: |     +-------------+      +-|??? in feature |
 +-------------+ +-*============*    |  binmode.c  |      | +---------------+
 | OS and CPU  |===# biew.c     #----|  hexmode.c  |      | +----------------+
 | depended    |===# mainloop.c #    |  textmode.c |------+ | various        |
 | library     |   *============*    |  disasm.c   |--------| disassemblers  |
 +-------------+         |           +-------------+        | plugins/disasm |
                         |                                  +----------------+
                         |   
             +----------------------+
             |   biew utilities:    |
             | biewutil, bin_util   |
             | bconsole, biewhelp,  |
             | events, ...          |
             +----------------------+

1.1. How it works
-----------------
   If you want acquaint with details of  sourcecodes  interaction  you
should install DOXYGEN to which are oriented commentaries  of  project
sources. Interaction in short:

- At program start control is riched of main function that is  defined
  in biew.c. Here program initializes itself, analyses  command  line,
  reads .ini file, creates general windows,  initializes  plugins  and
  addons and puts control to mainloop.c file.
- In module mainloop.c is defined basic message loop  of  program.  It
  works similar "GetMessage - DispatchMessage" loop  from  Win3.1  SDK
  with realized callback function.
- After receiving EXIT message loop returns control to main  function.
  After getting control main function  of  program  deinitializes  the
  program, saves variables  in  .ini  file,  disconnects  plugins  and
  addons, destroys all global objects and terminates an execution.

All other modules is auxiliary for this loop.

1.2. Understanding of plugins and addons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Now in project exist tree classes an plugins:
- autolevel
- I level
- II level

   Plugins of autolevel, basically, are designed for the  file  format
support. Plugins of this class are automatic plugins. This means:
- You can not see and find them in the menu, prompts and e.t.c.
- You can not activate such plugins at your opinion.
During initialization BIEW performs survey of autolevel  plugin  about
viewing file format and assign first plugin that is  known  with  this
format as server of opened file.  This  plugin  is  turned  to  active
state. All other plugins of this class are turned in unplugged  state.
Only way to see what plugin of  autolevel  is  active  it  call  "File
information" utility and read information  about  detected  format  of
file.

  Plugins of I level are work horsies of project.  They  present  main
view modes of file. Plugins of this class either as II level  are  not
automatic plugins. This means:
- You can see and find them in menu, prompts and e.t.c.
- You can activate such plugins at your opinion.

During initialization BIEW performs survey of  plugin  I  level  about
who first will be able to handle with the viewing file. Main  goal  of
such survey is realization of attempt  to  separate  text  and  binary
files. Anyway, plugins of  I  level  can  be  selected  and  activated
through the menu. When plugin of I level is activated by  the  program
or user it performs surveys and initialization  an  plugins  of  level
II, which are connected with level I plugins only.

In principle, in the future can be added plugins of level  III,  which
will connected with plugins of II level only and e.t.c. ad infinitum.

Example  for  understanding  of   differences   between   plugins   of
autolevel, I level and II level:
1. We have openned a file of Windows PE format.
2.  Plugin  for  PE  format  will  accepted  as  server  for  binaries
structure of this file. (In principle, we must have a possibility  for
viewing this file as NE or any other format,  but  if  user  is  aware
about it then one must to edit  signature  of  file  and  restart  the
program  or  reread  a  file.  I.e.  basically,   such   features   is
meaningless).
3. How to look this file. Program detects  this  file  as  binary  and
automatically  assigns  binary  mode  viewer.   But   we   must   have
possibility of change a viewing mode and user can to do  this  through
the menu.
4. User selects disassembler mode. In general, PE format  is  designed
for Intel-32 architecture and disassembler plugin of I  level,  making
request to active pluging of autolevel (PE format) will know  of  this
and assigned ix86 disassembler by default for this file. But  we  must
have possibility of changing disassembler on any other, and  user  can
to do it by using menu. But  in  this  case  all  such  work  performs
plugin of I level (disassembler pluging).
5. User wants to resolve all internal references in the  viewing  file
and get full disassembling dump of file. Plugin of I level performs of
this work. It communicates with plugin of  autolevel which knows  this
format. Plugin of I level in this  case  is  facility  of  interaction
between pluging of level II and  plugin  of  autolevel.  By  the  way,
pluging of level II also is aware about references  resolving  but  it
builds all requests for plugin of I level.
6. User wants to save result into the file. File utility  also  builds
all requests to pluging of level I and this plugin performs  all  work
for creating connection between plugins of other levels,  but  outputs
result into the file stream instead of the  screen.
   Resume: Role of plugins level I is realization of common  interface
between the program and plugins of other levels  and  containing  some
common function for  plugins  of  next  level,  which  abstracts  such
plugins from some media specific features.
   In terms of C++ plugin of level I provides  an  interface  of  base
class for plugins of next level.

  In contrast with plugins, addons are modules, which  do  not  change
properties of program and only add some useful features into it.  They
always  are  selected  through  the  menu  at  user  opinion.  Program
interface of addons is too simple  and  can  be  understood  from  the
source code of program.

1.3. Why biewlib
~~~~~~~~~~~~~~~~~~
   As  this  is  described  above  exist  two  reasons  of  birth  and
existance of biewlib:
- Project was born in 16-bit DOS environment with poor development kit
- Portability to non-POSIX systems
  Biewlib does not pretend on the  place  of  substitute  C  and  many
other existance libraries. Unfortunately, many topics in biewlib  were
born as facility of fighting with errors  of  standard  libraries  and
discrepancy of  properties  their  functions  to  documentation.  Some
parts of biewlib were born as facility of rehabilitation of  deads  or
abandoned libraries. For this existed history  reasons.  When  project
was born there were a lot of  unknown  today  companies  with  popular
products. Part from them now in the WindZone,  part  are  acquired  by
the others. If separate biewlib on parts that will be got following:
- bbio     - a facility of  caching  the  binary  streams  (Family  of
             functions, working with FILE * streams in  many  IDE  (at
             least in early  versions)  contain  errors  when  working
             with binary files.
- biewlib  - contains   functions,   which   are   implemented    with
             difference  in  posix  and  ms-compatible  libraries  and
             small kit, eager to  increase,  useful  utilities,  being
             absent in all C-libraries.
- file_ini - an unique library for work with .ini files.
- pmalloc -  implements some useful mechanisms of work with the memory
             in the preemptive mode. (May be I not right - but  I  did
             not find of facilities in the standard kit an  libraries,
             signalizing to program on  the  lack  of  memory  in  the
             system).
- twin -     a window library  that  is  compatible  by  functionality
             with the window library of TopSpeed C JPI  (window.h)  to
             which project was oriented from begin. (It  seems  to  me
             that mostly easier to create  a  full  analogue  of  this
             library instead porting the project under other  windows.
             In additional I had own library of  windows  -  I  vastly
             increased it and make it portable).
sysdep/ -    all that is  located  in  this  subdirectory  needed  for
             implementation of lowlevel  interface  to  the  operating
             system. I hope - here all so understandable. May be  only
             one question - why fileio.c?  Strangely,  but  even  port
             gcc under os/2 - emx-0.9c(d) is kept an error  when works
             with   files   through   open,   read,   write,    close.
             Consequently  fileio  (either  as  all   others)   -   is
             guarantee of stable work  of  project  independetly  from
             IDE.
   Certainly, biew uses standard functions from C libraries, but  this
in general ANSI compatible  functions  (ISO  -  too  narrow  for  such
project, as biew, POSIX - in many versions of C-libraries  implemented
or with bugs or with errors). Anyway, using  subset  of  functions  in
biew does not cause problems at  recompiling  it  under  MSC,  Watcom,
TopSpeed and many ports of gcc. Though  -  certainly,  if  working  of
some functions  causes  suspicions,  the  more  correct  to  take  its
implementation from stable working  project  with  the  open  sources,
than reject a  development  system  (Example:  qsort  -  lfind,  which
happen to take from the project  djgpp.)  or  search  and  correct  an
error by hand.


2. How to expand or port the project
====================================

    In general project is expandable on tree ways:
- Adding and register new plugin.
- Adding and register new addon.
- Porting a project under new computing architecture and/or OS.
    All other cases are  rewritting  an  ideology  of  project  and/or
patching mistakes and errors.

2.1. Creation a pluggins and addons
-----------------------------------

   For creating of new plugin or addon  is  necessary  to  create  the
object, whose interface must be implemented through  the  corresponded
structure. Though such variant is  not  the  most  ideal  for  futured
extension of  possibilities  of  such  objects,  but  this  scheme  is
choosed as facility of  program  parts  interaction,  additionally  it
does not exclude  a  possibility  of  implementation  of  objects,  as
external modules in the manner of  dynamically  linked  libraries  and
e.t.c.
  For creating plugins of any level always  possible  using  of  empty
template, which is submitted for each level:
- autolevel: plugins/bin/bin.c
- level I: plugins/binmode.c
- level II:
  - disassembler:              plugins/disasm/null_da.c
  - national language support: plugins/nls/russian.c
  Practically, all (except russian.c) files contain  a  minimum  level
of functionality, which is allowed for objects of such class.
  The file reg_form.h contains declaration of  interfaces,  and  their
full description, for plugins (except  of  level  II  and  above)  and
addons.
  Description of interface for  plugins  of  II  level  is  located in
corresponding header files for plugins level I.
  As accepted in rules, all files, which implement interfaces must  be
located in corresponding thematic directories  and  contain  not  more
than one interface per file. In the future  this  will  enable  detach
their into the separate modules.
  After writing of source code there is the last thing that needed  to
do: to correct corresponding  makefiles  with  the  aid  of  which  is
planned to build the project.

2.2. Project porting
--------------------

   Problem of porting a project one of  the  most  light.  All  system
depended  parts  of  the  project  is   located   in   the   directory
'biewlib/sysdep'. Structure of  subdirectories  is  built  as  CPU/OS.
Exception from this rule is 'generic' subdirectory,  where  inheres  a
code, which can be used on any platforms. Within of this  subdirectory
also are located subdirectories 'posix' and 'unix'. Directory  'posix'
contains a implementation of  functions,  which  are  common  for  all
fully posix-compatible operating systems. POSIX can not be  completely
implemented  in  terms  of  itself,  so  a  compilation   using   just
TARGET_OS=posix can not  be  complete.  Directory  'unix'  contains  a
implementation of all functions for  fully  unix-compatible  operating
systems.  Most  Unix   systems   are   fundamentally   very   similar.
Unfortunately, unified standard, like  POSIX,  for  implementation  of
console (however either  as  graphics)  applications  does  not  exist
today, and if it will  appear,  stays  a  lot  of  existing  operating
system (or at least development systems), which  will  not  compatible
with it. In this connexion, it was  necessary  to  implement  invented
standard separately  for  each  operating  system.  (If  OS  is  fully
unix-compatible, then hardly needed to  port  project,  though  hardly
somebody will happy to wait of program reactions in this model).
   All system depended  functions  well  documented  in  corresponding
header files, that is located in biewlib and its subdirectories.
   If during project porting some  file  can  be  taken  from  existed
implementation, it can be easy made  by including this  file  in  anew
created.
Example:
The file /biewlib/sysdep/ia32/linux/fileio.c contains a single line:
#include "biewlib/sysdep/generic/linux/fileio.c"
   Thereby,  these  are  implemented  portable  on  any  file  systems
"symbolic links". After writing of  source  code  there  is  the  last
thing that needed to do: to correct corresponding makefiles  with  the
aid of which is planned to build the project.

3. Source code notes
====================

   History of project is beginning in  DOS  environment.  Project  was
born and for a long time  existed  in  DOS.  In  this  connexion,  are
accepted following rules of source code development:
- All source of  project  has  written  in  ibm-866  code  page  (equ:
  ibm-437 + russian letters).
- All names of files must be in 8.3 model, no symlinks e.t.c  (Project
  builds in DOS environment).

3.1. Placing of sources
-----------------------

  Directory where sources will be placed  do  not  matter.  All  paths
which used in project are relative.

3.2. Source layout
------------------

  Hierarchy of source tree is very simple. Top level is  started  from
directory where source code is located. Following picture  illustrated
source tree layout:

/          - top level contains entry point routine and common utilities
addons     - contains various addons
 sys       - contains system related addons
 tools     - contains general addons
biewlib    - contains library named biewlib
 sysdep    - contains all system and OS depended files and structured as CPU/OS
bin_rc     - contains various ready to use binary and text files
doc        - contains documentation
hlp        - contains project help and some help sources
mk_files   - contains various makefiles for non GNU make utilities
plugins    - contains various plugins
 bin       - contains plugins of autolevel
 disasm    - contains various disassemblers
 nls       - contains national language support
testlab    - contains various test routines and files
tools      - contains auxiliary utilities

3.3. GNU Makefile
-----------------

  The process of building the project is  driven  by  makefile,  which
uses features of GNU make utility. The makefile is not  very  complex,
and you probably want to try to understand it. All rules  are  defined
in makefile.inc file. Makefile included this file to itself  and  does
not contain any OS and CPU specific information. All what you must  to
do during porting project it add OS and  CPU  specific  sections  into
makefile.inc using previously declared sections as template. After  it
you  can  change  values  of  TARGET_OS  and  TAGET_PLATFORM  to   the
preferred values.

4. Optimization notes
=====================

   Task  of  optimization  of  any   program   requires   a   separate
interpretation. Each program has their own fine places  which  require
special optimization for the qualitative speedup of project.
   Of course biew -  interactive  program,  therefore  interchange  by
information with console will thin place for it.  In  any  case,  best
facility  for  searching  such  places  in  each  separate  event   is
profilier, but today some function is already known as possible  brake
for project and after their  optimization  seriously  accelerating  of
project is possible:
- __MsGetPos
- __vioGet(Set)CursorPos (it is cached if  to  use  twGet(Set)CursorPos
  instead)
  Thereby, best  implementation  of  these  and  other  functions  for
working with the console will  using  of  internal  flags,  that  must
signalizing about  changing  of  state  observed  (by  this  function)
objects, which asynchronously change of own state.
   Optimization of non interactive  part  of  program  needed  special
measurements by profiler.

4.1. A few words about calling conventions
------------------------------------------

   Why has happenned so that mainstream of programmers considers  that
destiny  of  language  C  is  cdecl  calling  convention.   Absolutely
understandable that K&R  reated  language  C  on  earliest  stage  of
computer industry evolution. The base concept of  accepted  agreements
about  calling  convention  is implementation of variable  number   of
arguments (...). But, in first - percent proportion of such  functions
per the program is too little and secondly - this was a first  edition
of standard that hereinafter was correctly extended by ANSI  committee
up to modifiers of call. I do not want to consider these questions  in
terms  of  non  Intel  architectures,  may  be  there  are  their  own
particularities, though I am convinced that  on  these  platforms  are
possible to use fast calling convention.  But  as  far  as  during  of
writing of this document by the working platforms for the program  are
ia16 and ia32 architectures, such kind of optimization is  meaningful.
Certainly it will possible to object to me  that  with  appearance  in
newest processors of branch prediction, pipelinest and  etc  -  effect
from similar changes is lost, but in first, this  yet  does  not  mean
that program will not  work  on  earlier  processor  models.  Secondly
decreasing of code size (by such  optimization)  always  produces  the
most effective using of processor caches and in third it  is  not  the
fact  that  pipelines  and  predictions  reduce  optimization  effects
towards the zero, they minimize effect of cdecl but in any case  there
are differences.
   Be that as it may, in the  code  is  defined  macros  __FASTCALL__,
with aid of which big part of the project is  modified.  In  principle
it can be redefined with any value,  but  during  development  of  new
functions it would be better to use it.

What said Watcom manual about calling conventions:

__cdecl:
Defines the calling convention used by Microsoft compilers.

Notes:
1.      All symbols are preceded by an underscore character.
2.      Arguments are pushed on the stack from  right  to  left.  That
        is, the last argument is pushed  first.  The  calling  routine
        will remove the arguments from the stack.
3.      Floating-point  values  are  returned  in  the  same  way   as
        structures. When a structure is returned, the  called  routine
        returns a pointer in  register  AX/EAX  to  the  return  value
        which is stored in the data segment (DGROUP).
        (NK: In 32-bit version floating-point values are  returned  in
             80x87 register ST(0)).
4.      For the 16-bit compiler, registers AX,  BX,  CX  and  DX,  and
        segment register ES are not saved and restored when a call  is
        made.
5.      For the 32-bit compiler, registers EAX, ECX and  EDX  are  not
        saved and restored when a call is made.

__stdcall:
(32-bit  only)  The  __stdcall  keyword  may  be  used  with  function
definitions, and indicates that the 32-bit  Win32  calling  convention
is to be used.

Notes:

1.      All symbols are preceded by an underscore character.
2.      All C symbols (extern "C" symbols  in  C++)  are  suffixed  by
        "@nnn" where "nnn" is the sum  of  the  argument  sizes  (each
        size is rounded up to a multiple of 4 bytes so that  char  and
        short are size 4). When the argument list contains "...",  the
        "@nnn" suffix is omitted.
3.      Arguments are pushed on the stack from  right  to  left.  That
        is, the last argument is  pushed  first.  The  called  routine
        will remove the arguments from the stack.
4.      When a structure is returned, the caller  allocates  space  on
        the stack. The address of the allocated space will  be  pushed
        on the stack immediately before  the  call  instruction.  Upon
        returning from the call, register EAX will contain address  of
        the space  allocated  for  the  return  value.  Floating-point
        values are returned in 80x87 register ST(0).
5.      Registers EAX, ECX and EDX are not saved and restored  when  a
        call is made.

__syscall:
(32-bit  only)  The  __syscall  keyword  may  be  used  with  function
definitions,  and  indicates  that  the  calling  convention  used  is
compatible with functions provided by 32-bit OS/2.

Notes:
1.      Symbols names are not modified, that is, they are not  adorned
        with leading or trailing underscores.
2.      Arguments are pushed on the stack from  right  to  left.  That
        is, the last argument is pushed  first.  The  calling  routine
        will remove the arguments from the stack.
3.      When a structure is returned, the caller  allocates  space  on
        the stack. The address of the allocated space will  be  pushed
        on the stack immediately before  the  call  instruction.  Upon
        returning from the call, register EAX will contain address  of
        the space  allocated  for  the  return  value.  Floating-point
        values are returned in 80x87 register ST(0).
4.      Registers EAX, ECX and EDX are not saved and restored  when  a
        call is made.

__pascal:
(16-bit only) Defines the calling convention  used  by  OS/2  1.x  and
Windows 3.x API functions.

Notes:
1.      All symbols are mapped to upper case.
2.      Arguments are pushed on the stack in reverse order.  That  is,
        the first argument is pushed first,  the  second  argument  is
        pushed next, and so on. The routine being called  will  remove
        the arguments from the stack.
3.      Floating-point  values  are  returned  in  the  same  way   as
        structures.  When  a  structure  is   returned,   the   caller
        allocates space on the stack. The  address  of  the  allocated
        space will be pushed on the stack immediately before the  call
        instruction. Upon returning from the call,  register  AX  will
        contain address of the space allocated for the return value.
4.      Registers AX, BX, CX and DX, and segment register ES  are  not
        saved and restored when a call is made.

Author notes:

__fastcall:
Different compilers has different implementation of it.

Notes:
1.      Name of functions or are not  modified  or  are  adorned  with
        trailing underscore.
2.      Arguments are passed through  registers  (E)AX,  (E)BX,  (E)CX
        and (E)DX. If number of arguments is  greate  than  number  of
        registers then remaind of arguments are pushed  on  the  stack
        from right to left. That  is,  the  last  argument  is  pushed
        first. The called routine will remove the arguments  from  the
        stack.  In  some  implementations  floating-point  values  are
        passed   through   registers   (ST(0)-ST(2(5)))   of     80x87
        coprocessor.
3.      In some implementations small structures  are  return  through
        registers of processor. Floating-point values are returned  in
        80x87 register ST(0). When a big structure  is  returned,  the
        caller allocates space  on  the  stack.  The  address  of  the
        allocated space  will  be  pushed  on  the  stack  immediately
        before the call instruction. Upon  returning  from  the  call,
        register AX will contain address of the  space  allocated  for
        the return value.
4.      All registers, except registers which contain  return  values,
        are saved and do not require restoring after call is made.
 
4.2. Source code optimization notes
-----------------------------------

   There are  may  exist  various  interpretations  of  brought  below
material - as manual about how never need to program or accept  it  as
guidelines. (While I was reading it seems  to  me  very  strange,  why
this work must do programmer but not processor).  But  anyway  - these
are official recommendations from leaders of processors  industry  for
personal computers and they must be well known.

Intel P-III manual says:               Athlon-K7 manual says:

(System Programming Guide, Order       (Publication # 22007 Rev: D
Number 243192 (pages 443 and           Issue Date: August 1999 (pages 21 and
below)):                               below)):

 [14.1.1. General Code Optimization    C Source Level Optimizations:
 Guidelines]

Write code that can be optimized by    This chapter details C programming 
the compiler. For example:             practices for optimizing code for the 
                                       AMD Athlon  processor:
*******************************************************************************
- Minimize the use of global           Avoid frequently de-referencing pointer 
  variables, pointers, and complex     arguments inside a function. Since the 
  control flow statements.             compiler has no knowledge of whether 
                                       aliasing exists between the pointers, 
                                       such de-referencing can not be optimized
                                       away by the compiler. This prevents data
                                       from being kept in registers and 
                                       significantly increases memory traffic.

- Do not use the "register" modifier.  --------------[ N/A ]----------------

- Use the "const" modifier.            Use the "const" type qualifier as much as
                                       possible. This optimization makes code
                                       more robust and may enable higher
                                       performance code to be generated due to
                                       the additional information available to
                                       the compiler. For example, the C standard
                                       allows compilers to not allocate storage
                                       for objects that are declared const, if
                                       their address is never taken.

- Do not defeat the typing system.     --------------[ N/A ]----------------

- Do not make indirect calls.          --------------[ N/A ]----------------

- Sign Extension is usually quite      If possible, use unsigned integer types
  expensive.                           over signed integer types. The unsigned
                                       types convey to the compiler that data
                                       cannot be negative, which allows some
                                       optimizations not possible with signed
                                       and potentially negative data.

--------------[ N/A ]----------------  Optimize Switch Statements:
                                       recommended to sort the cases of a switch
                                       statement according to the probability of
                                       occurrences, with the most probable first.
                                        int days_in_month, short_months,
                                            normal_months, long_months;
                                         switch (days_in_month) {
                                            case 31: long_months++; break;
                                            case 30: normal_months++; break;
                                            case 28:
                                            case 29: short_months++; break;
                                            default: printf ("month has fewer"
                                            "than 28 or more than 31 days\n");
                                         }

--------------[ N/A ]----------------  Declare Local Functions as Static:
                                       Functions that are not used outside the
                                       file in which they are defined should
                                       always be declared static, which forces
                                       internal linkage. Otherwise, such
                                       functions default to external linkage,
                                       which might inhibit certain optimizations
                                       with some compilers for example,
                                       aggressive inlining.

--------------[ N/A ]----------------  Use Prototypes for All Functions:
                                       Prototypes can convey additional
                                       information to the compiler that might
                                       enable more aggressive optimizations.

- For best performance, make sure      C Language Structure Component Considerations
  that data structures and arrays      - Sort structure members according to
  greater than 32 bytes, are 32-byte     their base type size, declaring members
  aligned, and that access patterns      with a larger base type size ahead of
  to data structures and arrays do       members with a smaller base type size.
  not break the alignment rules.       - Pad the structure to a multiple of the
                                         largest base type size of any member:
                                         struct {
                                           double x;
                                           long k;
                                           char a[5];
                                           char pad[7]; } baz;

- ALIGNMENT OF DATA ON THE STACK       Sort Local Variables According to Base
  Use static variables instead of      Type Size:
  dynamic (stack) variables.           When a compiler allocates local variables
  On the Pentium processor, accessing  in the same order in which they are
  64-bit variables that are not 8-byte declared in the source code, it can be
  aligned will cost an extra 3 clocks. helpful to declare local variables in
  On the P6 family processors,         such a manner that variables with a
  accessing a 64-bit variable will     larger base type size are declared ahead
  cause a data cache split.            of the variables with smaller base type
                                       size:
                                        double z[3];
                                        double x, y;
                                        long foo, bar;
                                        float baz;
                                        short ga, gu, gi;

- Use minimum sizes for integer and    Use 32-bit data types for integer code. 
  floating-point data types, to        Compiler implementations vary, but 
  enable SIMD parallelism.             typically the following data types are
                                       included int, signed, signed int,
                                       unsigned, unsigned int, long, signed long,
                                       long int, signed long int, unsigned long,
                                       and unsigned long int.

--------------[ N/A ]----------------  Avoid Unnecessary Integer Division:
                                       Integer division is the slowest of all
                                       integer arithmetic operations and should
                                       be avoided wherever possible.

- Unroll all very short loops. Loops   Complete unrolling reduces register
  that execute for less than 2 clocks  pressure by removing the loop counter.
  waste loop overhead.                 To completely unroll a loop, remove the
                                       loop control and replicate the loop body
                                       N times. In addition, completely
                                       unrolling a loop increases scheduling
                                       opportunities. Only unrolling very large
                                       code loops can result in the inefficient
                                       use of the L1 instruction cache.

--------------[ N/A ]----------------  Always Inline Functions with Fewer Than
                                       25 Machine Instructions

- Pay attention to the branch          Place branch targets at or near the
  prediction algorithm for the target  beginning of 16-byte aligned code windows.
  processor. This optimization is      This technique helps to maximize the number
  particularly important for P6 family of instructions that are filled into the
  processors. Code that optimizes      instruction-byte queue.
  branch predict-ability will spend
  fewer clocks fetching instructions.

- Take advantage of the SIMD           Use 3DNow! Instructions:
  capabilities of MMXT technology and  Unless accuracy requirements dictate
  Streaming SIMD Extensions.           otherwise, perform floating-point 
                                       computations using the 3DNow! instructions
                                       instead of x87 instructions. The SIMD 
                                       nature of 3DNow! instructions achieves
                                       twice the number of FLOPs that are
                                       achieved through x87 instructions. 3DNow!
                                       instructions also provide for a flat
                                       register file instead of the stack-based
                                       approach of x87 instructions.

- Avoid partial register stalls:       Avoid Partial Register Reads and Writes:
  On P6 family processors, when a      In order to handle partial register
  large (32-bit) general-purpose       writes, the AMD Athlon processor
  register is read immediately after   execution core implements a data-merging
  a small register (8- or 16-bit)      scheme. In the execution unit, an
  that is contained in the large       instruction writing a partial register
  register has been written, the read  merges the modified portion with the
  is stalled until the write retires   current state of the remainder of the
  (a minimum of 7 clocks). Consider    register. Therefore, the dependency 
  the example below:                   hardware can potentially force a false 
    MOV AX, 8                          dependency on the most recent instruction
    ADD ECX, EAX ;Partial stall        that writes to any part of the register.
                 ;occurs on access of  Example 1 (Avoid): 
                 ;the EAX register       MOV AL, 10 ;inst 1
  Here, the first instruction moves      MOV AH, 12 ;inst 2.
  the value 8 into the small register  Inst 2 has a false dependency on inst 1.
  AX. The next instruction accesses    Inst 2 merges new AH with current the 
  sequence results in a partial        large register EAX. This code EAX 
  register stall.                      register value forwarded by inst 1. In
                                       addition, an instruction that has a read
  Pentium R and Intel486T processors   dependency on any part of a given 
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   architectural register has a read
  do not generate this stall.          dependency on the most recent instruction
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^          that modifies any part of the same 
                                       architectural register. 
                                       Example 2 (Avoid):
                                         MOV BX, 12h ;inst 1
                                         MOV BL, DL  ;inst 2
                                         MOV BH, CL  ;inst 3
                                         MOV AL, BL  ;inst 4
                                       Inst 2 false dependency on completion of
                                       inst 1. Inst 3, false dependency on 
                                       completion of inst 2. Inst 4, depends on
                                       completion of inst 2.
     
- Align all data.                      Avoid Memory Size Mismatches:
 - Align 8-bit data on any boundary    - Align 8-bit data on any boundary
 - Align 16-bit data to be contained   - WORD accesses are aligned if they
   within an aligned 4-byte word.        access an address divisible by 2.
 - Align 32-bit data on any boundary   - DWORD accesses are aligned if they
   that is a multiple of 4.              access an address divisible by 4.
 - Align 64-bit data on any boundary   - QWORD accesses are aligned if they 
   that is a multiple of 8.              access an address divisible by 8.
 - Align 80-bit and 128-bit data on    - TBYTE accesses are aligned if they
   a 128-bit boundary (that is, any      access an address divisible by 16.
   boundary that is a multiple of 16 
   bytes).                             

 - A loop entry label should be        --------------[ N/A ]----------------
   16-byte aligned when it is less
   than 8 bytes  away from that
   boundary.

 - A label that follows a conditional  --------------[ N/A ]----------------
   branch should not be aligned.

 - A label that follows an             --------------[ N/A ]----------------
   unconditional branch or function
   call should be 16-byte aligned when
   it is less than 8 bytes away from
   that boundary.

 Aligment penalties:                   Avoid misaligned data references. A
 - On a Pentium R processor, a         misaligned store or load operation
   misaligned access costs 3 clocks.   suffers a minimum 1-cycle penalty in
 - On a P6 family processor, a         the AMD Athlon processor load/store
   misaligned access that crosses a    pipeline.
   cache line boundary costs 6 to 9
   clocks.
 - On a P6 family processor,
   unaligned accesses that cause a
   data cache split stall the
   processor. A data cache split is
   a memory access that crosses a
   32-byte cache line boundary.
     
- Dynamic Allocation Using MALLOC      Dynamic Memory Allocation Consideration:
  When using dynamic allocation, check Dynamic memory allocation ( malloc in C
  that the compiler aligns doubleword  language) should always return a pointer
  or quadword values on 8-byte         that is suitably aligned for the largest
  boundaries. If the compiler does not base type (quadword alignment). Where
  implement this alignment, then use   this aligned pointer cannot be guaranteed,
  the following technique to align     use the technique shown in the following
  doublewords and quadwords for        code to make the pointer quadword aligned,
  optimum code execution:              if needed. This code assumes the pointer
  1. Allocate memory equal to the size can be cast to a long. Example: 
     of the array or structure plus 4
     bytes.
  2. Use "bitwise" and to make sure
     that the array is aligned, for
     example:
       double a[5];                     double* p;
       double *p, *newp;                double* np;
       p = (double*)malloc              p = (double *)malloc
         ((sizeof(double)*5)+4)           (sizeof(double)*number_of_doubles+7L);
       newp = (p+4) & (-7)              np = (double *)((((long)(p))+7L)&(8L));

- Organize code to minimize            !!!!!!!! [SAME ] !!!!!!!!!!!!!!!!!!!!!
  instruction cache misses and
  optimize instruction prefetches.

- Avoid prefixed opcodes other than    --------------[ N/A ]----------------
  0FH.

- Use software pipelining.             !!!!!!!! [SAME ] !!!!!!!!!!!!!!!!!!!!!

- Always pair CALL and RET (return)    --------------[ N/A ]----------------
  instructions.

- Avoid self-modifying code.           !!!!!!!! [SAME ] !!!!!!!!!!!!!!!!!!!!!

- Do not place data in the code        !!!!!!!! [SAME ] !!!!!!!!!!!!!!!!!!!!!
  segment.

- Avoid instructions that contain 4 or Use Short Instruction Lengths:
  more micro-ops or instructions that  Assemblers and compilers should generate
  are more than 7 bytes long. If       the tightest code possible to optimize
      ^^^^^^^^^^^^^^^^^^^^^^^          use of the I-Cache and increase average
  possible, use instructions that      decode rate. Wherever possible, use
  require 1 micro-op.                  instructions with shorter lengths.
  Pentium R processors without MMXT    Using shorter instructions increases the
  technology do not execute a set of   number of instructions that can fit into
  paired instructions if either        the instruction-byte queue.
  instruction is longer than 7 bytes;
  Pentium R processors with MMXT       avoid:
  technology do not execute a set of   81 C3 FB FF FF FF:  add ebx, -5
  paired instructions if the first     preffered:
  instruction is longer than 11 bytes  83 C3 FB:           add ebx, -5
  or the second instruction is longer
  than 7 bytes. Prefixes are not 
       ^^^^^^^^
  counted.
  The P6 family processors have 3
  decoders that translate Intel
  Architecture macro instructions into
  micro operations (micro-ops, also
  called "uops"). The decoder
  limitations are as follows:
  The first decoder (decoder 0) can
  decode instructions up to 7 bytes in
                      ^^^^^^^^^^^^^
  length and with up to 4 micro-ops in
  one clock cycle. The second two
  decoders (decoders 1 and 2) can
  decode instructions that are 1 micro-
  op instructions, and these
  instructions will also be decoded in
  one clock cycle.

  So, for best performance on all
  Intel processors, use simple
  instructions that are less than 8
  bytes in length.

 [14.1.2. Guidelines for Optimizing
 MMXT Code]

- Do not intermix MMXT instructions    There is no penalty for switching 
  and floating-point instructions.     between x87 FPU and 3DNow!/MMX
                                       instructions in the AMD Athlon processor.

- Use the opcode reg, mem instruction  Avoid Address Generation Interlocks:
  format whenever possible. This       It is advantageous to schedule loads and
  format helps to free registers and   stores that can calculate their addresses
  reduce clocks without generating     quickly, ahead of loads and stores that
  unnecessary loads.                   require the resolution of a long
                                       dependency chain in order to generate
                                       their addresses.

- Put an EMMS instruction at the end   FEMMS instruction should be used to
  of all MMXT code sections that you   ensure the same code also runs optimally
  know will transition to floating-    on the AMD-K6 processor. The FEMMS
  point code.                          instruction is supported for backward
                                       compatibility with the AMD-K6 processor,
                                       and is aliased to the EMMS instruction.

 [14.1.3. Guidelines for Optimizing
 Floating-Point Code]

- Understand how the compiler handles  - Use Multiplies Rather Than Divides
  floating-point code. Look at the     - Use FFREEP to Pop One Register from the 
  assembly dump and see what             FPU Stack
  transforms are already performed on  - For branches that are dependent on
  the program. Study the loop nests in   floating-point comparisons, use the
  the application that dominate the      FCOMI/FCOMIP/FUCOMI/FUCOMIP instructions.
  execution time. Determine why the      These instructions are much faster than
  compiler is not creating the fastest   the classical approach using FSTSW,
  code. For example, look for            because FSTSW is essentially a 
  dependences that can be resolved by    serializing instruction on the AMD
  rearranging code                       Athlon processor. When FSTSW cannot
                                         be avoided (for example, backward
                                         compatibility of code with older
                                         processors), no FPU instruction should
                                         occur between an FCOM[P], FICOM[P],
                                         FUCOM[P], or FTST and a dependent
                                         FSTSW. This optimization allows the use
                                         of a fast forwarding mechanism for the
                                         FPU condition codes internal to the AMD
                                         Athlon processor FPU and increases
                                         performance.

- Look for and correct situations      Ensure All FPU Data is Aligned:
  known to cause slow execution of
  floating-point code, such as:
  - Large memory bandwidth             Misaligned memory accesses reduce the
    requirements.                      available memory bandwidth.
  - Poor cache locality.
  - Long-latency floating-point
    arithmetic operations.

- Do not use more precision than is    Avoid Using Extended-Precision Data:
  necessary. Single precision          Store data as either single-precision or
  (32-bits) is faster on some          double-precision quantities. Loading and
  operations and consumes only half    storing extended-precision data is
  the memory space as double precision comparatively slower.
  (64-bits) or double extended
  (80-bits).

- Use a library that provides fast     Minimize Floating-Point-to-Integer
  floating-point to integer routines.  Conversions
  Many library routines do more work
  than is necessary.

- Insure whenever possible that        --------------[ N/A ]----------------
  computations stay in range. Out of
  range numbers cause very high
  overhead.

- Schedule code in assembly language   Use the FXCH instruction rather than
  using the FXCH instruction. When     FST/FLD pairs:
  possible, unroll loops and pipeline  Using FXCH is preferred over the use of
  code.                                FST/FLD pairs, even if the FST/FLD pair
                                       works on a register. An FST/FLD pair adds
                                       two cycles of latency and consists of two
                                       OPs.

- Perform transformations to improve   --------------[ N/A ]----------------
  memory access patterns. Use loop
  fusion or compression to keep as
  much of the computation in the cache
  as possible.

- Break dependency chains.             --------------[ N/A ]----------------

 [14.6.3. Write Allocation Effects]
P6 family processors have a "write     --------------[ N/A ]----------------
allocate by read-for-ownership" cache,
whereas the Pentium R processor has a
"no-write-allocate; write through on
write miss" cache.

 boolean array[max]; 
  /* 1-In "boolean" in this example
     there is a "char" array. Here, it
     may well be better to make the
     "boolean" array into an array of
     bits, thereby reducing the size
     of the array, which in turn
     reduces the number of cache line
     fetches. */
 for(i=2;i<max;i++) {
   array = 1;
 }
  for(i=2;i<max;i++) {
   if( array[i] ) {
    for(j=2;j<max;j+=i) {
     if( array[j] != 0 ) { 
       array[j] = 0;          
/* check to see if value is already 0. 
if the value is already zero before
writing (as shown in the following
example), thereby reducing the number
of writes to memory (dirty cache
lines) */
     }
    }
   }
  }

5. Final chapter
================

  That all!!!