HOW-TO use rt-app and workgen to emulate a workload ?

**** rt-app and workgen ****
rt-app/workgen is a tool that can be used to emulate a use case. Not only the
sleep and run pattern can be emulated but also the dependency between tasks
like accessing same critical resources, creating sequencial wake up or syncing
the wake up of threads. The use case is described in a json like file which is
first parsed by workgen then rt-app.

workgen is a python script that will parse, check and update the json like file
in order to strictly match the json grammar before calling rt-app that will
execute the sequence described in it.

Even if the json protocol is used in rt-app, workgen enables some freedom
compared to stricter json grammar rules. For examples,
- You don't have to ensure uniqueness of key in json object, workgen will check
the file and ensure that each key is unique by appending an index if necessary.
The feature is quite useful when you describe a sequence made of same kind of
events e.g. a sequence of multiple sleep and run events.
- Some values can be omitted and workgen will add them when it parses the file
before starting the use case.

**** Compiling and cross-compiling rt-app ****
rt-app uses libjson-c to parse the .json file
You can either install the libjson-c2 and libjson-c-dev deb packages or compile
from source code available here: https://github.com/json-c/json-c

The following sequence can be used to cross compile rt-app with static linkage
for aarch64:

For libjson-c:

export ac_cv_func_malloc_0_nonnull=yes
export ac_cv_func_realloc_0_nonnull=yes
./autogen.sh
./configure --host=aarch64-linux-gnu --disable-shared --enable-static
make

For rt-app:

export ac_cv_lib_json_c_json_object_from_file=yes
./autogen.sh
./configure --host=aarch64-linux-gnu LDFLAGS=" --static	-L<path to parent of json repo>/json-c/. libs/" CFLAGS="-I<path to parent of json repo>" --with-deadline
make

e.g, with a directory structure like the following:

$ tree -d
.
├── autom4te.cache
├── build
├── build-aux
├── doc
│   └── examples
│       ├── cpufreq_governor_efficiency
│       ├── merge
│       └── tutorial
├── json-c
│   ├── autom4te.cache
│   ├── doc
│   │   └── html
│   └── tests
├── libdl
├── m4
└── src

the configure step becomes

./configure --host=aarch64-linux-gnu LDFLAGS="--static -L./../json-c/." CFLAGS="-I./.." --with-deadline

**** json file skeleton ****

The json file that describes a workload is made on 3 main objects: tasks,
resources and global objects. Only tasks object is mandatory; default value will
be used if global object is not defined and resources object is kept for
backward compatibility with old version of rt-app but it doesn't give any
benefit to declare it as the resources are now dynamically created by rt-app
when it parses the file.

**** global object ****

The global object defines parameters for the whole use case:

* duration : Integer. Duration of the use case in seconds. All the threads will
be killed once the duration has elapsed. if -1 has been set, the use case will
run indefinitly until all threads kill themselves (as an example if a finite
number of loop has been defined in their running pattern) or if a signal
is received to stop the use case.

* calibration : String or Integer: A String defines the CPU that will be used
to calibrate the ns per loop value. "CPU0" is the default value (see run event
section for details about ns per loop value). A integer skips the calibration
step and uses the integer value as ns per loop.

* default_policy : String. Default scheduling policy of threads. Default
value is SCHED_OTHER.

* pi_enabled: Boolean. Enable the prority inheritance of mutex. Default value
is False.

* lock_pages : Boolean. Lock the mem page in RAM. Locking the page in RAM
ensures that your RT thread will not be stalled until a page is moved from swap
to RAM.  The lock of the page is only possible for non CFS tasks. Default value
is False. 

* logdir : String. Path to store the various log files. The default path is
the current directory (./).

* log_basename : String. Prefix used for all log files of the use case.
"rt-app-" is used by default.

* log_size : String or Integer. A Integer defines a fix size in MB of the
temporary buffer (size per thread) that will be used to store the log data
before saving them in a file. This temporary buffer is used as a cicular
buffer so the oldest data will be lost in case of overflow. A string is used
to set a predifined behavior:
  - "file" will be used to store the log data directly in the file without
	using a temporary buffer.
  - "Disable" will disable the log mecahnism.
  - "Auto" will let rt-app compute the buffer size to not overflow the latter 
	during the use case.
The use of a temporary buffer prevents the threads of unexpected wait during
io access. The "Auto" mode is not implemented yet and fallback to "file" mode
for the moment.

* ftrace: Boolean. If enable, rt-app logs in ftrace the main events of the use
case. Default value is False.

* gnuplot : Boolean. if True, it will create a gnu plot compatible file for
each threads (see gnuplot section for more details). Default value is False.

"io_device" : Text. Path to the file which will be written to create IO-bounded
busy loop. Specify it carefully since it might damage the specified file.
Default value is "/dev/null".

"mem_buffer_size" : Integer. The size of per-thread memory buffer in byte being
used to create IO-bounded and memory-bounded busy loop. Default value is
4194304(4MB).

* cumulative_slack : Boolean. Accumulate slack (see below) measured during
  successive timer events in a phase. Default value is False (time between the
  end of last event and the end of the phase).

*** default global object:
	"global" : {
		"duration" : -1,
		"calibration" : "CPU0",
		"default_policy" : "SCHED_OTHER",
		"pi_enabled" : false,
		"lock_pages" : false,
		"logdir" : "./",
		"log_size" : "file",
		"log_basename" : "rt-app",
		"ftrace" : false,
		"gnuplot" : false,
		"io_device" : "/dev/null"
		"mem_buffer_size" : 4194304,
		"cumulative_slack" : false
	}

**** tasks object ****

 The tasks object is made of sub-objects that describe threads. No other kind
 of object than thread object is allowed. The key value of each object will be
 used as thread's name.

"tasks" : {
	"thread_name1" : {
		...
	},
	"thread_name2" : {
		...
	},
	"thread_name3" : {
		...
	},
	...
}

*** thread object ***

Thread object describes the behavior of one kind of thread which means that
several threads can be created with the same object (see instance parameter
below).

* instance : Integer. Define the number of threads that will be created with
the properties of this thread object. Default value is 1.

*** policy : String: Define the scheduling policy of the thread. default_policy
is used if not defined

* priority : Integer. Define the scheduling priority of the thread. The value
must be aligned with the range allowed by the policy. The default priority is
0 for sched_other class and 10 for other classes

* dl-runtime : Integer: Define the runtime budget for deadline scheduling class.
Default value is 0. The unit is usec.

* dl-period : Integer. Define the period duration for deadline scheduling class.
Default value is runtime. The unit is usec.

* dl-deadline : Integer. Define the deadline parameter for deadline scheduling
class. Default value is period. The unit is usec.

* cpus: Array of Integer. Define the CPU affinity of the thread. Default
value is all CPUs of the system. An example :  "cpus" : [0, 2, 3]

* delay: Integer. Initial delay before a thread starts execution. The unit
is usec.

* phases: Object. The phases object described TBF

If there is only 1 phase, the sequence of events can be directly put in the
thread object instead of creating a phases object. As an example, the 2 objects
below are equals:

"task" : {
	"thread1" : {
		"phases" : {
			"phase1" : {
				"run" : 10,
				"sleep" : 10
			}
		}
	}
}

"task" : {
	"thread1" : {
		"run" : 10,
		"sleep" : 10
	}
}

* loop: Integer. Define the number of times the parent object must be run.
The parent object can be a phase or a thread. For phase object, the loop
defines the number of time the phase will be executed before starting the next
phase. For thread object, the loop defines the number of times that the
complete "phases" object will be executed. The default value is -1.

*** events ***

events are simple action that will be performed by the thread or on the
thread. They have to be listed by execution order. 

* run : Integer. Emulate the execution of a load. The duration is defined in
usec but the run event will effectively run a number of time a loop that waste
cpu cycles. When the run event is executed, the duration is transformed in a
number of loop thanks to the ns per loop value (see calibration). This way of
working enables to emulate a load with a duration that will vary with the
frequency or the compute capacity of the CPU.

* runtime : Integer.  The duration is define in usec.  Similar to the
run event, it emulates the execution of a load.  Unlike run, runtime
runs for a specific amount of time irrespective of the compute
capacity of the CPU or the frequency.

* sleep : Integer. Emulate the sleep of a task. The duration is defined in
usec.

* mem : Integer. Emulate the memory write operation. The value defines the size
in byte to be written into the memory buffer. The size of the memory buffer is
defined by "mem_buffer_size" in "global" object.

* iorun : Integer. Emulate the IO write operation. The value defined the size
in byte to be write into the IO device specified by "io_device" in "global"
object.

* timer : Object. Emulate the wake up of the thread by a timer. Timer differs
from sleep event by the start time of the timer duration. Sleep duration starts
at the beginning of the sleep event whereas timer duration starts at the end of
the last use of the timer. So Timer event are immunized against preemption,
frequency scaling and computing capacity of a CPU. The initial starting time of
the timer is set during the 1st use of the latter.

          |<------19---------->|<------19---------->|<------19---------->|
task A ...|run 5|timer 19      |------run 5|timer 19|run 5|timer 19      |
task B                      |run 10  |   


As an example to point out the difference between sleep and timer, let consider
a task A that run 5 and then sleep 10. The period of task A should be 15.
Let's add a task B that runs 5 and use a timer to wakes up every 19. 

         |<------15------>|<------15------>|<------19---------->|<------15------>|<------16------->|<------19---------->|<------19---------->|<------19---------->|<------19---------->|<------19---------->|
taskA ...|run 5|sleep 10  |run 5|sleep 10  |----run 5|sleep 10  |run 5|sleep 10  |-run 5|sleep 10  |----run 5|sleep 10  |----run 5|sleep 10  |----run 5|sleep 10  |----run 5|sleep 10  |----run 5|sleep 10  |
taskB ...|------run 5|timer 19|--run 5|timer 19    |run 5|timer 19      |run 5|timer 19      |run 5|timer 19      |run 5|timer 19      |run 5|timer 19      |run 5|timer 19      |run 5|timer 19      |
         |<------19---------->|<------19---------->|<------19---------->|<------19---------->|<------19---------->|<------19---------->|<------19---------->|<------19---------->|<------19---------->|

We can see that task B period stays to 19 even if the run event is delayed
because of scheduling preemption whereas the period of task A starts at 15 but
increases to 19 because of the scheduling delay.

"unique" timer device: When a thread that uses a timer is instanciated, the
timer will be shared across the thread instance which disturbs the original
sequence. In order to have 1 timer per instance, you can use the "unique"
prefix so a new timer will be created for each instance. Uniqueness
applies to the thread boundary which means that using the same unique
name in the sequence of a thread will refer to the same timer like the example
below:
	"phases" : {
		"light" : {
			"loop" : 10,
			"run" :   1000,
			"timer" : { "ref" : "unique", "period" : 30000 }
		},
		"heavy" : {
			"loop" : 10,
			"run" :   4000,
			"timer" : { "ref" : "unique", "period" : 30000 }
		}
	}

If you want to refer to different timer you must use different name like
below:
	"phases" : {
		"light" : {
			"loop" : 10,
			"run" :   1000,
			"timer" : { "ref" : "uniqueA", "period" : 30000 }
		},
		"heavy" : {
			"loop" : 10,
			"run" :   4000,
			"timer" : { "ref" : "uniqueB", "period" : 400000 }
		}
	}

Timers can work with a "relative" or an "absolute" reference. By default they
work in "relative" mode, but this mode can also be explicity specified as the
following:

	"phases" : {
		"phase0" : {
			"loop" : 10,
			"run0" :  10000,
			"timer0" : { "ref" : "unique", "period" : 20000, "mode" : "relative" },
	}

"relative" mode means that the reference for setting the next timer event is
relative to the end of the current phase. This in turn means that if, for some
reason (i.e., clock frequency was too low), events in a certain phase took too
long to execute and the timer of that phase couldn't actually fire at all, the
next phase won't be affected. For example:

  +---- +     +-----+     +-------------------+-----+     +---
  |r0   |     |r0   |     |r0                 |r0   |     |r0
  |     |     |     |     |                   |     |     |
  o-----------o-----------o-------------------o-----------o------->
  0    10    20    30    40    50    60    70    80   100   120
              ^           ^           ^                   ^
              |           |           | MISS!             |
              +           +           +                   +
            Timer0      Timer0       Timer0              Timer0

In this example character "o" denotes when phases finish/start. Third
activation of Timer0 is missed, since r0 executed for more that 20ms. However
the next phase is not affected as Timer0 was set considering the instant of
time when the misbehaving r0 finished executing.

"absolute" mode is specified as the following:

	"phases" : {
		"phase0" : {
			"loop" : 10,
			"run0" :  10000,
			"timer0" : { "ref" : "unique", "period" : 20000, "mode" : "absolute" },
	}

"absolute" mode means that the reference for setting the next timer event is
fixed and always consider the starting time of the first phase. This means that
if, for some reason (i.e., clock frequency was too low), events in a certain
phase took too long to execute and the timer of that phase couldn't actually
fire at all, the next phase (and potentially other subsequent phases) _will_ be
affected. For example, considering again the example above:

  +---- +     +-----+     +-------------------+-----+-----+   +---
  |r0   |     |r0   |     |r0                 |r0   |r0   |   |r0
  |     |     |     |     |                   |     |     |   |
  o-----------o-----------o-------------------o-----o---------o---->
  0    10    20    30    40    50    60    70    80   100   120
              ^           ^           ^           ^           ^
              |           |           | MISS!     | MISS!     |
              +           +           +           +           +
            Timer0      Timer0       Timer0      Timer0      Timer0

Third activation of Timer0 is missed, since r0 executed for more that 20ms.
Even if 4th activation of r0 executes for 10ms (as specified in the
configuration), 4th Timer0 is still missed because the reference didn't change.
In this example 5th activation of r0 then managed to recover, but in general it
depends on how badly a certain phase misbehaves.

* lock : String. Lock the mutex defined by the string value.

* unlock : String. Unlock the mutex defined by the string value.

* wait : Object {"ref" : String, "mutex" : String }. Block the calling thread
until another thread sends a wake up signal to the resource defined by "ref".
The mutex defined by "mutex" is used during the block sequence: See
pthread_cond_wait() for more details about the sequence (especially regarding
the use of the mutex).

* signal : String. Send a wake up signal to the resource defined by the
string. The 1st thread in the wait list will be wokn up. See
pthread_cond_signal() for more details.

* broad : String. Send a wake up signal to the resource defined by the
string. All threads that are blocked on the resource wil wake up. See
pthread_cond_broadcast() for more details.

* sync : Object {"ref" : String, "mutex" : String }. Atomically wakes 
up a blocked thread and then blocks the calling thread on the condition. 
taskA ...|run 5|wait|------------|run 5| wait |----------------
taskB ...-------------------|sync|-------- 

The sync event "sync" : {"ref" : "CondA", "mutex" : "mutexA" } generates the
following sequence:
{
	"lock" : "mutexA",
	"signal" : "CondA",
	"wait" : { "ref" : "condA", "mutex" : "mutexA" },
	"unlock" : "mutexA"
}

* suspend : String. Block the calling thread until another thread wakes it up
with resume. The String can be let empty as it will be filled by workgen with
the right thread's name before starting the use case.

* resume : String. Wake up the thread defined by the string.

taskA ...|run 5|suspend |----------------|run 5|suspend |----------------
taskB ...-------------------|resume taskA|run 10    |-------------------- 

* yield: String. Calls pthread_yield(), freeing the CPU for other tasks. This has a 
special meaning for SCHED_DEADLINE tasks. String can be empty.

**** Trace and Log ****

Some traces and log hooks have been added to ease the debug and monitor various
metrics of the use cases

*** Trace ***

A trace can be inserted into ftrace buffer for each event in order to sync
kernel events like sched_switch with the use case's sequence.

A simple example of ftrace log:

         thread0-23345 [003] 115591.560884: tracing_mark_write: [0] starts
         thread0-23345 [003] 115591.560890: tracing_mark_write: [0] begins loop 0 phase 0 step 0
         thread0-23345 [003] 115591.560894: tracing_mark_write: [0] executing 0
         thread0-23345 [003] 115591.580212: tracing_mark_write: [0] executing 1
         thread0-23345 [003] 115591.580217: sched_switch: prev_comm=thread0 prev_pid=23345 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
          <idle>-0     [003] 115591.670198: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=23345 next_prio=120
         thread0-23345 [003] 115591.670243: tracing_mark_write: [0] end loop 0 phase 0 step 0
         thread0-23345 [003] 115591.670251: tracing_mark_write: [0] begins loop 0 phase 0 step 1
         thread0-23345 [003] 115591.670254: tracing_mark_write: [0] executing 0
         thread0-23345 [003] 115591.688081: tracing_mark_write: [0] executing 1
         thread0-23345 [003] 115591.688085: sched_switch: prev_comm=thread0 prev_pid=23345 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
          <idle>-0     [003] 115591.778063: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=23345 next_prio=120
         thread0-23345 [003] 115591.778108: tracing_mark_write: [0] end loop 0 phase 0 step 1
         thread0-23345 [003] 115591.778116: tracing_mark_write: [0] begins loop 0 phase 0 step 2
         thread0-23345 [003] 115591.778119: tracing_mark_write: [0] executing 0
         thread0-23345 [003] 115591.794619: tracing_mark_write: [0] executing 1
         thread0-23345 [003] 115591.794623: sched_switch: prev_comm=thread0 prev_pid=23345 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
          <idle>-0     [003] 115591.884607: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=23345 next_prio=120
         thread0-23345 [003] 115591.884652: tracing_mark_write: [0] end loop 0 phase 0 step 2
		 ...
         thread0-23345 [003] 115593.394022: tracing_mark_write: [0] begins loop 0 phase 0 step 17
         thread0-23345 [003] 115593.394025: tracing_mark_write: [0] executing 0
         thread0-23345 [003] 115593.410583: tracing_mark_write: [0] executing 1
         thread0-23345 [003] 115593.410594: sched_switch: prev_comm=thread0 prev_pid=23345 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
          <idle>-0     [003] 115593.500567: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=23345 next_prio=120
         thread0-23345 [003] 115593.500612: tracing_mark_write: [0] end loop 0 phase 0 step 17
         thread0-23345 [003] 115593.500620: tracing_mark_write: [0] begins loop 0 phase 0 step 18
         thread0-23345 [003] 115593.500623: tracing_mark_write: [0] executing 0
         thread0-23345 [003] 115593.520198: tracing_mark_write: [0] executing 1
         thread0-23345 [003] 115593.520202: sched_switch: prev_comm=thread0 prev_pid=23345 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
          <idle>-0     [003] 115593.610185: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=thread0 next_pid=23345 next_prio=120
         thread0-23345 [003] 115593.610230: tracing_mark_write: [0] end loop 0 phase 0 step 18
         thread0-23345 [003] 115593.610258: tracing_mark_write: [0] exiting

The main events are :

* "tracing_mark_write: [0] starts" indicates the start of the thread with index 0

* "tracing_mark_write: [0] begins loop A phase B loop C" indicates that the
thread with index 0 starts to execute the loop C of the phase B of the main
loop A

* "tracing_mark_write: [0] executing X": indicates that the thread with index 0
execute the event X of the current phase
-"tracing_mark_write: [0] end loop A phase B step C" indicates that the
thread with index 0 ends to execute the event C of the phase B of the loop A

*"tracing_mark_write: [0] exiting": indiscates that the trhead with index 0 has
finished to execute its events.

*** Log and gnuplot ***

You can log some metrics for each phase that is executed by a thread. These
metrics are:
- perf: number of loop that has been executed by run events
- run: time spent by the thread to execute the run events
- period: duration to execute the complte phase
- start/end : absolute start and end time of a phase. Same time base is used in
  ftrace
- rel_st: start time of a phase relatively to the beg of the use case
- slack: if global option "cumulative_slack" (see above) is false, time between
  the end of last event and the end of the phase, e.g.

  taskA ...|-- run5 --|- sleep5 -|-- run5--|..timer20.|-- run5 --|- sleep5 -|-- run6 --|.timer20.|
            <--------------- period 20 --------------> <--------------- period 20 -------------->
                                            <-slack5->                                  <slack4->

  it can also be negative if the execution of a phases' events overshoots the
  current period, e.g.

  taskA ...|-- run5 --|- sleep5 -|------- run30 ------xxxxxxxxxx|
            <--------------- period 20 -------------->
                                                       <slack-5>

  if global option "cumulative_slack" is true, all the intermediate slacks of a
  phase with multiple timers are accumulated and reported when the phase
  completes

- c_duration: sum of the configured duration of run/runtime events
- c_period: sum of the timer(s) period(s)
- wu_lat: sum of wakeup latencies after timer events

Below is an extract of a log:

# Policy : SCHED_OTHER priority : 0
#idx     perf      run   period           start             end          rel_st      slack c_duration   c_period     wu_lat
   0    92164    19935    98965    504549567051    504549666016            2443      78701      20000     100000        266
   0    92164    19408    99952    504549666063    504549766015          101455      80217      20000     100000        265
   0    92164    19428    99952    504549766062    504549866014          201454      80199      20000     100000        264
   0    92164    19438    99955    504549866060    504549966015          301452      80190      20000     100000        265
   0    92164    19446    99952    504549966061    504550066013          401453      80093      20000     100000        264
   0    92164    19415    99953    504550066060    504550166013          501452      80215      20000     100000        263
   0    92164    19388    99954    504550166059    504550266013          601451      80242      20000     100000        264
   0    92164    19444    99956    504550266060    504550366015          701452      80185      20000     100000        265

Some gnuplot files are also created to generate charts based on the log files 
for each thread and for each kind of metrics. The format of the chart that
will be generated by gnuplot is eps (text art has been used for the chart
below)

                  Measured thread0 Loop stats
                                                Load [nb loop]
  120000 ++--+----+---+----+---+---+----+---+----+--++ 506000     load  ******
         +   +    +   +    +   +   +    +   +    +   +             run  ######
         |$$ :    :   :    :   $$$ :    :   :    :   |          period  $$$$$$
  110000 ++.$$$........$$$$...$...$..........$$$.$$$$+
         |   : $$$$$$$$    $$ $:   $ $$$$$$$$   $:   $
         |   :    :   :    : $ :   :$   :   :    :  ++ 504000
         |   :    :   :    :   :   :    :   :    :   |
  100000 ++.........................................++
         |   :    :   :    :   :   :    :   :    :   |
         |   :    :   :    :   :   :    :   :    :   |
   90000 ++.........................................++
         |   :    :   :    :   :   :    :   :    :  ++ 502000
         |   :    :   :    :   :   :    :   :    :   |
   80000 ++.........................................++
         |   :    :   :    :   :   :    :   :    :   |
         |   :    :   :    :   :   :    :   :    :   |
   70000 +******************************************** 500000
         |   :    :   :    :   :   :    :   :    :   |
         |   :    :   :    :   :   :    :   :    :   |
         |   :    :   :    :   :   :    :   :    :   |
   60000 ++.........................................++
         |   :    :   :    :   :   :    :   :    :   |
         |   :    :   :    :   :   :    :   :    :  ++ 498000
   50000 ++.........................................++
         |   :    :   :    :   :   :    :   :    :   |
         |   :    :   :    :   :   :    :   :    :   |
   40000 ++.........................................++
         |   :    :   :    :   :   :    :   :    :  ++ 496000
         |#  :    :   :    :   :   :    :   :    :   |
         | # :    :   :    :   ### :    :   :    :   |
   30000 ++.###.###....####...#...#..........###.####+
         |   : #  :####    ### :   # ########   #:   #
         +   +    +   +    +   +   +#   +   +    +   +
   20000 ++--+----+---+----+---+---+----+---+----+--++ 494000
       17560556057560556057560656065606756065606756075607
                    Loop start time [msec]