File: profiling.tex

package info (click to toggle)
mlton 20041109-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 18,212 kB
  • ctags: 58,085
  • sloc: ansic: 10,386; makefile: 1,178; sh: 1,139; pascal: 256; asm: 97
file content (346 lines) | stat: -rw-r--r-- 14,839 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
\sec{Profiling}{profiling}{Profiling.html}

With {\mlton} and {\tt mlprof}, you can profile your program along
different measures of performance.  To profile you program, compile
with {\tt -profile {\em kind}}.  Then, run the executable, which, when
it finishes, will write an {\tt mlmon.out} file.  You can then run
{\tt mlprof} on the executable and the {\tt mlmon.out} file to see the
gathered performance data.

%
\subsec{Profiling time}{proftime}
You can compile your program with {\tt -profile time} to find out how
much time is spent by each source function over an entire run of the
program.  Here is an example of time profiling, run from within the
{\tt examples/profiling} directory.
\begin{verbatim}
% mlton -profile time tak.sml
% ./tak
% mlprof tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
function     cur 
------------- -----
Tak.tak1.tak2 75.8%
Tak.tak1      24.2%
\end{verbatim}
This is a contrived example with two mutually recursive copies of the
{\tt tak} function, defined within the {\tt Tak} structure.  This
examples shows how {\tt mlprof} indicates lexical nesting via a
sequence of period-separated names indicating the structures and
functions in which a function definition is nested.  The profiling
data shows that roughly three-quarters of the time is spent in the
{\tt Tak.tak1.tak2} function, while the rest is spent in {\tt
Tak.tak1}.

You can display raw counts in addition to percentages with {\tt -raw
true}.
\begin{verbatim}
% mlprof -raw true tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
  function     cur    raw  
------------- ----- -------
Tak.tak1.tak2 75.8% (4.55s)
Tak.tak1      24.2% (1.45s)
\end{verbatim}
%
You can display the file name and line number for each function in
addition to its name with {\tt -show-line true}.
\begin{verbatim}
% mlprof -show-line true tak mlmon.out
6.00 seconds of CPU time (0.00 seconds GC)
        function           cur 
------------------------- -----
Tak.tak1.tak2  tak.sml: 5 75.8%
Tak.tak1  tak.sml: 3      24.2%
\end{verbatim}

Time profiling is designed to have a very small performance impact.
You can also compile with {\tt -profile time -profile-branch true} to
find out how much time is spent in each branch of a function; see
\secref{profcounts} for more details.

\subsec{Profiling allocation}{profalloc}
You can compile your program with {\tt -profile alloc} to find out how
many bytes each function allocates.  Here is an example run from
within the {\tt examples/profiling} directory.
\begin{verbatim}
% mlton -profile alloc list-rev.sml
% ./list-rev
% mlprof -show-line true list-rev mlmon.out
6,030,136 bytes allocated (108,336 bytes by GC)
       function          cur 
----------------------- -----
append  list-rev.sml: 1 97.6%
<gc>                     1.8%
<main>                   0.4%
rev  list-rev.sml: 6     0.2%
\end{verbatim}
%
The data shows that most of the allocation is done by the {\tt append}
function defined on line 1 of {\tt list-rev.sml}.  The table also
shows how special functions like {\tt gc} and {\tt main} are handled:
they are printed with surrounding brackets.  C functions are displayed
similarly.  Here, as is usually the case, the allocation done by the
garbage collector is due to stack growth.

The performance impact of allocation profiling is noticeable, because
it inserts additional C calls for object allocation.  You can also
compile with {\tt -profile alloc -profile-branch true} to find out how
much allocation is done in each branch of a function; see
\secref{profcounts} for more details.

%
\subsec{Profiling counts}{profcounts}
You can compile your program with {\tt -profile count -profile-branch
true} to find out how many times each function is called and how many
times each branch is taken.  Here's an example.

\begin{verbatim}
% mlton -profile count -profile-branch true tak.sml
% ./tak
% mlprof -raw true -show-line true tak mlmon.out
623,610,002 ticks
            function               cur       raw     
--------------------------------- ----- -------------
Tak.tak1.tak2  tak.sml: 5         38.2% (238,530,000)
Tak.tak1.tak2.<true>  tak.sml: 7  27.5% (171,510,000)
Tak.tak1  tak.sml: 3              10.7%  (67,025,000)
Tak.tak1.<true>  tak.sml: 14      10.7%  (67,025,000)
Tak.tak1.tak2.<false>  tak.sml: 9 10.7%  (67,020,000)
Tak.tak1.<false>  tak.sml: 16      2.0%  (12,490,000)
f  tak.sml: 23                     0.0%       (5,001)
f.<branch>  tak.sml: 25            0.0%       (5,000)
f.<branch>  tak.sml: 23            0.0%           (1)
uncalled  tak.sml: 29              0.0%           (0)
f.<branch>  tak.sml: 24            0.0%           (0)
\end{verbatim}

Branches are displayed with the usual lexical nesting followed by {\tt
<branch>} where the function name would normally be, or {\tt <true>}
or {\tt <false>} for if-expressions.  Hence, it is best to run {\tt
mlprof} with {\tt -show-line true} to help identify the branch.

One use of {\tt -profile count} is as a code-coverage tool, to help
find code in your program that hasn't been tested.  For this reason,
{\tt mlprof} displays functions and branches that have a count of
zero.  As the above output shows, the branch on line 24 was never
taken and the function defined on line 29 was never called.  To see
zero counts, it is best to run {\tt mlprof} with {\tt -raw true},
since some code (e.g. the branch on line 23 above) will show up with
{\tt 0.0\%} but may still have been executed and hence have a nonzero
raw count.
%
\subsec{Profiling the stack}{profstack}

For all forms of profiling, you can use {\tt -profile-stack true} to
count the time spent (or bytes allocated) while a function is on the
stack.  Here is an example.

\begin{verbatim}
% mlton -profile alloc -profile-stack true list-rev.sml
% ./list-rev
% mlprof -show-line true list-rev mlmon.out
6,030,136 bytes allocated (108,336 bytes by GC)
       function          cur  stack  GC 
----------------------- ----- ----- ----
append  list-rev.sml: 1 97.6% 97.6% 1.4%
<gc>                     1.8%  0.0% 1.8%
<main>                   0.4% 98.2% 1.8%
rev  list-rev.sml: 6     0.2% 97.6% 1.8%
\end{verbatim}
%
In the above table, we see that {\tt rev}, defined on line 6 of {\tt
list-rev.sml}, is only responsible for 0.2\% of the allocation, but is
on the stack while 97.6\% of the allocation is done by the user
program and while 1.8\% of the allocation is done by the garbage
collector.

The performance impact of {\tt -profile-stack true} can be noticeable
since there is some extra bookkeeping at every nontail call and
return.
%
\subsection{Call graphs}

For easier visualization of profiling data, {\tt mlprof} can create a
call graph of the program in dot format, from which you can use the
\htmladdnormallink{{\tt graphviz}}
		  {http://www.research.att.com/sw/tools/graphviz/}
software package to create a postscript graph.  For example, {\tt
mlprof -call-graph foo.dot foo mlmon.out} will create {\tt foo.dot}
with a complete call graph.  For each source function, there will be
one node in the graph that contains the function name (and source
position with {\tt -show-line true}), as well as the percentage of
ticks.  If you want to create a call graph for your program without
any profiling data, you can simply call {\tt mlprof} without any {\tt
mlmon.out} files, e.g. {\tt mlprof -call-graph foo.dot foo}.

Because SML has higher-order functions, the call graph is is dependent
on {\mlton}'s analysis of which functions call each other.  This
analysis depends on many implementation details and might display
spurious edges that a human could conclude are impossible.  However,
in practice, the call graphs tend to be very accurate.

Because call graphs can get big, {\tt mlprof} provides the {\tt -keep}
option to specify the nodes that you would like to see.  This
option also controls which functions appear in the table that {\tt
mlprof} prints.  The argument to {\tt -keep} is an expression
describing a set of source functions (i.e. graph nodes), taken from
the following grammar.

\begin{latexonly}
\begin{center}
\begin{tabular}{lcl}
{\it e} & ::= & {\tt all} \\
& $\alt$ & {\tt "{\it s}"} \\
& $\alt$ & {\tt (and {\it e} ...)} \\
& $\alt$ & {\tt (from {\it e})} \\
& $\alt$ & {\tt (not {\it e})} \\
& $\alt$ & {\tt (or {\it e} ...)} \\
& $\alt$ & {\tt (pred {\it e})} \\
& $\alt$ & {\tt (succ {\it e})} \\
& $\alt$ & {\tt (to {\it e})} \\
& $\alt$ & {\tt (thresh {\it x})} \\
& $\alt$ & {\tt (thresh-gc {\it x})} \\
& $\alt$ & {\tt (thresh-stack {\it x})} \\
\end{tabular}
\end{center}
\end{latexonly}
\begin{htmlonly}
\begin{center}
\begin{tabular}{lcl}
{\it e} & ::= & {\tt all} \\
& | & {\tt "{\it s}"} \\
& | & {\tt (and {\it e} ...)} \\
& | & {\tt (from {\it e})} \\
& | & {\tt (not {\it e})} \\
& | & {\tt (or {\it e} ...)} \\
& | & {\tt (pred {\it e})} \\
& | & {\tt (succ {\it e})} \\
& | & {\tt (to {\it e})} \\
& | & {\tt (thresh {\it x})} \\
& | & {\tt (thresh-gc {\it x})} \\
& | & {\tt (thresh-stack {\it x})} \\
\end{tabular}
\end{center}
\end{htmlonly}

In the grammar, {\tt all} denotes the set of all nodes.  {\tt "{\it
s}"} is a regular expression denoting the set of functions whose name
(followed by a space and the source position) has a prefix matching
the regexp.  The {\tt and}, {\tt not}, and {\tt or} expressions denote
intersection, complement, and union, respectively.  The {\tt pred} and
{\tt succ} expressions add the set of immediate predecessors or
successors to their argument, respectively.  The {\tt from} and {\tt
to} expressions denote the set of nodes that have paths from or to the
set of nodes denoted by their arguments, respectively.  Finally, {\tt
thresh}, {\tt thresh-gc}, and {\tt thresh-stack} denote the set of
nodes whose percentage of ticks, gc ticks, or stack ticks,
respectively, is greater than or equal to the real number {\it x}.

For example, if you want to see the entire call graph for a program,
you can use {\tt -keep all} (this is the default).  If you want to see
all nodes reachable from function {\tt foo} in your program, you would
use {\tt -keep '(from "foo")'}.  Or, if you want to see all the
functions defined in subdirectory {\tt bar} of your project that used
at least 1\% of the ticks, you would use {\tt -keep '(and ".*/bar/"
(thresh 1.0))'}.  To see all functions with ticks above a threshold,
you can also use {\tt -thresh x}, which is an abbreviation for {\tt
-keep '(thresh x)'}.  You can not use multiple {\tt -keep} arguments
or both {\tt -keep} and {\tt -thresh}.  When you use {\tt -keep} to
display a subset of the functions, {\tt mlprof} will add dashed edges
to the call graph to indicate a path in the original call graph from
one function to another.

When compiling with {\tt -profile-stack true}, you can use {\tt mlprof
-gray true} to make the nodes darker or lighter depending on whether
their stack percentage is higher or lower.

{\mlton}'s optimizer may duplicate source functions for any of a
number of reasons (functor duplication, monomorphisation,
polyvariance, inlining).  By default, all duplicates of a function are
treated as one.  If you would like to treat the duplicates separately,
you can use {\tt mlprof -split} {\it regexp}, which will cause all
duplicates of functions whose name has a prefix matching the regular
expression to be treated separately.  This can be especially useful
for higher-order utility functions like {\tt General.o}.
%
\subsection{Using {\tt MLton.Profile}}

To profile individual portions of your program, you can use the {\tt
MLton.Profile} structure (see \secref{profile-structure}).  This
allows you to create many units of profiling data (essentially,
mappings from functions to counts) during a run of a program, to
switch between them while the program is running, and to output
multiple {\tt mlmon.out} files.  Executing {\tt mlprof} with multiple
{\tt mlmon.out} files sums the profiling data to produce the output
profiling information.

Here is an example, run from within the {\tt examples/profiling}
directory, showing how to profile the executions of the {\tt fib} and
{\tt tak} functions separately.

\begin{verbatim}
% mlton -profile time fib-tak.sml
% ./fib-tak
% mlprof fib-tak mlmon.fib.out
5.77 seconds of CPU time (0.00 seconds GC)
function   cur 
--------- -----
fib       96.9%
<unknown>  3.1%
% mlprof fib-tak mlmon.tak.out
0.68 seconds of CPU time (0.00 seconds GC)
function  cur  
-------- ------
tak      100.0%
% mlprof fib-tak mlmon.fib.out mlmon.tak.out mlmon.out
6.45 seconds of CPU time (0.00 seconds GC)
function   cur 
--------- -----
fib       86.7%
tak       10.5%
<unknown>  2.8%
\end{verbatim}
%
\subsection{Profiling details}

Technically speaking, {\tt mlprof} produces a call-stack graph rather
than a call graph, because it describes the set of possible call
stacks.  The difference is in how tail calls are displayed.  For
example if {\tt f} nontail calls {\tt g} and {\tt g} tail calls {\tt
h}, then the call-stack graph has edges from {\tt f} to {\tt g} and
{\tt f} to {\tt h}, while the call graph has edges from {\tt f} to {\tt
g} and {\tt g} to {\tt h}.  That is, a tail call from {\tt g} to {\tt
h} removes {\tt g} from the call stack and replaces it with {\tt h}.

Conceptually, both allocation and time profiling work in the same way.
The compiler produces information that maps machine code positions to
source functions that the profiler uses while the program is running
to find out the current source function.  With {\tt -profile-stack
true}, the profiler also keeps track of which functions are on the
stack.  The profiler associates counters (either clock ticks or byte
counts) with source functions.  For allocation profiling, the compiler
inserts code whenever an object is allocated to call a C function to
increment the appropriate counter.  For time profiling, the profiler
catches the {\tt SIGPROF} signal 100 times per second and increments
the appropriate counter.  When the program finishes, the profiler
writes the counts out to the {\tt mlmon.out} file.  Then, {\tt mlprof}
uses source information stored in the executable to associate the
counts in the {\tt mlmon.out} file with source functions.

With {\tt -profile time}, use of the following in your program will
cause a run-time error, since they would interfere with the profiler
signal handler.\\
\begin{tabular}{l}
\tt MLton.Itimer.set (MLton.Itimer.Prof, ...)\}\\
\tt MLton.Signal.setHandler (MLton.Signal.prof, ...)
\end{tabular}\\
Also, because of the random sampling used to implement {\tt -profile
time}, it is best to have a long running program (at least tens of
seconds) in order to get reasonable time data.

There may be a few missed clock ticks or bytes allocated at the
very end of the program after the data is written.

Profiling has not been tested with signals or threads.  In particular,
stack profiling may behave strangely.