File: timer.lsp

package info (click to toggle)
acl2 8.6%2Bdfsg-3
  • links: PTS
  • area: main
  • in suites: forky, sid
  • size: 1,138,276 kB
  • sloc: lisp: 17,818,294; java: 125,359; python: 28,122; javascript: 23,458; cpp: 18,851; ansic: 11,569; perl: 7,678; xml: 5,591; sh: 3,978; makefile: 3,840; ruby: 2,633; yacc: 1,126; ml: 763; awk: 295; csh: 233; lex: 197; php: 178; tcl: 49; asm: 23; haskell: 17
file content (281 lines) | stat: -rw-r--r-- 12,685 bytes parent folder | download | duplicates (8)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
; Memoize Library
; Copyright (C) 2013 Centaur Technology
;
; Contact:
;   Centaur Technology Formal Verification Group
;   7600-C N. Capital of Texas Highway, Suite 300, Austin, TX 78731, USA.
;   http://www.centtech.com/
;
; License: (An MIT/X11-style license)
;
;   Permission is hereby granted, free of charge, to any person obtaining a
;   copy of this software and associated documentation files (the "Software"),
;   to deal in the Software without restriction, including without limitation
;   the rights to use, copy, modify, merge, publish, distribute, sublicense,
;   and/or sell copies of the Software, and to permit persons to whom the
;   Software is furnished to do so, subject to the following conditions:
;
;   The above copyright notice and this permission notice shall be included in
;   all copies or substantial portions of the Software.
;
;   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
;   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
;   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
;   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
;   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
;   FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
;   DEALINGS IN THE SOFTWARE.
;
; Original author: Jared Davis <jared@centtech.com>
;
; This library is a descendant of the memoization scheme developed by Bob Boyer
; and Warren A. Hunt, Jr. which was incorporated into the HONS version of ACL2,
; sometimes called ACL2(h).

(in-package "MEMOIZE")

; We now develop a simple timing mechanism that uses RDTSC for high-performance
; timing on CCL, or else uses whatever timing utilities are provided by the
; Common Lisp implementation.
;
; The interface here is just:
;
;   (ticks)
;     - similar to get-internal-real-time in Common Lisp, but guaranteed
;       to return an mfixnum
;
;   (ticks-per-second)
;     - similar to internal-time-units-per-second in Common Lisp, but
;       guaranteed to return a float
;
; It would be nice to have a cross-platform, high-performance way to time
; things, and it would be nice if Lisp took care of this for us.  But
; high-performance timing turns out to be a very complicated and subtle matter.
;
; On CCL at least, (get-internal-real-time) is implemented with gettimeofday(),
; which seems basically reasonable but involves a system call and is slow.
; FWIW, gettimeofday() is actually marked as obsolete in Posix.1-2008, with
; clock_gettime() as the recommended replacement.  It also seems like there are
; some efforts to speed this up, e.g., there seems to be some user-space
; equivalent called vgettimeofday(), but I haven't yet found any coherent
; documentation about how to use it.  At any rate, maybe gettimeofday() will be
; faster some day.
;
; In the meantime, clock_gettime() seems to provide a sort of higher-level
; interface to the high-performance counters (e.g., CLOCK_PROCESS_CPUTIME_ID),
; but these would appear to have the same problems as RDTSC.  It would probably
; make sense to switch our code below to use clock_gettime() instead of RDTSC
; if it can be conveniently added to CCL.
;
; But for now we use RDTSC (read timestamp counter).  This is a very fast way
; to get something resembling the current time, but has its own set of
; problems:
;
;  - RDTSC doesn't serialize, so it may read the timestamp before previous
;    instructions have completed.  (This may actually be a feature in that it
;    suggests adding profiling instructions may not slow things down too badly;
;    alternately we could consider using RDTSCP which does serialize, or else
;    the typical trick seems to be to put a CPUID instruction in front of the
;    RDTSC call to force serialization).
;
;  - On some systems, CPU frequency scaling may cause the counter to advance at
;    different rates over time.  This is apparently a particlar problem on the
;    Pentium M.  But on newer processors (Core 2, Phenom, and presumably later)
;    this seems to have been fixed by instead basing the RDTSC on the FSB clock
;    frequency instead of the CPU clock.  So, allegedly on newer systems the
;    RDTSC ticks at a reliable rate even when the clock frequency is scaled
;    back to save power.
;
;  - On a multi-core system, the timestamps on the various cores may not agree.
;    This might cause wildly inaccurate results if the operating system
;    relocates our process from one core to another.  It seems like the whole
;    point of RDTSCP is to let you detect this by also figuring out which CPU
;    you are on, but this seems to require some support from the operating
;    system and it's not clear to me how it's all supposed to work.
;
; On CCL it seems that RDTSC is about 7x faster than (get-internal-real-time).
; CCL actually gives us two ways to use RDTSC:
;
;    (ccl::rdtsc)   -- just returns a fixnum's worth of bits of rdtsc
;    (ccl::rdtsc64) -- returns the full 64 bits (possibly a bignum)
;
; Either of these compiles down to just a few machine instructions instead of
; having to do system calls.  On FV-1, these loops finish in 1.61, 1.57, 11.16,
; and 42.55 seconds, respectively:

;; (progn
;;  (time (loop for i fixnum from 1 to 100000000 do (ccl::rdtsc)))
;;  (time (loop for i fixnum from 1 to 100000000 do (ccl::rdtsc64)))
;;  (time (loop for i fixnum from 1 to 100000000 do (get-internal-real-time)))
;;  (time (loop for i fixnum from 1 to 100000000 do (get-internal-run-time))))

; This works out to about 9 million calls of (get-internal-real-time) per
; second, versus 62 million calls of (ccl::rdtsc).  The better speed of RDTSC
; is pretty appealing since, e.g., the faster we can do timing, the more
; accurately we can handle nested timing.
;
; I wrote a little code to investigate whether these frequency-scaling problems
; have really been resolved.  I first tweaked some loops to provide one
; second's worth of computation.  Here are the loops I came up with for fv-1
; (Xeon e5450) and prime (i5-750).  The timings on fv-1 are very reliable, for
; instance 1.000538, 1.000906, 1.000748, .999290, etc.  The timings on prime
; are much less reliable, often as much as 2 seconds on the first run, but they
; settle down to nearly 1 second each time when it is run in a loop.

;; (defun one-second-on-fv-1 ()
;;   (let ((k 0))
;;     (declare (fixnum k))
;;     (loop for i fixnum from 1 to 9017500 do
;;           (loop for j fixnum from 1 to 153 do
;;                 (incf k)))))
;;
;; (defun one-second-on-prime ()
;;   (let ((k 0))
;;     (declare (fixnum k))
;;     (loop for i fixnum from 1 to 8120000 do
;;           (loop for j fixnum from 1 to 161 do
;;                 (incf k)))))

; I then tried to compare how fast the RDTSC increased while running these
; loops versus sleeping.

;; (let (start end)
;;   (one-second-on-prime)
;;   (setq start (ccl::rdtsc))
;;   (loop for i fixnum from 1 to 100 do (one-second-on-prime))
;;   (setq end (ccl::rdtsc))
;;   (- end start))
;;
;; (let (start end)
;;   (sleep 10)
;;   (setq start (ccl::rdtsc))
;;   (sleep 100)
;;   (setq end (ccl::rdtsc))
;;   (- end start))

; The idea behind this test is that the CPU should be running at high frequency
; when running the first loop, but at low frequency when sleeping.  Even on
; prime (which aggressively throttles), the two forms gave results that were
; within 1.5%.  So this is encouraging -- it seems like indeed frequency
; scaling is unlikely to lead to bad results on these modern processors.
;
; It seems more difficult to try to deal with unreliable timings due to
; changing from one core to another.  If our process is relocated from core A
; to core B, there are two possibilities:
;
;   (1) A's timer is behind B's.  Here, any previous start times we recorded
;       will seem to be farther back in time than they really were, and we may
;       over-estimate how much time something has taken.
;
;   (2) B's timer is behind A's.  Here, any previous start times we recorded
;       will seem either closer in time than they really were, or will seem to
;       be in the future.  We can probably detect these future cases (and do
;       what?  ignore whatever time has elapsed?) but in the other cases we
;       will be silently under-estimating how much time has been taken.
;
; BOZO we should try to investigate how often these relocations occur.  It may
; be that we can detect these cases with something like RDTSCP and at least
; tell the user when the timings we've collected are unreliable.  Alternately,
; we might suggest that the user run "taskset" to set the CPU affinity for
; their ACL2 process if they want more accurate timings.  Hopefully this just
; doesn't happen very often and the timings we collect will be okay.

#+(and Clozure x86_64)
(eval-when
 (:execute :compile-toplevel :load-toplevel)
 (when (fboundp 'ccl::rdtsc)
   (pushnew :memoize-use-rdtsc *features*)))

#-memoize-use-rdtsc
(progn

  (defabbrev ticks ()
    (the mfixnum (logand (get-internal-real-time)
                         acl2::most-positive-mfixnum)))

  (defun ticks-per-second ()
    internal-time-units-per-second))

#+memoize-use-rdtsc
(progn

  (defabbrev ticks ()
    (the mfixnum (ccl::rdtsc)))

  (defparameter *ticks-per-second*
    ;; Cached estimate of how many ticks are in a second.  NIL means we haven't
    ;; estimated this yet, otherwise it will be a float.
    ;;
    ;; We estimate this the first time we need it, then record our guess.  We
    ;; do this at runtime, rather than at compile time, because ACL2 is often
    ;; used in networked environments where the same ACL2 image may be running
    ;; on several different machines, each of which have their own tick rates.
    nil)

  (defun estimate-ticks-per-second (sleep-time)
    ;; Simply estimate the number of RDTSC ticks per second by sleeping for a
    ;; little bit and seeing how many RDTSC ticks have elapsed.
    (let* ((start (ccl::rdtsc))
           (end   (progn (sleep sleep-time) (ccl::rdtsc)))
           (ticks (coerce (- end start) 'float)))
      (/ ticks sleep-time)))

; We want to sleep for the shortest time that will give us a reasonably
; accurate guess as to the true number of ticks per second.  As an experiment,
; I estimated the "true" amount of ticks per second on fv-1 by sleeping for 20
; seconds.  Then, I tried using much smaller sleep times, and compared how
; close my guesses were.  Here is the code I used:

; I later repeated this experiment on the machine "k", an AMD FX-8350,
; running FreeBSD.  The results were much less consistent!

#||
(defparameter *true-ticks-per-second*
  (estimate-ticks-per-second 20))

(defun max-error-ticks (times sleep-time)
  (if (= times 0)
      0
    (let* ((est1 (estimate-ticks-per-second sleep-time))
	   (diff (abs (- est1 *true-ticks-per-second*))))
      (max diff (max-error-ticks (1- times) sleep-time)))))

(defun max-error-pct (times sleep-time)
  (* 100.0 (/ (max-error-ticks times sleep-time) *true-ticks-per-second*)))

                          ;; fv-1          k
(max-error-pct 100 .0001) ;; 888.60%   20426.45%
(max-error-pct 100 .001)  ;;  98.80%    1904.23%
(max-error-pct 100 .01)   ;;   9.89%     105.13%
(max-error-pct 100 .05)   ;;   1.97%      19.97%
(max-error-pct 100 .08)   ;;   1.23%      12.60%
(max-error-pct 100 .1)    ;;   0.98%       9.98%
(max-error-pct 100 .12)   ;;   0.81%       8.38%
(max-error-pct 100 .15)   ;;   0.67%       6.64%
(max-error-pct 100 .2)    ;;   0.49%       4.99%
(max-error-pct 100 .3)    ;;   0.32%       3.31% (but 16.62% once)
(max-error-pct 100 .5)    ;;   0.19%       1.98%
||#

; In short, it seems like sleeping for .1 seconds gives an estimate
; that is within 1% of the true ticks per second on fv-1 (an Intel
; Core2 generation Xeon running Linux), which is very reasonable, but
; that's still not very good on the AMD/FreeBSD box.  So, I guess as a
; compromise, I'll have it sleep for .2 seconds.  Blah.

  (defun carefully-estimate-ticks-per-second ()
    ;; Of course, if RDTSC is wacky, maybe this doesn't give us a good number.
    ;; So, try to make sure that we have a least a 1 KHz processor. :)
    (let* ((sleep-time 0.2))
      (loop do
            (setq *ticks-per-second*
                  (estimate-ticks-per-second sleep-time))
            (if (< *ticks-per-second* 1000.0)
                ;; Probably a bogus result, try again, maybe sleeping a bit more.
                (setq sleep-time (* sleep-time 2))
              (loop-finish)))
      *ticks-per-second*))

  (defun ticks-per-second ()
    (or *ticks-per-second*
        (carefully-estimate-ticks-per-second))))