1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
|
==============================================================================
Cycle counts and throughput for low-level routines in GNU MP as currently
implemented.
A range means that the timing is data-dependent. The slower number of such
an interval is usually the best performance estimate.
The throughput value, measured in Gb/s (gigabits per second) has a meaning
only for comparison between CPUs.
A star before a line means that all values on that line are estimates. A
star before a number means that that number is an estimate. A `p' before a
number means that the code is not complete, but the timing is believed to be
accurate.
| mpn_lshift mpn_add_n mpn_mul_1 mpn_addmul_1
| mpn_rshift mpn_sub_n mpn_submul_1
------------+-----------------------------------------------------------------
DEC/Alpha |
EV4 | 4.75 cycles/64b 7.75 cycles/64b 42 cycles/64b 42 cycles/64b
200MHz | 2.7 Gb/s 1.65 Gb/s 20 Gb/s 20 Gb/s
EV5 old code| 4.0 cycles/64b 5.5 cycles/64b 18 cycles/64b 18 cycles/64b
267MHz | 4.27 Gb/s 3.10 Gb/s 61 Gb/s 61 Gb/s
417MHz | 6.67 Gb/s 4.85 Gb/s 95 Gb/s 95 Gb/s
EV5 tuned | 3.25 cycles/64b 4.75 cycles/64b
267MHz | 5.25 Gb/s 3.59 Gb/s as above
417MHz | 8.21 Gb/s 5.61 Gb/s
------------+-----------------------------------------------------------------
Sun/SPARC |
SPARC v7 | 14.0 cycles/32b 8.5 cycles/32b 37-54 cycl/32b 37-54 cycl/32b
SuperSPARC | 3 cycles/32b 2.5 cycles/32b 8.2 cycles/32b 10.8 cycles/32b
50MHz | 0.53 Gb/s 0.64 Gb/s 6.2 Gb/s 4.7 Gb/s
**SuperSPARC| tuned addmul and submul will take: 9.25 cycles/32b
MicroSPARC2 | ? 6.65 cycles/32b 30 cycles/32b 31.5 cycles/32b
110MHz | ? 0.53 Gb/s 3.75 Gb/s 3.58 Gb/s
SuperSPARC2 | ? ? ? ?
Ultra/32 (4)| 2.5 cycles/32b 6.5 cycles/32b 13-27 cyc/32b 16-30 cyc/32b
182MHz | 2.33 Gb/s 0.896 Gb/s 14.3-6.9 Gb/s
Ultra/64 (5)| 2.5 cycles/64b 10 cycles/64b 40-70 cyc/64b 46-76 cyc/64b
182MHz | 4.66 Gb/s 1.16 Gb/s 18.6-11 Gb/s
HalSPARC64 | ? ? ? ?
------------+-----------------------------------------------------------------
SGI/MIPS |
R3000 | 6 cycles/32b 9.25 cycles/32b 16 cycles/32b 16 cycles/32b
40MHz | 0.21 Gb/s 0.14 Gb/s 2.56 Gb/s 2.56 Gb/s
R4400/32 | 8.6 cycles/32b 10 cycles/32b 16-18 19-21
200MHz | 0.74 Gb/s 0.64 Gb/s 13-11 Gb/s 11-9.6 Gb/s
*R4400/64 | 8.6 cycles/64b 10 cycles/64b 22 cycles/64b 22 cycles/64b
*200MHz | 1.48 Gb/s 1.28 Gb/s 37 Gb/s 37 Gb/s
R4600/32 | 6 cycles/64b 9.25 cycles/32b 15 cycles/32b 19 cycles/32b
134MHz | 0.71 Gb/s 0.46 Gb/s 9.1 Gb/s 7.2 Gb/s
R4600/64 | 6 cycles/64b 9.25 cycles/64b ? ?
134MHz | 1.4 Gb/s 0.93 Gb/s ? ?
R8000/64 | 3 cycles/64b 4.6 cycles/64b 8 cycles/64b 8 cycles/64b
75MHz | 1.6 Gb/s 1.0 Gb/s 38 Gb/s 38 Gb/s
*R10000/64 | 2 cycles/64b 3 cycles/64b 11 cycles/64b 11 cycles/64b
*200MHz | 6.4 Gb/s 4.27 Gb/s 74 Gb/s 74 Gb/s
*250MHz | 8.0 Gb/s 5.33 Gb/s 93 Gb/s 93 Gb/s
------------+-----------------------------------------------------------------
Motorola |
MC68020 | ? 24 cycles/32b 62 cycles/32b 70 cycles/32b
MC68040 | ? 6 cycles/32b 24 cycles/32b 25 cycles/32b
MC88100 | >5 cycles/32b 4.6 cycles/32b 16/21 cyc/32b p 18/23 cyc/32b
MC88110 wt | ? 3.75 cycles/32b 6 cycles/32b 8.5 cyc/32b
*MC88110 wb | ? 2.25 cycles/32b 4 cycles/32b 5 cycles/32b
------------+-----------------------------------------------------------------
HP/PA-RISC |
PA7000 | 4 cycles/32b 5 cycles/32b 9 cycles/32b 11 cycles/32b
67MHz | 0.53 Gb/s 0.43 Gb/s 7.6 Gb/s 6.2 Gb/s
PA7100 | 3.25 cycles/32b 4.25 cycles/32b 7 cycles/32b 8 cycles/32b
99MHz | 0.97 Gb/s 0.75 Gb/s 14 Gb/s 12.8 Gb/s
PA7100LC | ? ? ? ?
PA7200 (3) | 3 cycles/32b 4 cycles/32b 7 cycles/32b 6.5 cycles/32b
100MHz | 1.07 Gb/s 0.80 14 Gb/s 15.8 Gb/s
PA7300LC | ? ? ? ?
*PA8000 | 3 cycles/64b 4 cycles/64b 7 cycles/64b 6.5 cycles/64b
180MHz | 3.84 Gb/s 2.88 Gb/s 105 Gb/s 113 Gb/s
------------+-----------------------------------------------------------------
Intel/x86 |
386DX | 20 cycles/32b 17 cycles/32b 41-70 cycl/32b 50-79 cycl/32b
16.7MHz | 0.027 Gb/s 0.031 Gb/s 0.42-0.24 Gb/s 0.34-0.22 Gb/s
486DX | ? ? ? ?
486DX4 | 9.5 cycles/32b 9.25 cycles/32b 17-23 cycl/32b 20-26 cycl/32b
100MHz | 0.34 Gb/s 0.35 Gb/s 6.0-4.5 Gb/s 5.1-3.9 Gb/s
Pentium | 2/6 cycles/32b 2.5 cycles/32b 13 cycles/32b 14 cycles/32b
167MHz | 2.7/0.89 Gb/s 2.1 Gb/s 13.1 Gb/s 12.2 Gb/s
Pentium Pro | 2.5 cycles/32b 3.5 cycles/32b 6 cycles/32b 9 cycles/32b
200MHz | 2.6 Gb/s 1.8 Gb/s 34 Gb/s 23 Gb/s
------------+-----------------------------------------------------------------
IBM/POWER |
RIOS 1 | 3 cycles/32b 4 cycles/32b 11.5-12.5 c/32b 14.5/15.5 c/32b
RIOS 2 | 2 cycles/32b 2 cycles/32b 7 cycles/32b 8.5 cycles/32b
------------+-----------------------------------------------------------------
PowerPC |
PPC601 (1) | 3 cycles/32b 6 cycles/32b 11-16 cycl/32b 14-19 cycl/32b
PPC601 (2) | 5 cycles/32b 6 cycles/32b 13-22 cycl/32b 16-25 cycl/32b
67MHz (2) | 0.43 Gb/s 0.36 Gb/s 5.3-3.0 Gb/s 4.3-2.7 Gb/s
PPC603 | ? ? ? ?
*PPC604 | 2 3 2 3
*167MHz | 57 Gb/s
PPC620 | ? ? ? ?
------------+-----------------------------------------------------------------
Tege |
Model 1 | 2 cycles/64b 3 cycles/64b 2 cycles/64b 3 cycles/64b
250MHz | 8 Gb/s 5.3 Gb/s 500 Gb/s 340 Gb/s
500MHz | 16 Gb/s 11 Gb/s 1000 Gb/s 680 Gb/s
____________|_________________________________________________________________
(1) Using POWER and PowerPC instructions
(2) Using only PowerPC instructions
(3) Actual timing for shift/add/sub depends on code alignment. PA7000 code
is smaller and therefore often faster on this CPU.
(4) Multiplication routines modified for bogus UltraSPARC early-out
optimization. Smaller operand is put in rs1, not rs2 as it should
according to the SPARC architecture manuals.
(5) Preliminary timings, since there is no stable 64-bit environment.
(6) Use mulu.d at least for mpn_lshift. With mak/extu/or, we can only get
to 2 cycles/32b.
=============================================================================
Estimated theoretical asymptotic cycle counts for low-level routines:
| mpn_lshift mpn_add_n mpn_mul_1 mpn_addmul_1
| mpn_rshift mpn_sub_n mpn_submul_1
------------+-----------------------------------------------------------------
DEC/Alpha |
EV4 | 3 cycles/64b 5 cycles/64b 42 cycles/64b 42 cycles/64b
EV5 | 3 cycles/64b 4 cycles/64b 18 cycles/64b 18 cycles/64b
------------+-----------------------------------------------------------------
Sun/SPARC |
SuperSPARC | 2.5 cycles/32b 2 cycles/32b 8 cycles/32b 9 cycles/32b
------------+-----------------------------------------------------------------
SGI/MIPS |
R4400/32 | 5 cycles/64b 8 cycles/64b 16 cycles/64b 16 cycles/64b
R4400/64 | 5 cycles/64b 8 cycles/64b 22 cycles/64b 22 cycles/64b
R4600 |
------------+-----------------------------------------------------------------
HP/PA-RISC |
PA7100 | 3 cycles/32b 4 cycles/32b 6.5 cycles/32b 7.5 cycles/32b
PA7100LC |
------------+-----------------------------------------------------------------
Motorola |
MC88110 | 1.5 cyc/32b (6) 1.5 cycle/32b 1.5 cycles/32b 2.25 cycles/32b
------------+-----------------------------------------------------------------
Intel/x86 |
486DX4 |
Pentium P5x | 5 cycles/32b 2 cycles/32b 11.5 cycles/32b 13 cycles/32b
Pentium Pro | 2 cycles/32b 3 cycles/32b 4 cycles/32b 6 cycles/32b
------------+-----------------------------------------------------------------
IBM/POWER |
RIOS 1 | 3 cycles/32b 4 cycles/32b
RIOS 2 | 1.5 cycles/32b 2 cycles/32b 4.5 cycles/32b 5.5 cycles/32b
------------+-----------------------------------------------------------------
PowerPC |
PPC601 (1) | 3 cycles/32b ?4 cycles/32b
PPC601 (2) | 4 cycles/32b ?4 cycles/32b
____________|_________________________________________________________________
|