1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html xmlns:o>
<head>
<title></title>
<meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">
<meta name="vs_targetSchema" content="http://schemas.microsoft.com/intellisense/ie5">
</head>
<body>
<H1>Data and Instruction Range Restrictions in PAPI</H1>
<H2>Introduction</H2>
<P>Performance instrumentation of data structures, as opposed to code segments, is
a feature not widely supported across a range of platforms. One platform on
which this feature is supported is the Itanium2. In fact, event counting on
Itanium2 can be qualified by a number of conditioners, including instruction
address, opcode matching, and data address. We have implemented a generalized
PAPI interface for data structure and instruction range performance
instrumentation, also referred to as data and instruction range specification,
and applied that interface to the specific instance of the Itanium2 platform to
demonstrate its viability. This feature is being introduced for the first time in the PAPI 3.5 release.
</P>
<H2>The PAPI Interface</H2>
<P>Since PAPI is a platform-independent library, care must be taken when extending
its feature set so as not to disrupt the existing interface or to clutter the
API with calls to functionality that is not available on a large subset of the
supported platforms. To that end, we elected to extend an existing PAPI call, <SPAN style="FONT-FAMILY: 'Courier New'">
PAPI_set_opt()</SPAN>, with the capability of specifying starting and
ending addresses of data structures or instructions to be instrumented. The <SPAN style="FONT-FAMILY: 'Courier New'">
PAPI_set_opt()</SPAN> call previously supported functionality to set a
variety of optional capability in the PAPI interface, including debug levels,
multiplexing of eventsets,<SPAN style="mso-spacerun: yes"> </SPAN>and the
scope of counting domains. This call was extended with two new cases to support
instruction and data address range specification: <SPAN style="FONT-FAMILY: 'Courier New'">
PAPI_INSTR_ADDRESS </SPAN>and<SPAN style="FONT-FAMILY: 'Courier New'"> PAPI_DATA_ADDRESS</SPAN>.
To access these options, a user initializes a simple option specific data
structure and calls <SPAN style="FONT-FAMILY: 'Courier New'">PAPI_set_opt()</SPAN>
as illustrated in the code fragment below:</P>
<P class="MsoNormal" style="mso-layout-grid-align: none"><SPAN style="FONT-FAMILY: 'Courier New'"><SPAN style="mso-spacerun: yes">
...<BR>
</SPAN></SPAN><SPAN style="FONT-FAMILY: 'Courier New'">option.addr.eventset
= EventSet;
<o:p></o:p></SPAN><BR>
<SPAN style="FONT-FAMILY: 'Courier New'"> option.addr.start =
(caddr_t)array;
<o:p></o:p></SPAN><BR>
<SPAN style="FONT-FAMILY: 'Courier New'"><SPAN style="mso-spacerun: yes"> </SPAN>
option.addr.end = (caddr_t)(array + size_array);
<o:p></o:p></SPAN><BR>
<SPAN style="FONT-FAMILY: 'Courier New'"><SPAN style="mso-spacerun: yes"> </SPAN>
retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &option);<BR>
...</SPAN></P>
<P class="MsoNormal" style="mso-layout-grid-align: none"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Courier New'">
<o:p></o:p></SPAN></P>
<P style="mso-layout-grid-align: none">The user creates a PAPI eventset and
determines the starting and ending addresses of the data to be monitored. The
call to <SPAN style="FONT-FAMILY: 'Courier New'">PAPI_set_opt</SPAN> then
prepares the interface to count events that occur on accesses to data in that
range. The specific events to be monitored can be added to the eventset either
before or after the data range is specified. In a similar fashion, an
instruction range can be set using the <SPAN style="FONT-FAMILY: 'Courier New'">PAPI_INSTR_ADDRESS
</SPAN>option. If this option is supported on the platform in use, the data is
transferred to the platform-specific implementation and handled appropriately.
If not supported, the call returns an error message.</P>
<P style="mso-layout-grid-align: none">It may not always be possible to exactly
specify the address range of interest. If this is the case, it is important
that the user have some way to know what approximations have been made, so that
appropriate corrective action can be taken. For instance, to isolate a
specific data structure completely, it may be necessary to pad memory before
and after the structure with dummy structures that are never accessed. To
facilitate this, <FONT face="Courier New">PAPI_set_opt()</FONT> returns the
offsets from the requested starting and ending addresses as they were actually
programmed into the hardware. If the addresses were mapped exactly, these
values are zero. An example of this is shown below:</P>
<P class="MsoNormal" style="mso-layout-grid-align: none"><SPAN style="FONT-FAMILY: 'Courier New'"><SPAN style="mso-spacerun: yes">
...<BR>
</SPAN></SPAN><SPAN style="FONT-FAMILY: 'Courier New'"><SPAN style="mso-spacerun: yes">
</SPAN>retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &option);<BR>
<SPAN style="FONT-FAMILY: 'Courier New'"> actual.start = (caddr_t)array
- option.addr.start_off;
<o:p></o:p></SPAN><BR>
<SPAN style="FONT-FAMILY: 'Courier New'"><SPAN style="mso-spacerun: yes">
actual.end = (caddr_t)(array + size_array) + </SPAN>option.addr.end_off;
<o:p></o:p></SPAN><BR>
...</SPAN></P>
<H2 class="MsoNormal" style="mso-layout-grid-align: none"><SPAN style="FONT-FAMILY: 'Courier New'"><FONT face="Times New Roman">Itanium
Idiosyncrasies</FONT></SPAN></H2>
<P style="mso-layout-grid-align: none">There are roughly 475 native events
available on Itanium 2.160 of them are memory related and can be counted with
data address specification in place. 283 can be counted using instruction
address specification. All events in an eventset with data or instruction range
specification in place must be one of these supported events. Further
restrictions also apply to the use of data and instruction range specification,
as described below. Data addresses can only be specified in coarse mode.
Although four independent pairs of data address registers exist in the hardware
and would suggest that four disjoint address regions can be monitored
simultaneously, the Intel documentation strongly suggests that this is not a
good idea. Further,
the underlying software takes advantage of these register pairs to tune the range
of addresses that is actually monitored. See the discussion under <strong>Data Address
Ranges</strong> for further detail.</P>
<H3 style="mso-layout-grid-align: none">Instruction Address Ranges</H3>
<P style="mso-layout-grid-align: none">Instruction ranges can be specified in one
of two ways: coarse and fine. In fine mode, addresses can be specified exactly,
but both the start and end addresses must exist on the same 4K byte page. In
other words, the address range must be less than 4K bytes, and the addresses
can only differ in the bottom 12 bits. If fine mode cannot be used, the
underlying perfmon library automatically switches to coarse address
specification. Four pairs of registers are available to specify coarse
instruction address ranges. The restrictions to coarse address specification
are discussed below.</P>
<H3 style="mso-layout-grid-align: none">Data Address Ranges</H3>
<P style="mso-layout-grid-align: none">Data addresses can only be specified in
coarse mode. As with instruction ranges, four pairs of registers are available
to specify the data address ranges. Use of coarse mode addressing for
either instruction or data address specification can cause some anomalous
results. The Intel documentation points out that starting and ending addresses
cannot be specified exactly, since the hardware representation relies on
powers-of-two bitmasks. The perfmon library tries to optimize the alignment of
these power-of-two regions to cover the addresses requested as effectively as
possible with the four sets of registers available. Perfmon first finds the
largest power-of-two address region completely contained within the reqested
addresses. Then it finds successively smaller power-of-two regions to cover the
errors on the high and low end of the requested address range. The effective
result is that the actual range specified is always equal to or larger than,
and completely contains the requested range, and can occupy from one to four
pairs of address registers. In some cases this can result in significant
overcounts of the events of interest, especially if two active data structures
are located in close proximity to each other. This may require that the
developer insert some padding structures before and/or after a particular
structure of interest to guarantee accurate counts.</P>
<H2 style="mso-layout-grid-align: none">Supporting Software</H2>
<P class="MsoNormal" style="mso-layout-grid-align: none">To make this new PAPI
feature more accessible and easier to use, a test case was developed to both
provide a coding example and to exercise and test the fuctionality of the data
ranging features of the Itanium 2. In addition, the papi_native_event utility
was modified to make it easier to identify events that support these features.</P>
<H3 style="mso-layout-grid-align: none">The <FONT face="Courier New">data_range.c</FONT> Test
Case</H3>
<P style="mso-layout-grid-align: none">A test case, called <FONT face="Courier New">data_range</FONT> was
developed that measures memory load and store events on three different types
of data structures. Three static arrays of 16,384 <EM>ints</EM> were declared
in the program, and three dynamic arrays of 16,384 <EM>ints </EM>were malloc'd.
The data range was specified sequentially to be the starting and ending
addresses of each of:</P>
<UL>
<LI>
<DIV style="mso-layout-grid-align: none">the pointers to the malloc'd arrays;</DIV>
<LI>
<DIV style="mso-layout-grid-align: none">the malloc'd arrays themselves;</DIV>
<LI>
<DIV style="mso-layout-grid-align: none">the statically declared arrays.</DIV>
</LI>
</UL>
<P style="mso-layout-grid-align: none">The work done in each case consisted of
loading an initialization value into each element of each array, and then
summing the values of each element. This should produce 16,384 loads and 16,384
stores on each element.</P>
<P style="mso-layout-grid-align: none">For the pointers, the size was 8 bytes and
the starting and ending addresses could be specified exactly. Output is shown
below:</P>
<P style="mso-layout-grid-align: none"><FONT face="Courier New">Measure loads and
stores on the pointers to the allocated arrays
<BR>
Expected loads: 32768; Expected stores: 0
<BR>
These loads result from accessing the pointers to compute array addresses.
<BR>
They will likely disappear with higher levels of optimization.
<BR>
Requested Start Address: 0x6000000000011640; Start Offset: 0x 0; Actual Start
Address: 0x6000000000011640
<BR>
Requested End Address: 0x6000000000011648; End Offset: 0x 0; Actual End
Address: 0x6000000000011648
<BR>
loads_retired: 32768
<BR>
stores_retired: 0
<BR>
Requested Start Address: 0x6000000000011628; Start Offset: 0x 0; Actual Start
Address: 0x6000000000011628
<BR>
Requested End Address: 0x6000000000011630; End Offset: 0x 0; Actual End
Address: 0x6000000000011630
<BR>
loads_retired: 32768
<BR>
stores_retired: 0
<BR>
Requested Start Address: 0x6000000000011638; Start Offset: 0x 0; Actual Start
Address: 0x6000000000011638
<BR>
Requested End Address: 0x6000000000011640; End Offset: 0x 0; Actual End
Address: 0x6000000000011640
<BR>
loads_retired: 32768
<BR>
stores_retired: 0 </FONT>
</P>
<P style="mso-layout-grid-align: none">For the allocated arrays, small offsets were
introduced in each case, and the resulting error in the loads and stores is
exactly what would be predicted by the activity in the adjacent memory
locations:</P>
<P style="mso-layout-grid-align: none"><FONT face="Courier New">Measure loads and
stores on the allocated arrays themselves
<BR>
Expected loads: 16384; Expected stores: 16384
<BR>
Requested Start Address: 0x6000000004044010; Start Offset: 0x 10; Actual Start
Address: 0x6000000004044000
<BR>
Requested End Address: 0x6000000004054010; End Offset: 0x 0; Actual End
Address: 0x6000000004054010
<BR>
loads_retired: 16384
<BR>
stores_retired: 16384
<BR>
Requested Start Address: 0x6000000004054020; Start Offset: 0x 20; Actual Start
Address: 0x6000000004054000
<BR>
Requested End Address: 0x6000000004064020; End Offset: 0x 0; Actual End
Address: 0x6000000004064020
<BR>
loads_retired: 16388
<BR>
stores_retired: 16388
<BR>
Requested Start Address: 0x6000000004064030; Start Offset: 0x 30; Actual Start
Address: 0x6000000004064000
<BR>
Requested End Address: 0x6000000004074030; End Offset: 0x 10; Actual End
Address: 0x6000000004074040
<BR>
loads_retired: 16392
<BR>
stores_retired: 16392 </FONT>
</P>
<P style="mso-layout-grid-align: none">For the static arrays, the locations of the
arrays resulted in significant offsets, and hence significant errors. The most
interesting case is the second one, in which the starting offset can be seen to
force the inclusion of all three pointers to the malloc'd arrays. Because of
this the loads retired count is too high by 98310, almost exactly 3 * 32768 =
98204:</P>
<P style="mso-layout-grid-align: none"><FONT face="Courier New">Measure loads and
stores on the static arrays
<BR>
</FONT><FONT face="Courier New">These values will differ from the expected values
by the size of the offsets.
<BR>
Expected loads: 16384; Expected stores: 16384
<BR>
</FONT><FONT face="Courier New">Requested Start Address: 0x60000000000218cc; Start
Offset: 0x 18cc; Actual Start Address: 0x6000000000020000
<BR>
Requested End Address: 0x60000000000318cc; End Offset: 0x 734; Actual End
Address: 0x6000000000032000
<BR>
loads_retired: 18432
<BR>
stores_retired: 18432
<BR>
Requested Start Address: 0x60000000000118cc; Start Offset: 0x 18cc; Actual
Start Address: 0x6000000000010000
<BR>
Requested End Address: 0x60000000000218cc; End Offset: 0x 734; Actual End
Address: 0x6000000000022000
<BR>
loads_retired: 115155
<BR>
stores_retired: 16845
<BR>
Requested Start Address: 0x60000000000318cc; Start Offset: 0x 18cc; Actual
Start Address: 0x6000000000030000
<BR>
Requested End Address: 0x60000000000418cc; End Offset: 0x 734; Actual End
Address: 0x6000000000042000
<BR>
loads_retired: 17971
<BR>
stores_retired: 17971 </FONT>
</P>
<H2 style="mso-layout-grid-align: none">The <FONT face="Courier New">papi_native_avail</FONT>
Utility</H2>
<P style="mso-layout-grid-align: none">To effectively use the instruction and
address range specification feature for Itanium 2, one must know which of the
roughly 475 available native events support these features. In addition, there
are other qualifiers to Itanium 2 native events that are valuable to inspect.
For these reasons, the <FONT face="Courier New">papi_native_avail</FONT> utility
was enhanced to make it possible to filter the list of native events by these
qualifiers. A help feature was added to this utility to make it easier to
remember the Itanium specific options:</P>
<P style="mso-layout-grid-align: none"><FONT face="Courier New">> papi_native_avail
--help
<BR>
This is the PAPI native avail program.
<BR>
It provides availability and detail information
<BR>
for PAPI native events. Usage:
<BR>
papi_native_avail [options]
<BR>
Options:
<BR>
<BR>
-h, --help print this help message
<BR>
--darr display Itanium events that support
Data Address Range Restriction
<BR>
--dear display Itanium Data Event Address
Register events only
<BR>
--iarr display Itanium events that support
Instruction Address Range Restriction
<BR>
--iear display Itanium Instruction Event
Address Register events only
<BR>
--opcm display Itanium events that support
OpCode Matching
<BR>
NOTE: The last five options are
mutually exclusive.</FONT></P>
<P style="mso-layout-grid-align: none">If any of these options are specified on the
command line, only those events that support that option are displayed. Even
so, the list can be extensive, with roughly 160 events supporting data address
ranging, and even more supporting instruction address ranging.</P>
</body>
</html>
|