1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
|
<!---======================= begin_copyright_notice ============================
Copyright (C) 2020-2022 Intel Corporation
SPDX-License-Identifier: MIT
============================= end_copyright_notice ==========================-->
## Opcode
DPAS = 0x83
## Format
| | | | | | | |
| --- | --- | --- | --- | --- | --- | --- |
| 0x83(DPAS) | Exec_size | Dst | Src0 | Src1 | Src2 | W |
| | A | SD | RC | | | |
## Semantics
```
DPAS is a matrix multiply-add operation as follows:
D = C + A x B
where
D (Dst) : MxN
C (Src0) : MxN
A (Src2) : MxK
B (Src1) : KxN
M : repeat count
N : fixed execution size, either 8 or 16
K : depth * OPS_PER_CHAN
OPS_PER_CHAN
1 : for TF32
2 : for 16-bit precision(BF, HF)
4 : for 8-bit precision (FP8, UB, B)
8 : for less-then 8 bit precision (U4/S4, U2/S2).
If depth is 8, K would be 8, 16, 32, or 64 (basd on OPS_PER_CHAN).
Note that Src2 is A, and Src1 is B.
Conceptually, DPAS instruction is simple. As it requires its operands to have a special layout in GRF
for matrix B, this makes DPAS a little complicated to understand. The following might help undertand
it better.
If we view GRFs as a 1-D memory space in the increasing order of GRF numbers, Dst, Src0, Src2
are laid out in row-major in this 1-D memory space. But Src1 needs a special layout, neither
row-major nor column major. To visualize the layout for Src1, we can view entire GRFs as a 2-D
memory space with each row being an exactly one whole GRF. For example, 128 GRFs of 16 DWs each
can be viewed as 128x16 of DW (it would be 128x8 if GRF size is 8 DWs), which means N must be 16.
Under this view, each column of Src1 (matrix B) of total 16 columns are packed into its corresponding
column of this 2-D memory space, that is, Column 0 of Src1 to Column 0 of 2-D memory space, column 1
of Src1 to Column 1 of 2-D memory space, and so on. Take 8-bit precision for example, K is 32. This
32 elements of 8-bit precision data is packed into 8 DWs (4 for each DW). For all 16 columns, it
takes 8 rows in this 2-D memory space, which is 8 GRFs.
With this, the following detailed description will be not hard to follow.
1) The semantics for interger DPAS and float(bf/hf) DPAS:
The semantics is described using the following notation:
Let dot4(X.a[3:0], Y.b[3:0]) be a 4-element dot product of X and Y, starting
at element offset a and b respectively. Similary, let dot2(X.a[1:0], Y.b[1:0])
be a 2-element dot product of X and Y, and dot8(X.a[7:0], Y.b[7:0]) be a 8-element
dot product of X and Y. Note that the size of element in X and Y may be different.
For example, if X's precision is 4 bit, Y's 8 bits, X's element
size shall be 4 bits, y's shall be 8 bits. Also, X.DW[j] is used to denote
the j+1'th DW of X; X.R[j] for denoting the j+1'th GRF. Similarly, X.R[1].DW[2]
will refer to the 3rd DW of the 2nd GRF of X. Note that index always starts from zero.
GRF size would be 8 pre PVC and 16 DW in PVC or later.
// 1) Src1PrecisionInBits and Src2PrecisionInBits are 16 for both bfloat(bf) and half(hf) dpas.
// 2) Element size for Src0 and Dst are 4 bytes (either int or fp32)
// 3) Element size for src1 and src2 could be 2 bits, 4 bits, 8 bits, or 16 bits (bf/hf),
// which are given as precisons.
// OPS_PER_CHAN
if (Src1PrecisionInBits == 16) {
// BF/HF dpas. Src2PrecisionInBits must be 16
OPS_PER_CHAN = 2;
} else if (Src1Precision == 8 || Src2Precision == 8) {
// integer dpas : dot4 per DW
OPS_PER_CHAN = 4;
} else {
// integer dpas : dot8 per DW
OPS_PER_CHAN = 8;
}
// Src1's DW of each channel might be used in more than one depth (based on its precision).
// For example, an integer dpas with src1's precision being 2-bits, one DW has data for
// 4 depths (assuming dot4 per DW. Each dot4 consumes 8 bits, a DW can do 4 dot4 operations).
// Under SD=8, the first four depths will use the entire DW of the first GRF holding src1;
// and the second 4 depths will use the entire DW of the second GRF. If src1's precision is
// 8 bits (dot4 per DW), each DW will be used in an exact one depth; and the next depth will
// use the DW of the next GRF. The following variable is used in the pseudo code:
// SRC1_OPERANDS_PER_CHAN :
// how many operands per each DW (each channel is DW-wide) for src1.
// Each operand is defined as data that is used for a single dot4/dot8(int) or dot2(bf/bf).
// This means that for float DPAS, SRC1_OPERANDS_PER_CHAN = 1; and for int DPAS,
// if OPS_PER_CHAN = 4
// SRC1_OPERANDS_PER_CHAN = 1 (8-bit precision), 2 (4-bit), or 4 (2-bit)
// else // OPS_PER_CHAN = 8
// SRC1_OPERANDS_PER_CHAN = 1 (4-bit), or 2 (2-bit)
SRC1_OPERANDS_PER_CHAN = 32 / (OPS_PER_CHAN * Src1PrecisionInBits)
Exec_size = isPrePVC ? 8 : 16;
k = 0;
for (r = 0; i < RC; ++r)
{
temp = Src0.R[r];
for (d = 0; d < SD; ++d )
{
m = d / SRC1_OPERANDS_PER_CHAN; // to select GRF
n = (d % SRC1_OPERANDS_PER_CHAN) * OPS_PER_CHAN; // in unit of element
for ( i = 0; i < Exec_size; i++ )
{ // for each channel
if OPS_PER_CHAN == 4 // int dpas
temp.DW[i] += dot4(Src1.R[m].DW[i].n[3:0], Src2.k[3:0]);
else if OPS_PER_CHAN == 8 // int dpas
temp.DW[i] += dot8(Src1.R[m].DW[i].n[7:0], Src2.k[7:0]);
else // float DPAS
temp.F[i] += dot2(Src1.R[m].DW[i].n[1:0], Src2.k[1:0]);
}
k += OPS_PER_CHAN;
}
// update dst
dst.R[r] = temp;
}
Dst, Src0 are advanced one GRF for each repeat advance; Src2 is advanced 8*OPS_PER_CHAN for
each repeat advance. Src1 stays the same for each repeat count advance.
```
## Description
Integer DPAS is a element wise multiply add and accumulate operation of multiple elements
in a systolic pipeline with low precision (<= 8 bits) inputs. Src1 is also refered to
as Weights, and Src2 as Activation. Src1 and Src2's element types are defined by
Precision W and A, respectively. Src1 is divided into elements along each 32-bit SIMD channel.
Float DPAS is the same, except its elements are always 16 bits. So each 32-bit channel has
exactly two elements. And element type is either half float (hf) or bfloat16 (bf).
The operand precision refers to a type of elements that make up an operand, and generally,
it is no larger than 8 bits in size (subbyte type) for integer, 16 bits for both bf and hf,
Currently, the precision is used only in DPAS instruction. And the operand that a precision applies
to must be of either D or UD type. The following table lists all possible precisions and their text
and binary formats:
.. table:: The following table lists all operand precisions:
:align: center
+-------------------+-------------+--------+--------+
| Operand Precision | Range | Binary | Text |
| | | Format | Format |
+-------------------+-------------+--------+--------+
| unused | | 0000b | |
+-------------------+-------------+--------+--------+
| Unsigned 1-bit | [0, 1] | 0001b | u1 |
+-------------------+-------------+--------+--------+
| Signed 1-bit | [-1, 0] | 0010b | s1 |
+-------------------+-------------+--------+--------+
| Unsigned 2-bits | [0, 3] | 0011b | u2 |
+-------------------+-------------+--------+--------+
| Signed 2-bits | [-2, 1] | 0100b | s2 |
+-------------------+-------------+--------+--------+
| Unsigned 4-bits | [0, 15] | 0101b | u4 |
+-------------------+-------------+--------+--------+
| Signed 4-bits | [-8, 7] | 0110b | s4 |
+-------------------+-------------+--------+--------+
| Unsigned 8-bits | [0, 255] | 0111b | u8 |
+-------------------+-------------+--------+--------+
| Signed 8-bits | [-128, 127] | 1000b | s8 |
+-------------------+-------------+--------+--------+
| bfloat | bfloat16 | 1001b | bf |
+-------------------+-------------+--------+--------+
| half | fp16 | 1010b | hf |
+-------------------+-------------+--------+--------+
| tf32 | tf32 | 1100b | tf32 |
+-------------------+-------------+--------+--------+
The bfloat16 is a 16-bit float type (E8M7, aka truncated IEEE 754 single-precision 32-bit float,
1-bit sign, 8-bit exponent, 7-bit mantissa). and fp16 is the IEEE 754 half.
The TF32 is 19-bit tensor float type (E8M10), which has 1-bit sign, 8-bit exponent, and 10-bit mantissa.
For integer type, the precision of the sources can vary per src1 and src2. For float type, src1
and src2 must be the same precision (either bf or hf, not mixed). For df type, dst/src0/src1/srcs must
have same df precision.
The dst and src0 take a regular integer type (e.g. D or UD) or float type which is 32 bits in size,
Src0 is used as an accumulator to add operands to. The SD parameter is the systolic depth of the operation,
meaning we perform a sequence of these operations advancing over successive inputs. The output of each stage
is a dword (integer or float) or df which is the accumulated input to the next systolic stage. The first stage
accumulation input is defined via Src0. The last stage accumulated output is written to Dst.
Not all combinations of operand types are allowed. The following table gives all the legal combinations.
Note that particular platform might have additional restriction and this spec will follow that
additional restriction.
.. table:: All legal combinations of types and precisions.
:align: center
+--------+---------+-----------+----------+
| Dst | Src0 | Src1 | Src2 |
| | | Precision | Preision |
+--------+---------+-----------+----------+
| UD, D | UD,D | int | int |
+--------+---------+-----------+----------+
| F, BF | F, BF | BF | BF |
+--------+---------+-----------+----------+
| F, HF | F, HF | HF | HF |
+--------+---------+-----------+----------+
| F | F | TF32 | TF32 |
+--------+---------+-----------+----------+
| F | F | BF8, HF8 | BF8, HF8 |
+--------+---------+-----------+----------+
SD can be encoded as 1, 2, 4, and 8. XEHP+ only supports a systolic depth of 8.
- **Exec_size(ub):** Execution size
- Bit[2..0]: size of the region for source and destination operands
- {XEHP}0b011: 8 elements
- {PVC}0b100: 16 elements
- Bit[7..4]: execution mask (explicit control over the enabled channels)
- 0b0000: M1
- 0b0001: M2
- 0b0010: M3
- 0b0011: M4
- 0b0100: M5
- 0b0101: M6
- 0b0110: M7
- 0b0111: M8
- 0b1000: M1_NM
- 0b1001: M2_NM
- 0b1010: M3_NM
- 0b1011: M4_NM
- 0b1100: M5_NM
- 0b1101: M6_NM
- 0b1110: M7_NM
- 0b1111: M8_NM
- **Dst(raw_operand):** The destination operand.. Must have type D,UD,F,BF,HF,DF
- **Src0(raw_operand):** The source 0 operand. It could be a null operand, meaning it is zero.. Must have type D,UD,F,BF,HF,DF
- **Src1(raw_operand):** The field **W** further defines its element precision. Must have type D,UD,DF
- **Src2(vec_operand):** The field **A** defines its element precision. Must have type D,UD,DF. Operand class: general
- **W(ub):** The precision of Src1, shown in the table above
- **A(ub):** The precision of Src2, shown in the table above
- **SD(ub):** Systolic depth
- **RC(ub):** Repeat Count
#### Properties
- **Supported Types:** D,F,UD
- **Source Modifier:** false
## Text
```
DPAS.W.A.SD.RC (Exec_size) <dst> <src0> <src1> <src2>
```
## Notes
```
- **Register region:** No
- **Alignment:** Dst, Src0, and Src1 are GRF aligned; for interger or float(bf, hf) DPAS, Src2 is
(SD/(32/(Src2PrecisionInBits * OPS_PER_CHAN))) DWORD aligned. That is, for SD=8 and OPS_PER_CHAN=4,
src2 would be 8 DWORD aligned for s8/u8 precision; 4 DWORD aligned for s4/u4; and 2 DWORD aligned
for s2/u2.
Src2 type should be consistent with Src1's, that is, if Src1 is an integer type, Src2 must be an integer
type; if Src1 is bf, Src2 must be bf; if Src1 is hf, Src2 must be hf too.
Here are some examples:
- DPAS.u4.s8.8.8 (Exec_size) <dst> <src0> <src1> <src2> // int DPAS with u4 for src1 and s8 for src1
- DPAS.bf.bf.8.8 (Exec_size) <dst> <src0> <src1> <src2> // float DPAS with bfloat as element type
- DPAS.hf.hf.8.8 (Exec_size) <dst> <src0> <src1> <src2> // float DPAS with half as element type
Also note that Src2 Should be uniform, meaning the entire Src2 is used by every simd lanes.
Exec_size is 16 for PVC and 8 otherwise for int or float DPAS.
```
|