1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
|
TODO: Port the ARMv7 (32bit) less-than-8-bit GEMM kernel to (ARMv8 64bit)
Platforms: ARM NEON
Coding time: M
Experimentation time: M
Skill required: M
Prerequisite reading:
doc/kernels.txt
doc/packing.txt
Model to follow/adapt:
internal/kernel_neon.h
In internal/kernel_neon.h, for ARMv7 (32bit), we have a kernel
specifically designed to take advantage of smaller operands ranges
to use 16-bit local accumulators to achieve higher arithmetic throughput:
NEON_32_Kernel12x4Depth2Assuming12BitProducts
This is the kernel used with BitDepthSetting::L7R5 and is what allows
this bit depth setting to outperform L8R8.
This TODO item is about porting it to ARMv8 (64bit) assembly. It can
be approached in two parts:
1. Make a trivial port of the existing ARMv7 assembly code in
NEON_32_Kernel12x4Depth2Assuming12BitProducts to ARMv8 assembly.
2. Consider ways to make use of the larger register space availble on ARMv8:
there are 32 128-bit vector registers, instead of 16 on ARMv7.
A simple way, and quite possibly the best, would be to take the same
approach already implemented in NEON_64_Kernel12x8Depth2:
When porting a 12x4 kernel from ARMv7 to ARMv8, the extra register
space can be put to good use by doubling the RHS kernel width, from
4 to 8, thus changing the 12x4 kernel size to 12x8. Since cells
are of width 4, this means switching from 1 RHS cell to 2 RHS cells.
Since everything else remains unchanged, this should be a rather
simple change to implement. Compare NEON_64_Kernel12x8Depth2
to NEON_32_Kernel12x4Depth2.
|