1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
|
BLAKE2b-SIMD
============
Pure Go implementation of BLAKE2b using SIMD optimizations.
Introduction
------------
This package was initially based on the pure go [BLAKE2b](https://github.com/dchest/blake2b) implementation of Dmitry Chestnykh and merged with the (`cgo` dependent) AVX optimized [BLAKE2](https://github.com/codahale/blake2) implementation (which in turn is based on the [official implementation](https://github.com/BLAKE2/BLAKE2). It does so by using [Go's Assembler](https://golang.org/doc/asm) for amd64 architectures with a golang only fallback for other architectures.
In addition to AVX there is also support for AVX2 as well as SSE. Best performance is obtained with AVX2 which gives roughly a **4X** performance increase approaching hashing speeds of **1GB/sec** on a single core.
Benchmarks
----------
This is a summary of the performance improvements. Full details are shown below.
| Technology | 128K |
| ---------- |:-----:|
| AVX2 | 3.94x |
| AVX | 3.28x |
| SSE | 2.85x |
asm2plan9s
----------
In order to be able to work more easily with AVX2/AVX instructions, a separate tool was developed to convert AVX2/AVX instructions into the corresponding BYTE sequence as accepted by Go assembly. See [asm2plan9s](https://github.com/minio/asm2plan9s) for more information.
bt2sum
------
[bt2sum](https://github.com/s3git/bt2sum) is a utility that takes advantages of the BLAKE2b SIMD optimizations to compute check sums using the BLAKE2 Tree hashing mode in so called 'unlimited fanout' mode.
Technical details
-----------------
BLAKE2b is a hashing algorithm that operates on 64-bit integer values. The AVX2 version uses the 256-bit wide YMM registers in order to essentially process four operations in parallel. AVX and SSE operate on 128-bit values simultaneously (two operations in parallel). Below are excerpts from `compressAvx2_amd64.s`, `compressAvx_amd64.s`, and `compress_generic.go` respectively.
```
VPADDQ YMM0,YMM0,YMM1 /* v0 += v4, v1 += v5, v2 += v6, v3 += v7 */
```
```
VPADDQ XMM0,XMM0,XMM2 /* v0 += v4, v1 += v5 */
VPADDQ XMM1,XMM1,XMM3 /* v2 += v6, v3 += v7 */
```
```
v0 += v4
v1 += v5
v2 += v6
v3 += v7
```
Detailed benchmarks
-------------------
Example performance metrics were generated on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - 6 physical cores, 12 logical cores running Ubuntu GNU/Linux with kernel version 4.4.0-24-generic (vanilla with no optimizations).
### AVX2
```
$ benchcmp go.txt avx2.txt
benchmark old ns/op new ns/op delta
BenchmarkHash64-12 1481 849 -42.67%
BenchmarkHash128-12 1428 746 -47.76%
BenchmarkHash1K-12 6379 2227 -65.09%
BenchmarkHash8K-12 37219 11714 -68.53%
BenchmarkHash32K-12 140716 35935 -74.46%
BenchmarkHash128K-12 561656 142634 -74.60%
benchmark old MB/s new MB/s speedup
BenchmarkHash64-12 43.20 75.37 1.74x
BenchmarkHash128-12 89.64 171.35 1.91x
BenchmarkHash1K-12 160.52 459.69 2.86x
BenchmarkHash8K-12 220.10 699.32 3.18x
BenchmarkHash32K-12 232.87 911.85 3.92x
BenchmarkHash128K-12 233.37 918.93 3.94x
```
### AVX2: Comparison to other hashing techniques
```
$ go test -bench=Comparison
BenchmarkComparisonMD5-12 1000 1726121 ns/op 607.48 MB/s
BenchmarkComparisonSHA1-12 500 2005164 ns/op 522.94 MB/s
BenchmarkComparisonSHA256-12 300 5531036 ns/op 189.58 MB/s
BenchmarkComparisonSHA512-12 500 3423030 ns/op 306.33 MB/s
BenchmarkComparisonBlake2B-12 1000 1232690 ns/op 850.64 MB/s
```
Benchmarks below were generated on a MacBook Pro with a 2.7 GHz Intel Core i7.
### AVX
```
$ benchcmp go.txt avx.txt
benchmark old ns/op new ns/op delta
BenchmarkHash64-8 813 458 -43.67%
BenchmarkHash128-8 766 401 -47.65%
BenchmarkHash1K-8 4881 1763 -63.88%
BenchmarkHash8K-8 36127 12273 -66.03%
BenchmarkHash32K-8 140582 43155 -69.30%
BenchmarkHash128K-8 567850 173246 -69.49%
benchmark old MB/s new MB/s speedup
BenchmarkHash64-8 78.63 139.57 1.78x
BenchmarkHash128-8 166.98 318.73 1.91x
BenchmarkHash1K-8 209.76 580.68 2.77x
BenchmarkHash8K-8 226.76 667.46 2.94x
BenchmarkHash32K-8 233.09 759.29 3.26x
BenchmarkHash128K-8 230.82 756.56 3.28x
```
### SSE
```
$ benchcmp go.txt sse.txt
benchmark old ns/op new ns/op delta
BenchmarkHash64-8 813 478 -41.21%
BenchmarkHash128-8 766 411 -46.34%
BenchmarkHash1K-8 4881 1870 -61.69%
BenchmarkHash8K-8 36127 12427 -65.60%
BenchmarkHash32K-8 140582 49512 -64.78%
BenchmarkHash128K-8 567850 199040 -64.95%
benchmark old MB/s new MB/s speedup
BenchmarkHash64-8 78.63 133.78 1.70x
BenchmarkHash128-8 166.98 311.23 1.86x
BenchmarkHash1K-8 209.76 547.37 2.61x
BenchmarkHash8K-8 226.76 659.20 2.91x
BenchmarkHash32K-8 233.09 661.81 2.84x
BenchmarkHash128K-8 230.82 658.52 2.85x
```
License
-------
Released under the Apache License v2.0. You can find the complete text in the file LICENSE.
Contributing
------------
Contributions are welcome, please send PRs for any enhancements.
|