1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395

Strong (welldistributed and unpredictable) hashes:
* Portable implementation of
[SipHash](https://www.131002.net/siphash/siphash.pdf)
* HighwayHash, a 5x faster SIMD hash with [security
claims](https://arxiv.org/abs/1612.06257)
## Quick Start
To build on a Linux or Mac platform, simply run `make`. For Windows, we provide
a Visual Studio 2015 project in the `msvc` subdirectory.
Run `benchmark` for speed measurements. `sip_hash_test` and `highwayhash_test`
ensure the implementations return knowngood values for a given set of inputs.
64bit SipHash for any CPU:
#include "highwayhash/sip_hash.h"
using namespace highwayhash;
const HH_U64 key2[2] HH_ALIGNAS(16) = {1234, 5678};
char in[8] = {1};
return SipHash(key2, in, 8);
64, 128 or 256 bit HighwayHash for the CPU determined by compiler flags:
#include "highwayhash/highwayhash.h"
using namespace highwayhash;
const HHKey key HH_ALIGNAS(32) = {1, 2, 3, 4};
char in[8] = {1};
HHResult64 result; // or HHResult128 or HHResult256
HHStateT<HH_TARGET> state(key);
HighwayHashT(&state, in, 8, &result);
64, 128 or 256 bit HighwayHash for the CPU on which we're currently running:
#include "highwayhash/highwayhash_target.h"
#include "highwayhash/instruction_sets.h"
using namespace highwayhash;
const HHKey key HH_ALIGNAS(32) = {1, 2, 3, 4};
char in[8] = {1};
HHResult64 result; // or HHResult128 or HHResult256
InstructionSets::Run<HighwayHash>(key, in, 8, &result);
Ccallable 64bit HighwayHash for the CPU on which we're currently running:
#include "highwayhash/c_bindings.h"
const uint64_t key[4] = {1, 2, 3, 4};
char in[8] = {1};
return HighwayHash64(key, in, 8);
Printing a 256bit result in a hexadecimal format similar to sha1sum:
HHResult256 result;
printf("%016"PRIx64"%016"PRIx64"%016"PRIx64"%016"PRIx64"\n",
result[3], result[2], result[1], result[0]);
## Introduction
Hash functions are widely used, so it is desirable to increase their speed and
security. This package provides two 'strong' (welldistributed and
unpredictable) hash functions: a faster version of SipHash, and an even faster
algorithm we call HighwayHash.
SipHash is a fast but 'cryptographically strong' pseudorandom function by
Aumasson and Bernstein [https://www.131002.net/siphash/siphash.pdf].
HighwayHash is a new way of mixing inputs which may inspire new
cryptographically strong hashes. Large inputs are processed at a rate of 0.24
cycles per byte, and latency remains low even for small inputs. HighwayHash is
faster than SipHash for all input sizes, with 5 times higher throughput at 1
KiB. We discuss design choices and provide statistical analysis and preliminary
cryptanalysis in https://arxiv.org/abs/1612.06257.
## Applications
Unlike prior strong hashes, these functions are fast enough to be recommended
as safer replacements for weak hashes in many applications. The additional CPU
cost appears affordable, based on profiling data indicating C++ hash functions
account for less than 0.25% of CPU usage.
Hashbased selection of random subsets is useful for A/B experiments and similar
applications. Such random generators are idempotent (repeatable and
deterministic), which is helpful for parallel algorithms and testing. To avoid
bias, it is important that the hash function be unpredictable and
indistinguishable from a uniform random generator. We have verified the bit
distribution and avalanche properties of SipHash and HighwayHash.
64bit hashes are also useful for authenticating shortlived messages such as
network/RPC packets. This requires that the hash function withstand
differential, length extension and other attacks. We have published a formal
security analysis for HighwayHash. New cryptanalysis tools may still need to be
developed for further analysis.
Strong hashes are also important parts of methods for protecting hash tables
against unacceptable worstcase behavior and denial of service attacks
(see "hash flooding" below).
128 and 256bit hashes can be useful for verifying data integrity (checksums).
## SipHash
Our SipHash implementation is a fast and portable dropin replacement for
the reference C code. Outputs are identical for the given test cases (messages
between 0 and 63 bytes).
Interestingly, it is about twice as fast as a SIMD implementation using SSE4.1
(https://goo.gl/80GBSD). This is presumably due to the lack of SIMD bit rotate
instructions prior to AVX512.
SipHash13 is a faster but weaker variant with one mixing round per update and
three during finalization.
We also provide a dataparallel 'tree hash' variant that enables efficient SIMD
while retaining safety guarantees. This is about twice as fast as SipHash, but
does not return the same results.
## HighwayHash
We have devised a new way of mixing inputs with SIMD multiply and permute
instructions. The multiplications are 32x32 > 64 bits and therefore infeasible
to reverse. Permuting equalizes the distribution of the resulting bytes.
The internal state is quite large (1024 bits) but fits within SIMD registers.
Due to limitations of the AVX2 instruction set, the registers are partitioned
into two 512bit halves that remain independent until the reduce phase. The
algorithm outputs 64 bit digests or up to 256 bits at no extra cost.
In addition to high throughput, the algorithm is designed for low finalization
cost. The result is more than twice as fast as SipTreeHash.
We also provide an SSE4.1 version (80% as fast for large inputs and 95% as fast
for short inputs), an implementation for VSX on POWER and a portable version
(10% as fast). A thirdparty ARM implementation is referenced below.
Statistical analyses and preliminary cryptanalysis are given in
https://arxiv.org/abs/1612.06257.
## Versioning and stability
Now that 21 months have elapsed since their initial release, we have declared
all (64/128/256 bit) variants of HighwayHash frozen, i.e. unchanging forever.
SipHash and HighwayHash are 'fingerprint functions' whose input > hash
mapping will not change. This is important for applications that write hashes to
persistent storage.
## Speed measurements
To measure the CPU cost of a hash function, we can either create an artificial
'microbenchmark' (easier to control, but probably not representative of the
actual runtime), or insert instrumentation directly into an application (risks
influencing the results through observer overhead). We provide novel variants of
both approaches that mitigate their respective disadvantages.
profiler.h uses software writecombining to stream program traces to memory
with minimal overhead. These can be analyzed offline, or when memory is full,
to learn how much time was spent in each (possibly nested) zone.
nanobenchmark.h enables cycleaccurate measurements of very short functions.
It uses CPU fences and robust statistics to minimize variability, and also
avoids unrealistic branch prediction effects.
We compile the 64bit C++ implementations with a patched GCC 4.9 and run on a
single idle core of a Xeon E52690 v3 clocked at 2.6 GHz. CPU cost is measured
as cycles per byte for various input sizes:
Algorithm  8  31  32  63  64  1024
            
HighwayHashAVX2  7.34  1.81  1.71  1.04  0.95  0.24
HighwayHashSSE41  8.00  2.11  1.75  1.13  0.96  0.30
SipTreeHash  16.51  4.57  4.09  2.22  2.29  0.57
SipTreeHash13  12.33  3.47  3.06  1.68  1.63  0.33
SipHash  8.13  2.58  2.73  1.87  1.93  1.26
SipHash13  6.96  2.09  2.12  1.32  1.33  0.68
SipTreeHash is slower than SipHash for small inputs because it processes blocks
of 32 bytes. AVX2 and SSE4.1 HighwayHash are faster than SipHash for all input
sizes due to their highly optimized handling of partial vectors.
Note that previous measurements included the initialization of their input,
which dramatically increased timings especially for small inputs.
## CPU requirements
SipTreeHash(13) requires an AVX2capable CPU (e.g. Haswell). HighwayHash
includes a dispatcher that chooses the implementation (AVX2, SSE4.1, VSX or
portable) at runtime, as well as a directly callable function template that can
only run on the CPU for which it was built. SipHash(13) and
ScalarSipTreeHash(13) have no particular CPU requirements.
### AVX2 vs SSE4
When both AVX2 and SSE4 are available, the decision whether to use AVX2 is
nonobvious. AVX2 vectors are twice as wide, but require a higher power license
(integer multiplications count as 'heavy' instructions) and can thus reduce the
clock frequency of the core or entire socket(!) on Haswell systems. This
partially explains the observed 1.25x (not 2x) speedup over SSE4. Moreover, it
is inadvisable to only sporadically use AVX2 instructions because there is also
a ~56K cycle warmup period during which AVX2 operations are slower, and Haswell
can even stall during this period. Thus, we recommend avoiding AVX2 for
infrequent hashing if the rest of the application is also not using AVX2. For
any input larger than 1 MiB, it is probably worthwhile to enable AVX2.
### SIMD implementations
Our x86 implementations use custom vector classes with overloaded operators
(e.g. `const V4x64U a = b + c`) for typesafety and improved readability vs.
compiler intrinsics (e.g. `const __m256i a = _mm256_add_epi64(b, c)`).
The VSX implementation uses builtin vector types alongside Altivec intrinsics.
A highperformance thirdparty ARM implementation is mentioned below.
### Dispatch
Our instruction_sets dispatcher avoids running newer instructions on older CPUs
that do not support them. However, intrinsics, and therefore also any vector
classes that use them, require (on GCC < 4.9 or Clang < 3.9) a compiler flag
that also allows the compiler to generate code for that CPU. This means the
intrinsics must be placed in separate translation units that are compiled with
the required flags. It is important that these source files and their headers
not define any inline functions, because that might break the one definition
rule and cause crashes.
To minimize dispatch overhead when hashes are computed often (e.g. in a loop),
we can inline the hash function into its caller using templates. The dispatch
overhead will only be paid once (e.g. before the loop). The template mechanism
also avoids duplicating code in each CPUspecific implementation.
## Defending against hash flooding
To mitigate hash flooding attacks, we need to take both the hash function and
the data structure into account.
We wish to defend (web) services that utilize hash sets/maps against
denialofservice attacks. Such data structures assign attackercontrolled
input messages `m` to a hash table bin `b` by computing the hash `H(s, m)`
using a hash function `H` seeded by `s`, and mapping it to a bin with some
narrowing function `b = R(h)`, discussed below.
Attackers may attempt to trigger 'flooding' (excessive work in insertions or
lookups) by finding multiple `m` that map to the same bin. If the attacker has
local access, they can do far worse, so we assume the attacker can only issue
remote requests. If the attacker is able to send large numbers of requests,
they can already deny service, so we need only ensure the attacker's cost is
sufficiently large compared to the service's provisioning.
If the hash function is 'weak', attackers can easily generate 'hash collisions'
(inputs mapping to the same hash values) that are independent of the seed. In
other words, certain input messages will cause collisions regardless of the seed
value. The author of SipHash has published C++ programs to generate such
'universal (keyindependent) multicollisions' for CityHash and Murmur. Similar
'differential' attacks are likely possible for any hash function consisting only
of reversible operations (e.g. addition/multiplication/rotation) with a constant
operand. `n` requests with such inputs cause `n^2` work for an unprotected hash
table, which is unacceptable.
By contrast, 'strong' hashes such as SipHash or HighwayHash require infeasible
attacker effort to find a hash collision (an expected 2^32 guesses of `m` per
the birthday paradox) or recover the seed (2^63 requests). These security claims
assume the seed is secret. It is reasonable to suppose `s` is initially unknown
to attackers, e.g. generated on startup or even perconnection. A timing attack
by Wool/BarYosef recovers 13bit seeds by testing all 8K possibilities using
millions of requests, which takes several days (even assuming unrealistic 150 us
roundtrip times). It appears infeasible to recover 64bit seeds in this way.
However, attackers are only looking for multiple `m` mapping to the same bin
rather than identical hash values. We assume they know or are able to discover
the hash table size `p`. It is common to choose `p = 2^i` to enable an efficient
`R(h) := h & (p  1)`, which simply retains the lower hash bits. It may be
easier for attackers to compute partial collisions where only the lower `i` bits
match. This can be prevented by choosing a prime `p` so that `R(h) := h % p`
incorporates all hash bits. The costly modulo operation can be avoided by
multiplying with the inverse (https://goo.gl/l7ASm8). An interesting alternative
suggested by Kyoung Jae Seo chooses a random subset of the `h` bits. Such an `R`
function can be computed in just 3 cycles using PEXT from the BMI2 instruction
set. This is expected to defend against SATsolver attacks on the hash bits at a
slightly lower cost than the multiplicative inverse method, and still allows
poweroftwo table sizes.
Summary thus far: given a strong hash function and secret seed, it appears
infeasible for attackers to generate hash collisions because `s` and/or `R` are
unknown. However, they can still observe the timings of data structure
operations for various `m`. With typical table sizes of 2^10 to 2^17 entries,
attackers can detect some 'bin collisions' (inputs mapping to the same bin).
Although this will be costly for the attacker, they can then send many instances
of such inputs, so we need to limit the resulting work for our data structure.
Hash tables with separate chaining typically store bin entries in a linked list,
so worstcase inputs lead to unacceptable lineartime lookup cost. We instead
seek optimal asymptotic worstcase complexity for each operation (insertion,
deletion and lookups), which is a constant factor times the logarithm of the
data structure size. This naturally leads to a treelike data structure for each
bin. The Java8 HashMap only replaces its linked list with trees when needed.
This leads to additional cost and complexity for deciding whether a bin is a
list or tree.
Our first proposal (suggested by Github user funnyfalcon) avoids this overhead
by always storing one tree per bin. It may also be worthwhile to store the first
entry directly in the bin, which avoids allocating any tree nodes in the common
case where bins are sparsely populated. What kind of tree should be used?
Given SipHash and HighwayHash provide high quality randomness, depending on
expecting attack surface simple nonbalancing binary search tree could perform
reasonably well. [Wikipedia says](https://en.wikipedia.org/wiki/Binary_search_tree#Definition)
> After a long intermixed sequence of random insertion and deletion, the
> expected height of the tree approaches square root of the number of keys, √n,
> which grows much faster than log n.
While `O(√n)` is much larger than `O(log n)`, it is still much smaller than `O(n)`.
And it will certainly complicate the timing attack, since the time of operation
on collisioned bin will grow slower.
If stronger safety guarantees are needed, then a balanced tree should be used.
Scapegoat and splay trees only offer amortized complexity guarantees, whereas
treaps require an entropy source and have higher constant factors in practice.
Selfbalancing structures such as 23 or redblack trees require additional
bookkeeping information. We can hope to reduce rebalancing cost by realizing
that the output bits of strong `H` functions are uniformly distributed. When
using them as keys instead of the original message `m`, recent relaxed balancing
schemes such as leftleaning redblack or weak AVL trees may require fewer tree
rotations to maintain their invariants. Note that `H` already determines the
bin, so we should only use the remaining bits. 64bit hashes are likely
sufficient for this purpose, and HighwayHash generates up to 256 bits. It seems
unlikely that attackers can craft inputs resulting in worst cases for both the
bin index and tree key without being able to generate hash collisions, which
would contradict the security claims of strong hashes. Even if they succeed, the
relaxed tree balancing still guarantees an upper bound on height and therefore
the worstcase operation cost. For the AVL variant, the constant factors are
slightly lower than for redblack trees.
The second proposed approach uses augmented/deamortized cuckoo hash tables
(https://goo.gl/PFwwkx). These guarantee worstcase `log n` bounds for all
operations, but only if the hash function is 'indistinguishable from random'
(uniformly distributed regardless of the input distribution), which is claimed
for SipHash and HighwayHash but certainly not for weak hashes.
Both alternatives retain good average case performance and defend against
flooding by limiting the amount of extra work an attacker can cause. The first
approach guarantees an upper bound of `log n` additional work even if the hash
function is compromised.
In summary, a strong hash function is not, by itself, sufficient to protect a
chained hash table from flooding attacks. However, strong hash functions are
important parts of two schemes for preventing denial of service. Using weak hash
functions can slightly accelerate the bestcase and averagecase performance of
a service, but at the risk of greatly reduced attack costs and worstcase
performance.
## Thirdparty implementations / bindings
Thanks to Damian Gryski and Frank Wessels for making us aware of these
thirdparty implementations or bindings. Please feel free to get in touch or
raise an issue and we'll add yours as well.
By  Language  URL
    
Damian Gryski  Go and x64 assembly  https://github.com/dgryski/gohighway/
Lovell Fuller  node.js bindings  https://github.com/lovell/highwayhash
Vinzent Steinberg  Rust bindings  https://github.com/vks/highwayhashrs
Frank Wessels & Andreas Auernhammer  Go and ARM assembly  https://github.com/minio/highwayhash
Phil Demetriou  Python 3 bindings  https://github.com/kpdemetriou/highwayhashcffi
> **_NOTE:_** For highwayhashcffi, please note an [issue](https://github.com/kpdemetriou/highwayhashcffi/issues/1)
has been reported ([merge request](https://github.com/kpdemetriou/highwayhashcffi/pull/2)).
## Modules
### Hashes
* c_bindings.h declares Ccallable versions of SipHash/HighwayHash.
* sip_hash.cc is the compatible implementation of SipHash, and also provides
the final reduction for sip_tree_hash.
* sip_tree_hash.cc is the faster but incompatible SIMD jlanes tree hash.
* scalar_sip_tree_hash.cc is a nonSIMD version.
* state_helpers.h simplifies the implementation of the SipHash variants.
* highwayhash.h is our new, fast hash function.
* hh_{avx2,sse41,vsx,portable}.h are its various implementations.
* highwayhash_target.h chooses the best available implementation at runtime.
### Infrastructure
* arch_specific.h offers byte swapping and CPUID detection.
* compiler_specific.h defines some compilerdependent language extensions.
* data_parallel.h provides a C++11 ThreadPool and PerThread (similar to
OpenMP).
* instruction_sets.h and targets.h enable efficient CPUspecific dispatching.
* nanobenchmark.h measures elapsed times with < 1 cycle variability.
* os_specific.h sets thread affinity and priority for benchmarking.
* profiler.h is a lowoverhead, deterministic hierarchical profiler.
* tsc_timer.h obtains highresolution timestamps without CPU reordering.
* vector256.h and vector128.h contain wrapper classes for AVX2 and SSE4.1.
By Jan Wassenberg <jan.wassenberg@gmail.com> and Jyrki Alakuijala
<jyrki.alakuijala@gmail.com>, updated 20181002
This is not an official Google product.
