Cryptoxide perf (SHA2 / Blake2)

Related to some engine rewrite and SSE, AVX, AVX2 cpu optimisation I did last year on cryptoxide :

History of cryptoxide§

Cryptoxide is a fork of the initial rust-crypto one-stop cryptography package that went unmaintained.

In 2018, we needed a pure rust version to construct rust-wasm based web-applications when this use case was in its infancy; rust-crypto was an interesting starting point, as all the algorithms were written in pure rust, and it was also easier to construct something than the exploded version which would have required lots more time to port.

Many other cryptographic packages are now wasm friendly also.

Benchmarks setup§

  • cpu: 3.6 GHz 8-Core Intel Core i9 (I9-9900K)
  • rust compiler: stable 1.49
  • cryptoxide: 0.3.0
  • rust-crypto: blake2 0.9.1, sha2 0.9.1
  • ring: 0.16.19

The benchmark code itself consist of benchmarking few time the main costly part of each algorithm over a 10 megabytes array and taking the average of the run. It's possible that the number reported could be buggy, but it should be consistently buggy, so here we're more interested by the relative values than the absolute values.

This benchmark is only looking at the function I was interested about also, thus only compare Sha256, Sha512, Blake2b and Blake2s.

Finally benchmarks should always be taken with a grain of salt, as different cpu and environment can lead to different results.

To play with the benchmark on your own machine, have a look at rcc

Raw numbers§

Let's start with the raw number in release mode; This show the average (lower better) with standard deviation (the lower, the better for reliability of benchmark), and the speed of processing (higher better):

Using the default target_cpu:

AlgorithmCrateAvg(ms)Std Dev(ms)Speed(mb/s)
blake2bcryptoxide10.180.174981
blake2bblake210.280.260972
blake2scryptoxide15.970.264625
blake2sblake217.070.150585
sha256cryptoxide30.510.220327
sha256sha235.660.277280
sha256ring19.170.293521
sha512cryptoxide20.860.319479
sha512sha221.100.422473
sha512ring13.290.296752

Using the native target_cpu target_cpu=native:

AlgorithmCrateAvg(ms)Std Dev(ms)Speed(mb/s)
blake2bcryptoxide6.720.2291486
blake2bblake29.950.3881004
blake2scryptoxide11.270.232886
blake2sblake217.230.136580
sha256cryptoxide20.710.243482
sha256sha228.310.365353
sha256ring19.740.283506
sha512cryptoxide17.130.184583
sha512sha217.500.339571
sha512ring13.170.133759

In Graphs§

Putting in graphical form, comparing the default and native runs:

Sha256§

SHA512§

BLAKE2B§

BLAKE2S§

Conclusion§

Ring is the uncontested winner in term of performance (and probably safety); Most or all algorithms are implemented in assembly and using the best level of optimisation all the time; which explains default and native being virtually identical.

Related to Sha256 algorithm, with native optimisation cryptoxide reach very close to the very optimised ring implementation and have a noticeable difference with the pervasive sha2 crate.

Related to Sha512 algorithm, there's no significant difference between cryptoxide and sha2, which is not particularly surprising considering that I didn't take time to write an SIMD optimised version of Sha512 in cryptoxide.

Both SHA256 and SHA512 algorithms are only partially optimisable with SIMD.

Related to Blake2b and Blake2s algorithm, while at the default level performance is mostly equivalent, the true difference happens at the AVX/AVX2 level, where cryptoxide manage a massive boost compared to blake2b. This is enabled by the really nice design of BLAKE2.

With time permitting, the next step is to add more SIMD optimisation with different algorithms and as new architecture achieved tier1 and wide support in rust, hopefully getting other type of SIMD optimisations.