Improve performance of generic SHA256 implementation
* cipher/sha256.c (R): Let caller do variable shuffling. (Chro, Maj, Sum0, Sum1): Convert from inline functions to macros. (W, I): New. (transform_blk): Unroll round loop; inline message expansion to rounds to make message expansion buffer smaller.
Benchmark on Cortex-A8 (armv6, 1008 Mhz):
Before:
| nanosecs/byte mebibytes/sec cycles/byte SHA256 | 27.63 ns/B 34.52 MiB/s 27.85 c/B
After (1.31x faster):
| nanosecs/byte mebibytes/sec cycles/byte SHA256 | 20.97 ns/B 45.48 MiB/s 21.13 c/B
Benchmark on Cortex-A8 (armv7, 1008 Mhz):
Before:
| nanosecs/byte mebibytes/sec cycles/byte SHA256 | 24.18 ns/B 39.43 MiB/s 24.38 c/B
After (1.13x faster):
| nanosecs/byte mebibytes/sec cycles/byte SHA256 | 21.28 ns/B 44.82 MiB/s 21.45 c/B
Benchmark on Intel Core i5-4570 (i386, 3.2 Ghz):
Before:
| nanosecs/byte mebibytes/sec cycles/byte SHA256 | 5.78 ns/B 164.9 MiB/s 18.51 c/B
After (1.06x faster)
| nanosecs/byte mebibytes/sec cycles/byte SHA256 | 5.41 ns/B 176.1 MiB/s 17.33 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>