Home GnuPG

Add AMD64 assembly implementation of Salsa20
5a3d43485efdUnpublished

Unpublished Commit ยท Learn More

Not On Permanent Ref: This commit is not an ancestor of any permanent ref.

Description

Add AMD64 assembly implementation of Salsa20

* cipher/Makefile.am: Add 'salsa20-amd64.S'.
* cipher/salsa20-amd64.S: New.
* cipher/salsa20.c (USE_AMD64): New macro.
[USE_AMD64] (_gcry_salsa20_amd64_keysetup, _gcry_salsa20_amd64_ivsetup)
(_gcry_salsa20_amd64_encrypt_blocks): New prototypes.
[USE_AMD64] (salsa20_keysetup, salsa20_ivsetup, salsa20_core): New.
[!USE_AMD64] (salsa20_core): Change 'src' to non-constant, update block
counter in 'salsa20_core' and return burn stack depth.
[!USE_AMD64] (salsa20_keysetup, salsa20_ivsetup): New.
(salsa20_do_setkey): Move generic key setup to 'salsa20_keysetup'.
(salsa20_setkey): Fix burn stack depth.
(salsa20_setiv): Move generic IV setup to 'salsa20_ivsetup'.
(salsa20_do_encrypt_stream) [USE_AMD64]: Process large buffers in AMD64
implementation.
(salsa20_do_encrypt_stream): Move stack burning to this function...
(salsa20_encrypt_stream, salsa20r12_encrypt_stream): ...from these
functions.
* configure.ac [x86-64]: Add 'salsa20-amd64.lo'.

Patch adds fast AMD64 assembly implementation for Salsa20. This implementation
is based on public domain code by D. J. Bernstein and it is available at
http://cr.yp.to/snuffle.html (amd64-xmm6). Implementation gains extra speed
by processing four blocks in parallel with help SSE2 instructions.

Benchmark results on Intel Core i5-4570 (3.2 Ghz):

Before:
SALSA20 | nanosecs/byte mebibytes/sec cycles/byte

STREAM enc |      3.88 ns/B     246.0 MiB/s     12.41 c/B
STREAM dec |      3.88 ns/B     246.0 MiB/s     12.41 c/B
           =

SALSA20R12 | nanosecs/byte mebibytes/sec cycles/byte

STREAM enc |      2.46 ns/B     387.9 MiB/s      7.87 c/B
STREAM dec |      2.46 ns/B     387.7 MiB/s      7.87 c/B

After:
SALSA20 | nanosecs/byte mebibytes/sec cycles/byte

STREAM enc |     0.985 ns/B     967.8 MiB/s      3.15 c/B
STREAM dec |     0.987 ns/B     966.5 MiB/s      3.16 c/B
           =

SALSA20R12 | nanosecs/byte mebibytes/sec cycles/byte

STREAM enc |     0.636 ns/B    1500.5 MiB/s      2.03 c/B
STREAM dec |     0.636 ns/B    1499.2 MiB/s      2.04 c/B
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Oct 26 2013, 2:00 PM
Parents
rCe214e8392671: Add new benchmarking utility, bench-slope
Branches
Unknown
Tags
Unknown

Event Timeline

Jussi Kivilinna <jussi.kivilinna@iki.fi> committed rC5a3d43485efd: Add AMD64 assembly implementation of Salsa20 (authored by Jussi Kivilinna <jussi.kivilinna@iki.fi>).Oct 28 2013, 3:12 PM