Home GnuPG

aria-avx: small optimization for aria_ark_8way

Description

aria-avx: small optimization for aria_ark_8way

* cipher/aria-aesni-avx-amd64.S (aria_ark_8way): Use 'vmovd' for
loading key material and 'vpshufb' for broadcasting from byte
locations 3, 2, 1 and 0.

Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):

Before (GFNI/AVX):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |     0.516 ns/B      1847 MiB/s      2.43 c/B      4700
ECB dec |     0.519 ns/B      1839 MiB/s      2.44 c/B      4700
CTR enc |     0.517 ns/B      1846 MiB/s      2.43 c/B      4700
CTR dec |     0.518 ns/B      1843 MiB/s      2.43 c/B      4700

After (GFNI/AVX, ~5% faster):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |     0.490 ns/B      1947 MiB/s      2.30 c/B      4700
ECB dec |     0.490 ns/B      1946 MiB/s      2.30 c/B      4700
CTR enc |     0.493 ns/B      1935 MiB/s      2.32 c/B      4700
CTR dec |     0.493 ns/B      1934 MiB/s      2.32 c/B      4700

Benchmark on Intel Core i3-1115G4 (tiger-lake, turbo-freq off):

Before (GFNI/AVX):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |     0.967 ns/B     986.6 MiB/s      2.89 c/B      2992
ECB dec |     0.966 ns/B     987.1 MiB/s      2.89 c/B      2992
CTR enc |     0.972 ns/B     980.8 MiB/s      2.91 c/B      2993
CTR dec |     0.971 ns/B     982.5 MiB/s      2.90 c/B      2993

After (GFNI/AVX, ~6% faster):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |     0.908 ns/B      1050 MiB/s      2.72 c/B      2992
ECB dec |     0.903 ns/B      1056 MiB/s      2.70 c/B      2992
CTR enc |     0.913 ns/B      1045 MiB/s      2.73 c/B      2992
CTR dec |     0.910 ns/B      1048 MiB/s      2.72 c/B      2992

Benchmark on AMD Ryzen 7 5800X (zen3, turbo-freq off):

Before (AESNI/AVX):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |     0.921 ns/B      1035 MiB/s      3.50 c/B      3800
ECB dec |     0.922 ns/B      1034 MiB/s      3.50 c/B      3800
CTR enc |     0.923 ns/B      1033 MiB/s      3.51 c/B      3800
CTR dec |     0.923 ns/B      1033 MiB/s      3.51 c/B      3800

After (AESNI/AVX, ~6% faster)
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |     0.862 ns/B      1106 MiB/s      3.28 c/B      3800
ECB dec |     0.862 ns/B      1106 MiB/s      3.28 c/B      3800
CTR enc |     0.865 ns/B      1102 MiB/s      3.29 c/B      3800
CTR dec |     0.865 ns/B      1103 MiB/s      3.29 c/B      3800

Benchmark on AMD EPYC 7642 (zen2):

Before (AESNI/AVX):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |      1.22 ns/B     784.5 MiB/s      4.01 c/B      3298
ECB dec |      1.22 ns/B     784.8 MiB/s      4.00 c/B      3292
CTR enc |      1.22 ns/B     780.1 MiB/s      4.03 c/B      3299
CTR dec |      1.22 ns/B     779.1 MiB/s      4.04 c/B      3299

After (AESNI/AVX, ~13% faster):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |      1.07 ns/B     888.3 MiB/s      3.54 c/B      3299
ECB dec |      1.08 ns/B     885.3 MiB/s      3.55 c/B      3299
CTR enc |      1.07 ns/B     888.7 MiB/s      3.54 c/B      3298
CTR dec |      1.07 ns/B     887.4 MiB/s      3.55 c/B      3299

Benchmark on Intel Core i5-6500 (skylake):

Before (AESNI/AVX):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |      1.24 ns/B     766.6 MiB/s      4.48 c/B      3598
ECB dec |      1.25 ns/B     764.9 MiB/s      4.49 c/B      3598
CTR enc |      1.25 ns/B     761.7 MiB/s      4.50 c/B      3598
CTR dec |      1.25 ns/B     761.6 MiB/s      4.51 c/B      3598

After (AESNI/AVX, ~2% faster):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |      1.22 ns/B     780.0 MiB/s      4.40 c/B      3598
ECB dec |      1.22 ns/B     779.6 MiB/s      4.40 c/B      3598
CTR enc |      1.23 ns/B     776.6 MiB/s      4.42 c/B      3598
CTR dec |      1.23 ns/B     776.6 MiB/s      4.42 c/B      3598

Benchmark on Intel Core i5-2450M (sandy-bridge, turbo-freq off):

Before (AESNI/AVX):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |      2.11 ns/B     452.7 MiB/s      5.25 c/B      2494
ECB dec |      2.10 ns/B     454.5 MiB/s      5.23 c/B      2494
CTR enc |      2.10 ns/B     453.2 MiB/s      5.25 c/B      2494
CTR dec |      2.10 ns/B     453.2 MiB/s      5.25 c/B      2494

After (AESNI/AVX, ~4% faster)
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

ECB enc |      2.00 ns/B     475.8 MiB/s      5.00 c/B      2494
ECB dec |      2.00 ns/B     476.4 MiB/s      4.99 c/B      2494
CTR enc |      2.01 ns/B     474.7 MiB/s      5.01 c/B      2494
CTR dec |      2.01 ns/B     473.9 MiB/s      5.02 c/B      2494

Cc: Taehee Yoo <ap420073@gmail.com>

  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Feb 18 2023, 10:13 AM
Parents
rC45351e6474cb: aria: add x86_64 GFNI/AVX512 accelerated implementation
Branches
Unknown
Tags
Unknown