Home GnuPG

rijndael: use more compact look-up tables and add table prefetching
2374753938dfUnpublished

Unpublished Commit ยท Learn More

Not On Permanent Ref: This commit is not an ancestor of any permanent ref.

Description

rijndael: use more compact look-up tables and add table prefetching

* cipher/rijndael-internal.h (rijndael_prefetchfn_t): New.
(RIJNDAEL_context): Add 'prefetch_enc_fn' and 'prefetch_dec_fn'.
* cipher/rijndael-tables.h (S, T1, T2, T3, T4, T5, T6, T7, T8, S5, U1)
(U2, U3, U4): Remove.
(encT, dec_tables, decT, inv_sbox): Add.
* cipher/rijndael.c (_gcry_aes_amd64_encrypt_block)
(_gcry_aes_amd64_decrypt_block, _gcry_aes_arm_encrypt_block)
(_gcry_aes_arm_encrypt_block): Add parameter for passing table pointer
to assembly implementation.
(prefetch_table, prefetch_enc, prefetch_dec): New.
(do_setkey): Setup context prefetch functions depending on selected
rijndael implementation; Use new tables for key setup.
(prepare_decryption): Use new tables for decryption key setup.
(do_encrypt_aligned): Rename to...
(do_encrypt_fn): ... to this, change to use new compact tables,
make handle unaligned input and unroll rounds loop by two.
(do_encrypt): Remove handling of unaligned input/output; pass table
pointer to assembly implementations.
(rijndael_encrypt, _gcry_aes_cfb_enc, _gcry_aes_cbc_enc)
(_gcry_aes_ctr_enc, _gcry_aes_cfb_dec): Prefetch encryption tables
before encryption.
(do_decrypt_aligned): Rename to...
(do_decrypt_fn): ... to this, change to use new compact tables,
make handle unaligned input and unroll rounds loop by two.
(do_decrypt): Remove handling of unaligned input/output; pass table
pointer to assembly implementations.
(rijndael_decrypt, _gcry_aes_cbc_dec): Prefetch decryption tables
before decryption.
* cipher/rijndael-amd64.S: Use 1+1.25 KiB tables for
encryption+decryption; remove tables from assembly file.
* cipher/rijndael-arm.S: Ditto.

Patch replaces 4+4.25 KiB look-up tables in generic implementation and
8+8 KiB look-up tables in AMD64 implementation and 2+2 KiB look-up tables in
ARM implementation with 1+1.25 KiB look-up tables, and adds prefetching of
look-up tables.

AMD64 assembly is slower than before because of additional rotation
instructions. The generic C implementation is now better optimized and
actually faster than before.

Benchmark results on Intel i5-4570 (turbo off) (64-bit, AMD64 assembly):

tests/bench-slope --disable-hwf intel-aesni --cpu-mhz 3200 cipher aes

Old:
AES | nanosecs/byte mebibytes/sec cycles/byte

ECB enc |      3.10 ns/B     307.5 MiB/s      9.92 c/B
ECB dec |      3.15 ns/B     302.5 MiB/s     10.09 c/B
CBC enc |      3.46 ns/B     275.5 MiB/s     11.08 c/B
CBC dec |      3.19 ns/B     299.2 MiB/s     10.20 c/B
CFB enc |      3.48 ns/B     274.4 MiB/s     11.12 c/B
CFB dec |      3.23 ns/B     294.8 MiB/s     10.35 c/B
OFB enc |      3.29 ns/B     290.2 MiB/s     10.52 c/B
OFB dec |      3.31 ns/B     288.3 MiB/s     10.58 c/B
CTR enc |      3.64 ns/B     261.7 MiB/s     11.66 c/B
CTR dec |      3.65 ns/B     261.6 MiB/s     11.67 c/B

New:
AES | nanosecs/byte mebibytes/sec cycles/byte

ECB enc |      4.21 ns/B     226.7 MiB/s     13.46 c/B
ECB dec |      4.27 ns/B     223.2 MiB/s     13.67 c/B
CBC enc |      4.15 ns/B     229.8 MiB/s     13.28 c/B
CBC dec |      3.85 ns/B     247.8 MiB/s     12.31 c/B
CFB enc |      4.16 ns/B     229.1 MiB/s     13.32 c/B
CFB dec |      3.88 ns/B     245.9 MiB/s     12.41 c/B
OFB enc |      4.38 ns/B     217.8 MiB/s     14.01 c/B
OFB dec |      4.36 ns/B     218.6 MiB/s     13.96 c/B
CTR enc |      4.30 ns/B     221.6 MiB/s     13.77 c/B
CTR dec |      4.30 ns/B     221.7 MiB/s     13.76 c/B

Benchmark on Intel i5-4570 (turbo off) (32-bit mingw, generic C):

tests/bench-slope.exe --disable-hwf intel-aesni --cpu-mhz 3200 cipher aes

Old:
AES | nanosecs/byte mebibytes/sec cycles/byte

ECB enc |      6.03 ns/B     158.2 MiB/s     19.29 c/B
ECB dec |      5.81 ns/B     164.1 MiB/s     18.60 c/B
CBC enc |      6.22 ns/B     153.4 MiB/s     19.90 c/B
CBC dec |      5.91 ns/B     161.3 MiB/s     18.92 c/B
CFB enc |      6.25 ns/B     152.7 MiB/s     19.99 c/B
CFB dec |      6.24 ns/B     152.8 MiB/s     19.97 c/B
OFB enc |      6.33 ns/B     150.6 MiB/s     20.27 c/B
OFB dec |      6.33 ns/B     150.7 MiB/s     20.25 c/B
CTR enc |      6.28 ns/B     152.0 MiB/s     20.08 c/B
CTR dec |      6.28 ns/B     151.7 MiB/s     20.11 c/B

New:
AES | nanosecs/byte mebibytes/sec cycles/byte

ECB enc |      5.02 ns/B     190.0 MiB/s     16.06 c/B
ECB dec |      5.33 ns/B     178.8 MiB/s     17.07 c/B
CBC enc |      4.64 ns/B     205.4 MiB/s     14.86 c/B
CBC dec |      4.95 ns/B     192.7 MiB/s     15.84 c/B
CFB enc |      4.75 ns/B     200.7 MiB/s     15.20 c/B
CFB dec |      4.74 ns/B     201.1 MiB/s     15.18 c/B
OFB enc |      5.29 ns/B     180.3 MiB/s     16.93 c/B
OFB dec |      5.29 ns/B     180.3 MiB/s     16.93 c/B
CTR enc |      4.77 ns/B     200.0 MiB/s     15.26 c/B
CTR dec |      4.77 ns/B     199.8 MiB/s     15.27 c/B

Benchmark on Cortex-A8 (ARM assembly):

tests/bench-slope --cpu-mhz 1008 cipher aes

Old:
AES | nanosecs/byte mebibytes/sec cycles/byte

ECB enc |     21.84 ns/B     43.66 MiB/s     22.02 c/B
ECB dec |     22.35 ns/B     42.67 MiB/s     22.53 c/B
CBC enc |     22.97 ns/B     41.53 MiB/s     23.15 c/B
CBC dec |     23.48 ns/B     40.61 MiB/s     23.67 c/B
CFB enc |     22.72 ns/B     41.97 MiB/s     22.90 c/B
CFB dec |     23.41 ns/B     40.74 MiB/s     23.59 c/B
OFB enc |     23.65 ns/B     40.32 MiB/s     23.84 c/B
OFB dec |     23.67 ns/B     40.29 MiB/s     23.86 c/B
CTR enc |     23.24 ns/B     41.03 MiB/s     23.43 c/B
CTR dec |     23.23 ns/B     41.05 MiB/s     23.42 c/B

New:
AES | nanosecs/byte mebibytes/sec cycles/byte

ECB enc |     26.03 ns/B     36.64 MiB/s     26.24 c/B
ECB dec |     26.97 ns/B     35.36 MiB/s     27.18 c/B
CBC enc |     23.21 ns/B     41.09 MiB/s     23.39 c/B
CBC dec |     23.36 ns/B     40.83 MiB/s     23.54 c/B
CFB enc |     23.02 ns/B     41.42 MiB/s     23.21 c/B
CFB dec |     23.67 ns/B     40.28 MiB/s     23.86 c/B
OFB enc |     27.86 ns/B     34.24 MiB/s     28.08 c/B
OFB dec |     27.87 ns/B     34.21 MiB/s     28.10 c/B
CTR enc |     23.47 ns/B     40.63 MiB/s     23.66 c/B
CTR dec |     23.49 ns/B     40.61 MiB/s     23.67 c/B
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Dec 23 2014, 11:35 AM
Parents
rCad50e360ef48: build: Add configure option --disable-doc.
Branches
Unknown
Tags
Unknown

Event Timeline

Jussi Kivilinna <jussi.kivilinna@iki.fi> committed rC2374753938df: rijndael: use more compact look-up tables and add table prefetching (authored by Jussi Kivilinna <jussi.kivilinna@iki.fi>).Dec 23 2014, 11:37 AM