rijndael/ppc: re-implement single-block mode, and implement OCB block cipher
Needs ReviewPublic

Authored by slandden on Fri, Jul 5, 7:38 PM.

Details

Summary

VERY impressive speed wins over the cryptogams version:

Also, easier to maintain than an assembly version.

8x was only marginally faster than 6x. Probably could be sped up
with a vectorgather instruction.

See D490 D491 D492 and D493

Before:
ECB enc | 2.84 ns/B 336.1 MiB/s 5.38 c/B 1895
ECB dec | 2.89 ns/B 330.6 MiB/s 5.47 c/B 1895
CBC enc | 1.05 ns/B 908.3 MiB/s 1.99 c/B 1895
CBC dec | 0.221 ns/B 4315 MiB/s 0.419 c/B 1895
CFB enc | 4.41 ns/B 216.4 MiB/s 8.35 c/B 1895
CFB dec | 4.88 ns/B 195.3 MiB/s 9.26 c/B 1895
OFB enc | 5.06 ns/B 188.4 MiB/s 9.59 c/B 1895
OFB dec | 5.07 ns/B 188.2 MiB/s 9.60 c/B 1895
CTR enc | 0.218 ns/B 4374 MiB/s 0.413 c/B 1895
CTR dec | 0.219 ns/B 4349 MiB/s 0.416 c/B 1895
XTS enc | 0.681 ns/B 1400 MiB/s 1.29 c/B 1895
XTS dec | 0.687 ns/B 1387 MiB/s 1.30 c/B 1895
CCM enc | 4.21 ns/B 226.4 MiB/s 5.32 c/B 1264
CCM dec | 4.21 ns/B 226.7 MiB/s 5.32 c/B 1264
CCM auth | 3.99 ns/B 239.2 MiB/s 5.04 c/B 1264
EAX enc | 4.20 ns/B 227.2 MiB/s 5.30 c/B 1264
EAX dec | 4.21 ns/B 226.5 MiB/s 5.32 c/B 1264
EAX auth | 3.97 ns/B 239.9 MiB/s 5.02 c/B 1264
GCM enc | 19.81 ns/B 48.14 MiB/s 25.03 c/B 1264
GCM dec | 19.79 ns/B 48.18 MiB/s 25.01 c/B 1264
GCM auth | 19.55 ns/B 48.78 MiB/s 24.71 c/B 1264
OCB enc | 17.53 ns/B 54.41 MiB/s 14.77 c/B 842.4
OCB dec | 13.89 ns/B 68.67 MiB/s 17.55 c/B 1263
OCB auth | 9.14 ns/B 104.4 MiB/s 11.54 c/B 1264

After:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |      1.98 ns/B     482.6 MiB/s      3.75 c/B      1895 <=====
 ECB dec |      1.80 ns/B     529.3 MiB/s      3.42 c/B      1895 <=====
 CBC enc |      1.05 ns/B     907.7 MiB/s      1.99 c/B      1895
 CBC dec |     0.221 ns/B      4317 MiB/s     0.419 c/B      1895
 CFB enc |      1.65 ns/B     578.5 MiB/s      3.12 c/B      1895
 CFB dec |      1.03 ns/B     925.9 MiB/s      1.95 c/B      1895
 OFB enc |      2.34 ns/B     408.2 MiB/s      3.83 c/B      1638
 OFB dec |      2.33 ns/B     410.1 MiB/s      3.81 c/B      1638
 CTR enc |     0.216 ns/B      4416 MiB/s     0.409 c/B      1895
 CTR dec |     0.216 ns/B      4422 MiB/s     0.409 c/B      1895
 XTS enc |     0.557 ns/B      1712 MiB/s      1.06 c/B      1895
 XTS dec |     0.561 ns/B      1701 MiB/s      1.06 c/B      1895
 CCM enc |      1.87 ns/B     509.9 MiB/s      3.54 c/B      1895
 CCM dec |      1.87 ns/B     509.8 MiB/s      3.55 c/B      1895
CCM auth |      1.65 ns/B     576.4 MiB/s      3.14 c/B      1895
 EAX enc |      1.87 ns/B     510.3 MiB/s      3.54 c/B      1895
 EAX dec |      1.87 ns/B     510.0 MiB/s      3.54 c/B      1895
EAX auth |      1.65 ns/B     576.9 MiB/s      3.13 c/B      1895
 GCM enc |      3.55 ns/B     268.7 MiB/s      6.73 c/B      1895
 GCM dec |      3.55 ns/B     268.7 MiB/s      6.73 c/B      1895
GCM auth |      3.33 ns/B     286.2 MiB/s      6.32 c/B      1895
 OCB enc |     0.426 ns/B      2241 MiB/s     0.807 c/B      1895 <====
 OCB dec |     0.409 ns/B      2333 MiB/s     0.775 c/B      1895 <====
OCB auth |      1.23 ns/B     772.7 MiB/s      2.34 c/B      1895
Test Plan

Integrates into existing tests, which all pass.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped
slandden created this revision.Fri, Jul 5, 7:38 PM
slandden edited the summary of this revision. (Show Details)

@gcwilson Can you notify the performance team of this new patch?

It turns out that Openssl's main loop of single-block mode was not optimized.

Thanks. I really like this Altivec intrinsic approach. I might reimplement rest of the bulk block cipher functions this way later (if I ever get PPC HW access).

Here's generic review comments for full series:

I find Phabricator differential interface is quite horrible to use. I have not been able to fetch these patch cleanly and need to manually edit the diff files to fix importing to git. It's probably better if you send patches to the mailing-list, or leave git pull location at parent task T4531.

jukivili edited reviewers, added: jukivili; removed: jwilk.Mon, Jul 8, 4:02 PM

and cryptogam wrapper functions

I will leave these in the main file, as they might benefit from "static", and I do not want to rely on LTO for that.

(if I ever get PPC HW access).

https://minicloud.parqtec.unicamp.br/

click "request access". It is *free*.

In D494#4450, @slandden wrote:

I will leave these in the main file, as they might benefit from "static", and I do not want to rely on LTO for that.

Ok. There is need for some clean-up work in rijndael.c and I'll do this change as part of that. Bulk functions are mainly called through 'hd->bulk.<mode>_enc/dec' indirect function calls and 'ctx->use_<hwf>' code paths in '_gcry_aes_<mode>_enc/dec' functions are only really used by 'selftest_<mode>_128' functions at the end of rijndael.c. So I'm planning of removing those code paths from '_gcry_aes_<mode>_enc/dec' functions and handle HW feature detection and picking correct bulk function in each of those selftest functions (... split hw feature detection from setkey to new function that can also be used by selftest functions).

I find Phabricator differential interface is quite horrible to use.

Me too.

I find autotools to be EXTREMELY difficult to use. I tried to put rijndael.c stuff in its own file

(and it needs these CFLAGS -mabi=altivec -maltivec -mvsx -mpower8-vector )
but every time I try I get cryptic error messages from autotools/libtool that I don't understand,
and learning autotools just seems like a pointless thing to do.

autotools is so automatic that you NEVER know what is going.

Managed to get the build correct. (patches in 1 sec)

Putting the cryptogams stuff in its own folder requires some autotools wizardry,
as the straight-forward method I tried requires enabling a feature "subdir-objects" that
other parts of the libgcrypt code base say caused problems.