mpi/amd64: optimize add_n and sub_n
* mpi/amd64/mpih-add1.S (_gcry_mpih_add_n): New implementation with 4x unrolled fast-path loop. * mpi/amd64/mpih-sub1.S (_gcry_mpih_sub_n): Likewise.
Benchmark on AMD Ryzen 9 7900X:
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
add | 0.035 ns/B 27559 MiB/s 0.163 c/B 4700
sub | 0.034 ns/B 28332 MiB/s 0.158 c/B 4700
After (~26% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
add | 0.027 ns/B 35271 MiB/s 0.127 c/B 4700
sub | 0.027 ns/B 35206 MiB/s 0.127 c/B 4700
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>