UltrafastSecp256k1

History

shrec 83f6d033d6 perf(opencl): optimize kernels — unrolled field_mul/sqr, addition chain field_inv, wNAF scalar_mul Optimizations: - field_mul: fully unrolled 4x4 schoolbook (16 explicit mul64_full, no loops) - field_sqr: fully unrolled off-diagonal + diagonal computation - field_inv: Fermat addition chain (~260 ops, was ~448 naive binary exp) - scalar_mul: wNAF window-5 with 8-entry precomputed table (was double-and-add) Infrastructure: - Added batch dispatch: batch_field_add/sub/mul/sqr, batch_point_double/add - Rewrote benchmark to use batch throughput (same methodology as CUDA) - Created embed_kernels.cmake (was missing from repo) Results (RTX 5060 Ti, batch=65536): - Field Mul: 12.2 ns/op (82 M/s) - Field Inv: 44.8 ns/op (22 M/s) - Scalar Mul: 419 ns/op (2.39 M/s) — competitive with CUDA's 591 ns (1.69 M/s) - 32/32 correctness tests pass Verified: opencl_test --nvidia → ALL TESTS PASSED	2026-02-14 14:48:11 +00:00
..
embed_kernels.cmake	perf(opencl): optimize kernels — unrolled field_mul/sqr, addition chain field_inv, wNAF scalar_mul	2026-02-14 14:48:11 +00:00

shrec 83f6d033d6 perf(opencl): optimize kernels — unrolled field_mul/sqr, addition chain field_inv, wNAF scalar_mul

Optimizations:
- field_mul: fully unrolled 4x4 schoolbook (16 explicit mul64_full, no loops)
- field_sqr: fully unrolled off-diagonal + diagonal computation
- field_inv: Fermat addition chain (~260 ops, was ~448 naive binary exp)
- scalar_mul: wNAF window-5 with 8-entry precomputed table (was double-and-add)

Infrastructure:
- Added batch dispatch: batch_field_add/sub/mul/sqr, batch_point_double/add
- Rewrote benchmark to use batch throughput (same methodology as CUDA)
- Created embed_kernels.cmake (was missing from repo)

Results (RTX 5060 Ti, batch=65536):
- Field Mul: 12.2 ns/op (82 M/s)
- Field Inv: 44.8 ns/op (22 M/s)
- Scalar Mul: 419 ns/op (2.39 M/s) — competitive with CUDA's 591 ns (1.69 M/s)
- 32/32 correctness tests pass

Verified: opencl_test --nvidia → ALL TESTS PASSED

2026-02-14 14:48:11 +00:00

embed_kernels.cmake

perf(opencl): optimize kernels — unrolled field_mul/sqr, addition chain field_inv, wNAF scalar_mul

2026-02-14 14:48:11 +00:00