UltrafastSecp256k1/opencl/cmake
shrec 83f6d033d6 perf(opencl): optimize kernels — unrolled field_mul/sqr, addition chain field_inv, wNAF scalar_mul
Optimizations:
- field_mul: fully unrolled 4x4 schoolbook (16 explicit mul64_full, no loops)
- field_sqr: fully unrolled off-diagonal + diagonal computation
- field_inv: Fermat addition chain (~260 ops, was ~448 naive binary exp)
- scalar_mul: wNAF window-5 with 8-entry precomputed table (was double-and-add)

Infrastructure:
- Added batch dispatch: batch_field_add/sub/mul/sqr, batch_point_double/add
- Rewrote benchmark to use batch throughput (same methodology as CUDA)
- Created embed_kernels.cmake (was missing from repo)

Results (RTX 5060 Ti, batch=65536):
- Field Mul: 12.2 ns/op (82 M/s)
- Field Inv: 44.8 ns/op (22 M/s)
- Scalar Mul: 419 ns/op (2.39 M/s) — competitive with CUDA's 591 ns (1.69 M/s)
- 32/32 correctness tests pass

Verified: opencl_test --nvidia → ALL TESTS PASSED
2026-02-14 14:48:11 +00:00
..
embed_kernels.cmake perf(opencl): optimize kernels — unrolled field_mul/sqr, addition chain field_inv, wNAF scalar_mul 2026-02-14 14:48:11 +00:00