shrec
83f6d033d6
perf(opencl): optimize kernels — unrolled field_mul/sqr, addition chain field_inv, wNAF scalar_mul
Optimizations:
- field_mul: fully unrolled 4x4 schoolbook (16 explicit mul64_full, no loops)
- field_sqr: fully unrolled off-diagonal + diagonal computation
- field_inv: Fermat addition chain (~260 ops, was ~448 naive binary exp)
- scalar_mul: wNAF window-5 with 8-entry precomputed table (was double-and-add)
Infrastructure:
- Added batch dispatch: batch_field_add/sub/mul/sqr, batch_point_double/add
- Rewrote benchmark to use batch throughput (same methodology as CUDA)
- Created embed_kernels.cmake (was missing from repo)
Results (RTX 5060 Ti, batch=65536):
- Field Mul: 12.2 ns/op (82 M/s)
- Field Inv: 44.8 ns/op (22 M/s)
- Scalar Mul: 419 ns/op (2.39 M/s) — competitive with CUDA's 591 ns (1.69 M/s)
- 32/32 correctness tests pass
Verified: opencl_test --nvidia → ALL TESTS PASSED