UltrafastSecp256k1/opencl/benchmarks
shrec 6d0703d65c
bench(cuda): BENCH_MULTI=20 in full benchmark loops — matches autotuner throughput
Full-benchmark GLV+LUT loops previously fired 1 kernel per CudaTimer interval
(single-dispatch overhead ~4.5 ns).  Autotuner already used sample_repeats=20.
Now both measure the same way:

  Before: GPU+LUT 95.2 ns / 10.50 M/s  (1 kernel/timer)
  After:  GPU+LUT 90.6 ns / 11.04 M/s  (20 kernels/timer, divided by 20)

Update OpenCL CUDA reference: 95.2 → 90.6 ns.
2026-03-21 23:06:23 +00:00
..
bench_bip352_opencl.cpp bench(cuda): BENCH_MULTI=20 in full benchmark loops — matches autotuner throughput 2026-03-21 23:06:23 +00:00
bench_opencl.cpp Merge remote-tracking branch 'origin/release/v3.22.0' into dev 2026-03-21 14:14:49 +00:00