Full-benchmark GLV+LUT loops previously fired 1 kernel per CudaTimer interval (single-dispatch overhead ~4.5 ns). Autotuner already used sample_repeats=20. Now both measure the same way: Before: GPU+LUT 95.2 ns / 10.50 M/s (1 kernel/timer) After: GPU+LUT 90.6 ns / 11.04 M/s (20 kernels/timer, divided by 20) Update OpenCL CUDA reference: 95.2 → 90.6 ns. |
||
|---|---|---|
| .. | ||
| bench_bip352_opencl.cpp | ||
| bench_opencl.cpp | ||