opencl+metal: wire bulletproof_verify_batch — close last parity gap

- opencl/kernels/secp256k1_zk.cl: remove #if 0 guard (881 lines of bulletproof
  code re-enabled); fix range_verify_full_impl address-space qualifiers:
  bp_G / bp_H now __global const AffinePoint*; per-iteration private copy
  (AffinePoint g_pt = bp_G[i]) before passing to scalar_mul_impl.

- gpu/src/gpu_backend_opencl.cpp: replace Unsupported stub with real dispatch
  via range_proof_poly_batch kernel; add bp_poly_batch_ member + cleanup;
  update ensure_zk_kernels() to register the new kernel; parse 324-byte
  wire format (4x65-byte uncompressed + 2x32-byte scalars) into
  RangeProofPolyOCL GPU layout.

- gpu/src/gpu_backend_metal.mm: replace Unsupported stub with real dispatch
  via range_proof_poly_batch kernel; build RangeProofPolyMetal (320B) from
  324-byte wire format using be32_to_metal_fe / be32_to_metal_scalar helpers.

- docs/BACKEND_ASSURANCE_MATRIX.md: bulletproof row stub->Y for OpenCL+Metal;
  parity tracking now shows zero remaining stubs.

- CHANGELOG.md: document bulletproof parity closure.

All three backends (CUDA, OpenCL, Metal) now implement bulletproof_verify_batch.
Zero Unsupported stubs remain in the GPU backend surface.
This commit is contained in:
shrec 2026-03-24 21:59:56 +00:00
parent d329026a01
commit 849c9df309
No known key found for this signature in database
5 changed files with 203 additions and 27 deletions

View File

@ -13,8 +13,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **ClusterFuzzLite expanded to 5 targets**: added `cpu/fuzz/fuzz_ecdsa.cpp` (ECDSA sign→verify invariant, wrong-msg false-positive check, parse_compact_strict robustness) and `cpu/fuzz/fuzz_schnorr.cpp` (BIP-340 sign→verify, adversarial from_bytes verify, wrong-msg check).
### GPU Backend
- **OpenCL parity**: wired `zk_knowledge_verify_batch`, `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch` in `gpu_backend_opencl.cpp` — 4 new kernels matching CUDA surface. Closes assurance gap **#5/#6** (partial; bulletproof batch remains PARITY-EXCEPTION, documented in `BACKEND_ASSURANCE_MATRIX.md`).
- **Metal parity**: wired `zk_knowledge_verify_batch`, `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch` in `gpu_backend_metal.mm` — all four Metal kernels were already present in `secp256k1_kernels.metal`; dispatch code now connected. Also fixed `zk_knowledge_verify_batch` Metal kernel: was incorrectly treating pubkey buffer as a scalar (scalar×G); corrected to `lift_x` to recover the full point from x-coordinate. Full CUDA ↔ OpenCL ↔ Metal parity on all ZK and BIP-324 batch ops. `bulletproof_verify_batch` remains PARITY-EXCEPTION on all GPU backends.
- **Bulletproof parity (OpenCL + Metal)**: resolved the last remaining PARITY-EXCEPTION. OpenCL: removed `#if 0` guard in `secp256k1_zk.cl`; fixed `range_verify_full_impl` address-space qualifiers (`__global const AffinePoint*` for `bp_G`/`bp_H`, with per-iteration private copy before `scalar_mul_impl`); wired `bulletproof_verify_batch` host dispatch via `range_proof_poly_batch` kernel (matches CUDA poly-check behavior). Metal: wired `bulletproof_verify_batch` host dispatch via `range_proof_poly_batch` kernel; host converts 324-byte proof wire format to `RangeProofPolyMetal` GPU structs. Full CUDA ↔ OpenCL ↔ Metal parity — zero `Unsupported` stubs remaining.
- **OpenCL parity**: wired `zk_knowledge_verify_batch`, `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch` in `gpu_backend_opencl.cpp` — 4 new kernels matching CUDA surface.
- **Metal parity**: wired `zk_knowledge_verify_batch`, `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch` in `gpu_backend_metal.mm` — all four Metal kernels were already present in `secp256k1_kernels.metal`; dispatch code now connected. Also fixed `zk_knowledge_verify_batch` Metal kernel: was incorrectly treating pubkey buffer as a scalar (scalar×G); corrected to `lift_x` to recover the full point from x-coordinate.
- **CUDA 13 compatibility**: replaced deprecated `cudaDeviceProp::clockRate` / `::memoryClockRate` fields (removed in CUDA 13) with `cudaDeviceGetAttribute(cudaDevAttrClockRate/MemoryClockRate)` under `#if CUDART_VERSION >= 13000` guard. Backward-compatible with CUDA 12. Reported by @craigraw compiling with CUDA 13 on RTX 5080.

View File

@ -21,7 +21,7 @@
| ZK proofs | Y | Y | Y | Y | Y |
| ZK knowledge verify batch | - | - | Y | Y | Y |
| ZK DLEQ verify batch | - | - | Y | Y | Y |
| Bulletproof verify batch | - | - | Y | stub | stub |
| Bulletproof verify batch | - | - | Y | Y | Y |
| BIP-324 AEAD encrypt batch | - | - | Y | Y | Y |
| BIP-324 AEAD decrypt batch | - | - | Y | Y | Y |
| Multi-scalar mul | Y | - | Y | Y | Y |
@ -44,11 +44,13 @@
| Operation | Backend | Tracking note |
|-----------|---------|---------------|
| `bulletproof_verify_batch` | OpenCL, Metal | CUDA implemented; OpenCL/Metal blocked by address-space qualifier issues in kernel — tracked as PARITY-EXCEPTION |
| *(none — all parity gaps resolved)* | — | — |
> All other ZK and BIP-324 batch operations (`zk_knowledge_verify_batch`,
> `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch`)
> are now **fully implemented on all three GPU backends** (CUDA, OpenCL, Metal).
> All ZK and BIP-324 batch operations, including `bulletproof_verify_batch`, are now
> **fully implemented on all three GPU backends** (CUDA, OpenCL, Metal).
> OpenCL kernel `#if 0` guard removed; address-space qualifier fix applied to
> `range_verify_full_impl` (added `__global` to `bp_G`/`bp_H`, local copy in loop).
> Metal host dispatch wired via `range_proof_poly_batch` kernel.
> Resolved 2026-03-24.
### Current permanent exceptions

View File

@ -840,12 +840,82 @@ public:
return GpuError::Ok;
}
// PARITY-EXCEPTION(Metal): bulletproof kernel blocked upstream; OpenCL also #if 0.
// See docs/BACKEND_ASSURANCE_MATRIX.md row "bulletproof_verify_batch".
GpuError bulletproof_verify_batch(
const uint8_t*, const uint8_t*, const uint8_t*,
size_t, uint8_t*) override
{ return set_error(GpuError::Unsupported, "bulletproof: kernel pending address-space fixes"); }
const uint8_t* proofs324, const uint8_t* commitments65,
const uint8_t* H_generator65, size_t count,
uint8_t* out_results) override
{
if (!is_ready()) return set_error(GpuError::Device, "context not initialised");
if (count == 0) { clear_error(); return GpuError::Ok; }
if (!proofs324 || !commitments65 || !H_generator65 || !out_results)
return set_error(GpuError::NullArg, "NULL buffer");
auto err = ensure_library();
if (err != GpuError::Ok) return err;
/* Convert big-endian 32 bytes → MetalFieldElem (same format as scalar). */
auto be32_to_metal_fe = [](const uint8_t be[32]) -> MetalFieldElem {
MetalFieldElem fe;
for (int i = 0; i < 8; i++) {
int base = (7 - i) * 4;
fe.limbs[i] = ((uint32_t)be[base] << 24) |
((uint32_t)be[base+1] << 16) |
((uint32_t)be[base+2] << 8) |
((uint32_t)be[base+3]);
}
return fe;
};
/* Parse uncompressed point (65 bytes: 04 || x[32] || y[32]) → MetalAffinePoint */
auto parse_pt65 = [&be32_to_metal_fe](const uint8_t pt65[65]) -> MetalAffinePoint {
return { be32_to_metal_fe(pt65 + 1), be32_to_metal_fe(pt65 + 33) };
};
/* Build GPU-layout RangeProofPolyGPU structs (320 bytes each):
* 4 x MetalAffinePoint (A, S, T1, T2) + 2 x MetalScalar256 (tau_x, t_hat)
* Wire format per proof (324 bytes): 4 x 65-byte uncompressed + 2 x 32-byte scalars */
struct RangeProofPolyMetal {
MetalAffinePoint A, S, T1, T2;
MetalScalar256 tau_x, t_hat;
};
static_assert(sizeof(RangeProofPolyMetal) == 320, "struct layout mismatch");
auto buf_proofs = runtime_->alloc_buffer_shared(count * sizeof(RangeProofPolyMetal));
auto* proofs_out = static_cast<RangeProofPolyMetal*>(buf_proofs.contents());
for (size_t i = 0; i < count; ++i) {
const uint8_t* p = proofs324 + i * 324;
proofs_out[i].A = parse_pt65(p);
proofs_out[i].S = parse_pt65(p + 65);
proofs_out[i].T1 = parse_pt65(p + 130);
proofs_out[i].T2 = parse_pt65(p + 195);
proofs_out[i].tau_x = be32_to_metal_scalar(p + 260);
proofs_out[i].t_hat = be32_to_metal_scalar(p + 292);
}
auto buf_commits = runtime_->alloc_buffer_shared(count * sizeof(MetalAffinePoint));
auto* commits_out = static_cast<MetalAffinePoint*>(buf_commits.contents());
for (size_t i = 0; i < count; ++i)
commits_out[i] = parse_pt65(commitments65 + i * 65);
auto buf_hgen = runtime_->alloc_buffer_shared(sizeof(MetalAffinePoint));
*static_cast<MetalAffinePoint*>(buf_hgen.contents()) = parse_pt65(H_generator65);
auto buf_res = runtime_->alloc_buffer_shared(count * sizeof(uint32_t));
uint32_t n32 = (uint32_t)count;
auto buf_n = runtime_->alloc_buffer_shared(sizeof(uint32_t));
std::memcpy(buf_n.contents(), &n32, sizeof(n32));
auto pipe = runtime_->make_pipeline("range_proof_poly_batch");
runtime_->dispatch_sync(pipe, (uint32_t)count, 64u,
{&buf_proofs, &buf_commits, &buf_hgen, &buf_res, &buf_n});
const auto* res = static_cast<const uint32_t*>(buf_res.contents());
for (size_t i = 0; i < count; ++i)
out_results[i] = res[i] ? 1 : 0;
clear_error();
return GpuError::Ok;
}
GpuError bip324_aead_encrypt_batch(
const uint8_t* keys32, const uint8_t* nonces12,

View File

@ -144,6 +144,7 @@ public:
frost_init_attempted_ = false;
if (zk_knowledge_verify_) { clReleaseKernel(zk_knowledge_verify_); zk_knowledge_verify_ = nullptr; }
if (zk_dleq_verify_) { clReleaseKernel(zk_dleq_verify_); zk_dleq_verify_ = nullptr; }
if (bp_poly_batch_) { clReleaseKernel(bp_poly_batch_); bp_poly_batch_ = nullptr; }
if (zk_program_) { clReleaseProgram(zk_program_); zk_program_ = nullptr; }
zk_init_attempted_ = false;
if (bip324_aead_encrypt_) { clReleaseKernel(bip324_aead_encrypt_); bip324_aead_encrypt_ = nullptr; }
@ -1063,13 +1064,106 @@ public:
return GpuError::Ok;
}
// PARITY-EXCEPTION(OpenCL): bulletproof kernel is #if 0 blocked in secp256k1_zk.cl
// pending address-space qualifier fixes (__global/__private params).
// See docs/BACKEND_ASSURANCE_MATRIX.md row "bulletproof_verify_batch".
GpuError bulletproof_verify_batch(
const uint8_t*, const uint8_t*, const uint8_t*,
size_t, uint8_t*) override
{ return set_error(GpuError::Unsupported, "bulletproof: OpenCL kernel pending address-space fixes"); }
const uint8_t* proofs324, const uint8_t* commitments65,
const uint8_t* H_generator65, size_t count,
uint8_t* out_results) override
{
if (!is_ready()) return set_error(GpuError::Device, "context not initialised");
if (count == 0) { clear_error(); return GpuError::Ok; }
if (!proofs324 || !commitments65 || !H_generator65 || !out_results)
return set_error(GpuError::NullArg, "NULL buffer");
auto err = ensure_zk_kernels();
if (err != GpuError::Ok) return err;
auto* cl_ctx = static_cast<cl_context>(ctx_->native_context());
auto* queue = static_cast<cl_command_queue>(ctx_->native_queue());
cl_int clerr;
/* Parse 324-byte proofs into host-side RangeProofPolyGPU struct.
* Wire layout per proof: 4 × 65-byte uncompressed points (A, S, T1, T2)
* + 2 × 32-byte BE scalars (tau_x, t_hat) = 324 bytes.
* GPU struct layout: 4 × AffinePoint(64B) + 2 × Scalar(32B) = 320 bytes. */
struct RangeProofPolyOCL {
secp256k1::opencl::AffinePoint A, S, T1, T2;
secp256k1::opencl::Scalar tau_x, t_hat;
};
std::vector<RangeProofPolyOCL> h_proofs(count);
for (size_t i = 0; i < count; ++i) {
const uint8_t* p = proofs324 + i * 324;
if (!pubkey65_to_affine(p, &h_proofs[i].A)) return set_error(GpuError::BadKey, "invalid proof A");
if (!pubkey65_to_affine(p + 65, &h_proofs[i].S)) return set_error(GpuError::BadKey, "invalid proof S");
if (!pubkey65_to_affine(p + 130, &h_proofs[i].T1)) return set_error(GpuError::BadKey, "invalid proof T1");
if (!pubkey65_to_affine(p + 195, &h_proofs[i].T2)) return set_error(GpuError::BadKey, "invalid proof T2");
bytes_to_scalar(p + 260, &h_proofs[i].tau_x);
bytes_to_scalar(p + 292, &h_proofs[i].t_hat);
}
std::vector<secp256k1::opencl::AffinePoint> h_commits(count);
for (size_t i = 0; i < count; ++i) {
if (!pubkey65_to_affine(commitments65 + i * 65, &h_commits[i]))
return set_error(GpuError::BadKey, "invalid commitment");
}
secp256k1::opencl::AffinePoint h_gen;
if (!pubkey65_to_affine(H_generator65, &h_gen))
return set_error(GpuError::BadKey, "invalid H generator");
cl_mem d_proofs = clCreateBuffer(cl_ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(RangeProofPolyOCL) * count, h_proofs.data(), &clerr);
if (clerr != CL_SUCCESS) return set_error(GpuError::Memory, "bp proof buffer");
cl_mem d_commits = clCreateBuffer(cl_ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(secp256k1::opencl::AffinePoint) * count,
h_commits.data(), &clerr);
if (clerr != CL_SUCCESS) {
clReleaseMemObject(d_proofs);
return set_error(GpuError::Memory, "bp commit buffer");
}
cl_mem d_hgen = clCreateBuffer(cl_ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(secp256k1::opencl::AffinePoint), &h_gen, &clerr);
if (clerr != CL_SUCCESS) {
clReleaseMemObject(d_commits); clReleaseMemObject(d_proofs);
return set_error(GpuError::Memory, "bp h-gen buffer");
}
cl_mem d_res = clCreateBuffer(cl_ctx, CL_MEM_WRITE_ONLY,
sizeof(int) * count, nullptr, &clerr);
if (clerr != CL_SUCCESS) {
clReleaseMemObject(d_hgen); clReleaseMemObject(d_commits); clReleaseMemObject(d_proofs);
return set_error(GpuError::Memory, "bp result buffer");
}
cl_uint cl_count = static_cast<cl_uint>(count);
clSetKernelArg(bp_poly_batch_, 0, sizeof(cl_mem), &d_proofs);
clSetKernelArg(bp_poly_batch_, 1, sizeof(cl_mem), &d_commits);
clSetKernelArg(bp_poly_batch_, 2, sizeof(cl_mem), &d_hgen);
clSetKernelArg(bp_poly_batch_, 3, sizeof(cl_mem), &d_res);
clSetKernelArg(bp_poly_batch_, 4, sizeof(cl_uint), &cl_count);
size_t global = count;
clerr = clEnqueueNDRangeKernel(queue, bp_poly_batch_, 1, nullptr,
&global, nullptr, 0, nullptr, nullptr);
if (clerr != CL_SUCCESS) {
clReleaseMemObject(d_res); clReleaseMemObject(d_hgen);
clReleaseMemObject(d_commits); clReleaseMemObject(d_proofs);
return set_error(GpuError::Launch, "bp_poly_batch kernel launch failed");
}
clFinish(queue);
std::vector<int> h_res(count);
clEnqueueReadBuffer(queue, d_res, CL_TRUE, 0,
sizeof(int) * count, h_res.data(), 0, nullptr, nullptr);
for (size_t i = 0; i < count; ++i)
out_results[i] = h_res[i] ? 1 : 0;
clReleaseMemObject(d_res); clReleaseMemObject(d_hgen);
clReleaseMemObject(d_commits); clReleaseMemObject(d_proofs);
clear_error();
return GpuError::Ok;
}
/* -- BIP-324 AEAD batch operations (OpenCL via secp256k1_bip324.cl) ----- */
@ -1279,6 +1373,7 @@ private:
cl_program zk_program_ = nullptr;
cl_kernel zk_knowledge_verify_ = nullptr;
cl_kernel zk_dleq_verify_ = nullptr;
cl_kernel bp_poly_batch_ = nullptr; /* range_proof_poly_batch */
bool zk_init_attempted_ = false;
/* BIP-324 AEAD kernel handles (lazy-loaded via secp256k1_bip324.cl) */
@ -1578,7 +1673,7 @@ private:
/* -- Lazy-load ZK proof OpenCL program --------------------------------- */
GpuError ensure_zk_kernels() {
if (zk_knowledge_verify_ && zk_dleq_verify_) return GpuError::Ok;
if (zk_knowledge_verify_ && zk_dleq_verify_ && bp_poly_batch_) return GpuError::Ok;
if (zk_init_attempted_)
return set_error(GpuError::Launch, "ZK kernel init previously failed");
zk_init_attempted_ = true;
@ -1642,6 +1737,14 @@ private:
return set_error(GpuError::Launch, "zk_dleq_verify_batch kernel not found");
}
bp_poly_batch_ = clCreateKernel(zk_program_, "range_proof_poly_batch", &err);
if (err != CL_SUCCESS) {
clReleaseKernel(zk_dleq_verify_); zk_dleq_verify_ = nullptr;
clReleaseKernel(zk_knowledge_verify_); zk_knowledge_verify_ = nullptr;
clReleaseProgram(zk_program_); zk_program_ = nullptr;
return set_error(GpuError::Launch, "range_proof_poly_batch kernel not found");
}
return GpuError::Ok;
}

View File

@ -595,9 +595,6 @@ __kernel void zk_dleq_verify_batch(
// =============================================================================
// 3. Bulletproof Range Proof (64-bit)
// =============================================================================
// Note: Bulletproof section requires OpenCL address-space fixes before enabling.
// Knowledge and DLEQ proofs above are fully functional.
#if 0 // Bulletproof: pending address-space fixes for __global/__private params
// Full Bulletproof range proof verification on OpenCL.
// Ported from CUDA implementation (commit 02ac59d).
//
@ -937,8 +934,8 @@ inline int range_verify_full_impl(
const RangeProofGPU* proof,
const AffinePoint* commitment,
const AffinePoint* H_gen,
const AffinePoint* bp_G, // 64 G_i generators
const AffinePoint* bp_H, // 64 H_i generators
__global const AffinePoint* bp_G, // 64 G_i generators (__global: too large for private)
__global const AffinePoint* bp_H, // 64 H_i generators (__global: too large for private)
const ZKTagMidstate* bp_ip_midstate)
{
// ---- Fiat-Shamir: recompute y, z, x ----
@ -1154,8 +1151,10 @@ inline int range_verify_full_impl(
scalar_mul_mod_n_impl(&proof->a, &s_coeff[i], &a_si);
scalar_sub_mod_n_impl(&neg_z, &a_si, &g_coeff);
/* Copy generator point from __global to __private before scalar_mul_impl */
AffinePoint g_pt = bp_G[i];
JacobianPoint g_term;
scalar_mul_impl(&g_term, &g_coeff, &bp_G[i]);
scalar_mul_impl(&g_term, &g_coeff, &g_pt);
point_add_impl(&msm_acc, &msm_acc, &g_term);
// H_i: (z + z2*2^i*y_inv^i) - b*s_inv[i]*y_inv^i
@ -1169,8 +1168,10 @@ inline int range_verify_full_impl(
scalar_mul_mod_n_impl(&b_si, &y_inv_powers[i], &b_si_yi);
scalar_sub_mod_n_impl(&h_pcheck, &b_si_yi, &h_coeff);
/* Copy generator point from __global to __private before scalar_mul_impl */
AffinePoint h_pt = bp_H[i];
JacobianPoint h_term;
scalar_mul_impl(&h_term, &h_coeff, &bp_H[i]);
scalar_mul_impl(&h_term, &h_coeff, &h_pt);
point_add_impl(&msm_acc, &msm_acc, &h_term);
}
@ -1478,4 +1479,3 @@ __kernel void pedersen_verify_sum(
if (z_bytes[i] != 0) z_zero = 0;
*result = (sum.infinity || z_zero);
}
#endif // Bulletproof: pending address-space fixes