opencl+metal: wire bulletproof_verify_batch — close last parity gap

- opencl/kernels/secp256k1_zk.cl: remove #if 0 guard (881 lines of bulletproof code re-enabled); fix range_verify_full_impl address-space qualifiers: bp_G / bp_H now __global const AffinePoint*; per-iteration private copy (AffinePoint g_pt = bp_G[i]) before passing to scalar_mul_impl. - gpu/src/gpu_backend_opencl.cpp: replace Unsupported stub with real dispatch via range_proof_poly_batch kernel; add bp_poly_batch_ member + cleanup; update ensure_zk_kernels() to register the new kernel; parse 324-byte wire format (4x65-byte uncompressed + 2x32-byte scalars) into RangeProofPolyOCL GPU layout. - gpu/src/gpu_backend_metal.mm: replace Unsupported stub with real dispatch via range_proof_poly_batch kernel; build RangeProofPolyMetal (320B) from 324-byte wire format using be32_to_metal_fe / be32_to_metal_scalar helpers. - docs/BACKEND_ASSURANCE_MATRIX.md: bulletproof row stub->Y for OpenCL+Metal; parity tracking now shows zero remaining stubs. - CHANGELOG.md: document bulletproof parity closure. All three backends (CUDA, OpenCL, Metal) now implement bulletproof_verify_batch. Zero Unsupported stubs remain in the GPU backend surface.
2026-03-24 21:59:56 +00:00 · 2026-03-24 21:59:56 +00:00 · 849c9df309
commit 849c9df309
parent d329026a01
5 changed files with 203 additions and 27 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -13,8 +13,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **ClusterFuzzLite expanded to 5 targets**: added `cpu/fuzz/fuzz_ecdsa.cpp` (ECDSA sign→verify invariant, wrong-msg false-positive check, parse_compact_strict robustness) and `cpu/fuzz/fuzz_schnorr.cpp` (BIP-340 sign→verify, adversarial from_bytes verify, wrong-msg check).

 ### GPU Backend
- **OpenCL parity**: wired `zk_knowledge_verify_batch`, `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch` in `gpu_backend_opencl.cpp` — 4 new kernels matching CUDA surface. Closes assurance gap **#5/#6** (partial; bulletproof batch remains PARITY-EXCEPTION, documented in `BACKEND_ASSURANCE_MATRIX.md`).
- **Metal parity**: wired `zk_knowledge_verify_batch`, `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch` in `gpu_backend_metal.mm` — all four Metal kernels were already present in `secp256k1_kernels.metal`; dispatch code now connected. Also fixed `zk_knowledge_verify_batch` Metal kernel: was incorrectly treating pubkey buffer as a scalar (scalar×G); corrected to `lift_x` to recover the full point from x-coordinate. Full CUDA ↔ OpenCL ↔ Metal parity on all ZK and BIP-324 batch ops. `bulletproof_verify_batch` remains PARITY-EXCEPTION on all GPU backends.
+- **Bulletproof parity (OpenCL + Metal)**: resolved the last remaining PARITY-EXCEPTION. OpenCL: removed `#if 0` guard in `secp256k1_zk.cl`; fixed `range_verify_full_impl` address-space qualifiers (`__global const AffinePoint*` for `bp_G`/`bp_H`, with per-iteration private copy before `scalar_mul_impl`); wired `bulletproof_verify_batch` host dispatch via `range_proof_poly_batch` kernel (matches CUDA poly-check behavior). Metal: wired `bulletproof_verify_batch` host dispatch via `range_proof_poly_batch` kernel; host converts 324-byte proof wire format to `RangeProofPolyMetal` GPU structs. Full CUDA ↔ OpenCL ↔ Metal parity — zero `Unsupported` stubs remaining.
+- **OpenCL parity**: wired `zk_knowledge_verify_batch`, `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch` in `gpu_backend_opencl.cpp` — 4 new kernels matching CUDA surface.
+- **Metal parity**: wired `zk_knowledge_verify_batch`, `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch` in `gpu_backend_metal.mm` — all four Metal kernels were already present in `secp256k1_kernels.metal`; dispatch code now connected. Also fixed `zk_knowledge_verify_batch` Metal kernel: was incorrectly treating pubkey buffer as a scalar (scalar×G); corrected to `lift_x` to recover the full point from x-coordinate.
 - **CUDA 13 compatibility**: replaced deprecated `cudaDeviceProp::clockRate` / `::memoryClockRate` fields (removed in CUDA 13) with `cudaDeviceGetAttribute(cudaDevAttrClockRate/MemoryClockRate)` under `#if CUDART_VERSION >= 13000` guard. Backward-compatible with CUDA 12. Reported by @craigraw compiling with CUDA 13 on RTX 5080.


--- a/docs/BACKEND_ASSURANCE_MATRIX.md
+++ b/docs/BACKEND_ASSURANCE_MATRIX.md
@ -21,7 +21,7 @@
 | ZK proofs | Y | Y | Y | Y | Y |
 | ZK knowledge verify batch | - | - | Y | Y | Y |
 | ZK DLEQ verify batch | - | - | Y | Y | Y |
-| Bulletproof verify batch | - | - | Y | stub | stub |
+| Bulletproof verify batch | - | - | Y | Y | Y |
 | BIP-324 AEAD encrypt batch | - | - | Y | Y | Y |
 | BIP-324 AEAD decrypt batch | - | - | Y | Y | Y |
 | Multi-scalar mul | Y | - | Y | Y | Y |
@ -44,11 +44,13 @@

 | Operation | Backend | Tracking note |
 |-----------|---------|---------------|
-| `bulletproof_verify_batch` | OpenCL, Metal | CUDA implemented; OpenCL/Metal blocked by address-space qualifier issues in kernel — tracked as PARITY-EXCEPTION |
+| *(none — all parity gaps resolved)* | — | — |

-> All other ZK and BIP-324 batch operations (`zk_knowledge_verify_batch`,
-> `zk_dleq_verify_batch`, `bip324_aead_encrypt_batch`, `bip324_aead_decrypt_batch`)
-> are now **fully implemented on all three GPU backends** (CUDA, OpenCL, Metal).
+> All ZK and BIP-324 batch operations, including `bulletproof_verify_batch`, are now
+> **fully implemented on all three GPU backends** (CUDA, OpenCL, Metal).
+> OpenCL kernel `#if 0` guard removed; address-space qualifier fix applied to
+> `range_verify_full_impl` (added `__global` to `bp_G`/`bp_H`, local copy in loop).
+> Metal host dispatch wired via `range_proof_poly_batch` kernel.
 > Resolved 2026-03-24.

 ### Current permanent exceptions
--- a/gpu/src/gpu_backend_metal.mm
+++ b/gpu/src/gpu_backend_metal.mm
@ -840,12 +840,82 @@ public:
        return GpuError::Ok;
    }

-    // PARITY-EXCEPTION(Metal): bulletproof kernel blocked upstream; OpenCL also #if 0.
-    // See docs/BACKEND_ASSURANCE_MATRIX.md row "bulletproof_verify_batch".
    GpuError bulletproof_verify_batch(
-        const uint8_t*, const uint8_t*, const uint8_t*,
-        size_t, uint8_t*) override
-    { return set_error(GpuError::Unsupported, "bulletproof: kernel pending address-space fixes"); }
+        const uint8_t* proofs324, const uint8_t* commitments65,
+        const uint8_t* H_generator65, size_t count,
+        uint8_t* out_results) override
+    {
+        if (!is_ready()) return set_error(GpuError::Device, "context not initialised");
+        if (count == 0) { clear_error(); return GpuError::Ok; }
+        if (!proofs324 || !commitments65 || !H_generator65 || !out_results)
+            return set_error(GpuError::NullArg, "NULL buffer");
+
+        auto err = ensure_library();
+        if (err != GpuError::Ok) return err;
+
+        /* Convert big-endian 32 bytes → MetalFieldElem (same format as scalar). */
+        auto be32_to_metal_fe = [](const uint8_t be[32]) -> MetalFieldElem {
+            MetalFieldElem fe;
+            for (int i = 0; i < 8; i++) {
+                int base = (7 - i) * 4;
+                fe.limbs[i] = ((uint32_t)be[base]   << 24) |
+                              ((uint32_t)be[base+1]  << 16) |
+                              ((uint32_t)be[base+2]  << 8)  |
+                              ((uint32_t)be[base+3]);
+            }
+            return fe;
+        };
+
+        /* Parse uncompressed point (65 bytes: 04 || x[32] || y[32]) → MetalAffinePoint */
+        auto parse_pt65 = [&be32_to_metal_fe](const uint8_t pt65[65]) -> MetalAffinePoint {
+            return { be32_to_metal_fe(pt65 + 1), be32_to_metal_fe(pt65 + 33) };
+        };
+
+        /* Build GPU-layout RangeProofPolyGPU structs (320 bytes each):
+         *   4 x MetalAffinePoint (A, S, T1, T2) + 2 x MetalScalar256 (tau_x, t_hat)
+         * Wire format per proof (324 bytes): 4 x 65-byte uncompressed + 2 x 32-byte scalars */
+        struct RangeProofPolyMetal {
+            MetalAffinePoint A, S, T1, T2;
+            MetalScalar256 tau_x, t_hat;
+        };
+        static_assert(sizeof(RangeProofPolyMetal) == 320, "struct layout mismatch");
+
+        auto buf_proofs = runtime_->alloc_buffer_shared(count * sizeof(RangeProofPolyMetal));
+        auto* proofs_out = static_cast<RangeProofPolyMetal*>(buf_proofs.contents());
+        for (size_t i = 0; i < count; ++i) {
+            const uint8_t* p = proofs324 + i * 324;
+            proofs_out[i].A    = parse_pt65(p);
+            proofs_out[i].S    = parse_pt65(p + 65);
+            proofs_out[i].T1   = parse_pt65(p + 130);
+            proofs_out[i].T2   = parse_pt65(p + 195);
+            proofs_out[i].tau_x = be32_to_metal_scalar(p + 260);
+            proofs_out[i].t_hat = be32_to_metal_scalar(p + 292);
+        }
+
+        auto buf_commits = runtime_->alloc_buffer_shared(count * sizeof(MetalAffinePoint));
+        auto* commits_out = static_cast<MetalAffinePoint*>(buf_commits.contents());
+        for (size_t i = 0; i < count; ++i)
+            commits_out[i] = parse_pt65(commitments65 + i * 65);
+
+        auto buf_hgen = runtime_->alloc_buffer_shared(sizeof(MetalAffinePoint));
+        *static_cast<MetalAffinePoint*>(buf_hgen.contents()) = parse_pt65(H_generator65);
+
+        auto buf_res = runtime_->alloc_buffer_shared(count * sizeof(uint32_t));
+        uint32_t n32 = (uint32_t)count;
+        auto buf_n   = runtime_->alloc_buffer_shared(sizeof(uint32_t));
+        std::memcpy(buf_n.contents(), &n32, sizeof(n32));
+
+        auto pipe = runtime_->make_pipeline("range_proof_poly_batch");
+        runtime_->dispatch_sync(pipe, (uint32_t)count, 64u,
+                                {&buf_proofs, &buf_commits, &buf_hgen, &buf_res, &buf_n});
+
+        const auto* res = static_cast<const uint32_t*>(buf_res.contents());
+        for (size_t i = 0; i < count; ++i)
+            out_results[i] = res[i] ? 1 : 0;
+
+        clear_error();
+        return GpuError::Ok;
+    }

    GpuError bip324_aead_encrypt_batch(
        const uint8_t* keys32, const uint8_t* nonces12,
--- a/gpu/src/gpu_backend_opencl.cpp
+++ b/gpu/src/gpu_backend_opencl.cpp
@ -144,6 +144,7 @@ public:
        frost_init_attempted_ = false;
        if (zk_knowledge_verify_) { clReleaseKernel(zk_knowledge_verify_); zk_knowledge_verify_ = nullptr; }
        if (zk_dleq_verify_)      { clReleaseKernel(zk_dleq_verify_);     zk_dleq_verify_      = nullptr; }
+        if (bp_poly_batch_)       { clReleaseKernel(bp_poly_batch_);       bp_poly_batch_       = nullptr; }
        if (zk_program_)          { clReleaseProgram(zk_program_);         zk_program_          = nullptr; }
        zk_init_attempted_ = false;
        if (bip324_aead_encrypt_) { clReleaseKernel(bip324_aead_encrypt_); bip324_aead_encrypt_ = nullptr; }
@ -1063,13 +1064,106 @@ public:
        return GpuError::Ok;
    }

-    // PARITY-EXCEPTION(OpenCL): bulletproof kernel is #if 0 blocked in secp256k1_zk.cl
-    // pending address-space qualifier fixes (__global/__private params).
-    // See docs/BACKEND_ASSURANCE_MATRIX.md row "bulletproof_verify_batch".
    GpuError bulletproof_verify_batch(
-        const uint8_t*, const uint8_t*, const uint8_t*,
-        size_t, uint8_t*) override
-    { return set_error(GpuError::Unsupported, "bulletproof: OpenCL kernel pending address-space fixes"); }
+        const uint8_t* proofs324, const uint8_t* commitments65,
+        const uint8_t* H_generator65, size_t count,
+        uint8_t* out_results) override
+    {
+        if (!is_ready()) return set_error(GpuError::Device, "context not initialised");
+        if (count == 0) { clear_error(); return GpuError::Ok; }
+        if (!proofs324 || !commitments65 || !H_generator65 || !out_results)
+            return set_error(GpuError::NullArg, "NULL buffer");
+
+        auto err = ensure_zk_kernels();
+        if (err != GpuError::Ok) return err;
+
+        auto* cl_ctx = static_cast<cl_context>(ctx_->native_context());
+        auto* queue  = static_cast<cl_command_queue>(ctx_->native_queue());
+        cl_int clerr;
+
+        /* Parse 324-byte proofs into host-side RangeProofPolyGPU struct.
+         * Wire layout per proof: 4 × 65-byte uncompressed points (A, S, T1, T2)
+         *                      + 2 × 32-byte BE scalars (tau_x, t_hat) = 324 bytes.
+         * GPU struct layout: 4 × AffinePoint(64B) + 2 × Scalar(32B) = 320 bytes. */
+        struct RangeProofPolyOCL {
+            secp256k1::opencl::AffinePoint A, S, T1, T2;
+            secp256k1::opencl::Scalar tau_x, t_hat;
+        };
+        std::vector<RangeProofPolyOCL> h_proofs(count);
+        for (size_t i = 0; i < count; ++i) {
+            const uint8_t* p = proofs324 + i * 324;
+            if (!pubkey65_to_affine(p,       &h_proofs[i].A))  return set_error(GpuError::BadKey, "invalid proof A");
+            if (!pubkey65_to_affine(p + 65,  &h_proofs[i].S))  return set_error(GpuError::BadKey, "invalid proof S");
+            if (!pubkey65_to_affine(p + 130, &h_proofs[i].T1)) return set_error(GpuError::BadKey, "invalid proof T1");
+            if (!pubkey65_to_affine(p + 195, &h_proofs[i].T2)) return set_error(GpuError::BadKey, "invalid proof T2");
+            bytes_to_scalar(p + 260, &h_proofs[i].tau_x);
+            bytes_to_scalar(p + 292, &h_proofs[i].t_hat);
+        }
+
+        std::vector<secp256k1::opencl::AffinePoint> h_commits(count);
+        for (size_t i = 0; i < count; ++i) {
+            if (!pubkey65_to_affine(commitments65 + i * 65, &h_commits[i]))
+                return set_error(GpuError::BadKey, "invalid commitment");
+        }
+
+        secp256k1::opencl::AffinePoint h_gen;
+        if (!pubkey65_to_affine(H_generator65, &h_gen))
+            return set_error(GpuError::BadKey, "invalid H generator");
+
+        cl_mem d_proofs = clCreateBuffer(cl_ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
+                                         sizeof(RangeProofPolyOCL) * count, h_proofs.data(), &clerr);
+        if (clerr != CL_SUCCESS) return set_error(GpuError::Memory, "bp proof buffer");
+
+        cl_mem d_commits = clCreateBuffer(cl_ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
+                                          sizeof(secp256k1::opencl::AffinePoint) * count,
+                                          h_commits.data(), &clerr);
+        if (clerr != CL_SUCCESS) {
+            clReleaseMemObject(d_proofs);
+            return set_error(GpuError::Memory, "bp commit buffer");
+        }
+
+        cl_mem d_hgen = clCreateBuffer(cl_ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
+                                       sizeof(secp256k1::opencl::AffinePoint), &h_gen, &clerr);
+        if (clerr != CL_SUCCESS) {
+            clReleaseMemObject(d_commits); clReleaseMemObject(d_proofs);
+            return set_error(GpuError::Memory, "bp h-gen buffer");
+        }
+
+        cl_mem d_res = clCreateBuffer(cl_ctx, CL_MEM_WRITE_ONLY,
+                                      sizeof(int) * count, nullptr, &clerr);
+        if (clerr != CL_SUCCESS) {
+            clReleaseMemObject(d_hgen); clReleaseMemObject(d_commits); clReleaseMemObject(d_proofs);
+            return set_error(GpuError::Memory, "bp result buffer");
+        }
+
+        cl_uint cl_count = static_cast<cl_uint>(count);
+        clSetKernelArg(bp_poly_batch_, 0, sizeof(cl_mem),  &d_proofs);
+        clSetKernelArg(bp_poly_batch_, 1, sizeof(cl_mem),  &d_commits);
+        clSetKernelArg(bp_poly_batch_, 2, sizeof(cl_mem),  &d_hgen);
+        clSetKernelArg(bp_poly_batch_, 3, sizeof(cl_mem),  &d_res);
+        clSetKernelArg(bp_poly_batch_, 4, sizeof(cl_uint), &cl_count);
+
+        size_t global = count;
+        clerr = clEnqueueNDRangeKernel(queue, bp_poly_batch_, 1, nullptr,
+                                        &global, nullptr, 0, nullptr, nullptr);
+        if (clerr != CL_SUCCESS) {
+            clReleaseMemObject(d_res);   clReleaseMemObject(d_hgen);
+            clReleaseMemObject(d_commits); clReleaseMemObject(d_proofs);
+            return set_error(GpuError::Launch, "bp_poly_batch kernel launch failed");
+        }
+        clFinish(queue);
+
+        std::vector<int> h_res(count);
+        clEnqueueReadBuffer(queue, d_res, CL_TRUE, 0,
+                             sizeof(int) * count, h_res.data(), 0, nullptr, nullptr);
+        for (size_t i = 0; i < count; ++i)
+            out_results[i] = h_res[i] ? 1 : 0;
+
+        clReleaseMemObject(d_res);    clReleaseMemObject(d_hgen);
+        clReleaseMemObject(d_commits); clReleaseMemObject(d_proofs);
+        clear_error();
+        return GpuError::Ok;
+    }

    /* -- BIP-324 AEAD batch operations (OpenCL via secp256k1_bip324.cl) ----- */

@ -1279,6 +1373,7 @@ private:
    cl_program zk_program_            = nullptr;
    cl_kernel  zk_knowledge_verify_   = nullptr;
    cl_kernel  zk_dleq_verify_        = nullptr;
+    cl_kernel  bp_poly_batch_         = nullptr;  /* range_proof_poly_batch */
    bool       zk_init_attempted_     = false;

    /* BIP-324 AEAD kernel handles (lazy-loaded via secp256k1_bip324.cl) */
@ -1578,7 +1673,7 @@ private:

    /* -- Lazy-load ZK proof OpenCL program --------------------------------- */
    GpuError ensure_zk_kernels() {
-        if (zk_knowledge_verify_ && zk_dleq_verify_) return GpuError::Ok;
+        if (zk_knowledge_verify_ && zk_dleq_verify_ && bp_poly_batch_) return GpuError::Ok;
        if (zk_init_attempted_)
            return set_error(GpuError::Launch, "ZK kernel init previously failed");
        zk_init_attempted_ = true;
@ -1642,6 +1737,14 @@ private:
            return set_error(GpuError::Launch, "zk_dleq_verify_batch kernel not found");
        }

+        bp_poly_batch_ = clCreateKernel(zk_program_, "range_proof_poly_batch", &err);
+        if (err != CL_SUCCESS) {
+            clReleaseKernel(zk_dleq_verify_);     zk_dleq_verify_     = nullptr;
+            clReleaseKernel(zk_knowledge_verify_); zk_knowledge_verify_ = nullptr;
+            clReleaseProgram(zk_program_); zk_program_ = nullptr;
+            return set_error(GpuError::Launch, "range_proof_poly_batch kernel not found");
+        }
+
        return GpuError::Ok;
    }

--- a/opencl/kernels/secp256k1_zk.cl
+++ b/opencl/kernels/secp256k1_zk.cl
@ -595,9 +595,6 @@ __kernel void zk_dleq_verify_batch(
 // =============================================================================
 // 3. Bulletproof Range Proof (64-bit)
 // =============================================================================
-// Note: Bulletproof section requires OpenCL address-space fixes before enabling.
-// Knowledge and DLEQ proofs above are fully functional.
-#if 0  // Bulletproof: pending address-space fixes for __global/__private params
 // Full Bulletproof range proof verification on OpenCL.
 // Ported from CUDA implementation (commit 02ac59d).
 //
@ -937,8 +934,8 @@ inline int range_verify_full_impl(
    const RangeProofGPU* proof,
    const AffinePoint* commitment,
    const AffinePoint* H_gen,
-    const AffinePoint* bp_G,     // 64 G_i generators
-    const AffinePoint* bp_H,     // 64 H_i generators
+    __global const AffinePoint* bp_G,   // 64 G_i generators (__global: too large for private)
+    __global const AffinePoint* bp_H,   // 64 H_i generators (__global: too large for private)
    const ZKTagMidstate* bp_ip_midstate)
 {
    // ---- Fiat-Shamir: recompute y, z, x ----
@ -1154,8 +1151,10 @@ inline int range_verify_full_impl(
        scalar_mul_mod_n_impl(&proof->a, &s_coeff[i], &a_si);
        scalar_sub_mod_n_impl(&neg_z, &a_si, &g_coeff);

+        /* Copy generator point from __global to __private before scalar_mul_impl */
+        AffinePoint g_pt = bp_G[i];
        JacobianPoint g_term;
-        scalar_mul_impl(&g_term, &g_coeff, &bp_G[i]);
+        scalar_mul_impl(&g_term, &g_coeff, &g_pt);
        point_add_impl(&msm_acc, &msm_acc, &g_term);

        // H_i: (z + z2*2^i*y_inv^i) - b*s_inv[i]*y_inv^i
@ -1169,8 +1168,10 @@ inline int range_verify_full_impl(
        scalar_mul_mod_n_impl(&b_si, &y_inv_powers[i], &b_si_yi);
        scalar_sub_mod_n_impl(&h_pcheck, &b_si_yi, &h_coeff);

+        /* Copy generator point from __global to __private before scalar_mul_impl */
+        AffinePoint h_pt = bp_H[i];
        JacobianPoint h_term;
-        scalar_mul_impl(&h_term, &h_coeff, &bp_H[i]);
+        scalar_mul_impl(&h_term, &h_coeff, &h_pt);
        point_add_impl(&msm_acc, &msm_acc, &h_term);
    }

@ -1478,4 +1479,3 @@ __kernel void pedersen_verify_sum(
        if (z_bytes[i] != 0) z_zero = 0;
    *result = (sum.infinity || z_zero);
 }
-#endif  // Bulletproof: pending address-space fixes