UltrafastSecp256k1/docs/GPU_VALIDATION_MATRIX.md

# GPU Validation Matrix

Unified view of GPU backend validation coverage in UltrafastSecp256k1.

The machine-readable companion for backend status, publishability, and artifact requirements is `docs/GPU_BACKEND_EVIDENCE.json`. When this document and the JSON diverge, treat the JSON as the enforcement source.

This document answers four practical questions for each backend:

1. Do we have correctness tests?
2. Do we have a unified self-audit runner?
3. Do we have benchmark coverage?
4. Do we have host-side integration tests?

It is intended as an engineering checklist, not a marketing page.

---

## C ABI Ops -- Per-Backend Status

The C ABI layer (`ufsecp_gpu.h`) currently exposes 13 backend-neutral GPU batch
operations. CUDA, OpenCL, and Metal all implement that stable public surface.
`UFSECP_ERR_GPU_UNSUPPORTED` (104) remains part of the ABI for unsupported
backend selection, missing device/runtime capability, or invalid execution
context, not as a standing parity gap for the stable 13-op surface.

| Operation | CUDA | OpenCL | Metal | Data Class |
|-----------|------|--------|-------|------------|
| `generator_mul_batch` | implemented | implemented | implemented | PUBLIC |
| `ecdsa_verify_batch` | implemented | implemented | implemented | PUBLIC |
| `schnorr_verify_batch` | implemented | implemented | implemented | PUBLIC |
| `ecdh_batch` | implemented | implemented | implemented | SECRET |
| `hash160_pubkey_batch` | implemented | implemented | implemented | PUBLIC |
| `msm` | implemented | implemented | implemented | PUBLIC |
| `frost_verify_partial_batch` | implemented | implemented | implemented | PUBLIC |
| `ecrecover_batch` | implemented | implemented | implemented | PUBLIC |
| `zk_knowledge_verify_batch` | implemented | implemented | implemented | PUBLIC |
| `zk_dleq_verify_batch` | implemented | implemented | implemented | PUBLIC |
| `bulletproof_verify_batch` | implemented | implemented | implemented | PUBLIC |
| `bip324_aead_encrypt_batch` | implemented | implemented | implemented | PUBLIC |
| `bip324_aead_decrypt_batch` | implemented | implemented | implemented | SECRET |
| **Total (unified GPU C ABI)** | **13/13** | **13/13** | **13/13** | |

### Expansion Roadmap

- Unified 13/13 GPU C ABI parity is closed across CUDA, OpenCL, and Metal.
- The five ZK/BIP-324 batch ops are implemented on all three backends and exposed through the stable C ABI.
- Remaining GPU governance work is no longer backend parity; it is hardware-backed publishability, artifact retention, and cross-device reproducibility.

### C ABI Test Coverage

| Test | Scope | Guard |
|------|-------|-------|
| `gpu_abi_gate` | ABI surface, error codes, discovery, lifecycle, NULL handling | GPU host + ufsecp |
| `gpu_ops_equivalence` | GPU vs CPU reference for the stable public op surface; unsupported paths remain negative-test coverage | GPU host + ufsecp |
| `gpu_host_api_negative` | NULL ptrs, count=0, invalid backend/device, error strings | GPU host + ufsecp |
| `gpu_backend_matrix` | Backend enumeration, device info, per-backend op probing | GPU host + ufsecp |

### CI and Local Verification

| Environment | CUDA | OpenCL | Metal | Tests |
|-------------|------|--------|-------|-------|
| **Local (dev machine)** | [OK] RTX 5060 Ti | [OK] RTX 5060 Ti | N/A (Linux) | All 49 tests pass including gpu_abi_gate, gpu_ops_equivalence, gpu_host_api_negative, gpu_backend_matrix |
| **Local Docker parity** | [OK]* host GPU passthrough | [OK]* host OpenCL/runtime passthrough | N/A (Linux) | Same Linux CI toolchain via `docker-compose.ci.yml` / `Dockerfile.local-ci`; GPU slices remain host-hardware dependent |
| **GitHub Actions CI** | N/A (no GPU runners) | N/A (no GPU runners) | [OK] macOS (lifecycle) | Metal discovery + lifecycle via macOS job |

> **Note**: GitHub Actions standard runners do not have NVIDIA GPUs or OpenCL devices, but that is not the only reproducible Linux path. The repository ships a containerized local CI stack in `Dockerfile.local-ci`, `docker-compose.ci.yml`, and `docs/LOCAL_CI.md`, so contributors can reproduce the same Linux toolchain in Docker on their own machines. GPU validation still requires the host GPU driver/runtime stack and appropriate device passthrough into the container or local process.

---

## Summary

| Backend | Correctness Tests | Unified Audit | Unified Bench | Host / Integration | Notes |
|--------|-------------------|---------------|---------------|--------------------|-------|
| CUDA | [OK] | [OK] | [OK] | [OK] | Strongest GPU validation path today |
| ROCm/HIP | [!] Planned / Source-Shared | [!] Planned / Source-Shared | [!] Planned / Source-Shared | [!] Planned / Source-Shared | Optional future backend expansion; current absence of AMD hardware does not invalidate existing audited backends |
| OpenCL | [OK] | [OK] | [OK] | [OK] | Good coverage, entry points are more distributed |
| Metal | [OK] | [OK] | [OK] | [OK] | Good coverage on Apple platforms |

ROCm/HIP reuses the CUDA/HIP source tree and runners, but real AMD GPU validation is still pending.

That pending state is intentionally non-blocking for current audit validity.
CPU, CUDA, OpenCL, and Metal assurance claims remain grounded in their own
existing evidence paths; ROCm/HIP only becomes relevant once the project
chooses to publish real AMD-backed claims.

This status is intentionally fail-closed in `docs/GPU_BACKEND_EVIDENCE.json`: ROCm/HIP remains non-publishable until AMD hardware-backed benchmark and audit artifacts exist.

---

## CUDA / ROCm

### Main Entry Points

- Benchmark: [gpu_bench_unified.cu](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256k1/cuda/src/gpu_bench_unified.cu)
- Audit runner: [gpu_audit_runner.cu](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/cuda/src/gpu_audit_runner.cu)
- Full test suite: [test_suite.cu](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/cuda/src/test_suite.cu)
- CT smoke: [test_ct_smoke.cu](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/cuda/src/test_ct_smoke.cu)
- Specialized benches:
  - [bench_bip352.cu](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/cuda/src/bench_bip352.cu)
  - [bench_zk.cu](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/cuda/src/bench_zk.cu)
  - [bench_cuda.cu](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/cuda/src/bench_cuda.cu)

### Coverage

| Area | Status | Notes |
|------|--------|-------|
| Field arithmetic | [OK] | Included in selftest + audit runner + unified bench |
| Scalar arithmetic | [OK] | Included in unified bench and audit runner |
| Point operations | [OK] | Add/double/kG/kP covered |
| ECDSA | [OK] | Sign/verify in bench + audit |
| Schnorr | [OK] | Sign/verify in bench + audit |
| ECDH | [OK] | Present in audit runner |
| Recovery | [OK] | Present in audit runner |
| Batch verify | [OK] | Included in audit runner |
| BIP32 | [OK] | Present in audit runner |
| CT GPU path | [OK] | Bench + CT smoke present |
| Real workload benches | [OK] | BIP-352 and ZK present |

### Current Strength

CUDA is the most unified backend today. If someone asks, "Which GPU backend has the cleanest validation story?" the answer is CUDA.

For Linux reproducibility, that story is not limited to the original developer workstation: the same local CI environment can be recreated through the repo's Docker-based local parity infrastructure, then paired with host GPU access for CUDA/OpenCL runs.

### Remaining Engineering Gaps

- ROCm/HIP should not be treated as validated until tested on real AMD hardware.
- Keep cross-device reproducibility artifacts organized by GPU model and driver version.
- Keep backend-specific regression logs together with benchmark JSON/TXT artifacts.

---

## OpenCL

### Main Entry Points

- Audit runner: [opencl_audit_runner.cpp](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/opencl/src/opencl_audit_runner.cpp)
- Selftest: [opencl_selftest.cpp](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/opencl/src/opencl_selftest.cpp)
- Extended test + bench: [opencl_extended_test.cpp](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/opencl/tests/opencl_extended_test.cpp)
- Basic test harness: [test_opencl.cpp](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/opencl/tests/test_opencl.cpp)
- Benchmark app: [bench_opencl.cpp](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/opencl/benchmarks/bench_opencl.cpp)

### Coverage

| Area | Status | Notes |
|------|--------|-------|
| Field arithmetic | [OK] | Covered by selftest + extended test |
| Point operations | [OK] | Covered by selftest + extended test |
| Scalar / hash / ECDSA / Schnorr / ECDH / recovery / MSM | [OK] | Covered via extended kernel set + host test |
| Audit report generation | [OK] | `opencl_audit_runner` exists |
| Benchmark coverage | [OK] | `opencl_benchmark` + extended test bench mode |
| Host integration | [OK] | Dedicated host-side extended test |

### Current Strength

OpenCL has broad native validation coverage already and now matches CUDA and Metal on the stable public GPU C ABI surface.

The repo's local Docker parity environment also helps standardize the Linux-side toolchain for OpenCL validation; the remaining variable is the host OpenCL ICD/device stack, not the surrounding build/test container.

### Remaining Engineering Gaps

- Entry points are more fragmented than CUDA.
- A single "OpenCL unified benchmark" story should stay easy to discover in docs.
- Cross-vendor reports should be organized clearly: NVIDIA OpenCL, AMD OpenCL, Intel OpenCL.

---

## Metal

### Main Entry Points

- Audit runner: [metal_audit_runner.mm](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/metal/src/metal_audit_runner.mm)
- Extended test + bench: [metal_extended_test.mm](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/metal/tests/metal_extended_test.mm)
- Host test: [test_metal_host.cpp](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/metal/tests/test_metal_host.cpp)
- App bench/test: [metal_test.mm](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/metal/app/metal_test.mm)
- Metal bench app: [bench_metal.mm](/home/shrek/Secp256K1/Secp256K1fast/libs/UltrafastSecp256K1/metal/app/bench_metal.mm)

### Coverage

| Area | Status | Notes |
|------|--------|-------|
| Field arithmetic | [OK] | Covered in tests and app bench |
| Point operations | [OK] | Covered in tests and app bench |
| Extended crypto ops | [OK] | Covered by extended test |
| Audit report generation | [OK] | `metal_audit_runner` exists |
| Benchmark coverage | [OK] | Bench mode and app bench exist |
| Host integration | [OK] | Dedicated host test present |

### Current Strength

Metal has a reasonably complete validation stack and is already beyond "demo backend" level.

### Remaining Engineering Gaps

- Keep Apple GPU model coverage explicit in benchmark docs.
- Keep shader/library build steps easy to reproduce from CI and local machines.

---

## Recommended Backend Checklist

Use this checklist before calling a GPU backend "fully validated" for a release candidate:

- [ ] Backend selftest passes
- [ ] Backend audit runner passes
- [ ] Unified benchmark runs and emits report
- [ ] Host-side integration test passes
- [ ] One real-device benchmark artifact is saved
- [ ] One real-device audit artifact is saved
- [ ] Driver/toolkit version is recorded
- [ ] JSON + TXT reports are archived

## ROCm/HIP Promotion Checklist

ROCm/HIP is source-shared with the CUDA/HIP portability layer, but it is not
promoted by source compatibility alone. Promotion from `planned` to
`validated` and `publishable: true` requires all of the following on real AMD
hardware:

1. `docs/GPU_BACKEND_EVIDENCE.json` is updated so `rocm-hip` becomes `validated`, `hardware_backed: true`, and `publishable: true` in the same change.
2. Benchmark artifacts exist in both JSON and text form for the recorded AMD device.
3. Audit runner output exists for the same hardware and is reproducible from the recorded command path.
4. Driver/runtime metadata is archived, including ROCm driver version and AMD device model.
5. Any published numbers identify the exact device class and do not reuse CUDA labels or NVIDIA-only evidence.
6. `python3 scripts/preflight.py --gpu-evidence` and `python3 scripts/validate_assurance.py` pass after the evidence update.

Until those conditions are met, ROCm/HIP remains deliberately fail-closed for
public benchmark and validation claims. This is a containment rule for future
AMD-specific claims, not a defect in the validity of the current audit and
validation surfaces.

This checklist is enforceable through:

```bash
python3 scripts/preflight.py --gpu-evidence
python3 scripts/validate_assurance.py
```

---

## Practical Reading

If the goal is day-to-day engineering confidence:

- Start with CUDA as the reference GPU backend.
- Treat OpenCL and Metal as validated but separately operationalized backends.
- Treat ROCm/HIP as source-compatible with CUDA, but not promotion-eligible until the AMD hardware-backed checklist above is complete.