12 KiB
GPU Testing & Benchmark Guide
UltrafastSecp256k1 -- OpenCL / CUDA / Metal
This document guides testing of ALL GPU backends when switching to Linux/Apple.
Scope note: this guide includes backend kernel inventory and internal capability checks that are broader than the stable public GPU C ABI in include/ufsecp/ufsecp_gpu.h. Do not treat every internal kernel listed here as a supported secret-bearing public interface. For security decisions, the stable ABI contract in
ufsecp_gpu.his the source of truth.Reproducibility note: Linux contributors are not limited to ad-hoc host setups. The repository also ships
docker-compose.ci.yml,Dockerfile.local-ci, anddocs/LOCAL_CI.mdso the surrounding build/test environment can be recreated in Docker. GPU execution still requires host driver/runtime support and device passthrough into that local environment.
1. File Inventory (What Was Created)
CUDA (reference -- already complete)
cuda/include/hash160.cuh-- SHA-256 + RIPEMD-160 + Hash160cuda/include/ecdsa.cuh-- ECDSA sign/verifycuda/include/schnorr.cuh-- Schnorr BIP-340cuda/include/ecdh.cuh-- ECDH shared secretcuda/include/recovery.cuh-- Key recoverycuda/include/msm.cuh-- Multi-scalar multiplicationcuda/src/test_suite.cu-- Full test suite
OpenCL
opencl/kernels/secp256k1_field.cl-- Field arithmetic (4x64-bit)opencl/kernels/secp256k1_point.cl-- EC point operationsopencl/kernels/secp256k1_batch.cl-- Batch operationsopencl/kernels/secp256k1_affine.cl-- Affine conversionsopencl/kernels/secp256k1_extended.cl-- Scalar, SHA-256, HMAC, RFC6979, ECDSA, Schnorr, ECDH, Recovery, MSM (~1370 lines)opencl/kernels/secp256k1_hash160.cl-- NEW -- SHA-256 one-shot + RIPEMD-160 + Hash160opencl/tests/opencl_extended_test.cpp-- NEW -- Host-side test+benchopencl/src/opencl_selftest.cpp-- Existing 40-test suite (field/point)
Metal
metal/shaders/secp256k1_field.h-- Field arithmetic (8x32-bit)metal/shaders/secp256k1_point.h-- EC point operationsmetal/shaders/secp256k1_affine.h-- Affine conversionsmetal/shaders/secp256k1_bloom.h-- Bloom filter (external -- not part of this project)metal/shaders/secp256k1_extended.h-- Scalar, SHA-256, HMAC, RFC6979, ECDSA, Schnorr, ECDH, Recovery, MSM (~680 lines)metal/shaders/secp256k1_hash160.h-- NEW -- SHA-256 one-shot + RIPEMD-160 + Hash160metal/shaders/secp256k1_kernels.metal-- UPDATED -- Now includes extended.h + hash160.h, 18 kernels totalmetal/tests/metal_extended_test.mm-- NEW -- Host-side test+benchmetal/src/metal_runtime.mm-- Existing Metal runtime
2. Internal Backend Capability Matrix
This matrix tracks backend/kernel coverage, not just the stable public GPU ABI.
Rows such as RFC 6979 or ECDSA sign/verify indicate internal implementation
or test coverage; they do not, by themselves, mean that a stable public
secret-bearing GPU C ABI exists for those operations.
| Feature | CUDA | OpenCL | Metal | Notes |
|---|---|---|---|---|
| Field add/sub/mul | [OK] | [OK] | [OK] | |
| Field inv/sqr | [OK] | [OK] | [OK] | |
| Field sqrt | [OK] | [OK] | [OK] | |
| Point add/double | [OK] | [OK] | [OK] | |
| Scalar mul (4-bit) | [OK] | [OK] | [OK] | |
| Batch inverse | [OK] | [OK] | [OK] | |
| Affine convert | [OK] | [OK] | [OK] | |
| Scalar mod-n ops | [OK] | [OK] | [OK] | |
| GLV endomorphism | [OK] | [OK] | [OK] | |
| SHA-256 streaming | [OK] | [OK] | [OK] | |
| SHA-256 one-shot | [OK] | [OK] | [OK] | For Hash160 |
| HMAC-SHA256 | [OK] | [OK] | [OK] | |
| RFC 6979 | [OK] | [OK] | [OK] | |
| ECDSA sign/verify | [OK] | [OK] | [OK] | |
| Schnorr BIP-340 | [OK] | [OK] | [OK] | |
| ECDH | [OK] | [OK] | [OK] | |
| Key Recovery | [OK] | [OK] | [OK] | |
| MSM / Pippenger | [OK] | [OK] | [OK] | |
| RIPEMD-160 | [OK] | [OK] | [OK] | |
| Hash160 | [OK] | [OK] | [OK] | |
| Bloom filter | [OK] | [FAIL] | [OK]* | *External, not part of project |
3. Linux Testing -- CUDA
Prerequisites
# NVIDIA driver + CUDA toolkit
nvidia-smi # Verify GPU
nvcc --version # Verify CUDA
Build
cd libs/UltrafastSecp256k1
cmake -S Secp256K1fast -B Secp256K1fast/build_rel -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build Secp256K1fast/build_rel -j
Test
ctest --test-dir Secp256K1fast/build_rel --output-on-failure
Expected Results
- All CUDA tests pass (P0 scalar/field, P1 ECDSA, P2 Schnorr/ECDH/Recovery/MSM)
- Hash160 test vectors:
Hash160(compressed key=1)=751e76e8199196d454941c45d1b3a323f1433bd6Hash160(uncompressed key=1)=91b24bf9f5288532960ac687abb035127b1d28a5
4. Linux Testing -- OpenCL
Prerequisites
# Install OpenCL ICD + headers
sudo apt install ocl-icd-opencl-dev opencl-headers
# For NVIDIA GPU:
sudo apt install nvidia-opencl-dev
# Or for Intel:
sudo apt install intel-opencl-icd
# Verify:
clinfo | head -20
Build Test
cd libs/UltrafastSecp256k1/opencl
# Compile the test (standalone)
g++ -std=c++17 -O2 \
-I kernels/ \
tests/opencl_extended_test.cpp \
-lOpenCL \
-o opencl_extended_test
Run Tests
# Copy kernels next to the binary
cp kernels/*.cl .
# Run tests
./opencl_extended_test --verbose
# Run benchmarks
./opencl_extended_test --bench --count 131072
Build Existing Self-Test (field/point)
g++ -std=c++17 -O2 \
-I . \
src/opencl_selftest.cpp src/opencl_context.cpp \
src/opencl_field.cpp src/opencl_point.cpp src/opencl_batch.cpp \
-lOpenCL \
-o opencl_selftest
./opencl_selftest
Expected Test Results
Hash160(compressed key=1): 751e76e8199196d454941c45d1b3a323f1433bd6
Hash160(uncompressed key=1): 91b24bf9f5288532960ac687abb035127b1d28a5
All 40 existing field/point tests: PASS
Troubleshooting
- If kernel build fails: check
-cl-std=CL2.0support, try removing it - If
ulongnot available: device doesn't support 64-bit int -- unusual for GPUs - Include path issues: ensure
-I kernels/or place all.clfiles in CWD
5. Apple Metal Testing
Prerequisites
- macOS 12+ with Apple Silicon (M1/M2/M3) or Intel Mac with Metal support
- Xcode Command Line Tools:
xcode-select --install
Build Metal Library
cd libs/UltrafastSecp256k1/metal
# Compile shader to .air
xcrun -sdk macosx metal -c shaders/secp256k1_kernels.metal \
-o secp256k1.air \
-I shaders/
# Link to .metallib
xcrun -sdk macosx metallib secp256k1.air -o secp256k1.metallib
Build Test
# Compile test + runtime
clang++ -std=c++17 -O2 -fobjc-arc \
-framework Metal -framework Foundation \
tests/metal_extended_test.mm \
src/metal_runtime.mm \
-I src/ -I shaders/ \
-o metal_extended_test
Run Tests
# Make sure metallib or .metal source is accessible
cp secp256k1.metallib . # Or the test will compile from source
# Run tests
./metal_extended_test --verbose
# Run benchmarks (default: 65536 items)
./metal_extended_test --bench --count 131072
Expected Results
Hash160(compressed key=1): 751e76e8199196d454941c45d1b3a323f1433bd6
Hash160(uncompressed key=1): 91b24bf9f5288532960ac687abb035127b1d28a5
field_mul(2, 3) = 6: PASS
1*G = G: PASS
Metal Kernel List (18 kernels in secp256k1_kernels.metal)
search_kernel-- Batch ECC searchscalar_mul_batch-- Batch Pxkgenerator_mul_batch-- Batch Gxkfield_mul_bench-- Benchmarkfield_sqr_bench-- Benchmarkfield_add_bench-- Benchmarkfield_sub_bench-- Benchmarkfield_inv_bench-- Benchmarkbatch_inverse-- Chunked Montgomerypoint_add_kernel-- Testingpoint_double_kernel-- Testingecdsa_sign_batch-- Batch ECDSA signecdsa_verify_batch-- Batch ECDSA verifyschnorr_sign_batch-- Batch Schnorr signschnorr_verify_batch-- Batch Schnorr verifyecdh_batch-- Batch ECDHhash160_batch-- Batch Hash160ecrecover_batch-- Batch key recoverysha256_bench-- SHA-256 benchmarkhash160_bench-- Hash160 benchmarkecdsa_bench-- ECDSA sign+verify benchmark
Troubleshooting (Metal)
- "Function not found" -- Add
#include "secp256k1_extended.h"to kernels.metal (already done) - Compile error on 64-bit int -- Metal uses 8x32-bit limbs, no
ulongneeded - MTLGPUFamilyApple9 error -- Update Xcode or use
@available(macOS 14.0, *)
6. Benchmark Comparison Template
Run on each platform and fill in:
| Operation | CUDA (RTX) | OpenCL (GPU) | Metal (M-series) |
|---|---|---|---|
| Field mul | |||
| Field inv | |||
| Field sqr | |||
| Generator mul (k*G) | |||
| Scalar mul (P*k) | |||
| Batch inverse | |||
| SHA-256 | |||
| Hash160 | |||
| ECDSA sign | |||
| ECDSA verify | |||
| Schnorr sign | |||
| Schnorr verify | |||
| ECDH | |||
| Key recovery |
Units: ops/sec (batch size = 131072)
7. Test Vectors (Cross-Platform Verification)
Hash160
Input: 0279be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
Output: 751e76e8199196d454941c45d1b3a323f1433bd6
Input: 0479be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
Output: 91b24bf9f5288532960ac687abb035127b1d28a5
Generator Point (1*G)
X: 79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
Y: 483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
2*G
X: c6047f9441ed7d6d3045406e95c07cd85c778e4b8cef3ca7abac09b95c709ee5
Y: 1ae168fea63dc339a3c58419466ceaeef7f632653266d0e1236431a950cfe52a
SHA-256("abc")
ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad
8. Quick Command Reference
Linux (CUDA + OpenCL)
# CUDA tests
ctest --test-dir Secp256K1fast/build_rel --output-on-failure
# OpenCL tests
cd libs/UltrafastSecp256k1/opencl
g++ -std=c++17 -O2 -I kernels/ tests/opencl_extended_test.cpp -lOpenCL -o opencl_extended_test
cp kernels/*.cl .
./opencl_extended_test --bench --count 131072
Apple (Metal)
cd libs/UltrafastSecp256k1/metal
xcrun -sdk macosx metal -c shaders/secp256k1_kernels.metal -o secp256k1.air -I shaders/
xcrun -sdk macosx metallib secp256k1.air -o secp256k1.metallib
clang++ -std=c++17 -O2 -fobjc-arc -framework Metal -framework Foundation \
tests/metal_extended_test.mm src/metal_runtime.mm -I src/ -I shaders/ -o metal_extended_test
./metal_extended_test --bench --count 131072
9. Architecture Notes
Limb Sizes
- CUDA: 4x
uint64_t(native 64-bit, PTXmul.hi.u64) - OpenCL: 4x
ulong(64-bit,mul_hi()) - Metal: 8x
uint32_t(no 64-bit int on Apple GPU!)
Key Differences
- Metal has NO 64-bit integer support on GPU -> 8x32-bit with carry chains
- Metal uses
constantinstead of__constant - Metal uses
threadqualifier for private pointers - Metal uses
[[buffer(N)]]for buffer bindings - OpenCL uses
_implsuffix convention for inline functions - CUDA has
__device__ __forceinline__qualifiers
Hash160 Pipeline
pubkey (33 or 65 bytes)
-> SHA-256 (one-shot, big-endian output, 32 bytes)
-> RIPEMD-160 (two parallel chains, little-endian output, 20 bytes)
= Hash160 (20 bytes)
Reminder: Bloom filters are NOT part of this project -- they should be external.