UltrafastSecp256k1/cuda
Craig Raw c7c2aad9f8 fix: remove stray brace in secp256k1.cuh (#192), guard opencl benchmarks (#193)
remove extra closing brace after muladd64 in secp256k1.cuh that
prematurely closed namespace cuda, breaking the 64-bit limb build.

guard opencl benchmark targets with SECP256K1_BUILD_BENCH to prevent
build failures when benchmarks are disabled and CL/cl.h is not on
the include path (e.g. windows ci).
2026-03-26 11:30:25 +02:00
..
include fix: remove stray brace in secp256k1.cuh (#192), guard opencl benchmarks (#193) 2026-03-26 11:30:25 +02:00
src cuda: fix deprecated prop.clockRate/memoryClockRate for CUDA 13 2026-03-24 21:24:25 +00:00
CMakeLists.txt bench: add CUDA BIP-324 transport benchmark — batch parallel ChaCha20-Poly1305 2026-03-22 04:00:14 +00:00
README.md fix: MSVC C2026 string limit (#173), OpenCL batch-inv kernels, source graph tooling 2026-03-19 16:43:55 +00:00

Secp256k1 CUDA -- GPU ECC Library

English summary: Full secp256k1 ECC library for NVIDIA GPUs -- header-only core with PTX inline assembly. Supports CUDA and ROCm/HIP (via gpu_compat.h abstraction layer). Priority: maximum throughput for batch operations. Not side-channel resistant (research/development use). See docs/API_REFERENCE.md for the full API and docs/BUILDING.md for build instructions.

Full secp256k1 ECC library for NVIDIA GPUs -- header-only core with PTX inline assembly.

Priority: Maximum throughput for batch operations. Not side-channel resistant (research/dev use).


Architecture

All code resides in the secp256k1::cuda namespace. The core is header-only -- secp256k1.cuh contains all device functions. Data types are interoperable with the CPU library (secp256k1/types.hpp POD structs).

Compile-Time Configuration (3 backends)

Macro Default Description
SECP256K1_CUDA_USE_HYBRID_MUL ON 32-bit Comba mul + 64-bit reduction (1.10x faster)
SECP256K1_CUDA_USE_MONTGOMERY OFF Montgomery residue domain (mont_reduce_512)
SECP256K1_CUDA_LIMBS_32 OFF Full 8x32-bit limbs (separate backend)

Default path (64-bit hybrid): field_mul -> field_mul_hybrid -> 32-bit Comba PTX -> reduce_512_to_256


Functionality

Field Arithmetic (Fp)

  • add/sub: PTX inline asm with carry chains (ADDC.CC/SUBC.CC)
  • mul: 32-bit Comba hybrid -> 64-bit secp256k1 fast reduction (P = 2^2⁵⁶ - 2^3^2 - 977)
  • sqr: Optimized squaring (cross-product doubling)
  • inverse: Fermat chain a^{p-2} (255 sqr + 16 mul)
  • mul_small: Multiplication by uint32 (for reduction constants)
  • Montgomery: field_to_mont, field_from_mont, mont_reduce_512 (optional backend)

Scalar Arithmetic (Fn)

  • add/sub: Modular arithmetic mod curve order N
  • bit extraction: Fast bit access for scalar processing

Point Operations (Jacobian coordinates)

  • doubling: dbl-2001-b (3M+4S, a=0 curves)
  • mixed addition: 6 variants optimized for different scenarios:
    • jacobian_add_mixed -- madd-2007-bl (7M+4S) general
    • jacobian_add_mixed_h -- madd-2004-hmv (8M+3S), H output for batch inversion
    • jacobian_add_mixed_h_z1 -- Z=1 specialized (5M+2S), first step
    • jacobian_add_mixed_const -- branchless (8M+3S), constant-point
    • jacobian_add_mixed_const_7m4s -- branchless 7M+4S + 2H output
  • general add: jacobian_add (11M+5S, Jacobian + Jacobian)
  • GLV endomorphism: apply_endomorphism phi(x,y) = (beta*x, y)

Scalar Multiplication

  • double-and-add: Simple, register-efficient (wNAF is expensive on GPU due to register pressure)
  • Batch kernels: scalar_mul_batch_kernel, generator_mul_batch_kernel

Batch Inversion

  • Montgomery trick: prefix/suffix scan (default, one inversion for N elements)
  • Fermat: a^{p-2} for each element (fallback)
  • naive: Direct GCD (debug/reference)

Hash160 (SHA-256 + RIPEMD-160)

  • hash160_pubkey_kernel -- pubkey -> Hash160 device-side

Bloom Filter

  • DeviceBloom -- FNV-1a + SplitMix hashing
  • test / add device functions + batch kernels

File Structure

cuda/
+-- CMakeLists.txt                              # Build: lib + test + bench
+-- README.md
+-- include/
|   +-- secp256k1.cuh                           # Core -- field/point/scalar device functions (1800+ lines)
|   +-- ptx_math.cuh                            # PTX inline asm (256x256->512 Comba multiply)
|   +-- secp256k1_32.cuh                        # Alternative: 8x32-bit limbs + Montgomery backend
|   +-- secp256k1_32_hybrid_final.cuh           # 32-bit Comba mul -> 64-bit reduction (default mul path)
|   +-- batch_inversion.cuh                     # Montgomery trick / Fermat / naive batch inverse
|   +-- bloom.cuh                               # Device-side Bloom filter (FNV-1a + SplitMix)
|   +-- hash160.cuh                             # SHA-256 + RIPEMD-160 -> Hash160
|   +-- host_helpers.cuh                        # Host-side wrappers (1-thread kernels, test-only)
|   +-- gpu_compat.h                            # CUDA <-> HIP (ROCm) compatibility layer
+-- src/
|   +-- secp256k1.cu                            # Kernel definitions (thin wrappers)
|   +-- test_suite.cu                           # 30 vector tests
|   +-- bench_cuda.cu                           # Benchmark harness

Build

# Via parent CMakeLists.txt (or standalone)
cmake -S cuda -B cuda/build -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build cuda/build -j

# Tests
./cuda/build/secp256k1_cuda_test

# Benchmark
./cuda/build/secp256k1_cuda_bench

Build Options

Option Default Description
CMAKE_CUDA_ARCHITECTURES 89 (Ada) NVIDIA GPU architecture (75/80/86/89/90)
SECP256K1_CUDA_USE_MONTGOMERY OFF Montgomery domain
SECP256K1_CUDA_LIMBS_32 OFF 8x32-bit limb backend
SECP256K1_BUILD_ROCM OFF AMD ROCm/HIP build (portable math)
CMAKE_HIP_ARCHITECTURES -- AMD GPU architectures (gfx906/gfx1030/gfx1100/...)

Requirements

  • NVIDIA: CUDA Toolkit 12.0+, GPU Compute Capability 7.0+ (Volta+), CMake 3.18+
  • AMD: ROCm 5.0+ (HIP SDK), CMake 3.21+, gfx9/gfx10/gfx11 GPU

ROCm/HIP Build (AMD GPU)

# With ROCm Docker or native installation
cmake -S . -B build-rocm -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DSECP256K1_BUILD_ROCM=ON \
  -DCMAKE_HIP_ARCHITECTURES="gfx1030;gfx1100"

cmake --build build-rocm -j
./build-rocm/cuda_rocm/secp256k1_cuda_test

Note

: In ROCm builds, PTX inline asm is automatically replaced with portable __int128 fallbacks (gpu_compat.h -> SECP256K1_USE_PTX=0). The 32-bit hybrid mul backend (PTX-dependent) is automatically disabled on HIP.


Usage

Device Functions

#include "secp256k1.cuh"

__global__ void my_kernel(const Scalar* scalars, JacobianPoint* results, int n) {
    using namespace secp256k1::cuda;
    
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;
    
    // G * k -- GENERATOR_JACOBIAN is embedded at compile time
    JacobianPoint G = GENERATOR_JACOBIAN;
    scalar_mul(&G, &scalars[idx], &results[idx]);
}

Batch Processing

#include "secp256k1.cuh"

const int N = 1 << 20;  // ~1M operations
Scalar* d_scalars;
JacobianPoint* d_results;

cudaMalloc(&d_scalars, N * sizeof(Scalar));
cudaMalloc(&d_results, N * sizeof(JacobianPoint));

// Generator multiplication batch
int block = 256;
int grid = (N + block - 1) / block;
generator_mul_batch_kernel<<<grid, block>>>(d_scalars, d_results, N);
cudaDeviceSynchronize();

Tests

30 vector tests in test_suite.cu:

  • Field arithmetic: identity, inverse, commutativity, associativity
  • Scalar arithmetic: add, sub, boundary
  • Point operations: doubling, mixed addition, identity
  • Scalar multiplication: known vectors, generator mul
  • GLV endomorphism: phi(phi(P)) + P = -phi(P)
  • Batch inversion: Montgomery trick correctness
  • Cross-backend: CPU <-> CUDA result comparison

CPU <-> CUDA Compatibility

Data types share layout via secp256k1/types.hpp:

static_assert(sizeof(FieldElement) == 32);
static_assert(sizeof(Scalar) == 32);
static_assert(sizeof(AffinePoint) == 64);
static_assert(offsetof(FieldElement, limbs) == 0);

CPU-computed data transfers directly to GPU via cudaMemcpy (little-endian, same POD layout).


Cross-Platform Benchmarks

Android ARM64 -- RK3588 (Cortex-A55/A76), ARM64 inline ASM (MUL/UMULH)

Operation Time
field_mul (a*b mod p) 68.3 ns
field_sqr (a^2 mod p) 50 ns
field_add (a+b mod p) 8 ns
field_inverse 2 us
fast scalar_mul (k*G) 15.27 us
fast scalar_mul (k*P) 130.33 us
ECDSA sign 22.22 us
Schnorr sign (precomputed) 16.67 us
ECDSA verify 150.13 us

Backend: ARM64 inline assembly (MUL/UMULH). Latest rerun kept the ARMv8 SHA2 dispatch win for signing-heavy paths on RK3588.

Latest RTX 5060 Ti Refresh

  • CUDA local rerun via gpu_bench_unified: k*G = 129.5 ns at TPB 256 on batch 65536.
  • OpenCL retained revalidation: kG (batch=65536) = 115.1 ns, kP (batch=65536) = 263.1 ns, kG (kernel) = 98.7 ns.
  • CUDA TPB 512 was not retained as a default because the same harness produced invalid CT timings while only marginally improving k*G.

See ../docs/BENCHMARKS.md for the current cross-platform benchmark matrix and retained-vs-rejected rerun notes.


License

MIT -- see LICENSE


Credits

Port: Direct CUDA adaptation of the C++ library
Focus: Maximum throughput for batch ECC operations
Philosophy: Speed > Security (research/development)


UltrafastSecp256k1 v3.0.0 -- CUDA/ROCm GPU Library