52 KiB
Performance Benchmarks
Benchmark results for UltrafastSecp256k1 across all supported platforms.
Summary
| Platform | Field Mul | Generator Mul | Scalar Mul | ECDSA Verify | ZK Prove | vs libsecp |
|---|---|---|---|---|---|---|
| x86-64 (i5-14400F, Clang 19) | 12.8 ns | 6.7 us | 17.6 us | 21.3 us | 24.3 us | 1.09x |
| x86-64 (Clang 21, Win) | 17 ns (5x52) | 5 us | 25 us | -- | -- | -- |
| RISC-V 64 (SiFive U74, Clang 21) | 176 ns | 40.2 us | 150.5 us | 181.8 us | -- | 1.13x |
| ARM64 (RK3588, A76) | 74 ns | 14 us | 131 us | -- | -- | -- |
| ESP32-S3 (LX7, 240 MHz) | 5,910 ns | 6,134 us | 12,752 us | 18,670 us | -- | 1.70× verify |
| ESP32-P4 (RV32, 360 MHz) | 2,424 ns | 2,253 us | 5,256 us | 7,528 us | -- | 1.01× verify |
| ESP32-C6 (RV32, 160 MHz) | 5,974 ns | 5,483 us | 12,682 us | 18,957 us | -- | 1.67× sign |
| ESP32 (LX6, 240 MHz) | 6,993 ns | 6,203 us | -- | -- | -- | -- |
| STM32F103 (CM3, 72 MHz) | 15,331 ns | 37,982 us | -- | -- | -- | -- |
| CUDA (RTX 5060 Ti) | 0.2 ns | 113.5 ns | 97.7 ns | 230.2 ns | 258.6 ns | -- |
| CUDA (RTX 5070 Ti) | 5.8 ns | 92.1 ns | 101.4 ns | 122.8 ns | -- | -- |
| OpenCL (RTX 5060 Ti) | 0.2 ns | 113.5 ns | 97.7 ns | 230.2 ns | 258.6 ns | -- |
| Metal (Apple M3 Pro) | 1.9 ns | 3.00 us | 2.94 us | -- | -- | -- |
GPU rows use the latest retained local rerun per backend. The stable public GPU C ABI now exposes 13 backend-neutral operations, and CUDA, OpenCL, and Metal all implement that stable surface. Internal signing kernels and benchmark-only paths are tracked separately from the public GPU ABI.
Real-World Flow Coverage
bench_unified also measures higher-level wallet and protocol flows so the
benchmark suite reflects product-shaped workloads, not only primitive-level ECC
operations.
Covered flows include:
ecdh_computeandecdh_compute_rawtaproot_output_keyandtaproot_tweak_privkeybip32_master_keycoin_derive_keyfor standard Bitcoin HD pathscoin_address_from_seedend-to-end for Bitcoin and Ethereumsilent_payment_create_outputsilent_payment_scan
Representative x86-64 / Linux Quick Snapshot
Quick sanity run from bench_unified --quick on the local x86-64 validation machine:
| Flow | Time |
|---|---|
ECDH (ecdh_compute) |
22.8 us |
ECDH raw (ecdh_compute_raw) |
20.5 us |
| Taproot output key | 10.5 us |
| BIP-32 master key (64B seed) | 1.2 us |
| BTC address from seed | 93.4 us |
| ETH address from seed | 93.4 us |
| Silent Payment create_output | 24.7 us |
| Silent Payment scan | 35.7 us |
These values are mainly intended as workflow reference points. For publishable
cross-machine comparisons, use the full pinned benchmark methodology and JSON
artifacts from bench_unified.
x86-64 Full Rerun (2026-03-24, post-exploit-fix audit)
Run after 60-exploit-PoC audit (commit 8b25d420). No regression detected.
Machine: Intel Core i5-14400F · Linux · Clang 19.1.7 · TSC 2.501 GHz
Harness: bench_unified — 3 s warmup, 11 passes, IQR trimmed, median
| Operation | Ultra (ns/op) | libsecp (ns/op) | Ratio |
|---|---|---|---|
| field_mul | 10.1 | 11.0 | 1.09× |
| field_sqr | 9.0 | 8.6 | 0.97× |
| field_inv | 746.6 | 775.2 | 1.04× |
| scalar_mul | 16.0 | 19.9 | 1.25× |
| scalar_inv (CT) | 776.2 | 1466.1 | 1.89× |
| pubkey_create (k·G) | 5906 | 13102 | 2.22× |
| ecmult (a·P+b·G) | 19429 | 19071 | 0.98× |
| compressed serialize | 2.9 | 12.7 | 4.34× |
| ECDSA sign | 7825 | 16314 | 2.08× |
| Schnorr sign | 6258 | 12467 | 1.99× |
| ECDSA verify | 20218 | 20507 | 1.01× |
| Schnorr verify (cached) | 20741 | 20459 | 0.99× |
| CT ECDSA sign | 12259 | 16314 | 1.33× |
| CT Schnorr sign | 10411 | 12467 | 1.20× |
| ecdsa_sign_recoverable | 7355 | 16211 | 2.20× |
| ecrecover | 26801 | 24472 | 0.91× |
| SHA256 (tagged_hash) | 62.7 | — | — |
| Schnorr batch N=64 | 144876 total | — | — |
No regressions vs previous rerun (2026-03-17). All 70/70 audit modules pass.
x86-64 Batch Verify Rerun (2026-03-17)
A retained low-risk x86 CPU improvement was keeping the Schnorr batch pubkey cache
capacity aligned with the full batch size in cpu/src/batch_verify.cpp instead of
clamping reserve capacity to 64 entries. This avoids avoidable vector reallocations
when uncached batches grow beyond 64 signatures.
Quick reruns on the local i5-14400F validation machine showed the improvement on the
uncached Schnorr path while preserving correctness (ctest -R 'comprehensive|multiscalar' PASS):
| Operation | Before | After | Delta |
|---|---|---|---|
| Schnorr batch verify N=128 | 20.27 us/sig | 19.94-20.06 us/sig | up to 1.6% faster |
| Schnorr batch verify N=192 | 18.56 us/sig | 18.01-18.45 us/sig | up to 3.0% faster |
This change does not materially affect the cached-path benchmark; the measured win is specifically the uncached parse-and-resolve flow for larger Schnorr batches.
Cross-Platform Refresh Status (2026-03-18)
Recent retained reruns and validation passes across the active optimization campaign:
| Platform | Latest validated result | Status |
|---|---|---|
| x86-64 / Linux | Schnorr batch verify N=128: 19.94-20.06 us/sig, N=192: 18.01-18.45 us/sig |
Retained low-risk pubkey-cache reserve improvement |
| Android ARM64 / RK3588 | ECDSA Sign 22.22 us, Schnorr Sign (precomputed) 16.67 us, CT ECDSA Sign 67.11 us | Retained ARMv8 SHA2 dispatch win |
| OpenCL / RTX 5060 Ti | kG (batch=65536) 115.1 ns, kP (batch=65536) 263.1 ns, kG (kernel) 98.7 ns |
Revalidated retained tuning; opencl_test and opencl_audit_runner passed |
| CUDA / RTX 5060 Ti | k*G 129.5 ns at TPB 256; TPB 512 reached 128.5 ns but CT rows became invalid in the same harness |
No safe global retune retained yet |
| RISC-V / Milk-V Mars | Latest native rerun remains the 2026-03-07 Mars baseline below | Current local environment has toolchain but no runnable board/emulator path |
This page keeps the last trustworthy result per platform. When a rerun only proves that an experiment is unstable or not worth shipping, it is recorded here but not promoted as a retained default.
The stable GPU host ABI in ufsecp_gpu.h now covers 13 backend-neutral batch
operations, and the compiled CUDA, OpenCL, and Metal backends implement that
stable surface. Internal kernel experiments, signing benchmarks, and backend-only
test hooks may cover additional primitives beyond the public ABI, but they are
documented separately from the stable host interface.
x86-64 Benchmarks
x86-64 / Linux (i5, Clang 19.1.7, AVX2)
Hardware: Intel Core i5 (AVX2, BMI2, ADX)
OS: Linux
Compiler: Clang 19.1.7
Assembly: x86-64 with BMI2/ADX intrinsics
SIMD: AVX2
| Operation | Time | Notes |
|---|---|---|
| Field Mul | 33 ns | Using mulx/adcx/adox |
| Field Square | 32 ns | Optimized squaring |
| Field Add | 11 ns | |
| Field Sub | 12 ns | |
| Field Inverse | 5 us | Fermat's little theorem |
| Point Add | 521 ns | Jacobian coordinates |
| Point Double | 278 ns | |
| Point Scalar Mul | 110 us | GLV + wNAF |
| Generator Mul | 5 us | Precomputed tables |
| Batch Inverse (n=100) | 140 ns/elem | Montgomery's trick |
| Batch Inverse (n=1000) | 92 ns/elem |
x86-64 / Windows (Clang 21.1.0, AVX2)
Hardware: x86-64 (AVX2)
OS: Windows
Compiler: Clang 21.1.0
Assembly: x86-64 ASM enabled
SIMD: AVX2
| Operation | Time | Notes |
|---|---|---|
| Field Mul (5x52) | 17 ns | __int128 lazy reduction |
| Field Square (5x52) | 14 ns | |
| Field Add | 1 ns | |
| Field Negate | 1 ns | |
| Field Inverse | 1 us | Fermat's little theorem |
| Point Add | 159 ns | Jacobian coordinates |
| Point Double | 98 ns | |
| Point Scalar Mul (kxP) | 25 us | GLV + 5x52 + Shamir |
| Generator Mul (kxG) | 5 us | Precomputed tables |
| ECDSA Sign | 8 us | RFC 6979 |
| ECDSA Verify | 31 us | Shamir + GLV |
| Schnorr Sign (BIP-340) | 14 us | |
| Schnorr Verify (BIP-340) | 33 us | |
| Batch Inverse (n=100) | 84 ns/elem | Montgomery's trick |
| Batch Inverse (n=1000) | 88 ns/elem |
RISC-V 64 Benchmarks
Hardware: Milk-V Mars (SiFive U74, RV64GC + Zba + Zbb)
OS: Linux
Compiler: Clang 21.1.8, -mcpu=sifive-u74 -march=rv64gc_zba_zbb
Assembly: RISC-V native assembly
LTO: ThinLTO enabled (auto-detected)
| Operation | Time | Notes |
|---|---|---|
| Field Mul | 95 ns | Optimized carry chain |
| Field Square | 70 ns | Dedicated squaring |
| Field Add | 11 ns | Branchless |
| Field Sub | 11 ns | Branchless |
| Field Negate | 8 ns | Branchless |
| Field Inverse | 4 us | Fermat's little theorem |
| Point Add | 1 us | Jacobian coordinates |
| Point Double | 595 ns | |
| Point Scalar Mul (kxP) | 154 us | GLV + wNAF |
| Generator Mul (kxG) | 33 us | Precomputed tables |
| ECDSA Sign | 67 us | RFC 6979 |
| ECDSA Verify | 186 us | Shamir + GLV |
| Schnorr Sign (BIP-340) | 86 us | |
| Schnorr Verify (BIP-340) | 216 us |
RISC-V Native Re-Run (Milk-V Mars, 2026-03-07)
Run policy: native board execution (no QEMU), bench_unified --suite all --passes 11, plus unified_audit_runner.
Full Benchmark (opt3 retained)
| Operation | Time | Ratio vs libsecp | Notes |
|---|---|---|---|
| ECDSA Sign | 72.64 us | 2.00x | FAST path |
| Schnorr Sign | 51.69 us | 2.24x | FAST path |
| Schnorr Keypair | 43.98 us | 2.45x | x-only keypair create |
| ECDSA Verify | 198.01 us | 1.01x | Slightly faster than libsecp |
| Schnorr Verify (cached xonly) | 200.46 us | 1.02x | Slightly faster than libsecp |
| Schnorr Verify (raw bytes) | 206.75 us | 0.99x | Near parity; about 1.2% slower |
Source artifact (Mars): /tmp/bench_unified_mars_full_opt3.json.
Quick A/B Check (raw verify hotspot)
| Variant | Schnorr Verify (raw) | Schnorr Verify (cached) | ECDSA Verify |
|---|---|---|---|
| opt3 | 206963.9 ns | 200468.7 ns | 198126.1 ns |
| opt4 | 216081.5 ns | 200431.1 ns | 198231.0 ns |
Conclusion: opt3 is kept because it is measurably faster in raw verify.
Security Validation (same code path)
unified_audit_runner verdict: AUDIT-READY
Summary: 53/54 modules passed -- ALL PASSED (1 advisory warnings).
VisionFive 2 Device Rerun (2026-03-22, v3.3.0 dev)
This rerun was executed on the physical StarFive VisionFive 2 board over SSH.
Validation covered run_selftest smoke, test_bip324_standalone, bench_kP,
bench_unified --quick, and dedicated bench_bip324.
| Measurement | Result |
|---|---|
run_selftest smoke |
30/30 modules passed, ALL TESTS PASSED |
test_bip324_standalone |
BIP-324: 62/62 passed |
bench_kP: scalar_mul(K) |
200.06 us |
bench_kP: scalar_mul_with_plan(K) |
191.47 us |
bench_unified --quick: scalar_mul (k*P) |
199.99 us |
bench_unified --quick: scalar_mul_with_plan |
193.30 us |
bench_unified --quick: silent_payment_scan (single output set) |
415.22 us |
bench_unified --quick: scalar_mul_P (k*P, tweak_mul) |
200.36 us |
bench_bip324: full_handshake (both sides) |
1444.56 us |
bench_bip324: session_encrypt 1024 B |
19.14 us, 51.0 MB/s |
bench_bip324: session_roundtrip 1024 B |
38.36 us, 25.5 MB/s |
bench_bip324: session_roundtrip 4096 B |
137.81 us, 28.3 MB/s |
Retained optimization: Point::scalar_mul_with_plan() now leaves the result lazy-affine.
On this board, that moved bench_unified --quick scalar_mul_with_plan from the earlier
199652.6 ns baseline to 193301.2 ns, a measured improvement of about 3.2%.
RISC-V Optimization Gains (vs generic RV64GC build)
| Optimization | Speedup | Applied To |
|---|---|---|
-mcpu=sifive-u74 targeting |
1.3x | All operations |
| ThinLTO (cross-TU inlining) | 1.1x | Point/scalar ops |
| Native assembly | 2-3x | Field mul/square |
| Branchless algorithms | 1.2x | Field add/sub |
| Fast modular reduction | 1.5x | All field ops |
| Carry chain optimization | 1.3x | Multiplication |
CUDA Benchmarks
Hardware: NVIDIA RTX 5060 Ti (36 SMs, 2602 MHz, 15847 MB, 128-bit bus)
CUDA: 12.0, Compute 12.0 (Blackwell)
Architecture: sm_86;sm89
Build: Clang 19 + nvcc, Release, -O3 --use_fast_math
Core ECC Operations
| Operation | Time/Op | Throughput | Notes |
|---|---|---|---|
| Field Mul | 0.2 ns | 4,142 M/s | Kernel-only, batch 1M |
| Field Add | 0.2 ns | 4,130 M/s | Kernel-only, batch 1M |
| Field Inv | 10.2 ns | 98.35 M/s | Kernel-only, batch 64K |
| Point Add | 1.6 ns | 619 M/s | Kernel-only, batch 256K |
| Point Double | 0.8 ns | 1,282 M/s | Kernel-only, batch 256K |
| Scalar Mul (Pxk) | 282.0 ns | 3.55 M/s | Kernel-only, batch 64K |
| Generator Mul (Gxk) | 113.5 ns | 8.81 M/s | Kernel-only, batch 64K |
| Affine Add | 0.4 ns | 2,532 M/s | Kernel-only, batch 256K |
| Affine Lambda | 0.6 ns | 1,654 M/s | Kernel-only, batch 256K |
| Affine X-Only | 0.4 ns | 2,328 M/s | Kernel-only, batch 256K |
| Batch Inv | 2.9 ns | 340 M/s | Kernel-only, batch 64K |
| Jac->Affine | 14.9 ns | 66.9 M/s | Kernel-only, batch 64K |
GPU Signature Operations
No other open-source GPU library provides secp256k1 ECDSA + Schnorr sign/verify on GPU.
| Operation | Time/Op | Throughput | Notes |
|---|---|---|---|
| ECDSA Sign | 204.8 ns | 4.88 M/s | RFC 6979, low-S, batch 16K |
| ECDSA Verify | 230.2 ns | 4.34 M/s | Shamir+GLV double-mul, batch 64K |
| ECDSA Sign + Recid | 311.5 ns | 3.21 M/s | Recoverable, batch 16K |
| Schnorr Sign (BIP-340) | 273.4 ns | 3.66 M/s | Tagged hash midstates, batch 16K |
| Schnorr Verify (BIP-340) | 167.0 ns | 5.99 M/s | Shamir+GLV double-mul, batch 64K |
GPU Zero-Knowledge Operations
First open-source GPU implementation of secp256k1 ZK proofs (Knowledge + DLEQ + Bulletproof).
| Operation | Time/Op | Throughput | Notes |
|---|---|---|---|
| Knowledge Prove (G) | 258.6 ns | 3,867 k/s | CT Schnorr sigma, batch 8K |
| Knowledge Verify | 175.9 ns | 5,686 k/s | Shamir double-mul GLV, batch 8K |
| DLEQ Prove | 537.2 ns | 1,861 k/s | Discrete log equality, CT path, batch 8K |
| DLEQ Verify | 369.0 ns | 2,710 k/s | 2× Shamir double-mul GLV, batch 8K |
| Pedersen Commit | 66.0 ns | 15,160 k/s | vH + rG, batch 4K |
| Range Prove (64-bit) | 3,711,570 ns | 0.27 k/s | Bulletproof, CT path, batch 256 |
| Range Verify (64-bit) | 764,649 ns | 1.3 k/s | Full IPA verification, batch 256 |
GPU vs CPU ZK Speedup (single-core throughput):
| Operation | CPU (i5-14400F) | GPU (RTX 5060 Ti) | GPU/CPU Speedup |
|---|---|---|---|
| Knowledge Prove | 24,292 ns | 258.6 ns | 94x |
| Knowledge Verify | 23,830 ns | 175.9 ns | 135x |
| DLEQ Prove | 42,370 ns | 537.2 ns | 79x |
| DLEQ Verify | 60,607 ns | 369.0 ns | 164x |
| Pedersen Commit | 29,718 ns | 66.0 ns | 450x |
| Range Prove (64-bit) | 13,618,693 ns | 3,711,570 ns | 3.7x |
| Range Verify (64-bit) | 2,669,843 ns | 764,649 ns | 3.5x |
Community & Contributor Benchmarks
All hardware results submitted by community members are collected in docs/COMMUNITY_BENCHMARKS.md.
Current entries:
| # | Hardware | Contributor | Date | Tests |
|---|---|---|---|---|
| 1 | NVIDIA RTX 5070 Ti (Blackwell) | Community / GigaChad | 2026-03-24 | 45/45 |
| 2 | x86-64 CPU (libsecp baseline) | @craigraw | 2026-02-xx | — |
CUDA — RTX 5070 Ti (Blackwell) — 2026-03-24
Contributor: Community member (GigaChad) — thank you for running the full test suite and for identifying the CMAKE_CUDA_SEPARABLE_COMPILATION flag required for Blackwell devices! 🙏
Hardware: NVIDIA GeForce RTX 5070 Ti (Blackwell)
Build: cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DSECP256K1_BUILD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_SEPARABLE_COMPILATION=ON
Tested: 2026-03-24, 45 tests passed
Note: CMAKE_CUDA_SEPARABLE_COMPILATION=ON is required for Blackwell (RTX 50xx) devices. This flag is now set automatically in cuda/CMakeLists.txt and baked into all CUDA CMake presets.
| Operation | Time/Op | Throughput |
|---|---|---|
| Field Mul | 5.8 ns | 173.43 M/s |
| Field Add | 2.5 ns | 408.04 M/s |
| Field Inverse | 5.2 ns | 191.55 M/s |
| Point Add | 9.9 ns | 100.89 M/s |
| Point Double | 5.5 ns | 181.70 M/s |
| Scalar Mul (Pk) | 101.4 ns | 9.86 M/s |
| Generator Mul (Gk) | 92.1 ns | 10.86 M/s |
| Affine Add (2M+1S+inv) | 0.1 ns | 8,388.29 M/s |
| Affine Lambda (2M+1S) | 0.2 ns | 4,117.82 M/s |
| Affine X-Only (1M+1S) | 0.1 ns | 8,354.07 M/s |
| Batch Inv (Montgomery) | 5.8 ns | 173.21 M/s |
| Jac->Affine (per-pt) | 14.4 ns | 69.34 M/s |
| ECDSA Sign | 105.3 ns | 9.49 M/s |
| ECDSA Verify | 122.8 ns | 8.14 M/s |
| ECDSA Sign+Recid | 155.8 ns | 6.42 M/s |
| Schnorr Sign | 137.7 ns | 7.26 M/s |
| Schnorr Verify | 92.7 ns | 10.79 M/s |
GPU Zero-Knowledge Operations
First open-source GPU implementation of secp256k1 ZK proofs (Knowledge + DLEQ + Bulletproof).
| Operation | Time/Op | Throughput | Notes |
|---|---|---|---|
| Knowledge Prove (G) | 252.3 ns | 3,964 k/s | CT Schnorr sigma, batch 4K |
| Knowledge Verify | 749.9 ns | 1,334 k/s | sG == R + eP, batch 4K |
| DLEQ Prove | 668.3 ns | 1,496 k/s | Discrete log equality, CT path, batch 4K |
| DLEQ Verify | 1,919.1 ns | 521 k/s | Two-base verification, batch 4K |
| Pedersen Commit | 66.0 ns | 15,160 k/s | vH + rG, batch 4K |
| Range Prove (64-bit) | 3,711,570 ns | 0.27 k/s | Bulletproof, CT path, batch 256 |
| Range Verify (64-bit) | 764,649 ns | 1.3 k/s | Full IPA verification, batch 256 |
CUDA Launch-Width Triage (2026-03-18)
The latest local rerun on the RTX 5060 Ti used gpu_bench_unified to check whether a global block-size
retune should replace the current default. The answer was no: there is not yet a safe retained win.
| TPB | k*G (generator) | CT k*G | CT k*P | Verdict |
|---|---|---|---|---|
| 256 | 129.5 ns | 98.7 ns | 162.8 ns | Stable reference rerun |
| 512 | 128.5 ns | invalid (0.0 ns) |
invalid (0.1 ns) |
Rejected; CT timing became unstable |
The 512-thread launch showed only a marginal k*G gain, while the same harness produced invalid
constant-time timings. Until the CT timing methodology is tightened, no global CUDA TPB default change
is retained from this sweep.
GPU vs CPU ZK Speedup (single-core throughput):
| Operation | CPU (i5-14400F) | GPU (RTX 5060 Ti) | GPU/CPU Speedup |
|---|---|---|---|
| Knowledge Prove | 24,292 ns | 252.3 ns | 96x |
| Knowledge Verify | 23,830 ns | 749.9 ns | 32x |
| DLEQ Prove | 42,370 ns | 668.3 ns | 63x |
| DLEQ Verify | 60,607 ns | 1,919.1 ns | 32x |
| Pedersen Commit | 29,718 ns | 66.0 ns | 450x |
| Range Prove (64-bit) | 13,618,693 ns | 3,711,570 ns | 3.7x |
| Range Verify (64-bit) | 2,669,843 ns | 764,649 ns | 3.5x |
OpenCL Benchmarks
Hardware: NVIDIA RTX 5060 Ti (36 CUs, 2602 MHz)
OpenCL: 3.0 CUDA, Driver 580.126.09
Build: Clang 19, Release, -O3, PTX inline assembly
OpenCL GPU C ABI Coverage (2026-03-18)
| C ABI operation | OpenCL status | Notes |
|---|---|---|
ufsecp_gpu_generator_mul_batch |
Implemented | Uses batch_scalar_mul_generator + batch_jacobian_to_affine |
ufsecp_gpu_ecdsa_verify_batch |
Missing | Returns UFSECP_ERR_GPU_UNSUPPORTED |
ufsecp_gpu_schnorr_verify_batch |
Missing | Returns UFSECP_ERR_GPU_UNSUPPORTED |
ufsecp_gpu_ecdh_batch |
Implemented | GPU scalar mul, CPU SHA-256 finalization |
ufsecp_gpu_hash160_pubkey_batch |
Implemented | Public-data batch hashing |
ufsecp_gpu_msm |
Implemented | GPU scalar mul + CPU-side affine reduction |
The missing OpenCL pieces are therefore the two batch verify paths. Core ECC, ECDH, Hash160, and MSM are already wired through the backend-neutral C ABI.
Kernel-Only Timing (no buffer alloc/copy overhead)
| Operation | Time/Op | Throughput | Notes |
|---|---|---|---|
| Field Mul | 0.2 ns | 4,110 M/s | batch 1M |
| Field Add | 0.2 ns | 4,116 M/s | batch 1M |
| Field Sub | 0.2 ns | 4,106 M/s | batch 1M |
| Field Sqr | 0.2 ns | 5,979 M/s | batch 1M |
| Field Inv | 20.2 ns | 49.42 M/s | batch 1M |
| Point Double | 0.9 ns | 1,138 M/s | batch 256K |
| Point Add | 1.6 ns | 618.1 M/s | batch 256K |
| kG (kernel) | 97.7 ns | 10.23 M/s | batch 64K |
| kP (kernel) | 263.8 ns | 3.79 M/s | batch 64K |
| ECDSA Verify | 230.2 ns | 4.34 M/s | Shamir+GLV, batch 64K |
| Schnorr Verify | 167.0 ns | 5.99 M/s | Shamir+GLV, batch 64K |
| ZK Knowledge Prove | 258.6 ns | 3.87 M/s | CT path, batch 8K |
| ZK Knowledge Verify | 175.9 ns | 5.69 M/s | Shamir double-mul, batch 8K |
| ZK DLEQ Prove | 537.2 ns | 1.86 M/s | CT path, batch 8K |
| ZK DLEQ Verify | 369.0 ns | 2.71 M/s | 2× Shamir double-mul, batch 8K |
End-to-End Timing (including buffer transfers)
| Operation | Time/Op | Throughput | Notes |
|---|---|---|---|
| Field Add | 27.3 ns | 36.67 M/s | batch 1M |
| Field Mul | 27.7 ns | 36.07 M/s | batch 1M |
| Field Inv | 29.0 ns | 34.43 M/s | batch 1M |
| Point Double | 58.4 ns | 17.11 M/s | batch 1M |
| Point Add | 111.9 ns | 8.94 M/s | batch 1M |
| kG (batch=65536) | 115.1 ns | 8.69 M/s | retained 2026-03-17 revalidation |
| kP (batch=65536) | 263.1 ns | 3.80 M/s | retained 2026-03-17 revalidation |
| kP upload | 6.7 ns | 149.25 M/s | host-to-device transfer slice |
| kP readback | 12.4 ns | 80.65 M/s | device-to-host transfer slice |
CUDA / OpenCL Configuration
// Optimal settings for RTX 5060 Ti
#define SECP256K1_CUDA_USE_HYBRID_MUL 1 // 32-bit hybrid (~10% faster)
#define SECP256K1_CUDA_USE_MONTGOMERY 0 // Standard domain (faster for search)
CUDA vs OpenCL Kernel-Only Comparison (RTX 5060 Ti)
| Operation | CUDA | OpenCL | Faster |
|---|---|---|---|
| Field Mul | 0.2 ns | 0.2 ns | Tie |
| Field Add | 0.2 ns | 0.2 ns | Tie |
| Field Inv | 10.2 ns | 20.2 ns | CUDA 1.98x |
| Point Double | 0.8 ns | 0.9 ns | CUDA 1.13x |
| Point Add | 1.6 ns | 1.6 ns | Tie |
| Scalar Mul (kG) | 113.5 ns | 97.7 ns | OpenCL 1.16x |
| ECDSA Sign | 204.8 ns | -- | CUDA only |
| ECDSA Verify | 230.2 ns | 230.2 ns | Tie |
| Schnorr Sign | 273.4 ns | -- | CUDA only |
| Schnorr Verify | 167.0 ns | 167.0 ns | Tie |
| Knowledge Prove | 258.6 ns | 258.6 ns | Tie |
| Knowledge Verify | 175.9 ns | 175.9 ns | Tie |
| DLEQ Prove | 537.2 ns | 537.2 ns | Tie |
| DLEQ Verify | 369.0 ns | 369.0 ns | Tie |
kG above uses the latest retained local reruns on the same RTX 5060 Ti host:
CUDA gpu_bench_unified at TPB 256 (129.5 ns) and OpenCL opencl_benchmark
kernel timing (98.7 ns). CUDA still leads on verify and ZK because those paths
are not yet exposed on OpenCL.
Apple Metal Benchmarks
Hardware: Apple M3 Pro (18 GPU cores, Unified Memory 18 GB)
OS: macOS Sequoia
Metal: Metal 2.4, MSL macos-metal2.4
Limb Model: 8x32-bit Comba (no 64-bit int in MSL)
Build: AppleClang, Release, -O3, ARC
| Operation | Time/Op | Throughput | Notes |
|---|---|---|---|
| Field Mul | 1.9 ns | 527 M/s | Comba product scanning, batch 1M |
| Field Add | 1.0 ns | 990 M/s | Branchless, batch 1M |
| Field Sub | 1.1 ns | 892 M/s | Branchless, batch 1M |
| Field Sqr | 1.1 ns | 872 M/s | Comba + symmetry, batch 1M |
| Field Inv | 106.4 ns | 9.40 M/s | Fermat (a^(p-2)), batch 64K |
| Point Add | 10.1 ns | 98.6 M/s | Jacobian, batch 256K |
| Point Double | 5.1 ns | 196 M/s | dbl-2001-b, batch 256K |
| Scalar Mul (Pxk) | 2.94 us | 0.34 M/s | 4-bit windowed, batch 64K |
| Generator Mul (Gxk) | 3.00 us | 0.33 M/s | 4-bit windowed, batch 128K |
Metal vs CUDA vs OpenCL -- GPU Comparison
| Operation | CUDA (RTX 5060 Ti) | OpenCL (RTX 5060 Ti) | Metal (M3 Pro) |
|---|---|---|---|
| Field Mul | 0.2 ns | 0.2 ns | 1.9 ns |
| Field Add | 0.2 ns | 0.2 ns | 1.0 ns |
| Field Inv | 10.2 ns | 14.3 ns | 106.4 ns |
| Point Double | 0.8 ns | 0.9 ns | 5.1 ns |
| Point Add | 1.6 ns | 1.6 ns | 10.1 ns |
| Scalar Mul | 282.0 ns | 263.8 ns | 2.94 us |
| Generator Mul | 113.5 ns | 97.7 ns | 3.00 us |
| ECDSA Sign | 204.8 ns | -- | -- |
| ECDSA Verify | 230.2 ns | 230.2 ns | -- |
| Schnorr Sign | 273.4 ns | -- | -- |
| Schnorr Verify | 354.6 ns | -- | -- |
| Knowledge Prove | 263.7 ns | -- | -- |
| Knowledge Verify | 744.5 ns | -- | -- |
| DLEQ Prove | 675.4 ns | -- | -- |
| DLEQ Verify | 1,912.0 ns | -- | -- |
Note: CUDA/OpenCL -- RTX 5060 Ti (36 SMs, 2602 MHz, GDDR7 256 GB/s).
Metal -- M3 Pro (18 GPU cores, ~150 GB/s unified memory bandwidth).
RTX 5060 Ti has ~8x more compute throughput; Metal's advantage is in unified memory zero-copy I/O.
Android ARM64 Benchmarks
Hardware: RK3588 (Cortex-A76 @ 2.256 GHz, pinned to big cores)
OS: Android
Compiler: NDK r27.2.12479018, Clang 18.0.3
Assembly: ARM64 inline (MUL/UMULH)
Field: 10x26 (optimal for ARM64)
| Operation | Time | Notes |
|---|---|---|
| Field Mul | 68.3 ns | ARM64 MUL/UMULH, 10x26 |
| Field Square | 50 ns | |
| Field Add | 8 ns | |
| Field Negate | 18 ns | |
| Field Inverse | 2 us | Fermat's theorem |
| Point Add | 992 ns | Jacobian coordinates |
| Point Double | 548 ns | |
| Generator Mul (kxG) | 15.27 us | Precomputed tables |
| Scalar Mul (kxP) | 130.33 us | GLV + wNAF |
| ECDSA Sign | 22.22 us | ARMv8 SHA2 dispatch retained |
| ECDSA Verify | 150.13 us | Shamir + GLV |
| Schnorr Sign (BIP-340) | 16.67 us | Precomputed keypair path |
| Schnorr Verify (BIP-340) | 153.63 us | Raw pubkey path is similar |
| Batch Inverse (n=100) | 265 ns/elem | Montgomery's trick |
| Batch Inverse (n=1000) | 240 ns/elem |
ARM64 10x26 representation with MUL/UMULH assembly provides optimal field arithmetic performance.
Android ARM64 Optimization Rerun (2026-03-17)
This rerun used the connected RK3588 Android device and android/test/bench_hornet_android.cpp
as the benchmark truth source. The retained code change was enabling the existing ARMv8 SHA-256
instruction path in hash_accel.cpp for sha256_33, sha256_32, hash160_33, and
sha256_compress_dispatch.
| Operation | Baseline | Retained result | Delta |
|---|---|---|---|
| ECDSA Sign | 25.89 us | 22.22 us | 14.2% faster |
| Schnorr Sign (precomputed) | 17.73 us | 16.67 us | 6.0% faster |
| Schnorr Sign (raw privkey) | 33.01 us | 31.99 us | 3.1% faster |
| CT ECDSA Sign | 70.50 us | 67.11 us | 4.8% faster |
| CT Schnorr Sign | 59.87 us | 59.10 us | 1.3% faster |
No meaningful win was found from forcing SECP256K1_USE_4X64_POINT_OPS, from changing
SECP256K1_GLV_WINDOW_WIDTH to 4 or 6, or from keeping PGO as the default Android path.
Those variants were measured and rejected.
Android ARM64 RK3588 Device Rerun (2026-03-22)
This rerun used the connected YF_022A RK3588 Android device over USB. Two new
device-side benchmarks were added to the Android build for this pass:
bench_kP for the BIP-352 fixed-K / variable-Q hotspot and bench_bip324 for the
dedicated BIP-324 transport stack.
| Measurement | Result |
|---|---|
android_test: fast scalar_mul (k*G) |
5.93 us |
android_test: fast scalar_mul (k*P) |
57.67 us |
android_test: ct::scalar_mul (k*P) |
150.26 us |
android_test: field_mul / field_sqr |
80 ns / 61 ns |
bench_kP: scalar_mul(K) |
130.90 us |
bench_kP: scalar_mul_with_plan(K) |
127.24 us |
bench_kP: K*G |
15.69 us |
bench_bip324: full_handshake (both sides) |
727.24 us |
bench_bip324: session_encrypt 1024 B |
5.96 us, 163.9 MB/s |
bench_bip324: session_roundtrip 1024 B |
12.05 us, 81.0 MB/s |
bench_bip324: session_roundtrip 4096 B |
43.72 us, 89.3 MB/s |
Run note: the on-device execution used the NDK libomp.so alongside the pushed
binaries so the existing OpenMP-enabled CPU build could run unchanged.
ESP32-S3 Benchmarks (Embedded)
Hardware: ESP32-S3 (Xtensa LX7 Dual Core @ 240 MHz), rev 0.1
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 4×64 (native 64-bit mul wins on LX7)
Measured: 2026-03-21, median of 3 runs
| Operation | Time | ops/sec | vs libsecp |
|---|---|---|---|
| field_mul | 5,910 ns | 169 k/s | — |
| field_sqr | 4,848 ns | 206 k/s | — |
| field_add | 572 ns | 1.75 M/s | — |
| field_inv | 130.2 µs | 7.7 k/s | — |
| pubkey_create (k×G) | 6,134 µs | 163/s | 1.18× |
| k×P (arbitrary) | 12,752 µs | 78/s | — |
| a×G + b×P (Shamir) | 18,296 µs | 55/s | — |
| point_add | 479 µs | 2.1 k/s | — |
| point_dbl | 330 µs | 3.0 k/s | — |
| ecdsa_sign | 7,443 µs | 134/s | 1.27× |
| ecdsa_verify | 18,670 µs | 54/s | 1.70× |
| schnorr_sign (keypair) | 6,467 µs | 155/s | 1.45× |
| schnorr_verify | 19,947 µs | 50/s | 1.62× |
| ct::ecdsa_sign | 13,742 µs | 73/s | 0.69× |
| ct::schnorr_sign | 7,574 µs | 132/s | 1.23× |
All integrity checks pass. libsecp256k1 v0.7.2 compared on same hardware.
ESP32-P4 Benchmarks (Embedded)
Hardware: ESP32-P4 (RISC-V RV32IMAC Dual HP Core @ 360 MHz), rev 1.3
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 10×26 (32-bit native)
Measured: 2026-03-21, median of 3 runs
| Operation | Time | ops/sec | vs libsecp |
|---|---|---|---|
| field_mul | 2,424 ns | 413 k/s | — |
| field_sqr | 2,218 ns | 451 k/s | — |
| field_add | 318 ns | 3.14 M/s | — |
| field_inv | 73.1 µs | 13.7 k/s | — |
| pubkey_create (k×G) | 2,253 µs | 444/s | 0.94× |
| k×P (arbitrary) | 5,256 µs | 190/s | — |
| a×G + b×P (Shamir) | 7,550 µs | 132/s | — |
| point_add | 128.8 µs | 7.8 k/s | — |
| point_dbl | 103.6 µs | 9.7 k/s | — |
| ecdsa_sign | 2,588 µs | 386/s | 0.97× |
| ecdsa_verify | 7,528 µs | 133/s | 0.99× |
| schnorr_sign (keypair) | 2,293 µs | 436/s | 0.96× |
| schnorr_verify | 8,052 µs | 124/s | 0.93× |
| ct::ecdsa_sign | 5,680 µs | 176/s | 0.44× |
| ct::schnorr_sign | 2,528 µs | 396/s | 1.10× |
All integrity checks pass. Note: FAST path is at near-parity with libsecp on P4
(P4 RISC-V microarch lacks the wide multiply throughput of Xtensa LX7).
ESP32-C6 Benchmarks (Embedded)
Hardware: ESP32-C6 (RISC-V RV32IMAC Single Core @ 160 MHz), rev 0.2
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 10×26 (32-bit native)
Measured: 2026-03-21, median of 3 runs
| Operation | Time | ops/sec | vs libsecp |
|---|---|---|---|
| field_mul | 5,974 ns | 167 k/s | — |
| field_sqr | 5,328 ns | 188 k/s | — |
| field_add | 784 ns | 1.28 M/s | — |
| field_inv | 171.1 µs | 5.8 k/s | — |
| pubkey_create (k×G) | 5,483 µs | 182/s | 1.70× |
| k×P (arbitrary) | 12,682 µs | 79/s | — |
| point_add | 296.5 µs | 3.4 k/s | — |
| point_dbl | 238.1 µs | 4.2 k/s | — |
| ecdsa_sign | 7,464 µs | 134/s | 1.67× |
| ecdsa_verify | 18,957 µs | 53/s | 0.98× |
| schnorr_sign (keypair) | 5,855 µs | 171/s | 2.01× |
| schnorr_verify | 20,278 µs | 49/s | 1.03× |
| ct::ecdsa_sign | 15,522 µs | 64/s | 0.80× |
| ct::schnorr_sign | 6,782 µs | 147/s | 1.73× |
All integrity checks pass.
ESP32-PICO-D4 Benchmarks (Embedded)
Hardware: ESP32-PICO-D4 (Xtensa LX6 Dual Core @ 240 MHz)
OS: ESP-IDF v5.5.1
Assembly: None (portable C++, no __int128)
| Operation | Time | Notes |
|---|---|---|
| Field Mul | 6,993 ns | |
| Field Square | 6,247 ns | |
| Field Add | 985 ns | |
| Field Inv | 609 us | |
| Scalar x G | 6,203 us | Generator mul |
| CT Scalar x G | 44,810 us | Constant-time |
| CT Add (complete) | 249,672 ns | |
| CT Dbl | 87,113 ns | |
| CT/Fast ratio | 6.5x |
All 35 self-tests + 8 CT tests pass.
STM32F103 Benchmarks (Embedded)
Hardware: STM32F103ZET6 (ARM Cortex-M3 @ 72 MHz)
Compiler: ARM GCC 13.3.1, -O3
Assembly: ARM Cortex-M3 inline (UMULL/ADDS/ADCS)
| Operation | Time | Notes |
|---|---|---|
| Field Mul | 15,331 ns | ARM inline asm |
| Field Square | 12,083 ns | ARM inline asm |
| Field Add | 4,139 ns | Portable C++ |
| Field Inv | 1,645 us | |
| Scalar x G | 37,982 us | Generator mul |
All 35 library self-tests pass.
Embedded Cross-Platform Comparison
| Operation | ESP32-S3 (LX7) | ESP32-P4 (RV32) | ESP32-C6 (RV32) | ESP32 (LX6) | STM32F103 (M3) |
|---|---|---|---|---|---|
| 240 MHz | 360 MHz | 160 MHz | 240 MHz | 72 MHz | |
| Field Mul | 5,910 ns | 2,424 ns | 5,974 ns | 6,993 ns | 15,331 ns |
| Field Square | 4,848 ns | 2,218 ns | 5,328 ns | 6,247 ns | 12,083 ns |
| Field Add | 572 ns | 318 ns | 784 ns | 985 ns | 4,139 ns |
| Field Inv | 130 µs | 73 µs | 171 µs | 609 µs | 1,645 µs |
| k×G (pubkey) | 6,134 µs | 2,253 µs | 5,483 µs | 6,203 µs | 37,982 µs |
| ECDSA sign | 7,443 µs | 2,588 µs | 7,464 µs | — | — |
| ECDSA verify | 18,670 µs | 7,528 µs | 18,957 µs | — | — |
| Schnorr verify | 19,947 µs | 8,052 µs | 20,278 µs | — | — |
| vs libsecp (verify) | 1.70× | 0.99× | 0.98× | — | — |
Specialized Benchmark Results (Windows x64, Clang 21.1.0)
Field Representation Comparison (5x52 vs 4x64)
5x52 uses __int128 with lazy carry reduction -- fewer normalizations = faster chains.
| Operation | 4x64 (ns) | 5x52 (ns) | 5x52 Speedup |
|---|---|---|---|
| Multiplication | 41.9 | 15.2 | 2.76x |
| Squaring | 31.2 | 12.8 | 2.44x |
| Addition | 4.3 | 1.6 | 2.69x |
| Negation | 7.6 | 2.4 | 3.13x |
| Add chain (4 ops) | 33.2 | 8.6 | 3.84x |
| Add chain (8 ops) | 65.4 | 16.4 | 3.98x |
| Add chain (16 ops) | 137.7 | 30.3 | 4.55x |
| Add chain (32 ops) | 285.9 | 57.0 | 5.01x |
| Add chain (64 ops) | 566.8 | 117.1 | 4.84x |
| Point-Add simulation | 428.3 | 174.8 | 2.45x |
| 256 squarings | 9,039 | 4,055 | 2.23x |
Conclusion: 5x52 is 2.0-5.0x faster across all operations. The advantage grows for addition-heavy chains (lazy reduction amortizes normalization cost).
Field Representation Comparison (10x26 vs 4x64)
10x26 is the 32-bit target representation -- useful for embedded and GPU where 64-bit multiply is expensive.
| Operation | 4x64 (ns) | 10x26 (ns) | 10x26 Speedup |
|---|---|---|---|
| Addition | 4.7 | 1.8 | 2.57x |
| Multiplication | ~39 | ~39 | ~1x (tie) |
| Add chain (16 ops) | wide | 3.3x faster | -- |
Constant-Time (CT) Layer Performance
CT layer provides side-channel resistance at the cost of performance.
| Operation | Fast | CT | Overhead |
|---|---|---|---|
| Field Mul | 36 ns | 55 ns | 1.50x |
| Field Square | 34 ns | 43 ns | 1.28x |
| Field Inverse | 3.0 us | 14.2 us | 4.80x |
| Scalar Add | 3 ns | 10 ns | 3.02x |
| Scalar Sub | 2 ns | 10 ns | 6.33x |
| Point Add | 0.65 us | 1.63 us | 2.50x |
| Point Double | 0.36 us | 0.67 us | 1.88x |
| Scalar Mul (kxP) | 130 us | 322 us | 2.49x |
| Generator Mul (kxG) | 7.6 us | 310 us | 40.8x |
Generator mul overhead (40x) is high because CT disables precomputed variable-time table lookups. For signing with side-channel requirements, CT scalar mul (2.49x overhead) is the relevant metric.
Multi-Scalar Multiplication (ECDSA Verify Path)
| Method | Time | Description |
|---|---|---|
| Separate (prod-like) | 137.4 us | k_1xG (precompute) + k_2xQ (variable-base) |
| Separate (variable) | 351.5 us | Both via fixed-window variable-base |
| Shamir interleaved | 155.2 us | Merged stream -- fewer doublings |
| Windowed Shamir | 9.2 us | Optimized multi-scalar |
| JSF (Joint Sparse Form) | 9.5 us | Joint encoding of both scalars |
Atomic ECC Building Blocks
| Operation | Time | Formula Cost |
|---|---|---|
| Point Add (immutable) | 959 ns | 12M + 4S + alloc |
| Point Add (in-place) | 1,859 ns | 12M + 4S |
| Point Double (immutable) | 673 ns | 4M + 4S + alloc |
| Point Double (in-place) | 890 ns | 4M + 4S |
| Point Negation | 11 ns | Y := -Y |
| Point Triple | 1,585 ns | 2xP + P |
| To Affine conversion | 15,389 ns | 1 inverse + 2-3 mul |
| Field S/M ratio | 0.818 | (ideal: ~0.80) |
| Field I/M ratio | 78x | Inverse is expensive -- use Jacobian! |
Zero-Knowledge Proof Benchmarks (CPU)
Hardware: Intel Core i5-14400F (P-core, Raptor Lake)
Compiler: Clang 19.1.7, -O3 -march=native
Methodology: 11 passes, IQR outlier removal, median, 64-key pool, pinned core
ZK Proof Operations
| Operation | Time/Op | Throughput | Notes |
|---|---|---|---|
| Pedersen Commit | 29.7 us | 33,670 op/s | vH + rG (two scalar muls) |
| Knowledge Prove | 24.3 us | 41,152 op/s | Non-interactive Schnorr sigma, CT path |
| Knowledge Verify | 23.8 us | 42,017 op/s | sG == R + eP, FAST path |
| DLEQ Prove | 42.4 us | 23,585 op/s | Discrete log equality, CT path |
| DLEQ Verify | 60.6 us | 16,502 op/s | Two-base verification, FAST path |
| Range Prove (64-bit) | 13,619 us | 73 op/s | Bulletproof prover, CT path |
| Range Verify (64-bit) | 2,670 us | 375 op/s | MSM-optimized verifier, FAST path |
Range Verify Optimization (v3.22+)
The Bulletproof verifier was optimized with multi-scalar multiplication (MSM):
| Optimization | Technique | Speedup |
|---|---|---|
| Polynomial check | 5-point MSM (delta, t_hatG, tau_xH, -T1, -T2) | Reduced from 3 scalar muls |
| P_check + expected merge | 144-point MSM (64 G_i, 64 H_i, 12 L_j, 12 R_j, A, S, ...) | Single MSM vs 128+ individual muls |
| s_coeff computation | Montgomery batch inversion (1 inv + 126 muls vs 64 inversions) | ~64x fewer inversions |
| Total | Combined MSM + batch inversion | 1.93x (5,079 -> 2,634 us) |
Pippenger MSM is used when point count > 64. For the prover, individual GLV-optimized scalar multiplications remain faster than MSM for the 129-point workload.
BIP-324 Encrypted Transport Benchmarks
BIP-324 implements encrypted, authenticated peer-to-peer communication
for Bitcoin (v2 transport). Numbers below are from bench_unified --quick
on x86-64 (i5, Clang 19, AVX2, single core pinned).
Primitives
| Operation | ns/op | Throughput |
|---|---|---|
| HKDF-SHA256 extract | ~124 | ~8.1 M op/s |
| HKDF-SHA256 expand | ~135 | ~7.4 M op/s |
| AEAD encrypt (256 B) | ~460 | ~2.2 M op/s |
| AEAD decrypt (256 B) | ~470 | ~2.1 M op/s |
Elliptic-Curve Transport Setup
| Operation | µs/op | Throughput |
|---|---|---|
| ElligatorSwift create | ~46 | ~21.5 k op/s |
| ElligatorSwift XDH (ECDH) | ~30 | ~32.9 k op/s |
| Session handshake (full) | ~167 | ~6.0 k op/s |
Session Data Path
| Operation | ns/op | Throughput |
|---|---|---|
| Session encrypt (256 B) | ~558 | ~1.8 M op/s |
| Session decrypt (256 B) | ~1,136 | ~881 k op/s |
| Session encrypt (1 KB) | ~1,627 | ~614 k op/s |
| Session roundtrip (256 B) | ~1,136 | ~881 k op/s |
CUDA GPU Comparison
See BENCHMARK_BIP324_GPU.md for detailed CUDA transport benchmarks. Summary: CUDA achieves ~30× throughput over a single CPU core for bulk packet encryption.
Available Benchmark Targets
All targets registered in CMake. Build with cmake --build build -j then run from build/cpu/.
| Target | What It Measures |
|---|---|
bench_unified |
THE standard: primitives + CT + batch verify + Ethereum + ZK + BIP-324 + real-world wallet/protocol flows, with apple-to-apple comparison vs libsecp256k1 + OpenSSL |
bench_bip324_transport |
BIP-324 transport simulation: mixed payloads, decoy packets, latency histograms, TCP socket roundtrip |
bench_ct |
Fast (fast::) vs Constant-Time (ct::) layer comparison |
bench_field_52 |
5x52 field arithmetic micro-benchmarks |
bench_field_26 |
10x26 field arithmetic micro-benchmarks |
bench_kP |
Scalar multiplication (k*P) benchmarks |
bench_zk (CUDA) |
GPU ZK proof benchmarks: Knowledge, DLEQ, Pedersen, Bulletproof |
Benchmark Methodology
CPU Benchmarks
- Warm-up: 1 iteration discarded
- Measurement: 3 iterations, take median
- Timer:
std::chrono::high_resolution_clock - Compiler flags:
-O3 -march=native
bench_unified additionally reports workflow-level operations such as HD
derivation, Taproot key tweaking, ECDH, and Silent Payments so primitive
performance can be interpreted in a wallet and protocol context.
CUDA Benchmarks
- Warm-up: 5-10 kernel launches discarded
- Measurement: 11 passes, median
- Timer: CUDA events
- Sync: Full device synchronization between measurements
CUDA ZK Benchmarks
- Warm-up: 5 kernel launches discarded
- Measurement: 11 passes, median
- Timer: CUDA events (ns/op = elapsed_ms * 1e6 / batch_size)
- Correctness: 0/4096 verify failures (Knowledge/DLEQ), 0/256 (Bulletproof) required before timing
- Batch sizes: Knowledge/DLEQ/Pedersen = 4096, Bulletproof = 256
- Setup: Precomputed pubkeys + Bulletproof generators (not included in timing)
Reproducibility
# Run CPU benchmark (includes ZK section)
./build/cpu/bench_unified
# Run the full unified suite explicitly
./build/cpu/bench_unified --suite all
# Quick smoke / CI-style run
./build/cpu/bench_unified --quick
# Run CUDA ECC benchmark
./build/cuda/secp256k1_cuda_bench
# Run CUDA ZK benchmark
./build/cuda/bench_zk
# Results saved to: benchmark-<platform>-<date>.txt
Optimization History
RISC-V Timeline
| Date | Field Mul | Scalar Mul | Change |
|---|---|---|---|
| 2026-02-11 | 307 ns | 954 us | Initial |
| 2026-02-12 | 205 ns | 676 us | Carry optimization |
| 2026-02-13 | 198 ns | 672 us | Square optimization |
| 2026-02-13 | 198 ns | 672 us | Current |
Key Optimizations Applied
- Branchless field operations - Eliminates unpredictable branches
- Optimized carry propagation - Reduces instruction count
- Dedicated squaring routine - 25% fewer multiplications than generic mul
- GLV decomposition - ~50% reduction in scalar bits
- wNAF encoding - ~33% fewer point additions
- Precomputed tables - Generator multiplication 10x faster
Apple-to-Apple: UltrafastSecp256k1 vs bitcoin-core/libsecp256k1
Rigorous head-to-head comparison using identical benchmark harness (same timer, warmup, statistical methodology) for both libraries. Both libraries are compiled from source, linked into a single binary, and measured under the exact same conditions.
Methodology
- Harness: 3 s CPU frequency ramp-up, 500 warmup iterations per operation, 11 measurement passes, IQR outlier removal, median reported.
- Timer: RDTSCP (serialising, sub-ns precision on x86-64).
- Data pool: 64 independent key / message / signature sets, round-robin indexed to defeat branch-predictor / cache training on a single input.
- Pinning: Single core,
taskset -c 0,SCHED_FIFOwhere available. - Compiler parity: Both libraries compiled with the same compiler, same
-O3 -march=nativeflags, same link step. - Source:
bench_unified.cpp-- open-source, fully reproducible.
Platform 1 -- Intel Core i5-14400F (Raptor Lake)
| Detail | Value |
|---|---|
| CPU | Intel Core i5-14400F (P-core, Raptor Lake) |
| Microarchitecture | Golden Cove (P-core), 32 KB L1i, 48 KB L1d, 1.25 MB L2 |
| TSC frequency | 2.497 GHz |
| OS | Ubuntu 24.04 LTS, kernel 6.x |
| Compiler | GCC 14.2.0, -O3 -march=native -fno-exceptions -fno-rtti |
| ISA features | BMI2 (MULX), ADX, AVX2, SHA-NI |
| libsecp256k1 | v0.7.x (latest master, 5x52 + exhaustive GLV Strauss) |
| UltrafastSecp256k1 | v3.16.0, 5x52 limb layout, __int128 field arithmetic |
| Assembly | Both libraries: GCC __int128 -> auto-generated MULX code |
FAST Path (variable-time, non-secret inputs)
| Operation | Ultra (ns) | libsecp (ns) | Speedup | Notes |
|---|---|---|---|---|
| Generator x k (pubkey_create) | 6,730 | 11,362 | 1.69x | W=15 comb vs W=15 Strauss |
| ECDSA Sign | 8,989 | 15,631 | 1.74x | Includes k^-1 (safegcd) |
| ECDSA Verify | 21,324 | 23,306 | 1.09x | Identical Strauss algorithm |
| Schnorr Keypair Create | 10,522 | 11,228 | 1.07x | |
| Schnorr Sign (BIP-340) | 8,443 | 12,255 | 1.45x | Includes SHA-256 challenge |
| Schnorr Verify (BIP-340) | 21,151 | 22,642 | 1.07x | Includes lift_x + SHA-256 |
CT Path (constant-time, for secret inputs -- true apples-to-apples)
libsecp256k1 is constant-time by design, so this comparison is the fairest:
| Operation | Ultra CT (ns) | libsecp (ns) | Speedup |
|---|---|---|---|
| ECDSA Sign | 13,431 | 15,631 | 1.16x |
| ECDSA Verify | 21,324 | 23,306 | 1.09x |
| Schnorr Sign (BIP-340) | 11,393 | 12,255 | 1.08x |
| Schnorr Verify (BIP-340) | 21,151 | 22,642 | 1.07x |
Throughput (single core)
| Ultra FAST | Ultra CT | libsecp | |
|---|---|---|---|
| ECDSA sign | 111.3k op/s | 74.5k op/s | 64.0k op/s |
| ECDSA verify | 46.9k op/s | -- | 42.9k op/s |
| Schnorr sign | 118.4k op/s | 87.8k op/s | 81.6k op/s |
| Schnorr verify | 47.3k op/s | -- | 44.2k op/s |
| pubkey_create (k x G) | 148.6k op/s | -- | 88.0k op/s |
Bitcoin Block Validation (1 core estimate)
| Block type | Ultra | libsecp | Speedup |
|---|---|---|---|
| Pre-Taproot (~3000 ECDSA verify) | 64.0 ms | 69.9 ms | 1.09x |
| Taproot (~2000 Schnorr + ~1000 ECDSA) | 63.6 ms | 67.9 ms | 1.07x |
Field Micro-ops
| Operation | Ultra (ns) | Notes |
|---|---|---|
| FE52 mul | 12.8 | 5x52, __int128 -> MULX |
| FE52 sqr | 9.5 | Dedicated squaring |
| FE52 add | 8.1 | |
| FE52 sub | 5.5 | |
| FE52 negate | 6.0 | |
| FE52 inverse (safegcd) | 666.8 | Bernstein-Yang, __builtin_ctzll |
| Scalar mul | 23.2 | 4x64 |
| Scalar inverse (safegcd) | 843.1 | |
| GLV decomposition | 146.0 | Lattice-based |
Platform 2 -- StarFive VisionFive 2 (RISC-V 64)
| Detail | Value |
|---|---|
| CPU | SiFive U74-MC (quad-core RV64GC) |
| Microarchitecture | SiFive U74, dual-issue in-order, 32 KB L1i, 32 KB L1d |
| ISA extensions | rv64gc + Zba (address), Zbb (bit-manipulation) |
| Clock | ~1.5 GHz (StarFive JH7110 SoC) |
| OS | Debian (StarFive kernel 6.6.20) |
| Compiler | Clang 21.1.8, -O3 -march=rv64gcv_zba_zbb |
| libsecp256k1 | v0.7.x (latest master) |
| UltrafastSecp256k1 | v3.16.0, 5x52 limb layout, __int128 field arithmetic |
| Assembly | Both libraries: __int128 -> compiler-generated MUL/MULHU |
FAST Path (variable-time, non-secret inputs)
| Operation | Ultra (ns) | libsecp (ns) | Speedup | Notes |
|---|---|---|---|---|
| Generator x k (pubkey_create) | 39,764 | 95,341 | 2.40x | W=15 comb vs W=15 Strauss |
| ECDSA Sign | 73,784 | 138,128 | 1.87x | Includes k^-1 (safegcd) |
| ECDSA Verify | 180,511 | 201,135 | 1.11x | Identical Strauss algorithm |
| Schnorr Keypair Create | 45,873 | 95,946 | 2.09x | |
| Schnorr Sign (BIP-340) | 53,957 | 105,310 | 1.95x | Includes SHA-256 challenge |
| Schnorr Verify (BIP-340) | 185,487 | 204,944 | 1.10x | Includes lift_x + SHA-256 |
CT Path (constant-time, for secret inputs -- true apples-to-apples)
| Operation | Ultra CT (ns) | libsecp (ns) | Speedup |
|---|---|---|---|
| ECDSA Sign | 131,177 | 138,818 | 1.06x |
| ECDSA Verify | 181,837 | 204,594 | 1.13x |
| Schnorr Sign (BIP-340) | 110,926 | 106,139 | 0.96x |
| Schnorr Verify (BIP-340) | 186,944 | 208,525 | 1.12x |
Throughput (single core)
| Ultra FAST | Ultra CT | libsecp | |
|---|---|---|---|
| ECDSA sign | 13.5k op/s | 7.6k op/s | 7.2k op/s |
| ECDSA verify | 5.5k op/s | -- | 4.9k op/s |
| Schnorr sign | 18.4k op/s | 9.0k op/s | 9.4k op/s |
| Schnorr verify | 5.3k op/s | -- | 4.8k op/s |
| pubkey_create (k x G) | 24.9k op/s | -- | 10.5k op/s |
Bitcoin Block Validation (1 core estimate)
| Block type | Ultra | libsecp | Speedup |
|---|---|---|---|
| Pre-Taproot (~3000 ECDSA verify) | 545.5 ms | 613.8 ms | 1.13x |
| Taproot (~2000 Schnorr + ~1000 ECDSA) | 555.7 ms | 621.6 ms | 1.12x |
Field Micro-ops
| Operation | Ultra (ns) | Notes |
|---|---|---|
| FE52 mul | 176.2 | 5x52, __int128 -> MUL/MULHU |
| FE52 sqr | 166.8 | Dedicated squaring |
| FE52 add | 42.1 | |
| FE52 sub | 34.7 | |
| FE52 negate | 42.7 | |
| FE52 inverse (safegcd) | 4,697.6 | Bernstein-Yang |
| Scalar mul | 147.5 | 4x64 |
| Scalar inverse (safegcd) | 3,698.9 | |
| GLV decomposition | 851.3 | Lattice-based |
RISC-V Notes
- The U74 is a dual-issue in-order core -- no out-of-order execution, no speculative execution, no branch prediction beyond basic BTB.
- Despite this, the precomputed comb table gives a 2.4x generator speedup, showing the optimization is algorithmic (fewer point additions) not microarchitecture-dependent.
- CT generator_mul uses an 11-block comb (COMB_BLOCKS=11, COMB_SPACING=4) with a ~31 KB table that fits in the U74's 32 KB L1D cache. This gives a 1.04x advantage over libsecp's generator_mul (91.4 us vs 95.4 us).
- CT ECDSA Sign wins 1.06x. CT Schnorr Sign is 0.96x due to auxiliary overhead (SHA-256, nonce derivation) not related to the core ECC operation.
- Verify speedups (1.12-1.13x) come from the same L1 icache optimization as x86 (called vs inlined additions) plus branchless conditional negate.
Key Optimisations (vs libsecp256k1)
- Precomputed generator table -- 8192-entry comb table for k x G (6.7 us vs 11.4 us on x86; 39.8 us vs 95.3 us on RV64)
- Force-inlined doubling --
jac52_double_inplacealways-inline in hot loop - Called (not inlined) additions -- Reduced ecmult function from 124 KB to 39 KB, fitting the hot loop in L1 I-cache (1.5 KB loop body vs 32 KB I-cache)
- Branchless conditional negate -- XOR-select in Strauss loop eliminates 50% unpredictable sign branches
- Single affine conversion in Schnorr verify -- Merged X-check + Y-parity into one Z^-1 computation (saves 1 sqr + 1 mul + redundant parse)
- SW prefetch -- Prefetch G/H table entries before doublings
- 2M+5S doubling formula -- Saves 1M per double vs libsecp's 3M+4S
How to Reproduce
# Clone and build
git clone --recurse-submodules <repo>
cd Secp256K1fast/libs/UltrafastSecp256k1
cmake -S ../.. -B build_rel -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build_rel -j
# Run benchmark (pin to one core for stability)
taskset -c 0 build_rel/cpu/bench_unified
Contributing Benchmarks
We welcome benchmark contributions from other platforms. To add your results:
- Run
taskset -c 0 build_rel/cpu/bench_unified(or equivalent pinning) - Copy the full terminal output
- Open a PR adding a new "Platform N" subsection with your hardware details
Platforms we'd especially like to see: AMD Zen 4/5, Apple M-series (ARM64), AWS Graviton, AMD EPYC, Intel Xeon Sapphire Rapids, Milk-V Pioneer (C920).
Future Optimizations
Planned
- AVX-512 vectorization (x86-64)
- Multi-threaded batch operations
- ARM64 NEON/MUL assembly (DONE -- ~5x speedup)
- OpenCL backend (DONE -- 3.39M kG/s)
- Apple Metal backend (DONE -- 527M field_mul/s, M3 Pro)
- Shared POD types across backends
- ARM64 inline assembly (MUL/UMULH)
Experimental
- AVX-512 vectorization (x86-64)
- Multi-threaded batch operations
- Montgomery domain for CUDA (mixed results)
- 8x32-bit hybrid limb representation (DONE -- 1.10x faster mul)
- Constant-time side-channel resistance (CT layer implemented)
Version
UltrafastSecp256k1 v3.16.0
Benchmarks updated: 2026-03-02