UltrafastSecp256k1/docs/BENCHMARKS.md
2026-03-25 14:36:36 +00:00

52 KiB
Raw Permalink Blame History

Performance Benchmarks

Benchmark results for UltrafastSecp256k1 across all supported platforms.


Summary

Platform Field Mul Generator Mul Scalar Mul ECDSA Verify ZK Prove vs libsecp
x86-64 (i5-14400F, Clang 19) 12.8 ns 6.7 us 17.6 us 21.3 us 24.3 us 1.09x
x86-64 (Clang 21, Win) 17 ns (5x52) 5 us 25 us -- -- --
RISC-V 64 (SiFive U74, Clang 21) 176 ns 40.2 us 150.5 us 181.8 us -- 1.13x
ARM64 (RK3588, A76) 74 ns 14 us 131 us -- -- --
ESP32-S3 (LX7, 240 MHz) 5,910 ns 6,134 us 12,752 us 18,670 us -- 1.70× verify
ESP32-P4 (RV32, 360 MHz) 2,424 ns 2,253 us 5,256 us 7,528 us -- 1.01× verify
ESP32-C6 (RV32, 160 MHz) 5,974 ns 5,483 us 12,682 us 18,957 us -- 1.67× sign
ESP32 (LX6, 240 MHz) 6,993 ns 6,203 us -- -- -- --
STM32F103 (CM3, 72 MHz) 15,331 ns 37,982 us -- -- -- --
CUDA (RTX 5060 Ti) 0.2 ns 113.5 ns 97.7 ns 230.2 ns 258.6 ns --
CUDA (RTX 5070 Ti) 5.8 ns 92.1 ns 101.4 ns 122.8 ns -- --
OpenCL (RTX 5060 Ti) 0.2 ns 113.5 ns 97.7 ns 230.2 ns 258.6 ns --
Metal (Apple M3 Pro) 1.9 ns 3.00 us 2.94 us -- -- --

GPU rows use the latest retained local rerun per backend. The stable public GPU C ABI now exposes 13 backend-neutral operations, and CUDA, OpenCL, and Metal all implement that stable surface. Internal signing kernels and benchmark-only paths are tracked separately from the public GPU ABI.


Real-World Flow Coverage

bench_unified also measures higher-level wallet and protocol flows so the benchmark suite reflects product-shaped workloads, not only primitive-level ECC operations.

Covered flows include:

  • ecdh_compute and ecdh_compute_raw
  • taproot_output_key and taproot_tweak_privkey
  • bip32_master_key
  • coin_derive_key for standard Bitcoin HD paths
  • coin_address_from_seed end-to-end for Bitcoin and Ethereum
  • silent_payment_create_output
  • silent_payment_scan

Representative x86-64 / Linux Quick Snapshot

Quick sanity run from bench_unified --quick on the local x86-64 validation machine:

Flow Time
ECDH (ecdh_compute) 22.8 us
ECDH raw (ecdh_compute_raw) 20.5 us
Taproot output key 10.5 us
BIP-32 master key (64B seed) 1.2 us
BTC address from seed 93.4 us
ETH address from seed 93.4 us
Silent Payment create_output 24.7 us
Silent Payment scan 35.7 us

These values are mainly intended as workflow reference points. For publishable cross-machine comparisons, use the full pinned benchmark methodology and JSON artifacts from bench_unified.

x86-64 Full Rerun (2026-03-24, post-exploit-fix audit)

Run after 60-exploit-PoC audit (commit 8b25d420). No regression detected.
Machine: Intel Core i5-14400F · Linux · Clang 19.1.7 · TSC 2.501 GHz
Harness: bench_unified — 3 s warmup, 11 passes, IQR trimmed, median

Operation Ultra (ns/op) libsecp (ns/op) Ratio
field_mul 10.1 11.0 1.09×
field_sqr 9.0 8.6 0.97×
field_inv 746.6 775.2 1.04×
scalar_mul 16.0 19.9 1.25×
scalar_inv (CT) 776.2 1466.1 1.89×
pubkey_create (k·G) 5906 13102 2.22×
ecmult (a·P+b·G) 19429 19071 0.98×
compressed serialize 2.9 12.7 4.34×
ECDSA sign 7825 16314 2.08×
Schnorr sign 6258 12467 1.99×
ECDSA verify 20218 20507 1.01×
Schnorr verify (cached) 20741 20459 0.99×
CT ECDSA sign 12259 16314 1.33×
CT Schnorr sign 10411 12467 1.20×
ecdsa_sign_recoverable 7355 16211 2.20×
ecrecover 26801 24472 0.91×
SHA256 (tagged_hash) 62.7
Schnorr batch N=64 144876 total

No regressions vs previous rerun (2026-03-17). All 70/70 audit modules pass.

x86-64 Batch Verify Rerun (2026-03-17)

A retained low-risk x86 CPU improvement was keeping the Schnorr batch pubkey cache capacity aligned with the full batch size in cpu/src/batch_verify.cpp instead of clamping reserve capacity to 64 entries. This avoids avoidable vector reallocations when uncached batches grow beyond 64 signatures.

Quick reruns on the local i5-14400F validation machine showed the improvement on the uncached Schnorr path while preserving correctness (ctest -R 'comprehensive|multiscalar' PASS):

Operation Before After Delta
Schnorr batch verify N=128 20.27 us/sig 19.94-20.06 us/sig up to 1.6% faster
Schnorr batch verify N=192 18.56 us/sig 18.01-18.45 us/sig up to 3.0% faster

This change does not materially affect the cached-path benchmark; the measured win is specifically the uncached parse-and-resolve flow for larger Schnorr batches.

Cross-Platform Refresh Status (2026-03-18)

Recent retained reruns and validation passes across the active optimization campaign:

Platform Latest validated result Status
x86-64 / Linux Schnorr batch verify N=128: 19.94-20.06 us/sig, N=192: 18.01-18.45 us/sig Retained low-risk pubkey-cache reserve improvement
Android ARM64 / RK3588 ECDSA Sign 22.22 us, Schnorr Sign (precomputed) 16.67 us, CT ECDSA Sign 67.11 us Retained ARMv8 SHA2 dispatch win
OpenCL / RTX 5060 Ti kG (batch=65536) 115.1 ns, kP (batch=65536) 263.1 ns, kG (kernel) 98.7 ns Revalidated retained tuning; opencl_test and opencl_audit_runner passed
CUDA / RTX 5060 Ti k*G 129.5 ns at TPB 256; TPB 512 reached 128.5 ns but CT rows became invalid in the same harness No safe global retune retained yet
RISC-V / Milk-V Mars Latest native rerun remains the 2026-03-07 Mars baseline below Current local environment has toolchain but no runnable board/emulator path

This page keeps the last trustworthy result per platform. When a rerun only proves that an experiment is unstable or not worth shipping, it is recorded here but not promoted as a retained default.

The stable GPU host ABI in ufsecp_gpu.h now covers 13 backend-neutral batch operations, and the compiled CUDA, OpenCL, and Metal backends implement that stable surface. Internal kernel experiments, signing benchmarks, and backend-only test hooks may cover additional primitives beyond the public ABI, but they are documented separately from the stable host interface.


x86-64 Benchmarks

x86-64 / Linux (i5, Clang 19.1.7, AVX2)

Hardware: Intel Core i5 (AVX2, BMI2, ADX)
OS: Linux
Compiler: Clang 19.1.7
Assembly: x86-64 with BMI2/ADX intrinsics
SIMD: AVX2

Operation Time Notes
Field Mul 33 ns Using mulx/adcx/adox
Field Square 32 ns Optimized squaring
Field Add 11 ns
Field Sub 12 ns
Field Inverse 5 us Fermat's little theorem
Point Add 521 ns Jacobian coordinates
Point Double 278 ns
Point Scalar Mul 110 us GLV + wNAF
Generator Mul 5 us Precomputed tables
Batch Inverse (n=100) 140 ns/elem Montgomery's trick
Batch Inverse (n=1000) 92 ns/elem

x86-64 / Windows (Clang 21.1.0, AVX2)

Hardware: x86-64 (AVX2)
OS: Windows
Compiler: Clang 21.1.0
Assembly: x86-64 ASM enabled
SIMD: AVX2

Operation Time Notes
Field Mul (5x52) 17 ns __int128 lazy reduction
Field Square (5x52) 14 ns
Field Add 1 ns
Field Negate 1 ns
Field Inverse 1 us Fermat's little theorem
Point Add 159 ns Jacobian coordinates
Point Double 98 ns
Point Scalar Mul (kxP) 25 us GLV + 5x52 + Shamir
Generator Mul (kxG) 5 us Precomputed tables
ECDSA Sign 8 us RFC 6979
ECDSA Verify 31 us Shamir + GLV
Schnorr Sign (BIP-340) 14 us
Schnorr Verify (BIP-340) 33 us
Batch Inverse (n=100) 84 ns/elem Montgomery's trick
Batch Inverse (n=1000) 88 ns/elem

RISC-V 64 Benchmarks

Hardware: Milk-V Mars (SiFive U74, RV64GC + Zba + Zbb)
OS: Linux
Compiler: Clang 21.1.8, -mcpu=sifive-u74 -march=rv64gc_zba_zbb
Assembly: RISC-V native assembly
LTO: ThinLTO enabled (auto-detected)

Operation Time Notes
Field Mul 95 ns Optimized carry chain
Field Square 70 ns Dedicated squaring
Field Add 11 ns Branchless
Field Sub 11 ns Branchless
Field Negate 8 ns Branchless
Field Inverse 4 us Fermat's little theorem
Point Add 1 us Jacobian coordinates
Point Double 595 ns
Point Scalar Mul (kxP) 154 us GLV + wNAF
Generator Mul (kxG) 33 us Precomputed tables
ECDSA Sign 67 us RFC 6979
ECDSA Verify 186 us Shamir + GLV
Schnorr Sign (BIP-340) 86 us
Schnorr Verify (BIP-340) 216 us

RISC-V Native Re-Run (Milk-V Mars, 2026-03-07)

Run policy: native board execution (no QEMU), bench_unified --suite all --passes 11, plus unified_audit_runner.

Full Benchmark (opt3 retained)

Operation Time Ratio vs libsecp Notes
ECDSA Sign 72.64 us 2.00x FAST path
Schnorr Sign 51.69 us 2.24x FAST path
Schnorr Keypair 43.98 us 2.45x x-only keypair create
ECDSA Verify 198.01 us 1.01x Slightly faster than libsecp
Schnorr Verify (cached xonly) 200.46 us 1.02x Slightly faster than libsecp
Schnorr Verify (raw bytes) 206.75 us 0.99x Near parity; about 1.2% slower

Source artifact (Mars): /tmp/bench_unified_mars_full_opt3.json.

Quick A/B Check (raw verify hotspot)

Variant Schnorr Verify (raw) Schnorr Verify (cached) ECDSA Verify
opt3 206963.9 ns 200468.7 ns 198126.1 ns
opt4 216081.5 ns 200431.1 ns 198231.0 ns

Conclusion: opt3 is kept because it is measurably faster in raw verify.

Security Validation (same code path)

unified_audit_runner verdict: AUDIT-READY
Summary: 53/54 modules passed -- ALL PASSED (1 advisory warnings).

VisionFive 2 Device Rerun (2026-03-22, v3.3.0 dev)

This rerun was executed on the physical StarFive VisionFive 2 board over SSH. Validation covered run_selftest smoke, test_bip324_standalone, bench_kP, bench_unified --quick, and dedicated bench_bip324.

Measurement Result
run_selftest smoke 30/30 modules passed, ALL TESTS PASSED
test_bip324_standalone BIP-324: 62/62 passed
bench_kP: scalar_mul(K) 200.06 us
bench_kP: scalar_mul_with_plan(K) 191.47 us
bench_unified --quick: scalar_mul (k*P) 199.99 us
bench_unified --quick: scalar_mul_with_plan 193.30 us
bench_unified --quick: silent_payment_scan (single output set) 415.22 us
bench_unified --quick: scalar_mul_P (k*P, tweak_mul) 200.36 us
bench_bip324: full_handshake (both sides) 1444.56 us
bench_bip324: session_encrypt 1024 B 19.14 us, 51.0 MB/s
bench_bip324: session_roundtrip 1024 B 38.36 us, 25.5 MB/s
bench_bip324: session_roundtrip 4096 B 137.81 us, 28.3 MB/s

Retained optimization: Point::scalar_mul_with_plan() now leaves the result lazy-affine. On this board, that moved bench_unified --quick scalar_mul_with_plan from the earlier 199652.6 ns baseline to 193301.2 ns, a measured improvement of about 3.2%.

RISC-V Optimization Gains (vs generic RV64GC build)

Optimization Speedup Applied To
-mcpu=sifive-u74 targeting 1.3x All operations
ThinLTO (cross-TU inlining) 1.1x Point/scalar ops
Native assembly 2-3x Field mul/square
Branchless algorithms 1.2x Field add/sub
Fast modular reduction 1.5x All field ops
Carry chain optimization 1.3x Multiplication

CUDA Benchmarks

Hardware: NVIDIA RTX 5060 Ti (36 SMs, 2602 MHz, 15847 MB, 128-bit bus)
CUDA: 12.0, Compute 12.0 (Blackwell)
Architecture: sm_86;sm89
Build: Clang 19 + nvcc, Release, -O3 --use_fast_math

Core ECC Operations

Operation Time/Op Throughput Notes
Field Mul 0.2 ns 4,142 M/s Kernel-only, batch 1M
Field Add 0.2 ns 4,130 M/s Kernel-only, batch 1M
Field Inv 10.2 ns 98.35 M/s Kernel-only, batch 64K
Point Add 1.6 ns 619 M/s Kernel-only, batch 256K
Point Double 0.8 ns 1,282 M/s Kernel-only, batch 256K
Scalar Mul (Pxk) 282.0 ns 3.55 M/s Kernel-only, batch 64K
Generator Mul (Gxk) 113.5 ns 8.81 M/s Kernel-only, batch 64K
Affine Add 0.4 ns 2,532 M/s Kernel-only, batch 256K
Affine Lambda 0.6 ns 1,654 M/s Kernel-only, batch 256K
Affine X-Only 0.4 ns 2,328 M/s Kernel-only, batch 256K
Batch Inv 2.9 ns 340 M/s Kernel-only, batch 64K
Jac->Affine 14.9 ns 66.9 M/s Kernel-only, batch 64K

GPU Signature Operations

No other open-source GPU library provides secp256k1 ECDSA + Schnorr sign/verify on GPU.

Operation Time/Op Throughput Notes
ECDSA Sign 204.8 ns 4.88 M/s RFC 6979, low-S, batch 16K
ECDSA Verify 230.2 ns 4.34 M/s Shamir+GLV double-mul, batch 64K
ECDSA Sign + Recid 311.5 ns 3.21 M/s Recoverable, batch 16K
Schnorr Sign (BIP-340) 273.4 ns 3.66 M/s Tagged hash midstates, batch 16K
Schnorr Verify (BIP-340) 167.0 ns 5.99 M/s Shamir+GLV double-mul, batch 64K

GPU Zero-Knowledge Operations

First open-source GPU implementation of secp256k1 ZK proofs (Knowledge + DLEQ + Bulletproof).

Operation Time/Op Throughput Notes
Knowledge Prove (G) 258.6 ns 3,867 k/s CT Schnorr sigma, batch 8K
Knowledge Verify 175.9 ns 5,686 k/s Shamir double-mul GLV, batch 8K
DLEQ Prove 537.2 ns 1,861 k/s Discrete log equality, CT path, batch 8K
DLEQ Verify 369.0 ns 2,710 k/s 2× Shamir double-mul GLV, batch 8K
Pedersen Commit 66.0 ns 15,160 k/s vH + rG, batch 4K
Range Prove (64-bit) 3,711,570 ns 0.27 k/s Bulletproof, CT path, batch 256
Range Verify (64-bit) 764,649 ns 1.3 k/s Full IPA verification, batch 256

GPU vs CPU ZK Speedup (single-core throughput):

Operation CPU (i5-14400F) GPU (RTX 5060 Ti) GPU/CPU Speedup
Knowledge Prove 24,292 ns 258.6 ns 94x
Knowledge Verify 23,830 ns 175.9 ns 135x
DLEQ Prove 42,370 ns 537.2 ns 79x
DLEQ Verify 60,607 ns 369.0 ns 164x
Pedersen Commit 29,718 ns 66.0 ns 450x
Range Prove (64-bit) 13,618,693 ns 3,711,570 ns 3.7x
Range Verify (64-bit) 2,669,843 ns 764,649 ns 3.5x

Community & Contributor Benchmarks

All hardware results submitted by community members are collected in docs/COMMUNITY_BENCHMARKS.md.

Current entries:

# Hardware Contributor Date Tests
1 NVIDIA RTX 5070 Ti (Blackwell) Community / GigaChad 2026-03-24 45/45
2 x86-64 CPU (libsecp baseline) @craigraw 2026-02-xx

CUDA — RTX 5070 Ti (Blackwell) — 2026-03-24

Contributor: Community member (GigaChad) — thank you for running the full test suite and for identifying the CMAKE_CUDA_SEPARABLE_COMPILATION flag required for Blackwell devices! 🙏
Hardware: NVIDIA GeForce RTX 5070 Ti (Blackwell)
Build: cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DSECP256K1_BUILD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_SEPARABLE_COMPILATION=ON
Tested: 2026-03-24, 45 tests passed
Note: CMAKE_CUDA_SEPARABLE_COMPILATION=ON is required for Blackwell (RTX 50xx) devices. This flag is now set automatically in cuda/CMakeLists.txt and baked into all CUDA CMake presets.

Operation Time/Op Throughput
Field Mul 5.8 ns 173.43 M/s
Field Add 2.5 ns 408.04 M/s
Field Inverse 5.2 ns 191.55 M/s
Point Add 9.9 ns 100.89 M/s
Point Double 5.5 ns 181.70 M/s
Scalar Mul (Pk) 101.4 ns 9.86 M/s
Generator Mul (Gk) 92.1 ns 10.86 M/s
Affine Add (2M+1S+inv) 0.1 ns 8,388.29 M/s
Affine Lambda (2M+1S) 0.2 ns 4,117.82 M/s
Affine X-Only (1M+1S) 0.1 ns 8,354.07 M/s
Batch Inv (Montgomery) 5.8 ns 173.21 M/s
Jac->Affine (per-pt) 14.4 ns 69.34 M/s
ECDSA Sign 105.3 ns 9.49 M/s
ECDSA Verify 122.8 ns 8.14 M/s
ECDSA Sign+Recid 155.8 ns 6.42 M/s
Schnorr Sign 137.7 ns 7.26 M/s
Schnorr Verify 92.7 ns 10.79 M/s

GPU Zero-Knowledge Operations

First open-source GPU implementation of secp256k1 ZK proofs (Knowledge + DLEQ + Bulletproof).

Operation Time/Op Throughput Notes
Knowledge Prove (G) 252.3 ns 3,964 k/s CT Schnorr sigma, batch 4K
Knowledge Verify 749.9 ns 1,334 k/s sG == R + eP, batch 4K
DLEQ Prove 668.3 ns 1,496 k/s Discrete log equality, CT path, batch 4K
DLEQ Verify 1,919.1 ns 521 k/s Two-base verification, batch 4K
Pedersen Commit 66.0 ns 15,160 k/s vH + rG, batch 4K
Range Prove (64-bit) 3,711,570 ns 0.27 k/s Bulletproof, CT path, batch 256
Range Verify (64-bit) 764,649 ns 1.3 k/s Full IPA verification, batch 256

CUDA Launch-Width Triage (2026-03-18)

The latest local rerun on the RTX 5060 Ti used gpu_bench_unified to check whether a global block-size retune should replace the current default. The answer was no: there is not yet a safe retained win.

TPB k*G (generator) CT k*G CT k*P Verdict
256 129.5 ns 98.7 ns 162.8 ns Stable reference rerun
512 128.5 ns invalid (0.0 ns) invalid (0.1 ns) Rejected; CT timing became unstable

The 512-thread launch showed only a marginal k*G gain, while the same harness produced invalid constant-time timings. Until the CT timing methodology is tightened, no global CUDA TPB default change is retained from this sweep.

GPU vs CPU ZK Speedup (single-core throughput):

Operation CPU (i5-14400F) GPU (RTX 5060 Ti) GPU/CPU Speedup
Knowledge Prove 24,292 ns 252.3 ns 96x
Knowledge Verify 23,830 ns 749.9 ns 32x
DLEQ Prove 42,370 ns 668.3 ns 63x
DLEQ Verify 60,607 ns 1,919.1 ns 32x
Pedersen Commit 29,718 ns 66.0 ns 450x
Range Prove (64-bit) 13,618,693 ns 3,711,570 ns 3.7x
Range Verify (64-bit) 2,669,843 ns 764,649 ns 3.5x

OpenCL Benchmarks

Hardware: NVIDIA RTX 5060 Ti (36 CUs, 2602 MHz)
OpenCL: 3.0 CUDA, Driver 580.126.09
Build: Clang 19, Release, -O3, PTX inline assembly

OpenCL GPU C ABI Coverage (2026-03-18)

C ABI operation OpenCL status Notes
ufsecp_gpu_generator_mul_batch Implemented Uses batch_scalar_mul_generator + batch_jacobian_to_affine
ufsecp_gpu_ecdsa_verify_batch Missing Returns UFSECP_ERR_GPU_UNSUPPORTED
ufsecp_gpu_schnorr_verify_batch Missing Returns UFSECP_ERR_GPU_UNSUPPORTED
ufsecp_gpu_ecdh_batch Implemented GPU scalar mul, CPU SHA-256 finalization
ufsecp_gpu_hash160_pubkey_batch Implemented Public-data batch hashing
ufsecp_gpu_msm Implemented GPU scalar mul + CPU-side affine reduction

The missing OpenCL pieces are therefore the two batch verify paths. Core ECC, ECDH, Hash160, and MSM are already wired through the backend-neutral C ABI.

Kernel-Only Timing (no buffer alloc/copy overhead)

Operation Time/Op Throughput Notes
Field Mul 0.2 ns 4,110 M/s batch 1M
Field Add 0.2 ns 4,116 M/s batch 1M
Field Sub 0.2 ns 4,106 M/s batch 1M
Field Sqr 0.2 ns 5,979 M/s batch 1M
Field Inv 20.2 ns 49.42 M/s batch 1M
Point Double 0.9 ns 1,138 M/s batch 256K
Point Add 1.6 ns 618.1 M/s batch 256K
kG (kernel) 97.7 ns 10.23 M/s batch 64K
kP (kernel) 263.8 ns 3.79 M/s batch 64K
ECDSA Verify 230.2 ns 4.34 M/s Shamir+GLV, batch 64K
Schnorr Verify 167.0 ns 5.99 M/s Shamir+GLV, batch 64K
ZK Knowledge Prove 258.6 ns 3.87 M/s CT path, batch 8K
ZK Knowledge Verify 175.9 ns 5.69 M/s Shamir double-mul, batch 8K
ZK DLEQ Prove 537.2 ns 1.86 M/s CT path, batch 8K
ZK DLEQ Verify 369.0 ns 2.71 M/s 2× Shamir double-mul, batch 8K

End-to-End Timing (including buffer transfers)

Operation Time/Op Throughput Notes
Field Add 27.3 ns 36.67 M/s batch 1M
Field Mul 27.7 ns 36.07 M/s batch 1M
Field Inv 29.0 ns 34.43 M/s batch 1M
Point Double 58.4 ns 17.11 M/s batch 1M
Point Add 111.9 ns 8.94 M/s batch 1M
kG (batch=65536) 115.1 ns 8.69 M/s retained 2026-03-17 revalidation
kP (batch=65536) 263.1 ns 3.80 M/s retained 2026-03-17 revalidation
kP upload 6.7 ns 149.25 M/s host-to-device transfer slice
kP readback 12.4 ns 80.65 M/s device-to-host transfer slice

CUDA / OpenCL Configuration

// Optimal settings for RTX 5060 Ti
#define SECP256K1_CUDA_USE_HYBRID_MUL 1  // 32-bit hybrid (~10% faster)
#define SECP256K1_CUDA_USE_MONTGOMERY 0  // Standard domain (faster for search)

CUDA vs OpenCL Kernel-Only Comparison (RTX 5060 Ti)

Operation CUDA OpenCL Faster
Field Mul 0.2 ns 0.2 ns Tie
Field Add 0.2 ns 0.2 ns Tie
Field Inv 10.2 ns 20.2 ns CUDA 1.98x
Point Double 0.8 ns 0.9 ns CUDA 1.13x
Point Add 1.6 ns 1.6 ns Tie
Scalar Mul (kG) 113.5 ns 97.7 ns OpenCL 1.16x
ECDSA Sign 204.8 ns -- CUDA only
ECDSA Verify 230.2 ns 230.2 ns Tie
Schnorr Sign 273.4 ns -- CUDA only
Schnorr Verify 167.0 ns 167.0 ns Tie
Knowledge Prove 258.6 ns 258.6 ns Tie
Knowledge Verify 175.9 ns 175.9 ns Tie
DLEQ Prove 537.2 ns 537.2 ns Tie
DLEQ Verify 369.0 ns 369.0 ns Tie

kG above uses the latest retained local reruns on the same RTX 5060 Ti host: CUDA gpu_bench_unified at TPB 256 (129.5 ns) and OpenCL opencl_benchmark kernel timing (98.7 ns). CUDA still leads on verify and ZK because those paths are not yet exposed on OpenCL.


Apple Metal Benchmarks

Hardware: Apple M3 Pro (18 GPU cores, Unified Memory 18 GB)
OS: macOS Sequoia
Metal: Metal 2.4, MSL macos-metal2.4
Limb Model: 8x32-bit Comba (no 64-bit int in MSL)
Build: AppleClang, Release, -O3, ARC

Operation Time/Op Throughput Notes
Field Mul 1.9 ns 527 M/s Comba product scanning, batch 1M
Field Add 1.0 ns 990 M/s Branchless, batch 1M
Field Sub 1.1 ns 892 M/s Branchless, batch 1M
Field Sqr 1.1 ns 872 M/s Comba + symmetry, batch 1M
Field Inv 106.4 ns 9.40 M/s Fermat (a^(p-2)), batch 64K
Point Add 10.1 ns 98.6 M/s Jacobian, batch 256K
Point Double 5.1 ns 196 M/s dbl-2001-b, batch 256K
Scalar Mul (Pxk) 2.94 us 0.34 M/s 4-bit windowed, batch 64K
Generator Mul (Gxk) 3.00 us 0.33 M/s 4-bit windowed, batch 128K

Metal vs CUDA vs OpenCL -- GPU Comparison

Operation CUDA (RTX 5060 Ti) OpenCL (RTX 5060 Ti) Metal (M3 Pro)
Field Mul 0.2 ns 0.2 ns 1.9 ns
Field Add 0.2 ns 0.2 ns 1.0 ns
Field Inv 10.2 ns 14.3 ns 106.4 ns
Point Double 0.8 ns 0.9 ns 5.1 ns
Point Add 1.6 ns 1.6 ns 10.1 ns
Scalar Mul 282.0 ns 263.8 ns 2.94 us
Generator Mul 113.5 ns 97.7 ns 3.00 us
ECDSA Sign 204.8 ns -- --
ECDSA Verify 230.2 ns 230.2 ns --
Schnorr Sign 273.4 ns -- --
Schnorr Verify 354.6 ns -- --
Knowledge Prove 263.7 ns -- --
Knowledge Verify 744.5 ns -- --
DLEQ Prove 675.4 ns -- --
DLEQ Verify 1,912.0 ns -- --

Note: CUDA/OpenCL -- RTX 5060 Ti (36 SMs, 2602 MHz, GDDR7 256 GB/s).
Metal -- M3 Pro (18 GPU cores, ~150 GB/s unified memory bandwidth).
RTX 5060 Ti has ~8x more compute throughput; Metal's advantage is in unified memory zero-copy I/O.


Android ARM64 Benchmarks

Hardware: RK3588 (Cortex-A76 @ 2.256 GHz, pinned to big cores)
OS: Android
Compiler: NDK r27.2.12479018, Clang 18.0.3
Assembly: ARM64 inline (MUL/UMULH)
Field: 10x26 (optimal for ARM64)

Operation Time Notes
Field Mul 68.3 ns ARM64 MUL/UMULH, 10x26
Field Square 50 ns
Field Add 8 ns
Field Negate 18 ns
Field Inverse 2 us Fermat's theorem
Point Add 992 ns Jacobian coordinates
Point Double 548 ns
Generator Mul (kxG) 15.27 us Precomputed tables
Scalar Mul (kxP) 130.33 us GLV + wNAF
ECDSA Sign 22.22 us ARMv8 SHA2 dispatch retained
ECDSA Verify 150.13 us Shamir + GLV
Schnorr Sign (BIP-340) 16.67 us Precomputed keypair path
Schnorr Verify (BIP-340) 153.63 us Raw pubkey path is similar
Batch Inverse (n=100) 265 ns/elem Montgomery's trick
Batch Inverse (n=1000) 240 ns/elem

ARM64 10x26 representation with MUL/UMULH assembly provides optimal field arithmetic performance.

Android ARM64 Optimization Rerun (2026-03-17)

This rerun used the connected RK3588 Android device and android/test/bench_hornet_android.cpp as the benchmark truth source. The retained code change was enabling the existing ARMv8 SHA-256 instruction path in hash_accel.cpp for sha256_33, sha256_32, hash160_33, and sha256_compress_dispatch.

Operation Baseline Retained result Delta
ECDSA Sign 25.89 us 22.22 us 14.2% faster
Schnorr Sign (precomputed) 17.73 us 16.67 us 6.0% faster
Schnorr Sign (raw privkey) 33.01 us 31.99 us 3.1% faster
CT ECDSA Sign 70.50 us 67.11 us 4.8% faster
CT Schnorr Sign 59.87 us 59.10 us 1.3% faster

No meaningful win was found from forcing SECP256K1_USE_4X64_POINT_OPS, from changing SECP256K1_GLV_WINDOW_WIDTH to 4 or 6, or from keeping PGO as the default Android path. Those variants were measured and rejected.

Android ARM64 RK3588 Device Rerun (2026-03-22)

This rerun used the connected YF_022A RK3588 Android device over USB. Two new device-side benchmarks were added to the Android build for this pass: bench_kP for the BIP-352 fixed-K / variable-Q hotspot and bench_bip324 for the dedicated BIP-324 transport stack.

Measurement Result
android_test: fast scalar_mul (k*G) 5.93 us
android_test: fast scalar_mul (k*P) 57.67 us
android_test: ct::scalar_mul (k*P) 150.26 us
android_test: field_mul / field_sqr 80 ns / 61 ns
bench_kP: scalar_mul(K) 130.90 us
bench_kP: scalar_mul_with_plan(K) 127.24 us
bench_kP: K*G 15.69 us
bench_bip324: full_handshake (both sides) 727.24 us
bench_bip324: session_encrypt 1024 B 5.96 us, 163.9 MB/s
bench_bip324: session_roundtrip 1024 B 12.05 us, 81.0 MB/s
bench_bip324: session_roundtrip 4096 B 43.72 us, 89.3 MB/s

Run note: the on-device execution used the NDK libomp.so alongside the pushed binaries so the existing OpenMP-enabled CPU build could run unchanged.


ESP32-S3 Benchmarks (Embedded)

Hardware: ESP32-S3 (Xtensa LX7 Dual Core @ 240 MHz), rev 0.1
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 4×64 (native 64-bit mul wins on LX7)
Measured: 2026-03-21, median of 3 runs

Operation Time ops/sec vs libsecp
field_mul 5,910 ns 169 k/s
field_sqr 4,848 ns 206 k/s
field_add 572 ns 1.75 M/s
field_inv 130.2 µs 7.7 k/s
pubkey_create (k×G) 6,134 µs 163/s 1.18×
k×P (arbitrary) 12,752 µs 78/s
a×G + b×P (Shamir) 18,296 µs 55/s
point_add 479 µs 2.1 k/s
point_dbl 330 µs 3.0 k/s
ecdsa_sign 7,443 µs 134/s 1.27×
ecdsa_verify 18,670 µs 54/s 1.70×
schnorr_sign (keypair) 6,467 µs 155/s 1.45×
schnorr_verify 19,947 µs 50/s 1.62×
ct::ecdsa_sign 13,742 µs 73/s 0.69×
ct::schnorr_sign 7,574 µs 132/s 1.23×

All integrity checks pass. libsecp256k1 v0.7.2 compared on same hardware.


ESP32-P4 Benchmarks (Embedded)

Hardware: ESP32-P4 (RISC-V RV32IMAC Dual HP Core @ 360 MHz), rev 1.3
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 10×26 (32-bit native)
Measured: 2026-03-21, median of 3 runs

Operation Time ops/sec vs libsecp
field_mul 2,424 ns 413 k/s
field_sqr 2,218 ns 451 k/s
field_add 318 ns 3.14 M/s
field_inv 73.1 µs 13.7 k/s
pubkey_create (k×G) 2,253 µs 444/s 0.94×
k×P (arbitrary) 5,256 µs 190/s
a×G + b×P (Shamir) 7,550 µs 132/s
point_add 128.8 µs 7.8 k/s
point_dbl 103.6 µs 9.7 k/s
ecdsa_sign 2,588 µs 386/s 0.97×
ecdsa_verify 7,528 µs 133/s 0.99×
schnorr_sign (keypair) 2,293 µs 436/s 0.96×
schnorr_verify 8,052 µs 124/s 0.93×
ct::ecdsa_sign 5,680 µs 176/s 0.44×
ct::schnorr_sign 2,528 µs 396/s 1.10×

All integrity checks pass. Note: FAST path is at near-parity with libsecp on P4
(P4 RISC-V microarch lacks the wide multiply throughput of Xtensa LX7).


ESP32-C6 Benchmarks (Embedded)

Hardware: ESP32-C6 (RISC-V RV32IMAC Single Core @ 160 MHz), rev 0.2
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 10×26 (32-bit native)
Measured: 2026-03-21, median of 3 runs

Operation Time ops/sec vs libsecp
field_mul 5,974 ns 167 k/s
field_sqr 5,328 ns 188 k/s
field_add 784 ns 1.28 M/s
field_inv 171.1 µs 5.8 k/s
pubkey_create (k×G) 5,483 µs 182/s 1.70×
k×P (arbitrary) 12,682 µs 79/s
point_add 296.5 µs 3.4 k/s
point_dbl 238.1 µs 4.2 k/s
ecdsa_sign 7,464 µs 134/s 1.67×
ecdsa_verify 18,957 µs 53/s 0.98×
schnorr_sign (keypair) 5,855 µs 171/s 2.01×
schnorr_verify 20,278 µs 49/s 1.03×
ct::ecdsa_sign 15,522 µs 64/s 0.80×
ct::schnorr_sign 6,782 µs 147/s 1.73×

All integrity checks pass.


ESP32-PICO-D4 Benchmarks (Embedded)

Hardware: ESP32-PICO-D4 (Xtensa LX6 Dual Core @ 240 MHz)
OS: ESP-IDF v5.5.1
Assembly: None (portable C++, no __int128)

Operation Time Notes
Field Mul 6,993 ns
Field Square 6,247 ns
Field Add 985 ns
Field Inv 609 us
Scalar x G 6,203 us Generator mul
CT Scalar x G 44,810 us Constant-time
CT Add (complete) 249,672 ns
CT Dbl 87,113 ns
CT/Fast ratio 6.5x

All 35 self-tests + 8 CT tests pass.


STM32F103 Benchmarks (Embedded)

Hardware: STM32F103ZET6 (ARM Cortex-M3 @ 72 MHz)
Compiler: ARM GCC 13.3.1, -O3
Assembly: ARM Cortex-M3 inline (UMULL/ADDS/ADCS)

Operation Time Notes
Field Mul 15,331 ns ARM inline asm
Field Square 12,083 ns ARM inline asm
Field Add 4,139 ns Portable C++
Field Inv 1,645 us
Scalar x G 37,982 us Generator mul

All 35 library self-tests pass.


Embedded Cross-Platform Comparison

Operation ESP32-S3 (LX7) ESP32-P4 (RV32) ESP32-C6 (RV32) ESP32 (LX6) STM32F103 (M3)
240 MHz 360 MHz 160 MHz 240 MHz 72 MHz
Field Mul 5,910 ns 2,424 ns 5,974 ns 6,993 ns 15,331 ns
Field Square 4,848 ns 2,218 ns 5,328 ns 6,247 ns 12,083 ns
Field Add 572 ns 318 ns 784 ns 985 ns 4,139 ns
Field Inv 130 µs 73 µs 171 µs 609 µs 1,645 µs
k×G (pubkey) 6,134 µs 2,253 µs 5,483 µs 6,203 µs 37,982 µs
ECDSA sign 7,443 µs 2,588 µs 7,464 µs
ECDSA verify 18,670 µs 7,528 µs 18,957 µs
Schnorr verify 19,947 µs 8,052 µs 20,278 µs
vs libsecp (verify) 1.70× 0.99× 0.98×

Specialized Benchmark Results (Windows x64, Clang 21.1.0)

Field Representation Comparison (5x52 vs 4x64)

5x52 uses __int128 with lazy carry reduction -- fewer normalizations = faster chains.

Operation 4x64 (ns) 5x52 (ns) 5x52 Speedup
Multiplication 41.9 15.2 2.76x
Squaring 31.2 12.8 2.44x
Addition 4.3 1.6 2.69x
Negation 7.6 2.4 3.13x
Add chain (4 ops) 33.2 8.6 3.84x
Add chain (8 ops) 65.4 16.4 3.98x
Add chain (16 ops) 137.7 30.3 4.55x
Add chain (32 ops) 285.9 57.0 5.01x
Add chain (64 ops) 566.8 117.1 4.84x
Point-Add simulation 428.3 174.8 2.45x
256 squarings 9,039 4,055 2.23x

Conclusion: 5x52 is 2.0-5.0x faster across all operations. The advantage grows for addition-heavy chains (lazy reduction amortizes normalization cost).

Field Representation Comparison (10x26 vs 4x64)

10x26 is the 32-bit target representation -- useful for embedded and GPU where 64-bit multiply is expensive.

Operation 4x64 (ns) 10x26 (ns) 10x26 Speedup
Addition 4.7 1.8 2.57x
Multiplication ~39 ~39 ~1x (tie)
Add chain (16 ops) wide 3.3x faster --

Constant-Time (CT) Layer Performance

CT layer provides side-channel resistance at the cost of performance.

Operation Fast CT Overhead
Field Mul 36 ns 55 ns 1.50x
Field Square 34 ns 43 ns 1.28x
Field Inverse 3.0 us 14.2 us 4.80x
Scalar Add 3 ns 10 ns 3.02x
Scalar Sub 2 ns 10 ns 6.33x
Point Add 0.65 us 1.63 us 2.50x
Point Double 0.36 us 0.67 us 1.88x
Scalar Mul (kxP) 130 us 322 us 2.49x
Generator Mul (kxG) 7.6 us 310 us 40.8x

Generator mul overhead (40x) is high because CT disables precomputed variable-time table lookups. For signing with side-channel requirements, CT scalar mul (2.49x overhead) is the relevant metric.

Multi-Scalar Multiplication (ECDSA Verify Path)

Method Time Description
Separate (prod-like) 137.4 us k_1xG (precompute) + k_2xQ (variable-base)
Separate (variable) 351.5 us Both via fixed-window variable-base
Shamir interleaved 155.2 us Merged stream -- fewer doublings
Windowed Shamir 9.2 us Optimized multi-scalar
JSF (Joint Sparse Form) 9.5 us Joint encoding of both scalars

Atomic ECC Building Blocks

Operation Time Formula Cost
Point Add (immutable) 959 ns 12M + 4S + alloc
Point Add (in-place) 1,859 ns 12M + 4S
Point Double (immutable) 673 ns 4M + 4S + alloc
Point Double (in-place) 890 ns 4M + 4S
Point Negation 11 ns Y := -Y
Point Triple 1,585 ns 2xP + P
To Affine conversion 15,389 ns 1 inverse + 2-3 mul
Field S/M ratio 0.818 (ideal: ~0.80)
Field I/M ratio 78x Inverse is expensive -- use Jacobian!

Zero-Knowledge Proof Benchmarks (CPU)

Hardware: Intel Core i5-14400F (P-core, Raptor Lake) Compiler: Clang 19.1.7, -O3 -march=native Methodology: 11 passes, IQR outlier removal, median, 64-key pool, pinned core

ZK Proof Operations

Operation Time/Op Throughput Notes
Pedersen Commit 29.7 us 33,670 op/s vH + rG (two scalar muls)
Knowledge Prove 24.3 us 41,152 op/s Non-interactive Schnorr sigma, CT path
Knowledge Verify 23.8 us 42,017 op/s sG == R + eP, FAST path
DLEQ Prove 42.4 us 23,585 op/s Discrete log equality, CT path
DLEQ Verify 60.6 us 16,502 op/s Two-base verification, FAST path
Range Prove (64-bit) 13,619 us 73 op/s Bulletproof prover, CT path
Range Verify (64-bit) 2,670 us 375 op/s MSM-optimized verifier, FAST path

Range Verify Optimization (v3.22+)

The Bulletproof verifier was optimized with multi-scalar multiplication (MSM):

Optimization Technique Speedup
Polynomial check 5-point MSM (delta, t_hatG, tau_xH, -T1, -T2) Reduced from 3 scalar muls
P_check + expected merge 144-point MSM (64 G_i, 64 H_i, 12 L_j, 12 R_j, A, S, ...) Single MSM vs 128+ individual muls
s_coeff computation Montgomery batch inversion (1 inv + 126 muls vs 64 inversions) ~64x fewer inversions
Total Combined MSM + batch inversion 1.93x (5,079 -> 2,634 us)

Pippenger MSM is used when point count > 64. For the prover, individual GLV-optimized scalar multiplications remain faster than MSM for the 129-point workload.


BIP-324 Encrypted Transport Benchmarks

BIP-324 implements encrypted, authenticated peer-to-peer communication for Bitcoin (v2 transport). Numbers below are from bench_unified --quick on x86-64 (i5, Clang 19, AVX2, single core pinned).

Primitives

Operation ns/op Throughput
HKDF-SHA256 extract ~124 ~8.1 M op/s
HKDF-SHA256 expand ~135 ~7.4 M op/s
AEAD encrypt (256 B) ~460 ~2.2 M op/s
AEAD decrypt (256 B) ~470 ~2.1 M op/s

Elliptic-Curve Transport Setup

Operation µs/op Throughput
ElligatorSwift create ~46 ~21.5 k op/s
ElligatorSwift XDH (ECDH) ~30 ~32.9 k op/s
Session handshake (full) ~167 ~6.0 k op/s

Session Data Path

Operation ns/op Throughput
Session encrypt (256 B) ~558 ~1.8 M op/s
Session decrypt (256 B) ~1,136 ~881 k op/s
Session encrypt (1 KB) ~1,627 ~614 k op/s
Session roundtrip (256 B) ~1,136 ~881 k op/s

CUDA GPU Comparison

See BENCHMARK_BIP324_GPU.md for detailed CUDA transport benchmarks. Summary: CUDA achieves ~30× throughput over a single CPU core for bulk packet encryption.


Available Benchmark Targets

All targets registered in CMake. Build with cmake --build build -j then run from build/cpu/.

Target What It Measures
bench_unified THE standard: primitives + CT + batch verify + Ethereum + ZK + BIP-324 + real-world wallet/protocol flows, with apple-to-apple comparison vs libsecp256k1 + OpenSSL
bench_bip324_transport BIP-324 transport simulation: mixed payloads, decoy packets, latency histograms, TCP socket roundtrip
bench_ct Fast (fast::) vs Constant-Time (ct::) layer comparison
bench_field_52 5x52 field arithmetic micro-benchmarks
bench_field_26 10x26 field arithmetic micro-benchmarks
bench_kP Scalar multiplication (k*P) benchmarks
bench_zk (CUDA) GPU ZK proof benchmarks: Knowledge, DLEQ, Pedersen, Bulletproof

Benchmark Methodology

CPU Benchmarks

  1. Warm-up: 1 iteration discarded
  2. Measurement: 3 iterations, take median
  3. Timer: std::chrono::high_resolution_clock
  4. Compiler flags: -O3 -march=native

bench_unified additionally reports workflow-level operations such as HD derivation, Taproot key tweaking, ECDH, and Silent Payments so primitive performance can be interpreted in a wallet and protocol context.

CUDA Benchmarks

  1. Warm-up: 5-10 kernel launches discarded
  2. Measurement: 11 passes, median
  3. Timer: CUDA events
  4. Sync: Full device synchronization between measurements

CUDA ZK Benchmarks

  1. Warm-up: 5 kernel launches discarded
  2. Measurement: 11 passes, median
  3. Timer: CUDA events (ns/op = elapsed_ms * 1e6 / batch_size)
  4. Correctness: 0/4096 verify failures (Knowledge/DLEQ), 0/256 (Bulletproof) required before timing
  5. Batch sizes: Knowledge/DLEQ/Pedersen = 4096, Bulletproof = 256
  6. Setup: Precomputed pubkeys + Bulletproof generators (not included in timing)

Reproducibility

# Run CPU benchmark (includes ZK section)
./build/cpu/bench_unified

# Run the full unified suite explicitly
./build/cpu/bench_unified --suite all

# Quick smoke / CI-style run
./build/cpu/bench_unified --quick

# Run CUDA ECC benchmark
./build/cuda/secp256k1_cuda_bench

# Run CUDA ZK benchmark
./build/cuda/bench_zk

# Results saved to: benchmark-<platform>-<date>.txt

Optimization History

RISC-V Timeline

Date Field Mul Scalar Mul Change
2026-02-11 307 ns 954 us Initial
2026-02-12 205 ns 676 us Carry optimization
2026-02-13 198 ns 672 us Square optimization
2026-02-13 198 ns 672 us Current

Key Optimizations Applied

  1. Branchless field operations - Eliminates unpredictable branches
  2. Optimized carry propagation - Reduces instruction count
  3. Dedicated squaring routine - 25% fewer multiplications than generic mul
  4. GLV decomposition - ~50% reduction in scalar bits
  5. wNAF encoding - ~33% fewer point additions
  6. Precomputed tables - Generator multiplication 10x faster

Apple-to-Apple: UltrafastSecp256k1 vs bitcoin-core/libsecp256k1

Rigorous head-to-head comparison using identical benchmark harness (same timer, warmup, statistical methodology) for both libraries. Both libraries are compiled from source, linked into a single binary, and measured under the exact same conditions.

Methodology

  • Harness: 3 s CPU frequency ramp-up, 500 warmup iterations per operation, 11 measurement passes, IQR outlier removal, median reported.
  • Timer: RDTSCP (serialising, sub-ns precision on x86-64).
  • Data pool: 64 independent key / message / signature sets, round-robin indexed to defeat branch-predictor / cache training on a single input.
  • Pinning: Single core, taskset -c 0, SCHED_FIFO where available.
  • Compiler parity: Both libraries compiled with the same compiler, same -O3 -march=native flags, same link step.
  • Source: bench_unified.cpp -- open-source, fully reproducible.

Platform 1 -- Intel Core i5-14400F (Raptor Lake)

Detail Value
CPU Intel Core i5-14400F (P-core, Raptor Lake)
Microarchitecture Golden Cove (P-core), 32 KB L1i, 48 KB L1d, 1.25 MB L2
TSC frequency 2.497 GHz
OS Ubuntu 24.04 LTS, kernel 6.x
Compiler GCC 14.2.0, -O3 -march=native -fno-exceptions -fno-rtti
ISA features BMI2 (MULX), ADX, AVX2, SHA-NI
libsecp256k1 v0.7.x (latest master, 5x52 + exhaustive GLV Strauss)
UltrafastSecp256k1 v3.16.0, 5x52 limb layout, __int128 field arithmetic
Assembly Both libraries: GCC __int128 -> auto-generated MULX code

FAST Path (variable-time, non-secret inputs)

Operation Ultra (ns) libsecp (ns) Speedup Notes
Generator x k (pubkey_create) 6,730 11,362 1.69x W=15 comb vs W=15 Strauss
ECDSA Sign 8,989 15,631 1.74x Includes k^-1 (safegcd)
ECDSA Verify 21,324 23,306 1.09x Identical Strauss algorithm
Schnorr Keypair Create 10,522 11,228 1.07x
Schnorr Sign (BIP-340) 8,443 12,255 1.45x Includes SHA-256 challenge
Schnorr Verify (BIP-340) 21,151 22,642 1.07x Includes lift_x + SHA-256

CT Path (constant-time, for secret inputs -- true apples-to-apples)

libsecp256k1 is constant-time by design, so this comparison is the fairest:

Operation Ultra CT (ns) libsecp (ns) Speedup
ECDSA Sign 13,431 15,631 1.16x
ECDSA Verify 21,324 23,306 1.09x
Schnorr Sign (BIP-340) 11,393 12,255 1.08x
Schnorr Verify (BIP-340) 21,151 22,642 1.07x

Throughput (single core)

Ultra FAST Ultra CT libsecp
ECDSA sign 111.3k op/s 74.5k op/s 64.0k op/s
ECDSA verify 46.9k op/s -- 42.9k op/s
Schnorr sign 118.4k op/s 87.8k op/s 81.6k op/s
Schnorr verify 47.3k op/s -- 44.2k op/s
pubkey_create (k x G) 148.6k op/s -- 88.0k op/s

Bitcoin Block Validation (1 core estimate)

Block type Ultra libsecp Speedup
Pre-Taproot (~3000 ECDSA verify) 64.0 ms 69.9 ms 1.09x
Taproot (~2000 Schnorr + ~1000 ECDSA) 63.6 ms 67.9 ms 1.07x

Field Micro-ops

Operation Ultra (ns) Notes
FE52 mul 12.8 5x52, __int128 -> MULX
FE52 sqr 9.5 Dedicated squaring
FE52 add 8.1
FE52 sub 5.5
FE52 negate 6.0
FE52 inverse (safegcd) 666.8 Bernstein-Yang, __builtin_ctzll
Scalar mul 23.2 4x64
Scalar inverse (safegcd) 843.1
GLV decomposition 146.0 Lattice-based

Platform 2 -- StarFive VisionFive 2 (RISC-V 64)

Detail Value
CPU SiFive U74-MC (quad-core RV64GC)
Microarchitecture SiFive U74, dual-issue in-order, 32 KB L1i, 32 KB L1d
ISA extensions rv64gc + Zba (address), Zbb (bit-manipulation)
Clock ~1.5 GHz (StarFive JH7110 SoC)
OS Debian (StarFive kernel 6.6.20)
Compiler Clang 21.1.8, -O3 -march=rv64gcv_zba_zbb
libsecp256k1 v0.7.x (latest master)
UltrafastSecp256k1 v3.16.0, 5x52 limb layout, __int128 field arithmetic
Assembly Both libraries: __int128 -> compiler-generated MUL/MULHU

FAST Path (variable-time, non-secret inputs)

Operation Ultra (ns) libsecp (ns) Speedup Notes
Generator x k (pubkey_create) 39,764 95,341 2.40x W=15 comb vs W=15 Strauss
ECDSA Sign 73,784 138,128 1.87x Includes k^-1 (safegcd)
ECDSA Verify 180,511 201,135 1.11x Identical Strauss algorithm
Schnorr Keypair Create 45,873 95,946 2.09x
Schnorr Sign (BIP-340) 53,957 105,310 1.95x Includes SHA-256 challenge
Schnorr Verify (BIP-340) 185,487 204,944 1.10x Includes lift_x + SHA-256

CT Path (constant-time, for secret inputs -- true apples-to-apples)

Operation Ultra CT (ns) libsecp (ns) Speedup
ECDSA Sign 131,177 138,818 1.06x
ECDSA Verify 181,837 204,594 1.13x
Schnorr Sign (BIP-340) 110,926 106,139 0.96x
Schnorr Verify (BIP-340) 186,944 208,525 1.12x

Throughput (single core)

Ultra FAST Ultra CT libsecp
ECDSA sign 13.5k op/s 7.6k op/s 7.2k op/s
ECDSA verify 5.5k op/s -- 4.9k op/s
Schnorr sign 18.4k op/s 9.0k op/s 9.4k op/s
Schnorr verify 5.3k op/s -- 4.8k op/s
pubkey_create (k x G) 24.9k op/s -- 10.5k op/s

Bitcoin Block Validation (1 core estimate)

Block type Ultra libsecp Speedup
Pre-Taproot (~3000 ECDSA verify) 545.5 ms 613.8 ms 1.13x
Taproot (~2000 Schnorr + ~1000 ECDSA) 555.7 ms 621.6 ms 1.12x

Field Micro-ops

Operation Ultra (ns) Notes
FE52 mul 176.2 5x52, __int128 -> MUL/MULHU
FE52 sqr 166.8 Dedicated squaring
FE52 add 42.1
FE52 sub 34.7
FE52 negate 42.7
FE52 inverse (safegcd) 4,697.6 Bernstein-Yang
Scalar mul 147.5 4x64
Scalar inverse (safegcd) 3,698.9
GLV decomposition 851.3 Lattice-based

RISC-V Notes

  • The U74 is a dual-issue in-order core -- no out-of-order execution, no speculative execution, no branch prediction beyond basic BTB.
  • Despite this, the precomputed comb table gives a 2.4x generator speedup, showing the optimization is algorithmic (fewer point additions) not microarchitecture-dependent.
  • CT generator_mul uses an 11-block comb (COMB_BLOCKS=11, COMB_SPACING=4) with a ~31 KB table that fits in the U74's 32 KB L1D cache. This gives a 1.04x advantage over libsecp's generator_mul (91.4 us vs 95.4 us).
  • CT ECDSA Sign wins 1.06x. CT Schnorr Sign is 0.96x due to auxiliary overhead (SHA-256, nonce derivation) not related to the core ECC operation.
  • Verify speedups (1.12-1.13x) come from the same L1 icache optimization as x86 (called vs inlined additions) plus branchless conditional negate.

Key Optimisations (vs libsecp256k1)

  1. Precomputed generator table -- 8192-entry comb table for k x G (6.7 us vs 11.4 us on x86; 39.8 us vs 95.3 us on RV64)
  2. Force-inlined doubling -- jac52_double_inplace always-inline in hot loop
  3. Called (not inlined) additions -- Reduced ecmult function from 124 KB to 39 KB, fitting the hot loop in L1 I-cache (1.5 KB loop body vs 32 KB I-cache)
  4. Branchless conditional negate -- XOR-select in Strauss loop eliminates 50% unpredictable sign branches
  5. Single affine conversion in Schnorr verify -- Merged X-check + Y-parity into one Z^-1 computation (saves 1 sqr + 1 mul + redundant parse)
  6. SW prefetch -- Prefetch G/H table entries before doublings
  7. 2M+5S doubling formula -- Saves 1M per double vs libsecp's 3M+4S

How to Reproduce

# Clone and build
git clone --recurse-submodules <repo>
cd Secp256K1fast/libs/UltrafastSecp256k1
cmake -S ../.. -B build_rel -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build_rel -j

# Run benchmark (pin to one core for stability)
taskset -c 0 build_rel/cpu/bench_unified

Contributing Benchmarks

We welcome benchmark contributions from other platforms. To add your results:

  1. Run taskset -c 0 build_rel/cpu/bench_unified (or equivalent pinning)
  2. Copy the full terminal output
  3. Open a PR adding a new "Platform N" subsection with your hardware details

Platforms we'd especially like to see: AMD Zen 4/5, Apple M-series (ARM64), AWS Graviton, AMD EPYC, Intel Xeon Sapphire Rapids, Milk-V Pioneer (C920).


Future Optimizations

Planned

  • AVX-512 vectorization (x86-64)
  • Multi-threaded batch operations
  • ARM64 NEON/MUL assembly (DONE -- ~5x speedup)
  • OpenCL backend (DONE -- 3.39M kG/s)
  • Apple Metal backend (DONE -- 527M field_mul/s, M3 Pro)
  • Shared POD types across backends
  • ARM64 inline assembly (MUL/UMULH)

Experimental

  • AVX-512 vectorization (x86-64)
  • Multi-threaded batch operations
  • Montgomery domain for CUDA (mixed results)
  • 8x32-bit hybrid limb representation (DONE -- 1.10x faster mul)
  • Constant-time side-channel resistance (CT layer implemented)

Version

UltrafastSecp256k1 v3.16.0
Benchmarks updated: 2026-03-02