sparrowwallet/UltrafastSecp256k1

Fork 0

shrec 8af7320c60

Harden audit and fix Windows CUDA build

2026-03-25 14:36:36 +00:00

52 KiB

Raw Permalink Blame History

Performance Benchmarks

Benchmark results for UltrafastSecp256k1 across all supported platforms.

Summary

Platform	Field Mul	Generator Mul	Scalar Mul	ECDSA Verify	ZK Prove	vs libsecp
x86-64 (i5-14400F, Clang 19)	12.8 ns	6.7 us	17.6 us	21.3 us	24.3 us	1.09x
x86-64 (Clang 21, Win)	17 ns (5x52)	5 us	25 us	--	--	--
RISC-V 64 (SiFive U74, Clang 21)	176 ns	40.2 us	150.5 us	181.8 us	--	1.13x
ARM64 (RK3588, A76)	74 ns	14 us	131 us	--	--	--
ESP32-S3 (LX7, 240 MHz)	5,910 ns	6,134 us	12,752 us	18,670 us	--	1.70× verify
ESP32-P4 (RV32, 360 MHz)	2,424 ns	2,253 us	5,256 us	7,528 us	--	1.01× verify
ESP32-C6 (RV32, 160 MHz)	5,974 ns	5,483 us	12,682 us	18,957 us	--	1.67× sign
ESP32 (LX6, 240 MHz)	6,993 ns	6,203 us	--	--	--	--
STM32F103 (CM3, 72 MHz)	15,331 ns	37,982 us	--	--	--	--
CUDA (RTX 5060 Ti)	0.2 ns	113.5 ns	97.7 ns	230.2 ns	258.6 ns	--
CUDA (RTX 5070 Ti)	5.8 ns	92.1 ns	101.4 ns	122.8 ns	--	--
OpenCL (RTX 5060 Ti)	0.2 ns	113.5 ns	97.7 ns	230.2 ns	258.6 ns	--
Metal (Apple M3 Pro)	1.9 ns	3.00 us	2.94 us	--	--	--

GPU rows use the latest retained local rerun per backend. The stable public GPU C ABI now exposes 13 backend-neutral operations, and CUDA, OpenCL, and Metal all implement that stable surface. Internal signing kernels and benchmark-only paths are tracked separately from the public GPU ABI.

Real-World Flow Coverage

bench_unified also measures higher-level wallet and protocol flows so the benchmark suite reflects product-shaped workloads, not only primitive-level ECC operations.

Covered flows include:

ecdh_compute and ecdh_compute_raw
taproot_output_key and taproot_tweak_privkey
bip32_master_key
coin_derive_key for standard Bitcoin HD paths
coin_address_from_seed end-to-end for Bitcoin and Ethereum
silent_payment_create_output
silent_payment_scan

Representative x86-64 / Linux Quick Snapshot

Quick sanity run from bench_unified --quick on the local x86-64 validation machine:

Flow	Time
ECDH (`ecdh_compute`)	22.8 us
ECDH raw (`ecdh_compute_raw`)	20.5 us
Taproot output key	10.5 us
BIP-32 master key (64B seed)	1.2 us
BTC address from seed	93.4 us
ETH address from seed	93.4 us
Silent Payment create_output	24.7 us
Silent Payment scan	35.7 us

These values are mainly intended as workflow reference points. For publishable cross-machine comparisons, use the full pinned benchmark methodology and JSON artifacts from bench_unified.

x86-64 Full Rerun (2026-03-24, post-exploit-fix audit)

Run after 60-exploit-PoC audit (commit 8b25d420). No regression detected.
Machine: Intel Core i5-14400F · Linux · Clang 19.1.7 · TSC 2.501 GHz
Harness: bench_unified — 3 s warmup, 11 passes, IQR trimmed, median

Operation	Ultra (ns/op)	libsecp (ns/op)	Ratio
field_mul	10.1	11.0	1.09×
field_sqr	9.0	8.6	0.97×
field_inv	746.6	775.2	1.04×
scalar_mul	16.0	19.9	1.25×
scalar_inv (CT)	776.2	1466.1	1.89×
pubkey_create (k·G)	5906	13102	2.22×
ecmult (a·P+b·G)	19429	19071	0.98×
compressed serialize	2.9	12.7	4.34×
ECDSA sign	7825	16314	2.08×
Schnorr sign	6258	12467	1.99×
ECDSA verify	20218	20507	1.01×
Schnorr verify (cached)	20741	20459	0.99×
CT ECDSA sign	12259	16314	1.33×
CT Schnorr sign	10411	12467	1.20×
ecdsa_sign_recoverable	7355	16211	2.20×
ecrecover	26801	24472	0.91×
SHA256 (tagged_hash)	62.7	—	—
Schnorr batch N=64	144876 total	—	—

No regressions vs previous rerun (2026-03-17). All 70/70 audit modules pass.

x86-64 Batch Verify Rerun (2026-03-17)

A retained low-risk x86 CPU improvement was keeping the Schnorr batch pubkey cache capacity aligned with the full batch size in cpu/src/batch_verify.cpp instead of clamping reserve capacity to 64 entries. This avoids avoidable vector reallocations when uncached batches grow beyond 64 signatures.

Quick reruns on the local i5-14400F validation machine showed the improvement on the uncached Schnorr path while preserving correctness (ctest -R 'comprehensive|multiscalar' PASS):

Operation	Before	After	Delta
Schnorr batch verify N=128	20.27 us/sig	19.94-20.06 us/sig	up to 1.6% faster
Schnorr batch verify N=192	18.56 us/sig	18.01-18.45 us/sig	up to 3.0% faster

This change does not materially affect the cached-path benchmark; the measured win is specifically the uncached parse-and-resolve flow for larger Schnorr batches.

Cross-Platform Refresh Status (2026-03-18)

Recent retained reruns and validation passes across the active optimization campaign:

Platform	Latest validated result	Status
x86-64 / Linux	Schnorr batch verify `N=128`: 19.94-20.06 us/sig, `N=192`: 18.01-18.45 us/sig	Retained low-risk pubkey-cache reserve improvement
Android ARM64 / RK3588	ECDSA Sign 22.22 us, Schnorr Sign (precomputed) 16.67 us, CT ECDSA Sign 67.11 us	Retained ARMv8 SHA2 dispatch win
OpenCL / RTX 5060 Ti	`kG (batch=65536)` 115.1 ns, `kP (batch=65536)` 263.1 ns, `kG (kernel)` 98.7 ns	Revalidated retained tuning; `opencl_test` and `opencl_audit_runner` passed
CUDA / RTX 5060 Ti	`k*G` 129.5 ns at TPB 256; TPB 512 reached 128.5 ns but CT rows became invalid in the same harness	No safe global retune retained yet
RISC-V / Milk-V Mars	Latest native rerun remains the 2026-03-07 Mars baseline below	Current local environment has toolchain but no runnable board/emulator path

This page keeps the last trustworthy result per platform. When a rerun only proves that an experiment is unstable or not worth shipping, it is recorded here but not promoted as a retained default.

The stable GPU host ABI in ufsecp_gpu.h now covers 13 backend-neutral batch operations, and the compiled CUDA, OpenCL, and Metal backends implement that stable surface. Internal kernel experiments, signing benchmarks, and backend-only test hooks may cover additional primitives beyond the public ABI, but they are documented separately from the stable host interface.

x86-64 Benchmarks

x86-64 / Linux (i5, Clang 19.1.7, AVX2)

Hardware: Intel Core i5 (AVX2, BMI2, ADX)
OS: Linux
Compiler: Clang 19.1.7
Assembly: x86-64 with BMI2/ADX intrinsics
SIMD: AVX2

Operation	Time	Notes
Field Mul	33 ns	Using mulx/adcx/adox
Field Square	32 ns	Optimized squaring
Field Add	11 ns
Field Sub	12 ns
Field Inverse	5 us	Fermat's little theorem
Point Add	521 ns	Jacobian coordinates
Point Double	278 ns
Point Scalar Mul	110 us	GLV + wNAF
Generator Mul	5 us	Precomputed tables
Batch Inverse (n=100)	140 ns/elem	Montgomery's trick
Batch Inverse (n=1000)	92 ns/elem

x86-64 / Windows (Clang 21.1.0, AVX2)

Hardware: x86-64 (AVX2)
OS: Windows
Compiler: Clang 21.1.0
Assembly: x86-64 ASM enabled
SIMD: AVX2

Operation	Time	Notes
Field Mul (5x52)	17 ns	`__int128` lazy reduction
Field Square (5x52)	14 ns
Field Add	1 ns
Field Negate	1 ns
Field Inverse	1 us	Fermat's little theorem
Point Add	159 ns	Jacobian coordinates
Point Double	98 ns
Point Scalar Mul (kxP)	25 us	GLV + 5x52 + Shamir
Generator Mul (kxG)	5 us	Precomputed tables
ECDSA Sign	8 us	RFC 6979
ECDSA Verify	31 us	Shamir + GLV
Schnorr Sign (BIP-340)	14 us
Schnorr Verify (BIP-340)	33 us
Batch Inverse (n=100)	84 ns/elem	Montgomery's trick
Batch Inverse (n=1000)	88 ns/elem

RISC-V 64 Benchmarks

Hardware: Milk-V Mars (SiFive U74, RV64GC + Zba + Zbb)
OS: Linux
Compiler: Clang 21.1.8, -mcpu=sifive-u74 -march=rv64gc_zba_zbb
Assembly: RISC-V native assembly
LTO: ThinLTO enabled (auto-detected)

Operation	Time	Notes
Field Mul	95 ns	Optimized carry chain
Field Square	70 ns	Dedicated squaring
Field Add	11 ns	Branchless
Field Sub	11 ns	Branchless
Field Negate	8 ns	Branchless
Field Inverse	4 us	Fermat's little theorem
Point Add	1 us	Jacobian coordinates
Point Double	595 ns
Point Scalar Mul (kxP)	154 us	GLV + wNAF
Generator Mul (kxG)	33 us	Precomputed tables
ECDSA Sign	67 us	RFC 6979
ECDSA Verify	186 us	Shamir + GLV
Schnorr Sign (BIP-340)	86 us
Schnorr Verify (BIP-340)	216 us

RISC-V Native Re-Run (Milk-V Mars, 2026-03-07)

Run policy: native board execution (no QEMU), bench_unified --suite all --passes 11, plus unified_audit_runner.

Full Benchmark (opt3 retained)

Operation	Time	Ratio vs libsecp	Notes
ECDSA Sign	72.64 us	2.00x	FAST path
Schnorr Sign	51.69 us	2.24x	FAST path
Schnorr Keypair	43.98 us	2.45x	x-only keypair create
ECDSA Verify	198.01 us	1.01x	Slightly faster than libsecp
Schnorr Verify (cached xonly)	200.46 us	1.02x	Slightly faster than libsecp
Schnorr Verify (raw bytes)	206.75 us	0.99x	Near parity; about 1.2% slower

Source artifact (Mars): /tmp/bench_unified_mars_full_opt3.json.

Quick A/B Check (raw verify hotspot)

Variant	Schnorr Verify (raw)	Schnorr Verify (cached)	ECDSA Verify
opt3	206963.9 ns	200468.7 ns	198126.1 ns
opt4	216081.5 ns	200431.1 ns	198231.0 ns

Conclusion: opt3 is kept because it is measurably faster in raw verify.

Security Validation (same code path)

unified_audit_runner verdict: AUDIT-READY
Summary: 53/54 modules passed -- ALL PASSED (1 advisory warnings).

VisionFive 2 Device Rerun (2026-03-22, v3.3.0 dev)

This rerun was executed on the physical StarFive VisionFive 2 board over SSH. Validation covered run_selftest smoke, test_bip324_standalone, bench_kP, bench_unified --quick, and dedicated bench_bip324.

Measurement	Result
`run_selftest smoke`	30/30 modules passed, `ALL TESTS PASSED`
`test_bip324_standalone`	`BIP-324: 62/62 passed`
`bench_kP`: scalar_mul(K)	200.06 us
`bench_kP`: scalar_mul_with_plan(K)	191.47 us
`bench_unified --quick`: scalar_mul (k*P)	199.99 us
`bench_unified --quick`: scalar_mul_with_plan	193.30 us
`bench_unified --quick`: silent_payment_scan (single output set)	415.22 us
`bench_unified --quick`: scalar_mul_P (k*P, tweak_mul)	200.36 us
`bench_bip324`: full_handshake (both sides)	1444.56 us
`bench_bip324`: session_encrypt 1024 B	19.14 us, 51.0 MB/s
`bench_bip324`: session_roundtrip 1024 B	38.36 us, 25.5 MB/s
`bench_bip324`: session_roundtrip 4096 B	137.81 us, 28.3 MB/s

Retained optimization: Point::scalar_mul_with_plan() now leaves the result lazy-affine. On this board, that moved bench_unified --quick scalar_mul_with_plan from the earlier 199652.6 ns baseline to 193301.2 ns, a measured improvement of about 3.2%.

RISC-V Optimization Gains (vs generic RV64GC build)

Optimization	Speedup	Applied To
`-mcpu=sifive-u74` targeting	1.3x	All operations
ThinLTO (cross-TU inlining)	1.1x	Point/scalar ops
Native assembly	2-3x	Field mul/square
Branchless algorithms	1.2x	Field add/sub
Fast modular reduction	1.5x	All field ops
Carry chain optimization	1.3x	Multiplication

CUDA Benchmarks

Hardware: NVIDIA RTX 5060 Ti (36 SMs, 2602 MHz, 15847 MB, 128-bit bus)
CUDA: 12.0, Compute 12.0 (Blackwell)
Architecture: sm_86;sm89
Build: Clang 19 + nvcc, Release, -O3 --use_fast_math

Core ECC Operations

Operation	Time/Op	Throughput	Notes
Field Mul	0.2 ns	4,142 M/s	Kernel-only, batch 1M
Field Add	0.2 ns	4,130 M/s	Kernel-only, batch 1M
Field Inv	10.2 ns	98.35 M/s	Kernel-only, batch 64K
Point Add	1.6 ns	619 M/s	Kernel-only, batch 256K
Point Double	0.8 ns	1,282 M/s	Kernel-only, batch 256K
Scalar Mul (Pxk)	282.0 ns	3.55 M/s	Kernel-only, batch 64K
Generator Mul (Gxk)	113.5 ns	8.81 M/s	Kernel-only, batch 64K
Affine Add	0.4 ns	2,532 M/s	Kernel-only, batch 256K
Affine Lambda	0.6 ns	1,654 M/s	Kernel-only, batch 256K
Affine X-Only	0.4 ns	2,328 M/s	Kernel-only, batch 256K
Batch Inv	2.9 ns	340 M/s	Kernel-only, batch 64K
Jac->Affine	14.9 ns	66.9 M/s	Kernel-only, batch 64K

GPU Signature Operations

No other open-source GPU library provides secp256k1 ECDSA + Schnorr sign/verify on GPU.

Operation	Time/Op	Throughput	Notes
ECDSA Sign	204.8 ns	4.88 M/s	RFC 6979, low-S, batch 16K
ECDSA Verify	230.2 ns	4.34 M/s	Shamir+GLV double-mul, batch 64K
ECDSA Sign + Recid	311.5 ns	3.21 M/s	Recoverable, batch 16K
Schnorr Sign (BIP-340)	273.4 ns	3.66 M/s	Tagged hash midstates, batch 16K
Schnorr Verify (BIP-340)	167.0 ns	5.99 M/s	Shamir+GLV double-mul, batch 64K

GPU Zero-Knowledge Operations

First open-source GPU implementation of secp256k1 ZK proofs (Knowledge + DLEQ + Bulletproof).

Operation	Time/Op	Throughput	Notes
Knowledge Prove (G)	258.6 ns	3,867 k/s	CT Schnorr sigma, batch 8K
Knowledge Verify	175.9 ns	5,686 k/s	Shamir double-mul GLV, batch 8K
DLEQ Prove	537.2 ns	1,861 k/s	Discrete log equality, CT path, batch 8K
DLEQ Verify	369.0 ns	2,710 k/s	2× Shamir double-mul GLV, batch 8K
Pedersen Commit	66.0 ns	15,160 k/s	vH + rG, batch 4K
Range Prove (64-bit)	3,711,570 ns	0.27 k/s	Bulletproof, CT path, batch 256
Range Verify (64-bit)	764,649 ns	1.3 k/s	Full IPA verification, batch 256

GPU vs CPU ZK Speedup (single-core throughput):

Operation	CPU (i5-14400F)	GPU (RTX 5060 Ti)	GPU/CPU Speedup
Knowledge Prove	24,292 ns	258.6 ns	94x
Knowledge Verify	23,830 ns	175.9 ns	135x
DLEQ Prove	42,370 ns	537.2 ns	79x
DLEQ Verify	60,607 ns	369.0 ns	164x
Pedersen Commit	29,718 ns	66.0 ns	450x
Range Prove (64-bit)	13,618,693 ns	3,711,570 ns	3.7x
Range Verify (64-bit)	2,669,843 ns	764,649 ns	3.5x

Community & Contributor Benchmarks

All hardware results submitted by community members are collected in docs/COMMUNITY_BENCHMARKS.md.

Current entries:

#	Hardware	Contributor	Date	Tests
1	NVIDIA RTX 5070 Ti (Blackwell)	Community / GigaChad	2026-03-24	45/45
2	x86-64 CPU (libsecp baseline)	@craigraw	2026-02-xx	—

CUDA — RTX 5070 Ti (Blackwell) — 2026-03-24

Contributor: Community member (GigaChad) — thank you for running the full test suite and for identifying the CMAKE_CUDA_SEPARABLE_COMPILATION flag required for Blackwell devices! 🙏
Hardware: NVIDIA GeForce RTX 5070 Ti (Blackwell)
Build: cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DSECP256K1_BUILD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_SEPARABLE_COMPILATION=ON
Tested: 2026-03-24, 45 tests passed
Note: CMAKE_CUDA_SEPARABLE_COMPILATION=ON is required for Blackwell (RTX 50xx) devices. This flag is now set automatically in cuda/CMakeLists.txt and baked into all CUDA CMake presets.

Operation	Time/Op	Throughput
Field Mul	5.8 ns	173.43 M/s
Field Add	2.5 ns	408.04 M/s
Field Inverse	5.2 ns	191.55 M/s
Point Add	9.9 ns	100.89 M/s
Point Double	5.5 ns	181.70 M/s
Scalar Mul (Pk)	101.4 ns	9.86 M/s
Generator Mul (Gk)	92.1 ns	10.86 M/s
Affine Add (2M+1S+inv)	0.1 ns	8,388.29 M/s
Affine Lambda (2M+1S)	0.2 ns	4,117.82 M/s
Affine X-Only (1M+1S)	0.1 ns	8,354.07 M/s
Batch Inv (Montgomery)	5.8 ns	173.21 M/s
Jac->Affine (per-pt)	14.4 ns	69.34 M/s
ECDSA Sign	105.3 ns	9.49 M/s
ECDSA Verify	122.8 ns	8.14 M/s
ECDSA Sign+Recid	155.8 ns	6.42 M/s
Schnorr Sign	137.7 ns	7.26 M/s
Schnorr Verify	92.7 ns	10.79 M/s

GPU Zero-Knowledge Operations

First open-source GPU implementation of secp256k1 ZK proofs (Knowledge + DLEQ + Bulletproof).

Operation	Time/Op	Throughput	Notes
Knowledge Prove (G)	252.3 ns	3,964 k/s	CT Schnorr sigma, batch 4K
Knowledge Verify	749.9 ns	1,334 k/s	sG == R + eP, batch 4K
DLEQ Prove	668.3 ns	1,496 k/s	Discrete log equality, CT path, batch 4K
DLEQ Verify	1,919.1 ns	521 k/s	Two-base verification, batch 4K
Pedersen Commit	66.0 ns	15,160 k/s	vH + rG, batch 4K
Range Prove (64-bit)	3,711,570 ns	0.27 k/s	Bulletproof, CT path, batch 256
Range Verify (64-bit)	764,649 ns	1.3 k/s	Full IPA verification, batch 256

CUDA Launch-Width Triage (2026-03-18)

The latest local rerun on the RTX 5060 Ti used gpu_bench_unified to check whether a global block-size retune should replace the current default. The answer was no: there is not yet a safe retained win.

TPB	k*G (generator)	CT k*G	CT k*P	Verdict
256	129.5 ns	98.7 ns	162.8 ns	Stable reference rerun
512	128.5 ns	invalid (`0.0 ns`)	invalid (`0.1 ns`)	Rejected; CT timing became unstable

The 512-thread launch showed only a marginal k*G gain, while the same harness produced invalid constant-time timings. Until the CT timing methodology is tightened, no global CUDA TPB default change is retained from this sweep.

GPU vs CPU ZK Speedup (single-core throughput):

Operation	CPU (i5-14400F)	GPU (RTX 5060 Ti)	GPU/CPU Speedup
Knowledge Prove	24,292 ns	252.3 ns	96x
Knowledge Verify	23,830 ns	749.9 ns	32x
DLEQ Prove	42,370 ns	668.3 ns	63x
DLEQ Verify	60,607 ns	1,919.1 ns	32x
Pedersen Commit	29,718 ns	66.0 ns	450x
Range Prove (64-bit)	13,618,693 ns	3,711,570 ns	3.7x
Range Verify (64-bit)	2,669,843 ns	764,649 ns	3.5x

OpenCL Benchmarks

Hardware: NVIDIA RTX 5060 Ti (36 CUs, 2602 MHz)
OpenCL: 3.0 CUDA, Driver 580.126.09
Build: Clang 19, Release, -O3, PTX inline assembly

OpenCL GPU C ABI Coverage (2026-03-18)

C ABI operation	OpenCL status	Notes
`ufsecp_gpu_generator_mul_batch`	Implemented	Uses `batch_scalar_mul_generator` + `batch_jacobian_to_affine`
`ufsecp_gpu_ecdsa_verify_batch`	Missing	Returns `UFSECP_ERR_GPU_UNSUPPORTED`
`ufsecp_gpu_schnorr_verify_batch`	Missing	Returns `UFSECP_ERR_GPU_UNSUPPORTED`
`ufsecp_gpu_ecdh_batch`	Implemented	GPU scalar mul, CPU SHA-256 finalization
`ufsecp_gpu_hash160_pubkey_batch`	Implemented	Public-data batch hashing
`ufsecp_gpu_msm`	Implemented	GPU scalar mul + CPU-side affine reduction

The missing OpenCL pieces are therefore the two batch verify paths. Core ECC, ECDH, Hash160, and MSM are already wired through the backend-neutral C ABI.

Kernel-Only Timing (no buffer alloc/copy overhead)

Operation	Time/Op	Throughput	Notes
Field Mul	0.2 ns	4,110 M/s	batch 1M
Field Add	0.2 ns	4,116 M/s	batch 1M
Field Sub	0.2 ns	4,106 M/s	batch 1M
Field Sqr	0.2 ns	5,979 M/s	batch 1M
Field Inv	20.2 ns	49.42 M/s	batch 1M
Point Double	0.9 ns	1,138 M/s	batch 256K
Point Add	1.6 ns	618.1 M/s	batch 256K
kG (kernel)	97.7 ns	10.23 M/s	batch 64K
kP (kernel)	263.8 ns	3.79 M/s	batch 64K
ECDSA Verify	230.2 ns	4.34 M/s	Shamir+GLV, batch 64K
Schnorr Verify	167.0 ns	5.99 M/s	Shamir+GLV, batch 64K
ZK Knowledge Prove	258.6 ns	3.87 M/s	CT path, batch 8K
ZK Knowledge Verify	175.9 ns	5.69 M/s	Shamir double-mul, batch 8K
ZK DLEQ Prove	537.2 ns	1.86 M/s	CT path, batch 8K
ZK DLEQ Verify	369.0 ns	2.71 M/s	2× Shamir double-mul, batch 8K

End-to-End Timing (including buffer transfers)

Operation	Time/Op	Throughput	Notes
Field Add	27.3 ns	36.67 M/s	batch 1M
Field Mul	27.7 ns	36.07 M/s	batch 1M
Field Inv	29.0 ns	34.43 M/s	batch 1M
Point Double	58.4 ns	17.11 M/s	batch 1M
Point Add	111.9 ns	8.94 M/s	batch 1M
kG (batch=65536)	115.1 ns	8.69 M/s	retained 2026-03-17 revalidation
kP (batch=65536)	263.1 ns	3.80 M/s	retained 2026-03-17 revalidation
kP upload	6.7 ns	149.25 M/s	host-to-device transfer slice
kP readback	12.4 ns	80.65 M/s	device-to-host transfer slice

CUDA / OpenCL Configuration

// Optimal settings for RTX 5060 Ti
#define SECP256K1_CUDA_USE_HYBRID_MUL 1  // 32-bit hybrid (~10% faster)
#define SECP256K1_CUDA_USE_MONTGOMERY 0  // Standard domain (faster for search)

CUDA vs OpenCL Kernel-Only Comparison (RTX 5060 Ti)

Operation	CUDA	OpenCL	Faster
Field Mul	0.2 ns	0.2 ns	Tie
Field Add	0.2 ns	0.2 ns	Tie
Field Inv	10.2 ns	20.2 ns	CUDA 1.98x
Point Double	0.8 ns	0.9 ns	CUDA 1.13x
Point Add	1.6 ns	1.6 ns	Tie
Scalar Mul (kG)	113.5 ns	97.7 ns	OpenCL 1.16x
ECDSA Sign	204.8 ns	--	CUDA only
ECDSA Verify	230.2 ns	230.2 ns	Tie
Schnorr Sign	273.4 ns	--	CUDA only
Schnorr Verify	167.0 ns	167.0 ns	Tie
Knowledge Prove	258.6 ns	258.6 ns	Tie
Knowledge Verify	175.9 ns	175.9 ns	Tie
DLEQ Prove	537.2 ns	537.2 ns	Tie
DLEQ Verify	369.0 ns	369.0 ns	Tie

kG above uses the latest retained local reruns on the same RTX 5060 Ti host: CUDA gpu_bench_unified at TPB 256 (129.5 ns) and OpenCL opencl_benchmark kernel timing (98.7 ns). CUDA still leads on verify and ZK because those paths are not yet exposed on OpenCL.

Apple Metal Benchmarks

Hardware: Apple M3 Pro (18 GPU cores, Unified Memory 18 GB)
OS: macOS Sequoia
Metal: Metal 2.4, MSL macos-metal2.4
Limb Model: 8x32-bit Comba (no 64-bit int in MSL)
Build: AppleClang, Release, -O3, ARC

Operation	Time/Op	Throughput	Notes
Field Mul	1.9 ns	527 M/s	Comba product scanning, batch 1M
Field Add	1.0 ns	990 M/s	Branchless, batch 1M
Field Sub	1.1 ns	892 M/s	Branchless, batch 1M
Field Sqr	1.1 ns	872 M/s	Comba + symmetry, batch 1M
Field Inv	106.4 ns	9.40 M/s	Fermat (a^(p-2)), batch 64K
Point Add	10.1 ns	98.6 M/s	Jacobian, batch 256K
Point Double	5.1 ns	196 M/s	dbl-2001-b, batch 256K
Scalar Mul (Pxk)	2.94 us	0.34 M/s	4-bit windowed, batch 64K
Generator Mul (Gxk)	3.00 us	0.33 M/s	4-bit windowed, batch 128K

Metal vs CUDA vs OpenCL -- GPU Comparison

Operation	CUDA (RTX 5060 Ti)	OpenCL (RTX 5060 Ti)	Metal (M3 Pro)
Field Mul	0.2 ns	0.2 ns	1.9 ns
Field Add	0.2 ns	0.2 ns	1.0 ns
Field Inv	10.2 ns	14.3 ns	106.4 ns
Point Double	0.8 ns	0.9 ns	5.1 ns
Point Add	1.6 ns	1.6 ns	10.1 ns
Scalar Mul	282.0 ns	263.8 ns	2.94 us
Generator Mul	113.5 ns	97.7 ns	3.00 us
ECDSA Sign	204.8 ns	--	--
ECDSA Verify	230.2 ns	230.2 ns	--
Schnorr Sign	273.4 ns	--	--
Schnorr Verify	354.6 ns	--	--
Knowledge Prove	263.7 ns	--	--
Knowledge Verify	744.5 ns	--	--
DLEQ Prove	675.4 ns	--	--
DLEQ Verify	1,912.0 ns	--	--

Note: CUDA/OpenCL -- RTX 5060 Ti (36 SMs, 2602 MHz, GDDR7 256 GB/s).
Metal -- M3 Pro (18 GPU cores, ~150 GB/s unified memory bandwidth).
RTX 5060 Ti has ~8x more compute throughput; Metal's advantage is in unified memory zero-copy I/O.

Android ARM64 Benchmarks

Hardware: RK3588 (Cortex-A76 @ 2.256 GHz, pinned to big cores)
OS: Android
Compiler: NDK r27.2.12479018, Clang 18.0.3
Assembly: ARM64 inline (MUL/UMULH)
Field: 10x26 (optimal for ARM64)

Operation	Time	Notes
Field Mul	68.3 ns	ARM64 MUL/UMULH, 10x26
Field Square	50 ns
Field Add	8 ns
Field Negate	18 ns
Field Inverse	2 us	Fermat's theorem
Point Add	992 ns	Jacobian coordinates
Point Double	548 ns
Generator Mul (kxG)	15.27 us	Precomputed tables
Scalar Mul (kxP)	130.33 us	GLV + wNAF
ECDSA Sign	22.22 us	ARMv8 SHA2 dispatch retained
ECDSA Verify	150.13 us	Shamir + GLV
Schnorr Sign (BIP-340)	16.67 us	Precomputed keypair path
Schnorr Verify (BIP-340)	153.63 us	Raw pubkey path is similar
Batch Inverse (n=100)	265 ns/elem	Montgomery's trick
Batch Inverse (n=1000)	240 ns/elem

ARM64 10x26 representation with MUL/UMULH assembly provides optimal field arithmetic performance.

Android ARM64 Optimization Rerun (2026-03-17)

This rerun used the connected RK3588 Android device and android/test/bench_hornet_android.cpp as the benchmark truth source. The retained code change was enabling the existing ARMv8 SHA-256 instruction path in hash_accel.cpp for sha256_33, sha256_32, hash160_33, and sha256_compress_dispatch.

Operation	Baseline	Retained result	Delta
ECDSA Sign	25.89 us	22.22 us	14.2% faster
Schnorr Sign (precomputed)	17.73 us	16.67 us	6.0% faster
Schnorr Sign (raw privkey)	33.01 us	31.99 us	3.1% faster
CT ECDSA Sign	70.50 us	67.11 us	4.8% faster
CT Schnorr Sign	59.87 us	59.10 us	1.3% faster

No meaningful win was found from forcing SECP256K1_USE_4X64_POINT_OPS, from changing SECP256K1_GLV_WINDOW_WIDTH to 4 or 6, or from keeping PGO as the default Android path. Those variants were measured and rejected.

Android ARM64 RK3588 Device Rerun (2026-03-22)

This rerun used the connected YF_022A RK3588 Android device over USB. Two new device-side benchmarks were added to the Android build for this pass: bench_kP for the BIP-352 fixed-K / variable-Q hotspot and bench_bip324 for the dedicated BIP-324 transport stack.

Measurement	Result
`android_test`: fast scalar_mul (k*G)	5.93 us
`android_test`: fast scalar_mul (k*P)	57.67 us
`android_test`: ct::scalar_mul (k*P)	150.26 us
`android_test`: field_mul / field_sqr	80 ns / 61 ns
`bench_kP`: scalar_mul(K)	130.90 us
`bench_kP`: scalar_mul_with_plan(K)	127.24 us
`bench_kP`: K*G	15.69 us
`bench_bip324`: full_handshake (both sides)	727.24 us
`bench_bip324`: session_encrypt 1024 B	5.96 us, 163.9 MB/s
`bench_bip324`: session_roundtrip 1024 B	12.05 us, 81.0 MB/s
`bench_bip324`: session_roundtrip 4096 B	43.72 us, 89.3 MB/s

Run note: the on-device execution used the NDK libomp.so alongside the pushed binaries so the existing OpenMP-enabled CPU build could run unchanged.

ESP32-S3 Benchmarks (Embedded)

Hardware: ESP32-S3 (Xtensa LX7 Dual Core @ 240 MHz), rev 0.1
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 4×64 (native 64-bit mul wins on LX7)
Measured: 2026-03-21, median of 3 runs

Operation	Time	ops/sec	vs libsecp
field_mul	5,910 ns	169 k/s	—
field_sqr	4,848 ns	206 k/s	—
field_add	572 ns	1.75 M/s	—
field_inv	130.2 µs	7.7 k/s	—
pubkey_create (k×G)	6,134 µs	163/s	1.18×
k×P (arbitrary)	12,752 µs	78/s	—
a×G + b×P (Shamir)	18,296 µs	55/s	—
point_add	479 µs	2.1 k/s	—
point_dbl	330 µs	3.0 k/s	—
ecdsa_sign	7,443 µs	134/s	1.27×
ecdsa_verify	18,670 µs	54/s	1.70×
schnorr_sign (keypair)	6,467 µs	155/s	1.45×
schnorr_verify	19,947 µs	50/s	1.62×
ct::ecdsa_sign	13,742 µs	73/s	0.69×
ct::schnorr_sign	7,574 µs	132/s	1.23×

All integrity checks pass. libsecp256k1 v0.7.2 compared on same hardware.

ESP32-P4 Benchmarks (Embedded)

Hardware: ESP32-P4 (RISC-V RV32IMAC Dual HP Core @ 360 MHz), rev 1.3
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 10×26 (32-bit native)
Measured: 2026-03-21, median of 3 runs

Operation	Time	ops/sec	vs libsecp
field_mul	2,424 ns	413 k/s	—
field_sqr	2,218 ns	451 k/s	—
field_add	318 ns	3.14 M/s	—
field_inv	73.1 µs	13.7 k/s	—
pubkey_create (k×G)	2,253 µs	444/s	0.94×
k×P (arbitrary)	5,256 µs	190/s	—
a×G + b×P (Shamir)	7,550 µs	132/s	—
point_add	128.8 µs	7.8 k/s	—
point_dbl	103.6 µs	9.7 k/s	—
ecdsa_sign	2,588 µs	386/s	0.97×
ecdsa_verify	7,528 µs	133/s	0.99×
schnorr_sign (keypair)	2,293 µs	436/s	0.96×
schnorr_verify	8,052 µs	124/s	0.93×
ct::ecdsa_sign	5,680 µs	176/s	0.44×
ct::schnorr_sign	2,528 µs	396/s	1.10×

All integrity checks pass. Note: FAST path is at near-parity with libsecp on P4
(P4 RISC-V microarch lacks the wide multiply throughput of Xtensa LX7).

ESP32-C6 Benchmarks (Embedded)

Hardware: ESP32-C6 (RISC-V RV32IMAC Single Core @ 160 MHz), rev 0.2
OS: ESP-IDF v5.4, GCC 14.2.0
Field: 10×26 (32-bit native)
Measured: 2026-03-21, median of 3 runs

Operation	Time	ops/sec	vs libsecp
field_mul	5,974 ns	167 k/s	—
field_sqr	5,328 ns	188 k/s	—
field_add	784 ns	1.28 M/s	—
field_inv	171.1 µs	5.8 k/s	—
pubkey_create (k×G)	5,483 µs	182/s	1.70×
k×P (arbitrary)	12,682 µs	79/s	—
point_add	296.5 µs	3.4 k/s	—
point_dbl	238.1 µs	4.2 k/s	—
ecdsa_sign	7,464 µs	134/s	1.67×
ecdsa_verify	18,957 µs	53/s	0.98×
schnorr_sign (keypair)	5,855 µs	171/s	2.01×
schnorr_verify	20,278 µs	49/s	1.03×
ct::ecdsa_sign	15,522 µs	64/s	0.80×
ct::schnorr_sign	6,782 µs	147/s	1.73×

All integrity checks pass.

ESP32-PICO-D4 Benchmarks (Embedded)

Hardware: ESP32-PICO-D4 (Xtensa LX6 Dual Core @ 240 MHz)
OS: ESP-IDF v5.5.1
Assembly: None (portable C++, no __int128)

Operation	Time	Notes
Field Mul	6,993 ns
Field Square	6,247 ns
Field Add	985 ns
Field Inv	609 us
Scalar x G	6,203 us	Generator mul
CT Scalar x G	44,810 us	Constant-time
CT Add (complete)	249,672 ns
CT Dbl	87,113 ns
CT/Fast ratio	6.5x

All 35 self-tests + 8 CT tests pass.

STM32F103 Benchmarks (Embedded)

Hardware: STM32F103ZET6 (ARM Cortex-M3 @ 72 MHz)
Compiler: ARM GCC 13.3.1, -O3
Assembly: ARM Cortex-M3 inline (UMULL/ADDS/ADCS)

Operation	Time	Notes
Field Mul	15,331 ns	ARM inline asm
Field Square	12,083 ns	ARM inline asm
Field Add	4,139 ns	Portable C++
Field Inv	1,645 us
Scalar x G	37,982 us	Generator mul

All 35 library self-tests pass.

Embedded Cross-Platform Comparison

Operation	ESP32-S3 (LX7)	ESP32-P4 (RV32)	ESP32-C6 (RV32)	ESP32 (LX6)	STM32F103 (M3)
	240 MHz	360 MHz	160 MHz	240 MHz	72 MHz
Field Mul	5,910 ns	2,424 ns	5,974 ns	6,993 ns	15,331 ns
Field Square	4,848 ns	2,218 ns	5,328 ns	6,247 ns	12,083 ns
Field Add	572 ns	318 ns	784 ns	985 ns	4,139 ns
Field Inv	130 µs	73 µs	171 µs	609 µs	1,645 µs
k×G (pubkey)	6,134 µs	2,253 µs	5,483 µs	6,203 µs	37,982 µs
ECDSA sign	7,443 µs	2,588 µs	7,464 µs	—	—
ECDSA verify	18,670 µs	7,528 µs	18,957 µs	—	—
Schnorr verify	19,947 µs	8,052 µs	20,278 µs	—	—
vs libsecp (verify)	1.70×	0.99×	0.98×	—	—

Specialized Benchmark Results (Windows x64, Clang 21.1.0)

Field Representation Comparison (5x52 vs 4x64)

5x52 uses __int128 with lazy carry reduction -- fewer normalizations = faster chains.

Operation	4x64 (ns)	5x52 (ns)	5x52 Speedup
Multiplication	41.9	15.2	2.76x
Squaring	31.2	12.8	2.44x
Addition	4.3	1.6	2.69x
Negation	7.6	2.4	3.13x
Add chain (4 ops)	33.2	8.6	3.84x
Add chain (8 ops)	65.4	16.4	3.98x
Add chain (16 ops)	137.7	30.3	4.55x
Add chain (32 ops)	285.9	57.0	5.01x
Add chain (64 ops)	566.8	117.1	4.84x
Point-Add simulation	428.3	174.8	2.45x
256 squarings	9,039	4,055	2.23x

Conclusion: 5x52 is 2.0-5.0x faster across all operations. The advantage grows for addition-heavy chains (lazy reduction amortizes normalization cost).

Field Representation Comparison (10x26 vs 4x64)

10x26 is the 32-bit target representation -- useful for embedded and GPU where 64-bit multiply is expensive.

Operation	4x64 (ns)	10x26 (ns)	10x26 Speedup
Addition	4.7	1.8	2.57x
Multiplication	~39	~39	~1x (tie)
Add chain (16 ops)	wide	3.3x faster	--

Constant-Time (CT) Layer Performance

CT layer provides side-channel resistance at the cost of performance.

Operation	Fast	CT	Overhead
Field Mul	36 ns	55 ns	1.50x
Field Square	34 ns	43 ns	1.28x
Field Inverse	3.0 us	14.2 us	4.80x
Scalar Add	3 ns	10 ns	3.02x
Scalar Sub	2 ns	10 ns	6.33x
Point Add	0.65 us	1.63 us	2.50x
Point Double	0.36 us	0.67 us	1.88x
Scalar Mul (kxP)	130 us	322 us	2.49x
Generator Mul (kxG)	7.6 us	310 us	40.8x

Generator mul overhead (40x) is high because CT disables precomputed variable-time table lookups. For signing with side-channel requirements, CT scalar mul (2.49x overhead) is the relevant metric.

Multi-Scalar Multiplication (ECDSA Verify Path)

Method	Time	Description
Separate (prod-like)	137.4 us	k_1xG (precompute) + k_2xQ (variable-base)
Separate (variable)	351.5 us	Both via fixed-window variable-base
Shamir interleaved	155.2 us	Merged stream -- fewer doublings
Windowed Shamir	9.2 us	Optimized multi-scalar
JSF (Joint Sparse Form)	9.5 us	Joint encoding of both scalars

Atomic ECC Building Blocks

Operation	Time	Formula Cost
Point Add (immutable)	959 ns	12M + 4S + alloc
Point Add (in-place)	1,859 ns	12M + 4S
Point Double (immutable)	673 ns	4M + 4S + alloc
Point Double (in-place)	890 ns	4M + 4S
Point Negation	11 ns	Y := -Y
Point Triple	1,585 ns	2xP + P
To Affine conversion	15,389 ns	1 inverse + 2-3 mul
Field S/M ratio	0.818	(ideal: ~0.80)
Field I/M ratio	78x	Inverse is expensive -- use Jacobian!

Zero-Knowledge Proof Benchmarks (CPU)

Hardware: Intel Core i5-14400F (P-core, Raptor Lake) Compiler: Clang 19.1.7, -O3 -march=native Methodology: 11 passes, IQR outlier removal, median, 64-key pool, pinned core

ZK Proof Operations

Operation	Time/Op	Throughput	Notes
Pedersen Commit	29.7 us	33,670 op/s	vH + rG (two scalar muls)
Knowledge Prove	24.3 us	41,152 op/s	Non-interactive Schnorr sigma, CT path
Knowledge Verify	23.8 us	42,017 op/s	sG == R + eP, FAST path
DLEQ Prove	42.4 us	23,585 op/s	Discrete log equality, CT path
DLEQ Verify	60.6 us	16,502 op/s	Two-base verification, FAST path
Range Prove (64-bit)	13,619 us	73 op/s	Bulletproof prover, CT path
Range Verify (64-bit)	2,670 us	375 op/s	MSM-optimized verifier, FAST path

Range Verify Optimization (v3.22+)

The Bulletproof verifier was optimized with multi-scalar multiplication (MSM):

Optimization	Technique	Speedup
Polynomial check	5-point MSM (delta, t_hatG, tau_xH, -T1, -T2)	Reduced from 3 scalar muls
P_check + expected merge	144-point MSM (64 G_i, 64 H_i, 12 L_j, 12 R_j, A, S, ...)	Single MSM vs 128+ individual muls
s_coeff computation	Montgomery batch inversion (1 inv + 126 muls vs 64 inversions)	~64x fewer inversions
Total	Combined MSM + batch inversion	1.93x (5,079 -> 2,634 us)

Pippenger MSM is used when point count > 64. For the prover, individual GLV-optimized scalar multiplications remain faster than MSM for the 129-point workload.

BIP-324 Encrypted Transport Benchmarks

BIP-324 implements encrypted, authenticated peer-to-peer communication for Bitcoin (v2 transport). Numbers below are from bench_unified --quick on x86-64 (i5, Clang 19, AVX2, single core pinned).

Primitives

Operation	ns/op	Throughput
HKDF-SHA256 extract	~124	~8.1 M op/s
HKDF-SHA256 expand	~135	~7.4 M op/s
AEAD encrypt (256 B)	~460	~2.2 M op/s
AEAD decrypt (256 B)	~470	~2.1 M op/s

Elliptic-Curve Transport Setup

Operation	µs/op	Throughput
ElligatorSwift create	~46	~21.5 k op/s
ElligatorSwift XDH (ECDH)	~30	~32.9 k op/s
Session handshake (full)	~167	~6.0 k op/s

Session Data Path

Operation	ns/op	Throughput
Session encrypt (256 B)	~558	~1.8 M op/s
Session decrypt (256 B)	~1,136	~881 k op/s
Session encrypt (1 KB)	~1,627	~614 k op/s
Session roundtrip (256 B)	~1,136	~881 k op/s

CUDA GPU Comparison

See BENCHMARK_BIP324_GPU.md for detailed CUDA transport benchmarks. Summary: CUDA achieves ~30× throughput over a single CPU core for bulk packet encryption.

Available Benchmark Targets

All targets registered in CMake. Build with cmake --build build -j then run from build/cpu/.

Target	What It Measures
`bench_unified`	THE standard: primitives + CT + batch verify + Ethereum + ZK + BIP-324 + real-world wallet/protocol flows, with apple-to-apple comparison vs libsecp256k1 + OpenSSL
`bench_bip324_transport`	BIP-324 transport simulation: mixed payloads, decoy packets, latency histograms, TCP socket roundtrip
`bench_ct`	Fast (`fast::`) vs Constant-Time (`ct::`) layer comparison
`bench_field_52`	5x52 field arithmetic micro-benchmarks
`bench_field_26`	10x26 field arithmetic micro-benchmarks
`bench_kP`	Scalar multiplication (k*P) benchmarks
`bench_zk` (CUDA)	GPU ZK proof benchmarks: Knowledge, DLEQ, Pedersen, Bulletproof

Benchmark Methodology

CPU Benchmarks

Warm-up: 1 iteration discarded
Measurement: 3 iterations, take median
Timer: std::chrono::high_resolution_clock
Compiler flags: -O3 -march=native

bench_unified additionally reports workflow-level operations such as HD derivation, Taproot key tweaking, ECDH, and Silent Payments so primitive performance can be interpreted in a wallet and protocol context.

CUDA Benchmarks

Warm-up: 5-10 kernel launches discarded
Measurement: 11 passes, median
Timer: CUDA events
Sync: Full device synchronization between measurements

CUDA ZK Benchmarks

Warm-up: 5 kernel launches discarded
Measurement: 11 passes, median
Timer: CUDA events (ns/op = elapsed_ms * 1e6 / batch_size)
Correctness: 0/4096 verify failures (Knowledge/DLEQ), 0/256 (Bulletproof) required before timing
Batch sizes: Knowledge/DLEQ/Pedersen = 4096, Bulletproof = 256
Setup: Precomputed pubkeys + Bulletproof generators (not included in timing)

Reproducibility

# Run CPU benchmark (includes ZK section)
./build/cpu/bench_unified

# Run the full unified suite explicitly
./build/cpu/bench_unified --suite all

# Quick smoke / CI-style run
./build/cpu/bench_unified --quick

# Run CUDA ECC benchmark
./build/cuda/secp256k1_cuda_bench

# Run CUDA ZK benchmark
./build/cuda/bench_zk

# Results saved to: benchmark-<platform>-<date>.txt

Optimization History

RISC-V Timeline

Date	Field Mul	Scalar Mul	Change
2026-02-11	307 ns	954 us	Initial
2026-02-12	205 ns	676 us	Carry optimization
2026-02-13	198 ns	672 us	Square optimization
2026-02-13	198 ns	672 us	Current

Key Optimizations Applied

Branchless field operations - Eliminates unpredictable branches
Optimized carry propagation - Reduces instruction count
Dedicated squaring routine - 25% fewer multiplications than generic mul
GLV decomposition - ~50% reduction in scalar bits
wNAF encoding - ~33% fewer point additions
Precomputed tables - Generator multiplication 10x faster

Apple-to-Apple: UltrafastSecp256k1 vs bitcoin-core/libsecp256k1

Rigorous head-to-head comparison using identical benchmark harness (same timer, warmup, statistical methodology) for both libraries. Both libraries are compiled from source, linked into a single binary, and measured under the exact same conditions.

Methodology

Harness: 3 s CPU frequency ramp-up, 500 warmup iterations per operation, 11 measurement passes, IQR outlier removal, median reported.
Timer: RDTSCP (serialising, sub-ns precision on x86-64).
Data pool: 64 independent key / message / signature sets, round-robin indexed to defeat branch-predictor / cache training on a single input.
Pinning: Single core, taskset -c 0, SCHED_FIFO where available.
Compiler parity: Both libraries compiled with the same compiler, same -O3 -march=native flags, same link step.
Source: bench_unified.cpp -- open-source, fully reproducible.

Platform 1 -- Intel Core i5-14400F (Raptor Lake)

Detail	Value
CPU	Intel Core i5-14400F (P-core, Raptor Lake)
Microarchitecture	Golden Cove (P-core), 32 KB L1i, 48 KB L1d, 1.25 MB L2
TSC frequency	2.497 GHz
OS	Ubuntu 24.04 LTS, kernel 6.x
Compiler	GCC 14.2.0, `-O3 -march=native -fno-exceptions -fno-rtti`
ISA features	BMI2 (MULX), ADX, AVX2, SHA-NI
libsecp256k1	v0.7.x (latest master, 5x52 + exhaustive GLV Strauss)
UltrafastSecp256k1	v3.16.0, 5x52 limb layout, `__int128` field arithmetic
Assembly	Both libraries: GCC `__int128` -> auto-generated MULX code

FAST Path (variable-time, non-secret inputs)

Operation	Ultra (ns)	libsecp (ns)	Speedup	Notes
Generator x k (pubkey_create)	6,730	11,362	1.69x	W=15 comb vs W=15 Strauss
ECDSA Sign	8,989	15,631	1.74x	Includes k^-1 (safegcd)
ECDSA Verify	21,324	23,306	1.09x	Identical Strauss algorithm
Schnorr Keypair Create	10,522	11,228	1.07x
Schnorr Sign (BIP-340)	8,443	12,255	1.45x	Includes SHA-256 challenge
Schnorr Verify (BIP-340)	21,151	22,642	1.07x	Includes lift_x + SHA-256

CT Path (constant-time, for secret inputs -- true apples-to-apples)

libsecp256k1 is constant-time by design, so this comparison is the fairest:

Operation	Ultra CT (ns)	libsecp (ns)	Speedup
ECDSA Sign	13,431	15,631	1.16x
ECDSA Verify	21,324	23,306	1.09x
Schnorr Sign (BIP-340)	11,393	12,255	1.08x
Schnorr Verify (BIP-340)	21,151	22,642	1.07x

Throughput (single core)

	Ultra FAST	Ultra CT	libsecp
ECDSA sign	111.3k op/s	74.5k op/s	64.0k op/s
ECDSA verify	46.9k op/s	--	42.9k op/s
Schnorr sign	118.4k op/s	87.8k op/s	81.6k op/s
Schnorr verify	47.3k op/s	--	44.2k op/s
pubkey_create (k x G)	148.6k op/s	--	88.0k op/s

Bitcoin Block Validation (1 core estimate)

Block type	Ultra	libsecp	Speedup
Pre-Taproot (~3000 ECDSA verify)	64.0 ms	69.9 ms	1.09x
Taproot (~2000 Schnorr + ~1000 ECDSA)	63.6 ms	67.9 ms	1.07x

Field Micro-ops

Operation	Ultra (ns)	Notes
FE52 mul	12.8	5x52, `__int128` -> MULX
FE52 sqr	9.5	Dedicated squaring
FE52 add	8.1
FE52 sub	5.5
FE52 negate	6.0
FE52 inverse (safegcd)	666.8	Bernstein-Yang, `__builtin_ctzll`
Scalar mul	23.2	4x64
Scalar inverse (safegcd)	843.1
GLV decomposition	146.0	Lattice-based

Platform 2 -- StarFive VisionFive 2 (RISC-V 64)

Detail	Value
CPU	SiFive U74-MC (quad-core RV64GC)
Microarchitecture	SiFive U74, dual-issue in-order, 32 KB L1i, 32 KB L1d
ISA extensions	rv64gc + Zba (address), Zbb (bit-manipulation)
Clock	~1.5 GHz (StarFive JH7110 SoC)
OS	Debian (StarFive kernel 6.6.20)
Compiler	Clang 21.1.8, `-O3 -march=rv64gcv_zba_zbb`
libsecp256k1	v0.7.x (latest master)
UltrafastSecp256k1	v3.16.0, 5x52 limb layout, `__int128` field arithmetic
Assembly	Both libraries: `__int128` -> compiler-generated MUL/MULHU

FAST Path (variable-time, non-secret inputs)

Operation	Ultra (ns)	libsecp (ns)	Speedup	Notes
Generator x k (pubkey_create)	39,764	95,341	2.40x	W=15 comb vs W=15 Strauss
ECDSA Sign	73,784	138,128	1.87x	Includes k^-1 (safegcd)
ECDSA Verify	180,511	201,135	1.11x	Identical Strauss algorithm
Schnorr Keypair Create	45,873	95,946	2.09x
Schnorr Sign (BIP-340)	53,957	105,310	1.95x	Includes SHA-256 challenge
Schnorr Verify (BIP-340)	185,487	204,944	1.10x	Includes lift_x + SHA-256

CT Path (constant-time, for secret inputs -- true apples-to-apples)

Operation	Ultra CT (ns)	libsecp (ns)	Speedup
ECDSA Sign	131,177	138,818	1.06x
ECDSA Verify	181,837	204,594	1.13x
Schnorr Sign (BIP-340)	110,926	106,139	0.96x
Schnorr Verify (BIP-340)	186,944	208,525	1.12x

Throughput (single core)

	Ultra FAST	Ultra CT	libsecp
ECDSA sign	13.5k op/s	7.6k op/s	7.2k op/s
ECDSA verify	5.5k op/s	--	4.9k op/s
Schnorr sign	18.4k op/s	9.0k op/s	9.4k op/s
Schnorr verify	5.3k op/s	--	4.8k op/s
pubkey_create (k x G)	24.9k op/s	--	10.5k op/s

Bitcoin Block Validation (1 core estimate)

Block type	Ultra	libsecp	Speedup
Pre-Taproot (~3000 ECDSA verify)	545.5 ms	613.8 ms	1.13x
Taproot (~2000 Schnorr + ~1000 ECDSA)	555.7 ms	621.6 ms	1.12x

Field Micro-ops

Operation	Ultra (ns)	Notes
FE52 mul	176.2	5x52, `__int128` -> MUL/MULHU
FE52 sqr	166.8	Dedicated squaring
FE52 add	42.1
FE52 sub	34.7
FE52 negate	42.7
FE52 inverse (safegcd)	4,697.6	Bernstein-Yang
Scalar mul	147.5	4x64
Scalar inverse (safegcd)	3,698.9
GLV decomposition	851.3	Lattice-based

RISC-V Notes

The U74 is a dual-issue in-order core -- no out-of-order execution, no speculative execution, no branch prediction beyond basic BTB.
Despite this, the precomputed comb table gives a 2.4x generator speedup, showing the optimization is algorithmic (fewer point additions) not microarchitecture-dependent.
CT generator_mul uses an 11-block comb (COMB_BLOCKS=11, COMB_SPACING=4) with a ~31 KB table that fits in the U74's 32 KB L1D cache. This gives a 1.04x advantage over libsecp's generator_mul (91.4 us vs 95.4 us).
CT ECDSA Sign wins 1.06x. CT Schnorr Sign is 0.96x due to auxiliary overhead (SHA-256, nonce derivation) not related to the core ECC operation.
Verify speedups (1.12-1.13x) come from the same L1 icache optimization as x86 (called vs inlined additions) plus branchless conditional negate.

Key Optimisations (vs libsecp256k1)

Precomputed generator table -- 8192-entry comb table for k x G (6.7 us vs 11.4 us on x86; 39.8 us vs 95.3 us on RV64)
Force-inlined doubling -- jac52_double_inplace always-inline in hot loop
Called (not inlined) additions -- Reduced ecmult function from 124 KB to 39 KB, fitting the hot loop in L1 I-cache (1.5 KB loop body vs 32 KB I-cache)
Branchless conditional negate -- XOR-select in Strauss loop eliminates 50% unpredictable sign branches
Single affine conversion in Schnorr verify -- Merged X-check + Y-parity into one Z^-1 computation (saves 1 sqr + 1 mul + redundant parse)
SW prefetch -- Prefetch G/H table entries before doublings
2M+5S doubling formula -- Saves 1M per double vs libsecp's 3M+4S

How to Reproduce

# Clone and build
git clone --recurse-submodules <repo>
cd Secp256K1fast/libs/UltrafastSecp256k1
cmake -S ../.. -B build_rel -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build_rel -j

# Run benchmark (pin to one core for stability)
taskset -c 0 build_rel/cpu/bench_unified

Contributing Benchmarks

We welcome benchmark contributions from other platforms. To add your results:

Run taskset -c 0 build_rel/cpu/bench_unified (or equivalent pinning)
Copy the full terminal output
Open a PR adding a new "Platform N" subsection with your hardware details

Platforms we'd especially like to see: AMD Zen 4/5, Apple M-series (ARM64), AWS Graviton, AMD EPYC, Intel Xeon Sapphire Rapids, Milk-V Pioneer (C920).

Future Optimizations

Planned

AVX-512 vectorization (x86-64)
Multi-threaded batch operations
ARM64 NEON/MUL assembly (DONE -- ~5x speedup)
OpenCL backend (DONE -- 3.39M kG/s)
Apple Metal backend (DONE -- 527M field_mul/s, M3 Pro)
Shared POD types across backends
ARM64 inline assembly (MUL/UMULH)

Experimental

AVX-512 vectorization (x86-64)
Multi-threaded batch operations
Montgomery domain for CUDA (mixed results)
8x32-bit hybrid limb representation (DONE -- 1.10x faster mul)
Constant-time side-channel resistance (CT layer implemented)

Version

UltrafastSecp256k1 v3.16.0
Benchmarks updated: 2026-03-02

52 KiB Raw Permalink Blame History Unescape Escape

Performance Benchmarks

Summary

Real-World Flow Coverage

Representative x86-64 / Linux Quick Snapshot

x86-64 Full Rerun (2026-03-24, post-exploit-fix audit)

x86-64 Batch Verify Rerun (2026-03-17)

Cross-Platform Refresh Status (2026-03-18)

x86-64 Benchmarks

x86-64 / Linux (i5, Clang 19.1.7, AVX2)

x86-64 / Windows (Clang 21.1.0, AVX2)

RISC-V 64 Benchmarks

RISC-V Native Re-Run (Milk-V Mars, 2026-03-07)

Full Benchmark (opt3 retained)

Quick A/B Check (raw verify hotspot)

Security Validation (same code path)

VisionFive 2 Device Rerun (2026-03-22, v3.3.0 dev)

RISC-V Optimization Gains (vs generic RV64GC build)

CUDA Benchmarks

Core ECC Operations

GPU Signature Operations

GPU Zero-Knowledge Operations

Community & Contributor Benchmarks

CUDA — RTX 5070 Ti (Blackwell) — 2026-03-24

GPU Zero-Knowledge Operations

CUDA Launch-Width Triage (2026-03-18)

OpenCL Benchmarks

OpenCL GPU C ABI Coverage (2026-03-18)

Kernel-Only Timing (no buffer alloc/copy overhead)

End-to-End Timing (including buffer transfers)

CUDA / OpenCL Configuration

CUDA vs OpenCL Kernel-Only Comparison (RTX 5060 Ti)

Apple Metal Benchmarks

Metal vs CUDA vs OpenCL -- GPU Comparison

Android ARM64 Benchmarks

Android ARM64 Optimization Rerun (2026-03-17)

Android ARM64 RK3588 Device Rerun (2026-03-22)

ESP32-S3 Benchmarks (Embedded)

ESP32-P4 Benchmarks (Embedded)

ESP32-C6 Benchmarks (Embedded)

ESP32-PICO-D4 Benchmarks (Embedded)

STM32F103 Benchmarks (Embedded)

Embedded Cross-Platform Comparison

Specialized Benchmark Results (Windows x64, Clang 21.1.0)

Field Representation Comparison (5x52 vs 4x64)

Field Representation Comparison (10x26 vs 4x64)

Constant-Time (CT) Layer Performance

Multi-Scalar Multiplication (ECDSA Verify Path)

Atomic ECC Building Blocks

Zero-Knowledge Proof Benchmarks (CPU)

ZK Proof Operations

Range Verify Optimization (v3.22+)

BIP-324 Encrypted Transport Benchmarks

Primitives

Elliptic-Curve Transport Setup

Session Data Path

CUDA GPU Comparison

Available Benchmark Targets

Benchmark Methodology

CPU Benchmarks

CUDA Benchmarks

CUDA ZK Benchmarks

Reproducibility

Optimization History

RISC-V Timeline

Key Optimizations Applied

Apple-to-Apple: UltrafastSecp256k1 vs bitcoin-core/libsecp256k1

Methodology

Platform 1 -- Intel Core i5-14400F (Raptor Lake)

FAST Path (variable-time, non-secret inputs)

CT Path (constant-time, for secret inputs -- true apples-to-apples)

Throughput (single core)

Bitcoin Block Validation (1 core estimate)

Field Micro-ops

Platform 2 -- StarFive VisionFive 2 (RISC-V 64)

FAST Path (variable-time, non-secret inputs)

CT Path (constant-time, for secret inputs -- true apples-to-apples)

Throughput (single core)

Bitcoin Block Validation (1 core estimate)

Field Micro-ops

52 KiB

Raw Permalink Blame History