docs: add COMMUNITY_BENCHMARKS.md for community-submitted hardware results

- New docs/COMMUNITY_BENCHMARKS.md: dedicated page for community and
  contributor benchmark entries with submission instructions
- Entry #1: NVIDIA RTX 5070 Ti (Blackwell) by community / GigaChad
  (45/45 tests, CMAKE_CUDA_SEPARABLE_COMPILATION fix credited)
- Entry #2: x86-64 CPU libsecp comparison by @craigraw (bench_bip352)
- README: replaced inline community tables with a concise link to the
  new page; added 'Community Benchmarks' to quick links
- docs/BENCHMARKS.md: RTX 5070 Ti section now links to the dedicated
  page with contributor credit
- README acknowledgments: added GigaChad community entry

2026-03-24 21:00:18 +00:00

5.0 KiB

Raw Permalink Blame History

Community & Contributor Benchmarks

This page collects benchmark results submitted by community members running UltrafastSecp256k1 on their own hardware. Each entry includes the exact build command, hardware details, date, and test count so results are reproducible.

Want to contribute? Run the benchmark suite on your hardware and share your results via a GitHub Discussion or open an issue with the output of bench_unified --suite all (or gpu_bench_unified for GPU). We'll add your entry here and credit you.

Tip: Include your GPU/CPU model, OS, driver/CUDA/compiler version, the exact cmake command you used, and the full benchmark table.

Index

#	Contributor	Hardware	Date	Tests
1	Community (GigaChad)	NVIDIA RTX 5070 Ti (Blackwell)	2026-03-24	45/45
2	@craigraw	x86-64 CPU (libsecp baseline)	2026-02-xx	—

Entry #1 — NVIDIA RTX 5070 Ti

Contributor: Community member (Discord: GigaChad) — thank you for running the full test suite and for identifying the CMAKE_CUDA_SEPARABLE_COMPILATION flag required for Blackwell devices! 🙏

Field	Value
GPU	NVIDIA GeForce RTX 5070 Ti
Architecture	Blackwell (SM 12.0)
OS	Linux
CUDA	(native)
Tests passed	45 / 45
Date	2026-03-24

Build command:

cmake -S . -B build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DSECP256K1_BUILD_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=native \
  -DCMAKE_CUDA_SEPARABLE_COMPILATION=ON

Note: CMAKE_CUDA_SEPARABLE_COMPILATION=ON is required on RTX 50xx (Blackwell) devices. This flag is now set automatically inside cuda/CMakeLists.txt and baked into all CUDA CMake presets — you only need to pass it explicitly if you invoke CMake without a preset.

Core ECC Operations

Operation	Time/Op	Throughput
Field Mul	5.8 ns	173.43 M/s
Field Add	2.5 ns	408.04 M/s
Field Inverse	5.2 ns	191.55 M/s
Point Add	9.9 ns	100.89 M/s
Point Double	5.5 ns	181.70 M/s
Scalar Mul (Pk)	101.4 ns	9.86 M/s
Generator Mul (Gk)	92.1 ns	10.86 M/s
Affine Add (2M+1S+inv)	0.1 ns	8,388.29 M/s
Affine Lambda (2M+1S)	0.2 ns	4,117.82 M/s
Affine X-Only (1M+1S)	0.1 ns	8,354.07 M/s
Batch Inv (Montgomery)	5.8 ns	173.21 M/s
Jac→Affine (per-pt)	14.4 ns	69.34 M/s

Signature Operations

Operation	Time/Op	Throughput
ECDSA Sign	105.3 ns	9.49 M/s
ECDSA Verify	122.8 ns	8.14 M/s
ECDSA Sign+Recid	155.8 ns	6.42 M/s
Schnorr Sign	137.7 ns	7.26 M/s
Schnorr Verify	92.7 ns	10.79 M/s

Entry #2 — CPU libsecp256k1 Comparison (BIP-352 Standalone)

Contributor: @craigraw (Sparrow Wallet) — thank you for creating the standalone bench_bip352 benchmark and for the independent reproducible comparison! 🙏

Field	Value
CPU	x86-64
Compiler	GCC 12.4, `-O3 -march=native`, `USE_ASM_X86_64=1`
Benchmark tool	bench_bip352
Mode	Single-threaded, 10K points, 11 passes, median

Full Pipeline

Backend	Median	ns/op	Ratio
libsecp256k1	545.2 ms	54,519 ns	1.00×
UltrafastSecp256k1	456.1 ms	45,615 ns	1.20× faster

Per-Operation Breakdown (1K points, 11 passes, median)

Operation	libsecp256k1	UltrafastSecp256k1	Ratio
k×P (scalar mul)	37,975 ns	26,460 ns	1.44× faster
Serialize compressed (1st)	36 ns	15 ns	2.4× faster
Tagged SHA-256	744 ns	65 ns	11.4× faster
k×G (generator mul)	17,460 ns	8,559 ns	2.04× faster
Point addition	2,250 ns	2,457 ns	0.92×
Serialize compressed (2nd)	23 ns	21 ns	1.1× faster

Point addition is slightly slower because both inputs have Z=1 (affine), so UltrafastSecp256k1 uses direct affine addition with a field inversion to return an affine result — this eliminates the separate inversion step inside serialization.

How to Submit Your Benchmark

Build the project for your target (CPU or GPU).
Run bench_unified --suite all --passes 11 (CPU) or gpu_bench_unified (CUDA/OpenCL/Metal).
Post results in GitHub Discussions → Benchmarks with:
- Hardware model (CPU/GPU/embedded)
- OS, compiler, driver/CUDA version
- Exact cmake command
- Full benchmark output (or paste the table)
We'll add your entry here and list you in the contributor acknowledgments in README.md.

Last updated: 2026-03-24

5.0 KiB Raw Permalink Blame History Unescape Escape