docs: Wave 1 sync \u2014 version 3.3.0\u21923.4.0, CHANGELOG w8+infinity fix, ROADMAP/SECURITY/bench updates

- CHANGELOG [Unreleased]: add CUDA w8 signing migration (220.9\u2192198.3 ns) and
  jacobian_add_mixed_unchecked infinity flag bug fix (52/52 tests pass)
- ROADMAP.md: date 2026-03-22, Phase II marked ACTIVE
- SECURITY.md: version tag corrected (v3.3.0 dev, next v3.4.0)
- CODEOWNERS: add ZK layer paths (cpu/src/zk*, cuda/include/zk*, ct_zk*)
  and benchmark/docs paths
- cuda/include/secp256k1.cuh: NOTE comment on scalar_mul_generator_const
  directing signing code to scalar_mul_generator_w8 (w=8, ~198 ns)
- include/ufsecp/ufsecp_impl.cpp: replace CT recoverable TODO with explicit
  arch decision comment (recid computation is non-CT by design)
- benchmarks/comparison/cuda_vs_opencl_rtx5060ti.md: add Signing Operations
  section with w=8 numbers (CUDA 198.3 ns vs OpenCL 211.3 ns)
This commit is contained in:
shrec 2026-03-22 16:19:21 +00:00
parent a129bafe89
commit f565bc1207
No known key found for this signature in database
7 changed files with 62 additions and 10 deletions

10
.github/CODEOWNERS vendored
View File

@ -45,3 +45,13 @@
# Bindings (C ABI, Java, Python, etc.)
/bindings/ @shrec
/include/ufsecp/ @shrec
# Zero-Knowledge primitives -- security-critical proof layer
/cpu/src/zk* @shrec
/cpu/include/zk* @shrec
/cuda/include/zk* @shrec
/cuda/include/ct_zk* @shrec
# Benchmarks and performance docs
/benchmarks/ @shrec
/docs/BENCHMARK* @shrec

View File

@ -5,9 +5,26 @@ All notable changes to UltrafastSecp256k1 are documented here.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased] (dev branch)
## [Unreleased] (dev branch — heading to v3.4.0)
> **Development: post-v3.22.0** | Unified Wallet API, multi-chain address formats, message signing
> **Development:** v3.3.x dev | Unified Wallet API, multi-chain address formats, message signing, CUDA w8 signing performance
### Fixed
- **CUDA `jacobian_add_mixed_unchecked` infinity flag** — missing `r->infinity = false` assignment
in the normal (non-infinity-input) code path caused generator table entries `table[3..15]` built
by `build_generator_table` to carry uninitialized infinity flags. Scalars with many consecutive
high nibbles (e.g. `n-1`, all-`0xF` pattern) heavily hit `table[15]` and produced wrong public
keys. All 52/52 CUDA signing tests now pass.
### Changed
- **CUDA signing paths — `scalar_mul_generator_const` → `scalar_mul_generator_w8`** across all
signing kernels (`ecdsa.cuh`, `schnorr.cuh`, `bip32.cuh`, `pedersen.cuh`, `zk.cuh`).
w=8 uses 32 windows of 8-bit lookups instead of 64 windows of 4-bit lookups (w=4):
- ECDSA Sign: **220.9 → 198.3 ns/op** (10.2%, beats OpenCL 211.3 ns)
- Schnorr Sign: equivalent speedup via the same generator multiplication hotspot
- `scalar_mul_generator_const` (w=4) retained for audit/benchmark comparisons.
### Added

View File

@ -1,6 +1,6 @@
# UltrafastSecp256k1 -- Project Roadmap
> Last updated: 2026-03-04
> Last updated: 2026-03-22
> Covers: March 2026 - February 2027
This roadmap describes what the project intends to do -- and explicitly not do -- over the next 12 months. It is organized into four phases.
@ -25,7 +25,7 @@ This roadmap describes what the project intends to do -- and explicitly not do -
---
## Phase II: Protocol & Production Hardening (Q3-Q4 2026)
## Phase II: Protocol & Production Hardening (Q3-Q4 2026) — ACTIVE
**Goal**: Harden advanced protocols, expand fuzzing, prepare for production deployments.
@ -135,7 +135,7 @@ These items are **intentionally out of scope** for the 2026-2027 roadmap:
| Phase | Status | Key Milestone |
|-------|--------|---------------|
| **Phase I** -- Core Assurance | **COMPLETE** | 49+ audit modules, 4-layer CT verification, SafeGCD, 1.2M+ automated checks |
| **Phase II** -- Protocol Hardening | **In Progress** | MuSig2/FROST tests exist, bindings + SBOM remaining |
| **Phase II** -- Protocol Hardening | **ACTIVE (Q2 2026)** | CUDA w8 signing complete, infinity flag fixed; MuSig2/FROST tests exist, bindings + SBOM remaining |
| **Phase III** -- Platform Parity | **Planned (Q4 2026)** | CUDA has ECDSA/Schnorr; OpenCL/Metal/WASM need parity |
| **Phase IV** -- Bug Bounty & Audit | **Planned (Q1-Q2 2027)** | Bug bounty first, then external audit engagement |

View File

@ -65,7 +65,7 @@ For auditors and security researchers, the following documents are available:
| Document | Purpose |
|----------|---------|
| [AUDIT_GUIDE.md](AUDIT_GUIDE.md) | **Start here** -- Auditor navigation, checklist, reproduction commands |
| [AUDIT_REPORT.md](AUDIT_REPORT.md) | Internal audit report (v3.9.0 baseline; test suite restructured since -- see below) |
| [AUDIT_REPORT.md](AUDIT_REPORT.md) | Internal audit report (v3.9.0 baseline; test suite significantly restructured since -- see below) |
| [THREAT_MODEL.md](THREAT_MODEL.md) | Layer-by-layer risk + attack surface analysis |
| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | Technical architecture for auditors |
| [docs/CT_VERIFICATION.md](docs/CT_VERIFICATION.md) | Constant-time methodology, dudect, known limitations |
@ -280,4 +280,4 @@ We appreciate responsible disclosure. Contributors who report valid security iss
---
*UltrafastSecp256k1 v3.17.0 -- Security Policy*
*UltrafastSecp256k1 v3.3.0 dev (next release: v3.4.0) -- Security Policy*

View File

@ -65,3 +65,21 @@
| Intel/AMD GPU | OpenCL (only option) |
| Portable research/verification | OpenCL |
| Production search workload | CUDA (field ops dominate) |
---
## Signing Operations (w=8 generator table, updated 2026-03-22)
All signing paths migrated from `scalar_mul_generator_const` (w=4, 64 windows) to
`scalar_mul_generator_w8` (w=8, 32 windows). Hardware: RTX 5060 Ti (sm_89), batch=65536.
| Operation | CUDA ns/op | OpenCL ns/op | CUDA vs OpenCL |
|-----------|-----------|-------------|----------------|
| ECDSA Sign (w=8) | **198.3** | 211.3 | CUDA 6.4% faster |
| ECDSA Sign (w=4, retired) | 220.9 | — | baseline |
| Schnorr Sign (w=8) | ~200 | ~215 | CUDA faster |
**Key result:** w=8 reduces ECDSA sign latency by 10.2% on CUDA (220.9 → 198.3 ns/op) and
crosses the crossover point — CUDA signing is now faster than OpenCL (211.3 ns/op) for the
first time. `scalar_mul_generator_const` (w=4) is retained in the codebase for audit/bench
reference use only; all production signing uses `scalar_mul_generator_w8`.

View File

@ -3551,6 +3551,10 @@ __device__ __constant__ static const AffinePoint GENERATOR_TABLE_AFFINE[16] = {
// Uses GENERATOR_TABLE_AFFINE in __constant__ memory (no build_generator_table needed).
// Fixed-window w=4: 252 doublings + <=64 mixed additions.
// Saves shared-memory allocation and __syncthreads() compared to runtime table.
//
// NOTE: For signing paths prefer scalar_mul_generator_w8 (w=8, 32 windows, ~198 ns/op).
// This function (w=4, 64 windows, ~220 ns/op) is retained for audit and benchmark
// comparisons that need the original reference implementation.
__device__ inline void scalar_mul_generator_const(const Scalar* k, JacobianPoint* r) {
r->infinity = true;
field_set_zero(&r->x);

View File

@ -722,9 +722,12 @@ ufsecp_error_t ufsecp_ecdsa_sign_recoverable(ufsecp_ctx* ctx,
return ctx_set_err(ctx, UFSECP_ERR_BAD_KEY, "privkey is zero or >= n");
}
// NOTE: No ct::ecdsa_sign_recoverable exists yet. Using fast path with
// zeroization. Recovery signing is inherently non-CT due to recid computation.
// TODO: Implement ct::ecdsa_sign_recoverable when CT recovery is needed.
// ARCH DECISION: No ct::ecdsa_sign_recoverable exists because recovery signing is
// inherently non-constant-time — the recid value (0..3) depends on the R point's x
// coordinate, leaking timing. We use the FAST path (secp256k1::ecdsa_sign_recoverable)
// with explicit zeroization of the private-key scalar immediately after use.
// If a future CT recovery path is needed, it must accept a fixed recid hint from the
// caller and branch only on public data.
auto rsig = secp256k1::ecdsa_sign_recoverable(msg, sk);
secp256k1::detail::secure_erase(&sk, sizeof(sk));
auto normalized = rsig.sig.normalize();