- test_exploit_ct_recov: rewrite Phase C to simulate CT branchless overflow
check and verify it agrees with reference for all 400 inputs
- test_musig2_bip327_vectors: update Test 2 assertions from != to == since
canonical sort makes key aggregation order-independent (expected behavior)
- test_musig2_frost: update Test 2 from 'Ordering Matters' to 'Order
Independence'; CHECK(same) now that canonical sort is applied
- test_musig2_frost_advanced: fix test_musig2_key_coefficient_binding to
account for BIP-327 second-unique-key coeff=1 optimization; assert
Q_x differs (the real security property) and only check coeff when pk_a
is not the second-unique shortcut in both groups
- ct_field.cpp field_add: add pointer value_barrier to prevent LTO from
propagating fe_zero limbs into add256, reducing |t| from ~12 to ~4.5
- ct_field.cpp field_sqr: extend RISC-V FE52 limb barrier to all GCC/clang
platforms (was RISC-V-only); drops |t| from ~12 to ~0.8
- ct_field.cpp field_mul: add FE52 limb barriers to both operands
(same rationale as field_sqr)
Result: unified_audit_runner 70/70 PASS -- AUDIT-READY
🔴 ct_sign.cpp: CT violation — if-branch on R.y parity replaced with
branchless mask (recid = limbs[0] & 1u). Early-break loop on r_bytes
vs ORDER replaced with constant-time MSB accumulation; overflow bit set
via branchless shift-OR. Both paths now leak zero timing information
about the secret nonce.
🔴 ecies.cpp (Android): /dev/urandom early-boot weakness — switched to
arc4random_buf() which blocks until the kernel entropy pool is seeded.
🟡 ecies.cpp (KDF): SHA-512(x) replaced with HKDF-SHA256(IKM=shared_x,
salt="secp256k1-ecies-v1", info=eph_pubkey_compressed[33]). Provides
domain separation, context binding, and resistance to related-key attacks.
Removed redundant local hmac_sha256; use secp256k1::hmac_sha256 from hkdf.hpp.
🟡 bip32.cpp: depth uint8_t silent wrap — added depth == 0xFF guard that
returns {ExtendedKey{}, false} instead of wrapping to depth 0.
🟡 musig2.cpp: pubkeys unsorted → wrong aggregate key — sort a canonical
copy before computing L so the aggregate key is identical regardless of
caller input order. second_unique key detection uses the sorted list.
🟠 message_signing.cpp: intermediate hash1 and message buf not zeroed —
added secure_erase on hash1 and the stack/heap buffer before return.
Added explicit #include for secure_erase.hpp.
🟠 frost.cpp: derive_scalar missing BIP-340 tag prefix — added
SHA256(tag) || SHA256(tag) double-tag prefix so domain separation
matches BIP-340 and prevents cross-protocol hash collisions.
Lower STACK_BUCKETS from 256 to 64. Window sizes c<=6 (n<=384) keep
64 Points on stack (~6.7 KB); c>=7 routes to the existing heap path.
Previous 256-bucket stack allocated 26-35 KB depending on Point size.
61/61 CPU tests pass.
musig2_partial_sign now takes MuSig2SecNonce& (non-const) and zeroizes
both k1 and k2 before returning. This prevents secret-nonce reuse from
the C++ API level -- a critical property for MuSig2 where reusing
(k1,k2) with a different message leaks the private key.
Also update bench_unified.cpp: pregenerate per-iteration nonce pool for
the partial_sign bench loop so it doesn't call partial_sign on a
consumed (zeroed) nonce.
The C ABI (ufsecp_musig2_partial_sign) already zeroized its local
MuSig2SecNonce after calling musig2_partial_sign; that remains as
belt-and-suspenders redundancy.
53/53 tests pass. Audit: AUDIT-READY (69/70, 1 advisory CT warning).
frost_sign now takes FrostNonce& (non-const) and zeroizes both
hiding_nonce and binding_nonce before returning. This prevents nonce
reuse from the C++ API level — a critical security property for FROST,
where reusing (d,e) with a different message leaks the signing share.
Also update:
- bench_unified.cpp: pregenerate nonce pool per iteration so the bench
loop does not call frost_sign on a consumed (zeroed) nonce.
- test_frost_kat.cpp: move RFC9591 Invariant-7 nonce-commitment check
(D == d*G) to before frost_sign consumes the nonce, per the new policy.
The C ABI (ufsecp_frost_sign) already zeroized its local FrostNonce after
calling frost_sign; this remains as belt-and-suspenders redundancy.
53/53 tests pass. Audit: AUDIT-READY (69/70, 1 advisory CT warning).
Critical (remotely exploitable timing side-channel):
- ellswift.cpp: Replace variable-time scalar_mul with ct::generator_mul
and ct::scalar_mul for BIP-324 handshake (privkey timing leak)
- ellswift.cpp: Erase ECDH point after use
HIGH — nonce/privkey stack leaks:
- adaptor.cpp: Erase sk_bytes, hash in adaptor_nonce; k, sk in
schnorr_adaptor_sign; k, k_inv in ecdsa_adaptor_sign; t/t_inv
in adapt functions (6 findings)
- bip32.cpp: Erase HMAC k_buf/ipad/opad; data[37] with privkey in
derive_child; I/IL in derive_child and bip32_master_key (4 findings)
- wallet.cpp: Erase privkey bytes in export_private_key EVM/Tron paths
- zk.cpp: Erase secret bytes in derive_nonce, nonce k in
knowledge_prove_base/dleq_prove, massive secret state in range_prove
(4 findings)
HIGH — buffer overflow:
- taproot.cpp: Bounds check merkle_root_len <= 32 in taproot_tweak_hash
- taproot.cpp: Bounds check input_index < input_count in tap_sighash_common
MEDIUM — secret state leaks:
- taproot.cpp: Erase private key scalar d in taproot_tweak_privkey
- bip39.cpp: Erase PBKDF2 result/u intermediates, word indices vectors,
salt_str in mnemonic_to_seed (3 findings)
All 68 tests pass.
Fixes all 32 open alerts from https://github.com/shrec/UltrafastSecp256k1/security/code-scanning:
clang-analyzer-core.CallAndMessage (#8235):
ufsecp_impl.cpp: replace structured binding auto[master,ok] with explicit
std::pair decomposition so analyzer tracks that master is only used when ok
readability-braces-around-statements (#8259-#8263):
ufsecp_gpu_impl.cpp: add {} around all multi-line null-guard if bodies
misc-const-correctness (#8232-#8236, #8239-#8240, #8244-#8247, #8249-#8254, #8257-#8258):
ufsecp_impl.cpp: const size_t payload_tag_len
ufsecp_gpu_impl.cpp: const uint32_t count
test_gpu_abi_gate.cpp: const count, n, dcount (4 sites)
test_gpu_host_api_negative.cpp: const n, dcount, recid, gpu_codes[], code (5 sites)
test_adversarial_protocol.cpp: const wrong_recid, rc_bad
bench_bip324_transport.cpp: const sz, t0, mem_ns
cert-err33-c (#8255):
test_gpu_host_api_negative.cpp: cast snprintf return value to (void)
clang-analyzer-core.NonNullParamChecker / nullPointerRedundantCheck (#8241, #8242, #8256):
test_gpu_host_api_negative.cpp: guard strcmp(str,...) with if(str!=nullptr)
bugprone-implicit-widening-of-multiplication-result (#8248):
test_gpu_abi_gate.cpp: 1024*1024 -> 1024ULL*1024ULL
cpp/trivial-switch (#8243):
gpu_registry.cpp: replace switch with preprocessor-gated if chain to
avoid trivial switch when no GPU backends are compiled in
All tests pass: 544/544 adversarial protocol tests.
- init.hpp: replace non-thread-safe 'static bool tested' with std::call_once
in ensure_library_integrity(). Fixes c_abi_thread_stress crash on macOS where
concurrent ufsecp_ctx_create calls raced into Selftest, causing
'Digit index out of range during accumulation' via g_context.reset().
- ufsecp.h: wrap _Static_assert(sizeof(ufsecp_bip32_key)==82,...) in
__STDC_VERSION__ >= 201112L guard. Fixes MSVC C89 build failure in JNI binding.
- source_graph.py: add bodygrep command for searching string literals inside
indexed function bodies via SQL LIKE. Enhance find command with body search
fallback after FTS5 miss.
-Werror fixes:
- audit/audit_ct_namespace.cpp: assign fread() return value and check
for short reads (fixes -Werror=unused-result with GCC 13)
- audit/test_kat_all_operations.cpp: remove dead hex_to_bytes() static
function (fixes -Werror=unused-function)
MSan fix:
- cpu/src/pippenger.cpp: guard aggregation loop with used[b] check
before reading from stack_buckets[], preventing MSan uninitialized-
read reports on the first window where untouched bucket slots have
not been written yet. Functionally equivalent: untouched slots are
conceptually Point::infinity() (identity), so skipping the add is
correct and matches the original algorithm.
GPU parity (OpenCL + Metal):
- opencl/kernels/secp256k1_frost.cl: new full FROST partial-verify
kernel implementing R_i = D_i + rho_i*E_i, lhs = z_i*G,
rhs = R_i + lambda_ie*Y_i comparison
- gpu/src/gpu_backend_opencl.cpp: replace frost stub with full 9-buffer
GPU dispatch via ensure_frost_kernel()
- metal/shaders/secp256k1_kernels.metal: add Kernel 20 for FROST
partial batch verify
- gpu/src/gpu_backend_metal.mm: full rewrite implementing all 7 GPU
operations (gen_mul, ecdsa_verify, schnorr_verify, ecdh, hash160,
frost, msm) on Metal
1. fe_batch_inverse: handle zero inputs gracefully by substituting ones
during forward accumulation and restoring zeros in output. Prevents
undefined behavior when callers pass zero-valued field elements.
Added test_batch_inverse_zero_safe covering mixed, all-zero, and
single-zero cases. (CT paths in ct_point.cpp unchanged — documented
preconditions only.)
2. 4-stream WNAF (ESP32/STM32): fixed phi(G) sign — use k2_neg directly
instead of k1_neg XOR k2_neg. G tables are precomputed without any
sign baked in, unlike P tables where k1_neg is absorbed into P_base.
Re-enabled the previously disabled code path.
3. OpenMP: added conditional OpenMP support for fe_h_based_inversion_batched.
find_package(OpenMP QUIET) in CMakeLists.txt with ESP32/WASM exclusion.
Static libgomp.a resolution for ARM64 cross-compilation.
4. MuSig2 key aggregation: validate ALL pubkeys upfront before computing
anything. Previously, invalid pubkeys were silently skipped via continue,
enabling potential rogue key attacks. Now returns empty ctx (Q=infinity)
if any pubkey is invalid.
Tested on x86_64 (25/25), ARM64 RK3588 (25/25), RISC-V VisionFive2 (25/25).
No benchmark regressions detected.
armsha::sha256_compress passed the already-updated abcd to
vsha256h2q_u32 instead of the pre-update value. Per the ARMv8
Cryptographic Extensions spec, SHA256H2 requires the original
ABCD as its Qn operand.
This caused wrong SHA-256 digests on macOS ARM64 (Apple Silicon),
breaking NIST vectors, BIP-340, BIP-39, and RFC-6979 tests.
Linux x86_64 was unaffected (uses SHA-NI or scalar path).
* perf: batch ops 17-67x faster via all-affine fast path; pippenger touched-bucket + window tuning
## Performance (N=64 batch, LTO Release build)
- batch_normalize /pt: 144.7 ns → 8.2 ns (17.6x faster)
- batch_to_compressed /pt: 134.6 ns → 2.0 ns (67x faster)
- batch_x_only_bytes /pt: 97.4 ns → 1.9 ns (51x faster)
- scalar_mul (k*P): 17012 ns → 17155 ns (no regression)
## Changes
### cpu/src/point.cpp
- batch_normalize: all-affine fast path — when all z_one_==true, skip batch
inversion and read x_/y_ directly
- batch_to_compressed: same fast path + parity via limbs()[0] (avoids
full serialization just for one bit)
- batch_x_only_bytes: same fast path using store_b32_prenorm / to_bytes()
### cpu/src/pippenger.cpp
- Window thresholds retuned: n<=72→c=5, n<=384→c=6, n<=768→c=7, etc.
(was n<=32→c=5, n<=64→c=6)
- Strauss/Pippenger crossover: n<48 (was n<=64) — avoids Pippenger
overhead for small MSM sizes
- Stack allocation for buckets/touched/used (STACK_BUCKETS=256): eliminates
heap alloc for common window sizes; unique_ptr fallback for larger
- Pre-extracted digits: all n*num_windows digits in a flat u16[] before
main loop — avoids redundant extract_window_bits calls
- all_affine scatter: uses add_mixed52_inplace/from_affine52 instead of
full Jacobian add
- touched[] tracking: only reset touched buckets (O(k) not O(2^c))
- max_touched_digit: aggregate loop starts from highest used bucket
### cpu/src/batch_add_affine.cpp
- PrecomputeBuffers struct: stack arrays for count<=64, heap fallback
- y-parity via limbs()[0] instead of to_bytes()[31]
## Tests
- test_point_edge_cases_standalone: 53/53 PASS
- test_ecc_properties_standalone: 89/89 PASS
- test_edge_cases_standalone: 60/60 PASS
- test_comprehensive_standalone: 12023/12023 PASS
- test_batch_add_affine_standalone: 548/548 PASS
* fix: precompute_point_multiples stack alloc; ASan timeout 300→600s -j4
- cpu/src/batch_add_affine.cpp: precompute_point_multiples now uses
PrecomputeBuffers (same pattern as precompute_g_multiples from PR #169)
Eliminates 3 heap allocations per b*P table build for count <= 64
(stack fallback path, avoids malloc/free overhead)
- docker/run_ci.sh: fix ASan+UBSan flaky timeout (root cause: selftest
~159s normally → 300-480s under ASan + CPU contention from -j$NPROC)
Fix: --timeout 600 and cap parallelism at min(4, NPROC) for asan job
* perf: keep schnorr batch verify on fast path through N=64
Current measurements still show the randomized MSM path losing to
per-signature schnorr_verify at N=64, even after earlier MSM work.
Keep batches through 64 entries on the existing GLV Strauss + fixed-base
path and defer the MSM path to larger batches, matching the measured
crossover on this machine.
* perf: reduce schnorr batch setup passes
Build the non-generator MSM inputs in one pass during large-batch
Schnorr verification instead of materializing temporary weights,
challenges, and lifted-point vectors and then refilling scalars/points.
This cuts setup memory traffic and improved a local N=128 large-batch
harness by about 5% in the one-shot path.
Also add an audit case for malformed x-only pubkeys with the same xor
fingerprint as a valid one so future lift-caching changes keep the batch
path robust under collisions and repeated invalid inputs.
* bench: add larger batch verify sizes
Extend the unified benchmark to measure Schnorr and ECDSA batch
verification at N=128 and N=192 in addition to 4/16/64.
Scale iteration counts for the larger sizes so the benchmark stays
practical while exposing the real crossover behavior of the large-batch
path.
* perf: retune schnorr batch crossover
Keep Schnorr batch verification on the per-signature GLV Strauss path
through N=128, and switch to the randomized MSM path above that.
Current official benchmark data on this CPU shows the public batch path
is still slower than individual verification at 64 and 128 entries, while
192 entries is close enough to justify keeping the large-batch path active
there for further tuning.
* perf: cache repeated x-only pubkeys in large schnorr batches
Large Schnorr batches in the official benchmark reuse x-only pubkeys from
a 64-entry pool, so avoid re-lifting the same pubkey multiple times inside
one batch. Reuse parsed SchnorrXonlyPubkey points for duplicates and stream
challenge hashing directly instead of building a temporary 96-byte buffer.
This keeps 64/128 on the per-signature path and improves the N=192 batch
path enough to beat individual verification again on the current CPU.
Also add audit coverage for batches that intentionally reuse the same
x-only pubkey across many signatures.
* perf: trim field batch inversion scratch overhead
Use indexed scratch storage in fe_batch_inverse instead of push_back on the
hot path, and route small batches through a fixed stack scratch buffer.
This keeps the change local to field inversion while shaving overhead from
runtime callers that use small and medium Montgomery batch inversions.
* Add cached schnorr batch path and preflight coverage fixes
* Benchmark cached schnorr batch verification
* Reuse scratch buffers in schnorr batch verify
* Cache x-only lifts in schnorr parse path
* Trim schnorr batch seed serialization overhead
* Tune schnorr batch cutoff for N=128
* Reuse SHA256 base for schnorr batch weights
* Harden ECIES zero-ephemeral cleanup
* harden ABI secret cleanup paths
* harden wallet seed-to-address cleanup
* optimize coin HD fixed-path derivation
* optimize silent payment scan invariants
* optimize OpenCL generator nibble lookup
* optimize OpenCL GLV generator phi table
* Optimize CUDA BIP352 benchmark and enrich project graph
---------
Co-authored-by: shrec <shrec@users.noreply.github.com>