* docs(arm64): Add comprehensive ARM64 audit, benchmark, and gap analysis - ARM64_AUDIT_BENCHMARK.md: Platform certification with full audit results * Apple Silicon M1/M2/M3 native dudect verification (48/49 modules) * Android ARM64 (Cortex-A55) real hardware benchmarks * Linux ARM64 cross-compile validation * Performance comparison: ARM64 vs x86-64 vs RISC-V * CI workflows: ct-arm64.yml, release.yml Android NDK - ARM64_GAPS_ANALYSIS.md: Testing and optimization opportunities * Testing gaps: Native Linux ARM64 CI, Android device farm * Optimization roadmap: NEON batch ops (2-3x speedup priority) * Stack round-trip elimination, Metal GPU analysis * Decision matrix with effort/impact/recommendations - README.md: Add ARM64 docs to Documentation section Addresses issue #87 ARM audit request * fix: stabilize dev point.cpp and reduce clang-tidy hotspots (#106) * fix(clang-tidy): clear repeated const/init/braces warnings hotspots * fix(point): repair broken preprocessor branches in point ops (#104) Co-authored-by: shrec <shrec@users.noreply.github.com> --------- Co-authored-by: shrec <shrec@users.noreply.github.com> * riscv: speed up schnorr raw verify parsing/cache path * docs: add native mars riscv benchmark and audit results * ci(ct-arm64): avoid forcing -fuse-ld=lld on Apple * ci(mobile): avoid forcing -fuse-ld=lld on Android toolchains * ci(android/ios): fix arm target detection, asm gating, xcframework lto * fix(windows): correct BE parse byteswap under MSVC * sync: ESP32 bench hornet + benchmark data + CI scripts - fix(esp32): guard thread_local 36KB lift_x cache on embedded platforms - add: ESP32-S3 bench_hornet raw serial output + README - add: x86-64 bench_unified JSON/TXT results + validation reports - add: cross-platform comparison README updates - add: local CI scripts (ci-local.sh, docker-compose.ci.yml) - add: RISC-V Mars SSH helper, LOCAL_CI docs, RELEASE_PROCESS - update: ESP32 sdkconfig defaults for bench_hornet Addresses benchmark data preservation and ESP32 boot crash fix. --------- Co-authored-by: shrec <shrec@users.noreply.github.com>
13 KiB
OPTIMIZATION_ARCHITECTURE.md
UltrafastSecp256k1 -- Optimization Architecture Reference
This document is the technical reference for every optimization layer in UltrafastSecp256k1. It is generated from code inspection and is the companion to the benchmark framework (bench_unified).
0. Design Philosophy
Algorithm > Architecture > Optimization > Hardware
Memory traffic is more expensive than arithmetic
Correctness is absolute; performance is earned
No abstraction is free at scale
Hot paths must be explicit, boring, and predictable
1. Field Representations
The library maintains multiple field element representations, each optimised for a specific limb width and platform. Conversions are explicit and documented.
| Name | File | Layout | Headroom | Use Case |
|---|---|---|---|---|
FieldElement |
field.hpp |
4x uint64_t | 0 (normalized) | I/O, serialization, from_limbs (DB loads) |
FieldElement52 |
field_52.hpp |
5x uint64_t | 12 bit/limb | ECC point ops on 64-bit; lazy reduction |
FieldElement26 |
field_26.hpp |
10x uint32_t | 6 bit/limb | 32-bit targets (ESP32 Xtensa, Cortex-M) |
MidFieldElement |
field.hpp |
8x uint32_t | reinterpret | Zero-cost 32-bit view of 4x64 |
Lazy Reduction
FieldElement52 allows ~4096 additions without normalization due to 12-bit
headroom. FieldElement26 allows ~64 additions. This avoids reduction in
inner loops (point addition chains), reducing to full mod p only at
serialization or comparison boundaries.
Hybrid Conversion
FieldElement52::from_fe() and to_fe() convert between 4x64 and 5x52.
Bit mapping:
4x64: [0..63][64..127][128..191][192..255]
5x52: [0..51][52..103][104..155][156..207][208..255]
2. Scalar Arithmetic
256-bit scalar field (mod n) using 4x uint64_t limbs.
Scalar::mul-- schoolbook 4x4 with__int128accumulator, followed by Barrett or Montgomery reductionScalar::inverse-- SafeGCD (Bernstein-Yang), 590 divsteps, constant-timeScalar::from_bytes-- big-endian 32-byte input, reduction mod n
3. GLV Endomorphism
Splits 256-bit scalar k into two ~128-bit halves (k1, k2) such that
k*P = k1*P + k2*phi(P), where phi(x,y) = (beta*x, y).
Constants
lambda = 0x5363ad4cc05c30e0a5261c028812645a122e22ea20816678df02967c1b23bd72
beta = 0x7ae96a2b657c07106e64479eac3434e99cf0497512f58995c1396c28719501ee
Decomposition (Babai Nearest-Plane)
- Compute
c1 = round(k * g1 / 2^384),c2 = round(k * g2 / 2^384)via 256x256 Comba multiply (16__int128muls on 64-bit, 64uint32_tmuls on 32-bit) + right-shift + rounding bit k2 = c1*(-b1) + c2*(-b2) (mod n)- Pick shorter representation (
k2vs-k2) by bitlength k1 = k - lambda*k2 (mod n), again pick shorter
Lattice basis vectors (128-bit):
A1 = 0x3086d221a7d46bcde86c90e49284eb15
-B1 = 0xe4437ed6010e88286f547fa90abfe4c3
A2 = 0xe4437ed6010e88286f547fa90abfe4c4
B2 = 0x3086d221a7d46bcde86c90e49284eb15
4. Point Multiplication Strategies
4.1 FAST Path (fast:: namespace)
Variable-time. Not for secret scalars.
- k*G (generator): Precomputed comb (Lim-Lee) or precomputed windows
- Comb:
teeth-parameterized,spacing = ceil(256/teeth), table =2^teethaffine points - Default
comb_width=6: 64-entry table, ~176 KB, 43 dbl + 43 add = 86 ops - Aggressive
teeth=11: 2048-entry, ~3 MB (L2), 24 dbl + 24 add = 48 ops
- Comb:
- k*P (arbitrary): wNAF + GLV decomposition, width-5 (16 precomputed points)
- aG + bP (dual): Shamir's trick -- interleaved double-scalar multiplication
4.2 CT Path (ct:: namespace)
Constant-time for all secret-dependent operations.
-
ct::generator_mul(k): Hamburg signed-digit comb encoding
- 4-bit windows, 8-entry precomputed affine table (odd multiples)
- 64 mixed-affine additions + 64 signed table lookups, NO doublings
- Table lookup scans ALL entries (CT:
affine_table_lookup_signed) - ~3x faster than generic CT scalar_mul
-
ct::scalar_mul(P, k): Hamburg signed-digit + GLV
- GROUP_SIZE=5, 16 odd multiples per sub-curve
- 125 dbl + 52 unified_add + 52 signed lookups(16)
4.3 CT Primitives
| Function | Cost | Notes |
|---|---|---|
point_add_complete |
11M + 6S | Brier-Joye unified, handles all edge cases branchless |
point_add_mixed_complete |
7M + 5S | Brier-Joye, Jac+Affine (Z2=1) |
point_add_mixed_unified |
7M + 5S | Brier-Joye, precondition: b != infinity |
point_add_mixed_unified_into |
7M + 5S | In-place variant, avoids 128-byte copy |
point_dbl |
3M + 4S + h | Libsecp-style, handles identity via cmov |
point_dbl_n_inplace |
N*(3M+4S+h) | Batch N doublings in-place |
point_cmov / point_select |
O(n) | CT conditional move / table scan |
affine_table_lookup_signed |
O(n) | Hamburg signed-digit encoding, scans all entries |
point_endomorphism |
1M | phi(P) = (beta*X, Y, Z) |
4.4 Batch Verification
- Schnorr batch: Single MSM
sum(a_i*s_i)*G + sum(-a_i*e_i*P_i) + sum(-a_i*R_i) = O - ECDSA batch: Montgomery batch inversion + per-sig Shamir's trick
Measured speedup (N=64): ~2.5-3.5x vs individual verification.
5. Assembly and Intrinsics
| File | Target | Operations |
|---|---|---|
field_asm.cpp |
x64 BMI2 | mul_4x4_bmi2, square_4_bmi2, mont_reduce |
field_asm_x64_gas.S |
x64 GAS | field_mul_full_asm, field_sqr_full_asm |
field_asm52_x64_gas.S |
x64 BMI2 | 5x52 MULX assembly (SECP256K1_USE_ASM52_X64) |
field_asm_arm64.cpp |
AArch64 | field_mul_arm64, field_sqr_arm64, add, sub |
decomposition_optimized.hpp |
x64 BMI2 | 16x _mulx_u64 for GLV decomposition |
ct_field.cpp |
portable | SafeGCD divsteps, 5x52 mul/sqr with __int128 |
glv.cpp |
portable | 4x4 Comba with __int128 |
Runtime Detection
BMI2 availability is detected at runtime via CPUID (field_asm.cpp).
When BMI2 is available, _mulx_u64 provides carry-free multiplication;
otherwise, falls back to portable C++ with __int128.
6. Build Feature Gates
| Option | Default | Effect |
|---|---|---|
SECP256K1_USE_ASM |
ON | Enable inline assembly (x64/RISC-V, 2-5x) |
SECP256K1_USE_FAST_REDUCTION |
ON | Fast mod-p reduction (RISC-V asm, x64 BMI2) |
SECP256K1_USE_ASM52_X64 |
OFF | Hand-tuned 5x52 MULX asm (~8ns field_mul) |
SECP256K1_USE_RISCV_FE52_ASM |
OFF | RISC-V 5x52 asm (slower on in-order U74) |
SECP256K1_RISCV_USE_VECTOR |
ON | RISC-V Vector Extension (RVV) |
SECP256K1_RISCV_USE_PREFETCH |
ON | Cache prefetch hints (RISC-V) |
SECP256K1_USE_LTO |
OFF | Link Time Optimization |
SECP256K1_USE_PGO_GEN |
OFF | PGO: generate profile |
SECP256K1_USE_PGO_USE |
OFF | PGO: use profile |
SECP256K1_SPEED_FIRST |
ON | Aggressive unsafe-speed optimizations |
SECP256K1_VERBOSE_DEBUG |
OFF | Compile-time gated verbose GPU logging |
7. Performance Model
Cost Model (x86-64 / i5-14400F, Clang 19)
Measured with bench_unified (11 passes, IQR, median, RDTSCP):
| Operation | ns/op | Cycles (est.) |
|---|---|---|
| field_mul (4x64) | ~18 | ~45 |
| field_sqr (4x64) | ~14 | ~35 |
| field_inv | ~730 | ~1840 |
| scalar_mul | ~18 | ~45 |
| scalar_inv (SafeGCD) | ~790 | ~1990 |
| pubkey_create (k*G) | ~5,600 | ~14,100 |
| scalar_mul (k*P) | ~18,800 | ~47,400 |
| dual_mul (aG+bP) | ~19,900 | ~50,200 |
| ECDSA sign (FAST) | ~8,500 | ~21,400 |
| ECDSA verify | ~21,100 | ~53,200 |
| Schnorr sign (FAST) | ~7,100 | ~17,900 |
| Schnorr verify | ~21,700 | ~54,700 |
| ct::ecdsa_sign | ~12,100 | ~30,500 |
| ct::schnorr_sign | ~10,800 | ~27,200 |
Apple-to-Apple Ratios (Ultra / libsecp256k1)
| Operation | FAST path | CT path |
|---|---|---|
| ECDSA Sign | 1.9x | 1.3x |
| ECDSA Verify | 1.0x | 1.0x |
| Schnorr Sign | 1.8x | 1.2x |
| Schnorr Verify | 1.0x | 1.0x |
| Generator * k | 2.1x | -- |
Notes:
- Verify uses only public data (no CT needed); same path for both
- libsecp256k1 signing is always CT; FAST path comparison is unfair for signing
- CT-vs-CT is the true apples-to-apples for signing operations
8. Benchmark Framework
Source of Truth
bench_unified is the single canonical benchmark. All other bench_*
targets measure specific subsystems (field representations, CT overhead)
but bench_unified is what gets reported.
Kept Benchmark Targets
| Target | Source | Purpose |
|---|---|---|
bench_unified |
bench_unified.cpp + libsecp_provider.c |
Full apple-to-apple vs libsecp256k1 |
bench_ct |
bench_ct.cpp |
CT layer overhead measurement |
bench_field_52 |
bench_field_52.cpp |
FE52 vs FE64 regression testing |
bench_field_26 |
bench_field_26.cpp |
FE26 vs FE64 (32-bit targets) |
Infrastructure
- Timer: RDTSCP on x86 (sub-ns), chrono::high_resolution_clock elsewhere
- Statistics: IQR-based outlier removal, median of N passes (default 11)
- Anti-DCE:
DoNotOptimize()/ClobberMemory()compiler barriers - Pinning: Thread pinned to core 0, priority elevated (Linux/Windows)
- Data pool: 64 independent key/msg/sig sets, cycled to defeat caching
CLI
bench_unified [OPTIONS]
--json <file> Write structured JSON report
--suite <name> core | extended | all (default: all)
--passes <N> Override passes (default: 11, min: 3)
--quick CI smoke (3 passes, 1/5 iterations)
--no-warmup Skip CPU frequency ramp-up
JSON Schema
{
"metadata": {
"cpu": "...", "compiler": "...", "arch": "...",
"timer": "RDTSCP", "tsc_ghz": 2.524,
"passes": 11, "warmup": 500, "pool_size": 64
},
"results": [
{"section": "FIELD ARITHMETIC (Ultra)", "name": "field_mul", "ns": 18.07},
{"section": "FAST path (Ultra FAST vs libsecp)", "name": "ECDSA Sign", "ratio": 1.91}
]
}
Cross-Platform Reporting
# Run on each platform
./bench/scripts/run_bench.sh --out-dir results/
# Merge all reports into comparison table
python3 bench/scripts/merge_reports.py --dir results/ -o comparison.md
9. Platform-Specific Notes
x86-64
- Primary target. BMI2 intrinsics (
_mulx_u64) + GAS assembly - RDTSCP timer provides sub-nanosecond precision
- CPU frequency warmup (3s heavy crypto load) stabilises turbo before measurement
AArch64
- Full inline assembly for field ops (mul, sqr, add, sub)
- chrono fallback timer
- No BMI2, uses
__int128for wide multiplies
RISC-V 64
- Tested on StarFive VisionFive 2 (SiFive U74, in-order)
- Optional RVV (Vector Extension) support
- 5x52 RISC-V asm available but noted as slower on in-order cores
- Prefetch hints configurable
ESP32-S3 (Xtensa LX7)
- 32-bit target: uses
FieldElement26(10x26) - Separate benchmark binary:
examples/esp32_bench_hornet/ - Same apple-to-apple ratio table format as bench_unified
- Measured: ECDSA Verify 1.78x, Schnorr Verify 1.65x
10. Versioning
This document describes the optimization architecture as of the
dev/optimize-verify-gap branch. Generated from code inspection;
update when algorithms or parameters change.