* perf: replace std::endl with newline char (performance-avoid-endl)
Replace all 133 occurrences of std::endl with '\n' across 6 files.
std::endl forces a flush on every call, '\n' does not.
Files: precompute.cpp, test_comprehensive.cpp, test_arithmetic_correctness.cpp,
test_exhaustive.cpp, bench_adaptive_glv.cpp, bench_glv_decomp_profile.cpp
* fix: mass clang-tidy auto-fix (const, init, braces, auto, endl)
Run clang-tidy --fix on all 98 source files using .clang-tidy config:
- misc-const-correctness: add const to unmodified locals
- cppcoreguidelines-init-variables: initialize variables at decl
- readability-braces-around-statements: add braces to single-line
- modernize-use-auto: use auto where type is obvious
- performance-avoid-endl: replace remaining endl with newline
Manual fixes for 4 false-positive const additions:
- ct_point.cpp: y3/z3/one52 used as fe52_cmov output params
- selftest.cpp: path used as _dupenv_s output param
- precompute.cpp: end used as strtoul output param
- test_fuzz_address_bip32_ffi.cpp: out used as ufsecp_ctx_clone output
Build: 0 errors. Tests: 25/25 pass.
* fix: clang-tidy manual fixes for headers + test widening cast
- ct_utils.hpp: const-qualify w0a-w3a load variables
- field_optimal.hpp: FieldTier enum use uint8_t base type (performance-enum-size)
- test_comprehensive.cpp: fix widening cast order (cast before add)
Build: 0 errors. Tests: 25/25 pass.
Core changes (ct_point.cpp):
- Created always_inline template core versions of hot functions:
table_lookup_core<NORMALIZE_Y>, unified_add_core<CHECK_INFINITY>,
point_dbl_n_core(). Public API wrappers delegate to cores.
- scalar_mul/generator_mul main loops call cores directly with
CHECK_INFINITY=false (saves 3 fe52_cmov + 1 fe52_is_zero per add)
and NORMALIZE_Y=false (skips unnecessary normalize_weak on table
entries already at magnitude 1).
- Added fe52_normalizes_to_zero(): cheaper CT zero check using
normalize_weak + overflow reduce + dual representation check
(~40 ops vs ~64 for full fe52_is_zero). Matches libsecp approach.
- Added #pragma clang loop unroll(disable) on main loops to prevent
code bloat from inlined 66KB function body.
Benchmark improvements (bench_ct.cpp, bench_atomic_operations.cpp):
- Use pool of 32 random 256-bit scalars to prevent branch-predictor
warming and ensure realistic workload measurements.
- Use full 256-bit field elements/scalars instead of trivially small
values in atomic operation benchmarks.
Results (i5-14400F, clang-19, -O3 -march=native):
- CT scalar_mul: 23.3us -> 22.0us (~5% improvement, 1.25x -> 1.18x vs libsecp)
- CT generator_mul: 11.0us -> 9.6us (~13% improvement, 0.99x -> 0.87x vs libsecp)
- All CT audit tests pass (120,652 checks, 0 failures)