================================================================================
  UltrafastSecp256k1 -- Cross-Platform Performance Comparison
  bench_hornet (Bitcoin Consensus CPU Benchmark)
  Target: Hornet Node (hornetnode.org)
================================================================================

  Platform A:  Intel i7-11700 @ 2.50 GHz  (x86-64, BMI2/ADX)
               Clang 21.1.0, 4x64 limbs, Montgomery, TSC timer
  Platform B:  ESP32-S3 (Xtensa LX7) @ 240 MHz  (32-bit, no __int128)
               GCC 14.2.0, 10x26 limbs, esp_timer
  Platform C:  ARM Cortex-A55 (YF_022A) (aarch64, NEON, crypto)
               Clang 18.0.1, 10x26 limbs, clock_gettime
  Platform D:  SiFive U74-MC (Milk-V Mars) @ 1.5 GHz (rv64gc_zba_zbb)
               GCC 13.3.0, 4x64 limbs, Montgomery, chrono timer

  Clock ratio (x86/ESP32): 2500 / 240 = 10.4x
  Clock ratio (x86/RISC-V): 2500 / 1500 = 1.67x

================================================================================
  1. HIGH-LEVEL OPERATIONS (Bitcoin-critical)
================================================================================

  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  |          Operation        |  x86 (us) | ARM64(us) |RISCV (us) | ESP32(us) |x86/RISCV  |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  | ecdsa_sign                |    10.18  |    27.98  |    81.25  |   7599.8  |    8.0x   |
  | ecdsa_verify              |    31.31  |   146.95  |   235.50  |  18446.2  |    7.5x   |
  | schnorr_sign (keypair)    |     8.37  |    20.11  |    56.37  |   6640.4  |    6.7x   |
  | schnorr_verify (xonly)    |    33.77  |   167.15  |   265.88  |  20606.2  |    7.9x   |
  | pubkey_create (k*G)       |     5.95  |    17.46  |    40.60  |   6272.6  |    6.8x   |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+

  x86 vs ESP32:   ~750x average ratio
  x86 vs RISC-V:  ~7.4x average ratio
  x86 vs ARM64:   ~3.5x average ratio
  ARM64 vs RISC-V: ~2.1x average ratio
  RISC-V vs ESP32: ~103x average ratio

================================================================================
  2. POINT ARITHMETIC
================================================================================

  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  |          Operation        |  x86 (us) | ARM64(us) |RISCV (us) | ESP32(us) |x86/RISCV  |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  | scalar_mul (k*P)          |    26.46  |   131.65  |   191.40  |  13342.6  |    7.2x   |
  | dual_mul (a*G+b*P)        |    31.75  |   145.37  |   227.01  |  18649.6  |    7.2x   |
  | point_add                 |     0.27  |     4.42  |     2.35  |    576.0  |    8.7x   |
  | point_dbl                 |     0.10  |     3.66  |     0.89  |    526.5  |    8.9x   |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+

================================================================================
  3. FIELD ARITHMETIC (innermost hot loop)
================================================================================

  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  |          Operation        | x86 (ns)  |ARM64 (ns) |RISCV (ns) | ESP32(ns) |x86/RISCV  |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  | field_mul                 |     26.4  |     69.9  |    182.2  |     5910  |    6.9x   |
  | field_sqr                 |     23.6  |     50.4  |    174.2  |     4848  |    7.4x   |
  | field_inv                 |   1087.2  |   2823.3  |   4430.1  |   130150  |    4.1x   |
  | field_add                 |      4.5  |     12.5  |     46.0  |      798  |   10.2x   |
  | field_sub                 |      3.3  |      9.1  |     38.7  |      810  |   11.7x   |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+

================================================================================
  4. SCALAR ARITHMETIC
================================================================================

  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  |          Operation        | x86 (ns)  |ARM64 (ns) |RISCV (ns) | ESP32(ns) |x86/RISCV  |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  | scalar_mul                |     31.5  |    107.9  |    182.2  |    18886  |    5.8x   |
  | scalar_inv                |   1065.7  |   2864.2  |   4983.9  |   132950  |    4.7x   |
  | scalar_add                |      4.2  |      8.9  |     56.7  |      998  |   13.5x   |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+

================================================================================
  5. CONSTANT-TIME OVERHEAD
================================================================================

  +---------------------------+-----------+-----------+-----------+-----------+
  |          Metric           |   x86     |  ARM64    |  RISC-V   |   ESP32   |
  +---------------------------+-----------+-----------+-----------+-----------+
  | CT overhead (ECDSA)       |   1.77x   |   2.57x   |   1.96x   |   1.05x   |
  | CT overhead (Schnorr)     |   2.03x   |   3.18x   |   2.37x   |   1.06x   |
  +---------------------------+-----------+-----------+-----------+-----------+

  ESP32 has lowest CT overhead: in-order Xtensa LX7, no speculative execution.
  RISC-V U74 (dual-issue in-order) CT overhead is close to x86 (~2x).
  x86 CT overhead improved significantly in v3.16.0 (ECDSA 1.77x, Schnorr 2.03x).
  ARM64 Cortex-A55 has highest CT overhead despite being in-order --
  possible memory/cache pressure on larger working set.

================================================================================
  6. BITCOIN BLOCK VALIDATION ESTIMATES
================================================================================

  Pre-Taproot block (3000 ECDSA):
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  |          Metric           |   x86     |  ARM64    |  RISC-V   |   ESP32   |x86/RISCV  |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  | Wall time                 |  93.9 ms  |  440.9 ms |  706.5 ms |   55.3 s  |    7.5x   |
  | Blocks/sec                |    10.6   |     2.27  |     1.4   |    0.02   |    7.5x   |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+

  Taproot block (2000 Schnorr + 1000 ECDSA):
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  |          Metric           |   x86     |  ARM64    |  RISC-V   |   ESP32   |x86/RISCV  |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  | Wall time                 |  98.8 ms  |  481.3 ms |  767.3 ms |   59.7 s  |    7.8x   |
  | Blocks/sec                |    10.1   |     2.08  |     1.3   |    0.02   |    7.8x   |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+

  TX throughput (1 core):
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  |          Metric           |   x86     |  ARM64    |  RISC-V   |   ESP32   |x86/RISCV  |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+
  | ECDSA tx/sec              |  31,939   |   6,805   |   4,246   |      54   |    7.5x   |
  | Schnorr tx/sec            |  29,614   |   5,983   |   3,761   |      49   |    7.9x   |
  +---------------------------+-----------+-----------+-----------+-----------+-----------+

================================================================================
  7. vs libsecp256k1 (apple-to-apple, same hardware)
================================================================================

  A) FAST path vs libsecp256k1:
  +---------------------------+-----------+-----------+-----------+-----------+
  |          Operation        |   x86     |  ARM64    |  RISC-V   |   ESP32   |
  +---------------------------+-----------+-----------+-----------+-----------+
  | Generator * k             |   2.87x   |   3.62x   |   3.08x   |   1.15x   |
  | ECDSA Sign                |   2.46x   |   2.73x   |   2.02x   |   1.25x   |
  | ECDSA Verify              |   0.91x * |   1.01x   |   0.94x * |   1.59x   |
  | Schnorr Keypair           |   2.11x   |   3.58x   |   2.71x   |   1.16x   |
  | Schnorr Sign              |   2.12x   |   3.23x   |   2.36x   |   1.42x   |
  | Schnorr Verify            |   1.11x   |   1.01x   |   0.94x * |   1.43x   |
  +---------------------------+-----------+-----------+-----------+-----------+

  * libsecp256k1 wins: x86 ECDSA Verify (1.10x), RISC-V Verify ops (1.06x)

  x86 FAST:    wins 5/6 ops (1.11x-2.87x); loses ECDSA Verify
  ARM64 FAST:  wins 6/6 ops (1.01x-3.62x)
  RISC-V FAST: wins 4/6 ops (2.02x-3.08x); loses both Verify ops
  ESP32 FAST:  wins 6/6 ops (1.15x-1.59x)

  B) CT-vs-CT FAIR comparison (signing ops constant-time vs constant-time):
     libsecp256k1 is ALWAYS constant-time. FAST comparison is UNFAIR for
     signing/keygen ops. CT-vs-CT shows the true picture.
     Verify uses public inputs -- no CT needed, same result in both paths.

  +---------------------------+-----------+-----------+-----------+-----------+
  |          Operation        |   x86     |  ARM64    |  RISC-V   |   ESP32   |
  +---------------------------+-----------+-----------+-----------+-----------+
  | ECDSA Sign (CT)           |   1.39x   |   1.06x   |   1.03x   |   1.20x   |
  | ECDSA Verify              |   0.91x * |   1.01x   |   0.94x * |   1.59x   |
  | Schnorr Sign (CT)         |   1.04x   |   1.02x   |   1.00x   |   1.36x   |
  | Schnorr Verify            |   1.11x   |   1.01x   |   0.94x * |   1.43x   |
  +---------------------------+-----------+-----------+-----------+-----------+

  * libsecp256k1 wins in CT-vs-CT:
    x86: ECDSA Verify (0.91x)
    RISC-V: both Verify ops (0.94x)

  x86 CT:    wins 3/4 ops (1.04x-1.39x); libsecp256k1 wins ECDSA Verify (1.10x)
  ARM64 CT:  wins 4/4 ops (1.01x-1.06x)
  RISC-V CT: tied on Sign (1.00x-1.03x); loses Verify (0.94x)
  ESP32 CT:  wins 4/4 ops (1.20x-1.59x)

================================================================================
  KEY INSIGHTS
================================================================================

  1. x86 is ~3.5x faster than ARM64, ~7.4x faster than RISC-V,
     and ~590-1050x faster than ESP32 for high-level ECC operations

  2. RISC-V U74 @ 1.5 GHz delivers 1.4 blocks/sec pre-Taproot
     -- borderline viable for lightweight Bitcoin node, needs multi-core
     ARM64 Cortex-A55 delivers 2.3 blocks/sec -- viable

  3. UltrafastSecp256k1 FAST gains are largest on ARM64 and RISC-V:
     - ARM64:  Generator*k 3.62x, Schnorr Keypair 3.58x, Sign 3.23x
     - RISC-V: Generator*k 3.08x, Schnorr Keypair 2.71x, Sign 2.36x
     - x86:    Generator*k 2.87x, Schnorr Keypair 2.11x, Sign 2.12x

  4. CT-vs-CT tells the REAL story (v3.16.0 CT improvements!):
     - x86: Ultra NOW WINS Sign ops (1.04x-1.39x) -- v3.16.0 CT dramatically improved
     - ARM64: Ultra wins (1.02x-1.06x) -- CT overhead is manageable
     - RISC-V: essentially tied on Sign (1.00x-1.03x), loses Verify (0.94x)
     - ESP32: Ultra wins decisively (1.20x-1.59x) -- lowest CT overhead

  5. Verify ops are the closest race on all platforms:
     - x86: 0.91x (libsecp256k1 wins ECDSA Verify)
     - ARM64: 1.01x (tied)
     - RISC-V: 0.94x (libsecp256k1 wins both Verify ops)
     - ESP32: 1.43x-1.59x (Ultra wins)

  6. Field multiply: x86 and RISC-V use 4x64 Montgomery;
     ARM64 and ESP32 use 10x26 schoolbook.
     RISC-V 4x64 is 8.4x slower than x86 (no BMI2/ADX, single mul unit)

  7. CT overhead ranking: ESP32 (1.05x) < x86 (1.77x) ~= RISC-V (1.96x) < ARM64 (2.57x)
     v3.16.0 reduced x86 CT overhead from 1.94x to 1.77x (ECDSA) / 2.03x (Schnorr)
     x86 is now LOWER than RISC-V for ECDSA CT overhead

  8. RISC-V point_add (2.35 us) is 1.9x FASTER than ARM64 (4.42 us)
     despite ARM64's higher clock -- 4x64 limbs win over 10x26 for basic ops

================================================================================
  i7-11700 (x86-64) vs Cortex-A55 (ARM64) vs U74-MC (RISC-V) vs ESP32-S3
  UltrafastSecp256k1 v3.16.0
================================================================================
