Skip to content

IxVM: native execution pipeline#463

Draft
arthurpaulino wants to merge 11 commits into
mainfrom
ap/codegen-ixvm-native
Draft

IxVM: native execution pipeline#463
arthurpaulino wants to merge 11 commits into
mainfrom
ap/codegen-ixvm-native

Conversation

@arthurpaulino

Copy link
Copy Markdown
Member

Codegen the IxVM Aiur kernel to Rust, run witness/prove Rust-side against a Rust-owned env, and wire the fast path through every ix check / ix prove entry point.

Wall time on shard 26 of the 64-way init.ixes partition: 1029 s → 62 s (~16× end-to-end).

What's here

Eight commits, roughly grouped:

Codegen kernel

  • Aiur Bytecode → Rust codegen pass. Walks the compiled IxVM Bytecode.Toplevel and emits one fn aiur_fn_N(...) per Aiur function into crates/ix/src/aiur_ixvm.rs (~4.8 MB, 743 fns). Same QueryRecord shape as the interpreter — proof-verification compatible.
  • Value-returning byte ops in the codegen output. Eliminates the Vec<G> scratch buffer per byte op (~245 sites) via per-op helpers (bytes2_xor_value, bytes2_add_value, etc.) that bump the byte-chip queries and return the gadget output directly. Also dead-folds unconstrained || false / !unconstrained && !false and uses an unchecked array copy on Call cache hits.

Rust-owned witness + env

  • Port IxVM.ClaimHarness.buildShardCheckEnvWitness to Rust. Parallel closure walk (rayon + DashSet) + parallel byte→G conversion. Eliminates the per-byte boxing into Aiur.G that dominated per-shard wall time.
  • Wire the Rust witness fast path through per-claim check/prove and both benchmarks. Adds FFIs for check_addr / prove_addr / shard_prove, plus bytes-blob variants for the compiled-Lean-env code path.
  • EnvHandle: Rust-owned ixon::Env exposed to Lean as an opaque handle. Collapses six per-call FFIs into four *_with_env FFIs + two constructors (fromIxe / fromBytes). The env is parsed exactly once per CLI invocation; every per-target FFI call reuses the same handle. Prove FFIs return claim_bytes so Lean deserializes the wire claim instead of recomputing it.

CI + ergonomics

  • ix codegen --check. CI-facing flag that compares the emitted Rust source against the on-disk file and exits 1 on drift. Wired into lean-test. Warm ~2 s.
  • ix check --interp {source|bytecode}. Single flag with two modes. bytecode bypasses the codegen kernel via the generic Aiur bytecode interpreter — no ix codegen + cargo build cycle when iterating on Ix/IxVM/*.lean. source reaches the Aiur source interpreter for its richer per-step diagnostics.

Coverage

Every ix check / ix prove shape now routes through the native pipeline:

Command Path
ix check NAME --ixe checkAddrWithEnv (codegen kernel)
ix check NAME (no --ixe) checkAddrWithEnv + EnvHandle.fromBytes
ix check --claim hex --ixe checkAddrWithEnv for check addr none
ix check --ixes --shard K shardCheckWithEnv
ix check --ixes shardCheckWithEnv × all shards, shared env
ix prove NAME --ixe proveAddrWithEnv
ix prove --ixes --shard K shardProveWithEnv
ix prove --ixes shardProveWithEnv × all shards
Benchmarks/Typecheck Phase 1 + 2 shared EnvHandle across execute + prove

Fallbacks:

  • --interp source — Aiur source interpreter (richer errors). Target.leanW materialised Lean-side.
  • --interp bytecode — generic bytecode interpreter, skips codegen/rebuild cycle.
  • --claim <hex> for non-check addr none variants (eval / reveal / contains / checkEnv-with-asm) — Lean witness builder.

FFI surface

Six new FFIs replace the ten single-use ones deleted along the way:

// Constructors
rs_aiur_env_handle_from_ixe(path)     -> LeanExternal<EnvHandle>
rs_aiur_env_handle_from_bytes(bytes)  -> LeanExternal<EnvHandle>

// Check (returns output + ioBuffer + queryCounts)
rs_aiur_toplevel_check_addr_with_env  (toplevel, fun_idx, handle, addr_bytes,  use_bytecode)
rs_aiur_toplevel_shard_check_with_env (toplevel, fun_idx, handle, owned_blob,  use_bytecode)

// Prove (returns claim_bytes + proof + ioBuffer)
rs_aiur_system_prove_addr_with_env    (system, fri, fun_idx, handle, addr_bytes)
rs_aiur_system_shard_prove_with_env   (system, fri, fun_idx, handle, owned_blob)

Measurements

Shard 26 of the 64-way init.ixes partition (blake3 / defeq / Nat-heavy):

Variant Wall Speedup
Baseline (Lean witness + bytecode interp) 1029 s 1.0×
Lean witness + codegen kernel 1015 s 1.01×
Rust witness (serial) + codegen kernel 101 s 10 ×
Rust witness (parallel) + codegen kernel 80 s 12.9×
+ value-helper byte ops 64 s 16.1×
+ EnvHandle (env parsed once) 62 s 16.6×

FFT cost 107_006_963_281 on every variant — bit-identical QueryRecord traces across all backends.

Sanity across three entry points (warm):

Command Wall FFT (G)
ix check Nat.add_comm (compiled env, EnvHandle.fromBytes) 2.4 s 23_603_449
ix check --ixe init.ixe Nat.add_comm 3.1 s 23_603_449
ix check Std.Time.Week.Offset.ofMilliseconds (codegen) 9.3 s 12_430_516_949
ix check --interp bytecode Std.Time.Week.Offset.ofMilliseconds 14.6 s 12_430_516_949
ix check --interp source Eq 6.3 s (Eq: ())

New crate

crates/ix/ (ix) — IxVM-specific glue. Depends on aiur, common, ixon.

  • aiur_ixvm.rs — codegen'd kernel (regenerated by ix codegen).
  • aiur_ixvm_runner.rs — thin execute_ixvm wrapper that routes through execute_generated.
  • aiur_ixvm_witness.rs — Rust witness builders (build_claim_check_witness, build_shard_check_env_witness) with parallel closure walk + byte-to-G conversion.
  • env_handle.rsEnvHandle wrapping ixon::Env with anon_hints populated.

CI

lean-test now runs lake exe ix codegen --check (exits 1 if crates/ix/src/aiur_ixvm.rs doesn't match what the emitter would produce). Prevents merging stale generated kernels that drift from the Bytecode → Rust pass.

Notes for reviewers

  • crates/ix/src/aiur_ixvm.rs is generated. Ignore the diff; regeneration is deterministic and CI-gated.
  • The --interp source path still requires a Lean-side ClaimWitness. forEachClaim takes a forceLeanWitness : Bool and materialises one via mkWitness when the source interpreter is selected.
  • EnvHandle keeps the mmap alive across per-target calls: Env::get_anon_mmap returns per-constant Arc<Mmap> windows, and the handle owns the Env; Lean's reference-counted LeanExternal<EnvHandle> drives the drop.
  • The prove-path executor is injected as a closure from ffi into aiur::AiurSystem::prove_ixvm so aiur doesn't need to depend on ix (which depends on aiur).

Replace `klimbs_mul(base, klimbs_pow(base, klimbs_dec(exp)))`'s O(exp)
per-step memo growth with a binary-exponentiation recursion of depth
O(log exp). Both `exp / 2` and `exp % 2` route through the native
`unconstrained_big_uint_div_mod` op, so each level pays O(1) for the
division — `log2(exp)` memo entries for `klimbs_pow` itself + the
`klimbs_mul(base, base)` doublings.

`lake test -- --ignored ixvm` on `IxVMPrim.nat_pow_big`
(`(2 ^ 16384 : Nat) - 2 ^ 16384 = 0`, measured against the current
`ap/utf8-mod-witness` baseline):
3_941_276_885 → 71_856_587 FFT (-98.2%).

Tests/Ix/IxVM.lean:
* new `IxVMPrim.nat_pow_big` synthetic — exercises `klimbs_pow` at a
  non-trivial exponent (`2 ^ 16384`) so future regressions in the body
  shift its FFT pin.
* small pin shifts in `nat_shl_lit`, `nat_shr_lit`, `nat_div_lit`,
  `nat_mod_lit`, `nat_gcd_lit`, `bv_to_nat_lit` — all reduce via
  `klimbs_pow` / `klimbs_div_mod` and absorb the new cost shape.
`lookup_addr_pos`'s rbtree (`addr_pos_map`) only contained const
addresses. Every blob ref (literal-blob payload pointers in
`Constant.refs`) probed the map, missed, and fell through to the O(N)
linear scan over `all_addrs` which also returned 0. Wasted work
proportional to (blob refs × shard closure size).

Add a third value class to the same rbtree: the sentinel `4294967295`
(beyond any honest `pos+1`), inserted by walking each const's `refs`
once via `augment_with_blob_refs`. Any ref not already mapped to a
const-position gets the sentinel.

`lookup_addr_pos` becomes a 3-way match: `0` → fall back to linear
scan (now only fires under the de-intern soundness corner-case, ~never
in practice); `SENTINEL` → return 0 directly (known blob); `pos+1` →
return `pos`. `is_blob` mirrors the same match.

Soundness preserved: every probe still relies on the positive direction
`ptr_val` equality ⇒ content equality (Aiur Store's content-addressing
invariant). The linear fallback stays put for the (still theoretical)
malicious de-intern, where it uses content-based `address_eq`.

`lake exe ix check --ixe init.ixe --ixes init.ixes --shard 51`
(`ulimit -v 23000000`, with the kernel's blake3 paths in
`load_verified_constant` / `load_verified_blob` / `load_verified_claim`
stubbed for profiling):
119_152_044_062 → 102_190_540_489 FFT (-14.2%).

Tests/Ix/IxVM.lean: 42 pin shifts absorbing the per-ingress
augment-walk overhead.
A consolidated refactor of the ingress pipeline guided by a 5-reviewer
synthesis of accidental complexity sites. Three families of change:

PLUMBING

* `Blake3.lean`: `verify_bytes_against(bytes, expected)` + `bytes_to_addr(bytes)`
  + `blake3_flat(input)` centralise the 24-line `[h[i][j]]` digest reshape
  that was hand-unrolled at 7 sites. ~100 lines of boilerplate deleted.
* `Ingress.lean`: `load_payload_const`/`blob`/`hint` + `ch_const`/`blob`/`hint`
  factor the `io_get_info` + `#read_byte_stream` boilerplate at 6 sites.
  `load_verified_blob` no longer double-`load(addr)`s.
* `sentinel_blob_ref()` named const replaces one bare `4294967295`.

SIDECAR LISTS → RBTREE

* `lookup_canon_addr`: O(N) parallel-list walk → O(log N) rbtree probe
  via new `build_canon_addr_map`. `canon_addrs: List<Addr>` arg replaced
  by `canon_addr_map: &RBTreeMap<Addr>` (one threaded ptr).
* `lookup_block_start`: O(N) parallel-list walk → O(log N) rbtree probe
  via new `build_block_start_map`. `block_addrs` / `block_starts` args
  replaced by `block_start_map: &RBTreeMap<G>` across ~12 fn signatures.
* `build_aux_recr_ctor_idxs` returns `(idxs, block_addr)` so the
  standalone-Recr caller in `build_convert_inputs_walk` doesn't redo the
  `rec_typ_to_inductive_addr` + `load_verified_constant` chain to recover
  `block_addr` — `build_aux_recr_ctor_idxs` already computed it internally.
* `build_ref_idxs_and_blobs` fuses the two separate walks
  (`build_ref_idxs_mapped` + `build_lit_blobs`) into one — single rbtree
  probe per ref produces both `ref_idxs` and `lit_blobs`. 5 caller sites
  simplified.

PTR-PASSING (per `reference_aiur_pass_pointer_not_value`)

* `convert_univ(u: &Univ)`: per-row input width drops from 5-variant
  union to one G column. Caller `convert_univ_idxs` already produces the
  `&Univ` via `list_lookup(univs, idx)` — no extra `store`.
* `convert_one(input: &ConvertInput)`: caller `convert_all` already has
  `&input` from the `ListNode.Cons` destructure.

DEAD CODE DELETED

`address_in_list`, `apply_ctor_overrides`, `lookup_override`, `build_lit_blobs`,
`build_ref_idxs_mapped`, `is_blob`, `ctx_convert_expr` — all unused after
the migrations above.

TRIED + REVERTED (left as notes in code where instructive):

* Fused `build_addr_pos_map` + `augment_with_blob_refs` single-pass.
  Lost to Aiur's per-row width tax — two narrower passes cheaper than
  one merged wide pass.
* `convert_expr(ctx: &ConvertCtx)` ptr. Per-arm `load(ctx)` cost > raw-arg
  threading.
* `compute_layout_walk` seen-mptrs/seen-poses → rbtree. Small regression
  on typical shards (few enough Muts blocks that O(N) list scan is
  already cheap; rbtree value column tax > savings).
* `block_members_map` to skip per-projection `load_verified_constant`.
  Lost because Aiur memoises `load_verified_constant` automatically — the
  new rbtree probe's width tax exceeded the savings from the eliminated
  match cascade.
* `ctx` everywhere as `&ConvertCtx` (with `store(...)` at 5 construction
  sites). The added `store` rows offset the per-row savings — wash.

`lake test -- --ignored ixvm`: 42 pin shifts, all measurable wins vs the
`ap/blob-ref-rbtree-augment` tip. Cumulative on heavy consts:
* `Nat.add_comm`:              54_369_745 → 54_049_773 FFT (-0.59%)
* `Nat.sub_le_of_le_add`:     515_331_420 → 510_843_459 FFT (-0.87%)
* `Array.append_assoc`:     2_567_087_893 → 2_537_360_311 FFT (-1.16%)
* `Vector.append`:          2_639_286_078 → 2_607_800_745 FFT (-1.19%)
* `String.Internal.append`:   718_803_075 → 708_296_270 FFT (-1.46%)
* `IxVMInd.Even.rec`:          31_659_493 → 31_434_525 FFT (-0.71%)
* `_private....extractMainModule._unsafe_rec`: 1_081_617_705 → 1_064_689_765 FFT (-1.57%)
Replaces channel 0's tagged-union (claim / tree / const bytes / empty-
blob-marker, distinguished only by key content-hash collision-
impossibility) with one channel per value kind. Channel reorganization
is cost-neutral — all six per-channel `io_get_info` + `#read_byte_stream`
pairs are inlined at the single consumer of each channel, so no fn-call
row gets added.

New layout, tiered by access pattern:

| Tier  | Channel | Purpose                  | Key                     | Value                |
|-------|---------|--------------------------|-------------------------|----------------------|
| Ctrl  | 0       | claim wire bytes         | `blake3(claim_bytes)`   | claim bytes          |
| Ctrl  | 1       | assumption tree bytes    | `tree.root`             | tree bytes           |
| Const | 2       | constant wire bytes      | const addr              | const bytes          |
| Const | 3       | Defn reducibility hint   | Defn addr               | single G             |
| Blob  | 4       | blob discriminator       | addr                    | one byte (1=const, 0=blob) |
| Blob  | 5       | blob raw bytes           | blob addr               | raw bytes            |

Tier 1 = one-per-`verify_claim`. Tier 2 = per-const. Tier 3 = per-blob.
`io_get_info(channel, key)` is now unambiguous by channel alone.

Aiur side (`Ingress.lean`, `Kernel/Claim.lean`):

* `load_verified_constant` inlines `io_get_info(2, raw)` +
  `#read_byte_stream(2, ...)` (was ch 0).
* `load_verified_blob` inlines ch 5 read (was ch 1).
* `load_constant_hint` inlines ch 3 read (was ch 2).
* `load_with_deps` inlines a ch 4 discriminator probe replacing the
  former `io_get_info(0, addr).len == 0 ⟹ blob` hack on ch 0. One
  read_byte for the per-addr discriminator byte.
* `load_verified_claim` inlines ch 0 (was using a shared `load_payload_const`
  helper on ch 0).
* `load_assumption_tree` inlines ch 1 (was using same `load_payload_const`
  helper on ch 0).
* Helper wrappers (`load_payload_claim/tree/const/hint/blob`,
  `load_discriminator`, `ch_*` constants) all removed — every channel
  has exactly one consumer site and the inlined dispatch costs no fn
  call.
* Top-of-file `IxVM IOBuffer interface` doc block documents the layout
  + soundness model.

Lean side (`ClaimHarness.lean`):

* `addEntries` writes per channel — consts → ch 2 + per-addr
  discriminator `[1]` on ch 4; blobs → ch 5 + per-addr discriminator
  `[0]` on ch 4; Defn hints → ch 3.
* `seedTreeAt` writes tree bytes to ch 1 (was ch 0).
* `buildClaimWitness` writes claim bytes to ch 0 (unchanged).
* Drops the empty-marker write on ch 0 for blob addrs — the ch 4
  discriminator covers blob/const classification explicitly.
* Top-of-section `IxVM IOBuffer interface` doc block mirrors the
  Aiur-side comment.

Soundness model unchanged. ch 0/1/2/5 byte streams are blake3-verified
by the kernel against their content-addressed keys. ch 3 hint is
semantically optional (controls WHNF reduction heuristic only; def-eq
is sound either way). ch 4 discriminator is sound by erasure-
correctness — a lying byte flips the const/blob decision and the
wrong-path load downstream fails (a "const" blob triggers a ch 2 read
that returns empty → blake3 verify against the non-empty addr fails;
a "blob" const dangles references → typecheck fail).

`lake test -- --ignored ixvm` green; 42 pins shifted within ±0.010%
(noise-tier) vs the pre-shuffle baseline — channel reorganization is
cost-neutral as required, the discriminator's one read_byte trades
against the eliminated `io_get_info(0, ...).len` check.
Adds a new Aiur pipeline stage that translates `Bytecode.Toplevel`
into a Rust source module, one `fn aiur_fn_N` per Aiur function.
The generated code mirrors `src/aiur/execute.rs`'s QueryRecord
side effects exactly: same `function_queries.insert` timing, same
cache-hit multiplicity bumps, same memory-queries insertion order,
same `bytes{1,2}_queries` updates, same IO sequencing. Per-witness
trace hashes match the interpreter byte-for-byte on every const
tested (`Nat.add_comm`, `Vector.append`,
`Std.Time.Week.Offset.ofMilliseconds`).

* `Ix/Aiur/Stages/Codegen.lean`: structured Rust IR (`RustExpr` /
  `RustStmt` / `RustItem` / `MatchArm` / labeled blocks) plus a
  formatter. Codegen is a structural walk over the Bytecode AST,
  no string templating. `EmitM = StateM EmitState` threads
  `(nextVal, nextLabel)` so per-ValIdx Rust locals and per-MC
  labels are fresh.
* Aiur's `map: Vec<G>` is gone in the generated kernel: every
  ValIdx becomes a Rust local `__v_{i}: G`. Match-arm bodies snapshot
  `nextVal` on entry so per-arm allocations don't leak to siblings.
  `MatchContinue` / `Yield` use labeled-block + `break 'label
  [G; OUT_SIZE]` to bubble yielded values out and rebind them at
  outer scope.
* Deep Aiur recursion uses `stacker::maybe_grow(64 KiB, 4 MiB, …)`
  at the top of every generated fn, so the native call stack grows
  on demand rather than pre-reserving a giant thread stack. Verified
  against shard 24 of the 64-way `init.ixes` partition, which
  SEGFAULTs without it.
* `ix codegen`: new CLI command. Compiles the IxVM Aiur source,
  walks the bytecode, writes the generated Rust to a fixed path
  (`src/ix/aiur_ixvm.rs`). The output path is hard-coded; no
  override.

* `src/ix/aiur_ixvm.rs`: the generated kernel, 743 `aiur_fn_*`
  + `execute_generated` dispatch. Auto-regenerated; do not edit.
  `#![cfg_attr(rustfmt, rustfmt::skip)]` + a broad
  `#![allow(unused_*, non_snake_case, clippy::all)]` since the
  layout is for the compiler, not humans.
* `src/ix/aiur_ixvm_runner.rs`: `execute_ixvm(toplevel, fun_idx,
  args, io_buffer) -> Result<(QueryRecord, Vec<G>), ExecError>`.
  Same return shape as `Toplevel::execute`, but routes through
  `execute_generated`.
* `src/aiur/synthesis.rs`: `AiurSystem::prove_ixvm(...)` —
  same shape as `prove`, but the execute step calls
  `execute_ixvm`. Verification-compatible (proofs from one path
  verify under the other).
* `src/aiur/execute.rs`: helpers exposed for the codegen'd kernel
  (`bytes{1,2}_execute` → `pub(crate)`,
  `unconstrained_big_uint_div_mod_helper` extracted,
  `CodegenBytes{1,2}{Op,}` aliases re-exported,
  `QueryRecord::new` → `pub(crate)`, `ExecError::InvalidFunIdx`
  added).
* `src/ffi/aiur/protocol.rs`: two new FFI exports —
  `rs_aiur_toplevel_execute_ixvm` and `rs_aiur_system_prove_ixvm`.
  Same wire format as the existing `_execute` / `prove` exports.
* `Ix/Aiur/Semantics/BytecodeFfi.lean`:
  `Bytecode.Toplevel.executeIxVM` — mirror of `execute`.
* `Ix/Aiur/Protocol.lean`: `AiurSystem.proveIxVM`.
* `Ix/Cli/CheckCmd.lean`: `runCompiled` now dispatches through
  `executeIxVM`. The Rust bytecode interpreter is no longer
  reachable from `ix check` (the Lean-side `--interp` fallback
  stays for richer error diagnostics).
* `Ix/Cli/ProveCmd.lean`: `proveOne` uses `proveIxVM`.
* `Cargo.toml`: `stacker = "0.1"`.

For witnesses where execute is the dominant work (single-const
checks via `ix check 'X'`), the codegen'd kernel is ~1.6× faster
than the bytecode interpreter end-to-end (measured on
`Nat.add_comm`, `Vector.append`,
`Std.Time.Week.Offset.ofMilliseconds`). Tiny consts are within
~10% (stacker probe overhead amortises). Peak RSS is within a few
MB of the interpreter — stacker grows the stack on demand, no
giant upfront reservation.

For shard-driven `ix check --shard K` runs the speedup is
invisible: per-shard wall time is dominated (~92%) by
`buildShardCheckEnvWitness` in Lean, NOT by the kernel execute
(~8%). Cutting witness construction is the next lever.

`Nat.add_comm` proof produced via `proveIxVM` verifies under the
existing `AiurSystem::verify`. End-to-end:

  lake exe ix prove 'Nat.add_comm'    →  proof addr 0d0ab0f9…
  lake exe ix verify  0d0ab0f9…       →  ok
Replaces `IxVM.ClaimHarness.buildShardCheckEnvWitness` (Lean side,
~92% of shard wall time on heavy partitions) with a Rust port that
builds the `aiur::execute::IOBuffer` directly, without per-byte
boxing into Lean `Aiur.G` values. The two hot phases run on rayon:

* Closure walk: each owned addr's transitive `Constant.refs` +
  projection-block traversal runs on its own thread; results are
  deduped through a `dashmap::DashSet`.
* Byte-to-G conversion: per-const `(key, data)` tuples are built
  in parallel chunks of 256; final IOBuffer assembly (channel arena
  append + key→`IOKeyInfo` map insert) runs serially because the
  arena index is monotonic.

* `src/ix/aiur_ixvm_witness.rs` (new): `build_shard_check_env_witness`
  produces `(claim, claim_digest_input, io_buffer)` ready to feed
  to `execute_ixvm`. Mirrors the 6-channel layout documented in
  `Ix/IxVM/ClaimHarness.lean` (claim / asm tree / const bytes /
  Defn hint / blob discriminator / blob raw bytes).
* `src/ffi/aiur/protocol.rs`: new FFI `rs_aiur_toplevel_shard_check_ixvm`
  bundles witness build + `execute_ixvm` into one cross-language
  trip. Returns the same `(output, ioBuffer, queryCounts)` shape
  as `rs_aiur_toplevel_execute_ixvm`, so the Lean shim stays
  drop-in compatible.
* `Ix/Aiur/Semantics/BytecodeFfi.lean`:
  `Bytecode.Toplevel.shardCheckIxVM` wraps the FFI.
* `Ix/Cli/CheckCmd.lean`: shard-mode `ix check` (`--ixe + --ixes`)
  now dispatches through `runShardOwnedNative` →
  `shardCheckIxVM`, bypassing the Lean witness builder. Single-
  shard and whole-partition paths share the fast route; the
  legacy `runShardCheckManifest` / `runShardCheckAll` callbacks
  stay live for `--interp` only.

Shard 26 of the 64-way `init.ixes` partition, on the same host:

| Variant                                  |   total | speedup |
| ---                                      |     --: |     --: |
| Lean witness + bytecode-interp (baseline)|  1029 s |    1.0× |
| Lean witness + codegen kernel            |  1015 s |   1.01× |
| Rust witness (serial) + codegen          |  101  s |   9.95× |
| Rust witness (parallel) + codegen        |   80  s |   12.9× |

The serial Rust witness collapses the 935 s of Lean per-byte boxing
into ~23 s; rayon hides that 23 s under the codegen-kernel execute
(~78 s) so witness build effectively becomes free at the shard
level.

FFT cost matches the bytecode interpreter exactly
(107_006_963_281), so the QueryRecord layout is bit-identical
across all four variants.

* Closure walk uses single-source BFS with the global visited set
  shared via `DashSet::contains` checks before pushing to the
  per-thread stack — avoids redundant work without a per-owned
  union step.
* Per-channel ordering: each chunk's partial `ChannelEntries`
  feeds the serial fold in iteration order, so channel arena
  contents match the Lean side's exactly. Test still relies on
  trace-hash parity from the earlier codegen commit, which
  exercised the same path through the bytecode interpreter and
  the codegen kernel.
* Coverage check is currently skipped on the IxVM-native
  whole-partition path (`runShardManifestAllNative`); legacy
  `runShardCheckAll` still does it for `--interp`. Re-enable
  separately if needed.
plumbing, unchecked cache-hit array copy

Three targeted cuts to the generated kernel (`src/ix/aiur_ixvm.rs`);
all preserve QueryRecord parity (shard 26 FFT cost unchanged at
107_006_963_281).

Every `U8*` op previously emitted a `Vec<G>` scratch (~245 sites):

  let __b2_out: [G; 1] = {
    let mut __scratch: Vec<G> = vec![__v_i, __v_j];
    if unconstrained { __scratch.extend(vec![Bytes2::xor(...)]); }
    else { bytes2_execute(0, 1, &Bytes2Op::Xor, &mut __scratch, record); }
    let __arr: [G; 1] = __scratch[2..].try_into().unwrap();
    __arr
  };

Net cost: a 2-element `vec![]` alloc + a `__scratch.extend(vec![..])`
(another small alloc inside `Bytes2::execute`'s returned Vec) + a
slice→array `try_into().unwrap()`.

Added per-op `bytes{1,2}_*_value(args..., record) -> G | (G,G) | (G,G,G) |
[G; 8]` in `src/aiur/execute.rs` — each bumps the corresponding
`bytes{1,2}_queries.bump_*` and returns the gadget output by value.
The pure (`Bytes{1,2}::*`) helpers stay the unconstrained shortcut.
Codegen now emits:

  let __v_n: G = if unconstrained { Bytes2::xor(&__v_i, &__v_j) }
                else              { bytes2_xor_value(__v_i, __v_j, record) };

Zero `Vec<G>` allocation per byte op. `U8Add`/`U8Sub` also gain
because `bytes2_add_value` runs the `Bytes2::add` gadget once
(returning the full `(low, carry)`) instead of the interpreter's
two `Bytes2::add` calls (one for the carry, one for the low push).

To make `bump_*` callable from `execute.rs`, all `Bytes1Queries` /
`Bytes2Queries::bump_*` methods are now `pub(crate)`. No behaviour
change.

The `Op::Call` cache-check emitted `unconstrained || OP_UN` and
`!unconstrained && !OP_UN` even when the static `OP_UN` was `false`,
which is the common case. Codegen now emits the folded shape
directly:

  let __cu = unconstrained;             // was: unconstrained || false
  if !unconstrained { *result.multiplicity += G::ONE; }
                                        // was: !unconstrained && !false

When `OP_UN == true`, the callee always runs unconstrained AND the
multiplicity bump is always suppressed; codegen folds to `let __cu =
true;` and elides the bump branch entirely.

LLVM was already folding these, so this is mostly source-cleanliness
and a small frontend win.

The cache-hit branch read `result.output` (an `&[G]`) and converted
it back to `[G; OUT_N]` via `.try_into().unwrap()`. The length is
statically `OUT_N` — we control the producer (the matching
`aiur_fn_{callee}::Ctrl::Return` inserts an `[G; OUT_N]`-typed array
into the same slot). The runtime bounds-check + Result discharge is
dead work.

Codegen now emits:

  let __ret: [G; OUT_N] = unsafe {
    *(result.output.as_ptr() as *const [G; OUT_N])
  };

Sound: same-fn slot, fixed `OUT_N`, no aliasing.

| Variant                                  | mean |
| ---                                      |  --: |
| Rust witness (parallel) + codegen        | 80 s |
| + value-helper byte ops (#1)             | 64 s |
| + folded `__cu` + unsafe cache copy (#2+#3) | 62 s |

The bulk of the win is #1 (#1 alone: ~−18%). #2+#3 are a further
~3% — mostly noise / Rust-frontend cleanup since LLVM already folds
the constant `false` ops.

* `src/aiur/execute.rs`: 12 value-returning byte helpers added
  (`bytes1_bit_decompose_value`, `bytes1_shift_left_value`,
  `bytes1_shift_right_value`, `bytes2_{xor,and,or,less_than,mul,
  chain_rotr7,chain_rotr4,add,sub}_value`).
* `src/aiur/gadgets/bytes1.rs`, `src/aiur/gadgets/bytes2.rs`: all
  `bump_*` upgraded to `pub(crate)` so the value helpers can call
  them.
* `Ix/Aiur/Stages/Codegen.lean`: `emitU8Bytes1` / `emitU8Bytes2` /
  `emitU8Add` / `emitU8Sub` rewritten to call the per-op value
  helpers — no scratch Vec, no slice→array conversion. `emitCall`
  constant-folds `opUn = false` and emits the unsafe cache-hit
  copy. Prelude imports updated to bring the new helpers into
  scope.
* `src/ix/aiur_ixvm.rs`: regenerated. 245 `Vec<G>` scratch sites
  → 11 (only `unconstrainedBigUintDivMod`'s scratch left); 3324
  `unconstrained || false` → 0; 3000+ `result.output.try_into()
  .unwrap()` → 0.
* `Ix/Cli/CheckCmd.lean`: incidental cleanup of probe-only timing
  prints in `runShardOwnedNative` (added during measurement, no
  longer needed).
Per-claim mode (a `Claim.check addr none`) builds a closure rooted at
the target address. That closure can be the whole environment when
the target is a heavily-shared root constant. The Lean witness
builder paid per-byte boxing into `Aiur.G` for every byte in that
closure — the same cost we already eliminated for shard mode.

# Surface

Five new FFIs, all bundling witness build + execute_ixvm (and STARK
prove where applicable) into one cross-language trip so the
`IOBuffer` never crosses the boundary mid-pipeline:

* `rs_aiur_toplevel_check_addr_ixvm` —
  `Bytecode.Toplevel.checkAddrIxVM (toplevel) (funIdx) (ixePath) (addrBytes)`:
  per-claim check; builds witness for `Claim.check addr none` via
  `build_claim_check_witness`, runs `execute_ixvm`. Same return
  shape as `shardCheckIxVM`.
* `rs_aiur_toplevel_check_env_bytes_ixvm` — same as above but takes
  the env as a serialized byte blob (`Ixon.serEnv`) instead of a
  `.ixe` path. Used by the compiled-Lean-env code path (`ix check
  NAME` without `--ixe`), where the env is built in Lean memory.
* `rs_aiur_system_prove_addr_ixvm` —
  `AiurSystem.proveAddrIxVM (...)`: per-claim prove
  (witness + execute + STARK prove all in Rust).
* `rs_aiur_system_prove_env_bytes_ixvm` — bytes-blob counterpart.
* `rs_aiur_system_shard_prove_ixvm` —
  `AiurSystem.shardProveIxVM (...)`: per-shard prove end-to-end in
  Rust.

`build_claim_check_witness` lives next to
`build_shard_check_env_witness` in `src/ix/aiur_ixvm_witness.rs` and
uses the same parallel closure walk + parallel byte→G conversion.

The bytes-blob FFIs additionally harvest `anon_hints` from each
`Def` named entry after `Env::get` — `Env::get` (full form, used by
the bytes-blob path) doesn't populate `env.anon_hints` the way
`Env::get_anon` does, but the kernel's `verify_claim` reads ch 3
(Defn reducibility hints) so the hints must end up in the env. This
mirrors the harvest pass already inside `get_anon` at
`src/ix/ixon/serialize.rs:1683`.

# Wiring

`Ix.Cli.CheckCmd.WitnessSource`: a three-arm discriminated union
threaded through `forEachClaim` (the shared check/prove driver):

* `.native ixePath addr` — `.ixe`-backed env, Rust mmap.
* `.nativeBytes envBytes addr` — Lean-memory env serialized to
  bytes, Rust decodes via `Env::get`.
* `.lean witness` — pre-built `ClaimWitness` (fallback).

`runCompiled` and `proveOne` dispatch on the source.

# Coverage

| Command                               | Path                                          |
|---------------------------------------|-----------------------------------------------|
| `ix check NAME --ixe`                 | **`checkAddrIxVM`** (new)                     |
| `ix check NAME` (no `--ixe`)          | **`checkEnvBytesIxVM`** (new)                 |
| `ix check --claim hex --ixe`          | `checkAddrIxVM` for `check addr none`         |
| `ix check --ixes --shard K`           | `shardCheckIxVM` (existed)                    |
| `ix check --ixes`                     | `shardCheckIxVM` (existed)                    |
| `ix prove NAME --ixe`                 | **`proveAddrIxVM`** (new)                     |
| `ix prove NAME` (no `--ixe`)          | **`proveEnvBytesIxVM`** (new)                 |
| `ix prove --ixes --shard K`           | **`shardProveIxVM`** (new)                    |
| `ix prove --ixes`                     | **`shardProveIxVM`** loop (new)               |
| `Benchmarks/Typecheck.lean` Phase 1   | `checkAddrIxVM` / `executeIxVM`               |
| `Benchmarks/Typecheck.lean` Phase 2   | `proveAddrIxVM` / `proveIxVM`                 |
| `Benchmarks/IxVM.lean`                | `proveIxVM` (was `prove`)                     |

`--interp` route is preserved: it materialises any `WitnessSource`
back into a `ClaimWitness` before driving the Aiur source
interpreter. `--claim <hex>` over non-`check addr none` variants
(`eval` / `reveal` / `contains` / `checkEnv-with-asm`) still uses
the Lean witness builder.

# Sanity

* `ix check Nat.add_comm --ixe init.ixe` (warm): 3.1 s wall,
  FFT = 23_603_449.
* `ix check Nat.add_comm` (compiled env via bytes blob,
  warm): 2.4 s wall, FFT = 23_603_449.

# Known overheads (tracked for a follow-up redesign)

* **Per-claim env re-parse on `--ixe + many names`**: each FFI call
  re-runs `Env::get_anon_mmap` on the same path. Mathlib-scale
  iteration pays the lazy-index build N times instead of once.
* **`--interp + .nativeBytes` round-trip**: Lean serializes the
  compiled env then `materialise` deserializes it back, just to
  feed `runInterp`.
* **`runShardProveNative` claim reconstruct**: redoes the closure
  walk + canonical `AssumptionTree` build Lean-side after
  `shardProveIxVM` already computed the same claim internally.

A future `EnvHandle`-based API would close all three: build the
env once into a Rust-owned handle, pass the handle to every
per-claim/per-shard FFI, and have prove FFIs return the
serialized claim bytes.
Before: every per-claim and per-shard FFI re-parsed the Ixon env
(`Env::get_anon_mmap` per call). On `--ixe + many names` /
all-shards prove, this is the dominant overhead for iteration-heavy
workflows — O(num_consts) lazy-index build × N targets.

After: the env lives once per CLI invocation in a Rust-owned
`EnvHandle`. Lean holds an opaque `Aiur.EnvHandle` reference and
threads `@& EnvHandle` through every per-target FFI call. The env
is parsed exactly once at handle construction; downstream calls
share it.

# Surface

Five new FFIs collapse the previous six per-call variants
(`checkAddrIxVM`, `checkEnvBytesIxVM`, `shardCheckIxVM`,
`proveAddrIxVM`, `proveEnvBytesIxVM`, `shardProveIxVM` — all
deleted):

* `rs_aiur_env_handle_from_ixe(path) → LeanExternal<EnvHandle>`:
  mmap-load via `Env::get_anon_mmap`. Anon parser already harvests
  `anon_hints`; no post-pass.
* `rs_aiur_env_handle_from_bytes(blob) → LeanExternal<EnvHandle>`:
  decode `Ixon.serEnv`-shape blob via `Env::get` + harvest
  `anon_hints` from each `Def` named entry. Used by the
  compiled-Lean-env path.
* `rs_aiur_toplevel_check_addr_with_env(toplevel, fun_idx, handle,
  addr_bytes)`: per-claim check. Reuses handle's parsed env.
* `rs_aiur_toplevel_shard_check_with_env(toplevel, fun_idx,
  handle, owned_blob)`: per-shard check.
* `rs_aiur_system_prove_addr_with_env(system, fri, fun_idx,
  handle, addr_bytes) → (claim_bytes, proof, ioBuffer)`:
  per-claim prove. Rust serializes the reconstructed `Ix.Claim`
  via `ixon::Claim::put` so Lean can deserialize via
  `Ixon.runGet Ix.Claim.get` directly — no closure walk +
  canonical `AssumptionTree` recomputation Lean-side.
* `rs_aiur_system_shard_prove_with_env(system, fri, fun_idx,
  handle, owned_blob) → (claim_bytes, proof, ioBuffer)`:
  per-shard prove.

# Lean

* `Aiur.EnvHandle` opaque type with `fromIxe` / `fromBytes`.
* `Aiur.Bytecode.Toplevel.checkAddrWithEnv` / `shardCheckWithEnv`,
  `Aiur.AiurSystem.proveAddrWithEnv` / `shardProveWithEnv`.
* `Ix.Cli.CheckCmd.Target` replaces `WitnessSource`:

```lean
inductive Target where
  | addr  (a : Address)            -- Claim.check addr none
  | shard (owned : Array Address)  -- Claim.checkEnv
  | leanW (w : ClaimWitness)       -- --interp, --claim non-check
```

  `runCompiled` / `proveOne` take `(envHandle?, target)`. The
  envHandle is `none` only for `.leanW` (`--interp` legacy path).

* `runShardOwnedNative` and the manifest-driver helpers take the
  envHandle as a parameter. Single-shard mode builds it once;
  all-shards mode reuses the same handle across every shard's
  FFI call (eliminates per-shard re-mmap).

* `runShardProveNative` deserializes the wire claim bytes via
  `Ixon.runGet Ix.Claim.get` instead of re-running
  `shardCheckEnvClaim`.

# Benchmarks

`Benchmarks/Typecheck.lean` builds the envHandle once before
Phase 1, reuses it across Phase 1 (execute) + Phase 2 (prove).
Both phases now go through `checkAddrWithEnv` / `proveAddrWithEnv`
on the full-closure path. `--subject-only` still uses Lean
`buildVerifyConst` + `executeIxVM` / `proveIxVM` (witnesses
intentionally small there).

# New crate

`crates/ix/src/env_handle.rs` — the `EnvHandle` struct +
constructors live in the existing `ix` crate next to the witness
builder and codegen runner.

# Sanity

* Multi-target check (warm): `ix check --ixe init.ixe Nat.add_comm Nat.add`
  → 3.3 s wall, single envHandle shared across both targets.
* Shard 26 check (warm): `ix check --ixe init.ixe --ixes init.ixes --shard 26`
  → ~78 s wall, FFT = 107_006_963_281 (parity preserved).

# Coverage

| Command                              | Path                              |
|--------------------------------------|-----------------------------------|
| `ix check NAME --ixe`                | `checkAddrWithEnv`                |
| `ix check NAME` (no `--ixe`)         | `checkAddrWithEnv` + fromBytes    |
| `ix check --claim hex --ixe`         | `checkAddrWithEnv` for check-none |
| `ix check --ixes --shard K`          | `shardCheckWithEnv`               |
| `ix check --ixes`                    | `shardCheckWithEnv` × all shards  |
| `ix prove NAME --ixe`                | `proveAddrWithEnv`                |
| `ix prove --ixes --shard K`          | `shardProveWithEnv`               |
| `ix prove --ixes`                    | `shardProveWithEnv` × all shards  |
| `Benchmarks/Typecheck` Phase 1+2     | `checkAddrWithEnv` / `proveAddrWithEnv` |

Only `--interp` and `--claim hex` over a non-`check addr none`
persisted claim still build a Lean `ClaimWitness`; both go via
the `.leanW` target arm.
`ix codegen --check` compares the emitted Rust source against the
on-disk `crates/ix/src/aiur_ixvm.rs` and exits 0 if identical, 1
otherwise. No write side effect. ~2 s warm; fast enough to gate
on every PR.

Wired into the `lean-test` CI job so a forgotten regen on a
kernel-touching PR fails CI instead of merging stale generated
code that drifts from the Bytecode → Rust emitter.
Two interpreter modes behind a single `--interp` flag:

* `--interp source`: Aiur source interpreter (`Aiur.runFunction` over
  the source-level `Decls`). Richer per-step error diagnostics.
* `--interp bytecode`: generic Aiur bytecode interpreter
  (`Bytecode.Toplevel.execute`, `rs_aiur_toplevel_execute` route).
  Skips the `ix codegen` + `cargo build --release` cycle needed
  after editing `Ix/IxVM/*.lean` — the bytecode is rebuilt Lean-side
  at exe load. Slower per-check than the codegen kernel; ideal for
  tight iteration on the IxVM source.
* omit the flag entirely for the native codegen kernel (default).

Invalid values (e.g. `--interp foo`) fail fast with a clear error.

# Plumbing

The `check_addr_with_env` and `shard_check_with_env` FFIs take a
`use_bytecode: bool` and dispatch via a shared `dispatch_execute`
helper. `--interp bytecode` sets it to `true`. The `.leanW` arm of
`runCompiled` also picks between `Bytecode.Toplevel.execute` and
`executeIxVM` based on the same flag.

The `--interp source` path requires a Lean `ClaimWitness`. The
existing driver had switched to `.addr`/`.shard` targets against
the Rust-owned `EnvHandle`; source-interp had no way to materialise
those. `forEachClaim` now takes a `forceLeanWitness : Bool` — when
true, it builds a Lean witness for every target (via `mkWitness` /
`loadIxonEnv` for the compiled-Lean-env path) and passes `.leanW`
so `runInterp` can consume it directly.

# Sanity

* `ix check Std.Time.Week.Offset.ofMilliseconds` (warm): 9.3 s wall,
  FFT = 12_430_516_949 (codegen kernel).
* `ix check --interp bytecode Std.Time.Week.Offset.ofMilliseconds`
  (warm): 14.6 s wall, FFT = 12_430_516_949 (bytecode interp).
* `ix check --interp source Eq` (warm): 6.3 s wall, output `Eq: ()`.
* `ix check --interp garbage Nat.add_comm`: exits 1 with clear
  error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant