Skip to content

IxVM ingress: blob-ref index, sidecar→rbtree, IOBuffer interface#457

Draft
arthurpaulino wants to merge 4 commits into
mainfrom
ap/ingress-refactor
Draft

IxVM ingress: blob-ref index, sidecar→rbtree, IOBuffer interface#457
arthurpaulino wants to merge 4 commits into
mainfrom
ap/ingress-refactor

Conversation

@arthurpaulino

Copy link
Copy Markdown
Member

Three commits on top of main that cut ingress FFT cost on heavy shards and land a documented IxVM IOBuffer interface.

Measurement methodology — read first

Shard 51 with unstubbed blake3 currently OOMs at all practical ulimit settings on the dev machine. To get past memory exhaustion and produce a comparative measurement, the kernel's blake3 verification paths in load_verified_constant / load_verified_blob / load_verified_claim were stubbed to no-ops for profiling. The stubs are not part of any committed code in this PR — they were applied locally during measurement and reverted.

What this means for the numbers below:

  • The delta between two stubbed runs is real for the non-blake3 portion of the trace. Both endpoints elide the same verification work, so the comparison isolates the effect of the kernel changes in this PR.
  • The absolute stubbed cost (e.g. "93.94G FFT") is not the production shard-51 typecheck cost. Production includes blake3 work (~16–20% of the trace per earlier profiles) that the stubs omit.

Every shard-51 number below is a "blake3-stubbed trace" measurement. Per-const numbers from lake test -- --ignored ixvm (no stubs) are unaffected and reflect real production cost.

Headline

Metric Before After Δ
Shard 51 (blake3-stubbed trace) 119_152_044_062 93_939_967_063 −21.1% on the non-blake3 portion
Vector.append (production, unstubbed) 2_661_244_845 2_607_925_682 −2.00%
Array.append_assoc (production) 2_588_157_538 2_537_478_644 −1.96%
_private...extractMainModule._unsafe_rec (production) 1_091_354_031 1_064_762_809 −2.44%

Per-const wins are end-to-end on the real (unstubbed) kernel; the shard-51 delta is real on the non-blake3 portion of the trace.

Commits

1. 120001d — IxVM kernel: augment addr_pos_map with blob-ref sentinels

lookup_addr_pos's rbtree only contained const addresses. Every blob ref (literal-blob payload pointers in Constant.refs) probed the map, missed, and fell through to the O(N) linear scan over all_addrs which also returned 0. Wasted work proportional to (blob refs × shard closure size).

This commit adds a third value class to the same rbtree: a sentinel 4294967295 (beyond any honest pos+1), inserted by walking each const's refs once via augment_with_blob_refs. lookup_addr_pos becomes a 3-way match: 0 → fall back to linear scan (now only fires under the de-intern soundness corner-case, ~never in practice); SENTINEL → return 0 directly (known blob); pos+1 → return pos.

Shard 51 (blake3-stubbed): 119_152_044_062 → 102_190_540_489 — −14.2% on the non-blake3 portion.

address_eq row count drops from ~10.3M to ~3.4M (−67%); lookup_addr_pos_linear falls out of the top-25 cost table. The augment-walk's added rbtree probes (~430M FFT in the stubbed trace) are paid back many times over by the ~17B FFT saved on lookups.

Soundness: every probe still relies on the positive direction ptr_val equality ⇒ content equality (Aiur Store content-addressing invariant). The linear fallback stays for the (still theoretical) malicious de-intern, where it uses content-based address_eq.

2. 9acbd58 — IxVM ingress: plumbing helpers + sidecar→rbtree migrations + ptr-passing

A 5-reviewer synthesis of ingress accidental complexity drove this consolidated refactor. Three families of change.

Plumbing helpers:

  • verify_bytes_against(bytes, expected) + bytes_to_addr(bytes) + blake3_flat(input) centralise the 24-line [h[i][j]] digest reshape that was hand-unrolled at 7 sites. ~100 lines of boilerplate deleted.
  • Channel I/O helpers factor the io_get_info + #read_byte_stream boilerplate at 6 sites. load_verified_blob no longer double-load(addr)s.
  • sentinel_blob_ref() named const replaces one bare 4294967295.

Sidecar lists → rbtree:

  • lookup_canon_addr: O(N) parallel-list walk → O(log N) rbtree probe via new build_canon_addr_map. canon_addrs: List<Addr> arg replaced by canon_addr_map: &RBTreeMap<Addr> (one threaded ptr).
  • lookup_block_start: O(N) parallel-list walk → O(log N) rbtree probe via new build_block_start_map. block_addrs / block_starts args replaced by block_start_map: &RBTreeMap<G> across ~12 fn signatures.
  • build_aux_recr_ctor_idxs returns (idxs, block_addr) so the standalone-Recr caller doesn't redo the rec_typ_to_inductive_addr + load_verified_constant chain to recover block_addr.
  • build_ref_idxs_and_blobs fuses the two separate walks (build_ref_idxs_mapped + build_lit_blobs) into one — single rbtree probe per ref produces both ref_idxs and lit_blobs. 5 caller sites simplified.

Pointer-passing (per Aiur's inputSize→width cost model):

  • convert_univ(u: &Univ): per-row input width drops from 5-variant union to one G column. Caller convert_univ_idxs already produces the &Univ via list_lookup(univs, idx) — no extra store.
  • convert_one(input: &ConvertInput): caller convert_all already has &input from the ListNode.Cons destructure.

Dead code deleted: address_in_list, apply_ctor_overrides, lookup_override, build_lit_blobs, build_ref_idxs_mapped, is_blob, ctx_convert_expr.

Shard 51 (blake3-stubbed): 102_190_540_489 → 93_939_967_063 — −8.07% on the non-blake3 portion on top of 120001d.

Per-const wins on heavy production targets (unstubbed lake test):

Constant Before After Δ
Nat.add_comm 54_369_745 54_049_773 −0.59%
Nat.sub_le_of_le_add 515_331_420 510_843_459 −0.87%
Nat.decLe 191_471_719 189_723_325 −0.91%
Array.append_assoc 2_567_087_893 2_537_360_311 −1.16%
Vector.append 2_639_286_078 2_607_800_745 −1.19%
IxVMInd.Even.rec 31_659_493 31_434_525 −0.71%
IxVMInd.Odd.rec 31_658_598 31_433_622 −0.71%
String.Internal.append 718_803_075 708_296_270 −1.46%
_private...extractMainModule._unsafe_rec 1_081_617_705 1_064_689_765 −1.57%

lake test -- --ignored ixvm green; 42 pins re-pinned.

3. 9c2eb23 — IxVM: per-kind IOBuffer channels with documented interface

The pre-existing IOBuffer multiplexed claim bytes, assumption tree bytes, constant wire bytes, and a per-blob-addr empty-marker on a single channel 0, distinguished only by blake3 content-hash collision-impossibility. A new contributor reading load_payload_const(key) couldn't tell from the helper whether they were getting a const, a claim, or a tree — only the caller knew the intent.

This commit splits channel 0 into per-kind channels and lands a documented IxVM IOBuffer interface block (mirrored on both Aiur and Lean sides).

Tier Channel Purpose Key Value
Ctrl 0 claim wire bytes blake3(claim_bytes) claim bytes
Ctrl 1 assumption tree bytes tree.root tree bytes
Const 2 constant wire bytes const addr const bytes
Const 3 Defn reducibility hint Defn addr single G
Blob 4 blob discriminator addr one byte (1=const, 0=blob)
Blob 5 blob raw bytes blob addr raw bytes

Tier 1 fires once per verify_claim. Tier 2 fires per constant traversed during load_with_deps. Tier 3 fires per blob ref encountered during build_ref_idxs_and_blobs.

The blob discriminator (ch 4) replaces the previous io_get_info(0, addr).len == 0 ⟹ blob hack inside load_with_deps with an explicit per-addr one-byte payload. No more len-as-meaning overload.

Every per-channel io_get_info + #read_byte_stream pair is inlined at the single consumer of each channel — no fn-call row gets added for the reorganization. Channel reorganization is cost-neutral by design, confirmed by measurement: per-const pin shifts vs 9acbd58 land within ±0.010% (noise-tier; the ch-4 discriminator's one read_byte trades against the eliminated io_get_info(0, addr).len check).

The win here is review-time clarity, not FFT — io_get_info(channel, key) now has one value shape per channel and a doc block both kernel and harness reference.

Soundness: ch 0/1/2/5 byte streams are blake3-verified by the kernel against their content-addressed keys. ch 3 hint is semantically optional (controls WHNF reduction heuristic only; def-eq is sound either way). ch 4 discriminator is sound by erasure-correctness — a lying byte flips the const/blob decision and the wrong-path load downstream fails (a "const" blob triggers a ch 2 read that returns empty → blake3 verify against the non-empty addr fails; a "blob" const dangles references → typecheck fail).

Cumulative measurement (vs main)

Constant Before After Δ
Nat.add_comm 54_504_714 (approx.) 54_047_004 ~−0.84%
Nat.decLe 191_354_793 189_722_204 −0.85%
Nat.sub_le_of_le_add 515_056_158 510_870_092 −0.81%
Array.append_assoc 2_588_157_538 2_537_478_644 −1.96%
Vector.append 2_661_244_845 2_607_925_682 −2.00%
String.Internal.append 725_415_461 708_332_683 −2.36%
_private...extractMainModule._unsafe_rec 1_091_354_031 1_064_762_809 −2.44%
Shard 51 (blake3-stubbed trace, non-blake3 portion) 119_152_044_062 93_939_967_063 −21.1%

Test plan

  • lake build clean
  • lake test -- --ignored ixvm green at every commit
  • Shard 51 comparative measurement with blake3 stubs applied locally to both endpoints (stubs reverted; not part of any committed code in this PR)

Replace `klimbs_mul(base, klimbs_pow(base, klimbs_dec(exp)))`'s O(exp)
per-step memo growth with a binary-exponentiation recursion of depth
O(log exp). Both `exp / 2` and `exp % 2` route through the native
`unconstrained_big_uint_div_mod` op, so each level pays O(1) for the
division — `log2(exp)` memo entries for `klimbs_pow` itself + the
`klimbs_mul(base, base)` doublings.

`lake test -- --ignored ixvm` on `IxVMPrim.nat_pow_big`
(`(2 ^ 16384 : Nat) - 2 ^ 16384 = 0`, measured against the current
`ap/utf8-mod-witness` baseline):
3_941_276_885 → 71_856_587 FFT (-98.2%).

Tests/Ix/IxVM.lean:
* new `IxVMPrim.nat_pow_big` synthetic — exercises `klimbs_pow` at a
  non-trivial exponent (`2 ^ 16384`) so future regressions in the body
  shift its FFT pin.
* small pin shifts in `nat_shl_lit`, `nat_shr_lit`, `nat_div_lit`,
  `nat_mod_lit`, `nat_gcd_lit`, `bv_to_nat_lit` — all reduce via
  `klimbs_pow` / `klimbs_div_mod` and absorb the new cost shape.
`lookup_addr_pos`'s rbtree (`addr_pos_map`) only contained const
addresses. Every blob ref (literal-blob payload pointers in
`Constant.refs`) probed the map, missed, and fell through to the O(N)
linear scan over `all_addrs` which also returned 0. Wasted work
proportional to (blob refs × shard closure size).

Add a third value class to the same rbtree: the sentinel `4294967295`
(beyond any honest `pos+1`), inserted by walking each const's `refs`
once via `augment_with_blob_refs`. Any ref not already mapped to a
const-position gets the sentinel.

`lookup_addr_pos` becomes a 3-way match: `0` → fall back to linear
scan (now only fires under the de-intern soundness corner-case, ~never
in practice); `SENTINEL` → return 0 directly (known blob); `pos+1` →
return `pos`. `is_blob` mirrors the same match.

Soundness preserved: every probe still relies on the positive direction
`ptr_val` equality ⇒ content equality (Aiur Store's content-addressing
invariant). The linear fallback stays put for the (still theoretical)
malicious de-intern, where it uses content-based `address_eq`.

`lake exe ix check --ixe init.ixe --ixes init.ixes --shard 51`
(`ulimit -v 23000000`, with the kernel's blake3 paths in
`load_verified_constant` / `load_verified_blob` / `load_verified_claim`
stubbed for profiling):
119_152_044_062 → 102_190_540_489 FFT (-14.2%).

Tests/Ix/IxVM.lean: 42 pin shifts absorbing the per-ingress
augment-walk overhead.
A consolidated refactor of the ingress pipeline guided by a 5-reviewer
synthesis of accidental complexity sites. Three families of change:

PLUMBING

* `Blake3.lean`: `verify_bytes_against(bytes, expected)` + `bytes_to_addr(bytes)`
  + `blake3_flat(input)` centralise the 24-line `[h[i][j]]` digest reshape
  that was hand-unrolled at 7 sites. ~100 lines of boilerplate deleted.
* `Ingress.lean`: `load_payload_const`/`blob`/`hint` + `ch_const`/`blob`/`hint`
  factor the `io_get_info` + `#read_byte_stream` boilerplate at 6 sites.
  `load_verified_blob` no longer double-`load(addr)`s.
* `sentinel_blob_ref()` named const replaces one bare `4294967295`.

SIDECAR LISTS → RBTREE

* `lookup_canon_addr`: O(N) parallel-list walk → O(log N) rbtree probe
  via new `build_canon_addr_map`. `canon_addrs: List<Addr>` arg replaced
  by `canon_addr_map: &RBTreeMap<Addr>` (one threaded ptr).
* `lookup_block_start`: O(N) parallel-list walk → O(log N) rbtree probe
  via new `build_block_start_map`. `block_addrs` / `block_starts` args
  replaced by `block_start_map: &RBTreeMap<G>` across ~12 fn signatures.
* `build_aux_recr_ctor_idxs` returns `(idxs, block_addr)` so the
  standalone-Recr caller in `build_convert_inputs_walk` doesn't redo the
  `rec_typ_to_inductive_addr` + `load_verified_constant` chain to recover
  `block_addr` — `build_aux_recr_ctor_idxs` already computed it internally.
* `build_ref_idxs_and_blobs` fuses the two separate walks
  (`build_ref_idxs_mapped` + `build_lit_blobs`) into one — single rbtree
  probe per ref produces both `ref_idxs` and `lit_blobs`. 5 caller sites
  simplified.

PTR-PASSING (per `reference_aiur_pass_pointer_not_value`)

* `convert_univ(u: &Univ)`: per-row input width drops from 5-variant
  union to one G column. Caller `convert_univ_idxs` already produces the
  `&Univ` via `list_lookup(univs, idx)` — no extra `store`.
* `convert_one(input: &ConvertInput)`: caller `convert_all` already has
  `&input` from the `ListNode.Cons` destructure.

DEAD CODE DELETED

`address_in_list`, `apply_ctor_overrides`, `lookup_override`, `build_lit_blobs`,
`build_ref_idxs_mapped`, `is_blob`, `ctx_convert_expr` — all unused after
the migrations above.

TRIED + REVERTED (left as notes in code where instructive):

* Fused `build_addr_pos_map` + `augment_with_blob_refs` single-pass.
  Lost to Aiur's per-row width tax — two narrower passes cheaper than
  one merged wide pass.
* `convert_expr(ctx: &ConvertCtx)` ptr. Per-arm `load(ctx)` cost > raw-arg
  threading.
* `compute_layout_walk` seen-mptrs/seen-poses → rbtree. Small regression
  on typical shards (few enough Muts blocks that O(N) list scan is
  already cheap; rbtree value column tax > savings).
* `block_members_map` to skip per-projection `load_verified_constant`.
  Lost because Aiur memoises `load_verified_constant` automatically — the
  new rbtree probe's width tax exceeded the savings from the eliminated
  match cascade.
* `ctx` everywhere as `&ConvertCtx` (with `store(...)` at 5 construction
  sites). The added `store` rows offset the per-row savings — wash.

`lake test -- --ignored ixvm`: 42 pin shifts, all measurable wins vs the
`ap/blob-ref-rbtree-augment` tip. Cumulative on heavy consts:
* `Nat.add_comm`:              54_369_745 → 54_049_773 FFT (-0.59%)
* `Nat.sub_le_of_le_add`:     515_331_420 → 510_843_459 FFT (-0.87%)
* `Array.append_assoc`:     2_567_087_893 → 2_537_360_311 FFT (-1.16%)
* `Vector.append`:          2_639_286_078 → 2_607_800_745 FFT (-1.19%)
* `String.Internal.append`:   718_803_075 → 708_296_270 FFT (-1.46%)
* `IxVMInd.Even.rec`:          31_659_493 → 31_434_525 FFT (-0.71%)
* `_private....extractMainModule._unsafe_rec`: 1_081_617_705 → 1_064_689_765 FFT (-1.57%)
Replaces channel 0's tagged-union (claim / tree / const bytes / empty-
blob-marker, distinguished only by key content-hash collision-
impossibility) with one channel per value kind. Channel reorganization
is cost-neutral — all six per-channel `io_get_info` + `#read_byte_stream`
pairs are inlined at the single consumer of each channel, so no fn-call
row gets added.

New layout, tiered by access pattern:

| Tier  | Channel | Purpose                  | Key                     | Value                |
|-------|---------|--------------------------|-------------------------|----------------------|
| Ctrl  | 0       | claim wire bytes         | `blake3(claim_bytes)`   | claim bytes          |
| Ctrl  | 1       | assumption tree bytes    | `tree.root`             | tree bytes           |
| Const | 2       | constant wire bytes      | const addr              | const bytes          |
| Const | 3       | Defn reducibility hint   | Defn addr               | single G             |
| Blob  | 4       | blob discriminator       | addr                    | one byte (1=const, 0=blob) |
| Blob  | 5       | blob raw bytes           | blob addr               | raw bytes            |

Tier 1 = one-per-`verify_claim`. Tier 2 = per-const. Tier 3 = per-blob.
`io_get_info(channel, key)` is now unambiguous by channel alone.

Aiur side (`Ingress.lean`, `Kernel/Claim.lean`):

* `load_verified_constant` inlines `io_get_info(2, raw)` +
  `#read_byte_stream(2, ...)` (was ch 0).
* `load_verified_blob` inlines ch 5 read (was ch 1).
* `load_constant_hint` inlines ch 3 read (was ch 2).
* `load_with_deps` inlines a ch 4 discriminator probe replacing the
  former `io_get_info(0, addr).len == 0 ⟹ blob` hack on ch 0. One
  read_byte for the per-addr discriminator byte.
* `load_verified_claim` inlines ch 0 (was using a shared `load_payload_const`
  helper on ch 0).
* `load_assumption_tree` inlines ch 1 (was using same `load_payload_const`
  helper on ch 0).
* Helper wrappers (`load_payload_claim/tree/const/hint/blob`,
  `load_discriminator`, `ch_*` constants) all removed — every channel
  has exactly one consumer site and the inlined dispatch costs no fn
  call.
* Top-of-file `IxVM IOBuffer interface` doc block documents the layout
  + soundness model.

Lean side (`ClaimHarness.lean`):

* `addEntries` writes per channel — consts → ch 2 + per-addr
  discriminator `[1]` on ch 4; blobs → ch 5 + per-addr discriminator
  `[0]` on ch 4; Defn hints → ch 3.
* `seedTreeAt` writes tree bytes to ch 1 (was ch 0).
* `buildClaimWitness` writes claim bytes to ch 0 (unchanged).
* Drops the empty-marker write on ch 0 for blob addrs — the ch 4
  discriminator covers blob/const classification explicitly.
* Top-of-section `IxVM IOBuffer interface` doc block mirrors the
  Aiur-side comment.

Soundness model unchanged. ch 0/1/2/5 byte streams are blake3-verified
by the kernel against their content-addressed keys. ch 3 hint is
semantically optional (controls WHNF reduction heuristic only; def-eq
is sound either way). ch 4 discriminator is sound by erasure-
correctness — a lying byte flips the const/blob decision and the
wrong-path load downstream fails (a "const" blob triggers a ch 2 read
that returns empty → blake3 verify against the non-empty addr fails;
a "blob" const dangles references → typecheck fail).

`lake test -- --ignored ixvm` green; 42 pins shifted within ±0.010%
(noise-tier) vs the pre-shuffle baseline — channel reorganization is
cost-neutral as required, the discriminator's one read_byte trades
against the eliminated `io_get_info(0, ...).len` check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant