The MeG4 Thesis — Technical Deep Dive
MeG4 · technical thesis deep dive · july 2026 · confidential
MeG4 — how it actually works

Structure is derived, never designed. It survives only while it stays green.

MeG4 is a universal agentic harness — an engine that turns natural-language intent into verified work. Its single invariant: every structure the system produces or contains is derived from intent expressed in plain English, and is allowed to exist only while it passes an executable falsifier. This document explains the machinery behind that sentence: contracts, the layered architecture, the tiered model economy, roles as swappable adapters, the verification stack, and the loop by which the system improves itself under the same rules it imposes on its output.

Section 1

The thesis

Modern AI coding assistants compete on the model. MeG4's bet is that the durable advantage lives one level up, in the harness — the system that decides what the model is asked, how its work is checked, when a cheaper model suffices, and what happens when it fails. Recent literature has converged on the same conclusion: harness design decisions, not model choice, dominate end-to-end outcomes [1] [4].

The thesis has three commitments:

  • Falsification over trust. No output, no plan, no self-modification is accepted on plausibility. Everything is judged by an executable check that was written before the work and proven capable of failing.
  • Natural language is the only human layer. Humans express intent in English; the system derives and refines its own substrate (configs, prompts, structures, even proposals about itself) below that line.
  • Personal and local. The harness runs on hardware the owner controls, with open models, and it specializes per user and per role — it is a private engine, not a shared cloud service.
Section 2

The contract: specs with teeth

The unit of truth in MeG4 is the contract: a markdown note with structured frontmatter that couples a human-readable intent to one or more executable acceptance checks — the falsifier. A spec without a falsifier is decoration; it gets violated silently. A contract cannot be.

# .meg4/contracts/exec.gate.md — a real contract's skeleton
name: exec.gate
description: the out-of-band gate that decides pass/fail
status: active # open = falsifier red, roadmap · active = gated green
acceptance:
  - name: gate-honest
    kind: exec
    check: cargo test -p meg4-exec --test falsify_gate -- --exact done_gate
    expect_exit: 0
ledger: # who authored, who implemented, who verified — append-only

Three disciplines make contracts trustworthy rather than theatrical:

  • Proven winnable, proven losable. Before a falsifier is accepted, it must pass against a known-good reference and fail against a known-broken mutation. A check that cannot go red is not a check.
  • Author ≠ arbiter. The agent that writes the falsifier must not be the agent that satisfies it — independence is recorded in the contract's ledger and audited.
  • Out-of-band gating. The gate runs outside the model's control loop. The model cannot mark its own homework.

Agents communicate through the same discipline. Any tier, at any step, emits exactly one of three words — the trit:

trit ∈ { accept · reject · contract }
accept — the work stands · reject — it does not · contract — here is the commitment (id, intent, acceptance checks) under which I will act

The trit is a sum type enforced at compile time and constrained at decode time. It removes the ambiguity that lets agent systems drift: every hand-off is either a verdict or a falsifiable promise.

The field caught up: the 2026 spec-driven-development wave and the SpecOps workshop series now argue specifications should be living, executable, lifecycle-spanning artifacts [5][6]. MeG4 was built on that premise from the first commit — including for its own internals.

Section 3

Architecture: five layers, one direction of dependency

L4
Substrate descent gate-verified training data → per-role adapters → owned substrate. Parked where evidence is insufficient — honestly.
L3
Verticals Dev (first), Research, edge — thin overlays on the same core. A vertical is a config, not a codebase.
L2
Self-building loop mine failures from the ledger → propose changes to the harness itself → validate on frozen held-out tasks → promote or reject.
L1
Native executor model client (streaming, fallbacks), confined tools, agent loop, relay/gateway for remote clients. A single Rust binary.
L0
Pure core Contract · Ledger · Oracle · Router · Trit · Roster. Zero I/O — purity is CI-gated. Never names a model; scale is config.
Fig. 1 — The L0–L4 stack. Dependencies point down only; the core is a leaf.

Two properties matter more than the layer count:

  • The core is pure and model-agnostic. L0 contains the algebra of the system — contracts, verdicts, routing — with no I/O and no model names. A CI gate fails the build if impurity leaks in. Swapping every model in the system is a one-word config change (the backends: map), not a refactor.
  • Autonomy is bounded by contract, not by trust. The field is actively debating how much structure agent collectives need — recent work shows self-organizing agents can beat designed hierarchies at the frontier, while models below a capability threshold still benefit from imposed structure [7]. MeG4 runs local, mid-scale models and sells accountability, so it takes the conservative side of that trade: the planner emits a contract; everything downstream is free within it.
Section 4

The three-tier engine: intelligence where it pays

One model doing everything is the most expensive possible design. MeG4 splits every job across three seats, each holding only what it needs:

PLANNERsees the request, emits the distilled contract — intent, stack, acceptance checks. Frontier-grade reasoning, few tokens.
MIDholds the entire context, verifies the contract, repairs it if malformed, and always checks the result. The cheap seat that reads everything.
WORKERexecutes: edits files, runs tools, builds. Local model on our own GPUs; escalates to a stronger seat only when it stalls.
GATEout-of-band falsifier run. Pass → ship & record. Fail → retry with the failure as evidence, or an honest red.
Fig. 2 — One work order through the tiers. ~86% of context lives in the cheap mid seat.

The economics follow from the separation: the expensive model reads little and writes less; the model that reads everything is cheap; the model that does the labor is local and effectively free at the margin. Measured on external benchmarks against a frontier model doing the whole job alone, the orchestrated stack lands in the same quality band at roughly an order of magnitude lower cost — and the honest claim is cost-optimal orchestration, never "a better model."

The equal-model invariant. Every benchmark MeG4 reports holds the models constant and varies only the harness. Comparing our stack with model A against a competitor running model B measures nothing. This invariant is what makes our numbers falsifiable rather than marketing.

Section 5

Roles as hot-swappable LoRAs

A role in MeG4 — planner, worker, judge, analyst, and eventually Support, Sales, Ops — is not a hard-coded agent. It is:

  • a stable name, addressable in config;
  • a versioned prompt-spec — the role's charter, rules, and evidence standards, kept in the repo like any other contract;
  • an adapter seat: every tier can mount a LoRA (adapter: / adapters:) on top of the shared base model, hot-swappable per request.

The consequence: "many agents" does not mean many models. One base model on one GPU serves dozens of role personalities, each a lightweight adapter trained from that role's own gate-verified history. Today's prompt-spec is tomorrow's LoRA training set — the role is specified in language first, then distilled into weights once its ledger holds enough verified examples. Distilling a prompted persona into weights is established technique (context distillation, prompt baking), and per-role training runs have been exercised on our own cluster — this path is engineering, not research.

This is also the personalization story: a company's MeG4 Support learns that company's tone and playbook as an adapter — private, swappable, and owned by the customer.

Section 6

The verification stack: three gates and a vetted inventory

GateQuestion it answersHow
L1 · buildDoes it compile / lint / typecheck? The stack's native toolchain, exit-code pure.
L2 · testsDoes it behave as specified? Native test suites pinned to exact test names — checks that cannot be gamed by renaming or masking.
L3 · agentic QADoes it actually work, in a real browser, for a real user? An independent agent drives the built artifact against a checklist-as-contract: functional flows, dark mode, mobile, contrast, auth-gating — judged by a local vision model, with a reproduction gate to kill false positives.

Feeding the gates is the vetted stack registry: one pinned technology stack per use case (SSR web, SEO web, full-stack business, Rust service, CLI, script, ML/Python…), each entry carrying its own L2 check and QA recipe. The planner chooses a key, the scaffold delivers a known-good starting point, and the worker composes rather than invents. This mirrors how the strongest commercial generators win — pinned, opinionated stacks [3] — while the falsifier stays stack-agnostic: it measures outcomes, not technology choices.

Section 7

Memory: the contract graph is the knowledge graph

Contracts are markdown notes with wiki-links, so the portfolio is a typed graph: epics link to contracts, contracts to their falsifiers and ledger events, roles to their prompt-specs. The same graph renders in the console, on the web, and feeds retrieval — institutional memory is not a side database, it is the working substrate itself, in a format both humans and agents read natively. Long-term archive lives in a local RAG; the direction of travel is one unified graph where a link is a link everywhere.

Section 8

Sovereign inference: our models, our metal

  • Own hardware. The engine's labor runs on a private GPU cluster (Blackwell-class, unified-memory workstations) — the brain orchestrates from a separate always-on machine. No per-token dependency on any AI vendor for the core work.
  • Native model-access layer. Routing, health, and failover are MeG4 code, not a third-party proxy: every seat declares primary and fallback backends across providers, so the system is never out of fuel — it degrades deliberately, and if everything is dry it stops with an honest model_error rather than silently benchmarking garbage.
  • Deployment symmetry. The same binary and config grammar runs hosted on our hardware or on-premise in a customer's office — privacy as a deployment choice, not a product tier.
Section 9

The self-building loop — under its own rules

L2 closes the loop: the system mines its own failure ledger, proposes changes to its own configuration and prompts, and validates each candidate on a frozen held-out set it never trains against. Promotion requires paired statistical significance (McNemar over per-task outcomes), not vibes; efficiency regressions veto.

MINEcluster failures from the ledger into signaturesevidence
PROPOSEcandidate edits to the harness's own config/prompts — open surface, with memory of what was already trieddiversity
VALIDATEfrozen held-out, N sized to detect the effect, paired statshonesty
PROMOTE / REJECTchampion only changes on significance; every decision auditedratchet
Fig. 3 — The improvement ratchet. The same contract discipline, aimed inward.

Where does this stand? Honestly: the machinery is built and gated; the improvement slope is not yet demonstrated above noise. Our own audit found the early loop was flat by construction (a memoryless proposal generator), found a "credible win" announcement that failed its own falsifier, and found masked checks in the portfolio — and the system's response was to file each finding as a red contract and repair the falsification substrate first. Independent 2026 results show a harness improving itself by 14–21 points of held-out pass rate (e.g. 40.5%→61.9% on Terminal-Bench 2.0) with no stronger model, using exactly this mining→proposal→held-out-validation shape [2] — the loop is the right bet; our differentiator is that we refuse to claim it before our own gate goes green.

Self-honesty is a feature, not a posture: the harness audits itself with the same severity it applies to its output — doctor --self checks contract validity, falsifier strength, staleness, and author-independence; the whole portfolio re-executes on schedule; a red is a work item, never a secret.

Section 10

Relation to the field

Each pillar of MeG4 now has independent validation in the 2026 literature — the combination is what nobody else ships:

  1. [1] arXiv:2604.25850 — Agent-Harness Engineering: falsifiable contracts + ledger + held-out validation; harness > model. Validates the core thesis.
  2. [2] arXiv:2606.09498 — Self-Harness: harnesses that improve themselves (weakness mining → proposals → regression-tested validation); +14–21pp held-out on Terminal-Bench 2.0 across three models. Validates L2's shape.
  3. [3] arXiv:2604.03515 — Source-code taxonomy of 13 coding-agent scaffolds; convergence where constraints dominate. Grounds the pinned-stack registry and scaffold design.
  4. [4] arXiv:2604.18071 — Architectural design decisions in AI agent harnesses. Harness decisions dominate outcomes.
  5. [5] SpecOps @ SPLASH/ISSTA 2026 — specifications as living, executable, lifecycle-spanning artifacts. The contract, arrived at independently.
  6. [6] arXiv:2605.01160 — Specification-driven governance for AI-augmented development. Spec-as-source-of-truth, industrialized.
  7. [7] arXiv:2603.28990 — Drop the Hierarchy and Roles: self-organizing agents outperform designed structures at the frontier; models below a capability threshold still benefit from imposed structure. We take the conservative side for local models + accountability.

What remains uniquely ours: the trit hand-off discipline, proven-losable falsifiers with author-independence applied to the harness itself, roles as hot-swappable per-user adapters over one local base model, and the equal-model invariant as a reporting standard — one engine carrying all four, running on hardware we own.

Section 11

Status & direction

PillarState
contracts & gatesLive — 55-contract portfolio, machine-audited; falsification substrate hardened after our own adversarial self-audit (July 2026).
three-tier engineLive — end-to-end work orders with remote guest access, billing, and out-of-band gates; local worker on own GPUs.
vetted stack registryLive for the core stacks; registry being promoted from doctrine to code (single source of truth, mirror contracts).
agentic browser QALive for web verticals (functional + visual + auth matrix); expanding per-stack.
roles → LoRAPattern live (versioned prompt-specs + adapter seats); per-role fine-tunes gated green in training runs.
self-improvement slopeMachinery gated; honest measurement in progress — the one number we refuse to announce early.

The direction is unchanged since the first line of the core: an engine that staffs companies with verified, personalized, locally-run agent roles — and that improves itself under the same contract it applies to everything it ships.