System description of MeG4 and a dual adversarial self-audit case study
*author list to be finalized · meg4.dev · july 2, 2026
Agent-harness engineering — the discipline of designing the system around a language
model — is emerging as the dominant lever on end-to-end agent performance
[1,4,13,31]. We describe MeG4, a Rust agentic
harness organized around a single invariant: every structure the system produces
or contains is derived from natural-language intent and is permitted to exist only
while it passes an executable falsifier (a contract). MeG4 integrates a three-word
inter-agent alphabet ({accept, reject, contract}) enforced as an algebraic data type and by
constrained decoding; a tiered planner/mid/worker economy over locally-served open models;
roles expressed as versioned prompt-specifications with hot-swappable LoRA seats; and — the
part we believe is new — a reflexive audit (doctor --self) that applies
falsifier-strength requirements (demonstrated losability, author-independence, staleness
bounds) to the harness's own governing contracts as a blocking gate. We report a case study in
which the system was audited twice in parallel — once by itself, once by an independent
48-agent adversarial workflow — over the same evidence pack. The external audit confirmed 37
findings with zero refutations, including that 40 of 55 contract falsifiers were maskable by
construction (reproduced empirically: a deliberately broken test suite yielded a passing
falsifier), and that a self-improvement result the system had announced as a win failed its
own contract falsifier (n=19<20; z=1.59; McNemar p=0.125). Every finding was converted into
a red executable contract and the falsification substrate was repaired first. We argue the
case study demonstrates both the failure mode this architecture exists to catch — green-but-hollow
verification — and the honest-by-construction response it enables. We are explicit about what
is not yet demonstrated: the self-improvement slope itself remains unproven above noise, and
our unit-economics figures are a model, not a measurement.
Through 2025–2026 the field converged on an uncomfortable observation: among comparable frontier models, the harness — the scaffold of prompts, tools, verification, routing, and memory around the model — explains more end-to-end variance than the model itself. Analyses of deployed agent systems find recurring architectural dimensions that dominate outcomes [4]; controlled studies show orchestration topology alone worth 12–23% over static baselines with identical models [13]; position papers now argue agent comparisons are meaningless without harness disclosure [31]; and harnesses that automatically evolve themselves post double-digit gains with frozen models [1,2].
MeG4 is a bet placed on that observation before it was fashionable, with one extra demand that, to our knowledge, remains rare: the harness must be governed by the same verification discipline it imposes on its outputs. Concretely, MeG4 commits to:
This paper makes four claims of contribution, each stated with the framing that survived our own adversarial prior-art review (§2):
We explicitly do not claim novelty for equal-model evaluation; we adopt it as a hard reporting invariant following [31,32,33].
Harness engineering and self-evolution. Rombaut's source-code taxonomy of 13 coding agents identifies five composable loop primitives and finds most agents combine several [3]; Wei catalogs the architectural decisions of 70 agent systems [4]. Observability-driven harness evolution raises Terminal-Bench 2 pass@1 from 69.7% to 77.0% with the model frozen [1]. Self-Harness closes the loop — weakness mining from execution traces, minimal modification proposals, regression-tested validation — improving held-out Terminal-Bench 2.0 pass rates by 14–21 points across three models (e.g., 40.5%→61.9%) [2]. SICA [21], Meta-Harness [22], SIA [23], and the Darwin Gödel Machine [24] demonstrate agents editing their own scaffolds, all validated by downstream task performance. MeG4's L2 loop matches this shape; its distinguishing element is the portfolio-shape audit of §4 and the refusal to promote without paired significance.
Specifications as executable artifacts. The 2026 spec-driven-development wave treats specifications as living, lifecycle-spanning drivers rather than documentation: the SpecOps workshop states this as its founding vision [12]; Farrag argues specification discipline, not model capability, is the binding constraint on AI-assisted dependability [6]; structured specs measurably improve repository-level generation [7]; and LLMs can synthesize formal verification annotations from natural-language specs at high success rates [8]. MeG4's contract is this idea implemented end-to-end — including for the harness's own internals.
Verification disciplines we inherit. The requirement that a falsifier be demonstrated able to fail is classical: mutation testing [16], industrialized with LLMs at Meta [17], and the rotten-green-tests literature. Author-independence is classical IV&V (IEEE 1012 [18]), re-motivated in the LLM era by measured self-preference bias in LLM judges [19,20]. Spec staleness/drift is a named problem with emerging tooling [34]. MeG4 systematizes these into machine-checked, per-contract requirements enforced reflexively.
Multi-agent structure and communication. The trit descends from Contract-Net's accept/reject/propose performatives [14]. On how much imposed structure agent collectives need, the evidence is genuinely mixed: Dochkina finds self-organizing agents outperform designed hierarchies by 14% at the frontier, while models below a capability threshold still benefit from rigid structure [5]; AdaptOrch finds topology choice dominates once models converge [13]. MeG4 runs local mid-scale models and sells accountability, so it deliberately takes the conservative side: fix the commitment (the contract), grant freedom inside it.
Constrained generation of UIs and code. Vetted-inventory generation — an LLM plans while a deterministic assembler composes from approved components — is the design argument of Portal UX Agent [9] and SpecifyUI [10], and the de-facto strategy of commercial app generators. MeG4's pinned stack registry (§3.5) applies the same principle at project scale, with the falsifier kept stack-agnostic. Multi-turn correctness/security benchmarking [11] motivates our gate-per-turn design.
Personalization via adapters. S-LoRA-style multi-adapter serving [25] is shipped practice (e.g., per-feature on-device adapters in Apple Intelligence [35]); activated LoRAs frame adapters as agentic roles [26]; OPPU trains one PEFT per user [27], Profile-to-PEFT generates them on the fly [28]; context distillation and prompt baking compile prompted personas into weights [29,30]. §3.4 composes these into a spec-governed lifecycle.
The unit of truth is the contract: a markdown note whose frontmatter couples intent to
one or more executable acceptance checks with expected exit codes. Status is binary and honest:
open (falsifier red — roadmap) or active (gated green). An append-only
ledger records who authored, implemented, and verified. Inter-agent communication is restricted
to the trit — {accept, reject, contract} — enforced
three ways at once: as a Rust sum type (illegal states unrepresentable), as a constrained-decoding
schema at inference time, and as the only legal hand-off between tiers. Every exchange is
therefore either a verdict or a falsifiable promise; there is no third kind of message for drift
to hide in.
Five layers with one dependency direction. L0, the pure core (contracts,
ledger, oracle, router, trit, roster), contains no I/O and no model names; purity is CI-gated and
scale is configuration — swapping every model in the system is a one-word change in a
backends: map. L1 is the native executor: streaming model client
with cross-provider fallback, confined tools, the agent loop, and a relay/gateway for remote
clients — one Rust binary. L2 is the self-building loop (§3.6).
L3 is verticals (software development first) as thin config overlays.
L4, substrate descent toward owned ternary weights, is parked: its
post-training quantization路线 was falsified internally and we do not build on unfalsified
ground.
The economics follow from the separation: the expensive model reads little; the model that reads everything is cheap; the laborer is local. All performance claims follow the equal-model invariant [31]: benchmarks vary only the harness.
A role (planner, worker, judge, analyst; eventually customer-facing roles) is (i) a stable name in configuration, (ii) a version-controlled prompt-specification — its charter, rules, and evidence standards, and (iii) an adapter seat: any tier can mount LoRAs over the shared local base, hot-swapped per request [25]. When a role's ledger holds enough gate-verified episodes, its spec is compiled into an adapter — context distillation with a paper trail [29,30]: the spec's version history is the adapter's lineage, and the spec's falsifier is the adapter's acceptance test. Per-user personalization composes the same way [27].
Three gates: L1 native build; L2 native tests pinned to exact test identities; L3 agentic browser QA — an independent agent drives the built artifact against a checklist-as-contract (functional flows, dark mode, mobile, contrast, auth-gating) judged by a local vision model, with a reproduction gate against the false-positive rates documented for agentic web QA. Upstream of the gates, a registry pins one technology stack per use case; the planner selects a key, a scaffold provides a known-good start, the worker composes rather than invents [9,10]. The falsifier is stack-agnostic — it measures outcomes.
L2 mines failure signatures from the ledger, proposes edits to the harness's own configuration and prompts, and validates each candidate on a frozen held-out set, promoting only on paired per-task significance (exact McNemar) with an efficiency veto. This matches the published shape [2]. Its current honest status is the subject of §5.
doctor --selfSelf-improving harnesses ask "did my edit help?" MeG4 additionally asks a prior question:
"are my own contracts capable of telling me the truth?" The doctor --self
gate audits the portfolio's shape:
active is a claim about now, not
about the day of implementation: the entire portfolio's checks re-run on schedule and any
red blocks.We stress the epistemics: this gate does not make the harness good; it makes the harness's claims about itself falsifiable. §5 shows both why that matters and how it can still fail — and what failing loudly buys.
On July 2, 2026 the operator commissioned two parallel, mutually-blind analyses of the system — its thesis, its stack registry, and its structure — over an identical evidence pack (repository metrics; verified 2026 sources):
Audit B produced 42 findings: 37 confirmed, 5 directionally-correct with imprecise evidence, 0 refuted. Selected confirmed findings, all reproducible from the audit artifacts:
| Finding | Evidence |
|---|---|
40/55 contract falsifiers maskable by construction. The prevailing check
pattern piped test output to grep, returning grep's exit status; any earlier passing
suite satisfies the match. |
Empirical reproduction: a deliberately broken integration test (test runner exit 101) yielded a passing falsifier. The masked pattern violated a hard lesson recorded in the project's own architecture notes. |
| 35/47 checks filtered tests by name substring, unpinned — deleting the real test leaves the check green. | Static count over the portfolio; per-contract examples in the audit report. |
| The independence check was vacuous. It compared self-reported ledger actor strings; 40/55 contracts had no doer entries at all and passed trivially. | Ledger histogram across all 55 contracts; the state document itself listed independence as unresolved while the audit gate showed green. |
| A self-improvement win had been declared against a red falsifier. An internal result file announced "credible positive slope → flip to green" on a run with held-out n=19 where the contract requires n≥20; z = Δ/SE = 1.59 < the loop's own 2σ bar; best-case exact McNemar (13-vs-9/19; b=4, c=0) p=0.125. | Executing the contract's falsifier on the committed artifact: exit 1. The paired gate built to catch exactly this had never been exercised: 0/23 audit-log decisions carried a McNemar field; both prior promotions used a legacy aggregate rule. |
| The improvement loop was flat by construction. Proposal generation was memoryless with a closed candidate pool; the open LLM generator silently fell back to canned strings in 6/6 recorded rounds. | One identical rejected proposal was re-measured in three separate runs at 2.27M, 5.82M, and 3.74M tokens each — the cost of having no memory between runs. |
| Resolution was insufficient for the claim. The loop's own statistics module computes a minimum detectable effect of ≈0.40 (≈8 tasks) at n=19; the celebrated Δ was 0.211 (4 tasks). | Direct evaluation of the project's min_detectable_delta on the run's
parameters. |
Audit A — the system on itself — agreed with Audit B on the core diagnosis (the unmeasured improvement slope; oversized files; untested product surface; sound crate topology) but found none of Table 1: it did not descend from documentation to code, its report contained duplicated blocks and factual slips, and it passed its acceptance gate anyway. That gate checked form — required sections, evidence-reference counts, minimum length — not truth. This is the same disease Table 1 documents in the contract portfolio, recursing one level up: a winnable falsifier is not yet a falsifier with teeth. We record this as the case study's central result: form-gates are cheap and necessary, but strength must itself be audited — which is precisely the reflexive requirement of §4, now extended to analysis deliverables.
Every confirmed finding became a red executable contract in the portfolio; the merged report
became the governing roadmap; and repair was sequenced falsification-first: migrate all checks
to pinned, exit-code-pure form (each proven winnable, with honest reds recorded); harden the
gate executor against exit-masking; introduce portfolio re-execution and falsifier-strength
linting into doctor --self; only then resume feature work. The repair wave ran as
supervised multi-agent workflows with adversarial review on every change and mutation-injection
verification that migrated checks actually fail on broken code. At the time of writing the wave
is in progress; its numbers will be reported when its own gates are green — consistent with the
discipline this paper describes.
We state unit economics with the same honesty budget. On the project's internal pricing model (representative task of 290k input / 30k output tokens; June 2026 list prices), the tiered stack costs ≈$0.094 per task with the local worker carrying the volume, prices at ≈$0.39 per blended Mtok, and lands roughly 7–15× below flagship-API list pricing for the same token volume. The worker's capital cost (two compact GPU workstations, ≈$7k) amortizes at ~$120–195/month with modeled break-even near 5–6k worker-tasks/month against a serving capacity of roughly 25k. A result cache in the gateway was implemented and live-verified (identical deterministic request served at zero marginal cost), scaling effective cost by (1−hit-rate). These figures derive from a documented model whose token-split is explicitly declared illustrative; the equal-model head-to-head (~15–20% cheaper at equal pass rate) is a stated target with no committed run behind it yet, and we refuse to promote it until one exists.
MeG4 operationalizes a simple, harsh idea: a claim without an executable falsifier is decoration — and that must include the system's claims about itself. The dual self-audit shows why: a portfolio that looked disciplined (55 contracts, green dashboards) was quietly maskable at scale, and the system had already declared one victory its own falsifier rejects. The same architecture that produced the failure also produced the correction: independent adversarial verification, findings as red contracts, falsification repaired before features. What remains is the experiment this entire design exists to win honestly — a self-improvement slope, measured at adequate resolution, promoted by paired significance, on a substrate whose checks can no longer lie. We will report it when its gate is green, and not before.
Artifacts. The audit reports, evidence pack, contrast
document, and governing roadmap cited in §5 are internal repository artifacts
(runs/self_analysis/, docs/NORTE_REBUILD.md); numbers in Table 1
are reproducible from them. External-source claims were verified against the live web on
July 2, 2026; two of our own prior citations were corrected in the process (one inverted
reading, one inflated effect size) — that correction pass is itself an instance of the
discipline this paper argues for.