preprint · meg4.dev · july 2026 · shared by invitation

Contracts All the Way Down: A Self-Auditing Agentic Harness
and What It Found When It Audited Itself

System description of MeG4 and a dual adversarial self-audit case study

The MeG4 Project*

*author list to be finalized · meg4.dev · july 2, 2026

Abstract

Agent-harness engineering — the discipline of designing the system around a language model — is emerging as the dominant lever on end-to-end agent performance [1,4,13,31]. We describe MeG4, a Rust agentic harness organized around a single invariant: every structure the system produces or contains is derived from natural-language intent and is permitted to exist only while it passes an executable falsifier (a contract). MeG4 integrates a three-word inter-agent alphabet ({accept, reject, contract}) enforced as an algebraic data type and by constrained decoding; a tiered planner/mid/worker economy over locally-served open models; roles expressed as versioned prompt-specifications with hot-swappable LoRA seats; and — the part we believe is new — a reflexive audit (doctor --self) that applies falsifier-strength requirements (demonstrated losability, author-independence, staleness bounds) to the harness's own governing contracts as a blocking gate. We report a case study in which the system was audited twice in parallel — once by itself, once by an independent 48-agent adversarial workflow — over the same evidence pack. The external audit confirmed 37 findings with zero refutations, including that 40 of 55 contract falsifiers were maskable by construction (reproduced empirically: a deliberately broken test suite yielded a passing falsifier), and that a self-improvement result the system had announced as a win failed its own contract falsifier (n=19<20; z=1.59; McNemar p=0.125). Every finding was converted into a red executable contract and the falsification substrate was repaired first. We argue the case study demonstrates both the failure mode this architecture exists to catch — green-but-hollow verification — and the honest-by-construction response it enables. We are explicit about what is not yet demonstrated: the self-improvement slope itself remains unproven above noise, and our unit-economics figures are a model, not a measurement.

1Introduction

Through 2025–2026 the field converged on an uncomfortable observation: among comparable frontier models, the harness — the scaffold of prompts, tools, verification, routing, and memory around the model — explains more end-to-end variance than the model itself. Analyses of deployed agent systems find recurring architectural dimensions that dominate outcomes [4]; controlled studies show orchestration topology alone worth 12–23% over static baselines with identical models [13]; position papers now argue agent comparisons are meaningless without harness disclosure [31]; and harnesses that automatically evolve themselves post double-digit gains with frozen models [1,2].

MeG4 is a bet placed on that observation before it was fashionable, with one extra demand that, to our knowledge, remains rare: the harness must be governed by the same verification discipline it imposes on its outputs. Concretely, MeG4 commits to:

Falsification over trust. No output, plan, or self-modification is accepted on plausibility; each is judged by an executable check written before the work and demonstrated able to fail.
Natural language as the only human layer. Humans state intent in English; the system derives and refines everything below that line.
Local and personal. Inference runs on hardware the operator controls, with open models, specialized per role and per user.

This paper makes four claims of contribution, each stated with the framing that survived our own adversarial prior-art review (§2):

An integrated contract discipline. No individual component is new — the {accept, reject, contract} alphabet is a deliberate reduction of FIPA Contract-Net performatives [14], and type-level enforcement of message shapes is standard practice. The contribution is the integration: one three-valued hand-off (the trit) enforced simultaneously as a compile-time sum type, a constrained-decoding grammar, and the only legal inter-tier message (§3.1).
Reflexive falsifier-strength auditing as a blocking gate. Self-improving harnesses validate their edits by task performance [2,21,22,23,24]. MeG4 additionally audits the shape of its own contract portfolio — falsifier presence, demonstrated losability, staleness TTL, author≠arbiter attestation — as a gate that blocks work (§4). We found no published or shipped equivalent; the nearest system applies adversarial verification to outputs and calibrates its harness by failure classification, not by portfolio-shape audit [15].
Roles as versioned prompt-specs with adapter lineage. Multi-LoRA serving over one base [25,26], per-user adapters [27,28], and distilling prompts into weights [29,30] all exist separately. The contribution is the lifecycle: a role's charter is a version-controlled contract whose history defines the training lineage — and the falsifier — of the adapter that will replace it (§3.4).
A dual adversarial self-audit case study with findings-as-contracts. Agent self-audit exists [21,2]; independent multi-reviewer auditing is shipped practice. We contribute the merged governance loop — the system audits itself and is audited by an independent multi-agent workflow over identical evidence; the reports are contrasted; the merge becomes the governing roadmap; and every confirmed finding becomes a red executable contract — together with the resulting numbers, which we publish unflattering parts included (§5).

We explicitly do not claim novelty for equal-model evaluation; we adopt it as a hard reporting invariant following [31,32,33].

2Related work

Harness engineering and self-evolution. Rombaut's source-code taxonomy of 13 coding agents identifies five composable loop primitives and finds most agents combine several [3]; Wei catalogs the architectural decisions of 70 agent systems [4]. Observability-driven harness evolution raises Terminal-Bench 2 pass@1 from 69.7% to 77.0% with the model frozen [1]. Self-Harness closes the loop — weakness mining from execution traces, minimal modification proposals, regression-tested validation — improving held-out Terminal-Bench 2.0 pass rates by 14–21 points across three models (e.g., 40.5%→61.9%) [2]. SICA [21], Meta-Harness [22], SIA [23], and the Darwin Gödel Machine [24] demonstrate agents editing their own scaffolds, all validated by downstream task performance. MeG4's L2 loop matches this shape; its distinguishing element is the portfolio-shape audit of §4 and the refusal to promote without paired significance.

Specifications as executable artifacts. The 2026 spec-driven-development wave treats specifications as living, lifecycle-spanning drivers rather than documentation: the SpecOps workshop states this as its founding vision [12]; Farrag argues specification discipline, not model capability, is the binding constraint on AI-assisted dependability [6]; structured specs measurably improve repository-level generation [7]; and LLMs can synthesize formal verification annotations from natural-language specs at high success rates [8]. MeG4's contract is this idea implemented end-to-end — including for the harness's own internals.

Verification disciplines we inherit. The requirement that a falsifier be demonstrated able to fail is classical: mutation testing [16], industrialized with LLMs at Meta [17], and the rotten-green-tests literature. Author-independence is classical IV&V (IEEE 1012 [18]), re-motivated in the LLM era by measured self-preference bias in LLM judges [19,20]. Spec staleness/drift is a named problem with emerging tooling [34]. MeG4 systematizes these into machine-checked, per-contract requirements enforced reflexively.

Multi-agent structure and communication. The trit descends from Contract-Net's accept/reject/propose performatives [14]. On how much imposed structure agent collectives need, the evidence is genuinely mixed: Dochkina finds self-organizing agents outperform designed hierarchies by 14% at the frontier, while models below a capability threshold still benefit from rigid structure [5]; AdaptOrch finds topology choice dominates once models converge [13]. MeG4 runs local mid-scale models and sells accountability, so it deliberately takes the conservative side: fix the commitment (the contract), grant freedom inside it.

Constrained generation of UIs and code. Vetted-inventory generation — an LLM plans while a deterministic assembler composes from approved components — is the design argument of Portal UX Agent [9] and SpecifyUI [10], and the de-facto strategy of commercial app generators. MeG4's pinned stack registry (§3.5) applies the same principle at project scale, with the falsifier kept stack-agnostic. Multi-turn correctness/security benchmarking [11] motivates our gate-per-turn design.

Personalization via adapters. S-LoRA-style multi-adapter serving [25] is shipped practice (e.g., per-feature on-device adapters in Apple Intelligence [35]); activated LoRAs frame adapters as agentic roles [26]; OPPU trains one PEFT per user [27], Profile-to-PEFT generates them on the fly [28]; context distillation and prompt baking compile prompted personas into weights [29,30]. §3.4 composes these into a spec-governed lifecycle.

3The MeG4 system

3.1Contracts and the trit

The unit of truth is the contract: a markdown note whose frontmatter couples intent to one or more executable acceptance checks with expected exit codes. Status is binary and honest: open (falsifier red — roadmap) or active (gated green). An append-only ledger records who authored, implemented, and verified. Inter-agent communication is restricted to the trit — {accept, reject, contract} — enforced three ways at once: as a Rust sum type (illegal states unrepresentable), as a constrained-decoding schema at inference time, and as the only legal hand-off between tiers. Every exchange is therefore either a verdict or a falsifiable promise; there is no third kind of message for drift to hide in.

3.2Architecture

Five layers with one dependency direction. L0, the pure core (contracts, ledger, oracle, router, trit, roster), contains no I/O and no model names; purity is CI-gated and scale is configuration — swapping every model in the system is a one-word change in a backends: map. L1 is the native executor: streaming model client with cross-provider fallback, confined tools, the agent loop, and a relay/gateway for remote clients — one Rust binary. L2 is the self-building loop (§3.6). L3 is verticals (software development first) as thin config overlays. L4, substrate descent toward owned ternary weights, is parked: its post-training quantization路线 was falsified internally and we do not build on unfalsified ground.

3.3The tiered engine

PLANNERemits the distilled contract: intent, stack key, acceptance checks. Frontier-grade seat, few tokens.

→

MIDholds the entire context; verifies or repairs the contract; always checks the result. Cheap seat, reads everything.

→

WORKERexecutes with confined tools on local GPUs; escalates only on stall.

→

GATEout-of-band falsifier run. Pass → ship and record; fail → retry with failure as evidence, or an honest red.

Figure 1 — one work order through the tiers; the gate runs outside the model's control loop.

The economics follow from the separation: the expensive model reads little; the model that reads everything is cheap; the laborer is local. All performance claims follow the equal-model invariant [31]: benchmarks vary only the harness.

3.4Roles as versioned prompt-specs with adapter seats

A role (planner, worker, judge, analyst; eventually customer-facing roles) is (i) a stable name in configuration, (ii) a version-controlled prompt-specification — its charter, rules, and evidence standards, and (iii) an adapter seat: any tier can mount LoRAs over the shared local base, hot-swapped per request [25]. When a role's ledger holds enough gate-verified episodes, its spec is compiled into an adapter — context distillation with a paper trail [29,30]: the spec's version history is the adapter's lineage, and the spec's falsifier is the adapter's acceptance test. Per-user personalization composes the same way [27].

3.5The verification stack and the pinned-stack registry

Three gates: L1 native build; L2 native tests pinned to exact test identities; L3 agentic browser QA — an independent agent drives the built artifact against a checklist-as-contract (functional flows, dark mode, mobile, contrast, auth-gating) judged by a local vision model, with a reproduction gate against the false-positive rates documented for agentic web QA. Upstream of the gates, a registry pins one technology stack per use case; the planner selects a key, a scaffold provides a known-good start, the worker composes rather than invents [9,10]. The falsifier is stack-agnostic — it measures outcomes.

3.6The self-building loop

L2 mines failure signatures from the ledger, proposes edits to the harness's own configuration and prompts, and validates each candidate on a frozen held-out set, promoting only on paired per-task significance (exact McNemar) with an efficiency veto. This matches the published shape [2]. Its current honest status is the subject of §5.

4The reflexive audit: `doctor --self`

Self-improving harnesses ask "did my edit help?" MeG4 additionally asks a prior question: "are my own contracts capable of telling me the truth?" The doctor --self gate audits the portfolio's shape:

Falsifier presence and strength — every non-done contract carries an executable check; checks matching mask-prone patterns (exit-code-swallowing pipes, unpinned name filters) are rejected; losability must be demonstrated against a known-broken mutation [16,17].
Author ≠ arbiter — the falsifier's author may not be its satisfier; attestation is recorded in the ledger and audited with actor-alias normalization [18,19].
Staleness — attestations expire (30-day TTL); a contract nobody has re-touched is not evidence [34].
Portfolio re-execution — active is a claim about now, not about the day of implementation: the entire portfolio's checks re-run on schedule and any red blocks.

We stress the epistemics: this gate does not make the harness good; it makes the harness's claims about itself falsifiable. §5 shows both why that matters and how it can still fail — and what failing loudly buys.

5Case study: the dual adversarial self-audit

On July 2, 2026 the operator commissioned two parallel, mutually-blind analyses of the system — its thesis, its stack registry, and its structure — over an identical evidence pack (repository metrics; verified 2026 sources):

Audit A (the system on itself): a dedicated analyst role (versioned prompt-spec per §3.4) running on the system's own tiered engine. Its report passed its acceptance gate in one iteration: 918k tokens in, 17.5k out, 991.6 s.
Audit B (independent): a 48-agent external workflow — six specialized finders (thesis, stacks, structure, contracts, product, self-improvement), each finding then attacked by an adversarial verifier with repository access.

Audit B produced 42 findings: 37 confirmed, 5 directionally-correct with imprecise evidence, 0 refuted. Selected confirmed findings, all reproducible from the audit artifacts:

Table 1 — principal confirmed findings of the dual self-audit (July 2, 2026).
Finding	Evidence
40/55 contract falsifiers maskable by construction. The prevailing check pattern piped test output to `grep`, returning grep's exit status; any earlier passing suite satisfies the match.	Empirical reproduction: a deliberately broken integration test (test runner exit 101) yielded a passing falsifier. The masked pattern violated a hard lesson recorded in the project's own architecture notes.
35/47 checks filtered tests by name substring, unpinned — deleting the real test leaves the check green.	Static count over the portfolio; per-contract examples in the audit report.
The independence check was vacuous. It compared self-reported ledger actor strings; 40/55 contracts had no doer entries at all and passed trivially.	Ledger histogram across all 55 contracts; the state document itself listed independence as unresolved while the audit gate showed green.
A self-improvement win had been declared against a red falsifier. An internal result file announced "credible positive slope → flip to green" on a run with held-out n=19 where the contract requires n≥20; z = Δ/SE = 1.59 < the loop's own 2σ bar; best-case exact McNemar (13-vs-9/19; b=4, c=0) p=0.125.	Executing the contract's falsifier on the committed artifact: exit 1. The paired gate built to catch exactly this had never been exercised: 0/23 audit-log decisions carried a McNemar field; both prior promotions used a legacy aggregate rule.
The improvement loop was flat by construction. Proposal generation was memoryless with a closed candidate pool; the open LLM generator silently fell back to canned strings in 6/6 recorded rounds.	One identical rejected proposal was re-measured in three separate runs at 2.27M, 5.82M, and 3.74M tokens each — the cost of having no memory between runs.
Resolution was insufficient for the claim. The loop's own statistics module computes a minimum detectable effect of ≈0.40 (≈8 tasks) at n=19; the celebrated Δ was 0.211 (4 tasks).	Direct evaluation of the project's `min_detectable_delta` on the run's parameters.

5.1The meta-lesson

Audit A — the system on itself — agreed with Audit B on the core diagnosis (the unmeasured improvement slope; oversized files; untested product surface; sound crate topology) but found none of Table 1: it did not descend from documentation to code, its report contained duplicated blocks and factual slips, and it passed its acceptance gate anyway. That gate checked form — required sections, evidence-reference counts, minimum length — not truth. This is the same disease Table 1 documents in the contract portfolio, recursing one level up: a winnable falsifier is not yet a falsifier with teeth. We record this as the case study's central result: form-gates are cheap and necessary, but strength must itself be audited — which is precisely the reflexive requirement of §4, now extended to analysis deliverables.

5.2Governance response

Every confirmed finding became a red executable contract in the portfolio; the merged report became the governing roadmap; and repair was sequenced falsification-first: migrate all checks to pinned, exit-code-pure form (each proven winnable, with honest reds recorded); harden the gate executor against exit-masking; introduce portfolio re-execution and falsifier-strength linting into doctor --self; only then resume feature work. The repair wave ran as supervised multi-agent workflows with adversarial review on every change and mutation-injection verification that migrated checks actually fail on broken code. At the time of writing the wave is in progress; its numbers will be reported when its own gates are green — consistent with the discipline this paper describes.

6Economics (a model, not a measurement)

We state unit economics with the same honesty budget. On the project's internal pricing model (representative task of 290k input / 30k output tokens; June 2026 list prices), the tiered stack costs ≈$0.094 per task with the local worker carrying the volume, prices at ≈$0.39 per blended Mtok, and lands roughly 7–15× below flagship-API list pricing for the same token volume. The worker's capital cost (two compact GPU workstations, ≈$7k) amortizes at ~$120–195/month with modeled break-even near 5–6k worker-tasks/month against a serving capacity of roughly 25k. A result cache in the gateway was implemented and live-verified (identical deterministic request served at zero marginal cost), scaling effective cost by (1−hit-rate). These figures derive from a documented model whose token-split is explicitly declared illustrative; the equal-model head-to-head (~15–20% cheaper at equal pass rate) is a stated target with no committed run behind it yet, and we refuse to promote it until one exists.

7Limitations and honest status

The central claim of the thesis is not yet demonstrated. The self-improvement slope remains statistically indistinguishable from zero at current held-out resolution; the one positive artifact failed its own falsifier (§5). The honest re-run requires n≥40, ≥3 repetitions, and the paired gate active.
Self-improvement is not yet human-free. In one gated self-edit the local worker produced correct logic but corrupted file integrity; a human removed the duplicated artifacts before the independent falsifier passed. The honest claim is "self-improvement with human editorial supervision."
Economics are modeled, not measured (§6). An external-benchmark cost figure circulating in project memory (~12×) does not exist in the repository and is excluded here.
Per-role adapter results are claimed, not committed. The training-run metric lives outside the repository, and the specific worker-adapter line was cancelled when the base model changed; the lifecycle of §3.4 is design-plus-precedent, not a reported result.
Quality of the quantized local worker was not re-scored after its 3.3× serving speedup (67 vs ~20 tok/s), and single-run worker variance is large (pass 7↔10/19 across repetitions), which bounds what any small study can conclude.
Single site, single operator. All numbers come from one deployment; no external replication yet.

8Conclusion

MeG4 operationalizes a simple, harsh idea: a claim without an executable falsifier is decoration — and that must include the system's claims about itself. The dual self-audit shows why: a portfolio that looked disciplined (55 contracts, green dashboards) was quietly maskable at scale, and the system had already declared one victory its own falsifier rejects. The same architecture that produced the failure also produced the correction: independent adversarial verification, findings as red contracts, falsification repaired before features. What remains is the experiment this entire design exists to win honestly — a self-improvement slope, measured at adequate resolution, promoted by paired significance, on a substrate whose checks can no longer lie. We will report it when its gate is green, and not before.

References

[1] J. Lin, S. Liu, C. Pan, et al. Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses. arXiv:2604.25850, Apr 2026.
[2] H. Zhang, S. Zhang, K. Li, et al. Self-Harness: Harnesses That Improve Themselves. arXiv:2606.09498, Jun 2026.
[3] B. Rombaut. Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures. arXiv:2604.03515, Apr 2026.
[4] H. Wei. Architectural Design Decisions in AI Agent Harnesses. arXiv:2604.18071, Apr 2026.
[5] V. Dochkina. Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures. arXiv:2603.28990, Mar 2026.
[6] S. E. Farrag. The Productivity-Reliability Paradox: Specification-Driven Governance for AI-Augmented Software Development. arXiv:2605.01160, May 2026.
[7] S. Feng, B. Chen, B. H. Meyer, G. Mussbacher. LLM-Assisted Repository-Level Generation with Structured Spec-Driven Engineering. arXiv:2605.02455, May 2026.
[8] J. P. Faria, E. Trigo, V. Honorato, R. Abreu. Automatic Generation of Formal Specification and Verification Annotations Using LLMs and Test Oracles. arXiv:2601.12845, Jan 2026.
[9] X. Li, N. Jiang, J. Selvaraj. Portal UX Agent — A Plug-and-Play Engine for Rendering UIs from Natural Language Specifications. arXiv:2511.00843, Nov 2025.
[10] Y. Chen, C. Shi, L. Chen. SpecifyUI: Supporting Iterative UI Design Intent Expression through Structured Specifications and Generative AI. arXiv:2509.07334, Sep 2025.
[11] R. Rawal, J. Y. F. Chiang, C. Shen, et al. Benchmarking Correctness and Security in Multi-Turn Code Generation. arXiv:2510.13859, Oct 2025.
[12] SpecOps 2026 — 1st International Workshop on Specification-Driven Development Life Cycle. Co-located with SPLASH/ISSTA 2026, Oakland, CA. conf.researchr.org/home/splash-issta-2026/specops-2026.
[13] AdaptOrch: Task-Adaptive Multi-Agent Orchestration in the Era of LLM Performance Convergence. arXiv:2602.16873, Feb 2026.
[14] R. G. Smith. The Contract Net Protocol. IEEE Trans. Computers, 1980; FIPA Contract Net Interaction Protocol, SC00029, 2002.
[15] Meta-Engineering Harnesses for AI-Native Software Production. arXiv:2605.25665, May 2026.
[16] R. A. DeMillo, R. J. Lipton, F. G. Sayward. Hints on Test Data Selection: Help for the Practicing Programmer. IEEE Computer, 1978.
[17] Mutation-Guided LLM-Based Test Generation at Meta. arXiv:2501.12862, FSE 2025.
[18] IEEE Std 1012 — System, Software, and Hardware Verification and Validation (independent V&V).
[19] Self-Preference Bias in LLM-as-a-Judge. arXiv:2410.21819, 2024.
[20] Quantifying and Mitigating Self-Preference Bias in LLM Evaluators. arXiv:2604.22891, Apr 2026.
[21] SICA: A Self-Improving Coding Agent. arXiv:2504.15228, 2025.
[22] Meta-Harness. arXiv:2603.28052, Mar 2026.
[23] SIA: Self-Improving Agents. arXiv:2605.27276, May 2026.
[24] J. Zhang, et al. The Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents. 2025.
[25] Y. Sheng, et al. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv:2311.03285, 2023.
[26] Activated LoRA (aLoRA): agents as switchable adapters. arXiv:2512.17910, Dec 2025.
[27] Z. Tan, et al. Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning (OPPU). arXiv:2402.04401, 2024.
[28] Profile-to-PEFT: hypernetwork-generated per-user adapters. arXiv:2510.16282, Oct 2025.
[29] C. Snell, D. Klein, R. Zhong. Learning by Distilling Context. arXiv:2209.15189, 2022.
[30] Prompt Baking. arXiv:2409.13697, 2024.
[31] Stop Comparing LLM Agents Without Disclosing the Harness. arXiv:2605.23950, May 2026.
[32] HAL: The Holistic Agent Leaderboard. arXiv:2510.11977, Oct 2025.
[33] Cost-Controlled Evaluation of AI Agents. arXiv:2407.01502, 2024.
[34] GitHub Spec Kit — spec-driven development tooling (drift-detection quality gates). github.com/github/spec-kit.
[35] Apple. Introducing Apple Foundation Models — per-feature on-device LoRA adapters. machinelearning.apple.com, 2024.

Artifacts. The audit reports, evidence pack, contrast document, and governing roadmap cited in §5 are internal repository artifacts (runs/self_analysis/, docs/NORTE_REBUILD.md); numbers in Table 1 are reproducible from them. External-source claims were verified against the live web on July 2, 2026; two of our own prior citations were corrected in the process (one inverted reading, one inflated effect size) — that correction pass is itself an instance of the discipline this paper argues for.

Contracts All the Way Down: A Self-Auditing Agentic Harnessand What It Found When It Audited Itself