Skip to content
Search ESC
ClaudeGeminiMulti-Agent OrchestrationAdversarial ReviewPython

Axion Engine: Adversarial R&D Operating System

Domain-agnostic R&D pipeline where three models attack each other's output across CS, clinical medicine, and IoT firmware.

Bottom Line

Three-model adversarial pipeline across 3 domains (CS, clinical, IoT). 152 production sessions, zero fluff. Trade-off: 3x inference cost for cross-vendor debate.

// system_metrics
production_sessions: 152
active_domains: 3
adversarial_stages: 6
fluff_score: 0%

The Problem

Single-model pipelines produce documentation that sounds authoritative but isn’t

Standard single-vendor pipelines default to safe, shallow output. For R&D documentation — distributed systems references, clinical research protocols, production firmware specs — shallow content actively misleads. One model can’t catch its own hallucinations, and confirmation bias compounds over hundreds of sections.

  • Confirmation bias: same-vendor review agrees 2x more than cross-vendor review
  • No domain isolation: one pipeline can’t serve CS, medicine, and IoT without rewriting
  • Context amnesia: session 100 has no memory of what failed in sessions 1-99
  • Zero quality gates: sending structurally broken output to expensive reviewers wastes tokens

The Architecture

Axion Engine adversarial pipeline — Producer draft through AST linter, Skeptical CTO critique, Reviewer verdict, and meta-reflection feedback loop

Fig 1 — Adversarial review pipeline with meta-reflection

Cross-vendor adversarial pipeline with deterministic linting and self-evolving intelligence

Axion Engine is a 5,859 LOC Python system that orchestrates three agents in an adversarial loop:

The Producer drafts deep technical material with extended reasoning and full context. The Skeptical CTO attacks the draft as a cynical staff engineer — finding hallucinations, missing mechanisms, and unsupported claims. The Reviewer sees both draft and critique, issuing ACCEPT/REVISE/REJECT verdicts.

A deterministic linter gate runs before stochastic review — Python AST validation, D2 diagram checks, caption enforcement, and constraint satisfaction. Structurally broken output is rejected instantly at zero cost, saving reviewer tokens.

Domain decoupling: one engine, three domains

The engine is domain-agnostic. Domain-specific behavior comes from YAML configs, prompt templates, and knowledge bases. Adding a new domain means adding a folder — not rewriting code.

DomainQuality BarAdversary Persona
Computer Science (distributed systems)Kleppmann-level reference standardCynical distributed-systems CTO
Clinical MedicineNEJM / Lancet StandardCynical journal editor
Production Firmware (IoT)AWS Well-ArchitectedN/A (firmware export)

Self-evolving intelligence: the Singularity Loop

After each adversarial loop, a Meta-Reflection stage analyzes recurring failures, hallucination patterns, and protocol gaps. Observations accumulate in three registries — Signal Tracker, Pattern Registry, and Trait Registry. Session 152 is measurably smarter than session 1 because the engine encodes 152 sessions of failure data.

Results

  • 152 production sessions logged with full provenance chains
  • 3 active domains from a single codebase with zero domain-specific engine code
  • 6-stage adversarial pipeline per section (Producer → Linter → Draft → CTO → Reviewer → Meta-Reflection)
  • 0% fluff score: tested against banned-word and specificity validators
  • 348 structured documents and 61 D2 architectural diagrams across all domains
  • Cross-vendor critique catches significantly more issues than single-vendor review — adversarial models surface blind spots a single vendor misses
  • Linter gate rejects 15-20% of outputs before expensive reviewer calls
  • Crash recovery via JSON session state — no lost work on API timeouts

Architecture Trade-offs

Gain

Cross-vendor adversarial review catches significantly more issues than single-vendor pipelines. Blind spots that Claude misses, Gemini finds — and vice versa.

Cost

3x inference cost per section. Three model calls (Producer + CTO + Reviewer) instead of one. Accepted because the alternative — human subject-matter review — costs 100x more and takes days instead of minutes.

Gain

Deterministic linter gate rejects 15-20% of outputs at zero cost. AST validation, D2 diagram checks, and constraint satisfaction catch structural failures before expensive reviewer inference.

Cost

Rigid format constraints limit creative output. The linter enforces section structure, citation format, and diagram presence. For R&D documentation this is a feature. For creative writing it would be a liability.

Technology Stack

What we built with

ClaudeGeminiMulti-Agent OrchestrationAdversarial ReviewPython
Similar challenge?

Deploy this architecture

Submit your requirements. We'll review your constraints, identify bottlenecks, and scope the path to production.

[ SUBMIT SPECS ]

No SDRs. A Principal Engineer reviews every submission.

From the team behind Production-Ready AI Agents (Amazon, 2025)