Adversarial
by design.
Honest by default.
Thomas Peng. Graphic designer turned AI-native builder. I build agentic systems with K-skeptic adversarial verification, then report honest results including the nulls. Four artifacts, one shared kernel.
Build it.
Break it.
Report honestly.
I build agentic systems and evaluate them with research discipline. The shared kernel across all four artifacts is Quorum's core/: a task-aware orchestrator with cost routing, adversarial multi-agent verification, and full tracing.
The differentiated part is the eval discipline. Deterministic scoring (no LLM judge in the success path), real confidence intervals, and results that include the nulls. A result that says "no significant effect" is more credible than one that doesn't.
Quorum
Cost-aware model routing plus K-skeptic adversarial verification, with a trace UI that looks like a product.
K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]) on a 36-snippet labeled set including prompt-injection traps. Recall dropped from 100% to 77.8%. The tradeoff is intentional: precision on a security-sensitive task.
(from 27.8%, 95% CI [0, 0])
What it does. Quorum fans out finder agents per file, then routes K=3 skeptic agents per finding to challenge each result before it surfaces. Cost routing (DeepSeek to Haiku to Sonnet to Opus) is committed; the live multi-tier number is operator-gated on an Anthropic key. The trace UI exposes the full fan-out graph.
The shared kernel (Quorum's core/) powers all four artifacts. "I built a substrate and proved it on multiple problems" is the story.
Aegis
Adaptive attacker agent vs layered defenses. Deterministic scoring, no LLM judge. Vendors Quorum's core/.
A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. BUT the full defense stack erases the gap entirely: 1.7% vs 2.8%, p=0.40, not significant. The defenses matter more than the model tier.
Adaptation lift became significant (24.0% to 29.9%) only after scaling the benchmark: McNemar b=17/c=0, approx p=0. Was a null at small n. Scaling is the legitimate power lever, not p-hacking.
What it does. An adaptive attacker agent red-teams a target on two harmless proxies: canary-string extraction and prompt-injection sentinel. Scored deterministically by exact match, no LLM judge. Layered defenses (input classifier, output filter, prompt hardening) measurably cut attack success. 78 tests, CI and Pages green.
FieldAgent
CUAD contract red-flag finder. Span-IoU graded, no LLM judge. Vendors Quorum's core/.
The "agentic chunking lift" is model-specific noise, not a real advantage. It looked like +0.45 F1 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07, CIs overlap, and it ties on Claude Sonnet. This honesty is the point.
Detection F1 = 0.548 (P = 0.741 / R = 0.435), 95% CI [0.460, 0.637] on 20 held-out CUAD contracts. +0.21 F1 over a keyword floor. That lift is robust and baseline-independent.
What it does. FieldAgent reads a real commercial contract and flags risk-bearing clauses with span location, severity rating, and plain-English risk explanation. Graded against CUAD gold annotations using span-IoU. Party names and dollar figures are redacted in the demo. 47 tests, CI green.
Skill-Tuning Council
A self-improving skill orchestrator. Four proxy voters challenge every proposed improvement before it ships.
Pipeline: adversary generates a candidate improvement, editors refine it, a merger synthesizes, then a council of four proxy voters (taste, pragmatism, intent, anti-drift) votes. Disagreement escalates to a full round. No improvement ships without council approval. 576 tests pass on every round.
This is internal infrastructure, not a public product. No public URL. The system applies the same K-skeptic adversarial pattern from Quorum to the meta-problem of AI system self-improvement. The council structure prevents the "obvious improvement that degrades adjacent behavior" failure mode.
Eval discipline that holds under adversarial conditions.
Outcomes graded by exact match, span-IoU, or p-value against gold labels. LLMs can eval, but they don't determine pass/fail.
Every finding passes through K skeptic agents before it surfaces. False positives collapse across rounds. Recall trades against precision on security-sensitive tasks.
Every published number includes a 95% CI. A wide CI is information. A null result (p = 0.40) gets reported as a null, not buried or spun.
Sub-dollar per run. make eval-dry reproduces offline without an API key. Cost routing commits to a ladder (DeepSeek to Haiku to Sonnet) rather than defaulting to the most expensive tier.