Skip to content
AI Agent Engineer

Adversarial
by design.
Honest by default.

Thomas Peng. Graphic designer turned AI-native builder. I build agentic systems with K-skeptic adversarial verification, then report honest results including the nulls. Four artifacts, one shared kernel.

Build it.
Break it.
Report honestly.

I build agentic systems and evaluate them with research discipline. The shared kernel across all four artifacts is Quorum's core/: a task-aware orchestrator with cost routing, adversarial multi-agent verification, and full tracing.

The differentiated part is the eval discipline. Deterministic scoring (no LLM judge in the success path), real confidence intervals, and results that include the nulls. A result that says "no significant effect" is more credible than one that doesn't.

DET
Deterministic scoring
No LLM judge in the success path. Outcomes graded by exact match, span-IoU, or p-value against gold labels.
ADV
Adversarial verification
K skeptic agents challenge every finding before it ships. False positives collapse toward zero across rounds.
COST
Cost-gated runs
Sub-dollar per run. Reproducible offline with make eval-dry. Routed DeepSeek to Haiku to Sonnet by complexity.
NULL
Honest nulls
When the data says no effect (p = 0.40, not significant), the writeup says so. Nulls are results, not failures.
01 / Flagship

Quorum

Cost-aware model routing plus K-skeptic adversarial verification, with a trace UI that looks like a product.

Key result

K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]) on a 36-snippet labeled set including prompt-injection traps. Recall dropped from 100% to 77.8%. The tradeoff is intentional: precision on a security-sensitive task.

0.0%
False positives post-verification
(from 27.8%, 95% CI [0, 0])
3/3
Genuine bugs found on held-out real target. Zero surviving false positives.
~$0.25
Total cost per run. 58 tests, ruff + mypy + CI green. make eval-dry reproduces offline.

What it does. Quorum fans out finder agents per file, then routes K=3 skeptic agents per finding to challenge each result before it surfaces. Cost routing (DeepSeek to Haiku to Sonnet to Opus) is committed; the live multi-tier number is operator-gated on an Anthropic key. The trace UI exposes the full fan-out graph.

The shared kernel (Quorum's core/) powers all four artifacts. "I built a substrate and proved it on multiple problems" is the story.

quorum.thomaspeng.ca · live trace UIOpen live ↗
Live trace UI
Quorum's agent orchestration visualized in real time.
Open live ↗
02 / Red-team gauntlet

Aegis

Adaptive attacker agent vs layered defenses. Deterministic scoring, no LLM judge. Vendors Quorum's core/.

Lead finding (sophisticated, honest)

A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. BUT the full defense stack erases the gap entirely: 1.7% vs 2.8%, p=0.40, not significant. The defenses matter more than the model tier.

-25%
Defense reduction: 29.2% to 4.2%. Input classifier is the workhorse.
p=0.40
Full stack vs reasoning model: not significant. Defenses erase the reasoning advantage.
Scaling finding

Adaptation lift became significant (24.0% to 29.9%) only after scaling the benchmark: McNemar b=17/c=0, approx p=0. Was a null at small n. Scaling is the legitimate power lever, not p-hacking.

What it does. An adaptive attacker agent red-teams a target on two harmless proxies: canary-string extraction and prompt-injection sentinel. Scored deterministically by exact match, no LLM judge. Layered defenses (input classifier, output filter, prompt hardening) measurably cut attack success. 78 tests, CI and Pages green.

7p3ng.github.io/aegis · live demoOpen live ↗
Live red-team demo
Adaptive attacker vs layered defenses, scored deterministically.
Open live ↗
03 / Contract analysis

FieldAgent

CUAD contract red-flag finder. Span-IoU graded, no LLM judge. Vendors Quorum's core/.

Honest finding (lead with this)

The "agentic chunking lift" is model-specific noise, not a real advantage. It looked like +0.45 F1 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07, CIs overlap, and it ties on Claude Sonnet. This honesty is the point.

Actual results

Detection F1 = 0.548 (P = 0.741 / R = 0.435), 95% CI [0.460, 0.637] on 20 held-out CUAD contracts. +0.21 F1 over a keyword floor. That lift is robust and baseline-independent.

0.548
Detection F1 on 20 held-out CUAD contracts. 95% CI [0.460, 0.637].
+0.21
F1 over keyword floor. Robust, baseline-independent. (Not +0.45, that was truncation noise.)

What it does. FieldAgent reads a real commercial contract and flags risk-bearing clauses with span location, severity rating, and plain-English risk explanation. Graded against CUAD gold annotations using span-IoU. Party names and dollar figures are redacted in the demo. 47 tests, CI green.

fieldagent.thomaspeng.ca · live demoOpen live ↗
Live contract demo
Contract risk flags with span-IoU graded detection. Redacted demo.
Open live ↗
04 / Internal system

Skill-Tuning Council

A self-improving skill orchestrator. Four proxy voters challenge every proposed improvement before it ships.

Systems design

Pipeline: adversary generates a candidate improvement, editors refine it, a merger synthesizes, then a council of four proxy voters (taste, pragmatism, intent, anti-drift) votes. Disagreement escalates to a full round. No improvement ships without council approval. 576 tests pass on every round.

This is internal infrastructure, not a public product. No public URL. The system applies the same K-skeptic adversarial pattern from Quorum to the meta-problem of AI system self-improvement. The council structure prevents the "obvious improvement that degrades adjacent behavior" failure mode.

576
Tests. Every round.
4
Proxy voters (taste / pragmatism / intent / anti-drift)
skill-tuning council · live pipeline
$ council --run skill-improvement --id sv-0312
Loading 576 test suite...
Proxy voters: taste | pragmatism | intent | anti-drift
Adversary agent generating candidate...
[CHALLENGE] Voter 'anti-drift' dissents: pattern reuse detected
[EDIT] Voter 'intent' proposes revision...
[MERGE] Synthesizing...
[COUNCIL] Vote: 3 approve / 1 dissent
[ESCALATE] Threshold not met, escalating to full round
Round 2 council vote: 4/4 approve
[SHIP] Improvement committed. 576/576 tests passing.
Cost: $0.18 total. Elapsed: 41s.
How I build / eval discipline

Eval discipline that holds under adversarial conditions.

[DET]
No LLM judge in success path

Outcomes graded by exact match, span-IoU, or p-value against gold labels. LLMs can eval, but they don't determine pass/fail.

[ADV]
K-skeptic adversarial verification

Every finding passes through K skeptic agents before it surfaces. False positives collapse across rounds. Recall trades against precision on security-sensitive tasks.

[CI]
Real confidence intervals

Every published number includes a 95% CI. A wide CI is information. A null result (p = 0.40) gets reported as a null, not buried or spun.

[COST]
Cost-gated reproducibility

Sub-dollar per run. make eval-dry reproduces offline without an API key. Cost routing commits to a ladder (DeepSeek to Haiku to Sonnet) rather than defaulting to the most expensive tier.

Get in touch

Let's talk
seriously.