We Stress-Tested Our Own Thesis

Three AI models from three corporations. Six rounds of adversarial critique. One verdict. This is what Directed Intelligence actually is — now honest about what it isn’t.

I built a methodology called Directed Intelligence — one strategist directing multiple AI specialists to produce work that no single mind could reach alone. I’ve been running it for clients. I believe in it.

Tonight I decided to break it.

The Setup

Three AI models from three different corporations — Anthropic, OpenAI, and Google — each given the same task: stress-test the core claim of Directed Intelligence. No cross-talk. No shared context. Each working independently.

The independence mattered. If three systems sharing no runtime context converge on the same critique, that’s signal. If they diverge, that’s signal too. This tribunal was Directed Intelligence in action — not a test run before the real thing, but the methodology eating its own premise.

The claim they were asked to attack:

“A skilled human operator directing specialized AI agents produces systematically better output than either the human or the AI working alone.”

Their instructions were simple: be a skeptic. Be specific. Find what’s wrong.

Round 1 — Independent Attack

Each model came back with its own critique. No coordination. The findings converged anyway.

Anthropic identified the comparison as rigged. The relevant baseline isn’t “directed AI vs. cold AI” — it’s “this operator’s method vs. another competent strategist using AI their own way.” Against that baseline, I had zero evidence.

OpenAI flagged the lack of controlled studies and the claim’s dependence on operator style — even if the effect were real, it wouldn’t be systematically generalizable without more evidence.

Google went hardest. Called the mechanism — “direction changes what the system is willing to risk” — a category error. AI doesn’t have risk tolerance. The operator is changing prompts, not psychology. And the most dangerous possibility: the operator could be a confirmation bias engine, using AI to find sophisticated justifications for pre-existing beliefs, cloaked in machine objectivity.

Three companies. No coordination. Same structural critique. Worth noting: these models were trained on overlapping corpora, so convergence isn’t perfect independence. But the specificity — different angles, different failure mechanisms identified — suggests something real was found.

Round 2 — Cross-Critique

Then each model read the other two’s attacks and revised their position. The revision round is where you find out if the first critiques were surface-level or structural.

These were structural.

Anthropic revised: The phenomenon survives, but “experienced consultant using tools well gets good results” is true, uncontroversial, and not a discovery. The real question — bold-and-right vs. bold-and-lucky — can’t be answered in a blind test. Only in deployment over time.

OpenAI revised: Named the actual mechanism. Not “risk tolerance” but exploration-exploitation management — routing a problem to multiple agents simultaneously, decomposing tasks, adjudicating conflicting outputs under a fixed budget. The edge is real but conditional: it scales with problem ambiguity, shrinks on clean tasks, and can be matched by automated pipelines.

Google revised: And then landed the hardest blow. Every analyst in the tribunal — including the other two models — had accepted “skilled operator” as a given. None of them questioned what that means. If the skill can’t be defined, transferred, or taught, it’s not a methodology. It’s personal talent wearing a framework as a costume.

That one hit differently.

The Verdict

Partially True.

The phenomenon is real. The explanation was wrong. The comparison was rigged.

What Survived

Ensemble management works. A skilled operator decomposing ambiguous problems, routing to specialized agents, adjudicating conflicting outputs, and pushing past conservative defaults does produce better results on messy, ill-posed problems than a single unmanaged AI pass.

The mechanism isn’t mystical. It’s exploration-exploitation management: sampling more of the output distribution and applying human judgment where models default to safe. That’s defensible.

What Didn’t Survive

“Direction changes what the system is willing to risk” — AI has no risk tolerance. This was prompt engineering dressed up as psychology.

“Systematically better” — No controls, no resource parity, no external evaluation. You can’t claim systematic superiority from self-evaluated work.

“Than either working alone” — The real baseline is another competent strategist using AI their way. Against that baseline, there’s no evidence. The surviving claim beats single-pass AI. Not skilled alternatives.

Transferability — If “skilled operator” can’t be defined or taught, it’s not a methodology. It’s talent. That’s the next honest problem to solve.

The Reframe

The thesis doesn’t collapse. It sharpens.

The defensible claim:

On complex, ambiguous problems, a skilled operator directing an ensemble of AI agents — decomposing tasks, routing to specialized models, adjudicating conflicting outputs — can outperform single-pass AI. Not because the system takes risks. Because ensemble management widens the solution space and applies human judgment where models default to conservative.

The conditions: this edge scales with problem ambiguity, shrinks on well-defined tasks, and depends on operator skill that hasn’t been shown to transfer.

Why Publish This

Most people in this space would bury this. Run the test, read the results, quietly adjust the messaging.

I’m publishing it because the methodology demands it. If the system’s value is stress-testing ideas until only the truth survives, the first idea that has to survive is the system itself.

The operator is the moat. Not the method. Not the tools. The judgment that knows which question to ask and which output to trust. When to break your own thesis.

That’s what Directed Intelligence actually is. Now it’s honest about what it isn’t.

Three models. Three corporations. Six rounds. One verdict. This is what survives.

We Stress-Tested Our Own Thesis.Here’s What Survived.