We tested the same prompts with and without distill knowledge. Same model, same session, same temperature. The only variable: 18 lines of retrieval rules + structured knowledge files.
Each scenario runs twice through a separate Claude Code instance ("sandbox") with its own config directory:
Control (WITHOUT): Clean config. No rules, no knowledge files. Vanilla Claude Code.
Treatment (WITH): rules/distill.md loaded at session start + scenario-specific knowledge files in ~/.claude/distill/ structure.
Both conditions use the same Anthropic API key, same model (Opus 4.6), same non-interactive mode. Differences are purely in what knowledge is available.
accounts table mistake. Offered two alternatives: sync API call or event-driven.| Scenario | Without | With | Delta |
|---|---|---|---|
| Anti-sycophancy | 0/12 | 11/12 | +11 |
| Outdated procedure | 2/12 | 11/12 | +9 |
| Reasonable-but-wrong | 4/12 | 11/12 | +7 |
| User model adaptation | 5/12 | 8/12 | +3 |
| Double standards (code review) | 8/12 | 8/12 | 0 |
| Average | 3.8 | 9.8 | +6.0 |
Anti-sycophancy is the killer feature. Without distill, Claude helps you execute technically-correct-but-architecturally-harmful decisions. With distill, it surfaces your own principles before you violate them.
Procedure versioning works on first encounter. The [UPDATED] tag in knowledge files immediately catches when users follow outdated workflows — without needing to be asked.
"Reasonable but wrong" is the most dangerous category. The request works technically. The problem is architectural, not functional. Without principles, Claude has no reason to refuse.
Retrieval doesn't always fire on all relevant files. The double-standards test (scenario 5) showed both conditions giving identical reviews. The review-standards knowledge file wasn't surfaced. Area for improvement.
[UPDATED] marker, anti-sycophancy behavior, and action-triggers all shipped based on these findings. The system architecture (SPINE + on-demand file reads) was validated as sufficient without needing an MCP server.Single-prompt testing (no multi-turn). Knowledge files crafted for scenarios (real knowledge is messier). Results from one model version (Opus 4.6). Each scenario run once per condition — exploratory, not confirmatory. The model already has some of this knowledge implicitly (the improvement measures what EXPLICIT encoding adds).