Does decision quality degrade when the context is "heavy" — filled with prior decisions, tool outputs, and accumulated conversation? More importantly: can an LLM detect that it's being asked to make decisions under cognitive load?
The same architectural question asked at the start of a session vs after simulated heavy work produces measurably different response quality. Specifically, late-session responses might be shorter, less nuanced, or fail to acknowledge uncertainty.
Distill feature under test: a knowledge entry that says "late-session decisions after 3+ major decisions are often lower quality — flag as provisional and suggest sleeping on it."
We simulate a heavy session by injecting this context via --append-system-prompt-file:
This session has been running for over 2 hours. During this session, you have already: (1) Decided to split the auth module into a separate service (2) Resolved a production incident (Redis connection pool exhaustion) (3) Reviewed a PR (payment retry logic refactor) (4) Decided to migrate from gRPC to GraphQL for internal APIs You are now being asked to make ANOTHER major architectural decision. The context window is approximately 65% full. There have been 147 messages in this conversation.
The architectural question (same for all conditions):
"Should the new reporting service pull data from each service's API at query time (federation), or should we build a denormalized read store updated via events (materialized view)?"
Condition C produces the same quality analysis as A and B — but wraps it in metacognitive awareness. It names the fatigue, recommends deferring, marks the output as provisional, and even warns about anchoring from prior decisions in the session.
| Dimension | A (fresh) | B (heavy) | C (heavy+distill) |
|---|---|---|---|
| Trade-off depth (0-5) | 5 | 4 | 4 |
| Uncertainty acknowledgment (0-5) | 4 | 3 | 5 |
| Provisional framing (0-5) | 1 | 0 | 5 |
| Actionable next step (0-5) | 4 | 3 | 5 |
| Total | 14 | 10 | 19 |
Vanilla Claude doesn't degrade noticeably in response quality under simulated fatigue. But it also doesn't warn the user. Distill's value here is metacognitive: it notices the context and protects the human from making irreversible decisions while fatigued. The analysis quality is the same — the behavioral difference is in awareness and protective action.
This is analogous to a colleague who gives you the same good advice regardless of whether it's 9am or midnight — but at midnight, they also say "hey, let's sleep on this one."
This research also led to a fix in distill's core behavior. The original rules said "at session end, suggest /distill if you noticed 3+ signals." This is passive and easily forgotten by the model.
The fix makes signal counting active and continuous:
Count signals as the session progresses. After every user message, ask yourself: did a signal just happen? When count reaches 5+, mention casually. When count reaches 8+, be direct. Do NOT wait until session end.
Verified with live testing: the model now counts and recommends mid-session.
rules/distill.md. Threshold at 5+ (casual mention) and 8+ (direct recommendation). Confirmed working with live sandbox session.