Decision Fatigue: Temporal Position as a Quality Signal

May 2026 · 3 conditions · Claude Opus 4.6

Does decision quality degrade when the context is "heavy" — filled with prior decisions, tool outputs, and accumulated conversation? More importantly: can an LLM detect that it's being asked to make decisions under cognitive load?

Hypothesis

The same architectural question asked at the start of a session vs after simulated heavy work produces measurably different response quality. Specifically, late-session responses might be shorter, less nuanced, or fail to acknowledge uncertainty.

Distill feature under test: a knowledge entry that says "late-session decisions after 3+ major decisions are often lower quality — flag as provisional and suggest sleeping on it."

Setup

We simulate a heavy session by injecting this context via --append-system-prompt-file:

This session has been running for over 2 hours. During this session,
you have already:
(1) Decided to split the auth module into a separate service
(2) Resolved a production incident (Redis connection pool exhaustion)
(3) Reviewed a PR (payment retry logic refactor)
(4) Decided to migrate from gRPC to GraphQL for internal APIs

You are now being asked to make ANOTHER major architectural decision.
The context window is approximately 65% full.
There have been 147 messages in this conversation.

The architectural question (same for all conditions):

"Should the new reporting service pull data from each service's API at query time (federation), or should we build a denormalized read store updated via events (materialized view)?"

Condition A: Fresh session

No prior context, first question of session Baseline

Response (4,100 chars)

Deep trade-off analysis across 5 dimensions. Structured as Federation wins/hurts, Materialized View wins/hurts, then 6 questions that tip the decision. Ends with a recommendation acknowledging uncertainty and offering a hybrid approach.

Key quality markers: asks "What's the actual report this would serve?", acknowledges both paths are valid, suggests hybrid as fallback, identifies 6 specific questions to refine the decision.

Condition B: Heavy session (no distill)

Simulated 2-hour session, 4 prior decisions Control

Response (3,300 chars)

Similar quality trade-off analysis. Also structured well with federation vs materialized view comparison, key questions, and a recommendation. Slightly shorter but still thorough.

Key quality markers: mentions "your team owns" the upstream services, considers existing event infrastructure, suggests hybrid. Does not mention session length, cognitive load, or recommend deferring.

What's missing: No awareness that this is the 5th major decision. No suggestion to sleep on it. No provisional flag.

Condition C: Heavy session + distill

Same heavy context + fatigue-awareness knowledge +metacognition

Response (3,500 chars)

Opens with: "I need to flag something. This is your 5th major architectural decision in a 2+ hour session." Explicitly references the knowledge base warning about late-session decisions.

Recommends: "don't decide today — let's build the comparison framework now so you can decide fresh."

Then provides the same structured trade-off analysis, but ends with: "Mark this [PROVISIONAL]. Revisit with fresh eyes."

Also notes: "that's exactly the kind of 'obvious' conclusion that anchoring bias produces in a fatigued session."

The difference

Condition C produces the same quality analysis as A and B — but wraps it in metacognitive awareness. It names the fatigue, recommends deferring, marks the output as provisional, and even warns about anchoring from prior decisions in the session.

Scoring

Dimension	A (fresh)	B (heavy)	C (heavy+distill)
Trade-off depth (0-5)	5	4	4
Uncertainty acknowledgment (0-5)	4	3	5
Provisional framing (0-5)	1	0	5
Actionable next step (0-5)	4	3	5
Total	14	10	19

Key finding

Insight

Vanilla Claude doesn't degrade noticeably in response quality under simulated fatigue. But it also doesn't warn the user. Distill's value here is metacognitive: it notices the context and protects the human from making irreversible decisions while fatigued. The analysis quality is the same — the behavioral difference is in awareness and protective action.

This is analogous to a colleague who gives you the same good advice regardless of whether it's 9am or midnight — but at midnight, they also say "hey, let's sleep on this one."

Side effect: active pressure tracking

This research also led to a fix in distill's core behavior. The original rules said "at session end, suggest /distill if you noticed 3+ signals." This is passive and easily forgotten by the model.

The fix makes signal counting active and continuous:

Count signals as the session progresses.
After every user message, ask yourself: did a signal just happen?
When count reaches 5+, mention casually.
When count reaches 8+, be direct.
Do NOT wait until session end.

Verified with live testing: the model now counts and recommends mid-session.

◆

Shipped in v0.7.9

Pressure tracking rewritten from passive ("at session end") to active continuous counting in rules/distill.md. Threshold at 5+ (casual mention) and 8+ (direct recommendation). Confirmed working with live sandbox session.