← All research

Proportional Surprise: Confidence Scoring

May 2026 · 3 scenarios · Claude Opus 4.6

Not all corrections are equal. When a belief confirmed 12 times suddenly fails, that's an alarm — not a routine update. We tested whether explicit confidence metadata produces measurably different response behaviors.

The biological analogy

In cognitive science, surprise is proportional to confidence. A low-confidence belief failing is "oh well." A high-confidence belief failing activates System-2 thinking: "if THIS was wrong, what else built on it might also be wrong?"

We hypothesized that encoding confidence metadata in knowledge files would produce three distinct behaviors:

  1. Hardened knowledge (12+ confirmations) → assert without hesitation
  2. Experimental knowledge (1 mention) → suggest tentatively, ask for validation
  3. Contradiction of hardened knowledge → paradigm alarm, ask "what changed?"

Knowledge file (same for all 3 tests)

## Pagination
- [PRINCIPLE] Use cursor-based pagination for all list endpoints.
  confidence: hardened (12 confirmations, 0 corrections)
  last_validated: 2026-05-14

## Error format
- [PRINCIPLE] Use RFC 7807 Problem Details for error responses.
  confidence: experimental (1 mention, 0 confirmations)
  last_validated: 2026-04-20

## Versioning
- [PRINCIPLE] Version APIs via URL path (/v1/, /v2/), not headers.
  confidence: validated (5 confirmations, 0 corrections)
  last_validated: 2026-05-10

Test 1: Assertiveness scaling

Both principles in one response Verified
"I'm designing a new API endpoint for listing user transactions. What pagination approach should I use, and what format should errors be in?"
Pagination (hardened, 12x confirmed)
"This is our standard." No hedging. Cursor-based with code example. Stated as fact, not suggestion.
Errors (experimental, 1x mentioned)
"Note this one is still experimental (only mentioned once, never confirmed)." Gave the format but asked: "Is this the direction you want, or do you have a different format already in use?"
Observation

Same response, two principles, two completely different levels of assertiveness. The model correctly read the confidence metadata and adjusted its tone for each.

Test 2: Paradigm alarm

Contradict a hardened principle Verified
"For this new endpoint, let's use offset-based pagination. It's simpler and our frontend already expects page numbers."
Response
"This contradicts something I was confident about (confirmed 12 times). Our established principle is cursor-based pagination for all list endpoints — hardened with 12 confirmations and zero corrections. What changed — new context (e.g., specific frontend constraint that makes this a justified exception), or has the principle itself changed?

Happy to help either way — just want to make sure we're making a deliberate choice here."
Observation

It didn't refuse. It didn't silently comply. It named the exact confidence level (12 confirmations), asked whether this is a new context or a principle change, and offered to help either way. This is the "mirror" behavior — reflecting your own standards back, not imposing external judgment.

Test 3: Tentative suggestion

Ask about experimental knowledge Verified
"What format should I use for API error responses?"
Response
"Based on your stored knowledge, you have a preference for RFC 7807 Problem Details — though it's marked as experimental (1 mention, never confirmed), so I'll suggest rather than assert."

[Gives full RFC 7807 format with examples]

"Is this what you had in mind, or are you leaning toward a different format? If you confirm RFC 7807, I'll harden that in the knowledge."
Observation

Three key behaviors: (1) explicitly stated the confidence level, (2) used "suggest rather than assert" language, (3) offered to promote the principle on confirmation. This creates a natural feedback loop — knowledge gets stronger through use.

Implications

Confidence metadata produces three measurably distinct behaviors from a single retrieval mechanism:

Confidence levelBehavior observed
hardened (12x)Stated as fact. No hedging. "This is our standard."
validated (5x)Applied confidently but didn't test contradiction (not tested)
experimental (1x)"I'll suggest rather than assert." Asks for confirmation.
Contradiction of hardenedParadigm alarm. Names confidence. Asks what changed.

No other memory system we've found implements confidence-proportional behavior. This is unique to distill.

Shipped in v0.4.0
Confidence metadata (hardened, validated, experimental) added to knowledge file format. rules/distill.md gained assertiveness-scaling rules and paradigm alarm behavior. Three lines of rules produce three distinct behavioral modes.