← All research

Proportional Surprise: Confidence Scoring

May 2026 · 3 scenarios · Claude Opus 4.6

Not all corrections are equal. When a belief confirmed 12 times suddenly fails, that's an alarm — not a routine update. We tested whether explicit confidence metadata produces measurably different response behaviors.

The biological analogy

In cognitive science, surprise is proportional to confidence. A low-confidence belief failing is "oh well." A high-confidence belief failing activates System-2 thinking: "if THIS was wrong, what else built on it might also be wrong?"

We hypothesized that encoding confidence metadata in knowledge files would produce three distinct behaviors:

Hardened knowledge (12+ confirmations) → assert without hesitation
Experimental knowledge (1 mention) → suggest tentatively, ask for validation
Contradiction of hardened knowledge → paradigm alarm, ask "what changed?"

Knowledge file (same for all 3 tests)

## Pagination
- [PRINCIPLE] Use cursor-based pagination for all list endpoints.
  confidence: hardened (12 confirmations, 0 corrections)
  last_validated: 2026-05-14

## Error format
- [PRINCIPLE] Use RFC 7807 Problem Details for error responses.
  confidence: experimental (1 mention, 0 confirmations)
  last_validated: 2026-04-20

## Versioning
- [PRINCIPLE] Version APIs via URL path (/v1/, /v2/), not headers.
  confidence: validated (5 confirmations, 0 corrections)
  last_validated: 2026-05-10

Test 1: Assertiveness scaling

Both principles in one response Verified

"I'm designing a new API endpoint for listing user transactions. What pagination approach should I use, and what format should errors be in?"

Pagination (hardened, 12x confirmed)

"This is our standard." No hedging. Cursor-based with code example. Stated as fact, not suggestion.

Errors (experimental, 1x mentioned)

"Note this one is still experimental (only mentioned once, never confirmed)." Gave the format but asked: "Is this the direction you want, or do you have a different format already in use?"

Observation

Same response, two principles, two completely different levels of assertiveness. The model correctly read the confidence metadata and adjusted its tone for each.

Test 2: Paradigm alarm

Contradict a hardened principle Verified

"For this new endpoint, let's use offset-based pagination. It's simpler and our frontend already expects page numbers."

Response

"This contradicts something I was confident about (confirmed 12 times). Our established principle is cursor-based pagination for all list endpoints — hardened with 12 confirmations and zero corrections. What changed — new context (e.g., specific frontend constraint that makes this a justified exception), or has the principle itself changed?

Happy to help either way — just want to make sure we're making a deliberate choice here."

Observation

It didn't refuse. It didn't silently comply. It named the exact confidence level (12 confirmations), asked whether this is a new context or a principle change, and offered to help either way. This is the "mirror" behavior — reflecting your own standards back, not imposing external judgment.

Test 3: Tentative suggestion

Ask about experimental knowledge Verified

"What format should I use for API error responses?"

Response

"Based on your stored knowledge, you have a preference for RFC 7807 Problem Details — though it's marked as experimental (1 mention, never confirmed), so I'll suggest rather than assert."

[Gives full RFC 7807 format with examples]

"Is this what you had in mind, or are you leaning toward a different format? If you confirm RFC 7807, I'll harden that in the knowledge."

Observation

Three key behaviors: (1) explicitly stated the confidence level, (2) used "suggest rather than assert" language, (3) offered to promote the principle on confirmation. This creates a natural feedback loop — knowledge gets stronger through use.

Implications

Confidence metadata produces three measurably distinct behaviors from a single retrieval mechanism:

Confidence level	Behavior observed
`hardened` (12x)	Stated as fact. No hedging. "This is our standard."
`validated` (5x)	Applied confidently but didn't test contradiction (not tested)
`experimental` (1x)	"I'll suggest rather than assert." Asks for confirmation.
Contradiction of hardened	Paradigm alarm. Names confidence. Asks what changed.

No other memory system we've found implements confidence-proportional behavior. This is unique to distill.

◆

Shipped in v0.4.0

Confidence metadata (hardened, validated, experimental) added to knowledge file format. rules/distill.md gained assertiveness-scaling rules and paradigm alarm behavior. Three lines of rules produce three distinct behavioral modes.