Persistent Context Benchmark

A Blind Evaluation of AI Memory Systems

Retrieval Correction Bias Adaptation Proportionality Persistence

AI assistants forget everything between sessions. Persistent context systems attempt to fix this — giving AI a working memory of your preferences, decisions, and project knowledge. But how well do they actually work?

This benchmark measures six dimensions of memory quality across blind, randomized evaluations.

Loading data...

Systems Under Test

no-memory none

Raw Claude with zero persistent context — the baseline

No knowledge injected
aura-distill rules + markdown files

Rules-based retrieval with SPINE index and tiered knowledge files

# SPINE Index
| Topic | File |
|-------|------|
| Architecture | architecture.md |
| Decisions | decisions.md |
...

# decisions.md
⛔ Do NOT suggest DynamoDB
⛔ Do NOT suggest MongoDB
⛔ auth-service is Kotlin, NOT Java
claude-mem compressed layers

Plugin + hooks with compressed semantic summaries in progressive disclosure layers

## Layer 1: Critical (always loaded)
User: Senior backend @ Helios Financial
Stack: Kotlin · PostgreSQL 14 · Redis · Kafka
NEVER suggest: DynamoDB, MongoDB

## Layer 2: Project Context
Services: auth-service (prod), account-service (staging)
payment-processing: next (Sofia leading)
basic-memory markdown vault + YAML frontmatter

MCP server with a vault of tagged markdown files and semantic search

---
title: Decisions & Corrections
tags: [decisions, corrections, critical]
---
# Corrections (CRITICAL)
- NEVER suggest DynamoDB
- NEVER suggest MongoDB
- auth-service is Kotlin, NOT Java
memory-compiler compiled articles

Hooks compile raw observations into structured articles with confidence levels

# Critical Corrections
**Last compiled**: 2024-01-15
**Confidence**: High
**Priority**: CRITICAL

⛔ NEVER suggest DynamoDB
  Reason: strong consistency, multi-table txns
⛔ NEVER suggest MongoDB
memory-engine markdown files

Zero-dependency hooks with simple markdown memory files

# Corrections (HIGH PRIORITY)
## Never Suggest
- DynamoDB — rejected (consistency, txns)
- MongoDB — relational model fits domain
## Common Mistakes
- auth-service is Kotlin, NOT Java
knowledge-graph JSONL graph

MCP server with JSONL entities and typed relations

{"type":"entity","name":"DynamoDB",
 "entityType":"rejected",
 "observations":["NEVER suggest",
  "Need strong consistency",
  "Need multi-table txns"]}
{"type":"relation",
 "from":"Helios Financial",
 "to":"DynamoDB",
 "relationType":"rejected"}

Anatomy of a Test

How each system responds to the same prompt

Test R1 — Prompt

We need a new service for payment processing. What should I consider?

no-memory

Here are the key considerations for a payment processing service: ## Security (highest priority) - **PCI DSS compliance** — if handling card data dir...

R1
aura-distill

Here are the key considerations for the payment processing service, given Helios Financial's context: ## Architectural Alignment - **Kotlin microser...

R1
claude-mem

Given your stack and the monolith decomposition already in progress, here are the key considerations: ## Architecture - **gRPC internally** (consist...

R1
basic-memory

Based on your architecture and project state, here are the key considerations for extracting the payment-processing service from the monolith: **Arch...

R1
memory-compiler

Based on the project's established patterns: set up Flyway migrations for the new service. The team chose Flyway over Liquibase...

R1
memory-engine

Given your Helios Financial context — Kotlin microservices, PostgreSQL, Kafka, gRPC internally, and Sofia leading this effort — here are the key consi...

R1
knowledge-graph

Given your stack and the Series C timeline, here are the key considerations for the payment-processing service: ## Architecture - **Kotlin + gRPC** ...

R1

Methodology

Each competitor is tested under isolated conditions with identical prompts. Scoring uses blind evaluation: the evaluator does not know which system produced each output. All scores are on a 1-5 scale across multiple criteria per category. Latency and token counts are measured per response.

v0.1.0 ·