A Blind Evaluation of AI Memory Systems
AI assistants forget everything between sessions. Persistent context systems attempt to fix this — giving AI a working memory of your preferences, decisions, and project knowledge. But how well do they actually work?
This benchmark measures six dimensions of memory quality across blind, randomized evaluations.
Loading data...
Raw Claude with zero persistent context — the baseline
Rules-based retrieval with SPINE index and tiered knowledge files
# SPINE Index
| Topic | File |
|-------|------|
| Architecture | architecture.md |
| Decisions | decisions.md |
...
# decisions.md
⛔ Do NOT suggest DynamoDB
⛔ Do NOT suggest MongoDB
⛔ auth-service is Kotlin, NOT JavaPlugin + hooks with compressed semantic summaries in progressive disclosure layers
## Layer 1: Critical (always loaded)
User: Senior backend @ Helios Financial
Stack: Kotlin · PostgreSQL 14 · Redis · Kafka
NEVER suggest: DynamoDB, MongoDB
## Layer 2: Project Context
Services: auth-service (prod), account-service (staging)
payment-processing: next (Sofia leading)MCP server with a vault of tagged markdown files and semantic search
---
title: Decisions & Corrections
tags: [decisions, corrections, critical]
---
# Corrections (CRITICAL)
- NEVER suggest DynamoDB
- NEVER suggest MongoDB
- auth-service is Kotlin, NOT JavaHooks compile raw observations into structured articles with confidence levels
# Critical Corrections
**Last compiled**: 2024-01-15
**Confidence**: High
**Priority**: CRITICAL
⛔ NEVER suggest DynamoDB
Reason: strong consistency, multi-table txns
⛔ NEVER suggest MongoDBZero-dependency hooks with simple markdown memory files
# Corrections (HIGH PRIORITY)
## Never Suggest
- DynamoDB — rejected (consistency, txns)
- MongoDB — relational model fits domain
## Common Mistakes
- auth-service is Kotlin, NOT JavaMCP server with JSONL entities and typed relations
{"type":"entity","name":"DynamoDB",
"entityType":"rejected",
"observations":["NEVER suggest",
"Need strong consistency",
"Need multi-table txns"]}
{"type":"relation",
"from":"Helios Financial",
"to":"DynamoDB",
"relationType":"rejected"}How each system responds to the same prompt
We need a new service for payment processing. What should I consider?
Here are the key considerations for a payment processing service: ## Security (highest priority) - **PCI DSS compliance** — if handling card data dir...
R1Here are the key considerations for the payment processing service, given Helios Financial's context: ## Architectural Alignment - **Kotlin microser...
R1Given your stack and the monolith decomposition already in progress, here are the key considerations: ## Architecture - **gRPC internally** (consist...
R1Based on your architecture and project state, here are the key considerations for extracting the payment-processing service from the monolith: **Arch...
R1Based on the project's established patterns: set up Flyway migrations for the new service. The team chose Flyway over Liquibase...
R1Given your Helios Financial context — Kotlin microservices, PostgreSQL, Kafka, gRPC internally, and Sofia leading this effort — here are the key consi...
R1Given your stack and the Series C timeline, here are the key considerations for the payment-processing service: ## Architecture - **Kotlin + gRPC** ...
R1Each competitor is tested under isolated conditions with identical prompts. Scoring uses blind evaluation: the evaluator does not know which system produced each output. All scores are on a 1-5 scale across multiple criteria per category. Latency and token counts are measured per response.
v0.1.0 ·