Competitive benchmarks revealed user model as aura-distill's weakest category (3.17/5, 5th of 7). Analysis of the winning system showed the fix wasn't architectural — it was about where preferences live. This study led to the "Always-On User Preferences" mechanism.
In the distill-benchmark competitive analysis, aura-distill scored 3.17/5 on user model — placing 5th out of 7 systems. This was the lowest-performing category, dragging down overall results despite strong showings in knowledge retrieval and bias resistance.
Four scenarios exposed specific failure modes:
| Scenario | Description | Score | Issue |
|---|---|---|---|
| U1 | Senior calibration | 3.00 | Tutorial-level explanation to a senior engineer |
| U2 | Commit message style | 4.33 | Clean — matched user convention |
| U3 | Concise answer | 2.67 | Dead last — produced the longest response of all 7 systems |
| U4 | Fish shell | 2.67 | Labeled code as bash despite fish shell preference |
| Average | 3.17 | ||
U2 (commit messages) succeeded because commit conventions are structural — they live in the rules file. U1, U3, and U4 failed because they depend on knowing who the user is, not just what the project does.
Full 7-system comparison on the user model category:
| Rank | System | User Model Score |
|---|---|---|
| 1 | knowledge-graph | 4.33 |
| 2 | semantic-memory | 4.00 |
| 3 | structured-notes | 3.67 |
| 4 | context-window | 3.33 |
| 5 | aura-distill | 3.17 |
| 6 | flat-file | 3.00 |
| 7 | vanilla-baseline | 2.50 |
The 1.16-point gap between knowledge-graph (1st) and aura-distill (5th) was the largest category deficit in the benchmark. On overall scores, aura-distill placed competitively — but user model was the clear drag.
The top two systems (knowledge-graph, semantic-memory) both inline user preferences directly into the system prompt. The bottom three require retrieval steps before preferences become visible to the model.
The failure decomposed into three layers:
User preferences in aura-distill were stored in profile files behind a lazy-load gate. The SPINE had to match a domain keyword, trigger a file read, and then the model had to attend to the preference. For U3 (concise answers) and U4 (fish shell), there was no domain keyword to trigger loading — these are cross-cutting preferences, not domain-specific knowledge.
Even when preferences loaded, they were stated as descriptive facts: user prefers concise answers. This phrasing invites the model to weigh it against its own tendencies. The winning system used enforcement language: give concise answers, no preamble.
The user profile was thin on output-format preferences. Shell preference, verbosity calibration, and expertise level were either missing or buried in domain-specific files where cross-cutting scenarios couldn't find them.
**User prefs**: fish shell, concise answers, senior-level (15yr), no tutorials, imperative commit messagesprofiles/user.md:The user model problem is well-studied in recent LLM personalization research. Seven papers provide direct context for the always-on preference mechanism:
AlpsBench (Li et al., 2025) benchmarks models on latent user traits — preferences that must be inferred from behavior rather than stated explicitly. Models struggle most when traits conflict with default behavior, exactly the failure mode in U3 and U4.
PersistBench (Kovac et al., 2025) finds a 97% failure rate on memory-induced sycophancy: when a memory says "user prefers X" but X is wrong for the current context, models comply with the memory rather than pushing back. This informed the contradiction-checking mechanism in always-on preferences.
PLUS (Chen et al., 2025) demonstrates that text summaries outperform embedding-based retrieval for user preference matching. This validates the always-on approach: a concise text block in the system prompt beats a vector-search pipeline for preference recall.
Memory as Metabolism (Xu et al., 2025) distinguishes two types of personalization: mirroring style (match how the user communicates) and compensating substance (fill gaps in what they know). Always-on preferences handle style; domain knowledge handles substance.
ProfiLLM (Gupta et al., 2025) shows a 55-65% gap reduction in personalization after a single profile-enriched prompt. The effect is immediate and does not require multi-turn interaction. This confirms that always-on inlining is sufficient — no iterative refinement needed.
STALE (Jang et al., 2025) achieves only 55.2% accuracy on detecting stale memories — nearly a coin flip. This motivated the confidence lifecycle in always-on preferences: only validated or hardened preferences get promoted to always-on, reducing the risk of stale enforcement.
SteeM (Hu et al., 2025) introduces user-controlled memory reliance, letting users decide how much the model should depend on stored preferences vs. in-context signals. The always-on compliance checklist draws on this principle: users can override any always-on preference in-session.
Always-On User Preferences — a max 15-line section added to rules/distill.md that loads every session with zero retrieval cost.
Design principles:
/distill command evaluates profile data and writes the always-on section automatically. No manual curation required.Example always-on section:
# Always-On User Preferences (auto-synced by /distill) # Do not edit manually — will be overwritten on next distillation. - Use fish shell syntax in all shell examples. Never label as bash. - Give concise answers. No preamble, no "Great question!", no restating. - Calibrate to senior engineer (15yr). Skip tutorials, explain trade-offs. - Commit messages: imperative mood, no period, max 72 chars. - TypeScript strict mode. No `any` unless explicitly allowed. - Prefer composition over inheritance. - Test names: describe behavior, not implementation. ## Compliance checklist 1. Before responding: re-read the preferences above. 2. If a preference conflicts with the user's current request, follow the request. 3. If uncertain whether a preference applies, apply it — the user will correct.
Not every preference reaches always-on. The confidence ladder filters noise:
| Stage | Criteria | Status |
|---|---|---|
| Experimental | Observed once, no confirmation | Stored in profile only |
| Provisional | Observed 2-3 times or explicitly stated | Stored in profile, applied when loaded |
| Validated | Confirmed by user behavior across sessions | Candidate for always-on |
| Hardened | Confirmed 5+ times, never contradicted | Promoted to always-on |
Only validated and hardened preferences are promoted to the always-on section. This prevents one-off corrections from becoming permanent enforcement rules.
When /distill runs, it:
rules/distill.md, replacing the previous versionThe 15-line limit forces prioritization. If a user has 30 hardened preferences, only the 15 most cross-cutting survive. Domain-specific preferences stay in their domain files.
Every distillation checks for contradictions between:
When a contradiction is detected on a hardened preference, it triggers a paradigm alarm rather than a silent update — because something the user confirmed 5+ times is now being contradicted.
/distill. Expected improvement: U1, U3, U4 scores should converge toward U2 levels (4.0+).