Intro

Not Everything Deserves to Be Remembered

2 min read

Memory without judgment is noise. If you store every fact, every exchange, every passing mention with equal weight, you end up with a memory store that tells the AI nothing — because everything is equally important, which means nothing stands out.

Importance scoring is the mechanism that gives memory structure. It assigns each stored fact a value based on how relevant, recent, and frequently-used it is. High-scoring memories get injected. Low-scoring memories get archived or discarded. The AI works with what matters, not with everything.

Designing a scoring function is an engineering decision with real consequences. Score too aggressively and you lose context you needed. Score too loosely and the injection block grows until it crowds out the work. This module teaches you to make that tradeoff deliberately.

Portfolio artifact

Skill

A documented importance scoring function — specifying the signals used, their weights, the scoring formula, and the threshold rules for surface, archive, and discard decisions.

By the end of this module, you will:

Explain the three core scoring signals and what each one measures
Design a weighted scoring formula appropriate for a specific memory store
Define threshold rules that determine when a memory is surfaced, archived, or discarded
Identify edge cases where standard scoring breaks down and how to handle them
Apply NIST MEASURE principles to memory quality tracking

Scenario

The Overloaded Store

3 min read

Six months after building a memory system, a developer runs a count: 4,200 stored memory entries. Everything the AI was told, every decision logged, every preference captured. The system is technically working — facts are being written. They're just not being read.

When the injection logic runs, it pulls the most recent 50 entries. Most of them are session scraps: "user wanted response in bullet form," "user mentioned they're in a hurry," "user asked about database performance." These were true in the moment. None of them are useful now.

Meanwhile, the actual important facts — "this project uses PostgreSQL not MySQL," "the team decided against Redis for caching in March," "the naming convention changed to camelCase in all new modules" — are buried in a store with no priority. They get injected sometimes. They get missed other times. The AI behaves inconsistently because the memory system has no idea which facts matter.

The problem isn't storage. It's scoring. Without a function that assigns meaningful importance to each stored fact, retrieval is a lottery. Every fact competes equally, which means critical context loses to noise on a coin flip.

A scoring function changes that. It gives the memory system a way to answer the question: given everything stored, what should the AI know right now?

Lesson

The Scoring Triad

3 min read

Every importance scoring function is built from three signals: recency, frequency, and relevance. Each measures something different. Together they produce a score that reflects whether a memory should be in front of the AI right now.

Recency

How recently was this memory created or last accessed? Recent memories reflect the current state of a project. Old memories reflect past states that may no longer be accurate. Recency decays over time — a memory written yesterday scores higher on this signal than one written six months ago, all else being equal. The decay rate should match the pace at which your domain changes.

Frequency

How often has this memory been accessed, retrieved, or reinforced? A memory that comes up in every session is probably load-bearing. A memory accessed once and never again is probably a one-off detail. Frequency is a proxy for ongoing relevance — facts that keep mattering keep getting used. Track access counts and let them influence score.

Relevance

How semantically similar is this memory to the current session context? This is the hardest signal to compute cheaply. A simple approach: tag memories with topics at write time and match tags to the current session's topic. A more powerful approach: embed memories and score by cosine similarity to the current prompt. The right choice depends on your compute budget and how important semantic precision is for your use case.

Combining the signals

A weighted sum is the most common formula: score = (w₁ × recency) + (w₂ × frequency) + (w₃ × relevance). The weights are the design decision. A system where architectural decisions matter more than recency should weight frequency heavily. A system where current task context dominates should weight relevance heavily. There is no universal correct weighting — you set it based on what your use case needs to prioritize.

Governance — NIST AI RMF

NIST MEASURE — Evaluating Memory System Quality

NIST's MEASURE function requires quantitative and qualitative evaluation of AI system behavior. An importance scoring function is a measurable component: you can track whether high-scoring memories are actually being used in responses, whether discarded memories are ever missed, and whether the score distribution is healthy (not everything clustered at the top, not everything drifting toward archive). Define these metrics before you need them.

NIST MANAGE — Scoring as a Lifecycle Decision

Scoring determines which memories persist and which are discarded. That is a lifecycle management decision with consequences for system behavior over time. NIST MANAGE asks: is there a defined process for reviewing and adjusting the scoring function as the system evolves? Weights that were correct at launch may be wrong at scale.

Context

Three Traps in Scoring Design

2 min read

Most scoring functions fail in predictable ways. These are the three traps you need to actively avoid in your design.

1. The recency bias trap

A scoring function that weights recency too heavily will discard old-but-critical facts. Architectural decisions made six months ago outweigh any individual session's recency. You need a way to protect high-stakes facts from decay — either by flagging them as protected (never discard regardless of score) or by assigning them a base frequency score that can't drop below a floor. Consider explicitly: which facts in your store should be immune to recency decay?

2. The flat distribution trap

If every fact scores similarly — because the weights are balanced and there's little variance in the signals — the scoring function provides no ranking. The top 50 entries are indistinguishable from entries 51–100. Examine your score distribution. You want meaningful spread at the top, a clear middle tier, and a bottom tier that can be safely archived. If your distribution is flat, adjust weights or add a signal.

3. The threshold creep trap

If the discard threshold is set too high out of caution, the archive grows indefinitely. If it's set too low, useful facts get lost. Thresholds need to be tested empirically, not set by intuition. Define a process for evaluating threshold performance: sample discarded memories monthly and check whether any were missed. Adjust based on evidence, not fear.

You'll apply all three traps in the lab — designing a scoring function and defending your choices against each failure mode.

⚙ Skill Lab

Scoring Function Designer

~20 minutes · 5 exchanges

What you're doing

You'll design an importance scoring function for a described memory store. I'll walk you through the signals, challenge your weights, and test your threshold rules against the three failure traps.

Roles

📐

You — Scoring DesignerYou design the function: signals, weights, formula, thresholds.

🔬

AI — Systems EngineerI'll pressure-test your design against real failure modes. I won't accept vague weights.

Framework — apply to your design

Recency × Frequency × Relevance — the scoring triad

Weighted sum: score = (w₁×R) + (w₂×F) + (w₃×Rel)

Protect high-stakes facts from recency decay

Test for flat distribution — you need meaningful spread

NIST MEASURE: define metrics to evaluate scoring quality

Success criteria

Produce a scoring function spec: three signals with defined weights, a formula, threshold rules for surface/archive/discard, and a plan for measuring effectiveness.

Shift + Enter for a new line

✓ Module Complete

You've completed Module 2 of 8. Your scoring function spec is in your portfolio.

Next Module →