AESOP AI Academy · Confidential Review Document

Curriculum Review Rubric

This rubric is provided to each LLM reviewer as the evaluation framework for the AESOP AI Academy curriculum. Read the full rubric before beginning your review. Report your scores using the format specified at the end of this document.
Your Assigned Role
Reviewer Assignment

See assignment below

Reviewer Primary Lens What to emphasize
Claude Narrative & Curriculum Integrity Is story doing pedagogical work? Is the curriculum architecture sound?
Gemini Technical & Factual Accuracy Are AI concepts, definitions, and cited cases correct and current?
ChatGPT Learner Experience & Accessibility Can a real learner at each level navigate, understand, and complete the work?
Perplexity Real-World Alignment & Currency Does content reflect the current AI landscape? Are sources and cases current?
Program Philosophy — Read Before Evaluating

The AESOP Model: Story Is How Humans Learn

The AESOP AI Academy is built on one foundational principle: story is how humans learn — not how they are entertained. This is not a stylistic choice. Every lesson is designed so the narrative creates the problem, and the concept section names what the story already demonstrated.

The curriculum is structured across three proficiency levels (Intro, Basic, Advanced) and five modules. The master content lives at the Advanced level; Intro and Basic are intentional subsets — not simplified versions of Advanced content, but curated selections delivered through a completely different lens appropriate to that level.

The highest-priority question in every evaluation is: After completing this lesson, can the learner actually DO something they could not do before? Not recall a definition — perform a task, make a judgment, or run an experiment.

Proficiency Levels
Ages 5–8

Intro

100% story-driven. Concrete, sensory language. The question being answered is: "What does it do?" No abstraction. No technical vocabulary.

Ages 9–12

Basic

~50–65% narrative. Relational/logical language. The question being answered is: "How does it work?" Concepts are named but explained through analogy.

Ages 13–18

Advanced

~20–35% narrative. Technical/systemic language. The question being answered is: "Why does it matter and what are the stakes?" Real documented cases. Full technical vocabulary.

Scoring Model

How to Score

0
Absent
Not present / completely fails
1
Poor
Attempted but misses the mark
2
Partial
Present but incomplete
3
Adequate
Meets expectations
4–5
Strong
Exceeds expectations

Score every criterion 0–5. Some criteria are weighted (marked ×2) — those are worth up to 10 points. Total possible score per unit: 100 points. You are scoring ALL five dimensions regardless of your primary role — your role simply defines where to apply the most critical attention.

The Five Evaluation Dimensions
1

Narrative Integrity

Primary role: Claude  ·  All reviewers score
25 pts

Does the narrative do actual pedagogical work — or is it decoration? Story must create the problem that the concept section answers.

CriterionEvaluating QuestionMax / Weight
Story Creates the ProblemDoes the narrative create the exact problem or question that the concept section then answers? Or does the story feel disconnected?10 pts (×2)
Learner Lands the InsightDoes the protagonist arrive at the insight themselves through the story, or does an adult/narrator explain it to them?5 pts
Narrative Density MatchIs the story-to-concept ratio calibrated correctly for this level? (Intro = high story; Advanced = scenario hooks + dense concept)5 pts
Character ConsistencyAre established characters used consistently? Does the narrative feel like a continuous experience?5 pts
0–1: Poor

Story is decoration. Concepts are text blocks with a thin narrative wrapper. Learner is told, not shown.

2–3: Adequate

Story is present and related, but the protagonist doesn't earn the insight — an adult explains it. Density roughly right but drifts.

4–5: Strong

The story creates a genuine problem. The learner character works it out. You couldn't remove the story without destroying the lesson.

2

Concept Accuracy

Primary role: Gemini  ·  All reviewers score
20 pts

Wrong technical content creates confident misconceptions. Every definition must be correct at the depth appropriate to the level.

CriterionEvaluating QuestionMax / Weight
Definition AccuracyAre AI terms (tokens, RLHF, hallucination, emergence, transformer, etc.) correctly defined at the appropriate depth for this level?10 pts (×2)
Real-World Case Fidelity(Advanced only) Are cited cases (Lemoine/LaMDA, Schwartz attorney, NYT v. OpenAI, etc.) described accurately and without distortion?5 pts
Misconception PreventionDoes the content actively avoid and counter common AI misconceptions? (AI "thinks," AI "knows," AI is "neutral," AI is "magic")5 pts
0–1: Poor

Definitions vague or wrong. Common misconceptions reinforced. Real cases absent or misrepresented.

2–3: Adequate

Core concepts roughly correct but lack precision. Misconceptions not reinforced but not corrected either.

4–5: Strong

Definitions precise and level-appropriate. Misconceptions named and countered. Advanced cases accurate with correct attribution and stakes.

3

Level Appropriateness

Primary role: ChatGPT  ·  All reviewers score
20 pts

Each level (Intro / Basic / Advanced) must feel purposefully designed for that learner — not adapted from another level. Subset-not-simplification: Intro covers fewer concepts with a completely different framing, not watered-down Advanced content.

CriterionEvaluating QuestionMax / Weight
Vocabulary CalibrationIs vocabulary genuinely matched to the level? Intro = concrete/sensory; Basic = relational/logical; Advanced = technical/systemic. Not just simpler words.5 pts
Cognitive FramingIs the right question being asked for this level? Intro = "What does it do?"; Basic = "How does it work?"; Advanced = "Why does it matter and what are the stakes?"10 pts (×2)
Subset IntegrityDoes Intro/Basic content feel purposefully curated, or like truncated Advanced content?5 pts
0–1: Poor

Feels like stripped-down Advanced. Vocabulary condescendingly oversimplified or accidentally too complex. "Dumbed down" rather than genuinely designed for the level.

2–3: Adequate

Vocabulary roughly calibrated but inconsistent. Framing appropriate in some sections but slips in others.

4–5: Strong

Each level feels written for that learner specifically. Intro feels naturally concrete; Advanced feels naturally systemic. Every concept in a lower level serves that level's complete learning outcome.

4

Delivery Architecture

All reviewers · Equal weight
15 pts

Does the structure of how content is delivered serve the learner? Navigation, pacing, and layout are only meaningful insofar as they help or hinder the learner's experience. This is not a technical audit.

CriterionEvaluating QuestionMax / Weight
Learner OrientationCan a learner immediately understand where they are, how much remains, and what comes next — without needing instructions?5 pts
Pacing SupportDoes the delivery structure support the learner moving at a natural pace? Are there clear rest/break points? Does it feel rushed or padded?5 pts
Story-Concept FlowDoes the layout make the transition from story → concept → lab → quiz feel natural and progressive, or jarring and arbitrary?5 pts
0–1: Poor

A learner could not proceed without external guidance. Section transitions feel arbitrary.

2–3: Adequate

Learner can navigate but requires effort. Story-to-concept transitions work but feel mechanical.

4–5: Strong

Delivery feels invisible — learner is never thinking about navigation, only about the lesson. Story → concept → lab → quiz feel like one continuous experience.

5

Applied Outcome — Can the Learner DO Something?

⬆ HIGHEST PRIORITY DIMENSION · All reviewers
20 pts
Override Rule: If Dimension 5 scores below 8/20, the unit cannot pass regardless of total score. A lesson where learners cannot do anything after completing it has failed its core purpose.

The entire AESOP philosophy collapses if learners walk away with facts but no capability. After completing this lesson, can the learner actually DO something they could not do before? Not recall a definition — perform a task, make a judgment, or run an experiment.

CriterionEvaluating QuestionMax / Weight
Lab ExecutabilityCan the story lab actually be performed — right now, by this learner, using the stated tools (AESOP or an LLM)? Is the task clearly defined and completable?10 pts (×2)
Quiz Tests JudgmentDo quiz questions require the learner to apply, evaluate, or decide — rather than simply recall a definition or fact they just read?5 pts
Clear Capability DeltaCan you complete this sentence: "After this lesson, this learner can ___"? Is that capability something real and meaningful — not just "knows what X is"?5 pts
0–1: Poor

Labs are vague. Quizzes test recall. At the end of the lesson, the learner knows more facts but has no new capability.

2–3: Adequate

Lab has a real task but is underspecified. Quiz has some application but mostly recall. Capability delta exists but is fuzzy.

4–5: Strong

Lab is fully defined — learner knows what to do, where, what the output looks like, and what "done" means. Quiz forces judgment. You can state the capability delta in one sentence with an action verb.

Required Output Format

How to Report Your Scores

After completing your evaluation, report your scores in the following format. One block per lesson unit reviewed. This format feeds directly into the scoring dashboard.

REVIEWER: [Your name — Claude / Gemini / ChatGPT / Perplexity]
UNIT: [Module number and age group — e.g. "Module 1 – Basic (Ages 9–10)"]
LEVEL: [Intro / Basic / Advanced]
DATE: [Today's date]

DIMENSION 1 — NARRATIVE INTEGRITY (max 25)
  Story Creates the Problem:    [0–5]
  Learner Lands the Insight:    [0–5]
  Narrative Density Match:      [0–5]
  Character Consistency:        [0–5]
  D1 Notes: [Brief notes on findings]

DIMENSION 2 — CONCEPT ACCURACY (max 20)
  Definition Accuracy:          [0–5]
  Real-World Case Fidelity:     [0–5]
  Misconception Prevention:     [0–5]
  D2 Notes: [Brief notes on findings]

DIMENSION 3 — LEVEL APPROPRIATENESS (max 20)
  Vocabulary Calibration:       [0–5]
  Cognitive Framing:            [0–5]
  Subset Integrity:             [0–5]
  D3 Notes: [Brief notes on findings]

DIMENSION 4 — DELIVERY ARCHITECTURE (max 15)
  Learner Orientation:          [0–5]
  Pacing Support:               [0–5]
  Story-Concept Flow:           [0–5]
  D4 Notes: [Brief notes on findings]

DIMENSION 5 — APPLIED OUTCOME (max 20) *** HIGHEST PRIORITY ***
  Lab Executability:            [0–5]
  Quiz Tests Judgment:          [0–5]
  Clear Capability Delta:       [0–5]
  D5 Notes: Complete this sentence — "After this lesson, this learner can ___"

TOTAL SCORE: [sum] / 100
D5 SCORE:    [sum] / 20