Before: rate 48 products × 7 criteria = 336 judgments by hand. Tedious, slow, inconsistent. After: ~20 minutes of conversation with an AI agent. One rated Excel file out.
This repo is a playbook your AI coding agent (Claude Code, Cursor, Codex, etc.) reads and follows. It walks you through designing a rubric grounded in real copywriting books, builds one specialist rater per criterion, and runs the evaluation. You drive the choices; the agent does the work.
Works out of the box for Assignment 01 of the GenAI and LLM Applications course at Google-Reichman Tech School. Adapts to any similar assignment.
git clone https://github.com/herman181920/llm-eval-framework.git
cd llm-eval-framework
bash setup.shThen open the repo in your AI agent and say:
"Help me evaluate my LLM outputs."
That's it. The agent reads CLAUDE.md → AGENT.md and walks you through 8 phases, one question at a time. validate.py runs automatically in Phase 1 after you provide the paths.
The framework asks one question per turn with lettered choices. You pick, it proceeds.
You: Help me evaluate my LLM outputs.
Agent: Q1 — What are we working with?
(A) My actual assignment — I have a notebook and CSV / eval sheet
(B) Try the example data first (examples/rated_output.xlsx)
(C) Something else — I'll describe it
You: A
Agent: Q2 — Paste the paths to your notebook (.ipynb) and product data (CSV
or eval sheet if descriptions are already generated).
You: ~/Desktop/assignment.ipynb ~/Desktop/products.csv
Agent: [reads files, runs validate.py, builds summary]
Notebook: 7 criteria detected. Input: 48 rows. Descriptions: already generated.
Q3 — Does this look right?
(A) Correct — move to Phase 2
(B) Column mapping is wrong — I'll fix it
(C) Descriptions aren't generated yet — generate them now
You: A
Agent: Q4 — Which rubric source?
(A) Pre-extracted expert_rubric.md (fastest, default)
(B) Re-query NotebookLM with the 7 books (~5 min)
(C) Add my own PDFs first
You: A
... Phases 3–6 walk through each criterion, agent architecture, model choice, final review ...
Agent: Phase 7 — running 48 products × 7 criteria...
[15/48] sku=KC-001 → fluency=good grammar=good tone=ok grounding=good
length=good latency=good cost=good → PASS
...
Agent: Done. 41/48 passed (85%). Excel saved to outputs/evaluation_sheet.xlsx.
Most failures were Grounding on products with "NOT" constraints.
Q16 — Next step? (A) Finish (B) Re-run a criterion (C) Save agent prompts
Want to skip the rubric design? Say "fast mode" and the agent will accept all defaults, confirm once, and run the whole evaluation in ~20 minutes.
Will this cost money? No, if you use the default "agent" rater (your AI coding agent does the rating). The framework only costs money if you explicitly pick an API rater (OpenAI / Anthropic) in Phase 5.
Do I need to know Python? No. The agent runs everything. You just review and approve decisions at each phase.
What if my assignment or CSV is different from the default?
Phase 1 reads your actual notebook and CSV, then asks you to confirm column names. Nothing is hardcoded. python validate.py --csv your.csv --notebook your.ipynb catches mismatches early.
Does it work offline? Mostly yes. Ratings via "agent" or Ollama need no network. Only Phase 2's fresh knowledge-base extraction via NotebookLM needs internet (and the pre-extracted rubric is included so you can skip that entirely).
How long does it take? ~20-30 minutes end-to-end for 48 products, most of which is the model rating each description. Setup + rubric design is 5-10 minutes.
My notebook uses 10 criteria, not 7. Can I customize? Yes. Phase 3 walks you through each criterion with three options: keep default, tweak, or generate a new definition from the included books.
AGENT.md— the 8-phase playbook the agent followsknowledge_base/books/— 7 books on copywriting and conversion (Bly, Hopkins, Blanks & Jesson, Handley, Krug, Yifrah, Podmajersky)knowledge_base/expert_rubric.md— pre-extracted expert knowledge for Fluency, Grammar, Tone, Groundingconfig.yaml— defaults for Assignment 01agents/TEMPLATE.md— schema for the specialist rater agentsutils/— reusable helpers (scoring, deterministic ratings, NotebookLM setup)examples/rated_output.xlsx— a real output from this framework
Required:
- An AI coding agent (Claude Code / Cursor / Codex / Aider / Continue / etc.)
- Python 3.10+
Optional (only if you pick them in Phase 5):
- Ollama (free local models)
- OpenAI / Anthropic API key (paid API ratings)
- Google account (only if you want to re-extract the rubric in Phase 2; not needed otherwise)
GenAI and LLM Applications course at Google-Reichman Tech School (a joint program between Google and Reichman), Assignment 01: generate product descriptions for 48 products and evaluate them on 7 criteria (Fluency, Grammar, Tone, Length, Grounding, Latency, Cost). examples/rated_output.xlsx is a real output from this framework.