LLM Eval Framework

Before: rate 48 products × 7 criteria = 336 judgments by hand. Tedious, slow, inconsistent. After: ~20 minutes of conversation with an AI agent. One rated Excel file out.

This repo is a playbook your AI coding agent (Claude Code, Cursor, Codex, etc.) reads and follows. It walks you through designing a rubric grounded in real copywriting books, builds one specialist rater per criterion, and runs the evaluation. You drive the choices; the agent does the work.

Works out of the box for Assignment 01 of the GenAI and LLM Applications course at Google-Reichman Tech School. Adapts to any similar assignment.

Quickstart

git clone https://github.com/herman181920/llm-eval-framework.git
cd llm-eval-framework
bash setup.sh

Then open the repo in your AI agent and say:

"Help me evaluate my LLM outputs."

That's it. The agent reads CLAUDE.md → AGENT.md and walks you through 8 phases, one question at a time. validate.py runs automatically in Phase 1 after you provide the paths.

What a run looks like

The framework asks one question per turn with lettered choices. You pick, it proceeds.

You: Help me evaluate my LLM outputs.

Agent: Q1 — What are we working with?
       (A) My actual assignment — I have a notebook and CSV / eval sheet
       (B) Try the example data first (examples/rated_output.xlsx)
       (C) Something else — I'll describe it
You: A

Agent: Q2 — Paste the paths to your notebook (.ipynb) and product data (CSV
       or eval sheet if descriptions are already generated).
You: ~/Desktop/assignment.ipynb  ~/Desktop/products.csv

Agent: [reads files, runs validate.py, builds summary]
       Notebook: 7 criteria detected.  Input: 48 rows.  Descriptions: already generated.
       Q3 — Does this look right?
       (A) Correct — move to Phase 2
       (B) Column mapping is wrong — I'll fix it
       (C) Descriptions aren't generated yet — generate them now
You: A

Agent: Q4 — Which rubric source?
       (A) Pre-extracted expert_rubric.md (fastest, default)
       (B) Re-query NotebookLM with the 7 books (~5 min)
       (C) Add my own PDFs first
You: A

... Phases 3–6 walk through each criterion, agent architecture, model choice, final review ...

Agent: Phase 7 — running 48 products × 7 criteria...
       [15/48] sku=KC-001 → fluency=good grammar=good tone=ok grounding=good
               length=good latency=good cost=good → PASS
...
Agent: Done. 41/48 passed (85%). Excel saved to outputs/evaluation_sheet.xlsx.
       Most failures were Grounding on products with "NOT" constraints.
       Q16 — Next step? (A) Finish  (B) Re-run a criterion  (C) Save agent prompts

Want to skip the rubric design? Say "fast mode" and the agent will accept all defaults, confirm once, and run the whole evaluation in ~20 minutes.

FAQ

Will this cost money? No, if you use the default "agent" rater (your AI coding agent does the rating). The framework only costs money if you explicitly pick an API rater (OpenAI / Anthropic) in Phase 5.

Do I need to know Python? No. The agent runs everything. You just review and approve decisions at each phase.

What if my assignment or CSV is different from the default? Phase 1 reads your actual notebook and CSV, then asks you to confirm column names. Nothing is hardcoded. python validate.py --csv your.csv --notebook your.ipynb catches mismatches early.

Does it work offline? Mostly yes. Ratings via "agent" or Ollama need no network. Only Phase 2's fresh knowledge-base extraction via NotebookLM needs internet (and the pre-extracted rubric is included so you can skip that entirely).

How long does it take? ~20-30 minutes end-to-end for 48 products, most of which is the model rating each description. Setup + rubric design is 5-10 minutes.

My notebook uses 10 criteria, not 7. Can I customize? Yes. Phase 3 walks you through each criterion with three options: keep default, tweak, or generate a new definition from the included books.

What's included

AGENT.md — the 8-phase playbook the agent follows
knowledge_base/books/ — 7 books on copywriting and conversion (Bly, Hopkins, Blanks & Jesson, Handley, Krug, Yifrah, Podmajersky)
knowledge_base/expert_rubric.md — pre-extracted expert knowledge for Fluency, Grammar, Tone, Grounding
config.yaml — defaults for Assignment 01
agents/TEMPLATE.md — schema for the specialist rater agents
utils/ — reusable helpers (scoring, deterministic ratings, NotebookLM setup)
examples/rated_output.xlsx — a real output from this framework

Requirements

Required:

An AI coding agent (Claude Code / Cursor / Codex / Aider / Continue / etc.)
Python 3.10+

Optional (only if you pick them in Phase 5):

Ollama (free local models)
OpenAI / Anthropic API key (paid API ratings)
Google account (only if you want to re-extract the rubric in Phase 2; not needed otherwise)

The assignment this was built for

GenAI and LLM Applications course at Google-Reichman Tech School (a joint program between Google and Reichman), Assignment 01: generate product descriptions for 48 products and evaluate them on 7 criteria (Fluency, Grammar, Tone, Length, Grounding, Latency, Cost). examples/rated_output.xlsx is a real output from this framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Eval Framework

Quickstart

What a run looks like

FAQ

What's included

Requirements

The assignment this was built for

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
agents		agents
examples		examples
knowledge_base		knowledge_base
utils		utils
.gitignore		.gitignore
AGENT.md		AGENT.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
setup.sh		setup.sh
validate.py		validate.py

Folders and files

Latest commit

History

Repository files navigation

LLM Eval Framework

Quickstart

What a run looks like

FAQ

What's included

Requirements

The assignment this was built for

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages