Skip to content

herman181920/llm-eval-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Eval Framework

Before: rate 48 products × 7 criteria = 336 judgments by hand. Tedious, slow, inconsistent. After: ~20 minutes of conversation with an AI agent. One rated Excel file out.

This repo is a playbook your AI coding agent (Claude Code, Cursor, Codex, etc.) reads and follows. It walks you through designing a rubric grounded in real copywriting books, builds one specialist rater per criterion, and runs the evaluation. You drive the choices; the agent does the work.

Works out of the box for Assignment 01 of the GenAI and LLM Applications course at Google-Reichman Tech School. Adapts to any similar assignment.

Quickstart

git clone https://github.com/herman181920/llm-eval-framework.git
cd llm-eval-framework
bash setup.sh

Then open the repo in your AI agent and say:

"Help me evaluate my LLM outputs."

That's it. The agent reads CLAUDE.mdAGENT.md and walks you through 8 phases, one question at a time. validate.py runs automatically in Phase 1 after you provide the paths.

What a run looks like

The framework asks one question per turn with lettered choices. You pick, it proceeds.

You: Help me evaluate my LLM outputs.

Agent: Q1 — What are we working with?
       (A) My actual assignment — I have a notebook and CSV / eval sheet
       (B) Try the example data first (examples/rated_output.xlsx)
       (C) Something else — I'll describe it
You: A

Agent: Q2 — Paste the paths to your notebook (.ipynb) and product data (CSV
       or eval sheet if descriptions are already generated).
You: ~/Desktop/assignment.ipynb  ~/Desktop/products.csv

Agent: [reads files, runs validate.py, builds summary]
       Notebook: 7 criteria detected.  Input: 48 rows.  Descriptions: already generated.
       Q3 — Does this look right?
       (A) Correct — move to Phase 2
       (B) Column mapping is wrong — I'll fix it
       (C) Descriptions aren't generated yet — generate them now
You: A

Agent: Q4 — Which rubric source?
       (A) Pre-extracted expert_rubric.md (fastest, default)
       (B) Re-query NotebookLM with the 7 books (~5 min)
       (C) Add my own PDFs first
You: A

... Phases 3–6 walk through each criterion, agent architecture, model choice, final review ...

Agent: Phase 7 — running 48 products × 7 criteria...
       [15/48] sku=KC-001 → fluency=good grammar=good tone=ok grounding=good
               length=good latency=good cost=good → PASS
...
Agent: Done. 41/48 passed (85%). Excel saved to outputs/evaluation_sheet.xlsx.
       Most failures were Grounding on products with "NOT" constraints.
       Q16 — Next step? (A) Finish  (B) Re-run a criterion  (C) Save agent prompts

Want to skip the rubric design? Say "fast mode" and the agent will accept all defaults, confirm once, and run the whole evaluation in ~20 minutes.

FAQ

Will this cost money? No, if you use the default "agent" rater (your AI coding agent does the rating). The framework only costs money if you explicitly pick an API rater (OpenAI / Anthropic) in Phase 5.

Do I need to know Python? No. The agent runs everything. You just review and approve decisions at each phase.

What if my assignment or CSV is different from the default? Phase 1 reads your actual notebook and CSV, then asks you to confirm column names. Nothing is hardcoded. python validate.py --csv your.csv --notebook your.ipynb catches mismatches early.

Does it work offline? Mostly yes. Ratings via "agent" or Ollama need no network. Only Phase 2's fresh knowledge-base extraction via NotebookLM needs internet (and the pre-extracted rubric is included so you can skip that entirely).

How long does it take? ~20-30 minutes end-to-end for 48 products, most of which is the model rating each description. Setup + rubric design is 5-10 minutes.

My notebook uses 10 criteria, not 7. Can I customize? Yes. Phase 3 walks you through each criterion with three options: keep default, tweak, or generate a new definition from the included books.

What's included

  • AGENT.md — the 8-phase playbook the agent follows
  • knowledge_base/books/ — 7 books on copywriting and conversion (Bly, Hopkins, Blanks & Jesson, Handley, Krug, Yifrah, Podmajersky)
  • knowledge_base/expert_rubric.md — pre-extracted expert knowledge for Fluency, Grammar, Tone, Grounding
  • config.yaml — defaults for Assignment 01
  • agents/TEMPLATE.md — schema for the specialist rater agents
  • utils/ — reusable helpers (scoring, deterministic ratings, NotebookLM setup)
  • examples/rated_output.xlsx — a real output from this framework

Requirements

Required:

  • An AI coding agent (Claude Code / Cursor / Codex / Aider / Continue / etc.)
  • Python 3.10+

Optional (only if you pick them in Phase 5):

  • Ollama (free local models)
  • OpenAI / Anthropic API key (paid API ratings)
  • Google account (only if you want to re-extract the rubric in Phase 2; not needed otherwise)

The assignment this was built for

GenAI and LLM Applications course at Google-Reichman Tech School (a joint program between Google and Reichman), Assignment 01: generate product descriptions for 48 products and evaluate them on 7 criteria (Fluency, Grammar, Tone, Length, Grounding, Latency, Cost). examples/rated_output.xlsx is a real output from this framework.

About

Agent-first framework for evaluating LLM outputs. Point your AI coding agent at this repo for an interactive 8-phase workflow.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors