FreeCite: A Judge-Free Benchmark for Granular Citation Evaluation in Large Language Models
-
Updated
Feb 22, 2026 - Python
FreeCite: A Judge-Free Benchmark for Granular Citation Evaluation in Large Language Models
A realistic RL environment for training LLM agents on enterprise email triage—featuring multi-step decision making, ambiguity handling, tool usage, and deterministic evaluation.
Deterministic offline ComtradeBench judge for evaluating agent robustness under pagination, retries, duplicates, page drift, and totals traps.
Public, fully local PoCs for counterfactually auditable lifecycle certification: exact paired replay, drift monitoring, post-drift replanning, and bridge-aware ledger control on synthetic tasks.
Add a description, image, and links to the deterministic-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the deterministic-evaluation topic, visit your repo's landing page and select "manage topics."