deterministic-evaluation

Star

Here are 4 public repositories matching this topic...

flozxwer / FreeCite

Star

FreeCite: A Judge-Free Benchmark for Granular Citation Evaluation in Large Language Models

citation teacher-forcing deterministic-evaluation freecite context-conditional-citation-prediction

Updated Feb 22, 2026
Python

OjasD07 / scaler-openenv-hackathon

Star

A realistic RL environment for training LLM agents on enterprise email triage—featuring multi-step decision making, ambiguity handling, tool usage, and deterministic evaluation.

Updated May 18, 2026
Python

yonghongzhang-io / green-comtrade-bench-v2

Star

Deterministic offline ComtradeBench judge for evaluating agent robustness under pagination, retries, duplicates, page drift, and totals traps.

benchmark comtrade data-quality agentbeats deterministic-evaluation api-faults robust-agents

Updated Mar 23, 2026
Python

kadubon / agent-lifecycle-certification-poc

Sponsor

Star

Public, fully local PoCs for counterfactually auditable lifecycle certification: exact paired replay, drift monitoring, post-drift replanning, and bridge-aware ledger control on synthetic tasks.

ai poc autonomous-agents synthetic-data drift-monitoring llm scientific-reproducibility deterministic-evaluation counterfactual-auditing paired-replay

Updated Mar 19, 2026
Python

Improve this page

Add a description, image, and links to the deterministic-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the deterministic-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deterministic-evaluation

Here are 4 public repositories matching this topic...

flozxwer / FreeCite

OjasD07 / scaler-openenv-hackathon

yonghongzhang-io / green-comtrade-bench-v2

kadubon / agent-lifecycle-certification-poc

Improve this page

Add this topic to your repo