Switch to Gemini 2.5 Pro model and add evaluation tools#10
Switch to Gemini 2.5 Pro model and add evaluation tools#10mayurrrrr wants to merge 1 commit intotillioss:mainfrom
Conversation
Updated LLM model references in curriculum_gateway.py, gateway.py, and guardrails.py to use 'models/gemini-2.5-pro' for improved output quality. Added a comprehensive evaluation framework, experimentation guide, batch and API test scripts, realistic test data, and supporting files to enable systematic testing and prompt improvement for SEAL project endpoints.
There was a problem hiding this comment.
Pull request overview
This pull request updates the SEAL project to use the Gemini 2.5 Pro model and adds a comprehensive evaluation framework for testing and improving prompt outputs. The changes include model reference updates across three core files and the addition of extensive testing, evaluation, and data generation tools.
Key Changes:
- Updated LLM model from 'gemini-1.5-flash-002' to 'models/gemini-2.5-pro' in gateway files and guardrails
- Added comprehensive evaluation framework with automated and human evaluation capabilities
- Introduced realistic test data generation based on educational research patterns
- Created extensive documentation and experimentation guides
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| app/llm/gateway.py | Updated Gemini model version to 2.5 Pro |
| app/llm/curriculum_gateway.py | Updated Gemini model version to 2.5 Pro |
| app/safety/guardrails.py | Updated Gemini model version to 2.5 Pro |
| test/evaluation_framework.py | Added automated evaluation framework with quality scoring |
| test/realistic_data_generator.py | Added realistic test data generation with educational patterns |
| test/human_evaluation_framework.py | Added HTML-based human evaluation interface |
| test/run_comprehensive_evaluation.py | Added comprehensive evaluation runner integrating all components |
| test/quick_start_evaluation.py | Added quick start tools and test case generation |
| test/manual_prompt_runner.py | Added manual prompt testing utility |
| test/batch_prompt_runner.py | Added batch processing for test cases |
| test/api_test_script.py | Added API endpoint testing script |
| test/quick_test_cases.json | Added sample test cases for EMT and curriculum |
| test/results/batch_results.csv | Added batch test results with sample responses |
| test/data/realistic_profiles.json | Added realistic class profile data |
| test/data/curriculum_test_cases.json | Added curriculum intervention test cases |
| test/EXPERIMENTATION_GUIDE.md | Added comprehensive experimentation guide |
| SEAL_EVALUATION_FRAMEWORK_SUMMARY.md | Added framework summary documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print("\n" + "=" * 50) | ||
|
|
||
| # Load test cases | ||
| test_cases = create_simple_test_cases() |
There was a problem hiding this comment.
The function create_simple_test_cases() is called but never defined in this file. This will cause a NameError at runtime when main() is executed.
| "num_students": 19, | ||
| "emt_scores": { | ||
| "EMT1": [ | ||
| 100, |
There was a problem hiding this comment.
The line numbers in the realistic_profiles.json file appear incorrect. Line 1088 shows a value of 100 which should be written as 100.0 for consistency with the other float values in the EMT scores arrays throughout the file.
| 100, | |
| 100.0, |
| 67.1, | ||
| 94.1, | ||
| 67.6, | ||
| 100, |
There was a problem hiding this comment.
The line shows a value of 100 which should be written as 100.0 for consistency with the other float values in the EMT scores arrays throughout the file.
| 100, | |
| 100.0, |
|
|
||
| def create_test_cases(self) -> List[TestCase]: | ||
| """Create comprehensive test cases for both prompt types""" | ||
| test_cases = [] |
There was a problem hiding this comment.
Variable test_cases is not used.
| test_cases = [] |
| test_cases = evaluator.generate_realistic_test_data(args.num_classes) | ||
|
|
||
| # Step 2: Run automated evaluation | ||
| automated_results = evaluator.run_automated_evaluation(test_cases) |
There was a problem hiding this comment.
Variable automated_results is not used.
| automated_results = evaluator.run_automated_evaluation(test_cases) | |
| evaluator.run_automated_evaluation(test_cases) |
| Simplified script to get Mayur started with prompt testing | ||
| """ | ||
|
|
||
| import os |
There was a problem hiding this comment.
Import of 'os' is not used.
| import os |
|
|
||
| import random | ||
| import json | ||
| from typing import Dict, List, Any, Tuple |
There was a problem hiding this comment.
Import of 'Tuple' is not used.
| from typing import Dict, List, Any, Tuple | |
| from typing import Dict, List, Any |
| from typing import Dict, List, Any, Tuple | ||
| from dataclasses import dataclass | ||
| from pathlib import Path | ||
| import numpy as np |
There was a problem hiding this comment.
Import of 'np' is not used.
| import numpy as np |
| Integrates automated testing, realistic data generation, and human evaluation | ||
| """ | ||
|
|
||
| import os |
There was a problem hiding this comment.
Import of 'os' is not used.
| import os |
| parser.add_argument("--out", default=None, help="Output CSV path (optional)") | ||
| args = parser.parse_args() | ||
|
|
||
| cases = json.load(open(args.input, "r", encoding="utf-8")) |
There was a problem hiding this comment.
File is opened but is not closed.
| cases = json.load(open(args.input, "r", encoding="utf-8")) | |
| with open(args.input, "r", encoding="utf-8") as f: | |
| cases = json.load(f) |
Updated LLM model references in curriculum_gateway.py, gateway.py, and guardrails.py to use 'models/gemini-2.5-pro' for improved output quality. Added a comprehensive evaluation framework, experimentation guide, batch and API test scripts, realistic test data, and supporting files to enable systematic testing and prompt improvement for SEAL project endpoints.
All Submissions:
New Feature Submissions:
Changes to Core Features: