Switch to Gemini 2.5 Pro model and add evaluation tools by mayurrrrr · Pull Request #10 · tillioss/seal

mayurrrrr · 2025-12-17T15:16:36Z

Updated LLM model references in curriculum_gateway.py, gateway.py, and guardrails.py to use 'models/gemini-2.5-pro' for improved output quality. Added a comprehensive evaluation framework, experimentation guide, batch and API test scripts, realistic test data, and supporting files to enable systematic testing and prompt improvement for SEAL project endpoints.

All Submissions:

[ yes] Have you followed the guidelines in our Contributing document?
[ yes] Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

[ yes] Does your submission pass tests?
[ yes] Have you lint your code locally before submission?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully run tests with your changes locally?

Updated LLM model references in curriculum_gateway.py, gateway.py, and guardrails.py to use 'models/gemini-2.5-pro' for improved output quality. Added a comprehensive evaluation framework, experimentation guide, batch and API test scripts, realistic test data, and supporting files to enable systematic testing and prompt improvement for SEAL project endpoints.

Copilot

Pull request overview

This pull request updates the SEAL project to use the Gemini 2.5 Pro model and adds a comprehensive evaluation framework for testing and improving prompt outputs. The changes include model reference updates across three core files and the addition of extensive testing, evaluation, and data generation tools.

Key Changes:

Updated LLM model from 'gemini-1.5-flash-002' to 'models/gemini-2.5-pro' in gateway files and guardrails
Added comprehensive evaluation framework with automated and human evaluation capabilities
Introduced realistic test data generation based on educational research patterns
Created extensive documentation and experimentation guides

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
app/llm/gateway.py	Updated Gemini model version to 2.5 Pro
app/llm/curriculum_gateway.py	Updated Gemini model version to 2.5 Pro
app/safety/guardrails.py	Updated Gemini model version to 2.5 Pro
test/evaluation_framework.py	Added automated evaluation framework with quality scoring
test/realistic_data_generator.py	Added realistic test data generation with educational patterns
test/human_evaluation_framework.py	Added HTML-based human evaluation interface
test/run_comprehensive_evaluation.py	Added comprehensive evaluation runner integrating all components
test/quick_start_evaluation.py	Added quick start tools and test case generation
test/manual_prompt_runner.py	Added manual prompt testing utility
test/batch_prompt_runner.py	Added batch processing for test cases
test/api_test_script.py	Added API endpoint testing script
test/quick_test_cases.json	Added sample test cases for EMT and curriculum
test/results/batch_results.csv	Added batch test results with sample responses
test/data/realistic_profiles.json	Added realistic class profile data
test/data/curriculum_test_cases.json	Added curriculum intervention test cases
test/EXPERIMENTATION_GUIDE.md	Added comprehensive experimentation guide
SEAL_EVALUATION_FRAMEWORK_SUMMARY.md	Added framework summary documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-17T15:26:55Z

+    print("\n" + "=" * 50)
+
+    # Load test cases
+    test_cases = create_simple_test_cases()


The function create_simple_test_cases() is called but never defined in this file. This will cause a NameError at runtime when main() is executed.

Copilot · 2025-12-17T15:26:56Z

+    "num_students": 19,
+    "emt_scores": {
+      "EMT1": [
+        100,


The line numbers in the realistic_profiles.json file appear incorrect. Line 1088 shows a value of 100 which should be written as 100.0 for consistency with the other float values in the EMT scores arrays throughout the file.

Suggested change

100,

100.0,

Copilot · 2025-12-17T15:26:56Z

+        67.1,
+        94.1,
+        67.6,
+        100,


The line shows a value of 100 which should be written as 100.0 for consistency with the other float values in the EMT scores arrays throughout the file.

Suggested change

100,

100.0,

Copilot · 2025-12-17T15:26:57Z

+
+    def create_test_cases(self) -> List[TestCase]:
+        """Create comprehensive test cases for both prompt types"""
+        test_cases = []


Variable test_cases is not used.

Suggested change

test_cases = []

Copilot · 2025-12-17T15:26:57Z

+        test_cases = evaluator.generate_realistic_test_data(args.num_classes)
+
+        # Step 2: Run automated evaluation
+        automated_results = evaluator.run_automated_evaluation(test_cases)


Variable automated_results is not used.

Suggested change

automated_results = evaluator.run_automated_evaluation(test_cases)

evaluator.run_automated_evaluation(test_cases)

Copilot · 2025-12-17T15:27:00Z

+Simplified script to get Mayur started with prompt testing
+"""
+
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2025-12-17T15:27:01Z

+
+import random
+import json
+from typing import Dict, List, Any, Tuple


Import of 'Tuple' is not used.

Suggested change

from typing import Dict, List, Any, Tuple

from typing import Dict, List, Any

Copilot · 2025-12-17T15:27:01Z

+from typing import Dict, List, Any, Tuple
+from dataclasses import dataclass
+from pathlib import Path
+import numpy as np


Import of 'np' is not used.

Suggested change

import numpy as np

Copilot · 2025-12-17T15:27:02Z

+Integrates automated testing, realistic data generation, and human evaluation
+"""
+
+import os


Import of 'os' is not used.

Suggested change

import os

Copilot · 2025-12-17T15:27:02Z

+    parser.add_argument("--out", default=None, help="Output CSV path (optional)")
+    args = parser.parse_args()
+
+    cases = json.load(open(args.input, "r", encoding="utf-8"))


File is opened but is not closed.

Suggested change

cases = json.load(open(args.input, "r", encoding="utf-8"))

with open(args.input, "r", encoding="utf-8") as f:

cases = json.load(f)

anjula-sack requested a review from Copilot December 17, 2025 15:17

Copilot started reviewing on behalf of anjula-sack December 17, 2025 15:17 View session

Copilot AI reviewed Dec 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to Gemini 2.5 Pro model and add evaluation tools#10

Switch to Gemini 2.5 Pro model and add evaluation tools#10
mayurrrrr wants to merge 1 commit intotillioss:mainfrom
mayurrrrr:mayur-evaluation

mayurrrrr commented Dec 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+.1,
+.1,
+.6,
+,

	automated_results = evaluator.run_automated_evaluation(test_cases)
	evaluator.run_automated_evaluation(test_cases)

	from typing import Dict, List, Any, Tuple
	from typing import Dict, List, Any

	cases = json.load(open(args.input, "r", encoding="utf-8"))
	with open(args.input, "r", encoding="utf-8") as f:
	cases = json.load(f)

Conversation

mayurrrrr commented Dec 17, 2025

New Feature Submissions:

Changes to Core Features:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants