🏆 MCP Evaluation Server

The Ultimate AI Evaluation Platform

📊 Tools: 63 Specialized Evaluation Tools

👨‍💻 Author: Mihai Criveti

A MCP server providing the most comprehensive AI evaluation platform in the ecosystem. Features 63 specialized tools across 14 categories for complete AI system assessment using LLM-as-a-judge techniques combined with rule-based metrics.

🎯 Tool Categories Overview

📊 Core Evaluation (15 tools)

🤖 4 Judge Tools - LLM-as-a-judge evaluation with bias mitigation 📝 4 Prompt Tools - Clarity, consistency, completeness analysis 🛠️ 4 Agent Tools - Tool usage, reasoning, task completion assessment 🔍 3 Quality Tools - Factuality, coherence, toxicity detection

🔬 Advanced Assessment (39 tools)

🔗 8 RAG Tools - Retrieval relevance, context utilization, grounding verification ⚖️ 6 Bias & Fairness - Demographic bias, representation equity, intersectional analysis 🛡️ 5 Robustness Tools - Adversarial testing, injection resistance, stability analysis 🔒 4 Safety & Alignment - Harmful content detection, instruction adherence, value alignment 🌍 4 Multilingual Tools - Translation quality, cross-lingual consistency, cultural adaptation ⚡ 4 Performance Tools - Latency tracking, efficiency metrics, throughput scaling 🔐 8 Privacy Tools - PII detection, data minimization, compliance, anonymization

🔧 System Management (9 tools)

🔄 3 Workflow Tools - Evaluation suites, parallel execution, results comparison 📊 2 Calibration Tools - Judge agreement testing, rubric optimization 🏥 4 Server Tools - Health monitoring, cache statistics, system management

⚡ Technology

🤖 LLM-as-a-Judge - GPT-4, Azure OpenAI, with position bias mitigation
📈 Statistical Rigor - Confidence intervals, significance testing, correlation analysis
🎪 Multi-Modal Assessment - Pattern matching + LLM evaluation + rule-based metrics
🏗️ Extensible Architecture - Configurable rubrics, custom criteria, plugin system

🚀 Quick Start

📡 Multiple Server Modes

🔌 MCP Server Mode (stdio)

# 🎯 One-command setup
pip install -e ".[dev]"

# 🔥 Launch MCP server for Claude Desktop, MCP clients
python -m mcp_eval_server.server
# or
make dev

# 🏥 Health check (automatic on port 8080)
curl http://localhost:8080/health   # ✅ Liveness probe
curl http://localhost:8080/ready    # 🎯 Readiness probe
curl http://localhost:8080/metrics  # 📊 Performance metrics

🌐 REST API Server Mode (HTTP)

# 🚀 Launch REST API server with FastAPI
python -m mcp_eval_server.rest_server --port 8080 --host 0.0.0.0
# or
make serve-rest

# 📚 Interactive API documentation
open http://localhost:8080/docs

# 🧪 Quick API test
curl http://localhost:8080/health
curl http://localhost:8080/tools/categories

🔄 HTTP Bridge Mode (MCP over HTTP)

# 🌍 MCP protocol over HTTP with Server-Sent Events
make serve-http

# 📡 Access via JSON-RPC over HTTP on port 9000
curl -X POST -H 'Content-Type: application/json' \
     -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}' \
     http://localhost:9000/

🚀 MCP Wrapper Mode (FastMCP + REST API)

# 🎯 Start REST API server first
make serve-rest

# 🔗 Launch MCP wrapper (FastMCP around REST API)
make serve-wrapper

# 📡 Access via MCP protocol over HTTP/SSE on port 9001
# Endpoint: http://localhost:9001/mcp
# Headers: Accept: application/json, text/event-stream
# Protocol: Streamable HTTP (SSE) with session management

# 🧪 Test the wrapper
python test_sse_client.py

☁️ AWS App Runner Deployment (Live)

# 🚀 Deploy to AWS App Runner
make deploy-apprunner

# 🌐 Access deployed MCP wrapper
# URL: https://6xaate4xrt.us-east-1.awsapprunner.com/mcp
# Protocol: Streamable HTTP (SSE)
# Headers: Accept: application/json, text/event-stream

# 🧪 Test deployed wrapper
curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test-client", "version": "1.0.0"}}}'

✨ Complete Tool Arsenal

🤖 LLM-as-a-Judge Tools (4 Tools)

🎯 Single Response Evaluation: Customizable criteria with weighted scoring and confidence metrics
⚖️ Pairwise Comparison: Head-to-head analysis with automatic position bias mitigation
🏆 Multi-Response Ranking: Tournament, round-robin, and scoring-based ranking algorithms
📊 Reference-Based Evaluation: Gold standard comparison for factuality, completeness, and style
🤝 Multi-Judge Consensus: Ensemble evaluation with agreement analysis and confidence weighting

📝 Prompt Evaluation Tools (4 Tools)

🔍 Clarity Analysis: Rule-based ambiguity detection + LLM semantic analysis with improvement recommendations
🔄 Consistency Testing: Multi-run variance analysis across temperature settings with outlier detection
✅ Completeness Measurement: Component coverage analysis with visual heatmap generation
🎯 Relevance Assessment: Semantic alignment using TF-IDF vectorization with drift analysis

🛠️ Agent Evaluation Tools (4 Tools)

⚙️ Tool Usage Evaluation: Selection accuracy, sequence optimization, parameter validation with efficiency scoring
✅ Task Completion Analysis: Multi-criteria success evaluation with partial credit and failure analysis
🧠 Reasoning Assessment: Decision-making quality, logical coherence, and hallucination detection
📈 Performance Benchmarking: Comprehensive capability testing across skill levels with baseline comparison

🔍 Quality Assessment Tools (3 Tools)

✅ Factuality Checking: Claims verification against knowledge bases with confidence scoring and evidence tracking
🧩 Coherence Analysis: Logical flow assessment, contradiction detection, and structural analysis
🛡️ Toxicity Detection: Multi-category harmful content identification with bias pattern analysis

🔗 RAG Evaluation Tools (8 Tools)

📊 Retrieval Relevance: Semantic similarity assessment with LLM judge validation and configurable thresholds
🎯 Context Utilization: Analysis of how well retrieved context is integrated into generated responses
⚓ Answer Groundedness: Claim verification against supporting context with strictness controls
🚨 Hallucination Detection: Contradiction identification between responses and source context
🎯 Retrieval Coverage: Topic completeness assessment and information gap analysis
📝 Citation Accuracy: Reference validation and citation quality scoring across multiple formats
🧩 Chunk Relevance: Individual document segment evaluation with ranking and scoring
🏆 Retrieval Benchmarking: Comparative analysis using standard IR metrics (precision, recall, MRR, NDCG)

⚖️ Bias & Fairness Tools (6 Tools)

🎯 Demographic Bias Detection: Pattern matching and LLM assessment for protected group bias
📊 Representation Fairness: Balanced representation analysis across contexts and groups
⚖️ Outcome Equity: Disparate impact analysis across protected attributes
🌍 Cultural Sensitivity: Cross-cultural appropriateness and awareness evaluation
🗣️ Linguistic Bias Detection: Language-based discrimination and dialect bias identification
🔗 Intersectional Fairness: Compound bias effects across multiple identity dimensions

🛡️ Robustness Tools (5 Tools)

⚔️ Adversarial Testing: Malicious prompt resistance and attack vector evaluation
🔄 Input Sensitivity: Response stability testing under input variations and perturbations
🛡️ Prompt Injection Resistance: Security defense evaluation against injection attacks
📈 Distribution Shift: Performance degradation analysis on out-of-domain data
🎯 Consistency Under Perturbation: Output stability measurement across input modifications

🔒 Safety & Alignment Tools (4 Tools)

⚠️ Harmful Content Detection: Multi-category risk assessment across safety dimensions
📋 Instruction Following: Constraint adherence and safety instruction compliance
🚫 Refusal Appropriateness: Evaluation of appropriate system refusal behavior
💎 Value Alignment: Human values and ethical principles alignment assessment

🌍 Multilingual Tools (4 Tools)

🔄 Translation Quality: Accuracy, fluency, and completeness assessment across languages
🔗 Cross-Lingual Consistency: Consistency evaluation across multiple language versions
🎭 Cultural Adaptation: Localization quality and cultural appropriateness evaluation
🔀 Language Mixing Detection: Inappropriate code-switching and language mixing identification

⚡ Performance Tools (4 Tools)

⏱️ Response Latency: Generation speed tracking with statistical analysis and percentiles
💻 Computational Efficiency: Resource usage monitoring and efficiency metrics
📈 Throughput Scaling: Concurrent request handling and scaling behavior analysis
💾 Memory Monitoring: Memory consumption pattern tracking and leak detection

🔐 Privacy Tools (8 Tools)

🔍 PII Detection: Personally identifiable information detection with configurable sensitivity
📊 Data Minimization: Evaluation of data collection necessity and purpose alignment
📋 Consent Compliance: Privacy regulation compliance assessment (GDPR, CCPA, COPPA, HIPAA)
🎭 Anonymization Effectiveness: Re-identification risk analysis and utility preservation
🚨 Data Leakage Detection: Unintended data exposure and inference leakage identification
📖 Consent Clarity: Readability and comprehensibility assessment of privacy notices
🗃️ Data Retention Compliance: Retention policy alignment and regulatory adherence
🏗️ Privacy-by-Design: System-level privacy implementation and design principle evaluation

🔄 Workflow Management Tools (3 Tools)

🎛️ Evaluation Suites: Customizable multi-step pipelines with weighted criteria and success thresholds
⚡ Parallel/Sequential Execution: Optimized processing with configurable concurrency and resource management
📊 Results Comparison: Statistical analysis with trend detection, significance testing, and regression analysis

📊 Judge Calibration Tools (2 Tools)

🤝 Agreement Testing: Inter-judge correlation analysis with human baseline comparison
🎯 Rubric Optimization: Automatic tuning using machine learning for improved human alignment

🔧 Server Management Tools (9 Tools)

📋 Judge Management: Available model listing, capability assessment, configuration validation
💾 Results Storage: Comprehensive evaluation history with metadata and statistical reporting
⚡ Cache Management: Multi-level caching statistics and performance optimization
🔍 Health Monitoring: System status checks and performance metrics

🚀 Advanced Features

🎯 LLM-as-a-Judge Best Practices

Position Bias Mitigation: Automatic response position randomization for fair comparisons
Chain-of-Thought Integration: Step-by-step reasoning for enhanced evaluation quality
Confidence Calibration: Self-assessment metrics for evaluation reliability
Multiple Judge Consensus: Ensemble methods with disagreement analysis
Human Alignment: Regular calibration against ground truth evaluations

⚡ Performance & Scalability

Lightweight Dependencies: Uses standard libraries (scikit-learn, numpy) instead of heavy ML frameworks
Smart Caching: Multi-level caching (memory + disk) with TTL and invalidation
Async Processing: Non-blocking evaluation execution with configurable concurrency
Batch Operations: Efficient multi-item processing with progress tracking
Resource Management: Memory and CPU optimization with automatic scaling
Fast Startup: Quick initialization without loading large pre-trained models

🔒 Enterprise Security

Cryptographic Random: Secure random number generation for bias mitigation
API Key Management: Secure credential handling with environment variable integration
Input Validation: Comprehensive parameter validation and sanitization
Error Isolation: Graceful failure handling with detailed error reporting
Audit Trail: Complete evaluation history with compliance reporting

📊 Analytics & Insights

Statistical Analysis: Correlation analysis, significance testing, trend detection
Performance Metrics: Latency tracking, throughput monitoring, success rate analysis
Quality Dashboards: Real-time evaluation quality monitoring with alerting
Comparative Analysis: A/B testing capabilities with regression detection
Predictive Analytics: Performance trend forecasting and anomaly detection

🚀 MCP Wrapper (FastMCP + REST API)

🎯 What is the MCP Wrapper?

The MCP Wrapper is a FastMCP-based server that wraps your existing REST API and exposes it as a Model Context Protocol (MCP) server. This allows MCP clients (like Claude Desktop) to access all your evaluation tools through the MCP protocol while keeping your REST API unchanged.

✨ Key Benefits

🔄 Dual Protocol Support: Access the same tools via both REST API and MCP protocol
🌐 HTTP/SSE Transport: Modern streamable HTTP with Server-Sent Events
🔗 REST API Integration: Seamlessly wraps existing REST endpoints
📡 Session Management: Automatic session handling with mcp-session-id headers
⚡ Real-time Communication: Bidirectional communication over single HTTP connection
🛡️ Protocol Compliance: Full MCP protocol compliance with proper initialization

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │    │   MCP Wrapper    │    │   REST API      │
│ (Claude Desktop)│◄──►│  (FastMCP + SSE) │◄──►│  (FastAPI)      │
│                 │    │                  │    │                 │
│ • stdio         │    │ • /mcp/ endpoint │    │ • /judge/*      │
│ • HTTP/SSE      │    │ • Session mgmt   │    │ • /quality/*    │
│ • JSON-RPC      │    │ • Tool wrapping  │    │ • /agent/*      │
└─────────────────┘    └──────────────────┘    └─────────────────┘

🔧 Technical Details

Transport: Streamable HTTP (SSE) with session management
Endpoint: http://localhost:9001/mcp/
Headers: Accept: application/json, text/event-stream
Session: Automatic via mcp-session-id header
Protocol: Full MCP protocol with initialization and notifications
Tools: All 63 evaluation tools exposed as MCP tools

🎮 Usage Examples

# Start both servers
make serve-rest      # REST API on port 8080
make serve-wrapper   # MCP wrapper on port 9001

# Test the wrapper
python test_sse_client.py
make test-wrapper

🛠️ Installation & Setup

Quick Installation

# Clone and install (lightweight dependencies only)
cd mcp-servers/python/mcp_eval_server
pip install -e ".[dev]"

# Set up API keys (optional - rule-based judge works without them)
export OPENAI_API_KEY="sk-your-key-here"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-azure-api-key"

# Configure health check endpoints (optional)
export HEALTH_CHECK_PORT=8080        # Default: 8080
export HEALTH_CHECK_HOST=0.0.0.0     # Default: 0.0.0.0

# Note: No heavy ML dependencies required!
# Uses efficient TF-IDF + scikit-learn instead of transformers

MCP Client Connection

Option 1: Direct MCP Server (stdio)

{
  "command": "python",
  "args": ["-m", "mcp_eval_server.server"],
  "cwd": "/path/to/mcp-servers/python/mcp_eval_server"
}

Protocol: stdio (Model Context Protocol)
Transport: Standard input/output (no HTTP port needed)
Tools Available: 63 specialized evaluation tools

Option 2: MCP Wrapper (HTTP/SSE)

{
  "url": "http://localhost:9001/mcp/",
  "headers": {
    "Accept": "application/json, text/event-stream"
  }
}

Protocol: Streamable HTTP (SSE)
Transport: HTTP with Server-Sent Events
Session Management: Automatic via mcp-session-id header
Tools Available: 63 specialized evaluation tools (via REST API wrapper)

Health Check Endpoints

The server automatically starts health check HTTP endpoints for monitoring:

# Health endpoints (started automatically with the MCP server)
curl http://localhost:8080/health    # Liveness probe
curl http://localhost:8080/ready     # Readiness probe
curl http://localhost:8080/metrics   # Basic metrics
curl http://localhost:8080/          # Service info

# Kubernetes-style endpoints
curl http://localhost:8080/healthz   # Alternative health
curl http://localhost:8080/readyz    # Alternative readiness

Health Check Response Example:

{
  "status": "healthy",
  "timestamp": 1698765432.123,
  "uptime_seconds": 45.67,
  "service": "mcp-eval-server",
  "version": "0.1.0",
  "checks": {
    "server_running": true,
    "uptime_ok": true
  }
}

Readiness Check Response Example:

{
  "status": "ready",
  "timestamp": 1698765432.123,
  "service": "mcp-eval-server",
  "version": "0.1.0",
  "checks": {
    "server_initialized": true,
    "judge_tools_loaded": true,
    "storage_initialized": true
  }
}

Docker Deployment

# Build container
make build

# Run with environment
make run

# Or use docker-compose
make compose-up

AWS App Runner Deployment

# 1. Prerequisites: AWS CLI configured, Docker installed
aws configure

# 2. Create IAM role for App Runner
aws iam create-role --role-name AppRunnerInstanceRole --assume-role-policy-document file://trust-policy.json
aws iam attach-role-policy --role-name AppRunnerInstanceRole --policy-arn arn:aws:iam::aws:policy/service-role/AppRunnerServicePolicyForECRAccess

# 3. Set environment variable
export INSTANCE_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AppRunnerInstanceRole"

# 4. Deploy to AWS App Runner
make deploy-apprunner

# 5. Set API keys in App Runner console (OPENAI_API_KEY, etc.)

🌐 Live Deployed Endpoints:

MCP Wrapper: https://6xaate4xrt.us-east-1.awsapprunner.com/mcp
Health Check: https://6xaate4xrt.us-east-1.awsapprunner.com/health
Service Info: https://6xaate4xrt.us-east-1.awsapprunner.com/

📚 Detailed Guide: See AWS_APP_RUNNER_DEPLOYMENT.md for complete deployment instructions.

🎉 Live MCP Wrapper Testing

The MCP wrapper is now live and fully functional! Here's how to test it:

1. Initialize MCP Session

curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -i \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "initialize",
    "params": {
      "protocolVersion": "2024-11-05",
      "capabilities": {},
      "clientInfo": {
        "name": "test-client",
        "version": "1.0.0"
      }
    }
  }'

2. Send Initialized Notification

# Use the session ID from Step 1
curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -H "mcp-session-id: <SESSION_ID>" \
  -d '{
    "jsonrpc": "2.0",
    "method": "notifications/initialized"
  }'

3. List Available Tools

curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -H "mcp-session-id: <SESSION_ID>" \
  -d '{
    "jsonrpc": "2.0",
    "id": 2,
    "method": "tools/list",
    "params": {}
  }'

4. Call a Tool

curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -H "mcp-session-id: <SESSION_ID>" \
  -d '{
    "jsonrpc": "2.0",
    "id": 3,
    "method": "tools/call",
    "params": {
      "name": "judge_evaluate",
      "arguments": {
        "response": "Paris is the capital of France.",
        "criteria": [
          {
            "name": "accuracy",
            "description": "Factual accuracy",
            "scale": "1-5",
            "weight": 1.0
          }
        ],
        "rubric": {
          "criteria": [],
          "scale_description": {
            "1": "Wrong",
            "5": "Correct"
          }
        },
        "judge_model": "rule-based"
      }
    }
  }'

Development Setup

# Install development dependencies
make dev-install

# Run development server
make dev

# Run tests
make test

# Check code quality
make lint

🎮 Usage Examples

🎯 MCP Client Integration

# Multi-criteria evaluation with MCP client
result = await mcp_client.call_tool("judge.evaluate_response", {
    "response": "Detailed technical explanation...",
    "criteria": [
        {"name": "technical_accuracy", "description": "Correctness of technical details", "scale": "1-5", "weight": 0.4},
        {"name": "clarity", "description": "Explanation clarity", "scale": "1-5", "weight": 0.3},
        {"name": "completeness", "description": "Coverage of key points", "scale": "1-5", "weight": 0.3}
    ],
    "rubric": {
        "criteria": [],
        "scale_description": {
            "1": "Severely lacking",
            "2": "Below expectations",
            "3": "Meets basic requirements",
            "4": "Exceeds expectations",
            "5": "Outstanding quality"
        }
    },
    "judge_model": "gpt-4",
    "use_cot": True
})

🌐 REST API Integration

# Evaluate response via REST API
curl -X POST http://localhost:8080/judge/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "response": "Paris is the capital of France",
    "criteria": [
      {
        "name": "accuracy",
        "description": "Factual accuracy",
        "scale": "1-5",
        "weight": 1.0
      }
    ],
    "rubric": {
      "criteria": [],
      "scale_description": {
        "1": "Wrong",
        "5": "Correct"
      }
    },
    "judge_model": "gpt-4o-mini"
  }'

# Python REST API client
import httpx
import asyncio

async def evaluate_via_rest():
    async with httpx.AsyncClient() as client:
        response = await client.post("http://localhost:8080/judge/evaluate", json={
            "response": "Technical explanation...",
            "criteria": [
                {"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}
            ],
            "rubric": {
                "criteria": [],
                "scale_description": {"1": "Poor", "5": "Excellent"}
            },
            "judge_model": "gpt-4o-mini"
        })
        result = response.json()
        return result

# Run evaluation
result = asyncio.run(evaluate_via_rest())
print(f"Overall score: {result['overall_score']}")

⚖️ Advanced Pairwise Comparison

# Head-to-head comparison with bias mitigation
comparison = await mcp_client.call_tool("judge.pairwise_comparison", {
    "response_a": "Technical solution A with implementation details...",
    "response_b": "Alternative solution B with different approach...",
    "criteria": [
        {"name": "innovation", "description": "Novelty and creativity", "scale": "1-5", "weight": 0.4},
        {"name": "feasibility", "description": "Implementation practicality", "scale": "1-5", "weight": 0.3},
        {"name": "efficiency", "description": "Resource optimization", "scale": "1-5", "weight": 0.3}
    ],
    "context": "Solutions for enterprise-scale data processing challenge",
    "position_bias_mitigation": True,
    "judge_model": "gpt-4-turbo"
})

📊 Comprehensive Agent Benchmarking

# Full agent performance assessment
benchmark_result = await mcp_client.call_tool("agent.benchmark_performance", {
    "benchmark_suite": "advanced_skills",
    "agent_config": {
        "model": "gpt-4",
        "temperature": 0.7,
        "tools_enabled": ["search", "calculator", "code_executor"]
    },
    "baseline_comparison": {
        "name": "GPT-3.5 Baseline",
        "scores": {"accuracy": 0.75, "efficiency": 0.68, "reliability": 0.72}
    },
    "metrics_focus": ["accuracy", "efficiency", "reliability", "creativity"]
})

🔄 Advanced Evaluation Suite

# Create sophisticated evaluation pipeline
suite = await mcp_client.call_tool("workflow.create_evaluation_suite", {
    "suite_name": "comprehensive_ai_assessment",
    "description": "Full-spectrum AI capability evaluation",
    "evaluation_steps": [
        {
            "tool": "prompt.evaluate_clarity",
            "weight": 0.15,
            "parameters": {"target_model": "gpt-4", "domain_context": "technical"}
        },
        {
            "tool": "judge.evaluate_response",
            "weight": 0.25,
            "parameters": {
                "criteria": [
                    {"name": "technical_depth", "description": "Technical sophistication", "scale": "1-5", "weight": 0.4},
                    {"name": "practical_utility", "description": "Real-world applicability", "scale": "1-5", "weight": 0.6}
                ],
                "judge_model": "gpt-4"
            }
        },
        {
            "tool": "quality.evaluate_factuality",
            "weight": 0.20
        },
        {
            "tool": "quality.measure_coherence",
            "weight": 0.15
        },
        {
            "tool": "quality.assess_toxicity",
            "weight": 0.10
        },
        {
            "tool": "agent.analyze_reasoning",
            "weight": 0.15,
            "parameters": {"judge_model": "gpt-4-turbo"}
        }
    ],
    "success_thresholds": {
        "overall": 0.85,
        "quality.evaluate_factuality": 0.90,
        "quality.assess_toxicity": 0.95
    },
    "weights": {
        "accuracy": 0.4,
        "safety": 0.3,
        "utility": 0.3
    }
})

# Execute comprehensive evaluation
results = await mcp_client.call_tool("workflow.run_evaluation", {
    "suite_id": suite["suite_id"],
    "test_data": {
        "response": "Complex AI system response...",
        "context": "Enterprise deployment scenario...",
        "reasoning_trace": [...],
        "agent_trace": {...}
    },
    "parallel_execution": True,
    "max_concurrent": 5
})

🎛️ Advanced Configuration

Custom Model Configuration

The MCP Eval Server supports complete customization of judge models, allowing you to:

Configure custom API endpoints and deployments
Set provider-specific parameters and capabilities
Create domain-specific model configurations
Use custom environment variable names

# Use custom model configuration
export MCP_EVAL_MODELS_CONFIG="./my-custom-models.yaml"
export DEFAULT_JUDGE_MODEL="my-custom-judge"

# Copy default config for customization
make copy-config                    # Copies to ./custom-config/
make show-config                    # Show current configuration status
make validate-config                # Validate configuration syntax

Model Configuration with Capabilities

models:
  azure:
    my-enterprise-gpt4:
      provider: "azure"
      deployment_name: "my-gpt4-deployment"
      model_name: "gpt-4"
      api_base_env: "AZURE_OPENAI_ENDPOINT"
      api_key_env: "AZURE_OPENAI_API_KEY"
      api_version_env: "AZURE_OPENAI_API_VERSION"
      deployment_name_env: "AZURE_DEPLOYMENT_NAME"
      default_temperature: 0.1  # Custom temperature
      max_tokens: 3000           # Custom token limit
      capabilities:
        supports_cot: true
        supports_pairwise: true
        supports_ranking: true
        supports_reference: true
        max_context_length: 8192
        optimal_temperature: 0.1
        consistency_level: "very_high"
      metadata:
        purpose: "production_evaluation"
        cost_tier: "premium"

  ollama:
    my-local-llama:
      provider: "ollama"
      model_name: "llama3:70b"
      base_url_env: "OLLAMA_BASE_URL"
      default_temperature: 0.3
      max_tokens: 2000
      request_timeout: 120  # Longer timeout for large models

# Custom defaults
defaults:
  primary_judge: "my-enterprise-gpt4"
  fallback_judge: "my-local-llama"

# Custom recommendations
recommendations:
  production: ["my-enterprise-gpt4"]
  development: ["my-local-llama"]

Advanced Evaluation Rubrics

rubrics:
  technical_excellence:
    name: "Technical Excellence Assessment"
    criteria:
      - name: "code_quality"
        description: "Code structure, efficiency, and best practices"
        scale: "1-10"
        weight: 0.3
      - name: "innovation"
        description: "Novel approaches and creative solutions"
        scale: "1-10"
        weight: 0.25
      - name: "scalability"
        description: "System scalability and performance considerations"
        scale: "1-10"
        weight: 0.25
      - name: "maintainability"
        description: "Code maintainability and documentation quality"
        scale: "1-10"
        weight: 0.2
    scale_description:
      "1-2": "Severely deficient, requires major rework"
      "3-4": "Below standards, significant improvements needed"
      "5-6": "Meets basic requirements, minor improvements possible"
      "7-8": "Exceeds expectations, high quality work"
      "9-10": "Exceptional excellence, industry-leading quality"

Multi-Domain Benchmarks

benchmarks:
  enterprise_readiness:
    name: "Enterprise Readiness Assessment"
    category: "production"
    tasks:
      - name: "security_analysis"
        description: "Security vulnerability assessment and mitigation"
        difficulty: "advanced"
        expected_tools: ["security_scanner", "vulnerability_analyzer", "mitigation_planner"]
        evaluation_metrics: ["threat_identification", "risk_assessment", "solution_quality"]
      - name: "performance_optimization"
        description: "System performance analysis and optimization"
        difficulty: "advanced"
        expected_tools: ["profiler", "optimizer", "benchmarker"]
        evaluation_metrics: ["performance_gain", "resource_efficiency", "scalability_impact"]

🔬 Research-Grade Features

📊 Statistical Analysis

Correlation Analysis: Pearson, Spearman, Cohen's Kappa for agreement measurement
Significance Testing: Statistical validation of evaluation differences
Trend Analysis: Performance trajectory analysis with volatility assessment
Outlier Detection: Anomaly identification in evaluation results
Confidence Intervals: Uncertainty quantification for evaluation scores

🧪 Experimental Capabilities

Judge Calibration: Systematic bias detection and correction algorithms
Rubric Evolution: Machine learning-powered rubric optimization
Meta-Evaluation: Evaluation of evaluation quality itself
Human Alignment: Continuous calibration against expert human judgments
Cross-Validation: K-fold validation for evaluation reliability

🎯 Domain-Specific Evaluations

Technical Content: Code quality, architecture assessment, security analysis
Creative Writing: Originality, engagement, style consistency evaluation
Academic Work: Research quality, citation analysis, argument strength
Customer Service: Helpfulness, politeness, problem resolution effectiveness
Educational Content: Learning objective achievement, instructional clarity

🏗️ Production Architecture

🔧 Infrastructure Components

Multi-Judge Runtime: Supports OpenAI, Azure OpenAI, and rule-based evaluation engines
Caching Layer: Redis-compatible distributed caching with automatic invalidation
Results Database: SQLite/PostgreSQL storage with comprehensive indexing
API Gateway: RESTful endpoints with authentication and rate limiting
Monitoring System: Prometheus metrics with Grafana dashboards

📦 Deployment Options

Container Deployment: Production-ready Docker/Podman containers with security hardening
Kubernetes Support: Helm charts with auto-scaling and service mesh integration
Cloud Integration: AWS ECS, Azure Container Instances, Google Cloud Run compatibility
Edge Deployment: Lightweight containers for edge computing scenarios
Development Mode: Hot-reload development server with debugging capabilities

🔒 Security & Compliance

Enterprise Security: OAuth 2.0, JWT tokens, API key rotation
Data Privacy: Encryption at rest and in transit, PII detection and filtering
Audit Logging: Comprehensive audit trails with tamper detection
Compliance Ready: SOC 2, GDPR, HIPAA compliance frameworks supported
Vulnerability Management: Continuous security scanning and automated patching

🗺️ Tool Ecosystem Map

🏆 MCP EVALUATION SERVER - 63 SPECIALIZED TOOLS 🏆
═══════════════════════════════════════════════════════════

📊 CORE EVALUATION SUITE (15 tools)
├── 🤖 Judge Tools (4) ────── LLM-as-a-judge evaluation
├── 📝 Prompt Tools (4) ───── Clarity, consistency, optimization
├── 🛠️ Agent Tools (4) ────── Performance, reasoning, benchmarking
└── 🔍 Quality Tools (3) ──── Factuality, coherence, toxicity

🔬 ADVANCED ASSESSMENT SUITE (39 tools)
├── 🔗 RAG Tools (8) ──────── Retrieval relevance, grounding, citations
├── ⚖️ Bias & Fairness (6) ── Demographic bias, intersectional analysis
├── 🛡️ Robustness (5) ──────── Adversarial testing, injection resistance
├── 🔒 Safety & Alignment (4) Harmful content, value alignment
├── 🌍 Multilingual (4) ────── Translation, cultural adaptation
├── ⚡ Performance (4) ──────── Latency, efficiency, scaling
└── 🔐 Privacy (8) ───────── PII detection, compliance, anonymization

🔧 SYSTEM MANAGEMENT (9 tools)
├── 🔄 Workflow Tools (3) ─── Evaluation suites, parallel execution
├── 📊 Calibration (2) ────── Judge agreement, rubric optimization
└── 🏥 Server Tools (4) ───── Health monitoring, system management

🎯 TOTAL: 63 TOOLS ACROSS 14 CATEGORIES 🎯

📋 Complete Tool Reference

Judge Tools (4/63)

Tool	Description	Key Features
`judge.evaluate_response`	Single response evaluation	Customizable criteria, weighted scoring, confidence metrics
`judge.pairwise_comparison`	Two-response comparison	Position bias mitigation, criterion-level analysis
`judge.rank_responses`	Multi-response ranking	Tournament/scoring algorithms, consistency measurement
`judge.evaluate_with_reference`	Reference-based evaluation	Gold standard comparison, similarity scoring

Prompt Tools (4/63)

Tool	Description	Key Features
`prompt.evaluate_clarity`	Clarity assessment	Rule-based + LLM analysis, ambiguity detection
`prompt.test_consistency`	Consistency testing	Multi-run analysis, temperature variance
`prompt.measure_completeness`	Completeness analysis	Component coverage, heatmap visualization
`prompt.assess_relevance`	Relevance measurement	TF-IDF semantic alignment, drift analysis

Agent Tools (4/63)

Tool	Description	Key Features
`agent.evaluate_tool_use`	Tool usage analysis	Selection accuracy, sequence optimization
`agent.measure_task_completion`	Task success evaluation	Multi-criteria assessment, partial credit
`agent.analyze_reasoning`	Reasoning quality assessment	Logic analysis, hallucination detection
`agent.benchmark_performance`	Performance benchmarking	Multi-domain testing, baseline comparison

Quality Tools (3/63)

Tool	Description	Key Features
`quality.evaluate_factuality`	Factual accuracy checking	Claims verification, confidence scoring
`quality.measure_coherence`	Logical flow analysis	Coherence scoring, contradiction detection
`quality.assess_toxicity`	Harmful content detection	Multi-category analysis, bias detection

RAG Tools (8/63)

Tool	Description	Key Features
`rag.evaluate_retrieval_relevance`	Document relevance assessment	Semantic similarity, LLM validation
`rag.measure_context_utilization`	Context usage analysis	Word overlap, sentence integration
`rag.assess_answer_groundedness`	Claim verification	Context support, strictness control
`rag.detect_hallucination_vs_context`	Contradiction detection	Statement verification, confidence scoring
`rag.evaluate_retrieval_coverage`	Topic completeness check	Information gap analysis, coverage scoring
`rag.assess_citation_accuracy`	Reference validation	Citation quality, format support
`rag.measure_chunk_relevance`	Document segment scoring	Individual chunk analysis, ranking
`rag.benchmark_retrieval_systems`	System comparison	IR metrics, performance analysis

Bias & Fairness Tools (6/63)

Tool	Description	Key Features
`bias.detect_demographic_bias`	Protected group bias detection	Pattern matching, LLM assessment, sensitivity control
`bias.measure_representation_fairness`	Balanced representation analysis	Context evaluation, fairness metrics
`bias.evaluate_outcome_equity`	Disparate impact assessment	Outcome analysis, equity scoring
`bias.assess_cultural_sensitivity`	Cultural appropriateness evaluation	Cross-cultural awareness, sensitivity dimensions
`bias.detect_linguistic_bias`	Language-based discrimination	Dialect bias, formality assessment
`bias.measure_intersectional_fairness`	Multi-dimensional bias analysis	Compound effects, intersectional metrics

Robustness Tools (5/63)

Tool	Description	Key Features
`robustness.test_adversarial_inputs`	Malicious prompt testing	Attack vectors, injection resistance
`robustness.measure_input_sensitivity`	Perturbation stability testing	Input variations, sensitivity thresholds
`robustness.evaluate_prompt_injection_resistance`	Security defense evaluation	Injection strategies, resistance scoring
`robustness.assess_distribution_shift`	Out-of-domain performance	Domain adaptation, degradation analysis
`robustness.measure_consistency_under_perturbation`	Output stability measurement	Perturbation consistency, variance analysis

Safety & Alignment Tools (4/63)

Tool	Description	Key Features
`safety.detect_harmful_content`	Harmful content identification	Multi-category risk assessment, severity classification
`safety.assess_instruction_following`	Constraint adherence evaluation	Instruction parsing, compliance scoring
`safety.evaluate_refusal_appropriateness`	Refusal behavior assessment	Decision accuracy, precision/recall metrics
`safety.measure_value_alignment`	Human values alignment	Ethical principles, weighted assessment

Multilingual Tools (4/63)

Tool	Description	Key Features
`multilingual.evaluate_translation_quality`	Translation assessment	Accuracy, fluency, cultural adaptation
`multilingual.measure_cross_lingual_consistency`	Multi-language consistency	Semantic preservation, factual alignment
`multilingual.assess_cultural_adaptation`	Localization evaluation	Cultural dimensions, adaptation scoring
`multilingual.detect_language_mixing`	Code-switching detection	Language purity, mixing appropriateness

Performance Tools (4/63)

Tool	Description	Key Features
`performance.measure_response_latency`	Latency measurement	Statistical analysis, percentiles, timeout tracking
`performance.assess_computational_efficiency`	Resource usage monitoring	CPU/memory efficiency, per-token metrics
`performance.evaluate_throughput_scaling`	Scaling behavior analysis	Concurrency testing, bottleneck detection
`performance.monitor_memory_usage`	Memory consumption tracking	Usage patterns, leak detection, threshold monitoring

Privacy Tools (8/63)

Tool	Description	Key Features
`privacy.detect_pii_exposure`	PII detection and analysis	Pattern matching, sensitivity levels, context analysis
`privacy.assess_data_minimization`	Data collection necessity	Purpose alignment, minimization scoring
`privacy.evaluate_consent_compliance`	Regulatory compliance assessment	GDPR/CCPA/COPPA/HIPAA standards, gap analysis
`privacy.measure_anonymization_effectiveness`	Anonymization quality evaluation	Re-identification risk, utility preservation
`privacy.detect_data_leakage`	Data exposure identification	Direct/inference leakage, unexpected data flow
`privacy.assess_consent_clarity`	Consent readability analysis	Grade level, accessibility, comprehension
`privacy.evaluate_data_retention_compliance`	Retention policy adherence	Policy-practice alignment, regulatory requirements
`privacy.assess_privacy_by_design`	System privacy implementation	Design principles, control effectiveness

Workflow Tools (3/63)

Tool	Description	Key Features
`workflow.create_evaluation_suite`	Evaluation pipeline creation	Multi-step workflows, weighted criteria
`workflow.run_evaluation`	Suite execution	Parallel processing, progress tracking
`workflow.compare_evaluations`	Results comparison	Statistical analysis, trend detection

Calibration Tools (2/63)

Tool	Description	Key Features
`calibration.test_judge_agreement`	Judge agreement testing	Correlation analysis, bias detection
`calibration.optimize_rubrics`	Rubric optimization	ML-powered tuning, human alignment

Server Tools (4/63)

Tool	Description	Key Features
`server.get_available_judges`	List available judges	Model capabilities, status checking
`server.get_evaluation_suites`	List evaluation suites	Suite management, configuration viewing
`server.get_evaluation_results`	Retrieve results	History browsing, filtering, pagination
`server.get_cache_stats`	Cache statistics	Performance monitoring, optimization

💡 Innovation & Research Integration

🧠 AI Research Applications

Model Comparison Studies: Systematic evaluation of different LLM architectures
Prompt Engineering Research: Large-scale prompt effectiveness analysis
Agent Behavior Studies: Comprehensive agent decision-making research
Bias Detection Research: Systematic bias pattern analysis across models
Evaluation Methodology: Meta-research on evaluation techniques themselves

🏢 Enterprise Applications

Quality Assurance: Automated content quality control in production systems
A/B Testing: Systematic comparison of different AI configurations
Performance Monitoring: Continuous evaluation of deployed AI systems
Compliance Reporting: Automated generation of evaluation compliance reports
Cost Optimization: Evaluation-driven optimization of AI system costs

🎓 Educational Applications

Student Assessment: Automated evaluation of student AI projects
Curriculum Development: Assessment-driven AI curriculum optimization
Research Training: Tools for training researchers in evaluation methodologies
Benchmark Creation: Development of new evaluation benchmarks
Peer Review: AI-assisted peer review systems for academic work

🚀 Getting Started

🎯 Deployment Options Quick Reference

Mode	Command	Protocol	Port	Auth	Use Case
MCP Server	`make dev`	stdio	none	none	Claude Desktop, MCP clients
REST API	`make serve-rest`	HTTP REST	8080	none	Direct HTTP API integration
REST Public	`make serve-rest-public`	HTTP REST	8080	none	Public REST API access
HTTP Bridge	`make serve-http`	JSON-RPC/HTTP	9000	none	MCP over HTTP, local testing
HTTP Public	`make serve-http-public`	JSON-RPC/HTTP	9000	none	MCP over HTTP, remote access
MCP Wrapper	`make serve-wrapper`	Streamable HTTP (SSE)	9001	none	FastMCP wrapper around REST API
MCP Wrapper Public	`make serve-wrapper-public`	Streamable HTTP (SSE)	9001	none	Public MCP wrapper access
Container	`make run`	HTTP	8080	none	Docker deployment
AWS App Runner	`make deploy-apprunner`	MCP Wrapper (SSE)	8080	none	Cloud deployment on AWS (Live)

Immediate Quick Start

Option 1: MCP Server (stdio)

# 1. Run MCP server (for Claude Desktop, etc.)
make dev                    # Shows connection info + starts server

# 2. Test basic functionality
make example               # Run evaluation example
make test-mcp             # Test MCP protocol

Option 2: REST API Server (FastAPI)

# 1. Run native REST API server
make serve-rest          # Starts on http://localhost:8080

# 2. Test REST API endpoints
make test-rest           # Test all REST endpoints

# 3. View interactive documentation
open http://localhost:8080/docs    # Swagger UI
open http://localhost:8080/redoc   # ReDoc

# 4. Get connection info
make rest-info           # Show complete REST API guide

Option 3: HTTP Bridge (MCP over HTTP)

# 1. Run MCP protocol over HTTP
make serve-http          # Starts on http://localhost:9000

# 2. Test HTTP endpoints
make test-http           # Test MCP JSON-RPC endpoints

# 3. Get connection info
make http-info           # Show complete HTTP bridge guide

Option 4: MCP Wrapper (FastMCP + REST API)

# 1. Start REST API server first
make serve-rest          # Starts on http://localhost:8080

# 2. Start MCP wrapper (FastMCP around REST API)
make serve-wrapper       # Starts on http://localhost:9001

# 3. Test wrapper functionality
python test_sse_client.py    # Test SSE client
make test-wrapper            # Test wrapper endpoints

# 4. Connection details
# Endpoint: http://localhost:9001/mcp/
# Protocol: Streamable HTTP (SSE)
# Headers: Accept: application/json, text/event-stream
# Session management: Automatic via mcp-session-id header

Option 5: Docker Deployment

# Build and deploy
make build && make run

Option 6: AWS App Runner Deployment

# 1. Set up AWS credentials and IAM role
aws configure
export INSTANCE_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AppRunnerInstanceRole"

# 2. Deploy to AWS App Runner
make deploy-apprunner

# 3. Test locally first (optional)
make test-docker-apprunner

Integration Examples

MCP Client Integration

# Basic MCP integration
from mcp import Client
client = Client("mcp-eval-server")

# Evaluate any AI output
result = await client.call_tool("judge.evaluate_response", {
    "response": "Your AI output here",
    "criteria": [{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}],
    "rubric": {"criteria": [], "scale_description": {"1": "Poor", "5": "Excellent"}}
})

REST API Integration

# Start REST API server
make serve-rest

# Check server health
curl http://localhost:8080/health

# List tool categories
curl http://localhost:8080/tools/categories

# Evaluate response directly via REST
curl -X POST http://localhost:8080/judge/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "response": "Paris is the capital of France.",
    "criteria": [
      {
        "name": "accuracy",
        "description": "Factual accuracy",
        "scale": "1-5",
        "weight": 1.0
      }
    ],
    "rubric": {
      "criteria": [],
      "scale_description": {"1": "Wrong", "5": "Correct"}
    },
    "judge_model": "rule-based"
  }'

HTTP Bridge Integration (MCP over HTTP)

# Start HTTP bridge server
make serve-http

# List available tools (JSON-RPC)
curl -X POST \
     -H "Content-Type: application/json" \
     -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}' \
     http://localhost:9000/

# Evaluate response via HTTP bridge (JSON-RPC)
curl -X POST \
     -H "Content-Type: application/json" \
     -d '{
       "jsonrpc": "2.0",
       "id": 2,
       "method": "tools/call",
       "params": {
         "name": "judge.evaluate_response",
         "arguments": {
           "response": "Paris is the capital of France.",
           "criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
           "rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
           "judge_model": "rule-based"
         }
       }
     }' \
     http://localhost:9000/

MCP Wrapper Integration (FastMCP + REST API)

# Start REST API server first
make serve-rest

# Start MCP wrapper
make serve-wrapper

# Initialize MCP session
curl -X POST -H "Content-Type: application/json" \
     -H "Accept: application/json, text/event-stream" \
     -d '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test-client", "version": "1.0.0"}}}' \
     http://localhost:9001/mcp

# Send initialized notification (note: returns 202 for notifications)
curl -X POST -H "Content-Type: application/json" \
     -H "Accept: application/json, text/event-stream" \
     -d '{"jsonrpc": "2.0", "method": "notifications/initialized"}' \
     http://localhost:9001/mcp

# List available tools via MCP wrapper
curl -X POST -H "Content-Type: application/json" \
     -H "Accept: application/json, text/event-stream" \
     -d '{"jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {}}' \
     http://localhost:9001/mcp

# Evaluate response via MCP wrapper
curl -X POST -H "Content-Type: application/json" \
     -H "Accept: application/json, text/event-stream" \
     -d '{
       "jsonrpc": "2.0",
       "id": 3,
       "method": "tools/call",
       "params": {
         "name": "judge_evaluate",
         "arguments": {
           "response": "Paris is the capital of France.",
           "criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
           "rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
           "judge_model": "rule-based"
         }
       }
     }' \
     http://localhost:9001/mcp

🌐 Live Deployment Testing: Replace http://localhost:9001/mcp with https://6xaate4xrt.us-east-1.awsapprunner.com/mcp in the above commands to test the live deployment!

Python REST API Client Integration

import httpx
import asyncio

async def evaluate_via_rest_api():
    """Example using native REST API endpoints."""
    async with httpx.AsyncClient() as client:
        base_url = "http://localhost:8080"

        # Check health
        health = await client.get(f"{base_url}/health")
        print(f"Server status: {health.json()['status']}")

        # List tool categories
        categories = await client.get(f"{base_url}/tools/categories")
        print(f"Available categories: {len(categories.json()['categories'])}")

        # Evaluate response using REST endpoint
        evaluation = await client.post(f"{base_url}/judge/evaluate", json={
            "response": "Your AI response here",
            "criteria": [
                {"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}
            ],
            "rubric": {
                "criteria": [],
                "scale_description": {"1": "Poor", "5": "Excellent"}
            },
            "judge_model": "rule-based"
        })
        result = evaluation.json()
        print(f"Evaluation score: {result['overall_score']}")

        # Check content toxicity
        toxicity = await client.post(f"{base_url}/quality/toxicity", json={
            "content": "This is a test message",
            "toxicity_categories": ["profanity", "hate_speech"],
            "sensitivity_level": "moderate",
            "judge_model": "rule-based"
        })
        result = toxicity.json()
        print(f"Toxicity detected: {result['toxicity_detected']}")

# Run evaluation
asyncio.run(evaluate_via_rest_api())

Python HTTP Bridge Client Integration

import httpx
import asyncio

async def evaluate_via_http_bridge():
    """Example using MCP over HTTP bridge."""
    async with httpx.AsyncClient() as client:
        base_url = "http://localhost:9000"

        # List tools via JSON-RPC
        tools_request = {
            "jsonrpc": "2.0",
            "id": 1,
            "method": "tools/list",
            "params": {}
        }

        response = await client.post(base_url, json=tools_request)
        result = response.json()
        tools = result.get("result", [])
        print(f"Available tools: {len(tools)}")

        # Evaluate response via JSON-RPC
        eval_request = {
            "jsonrpc": "2.0",
            "id": 2,
            "method": "tools/call",
            "params": {
                "name": "judge.evaluate_response",
                "arguments": {
                    "response": "Your AI response here",
                    "criteria": [{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}],
                    "rubric": {"criteria": [], "scale_description": {"1": "Poor", "5": "Excellent"}},
                    "judge_model": "rule-based"
                }
            }
        }

        response = await client.post(base_url, json=eval_request)
        result = response.json()
        print(f"Evaluation result: {result}")

# Run evaluation
asyncio.run(evaluate_via_http_bridge())

Python MCP Wrapper Client Integration

import httpx
import asyncio
import json
import re

class MCPWrapperClient:
    """Client for MCP wrapper with SSE support."""
    
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.session_id = None
        self.client = httpx.AsyncClient()
    
    async def __aenter__(self):
        return self
    
    async def __aexit__(self, exc_type, exc_val, exc_tb):
        await self.client.aclose()
    
    async def send_request(self, request: dict) -> dict:
        """Send a JSON-RPC request and parse SSE response."""
        headers = {"Accept": "application/json, text/event-stream"}
        if self.session_id:
            headers["mcp-session-id"] = self.session_id
        
        response = await self.client.post(
            self.base_url,
            json=request,
            headers=headers,
            timeout=30.0
        )
        
        if response.status_code not in [200, 202]:
            raise Exception(f"HTTP {response.status_code}: {response.text}")
        
        # Extract session ID from response headers
        if "mcp-session-id" in response.headers:
            self.session_id = response.headers["mcp-session-id"]
        
        # For notifications (202), return empty result
        if response.status_code == 202:
            return {}
        
        # Parse SSE response
        content = response.text
        data_match = re.search(r'data:\s*(\{.*\})', content)
        if data_match:
            json_str = data_match.group(1)
            return json.loads(json_str)
        else:
            raise Exception(f"Could not parse SSE response: {content}")

async def evaluate_via_mcp_wrapper():
    """Example using MCP wrapper with SSE."""
    async with MCPWrapperClient("http://localhost:9001/mcp/") as client:
        # Initialize MCP session
        init_request = {
            "jsonrpc": "2.0",
            "id": 1,
            "method": "initialize",
            "params": {
                "protocolVersion": "2024-11-05",
                "capabilities": {},
                "clientInfo": {"name": "test-client", "version": "1.0.0"}
            }
        }
        
        result = await client.send_request(init_request)
        print(f"Initialized: {result.get('result', {}).get('serverInfo', {}).get('name', 'Unknown')}")
        
        # Send initialized notification
        await client.send_request({"jsonrpc": "2.0", "method": "notifications/initialized"})
        
        # List tools
        tools_request = {"jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {}}
        result = await client.send_request(tools_request)
        tools = result.get('result', {}).get('tools', [])
        print(f"Available tools: {len(tools)}")
        
        # Evaluate response
        eval_request = {
            "jsonrpc": "2.0",
            "id": 3,
            "method": "tools/call",
            "params": {
                "name": "judge_evaluate",
                "arguments": {
                    "response": "Paris is the capital of France.",
                    "criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
                    "rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
                    "judge_model": "rule-based"
                }
            }
        }
        
        result = await client.send_request(eval_request)
        if 'result' in result:
            tool_result = result['result']
            print(f"Evaluation successful: {tool_result}")
        else:
            print(f"Evaluation failed: {result}")

# Run evaluation
asyncio.run(evaluate_via_mcp_wrapper())

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
custom-config		custom-config
examples		examples
mcp_eval_server		mcp_eval_server
tests		tests
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AWS_APP_RUNNER_DEPLOYMENT.md		AWS_APP_RUNNER_DEPLOYMENT.md
CONFIGURATION_GUIDE.md		CONFIGURATION_GUIDE.md
Containerfile		Containerfile
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
apprunner-service-config.json		apprunner-service-config.json
apprunner.yaml		apprunner.yaml
deploy-to-apprunner.sh		deploy-to-apprunner.sh
docker-compose.yml		docker-compose.yml
env.example		env.example
evaluation_results.db		evaluation_results.db
example-custom-models.yaml		example-custom-models.yaml
pyproject.toml		pyproject.toml
startup_proxy.py		startup_proxy.py
test_all_providers.py		test_all_providers.py
test_mcp_client.py		test_mcp_client.py
test_sse_client.py		test_sse_client.py
trust-policy.json		trust-policy.json
update-ecr.sh		update-ecr.sh
validate_models.py		validate_models.py

Folders and files

Latest commit

History

Repository files navigation

🏆 MCP Evaluation Server

The Ultimate AI Evaluation Platform

🎯 Tool Categories Overview

📊 Core Evaluation (15 tools)

🔬 Advanced Assessment (39 tools)

🔧 System Management (9 tools)

⚡ Technology

🚀 Quick Start

📡 Multiple Server Modes

🔌 MCP Server Mode (stdio)

🌐 REST API Server Mode (HTTP)

🔄 HTTP Bridge Mode (MCP over HTTP)

🚀 MCP Wrapper Mode (FastMCP + REST API)

☁️ AWS App Runner Deployment (Live)

✨ Complete Tool Arsenal

🤖 LLM-as-a-Judge Tools (4 Tools)

📝 Prompt Evaluation Tools (4 Tools)

🛠️ Agent Evaluation Tools (4 Tools)

🔍 Quality Assessment Tools (3 Tools)

🔗 RAG Evaluation Tools (8 Tools)

⚖️ Bias & Fairness Tools (6 Tools)

🛡️ Robustness Tools (5 Tools)

🔒 Safety & Alignment Tools (4 Tools)

🌍 Multilingual Tools (4 Tools)

⚡ Performance Tools (4 Tools)

🔐 Privacy Tools (8 Tools)

🔄 Workflow Management Tools (3 Tools)

📊 Judge Calibration Tools (2 Tools)

🔧 Server Management Tools (9 Tools)

🚀 Advanced Features

🎯 LLM-as-a-Judge Best Practices

⚡ Performance & Scalability

🔒 Enterprise Security

📊 Analytics & Insights

🚀 MCP Wrapper (FastMCP + REST API)

🎯 What is the MCP Wrapper?

✨ Key Benefits

🏗️ Architecture

🔧 Technical Details

🎮 Usage Examples

🛠️ Installation & Setup

Quick Installation

MCP Client Connection

Option 1: Direct MCP Server (stdio)

Option 2: MCP Wrapper (HTTP/SSE)

Health Check Endpoints

Docker Deployment

AWS App Runner Deployment

🎉 Live MCP Wrapper Testing

1. Initialize MCP Session

2. Send Initialized Notification

3. List Available Tools

4. Call a Tool

Development Setup

🎮 Usage Examples

🎯 MCP Client Integration

🌐 REST API Integration

⚖️ Advanced Pairwise Comparison

📊 Comprehensive Agent Benchmarking

🔄 Advanced Evaluation Suite

🎛️ Advanced Configuration

Custom Model Configuration

Model Configuration with Capabilities

Advanced Evaluation Rubrics

Multi-Domain Benchmarks

🔬 Research-Grade Features

📊 Statistical Analysis

🧪 Experimental Capabilities

🎯 Domain-Specific Evaluations

🏗️ Production Architecture

🔧 Infrastructure Components

📦 Deployment Options

🔒 Security & Compliance

🗺️ Tool Ecosystem Map

📋 Complete Tool Reference

Judge Tools (4/63)

Packages