📊 Tools: 63 Specialized Evaluation Tools
👨💻 Author: Mihai Criveti
A MCP server providing the most comprehensive AI evaluation platform in the ecosystem. Features 63 specialized tools across 14 categories for complete AI system assessment using LLM-as-a-judge techniques combined with rule-based metrics.
🤖 4 Judge Tools - LLM-as-a-judge evaluation with bias mitigation 📝 4 Prompt Tools - Clarity, consistency, completeness analysis 🛠️ 4 Agent Tools - Tool usage, reasoning, task completion assessment 🔍 3 Quality Tools - Factuality, coherence, toxicity detection
🔗 8 RAG Tools - Retrieval relevance, context utilization, grounding verification ⚖️ 6 Bias & Fairness - Demographic bias, representation equity, intersectional analysis 🛡️ 5 Robustness Tools - Adversarial testing, injection resistance, stability analysis 🔒 4 Safety & Alignment - Harmful content detection, instruction adherence, value alignment 🌍 4 Multilingual Tools - Translation quality, cross-lingual consistency, cultural adaptation ⚡ 4 Performance Tools - Latency tracking, efficiency metrics, throughput scaling 🔐 8 Privacy Tools - PII detection, data minimization, compliance, anonymization
🔄 3 Workflow Tools - Evaluation suites, parallel execution, results comparison 📊 2 Calibration Tools - Judge agreement testing, rubric optimization 🏥 4 Server Tools - Health monitoring, cache statistics, system management
- 🤖 LLM-as-a-Judge - GPT-4, Azure OpenAI, with position bias mitigation
- 📈 Statistical Rigor - Confidence intervals, significance testing, correlation analysis
- 🎪 Multi-Modal Assessment - Pattern matching + LLM evaluation + rule-based metrics
- 🏗️ Extensible Architecture - Configurable rubrics, custom criteria, plugin system
# 🎯 One-command setup
pip install -e ".[dev]"
# 🔥 Launch MCP server for Claude Desktop, MCP clients
python -m mcp_eval_server.server
# or
make dev
# 🏥 Health check (automatic on port 8080)
curl http://localhost:8080/health # ✅ Liveness probe
curl http://localhost:8080/ready # 🎯 Readiness probe
curl http://localhost:8080/metrics # 📊 Performance metrics# 🚀 Launch REST API server with FastAPI
python -m mcp_eval_server.rest_server --port 8080 --host 0.0.0.0
# or
make serve-rest
# 📚 Interactive API documentation
open http://localhost:8080/docs
# 🧪 Quick API test
curl http://localhost:8080/health
curl http://localhost:8080/tools/categories# 🌍 MCP protocol over HTTP with Server-Sent Events
make serve-http
# 📡 Access via JSON-RPC over HTTP on port 9000
curl -X POST -H 'Content-Type: application/json' \
-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}' \
http://localhost:9000/# 🎯 Start REST API server first
make serve-rest
# 🔗 Launch MCP wrapper (FastMCP around REST API)
make serve-wrapper
# 📡 Access via MCP protocol over HTTP/SSE on port 9001
# Endpoint: http://localhost:9001/mcp
# Headers: Accept: application/json, text/event-stream
# Protocol: Streamable HTTP (SSE) with session management
# 🧪 Test the wrapper
python test_sse_client.py# 🚀 Deploy to AWS App Runner
make deploy-apprunner
# 🌐 Access deployed MCP wrapper
# URL: https://6xaate4xrt.us-east-1.awsapprunner.com/mcp
# Protocol: Streamable HTTP (SSE)
# Headers: Accept: application/json, text/event-stream
# 🧪 Test deployed wrapper
curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test-client", "version": "1.0.0"}}}'- 🎯 Single Response Evaluation: Customizable criteria with weighted scoring and confidence metrics
- ⚖️ Pairwise Comparison: Head-to-head analysis with automatic position bias mitigation
- 🏆 Multi-Response Ranking: Tournament, round-robin, and scoring-based ranking algorithms
- 📊 Reference-Based Evaluation: Gold standard comparison for factuality, completeness, and style
- 🤝 Multi-Judge Consensus: Ensemble evaluation with agreement analysis and confidence weighting
- 🔍 Clarity Analysis: Rule-based ambiguity detection + LLM semantic analysis with improvement recommendations
- 🔄 Consistency Testing: Multi-run variance analysis across temperature settings with outlier detection
- ✅ Completeness Measurement: Component coverage analysis with visual heatmap generation
- 🎯 Relevance Assessment: Semantic alignment using TF-IDF vectorization with drift analysis
- ⚙️ Tool Usage Evaluation: Selection accuracy, sequence optimization, parameter validation with efficiency scoring
- ✅ Task Completion Analysis: Multi-criteria success evaluation with partial credit and failure analysis
- 🧠 Reasoning Assessment: Decision-making quality, logical coherence, and hallucination detection
- 📈 Performance Benchmarking: Comprehensive capability testing across skill levels with baseline comparison
- ✅ Factuality Checking: Claims verification against knowledge bases with confidence scoring and evidence tracking
- 🧩 Coherence Analysis: Logical flow assessment, contradiction detection, and structural analysis
- 🛡️ Toxicity Detection: Multi-category harmful content identification with bias pattern analysis
- 📊 Retrieval Relevance: Semantic similarity assessment with LLM judge validation and configurable thresholds
- 🎯 Context Utilization: Analysis of how well retrieved context is integrated into generated responses
- ⚓ Answer Groundedness: Claim verification against supporting context with strictness controls
- 🚨 Hallucination Detection: Contradiction identification between responses and source context
- 🎯 Retrieval Coverage: Topic completeness assessment and information gap analysis
- 📝 Citation Accuracy: Reference validation and citation quality scoring across multiple formats
- 🧩 Chunk Relevance: Individual document segment evaluation with ranking and scoring
- 🏆 Retrieval Benchmarking: Comparative analysis using standard IR metrics (precision, recall, MRR, NDCG)
- 🎯 Demographic Bias Detection: Pattern matching and LLM assessment for protected group bias
- 📊 Representation Fairness: Balanced representation analysis across contexts and groups
- ⚖️ Outcome Equity: Disparate impact analysis across protected attributes
- 🌍 Cultural Sensitivity: Cross-cultural appropriateness and awareness evaluation
- 🗣️ Linguistic Bias Detection: Language-based discrimination and dialect bias identification
- 🔗 Intersectional Fairness: Compound bias effects across multiple identity dimensions
- ⚔️ Adversarial Testing: Malicious prompt resistance and attack vector evaluation
- 🔄 Input Sensitivity: Response stability testing under input variations and perturbations
- 🛡️ Prompt Injection Resistance: Security defense evaluation against injection attacks
- 📈 Distribution Shift: Performance degradation analysis on out-of-domain data
- 🎯 Consistency Under Perturbation: Output stability measurement across input modifications
⚠️ Harmful Content Detection: Multi-category risk assessment across safety dimensions- 📋 Instruction Following: Constraint adherence and safety instruction compliance
- 🚫 Refusal Appropriateness: Evaluation of appropriate system refusal behavior
- 💎 Value Alignment: Human values and ethical principles alignment assessment
- 🔄 Translation Quality: Accuracy, fluency, and completeness assessment across languages
- 🔗 Cross-Lingual Consistency: Consistency evaluation across multiple language versions
- 🎭 Cultural Adaptation: Localization quality and cultural appropriateness evaluation
- 🔀 Language Mixing Detection: Inappropriate code-switching and language mixing identification
- ⏱️ Response Latency: Generation speed tracking with statistical analysis and percentiles
- 💻 Computational Efficiency: Resource usage monitoring and efficiency metrics
- 📈 Throughput Scaling: Concurrent request handling and scaling behavior analysis
- 💾 Memory Monitoring: Memory consumption pattern tracking and leak detection
- 🔍 PII Detection: Personally identifiable information detection with configurable sensitivity
- 📊 Data Minimization: Evaluation of data collection necessity and purpose alignment
- 📋 Consent Compliance: Privacy regulation compliance assessment (GDPR, CCPA, COPPA, HIPAA)
- 🎭 Anonymization Effectiveness: Re-identification risk analysis and utility preservation
- 🚨 Data Leakage Detection: Unintended data exposure and inference leakage identification
- 📖 Consent Clarity: Readability and comprehensibility assessment of privacy notices
- 🗃️ Data Retention Compliance: Retention policy alignment and regulatory adherence
- 🏗️ Privacy-by-Design: System-level privacy implementation and design principle evaluation
- 🎛️ Evaluation Suites: Customizable multi-step pipelines with weighted criteria and success thresholds
- ⚡ Parallel/Sequential Execution: Optimized processing with configurable concurrency and resource management
- 📊 Results Comparison: Statistical analysis with trend detection, significance testing, and regression analysis
- 🤝 Agreement Testing: Inter-judge correlation analysis with human baseline comparison
- 🎯 Rubric Optimization: Automatic tuning using machine learning for improved human alignment
- 📋 Judge Management: Available model listing, capability assessment, configuration validation
- 💾 Results Storage: Comprehensive evaluation history with metadata and statistical reporting
- ⚡ Cache Management: Multi-level caching statistics and performance optimization
- 🔍 Health Monitoring: System status checks and performance metrics
- Position Bias Mitigation: Automatic response position randomization for fair comparisons
- Chain-of-Thought Integration: Step-by-step reasoning for enhanced evaluation quality
- Confidence Calibration: Self-assessment metrics for evaluation reliability
- Multiple Judge Consensus: Ensemble methods with disagreement analysis
- Human Alignment: Regular calibration against ground truth evaluations
- Lightweight Dependencies: Uses standard libraries (scikit-learn, numpy) instead of heavy ML frameworks
- Smart Caching: Multi-level caching (memory + disk) with TTL and invalidation
- Async Processing: Non-blocking evaluation execution with configurable concurrency
- Batch Operations: Efficient multi-item processing with progress tracking
- Resource Management: Memory and CPU optimization with automatic scaling
- Fast Startup: Quick initialization without loading large pre-trained models
- Cryptographic Random: Secure random number generation for bias mitigation
- API Key Management: Secure credential handling with environment variable integration
- Input Validation: Comprehensive parameter validation and sanitization
- Error Isolation: Graceful failure handling with detailed error reporting
- Audit Trail: Complete evaluation history with compliance reporting
- Statistical Analysis: Correlation analysis, significance testing, trend detection
- Performance Metrics: Latency tracking, throughput monitoring, success rate analysis
- Quality Dashboards: Real-time evaluation quality monitoring with alerting
- Comparative Analysis: A/B testing capabilities with regression detection
- Predictive Analytics: Performance trend forecasting and anomaly detection
The MCP Wrapper is a FastMCP-based server that wraps your existing REST API and exposes it as a Model Context Protocol (MCP) server. This allows MCP clients (like Claude Desktop) to access all your evaluation tools through the MCP protocol while keeping your REST API unchanged.
- 🔄 Dual Protocol Support: Access the same tools via both REST API and MCP protocol
- 🌐 HTTP/SSE Transport: Modern streamable HTTP with Server-Sent Events
- 🔗 REST API Integration: Seamlessly wraps existing REST endpoints
- 📡 Session Management: Automatic session handling with
mcp-session-idheaders - ⚡ Real-time Communication: Bidirectional communication over single HTTP connection
- 🛡️ Protocol Compliance: Full MCP protocol compliance with proper initialization
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ MCP Client │ │ MCP Wrapper │ │ REST API │
│ (Claude Desktop)│◄──►│ (FastMCP + SSE) │◄──►│ (FastAPI) │
│ │ │ │ │ │
│ • stdio │ │ • /mcp/ endpoint │ │ • /judge/* │
│ • HTTP/SSE │ │ • Session mgmt │ │ • /quality/* │
│ • JSON-RPC │ │ • Tool wrapping │ │ • /agent/* │
└─────────────────┘ └──────────────────┘ └─────────────────┘
- Transport: Streamable HTTP (SSE) with session management
- Endpoint:
http://localhost:9001/mcp/ - Headers:
Accept: application/json, text/event-stream - Session: Automatic via
mcp-session-idheader - Protocol: Full MCP protocol with initialization and notifications
- Tools: All 63 evaluation tools exposed as MCP tools
# Start both servers
make serve-rest # REST API on port 8080
make serve-wrapper # MCP wrapper on port 9001
# Test the wrapper
python test_sse_client.py
make test-wrapper# Clone and install (lightweight dependencies only)
cd mcp-servers/python/mcp_eval_server
pip install -e ".[dev]"
# Set up API keys (optional - rule-based judge works without them)
export OPENAI_API_KEY="sk-your-key-here"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-azure-api-key"
# Configure health check endpoints (optional)
export HEALTH_CHECK_PORT=8080 # Default: 8080
export HEALTH_CHECK_HOST=0.0.0.0 # Default: 0.0.0.0
# Note: No heavy ML dependencies required!
# Uses efficient TF-IDF + scikit-learn instead of transformers{
"command": "python",
"args": ["-m", "mcp_eval_server.server"],
"cwd": "/path/to/mcp-servers/python/mcp_eval_server"
}Protocol: stdio (Model Context Protocol)
Transport: Standard input/output (no HTTP port needed)
Tools Available: 63 specialized evaluation tools
{
"url": "http://localhost:9001/mcp/",
"headers": {
"Accept": "application/json, text/event-stream"
}
}Protocol: Streamable HTTP (SSE)
Transport: HTTP with Server-Sent Events
Session Management: Automatic via mcp-session-id header
Tools Available: 63 specialized evaluation tools (via REST API wrapper)
The server automatically starts health check HTTP endpoints for monitoring:
# Health endpoints (started automatically with the MCP server)
curl http://localhost:8080/health # Liveness probe
curl http://localhost:8080/ready # Readiness probe
curl http://localhost:8080/metrics # Basic metrics
curl http://localhost:8080/ # Service info
# Kubernetes-style endpoints
curl http://localhost:8080/healthz # Alternative health
curl http://localhost:8080/readyz # Alternative readinessHealth Check Response Example:
{
"status": "healthy",
"timestamp": 1698765432.123,
"uptime_seconds": 45.67,
"service": "mcp-eval-server",
"version": "0.1.0",
"checks": {
"server_running": true,
"uptime_ok": true
}
}Readiness Check Response Example:
{
"status": "ready",
"timestamp": 1698765432.123,
"service": "mcp-eval-server",
"version": "0.1.0",
"checks": {
"server_initialized": true,
"judge_tools_loaded": true,
"storage_initialized": true
}
}# Build container
make build
# Run with environment
make run
# Or use docker-compose
make compose-up# 1. Prerequisites: AWS CLI configured, Docker installed
aws configure
# 2. Create IAM role for App Runner
aws iam create-role --role-name AppRunnerInstanceRole --assume-role-policy-document file://trust-policy.json
aws iam attach-role-policy --role-name AppRunnerInstanceRole --policy-arn arn:aws:iam::aws:policy/service-role/AppRunnerServicePolicyForECRAccess
# 3. Set environment variable
export INSTANCE_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AppRunnerInstanceRole"
# 4. Deploy to AWS App Runner
make deploy-apprunner
# 5. Set API keys in App Runner console (OPENAI_API_KEY, etc.)🌐 Live Deployed Endpoints:
- MCP Wrapper:
https://6xaate4xrt.us-east-1.awsapprunner.com/mcp - Health Check:
https://6xaate4xrt.us-east-1.awsapprunner.com/health - Service Info:
https://6xaate4xrt.us-east-1.awsapprunner.com/
📚 Detailed Guide: See AWS_APP_RUNNER_DEPLOYMENT.md for complete deployment instructions.
The MCP wrapper is now live and fully functional! Here's how to test it:
curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-i \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {
"protocolVersion": "2024-11-05",
"capabilities": {},
"clientInfo": {
"name": "test-client",
"version": "1.0.0"
}
}
}'# Use the session ID from Step 1
curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-H "mcp-session-id: <SESSION_ID>" \
-d '{
"jsonrpc": "2.0",
"method": "notifications/initialized"
}'curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-H "mcp-session-id: <SESSION_ID>" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/list",
"params": {}
}'curl -X POST "https://6xaate4xrt.us-east-1.awsapprunner.com/mcp" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-H "mcp-session-id: <SESSION_ID>" \
-d '{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "judge_evaluate",
"arguments": {
"response": "Paris is the capital of France.",
"criteria": [
{
"name": "accuracy",
"description": "Factual accuracy",
"scale": "1-5",
"weight": 1.0
}
],
"rubric": {
"criteria": [],
"scale_description": {
"1": "Wrong",
"5": "Correct"
}
},
"judge_model": "rule-based"
}
}
}'# Install development dependencies
make dev-install
# Run development server
make dev
# Run tests
make test
# Check code quality
make lint# Multi-criteria evaluation with MCP client
result = await mcp_client.call_tool("judge.evaluate_response", {
"response": "Detailed technical explanation...",
"criteria": [
{"name": "technical_accuracy", "description": "Correctness of technical details", "scale": "1-5", "weight": 0.4},
{"name": "clarity", "description": "Explanation clarity", "scale": "1-5", "weight": 0.3},
{"name": "completeness", "description": "Coverage of key points", "scale": "1-5", "weight": 0.3}
],
"rubric": {
"criteria": [],
"scale_description": {
"1": "Severely lacking",
"2": "Below expectations",
"3": "Meets basic requirements",
"4": "Exceeds expectations",
"5": "Outstanding quality"
}
},
"judge_model": "gpt-4",
"use_cot": True
})# Evaluate response via REST API
curl -X POST http://localhost:8080/judge/evaluate \
-H "Content-Type: application/json" \
-d '{
"response": "Paris is the capital of France",
"criteria": [
{
"name": "accuracy",
"description": "Factual accuracy",
"scale": "1-5",
"weight": 1.0
}
],
"rubric": {
"criteria": [],
"scale_description": {
"1": "Wrong",
"5": "Correct"
}
},
"judge_model": "gpt-4o-mini"
}'# Python REST API client
import httpx
import asyncio
async def evaluate_via_rest():
async with httpx.AsyncClient() as client:
response = await client.post("http://localhost:8080/judge/evaluate", json={
"response": "Technical explanation...",
"criteria": [
{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}
],
"rubric": {
"criteria": [],
"scale_description": {"1": "Poor", "5": "Excellent"}
},
"judge_model": "gpt-4o-mini"
})
result = response.json()
return result
# Run evaluation
result = asyncio.run(evaluate_via_rest())
print(f"Overall score: {result['overall_score']}")# Head-to-head comparison with bias mitigation
comparison = await mcp_client.call_tool("judge.pairwise_comparison", {
"response_a": "Technical solution A with implementation details...",
"response_b": "Alternative solution B with different approach...",
"criteria": [
{"name": "innovation", "description": "Novelty and creativity", "scale": "1-5", "weight": 0.4},
{"name": "feasibility", "description": "Implementation practicality", "scale": "1-5", "weight": 0.3},
{"name": "efficiency", "description": "Resource optimization", "scale": "1-5", "weight": 0.3}
],
"context": "Solutions for enterprise-scale data processing challenge",
"position_bias_mitigation": True,
"judge_model": "gpt-4-turbo"
})# Full agent performance assessment
benchmark_result = await mcp_client.call_tool("agent.benchmark_performance", {
"benchmark_suite": "advanced_skills",
"agent_config": {
"model": "gpt-4",
"temperature": 0.7,
"tools_enabled": ["search", "calculator", "code_executor"]
},
"baseline_comparison": {
"name": "GPT-3.5 Baseline",
"scores": {"accuracy": 0.75, "efficiency": 0.68, "reliability": 0.72}
},
"metrics_focus": ["accuracy", "efficiency", "reliability", "creativity"]
})# Create sophisticated evaluation pipeline
suite = await mcp_client.call_tool("workflow.create_evaluation_suite", {
"suite_name": "comprehensive_ai_assessment",
"description": "Full-spectrum AI capability evaluation",
"evaluation_steps": [
{
"tool": "prompt.evaluate_clarity",
"weight": 0.15,
"parameters": {"target_model": "gpt-4", "domain_context": "technical"}
},
{
"tool": "judge.evaluate_response",
"weight": 0.25,
"parameters": {
"criteria": [
{"name": "technical_depth", "description": "Technical sophistication", "scale": "1-5", "weight": 0.4},
{"name": "practical_utility", "description": "Real-world applicability", "scale": "1-5", "weight": 0.6}
],
"judge_model": "gpt-4"
}
},
{
"tool": "quality.evaluate_factuality",
"weight": 0.20
},
{
"tool": "quality.measure_coherence",
"weight": 0.15
},
{
"tool": "quality.assess_toxicity",
"weight": 0.10
},
{
"tool": "agent.analyze_reasoning",
"weight": 0.15,
"parameters": {"judge_model": "gpt-4-turbo"}
}
],
"success_thresholds": {
"overall": 0.85,
"quality.evaluate_factuality": 0.90,
"quality.assess_toxicity": 0.95
},
"weights": {
"accuracy": 0.4,
"safety": 0.3,
"utility": 0.3
}
})
# Execute comprehensive evaluation
results = await mcp_client.call_tool("workflow.run_evaluation", {
"suite_id": suite["suite_id"],
"test_data": {
"response": "Complex AI system response...",
"context": "Enterprise deployment scenario...",
"reasoning_trace": [...],
"agent_trace": {...}
},
"parallel_execution": True,
"max_concurrent": 5
})The MCP Eval Server supports complete customization of judge models, allowing you to:
- Configure custom API endpoints and deployments
- Set provider-specific parameters and capabilities
- Create domain-specific model configurations
- Use custom environment variable names
# Use custom model configuration
export MCP_EVAL_MODELS_CONFIG="./my-custom-models.yaml"
export DEFAULT_JUDGE_MODEL="my-custom-judge"
# Copy default config for customization
make copy-config # Copies to ./custom-config/
make show-config # Show current configuration status
make validate-config # Validate configuration syntaxmodels:
azure:
my-enterprise-gpt4:
provider: "azure"
deployment_name: "my-gpt4-deployment"
model_name: "gpt-4"
api_base_env: "AZURE_OPENAI_ENDPOINT"
api_key_env: "AZURE_OPENAI_API_KEY"
api_version_env: "AZURE_OPENAI_API_VERSION"
deployment_name_env: "AZURE_DEPLOYMENT_NAME"
default_temperature: 0.1 # Custom temperature
max_tokens: 3000 # Custom token limit
capabilities:
supports_cot: true
supports_pairwise: true
supports_ranking: true
supports_reference: true
max_context_length: 8192
optimal_temperature: 0.1
consistency_level: "very_high"
metadata:
purpose: "production_evaluation"
cost_tier: "premium"
ollama:
my-local-llama:
provider: "ollama"
model_name: "llama3:70b"
base_url_env: "OLLAMA_BASE_URL"
default_temperature: 0.3
max_tokens: 2000
request_timeout: 120 # Longer timeout for large models
# Custom defaults
defaults:
primary_judge: "my-enterprise-gpt4"
fallback_judge: "my-local-llama"
# Custom recommendations
recommendations:
production: ["my-enterprise-gpt4"]
development: ["my-local-llama"]rubrics:
technical_excellence:
name: "Technical Excellence Assessment"
criteria:
- name: "code_quality"
description: "Code structure, efficiency, and best practices"
scale: "1-10"
weight: 0.3
- name: "innovation"
description: "Novel approaches and creative solutions"
scale: "1-10"
weight: 0.25
- name: "scalability"
description: "System scalability and performance considerations"
scale: "1-10"
weight: 0.25
- name: "maintainability"
description: "Code maintainability and documentation quality"
scale: "1-10"
weight: 0.2
scale_description:
"1-2": "Severely deficient, requires major rework"
"3-4": "Below standards, significant improvements needed"
"5-6": "Meets basic requirements, minor improvements possible"
"7-8": "Exceeds expectations, high quality work"
"9-10": "Exceptional excellence, industry-leading quality"benchmarks:
enterprise_readiness:
name: "Enterprise Readiness Assessment"
category: "production"
tasks:
- name: "security_analysis"
description: "Security vulnerability assessment and mitigation"
difficulty: "advanced"
expected_tools: ["security_scanner", "vulnerability_analyzer", "mitigation_planner"]
evaluation_metrics: ["threat_identification", "risk_assessment", "solution_quality"]
- name: "performance_optimization"
description: "System performance analysis and optimization"
difficulty: "advanced"
expected_tools: ["profiler", "optimizer", "benchmarker"]
evaluation_metrics: ["performance_gain", "resource_efficiency", "scalability_impact"]- Correlation Analysis: Pearson, Spearman, Cohen's Kappa for agreement measurement
- Significance Testing: Statistical validation of evaluation differences
- Trend Analysis: Performance trajectory analysis with volatility assessment
- Outlier Detection: Anomaly identification in evaluation results
- Confidence Intervals: Uncertainty quantification for evaluation scores
- Judge Calibration: Systematic bias detection and correction algorithms
- Rubric Evolution: Machine learning-powered rubric optimization
- Meta-Evaluation: Evaluation of evaluation quality itself
- Human Alignment: Continuous calibration against expert human judgments
- Cross-Validation: K-fold validation for evaluation reliability
- Technical Content: Code quality, architecture assessment, security analysis
- Creative Writing: Originality, engagement, style consistency evaluation
- Academic Work: Research quality, citation analysis, argument strength
- Customer Service: Helpfulness, politeness, problem resolution effectiveness
- Educational Content: Learning objective achievement, instructional clarity
- Multi-Judge Runtime: Supports OpenAI, Azure OpenAI, and rule-based evaluation engines
- Caching Layer: Redis-compatible distributed caching with automatic invalidation
- Results Database: SQLite/PostgreSQL storage with comprehensive indexing
- API Gateway: RESTful endpoints with authentication and rate limiting
- Monitoring System: Prometheus metrics with Grafana dashboards
- Container Deployment: Production-ready Docker/Podman containers with security hardening
- Kubernetes Support: Helm charts with auto-scaling and service mesh integration
- Cloud Integration: AWS ECS, Azure Container Instances, Google Cloud Run compatibility
- Edge Deployment: Lightweight containers for edge computing scenarios
- Development Mode: Hot-reload development server with debugging capabilities
- Enterprise Security: OAuth 2.0, JWT tokens, API key rotation
- Data Privacy: Encryption at rest and in transit, PII detection and filtering
- Audit Logging: Comprehensive audit trails with tamper detection
- Compliance Ready: SOC 2, GDPR, HIPAA compliance frameworks supported
- Vulnerability Management: Continuous security scanning and automated patching
🏆 MCP EVALUATION SERVER - 63 SPECIALIZED TOOLS 🏆
═══════════════════════════════════════════════════════════
📊 CORE EVALUATION SUITE (15 tools)
├── 🤖 Judge Tools (4) ────── LLM-as-a-judge evaluation
├── 📝 Prompt Tools (4) ───── Clarity, consistency, optimization
├── 🛠️ Agent Tools (4) ────── Performance, reasoning, benchmarking
└── 🔍 Quality Tools (3) ──── Factuality, coherence, toxicity
🔬 ADVANCED ASSESSMENT SUITE (39 tools)
├── 🔗 RAG Tools (8) ──────── Retrieval relevance, grounding, citations
├── ⚖️ Bias & Fairness (6) ── Demographic bias, intersectional analysis
├── 🛡️ Robustness (5) ──────── Adversarial testing, injection resistance
├── 🔒 Safety & Alignment (4) Harmful content, value alignment
├── 🌍 Multilingual (4) ────── Translation, cultural adaptation
├── ⚡ Performance (4) ──────── Latency, efficiency, scaling
└── 🔐 Privacy (8) ───────── PII detection, compliance, anonymization
🔧 SYSTEM MANAGEMENT (9 tools)
├── 🔄 Workflow Tools (3) ─── Evaluation suites, parallel execution
├── 📊 Calibration (2) ────── Judge agreement, rubric optimization
└── 🏥 Server Tools (4) ───── Health monitoring, system management
🎯 TOTAL: 63 TOOLS ACROSS 14 CATEGORIES 🎯
| Tool | Description | Key Features |
|---|---|---|
judge.evaluate_response |
Single response evaluation | Customizable criteria, weighted scoring, confidence metrics |
judge.pairwise_comparison |
Two-response comparison | Position bias mitigation, criterion-level analysis |
judge.rank_responses |
Multi-response ranking | Tournament/scoring algorithms, consistency measurement |
judge.evaluate_with_reference |
Reference-based evaluation | Gold standard comparison, similarity scoring |
| Tool | Description | Key Features |
|---|---|---|
prompt.evaluate_clarity |
Clarity assessment | Rule-based + LLM analysis, ambiguity detection |
prompt.test_consistency |
Consistency testing | Multi-run analysis, temperature variance |
prompt.measure_completeness |
Completeness analysis | Component coverage, heatmap visualization |
prompt.assess_relevance |
Relevance measurement | TF-IDF semantic alignment, drift analysis |
| Tool | Description | Key Features |
|---|---|---|
agent.evaluate_tool_use |
Tool usage analysis | Selection accuracy, sequence optimization |
agent.measure_task_completion |
Task success evaluation | Multi-criteria assessment, partial credit |
agent.analyze_reasoning |
Reasoning quality assessment | Logic analysis, hallucination detection |
agent.benchmark_performance |
Performance benchmarking | Multi-domain testing, baseline comparison |
| Tool | Description | Key Features |
|---|---|---|
quality.evaluate_factuality |
Factual accuracy checking | Claims verification, confidence scoring |
quality.measure_coherence |
Logical flow analysis | Coherence scoring, contradiction detection |
quality.assess_toxicity |
Harmful content detection | Multi-category analysis, bias detection |
| Tool | Description | Key Features |
|---|---|---|
rag.evaluate_retrieval_relevance |
Document relevance assessment | Semantic similarity, LLM validation |
rag.measure_context_utilization |
Context usage analysis | Word overlap, sentence integration |
rag.assess_answer_groundedness |
Claim verification | Context support, strictness control |
rag.detect_hallucination_vs_context |
Contradiction detection | Statement verification, confidence scoring |
rag.evaluate_retrieval_coverage |
Topic completeness check | Information gap analysis, coverage scoring |
rag.assess_citation_accuracy |
Reference validation | Citation quality, format support |
rag.measure_chunk_relevance |
Document segment scoring | Individual chunk analysis, ranking |
rag.benchmark_retrieval_systems |
System comparison | IR metrics, performance analysis |
| Tool | Description | Key Features |
|---|---|---|
bias.detect_demographic_bias |
Protected group bias detection | Pattern matching, LLM assessment, sensitivity control |
bias.measure_representation_fairness |
Balanced representation analysis | Context evaluation, fairness metrics |
bias.evaluate_outcome_equity |
Disparate impact assessment | Outcome analysis, equity scoring |
bias.assess_cultural_sensitivity |
Cultural appropriateness evaluation | Cross-cultural awareness, sensitivity dimensions |
bias.detect_linguistic_bias |
Language-based discrimination | Dialect bias, formality assessment |
bias.measure_intersectional_fairness |
Multi-dimensional bias analysis | Compound effects, intersectional metrics |
| Tool | Description | Key Features |
|---|---|---|
robustness.test_adversarial_inputs |
Malicious prompt testing | Attack vectors, injection resistance |
robustness.measure_input_sensitivity |
Perturbation stability testing | Input variations, sensitivity thresholds |
robustness.evaluate_prompt_injection_resistance |
Security defense evaluation | Injection strategies, resistance scoring |
robustness.assess_distribution_shift |
Out-of-domain performance | Domain adaptation, degradation analysis |
robustness.measure_consistency_under_perturbation |
Output stability measurement | Perturbation consistency, variance analysis |
| Tool | Description | Key Features |
|---|---|---|
safety.detect_harmful_content |
Harmful content identification | Multi-category risk assessment, severity classification |
safety.assess_instruction_following |
Constraint adherence evaluation | Instruction parsing, compliance scoring |
safety.evaluate_refusal_appropriateness |
Refusal behavior assessment | Decision accuracy, precision/recall metrics |
safety.measure_value_alignment |
Human values alignment | Ethical principles, weighted assessment |
| Tool | Description | Key Features |
|---|---|---|
multilingual.evaluate_translation_quality |
Translation assessment | Accuracy, fluency, cultural adaptation |
multilingual.measure_cross_lingual_consistency |
Multi-language consistency | Semantic preservation, factual alignment |
multilingual.assess_cultural_adaptation |
Localization evaluation | Cultural dimensions, adaptation scoring |
multilingual.detect_language_mixing |
Code-switching detection | Language purity, mixing appropriateness |
| Tool | Description | Key Features |
|---|---|---|
performance.measure_response_latency |
Latency measurement | Statistical analysis, percentiles, timeout tracking |
performance.assess_computational_efficiency |
Resource usage monitoring | CPU/memory efficiency, per-token metrics |
performance.evaluate_throughput_scaling |
Scaling behavior analysis | Concurrency testing, bottleneck detection |
performance.monitor_memory_usage |
Memory consumption tracking | Usage patterns, leak detection, threshold monitoring |
| Tool | Description | Key Features |
|---|---|---|
privacy.detect_pii_exposure |
PII detection and analysis | Pattern matching, sensitivity levels, context analysis |
privacy.assess_data_minimization |
Data collection necessity | Purpose alignment, minimization scoring |
privacy.evaluate_consent_compliance |
Regulatory compliance assessment | GDPR/CCPA/COPPA/HIPAA standards, gap analysis |
privacy.measure_anonymization_effectiveness |
Anonymization quality evaluation | Re-identification risk, utility preservation |
privacy.detect_data_leakage |
Data exposure identification | Direct/inference leakage, unexpected data flow |
privacy.assess_consent_clarity |
Consent readability analysis | Grade level, accessibility, comprehension |
privacy.evaluate_data_retention_compliance |
Retention policy adherence | Policy-practice alignment, regulatory requirements |
privacy.assess_privacy_by_design |
System privacy implementation | Design principles, control effectiveness |
| Tool | Description | Key Features |
|---|---|---|
workflow.create_evaluation_suite |
Evaluation pipeline creation | Multi-step workflows, weighted criteria |
workflow.run_evaluation |
Suite execution | Parallel processing, progress tracking |
workflow.compare_evaluations |
Results comparison | Statistical analysis, trend detection |
| Tool | Description | Key Features |
|---|---|---|
calibration.test_judge_agreement |
Judge agreement testing | Correlation analysis, bias detection |
calibration.optimize_rubrics |
Rubric optimization | ML-powered tuning, human alignment |
| Tool | Description | Key Features |
|---|---|---|
server.get_available_judges |
List available judges | Model capabilities, status checking |
server.get_evaluation_suites |
List evaluation suites | Suite management, configuration viewing |
server.get_evaluation_results |
Retrieve results | History browsing, filtering, pagination |
server.get_cache_stats |
Cache statistics | Performance monitoring, optimization |
- Model Comparison Studies: Systematic evaluation of different LLM architectures
- Prompt Engineering Research: Large-scale prompt effectiveness analysis
- Agent Behavior Studies: Comprehensive agent decision-making research
- Bias Detection Research: Systematic bias pattern analysis across models
- Evaluation Methodology: Meta-research on evaluation techniques themselves
- Quality Assurance: Automated content quality control in production systems
- A/B Testing: Systematic comparison of different AI configurations
- Performance Monitoring: Continuous evaluation of deployed AI systems
- Compliance Reporting: Automated generation of evaluation compliance reports
- Cost Optimization: Evaluation-driven optimization of AI system costs
- Student Assessment: Automated evaluation of student AI projects
- Curriculum Development: Assessment-driven AI curriculum optimization
- Research Training: Tools for training researchers in evaluation methodologies
- Benchmark Creation: Development of new evaluation benchmarks
- Peer Review: AI-assisted peer review systems for academic work
| Mode | Command | Protocol | Port | Auth | Use Case |
|---|---|---|---|---|---|
| MCP Server | make dev |
stdio | none | none | Claude Desktop, MCP clients |
| REST API | make serve-rest |
HTTP REST | 8080 | none | Direct HTTP API integration |
| REST Public | make serve-rest-public |
HTTP REST | 8080 | none | Public REST API access |
| HTTP Bridge | make serve-http |
JSON-RPC/HTTP | 9000 | none | MCP over HTTP, local testing |
| HTTP Public | make serve-http-public |
JSON-RPC/HTTP | 9000 | none | MCP over HTTP, remote access |
| MCP Wrapper | make serve-wrapper |
Streamable HTTP (SSE) | 9001 | none | FastMCP wrapper around REST API |
| MCP Wrapper Public | make serve-wrapper-public |
Streamable HTTP (SSE) | 9001 | none | Public MCP wrapper access |
| Container | make run |
HTTP | 8080 | none | Docker deployment |
| AWS App Runner | make deploy-apprunner |
MCP Wrapper (SSE) | 8080 | none | Cloud deployment on AWS (Live) |
# 1. Run MCP server (for Claude Desktop, etc.)
make dev # Shows connection info + starts server
# 2. Test basic functionality
make example # Run evaluation example
make test-mcp # Test MCP protocol# 1. Run native REST API server
make serve-rest # Starts on http://localhost:8080
# 2. Test REST API endpoints
make test-rest # Test all REST endpoints
# 3. View interactive documentation
open http://localhost:8080/docs # Swagger UI
open http://localhost:8080/redoc # ReDoc
# 4. Get connection info
make rest-info # Show complete REST API guide# 1. Run MCP protocol over HTTP
make serve-http # Starts on http://localhost:9000
# 2. Test HTTP endpoints
make test-http # Test MCP JSON-RPC endpoints
# 3. Get connection info
make http-info # Show complete HTTP bridge guide# 1. Start REST API server first
make serve-rest # Starts on http://localhost:8080
# 2. Start MCP wrapper (FastMCP around REST API)
make serve-wrapper # Starts on http://localhost:9001
# 3. Test wrapper functionality
python test_sse_client.py # Test SSE client
make test-wrapper # Test wrapper endpoints
# 4. Connection details
# Endpoint: http://localhost:9001/mcp/
# Protocol: Streamable HTTP (SSE)
# Headers: Accept: application/json, text/event-stream
# Session management: Automatic via mcp-session-id header# Build and deploy
make build && make run# 1. Set up AWS credentials and IAM role
aws configure
export INSTANCE_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AppRunnerInstanceRole"
# 2. Deploy to AWS App Runner
make deploy-apprunner
# 3. Test locally first (optional)
make test-docker-apprunner# Basic MCP integration
from mcp import Client
client = Client("mcp-eval-server")
# Evaluate any AI output
result = await client.call_tool("judge.evaluate_response", {
"response": "Your AI output here",
"criteria": [{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}],
"rubric": {"criteria": [], "scale_description": {"1": "Poor", "5": "Excellent"}}
})# Start REST API server
make serve-rest
# Check server health
curl http://localhost:8080/health
# List tool categories
curl http://localhost:8080/tools/categories
# Evaluate response directly via REST
curl -X POST http://localhost:8080/judge/evaluate \
-H "Content-Type: application/json" \
-d '{
"response": "Paris is the capital of France.",
"criteria": [
{
"name": "accuracy",
"description": "Factual accuracy",
"scale": "1-5",
"weight": 1.0
}
],
"rubric": {
"criteria": [],
"scale_description": {"1": "Wrong", "5": "Correct"}
},
"judge_model": "rule-based"
}'# Start HTTP bridge server
make serve-http
# List available tools (JSON-RPC)
curl -X POST \
-H "Content-Type: application/json" \
-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list", "params": {}}' \
http://localhost:9000/
# Evaluate response via HTTP bridge (JSON-RPC)
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "judge.evaluate_response",
"arguments": {
"response": "Paris is the capital of France.",
"criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
"rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
"judge_model": "rule-based"
}
}
}' \
http://localhost:9000/# Start REST API server first
make serve-rest
# Start MCP wrapper
make serve-wrapper
# Initialize MCP session
curl -X POST -H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc": "2.0", "id": 1, "method": "initialize", "params": {"protocolVersion": "2024-11-05", "capabilities": {}, "clientInfo": {"name": "test-client", "version": "1.0.0"}}}' \
http://localhost:9001/mcp
# Send initialized notification (note: returns 202 for notifications)
curl -X POST -H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc": "2.0", "method": "notifications/initialized"}' \
http://localhost:9001/mcp
# List available tools via MCP wrapper
curl -X POST -H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {}}' \
http://localhost:9001/mcp
# Evaluate response via MCP wrapper
curl -X POST -H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "judge_evaluate",
"arguments": {
"response": "Paris is the capital of France.",
"criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
"rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
"judge_model": "rule-based"
}
}
}' \
http://localhost:9001/mcp🌐 Live Deployment Testing:
Replace http://localhost:9001/mcp with https://6xaate4xrt.us-east-1.awsapprunner.com/mcp in the above commands to test the live deployment!
import httpx
import asyncio
async def evaluate_via_rest_api():
"""Example using native REST API endpoints."""
async with httpx.AsyncClient() as client:
base_url = "http://localhost:8080"
# Check health
health = await client.get(f"{base_url}/health")
print(f"Server status: {health.json()['status']}")
# List tool categories
categories = await client.get(f"{base_url}/tools/categories")
print(f"Available categories: {len(categories.json()['categories'])}")
# Evaluate response using REST endpoint
evaluation = await client.post(f"{base_url}/judge/evaluate", json={
"response": "Your AI response here",
"criteria": [
{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}
],
"rubric": {
"criteria": [],
"scale_description": {"1": "Poor", "5": "Excellent"}
},
"judge_model": "rule-based"
})
result = evaluation.json()
print(f"Evaluation score: {result['overall_score']}")
# Check content toxicity
toxicity = await client.post(f"{base_url}/quality/toxicity", json={
"content": "This is a test message",
"toxicity_categories": ["profanity", "hate_speech"],
"sensitivity_level": "moderate",
"judge_model": "rule-based"
})
result = toxicity.json()
print(f"Toxicity detected: {result['toxicity_detected']}")
# Run evaluation
asyncio.run(evaluate_via_rest_api())import httpx
import asyncio
async def evaluate_via_http_bridge():
"""Example using MCP over HTTP bridge."""
async with httpx.AsyncClient() as client:
base_url = "http://localhost:9000"
# List tools via JSON-RPC
tools_request = {
"jsonrpc": "2.0",
"id": 1,
"method": "tools/list",
"params": {}
}
response = await client.post(base_url, json=tools_request)
result = response.json()
tools = result.get("result", [])
print(f"Available tools: {len(tools)}")
# Evaluate response via JSON-RPC
eval_request = {
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "judge.evaluate_response",
"arguments": {
"response": "Your AI response here",
"criteria": [{"name": "quality", "description": "Overall quality", "scale": "1-5", "weight": 1.0}],
"rubric": {"criteria": [], "scale_description": {"1": "Poor", "5": "Excellent"}},
"judge_model": "rule-based"
}
}
}
response = await client.post(base_url, json=eval_request)
result = response.json()
print(f"Evaluation result: {result}")
# Run evaluation
asyncio.run(evaluate_via_http_bridge())import httpx
import asyncio
import json
import re
class MCPWrapperClient:
"""Client for MCP wrapper with SSE support."""
def __init__(self, base_url: str):
self.base_url = base_url
self.session_id = None
self.client = httpx.AsyncClient()
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.client.aclose()
async def send_request(self, request: dict) -> dict:
"""Send a JSON-RPC request and parse SSE response."""
headers = {"Accept": "application/json, text/event-stream"}
if self.session_id:
headers["mcp-session-id"] = self.session_id
response = await self.client.post(
self.base_url,
json=request,
headers=headers,
timeout=30.0
)
if response.status_code not in [200, 202]:
raise Exception(f"HTTP {response.status_code}: {response.text}")
# Extract session ID from response headers
if "mcp-session-id" in response.headers:
self.session_id = response.headers["mcp-session-id"]
# For notifications (202), return empty result
if response.status_code == 202:
return {}
# Parse SSE response
content = response.text
data_match = re.search(r'data:\s*(\{.*\})', content)
if data_match:
json_str = data_match.group(1)
return json.loads(json_str)
else:
raise Exception(f"Could not parse SSE response: {content}")
async def evaluate_via_mcp_wrapper():
"""Example using MCP wrapper with SSE."""
async with MCPWrapperClient("http://localhost:9001/mcp/") as client:
# Initialize MCP session
init_request = {
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {
"protocolVersion": "2024-11-05",
"capabilities": {},
"clientInfo": {"name": "test-client", "version": "1.0.0"}
}
}
result = await client.send_request(init_request)
print(f"Initialized: {result.get('result', {}).get('serverInfo', {}).get('name', 'Unknown')}")
# Send initialized notification
await client.send_request({"jsonrpc": "2.0", "method": "notifications/initialized"})
# List tools
tools_request = {"jsonrpc": "2.0", "id": 2, "method": "tools/list", "params": {}}
result = await client.send_request(tools_request)
tools = result.get('result', {}).get('tools', [])
print(f"Available tools: {len(tools)}")
# Evaluate response
eval_request = {
"jsonrpc": "2.0",
"id": 3,
"method": "tools/call",
"params": {
"name": "judge_evaluate",
"arguments": {
"response": "Paris is the capital of France.",
"criteria": [{"name": "accuracy", "description": "Factual accuracy", "scale": "1-5", "weight": 1.0}],
"rubric": {"criteria": [], "scale_description": {"1": "Wrong", "5": "Correct"}},
"judge_model": "rule-based"
}
}
}
result = await client.send_request(eval_request)
if 'result' in result:
tool_result = result['result']
print(f"Evaluation successful: {tool_result}")
else:
print(f"Evaluation failed: {result}")
# Run evaluation
asyncio.run(evaluate_via_mcp_wrapper())