Skip to content

SourceOS-Linux/MMTEB-MCP

 
 

Repository files navigation

MMTEB MCP

MCP: https://mmteb-api.jina.ai/mcp

REST API + MCP server for MTEB benchmark cached results (526 models, 1319 tasks, 68 benchmarks). Built with FastAPI + Cloudflare Workers.

Data source: embeddings-benchmark/results cached-data branch, updated daily.

Usage (MCP)

The recommended way to use this API is via the Model Context Protocol (MCP) server. Add this to your MCP client config:

{
  "mcpServers": {
    "mmteb": {
      "url": "https://mmteb-api.jina.ai/mcp"
    }
  }
}

No API key required. The MCP server exposes 10 tools:

Tool Description
list_benchmarks List all available embedding benchmarks
get_benchmark_rankings Get model rankings for a benchmark
get_model_weaknesses Find tasks where a model performs worst
get_benchmark_gap_to_top Show gap between a model and top performers
list_models List/search all models with benchmark results
get_model_tasks Get all task results for a model
get_model_rank Get model rank in a specific benchmark
compare_models Compare models head-to-head on a benchmark
list_tasks List all evaluation tasks
get_task_rankings Get model rankings for a specific task

Server info: GET https://mmteb-api.jina.ai/mcp

Features

  • Official-aligned rankings: Uses per-task-type mean scoring (mean of type averages, not simple task mean) to match the official MTEB leaderboard. Models must cover ≥80% of a benchmark's tasks to be ranked.
  • Auto task-type inference: Tasks with missing type metadata are classified from name patterns (e.g., *Classification → Classification, *HardNegatives → Retrieval).
  • Autoresearch signals: Weaknesses, gap-to-top, neighborhood, and better-than endpoints for competitive analysis and model optimization.
  • Hot reload: POST /refresh re-downloads data and atomic-swaps without downtime.
  • Fast cold start: Pre-built pickle baked into Docker image, ~5s startup on Cloud Run.

Setup

uv sync
uv run uvicorn src.main:app --reload

Data loads in the background on startup (~5s from pre-built pickle, ~20s from JSON). /health returns {"status":"loading"} until ready.

API Reference

Base URL: https://mmteb-api.jina.ai

Model names in URLs use __ instead of / (e.g. jinaai__jina-embeddings-v5-text-small). All scores are percentages (0-100).

Health & Stats

Method Path Description
GET /health Returns ok/loading/error + model/task counts
GET /stats Total models, tasks, benchmarks

Models

Method Path Description
GET /models List models. Filters: name (substring), modality, min_tasks
GET /models/{model}/tasks All task scores for a model, sorted by score desc
GET /models/{model}/tasks/{task} Detailed scores per subset/split for a task
GET /models/{model}/rank Model's rank + percentile on every benchmark it has results for

Benchmarks

Method Path Description
GET /benchmarks List all benchmarks
GET /benchmarks/{bench}/rankings Leaderboard. Optional top_n
GET /benchmarks/{bench}/models/{model} Model's per-task scores on a benchmark
GET /benchmarks/{bench}/by_type/{model} Scores grouped by task type (avg, count, task list)
GET /benchmarks/{bench}/weaknesses/{model} Model's weakest tasks by percentile rank
GET /benchmarks/{bench}/gap_to_top/{model} Gap to top-N models, broken down by task type

Tasks

Method Path Description
GET /tasks List tasks. Filters: type, language
GET /tasks/{task}/rankings Model leaderboard for a task. Optional top_n
GET /tasks/{task}/info Description, type, domains, score distribution (min/max/mean/p25/p75)
GET /tasks/{task}/better_than/{model} Models scoring higher, with gap. Optional top_n
GET /tasks/{task}/neighborhood/{model} Models ranked around yours. Optional radius (default 5)

Compare & Admin

Method Path Description
GET /compare?models=m1,m2&benchmark=X Side-by-side comparison of multiple models
POST /refresh Re-download data and atomic swap (safe while serving)

Usage Examples

Quick model overview

# Where does my model rank across all benchmarks?
curl "https://mmteb-api.jina.ai/models/jinaai__jina-embeddings-v5-text-small/rank"

Autoresearch KPI workflow

# 1. Find weakest tasks on MTEB(eng, v2)
curl "https://mmteb-api.jina.ai/benchmarks/MTEB(eng,%20v2)/weaknesses/jinaai__jina-embeddings-v5-text-small?top_n=5"

# 2. See gap to top-3 models, broken down by task type
curl "https://mmteb-api.jina.ai/benchmarks/MTEB(eng,%20v2)/gap_to_top/jinaai__jina-embeddings-v5-text-small?top_n=3"

# 3. Scores grouped by task type
curl "https://mmteb-api.jina.ai/benchmarks/MTEB(eng,%20v2)/by_type/jinaai__jina-embeddings-v5-text-small"

# 4. Who beats you on a specific task, and by how much?
curl "https://mmteb-api.jina.ai/tasks/FiQA2018/better_than/jinaai__jina-embeddings-v5-text-small?top_n=5"

# 5. See your competitive neighborhood on a task
curl "https://mmteb-api.jina.ai/tasks/FiQA2018/neighborhood/jinaai__jina-embeddings-v5-text-small?radius=3"

# 6. Understand what a task measures
curl "https://mmteb-api.jina.ai/tasks/FiQA2018/info"

# 7. Compare two models head-to-head
curl "https://mmteb-api.jina.ai/compare?models=jinaai__jina-embeddings-v5-text-small,jinaai__jina-embeddings-v5-text-nano&benchmark=MTEB(eng,%20v2)"

# 8. After publishing new results, refresh data
curl -X POST "https://mmteb-api.jina.ai/refresh"

Deployment

Deploys to GCP Cloud Run (jinaai-public project, us-central1) via GitHub Actions on push to main.

Custom domain: mmteb-api.jina.ai (Cloudflare DNS -> GCP domain mapping)

  • Memory: 4Gi
  • CPU: 2 (with startup CPU boost, no throttling)
  • Instances: 0-1 (scales to zero)
  • Cold start: ~5s (pre-built pickle baked into Docker image)

Tests

uv sync --all-groups
uv run pytest tests/ -v

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.8%
  • Dockerfile 1.2%