Skip to content

feat(rankings): multimodal support for Cohere ranking endpoint#896

Open
jzakrzew wants to merge 4 commits into
ai-dynamo:mainfrom
jzakrzew:cohere-rerank-endpoint-multimodal
Open

feat(rankings): multimodal support for Cohere ranking endpoint#896
jzakrzew wants to merge 4 commits into
ai-dynamo:mainfrom
jzakrzew:cohere-rerank-endpoint-multimodal

Conversation

@jzakrzew
Copy link
Copy Markdown
Contributor

@jzakrzew jzakrzew commented May 7, 2026

Add multimodal input support to the cohere_rankings endpoint for vLLM’s vision rerank API, including structured text, image, and video document payload formatting. See:
https://docs.vllm.ai/en/latest/examples/pooling/score/#vision-rerank-api-online

We already have text-only Cohere rerank support. This keeps the existing text-only payload shape unchanged, while switching to structured Cohere documents when media is present.

Changes

  • Extend the shared rankings base endpoint to pass media inputs into endpoint-specific payload builders.
  • Add multimodal payload support to cohere_rankings for text, image, and video rerank documents.
  • Add synthetic multimodal ranking dataset generation
  • Add validation, mock server support, tests, and documentation for multimodal rankings inputs.

Example usage

Set up a vLLM server:

vllm serve nvidia/llama-nemotron-rerank-vl-1b-v2 \
  --runner pooling \
  --trust-remote-code \
  --chat-template "$(curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/main/examples/pooling/score/template/nemotron-vl-rerank.jinja)"

Run AIPerf with synthetic multimodal rankings inputs:

aiperf profile \
      -m nvidia/llama-nemotron-rerank-vl-1b-v2 \
      --endpoint-type cohere_rankings \
      --custom-endpoint /rerank \
      --url localhost:8000 \
      --request-count 10 \
      --rankings-passages-mean 4 \
      --rankings-passages-stddev 0 \
      --rankings-passages-prompt-token-mean 32 \
      --rankings-passages-prompt-token-stddev 0 \
      --rankings-query-prompt-token-mean 16 \
      --rankings-query-prompt-token-stddev 0 \
      --image-width-mean 224 \
      --image-width-stddev 0 \
      --image-height-mean 224 \
      --image-height-stddev 0 \
      --image-batch-size 1

Summary by CodeRabbit

  • New Features

    • Multimodal reranking: Cohere Rankings now supports ranking with text, images, and videos (index-aligned across modalities) and synthetic multimodal dataset generation.
  • Documentation

    • Added vLLM multimodal reranking guidance, CLI examples, and instructions for JSONL inputs with base64 image data.
  • Tests

    • Expanded unit and integration tests to cover multimodal payloads, dataset composition, and endpoint metadata.
  • Bug Fixes / Validation

    • Added media count validations and explicit rejection of unsupported audio inputs.

jzakrzew and others added 3 commits May 7, 2026 14:43
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the feat label May 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@17968c9298a63a47b928b20975eee427c2da4702

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@17968c9298a63a47b928b20975eee427c2da4702

Last updated for commit: 17968c9Browse code

Comment thread src/aiperf/endpoints/base_rankings_endpoint.py
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

This PR extends AIPerf's Cohere Rankings endpoint to support multimodal requests containing image and video content alongside text. The implementation includes enhanced dataset composition, index-aligned payload construction, metadata-driven validation, and comprehensive test coverage across integration and unit scopes.

Changes

Multimodal Cohere Rankings Feature

Layer / File(s) Summary
Data Contracts & Signatures
src/aiperf/endpoints/base_rankings_endpoint.py, src/aiperf/endpoints/cohere_rankings.py, src/aiperf/endpoints/hf_tei_rankings.py, src/aiperf/endpoints/nim_rankings.py, src/aiperf/plugin/plugins.yaml, tests/aiperf_mock_server/models.py
BaseRankingsEndpoint.build_payload abstract signature expanded to accept optional images, videos, audios keyword parameters; CohereRankingsEndpoint introduces multimodal parameters and document helpers; HF TEI and NIM accept multimodal params for interface consistency; plugin metadata declares supports_images and supports_videos; mock server broadens CohereRerankRequest.documents to accept multimodal structures.
Cohere Multimodal Logic
src/aiperf/endpoints/cohere_rankings.py
build_payload rewritten to accept multimodal inputs and reject audio; new _build_documents, _validate_document_counts, and _document_count helpers construct index-paired document objects with content arrays combining text and optional image/video URL references.
Base Endpoint Extraction & Validation
src/aiperf/endpoints/base_rankings_endpoint.py
format_payload refactored with new extraction helpers (_extract_rankings_texts, _extract_media_contents, _select_query_text, _warn_if_no_documents, _validate_media_support) to handle media-aware turn parsing, metadata-driven validation, and error handling.
Text-Only Endpoint Stubs
src/aiperf/endpoints/hf_tei_rankings.py, src/aiperf/endpoints/nim_rankings.py
Payload construction logic remains unchanged while accepting multimodal parameters for interface consistency.
Synthetic Dataset Composition
src/aiperf/dataset/composer/synthetic_rankings.py
_create_turn extended to conditionally generate and append Image and Video payloads via new helpers; include_image and include_video properties determine generation based on batch size and dimension configuration.
Request/Response Handling
tests/aiperf_mock_server/models.py
CohereRerankRequest.passage_texts property now parses multimodal documents field, extracting text and media URL representations via new _document_to_text and _media_url_text helpers.
Integration & Unit Tests
tests/integration/utils.py, tests/component_integration/endpoints/test_rankings_endpoint.py, tests/unit/dataset/composer/test_synthetic_rankings_composer.py, tests/unit/endpoints/test_cohere_rankings_endpoint.py, tests/unit/endpoints/test_hf_tei_rankings_endpoint.py, tests/unit/endpoints/test_nim_rankings_endpoint.py, tests/unit/server/test_models.py
New create_multimodal_rankings_dataset utility; integration tests run multimodal profiling against Cohere endpoint; unit tests validate synthetic media generation per passage, payload formatting with mixed modalities, count mismatches, unsupported media rejection, and mock server parsing.
Documentation
docs/tutorials/rankings.md
"Profile vLLM Vision Rerank Models" section added covering multimodal request payload structure, index alignment requirements, synthetic input generation, and AIPerf CLI usage examples with custom multimodal-rankings.jsonl datasets.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through mocks and payload streams,
Images and videos joining ranking dreams.
Per-index pairing kept each item true,
Text and vision stitched into the view.
A little rabbit cheers the multimodal crew!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding multimodal support to the Cohere ranking endpoint, which is the core objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/aiperf/endpoints/nim_rankings.py (1)

23-29: ⚡ Quick win

Prefer explicit rejection of unsupported media instead of silently discarding it.

At Line 28, media args are ignored. Failing fast here prevents accidental data loss when build_payload is called directly and keeps behavior explicit.

Proposed fix
     def build_payload(
         self,
         query_text: str,
         passages: Sequence[str],
         model_name: str,
         *,
         images: Sequence[str] = (),
         videos: Sequence[str] = (),
         audios: Sequence[str] = (),
     ) -> dict[str, Any]:
         """Build payload to match NIM rankings API schema."""
-        _ = images, videos, audios
+        if images or videos or audios:
+            raise ValueError(
+                "NIM rankings does not support image, video, or audio input."
+            )
         payload = {
             "model": model_name,
             "query": {"text": query_text},
             "passages": [{"text": p} for p in passages],
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/endpoints/nim_rankings.py` around lines 23 - 29, The build_payload
function currently ignores the media arguments (images, videos, audios) by
assigning them to _ and silently discarding any input; change this to fail-fast
by validating those parameters at the start of build_payload and raising a clear
exception (e.g., ValueError) if any of images, videos, or audios is non-empty,
mentioning which unsupported media was passed so callers know why the call
failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/dataset/composer/synthetic_rankings.py`:
- Around line 110-117: The _generate_video_payloads function currently skips
falsy results from video_generator.generate(), which can silently reduce the
number of videos and break passage alignment; change it to fail fast by checking
the result of video_generator.generate() and raising a clear exception (e.g.,
RuntimeError or ValueError) that includes context (count requested and index)
when data is falsy, or alternatively append a deterministic placeholder object
to Video.contents to preserve one-to-one correspondence; update references in
_generate_video_payloads, Video (contents), and video_generator.generate
accordingly.

---

Nitpick comments:
In `@src/aiperf/endpoints/nim_rankings.py`:
- Around line 23-29: The build_payload function currently ignores the media
arguments (images, videos, audios) by assigning them to _ and silently
discarding any input; change this to fail-fast by validating those parameters at
the start of build_payload and raising a clear exception (e.g., ValueError) if
any of images, videos, or audios is non-empty, mentioning which unsupported
media was passed so callers know why the call failed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a61ee8c9-460f-40fb-866e-9d7012727fdf

📥 Commits

Reviewing files that changed from the base of the PR and between 1393442 and 3a66f4e.

📒 Files selected for processing (15)
  • docs/tutorials/rankings.md
  • src/aiperf/dataset/composer/synthetic_rankings.py
  • src/aiperf/endpoints/base_rankings_endpoint.py
  • src/aiperf/endpoints/cohere_rankings.py
  • src/aiperf/endpoints/hf_tei_rankings.py
  • src/aiperf/endpoints/nim_rankings.py
  • src/aiperf/plugin/plugins.yaml
  • tests/aiperf_mock_server/models.py
  • tests/component_integration/endpoints/test_rankings_endpoint.py
  • tests/integration/utils.py
  • tests/unit/dataset/composer/test_synthetic_rankings_composer.py
  • tests/unit/endpoints/test_cohere_rankings_endpoint.py
  • tests/unit/endpoints/test_hf_tei_rankings_endpoint.py
  • tests/unit/endpoints/test_nim_rankings_endpoint.py
  • tests/unit/server/test_models.py

Comment on lines +110 to +117
def _generate_video_payloads(self, count: int) -> Video:
"""Generate one synthetic video per ranking passage."""
video = Video(name="video_url")
for _ in range(count):
data = self.video_generator.generate()
if data:
video.contents.append(data)
return video
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid silently dropping generated videos; fail fast on generation miss.

At Line 115, falsy data is skipped, which can produce fewer videos than passages and trigger downstream count-mismatch errors far from the source. Raise immediately (or append a placeholder) to keep per-passage alignment deterministic.

Proposed fix
 def _generate_video_payloads(self, count: int) -> Video:
     """Generate one synthetic video per ranking passage."""
     video = Video(name="video_url")
     for _ in range(count):
         data = self.video_generator.generate()
-        if data:
-            video.contents.append(data)
+        if not data:
+            raise ValueError(
+                "Video generation returned empty content while multimodal rankings are enabled."
+            )
+        video.contents.append(data)
     return video
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/dataset/composer/synthetic_rankings.py` around lines 110 - 117,
The _generate_video_payloads function currently skips falsy results from
video_generator.generate(), which can silently reduce the number of videos and
break passage alignment; change it to fail fast by checking the result of
video_generator.generate() and raising a clear exception (e.g., RuntimeError or
ValueError) that includes context (count requested and index) when data is
falsy, or alternatively append a deterministic placeholder object to
Video.contents to preserve one-to-one correspondence; update references in
_generate_video_payloads, Video (contents), and video_generator.generate
accordingly.

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unit/dataset/composer/test_synthetic_rankings_composer.py (2)

128-130: ⚡ Quick win

Avoid hard-coded passage multiplier in expected call count

expected_media_count is tied to a magic number (* 3). This makes the test brittle if passage generation defaults/config change. Derive the expected count from generated turns (sum of passage counts) or from the configured mean variable in the test setup.

Proposed test hardening
-    expected_media_count = synthetic_config.input.conversation.num_dataset_entries * 3
-    assert generate_image.call_count == expected_media_count
-    assert generate_video.call_count == expected_media_count
+    expected_media_count = sum(
+        len(conversation.turns[0].texts[1].contents) for conversation in dataset
+    )
+    assert generate_image.call_count == expected_media_count
+    assert generate_video.call_count == expected_media_count
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/dataset/composer/test_synthetic_rankings_composer.py` around lines
128 - 130, Replace the hard-coded multiplier used to compute
expected_media_count with a calculation derived from the actual passages
produced by the composer or from the test's configured mean; specifically,
compute expected_media_count by summing the number of passages across the
generated turns (or reading the configured passages-per-turn mean used in the
test setup) rather than using "* 3", then assert generate_image.call_count and
generate_video.call_count against that computed value (references:
expected_media_count, generate_image, generate_video,
synthetic_config.input.conversation.num_dataset_entries).

162-181: ⚡ Quick win

Assert generators are not called when media batch size is zero

This test validates output shape ([]) but not the “disabled generation” behavior. Patch both generators and assert zero calls to prevent regressions where media is still generated then discarded.

Proposed coverage extension
-    composer = SyntheticRankingsDatasetComposer(synthetic_config, mock_tokenizer)
-    dataset = composer.create_dataset()
+    composer = SyntheticRankingsDatasetComposer(synthetic_config, mock_tokenizer)
+    with (
+        patch.object(composer.image_generator, "generate") as generate_image,
+        patch.object(composer.video_generator, "generate") as generate_video,
+    ):
+        dataset = composer.create_dataset()
 
     for conversation in dataset:
         turn = conversation.turns[0]
         assert turn.images == []
         assert turn.videos == []
+    generate_image.assert_not_called()
+    generate_video.assert_not_called()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/dataset/composer/test_synthetic_rankings_composer.py` around lines
162 - 181, Extend the test to patch/mock the media generator functions used by
SyntheticRankingsDatasetComposer (the image and video generator callables used
internally when producing turns) before calling
SyntheticRankingsDatasetComposer.create_dataset, then assert those mocks were
not called when synthetic_config.input.image.batch_size and ...video.batch_size
are 0; this ensures generation is disabled (reference
SyntheticRankingsDatasetComposer and its create_dataset path that invokes the
image/video generators) and prevents regressions where media is produced then
discarded.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/dataset/composer/test_synthetic_rankings_composer.py`:
- Around line 128-130: Replace the hard-coded multiplier used to compute
expected_media_count with a calculation derived from the actual passages
produced by the composer or from the test's configured mean; specifically,
compute expected_media_count by summing the number of passages across the
generated turns (or reading the configured passages-per-turn mean used in the
test setup) rather than using "* 3", then assert generate_image.call_count and
generate_video.call_count against that computed value (references:
expected_media_count, generate_image, generate_video,
synthetic_config.input.conversation.num_dataset_entries).
- Around line 162-181: Extend the test to patch/mock the media generator
functions used by SyntheticRankingsDatasetComposer (the image and video
generator callables used internally when producing turns) before calling
SyntheticRankingsDatasetComposer.create_dataset, then assert those mocks were
not called when synthetic_config.input.image.batch_size and ...video.batch_size
are 0; this ensures generation is disabled (reference
SyntheticRankingsDatasetComposer and its create_dataset path that invokes the
image/video generators) and prevents regressions where media is produced then
discarded.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1bb54f89-c4c2-40b9-825f-97343ed6d4b2

📥 Commits

Reviewing files that changed from the base of the PR and between 3a66f4e and 17968c9.

📒 Files selected for processing (2)
  • tests/unit/dataset/composer/test_synthetic_rankings_composer.py
  • tests/unit/endpoints/test_cohere_rankings_endpoint.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/endpoints/test_cohere_rankings_endpoint.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants