feat(rankings): multimodal support for Cohere ranking endpoint#896
feat(rankings): multimodal support for Cohere ranking endpoint#896jzakrzew wants to merge 4 commits into
Conversation
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@17968c9298a63a47b928b20975eee427c2da4702Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@17968c9298a63a47b928b20975eee427c2da4702Last updated for commit: |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
WalkthroughThis PR extends AIPerf's Cohere Rankings endpoint to support multimodal requests containing image and video content alongside text. The implementation includes enhanced dataset composition, index-aligned payload construction, metadata-driven validation, and comprehensive test coverage across integration and unit scopes. ChangesMultimodal Cohere Rankings Feature
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/aiperf/endpoints/nim_rankings.py (1)
23-29: ⚡ Quick winPrefer explicit rejection of unsupported media instead of silently discarding it.
At Line 28, media args are ignored. Failing fast here prevents accidental data loss when
build_payloadis called directly and keeps behavior explicit.Proposed fix
def build_payload( self, query_text: str, passages: Sequence[str], model_name: str, *, images: Sequence[str] = (), videos: Sequence[str] = (), audios: Sequence[str] = (), ) -> dict[str, Any]: """Build payload to match NIM rankings API schema.""" - _ = images, videos, audios + if images or videos or audios: + raise ValueError( + "NIM rankings does not support image, video, or audio input." + ) payload = { "model": model_name, "query": {"text": query_text}, "passages": [{"text": p} for p in passages], }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/endpoints/nim_rankings.py` around lines 23 - 29, The build_payload function currently ignores the media arguments (images, videos, audios) by assigning them to _ and silently discarding any input; change this to fail-fast by validating those parameters at the start of build_payload and raising a clear exception (e.g., ValueError) if any of images, videos, or audios is non-empty, mentioning which unsupported media was passed so callers know why the call failed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/dataset/composer/synthetic_rankings.py`:
- Around line 110-117: The _generate_video_payloads function currently skips
falsy results from video_generator.generate(), which can silently reduce the
number of videos and break passage alignment; change it to fail fast by checking
the result of video_generator.generate() and raising a clear exception (e.g.,
RuntimeError or ValueError) that includes context (count requested and index)
when data is falsy, or alternatively append a deterministic placeholder object
to Video.contents to preserve one-to-one correspondence; update references in
_generate_video_payloads, Video (contents), and video_generator.generate
accordingly.
---
Nitpick comments:
In `@src/aiperf/endpoints/nim_rankings.py`:
- Around line 23-29: The build_payload function currently ignores the media
arguments (images, videos, audios) by assigning them to _ and silently
discarding any input; change this to fail-fast by validating those parameters at
the start of build_payload and raising a clear exception (e.g., ValueError) if
any of images, videos, or audios is non-empty, mentioning which unsupported
media was passed so callers know why the call failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a61ee8c9-460f-40fb-866e-9d7012727fdf
📒 Files selected for processing (15)
docs/tutorials/rankings.mdsrc/aiperf/dataset/composer/synthetic_rankings.pysrc/aiperf/endpoints/base_rankings_endpoint.pysrc/aiperf/endpoints/cohere_rankings.pysrc/aiperf/endpoints/hf_tei_rankings.pysrc/aiperf/endpoints/nim_rankings.pysrc/aiperf/plugin/plugins.yamltests/aiperf_mock_server/models.pytests/component_integration/endpoints/test_rankings_endpoint.pytests/integration/utils.pytests/unit/dataset/composer/test_synthetic_rankings_composer.pytests/unit/endpoints/test_cohere_rankings_endpoint.pytests/unit/endpoints/test_hf_tei_rankings_endpoint.pytests/unit/endpoints/test_nim_rankings_endpoint.pytests/unit/server/test_models.py
| def _generate_video_payloads(self, count: int) -> Video: | ||
| """Generate one synthetic video per ranking passage.""" | ||
| video = Video(name="video_url") | ||
| for _ in range(count): | ||
| data = self.video_generator.generate() | ||
| if data: | ||
| video.contents.append(data) | ||
| return video |
There was a problem hiding this comment.
Avoid silently dropping generated videos; fail fast on generation miss.
At Line 115, falsy data is skipped, which can produce fewer videos than passages and trigger downstream count-mismatch errors far from the source. Raise immediately (or append a placeholder) to keep per-passage alignment deterministic.
Proposed fix
def _generate_video_payloads(self, count: int) -> Video:
"""Generate one synthetic video per ranking passage."""
video = Video(name="video_url")
for _ in range(count):
data = self.video_generator.generate()
- if data:
- video.contents.append(data)
+ if not data:
+ raise ValueError(
+ "Video generation returned empty content while multimodal rankings are enabled."
+ )
+ video.contents.append(data)
return video🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/dataset/composer/synthetic_rankings.py` around lines 110 - 117,
The _generate_video_payloads function currently skips falsy results from
video_generator.generate(), which can silently reduce the number of videos and
break passage alignment; change it to fail fast by checking the result of
video_generator.generate() and raising a clear exception (e.g., RuntimeError or
ValueError) that includes context (count requested and index) when data is
falsy, or alternatively append a deterministic placeholder object to
Video.contents to preserve one-to-one correspondence; update references in
_generate_video_payloads, Video (contents), and video_generator.generate
accordingly.
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit/dataset/composer/test_synthetic_rankings_composer.py (2)
128-130: ⚡ Quick winAvoid hard-coded passage multiplier in expected call count
expected_media_countis tied to a magic number (* 3). This makes the test brittle if passage generation defaults/config change. Derive the expected count from generated turns (sum of passage counts) or from the configured mean variable in the test setup.Proposed test hardening
- expected_media_count = synthetic_config.input.conversation.num_dataset_entries * 3 - assert generate_image.call_count == expected_media_count - assert generate_video.call_count == expected_media_count + expected_media_count = sum( + len(conversation.turns[0].texts[1].contents) for conversation in dataset + ) + assert generate_image.call_count == expected_media_count + assert generate_video.call_count == expected_media_count🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/dataset/composer/test_synthetic_rankings_composer.py` around lines 128 - 130, Replace the hard-coded multiplier used to compute expected_media_count with a calculation derived from the actual passages produced by the composer or from the test's configured mean; specifically, compute expected_media_count by summing the number of passages across the generated turns (or reading the configured passages-per-turn mean used in the test setup) rather than using "* 3", then assert generate_image.call_count and generate_video.call_count against that computed value (references: expected_media_count, generate_image, generate_video, synthetic_config.input.conversation.num_dataset_entries).
162-181: ⚡ Quick winAssert generators are not called when media batch size is zero
This test validates output shape (
[]) but not the “disabled generation” behavior. Patch both generators and assert zero calls to prevent regressions where media is still generated then discarded.Proposed coverage extension
- composer = SyntheticRankingsDatasetComposer(synthetic_config, mock_tokenizer) - dataset = composer.create_dataset() + composer = SyntheticRankingsDatasetComposer(synthetic_config, mock_tokenizer) + with ( + patch.object(composer.image_generator, "generate") as generate_image, + patch.object(composer.video_generator, "generate") as generate_video, + ): + dataset = composer.create_dataset() for conversation in dataset: turn = conversation.turns[0] assert turn.images == [] assert turn.videos == [] + generate_image.assert_not_called() + generate_video.assert_not_called()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/dataset/composer/test_synthetic_rankings_composer.py` around lines 162 - 181, Extend the test to patch/mock the media generator functions used by SyntheticRankingsDatasetComposer (the image and video generator callables used internally when producing turns) before calling SyntheticRankingsDatasetComposer.create_dataset, then assert those mocks were not called when synthetic_config.input.image.batch_size and ...video.batch_size are 0; this ensures generation is disabled (reference SyntheticRankingsDatasetComposer and its create_dataset path that invokes the image/video generators) and prevents regressions where media is produced then discarded.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/dataset/composer/test_synthetic_rankings_composer.py`:
- Around line 128-130: Replace the hard-coded multiplier used to compute
expected_media_count with a calculation derived from the actual passages
produced by the composer or from the test's configured mean; specifically,
compute expected_media_count by summing the number of passages across the
generated turns (or reading the configured passages-per-turn mean used in the
test setup) rather than using "* 3", then assert generate_image.call_count and
generate_video.call_count against that computed value (references:
expected_media_count, generate_image, generate_video,
synthetic_config.input.conversation.num_dataset_entries).
- Around line 162-181: Extend the test to patch/mock the media generator
functions used by SyntheticRankingsDatasetComposer (the image and video
generator callables used internally when producing turns) before calling
SyntheticRankingsDatasetComposer.create_dataset, then assert those mocks were
not called when synthetic_config.input.image.batch_size and ...video.batch_size
are 0; this ensures generation is disabled (reference
SyntheticRankingsDatasetComposer and its create_dataset path that invokes the
image/video generators) and prevents regressions where media is produced then
discarded.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1bb54f89-c4c2-40b9-825f-97343ed6d4b2
📒 Files selected for processing (2)
tests/unit/dataset/composer/test_synthetic_rankings_composer.pytests/unit/endpoints/test_cohere_rankings_endpoint.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/endpoints/test_cohere_rankings_endpoint.py
Add multimodal input support to the
cohere_rankingsendpoint for vLLM’s vision rerank API, including structured text, image, and video document payload formatting. See:https://docs.vllm.ai/en/latest/examples/pooling/score/#vision-rerank-api-online
We already have text-only Cohere rerank support. This keeps the existing text-only payload shape unchanged, while switching to structured Cohere documents when media is present.
Changes
cohere_rankingsfor text, image, and video rerank documents.Example usage
Set up a vLLM server:
vllm serve nvidia/llama-nemotron-rerank-vl-1b-v2 \ --runner pooling \ --trust-remote-code \ --chat-template "$(curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/main/examples/pooling/score/template/nemotron-vl-rerank.jinja)"Run AIPerf with synthetic multimodal rankings inputs:
aiperf profile \ -m nvidia/llama-nemotron-rerank-vl-1b-v2 \ --endpoint-type cohere_rankings \ --custom-endpoint /rerank \ --url localhost:8000 \ --request-count 10 \ --rankings-passages-mean 4 \ --rankings-passages-stddev 0 \ --rankings-passages-prompt-token-mean 32 \ --rankings-passages-prompt-token-stddev 0 \ --rankings-query-prompt-token-mean 16 \ --rankings-query-prompt-token-stddev 0 \ --image-width-mean 224 \ --image-width-stddev 0 \ --image-height-mean 224 \ --image-height-stddev 0 \ --image-batch-size 1Summary by CodeRabbit
New Features
Documentation
Tests
Bug Fixes / Validation