Multi-checkpoint inference for pipelined training (RFC #513) #515

corbt · 2026-01-17T02:00:35Z

Summary

Add multi-checkpoint inference support across Tinker, Unsloth (LocalBackend), and Serverless backends using the name@step convention.
Track and route inference to specific checkpoints; keep multiple checkpoints available concurrently.
Update vLLM integration so dynamically added LoRAs are visible in /v1/models.
Add integration tests with real training loops per backend, plus unit tests covering naming and routing.

Context

Implements RFC #513:
#513

Justification: enables submitting metrics on old validation steps even after training has advanced to newer steps.

Details

Model.get_inference_name(step) and litellm_completion_params(step) support the name@step convention.
TinkerService stores multiple sampling clients keyed by step; OpenAI endpoint parses @step.
UnslothService keeps multiple LoRAs loaded (max_loras=2 default), uses step-based LoRA names, and updates vLLM’s model registry on add.
ServerlessBackend builds W&B artifact names with :step{N} suffix.
Local backend advances steps on skipped training and registers the new checkpoint for inference.
Integration tests cover backend training loops; tests skip cleanly without required credentials.

Tests

uv run pytest tests/unit/test_multi_checkpoint_inference.py -v
uv run pytest tests/integration/test_multi_checkpoint_training.py -v -s
- LocalBackend test verified end-to-end.
- Tinker/Serverless tests require TINKER_API_KEY / WANDB_API_KEY.

Cursor Bot added 3 commits January 16, 2026 20:36

Use training_step for W&B x-axis to allow out-of-order logging

f9a403c

Implement multi-checkpoint inference for pipelined training

092b461

Fix formatting and typing issues

67ea0d9

corbt requested a review from bradhilton January 17, 2026 02:15

bradhilton approved these changes Jan 17, 2026

View reviewed changes

corbt changed the base branch from training-step to main January 19, 2026 19:39

corbt merged commit 913b35d into main Jan 19, 2026
1 check passed

corbt mentioned this pull request Jan 19, 2026

[RFC] Multi-Checkpoint Inference Support for Pipelined Training #513

Closed

Kovbo mentioned this pull request Jan 21, 2026

Fix model registration #529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-checkpoint inference for pipelined training (RFC #513) #515

Multi-checkpoint inference for pipelined training (RFC #513) #515

Uh oh!

corbt commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Multi-checkpoint inference for pipelined training (RFC #513) #515

Multi-checkpoint inference for pipelined training (RFC #513) #515

Uh oh!

Conversation

corbt commented Jan 17, 2026

Summary

Context

Details

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants