Skip to content

Conversation

@corbt
Copy link
Contributor

@corbt corbt commented Jan 17, 2026

Summary

  • Add multi-checkpoint inference support across Tinker, Unsloth (LocalBackend), and Serverless backends using the name@step convention.
  • Track and route inference to specific checkpoints; keep multiple checkpoints available concurrently.
  • Update vLLM integration so dynamically added LoRAs are visible in /v1/models.
  • Add integration tests with real training loops per backend, plus unit tests covering naming and routing.

Context

Implements RFC #513:
#513

Justification: enables submitting metrics on old validation steps even after training has advanced to newer steps.

Details

  • Model.get_inference_name(step) and litellm_completion_params(step) support the name@step convention.
  • TinkerService stores multiple sampling clients keyed by step; OpenAI endpoint parses @step.
  • UnslothService keeps multiple LoRAs loaded (max_loras=2 default), uses step-based LoRA names, and updates vLLM’s model registry on add.
  • ServerlessBackend builds W&B artifact names with :step{N} suffix.
  • Local backend advances steps on skipped training and registers the new checkpoint for inference.
  • Integration tests cover backend training loops; tests skip cleanly without required credentials.

Tests

  • uv run pytest tests/unit/test_multi_checkpoint_inference.py -v
  • uv run pytest tests/integration/test_multi_checkpoint_training.py -v -s
    • LocalBackend test verified end-to-end.
    • Tinker/Serverless tests require TINKER_API_KEY / WANDB_API_KEY.

@corbt corbt requested a review from bradhilton January 17, 2026 02:15
@corbt corbt changed the base branch from training-step to main January 19, 2026 19:39
@corbt corbt merged commit 913b35d into main Jan 19, 2026
1 check passed
@Kovbo Kovbo mentioned this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants