Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.11
hooks:
- id: ruff-check
types: [python]
args: [--fix]
- id: ruff-format
types: [python]
4 changes: 2 additions & 2 deletions docs/ADRs/finetuning/011-finetuning-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,11 @@ Several constraints shaped this design:

### Considered approaches for client interface

Because fine-tuning access cannot be universally granted, and because the fine-tuning schema does not map cleanly onto the OpenAI fine-tuning spec (which requires a `file.id` to be submitted, mandating a file upload step we have rejected), we need a purpose-built client interface rather than relying on the OpenAI SDK alone. The design of that interface is covered in ADR-012.
Because fine-tuning access cannot be universally granted, and because the fine-tuning schema does not map cleanly onto the OpenAI fine-tuning spec (which requires a `file.id` to be submitted, mandating a file upload step we have rejected), we need a purpose-built client interface rather than relying on the OpenAI SDK alone. The design of that interface is covered in [?].

## Decision

We implement a fine-tuning service as a custom Docker container added to the `llm-service` stack. Users interact with it via a Splinter SDK (see ADR-012), which handles authentication against LiteLLM API keys and a separate fine-tuning access whitelist.
We implement a fine-tuning service as a custom Docker container added to the `llm-service` stack. Users interact with it via a Splinter SDK, which handles authentication against LiteLLM API keys and a separate fine-tuning access whitelist.

The service exposes:

Expand Down
23 changes: 23 additions & 0 deletions docs/ADRs/finetuning/012-finetuning-service-skeleton.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# ADR-013. Fine-Tuning Service Skeleton

Date: 2026-04-22
Status: Proposed

## Context

With the high level service design established in ADR-011, the skeleton implementation required a set of concrete technical decisions: stack layout, base image and service framework.

## Decision

**Separate stack.** The fine-tuning service lives in `stacks/finetuning-service/` rather than extending the llm-service stack. This mirrors the monitoring stack pattern and allows the service to be brought up and down independently.

**Base image.** `axolotlai/axolotl-uv:main-py3.12-cu130-2.10.0` is used and pinned. The `-uv` variant is consistent with the team's preference for uv across the project. The image is pinned to a specific tag; updates are a deliberate decision, identical to our LiteLLM versioning.

**FastAPI** for the service framework, with **SQLite** backed by a named Docker volume for job queue state. This is sufficient for a single-worker serialised queue and avoids a dependency on the existing PostgreSQL instance.

**Networking** between the fine-tuning service and LiteLLM (for future API key validation) uses `host.docker.internal` rather than joining the llm-service Docker network as an external network. This keeps the stacks decoupled.

## Consequences

- Hyperparameter configuration for training jobs is deferred; the job submission schema carries only the fields needed to identify the job.
- SQLite is sufficient now but migration to PostgreSQL remains possible if cross-service visibility is needed later.
19 changes: 18 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ name = "splinter"
version = "0.1.0"
description = "Add your description here"
requires-python = ">=3.12"

dependencies = [
"ipykernel>=7.2.0",
"matplotlib>=3.10.8",
Expand All @@ -12,5 +13,21 @@ dependencies = [
"numpy>=1.26",
"openai>=2.21.0",
"rich>=14.3.2",
"vllm>=0.15.1",
"vllm>=0.15.1; sys_platform == 'linux'",
]

[dependency-groups]
dev = [
"pre-commit>=4.6.0",
"ruff>=0.15.11",
]

[tool.ruff]
line-length = 79

[tool.ruff.lint]
select = ["E", "W", "F", "I", "D", "ANN"]
ignore = ["ANN401"]

[tool.ruff.lint.pydocstyle]
convention = "google"
27 changes: 27 additions & 0 deletions stacks/finetuning-service/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# =============================================================================
# Environment Configuration for Fine-Tuning Service
# =============================================================================
#
# IMPORTANT:
# - Copy this to .env and fill in real values
# - Never commit .env to git (it should be in the .gitignore)
# - Keep this .env.example as a template
#
# =============================================================================

# -----------------------------------------------------------------------------
# Fine-Tuning Service
# -----------------------------------------------------------------------------
FINETUNING_PORT=8005
MAX_JOB_DURATION_HOURS=4

# -----------------------------------------------------------------------------
# LiteLLM (for API key validation)
# -----------------------------------------------------------------------------
LITELLM_PORT=4000
LITELLM_MASTER_KEY=sk-CHANGE_ME_TO_SOMETHING_SECURE

# -----------------------------------------------------------------------------
# Axolotl
# -----------------------------------------------------------------------------
AXOLOTL_IMAGE=axolotlai/*
11 changes: 11 additions & 0 deletions stacks/finetuning-service/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM ${AXOLOTL_IMAGE}

WORKDIR /app

# Install service dependencies on top of Axolotl's environment
COPY requirements.txt .
RUN uv pip install --system -r requirements.txt

COPY app/ ./app/

CMD ["sh", "-c", "uvicorn app.main:app --host 0.0.0.0 --port ${FINETUNING_PORT}"]
1 change: 1 addition & 0 deletions stacks/finetuning-service/app/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Application for the Splinter fine-tuning service."""
188 changes: 188 additions & 0 deletions stacks/finetuning-service/app/database.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
"""SQLite database setup and job queue operations."""

import json
import sqlite3
import uuid
from contextlib import contextmanager
from datetime import datetime, timezone
from pathlib import Path
from typing import Generator, Optional

from .models import JobResponse, JobStatus, JobSubmitRequest

DB_PATH = Path("/data/jobs.db")


@contextmanager
def get_connection() -> Generator[sqlite3.Connection, None, None]:
"""Yield a database connection with row factory configured.

Yields:
An open SQLite connection.
"""
conn = sqlite3.connect(DB_PATH)
conn.row_factory = sqlite3.Row
try:
yield conn
conn.commit()
finally:
conn.close()


def _row_to_job(row: sqlite3.Row) -> JobResponse:
"""Convert a database row to a JobResponse.

Args:
row: A row from the jobs table.

Returns:
A populated JobResponse instance.
"""
return JobResponse(
id=row["id"],
status=JobStatus(row["status"]),
model=row["model"],
hf_dataset=row["hf_dataset"],
suffix=row["suffix"],
created_at=datetime.fromisoformat(row["created_at"]),
started_at=(
datetime.fromisoformat(row["started_at"])
if row["started_at"]
else None
),
completed_at=(
datetime.fromisoformat(row["completed_at"])
if row["completed_at"]
else None
),
error_message=row["error_message"],
)


def init_db() -> None:
"""Initialise the database schema if it does not exist."""
with get_connection() as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS jobs (
id TEXT PRIMARY KEY,
status TEXT NOT NULL,
model TEXT NOT NULL,
hf_dataset TEXT NOT NULL,
suffix TEXT,
config TEXT NOT NULL,
created_at TEXT NOT NULL,
started_at TEXT,
completed_at TEXT,
error_message TEXT
)
""")


def recover_running_jobs() -> None:
"""Mark any jobs left running at startup as failed.

A job in 'running' state when the service exits will never
complete. This is called at startup to mark such jobs as
failed rather than leaving them hanging indefinitely.
"""
with get_connection() as conn:
conn.execute(
"UPDATE jobs SET status = ? WHERE status = ?",
(JobStatus.FAILED, JobStatus.RUNNING),
)


def create_job(request: JobSubmitRequest) -> JobResponse:
"""Insert a new job record and return it in queued status.

Args:
request: The validated job submission request.

Returns:
The newly created job.
"""
job_id = str(uuid.uuid4())
now = datetime.now(timezone.utc).isoformat()
config = json.dumps(
{
# TODO: Add hyperparameters/config keys
}
)
with get_connection() as conn:
conn.execute(
"""
INSERT INTO jobs (
id, status, model, hf_dataset,
suffix, config, created_at
) VALUES (?, ?, ?, ?, ?, ?, ?)
""",
(
job_id,
JobStatus.QUEUED,
request.model,
request.hf_dataset,
request.suffix,
config,
now,
),
)
return JobResponse(
id=job_id,
status=JobStatus.QUEUED,
model=request.model,
hf_dataset=request.hf_dataset,
suffix=request.suffix,
created_at=datetime.fromisoformat(now),
)


def get_job(job_id: str) -> Optional[JobResponse]:
"""Fetch a single job by ID.

Args:
job_id: The UUID of the job to retrieve.

Returns:
The job if found, otherwise None.
"""
with get_connection() as conn:
row = conn.execute(
"SELECT * FROM jobs WHERE id = ?", (job_id,)
).fetchone()
return _row_to_job(row) if row else None


def list_jobs() -> list[JobResponse]:
"""Return all jobs ordered by creation time descending.

Returns:
All job records.
"""
with get_connection() as conn:
rows = conn.execute(
"SELECT * FROM jobs ORDER BY created_at DESC"
).fetchall()
return [_row_to_job(row) for row in rows]


def cancel_job(job_id: str) -> Optional[JobResponse]:
"""Cancel a queued job.

Only jobs in 'queued' status are affected. Jobs already
running must be stopped via the process termination path.

Args:
job_id: The UUID of the job to cancel.

Returns:
The updated job if found, otherwise None.
"""
with get_connection() as conn:
conn.execute(
"""
UPDATE jobs SET status = ?
WHERE id = ? AND status = ?
""",
(JobStatus.CANCELLED, job_id, JobStatus.QUEUED),
)
return get_job(job_id)
44 changes: 44 additions & 0 deletions stacks/finetuning-service/app/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
"""Fine-tuning service entry point."""

from contextlib import asynccontextmanager
from typing import AsyncGenerator

from fastapi import FastAPI

from .database import init_db, recover_running_jobs
from .routes import router


@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
"""Manage application startup and shutdown.

Initialises the database and recovers any jobs that were
left in a running state from a previous crash.

Args:
app: The FastAPI application instance.

Yields:
None
"""
init_db()
recover_running_jobs()
yield


app = FastAPI(
title="Splinter Fine-Tuning Service",
lifespan=lifespan,
)
app.include_router(router, prefix="/v1/fine_tu ning")


@app.get("/health")
async def health() -> dict[str, str]:
"""Health check endpoint.

Returns:
A dictionary with status ok.
"""
return {"status": "ok"}
Loading