acceleratescience · fingriffin · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,9 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.15.11
+    hooks:
+      - id: ruff-check
+        types: [python]
+        args: [--fix]
+      - id: ruff-format
+        types: [python]
diff --git a/docs/ADRs/finetuning/011-finetuning-service.md b/docs/ADRs/finetuning/011-finetuning-service.md
@@ -29,11 +29,11 @@ Several constraints shaped this design:
 
 ### Considered approaches for client interface
 
-Because fine-tuning access cannot be universally granted, and because the fine-tuning schema does not map cleanly onto the OpenAI fine-tuning spec (which requires a `file.id` to be submitted, mandating a file upload step we have rejected), we need a purpose-built client interface rather than relying on the OpenAI SDK alone. The design of that interface is covered in ADR-012.
+Because fine-tuning access cannot be universally granted, and because the fine-tuning schema does not map cleanly onto the OpenAI fine-tuning spec (which requires a `file.id` to be submitted, mandating a file upload step we have rejected), we need a purpose-built client interface rather than relying on the OpenAI SDK alone. The design of that interface is covered in [?].
 
 ## Decision
 
-We implement a fine-tuning service as a custom Docker container added to the `llm-service` stack. Users interact with it via a Splinter SDK (see ADR-012), which handles authentication against LiteLLM API keys and a separate fine-tuning access whitelist.
+We implement a fine-tuning service as a custom Docker container added to the `llm-service` stack. Users interact with it via a Splinter SDK, which handles authentication against LiteLLM API keys and a separate fine-tuning access whitelist.
 
 The service exposes:
 

diff --git a/docs/ADRs/finetuning/012-finetuning-service-skeleton.md b/docs/ADRs/finetuning/012-finetuning-service-skeleton.md
@@ -0,0 +1,23 @@
+# ADR-013. Fine-Tuning Service Skeleton
+
+Date: 2026-04-22
+Status: Proposed
+
+## Context
+
+With the high level service design established in ADR-011, the skeleton implementation required a set of concrete technical decisions: stack layout, base image and service framework.
+
+## Decision
+
+**Separate stack.** The fine-tuning service lives in `stacks/finetuning-service/` rather than extending the llm-service stack. This mirrors the monitoring stack pattern and allows the service to be brought up and down independently.
+
+**Base image.** `axolotlai/axolotl-uv:main-py3.12-cu130-2.10.0` is used and pinned. The `-uv` variant is consistent with the team's preference for uv across the project. The image is pinned to a specific tag; updates are a deliberate decision, identical to our LiteLLM versioning. 
+
+**FastAPI** for the service framework, with **SQLite** backed by a named Docker volume for job queue state. This is sufficient for a single-worker serialised queue and avoids a dependency on the existing PostgreSQL instance.
+
+**Networking** between the fine-tuning service and LiteLLM (for future API key validation) uses `host.docker.internal` rather than joining the llm-service Docker network as an external network. This keeps the stacks decoupled.
+
+## Consequences
+
+- Hyperparameter configuration for training jobs is deferred; the job submission schema carries only the fields needed to identify the job.
+- SQLite is sufficient now but migration to PostgreSQL remains possible if cross-service visibility is needed later.
diff --git a/pyproject.toml b/pyproject.toml
@@ -3,6 +3,7 @@ name = "splinter"
 version = "0.1.0"
 description = "Add your description here"
 requires-python = ">=3.12"
+
 dependencies = [
     "ipykernel>=7.2.0",
     "matplotlib>=3.10.8",
@@ -12,5 +13,21 @@ dependencies = [
     "numpy>=1.26",
     "openai>=2.21.0",
     "rich>=14.3.2",
-    "vllm>=0.15.1",
+    "vllm>=0.15.1; sys_platform == 'linux'",
 ]
+
+[dependency-groups]
+dev = [
+    "pre-commit>=4.6.0",
+    "ruff>=0.15.11",
+]
+
+[tool.ruff]
+line-length = 79
+
+[tool.ruff.lint]
+select = ["E", "W", "F", "I", "D", "ANN"]
+ignore = ["ANN401"]
+
+[tool.ruff.lint.pydocstyle]
+convention = "google"
diff --git a/stacks/finetuning-service/.env.example b/stacks/finetuning-service/.env.example
@@ -0,0 +1,27 @@
+# =============================================================================
+# Environment Configuration for Fine-Tuning Service
+# =============================================================================
+#
+# IMPORTANT:
+#   - Copy this to .env and fill in real values
+#   - Never commit .env to git (it should be in the .gitignore)
+#   - Keep this .env.example as a template
+#
+# =============================================================================
+
+# -----------------------------------------------------------------------------
+# Fine-Tuning Service
+# -----------------------------------------------------------------------------
+FINETUNING_PORT=8005
+MAX_JOB_DURATION_HOURS=4
+
+# -----------------------------------------------------------------------------
+# LiteLLM (for API key validation)
+# -----------------------------------------------------------------------------
+LITELLM_PORT=4000
+LITELLM_MASTER_KEY=sk-CHANGE_ME_TO_SOMETHING_SECURE
+
+# -----------------------------------------------------------------------------
+# Axolotl
+# -----------------------------------------------------------------------------
+AXOLOTL_IMAGE=axolotlai/*
diff --git a/stacks/finetuning-service/Dockerfile b/stacks/finetuning-service/Dockerfile
@@ -0,0 +1,11 @@
+FROM ${AXOLOTL_IMAGE}
+
+WORKDIR /app
+
+# Install service dependencies on top of Axolotl's environment
+COPY requirements.txt .
+RUN uv pip install --system -r requirements.txt
+
+COPY app/ ./app/
+
+CMD ["sh", "-c", "uvicorn app.main:app --host 0.0.0.0 --port ${FINETUNING_PORT}"]
diff --git a/stacks/finetuning-service/app/__init__.py b/stacks/finetuning-service/app/__init__.py
@@ -0,0 +1 @@
+"""Application for the Splinter fine-tuning service."""
diff --git a/stacks/finetuning-service/app/database.py b/stacks/finetuning-service/app/database.py
@@ -0,0 +1,188 @@
+"""SQLite database setup and job queue operations."""
+
+import json
+import sqlite3
+import uuid
+from contextlib import contextmanager
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Generator, Optional
+
+from .models import JobResponse, JobStatus, JobSubmitRequest
+
+DB_PATH = Path("/data/jobs.db")
+
+
+@contextmanager
+def get_connection() -> Generator[sqlite3.Connection, None, None]:
+    """Yield a database connection with row factory configured.
+
+    Yields:
+        An open SQLite connection.
+    """
+    conn = sqlite3.connect(DB_PATH)
+    conn.row_factory = sqlite3.Row
+    try:
+        yield conn
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def _row_to_job(row: sqlite3.Row) -> JobResponse:
+    """Convert a database row to a JobResponse.
+
+    Args:
+        row: A row from the jobs table.
+
+    Returns:
+        A populated JobResponse instance.
+    """
+    return JobResponse(
+        id=row["id"],
+        status=JobStatus(row["status"]),
+        model=row["model"],
+        hf_dataset=row["hf_dataset"],
+        suffix=row["suffix"],
+        created_at=datetime.fromisoformat(row["created_at"]),
+        started_at=(
+            datetime.fromisoformat(row["started_at"])
+            if row["started_at"]
+            else None
+        ),
+        completed_at=(
+            datetime.fromisoformat(row["completed_at"])
+            if row["completed_at"]
+            else None
+        ),
+        error_message=row["error_message"],
+    )
+
+
+def init_db() -> None:
+    """Initialise the database schema if it does not exist."""
+    with get_connection() as conn:
+        conn.execute("""
+            CREATE TABLE IF NOT EXISTS jobs (
+                id            TEXT PRIMARY KEY,
+                status        TEXT NOT NULL,
+                model         TEXT NOT NULL,
+                hf_dataset    TEXT NOT NULL,
+                suffix        TEXT,
+                config        TEXT NOT NULL,
+                created_at    TEXT NOT NULL,
+                started_at    TEXT,
+                completed_at  TEXT,
+                error_message TEXT
+            )
+        """)
+
+
+def recover_running_jobs() -> None:
+    """Mark any jobs left running at startup as failed.
+
+    A job in 'running' state when the service exits will never
+    complete. This is called at startup to mark such jobs as
+    failed rather than leaving them hanging indefinitely.
+    """
+    with get_connection() as conn:
+        conn.execute(
+            "UPDATE jobs SET status = ? WHERE status = ?",
+            (JobStatus.FAILED, JobStatus.RUNNING),
+        )
+
+
+def create_job(request: JobSubmitRequest) -> JobResponse:
+    """Insert a new job record and return it in queued status.
+
+    Args:
+        request: The validated job submission request.
+
+    Returns:
+        The newly created job.
+    """
+    job_id = str(uuid.uuid4())
+    now = datetime.now(timezone.utc).isoformat()
+    config = json.dumps(
+        {
+            # TODO: Add hyperparameters/config keys
+        }
+    )
+    with get_connection() as conn:
+        conn.execute(
+            """
+            INSERT INTO jobs (
+                id, status, model, hf_dataset,
+                suffix, config, created_at
+            ) VALUES (?, ?, ?, ?, ?, ?, ?)
+            """,
+            (
+                job_id,
+                JobStatus.QUEUED,
+                request.model,
+                request.hf_dataset,
+                request.suffix,
+                config,
+                now,
+            ),
+        )
+    return JobResponse(
+        id=job_id,
+        status=JobStatus.QUEUED,
+        model=request.model,
+        hf_dataset=request.hf_dataset,
+        suffix=request.suffix,
+        created_at=datetime.fromisoformat(now),
+    )
+
+
+def get_job(job_id: str) -> Optional[JobResponse]:
+    """Fetch a single job by ID.
+
+    Args:
+        job_id: The UUID of the job to retrieve.
+
+    Returns:
+        The job if found, otherwise None.
+    """
+    with get_connection() as conn:
+        row = conn.execute(
+            "SELECT * FROM jobs WHERE id = ?", (job_id,)
+        ).fetchone()
+    return _row_to_job(row) if row else None
+
+
+def list_jobs() -> list[JobResponse]:
+    """Return all jobs ordered by creation time descending.
+
+    Returns:
+        All job records.
+    """
+    with get_connection() as conn:
+        rows = conn.execute(
+            "SELECT * FROM jobs ORDER BY created_at DESC"
+        ).fetchall()
+    return [_row_to_job(row) for row in rows]
+
+
+def cancel_job(job_id: str) -> Optional[JobResponse]:
+    """Cancel a queued job.
+
+    Only jobs in 'queued' status are affected. Jobs already
+    running must be stopped via the process termination path.
+
+    Args:
+        job_id: The UUID of the job to cancel.
+
+    Returns:
+        The updated job if found, otherwise None.
+    """
+    with get_connection() as conn:
+        conn.execute(
+            """
+            UPDATE jobs SET status = ?
+            WHERE id = ? AND status = ?
+            """,
+            (JobStatus.CANCELLED, job_id, JobStatus.QUEUED),
+        )
+    return get_job(job_id)
diff --git a/stacks/finetuning-service/app/main.py b/stacks/finetuning-service/app/main.py
@@ -0,0 +1,44 @@
+"""Fine-tuning service entry point."""
+
+from contextlib import asynccontextmanager
+from typing import AsyncGenerator
+
+from fastapi import FastAPI
+
+from .database import init_db, recover_running_jobs
+from .routes import router
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
+    """Manage application startup and shutdown.
+
+    Initialises the database and recovers any jobs that were
+    left in a running state from a previous crash.
+
+    Args:
+        app: The FastAPI application instance.
+
+    Yields:
+        None
+    """
+    init_db()
+    recover_running_jobs()
+    yield
+
+
+app = FastAPI(
+    title="Splinter Fine-Tuning Service",
+    lifespan=lifespan,
+)
+app.include_router(router, prefix="/v1/fine_tu ning")
+
+
+@app.get("/health")
+async def health() -> dict[str, str]:
+    """Health check endpoint.
+
+    Returns:
+        A dictionary with status ok.
+    """
+    return {"status": "ok"}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Application for the Splinter fine-tuning service."""