acceleratescience · fingriffin · Apr 20, 2026 · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/.gitignore b/.gitignore
@@ -14,4 +14,6 @@ postgres_data/
 .DS_Store
 Thumbs.db
 
-.idea/
+.idea/
+
+whitelist.txt
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,9 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.15.11
+    hooks:
+      - id: ruff-check
+        types: [python]
+        args: [--fix]
+      - id: ruff-format
+        types: [python]
diff --git a/docs/ADRs/finetuning/011-finetuning-service.md b/docs/ADRs/finetuning/011-finetuning-service.md
@@ -0,0 +1,105 @@
+# ADR-011. Fine-Tuning Service
+
+Date: 2026-04-22
+Status: Proposed
+
+## Context
+
+The PRD (FR-FT-01, FR-FT-02) identifies LoRA/QLoRA fine-tuning as a "Should have" for Phase 2. The primary motivation is to allow researchers to adapt open weight models to domain-specific tasks (e.g. biology, materials science) without requiring cloud API access or local GPU hardware.
+
+Several constraints shaped this design:
+
+- **One GPU available**: At the time of writing, all four H100s are currently allocated to inference services. One GPU can be provisioned for fine-tuning, but it must be treated as a shared, serialised resource (concurrent training jobs are not feasible).
+- **No file upload infrastructure**: The existing Nginx configuration restricts `client_max_body_size` to 10MB and there is no file storage layer. Building one would add significant operational complexity and security risk for little gain.
+- **No per-user adapter serving**: Serving a fine-tuned adapter for a single user would require either dedicated GPU capacity or a hot-swap mechanism on a shared inference instance. Neither is practical at this scale. Adapters must be returned to users via Hugging Face rather than hosted.
+- **Access control**: LiteLLM has no mechanism to restrict fine-tuning access to a subset of users. A separate whitelist is required; fine-tuning cannot simply inherit the existing API key auth layer without granting all key holders access to submit training jobs.
+- **Security posture**: User-supplied credentials (HuggingFace tokens) must not be persisted in logs or the database beyond what is strictly necessary.
+
+### Considered approaches for dataset delivery
+
+**File upload via `/v1/files`**: LiteLLM routes this endpoint and it reaches the service (returning a 500 indicating `files_settings` is not configured, rather than a 404). However, enabling file storage requires configuring a backing store, raises questions about retention and quotas, and adds attack surface. Rejected.
+
+**HuggingFace Hub reference**: Users provide a HF dataset repository path and a scoped HF token. The training service pulls the dataset directly from the Hub at job start. This avoids any file storage infrastructure on our side and is well-matched to how researchers already manage data. Selected.
+
+### Considered approaches for adapter delivery
+
+**Serve locally via vLLM LoRA**: vLLM supports dynamic LoRA adapter loading via `--enable-lora` and a `/v1/load_lora_adapter` endpoint. However, this means hosting a persistent model endpoint for one user's adapter, which is not a scalable use of GPU memory. Rejected.
+
+**Push to user's HuggingFace Hub**: The training service uses the user's HF token (which must be write-scoped) to push the completed adapter back to their Hub. The user then has full ownership of the artifact and can load it however they choose. Selected.
+
+### Considered approaches for client interface
+
+Because fine-tuning access cannot be universally granted, and because the fine-tuning schema does not map cleanly onto the OpenAI fine-tuning spec (which requires a `file.id` to be submitted, mandating a file upload step we have rejected), we need a purpose-built client interface rather than relying on the OpenAI SDK alone. The design of that interface is covered in [?].
+
+## Decision
+
+We implement a fine-tuning service as a custom Docker container added to the `llm-service` stack. Users interact with it via a Splinter SDK, which handles authentication against LiteLLM API keys and a separate fine-tuning access whitelist.
+
+The service exposes:
+
+```
+POST /v1/fine_tuning/jobs        — submit job, returns job ID + queued status
+GET  /v1/fine_tuning/jobs/{id}   — poll status
+GET  /v1/fine_tuning/jobs        — list user's jobs
+POST /v1/fine_tuning/jobs/{id}/cancel
+```
+
+Job submissions return immediately with a job ID and `queued` status. All training is asynchronous.
+
+### Training framework
+
+[Axolotl](https://docs.axolotl.ai/) is used as the training framework, invoked as a subprocess by the service. It provides LoRA and QLoRA support, handles model loading from the HF Hub, and has stable support for the model families we serve (Qwen).
+
+### Job queue and state
+
+The service maintains a job queue backed by a SQLite database on a named Docker volume. This provides:
+
+- Serialisation of jobs against the single available GPU
+- Crash recovery: jobs that were `running` at startup are marked `failed` on restart, rather than hanging indefinitely
+- Status polling without in-memory state
+
+A future migration to the existing PostgreSQL instance is possible if cross-service visibility becomes a requirement.
+
+### Training time limits
+
+Each job has a configurable maximum wall clock duration (default: 4 hours). The service enforces this by terminating the Axolotl subprocess after the limit is reached and marking the job as `failed`. This prevents a single user from monopolising the GPU indefinitely.
+
+### HF token handling
+
+The HF token is used at job execution time to pull the dataset and push the completed adapter. It is:
+
+- Not written to disk beyond what the HF Hub client requires transiently
+- Not persisted to the job state database after the job completes
+
+Users are responsible for supplying a token with appropriate scope (read access to the dataset repository, write access to the adapter destination). The service validates token validity at job submission time and fails fast if the token is invalid or insufficiently scoped.
+
+Note: if requests pass through LiteLLM, the HF token will appear in its PostgreSQL request log. This is an acceptable risk given that the PostgreSQL instance is not externally accessible and HF tokens are revocable.
+
+### Monitoring
+
+The service exports Prometheus metrics on a `/metrics` endpoint, scraped by the existing Prometheus instance:
+
+- Job queue depth (by status: `queued`, `running`, `failed`, `succeeded`)
+- Job duration (histogram)
+- GPU utilisation during training (via DCGM, already instrumented)
+- HF pull and push durations
+
+### GPU allocation
+
+The fine-tuning service is allocated one H100 via `CUDA_VISIBLE_DEVICES`. The specific GPU to allocate is TBD pending a review of current GPU utilisation across the embedding, speech, and image generation services.
+
+## Consequences
+
+**Benefits:**
+
+- Researchers can fine-tune domain-adapted models without cloud API access or local hardware, fulfilling FR-FT-01 and FR-FT-02.
+- No file storage infrastructure required: datasets live on Hugging Face, adapters are returned there. The service itself is stateless with respect to artifacts.
+- The existing auth layer (LiteLLM API keys, Bearer token validation, fail2ban) covers the request path; fine-tuning-specific access control is handled via the whitelist in the Splinter SDK layer.
+- Crash recovery via DB-backed job state prevents ghost jobs.
+- Training time limits protect the shared GPU from runaway jobs.
+
+**Tradeoffs and limitations:**
+
+- Single GPU, serialised queue: a busy period could mean significant wait times for users who submit large jobs. We have no current mechanism for estimating or communicating queue wait time to users. This is a known gap.
+- HF tokens appear in LiteLLM's PostgreSQL request log.
+- Adapter serving is out of scope. Users who want to run inference against their fine-tuned model must load it themselves, or wait for a future self-service model onboarding workflow (Phase 3 of the PRD).
diff --git a/docs/ADRs/finetuning/012-finetuning-service-skeleton.md b/docs/ADRs/finetuning/012-finetuning-service-skeleton.md
@@ -0,0 +1,23 @@
+# ADR-012. Fine-Tuning Service Skeleton
+
+Date: 2026-04-22
+Status: Proposed
+
+## Context
+
+With the high level service design established in ADR-011, the skeleton implementation required a set of concrete technical decisions: stack layout, base image and service framework.
+
+## Decision
+
+**Separate stack.** The fine-tuning service lives in `stacks/finetuning-service/` rather than extending the llm-service stack. This mirrors the monitoring stack pattern and allows the service to be brought up and down independently.
+
+**Base image.** `axolotlai/axolotl:main-py3.12-cu130-2.10.0` is used and pinned. The image is pinned to a specific tag; updates are a deliberate decision, identical to our LiteLLM versioning. The `-uv` variant was considered for consistency with the team's preference for uv, but it locks down its Python installation and prevents packages being installed on top of it — which is exactly what we need to do to add FastAPI and uvicorn. The standard image with `pip` is used instead.
+
+**FastAPI** for the service framework, with **SQLite** backed by a named Docker volume for job queue state. This is sufficient for a single-worker serialised queue and avoids a dependency on the existing PostgreSQL instance.
+
+**Networking** between the fine-tuning service and LiteLLM (for future API key validation) uses `host.docker.internal` rather than joining the llm-service Docker network as an external network. This keeps the stacks decoupled.
+
+## Consequences
+
+- Hyperparameter configuration for training jobs is deferred; the job submission schema carries only the fields needed to identify the job.
+- SQLite is sufficient now but migration to PostgreSQL remains possible if cross-service visibility is needed later.
diff --git a/docs/ADRs/finetuning/013-axolotl-training-implementation.md b/docs/ADRs/finetuning/013-axolotl-training-implementation.md
@@ -0,0 +1,60 @@
+# ADR-013. Axolotl Training Implementation
+
+Date: 2026-04-27
+Status: Proposed
+
+## Context
+
+With the service skeleton established in ADR-012, the next decisions concerned the actual training pipeline: how Axolotl is invoked, how GPU utilisation is maximised on H100 hardware, how evaluation is handled and how optional integrations (Weights & Biases) are exposed to users.
+
+## Decision
+
+### Axolotl invocation
+
+Axolotl is invoked as a subprocess (`axolotl train <config.yaml>`) rather than imported as a library. This keeps Axolotl's CUDA environment self-contained.
+
+A per job Axolotl config is generated at runtime by merging a service level base config (`config.yaml`, mounted read-only) with the user's job parameters. The result is written to a temp directory (`/tmp/finetuning/{job_id}/`) and cleaned up unconditionally in a `finally` block after the subprocess exits.
+
+### Worker as a separate container
+
+The worker runs in its own container using the Axolotl base image (`axolotlai/axolotl:main-py3.12-cu130-2.10.0`). The API uses a separate lightweight Python image. Merging them would require the API to carry the full Axolotl image (several GB) for no benefit.
+
+### Flash Attention 4
+
+The Axolotl image ships Flash Attention 2 (FA2). On CUDA 13 / H100 hardware, FA2 produced a `CUBLAS_STATUS_INVALID_VALUE` error in the RoPE computation during evaluation, crashing jobs before training began. Installing Flash Attention 4 (`flash-attn-4[cu13]==4.0.0b10`) resolved this. FA4 is the architecturally correct choice for Hopper GPUs (H100) and is explicitly recommended by Axolotl in its startup logs for this hardware. The `[cu13]` extra selects the CUDA 13 wheel.
+
+FA4 is pinned to a specific beta version. Upgrading is a deliberate decision identical to how we pin the Axolotl base image.
+
+### Sample packing
+
+Sample packing is enabled by default (`sample_packing: true`) and disabled for evaluation (`eval_sample_packing: false`). Without it, sequences are padded individually to `sequence_len`, resulting in approximately 55% padding waste on typical instruction-tuning datasets. With sample packing, multiple conversations are packed end-to-end into each sequence slot using Flash Attention masking to prevent cross conversation attention. This improved trainable token density to ~65% and GPU throughput ~9x on our test dataset.
+
+Sample packing is not applied during evaluation: the eval set is usually small and packing adds complexity without meaningfully improving evaluation speed.
+
+### Default sequence length
+
+The default `sequence_len` is 2048. The original default of 512 dropped approximately 37% of training samples from our representative dataset (max sequence length ~1716 tokens). 2048 retains all sequences and, combined with sample packing, allows more conversations per packed slot. Users may override this per job.
+
+### Evaluation control
+
+Evaluation is opt-in. Jobs default to `do_eval: false`, which injects `eval_strategy: "no"` into the generated Axolotl config, overriding the base config's `eval_strategy: epoch`. When `do_eval: true`, the `validation` split of the user's dataset is used and evaluation runs at the end of each epoch.
+
+This is opt-in rather than opt-out because many Hugging Face datasets do not include a `validation` split; silently failing a job because the split is absent is a worse experience than requiring users to explicitly request evaluation. SDK documentation will specify that users must include a `validation` split if they set `do_eval: true`.
+
+### Weights & Biases integration
+
+Optional wandb logging is supported via three fields: `wandb_token`, `wandb_project`, and `wandb_entity`. The token is handled identically to the HF token: stored in the database, passed to the Axolotl subprocess as `WANDB_API_KEY` and cleared to `NULL` on job completion.
+
+Validation rules:
+- `wandb_project` and `wandb_token` must be provided together.
+- `wandb_entity` (a team/organisation name) may be omitted; wandb defaults to the user's personal account.
+- Providing `wandb_entity` without `wandb_project` is rejected.
+
+`wandb_entity` alone being omitted is the only partially specified combination that is permitted, reflecting wandb's own behaviour.
+
+## Consequences
+
+- Training jobs on H100 / CUDA 13 hardware work with FA4.
+- Sample packing significantly improves GPU utilisation but compresses the effective number of training steps per epoch. For small datasets, users should be aware that a large `micro_batch_size` relative to the dataset size can result in very few optimiser steps per epoch.
+- Evaluation requires users to know their dataset structure. No automatic detection of available splits is performed.
+- Wandb tokens are treated as sensitive credentials and cleared after use, consistent with the HF token policy established in ADR-011.
diff --git a/docs/ADRs/finetuning/014-finetuning-service-auth.md b/docs/ADRs/finetuning/014-finetuning-service-auth.md
@@ -0,0 +1,36 @@
+# ADR-014. Fine-Tuning Service Authentication
+
+Date: 2026-04-28
+Status: Proposed
+
+## Context
+
+The fine-tuning service exposes HTTP endpoints for submitting and managing training jobs. Without authentication, any user on the network could submit jobs, consuming GPU time and potentially exfiltrating model adapters. Splinter already operates a LiteLLM proxy that issues and manages API keys for all users, making it the natural authority for key validation.
+
+## Decision
+
+### LiteLLM key validation
+
+All `/v1/fine_tuning/*` endpoints require a Bearer token. On each request the service calls LiteLLM's `/key/info` endpoint using the service's own master key to verify the token. A 200 response means the key is valid and active; anything else returns 401 to the caller. The `/health` endpoint is left unauthenticated.
+
+This avoids maintaining a second credential store. Users present the same API key they already use for inference.
+
+### FastAPI dependency injection
+
+The auth logic is implemented as a FastAPI dependency (`verify_litellm_key`) applied at the router level rather than per-route. This ensures new routes are protected by default without any per-route ceremony.
+
+### Inter-container networking
+
+The API container reaches LiteLLM via a shared external Docker network (`finetuning_default`) rather than `host.docker.internal`. LiteLLM is bound to `127.0.0.1` on the host (loopback only), so `host.docker.internal` (which resolves to the Docker bridge gateway, not loopback) cannot reach it. Joining both containers to a shared network allows the API to address LiteLLM directly by container name (`http://litellm-proxy:4000`), the same pattern used by the monitoring stack.
+
+### User allowlist
+
+An optional `whitelist.txt` file (one LiteLLM user ID per line) can be placed alongside the compose file to restrict access to specific users. The user ID is extracted from the `/key/info` response. If `whitelist.txt` is absent, any valid LiteLLM key is accepted. The file is gitignored and never committed; `whitelist.txt.example` is committed as a template.
+
+Reading the allowlist on every request (rather than at startup) means the list can be updated without restarting the service.
+
+## Consequences
+
+- Users must include `Authorization: Bearer <litellm-key>` on all fine-tuning requests.
+- Each authenticated request incurs one additional HTTP call to LiteLLM. This is acceptable given that fine-tuning job submission is infrequent and not latency sensitive.
+- The LiteLLM master key must be present in the finetuning service `.env` as `LITELLM_MASTER_KEY`.
diff --git a/pyproject.toml b/pyproject.toml
@@ -3,6 +3,7 @@ name = "splinter"
 version = "0.1.0"
 description = "Add your description here"
 requires-python = ">=3.12"
+
 dependencies = [
     "ipykernel>=7.2.0",
     "matplotlib>=3.10.8",
@@ -12,5 +13,21 @@ dependencies = [
     "numpy>=1.26",
     "openai>=2.21.0",
     "rich>=14.3.2",
-    "vllm>=0.15.1",
+    "vllm>=0.15.1; sys_platform == 'linux'",
 ]
+
+[dependency-groups]
+dev = [
+    "pre-commit>=4.6.0",
+    "ruff>=0.15.11",
+]
+
+[tool.ruff]
+line-length = 79
+
+[tool.ruff.lint]
+select = ["E", "W", "F", "I", "D", "ANN"]
+ignore = ["ANN401"]
+
+[tool.ruff.lint.pydocstyle]
+convention = "google"
diff --git a/stacks/finetuning-service/.env.example b/stacks/finetuning-service/.env.example
@@ -0,0 +1,24 @@
+# =============================================================================
+# Environment Configuration for Fine-Tuning Service
+# =============================================================================
+#
+# IMPORTANT:
+#   - Copy this to .env and fill in real values
+#   - Never commit .env to git (it should be in the .gitignore)
+#   - Keep this .env.example as a template
+#
+# =============================================================================
+
+# -----------------------------------------------------------------------------
+# Fine-Tuning Service
+# -----------------------------------------------------------------------------
+DEVICE=3
+# Multi-gpu not currently supported
+FINETUNING_PORT=8005
+MAX_JOB_DURATION_HOURS=4
+
+# -----------------------------------------------------------------------------
+# LiteLLM (for API key validation)
+# -----------------------------------------------------------------------------
+LITELLM_PORT=4000
+LITELLM_MASTER_KEY=sk-CHANGE_ME_TO_SOMETHING_SECURE
diff --git a/stacks/finetuning-service/Dockerfile.api b/stacks/finetuning-service/Dockerfile.api
@@ -0,0 +1,19 @@
+# =============================================================================
+# Dockerfile.api - Fine-Tuning Service API
+# =============================================================================
+#
+# Lightweight image for the FastAPI service only.
+# The Axolotl worker uses a separate image (Dockerfile.worker).
+#
+# =============================================================================
+
+FROM python:3.12-slim
+
+WORKDIR /app
+
+COPY requirements.api.txt .
+RUN pip install -r requirements.api.txt
+
+COPY app/ ./app/
+
+CMD ["sh", "-c", "uvicorn app.main:app --host 0.0.0.0 --port ${FINETUNING_PORT}"]
diff --git a/stacks/finetuning-service/Dockerfile.worker b/stacks/finetuning-service/Dockerfile.worker
@@ -0,0 +1,22 @@
+# =============================================================================
+# Dockerfile.worker - Fine-Tuning Worker
+# =============================================================================
+#
+# Axolotl-based image for the queue worker that runs training jobs.
+# The FastAPI service uses a separate lightweight image (Dockerfile.api).
+#
+# Update the tag deliberately when upgrading Axolotl.
+#
+# =============================================================================
+
+FROM axolotlai/axolotl:main-py3.12-cu130-2.10.0
+
+WORKDIR /app
+
+COPY requirements.worker.txt .
+RUN pip install -r requirements.worker.txt
+
+COPY app/ ./app/
+COPY worker.py .
+
+CMD ["python", "worker.py"]
diff --git a/stacks/finetuning-service/app/__init__.py b/stacks/finetuning-service/app/__init__.py
@@ -0,0 +1 @@
+"""Application for the Splinter fine-tuning service."""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Application for the Splinter fine-tuning service."""