Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
04bebc9
feat(docs): add initial draft ADR for fine-tuning service
fingriffin Apr 20, 2026
c326f49
feat(docs): update ADR-011 with SDK decision
fingriffin Apr 22, 2026
b2a9294
refactor(docs): create dedicated ADRs folder for fine-tuning service
fingriffin Apr 22, 2026
432f4ca
chore(docs): remove mention of SDK ADR
fingriffin Apr 22, 2026
6b845fe
feat(stacks): add lightweight docker compose for fine-tuning service
fingriffin Apr 22, 2026
5991c90
feat(stacks): add Dockerfile for fine-tuning service build using axol…
fingriffin Apr 22, 2026
539dcce
feat(stacks): add fine-tuning service FastAPI application
fingriffin Apr 22, 2026
b7cf684
feat(docs): add ADR for fine-tuning service skeleton
fingriffin Apr 22, 2026
0f6f80b
Merge pull request #26 from acceleratescience/feature/ft-service-skel…
fingriffin Apr 22, 2026
04a9d52
chore: fix typo on endpoint name
fingriffin Apr 22, 2026
211ac93
chore: migrate to hardcoded axolotl base image
fingriffin Apr 23, 2026
76afb95
fix(docker): override system python guard for requirements installation
fingriffin Apr 23, 2026
b3c7943
feat(stacks): fallback to plain python axolotl base image
fingriffin Apr 23, 2026
b3e21ab
feat(stacks): add lightweight user config for fine-tuning service and…
fingriffin Apr 23, 2026
c2c4683
feat(stacks): add axolotl subprocess with separate worker
fingriffin Apr 23, 2026
a499968
chore: add chat template to test dataset and initialise db from axolo…
fingriffin Apr 23, 2026
f91ff8e
feat(stacks): let device for fine-tuning service be a environment var…
fingriffin Apr 24, 2026
b6f4eef
feat(deps): add FA4
fingriffin Apr 24, 2026
95e0393
fix: pin FA4 version with cu13 wheels, increase default sequence_len,…
fingriffin Apr 24, 2026
d44b181
feat(stacks): add do_eval as key in fine-tuning service
fingriffin Apr 27, 2026
3d7e2f3
feat(stacks): add wandb tracking
fingriffin Apr 27, 2026
97394f6
feat(docs): add ADR for axolotl implementation
fingriffin Apr 27, 2026
e05ebb9
Merge pull request #27 from acceleratescience/feature/axolotl
fingriffin Apr 27, 2026
e86dcc1
feat(stacks): add auth layer to FastAPI for fine-tuning service
fingriffin Apr 27, 2026
e3bb938
feat(stacks): add docker network between llm and fine-tuning service
fingriffin Apr 28, 2026
68d7351
feat(stacks): add user id whitelist for fine-tuning service
fingriffin Apr 28, 2026
d010f8c
fix: migrate to gitignored whitelist for user auth
fingriffin Apr 28, 2026
46f5ef7
fix: fix lite llm port typo
fingriffin Apr 28, 2026
07d53d9
Merge pull request #28 from acceleratescience/feature/auth
fingriffin Apr 28, 2026
4272903
feat(docs): add ADR for authentication
fingriffin Apr 29, 2026
9240d20
chore(docs): fix typo in adr title
fingriffin Apr 29, 2026
8fdac36
feat(nginx): add fine-tuning service to nginx config
fingriffin Apr 29, 2026
4deef0d
fix: use python.urllib for health check
fingriffin May 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@ postgres_data/
.DS_Store
Thumbs.db

.idea/
.idea/

whitelist.txt
9 changes: 9 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.11
hooks:
- id: ruff-check
types: [python]
args: [--fix]
- id: ruff-format
types: [python]
105 changes: 105 additions & 0 deletions docs/ADRs/finetuning/011-finetuning-service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# ADR-011. Fine-Tuning Service

Date: 2026-04-22
Status: Proposed

## Context

The PRD (FR-FT-01, FR-FT-02) identifies LoRA/QLoRA fine-tuning as a "Should have" for Phase 2. The primary motivation is to allow researchers to adapt open weight models to domain-specific tasks (e.g. biology, materials science) without requiring cloud API access or local GPU hardware.

Several constraints shaped this design:

- **One GPU available**: At the time of writing, all four H100s are currently allocated to inference services. One GPU can be provisioned for fine-tuning, but it must be treated as a shared, serialised resource (concurrent training jobs are not feasible).
- **No file upload infrastructure**: The existing Nginx configuration restricts `client_max_body_size` to 10MB and there is no file storage layer. Building one would add significant operational complexity and security risk for little gain.
- **No per-user adapter serving**: Serving a fine-tuned adapter for a single user would require either dedicated GPU capacity or a hot-swap mechanism on a shared inference instance. Neither is practical at this scale. Adapters must be returned to users via Hugging Face rather than hosted.
- **Access control**: LiteLLM has no mechanism to restrict fine-tuning access to a subset of users. A separate whitelist is required; fine-tuning cannot simply inherit the existing API key auth layer without granting all key holders access to submit training jobs.
- **Security posture**: User-supplied credentials (HuggingFace tokens) must not be persisted in logs or the database beyond what is strictly necessary.

### Considered approaches for dataset delivery

**File upload via `/v1/files`**: LiteLLM routes this endpoint and it reaches the service (returning a 500 indicating `files_settings` is not configured, rather than a 404). However, enabling file storage requires configuring a backing store, raises questions about retention and quotas, and adds attack surface. Rejected.

**HuggingFace Hub reference**: Users provide a HF dataset repository path and a scoped HF token. The training service pulls the dataset directly from the Hub at job start. This avoids any file storage infrastructure on our side and is well-matched to how researchers already manage data. Selected.

### Considered approaches for adapter delivery

**Serve locally via vLLM LoRA**: vLLM supports dynamic LoRA adapter loading via `--enable-lora` and a `/v1/load_lora_adapter` endpoint. However, this means hosting a persistent model endpoint for one user's adapter, which is not a scalable use of GPU memory. Rejected.

**Push to user's HuggingFace Hub**: The training service uses the user's HF token (which must be write-scoped) to push the completed adapter back to their Hub. The user then has full ownership of the artifact and can load it however they choose. Selected.

### Considered approaches for client interface

Because fine-tuning access cannot be universally granted, and because the fine-tuning schema does not map cleanly onto the OpenAI fine-tuning spec (which requires a `file.id` to be submitted, mandating a file upload step we have rejected), we need a purpose-built client interface rather than relying on the OpenAI SDK alone. The design of that interface is covered in [?].

## Decision

We implement a fine-tuning service as a custom Docker container added to the `llm-service` stack. Users interact with it via a Splinter SDK, which handles authentication against LiteLLM API keys and a separate fine-tuning access whitelist.

The service exposes:

```
POST /v1/fine_tuning/jobs — submit job, returns job ID + queued status
GET /v1/fine_tuning/jobs/{id} — poll status
GET /v1/fine_tuning/jobs — list user's jobs
POST /v1/fine_tuning/jobs/{id}/cancel
```

Job submissions return immediately with a job ID and `queued` status. All training is asynchronous.

### Training framework

[Axolotl](https://docs.axolotl.ai/) is used as the training framework, invoked as a subprocess by the service. It provides LoRA and QLoRA support, handles model loading from the HF Hub, and has stable support for the model families we serve (Qwen).

### Job queue and state

The service maintains a job queue backed by a SQLite database on a named Docker volume. This provides:

- Serialisation of jobs against the single available GPU
- Crash recovery: jobs that were `running` at startup are marked `failed` on restart, rather than hanging indefinitely
- Status polling without in-memory state

A future migration to the existing PostgreSQL instance is possible if cross-service visibility becomes a requirement.

### Training time limits

Each job has a configurable maximum wall clock duration (default: 4 hours). The service enforces this by terminating the Axolotl subprocess after the limit is reached and marking the job as `failed`. This prevents a single user from monopolising the GPU indefinitely.

### HF token handling

The HF token is used at job execution time to pull the dataset and push the completed adapter. It is:

- Not written to disk beyond what the HF Hub client requires transiently
- Not persisted to the job state database after the job completes

Users are responsible for supplying a token with appropriate scope (read access to the dataset repository, write access to the adapter destination). The service validates token validity at job submission time and fails fast if the token is invalid or insufficiently scoped.

Note: if requests pass through LiteLLM, the HF token will appear in its PostgreSQL request log. This is an acceptable risk given that the PostgreSQL instance is not externally accessible and HF tokens are revocable.

### Monitoring

The service exports Prometheus metrics on a `/metrics` endpoint, scraped by the existing Prometheus instance:

- Job queue depth (by status: `queued`, `running`, `failed`, `succeeded`)
- Job duration (histogram)
- GPU utilisation during training (via DCGM, already instrumented)
- HF pull and push durations

### GPU allocation

The fine-tuning service is allocated one H100 via `CUDA_VISIBLE_DEVICES`. The specific GPU to allocate is TBD pending a review of current GPU utilisation across the embedding, speech, and image generation services.

## Consequences

**Benefits:**

- Researchers can fine-tune domain-adapted models without cloud API access or local hardware, fulfilling FR-FT-01 and FR-FT-02.
- No file storage infrastructure required: datasets live on Hugging Face, adapters are returned there. The service itself is stateless with respect to artifacts.
- The existing auth layer (LiteLLM API keys, Bearer token validation, fail2ban) covers the request path; fine-tuning-specific access control is handled via the whitelist in the Splinter SDK layer.
- Crash recovery via DB-backed job state prevents ghost jobs.
- Training time limits protect the shared GPU from runaway jobs.

**Tradeoffs and limitations:**

- Single GPU, serialised queue: a busy period could mean significant wait times for users who submit large jobs. We have no current mechanism for estimating or communicating queue wait time to users. This is a known gap.
- HF tokens appear in LiteLLM's PostgreSQL request log.
- Adapter serving is out of scope. Users who want to run inference against their fine-tuned model must load it themselves, or wait for a future self-service model onboarding workflow (Phase 3 of the PRD).
23 changes: 23 additions & 0 deletions docs/ADRs/finetuning/012-finetuning-service-skeleton.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# ADR-012. Fine-Tuning Service Skeleton

Date: 2026-04-22
Status: Proposed

## Context

With the high level service design established in ADR-011, the skeleton implementation required a set of concrete technical decisions: stack layout, base image and service framework.

## Decision

**Separate stack.** The fine-tuning service lives in `stacks/finetuning-service/` rather than extending the llm-service stack. This mirrors the monitoring stack pattern and allows the service to be brought up and down independently.

**Base image.** `axolotlai/axolotl:main-py3.12-cu130-2.10.0` is used and pinned. The image is pinned to a specific tag; updates are a deliberate decision, identical to our LiteLLM versioning. The `-uv` variant was considered for consistency with the team's preference for uv, but it locks down its Python installation and prevents packages being installed on top of it — which is exactly what we need to do to add FastAPI and uvicorn. The standard image with `pip` is used instead.

**FastAPI** for the service framework, with **SQLite** backed by a named Docker volume for job queue state. This is sufficient for a single-worker serialised queue and avoids a dependency on the existing PostgreSQL instance.

**Networking** between the fine-tuning service and LiteLLM (for future API key validation) uses `host.docker.internal` rather than joining the llm-service Docker network as an external network. This keeps the stacks decoupled.

## Consequences

- Hyperparameter configuration for training jobs is deferred; the job submission schema carries only the fields needed to identify the job.
- SQLite is sufficient now but migration to PostgreSQL remains possible if cross-service visibility is needed later.
60 changes: 60 additions & 0 deletions docs/ADRs/finetuning/013-axolotl-training-implementation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# ADR-013. Axolotl Training Implementation

Date: 2026-04-27
Status: Proposed

## Context

With the service skeleton established in ADR-012, the next decisions concerned the actual training pipeline: how Axolotl is invoked, how GPU utilisation is maximised on H100 hardware, how evaluation is handled and how optional integrations (Weights & Biases) are exposed to users.

## Decision

### Axolotl invocation

Axolotl is invoked as a subprocess (`axolotl train <config.yaml>`) rather than imported as a library. This keeps Axolotl's CUDA environment self-contained.

A per job Axolotl config is generated at runtime by merging a service level base config (`config.yaml`, mounted read-only) with the user's job parameters. The result is written to a temp directory (`/tmp/finetuning/{job_id}/`) and cleaned up unconditionally in a `finally` block after the subprocess exits.

### Worker as a separate container

The worker runs in its own container using the Axolotl base image (`axolotlai/axolotl:main-py3.12-cu130-2.10.0`). The API uses a separate lightweight Python image. Merging them would require the API to carry the full Axolotl image (several GB) for no benefit.

### Flash Attention 4

The Axolotl image ships Flash Attention 2 (FA2). On CUDA 13 / H100 hardware, FA2 produced a `CUBLAS_STATUS_INVALID_VALUE` error in the RoPE computation during evaluation, crashing jobs before training began. Installing Flash Attention 4 (`flash-attn-4[cu13]==4.0.0b10`) resolved this. FA4 is the architecturally correct choice for Hopper GPUs (H100) and is explicitly recommended by Axolotl in its startup logs for this hardware. The `[cu13]` extra selects the CUDA 13 wheel.

FA4 is pinned to a specific beta version. Upgrading is a deliberate decision identical to how we pin the Axolotl base image.

### Sample packing

Sample packing is enabled by default (`sample_packing: true`) and disabled for evaluation (`eval_sample_packing: false`). Without it, sequences are padded individually to `sequence_len`, resulting in approximately 55% padding waste on typical instruction-tuning datasets. With sample packing, multiple conversations are packed end-to-end into each sequence slot using Flash Attention masking to prevent cross conversation attention. This improved trainable token density to ~65% and GPU throughput ~9x on our test dataset.

Sample packing is not applied during evaluation: the eval set is usually small and packing adds complexity without meaningfully improving evaluation speed.

### Default sequence length

The default `sequence_len` is 2048. The original default of 512 dropped approximately 37% of training samples from our representative dataset (max sequence length ~1716 tokens). 2048 retains all sequences and, combined with sample packing, allows more conversations per packed slot. Users may override this per job.

### Evaluation control

Evaluation is opt-in. Jobs default to `do_eval: false`, which injects `eval_strategy: "no"` into the generated Axolotl config, overriding the base config's `eval_strategy: epoch`. When `do_eval: true`, the `validation` split of the user's dataset is used and evaluation runs at the end of each epoch.

This is opt-in rather than opt-out because many Hugging Face datasets do not include a `validation` split; silently failing a job because the split is absent is a worse experience than requiring users to explicitly request evaluation. SDK documentation will specify that users must include a `validation` split if they set `do_eval: true`.

### Weights & Biases integration

Optional wandb logging is supported via three fields: `wandb_token`, `wandb_project`, and `wandb_entity`. The token is handled identically to the HF token: stored in the database, passed to the Axolotl subprocess as `WANDB_API_KEY` and cleared to `NULL` on job completion.

Validation rules:
- `wandb_project` and `wandb_token` must be provided together.
- `wandb_entity` (a team/organisation name) may be omitted; wandb defaults to the user's personal account.
- Providing `wandb_entity` without `wandb_project` is rejected.

`wandb_entity` alone being omitted is the only partially specified combination that is permitted, reflecting wandb's own behaviour.

## Consequences

- Training jobs on H100 / CUDA 13 hardware work with FA4.
- Sample packing significantly improves GPU utilisation but compresses the effective number of training steps per epoch. For small datasets, users should be aware that a large `micro_batch_size` relative to the dataset size can result in very few optimiser steps per epoch.
- Evaluation requires users to know their dataset structure. No automatic detection of available splits is performed.
- Wandb tokens are treated as sensitive credentials and cleared after use, consistent with the HF token policy established in ADR-011.
36 changes: 36 additions & 0 deletions docs/ADRs/finetuning/014-finetuning-service-auth.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# ADR-014. Fine-Tuning Service Authentication

Date: 2026-04-28
Status: Proposed

## Context

The fine-tuning service exposes HTTP endpoints for submitting and managing training jobs. Without authentication, any user on the network could submit jobs, consuming GPU time and potentially exfiltrating model adapters. Splinter already operates a LiteLLM proxy that issues and manages API keys for all users, making it the natural authority for key validation.

## Decision

### LiteLLM key validation

All `/v1/fine_tuning/*` endpoints require a Bearer token. On each request the service calls LiteLLM's `/key/info` endpoint using the service's own master key to verify the token. A 200 response means the key is valid and active; anything else returns 401 to the caller. The `/health` endpoint is left unauthenticated.

This avoids maintaining a second credential store. Users present the same API key they already use for inference.

### FastAPI dependency injection

The auth logic is implemented as a FastAPI dependency (`verify_litellm_key`) applied at the router level rather than per-route. This ensures new routes are protected by default without any per-route ceremony.

### Inter-container networking

The API container reaches LiteLLM via a shared external Docker network (`finetuning_default`) rather than `host.docker.internal`. LiteLLM is bound to `127.0.0.1` on the host (loopback only), so `host.docker.internal` (which resolves to the Docker bridge gateway, not loopback) cannot reach it. Joining both containers to a shared network allows the API to address LiteLLM directly by container name (`http://litellm-proxy:4000`), the same pattern used by the monitoring stack.

### User allowlist

An optional `whitelist.txt` file (one LiteLLM user ID per line) can be placed alongside the compose file to restrict access to specific users. The user ID is extracted from the `/key/info` response. If `whitelist.txt` is absent, any valid LiteLLM key is accepted. The file is gitignored and never committed; `whitelist.txt.example` is committed as a template.

Reading the allowlist on every request (rather than at startup) means the list can be updated without restarting the service.

## Consequences

- Users must include `Authorization: Bearer <litellm-key>` on all fine-tuning requests.
- Each authenticated request incurs one additional HTTP call to LiteLLM. This is acceptable given that fine-tuning job submission is infrequent and not latency sensitive.
- The LiteLLM master key must be present in the finetuning service `.env` as `LITELLM_MASTER_KEY`.
19 changes: 18 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ name = "splinter"
version = "0.1.0"
description = "Add your description here"
requires-python = ">=3.12"

dependencies = [
"ipykernel>=7.2.0",
"matplotlib>=3.10.8",
Expand All @@ -12,5 +13,21 @@ dependencies = [
"numpy>=1.26",
"openai>=2.21.0",
"rich>=14.3.2",
"vllm>=0.15.1",
"vllm>=0.15.1; sys_platform == 'linux'",
]

[dependency-groups]
dev = [
"pre-commit>=4.6.0",
"ruff>=0.15.11",
]

[tool.ruff]
line-length = 79

[tool.ruff.lint]
select = ["E", "W", "F", "I", "D", "ANN"]
ignore = ["ANN401"]

[tool.ruff.lint.pydocstyle]
convention = "google"
24 changes: 24 additions & 0 deletions stacks/finetuning-service/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# =============================================================================
# Environment Configuration for Fine-Tuning Service
# =============================================================================
#
# IMPORTANT:
# - Copy this to .env and fill in real values
# - Never commit .env to git (it should be in the .gitignore)
# - Keep this .env.example as a template
#
# =============================================================================

# -----------------------------------------------------------------------------
# Fine-Tuning Service
# -----------------------------------------------------------------------------
DEVICE=3
# Multi-gpu not currently supported
FINETUNING_PORT=8005
MAX_JOB_DURATION_HOURS=4

# -----------------------------------------------------------------------------
# LiteLLM (for API key validation)
# -----------------------------------------------------------------------------
LITELLM_PORT=4000
LITELLM_MASTER_KEY=sk-CHANGE_ME_TO_SOMETHING_SECURE
19 changes: 19 additions & 0 deletions stacks/finetuning-service/Dockerfile.api
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# =============================================================================
# Dockerfile.api - Fine-Tuning Service API
# =============================================================================
#
# Lightweight image for the FastAPI service only.
# The Axolotl worker uses a separate image (Dockerfile.worker).
#
# =============================================================================

FROM python:3.12-slim

WORKDIR /app

COPY requirements.api.txt .
RUN pip install -r requirements.api.txt

COPY app/ ./app/

CMD ["sh", "-c", "uvicorn app.main:app --host 0.0.0.0 --port ${FINETUNING_PORT}"]
22 changes: 22 additions & 0 deletions stacks/finetuning-service/Dockerfile.worker
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# =============================================================================
# Dockerfile.worker - Fine-Tuning Worker
# =============================================================================
#
# Axolotl-based image for the queue worker that runs training jobs.
# The FastAPI service uses a separate lightweight image (Dockerfile.api).
#
# Update the tag deliberately when upgrading Axolotl.
#
# =============================================================================

FROM axolotlai/axolotl:main-py3.12-cu130-2.10.0

WORKDIR /app

COPY requirements.worker.txt .
RUN pip install -r requirements.worker.txt

COPY app/ ./app/
COPY worker.py .

CMD ["python", "worker.py"]
1 change: 1 addition & 0 deletions stacks/finetuning-service/app/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Application for the Splinter fine-tuning service."""
Loading