Reference implementations and tooling for production LLM deployment.
Deploying fine-tuned LLMs in production with low latency, output quality safeguards, and high reliability requires a set of well-tested patterns that most teams end up rebuilding from scratch. This library provides configuration schemas, a hallucination guardrail pipeline, a latency profiler, quantization configs, and a thread-safe metrics collector — covering the key concerns in taking an LLM from training to production.
Based on patterns from deploying Llama-based LLMs at Intuit (Listen4U system, sub-600ms response time, 96% test coverage) and the Intuit-wide LLM deployment documentation adopted by 3+ ML teams.
| Module | Purpose |
|---|---|
llm_deploy.finetune |
Pydantic configs for LoRA/QLoRA fine-tuning jobs |
llm_deploy.serve |
Serving configs with health checks and autoscaling policies |
llm_deploy.guard |
Guardrail pipeline for output validation and confidence scoring |
llm_deploy.optimize |
Latency profiler and quantization configuration |
llm_deploy.monitor |
Thread-safe metrics collector for LLM serving workloads |
pip install llm-deploy-kitOr with UV:
uv add llm-deploy-kitfrom llm_deploy.finetune.config import FineTuneConfig, LoRAConfig, TrainingConfig
config = FineTuneConfig(
base_model="meta-llama/Llama-3-8B",
method="qlora",
dataset_path="data/train.jsonl",
lora=LoRAConfig(r=16, alpha=32),
training=TrainingConfig(num_epochs=3, learning_rate=2e-4),
)from llm_deploy.guard.validators import OutputValidator
from llm_deploy.guard.pipeline import GuardrailPipeline
from llm_deploy.guard.confidence import ConfidenceScorer
pipeline = GuardrailPipeline([
OutputValidator(max_length=500, banned_phrases=["I'm not sure"]),
])
result = pipeline.validate(model_output)
if not result.passed:
print(result.issues)from llm_deploy.optimize.profiler import LatencyProfiler
profiler = LatencyProfiler(fn=model.generate)
profile = profiler.run(inputs=test_prompts)
print(profile.summary())
# Latency Profile
# p50: 145ms | p95: 312ms | p99: 487msfrom llm_deploy.monitor.metrics import MetricsCollector, RequestMetrics
collector = MetricsCollector()
collector.record(RequestMetrics(
request_id="req_001",
time_to_first_token_ms=42.0,
total_latency_ms=145.0,
input_tokens=128,
output_tokens=64,
guardrail_passed=True,
fallback_triggered=False,
))
print(collector.aggregate().summary())uv sync --all-extras
uv run pytest tests/ -v --cov=src
uv run isort src/ tests/ && uv run black src/ tests/Apache 2.0