llm-deploy-kit

Reference implementations and tooling for production LLM deployment.

Overview

Deploying fine-tuned LLMs in production with low latency, output quality safeguards, and high reliability requires a set of well-tested patterns that most teams end up rebuilding from scratch. This library provides configuration schemas, a hallucination guardrail pipeline, a latency profiler, quantization configs, and a thread-safe metrics collector — covering the key concerns in taking an LLM from training to production.

Based on patterns from deploying Llama-based LLMs at Intuit (Listen4U system, sub-600ms response time, 96% test coverage) and the Intuit-wide LLM deployment documentation adopted by 3+ ML teams.

Modules

Module	Purpose
`llm_deploy.finetune`	Pydantic configs for LoRA/QLoRA fine-tuning jobs
`llm_deploy.serve`	Serving configs with health checks and autoscaling policies
`llm_deploy.guard`	Guardrail pipeline for output validation and confidence scoring
`llm_deploy.optimize`	Latency profiler and quantization configuration
`llm_deploy.monitor`	Thread-safe metrics collector for LLM serving workloads

Installation

pip install llm-deploy-kit

Or with UV:

uv add llm-deploy-kit

Quick Start

Fine-tuning configuration

from llm_deploy.finetune.config import FineTuneConfig, LoRAConfig, TrainingConfig

config = FineTuneConfig(
    base_model="meta-llama/Llama-3-8B",
    method="qlora",
    dataset_path="data/train.jsonl",
    lora=LoRAConfig(r=16, alpha=32),
    training=TrainingConfig(num_epochs=3, learning_rate=2e-4),
)

Guardrail pipeline

from llm_deploy.guard.validators import OutputValidator
from llm_deploy.guard.pipeline import GuardrailPipeline
from llm_deploy.guard.confidence import ConfidenceScorer

pipeline = GuardrailPipeline([
    OutputValidator(max_length=500, banned_phrases=["I'm not sure"]),
])

result = pipeline.validate(model_output)
if not result.passed:
    print(result.issues)

Latency profiling

from llm_deploy.optimize.profiler import LatencyProfiler

profiler = LatencyProfiler(fn=model.generate)
profile = profiler.run(inputs=test_prompts)
print(profile.summary())
# Latency Profile
#   p50: 145ms | p95: 312ms | p99: 487ms

Metrics collection

from llm_deploy.monitor.metrics import MetricsCollector, RequestMetrics

collector = MetricsCollector()
collector.record(RequestMetrics(
    request_id="req_001",
    time_to_first_token_ms=42.0,
    total_latency_ms=145.0,
    input_tokens=128,
    output_tokens=64,
    guardrail_passed=True,
    fallback_triggered=False,
))
print(collector.aggregate().summary())

Development

uv sync --all-extras
uv run pytest tests/ -v --cov=src
uv run isort src/ tests/ && uv run black src/ tests/

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
src/llm_deploy		src/llm_deploy
tests		tests
.coverage		.coverage
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-deploy-kit

Overview

Modules

Installation

Quick Start

Fine-tuning configuration

Guardrail pipeline

Latency profiling

Metrics collection

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-deploy-kit

Overview

Modules

Installation

Quick Start

Fine-tuning configuration

Guardrail pipeline

Latency profiling

Metrics collection

Development

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages