Skip to content

Kernel-ML/llm-deploy-kit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-deploy-kit

Reference implementations and tooling for production LLM deployment.

Overview

Deploying fine-tuned LLMs in production with low latency, output quality safeguards, and high reliability requires a set of well-tested patterns that most teams end up rebuilding from scratch. This library provides configuration schemas, a hallucination guardrail pipeline, a latency profiler, quantization configs, and a thread-safe metrics collector — covering the key concerns in taking an LLM from training to production.

Based on patterns from deploying Llama-based LLMs at Intuit (Listen4U system, sub-600ms response time, 96% test coverage) and the Intuit-wide LLM deployment documentation adopted by 3+ ML teams.

Modules

Module Purpose
llm_deploy.finetune Pydantic configs for LoRA/QLoRA fine-tuning jobs
llm_deploy.serve Serving configs with health checks and autoscaling policies
llm_deploy.guard Guardrail pipeline for output validation and confidence scoring
llm_deploy.optimize Latency profiler and quantization configuration
llm_deploy.monitor Thread-safe metrics collector for LLM serving workloads

Installation

pip install llm-deploy-kit

Or with UV:

uv add llm-deploy-kit

Quick Start

Fine-tuning configuration

from llm_deploy.finetune.config import FineTuneConfig, LoRAConfig, TrainingConfig

config = FineTuneConfig(
    base_model="meta-llama/Llama-3-8B",
    method="qlora",
    dataset_path="data/train.jsonl",
    lora=LoRAConfig(r=16, alpha=32),
    training=TrainingConfig(num_epochs=3, learning_rate=2e-4),
)

Guardrail pipeline

from llm_deploy.guard.validators import OutputValidator
from llm_deploy.guard.pipeline import GuardrailPipeline
from llm_deploy.guard.confidence import ConfidenceScorer

pipeline = GuardrailPipeline([
    OutputValidator(max_length=500, banned_phrases=["I'm not sure"]),
])

result = pipeline.validate(model_output)
if not result.passed:
    print(result.issues)

Latency profiling

from llm_deploy.optimize.profiler import LatencyProfiler

profiler = LatencyProfiler(fn=model.generate)
profile = profiler.run(inputs=test_prompts)
print(profile.summary())
# Latency Profile
#   p50: 145ms | p95: 312ms | p99: 487ms

Metrics collection

from llm_deploy.monitor.metrics import MetricsCollector, RequestMetrics

collector = MetricsCollector()
collector.record(RequestMetrics(
    request_id="req_001",
    time_to_first_token_ms=42.0,
    total_latency_ms=145.0,
    input_tokens=128,
    output_tokens=64,
    guardrail_passed=True,
    fallback_triggered=False,
))
print(collector.aggregate().summary())

Development

uv sync --all-extras
uv run pytest tests/ -v --cov=src
uv run isort src/ tests/ && uv run black src/ tests/

License

Apache 2.0

About

Reference implementations and tooling for production LLM deployment.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages