Enterprise-grade distributed LLM system built in Go with Gin Gonic. A single binary with a mode system that enables flexible deployment from single-host development to multi-host production clusters.
HelixLLM provides fully compatible OpenAI and Anthropic APIs, local LLM inference via llama.cpp, a RAG knowledge pipeline, a ReAct agent system with tool calling, and a control plane for multi-host cluster management -- all served over HTTP/3 with automatic HTTP/2 fallback.
- OpenAI and Anthropic compatible APIs -- any existing SDK client works without modification
- Local LLM inference via llama.cpp with CUDA, Metal, and ROCm support
- Multi-provider fallback chain -- auto-discovers free models from 7+ cloud providers (Chutes, OpenRouter, HuggingFace, Nvidia, Cerebras, SambaNova, Together), scores them via LLMsVerifier, routes through the ranked chain with automatic 429/5xx failover, llama.cpp as guaranteed last resort
- RAG knowledge pipeline -- document ingestion, chunking, embedding, vector search
- ReAct agent system with tool calling, conversation sessions, and RAG integration
- HTTP/3 (QUIC) server with automatic HTTP/2 fallback and TLS 1.3
- Multi-host distribution -- SSH-based probing, scheduling, and container deployment
- Mode system -- run as
full(all-in-one),gateway,brain,knowledge,agents, orcontrol - Brotli and gzip compression with automatic content negotiation
- SSE streaming matching OpenAI/Anthropic
text/event-streamformat - API key and JWT authentication with rate limiting
- Prometheus metrics and OpenTelemetry tracing
- 43 Go submodules providing production-grade infrastructure
# Clone with submodules
git clone --recurse-submodules https://github.com/HelixDevelopment/HelixLLM.git
cd HelixLLM
# Copy and edit configuration
cp .env.example .env
# Generate TLS certificates and start in full mode
make devThe server starts on https://localhost:8443 with all subsystems active.
| Method | Endpoint | Description |
|---|---|---|
| POST | /v1/chat/completions |
Chat completions (SSE streaming with stream: true) |
| POST | /v1/completions |
Text completions |
| GET | /v1/models |
List available models |
| GET | /v1/models/:id |
Get model details |
| POST | /v1/embeddings |
Generate embeddings |
| Method | Endpoint | Description |
|---|---|---|
| POST | /v1/messages |
Messages API (SSE streaming with stream: true) |
| Method | Endpoint | Description |
|---|---|---|
| POST | /v1/agents/chat |
Run agent loop with optional session tracking |
| GET | /v1/agents/tools |
List available tools |
| Method | Endpoint | Description |
|---|---|---|
| POST | /internal/knowledge/ingest |
Ingest documents into vector store |
| POST | /internal/knowledge/query |
Query knowledge base |
| GET | /internal/knowledge/collections |
List collections |
| GET | /internal/knowledge/stats |
Knowledge base statistics |
| Method | Endpoint | Description |
|---|---|---|
| GET | /internal/cluster/status |
Cluster health and deployment status |
| POST | /internal/cluster/probe |
Probe all configured hosts |
| POST | /internal/cluster/deploy |
Schedule and deploy services |
| POST | /internal/cluster/rebalance |
Rebalance service placement |
| Method | Endpoint | Description |
|---|---|---|
| GET | /internal/health |
Aggregated health check |
HelixLLM compiles to a single binary that operates in one of six modes:
| Mode | Role |
|---|---|
full |
All-in-one, single process (development and single-host production) |
gateway |
API surface: HTTP/3, OpenAI/Anthropic compat, auth, streaming |
brain |
LLM coordination: routing, llama.cpp RPC, cloud providers |
knowledge |
RAG pipeline: retrieval, embeddings, vector store, ingestion |
agents |
Agent system: ReAct loop, tools, conversation context |
control |
Cluster management: host probing, scheduling, deployment, monitoring |
In full mode all layers communicate via direct Go function calls with zero network overhead. In distributed mode the same binary runs on multiple hosts in different modes, communicating via gRPC, SSE, and Kafka.
HelixLLM routes requests through a scored chain of free cloud providers with automatic failover:
- Auto-discovery -- discovers available models from all configured providers (Chutes, OpenRouter, HuggingFace, Nvidia, Cerebras, SambaNova, Together)
- Scoring -- ranks providers using LLMsVerifier scores (refreshed every 5 minutes)
- Fallback -- on rate limit (429) or server error (5xx), automatically rotates to the next provider
- Local fallback -- llama.cpp is always the last resort, guaranteed to be available
- Rate limit tracking -- parses response headers to proactively skip providers approaching limits
Set API keys for any number of providers in .env:
HELIX_LLM_CHUTES_KEY=your-key
HELIX_LLM_OPENROUTER_KEY=your-key
HELIX_LLM_HUGGINGFACE_KEY=your-key
HELIX_LLM_NVIDIA_KEY=your-key
HELIX_LLM_CEREBRAS_KEY=your-key
HELIX_LLM_SAMBANOVA_KEY=your-key
HELIX_LLM_TOGETHER_KEY=your-keyThe chain automatically discovers and ranks available models. No manual model configuration needed. OpenRouter models with the :free suffix are automatically filtered.
Configuration is loaded from environment variables with sensible defaults. Copy .env.example to .env and customize:
HELIX_MODE=full # Operating mode
HELIX_PORT=8443 # Server port
HELIX_LLM_DEFAULT_PROVIDER=local # local | openai | anthropic | auto
HELIX_LLM_OPENAI_KEY=sk-... # OpenAI API key (optional)
HELIX_LLM_ANTHROPIC_KEY=sk-ant-... # Anthropic API key (optional)
HELIX_LLM_CHUTES_KEY=... # Free cloud providers (optional)
HELIX_LLM_OPENROUTER_KEY=...
HELIX_HOSTS=nezha.local # Comma-separated cluster hostsSee docs/user-guide/configuration.md for the full reference.
# Build the binary
make build
# Run unit tests with coverage
make test-unit
# Run integration tests
make test-integration
# Run all tests
make test-all
# Check coverage meets threshold (85%)
make coverage
# Lint
make lint
# Format code
make fmt
# Build container image (auto-detects Podman/Docker)
make container
# Update submodule dependencies
make depshelixllm/
cmd/helixllm/ CLI entry point and mode routing
internal/
gateway/ API layer (OpenAI/Anthropic endpoints, auth, streaming)
brain/ LLM coordination (routing, llama.cpp, cloud providers)
knowledge/ RAG pipeline (embeddings, vector store, chunking)
agents/ Agent system (ReAct loop, tools, conversation context)
control/ Cluster management (probing, scheduling, deployment)
mode/ Mode enum and parsing
server/ HTTP/3 + HTTP/2 server with middleware
shared/ Cross-cutting (config, events, health, logging, observability)
pkg/
api/ Public request/response types
types/ Shared type definitions
submodules/ 43 Go modules (vasic-digital ecosystem)
container/ Containerfiles for Podman/Docker
deploy/ Compose files for full stack
tests/ Integration and unit tests
challenges/ Challenge banks for testing
docs/
user-guide/ End-user documentation
manual/ Developer and operator documentation
- User Guide: Getting Started | Configuration | API Reference | Models | RAG Knowledge | Agents | Multi-Host Setup | Monitoring | Troubleshooting
- Manual: Architecture | Development | Testing | Security | Operations | Modules
All rights reserved. See LICENSE for details.