oxidize is a Rust workspace for local LLM tooling:
oxidize-core: model loading, quantization, tensor/sampling primitives, and optional WASM supportoxidize-cli: local CLI for prompt runs, chat mode, model planning, and profiling hooksoxidize-server: OpenAI-compatible HTTP API surfaceoxidize-quantize: file quantization utilityoxidize-py: Python bindings built withpyo3
- Rust toolchain (
rustup,cargo) with edition 2024 support make- Optional for WASM builds:
wasm-bindgen-cli
git clone https://github.com/Zapdev-labs/oxidize.git oxidize
cd oxidize
make buildmake test
make lintmake fmt
make checkToday we are announcing oxidize 0.1.0, the first stable workspace release for local-first LLM workflows in Rust.
This release brings together a complete core-to-interface stack:
oxidize-corefor model loading, quantization primitives, and generationoxidize-clifor prompt and chat runs with profiling hooksoxidize-serverfor OpenAI-compatible HTTP endpointsoxidize-pyfor Python integrationoxidize-quantizefor offline model conversion
What this means for early users:
- Start quickly with one workspace and consistent commands (
make build,make test,make lint) - Deploy the same inference behavior across CLI, server, and Python surfaces
- Tune memory and latency tradeoffs using quantization targets that fit your hardware
Thank you to everyone testing early builds and sharing feedback. 0.1.0 is our stability baseline, and future releases will focus on performance, platform parity, and better developer ergonomics.
cargo run -p oxidize-cli -- --prompt "hello"cargo run -p oxidize-cli -- --chatcargo run -p oxidize-cli -- --model /path/to/model.gguf --n-gpu-layers 20 --gpus 2 --parallelism pipelinecargo run -p oxidize-server -- --host 127.0.0.1 --port 8080Health checks:
curl http://127.0.0.1:8080/healthz
curl http://127.0.0.1:8080/openapi.jsoncargo run -p oxidize-quantize -- \
--input /path/to/input.bin \
--output /path/to/output.bin \
--source F32 \
--target F16- Start from a floating-point model file (
F32orF16) and pick a target that matches your latency/quality tradeoff. - Use
F16for a low-risk size reduction, or lower-bit targets (Q8_0,Q4_0,Q4_1,Q5_0,Q5_1) for stronger memory savings. - Quantize to a new output path and keep the source model unchanged so you can benchmark both variants.
- Run inference/perplexity checks on representative prompts before promoting the quantized model.
Example:
cargo run -p oxidize-quantize -- \
--input /models/model-f32.bin \
--output /models/model-q4_0.bin \
--source F32 \
--target Q4_0cargo run -p oxidize-cli -- \
--model /path/to/model.gguf \
--prompt "Summarize Rust ownership in one paragraph."cargo run -p oxidize-cli -- \
--model /path/to/model.gguf \
--chatcurl -N http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role":"user","content":"Write a short poem about systems programming."}],
"stream": true
}'cargo run -p oxidize-cli -- \
--model /path/to/model.gguf \
--batch-size 4 \
--prompt "Classify: Rust is memory-safe."cargo run -p oxidize-cli -- \
--model /path/to/model.gguf \
--prompt "Generate a release note title." \
--temperature 0.7 \
--top-p 0.9 \
--top-k 40curl http://127.0.0.1:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"input": ["Rust is fast.", "Rust is memory-safe."]
}'make wasmWASM artifacts are written to dist/wasm.
Use this loop for predictable, low-risk tuning:
- Start with a stable baseline and record tokens/sec and latency:
cargo run -p oxidize-cli -- --model /path/to/model.gguf --prompt "benchmark prompt" - Profile one run to find bottlenecks:
cargo run -p oxidize-cli -- --model /path/to/model.gguf --prompt "benchmark prompt" --profile perf - Increase GPU offload gradually (
--n-gpu-layers) and compare throughput after each step. - If using multiple GPUs, test both parallel strategies and keep the faster one for your hardware:
cargo run -p oxidize-cli -- --model /path/to/model.gguf --gpus 2 --n-gpu-layers 20 --parallelism pipeline cargo run -p oxidize-cli -- --model /path/to/model.gguf --gpus 2 --n-gpu-layers 20 --parallelism tensor
- Quantize only after offload strategy is stable; then re-run the same benchmark prompts and check quality.
Practical tuning priorities:
- Largest speed gains first: GPU layer offload and multi-GPU strategy.
- Memory pressure next: quantization target (
F16,Q8_0,Q5_*,Q4_*). - Stability before peak speed: benchmark with representative prompts, not one short prompt.
- Measure every change: keep a small log of config -> tokens/sec -> latency -> quality notes.
- Model path errors: verify the model file exists and is readable, then rerun with an absolute path to avoid shell-relative path mistakes.
- Slow or no GPU acceleration: increase
--n-gpu-layersgradually and compare throughput after each change; if speed does not improve, fall back to CPU and confirm baseline correctness. - Server auth failures (
401): setOXIDIZE_API_KEYbefore startingoxidize-server, then send the same value withx-api-keyorAuthorization: Bearer <key>. - WASM build failures: install the wasm target (
rustup target add wasm32-unknown-unknown) and ensurewasm-bindgenis available, then runmake wasmagain. - Unexpected output quality after quantization: keep the original model, benchmark both variants on representative prompts, and move to a less aggressive target if quality regresses.
- Build workspace crates (release):
make build - Test all targets:
make test - Lint with denied warnings:
make lint - Format check:
make fmt - Full CI-equivalent check + build:
make ci
OXIDIZE_API_KEY: optional API key foroxidize-server/v1/*routes. Supportsx-api-keyorAuthorization: Bearer <key>.OXIDIZE_PROFILE_CHILD: internal flag used byoxidize-cliprofiling flow.
oxidize is organized as a layered workspace:
- Core compute layer (
oxidize-core): owns GGUF parsing, tensor + quantization primitives, model loading, token generation loop, and backend-specific execution paths (CPU, CUDA, Metal, WASM). - Interface layer (
oxidize-cli,oxidize-server,oxidize-py): exposes core capabilities through a CLI, OpenAI-compatible HTTP routes, and Python bindings without duplicating inference logic. - Utility layer (
oxidize-quantize): handles offline model weight conversion and quantization workflows.
At runtime, request flow is: input prompt -> interface crate -> oxidize-core model/session setup -> token generation + sampling -> streamed or buffered output to the caller.
Design goals:
- Keep inference and model logic centralized in
oxidize-coreso all frontends share the same behavior. - Keep transport/UI concerns at the edge crates (
oxidize-cli,oxidize-server,oxidize-py) for maintainability. - Support multiple acceleration targets behind stable core APIs to keep feature parity across platforms.
MIT