imageCLI

Unified CLI for local image generation — FLUX.2-klein, FLUX.1-dev (+ PuLID face lock), FLUX.1-schnell, and SD3.5 Large Turbo backends via HuggingFace Diffusers.

Why

Running image generation locally means no API costs, no rate limits, and full control over prompts and outputs. But wiring up HuggingFace Diffusers — quantization, VRAM management, preflight checks — is repetitive boilerplate.

imageCLI wraps all of that into a single command. Point it at a text prompt or a .md file and get an image. Swap backends with -e. Batch an entire directory. Import it as a Python library.

Requirements

Python 3.11–3.12
CUDA GPU (16GB VRAM recommended; 10GB minimum for fp8 engines)
CUDA 13.0 (cu130) for NVFP4 engine; CUDA 12.8+ for all others
uv package manager

Install

From source (development)

git clone <repo-url> && cd imageCLI
uv sync
cp imagecli.example.toml ~/imagecli.toml   # optional — customize defaults

Run via uv run imagecli <command> from the project directory.

Global install (run from any directory)

imagecli ships a [project.scripts] entry so it installs as a proper system command.

uv tool (preferred) — run from the project directory so uv picks up the custom PyTorch cu128 index from pyproject.toml automatically:

cd /path/to/imageCLI
uv tool install .

pipx (alternative) — pass the PyTorch index explicitly using --index-url (replaces PyPI entirely, avoiding dependency confusion for torch/torchvision):

pipx install -e /path/to/imageCLI --pip-args="--index-url https://download.pytorch.org/whl/cu128"

Note: pip does not read [tool.uv.sources] from pyproject.toml. This means the git-pinned diffusers source and index overrides used by uv are ignored — pip resolves all packages from the pytorch wheel index above. For the exact tested dependency graph, prefer uv tool install ..

After either install, imagecli is on your PATH and works from any directory:

imagecli info                          # verify install + GPU detection
imagecli generate "a cat in space"     # generate from anywhere

Models are downloaded automatically on first use from HuggingFace (~13GB for FLUX.2-klein).

Quick Start

# Generate with default engine (FLUX.2-klein)
imagecli generate "a cat in space"

# Use a specific engine
imagecli generate "a cat in space" -e flux1-dev

# Generate from a .md prompt file
imagecli generate images/prompts_in/lyra-v1-cinematic.md

# Batch all prompts in a directory
imagecli batch images/prompts_in/

# List available engines
imagecli engines

# Show active config and GPU info
imagecli info

How it works

A prompt (text or .md file) is dispatched to an engine. The engine lazy-loads its model on the first call, runs preflight VRAM/RAM checks, generates the image, and saves it — then frees GPU memory.

flowchart LR
  A["prompt / .md file"] --> B["engine registry"]
  B --> C["preflight_check\nVRAM + RAM"]
  C --> D["_load()\nlazy model load"]
  D --> E["inference\ntorch.inference_mode()"]
  E --> F["image.save()"]
  F --> G["cleanup()\nVRAM freed"]

Engines stay alive across batch runs — the model loads once per session. torch.compile fuses CUDA kernels on the first generation, then runs at full speed for all subsequent images.

Library API

imageCLI is also a Python library. Install it as a dependency and call it directly — no subprocess, no HTTP server.

# From another project
uv add --editable /path/to/imageCLI

from pathlib import Path
from imagecli import generate, list_engines, preflight_check

# Generate an image — returns the path to the saved file
out: Path = generate(
    "a serene mountain landscape at sunset",
    engine="flux2-klein",       # optional, defaults to imagecli.toml
    width=1280,
    height=720,
    seed=42,                    # reproducible output
    output_dir="images/out/",
)
print(out)   # Path("images/out/image_20260312_143201.png")

# List available engines
for engine in list_engines():
    print(engine)   # "flux2-klein", "flux1-dev", …

# Check GPU / resource availability before loading a model
from imagecli import get_engine
eng = get_engine("flux2-klein")
preflight_check(eng)   # raises InsufficientResourcesError if VRAM too low

Public API (__all__): generate, get_engine, list_engines, preflight_check, load_config, parse_prompt_file, ImageEngine, InsufficientResourcesError, PromptDoc.

Engines

Engine	Model	VRAM	Steps	Speed	Notes
`flux2-klein`	FLUX.2-klein-4B	~8GB	50	~7s/step	Default. FP8 quanto + Marlin GEMM. Single: CPU offload. Batch: all-on-GPU (~12 GB) or `--two-phase` (~8 GB). Fastest.
`flux2-klein-fp8`	FLUX.2-klein-4B FP8	~13GB	50	~4s/step	torchao FP8 weight-only. No quanto, torch.compile compatible. ~40% slower than quanto (dequant overhead). `uv sync --group fp8`
`flux2-klein-fp4`	FLUX.2-klein-4B NVFP4	~11.5GB	50	~6.8s/step	Blackwell FP4 via comfy-kitchen. ~2 GB transformer. Requires sm_120+ and cu130. `uv sync --group fp4`
`pulid-flux2-klein`	FLUX.2-klein-4B + PuLID	~9–10GB	50	~25s	Face identity lock (Klein). Requires `face_image` in frontmatter. `uv sync --extra pulid`
`pulid-flux2-klein-fp4`	FLUX.2-klein-4B + PuLID NVFP4	~8.5GB	50	~36–37s	Face identity lock + NVFP4 quantization. ~26% faster than BF16 PuLID. Requires sm_120+ and cu130. `uv sync --extra pulid --group fp4`
`pulid-flux1-dev`	FLUX.1-dev + PuLID v0.9.1	~10GB	24	~42s	Face identity lock (recommended). GGUF Q5_K_S + PuLID. Clean 1024×1024, no banding. `uv sync --extra pulid`
`flux1-dev`	FLUX.1-dev fp8	~10GB	50	~30s	Maximum quality, longer prompts. fp8 via optimum-quanto
`flux1-schnell`	FLUX.1-schnell fp8	~10GB	4	~5s	Fastest. Apache 2.0, ungated. fp8 via optimum-quanto
`sd35`	SD3.5 Large Turbo	~14GB	20	~8s	CFG-free, realistic photos. T5 encoder quantized to int8

Benchmarks (RTX 5070 Ti, 512x512, 28 steps, April 2026)

Single image (cold start — includes model load):

Engine	Total wall-clock	Steady it/s	Peak VRAM
`flux2-klein` (quanto FP8)	28.5s	7.05	7.84 GB
`flux2-klein-fp4` (NVFP4)	15.2s	6.8	11.46 GB

Batch (model already loaded):

Engine	Mode	Steady it/s	Gen VRAM	Free VRAM
`flux2-klein`	all-on-GPU	7.0	12.26 GB	3.2 GB
`flux2-klein`	--two-phase	6.95	4.93 GB	10.5 GB
`flux2-klein-fp4`	all-on-GPU	6.78	11.14 GB	4.3 GB
`flux2-klein-fp4`	--two-phase	6.72	3.56 GB	11.9 GB

Multi-image per step (batch_size 2-4) provides zero speedup — GPU is already compute-saturated at 512x512.

Resolution scaling:

Resolution	quanto FP8 best	NVFP4 best	Notes
512x512	7.05 it/s (all-on-GPU)	6.78 it/s	quanto fastest at low res
1024x1024	2.02 it/s (two-phase)	2.13 it/s (two-phase)	NVFP4 pulls ahead — smaller transformer leaves VRAM for latents
2048x2048	OOM (all modes)	0.41 it/s (two-phase only)	NVFP4 is the only engine that fits — 8.82 GB peak

Which engine to pick:

Scenario	Best pick	Why
1 image (any res)	NVFP4	Half the cold start (15s vs 28s) — no runtime quantization
Batch 512x512	quanto FP8 all-on-GPU	Fastest steady-state (7.0 it/s)
Batch 1024x1024	NVFP4 --two-phase	Faster (2.13 vs 2.02 it/s) and less VRAM (7.44 vs 8.30 GB)
Batch 2048x2048	NVFP4 --two-phase	Only option that fits in 16 GB (8.82 GB peak)
500+ sharing GPU	NVFP4 --two-phase	Only 3.56 GB during gen at 512x512, 12 GB free for other processes

Commands

`generate` — Single image

imagecli generate "a cat in space"
imagecli generate "a cat in space" -e flux1-schnell
imagecli generate prompt.md                          # from .md file with frontmatter
imagecli generate prompt.md -e flux1-dev             # override engine from frontmatter
imagecli generate "prompt" -W 1280 -H 720            # custom dimensions
imagecli generate "prompt" --steps 30 --guidance 4.0
imagecli generate "prompt" --seed 42                 # reproducible output
imagecli generate "prompt" -n "blurry, ugly"         # negative prompt
imagecli generate "prompt" -o my_image.png           # explicit output path
imagecli generate "prompt" --output-dir ./out        # custom output directory
imagecli generate "prompt" --no-compile              # skip torch.compile for one-offs
imagecli generate "prompt" --lora ./lora.safetensors               # LoRA weights
imagecli generate "prompt" --lora ./lora.safetensors --lora-scale 1.5  # boosted LoRA

Flag	Short	Description	Default
`--engine`	`-e`	Engine name (`flux2-klein`, `flux2-klein-fp8`, `flux2-klein-fp4`, `pulid-flux2-klein`, `pulid-flux2-klein-fp4`, `pulid-flux1-dev`, `flux1-dev`, `flux1-schnell`, `sd35`)	`flux2-klein`
`--width`	`-W`	Output width in pixels (multiple of 64)	`1024`
`--height`	`-H`	Output height in pixels (multiple of 64)	`1024`
`--steps`	`-s`	Inference steps	`50`
`--guidance`	`-g`	Guidance scale (CFG)	`4.0`
`--seed`		Fixed seed for reproducibility	random
`--negative`	`-n`	Negative prompt	`""`
`--output`	`-o`	Output file path	auto-generated
`--output-dir`		Output directory (`~` expands to `$HOME`)	`~/.roxabi/imagecli/out`
`--format`		Output format: `png`, `jpg`, `webp`	`png`
`--no-compile`		Skip torch.compile (faster startup, slower generation)	off
`--lora`		Path to LoRA weights (`.safetensors`). `flux2-klein` and `flux2-klein-fp8` only.	none
`--lora-scale`		LoRA adapter scale	`1.0`

`batch` — All prompts in a directory

imagecli batch images/prompts_in/
imagecli batch images/prompts_in/ -e flux1-dev
imagecli batch images/prompts_in/ --output-dir ./results
imagecli batch images/prompts_in/ --no-compile
imagecli batch images/prompts_in/ --lora ./lora.safetensors       # LoRA for all images

Processes every .md file in the directory in sorted order.

Batch modes (flux2-klein engines): When all files use the same engine, batch mode picks the fastest strategy:

All-on-GPU (default): Encoder + transformer + VAE all on GPU at once (~12 GB). No model shuffling. Fastest.
Two-phase (--two-phase): Phase 1 loads text encoder (~8 GB), encodes all prompts, offloads. Phase 2 loads transformer + VAE (~4 GB), generates all images. Lower VRAM, useful when running alongside other GPU processes.

Flag	Short	Description	Default
`--engine`	`-e`	Override engine for all files	per-file frontmatter
`--output-dir`		Output directory for all images (`~` expands)	`~/.roxabi/imagecli/out`
`--no-compile`		Skip torch.compile	off
`--two-phase`		Force 2-phase batch (lower VRAM, slower)	off
`--lora`		Path to LoRA weights (`.safetensors`). `flux2-klein` and `flux2-klein-fp8` only.	none
`--lora-scale`		LoRA adapter scale	`1.0`

`engines` — List engines

imagecli engines

Prints a table of all registered engines with VRAM requirements, model ID, and description.

`info` — Config and GPU

imagecli info

Shows the active imagecli.toml values and detected GPU name + total VRAM.

Markdown Prompt Format

Write prompts as .md files with YAML frontmatter. Store them in images/prompts_in/ (tracked in git).

---
engine: flux2-klein          # flux2-klein | flux2-klein-fp8 | flux2-klein-fp4 | pulid-flux2-klein | pulid-flux2-klein-fp4 | pulid-flux1-dev | flux1-dev | flux1-schnell | sd35
width: 1024                  # pixels, must be a multiple of 64
height: 1024
steps: 50                    # inference steps (sd35 is fixed at 20)
guidance: 4.0                # CFG scale (sd35 is fixed at 1.0, flux1-schnell at 0.0)
seed: 42                     # optional — omit for a random result
negative_prompt: "blurry, low quality, watermark"
format: png                  # png | jpg | webp
face_image: /path/to/ref.png # pulid engines only — single reference face (abs or relative to .md)
face_images:                 # pulid engines only — multiple reference faces (takes precedence over face_image)
  - /path/to/ref1.png        #   ArcFace + EVA-CLIP embeddings are averaged across all refs before id_former
  - /path/to/ref2.png        #   gives a centroid identity more robust to pose/lighting variation
pulid_strength: 0.6          # pulid engines only — identity lock strength (default 0.6, 0.8 for flux1-dev)
lora_path: /path/to/lora.safetensors  # LoRA weights — flux2-klein and flux2-klein-fp8 only
lora_scale: 1.0              # LoRA adapter scale (default 1.0, try 1.5 for stronger identity)
---

Your prompt text here. Can span multiple paragraphs.
The entire body below the frontmatter is used as the prompt.

All frontmatter fields are optional. Missing values fall back to imagecli.toml, then to hardcoded defaults.

User Config (`imagecli.toml`)

Optional global config file. imagecli walks upward from CWD to $HOME looking for imagecli.toml. Place it at ~/imagecli.toml for a global default, or in a project parent directory to scope it narrowly.

cp imagecli.example.toml ~/imagecli.toml   # then edit to taste

[defaults]
engine     = "flux2-klein"       # default engine
width      = 1024                # output width in pixels
height     = 1024                # output height in pixels
steps      = 50                  # inference steps
guidance   = 4.0                 # CFG scale
output_dir = "~/.roxabi/imagecli/out" # where to save generated images (`~` → $HOME)
format     = "png"               # png | jpg | webp
quality    = 95                  # JPEG/WebP quality (ignored for PNG)

Priority: CLI flag > .md frontmatter > imagecli.toml > hardcoded default

Performance Optimizations

All optimizations are applied automatically when a model loads. The only manual choice is --no-compile.

torch.compile

The transformer (or UNet) can be compiled with torch.compile(mode="max-autotune-no-cudagraphs") on first load, fusing CUDA kernels for a 20-40% speedup after a one-time warmup.

Note: For the default flux2-klein engine, torch.compile is blocked by the quanto QLinear contiguity patch. The flux2-klein-fp8 engine (torchao) supports compile, but is already 40% slower than quanto's Marlin FP8 GEMM kernels even without compile. In practice, quanto's hardware-native FP8 matmul is faster than anything torch.compile can offer. Compile remains useful for non-Klein engines (flux1-dev, flux1-schnell, sd35).

imagecli generate "quick test" --no-compile  # skip compile for faster startup

VAE tiling and slicing

VAE tiling (enable_tiling()) and slicing (enable_slicing()) are always enabled. Tiling decodes the latent in tiles instead of all at once, which reduces peak VRAM for large images. Slicing processes the batch dimension one element at a time.

TF32 matmul precision

torch.set_float32_matmul_precision("high") is set globally at model load time. On Blackwell GPUs (RTX 5070 Ti and newer), this routes float32 matrix multiplications through tensor cores using TF32 precision, giving roughly a 10–15% speedup with negligible quality impact.

inference_mode

All generation runs inside torch.inference_mode(), which disables autograd tracking and reduces memory overhead compared to torch.no_grad().

Memory Safety

Preflight check

Before loading a model, preflight_check() reads current GPU and system memory and aborts early with a clear message if resources are insufficient:

VRAM: compares torch.cuda.mem_get_info() free VRAM against the engine's vram_gb requirement. If insufficient, it names the shortfall and suggests closing other GPU processes (e.g. ollama).
System RAM: reads /proc/meminfo and requires at least 4GB free. Override with IMAGECLI_MIN_FREE_RAM_GB=8 if needed.

Environment variables

Variable	Default	Purpose
`IMAGECLI_MIN_FREE_RAM_GB`	`4.0`	Minimum free system RAM (GB) before `preflight_check` aborts a load.
`IMAGECLI_NATS_OUTPUT_DIR`	`~/.roxabi/imagecli/nats_out`	Override directory used by the NATS satellite when delivering via `output_mode: "file"`. Useful for daemons whose CWD makes the config-file walk-up unreliable.

GPU memory cap

torch.cuda.set_per_process_memory_fraction(0.85) caps imagecli's VRAM usage at 85% of total, leaving headroom for the OS and other processes.

Cleanup after generation

engine.cleanup() is called in a finally block after every generation. It runs torch.cuda.empty_cache() and gc.collect() to release cached allocations back to the system promptly.

Project Structure

imagecli.example.toml     — copy to ~/imagecli.toml and customize
images/
  prompts_in/             — .md prompt files (tracked in git)
                           — generated images live at ~/.roxabi/imagecli/out/ (CLI)
                             and ~/.roxabi/imagecli/nats_out/ (NATS satellite),
                             outside the repo and Syncthing-replicated M₁↔M₂
src/imagecli/
  cli.py                  — Typer app: generate, batch, engines, info
  config.py               — TOML config loader (walks up from CWD to $HOME)
  engine.py               — Abstract ImageEngine base class + registry
  markdown.py             — YAML frontmatter parser for .md prompt files
  engines/
    flux2_klein.py        — FLUX.2-klein-4B engine (default, quanto FP8)
    flux2_klein_fp8.py    — FLUX.2-klein-4B torchao FP8 (40% slower than quanto, torch.compile compatible)
    flux2_klein_fp4.py    — FLUX.2-klein-4B NVFP4 (Blackwell FP4 via comfy-kitchen)
    pulid_flux2_klein.py     — FLUX.2-klein-4B + PuLID face identity lock (optional extra)
    pulid_flux2_klein_fp4.py — FLUX.2-klein-4B + PuLID NVFP4 (Blackwell, ~8.5 GB, optional extra + fp4 group)
    pulid_flux1_dev.py       — FLUX.1-dev GGUF + PuLID v0.9.1 face identity lock
    flux1_dev.py          — FLUX.1-dev fp8 quantized engine
    flux1_schnell.py      — FLUX.1-schnell fp8 quantized engine
    sd35.py               — SD3.5 Large Turbo engine

Output files never overwrite existing ones. If image.png already exists, the next run saves image_1.png, then image_2.png, and so on.

Upgrading from an earlier checkout? The default output directory moved from images/images_out/ (in-repo) to ~/.roxabi/imagecli/out/ (under the project-wide data convention, Syncthing-replicated M₁↔M₂). Either copy any existing images/images_out/ files over manually, or pin the old layout by setting output_dir = "images/images_out/" in your imagecli.toml.

Contributing

See CONTRIBUTING.md for dev setup, code conventions, and how to add a new engine. All PRs target staging and require ruff + pytest to pass.

Documentation

Doc	Description
CONTRIBUTING.md	How to contribute

Model Research (March 2026)

FLUX.2-klein-4B — Black Forest Labs, Jan 2026. 4B parameter model distilled from FLUX.2-dev. Three quantization options: quanto FP8 (default), official pre-quantized FP8, and NVFP4 (Blackwell-native).
FLUX.2-klein-4B torchao FP8 — torchao weight-only FP8 quantization (no quanto). Enables torch.compile but ~40% slower than quanto's Marlin FP8 GEMM kernels. Useful if you need compile or want to avoid quanto. Requires uv sync --group fp8.
FLUX.2-klein-4B NVFP4 — Official BFL NVFP4 weights for Blackwell GPUs. ~2 GB transformer via comfy-kitchen FP4 kernel. Requires sm_120+ and CUDA 13.0 (cu130).
FLUX.1-dev — Black Forest Labs. Still competitive for detailed, complex prompts. Non-commercial license. fp8 quantized via optimum-quanto to fit in 10GB.
FLUX.1-schnell — Black Forest Labs. Apache 2.0, ungated. 4-step distilled model for fast iteration. fp8 quantized.
SD3.5 Large Turbo — Stability AI. 20-step CFG-free generation. Strong for photorealistic subjects. T5 encoder quantized to int8.
FLUX.2-dev (32B) — Not included: requires 24GB+ VRAM even in GGUF format.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.claude-plugin		.claude-plugin
.claude		.claude
.github		.github
docs		docs
images/prompts_in		images/prompts_in
scripts		scripts
skills/image-generate		skills/image-generate
src/imagecli		src/imagecli
supervisor		supervisor
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
.importlinter		.importlinter
.license-policy.json		.license-policy.json
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
imagecli.example.toml		imagecli.example.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

imageCLI

Why

Requirements

Install

From source (development)

Global install (run from any directory)

Quick Start

How it works

Library API

Engines

Benchmarks (RTX 5070 Ti, 512x512, 28 steps, April 2026)

Commands

generate — Single image

batch — All prompts in a directory

engines — List engines

info — Config and GPU

Markdown Prompt Format

User Config (imagecli.toml)

Performance Optimizations

torch.compile

VAE tiling and slicing

TF32 matmul precision

inference_mode

Memory Safety

Preflight check

Environment variables

GPU memory cap

Cleanup after generation

Project Structure

Contributing

Documentation

Model Research (March 2026)

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`generate` — Single image

`batch` — All prompts in a directory

`engines` — List engines

`info` — Config and GPU

User Config (`imagecli.toml`)

Packages