AINode

Turn any NVIDIA GPU into a local AI platform.
Inference + fine-tuning in your browser. One container to install. Add nodes, they find each other.

ainode.dev · docs · Getting Started · Screenshots · What Works / What Doesn't

What AINode is

AINode is a self-hosted AI appliance for NVIDIA GB10 (DGX Spark, ASUS GX10) and any NVIDIA GPU box. It ships as one container that bundles:

A modern web UI (chat, cluster topology, server console, downloads, training)
An OpenAI-compatible API (/v1/chat/completions, /v1/completions, /v1/embeddings)
A GB10-patched vLLM with Ray for cross-node tensor/pipeline parallel
UDP node discovery for automatic clustering
NFS-shared model storage so you download once and use everywhere
Scripted fine-tuning (LoRA, QLoRA, full FT, DPO, distributed DDP)

One docker pull, one systemd unit per box, done. No host Python venv, no source-built vLLM, no fragile runtime wiring.

curl -fsSL https://ainode.dev/install | bash

Screenshots

Cluster view — 2 nodes, 244 GB aggregated VRAM, distributed inference

The "MASTER" node (head) runs the API and orchestrates. The smaller orbiting node (member) has its GPU reserved for a Ray worker that the head placed. The instance card shows DISTRIBUTED · TP=2 — the model is sharded across both GPUs.

Chat

Full-featured chat with streaming tokens, prompt history, code highlighting, per-message metrics (TTFT, tokens/sec, total tokens). Works against whatever model the cluster has loaded — solo or sharded.

Server — API console (LM Studio style)

Live developer console: which models are loaded on which node, OpenAI-/LM-Studio-/Anthropic-compatible endpoints, per-request logs with status codes and latency, eject-model buttons, copyable cURL snippets.

Downloads — live HF catalog

Browse trending HuggingFace models, with AVAILABLE / FITS GPU badges computed from your cluster's aggregate VRAM. Queue downloads to the shared NFS cache; any node can load them instantly.

Training — overview

Three quick-start paths: LoRA (lightweight, most users), Distributed DDP (multi-node fine-tuning), Full fine-tune (single large-memory node). Track active + completed runs, GPU-hours, and jump into dataset management.

Training — templates

Starter recipes for instruction tuning (Alpaca), chat fine-tuning (ShareGPT), classification heads, DPO / preference learning, and multi-node DDP. Each template ships a working dataset schema so you can start training in minutes.

Config — cluster

Pin the node's role (auto / master / worker), set a shared cluster_id so only matching nodes see each other, and inspect the current member list with per-node role, address, and last-seen.

Getting Started — step by step

Single node (solo mode)

Install Docker + NVIDIA container toolkit on your Linux box.

Pull the image and wire up the systemd unit:

curl -fsSL https://ainode.dev/install | bash

That one-liner performs:

docker pull ghcr.io/getainode/ainode:0.4.0
systemctl enable --now ainode.service

Open the UI at http://<your-ip>:3000. First-run onboarding walks you through picking a model. Click a model card → click Launch → chat.

Upgrade is docker pull + systemctl restart ainode.

Two nodes (distributed mode)

For models that don't fit on one GPU — e.g. a 70B-class model sharded across two DGX Sparks:

Wire a clean high-speed link between the two nodes (direct QSFP cable on its own /24, or a dedicated switch port). See Networking requirements — this matters.
Install AINode on both (step 1 above).

On the peer, set member mode in ~/.ainode/config.json:

{
  "distributed_mode": "member",
  "cluster_interface": "enp1s0f0np0",
  "ssh_user": "sem"
}

sudo systemctl restart ainode.

On the head, set head mode and add passwordless SSH to the peer:

{
  "distributed_mode": "head",
  "peer_ips": ["10.0.0.2"],
  "cluster_interface": "enp1s0f0np0",
  "ssh_user": "sem"
}

ssh-copy-id sem@10.0.0-2 && sudo systemctl restart ainode

Open the head UI — you should see both nodes, aggregated VRAM ("2 nodes · 244 GB · 2 GPUs"), and the instance badged as DISTRIBUTED · TP=2.

Want to do it from the browser instead? Open the Launch Instance panel, pick the model, set Minimum Nodes=2, click Tensor → LAUNCH. The UI writes the config and hot-swaps the engine for you.

Features

Feature	Status
One-command install	✅
Unified container image (UI + engine)	✅ v0.4.0
Auto-detect GPU and memory	✅
Chat UI in your browser	✅
OpenAI-compatible API	✅
Embeddings endpoint (`/v1/embeddings`)	✅
Live HF model catalog with trending + download manager	✅
NFS-shared model storage across cluster	✅
Multi-node auto-discovery (UDP broadcast)	✅
Distributed tensor-parallel inference across nodes	✅ (2-node verified)
Cluster topology UI (members, VRAM aggregate, instance badges)	✅
Browser-based fine-tuning (LoRA / QLoRA / DDP templates)	⚠️ UI scaffolded, real-GPU validation in progress
Prometheus metrics endpoint	🔜
4+ node sharding on dedicated switch	🔜 (next milestone)

State of Distributed Inference (April 2026)

We owe readers the honest picture, not a checkmark-soup. Here's what's really running on our hardware.

What works today (verified)

Single-node inference on any NVIDIA GB10 box (DGX Spark, ASUS GX10).
Two-node tensor-parallel (TP=2) with one GPU per node on a direct-connect QSFP /24. Both GPUs show ~61 GB of ray::RayWorkerWrapper memory; NCCL chose NET/IB RoCE @ 200 Gb/s.
One-container-per-node install (docker pull ghcr.io/getainode/ainode → systemd → running).
Shared model storage over NFS from an NVMe-oF-backed master.
UDP cluster discovery on port 5679 with real peer-IP capture.
Inference throughput: ~35 tok/s for a warmed-up 1.5B model over the RoCE fabric.

What doesn't work yet

Three-node TP=3 on our current network topology — see the "Why 3 nodes is harder than 2" section below.
One-click "add me to the cluster" UX for members — you still hand-edit config.json on the peer. Next UI slice.
Ray over Tailscale — use physical cables or a dedicated switch.

Lessons learned the hard way

Single NIC per cluster subnet. Multi-NIC ambiguity breaks NCCL ring setup silently; NCCL_SOCKET_IFNAME only tells NCCL which address to listen on, not which source the kernel picks for outbound traffic.
Ray placement groups outlive SIGKILL. Hung vLLM doesn't release the reservation; Ray's GCS still thinks the GPU is busy. Always docker rm -f the full chain before retrying.
Block-level shared storage is unsafe for multi-writer. NVMe-oF + ext4 mounted on two hosts corrupts under concurrent writes. Put NFS on top of a single-host mount.
The patched NCCL in eugr/spark-vllm-docker (dgxspark-3node-ring branch) is the only variant we've seen reliably handle GB10 unified-memory topologies. Our base image inherits it.
SSH from a root container into a host user fails silently when keys are mounted read-only from the host. Our entrypoint copies /host-ssh → /root/.ssh with correct perms and injects User <ssh_user> for peer IPs.

Why 3 nodes is harder than 2 (and why 4 is probably easier)

Two nodes: a single direct-connect cable on one /24. One cable, one subnet, one candidate interface per host. NCCL can't get confused. TP=2 splits the weights evenly. Solved problem.

Three nodes: no simple physical topology. Options:

Triangle mesh (A↔B, B↔C, A↔C) with each link on its own /30 — community tooling assumes this, nobody autoconfigures it.
Dedicated cluster switch with one NIC per node on an isolated subnet — easier, but a hardware purchase.
Star topology — asymmetric latency, not recommended.

If your three nodes just share a regular LAN, you hit multi-NIC routing ambiguity (lesson #1). We did. NCCL ring setup succeeded; data never flowed.

Four nodes: paradoxically simpler once you commit to a switch, which is the only practical option for 4+. One NIC per node on a fresh /24, TP=4 lines up with vLLM's defaults, and the community has published recipes (eugr's recipes/4x-spark-cluster/, NVIDIA's internal 4× Spark reference setups).

Our hypothesis: the difficulty is not N nodes — it's how you wire N nodes. Two is forced (one cable). Three forces a topology decision. Four+ forces a switch, which is what the community tools expect. Stick to 2 now; buy the switch; jump straight to 4.

Networking requirements

AINode relies on NCCL for cross-node tensor-parallel, and NCCL works best when it owns a clean link.

Passwordless SSH from the head's host user to every peer.
Single active NIC per cluster subnet on every node. Multiple interfaces on the same /24 breaks the NCCL ring.
No VPN between nodes for cluster traffic. Tailscale is fine for laptop→cluster SSH; not fine as the NCCL transport.
Consistent MTU across the cluster subnet.

Recommended topologies

Cluster size	Topology	Notes
2 nodes	Direct QSFP cable, each end on its own IP in a fresh `/24`	Simplest, verified
3 nodes	Triangle direct-connect (3 cables, each on a `/30`) or dedicated switch	Mesh is finicky; switch is easier
4+ nodes	Dedicated QSFP switch on its own `/24`, one NIC per node	The only practical option

Diagnostic commands

# Confirm RDMA is live on your ConnectX-7
ibstat mlx5_0 | grep -E "State|Rate"         # expect "Active" + "Rate: 200"

# Confirm exactly one interface has an IP on the cluster subnet
ip -br -4 addr | grep 10.0.0                 # expect one line per node

# Confirm cross-node reach on the cluster subnet (not Tailscale)
ping -c 2 10.0.0.2
traceroute 10.0.0.2                          # 1 hop = right link

# Passwordless SSH works
ssh sem@10.0.0.2 true && echo OK

# After launching distributed: confirm NCCL uses RoCE, not Socket
docker exec vllm_node bash -c 'grep -E "Using network|NET/IB.*RoCE" \
  /tmp/ray/session_latest/logs/worker-*-01000000-*.out | head -5'
# Expect: "Using network IB" + "NET/IB ... mlx5_0:1/RoCE ... speed=200000"

Optional: GPU Direct RDMA (GDR)

Without GDR, traffic goes GPU → CPU → NIC → NIC → CPU → GPU. With GDR it bypasses the CPU hop. On GB10 the unified-memory CPU hop is cheap, so the win is smaller than on discrete GPUs but still measurable.

# Load the peermem module on each host (not the container)
sudo modprobe nvidia_peermem
echo nvidia_peermem | sudo tee -a /etc/modules-load.d/nvidia-peermem.conf

# Verify NCCL picks it up next launch
docker exec vllm_node bash -c 'grep "GPU Direct RDMA" \
  /tmp/ray/session_latest/logs/worker-*-01000000-*.out'
# Expect: "GPU Direct RDMA Enabled"

Shared model storage across a cluster

Downloading a 70 GB model three times on a three-node cluster is wasteful. AINode supports a shared models_dir so every node pulls from the same cache.

Block-level shared storage (NVMe-oF, iSCSI, Fibre Channel) is fast but unsafe for multiple Linux kernels writing simultaneously — ext4 / XFS have no distributed lock manager. Layer NFS on top:

  Storage array (NVMe-oF, SAN, local NVMe)
         │
         ▼
  MASTER NODE  ← ext4/XFS mounted here, owns the disk
    │   │
    │   └── NFS server exports /mnt/ai-models
    ▼
  WORKERS      ← mount the NFS share at /mnt/ai-shared

NFS over a 100G fabric gives 3–8 GB/s — vLLM model loading is a one-shot sequential read, so you won't notice. For 100 GB+ models where load time hurts, add an rsync-to-local staging step.

CLI reference

ainode start                 # Start AINode (inference + web UI)
ainode stop                  # Stop AINode
ainode status                # Show cluster status
ainode models                # List available models
ainode service install       # Install the systemd unit
ainode service status        # Show systemd state + recent journal
ainode config                # Show current configuration

API

AINode exposes an OpenAI-compatible API. Drop it into any tool that speaks OpenAI:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

Works with Open WebUI, LiteLLM, LangChain, llama.cpp clients, and anything else that speaks OpenAI.

Requirements

OS: Ubuntu 22.04+ (DGX Spark OS works out of the box)
GPU: NVIDIA GB10 (DGX Spark, ASUS GX10) — or any NVIDIA GPU with CUDA 13 drivers
Memory: 8 GB+ GPU for small models, 128 GB recommended for the big ones, 240 GB+ aggregated for real sharded work
Disk: 20 GB for the container image; more for models
Docker: 24.0+ with the NVIDIA Container Toolkit

Why AINode?

	Cloud AI	AINode
Monthly cost	$100–10,000+	$0 (you own the hardware)
Data privacy	Your data on their servers	Your data stays local
Rate limits	Yes	None
Latency	200–2000 ms	10–50 ms
Fine-tuning	Limited, expensive	Unlimited, free
Internet required	Yes	No
Models available	Their choice	Your choice

Roadmap

Contributing

AINode is Apache-2.0 and welcomes contributions. See CONTRIBUTING.md — and please run the test suite (pytest tests/) before opening a PR.

License

Apache 2.0 — use it however you want.

_{crafted with ♥ by Jason Brashear · powered by argentos.ai}

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.github/workflows		.github/workflows
ainode		ainode
docs/images		docs/images
ops		ops
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

AINode

What AINode is

Screenshots

Cluster view — 2 nodes, 244 GB aggregated VRAM, distributed inference

Chat

Server — API console (LM Studio style)

Downloads — live HF catalog

Training — overview

Training — templates

Config — cluster

Getting Started — step by step

Single node (solo mode)

Two nodes (distributed mode)

Features

State of Distributed Inference (April 2026)

What works today (verified)

What doesn't work yet

Lessons learned the hard way

Why 3 nodes is harder than 2 (and why 4 is probably easier)

Networking requirements

Recommended topologies

Diagnostic commands

Optional: GPU Direct RDMA (GDR)

Shared model storage across a cluster

CLI reference

API

Requirements

Why AINode?

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages