ESNODE-Core features a professional, cloud-console-grade Terminal User Interface for real-time infrastructure monitoring:
✱ ESNODE Power-Aware AI Infrastructure ● ONLINE
───────────────────────────────────────────────────────────────────────
Navigation │
▶ Overview │ [CPU, Memory, Load Averages & Network Stats]
GPU & Power │
Network & Disk │
Efficiency & MCP │
Orchestrator │
Metrics Profiles │
Agent Status │
AIOps Intelligence │ [Autonomous RCA & Predictive Maintenance]
F5: Refresh | Arrow Keys: Navigate | Q/F3: Quit
Key Features:
- 🎨 Enterprise-grade dark navy theme
- 📊 Real-time gauges, tables, and status indicators
- 🧠 AIOps Intelligence Dashboard (Autonomous RCA & Risk Prediction)
- ⌨️ Intuitive keyboard navigation
- 🎯 Color-coded health warnings (green/amber/red)
- 📡 Auto-refresh every 5 seconds
Launch the TUI:
./esnode-core cliThis repository contains the source, build tooling, and documentation for the ESNODE-Core Agent.
- Ubuntu Server (primary, ~60–70% of AI fleet; best CUDA/driver/toolchain support)
- RHEL-compatible: RHEL / Rocky Linux / AlmaLinux (enterprise compliance, FIPS-ready)
- NVIDIA DGX OS (Ubuntu-based, pre-tuned for DGX appliances)
- SUSE Linux Enterprise Server (SLES) (enterprise/HPC niches)
- Debian (research/custom environments)
Packaging is provided as tar.gz plus optional deb/rpm; Windows zips are included for hybrid labs. On Linux, install scripts set up binaries under /usr/local/bin and create systemd units so esnode-core runs without extra PATH tweaks.
ESNODE-Core is a GPU-aware host metrics exporter for Linux nodes. It exposes CPU, memory, disk, network, and GPU telemetry at /metrics in Prometheus text format so observability stacks can scrape a single endpoint per node. Agents run in standalone mode.
- GPU metrics require the
gpufeature (enabled by default). MIG metrics additionally require building with thegpu-nvml-ffifeature and enablingenable_gpu_migin config; otherwise MIG metrics will remain zero. gpu_visible_devices(orNVIDIA_VISIBLE_DEVICES) filters which GPUs are scraped; empty/allscrapes all.mig_config_devices(orNVIDIA_MIG_CONFIG_DEVICES) further filters which GPUs are considered for MIG scraping whenenable_gpu_migis true.k8s_modeemits a small set of legacy-compatible GPU metrics with a singlegpulabel (suffix_compat) using Kubernetes/CDI resource-style names (nvidia.com/gpu,nvidia.com/gpu-<mig-profile>-mig) alongside the UUID+index labeled metrics to reduce dashboard breakage.- MIG compatibility labels are only emitted when
k8s_modeis enabled; MIG info/metrics still requiregpu-nvml-ffi+enable_gpu_mig. enable_gpu_eventsstarts an NVML event loop for XID/ECC/power/clock events (best-effort). The loop is non-blocking with a short poll and may miss very bursty streams; counters are monotonic but not guaranteed exact.gpu-nvml-ffi-extis an optional feature gate for extra NVML FFI (PCIe field counters, etc.). These are best-effort and unverified without suitable hardware; placeholders remain zero when unsupported.- NVSwitch/copy-engine clocks/power-cap reason codes are exposed as gauges but rely on NVML support; many remain zero on hardware/driver versions that do not surface them.
- Single binary with zero-config defaults (
0.0.0.0:9100, 5s interval). - Collectors: CPU, memory, disk, network, GPU (NVML-based; gracefully disabled if unavailable).
- Power-aware: optional power collector reads RAPL/hwmon/BMC paths for CPU/package/node power; GPU power via NVML.
- Self-observability: scrape duration + error counters per collector.
- Health endpoint at
/healthz. - JSON status endpoint at
/status(/v1/statusalias) with node load, power, temps, GPU snapshot (identity/health/MIG tree), last scrape/errors; SSE stream at/eventsfor near-real-time loops. - Degradation signals: disk busy/latency, network drops/retrans, swap spikes, and GPU throttle/ECC flags roll up into
esnode_degradation_score; surfaced in/statusand the TUI. - Degradation signals: disk busy/latency, network drops/retrans, swap spikes, and GPU throttle/ECC flags roll up into
esnode_degradation_score; surfaced in/statusand the TUI. - Local TSDB defaults to a user-writable XDG path so non-root runs no longer fail on
/var/lib/esnode/tsdb; override withlocal_tsdb_pathwhen you want/var/lib.
- Thermal Management: Automatically shifts workloads away from overheating devices (>85°C) using real-time thermal telemetry.
- Energy Efficiency: Scoring algorithm prefers devices with better performance-per-watt metrics.
- Local Control Plane: Autonomous decision making (preemption, bin-packing) runs directly on the node without external dependencies.
- Model Awareness: App collector integration allows for custom application metrics (e.g., tokens/sec) to influence scheduling decisions.
- Declarative Profiles: Define efficiency policies in YAML (e.g., "Throttle if > 82°C", "Alert if tokens/watt < 0.5").
- Continuous Enforcement: Agent continuously monitors and enforces policies in the background.
- Safety Indicators: Built-in flap detection prevents rapid toggling of enforcement actions.
- Automated RCA: Correlates GPU performance dips with Kubernetes pod events, network packet loss, and thermal throttling.
- Predictive Maintenance: Real-time failure risk scoring based on ECC Deep-Dive (Corrected/Uncorrected), thermal stress history, and retired memory pages.
- AIOps TUI Dashboard: Dedicated real-time visualization console (jump with hotkey '8') for all automated detections.
- Observability: Prometheus metrics (
esnode_rca_detections_total,esnode_gpu_failure_risk_score) track all autonomous insights.
- Cluster Federation: Connect multiple independent nodes via Gossip protocol.
- Global Optimization: Cross-node workload migration for rack-level power capping.
- Source-available under ESNODE BUSL-1.1 (see
LICENSE). - Trademarks governed by
docs/TRADEMARK_POLICY.md; no rebranding or redistribution of binaries. - Contributions require agreement to
docs/CLA.md. - Official binaries and commercial terms are controlled solely by Estimatedstocks AB.
esnode-core: per-node collector exposing Prometheus/metrics, JSON/status(/v1/status), and SSE/events.esnode-core: per-node collector exposing Prometheus/metrics, JSON/status(/v1/status), and SSE/events.
See docs/architecture.md and docs/platform-matrix.md for topology and build targets.
cargo build --workspace --release
./target/release/esnode-core- Cross-compiling on macOS for
x86_64-unknown-linux-gnu/aarch64-unknown-linux-gnurequires the corresponding GNU toolchains (e.g.,brew install x86_64-unknown-linux-gnu).
Configuration precedence: CLI flags > env vars > esnode.toml > defaults. See docs/quickstart.md for full examples.
- Config flags of interest:
enable_gpu_mig(default false) – turn on MIG scraping when built withgpu-nvml-ffi.enable_gpu_events(default false) – run NVML event loop (best-effort) for XID/ECC/clock/power events.enable_gpu_amd(default false) – experimental AMD/ROCm collector scaffolding; emits no metrics unless rsmi/rocm-smi support is added.k8s_mode(default false) – emit compatibility labels using Kubernetes/CDI resource names alongside UUID/index labels.gpu_visible_devices/NVIDIA_VISIBLE_DEVICES– filter visible GPUs.mig_config_devices/NVIDIA_MIG_CONFIG_DEVICES– filter MIG-capable GPUs whenenable_gpu_migis true.- Optional
gpu-nvml-ffi-extfeature enables additional NVML field-based counters (PCIe/etc.), best-effort only. enable_app+app_metrics_url– app/model metrics collector uses a 2s HTTP timeout; slow or hung endpoints are skipped for that scrape without blocking other collectors.enable_app+app_metrics_url– app/model metrics collector uses a 2s HTTP timeout; slow or hung endpoints are skipped for that scrape without blocking other collectors.
Local TSDB path (default): when enable_local_tsdb is true, the agent now resolves local_tsdb_path to $XDG_DATA_HOME/esnode/tsdb or ~/.local/share/esnode/tsdb so non-root runs don’t fail on /var/lib. Set ESNODE_LOCAL_TSDB_PATH or the config key if you want /var/lib/esnode/tsdb and ensure the directory is writable by the agent user.
- Tagging
vX.Y.Zon the default branch triggers.github/workflows/release.yml, which:- Runs
cargo test --workspace --locked --target <triple>on Linux (x86_64), macOS (aarch64), and Windows (x86_64). - Builds release binaries with default features for the same triples.
- Packages artifacts as tar.gz (Linux/macOS) or zip (Windows) and attaches them to the GitHub Release created from the tag.
- Runs
- Artifact names:
esnode-core-linux-x86_64.tar.gzesnode-core-macos-aarch64.tar.gzesnode-core-windows-x86_64.zip
- Binaries are built with default features; MIG metrics still require
gpu-nvml-ffiandenable_gpu_migwhen running on MIG-capable hosts. - For additional targets or feature builds, run
cargo build --release --locked --target <triple>locally and publish as needed. - Manual packaging:
scripts/dist/esnode-core-release.sh(optionally withESNODE_VERSION=X.Y.Z) builds and collects Linux tar/deb/rpm bundles (and Windows zip if toolchain available) underpublic/distribution/releases/. Push a tagvX.Y.Zafter verification to publish GitHub release artifacts automatically.
Community & policies:
- Contribution guidelines: see
CONTRIBUTING.md. - Code of Conduct: see
CODE_OF_CONDUCT.md. - Security reporting: see
SECURITY.md. - Support & upgrade policy: see
docs/support-policy.md. - Metric label migration (UUID-first): see
docs/migrations.md. - Sponsorship: see
docs/sponsorship.md(GitHub Sponsors for ESNODE). - Containers: see
docs/container.mdfor distroless build/run instructions.
- TUI surfaces degradation flags/score on Node Overview, Network & Disk, Agent Status; orchestrator screen shows token/loopback exposure.
- App collector uses a 2s timeout; slow endpoints are skipped per scrape to avoid blocking other collectors.
- TSDB: defaults to XDG (
~/.local/share/esnode/tsdb), opt into/var/lib/esnode/tsdbexplicitly and pre-create with correct perms. - Orchestrator: keep loopback-only unless
allow_public=trueandtokenis set; audit logs appear under tracing targetaudit.
cargo test --workspaceincludes a TUI render smoke test using ratatui’s test backend (no PTY required).- New metrics live in
docs/metrics-list.md; gap tracking indocs/gap-logbook.md. - Local HTTP defaults avoid privileged paths; adjust in
crates/agent-core/src/config.rsif changing defaults.
Fastest path (recommended):
curl -fsSL https://raw.githubusercontent.com/ESNODE/ESNODE-Core/main/public/install.sh | shNotes:
- Installs the
esnode-corebinary under/usr/local/binand (on Linux) enables asystemdservice with a default/etc/esnode/esnode.toml. - To pin a version or avoid systemd setup, pass args via
sh -s --:curl -fsSL https://raw.githubusercontent.com/ESNODE/ESNODE-Core/main/public/install.sh | sh -s -- --version 0.1.0 --no-service
Example commands (adjust version/OS paths):
- Ubuntu/Debian (
.deb):wget -O esnode-core_0.1.0_amd64.deb https://esnode.co/downloads/esnode-core_0.1.0_amd64.deb sudo dpkg -i esnode-core_0.1.0_amd64.deb sudo systemctl enable esnode-core && sudo systemctl start esnode-core
- RHEL/CentOS (
.rpm):wget -O esnode-core-0.1.0-1.x86_64.rpm https://esnode.co/downloads/esnode-core-0.1.0-1.x86_64.rpm sudo rpm -i esnode-core-0.1.0-1.x86_64.rpm sudo systemctl enable esnode-core && sudo systemctl start esnode-core
- Generic Linux (
tar.gz):wget -O esnode-core-0.1.0-linux-amd64.tar.gz https://esnode.co/downloads/esnode-core-0.1.0-linux-amd64.tar.gz tar xvf esnode-core-0.1.0-linux-amd64.tar.gz sudo mv esnode-core /usr/local/bin/ esnode-core --version
- Windows/macOS artifacts will follow the same public repo layout when available.
- Docker:
Dockerfile(builds frompublic/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz) - systemd:
deploy/systemd/esnode-core.service - Kubernetes DaemonSet:
deploy/k8s/daemonset.yaml
- Build/pull image:
docker build -t myregistry/esnode-core:0.1.0 -f Dockerfile .(usespublic/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz). - Apply manifests (headless service + ConfigMap + DaemonSet):
kubectl apply --dry-run=client -f deploy/k8s/esnode-configmap.yaml kubectl apply --dry-run=client -f deploy/k8s/esnode-service.yaml kubectl apply --dry-run=client -f deploy/k8s/esnode-daemonset.yaml kubectl apply -f deploy/k8s/
- ConfigMap (
esnode.toml) uses loopback-only orchestrator by default, enables TSDB at/var/lib/esnode/tsdb(hostPath volume), and keeps collectors on. - DaemonSet runs hostNetwork+hostPID, privileged for NVML access, and mounts
/devplus TSDB hostPath. Probes hit/healthz; port 9100 is exposed via headless Service for scraping. - Override
image:and namespace as needed; setlocal_tsdb_pathto match your volume; setorchestrator.tokenandallow_publiconly when intentionally exposing the control API. - If building multi-arch images, supply the matching tarball for each arch (e.g.,
linux-arm64); current Dockerfile targetslinux-amd64.
helm upgrade --install esnode-core ./deploy/helm/esnode-core \
--set image.repository=myregistry/esnode-core \
--set image.tag=0.1.0 \
--set tsdb.hostPath=/var/lib/esnode/tsdb \
--set config.orchestrator.allowPublic=false \
--set config.orchestrator.token=""Adjust hostPath, namespace (-n), tolerations/nodeSelector, and orchestrator/token as needed.
- Quickstart:
docs/quickstart.md - Metrics reference:
docs/metrics-list.md - Monitoring examples:
docs/monitoring-examples.md - Architecture:
docs/architecture.md - Platform matrix:
docs/platform-matrix.md - Dashboards & alerts:
docs/dashboards/grafana-esnode-core.jsonanddocs/dashboards/alerts.yaml(import into Grafana/Prometheus) - Smoke test script:
scripts/smoke.sh(builds, runs core locally, curls endpoints)
ESNODE-Core is designed to run as a DaemonSet on Kubernetes, providing monitoring for each node.
- Docker Image:
myregistry/esnode-core:0.1.0(Replacemyregistrywith your actual registry) - Release Tarball: esnode-core-0.1.0-linux-amd64.tar.gz
- Checksum:
public/distribution/releases/linux-amd64/esnode-core-0.1.0-linux-amd64.tar.gz.sha256
-
Configure: Edit
deploy/k8s/esnode-configmap.yamlto adjustesnode.tomlsettings (e.g., specific orchestrator URL). -
Appply Manifests:
kubectl apply -f deploy/k8s/esnode-configmap.yaml kubectl apply -f deploy/k8s/esnode-daemonset.yaml kubectl apply -f deploy/k8s/esnode-service.yaml
The DaemonSet automatically requests privileges for hardware monitoring:
privileged: truesecurity contexthostPID: truefor process monitoringhostNetwork: true(optional, but recommended for agent connectivity)/devand/procmounts
Ensure your nodes support these capabilities. For GPU monitoring, the DaemonSet sets NVIDIA_VISIBLE_DEVICES=all.
