CloudBalancer

A distributed task scheduling platform on Java 21, Spring Boot 4, Kafka, and PostgreSQL/TimescaleDB. The dispatcher accepts tasks over REST, schedules them onto worker containers that per-host agents spawn through Docker, and streams logs and events back to a React dashboard. Executors cover shell, Docker, Python, and simulated workloads. The cluster auto-scales on CPU and queue pressure, retries failed tasks, dead-letters poison pills, and exposes chaos controls for fault-tolerance experiments.

Architecture

                                        +--------------------+
                                        |  Web Dashboard     |
                                        |  (React / Vite)    |
                                        +---------+----------+
                                                  | HTTPS + WSS
                                                  v
Client / CLI  --REST/JWT-->   +---------------------+
                              |   Dispatcher :8080  |
                              |  REST, scheduling,  |
                              |  auto-scaling, WS   |
                              +----+----+------+----+
                                   |    |      |
                          JDBC     |    |Kafka |Kafka (REMOTE :29094, SASL)
                                   v    v      v
                       +-----------------+    +--------------------+
                       | PostgreSQL /    |    | Worker-Agent (per  |
                       | TimescaleDB     |    | host) -- Docker    |
                       +--------+--------+    | orchestrator       |
                                ^             +----------+---------+
                                |                        |
                                |                        | spawns
                                |                        v
                    +-----------+------------+    +------+-------+
                    | Metrics Aggregator     |<---+  Worker(s)   |
                    | :8081 (Kafka consumer) |Kafka| (executors) |
                    +------------------------+    +--------------+

Dispatcher — REST API, task scheduling, worker/agent registry, auto-scaling, WebSocket log + event streaming, JWT auth.
Worker-Agent — runs on each host, registers with the dispatcher using a pre-issued token, receives StartWorkerCommand/StopWorkerCommand over Kafka, and spawns/kills worker containers via the local Docker socket. Publishes heartbeats with per-host capacity.
Worker — executes tasks via pluggable executors, publishes heartbeat/metrics/results over Kafka, supports drain-aware shutdown.
Metrics Aggregator — consumes worker metrics from Kafka, persists time-series data in TimescaleDB, exposes metrics REST API.
Web Dashboard — React 19 + TypeScript + Vite SPA for task submission, worker/agent monitoring, live metrics, scaling and chaos controls.
Common — shared models, events, executor framework, agent protocol (no Spring dependency).

Per-module notes and the indexed wiki live under docs/ and wiki/.

Prerequisites

JDK 21 (auto-provisioned via Gradle toolchain)
Docker + Docker Compose (required for Testcontainers and the dev stack)
Node.js 20+ and npm (only if running the web dashboard locally)
No Gradle install needed — the gradlew wrapper handles it

Quick Start

Option A: Full Stack via Docker Compose

Build the backend JARs, then start everything:

./gradlew build -x test

cd docker
docker compose -f docker-compose.dev.yml up --build

The dev stack brings up:

Service	Purpose	Host port
`kafka`	Apache Kafka 3.9 (KRaft, SASL REMOTE listener)	`9092`, `29094`
`postgres`	TimescaleDB on PostgreSQL 16	`5432`
`nginx`	Reverse proxy fronting dispatcher + metrics	`80`
`dispatcher`	Control plane (REST + WebSocket)	via nginx `:80`
`metrics-aggregator`	Time-series metrics API	via nginx `:80`
`worker`	Seed worker container (not agent-spawned)	—
`worker-agent-1`	Host agent, capacity 8 CPU / 16 GB	—
`worker-agent-2`	Host agent, capacity 4 CPU / 8 GB	—

The compose file sets SCALING_RUNTIME_MODE=AGENT, so the dispatcher asks worker-agents to spawn new worker containers instead of running Docker itself. The other mode, DOCKER (the dispatcher default), drives the local Docker socket directly and only makes sense on a single host.

Option B: Infrastructure Only, Services via Gradle

Start Kafka + Postgres in Docker, run JVM services locally:

cd docker
docker compose -f docker-compose.dev.yml up kafka postgres

Then in separate terminals:

# Terminal 1: Dispatcher (:8080)
./gradlew :dispatcher:bootRun

# Terminal 2: Worker (default ID: worker-1)
./gradlew :worker:bootRun

# Terminal 3: Second worker with custom ID
WORKER_ID=worker-2 ./gradlew :worker:bootRun

# Terminal 4: Metrics Aggregator (:8081)
./gradlew :metrics-aggregator:bootRun

# Terminal 5 (optional): Worker-Agent (Docker orchestrator)
AGENT_ID=agent-1 ./gradlew :worker-agent:bootRun

Option C: Web Dashboard Dev Server

The dashboard is a Vite SPA. From web-dashboard/:

cd web-dashboard
npm install
npm run dev      # http://localhost:5173

# Other scripts
npm run build    # production build to dist/
npm run test     # Vitest unit tests
npm run lint

Point it at the backend with VITE_API_URL, VITE_METRICS_URL, VITE_WS_URL (defaults target http://localhost / ws://localhost, i.e. the compose nginx).

First Login

The dispatcher seeds a default admin on startup: admin / admin.

# Get a JWT access + refresh token
TOKEN=$(curl -s -X POST http://localhost:8080/api/auth/login \
  -H 'Content-Type: application/json' \
  -d '{"username":"admin","password":"admin"}' | jq -r .accessToken)

# Use the token for authenticated requests
curl -H "Authorization: Bearer $TOKEN" http://localhost:8080/api/tasks

API Reference

Dispatcher (`:8080`, proxied at `:80` via nginx)

Authentication (`/api/auth`)

Method	Path	Auth	Description
POST	`/login`	public	Login, returns access + refresh JWTs
POST	`/refresh`	public	Rotate refresh token, issue new access token
POST	`/logout`	authenticated	Revoke all refresh tokens for the caller

Tasks (`/api/tasks`)

Method	Path	Auth	Description
POST	``	ADMIN, OPERATOR, API_CLIENT	Submit a new task
GET	``	authenticated	List tasks with pagination + filters (`offset`, `limit`, `status`, `priority`, `executorType`, `workerId`, `since`)
GET	`/{id}`	authenticated	Get task by ID
GET	`/{id}/logs`	authenticated	Stored stdout / stderr for a finished task
GET	`/{id}/artifacts/{name}`	authenticated	Download a task artifact
POST	`/bulk/cancel`	ADMIN, OPERATOR	Cancel a batch of tasks
POST	`/bulk/retry`	ADMIN, OPERATOR	Retry a batch of failed tasks
POST	`/bulk/reprioritize`	ADMIN, OPERATOR	Update priority for many tasks

Workers upload artifacts through POST /internal/tasks/{taskId}/artifacts/{name} (unauthenticated, octet-stream body, not under /api/tasks).

Workers (`/api/workers`)

Method	Path	Auth	Description
GET	``	authenticated	List workers with health state and active task counts

Agent Registration (`/api/agents`)

Method	Path	Auth	Description
POST	`/register`	public (token)	Agent self-registration; dispatcher validates SHA-256 token hash, returns Kafka SASL credentials

Admin — Agents (`/api/admin/agents`, ADMIN)

Method	Path	Description
GET	``	List live agents with capacity and host info
GET	`/{agentId}`	Agent detail
GET	`/{agentId}/workers`	Workers currently running on the agent

Admin — Agent Tokens (`/api/admin/agent-tokens`, ADMIN)

Method	Path	Description
POST	``	Mint a new registration token (returned once, plaintext)
GET	``	List token metadata (label, created, last-used, revoked)
POST	`/{id}/revoke`	Revoke a token

Admin — Scheduling (`/api/admin`, ADMIN)

Method	Path	Description
GET	`/strategy`	Active scheduling strategy + weights
PUT	`/strategy`	Set strategy (`ROUND_ROBIN`, `WEIGHTED_ROUND_ROBIN`, `LEAST_CONNECTIONS`, `RESOURCE_FIT`, `CUSTOM`)
PUT	`/workers/{id}/tags`	Update the tag set for a worker (used by `TaskConstraints.requiredTags`)

Auto-Scaling (`/api/scaling`, ADMIN)

Method	Path	Description
GET	`/status`	Current worker counts, policy, cooldown state
PUT	`/policy`	Update policy (`minWorkers`, `maxWorkers`, `cooldownSeconds`, `scaleUpStep`, `scaleDownStep`, `drainTimeSeconds`)
POST	`/trigger`	Manually scale up/down by N

Chaos (`/api/admin/chaos`, ADMIN)

Method	Path	Description
POST	`/kill-worker`	Terminate a (random or named) worker
POST	`/fail-task`	Force-fail a running task
POST	`/latency`	Inject artificial latency into a component for N seconds

WebSocket

Path	Description
`ws://localhost:8080/api/tasks/{id}/logs/stream?token=JWT`	Real-time stdout/stderr streaming for a task
`ws://localhost:8080/api/ws/events?token=JWT`	Dashboard event bus: task updates, worker state, scaling events, alerts, and initial snapshot

Metrics Aggregator (`:8081`)

Method	Path	Auth	Description
GET	`/api/metrics/workers`	authenticated	Latest metrics for all workers
GET	`/api/metrics/workers/{id}/history`	authenticated	Time-series metrics for one worker
GET	`/api/metrics/cluster`	authenticated	Cluster-wide aggregates

Submitting a Task

curl -X POST http://localhost:8080/api/tasks \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "hello-world",
    "executorType": "SHELL",
    "payload": {
      "command": "echo Hello from CloudBalancer && date"
    },
    "resourceProfile": {
      "cpuCores": 1,
      "memoryMB": 256,
      "diskMB": 100,
      "gpuRequired": false,
      "gpuMemoryMB": 0,
      "networkAccessRequired": false
    },
    "executionPolicy": {
      "timeoutSeconds": 30,
      "maxRetries": 2,
      "retryBackoffStrategy": "EXPONENTIAL"
    },
    "priority": "NORMAL"
  }'

Executors: SIMULATED, SHELL, DOCKER, PYTHON. Shell commands run through an allow/deny list; Python runs inside a per-task venv with optional pip requirements and network-namespace isolation on Linux; Docker executes with hardened defaults (dropped capabilities, no-new-privileges, memory/CPU limits, optional read-only rootfs).

Task states: SUBMITTED → VALIDATED → QUEUED → ASSIGNED → PROVISIONING → RUNNING → POST_PROCESSING → COMPLETED, with terminal branches CANCELLED, TIMED_OUT, FAILED, DEAD_LETTERED. FAILED / TIMED_OUT re-queue until the maxRetries budget is exhausted.

Scheduling & Scaling

Strategies: ROUND_ROBIN, WEIGHTED_ROUND_ROBIN, LEAST_CONNECTIONS, RESOURCE_FIT, CUSTOM (weighted blend of WorkerScorer components). Hot-swappable via PUT /api/admin/strategy.
Scaling triggers: REACTIVE (CPU high/low windows), QUEUE_PRESSURE (queued-vs-active ratio), MANUAL (admin-initiated). Evaluated every 30 s by default.
Runtime modes: DOCKER (dispatcher default — it talks to its own Docker socket on a single host) or AGENT (compose default — worker-agents spawn containers across hosts). Set via SCALING_RUNTIME_MODE.
Worker constraints: tasks may specify requiredTags, whitelistedWorkers, blacklistedWorkers via TaskConstraints.

Fault Tolerance

Retries with FIXED, EXPONENTIAL, or EXPONENTIAL_WITH_JITTER backoff per-task.
Dead-letter queue — tasks exceeding maxRetries transition to DEAD_LETTERED.
Circuit breakers (Resilience4j) around the Kafka producer.
Rate limiting (Bucket4j) on authenticated endpoints.
Drain-aware shutdown — workers accept DrainCommand with a grace period to finish in-flight tasks.
Kafka idempotency guards on result and assignment listeners.

Configuration

Key environment variables (defaults shown):

Variable	Default	Scope	Description
`KAFKA_BOOTSTRAP_SERVERS`	`localhost:9092`	all JVM services	Internal Kafka bootstrap
`KAFKA_EXTERNAL_BOOTSTRAP`	`localhost:29094`	dispatcher	REMOTE (SASL) bootstrap advertised to remote agents
`KAFKA_AGENT_USERNAME` / `KAFKA_AGENT_PASSWORD`	`cloudbalancer-agent` / `changeme`	dispatcher + remote agents	SASL_PLAINTEXT creds for REMOTE listener
`SPRING_DATASOURCE_URL`	`jdbc:postgresql://localhost:5432/cloudbalancer`	dispatcher, metrics	JDBC URL
`SPRING_DATASOURCE_USERNAME` / `_PASSWORD`	`postgres` / `postgres`	dispatcher, metrics	DB creds
`SCALING_RUNTIME_MODE`	`DOCKER` (dispatcher default), `AGENT` (compose)	dispatcher	How scale-ups are materialised
`WORKER_ID`	`worker-1`	worker	Unique worker identifier
`DISPATCHER_URL`	`http://localhost:8080`	worker, agent	Dispatcher base URL
`AGENT_ID`	`agent-1`	worker-agent	Unique agent identifier
`AGENT_CPU_CORES`	`8`	worker-agent	Declared host capacity
`AGENT_MEMORY_MB`	`16384`	worker-agent	Declared host capacity
`AGENT_REGISTRATION_TOKEN`	—	worker-agent	Token from `POST /api/admin/agent-tokens`
`DOCKER_WORKER_IMAGE`	`docker-worker`	worker-agent	Image spawned for new workers
`DOCKER_NETWORK_NAME`	`docker_default`	worker-agent	Docker network joined by spawned workers
`VITE_API_URL` / `VITE_METRICS_URL` / `VITE_WS_URL`	`http://localhost` / `ws://localhost`	dashboard	Backend endpoints

Scaling defaults live under cloudbalancer.dispatcher.scaling.* in dispatcher/src/main/resources/application.yml. Agent capacity and heartbeat cadence live under cloudbalancer.agent.* in worker-agent/src/main/resources/application.yml.

Remote Agent Deployment

To attach an agent running on a different host:

Mint a registration token on the dispatcher:

curl -X POST http://dispatcher-host/api/admin/agent-tokens \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"label":"edge-host-1"}'

The response contains the plaintext token — it is only shown once.

On the target host, copy docker/docker-compose.agent.yml and a .env file defining AGENT_ID, AGENT_REGISTRATION_TOKEN, DISPATCHER_URL, and KAFKA_REMOTE_HOST.

Start the agent:

docker compose -f docker-compose.agent.yml up -d

The agent will POST to /api/agents/register, receive SASL Kafka credentials in response, and begin publishing heartbeats on agents.heartbeat. The dispatcher's AgentRegistry will schedule new workers onto it based on reported capacity.

Project Structure

cloudbalancer/
  common/                  Shared models, events, agent protocol, executor framework
  dispatcher/              REST API, scheduling, auto-scaling, WebSocket, agent control plane
  worker/                  Task execution, heartbeat + metrics reporting, drain
  worker-agent/            Per-host Docker orchestrator, Kafka command listener
  metrics-aggregator/      Kafka consumers, TimescaleDB persistence, metrics REST API
  web-dashboard/           React 19 + Vite + shadcn/ui operational dashboard
  docker/                  docker-compose.dev.yml, docker-compose.agent.yml, nginx.conf
  docs/                    Architecture diagrams and design documents

Running Tests

# Full backend build with tests (requires Docker for Testcontainers)
./gradlew build

# Per-module
./gradlew :common:test
./gradlew :dispatcher:test
./gradlew :worker:test
./gradlew :worker-agent:test
./gradlew :metrics-aggregator:test

# Dashboard
cd web-dashboard
npm run test          # Vitest unit + component tests
npx playwright test   # E2E (expects the dev stack running)

Backend tests use Testcontainers to spin up Kafka and PostgreSQL automatically — Docker must be running.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CloudBalancer

Architecture

Prerequisites

Quick Start

Option A: Full Stack via Docker Compose

Option B: Infrastructure Only, Services via Gradle

Option C: Web Dashboard Dev Server

First Login

API Reference

Dispatcher (`:8080`, proxied at `:80` via nginx)

Authentication (`/api/auth`)

Tasks (`/api/tasks`)

Workers (`/api/workers`)

Agent Registration (`/api/agents`)

Admin — Agents (`/api/admin/agents`, ADMIN)

Admin — Agent Tokens (`/api/admin/agent-tokens`, ADMIN)

Admin — Scheduling (`/api/admin`, ADMIN)

Auto-Scaling (`/api/scaling`, ADMIN)

Chaos (`/api/admin/chaos`, ADMIN)

WebSocket

Metrics Aggregator (`:8081`)

Submitting a Task

Scheduling & Scaling

Fault Tolerance

Configuration

Remote Agent Deployment

Project Structure

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
.github/workflows		.github/workflows
common		common
dispatcher		dispatcher
docker		docker
docs		docs
gradle		gradle
metrics-aggregator		metrics-aggregator
scripts		scripts
web-dashboard		web-dashboard
wiki		wiki
worker-agent		worker-agent
worker		worker
.gitignore		.gitignore
.impeccable.md		.impeccable.md
CLAUDE.md		CLAUDE.md
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Folders and files

Latest commit

History

Repository files navigation

CloudBalancer

Architecture

Prerequisites

Quick Start

Option A: Full Stack via Docker Compose

Option B: Infrastructure Only, Services via Gradle

Option C: Web Dashboard Dev Server

First Login

API Reference

Dispatcher (:8080, proxied at :80 via nginx)

Authentication (/api/auth)

Tasks (/api/tasks)

Workers (/api/workers)

Agent Registration (/api/agents)

Admin — Agents (/api/admin/agents, ADMIN)

Admin — Agent Tokens (/api/admin/agent-tokens, ADMIN)

Admin — Scheduling (/api/admin, ADMIN)

Auto-Scaling (/api/scaling, ADMIN)

Chaos (/api/admin/chaos, ADMIN)

WebSocket

Metrics Aggregator (:8081)

Submitting a Task

Scheduling & Scaling

Fault Tolerance

Configuration

Remote Agent Deployment

Project Structure

Running Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Dispatcher (`:8080`, proxied at `:80` via nginx)

Authentication (`/api/auth`)

Tasks (`/api/tasks`)

Workers (`/api/workers`)

Agent Registration (`/api/agents`)

Admin — Agents (`/api/admin/agents`, ADMIN)

Admin — Agent Tokens (`/api/admin/agent-tokens`, ADMIN)

Admin — Scheduling (`/api/admin`, ADMIN)

Auto-Scaling (`/api/scaling`, ADMIN)

Chaos (`/api/admin/chaos`, ADMIN)

Metrics Aggregator (`:8081`)

Packages