A distributed task scheduling platform on Java 21, Spring Boot 4, Kafka, and PostgreSQL/TimescaleDB. The dispatcher accepts tasks over REST, schedules them onto worker containers that per-host agents spawn through Docker, and streams logs and events back to a React dashboard. Executors cover shell, Docker, Python, and simulated workloads. The cluster auto-scales on CPU and queue pressure, retries failed tasks, dead-letters poison pills, and exposes chaos controls for fault-tolerance experiments.
+--------------------+
| Web Dashboard |
| (React / Vite) |
+---------+----------+
| HTTPS + WSS
v
Client / CLI --REST/JWT--> +---------------------+
| Dispatcher :8080 |
| REST, scheduling, |
| auto-scaling, WS |
+----+----+------+----+
| | |
JDBC | |Kafka |Kafka (REMOTE :29094, SASL)
v v v
+-----------------+ +--------------------+
| PostgreSQL / | | Worker-Agent (per |
| TimescaleDB | | host) -- Docker |
+--------+--------+ | orchestrator |
^ +----------+---------+
| |
| | spawns
| v
+-----------+------------+ +------+-------+
| Metrics Aggregator |<---+ Worker(s) |
| :8081 (Kafka consumer) |Kafka| (executors) |
+------------------------+ +--------------+
- Dispatcher — REST API, task scheduling, worker/agent registry, auto-scaling, WebSocket log + event streaming, JWT auth.
- Worker-Agent — runs on each host, registers with the dispatcher using a pre-issued token, receives
StartWorkerCommand/StopWorkerCommandover Kafka, and spawns/kills worker containers via the local Docker socket. Publishes heartbeats with per-host capacity. - Worker — executes tasks via pluggable executors, publishes heartbeat/metrics/results over Kafka, supports drain-aware shutdown.
- Metrics Aggregator — consumes worker metrics from Kafka, persists time-series data in TimescaleDB, exposes metrics REST API.
- Web Dashboard — React 19 + TypeScript + Vite SPA for task submission, worker/agent monitoring, live metrics, scaling and chaos controls.
- Common — shared models, events, executor framework, agent protocol (no Spring dependency).
Per-module notes and the indexed wiki live under docs/ and wiki/.
- JDK 21 (auto-provisioned via Gradle toolchain)
- Docker + Docker Compose (required for Testcontainers and the dev stack)
- Node.js 20+ and
npm(only if running the web dashboard locally) - No Gradle install needed — the
gradlewwrapper handles it
Build the backend JARs, then start everything:
./gradlew build -x test
cd docker
docker compose -f docker-compose.dev.yml up --buildThe dev stack brings up:
| Service | Purpose | Host port |
|---|---|---|
kafka |
Apache Kafka 3.9 (KRaft, SASL REMOTE listener) | 9092, 29094 |
postgres |
TimescaleDB on PostgreSQL 16 | 5432 |
nginx |
Reverse proxy fronting dispatcher + metrics | 80 |
dispatcher |
Control plane (REST + WebSocket) | via nginx :80 |
metrics-aggregator |
Time-series metrics API | via nginx :80 |
worker |
Seed worker container (not agent-spawned) | — |
worker-agent-1 |
Host agent, capacity 8 CPU / 16 GB | — |
worker-agent-2 |
Host agent, capacity 4 CPU / 8 GB | — |
The compose file sets SCALING_RUNTIME_MODE=AGENT, so the dispatcher asks worker-agents to spawn new worker containers instead of running Docker itself. The other mode, DOCKER (the dispatcher default), drives the local Docker socket directly and only makes sense on a single host.
Start Kafka + Postgres in Docker, run JVM services locally:
cd docker
docker compose -f docker-compose.dev.yml up kafka postgresThen in separate terminals:
# Terminal 1: Dispatcher (:8080)
./gradlew :dispatcher:bootRun
# Terminal 2: Worker (default ID: worker-1)
./gradlew :worker:bootRun
# Terminal 3: Second worker with custom ID
WORKER_ID=worker-2 ./gradlew :worker:bootRun
# Terminal 4: Metrics Aggregator (:8081)
./gradlew :metrics-aggregator:bootRun
# Terminal 5 (optional): Worker-Agent (Docker orchestrator)
AGENT_ID=agent-1 ./gradlew :worker-agent:bootRunThe dashboard is a Vite SPA. From web-dashboard/:
cd web-dashboard
npm install
npm run dev # http://localhost:5173
# Other scripts
npm run build # production build to dist/
npm run test # Vitest unit tests
npm run lintPoint it at the backend with VITE_API_URL, VITE_METRICS_URL, VITE_WS_URL (defaults target http://localhost / ws://localhost, i.e. the compose nginx).
The dispatcher seeds a default admin on startup: admin / admin.
# Get a JWT access + refresh token
TOKEN=$(curl -s -X POST http://localhost:8080/api/auth/login \
-H 'Content-Type: application/json' \
-d '{"username":"admin","password":"admin"}' | jq -r .accessToken)
# Use the token for authenticated requests
curl -H "Authorization: Bearer $TOKEN" http://localhost:8080/api/tasks| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /login |
public | Login, returns access + refresh JWTs |
| POST | /refresh |
public | Rotate refresh token, issue new access token |
| POST | /logout |
authenticated | Revoke all refresh tokens for the caller |
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | `` | ADMIN, OPERATOR, API_CLIENT | Submit a new task |
| GET | `` | authenticated | List tasks with pagination + filters (offset, limit, status, priority, executorType, workerId, since) |
| GET | /{id} |
authenticated | Get task by ID |
| GET | /{id}/logs |
authenticated | Stored stdout / stderr for a finished task |
| GET | /{id}/artifacts/{name} |
authenticated | Download a task artifact |
| POST | /bulk/cancel |
ADMIN, OPERATOR | Cancel a batch of tasks |
| POST | /bulk/retry |
ADMIN, OPERATOR | Retry a batch of failed tasks |
| POST | /bulk/reprioritize |
ADMIN, OPERATOR | Update priority for many tasks |
Workers upload artifacts through POST /internal/tasks/{taskId}/artifacts/{name} (unauthenticated, octet-stream body, not under /api/tasks).
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | `` | authenticated | List workers with health state and active task counts |
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /register |
public (token) | Agent self-registration; dispatcher validates SHA-256 token hash, returns Kafka SASL credentials |
| Method | Path | Description |
|---|---|---|
| GET | `` | List live agents with capacity and host info |
| GET | /{agentId} |
Agent detail |
| GET | /{agentId}/workers |
Workers currently running on the agent |
| Method | Path | Description |
|---|---|---|
| POST | `` | Mint a new registration token (returned once, plaintext) |
| GET | `` | List token metadata (label, created, last-used, revoked) |
| POST | /{id}/revoke |
Revoke a token |
| Method | Path | Description |
|---|---|---|
| GET | /strategy |
Active scheduling strategy + weights |
| PUT | /strategy |
Set strategy (ROUND_ROBIN, WEIGHTED_ROUND_ROBIN, LEAST_CONNECTIONS, RESOURCE_FIT, CUSTOM) |
| PUT | /workers/{id}/tags |
Update the tag set for a worker (used by TaskConstraints.requiredTags) |
| Method | Path | Description |
|---|---|---|
| GET | /status |
Current worker counts, policy, cooldown state |
| PUT | /policy |
Update policy (minWorkers, maxWorkers, cooldownSeconds, scaleUpStep, scaleDownStep, drainTimeSeconds) |
| POST | /trigger |
Manually scale up/down by N |
| Method | Path | Description |
|---|---|---|
| POST | /kill-worker |
Terminate a (random or named) worker |
| POST | /fail-task |
Force-fail a running task |
| POST | /latency |
Inject artificial latency into a component for N seconds |
| Path | Description |
|---|---|
ws://localhost:8080/api/tasks/{id}/logs/stream?token=JWT |
Real-time stdout/stderr streaming for a task |
ws://localhost:8080/api/ws/events?token=JWT |
Dashboard event bus: task updates, worker state, scaling events, alerts, and initial snapshot |
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /api/metrics/workers |
authenticated | Latest metrics for all workers |
| GET | /api/metrics/workers/{id}/history |
authenticated | Time-series metrics for one worker |
| GET | /api/metrics/cluster |
authenticated | Cluster-wide aggregates |
curl -X POST http://localhost:8080/api/tasks \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"name": "hello-world",
"executorType": "SHELL",
"payload": {
"command": "echo Hello from CloudBalancer && date"
},
"resourceProfile": {
"cpuCores": 1,
"memoryMB": 256,
"diskMB": 100,
"gpuRequired": false,
"gpuMemoryMB": 0,
"networkAccessRequired": false
},
"executionPolicy": {
"timeoutSeconds": 30,
"maxRetries": 2,
"retryBackoffStrategy": "EXPONENTIAL"
},
"priority": "NORMAL"
}'Executors: SIMULATED, SHELL, DOCKER, PYTHON.
Shell commands run through an allow/deny list; Python runs inside a per-task venv with optional pip requirements and network-namespace isolation on Linux; Docker executes with hardened defaults (dropped capabilities, no-new-privileges, memory/CPU limits, optional read-only rootfs).
Task states: SUBMITTED → VALIDATED → QUEUED → ASSIGNED → PROVISIONING → RUNNING → POST_PROCESSING → COMPLETED, with terminal branches CANCELLED, TIMED_OUT, FAILED, DEAD_LETTERED. FAILED / TIMED_OUT re-queue until the maxRetries budget is exhausted.
- Strategies:
ROUND_ROBIN,WEIGHTED_ROUND_ROBIN,LEAST_CONNECTIONS,RESOURCE_FIT,CUSTOM(weighted blend ofWorkerScorercomponents). Hot-swappable viaPUT /api/admin/strategy. - Scaling triggers:
REACTIVE(CPU high/low windows),QUEUE_PRESSURE(queued-vs-active ratio),MANUAL(admin-initiated). Evaluated every 30 s by default. - Runtime modes:
DOCKER(dispatcher default — it talks to its own Docker socket on a single host) orAGENT(compose default — worker-agents spawn containers across hosts). Set viaSCALING_RUNTIME_MODE. - Worker constraints: tasks may specify
requiredTags,whitelistedWorkers,blacklistedWorkersviaTaskConstraints.
- Retries with
FIXED,EXPONENTIAL, orEXPONENTIAL_WITH_JITTERbackoff per-task. - Dead-letter queue — tasks exceeding
maxRetriestransition toDEAD_LETTERED. - Circuit breakers (Resilience4j) around the Kafka producer.
- Rate limiting (Bucket4j) on authenticated endpoints.
- Drain-aware shutdown — workers accept
DrainCommandwith a grace period to finish in-flight tasks. - Kafka idempotency guards on result and assignment listeners.
Key environment variables (defaults shown):
| Variable | Default | Scope | Description |
|---|---|---|---|
KAFKA_BOOTSTRAP_SERVERS |
localhost:9092 |
all JVM services | Internal Kafka bootstrap |
KAFKA_EXTERNAL_BOOTSTRAP |
localhost:29094 |
dispatcher | REMOTE (SASL) bootstrap advertised to remote agents |
KAFKA_AGENT_USERNAME / KAFKA_AGENT_PASSWORD |
cloudbalancer-agent / changeme |
dispatcher + remote agents | SASL_PLAINTEXT creds for REMOTE listener |
SPRING_DATASOURCE_URL |
jdbc:postgresql://localhost:5432/cloudbalancer |
dispatcher, metrics | JDBC URL |
SPRING_DATASOURCE_USERNAME / _PASSWORD |
postgres / postgres |
dispatcher, metrics | DB creds |
SCALING_RUNTIME_MODE |
DOCKER (dispatcher default), AGENT (compose) |
dispatcher | How scale-ups are materialised |
WORKER_ID |
worker-1 |
worker | Unique worker identifier |
DISPATCHER_URL |
http://localhost:8080 |
worker, agent | Dispatcher base URL |
AGENT_ID |
agent-1 |
worker-agent | Unique agent identifier |
AGENT_CPU_CORES |
8 |
worker-agent | Declared host capacity |
AGENT_MEMORY_MB |
16384 |
worker-agent | Declared host capacity |
AGENT_REGISTRATION_TOKEN |
— | worker-agent | Token from POST /api/admin/agent-tokens |
DOCKER_WORKER_IMAGE |
docker-worker |
worker-agent | Image spawned for new workers |
DOCKER_NETWORK_NAME |
docker_default |
worker-agent | Docker network joined by spawned workers |
VITE_API_URL / VITE_METRICS_URL / VITE_WS_URL |
http://localhost / ws://localhost |
dashboard | Backend endpoints |
Scaling defaults live under cloudbalancer.dispatcher.scaling.* in dispatcher/src/main/resources/application.yml. Agent capacity and heartbeat cadence live under cloudbalancer.agent.* in worker-agent/src/main/resources/application.yml.
To attach an agent running on a different host:
-
Mint a registration token on the dispatcher:
curl -X POST http://dispatcher-host/api/admin/agent-tokens \ -H "Authorization: Bearer $TOKEN" \ -H 'Content-Type: application/json' \ -d '{"label":"edge-host-1"}'
The response contains the plaintext token — it is only shown once.
-
On the target host, copy
docker/docker-compose.agent.ymland a.envfile definingAGENT_ID,AGENT_REGISTRATION_TOKEN,DISPATCHER_URL, andKAFKA_REMOTE_HOST. -
Start the agent:
docker compose -f docker-compose.agent.yml up -d
The agent will POST to /api/agents/register, receive SASL Kafka credentials in response, and begin publishing heartbeats on agents.heartbeat. The dispatcher's AgentRegistry will schedule new workers onto it based on reported capacity.
cloudbalancer/
common/ Shared models, events, agent protocol, executor framework
dispatcher/ REST API, scheduling, auto-scaling, WebSocket, agent control plane
worker/ Task execution, heartbeat + metrics reporting, drain
worker-agent/ Per-host Docker orchestrator, Kafka command listener
metrics-aggregator/ Kafka consumers, TimescaleDB persistence, metrics REST API
web-dashboard/ React 19 + Vite + shadcn/ui operational dashboard
docker/ docker-compose.dev.yml, docker-compose.agent.yml, nginx.conf
docs/ Architecture diagrams and design documents
# Full backend build with tests (requires Docker for Testcontainers)
./gradlew build
# Per-module
./gradlew :common:test
./gradlew :dispatcher:test
./gradlew :worker:test
./gradlew :worker-agent:test
./gradlew :metrics-aggregator:test
# Dashboard
cd web-dashboard
npm run test # Vitest unit + component tests
npx playwright test # E2E (expects the dev stack running)Backend tests use Testcontainers to spin up Kafka and PostgreSQL automatically — Docker must be running.