A lightweight, multithreaded workload manager for LAN clusters written in Go. This manager runs on each node in the cluster and distributes jobs based on available thread capacity.
Originally created to support running distributed workloads for the Boltzmannomics project (https://github.com/ianfr/economic-simulation).
Note that this tool is WIP and should only be run in trusted environments; authentication & encryption has not been added yet, just IP whitelisting which isn't secure.
- Thread-aware scheduling: Jobs are queued and executed based on thread availability
- Distributed job management: Head node delegates jobs to worker nodes
- REST API: Submit and monitor jobs via HTTP endpoints
- Standalone mode: Can run on a single node without clustering
- Web-Based Monitoring GUI: Simple UI to monitor cluster health and job status
A screenshot of the monitoring UI with the workload manager running in standalone mode
The workload manager operates in three modes:
- Head node: Accepts job submissions, runs jobs locally, and delegates to worker nodes
- Worker node: Receives jobs from the head node and executes them locally
- Standalone: Runs independently without clustering
cd golang-scheduler
# Build with default architecture
go build -o workload-manager
cd monitor && go build -o monitor
# Compile for ARM64
GOARCH=arm64 go build -o workload-manager
cd monitor && GOARCH=arm64 go build -o monitorNote: While not strictly necessary, a shared filesystem is strongly reccomended to facilitate centralized job logging.
Copy the executable to the shared filesystem for the cluster:
# Example using scp
scp workload-manager user@node1:/mnt/md0/cluster
scp monitor/monitor user@node1:/mnt/md0/clusterCreate a configuration file for each node:
{
"role": "head",
"listen_port": 8080,
"max_threads": 16,
"worker_nodes": [
"192.168.1.101:8080",
"192.168.1.102:8080",
"192.168.1.103:8080"
]
}{
"role": "worker",
"listen_port": 8080,
"max_threads": 16,
"head_node_address": "192.168.1.100:8080"
}{
"role": "standalone",
"listen_port": 8080,
"max_threads": 8
}- role:
"head","worker", or"standalone" - listen_port: Port for the HTTP API (default: 8080)
- max_threads: Maximum threads this node can use (default: number of CPU cores)
- head_node_address: Address of the head node (required for workers)
- worker_nodes: List of worker node addresses (required for head node)
On the head node:
./workload-manager -config config-head.jsonOn worker nodes:
./workload-manager -config config-worker.jsonStandalone:
./workload-manager -config config-standalone.jsonSubmit a job via the REST API using curl or any HTTP client:
curl -X POST http://head-node:8080/api/v1/jobs/submit \
-H "Content-Type: application/json" \
-d '{
"command": "sleep 10 && echo Hello World",
"threads": 2,
"stdout_path": "/mnt/md0/cluster/jobs/job1.stdout",
"stderr_path": "/mnt/md0/cluster/jobs/job1.stderr"
}'Response:
{
"job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"message": "Job submitted successfully",
"status": "queued"
}- command: Shell command to execute (required)
- threads: Number of threads the job will use (required)
- stdout_path: Path where stdout will be written (required)
- stderr_path: Path where stderr will be written (required)
Get information about a specific job:
curl http://head-node:8080/api/v1/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890Response:
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"command": "sleep 10 && echo Hello World",
"threads": 2,
"stdout_path": "/mnt/md0/cluster/jobs/job1.stdout",
"stderr_path": "/mnt/md0/cluster/jobs/job1.stderr",
"status": "running",
"node_address": "192.168.1.101:8080",
"submitted_at": "2025-10-02T10:30:00Z",
"started_at": "2025-10-02T10:30:05Z"
}Get the status of a node:
curl http://node:8080/api/v1/statusResponse:
{
"address": "",
"max_threads": 16,
"used_threads": 4,
"available_threads": 12,
"queued_jobs": 2,
"running_jobs": 2,
"completed_jobs": 15,
"last_heartbeat": "2025-10-02T10:35:00Z",
"is_healthy": true
}Get the status of the entire cluster:
curl http://head-node:8080/api/v1/cluster/statusResponse:
{
"head_node": {
"address": "localhost:8080",
"max_threads": 16,
"used_threads": 2,
"available_threads": 14,
"queued_jobs": 0,
"running_jobs": 1,
"completed_jobs": 10,
"last_heartbeat": "2025-10-02T10:35:00Z",
"is_healthy": true
},
"worker_nodes": [
{
"address": "192.168.1.101:8080",
"max_threads": 16,
"used_threads": 8,
"available_threads": 8,
"queued_jobs": 3,
"running_jobs": 4,
"completed_jobs": 20,
"last_heartbeat": "2025-10-02T10:35:00Z",
"is_healthy": true
}
],
"total_jobs": 45
}curl http://node:8080/api/v1/health| Endpoint | Method | Description |
|---|---|---|
/api/v1/jobs/submit |
POST | Submit a new job |
/api/v1/jobs/{id} |
GET | Get job information |
/api/v1/jobs |
GET | List all jobs on this node |
/api/v1/status |
GET | Get node status |
/api/v1/health |
GET | Health check |
/api/v1/cluster/status |
GET | Get cluster status (head only) |
- Job Submission: Jobs are submitted to the head node via the REST API
- Node Selection: The head node selects the best worker based on:
- Available thread capacity
- Current queue length
- Node health status
- Execution: Jobs run immediately if threads are available, otherwise they're queued
- Completion: Job outputs are written to the specified paths on the shared filesystem
- Each job specifies the number of threads it will use
- Jobs only execute when sufficient thread capacity is available
- The scheduler uses a FIFO queue for pending jobs
- Thread capacity is released when jobs complete
- Head node performs health checks every 5 seconds
- Unhealthy workers are excluded from job delegation
- Jobs fail over to other nodes if submission fails
Create a script to submit multiple jobs:
#!/bin/bash
for i in {1..10}; do
curl -X POST http://192.168.1.100:8080/api/v1/jobs/submit \
-H "Content-Type: application/json" \
-d "{
\"command\": \"python3 process_data.py --input data_$i.txt\",
\"threads\": 4,
\"stdout_path\": \"/mnt/md0/cluster/jobs/job_$i.stdout\",
\"stderr_path\": \"/mnt/md0/cluster/jobs/job_$i.stderr\"
}"
echo "Submitted job $i"
doneMonitor the cluster status continuously:
watch -n 5 'curl -s http://192.168.1.100:8080/api/v1/cluster/status | jq .'- Verify worker nodes are accessible from the head node
- Check worker node health:
curl http://worker:8080/api/v1/health - Ensure worker addresses in head config match actual IPs/hostnames
- Check thread availability:
curl http://node:8080/api/v1/status - Verify no jobs are consuming all threads
- Check if
max_threadsconfiguration is appropriate
- Ensure all nodes have the same mount point for shared storage
- Verify write permissions on stdout/stderr paths
- Test file creation:
touch /mnt/md0/cluster/jobs/test.txt
