Skip to content

ianfr/cluster-workload-manager

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cluster Workload Manager

A lightweight, multithreaded workload manager for LAN clusters written in Go. This manager runs on each node in the cluster and distributes jobs based on available thread capacity.

Originally created to support running distributed workloads for the Boltzmannomics project (https://github.com/ianfr/economic-simulation).

Note that this tool is WIP and should only be run in trusted environments; authentication & encryption has not been added yet, just IP whitelisting which isn't secure.

Features

  • Thread-aware scheduling: Jobs are queued and executed based on thread availability
  • Distributed job management: Head node delegates jobs to worker nodes
  • REST API: Submit and monitor jobs via HTTP endpoints
  • Standalone mode: Can run on a single node without clustering
  • Web-Based Monitoring GUI: Simple UI to monitor cluster health and job status

A screenshot of the monitoring UI with the workload manager running in standalone mode

Architecture

The workload manager operates in three modes:

  1. Head node: Accepts job submissions, runs jobs locally, and delegates to worker nodes
  2. Worker node: Receives jobs from the head node and executes them locally
  3. Standalone: Runs independently without clustering

Installation

Build the executable

cd golang-scheduler

# Build with default architecture
go build -o workload-manager
cd monitor && go build -o monitor

# Compile for ARM64
GOARCH=arm64 go build -o workload-manager
cd monitor && GOARCH=arm64 go build -o monitor

Deploy to cluster nodes

Note: While not strictly necessary, a shared filesystem is strongly reccomended to facilitate centralized job logging.

Copy the executable to the shared filesystem for the cluster:

# Example using scp
scp workload-manager user@node1:/mnt/md0/cluster
scp monitor/monitor user@node1:/mnt/md0/cluster

Configuration

Create a configuration file for each node:

Head Node Configuration (config-head.json)

{
  "role": "head",
  "listen_port": 8080,
  "max_threads": 16,
  "worker_nodes": [
    "192.168.1.101:8080",
    "192.168.1.102:8080",
    "192.168.1.103:8080"
  ]
}

Worker Node Configuration (config-worker.json)

{
  "role": "worker",
  "listen_port": 8080,
  "max_threads": 16,
  "head_node_address": "192.168.1.100:8080"
}

Standalone Configuration (config-standalone.json)

{
  "role": "standalone",
  "listen_port": 8080,
  "max_threads": 8
}

Configuration Options

  • role: "head", "worker", or "standalone"
  • listen_port: Port for the HTTP API (default: 8080)
  • max_threads: Maximum threads this node can use (default: number of CPU cores)
  • head_node_address: Address of the head node (required for workers)
  • worker_nodes: List of worker node addresses (required for head node)

Usage

Starting the Manager

On the head node:

./workload-manager -config config-head.json

On worker nodes:

./workload-manager -config config-worker.json

Standalone:

./workload-manager -config config-standalone.json

Submitting a Job

Submit a job via the REST API using curl or any HTTP client:

curl -X POST http://head-node:8080/api/v1/jobs/submit \
  -H "Content-Type: application/json" \
  -d '{
    "command": "sleep 10 && echo Hello World",
    "threads": 2,
    "stdout_path": "/mnt/md0/cluster/jobs/job1.stdout",
    "stderr_path": "/mnt/md0/cluster/jobs/job1.stderr"
  }'

Response:

{
  "job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "message": "Job submitted successfully",
  "status": "queued"
}

Job Submission Fields

  • command: Shell command to execute (required)
  • threads: Number of threads the job will use (required)
  • stdout_path: Path where stdout will be written (required)
  • stderr_path: Path where stderr will be written (required)

Checking Job Status

Get information about a specific job:

curl http://head-node:8080/api/v1/jobs/a1b2c3d4-e5f6-7890-abcd-ef1234567890

Response:

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "command": "sleep 10 && echo Hello World",
  "threads": 2,
  "stdout_path": "/mnt/md0/cluster/jobs/job1.stdout",
  "stderr_path": "/mnt/md0/cluster/jobs/job1.stderr",
  "status": "running",
  "node_address": "192.168.1.101:8080",
  "submitted_at": "2025-10-02T10:30:00Z",
  "started_at": "2025-10-02T10:30:05Z"
}

Checking Node Status

Get the status of a node:

curl http://node:8080/api/v1/status

Response:

{
  "address": "",
  "max_threads": 16,
  "used_threads": 4,
  "available_threads": 12,
  "queued_jobs": 2,
  "running_jobs": 2,
  "completed_jobs": 15,
  "last_heartbeat": "2025-10-02T10:35:00Z",
  "is_healthy": true
}

Checking Cluster Status (Head Node Only)

Get the status of the entire cluster:

curl http://head-node:8080/api/v1/cluster/status

Response:

{
  "head_node": {
    "address": "localhost:8080",
    "max_threads": 16,
    "used_threads": 2,
    "available_threads": 14,
    "queued_jobs": 0,
    "running_jobs": 1,
    "completed_jobs": 10,
    "last_heartbeat": "2025-10-02T10:35:00Z",
    "is_healthy": true
  },
  "worker_nodes": [
    {
      "address": "192.168.1.101:8080",
      "max_threads": 16,
      "used_threads": 8,
      "available_threads": 8,
      "queued_jobs": 3,
      "running_jobs": 4,
      "completed_jobs": 20,
      "last_heartbeat": "2025-10-02T10:35:00Z",
      "is_healthy": true
    }
  ],
  "total_jobs": 45
}

Health Check

curl http://node:8080/api/v1/health

API Endpoints

Endpoint Method Description
/api/v1/jobs/submit POST Submit a new job
/api/v1/jobs/{id} GET Get job information
/api/v1/jobs GET List all jobs on this node
/api/v1/status GET Get node status
/api/v1/health GET Health check
/api/v1/cluster/status GET Get cluster status (head only)

How It Works

Job Scheduling

  1. Job Submission: Jobs are submitted to the head node via the REST API
  2. Node Selection: The head node selects the best worker based on:
    • Available thread capacity
    • Current queue length
    • Node health status
  3. Execution: Jobs run immediately if threads are available, otherwise they're queued
  4. Completion: Job outputs are written to the specified paths on the shared filesystem

Thread Management

  • Each job specifies the number of threads it will use
  • Jobs only execute when sufficient thread capacity is available
  • The scheduler uses a FIFO queue for pending jobs
  • Thread capacity is released when jobs complete

Worker Health Monitoring

  • Head node performs health checks every 5 seconds
  • Unhealthy workers are excluded from job delegation
  • Jobs fail over to other nodes if submission fails

Example: Running a Batch of Jobs

Create a script to submit multiple jobs:

#!/bin/bash

for i in {1..10}; do
  curl -X POST http://192.168.1.100:8080/api/v1/jobs/submit \
    -H "Content-Type: application/json" \
    -d "{
      \"command\": \"python3 process_data.py --input data_$i.txt\",
      \"threads\": 4,
      \"stdout_path\": \"/mnt/md0/cluster/jobs/job_$i.stdout\",
      \"stderr_path\": \"/mnt/md0/cluster/jobs/job_$i.stderr\"
    }"
  echo "Submitted job $i"
done

Monitoring

Monitor the cluster status continuously:

watch -n 5 'curl -s http://192.168.1.100:8080/api/v1/cluster/status | jq .'

Troubleshooting

Jobs not being delegated to workers

  • Verify worker nodes are accessible from the head node
  • Check worker node health: curl http://worker:8080/api/v1/health
  • Ensure worker addresses in head config match actual IPs/hostnames

Jobs stuck in queue

  • Check thread availability: curl http://node:8080/api/v1/status
  • Verify no jobs are consuming all threads
  • Check if max_threads configuration is appropriate

Cannot access shared filesystem paths

  • Ensure all nodes have the same mount point for shared storage
  • Verify write permissions on stdout/stderr paths
  • Test file creation: touch /mnt/md0/cluster/jobs/test.txt

About

Cluster Workload Manager (CWM). Lightweight alternative to Slurm/PBS written in Go.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published