diff --git a/tutorial4/README.md b/tutorial4/README.md index deb8480..59710b9 100644 --- a/tutorial4/README.md +++ b/tutorial4/README.md @@ -27,6 +27,16 @@ 1. [Prerequisites](#prerequisites) 1. [Head Node Configuration (Server)](#head-node-configuration-server) 1. [Compute Node Configuration (Clients)](#compute-node-configuration-clients) +1. [Integration of Slurm Cluster Monitoring with Grafana](#integration-of-slurm-cluster-monitoring-with-grafana) + 1. [Weekly Implementation Plan](#weekly-implementation-plan) + 1. [Cluster Architecture](#cluster-architecture) + 1. [Prerequisites & Dependencies](#prerequisites--dependencies) + 1. [Week 1: Cluster Foundation](#week-1-cluster-foundation) + 1. [Week 2: Slurm Cluster Setup](#week-2-slurm-cluster-setup) + 1. [Week 3: Monitoring Stack](#week-3-monitoring-stack) + 1. [Week 4: Slurm Exporter & Integration](#week-4-slurm-exporter--integration) + 1. [Week 5: Grafana Dashboards and Alerts](#week-5-grafana-dashboards-and-alerts) + 1. [Troubleshooting Guide](#troubleshooting-guide) 1. [GROMACS Application Benchmark](#gromacs-application-benchmark) 1. [Protein Visualization](#protein-visualization) 1. [Benchmark 2 (1.5M Water)](#benchmark-2-15m-water) @@ -1247,6 +1257,1095 @@ sinfo -alN The `S:C:T` column means "sockets, cores, threads" and your numbers for your compute node should match the settings that you made in the `slurm.conf` file. +# Integration of Slurm Cluster Monitoring with Grafana +Document Purpose & Scope +This document provides a complete, step-by-step guide for setting up a Slurm HPC cluster with Prometheus monitoring, organized by weekly milestones and based on real-world deployment experiences across Rocky Linux and Ubuntu environments. + +This guide is designed for system administrators and HPC practitioners who need to deploy a production-ready High Performance Computing cluster with comprehensive monitoring capabilities. It combines theoretical best practices with hard-earned practical knowledge from actual deployments. + +### Weekly Implementation Plan +Weekly Breakdown & Strategic Approach +This section outlines a phased 4-week implementation strategy to systematically build your HPC cluster, ensuring each layer is properly tested before proceeding to the next. + +Week 1: Cluster Foundation - Establishes the basic operational environment including time synchronization, secure communication, and user management + +Week 2: Slurm Cluster Setup - Implements the job scheduling system with proper authentication and resource management + +Week 3: Monitoring Stack - Deploys the core monitoring infrastructure for system-level metrics + +Week 4: Slurm Exporter & Integration - Adds HPC-specific monitoring and completes the full integration +This phased approach minimizes complexity and ensures each component is validated before integration, reducing troubleshooting overhead. + +## Cluster Architecture + +### System Design & Component Relationships +**This section defines the physical and logical layout of your HPC cluster, showing how different components interact and communicate.** + +### Final System Architecture (Sebowa OpenStack Example) +This table represents a typical production deployment showing service distribution and network configuration: + +| **Role** | **VM Hostname** | **IP Address** | **Ports** | **Services** | +|----------|-----------------|----------------|-----------|--------------| +| **Prometheus Server** | head-node | localhost | 9090 | prometheus.service | +| **Slurm Exporter** | head-node | localhost | 9341 | prometheus-slurm-exporter.service | +| **Node Exporter (Host)** | head-node | localhost | 9100 | node_exporter.service | +| **Compute Node 1** | rocky-com-node | - | 9100 | node_exporter.service | +| **Compute Node 2** | ubuntu-com-node | - | 9100 | node_exporter.service | +| **Compute Node 3** | arch-com-node | - | 9100 | node_exporter.service | + + +Key Architecture Notes: +- Prometheus and Slurm Exporter co-located on the head node for simplified management +- Node Exporters deployed on all systems for comprehensive hardware monitoring +- Standardized ports ensure consistent firewall and security configurations +- Head-node configurations is the same for all Distros + +--- + +## Prerequisites & Dependencies + +### Software Requirements & Package Management +This section covers all required software packages and dependencies for both Rocky Linux and Ubuntu environments, ensuring compatibility and proper functionality. + +### Essential Packages +These packages form the foundation of your HPC cluster and must be installed before proceeding: + +**Rocky Linux:** +```bash +sudo dnf install epel-release -y +sudo dnf install chrony pdsh pdsh-rcmd-ssh munge slurm-wlm slurmctld slurmd wget -y +``` + +**Ubuntu:** +```bash +sudo apt update +sudo apt install -y chrony pdsh munge libmunge-dev slurm-wlm slurmctld slurmd golang-go git make build-essential libssl-dev libpam0g-dev python3 apt-transport-https software-properties-common wget +``` + +**Arch Linux** +```bash +sudo pacman -Syu --noconfirm +sudo pacman -Sy --noconfirm chrony pdsh munge go git make base-devel openssl pam python wget +```` +--- + +## Week 1: Cluster Foundation + +### Core Infrastructure Establishment +This week focuses on building the fundamental cluster infrastructure that enables reliable communication, synchronization, and management across all nodes. + +### Time Synchronization (Chrony) +Time synchronization is CRITICAL for Slurm operation - mismatched clocks cause job failures and authentication issues. + +#### Enable chrony before configuration +```bash +sudo systemctl enable chronyd --now +``` + +#### Configuration (Master Node - node1) +The head node serves as the time source for the entire cluster: +Edit /etc/chrony.conf: +```bash +allow 192.168.1.0/24 # Permit cluster subnet to sync +bindaddress 192.168.1.10 # Bind to cluster network interface +server 0.centos.pool.ntp.org iburst # External time sources +server 1.centos.pool.ntp.org iburst +``` + +#### Client Configuration (node2, node3) +Compute nodes synchronize with the head node: +Edit /etc/chrony.conf: +```bash +server node1 iburst # Use head node as primary time source +``` + +#### Verification +```bash +sudo systemctl restart chronyd +chronyc tracking # Check synchronization status +chronyc sources -v # Verify time sources +``` + +### Parallel Command Execution (pdsh) +Enables simultaneous command execution across multiple nodes, essential for efficient cluster management. + +#### Configure PDSH +```bash +# Set SSH as default transport (secure alternative to rsh) +pdsh -w com[1-2] -R ssh getent passwd munge +echo 'export PDSH_RCMD_TYPE=ssh' >> ~/.bashrc +source ~/.bashrc +``` + +#### SSH Key Setup +Establish passwordless SSH for automated cluster management: +```bash +ssh-keygen -t rsa # Generate key pair +ssh-copy-id node1 # Distribute to head node +ssh-copy-id node2 # Distribute to compute nodes +ssh-copy-id node3 +``` + +#### Usage Examples +```bash +pdsh hostname # Check node connectivity +pdsh uptime # System status across cluster +pdsh "sudo systemctl restart chronyd" # Service management +pdcp myfile /tmp/ # Distributed file copy +``` + +### User & Permission Management +**Consistent user and permission configuration is ESSENTIAL for proper Slurm and filesystem operation.** + +#### Critical Requirements +- **Consistent UID/GID across all nodes** - Slurm and shared filesystems use numeric IDs, not usernames +- **Strict SSH permissions** - Required for passwordless authentication and security + +#### SSH Permission Fix +SSH requires specific permissions for security: +```bash +# On remote nodes +chmod go-w ~ # Home directory not world-writable +chmod 700 ~/.ssh # SSH directory owner-only access +chmod 600 ~/.ssh/authorized_keys # Keys file owner read/write only + +# SELinux fix (Rocky/RHEL) +sudo restorecon -R -v ~/.ssh # Reset SELinux contexts +``` + +#### Passwordless Sudo +Required for pdsh to execute privileged commands: +On all compute nodes, run sudo visudo and add: +```bash +username ALL=(ALL) NOPASSWD: ALL +``` +## NFS Server Setup Summary + +### Installation & Service Management +```bash +sudo pacman -Syu nfs-utils ## all arch nodes +sudo dnf install nfs-utils ## all rocky nodes +sudo apt install nfs-kernel-server ## ubuntu headnode +sudo apt install nfs-common ## ubuntu comnodes +sudo systemctl enable nfs-server ## arch & rocky headnode +sudo systemctl start nfs-server + +sudo systemctl enable nfs-kernel-server ## ubuntu +sudo systemctl start nfs-kernel-server + +``` + +### NFS Export Configuration +The `/etc/exports` configuration (replace 192.168.0.0/28 with your private network address): +``` +/home 192.168.0.0/28(rw,async,no_subtree_check,no_root_squash) +``` + +**Options explained:** +- `rw`: Read-write access +- `async`: Better performance but slightly less safe +- `no_subtree_check`: Improves reliability +- `no_root_squash`: Allows root user access (use with caution) + +### Applying Changes +```bash +sudo exportfs -ra # Re-export all +sudo exportfs -v # Verify exports +``` + +## Mounting NFS Shares +```bash +sudo mount -t nfs 192.168.0.12:/home /home +``` + +## SSH Configuration + +### Hosts File (/etc/hosts) Option 1 +``` +192.168.0.12 headnode +192.168.0.13 com1 +``` + +### SSH Config (~/.ssh/config) Option 2 +```ssh-config +Host headnode + Hostname 192.168.0.12 + User arch + IdentityFile ~/.ssh/id_ed25519 + +Host com1 + Hostname 192.168.0.13 + User arch + IdentityFile ~/.ssh/id_ed25519 +``` + +## Persistent Mounts +For automatic mounting at boot, add to `/etc/fstab`: +``` +192.168.0.12:/home /home nfs defaults 0 0 +``` + +This setup creates a seamless distributed environment where the home directory is shared across all nodes, and SSH access is simplified through the shared configuration. + +## Firewall Configuration + +**This is the configuration for Arch Linux and a similar software configuration was done for all other nodes** +### Improved iptables Configuration Script +This script opens the ports for the following services: ssh,icmp,nfs,ntp +```bash +#!/bin/bash + +# Flush existing rules +sudo iptables -F + +# Set default policies +sudo iptables -P INPUT DROP +sudo iptables -P FORWARD DROP +sudo iptables -P OUTPUT ACCEPT + +# Allow loopback +sudo iptables -A INPUT -i lo -j ACCEPT + +# Allow established connections +sudo iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT + +# Allow ICMP (ping) +sudo iptables -A INPUT -p icmp --icmp-type echo-request -j ACCEPT + +# SSH with rate limiting +sudo iptables -A INPUT -p tcp --dport 22 -m conntrack --ctstate NEW -m limit --limit 3/min --limit-burst 3 -j ACCEPT +sudo iptables -A INPUT -p tcp --dport 22 -m conntrack --ctstate NEW -j DROP + +# NFS ports +sudo iptables -A INPUT -p tcp --dport 111 -j ACCEPT +sudo iptables -A INPUT -p udp --dport 111 -j ACCEPT +sudo iptables -A INPUT -p tcp --dport 2049 -j ACCEPT +sudo iptables -A INPUT -p udp --dport 2049 -j ACCEPT +sudo iptables -A INPUT -p tcp --dport 20048 -j ACCEPT +sudo iptables -A INPUT -p udp --dport 20048 -j ACCEPT + +# NTP +sudo iptables -A INPUT -p udp --dport 123 -j ACCEPT + +# Save rules +sudo mkdir -p /etc/iptables +sudo iptables-save > /etc/iptables/iptables.rules +``` + +### Verification Commands +```bash +# Check current rules +sudo iptables -L -v + +# Check with line numbers (for management) +sudo iptables -L -v --line-numbers + +# Test NFS connectivity from compute nodes +showmount -e headnode +``` +## 5. Verification Commands + +After configuration, verify with: +```bash +# Check current rules +sudo iptables -L -v + +# Check with line numbers (for management) +sudo iptables -L -v --line-numbers + +# Test NFS connectivity from compute nodes +showmount -e headnode +``` + +## 6. Management Tips + +### To insert a rule at specific position: +```bash +sudo iptables -I INPUT 5 -p tcp --dport 80 -j ACCEPT +``` + +### To delete a rule: +```bash +sudo iptables -D INPUT 3 +``` + +### Temporary disable: +```bash +sudo systemctl stop iptables + +``` + +## Week 2: Slurm Cluster Setup + +### Job Scheduler Implementation +This week focuses on deploying Slurm, the workload manager that schedules and manages computational jobs across the cluster. + +### MUNGE Authentication Setup +**MUNGE provides the authentication layer for Slurm - it MUST be perfectly configured across all nodes.** + +#### User Synchronization +Munge user must have identical UID/GID on ALL nodes: + +**Problem:** UID/GID mismatch across nodes causes authentication failures +```bash +# Stop service first (required for user modification) +sudo systemctl stop munged # Rocky +sudo systemctl stop munge # Ubuntu & Arch + +# Standardize UID/GID to match head node +sudo usermod -uid 993 munge +sudo groupmod -gid 990 munge + +# Fix conflicting groups if needed (common issue) +grep ':990:' /etc/group #Get name of file +sudo groupmod -g 1500 fwupd-refresh # Move conflicting group +sudo groupmod -g 990 munge # Now assign to munge + +# Reassign files with old IDs (CRITICAL step) +sudo find / -user 112 -exec chown -h munge {} \; +sudo find / -group 113 -exec chgrp -h munge {} \; +``` + +#### Key Distribution +Munge.key must be identical on all nodes - secure distribution method: +```bash +# Copy munge.key to all nodes using secure pipe method +sudo cat /etc/munge/munge.key | ssh rocky@com1 "sudo tee /etc/munge/munge.key > /dev/null" + +# Fix ownership and permissions on remote node +ssh rocky@com1 "sudo chown munge:munge /etc/munge/munge.key && sudo chmod 400 /etc/munge/munge.key" +``` + +#### Verification +Test the complete MUNGE authentication chain: +```bash +# Start Munge(All Nodes) +sudo systemctl enable munge +sudo systemctl start munge + +# Test Munge authentication between nodes +munge -n | ssh com2 unmunge + +# Verify key consistency across cluster +sudo md5sum /etc/munge/munge.key +ssh com2 "sudo md5sum /etc/munge/munge.key" +``` + +#### Slurm Installation +Install Slurm components on appropriate nodes: +```bash +# Rocky Linux +sudo dnf install -y slurm-wlm slurmctld slurmd + +# Ubuntu +sudo apt install -y slurm-wlm slurmctld slurmd + +# Arch Linux (official repository) +sudo pacman -Syyu +sudo pacman -S slurm + +# Create Slurm User(All nodes) +sudo useradd slurm + +# Create Directories(All nodes) +sudo mkdir -p /var/spool/slurm/ctdl /var/spool/slurm/d /var/log/slurm +sudo cown -R slurm:slurm /var/spool/slurm/ctdl /var/spool/slurm/d /var/log/slurm +``` + +### Slurm Configuration + +#### Example slurm.conf +Main Slurm configuration file - must be identical on all nodes: +```bash +ClusterName=ubuntu-hpc +SlurmctldHost=headnode # Controller hostname +SlurmUser=slurm # Dedicated Slurm user +StateSaveLocation=/var/spool/slurmctld # State persistence +SlurmdSpoolDir=/var/spool/slurmd # Compute node spool + +AuthType=auth/munge # MUNGE authentication +CryptoType=crypto/munge # MUNGE encryption +MpiDefault=none # No MPI by default +ProctrackType=proctrack/cgroup # Process tracking +ReturnToService=2 # Error handling + +SlurmctldPort=6817 # Controller port +SlurmdPort=6818 # Daemon port + +# Logging +SlurmctldLogFile=/var/log/slurmctld.log +SlurmdLogFile=/var/log/slurmd.log +SlurmSchedLogFile=/var/log/slurm_sched.log + +# Scheduler +SchedulerType=sched/backfill # Backfill scheduling +SelectType=select/cons_tres # Resource selection +SelectTypeParameters=CR_Core # Core-based scheduling + +# Nodes +NodeName=node[1-3] CPUs=8 State=UNKNOWN # Compute node definitions +PartitionName=debug Nodes=node[1-3] Default=YES MaxTime=30 Walltime=00:30:00 State=UP +``` + +#### Service Management +Start and enable Slurm services: +```bash +# Head node (controller) +sudo systemctl enable slurmctld +sudo systemctl start slurmctld + +# Compute nodes (daemons) +sudo systemctl enable slurmd +sudo systemctl start slurmd +``` + +#### Configuration Distribution +Distribute consistent configuration to all nodes: +```bash +# Copy slurm.conf to all nodes using secure method +sudo cat /etc/slurm/slurm.conf | ssh rocky@node1 "sudo tee /etc/slurm/slurm.conf > /dev/null" +``` + +### Verification +Comprehensive testing of Slurm functionality: +```bash +sinfo # View node states +scontrol show nodes # Detailed node information +scontrol ping # Test controller connectivity + +# Test job submission +srun hostname # Interactive job +sbatch test_job.sh # Batch job +squeue # Check queue +``` + +--- +## Week 3: Monitoring Stack + +### Infrastructure Monitoring Deployment +This week implements the core monitoring infrastructure to track system health, resource utilization, and performance metrics. + +### Prometheus Installation on the headnode +Prometheus serves as the central metrics collection and storage system + +#### Create User & Directories +Dedicated user for security and proper directory structure: +```bash +sudo useradd --no-create-home --shell /bin/false prometheus +sudo mkdir /etc/prometheus /var/lib/prometheus +``` + +#### Download & Install +Install from official binaries for version control: +```bash +wget -O prometheus.tar.gz https://github.com/prometheus/prometheus/releases/latest/download/prometheus-2.37.0.linux-amd64.tar.gz +tar -xvf prometheus-2.37.0.linux-amd64.tar.gz +sudo cp prometheus-2.37.0.linux-amd64/prometheus /usr/local/bin/ +sudo cp prometheus-2.37.0.linux-amd64/promtool /usr/local/bin/ +``` + +#### Systemd Service +Create service file for proper process management: +Create /etc/systemd/system/prometheus.service: +```ini +[Unit] +Description=Prometheus +Wants=network-online.target +After=network-online.target + +[Service] +User=prometheus +Group=prometheus +Type=simple +ExecStart=/usr/local/bin/prometheus \ + --config.file /etc/prometheus/prometheus.yml \ + --storage.tsdb.path /var/lib/prometheus/ \ + --web.console.templates=/etc/prometheus/consoles \ + --web.console.libraries=/etc/prometheus/console_libraries + +[Install] +WantedBy=multi-user.target +``` +#### Prometheus Config (/etc/prometheus/prometheus.yml) +Configure Prometheus to scrape metrics from all nodes: +```yaml +global: + scrape_interval: 15s # How often to scrape metrics + +scrape_configs: + - job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090'] # Monitor itself + + - job_name: 'node_exporter' + static_configs: + - targets: ['node1:9100', 'node2:9100', 'node3:9100'] # All nodes +``` +#### Start Prometheus +Enable and start the service: +```bash +sudo systemctl daemon-reload +sudo systemctl enable prometheus +sudo systemctl start prometheus +``` + + +### Node Exporter Installation on the compute nodes +Node Exporter collects system-level metrics from each machine. +```bash +wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz +tar -xvf node_exporter-1.3.1.linux-amd64.tar.gz +sudo cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/ +sudo systemctl enable node_exporter +sudo systemctl start node_exporter +``` + +### Grafana Installation on the headnode +Grafana provides the visualization interface for monitoring data. +```bash +#Ubuntu +wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add - +sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main" +sudo apt update +sudo apt install grafana +#Rocky +sudo dnf install grafana +#Arch +sudo pacman -Syu grafana + +####Start grafana server +sudo systemctl enable grafana-server +sudo systemctl start grafana-server +``` + +#### Firewall Rules +Open required ports for monitoring services: +```bash +#Example for Ubuntu +sudo ufw allow 9090 # Prometheus web interface +sudo ufw allow 3000 # Grafana web interface +sudo ufw allow 9100 # Node Exporter metrics +``` + +#### Verification +Test the complete monitoring stack: +```bash +# Test Prometheus scraping +curl http://headnode:9090/targets + +# Verify services are running +sudo systemctl status prometheus +sudo systemctl status node_exporter +sudo systemctl status grafana-server +``` + +--- + +## Week 4: Slurm Exporter & Integration + +### HPC-Specific Monitoring +This week adds Slurm-specific monitoring to track job statistics, queue states, and scheduler performance. + +### Slurm Exporter Installation on the compute nodes +Slurm Exporter extracts metrics directly from Slurm utilities. + +#### Build from Source +Compile from source for latest features and compatibility: +```bash +sudo apt install -y golang git make +git clone https://github.com/vpenso/prometheus-slurm-exporter.git +cd prometheus-slurm-exporter +make +sudo cp slurm_exporter /usr/local/bin/ +``` + +#### Systemd Service +Create service with proper dependencies and environment: +Create /etc/systemd/system/slurm_exporter.service: +```ini +[Unit] +Description=Prometheus Slurm Exporter +Wants=network-online.target +After=network-online.target slurmctld.service # Requires Slurm + +[Service] +User=root +Group=root +Type=simple +ExecStart=/usr/local/bin/slurm_exporter +Restart=always +Environment="PATH=/usr/bin:/usr/local/bin:/opt/slurm/bin" # Critical for Slurm binaries + +[Install] +WantedBy=multi-user.target +``` + +#### Start Slurm Exporter +```bash +sudo systemctl daemon-reload +sudo systemctl enable slurm_exporter +sudo systemctl start slurm_exporter +``` + +### Configuration Updates + +#### Updated Prometheus Config +Add Slurm exporter to Prometheus scraping: +Edit /etc/prometheus/prometheus.yml: +```yaml +global: + scrape_interval: 15s + +scrape_configs: + - job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090'] + + - job_name: 'node_exporter' + static_configs: + - targets: ['node1:9100', 'node2:9100', 'node3:9100'] + + - job_name: 'slurm_exporter' + static_configs: + - targets: ['headnode:8080'] # or 9341 based on actual port +``` + +#### Additional Firewall Rules +```bash +sudo ufw allow 9341 # Slurm Exporter port +``` + +### Verification + +#### Verify Exporter Metrics +Test that Slurm Exporter is providing metrics: +```bash +curl http://localhost:8080/metrics +# or +curl http://localhost:9341/metrics +``` + +#### Test Prometheus Integration +Ensure Prometheus is scraping Slurm metrics: +```bash +# Restart Prometheus to load new config +sudo systemctl restart prometheus + +# Check targets endpoint +curl http://localhost:9090/api/v1/targets + +# Test Slurm metrics in Prometheus UI +http://headnode:9090/graph +``` + +#### Grafana Configuration +Connect Grafana to visualize Slurm metrics: +1. Access Grafana at http://headnode:3000 +2. Add Prometheus as data source: http://localhost:9090 +3. Import HPC monitoring dashboards +4. Verify Slurm metrics are visible + +#### Final Integration Check +End-to-end validation of complete system: +```bash +# Complete cluster status +sinfo +scontrol show nodes + +# Monitoring stack status +sudo systemctl status prometheus node_exporter slurm_exporter grafana-server + +# Test end-to-end monitoring +srun hostname +# Verify job appears in Slurm exporter metrics +``` + +## Week 5: Grafana Dashboards and Alerts + +### Project Overview +Integrate Prometheus data into Grafana and create comprehensive dashboards for SLURM cluster monitoring. + + +## SLURM Monitoring Dashboard Setup + +### Overview +This guide walks you through setting up a Grafana dashboard for monitoring SLURM workload manager activity using Prometheus metrics. + +### Prerequisites +- Grafana instance installed and running +- Prometheus data source configured in Grafana +- SLURM metrics being exported to Prometheus + +### Dashboard Installation + +#### Step 1: Create New Dashboard +1. Navigate to the **Dashboards** section in Grafana +2. Click on **"New"** +3. Select **"Import"** from the dropdown menu + +#### Step 2: Import Dashboard +1. In the import screen, enter the Grafana dashboard ID: **`4323`** +2. Click **"Load"** to load the dashboard configuration + +#### Step 3: Configure Data Source +1. Select **Prometheus** as your data source from the dropdown menu +2. Click **"Import"** to complete the installation + +### Verification Steps +1. Execute SLURM jobs in your cluster +2. Monitor the dashboard graphs for spikes in activity +3. Refer to the images in the project folder for expected visualizations + +### Dashboard Features +- Real-time monitoring of SLURM job activity +- Resource utilization metrics +- Queue status and job statistics +- Performance indicators and alerts + +## Dashboard Visualizations + +**Graph 1: Backfill Scheduler Cycles** +Monitor backfill scheduler performance metrics +![Backfill Scheduler Cycles](https://github.com/user-attachments/assets/7f9660e6-56c4-4e1e-861e-1a989ba7017a) + +**Graph 2: Job Status Overview** +Track job states across the cluster +![Job Status Overview](https://github.com/user-attachments/assets/fcd1df1e-71d6-45be-a843-bc5ae79e9040) + +**Graph 3: Scheduler Cycle Performance** +Monitor overall scheduler performance +![Scheduler Cycle Performance](https://github.com/user-attachments/assets/fbe1c982-9296-427e-a4f5-5ceafa6fed20) + +**Graph 4: Detailed Job Statistics** +Detailed view of job distribution and trends +![Detailed Job Statistics](https://github.com/user-attachments/assets/6718de08-2def-48e2-b482-3e40516adf9e) + +**Note**: Ensure your Prometheus instance is properly scraping SLURM metrics before expecting data in the dashboard. + +# SMTP Setup with Gmail for Grafana Alerts + +## Overview +Configure Gmail SMTP to enable email notifications for Grafana alerts in your SLURM monitoring setup. + +## Prerequisites +- Gmail account created for the group +- 2-step verification enabled on Gmail account +- Grafana running in Docker container + +## Gmail App Password Setup + +### Step 1: Generate App Password +1. Navigate to: [Google App Passwords](https://support.google.com/accounts/answer/185833?hl=en) +2. Log into your Gmail account +![Grafana email setup](https://github.com/user-attachments/assets/cda73165-2c42-48c3-82db-0827b5b8fda4) +3. Provide an app name: **"Grafana"** +![App name for email](https://github.com/user-attachments/assets/1c0a3e26-7795-4803-addd-96550377f5ca) +4. Copy the generated 16-character password for later use + +### Security Notes +- The Gmail app password is different from your account password +- Keep the app password secure and regenerate if compromised +- Regularly review active app passwords in Google Account settings + +## Grafana SMTP Configuration + +### Docker Environment Setup +Since Grafana runs in Docker, configure SMTP via the `grafana.ini` file: + +1. **Access the configuration file**: + ```bash + nano /etc/grafana/grafana.ini + ``` + +2. **Locate and configure the SMTP section**: +![Grafana INI Configuration](https://github.com/user-attachments/assets/670a749b-4ff6-4676-9748-43020d9736bc) + + ```ini + [smtp] + enabled = true + host = smtp.gmail.com:587 + user = your-email@gmail.com + password = your-generated-app-password + from_address = your-email@gmail.com + from_name = Grafana Alerts + startTLS_policy = OpportunisticStartTLS + ``` + +## Grafana Alerting Configuration + +### Step 1: Add Prometheus Data Source +1. Go to **Home** → **Connections** → **Data sources** +2. Click **"Add Data Source"** +3. Search for and select **Prometheus** +4. Configure connection: + - **Prometheus server URL**: `http://localhost:9090` +5. Click **"Save & Test"** to verify successful connection + +### Step 2: Create Contact Point + +#### Add Email Contact Point +1. Navigate to **Alerting** → **Contact points** +2. Click **"Add contact point"** +3. Configure settings: + - **Name**: `Node Down` + - **Integration**: `Email` + - **Addresses**: `dcdaggers01@gmail.com` + +4. **Test the configuration**: + - Click **"Test"** → **"Send test notification"** + ![Test email](https://github.com/user-attachments/assets/240bf1c5-b497-46d4-9a7d-dd9266f93e93) + - Verify receipt in your email inbox + ![Email inbox](https://github.com/user-attachments/assets/962d0597-1879-41f1-bbe4-b7829c406eba) + - Click **"Save contact point"** after successful test + +### Step 3: Create Alert Rule + +#### Configure Alert Rule +1. **Basic Information**: + - **Rule name**: `Node Down` + +2. **Query Configuration**: +![Alert Rule section A](https://github.com/user-attachments/assets/0dd67695-48bb-4565-abe0-b554fcddda8a) + - **Query A**: + ```promql + up{job="node_exporter"} == 0 + ``` + - **Query B**: + ```promql + job_success{job="myjob"} == 0 + ``` + +3. **Evaluation Settings**: + - **Evaluate every**: `1m` + +4. **Organization**: + - **Folder**: Create new folder `Node Down Alerts` + - Configure appropriate labels + ![Alert Label](https://github.com/user-attachments/assets/71971b57-db95-476b-82a4-f20bb61d129a) + +5. **Evaluation Behavior**: +![Alert Section 3 & 4](https://github.com/user-attachments/assets/35abcb5c-6205-4252-b67e-190cb0b8f3ac) + - **Evaluation group name**: `Evaluation Group` + - **Pending period**: `1m` + +6. **Notifications**: + - Add the previously created contact point as recipient + +7. **Notification Message** (Optional): + ```text + Node {{ $labels.instance }} is DOWN + ``` + +8. **Save** the alert rule + + +## Additional Grafana Management Tips + +### Check current config paths: +```bash +grafana-server -h +``` + +### View all Grafana paths: +```bash +sudo -u grafana grafana-server config paths +``` + +### Useful Grafana commands: +```bash +# Enable auto-start on boot +sudo systemctl enable grafana + +# View logs for debugging +sudo journalctl -u grafana -f + +# Test configuration +sudo -u grafana grafana-server -config /etc/grafana/grafana.ini cfg:default.paths.logs=/var/log/grafana +``` + + +--- + +**Next Steps**: Monitor alert triggers and refine notification messages based on your SLURM cluster's specific requirements. + +### Final Notes & Best Practices + +- **Time Synchronization:** Critical for Slurm operation - use Chrony exclusively on Rocky Linux +- **UID/GID Consistency:** Essential for shared filesystems and Munge authentication +- **Firewall Configuration:** Ensure all required ports are open across nodes +- **Regular Verification:** Use the weekly checklists to ensure progress +- **Documentation:** Keep configuration files and procedures documented for future maintenance + +**This comprehensive weekly guide combines lessons learned from multiple real-world deployments and provides a structured approach to building a fully monitored HPC cluster with Slurm.** + +## Troubleshooting Guide + +### Problem Resolution Reference +This section provides solutions to common issues encountered during HPC cluster deployment, based on real-world troubleshooting experiences. + +### Common Issues & Solutions + +#### 1. Prometheus Service Fails to Start +**Symptom:** Connection refused on port 9090 + +**Solution:** +```bash +# Check YAML syntax using official tool +/opt/prometheus/promtool check config /etc/prometheus/prometheus.yml + +# Fix indentation errors in prometheus.yml +sudo systemctl restart prometheus +``` + +#### 2. Slurm Exporter Port Issues +**Symptom:** Slurm job in Prometheus shows "down" + +**Solution:** +```bash +# Check actual port from service logs +sudo journalctl -u prometheus-slurm-exporter.service + +# Update prometheus.yml with correct port (usually 9341, not 8080) +``` + +#### 3. Slurm Nodes Show as idle* +**Solution:** +```bash +sudo systemctl restart slurmd +scontrol ping +``` + +#### 4. Jobs Stuck in "Configuring" State +**Solution:** +```bash +sudo systemctl restart slurmctld +ping node1 # Ensure hostname resolution works +``` + +#### 5. Munge Authentication Failures +**Symptom:** unmunge: Error: Invalid credential + +**Solution:** +- Verify consistent UID/GID for all users across nodes +- Check munge.key consistency with md5sum +- Ensure time synchronization +- Verify socket permissions in /run/munge/ + +#### 6. Slurmd Service Failures +**Common Errors & Fixes:** + +**Directory missing:** +```bash +sudo mkdir -p /var/spool/slurm +sudo chown slurm:slurm /var/spool/slurm +``` +### Get correct hardware config +slurmd -C + +### Update slurm.conf with correct NodeName line +sudo nano /etc/slurm/slurm.conf + +### Ensure slurm user exists on all nodes with identical UID/GID +sudo groupadd -g 64030 slurm +sudo useradd -u 64030 -g 64030 -r -c "Slurm User" -s /sbin/nologin slurm + + +#### 7. Node Exporters Not Scraping +**Symptom:** "context deadline exceeded" in Prometheus targets + +**Solution:** +- Add inbound firewall rules for port 9100 +- Verify security groups in OpenStack/cloud environment +- Test connectivity: curl http://node1:9100/metrics + +#### 8. Slurm Exporter Shows No Metrics +**Solution:** + +### Ensure Slurm binaries are in PATH +echo $PATH +which scontrol +which squeue + +### Add to service file if needed +Environment="PATH=/usr/bin:/usr/local/bin:/opt/slurm/bin" + +## Grafana Configuration Fix Summary + +### The Problem +- Grafana was looking for config at `/etc/grafana.ini` by default +- Actual config file location: `/etc/grafana/grafana.ini` + +### The Solution +Using systemd drop-in files to override the service configuration: + +```bash +sudo systemctl edit grafana +``` + +**Content added:** +```ini +[Service] +ExecStart= +ExecStart=/usr/bin/grafana server --config=/etc/grafana/grafana.ini --homepath=/usr/share/grafana +``` + +### Verification Commands +```bash +# Reload systemd +sudo systemctl daemon-reload + +# Restart Grafana +sudo systemctl restart grafana + +# Verify the override +systemctl cat grafana + +# Check service status +sudo systemctl status grafana +``` + +## Key Points Explained + +### 1. **systemctl edit** Behavior +- Creates: `/etc/systemd/system/grafana.service.d/override.conf` +- This is the proper way to modify systemd services without editing original files + +### 2. **ExecStart=** Clearing +- The empty `ExecStart=` line is crucial - it clears the existing command +- Without this, you'd get duplicate `ExecStart` directives error + +### 3. **Alternative Approaches** + +**Option A: Symlink (quick fix)** +```bash +sudo ln -s /etc/grafana/grafana.ini /etc/grafana.ini +``` + +**Option B: Environment variable** +```bash +sudo systemctl edit grafana +``` +```ini +[Service] +Environment=GF_PATHS_CONFIG=/etc/grafana/grafana.ini +``` +## Common Grafana Issues on Arch + +1. **Permission issues**: Ensure `grafana` user owns data/log directories +2. **Database path**: Check `data` path in `grafana.ini` +3. **Port conflicts**: Default port 3000 might be in use + +The solution is the recommended approach for Arch Linux, as it preserves the package manager's files while providing the necessary customization. The use of systemd drop-in files ensures your changes survive package updates. + +## Other Grafana Issues + +- **Gmail Authentication**: Ensure 2-step verification is enabled and app password is 16 characters +- **SMTP Issues**: Verify port 587 is open and credentials are correct +- **Prometheus Connection**: Confirm Prometheus is running on port 9090 +- **Docker Network**: Ensure container can reach external SMTP servers + + + # GROMACS Application Benchmark You will now be extending some of your earlier work from [Tutorial 3](../tutorial3/README.md#gromacs-adh-cubic).