A modular Python pipeline for analysing High-Performance Computing (HPC) scheduler logs.
It transforms raw job exports into a clean dataset, computes machine-wide and per-queue statistics, and produces publication-ready plots and detailed text reports.
- 🧹 Automated data cleaning & validation
- 📊 Temporal workload characterisation
- 📦 Job size & walltime distributions
- 🔥 Job size × walltime heatmaps
- 🧮 Per-queue core-hour accounting
- 🖥 Machine utilization over time
- 📅 Optional yearly / monthly breakdowns
- 📄 Structured text summaries for reporting
The workflow is composed of two stages:
src/preprocessor.py
Transforms raw scheduler exports into a clean, analysis-ready dataset.
- Reads raw HPC scheduler CSV logs
- Normalises column names
- Parses timestamps
- Computes derived metrics (runtime, wait time, core-hours)
- Filters corrupted / anomalous records
- Removes duplicate jobs
- Outputs compressed dataset
*_preprocessed.csv.gz
src/main.py
Generates figures, statistics, and structured reports.
- Hourly submission distribution
- Weekday distribution
- Day-of-year distribution (seasonality)
- Job-size distribution (node-count bins)
- Walltime distribution (custom bins)
- Job-size × Walltime heatmap (log colour scale)
-
Timeline scatter plot
- x-axis: submission date
- y-axis: walltime (log scale)
- marker size/colour: node count
-
Machine utilization
- Time-series of node usage (%)
- Rolling averages
- LOWESS trend estimation
- Stacked bar charts (core-hours by queue)
- Stacked cumulative area charts
- Pie charts (overall share)
- Nodes (mean / median / P5 / P95)
- Walltime
- Runtime
- Wait time
- Core-hours
- Efficiency
All analyses can be repeated:
- Per year (
--yearly) - Per month (
--monthly)
HPC-workload-analysis/
├── README.md # This file
├── LICENSE
├── data/ # Input data (one sub-folder per machine)
│ ├── Polaris/
│ │ ├── *.csv.gz # Raw job log files (from scheduler)
│ │ ├── jobs_preprocessed.csv.gz # Merged preprocessed output
│ │ ├── job_dimension.csv # Node-count bin definitions
│ │ ├── walltime.csv # Walltime bin definitions
│ │ ├── max_nodes.csv # Machine capacity (single value)
│ │ └── queue_names.csv # Valid queue definitions
│ └── Aurora/
│ └── ... # Same structure
├── src/ # Source code (main pipeline)
│ ├── main.py # CLI entry point & top-level driver
│ ├── orchestrator.py # Central coordinator between data & plots
│ ├── preprocessor.py # Raw → preprocessed CSV converter
│ ├── plotting.py # All visualisation functions
│ ├── report.py # Text report generator
│ ├── single_queue_analysis.py # Per-queue core-hour breakdown
│ ├── system_utilization.py # Node-utilization computation & plots
│ └── utils.py # Shared helpers, CLI parser, config loader
├── analysis_output/ # Generated output (one sub-folder per machine)
│ ├── Polaris/
│ │ ├── *.png # Full-range plots
│ │ ├── 2024_analysis/ # Per-year plots
│ │ │ ├── *.png
│ │ │ └── 6_analysis/ # Per-month plots (if --monthly)
│ │ └── queue_analysis/ # Queue breakdown (if --queue-analysis)
│ └── Aurora/
│ └── ...
| What | URL |
|---|---|
| Job log CSVs | https://reports.alcf.anl.gov/data/ |
| Polaris queue definitions | https://docs.alcf.anl.gov/polaris/running-jobs/ |
| Aurora queue definitions | https://docs.alcf.anl.gov/aurora/running-jobs-aurora/ |
- Download the raw job log files from the ALCF Reports portal.
- Use the Polaris / Aurora documentation pages to populate
queue_names.csv(queue names, node limits, walltime limits) andjob_dimension.csv(node-count bins matching the queue boundaries). - The walltime bins in
walltime.csvwere derived from the queue walltime limits listed in the documentation — set them to whatever boundaries make sense for your analysis.
Each machine folder under data/ must contain one or more raw CSV files exported from the job scheduler (plain or gzip-compressed).
The following columns are required (case-insensitive):
| Column | Type | Description |
|---|---|---|
JOB_NAME |
string | Human-readable job identifier |
USERNAME_GENID |
string | Anonymised user identifier |
PROJECT_NAME_GENID |
string | Anonymised project identifier |
QUEUE_NAME |
string | Queue the job was submitted to |
QUEUED_TIMESTAMP |
datetime | When the job was submitted |
START_TIMESTAMP |
datetime | When the job started executing |
END_TIMESTAMP |
datetime | When the job finished |
WALLTIME_SECONDS |
float | Requested walltime (seconds) |
RUNTIME_SECONDS |
float | Actual runtime (seconds) |
NODES_REQUESTED |
int | Number of nodes the user asked for |
NODES_USED |
int | Number of nodes actually allocated |
USED_CORE_HOURS |
float | Core-hours consumed by the job |
EXIT_CODE |
int | Job exit code (0 = success) |
Four small CSV files must sit alongside the job data in the same machine folder:
Defines how jobs are grouped by size (number of nodes).
name,min,max
tiny,1,10
small,11,24
medium,25,99
large,100,496| Column | Description |
|---|---|
name |
Human-readable label for the bin |
min |
Minimum node count (inclusive) |
max |
Maximum node count (inclusive); use infinity for unbounded |
Defines how jobs are grouped by requested walltime (in seconds).
name,min,max
shortest,1,60
short,61,600
medium-short,601,1800
medium,1801,3600
medium-long,3601,7200
long,7201,21600
very-long,21601,43200
super-extreme,43201,64800
ultra,64801,86400
mega,86401,259200
extra-infinity,259201,infinity| Column | Description |
|---|---|
name |
Label |
min |
Minimum walltime in seconds (inclusive) |
max |
Maximum walltime in seconds (inclusive); use infinity for unbounded |
A single-value file declaring the total number of compute nodes.
max_nodes
560Lists the queues to include in the queue-analysis breakdown.
Only used when --queue-analysis is enabled.
Populate this file from the official ALCF documentation:
- Polaris: https://docs.alcf.anl.gov/polaris/running-jobs/
- Aurora: https://docs.alcf.anl.gov/aurora/running-jobs-aurora/
queue_name,min_nodes,max_nodes,min_walltime,max_walltime
debug,1,2,00:05:00,01:00:00
small,10,24,00:05:00,03:00:00
medium,25,99,00:05:00,06:00:00
large,100,496,00:05:00,24:00:00| Column | Description |
|---|---|
queue_name |
Queue identifier (must match values in the job log) |
min_nodes |
Minimum node allocation for this queue |
max_nodes |
Maximum node allocation for this queue |
min_walltime |
Minimum walltime (HH:MM:SS) |
max_walltime |
Maximum walltime (HH:MM:SS) |
data/Polaris/
├── ANL-ALCF-DJC-POLARIS_20220809_20221231.csv.gz # Raw data (year 1)
├── ANL-ALCF-DJC-POLARIS_20230101_20231231.csv.gz # Raw data (year 2)
├── jobs_preprocessed.csv.gz # Output of preprocessor (merged)
├── job_dimension.csv # Node bins
├── walltime.csv # Walltime bins
├── max_nodes.csv # Machine capacity
└── queue_names.csv # Queue definitions
Run before analysis:
cd src/
# Single file
python preprocessor.py --path ../data/Polaris/file.csv.gz --single
# All files in directory
python preprocessor.py --path ../data/Polaris/ --allThe preprocessor removes rows with:
runtime ≥ 1.5 × walltime- Negative runtimes or walltimes
- Negative core-hours
- Duplicate
job_name
python main.py \
--path ../data/Polaris/jobs_preprocessed.csv.gz \
--machine-name Polarispython main.py \
--path ../data/Polaris/jobs_preprocessed.csv.gz \
--machine-name Polaris \
--queue-analysis \
--machine-utilization \
--yearly \
--monthly \
--full-queue-analysisOutput is written to:
analysis_output/<machine-name>/
- Hourly distribution
- Weekday distribution
- Day-of-year seasonality
- Job-size distribution
- Walltime distribution
- Job-size × Walltime heatmap
- Timeline scatter plot
- Machine utilization time-series
- Queue stacked bar / area / pie charts
- Detailed per-queue statistical summary
- Successful-jobs-only summary
- Job count
- Core-hours
- Share %
- Efficiency
- Mean / median / P5 / P95 / min / max
- Unique users & projects
| Module | Purpose |
|---|---|
main.py |
CLI entry point |
orchestrator.py |
Coordinates computations |
preprocessor.py |
Cleans raw CSV logs |
plotting.py |
Generates visualisations |
report.py |
Builds text summaries |
single_queue_analysis.py |
Queue core-hour breakdown |
system_utilization.py |
Node utilization computation |
utils.py |
Shared helpers |
Tested on Python 3.10+
Install:
pip install -r requirements.txtpandasnumpymatplotlibseabornstatsmodelscolour-science
Developed as part of PhD research on HPC workload characterisation and resource utilization modelling.