The gpudash command displays a GPU utilization dashboard in text (no graphics) for the last hour:
The dashboard can be generated for a specific user:
The gpudash command is part of the Jobstats platform. Here is the help menu:
usage: gpudash [-h] [--me] [-u NETID] [-n]
GPU utilization dashboard for the last hour
optional arguments:
-h, --help show this help message and exit
--me create dashboard for current user
-u NETID create dashboard for a single user
-n, --no-legend flag to hide the legend
Utilization is the percentage of time during a sampling window (< 1 second) that
a kernel was running on the GPU. The format of each entry in the dashboard is
username:utilization (e.g., aturing:90). Utilization varies between 0 and 100%.
Examples:
Show dashboard for all users:
$ gpudash
Show dashboard for the current user:
$ gpudash --me
Show dashboard for the user aturing:
$ gpudash -u aturing
Show dashboard for all users without displaying legend:
$ gpudash -n
The gpudash command builds on the Jobstats platform. To run the software it requires Python 3.6+ and version 1.17+ of the Python blessed package.
The query_prometheus.sh script below makes three queries to Prometheus. Old files are removed. The extract.py Python script is called to extract the data and write columns files. The column files are read by gpudash.
$ cat query_prometheus.sh
#!/bin/bash
printf -v SECS '%(%s)T' -1
DATA='/path/to/gpudash/data'
PROM_QUERY='http://vigilant2.sn17:8480/api/v1/query?'
curl -s ${PROM_QUERY}query=nvidia_gpu_duty_cycle > ${DATA}/util.${SECS}
curl -s ${PROM_QUERY}query=nvidia_gpu_jobUid > ${DATA}/uid.${SECS}
curl -s ${PROM_QUERY}query=nvidia_gpu_jobId > ${DATA}/jobid.${SECS}
# remove any files that are greater or equal to 70 minutes old
find ${DATA} -type f -mmin +70 -exec rm -f {} \;
# extract the data from the Prometheus queries, convert UIDs to usernames, write column files
python3 /path/to/extract.pyBe sure to customize nodelist in extract.py for the given system. The above Bash script will generate column files with the format:
$ head -n 5 column.1
{"timestamp": "1678144802", "host": "comp-g1", "index": "0", "user": "ft130", "util": "92", "jobid": "46034275"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "1", "user": "ft130", "util": "99", "jobid": "46015684"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "2", "user": "ft130", "util": "99", "jobid": "46015684"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "3", "user": "kt415", "util": "44", "jobid": "46048505"}
{"timestamp": "1678144802", "host": "comp-g2", "index": "0", "user": "kt415", "util": "82", "jobid": "46015407"}
The column files are read by gpudash to generate the dashboard.
Here is a sample of the file:
$ head -n 5 uid2user.csv
153441,ft130
150919,lc455
224256,sh235
329819,bb274
347117,kt415
The above file can be generated by running the following command:
$ getent passwd | awk -F":" '{print $3","$1}' > /path/to/uid2user.csv0,10,20,30,40,50 * * * * /path/to/query_prometheus.sh > /dev/null 2>&1
0 6 * * 1 getent passwd | awk -F":" '{print $3","$1}' > /path/to/uid2user.csv 2> /dev/nullThe first entry above calls the script that queries the Prometheus server every 10 minutes. The second entry creates the CSV file of UIDs and usernames on every Monday at 6 AM.
gpudash is a pure Python code. Its only dependency is the blessed Python package. On Ubuntu Linux, this can be installed with:
$ apt-get install python-blessedThen put gpudash in a location like /usr/local/bin:
$ cd /usr/local/bin
$ wget https://raw.githubusercontent.com/PrincetonUniversity/gpudash/main/gpudash
$ chmod 755 gpudash
Next, edit gpudash by replacing cluster1 with the beginning of the login node name. Modify all_nodes to generate a list of compute node names. Lastly, set SBASE to the path containing the column files produced by extract.py and make sure that the shebang line at the very top is pointing to python3.
With these steps in place, you can use the gpudash command:
$ gpudash
The choice was made to enter the node names in the script (i.e., all_nodes) as opposed to reading the Prometheus configuration file or using the output of the sinfo command. The code looks for data on each of the specified nodes and only updates the values for a given node if the data is found. Calling sinfo has the disadvantage of not having any node names if the command fails. One would also have to specify partitions. Reading the Prometheus server configuration file(s) is reasonable but changes would be required if Prometheus were swapped with an alternative.
The two most commons problems are (1) setting the correct paths throughout the procedure and (2) installing the Python blessed package.
Please post an issue to this repo. Extensions to the code are welcome via pull requests.

