[RFC] GPU support in CLI

The goal of this RFC is to discuss how we can expose the GPU support that was recently added to containerd by @crosbymichael in this PR: https://github.com/containerd/containerd/pull/2330

Since there are multiple ways of implementing this, I think it makes sense to start with the user experience, i.e. the CLI. If we can agree on the format, the next step will be that I write a PR for API changes in the `github.com/docker/docker/api/types` package. My rationale being that if we decide to not allow some configuration options in the CLI, it would be useless to have `api/types` expose it. As a concrete example, if we don't allow passing options on a per-GPU basis, it shouldn't be part of the API.

After discussing with @crosbymichael, he advised to limit the support to GPUs. We are excluding other types of specialized hardware such as FPGAs, InfiniBand, ASICs... from this RFC. 

This RFC introduces some vocabulary, then presents the approach I find is the most compelling.

## GPU Vendors
* While the containerd PR is only for NVIDIA GPUs, the CLI should be able to support multiple vendors.
* For simplicity, we should enforce the GPU vendor to be a lowercase string with no special characters.
e.g. `nvidia`, `intel` or `amd`.
* Reverse-DNS notation (e.g. `com.nvidia`) is too verbose and should probably be avoided for a CLI.
* The list of support GPU vendors **could** be hardcoded in the CLI and checked, since it's unlikely to change quickly, but it is not a requirement.

## GPU identifiers
* GPU support should not be a boolean flag like `--with-gpus`, users should be able to select a subset of all GPUs, similarly to `--cpuset-cpus`: ```nvidia=0,1,2,3```
* We should not assume a numerical identifier for GPUs (unlike `--cpuset-cpus`), GPUs can have a string representation, for instance a UUID: ```nvidia=GPU-fef8089b-4820-abfc-e83e-94318197576e```
* The CLI can only verify the basic format of GPU identifiers, it can't detect if they are ultimately valid (e.g. if the GPU exists or not), this will be done later and will be vendor-specific.

## GPU options
* In addition to selecting GPUs, the user should be to add options to modify the runtime behavior. For the containerd PR it would mean having a way to modify this struct from the CLI: https://github.com/containerd/containerd/blob/master/contrib/nvidia/nvidia.go#L88-L96
* The baseline would be to have options that apply to all GPUs/to the system: ```nvidia.capabilities = compute```.
* An extension would be to have options per-GPU, for instance to enforce that GPU 0 can only use 2 GB of memory, whereas GPU 1 can use 6 GB: ```limit_mem[0] = 2GB limit_mem[1] = 6GB```.
* Same as for identifiers, the CLI can't perform actual validation of the options passed by the user, it's vendor-specific.

## Suggested approach
Add two CLI options: 
* `--gpus` for specifying a list of GPUs (like `--cpuset-cpus`)
* `--gpus-opt` for specifying a list of GPU options (like `--storage-opt`)
```
docker run --gpus nvidia=0,1,2,3 --gpus nvidia=GPU-abcd --gpus-opt nvidia.capabilities=compute --gpus-opt nvidia.kmods=true
```
The important element being the vendor name, which serves as a key for specifying options and identifiers.
I think this is easy to read since you don't have to pack everything in a single argument.

If we want per-GPU options, we can introduce some sort of indexing in `--gpus-opt`:
```
--gpus-opt nvidia.0.limit_mem=2GB
--gpus-opt nvidia[0].limit_mem=2GB
```

## Discarded approaches
I think the approach above is the best, but here are some thoughts on other possibilities.

### Single CSV argument
Similar to `--mount`, add a new argument `--gpus` which expects a CSV value.
```
docker run --gpus vendor=nvidia,options=capabilities=compute,options=kmods=true,0,1,2,3,GPU-abcd
```
This is explicit and everything is contained in one option, but it's also bizarre since we have nested equal signs, and we have to expand the list of GPU identifiers as multiple arguments, to avoid needing complicated quoting from the shell.

A follow-up question would then be: can you specify multiple times the same vendor? If yes, what if the listed options don't match?

### Similar to `--device`
Add a new argument `--gpus` which expects 2 or 3 strings separated by `:`
```
docker run --gpus nvidia:0,1,2,3,GPU-abcd:capabilities=compute,kmods=true
```
Less problematic than approach above, but hard to read.

### NVIDIA specific
Same than the suggested approach, but without being generic:
```
docker run --nvidia-gpus 0,1,2,3 --nvidia-gpus GPU-abcd --nvidia-opt capabilities=compute --nvidia-opt kmods=true
```
It removes the need to use the vendor name as the key, and is fine for now, but is likely to be removed in the future if we add more vendors. Also, this doesn't match the approach used by the containerd PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] GPU support in CLI #1200

GPU Vendors

GPU identifiers

GPU options

Suggested approach

Discarded approaches

Single CSV argument

Similar to `--device`

NVIDIA specific

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] GPU support in CLI #1200

Description

GPU Vendors

GPU identifiers

GPU options

Suggested approach

Discarded approaches

Single CSV argument

Similar to --device

NVIDIA specific

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Similar to `--device`