[Ecosystem] Kubetorch

### Contact emails

paul@run.house, donny@run.house

### Project summary

A Fast, Pythonic Interface for Running ML Workloads on Kubernetes

### Project description

Kubernetes has emerged as the default compute foundation for ML (outside of pre-training on HPC-style Slurm clusters). However, the development experience is terrible. Making a minor code change to your distributed training workload might lead to 10-30 minutes before the first training batch. And the smallest OOM or node/pod pre-emption causes your entire workload to fall over, with no recourse for faults. 

With Kubetorch, the first benefit is extremely fast iteration on your distributed Torch trainings; all compute, artifacts, and state is retained during iteration, allowing for iteration on distributed training to occur in <2 seconds.  Secondly, it comes with Pythonic APIs that are friendly and familiar to practitioners, with no need to delve into YAML or Dockerfiles too early. But it builds on purely Kubernetes-native primitives, so it is easy to manage as part of the regular software platform and plugs in to the rest of the cloud-native ecosystem. Finally, it offers great programmatic fault tolerance, as the driver / controller Python making the call lives outside of the process group doing the training; if you hit a CUDA OOM or lose a pod to pre-emption, control is simply returned to the driver process and it can make a decision on how to continue (reduce batch size / reform process group with smaller world size, respectively). 

Kubetorch is partially named as an analogy to how Torch makes GPUs usable in Python with the `.to()` APIs -- in the same way, Kubetorch uses `.to()` to take regular Python classes (a Trainer) and functions (your training entrypoint) and launch them on Kubernetes while returning a callable that can be called locally (from within CI, production orchestration). 

A brief code snippet showing the APIs to launch a distributed training
```
def train(epochs, batch_size=32):
    torch.distributed.init_process_group(backend="nccl")
    ....
    model = DDP(model, device_ids=[device_id])
    for epoch in range(epochs):
        ...

if __name__ == "__main__":
    gpus = kt.Compute(
        gpus=8,
        image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3"),
        launch_timeout=600,
    ).distribute("pytorch", workers=4)

    train_ddp = kt.fn(train).to(gpus)

    results = train_ddp(epochs=args.epochs, batch_size=args.batch_size)

```

### Are there any other projects in the PyTorch Ecosystem similar to yours?  If, yes, what are they?

The closest analogue is Ray Core; it would be acceptable to think of Kubetorch as Kubernetes-native actor deployments, where the controller is actually the driver process doing launching and making calls, and the actors are 1 or more pods.

### Project repo URL

https://github.com/run-house/kubetorch

### Additional repos in scope of the application

https://github.com/run-house/kubetorch-examples

### Project license

Apache 2.0

### GitHub handles of the project maintainer(s)

@carolinechen @jlewitt1 @dongreenberg @belsasha @mkandler @py-rh

### Is there a corporate or academic entity backing this project?  If so, please provide the name and URL of the entity.

Runhouse run.house

### Website URL

run.house

### Documentation

https://www.run.house/kubetorch/introduction
https://www.youtube.com/@runhouse_

### How do you build and test the project today (continuous integration)?  Please describe.

Tests running in GitHub
https://github.com/run-house/kubetorch/tree/main/python_client/tests


### Version of PyTorch

PyTorch 1.0+ all work, the library is not opinionated 

### Components of PyTorch

We don't use PyTorch as a dependency, but our primary goal is to launch and run PyTorch-based workloads. 

### How long do you expect to maintain the project?

Forever

### Additional information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ecosystem] Kubetorch #48

Contact emails

Project summary

Project description

Are there any other projects in the PyTorch Ecosystem similar to yours? If, yes, what are they?

Project repo URL

Additional repos in scope of the application

Project license

GitHub handles of the project maintainer(s)

Is there a corporate or academic entity backing this project? If so, please provide the name and URL of the entity.

Website URL

Documentation

How do you build and test the project today (continuous integration)? Please describe.

Version of PyTorch

Components of PyTorch

How long do you expect to maintain the project?

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Ecosystem] Kubetorch #48

Description

Contact emails

Project summary

Project description

Are there any other projects in the PyTorch Ecosystem similar to yours? If, yes, what are they?

Project repo URL

Additional repos in scope of the application

Project license

GitHub handles of the project maintainer(s)

Is there a corporate or academic entity backing this project? If so, please provide the name and URL of the entity.

Website URL

Documentation

How do you build and test the project today (continuous integration)? Please describe.

Version of PyTorch

Components of PyTorch

How long do you expect to maintain the project?

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions