Skip to content

[Ecosystem] Kubetorch #48

@py-rh

Description

@py-rh

Contact emails

paul@run.house, donny@run.house

Project summary

A Fast, Pythonic Interface for Running ML Workloads on Kubernetes

Project description

Kubernetes has emerged as the default compute foundation for ML (outside of pre-training on HPC-style Slurm clusters). However, the development experience is terrible. Making a minor code change to your distributed training workload might lead to 10-30 minutes before the first training batch. And the smallest OOM or node/pod pre-emption causes your entire workload to fall over, with no recourse for faults.

With Kubetorch, the first benefit is extremely fast iteration on your distributed Torch trainings; all compute, artifacts, and state is retained during iteration, allowing for iteration on distributed training to occur in <2 seconds. Secondly, it comes with Pythonic APIs that are friendly and familiar to practitioners, with no need to delve into YAML or Dockerfiles too early. But it builds on purely Kubernetes-native primitives, so it is easy to manage as part of the regular software platform and plugs in to the rest of the cloud-native ecosystem. Finally, it offers great programmatic fault tolerance, as the driver / controller Python making the call lives outside of the process group doing the training; if you hit a CUDA OOM or lose a pod to pre-emption, control is simply returned to the driver process and it can make a decision on how to continue (reduce batch size / reform process group with smaller world size, respectively).

Kubetorch is partially named as an analogy to how Torch makes GPUs usable in Python with the .to() APIs -- in the same way, Kubetorch uses .to() to take regular Python classes (a Trainer) and functions (your training entrypoint) and launch them on Kubernetes while returning a callable that can be called locally (from within CI, production orchestration).

A brief code snippet showing the APIs to launch a distributed training

def train(epochs, batch_size=32):
    torch.distributed.init_process_group(backend="nccl")
    ....
    model = DDP(model, device_ids=[device_id])
    for epoch in range(epochs):
        ...

if __name__ == "__main__":
    gpus = kt.Compute(
        gpus=8,
        image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3"),
        launch_timeout=600,
    ).distribute("pytorch", workers=4)

    train_ddp = kt.fn(train).to(gpus)

    results = train_ddp(epochs=args.epochs, batch_size=args.batch_size)

Are there any other projects in the PyTorch Ecosystem similar to yours? If, yes, what are they?

The closest analogue is Ray Core; it would be acceptable to think of Kubetorch as Kubernetes-native actor deployments, where the controller is actually the driver process doing launching and making calls, and the actors are 1 or more pods.

Project repo URL

https://github.com/run-house/kubetorch

Additional repos in scope of the application

https://github.com/run-house/kubetorch-examples

Project license

Apache 2.0

GitHub handles of the project maintainer(s)

@CarolineChen @jlewitt1 @dongreenberg @BelSasha @mkandler @py-rh

Is there a corporate or academic entity backing this project? If so, please provide the name and URL of the entity.

Runhouse run.house

Website URL

run.house

Documentation

https://www.run.house/kubetorch/introduction
https://www.youtube.com/@runhouse_

How do you build and test the project today (continuous integration)? Please describe.

Tests running in GitHub
https://github.com/run-house/kubetorch/tree/main/python_client/tests

Version of PyTorch

PyTorch 1.0+ all work, the library is not opinionated

Components of PyTorch

We don't use PyTorch as a dependency, but our primary goal is to launch and run PyTorch-based workloads.

How long do you expect to maintain the project?

Forever

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions