-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Contact emails
paul@run.house, donny@run.house
Project summary
A Fast, Pythonic Interface for Running ML Workloads on Kubernetes
Project description
Kubernetes has emerged as the default compute foundation for ML (outside of pre-training on HPC-style Slurm clusters). However, the development experience is terrible. Making a minor code change to your distributed training workload might lead to 10-30 minutes before the first training batch. And the smallest OOM or node/pod pre-emption causes your entire workload to fall over, with no recourse for faults.
With Kubetorch, the first benefit is extremely fast iteration on your distributed Torch trainings; all compute, artifacts, and state is retained during iteration, allowing for iteration on distributed training to occur in <2 seconds. Secondly, it comes with Pythonic APIs that are friendly and familiar to practitioners, with no need to delve into YAML or Dockerfiles too early. But it builds on purely Kubernetes-native primitives, so it is easy to manage as part of the regular software platform and plugs in to the rest of the cloud-native ecosystem. Finally, it offers great programmatic fault tolerance, as the driver / controller Python making the call lives outside of the process group doing the training; if you hit a CUDA OOM or lose a pod to pre-emption, control is simply returned to the driver process and it can make a decision on how to continue (reduce batch size / reform process group with smaller world size, respectively).
Kubetorch is partially named as an analogy to how Torch makes GPUs usable in Python with the .to() APIs -- in the same way, Kubetorch uses .to() to take regular Python classes (a Trainer) and functions (your training entrypoint) and launch them on Kubernetes while returning a callable that can be called locally (from within CI, production orchestration).
A brief code snippet showing the APIs to launch a distributed training
def train(epochs, batch_size=32):
torch.distributed.init_process_group(backend="nccl")
....
model = DDP(model, device_ids=[device_id])
for epoch in range(epochs):
...
if __name__ == "__main__":
gpus = kt.Compute(
gpus=8,
image=kt.Image(image_id="nvcr.io/nvidia/pytorch:23.10-py3"),
launch_timeout=600,
).distribute("pytorch", workers=4)
train_ddp = kt.fn(train).to(gpus)
results = train_ddp(epochs=args.epochs, batch_size=args.batch_size)
Are there any other projects in the PyTorch Ecosystem similar to yours? If, yes, what are they?
The closest analogue is Ray Core; it would be acceptable to think of Kubetorch as Kubernetes-native actor deployments, where the controller is actually the driver process doing launching and making calls, and the actors are 1 or more pods.
Project repo URL
https://github.com/run-house/kubetorch
Additional repos in scope of the application
https://github.com/run-house/kubetorch-examples
Project license
Apache 2.0
GitHub handles of the project maintainer(s)
@CarolineChen @jlewitt1 @dongreenberg @BelSasha @mkandler @py-rh
Is there a corporate or academic entity backing this project? If so, please provide the name and URL of the entity.
Runhouse run.house
Website URL
run.house
Documentation
https://www.run.house/kubetorch/introduction
https://www.youtube.com/@runhouse_
How do you build and test the project today (continuous integration)? Please describe.
Tests running in GitHub
https://github.com/run-house/kubetorch/tree/main/python_client/tests
Version of PyTorch
PyTorch 1.0+ all work, the library is not opinionated
Components of PyTorch
We don't use PyTorch as a dependency, but our primary goal is to launch and run PyTorch-based workloads.
How long do you expect to maintain the project?
Forever
Additional information
No response