Skip to content

Magireer/distributed-training-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Distributed Training Utils

Kubernetes Docker

Simplify the deployment of distributed training jobs using PyTorch Elastic and Horovod on Kubernetes clusters.

Features

  • Helm charts for training operators
  • Automated data sharding scripts
  • Monitoring dashboards for GPU utilization

About

Utility scripts and configurations for distributed deep learning training on Kubernetes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors