kube-transform is a lightweight open-source framework for writing and deploying distributed, batch-oriented data pipelines on Kubernetes.
It is intentionally thin — only ~1,000 lines of code — and designed to be an easy-to-understand layer between your code and Kubernetes. Unlike heavyweight orchestration platforms like Apache Airflow (which has hundreds of thousands of lines), kube-transform aims to be simple to reason about, debug, and extend.
It focuses on simplicity and flexibility:
- Minimal required configuration
- No persistent control plane components
- Vendor-agnostic: compatible with any Kubernetes cluster, image, and file store that meet a few basic requirements
If you're looking for a quick way to get started, check out kube-transform-starter-kit for reusable setup resources like Dockerfiles, Terraform, and RBAC templates. But using the starter kit is entirely optional.
Your setup must meet these basic requirements:
To run a pipeline, you must provide:
pipeline_spec: A Python dictionary that conforms to theKTPipelineschema. You can optionally write your spec using theKTPipelinePydantic model directly, and then call.model_dump()to convert it to a dict.image_path: A string path to your Docker imagedata_dir: A string path to your file store (must be valid from the perspective of pods running in the cluster)
- Must include Python 3.11+
- Must have
kube-transforminstalled (e.g. via pip) - Must include your code in
/app/kt_functions/, which should be an importable module containing the functions referenced in your pipeline
- This is the directory that all pipeline jobs and the controller will read from and write to. It will be passed as the
data_dirargument torun_pipeline(). - Can be a local folder (e.g.
/mnt/data) or a cloud object store (e.g.s3://some-bucket/) - Must be readable and writable by all pods in your cluster
- Compatible with anything
fsspecsupports
The KT controller internally uses
fsspec.open()to write pipeline metadata toDATA_DIR. You must ensure thatDATA_DIRis transparently accessible to all pods (including the controller). This typically means one of the following:
- You're using a mounted volume that is accessible at
/mnt/data- You've configured access via IRSA (for S3) or Workload Identity (for GCS), so the
kt-podservice account has permissions to access your object store
In a single-node cluster, simply mounting a local folder to
/mnt/datawill work. All KT pods will have access to/mnt/data.
- Must be able to pull your Docker image (e.g. via ECR, DockerHub, etc.)
- Must be able to access your file store
- Must include a service account named
kt-podin the default namespace, with permission to create Kubernetes Jobs - Your deployment machine must be able to connect to the cluster (e.g. via
kubectl)
For a working example setup with autoscaling, IAM roles, and RBAC configuration, see
kube-transform-starter-kit.
Once your inputs are ready:
from kube_transform import run_pipeline
run_pipeline(pipeline_spec, image_path, data_dir)This will:
- Launch a temporary
kt-controllerJob in your Kubernetes cluster - Submit all pipeline jobs in the correct dependency order
- Shut down automatically when the pipeline completes (or fails)
You can view progress using:
kubectl get podsMake sure you have the same version of kube-transform running locally (for the run_pipeline function) as you have in your image.
kube-transform is compatible with both fixed-size and autoscaling clusters:
- If your cluster supports autoscaling, KT will take advantage of it automatically.
- If your cluster is fixed-size, jobs will remain in
Pendingstate until resources are available.
For help setting up either configuration, see
kube-transform-starter-kit.
Questions? Feature requests? Open an issue on GitHub.