NovaSky-AI · caoshiyi · May 10, 2025 · May 10, 2025 · May 10, 2025 · May 10, 2025
diff --git a/INSTALL.md b/INSTALL.md
@@ -0,0 +1,43 @@
+# SkyRL: Installation
+
+## Pre-requisites
+
+> [!TIP]
+> For an easy-to-use Dockerfile, see [Dockerfile.skyrl](./docker/Dockerfile.skyrl)
+
+
+The main prerequisites are: 
+- [CUDA Toolkit 12.4](https://developer.nvidia.com/cuda-12-4-0-download-archive) (versions greater than 12.4 might also work)
+- `build-essential`: This is needed for `torch-memory-saver`
+- [`uv`](https://docs.astral.sh/uv/getting-started/installation): We use the `uv` + `ray` integration to easily manage dependencies in multi-node training.
+- `python` 3.12
+- `ray` 2.43.0
+
+
+Once installed, configure ray to use `uv` with 
+
+```
+export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook
+```
+
+
+## Installation dry run
+
+Execute the following command from the root project directory:
+
+```bash
+uv run --isolated --frozen python -c 'import ray; ray.init(); print("Success!")'
+```
+
+This will trigger a fresh environment build on your system. 
+
+## Common installation issues
+
+1. "Failed to build `torch-memory-saver==0.0.5` .....  cannot find -lcuda: No such file or directory" 
+
+With a CPU head node, you might encounter installation issues with `torch-memory-saver`. The main problem is that the CUDA binaries need to be found at `/usr/lib/` for the installation to be successful. To fix this, you need to install CUDA and make sure your CUDA libraries are linked in `/usr/lib`. For example, 
+
+```bash
+sudo ln -s /usr/local/cuda-12.4/compat/libcuda.so /usr/lib/libcuda.so
+sudo ln -s /usr/local/cuda-12.4/compat/libcuda.so.1 /usr/lib/libcuda.so.1
+```
diff --git a/README.md b/README.md
@@ -30,31 +30,17 @@
 
 
 # Getting Started
-This repository contains training code for the `SkyRL-v0` release. Our implementation is a fork of [VeRL](https://github.com/volcengine/verl).
+This repository contains training code for the `SkyRL-v0` release. Our implementation is a fork of [VeRL](https://github.com/volcengine/verl).  
 
 ## Installation
 
-The only pre-requisite is having `uv` [installed](https://docs.astral.sh/uv/getting-started/installation) on your system. We use the `uv` + `ray` integration to easily manage dependencies in multi-node training. 
+The first step is to clone our repository:
 
-### Clone SkyRL
 ```bash 
 git clone --recurse-submodules https://github.com/NovaSky-AI/SkyRL
 ```
 
-### Installation dry run
-
-You can dry run your installation with the following command: 
-
-```bash
-uv run --isolated --frozen pip show torch
-```
-
-NOTE: With a CPU head node, you might encounter installation issues with `torch-memory-saver`. To fix this, you need to install CUDA and make sure your CUDA libraries are linked in `/usr/lib`. For example, 
-
-```bash
-sudo ln -s /usr/local/cuda-12.4/compat/libcuda.so /usr/lib/libcuda.so
-sudo ln -s /usr/local/cuda-12.4/compat/libcuda.so.1 /usr/lib/libcuda.so.1
-```
+For detailed installation instructions, please refer to [INSTALL.md](./INSTALL.md)
 
 ## Scripts for reproduction
 

diff --git a/docker/Dockerfile.skyrl b/docker/Dockerfile.skyrl
@@ -0,0 +1,24 @@
+# We start from Anyscale's ray image. The image from `ray-project` should also work.
+FROM anyscale/ray:2.43.0-slim-py312-cu124
+
+
+RUN sudo apt-get update -y && sudo apt-get install -y wget kmod libxml2 build-essential
+RUN wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run \
+    && sudo sh cuda_12.4.0_550.54.14_linux.run --silent --toolkit
+
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh
+RUN echo "export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook" >> /home/ray/.bashrc
+
+RUN sudo apt-get update && sudo apt-get install -y --no-install-recommends --allow-change-held-packages \
+    vim \
+    iputils-ping \
+    iproute2 \ 
+    openmpi-bin \
+    openmpi-common \
+    libopenmpi-dev \
+    libnccl2 \
+    libnccl-dev \
+    openssh-server \
+    ca-certificates \ 
+    infiniband-diags \
+    ibverbs-utils
diff --git a/examples/sky/README.md b/examples/sky/README.md
@@ -2,15 +2,24 @@
 
 We provide exact scripts to reproduce our results for SkyRL-Agent-7B-v0, SkyRL-Agent-8B-v0, SkyRL-Agent-14B-v0. 
 
-## Pre-requisite: Data preparation
+## Pre-requisite
+
+### Installation
+
+Make sure to have followed the installation commands in [INSTALL.md](../../INSTALL.md). 
+
+### Start Ray
+Start ray in your cluster following the guide: https://docs.ray.io/en/latest/ray-core/starting-ray.html 
+
+### Data preparation
 
 We provide the datasets we used on HuggingFace: https://huggingface.co/novasky-ai 
 
 We used [NovaSky-AI/SkyRL-v0-293-data](https://huggingface.co/datasets/NovaSky-AI/SkyRL-v0-293-data) for training both SkyRL-Agent-8B-v0 and SkyRL-Agent-14B-v0.
 We used [NovaSky-AI/SkyRL-v0-80-data](https://huggingface.co/datasets/NovaSky-AI/SkyRL-v0-80-data) (first stage) and [NovaSky-AI/SkyRL-v0-220-data](https://huggingface.co/datasets/NovaSky-AI/SkyRL-v0-220-data) (second stage) to train SkyRL-Agent-7B-v0.
 Make sure to download the dataset and update the path in `DATA_PATH` in the script. 
 
-## Setup Environment variables
+### Setup Environment variables
 
 We use a [`.env`](../../.env) file to pass environment variables to all the processes created by Ray. Make sure to set `WANDB_API_KEY`,  `ALLHANDS_API_KEY` and `SANDBOX_REMOTE_RUNTIME_API_URL`.