Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# SkyRL: Installation

## Pre-requisites

> [!TIP]
> For an easy-to-use Dockerfile, see [Dockerfile.skyrl](./docker/Dockerfile.skyrl)


The main prerequisites are:
- [CUDA Toolkit 12.4](https://developer.nvidia.com/cuda-12-4-0-download-archive) (versions greater than 12.4 might also work)
- `build-essential`: This is needed for `torch-memory-saver`
- [`uv`](https://docs.astral.sh/uv/getting-started/installation): We use the `uv` + `ray` integration to easily manage dependencies in multi-node training.
- `python` 3.12
- `ray` 2.43.0


Once installed, configure ray to use `uv` with

```
export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook
```


## Installation dry run

Execute the following command from the root project directory:

```bash
uv run --isolated --frozen python -c 'import ray; ray.init(); print("Success!")'
```

This will trigger a fresh environment build on your system.

## Common installation issues

1. "Failed to build `torch-memory-saver==0.0.5` ..... cannot find -lcuda: No such file or directory"

With a CPU head node, you might encounter installation issues with `torch-memory-saver`. The main problem is that the CUDA binaries need to be found at `/usr/lib/` for the installation to be successful. To fix this, you need to install CUDA and make sure your CUDA libraries are linked in `/usr/lib`. For example,

```bash
sudo ln -s /usr/local/cuda-12.4/compat/libcuda.so /usr/lib/libcuda.so
sudo ln -s /usr/local/cuda-12.4/compat/libcuda.so.1 /usr/lib/libcuda.so.1
```
20 changes: 3 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,31 +30,17 @@


# Getting Started
This repository contains training code for the `SkyRL-v0` release. Our implementation is a fork of [VeRL](https://github.com/volcengine/verl).
This repository contains training code for the `SkyRL-v0` release. Our implementation is a fork of [VeRL](https://github.com/volcengine/verl).

## Installation

The only pre-requisite is having `uv` [installed](https://docs.astral.sh/uv/getting-started/installation) on your system. We use the `uv` + `ray` integration to easily manage dependencies in multi-node training.
The first step is to clone our repository:

### Clone SkyRL
```bash
git clone --recurse-submodules https://github.com/NovaSky-AI/SkyRL
```

### Installation dry run

You can dry run your installation with the following command:

```bash
uv run --isolated --frozen pip show torch
```

NOTE: With a CPU head node, you might encounter installation issues with `torch-memory-saver`. To fix this, you need to install CUDA and make sure your CUDA libraries are linked in `/usr/lib`. For example,

```bash
sudo ln -s /usr/local/cuda-12.4/compat/libcuda.so /usr/lib/libcuda.so
sudo ln -s /usr/local/cuda-12.4/compat/libcuda.so.1 /usr/lib/libcuda.so.1
```
For detailed installation instructions, please refer to [INSTALL.md](./INSTALL.md)

## Scripts for reproduction

Expand Down
24 changes: 24 additions & 0 deletions docker/Dockerfile.skyrl
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# We start from Anyscale's ray image. The image from `ray-project` should also work.
FROM anyscale/ray:2.43.0-slim-py312-cu124


RUN sudo apt-get update -y && sudo apt-get install -y wget kmod libxml2 build-essential
RUN wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run \
&& sudo sh cuda_12.4.0_550.54.14_linux.run --silent --toolkit

RUN curl -LsSf https://astral.sh/uv/install.sh | sh
RUN echo "export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook" >> /home/ray/.bashrc

RUN sudo apt-get update && sudo apt-get install -y --no-install-recommends --allow-change-held-packages \
vim \
iputils-ping \
iproute2 \
openmpi-bin \
openmpi-common \
libopenmpi-dev \
libnccl2 \
libnccl-dev \
openssh-server \
ca-certificates \
infiniband-diags \
ibverbs-utils
13 changes: 11 additions & 2 deletions examples/sky/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,24 @@

We provide exact scripts to reproduce our results for SkyRL-Agent-7B-v0, SkyRL-Agent-8B-v0, SkyRL-Agent-14B-v0.

## Pre-requisite: Data preparation
## Pre-requisite

### Installation

Make sure to have followed the installation commands in [INSTALL.md](../../INSTALL.md).

### Start Ray
Start ray in your cluster following the guide: https://docs.ray.io/en/latest/ray-core/starting-ray.html

### Data preparation

We provide the datasets we used on HuggingFace: https://huggingface.co/novasky-ai

We used [NovaSky-AI/SkyRL-v0-293-data](https://huggingface.co/datasets/NovaSky-AI/SkyRL-v0-293-data) for training both SkyRL-Agent-8B-v0 and SkyRL-Agent-14B-v0.
We used [NovaSky-AI/SkyRL-v0-80-data](https://huggingface.co/datasets/NovaSky-AI/SkyRL-v0-80-data) (first stage) and [NovaSky-AI/SkyRL-v0-220-data](https://huggingface.co/datasets/NovaSky-AI/SkyRL-v0-220-data) (second stage) to train SkyRL-Agent-7B-v0.
Make sure to download the dataset and update the path in `DATA_PATH` in the script.

## Setup Environment variables
### Setup Environment variables

We use a [`.env`](../../.env) file to pass environment variables to all the processes created by Ray. Make sure to set `WANDB_API_KEY`, `ALLHANDS_API_KEY` and `SANDBOX_REMOTE_RUNTIME_API_URL`.

Expand Down
Loading