Skip to content

Improve installation instructions#3

Merged
caoshiyi merged 6 commits intomainfrom
sumanthrh/update-instructions
May 10, 2025
Merged

Improve installation instructions#3
caoshiyi merged 6 commits intomainfrom
sumanthrh/update-instructions

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

What does this PR do?

Tiny PR to improve our installation instructions. Adds a Dockerfile for a quick start experience.

SumanthRH added 5 commits May 10, 2025 05:33
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
x
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
@SumanthRH SumanthRH requested a review from caoshiyi May 10, 2025 05:49
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Copy link
Copy Markdown
Member

@caoshiyi caoshiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@caoshiyi caoshiyi merged commit a70d18e into main May 10, 2025
@caoshiyi caoshiyi deleted the sumanthrh/update-instructions branch May 10, 2025 06:13
@bhks
Copy link
Copy Markdown

bhks commented May 15, 2025

Hey @SumanthRH, @caoshiyi thanks for adding this.

I am wondering if you could also add instruction on how to use this docker file.

Couple of questions :

  • My understanding is that we will be using ray cluster to execute the training and if that is the case how would this docker file helpful. Given ray itself does not manage the containers.
  • Are you thinking to run ray and everything on a single node within the container to replicate the training where ray head and ray workers all running within the same container ?
  • I am trying to understand how did you run your training if its a multi node setup in ray cluster. If you could add some more details about training side of things that would be awesome.
  • I am also assuming you are running the openhands server on the k8s nodes, Are you also running ray on k8s via KubeRay ? And if yes can you help us with what setup you had to do ?
    Thanks

@SumanthRH
Copy link
Copy Markdown
Member Author

SumanthRH commented May 15, 2025

hi @bhks . Happy to help.

My understanding is that we will be using ray cluster to execute the training and if that is the case how would this docker file helpful. Given ray itself does not manage the containers.

I think it would be helpful to go over our architecture in the blog post:
https://novasky-ai.notion.site/skyrl-v0

TLDR is that LLM generated code is run on a remote server (separate from the training cluster). For each trajectory of the LLM, we run the code in a separate docker container in the remote server.

I am trying to understand how did you run your training if its a multi node setup in ray cluster. If you could add some more details about training side of things that would be awesome.

The training cluster is completely separate from the remote server running our OpenHands server. So you can set this up like a regular Ray cluster, clone and install SkyRL, and run training. For installation, we have setup instructions here: https://github.com/NovaSky-AI/SkyRL/blob/main/INSTALL.md . In terms of our setup, we ran single/multi-node training on Anyscale.

I am also assuming you are running the openhands server on the k8s nodes, Are you also running ray on k8s via KubeRay ? And if yes can you help us with what setup you had to do ?

I think this is the same question as before, but basically training cluster can be managed with infra of your choice (self managed ray cluster, k8s, proprietary platform, etc).

Hope that helps! If you have more questions, I would recommend we move this discussion to a separate Github issue for clarity!

@bhks
Copy link
Copy Markdown

bhks commented May 16, 2025

I think I understand now thank you so much man.

@bhks
Copy link
Copy Markdown

bhks commented May 16, 2025

I think it would be nice to put out the training cluster pieces into the architecture as well. I did read the blog post you guys have written and thank you for that.

I may create a pull request and let you review.

@SumanthRH
Copy link
Copy Markdown
Member Author

I think it would be nice to put out the training cluster pieces into the architecture as well. I did read the blog post you guys have written and thank you for that.

I may create a pull request and let you review.

@bhks yes agreed I think what is missing is a full system diagram or a just a description of what is running where. Let me see if we can add that. And contributions welcome, thank you!

@bhks
Copy link
Copy Markdown

bhks commented May 16, 2025

Exactly I had hard time reverse engineering things like

  1. Where is the training running
  2. why are we building docker, what would we do within docker ?
  3. If I have a k8s ENV, the container we are building does it need to run on CPU/GPU ?
  4. The data needed to execute also needs to be coppied to the docker container if the training is running within the container so that it can be pulled into the k8s pod and executed.
  5. But if you are pulling everything into a node to run with virtual ENV then there is no need of containers.

These things were confusing to me when trying to understand.

So yeah a step by step and system level architecture would be helpful.

pcmoritz added a commit to pcmoritz/SkyRL that referenced this pull request Oct 4, 2025
This PR adds a dtype parameter to the model, so it can e.g. be trained in bfloat16. The sft script will by default use the native type of the model. Also added a test to make sure the sft script runs for one step.
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
Before this PR, `session_id` is always None because Terminus 2 by
default does not pass it in. So we do `engine_idx = random.randint(0,
len(self.engines) - 1)` which is really bad for prefix cache hit rate.

We can actually pass in a session ID to `AgentConfig` and it will be
passed to all requests.

Verified that the following will print out logs like `CHARLIE:
session_id: 954320202c254bd8bbca083d34457b94` (multiple times too,
meaning the sesion_id is consistent across a trial, i.e. trajectory)

```python
    async def chat_completion(self, request_payload: Dict[str, Any]) -> Dict[str, Any]:
        session_id = request_payload["json"].pop("session_id", None)
        print(f"CHARLIE: session_id: {session_id}")
        ...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants