-
Notifications
You must be signed in to change notification settings - Fork 70
Add docs for Model.create, update default values and fix per_worker concurrency #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
clients/python/llmengine/model.py
Outdated
| checkpoint_path (`Optional[str]`): | ||
| Path to the checkpoint for the LLM. For now we only support loading a tar file from AWS S3. | ||
| Path to the checkpoint for the LLM. Can be either a folder (preferred since there's no untar) or a tar file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be either a folder
I think we want to be precise and consistent around our guidance relating to remote files/directories. Would suggest cross-checking with the Files API and making sure that the explanation around this makes sense. e.g. we may want to clarify that this is meant to be some remote path that's accessible by the LLM Engine deployment, etc.
Can be either a folder (preferred since there's no untar)
Might be good to explain why skipping the untar is desirable - is it cold start time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking code, i don't think we only support creating endpoints from files created with Files API.
yes skipping untar is good to cold start time
clients/python/llmengine/model.py
Outdated
| throughput requirements. 2. Determine a value for the maximum number of | ||
| concurrent requests in the workload. Divide this number by ``max_workers``. Doing | ||
| this ensures that the number of workers will "climb" to ``max_workers``. | ||
| Number of Uvicorn workers per pod. Recommendation is set to 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to hide details of Uvicorn.
Also I don't think this is correct? IIRC per_worker is basically a scaling sensitivity parameter. cc @seanshi-scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this actually decides both # of workers and HPA target concurrency https://github.com/scaleapi/llm-engine/blob/main/model-engine/model_engine_server/infra/gateways/k8s_resource_parser.py#L65
i'll make some updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't understand why we want to set some ratio between HPA target and per_worker. removing the ratio
| import re | ||
| from typing import Union | ||
|
|
||
| MAX_CONCURRENCY_TO_TARGET_CONCURRENCY_RATIO = 2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this technically change our autoscaling sensitivity? cc @seanshi-scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which might be fine, just being aware of production behavior changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this would but i think people are not aware of this ratio at all (like users specify per worker concurrency target to be 10 and HPA uses 5)
yixu34
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more high-level comment: after
…to yunfeng-retry-cds
yixu34
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only thing I had left were around the examples:
- framework version tag
- labels
Uh oh!
There was an error while loading. Please reload this page.