-
Notifications
You must be signed in to change notification settings - Fork 70
Enforce model checkpoints existing for endpoint/bundle creation #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if not checkpoint_path: | ||
| raise InvalidRequestException(f"No checkpoint path found for model {model_name}") | ||
|
|
||
| if checkpoint_path.startswith("s3://"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change the check to be if not ... and throw error so you can avoid a level of nesting. then honestly you could create a helper fn to validate checkpoint_path with both checks there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking more about this, ig we make an assumption that the URLs are S3, which is not true in the multi tenant world. since it'll be a lot more code change, no need to do here, but would be good to make a note to check infra_config().cloud_provider being "aws" later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 about not being cloud-specific (anymore). @squeakymouse is in the process of refactoring this, would be good for her to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this check triggered me a bit as well, but i wanted to hold off on any big refactors around "cloud agnosticism" until I had a fuller picture on what's needed to achieve it (e.g. we'd need to refactor ModelInfo so it doesn't hardcode s3_repo, and i'm sure there other things we'd need to change)
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
56eedc4 to
1d200f2
Compare
| ) | ||
| else: | ||
|
|
||
| if not checkpoint_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also use get_checkpoint_path here too imo. it's technically net new functionality but should be welcome for tensorrt too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @seanshi-scale, trt-llm provider is different in that we won't need model_name on creation so we won't necessarily have it in the control flow
Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Enforce that model checkpoints are used when creating bundles to prevent outages/issues in prod where a hf repo becomes unavailable
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.