-
Notifications
You must be signed in to change notification settings - Fork 71
LLM batch completions API #418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…into yunfeng-batch-infer
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/inference/batch_inference/vllm_batch.py
Outdated
Show resolved
Hide resolved
yixu34
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in scope for this PR per se, but given that we're looking at Azure support, we may want to start thinking about how to support Azure in PRs going forward.
Need to take another pass at the actual batch job script.
| """ | ||
| Path to the checkpoint to load the model from. | ||
| """ | ||
| labels: Dict[str, str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make labels required for external users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is would mostly be used internally, i think it's okay to sacrifice some ergnomics, plus external users can simply use {}
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/inference/batch_inference/Dockerfile_vllm
Show resolved
Hide resolved
yixu34
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty neat!
|
|
||
| def infer_hardware_from_model_name( | ||
| self, model_name: str | ||
| ) -> CreateDockerImageBatchJobResourceRequests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah I just realized this should be CreateDockerImageBatchJobResourceRequest, oh well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible to rename, but don't want to put in this PR
| if job_index == 0: | ||
| wait_for_all_chunks(request) | ||
| combine_all_chunks(request) | ||
| if request.output_data_path.startswith("s3://"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have this in a finally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this "finally" should be at end of job instead of end of worker 0. otherwise if worker 0 failed early, it may not be able to remove the other output chunks.
for now i think it's fine to not remove and leave some traces for debugging.
ian-scale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me! Just a few nits / questions for myself to learn more.
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Show resolved
Hide resolved
model-engine/model_engine_server/inference/batch_inference/requirements.txt
Show resolved
Hide resolved
model-engine/model_engine_server/inference/batch_inference/vllm_batch.py
Show resolved
Hide resolved
saiatmakuri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits on unit test style but lgtm otherwise
| mock_process, | ||
| ): | ||
| # Mock the necessary objects and data | ||
| mock_popen.return_value = mock_process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can create a new fixture for that mocks the return value for subprocess.Popen using the mock_process fixture such that you don't need to make this declaration at the start of each unit test. it'll clean up logic here:
def test(mock1, mock2):
mock2.return_value = mock1
becomes
def test(mock2):
...
you can do that for several of the mocks done at the start of each unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you're proposing to use
@patch("subprocess.Popen", mock_process)
i think there's some ordering problem about patch not sure but can't get it work quickly. will skip this for now
Pull Request Summary
Batch completions create API, which currently only uses vLLM to run batch inference with kubernetes jobs. feature highlights:
Test Plan and Usage Guide
note: during testing with arc challenge, found 1 worker and 2 workers do not always return the same results:
single difference across 1k requests:
1 worker:
2 workers: