-
Notifications
You must be signed in to change notification settings - Fork 70
Infer hardware from model name #515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yixu34
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a use case-level test that verifies this behavior? If there was a prior test that exercised infer_hardware_from_model_name, we can probably just route to that one.
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
| LLMInferenceFramework.TENSORRT_LLM: [], | ||
| } | ||
|
|
||
| # We need a dict where if we need to override we can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason we're getting rid of the max_model_len/max_num_batched_tokens args to vllm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i forgot why we're doing this in the first place, but i'm pretty certain in recent version of vLLM this is not needed. also checked most of these models, config.json has the same max_position_embeddings with the values here
…-engine into yunfeng-easy-model-creation
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
yixu34
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
yixu34
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Cleanup prints
- Unrelated changes?
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/infra/gateways/abs_llm_artifact_gateway.py
Show resolved
Hide resolved
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
| f"Memory calculation result: {min_memory_gb=} for {model_name}, min_kv_cache_size: {min_kv_cache_size}, model_weights_size: {model_weights_size}" | ||
| ) | ||
|
|
||
| if min_memory_gb <= 24: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not really an issue, but how well do these map to Azure instance types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with my limited experiences on Azure, they do provide the same set of GPUs @squeakymouse wdyt?
| ) -> CreateDockerImageBatchJobResourceRequests: | ||
| config = llm_artifact_gateway.get_model_config(checkpoint_path) | ||
|
|
||
| dtype_size = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess we're gonna handle quantization later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantization can already be handled here. but I chose not to update dtype_size since in my experiences, they usually make things slower not faster (at least for bitsandbytes and AWQ), so in order to achieve the same speed we still need the same amount of GPUs
|
|
||
| dtype_size = 2 | ||
|
|
||
| min_kv_cache_size = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we're implicitly setting this to "batch size = 1 and filling up the context window", this feels reasonable, but would it make sense to add a bit of room for a larger batch size? (I guess for something with a shorter context window e.g. llama 2 it makes more sense to add some room, maybe less so for mixtral)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for all the existing models i tested, batch size = 1 is a good enough default to reach to reasonable amount of GPUs. model builders would have thought about the tradeoffs between model sizes and GPU sizes
| f"Num shard {num_shards} must be the same as number of GPUs {gpus} for DeepSpeed." | ||
| ) | ||
| if num_shards > gpus: | ||
| if num_shards != gpus: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just deprecate the num_shards field at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we should but would prefer not in this PR
| and request.storage is None | ||
| ): | ||
| raise ObjectHasInvalidValueException( | ||
| "All hardware spec fields (gpus, gpu_type, cpus, memory, storage) must be provided if any hardware spec field is missing." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "... if any hardware spec field is provided"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#515 (comment)
i guess it works both ways. since here i'm only allowing two states: either all fields are provided, or none of them are provided
seanshi-scale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had a few questions/nits, but lgtm!
Pull Request Summary
Infer hardware specs from model name so these fields are optional now
formular used:
we hard code the param count for MoE models right now.
we omit some other small weights like embedding layer and MLP layer post transformer
this estimates is mostly correct with some issues on long context window models (investigation TBD)
Test Plan and Usage Guide
added unit tests and created llama 3 8b and codellama endpoint