Skip to content

Conversation

@yunfeng-scale
Copy link
Contributor

@yunfeng-scale yunfeng-scale commented May 7, 2024

Pull Request Summary

Infer hardware specs from model name so these fields are optional now

formular used:

dtype_size = 2 (for 16bit float)

min_kv_cache_size =
        2
        * dtype_size
        * config["num_hidden_layers"]
        * config["hidden_size"]
        * config["max_position_embeddings"]
        // (config["num_attention_heads"] // config["num_key_value_heads"])

model_weights_size = dtype_size * model_param_count_b * 1_000_000_000

min_memory_gb = math.ceil((min_kv_cache_size + model_weights_size) / 1_000_000_000 / 0.9)

we hard code the param count for MoE models right now.
we omit some other small weights like embedding layer and MLP layer post transformer
this estimates is mostly correct with some issues on long context window models (investigation TBD)

Test Plan and Usage Guide

added unit tests and created llama 3 8b and codellama endpoint

Copy link
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a use case-level test that verifies this behavior? If there was a prior test that exercised infer_hardware_from_model_name, we can probably just route to that one.

@yixu34 yixu34 requested a review from seanshi-scale May 7, 2024 22:50
LLMInferenceFramework.TENSORRT_LLM: [],
}

# We need a dict where if we need to override we can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason we're getting rid of the max_model_len/max_num_batched_tokens args to vllm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i forgot why we're doing this in the first place, but i'm pretty certain in recent version of vLLM this is not needed. also checked most of these models, config.json has the same max_position_embeddings with the values here

@yunfeng-scale yunfeng-scale enabled auto-merge (squash) May 10, 2024 17:08
@yunfeng-scale yunfeng-scale disabled auto-merge May 10, 2024 17:08
Copy link
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Cleanup prints
  2. Unrelated changes?

f"Memory calculation result: {min_memory_gb=} for {model_name}, min_kv_cache_size: {min_kv_cache_size}, model_weights_size: {model_weights_size}"
)

if min_memory_gb <= 24:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not really an issue, but how well do these map to Azure instance types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with my limited experiences on Azure, they do provide the same set of GPUs @squeakymouse wdyt?

) -> CreateDockerImageBatchJobResourceRequests:
config = llm_artifact_gateway.get_model_config(checkpoint_path)

dtype_size = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess we're gonna handle quantization later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantization can already be handled here. but I chose not to update dtype_size since in my experiences, they usually make things slower not faster (at least for bitsandbytes and AWQ), so in order to achieve the same speed we still need the same amount of GPUs


dtype_size = 2

min_kv_cache_size = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we're implicitly setting this to "batch size = 1 and filling up the context window", this feels reasonable, but would it make sense to add a bit of room for a larger batch size? (I guess for something with a shorter context window e.g. llama 2 it makes more sense to add some room, maybe less so for mixtral)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all the existing models i tested, batch size = 1 is a good enough default to reach to reasonable amount of GPUs. model builders would have thought about the tradeoffs between model sizes and GPU sizes

f"Num shard {num_shards} must be the same as number of GPUs {gpus} for DeepSpeed."
)
if num_shards > gpus:
if num_shards != gpus:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just deprecate the num_shards field at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should but would prefer not in this PR

and request.storage is None
):
raise ObjectHasInvalidValueException(
"All hardware spec fields (gpus, gpu_type, cpus, memory, storage) must be provided if any hardware spec field is missing."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "... if any hardware spec field is provided"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#515 (comment)
i guess it works both ways. since here i'm only allowing two states: either all fields are provided, or none of them are provided

Copy link
Contributor

@seanshi-scale seanshi-scale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a few questions/nits, but lgtm!

@yunfeng-scale yunfeng-scale merged commit ba68b8d into main May 15, 2024
@yunfeng-scale yunfeng-scale deleted the yunfeng-easy-model-creation branch May 15, 2024 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants