Proposal
This proposal requests a refactor on current inference engine InferenceEngine design.
Related PR:
Code touched:
colossalai/inference/modeling/models/:
colossalai/inference/core/engine.py:
|
self.model = self._shardformer( |
I’ve noticed a limitation in our current inference engine related to how parameters from external sources (e.g., inference engine config) are integrated. As it stands, external parameters can only be passed during the model sharding process via the from_native_module interface. However, from_native_module is primarily designed for replacing model layers, which will violate Single-Responsibility-Principle.
This approach restricts the flexibility of introducing or adjusting modeling parameters post-initialization, as any additional parameters must be passed as **kwargs via from_native_method. This setup is not ideal for several reasons, particularly when dealing with predefined configurations that should be initialized early in the model setup (e.g., use_cuda_kernel, use_spec_dec, use_flash_attn etc.). These options configures the InferenceEngine in selecting specific generation strategies and ops, which are currently determined during the inference modeling forward pass.
Above, here proposes 2 possible solutions:
- Global Context Object: Introduce a global context object that mimics the lifecycle of the inference engine. This object will allow for the retrieval of member properties at any point during the inference process, thus providing a centralized and consistent configuration management.
- InferenceShardformer Wrapper: Implement a wrapper around the existing
shardformer, named InferenceShardformer. This class will provide a new interface for parameter passing and will be capable of maintaining various inference states, thereby ensuring greater scalability and flexibility.
There will be a coming PR soon, after discussing with the maintainers.
Self-service
Proposal
This proposal requests a refactor on current inference engine
InferenceEnginedesign.Related PR:
This PR abstracts the attention layer ops in inference modeling into
AttentionBackend. However,AttentionBackendshould not be selected in attention forward pass, but during init of modules.Code touched:
colossalai/inference/modeling/models/:ColossalAI/colossalai/inference/modeling/models/nopadding_llama.py
Line 448 in 677cbfa
colossalai/inference/core/engine.py:ColossalAI/colossalai/inference/core/engine.py
Line 167 in 1b76564
I’ve noticed a limitation in our current inference engine related to how parameters from external sources (e.g., inference engine config) are integrated. As it stands, external parameters can only be passed during the model sharding process via the
from_native_moduleinterface. However,from_native_moduleis primarily designed for replacing model layers, which will violate Single-Responsibility-Principle.This approach restricts the flexibility of introducing or adjusting modeling parameters post-initialization, as any additional parameters must be passed as
**kwargsviafrom_native_method. This setup is not ideal for several reasons, particularly when dealing with predefined configurations that should be initialized early in the model setup (e.g.,use_cuda_kernel,use_spec_dec,use_flash_attnetc.). These options configures theInferenceEnginein selecting specific generation strategies and ops, which are currently determined during the inference modelingforwardpass.Above, here proposes 2 possible solutions:
shardformer, namedInferenceShardformer. This class will provide a new interface for parameter passing and will be capable of maintaining various inference states, thereby ensuring greater scalability and flexibility.There will be a coming PR soon, after discussing with the maintainers.
Self-service