[PROPOSAL]: Refactor inference engine by selecting backend during init of modules

### Proposal

This proposal requests a refactor on current inference engine `InferenceEngine` design. 

Related PR:
- https://github.com/hpcaitech/ColossalAI/pull/5771
This PR abstracts the attention layer ops in inference modeling into `AttentionBackend`. However, `AttentionBackend` should not be selected in attention forward pass, but during init of modules.

Code touched:
- `colossalai/inference/modeling/models/`:
https://github.com/hpcaitech/ColossalAI/blob/677cbfacf8ef11f423ec1f5216083675615ab85d/colossalai/inference/modeling/models/nopadding_llama.py#L448
- `colossalai/inference/core/engine.py`: https://github.com/hpcaitech/ColossalAI/blob/1b76564e1607aa8cf24566c794977b260de44f6c/colossalai/inference/core/engine.py#L167


I’ve noticed a limitation in our current inference engine related to how parameters from external sources (e.g., inference engine config) are integrated. As it stands, external parameters can only be passed during the model sharding process via the `from_native_module` interface. However, `from_native_module` is primarily designed for replacing model layers, which will violate Single-Responsibility-Principle.

This approach restricts the flexibility of introducing or adjusting modeling parameters post-initialization, as any additional parameters must be passed as `**kwargs` via `from_native_method`. This setup is not ideal for several reasons, particularly when dealing with predefined configurations that should be initialized early in the model setup (e.g., `use_cuda_kernel`, `use_spec_dec`, `use_flash_attn` etc.). These options configures the `InferenceEngine` in selecting specific **generation strategies** and **ops**, which are currently determined during the inference modeling `forward` pass.

Above, here proposes 2 possible solutions:
1. Global Context Object: Introduce a global context object that mimics the lifecycle of the inference engine. This object will allow for the retrieval of member properties at any point during the inference process, thus providing a centralized and consistent configuration management.
2. InferenceShardformer Wrapper: Implement a wrapper around the existing `shardformer`, named `InferenceShardformer`. This class will provide a new interface for parameter passing and will be capable of maintaining various inference states, thereby ensuring greater scalability and flexibility.

There will be a coming PR soon, after discussing with the maintainers. 

### Self-service

- [X] I'd be willing to do some initial work on this proposal myself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL]: Refactor inference engine by selecting backend during init of modules #5773

Proposal

Self-service

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PROPOSAL]: Refactor inference engine by selecting backend during init of modules #5773

Description

Proposal

Self-service

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions