Skip to content

[Feature Request] Add support for n_cpu_moe parameter in Llama class #120

@KLL535

Description

@KLL535

Description

This parameter allows users to offload the expert layers of MOE models directly to the CPU/RAM while keeping the Attention layers in the GPU.

This is incredibly useful for running large MoE models on systems with limited VRAM. Currently, this parameter cannot be passed directly when initializing the Llama class.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions