Description
This parameter allows users to offload the expert layers of MOE models directly to the CPU/RAM while keeping the Attention layers in the GPU.
This is incredibly useful for running large MoE models on systems with limited VRAM. Currently, this parameter cannot be passed directly when initializing the Llama class.
Thank you!
Description
This parameter allows users to offload the expert layers of MOE models directly to the CPU/RAM while keeping the Attention layers in the GPU.
This is incredibly useful for running large MoE models on systems with limited VRAM. Currently, this parameter cannot be passed directly when initializing the
Llamaclass.Thank you!