From 5cc8615403c790038f3dd2fbaa9b6b17c4abebe5 Mon Sep 17 00:00:00 2001 From: Toluwani Oyewusi Date: Sun, 7 Sep 2025 19:52:58 -0500 Subject: [PATCH] Update JetMoe model card and placeholder image --- docs/source/en/model_doc/jetmoe.md | 123 +++++++++++++++++++++++++---- 1 file changed, 109 insertions(+), 14 deletions(-) diff --git a/docs/source/en/model_doc/jetmoe.md b/docs/source/en/model_doc/jetmoe.md index 059fb956ce23..bdbac687e5a8 100644 --- a/docs/source/en/model_doc/jetmoe.md +++ b/docs/source/en/model_doc/jetmoe.md @@ -15,25 +15,120 @@ rendered properly in your Markdown viewer. --> *This model was released on 2023-06-07 and added to Hugging Face Transformers on 2024-05-14.* -# JetMoe -
-PyTorch -FlashAttention -SDPA +# add appropriate badges +
+
+ PyTorch + FlashAttention + SDPA + MoE +
-## Overview -**JetMoe-8B** is an 8B Mixture-of-Experts (MoE) language model developed by [Yikang Shen](https://scholar.google.com.hk/citations?user=qff5rRYAAAAJ) and [MyShell](https://myshell.ai/). -JetMoe project aims to provide a LLaMA2-level performance and efficient language model with a limited budget. -To achieve this goal, JetMoe uses a sparsely activated architecture inspired by the [ModuleFormer](https://huggingface.co/papers/2306.04640). -Each JetMoe block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. -Given the input tokens, it activates a subset of its experts to process them. -This sparse activation schema enables JetMoe to achieve much better training throughput than similar size dense models. -The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy. +# JetMoe + +[JetMoe-8b](https://huggingface.co/papers/2404.07413) is an 8B-parameter Mixture-of-Experts (MoE) language model for efficient text generation. The model activates a subset of specialized “experts” for each input, which improves computational efficiency while keeping performance comparable to dense models of similar size. Each block consists of two MoE layers: Mixture of Attention Heads and Mixture of MLP Experts. This sparse activation allows the model to process input tokens effectively and achieve faster training and inference with fewer resources compared to standard dense models. + +You can find all the original JetMoe checkpoints under the [JetMoe](https://huggingface.co/jetmoe) collection. + +> [!TIP] +> This model was contributed by [Yikang Shen](https://huggingface.co/YikangS). +> +> Click on the JetMoe models in the right sidebar for more examples of how to apply JetMoe to different text generation tasks. + +The example below demonstrates how to generate text with [`Pipeline`] or the [`AutoModel`], and from the command line. + + + + + +```py +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("jetmoe/jetmoe-8b") +model = AutoModelForCausalLM.from_pretrained( + "jetmoe/jetmoe-8b", + dtype=torch.float16, + device_map="auto", + attn_implementation="sdpa" +) +input_ids = tokenizer("The stock market rallied today after positive economic news", return_tensors="pt").to(model.device) + +output = model.generate(**input_ids, cache_implementation="static") +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + + + + +```bash +echo -e "The stock market rallied today after positive economic news" | transformers run --task text-generation --model jetmoe/jetmoe-8b --device 0 +``` + + + + +Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. + +The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4. + +```py +# pip install torchao +import torch +from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer + +quantization_config = TorchAoConfig("int4_weight_only", group_size=128) +model = AutoModelForCausalLM.from_pretrained( + "jetmoe/jetmoe-8b", + dtype=torch.bfloat16, + device_map="auto", + quantization_config=quantization_config +) + +tokenizer = AutoTokenizer.from_pretrained("jetmoe/jetmoe-8b") +input_ids = tokenizer("The stock market rallied today after positive economic news", return_tensors="pt").to(model.device) + +output = model.generate(**input_ids, cache_implementation="static") +print(tokenizer.decode(output[0], skip_special_tokens=True)) +``` + +# add if this is supported for your model +Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to. + +```py +from transformers.utils.attention_visualizer import AttentionMaskVisualizer + +visualizer = AttentionMaskVisualizer("jetmoe/jetmoe-8b") +visualizer("The stock market rallied today after positive economic news") +``` + +# upload image to https://huggingface.co/datasets/huggingface/documentation-images/tree/main/transformers/model_doc and ping me to merge +
+ +
+ +## Notes -This model was contributed by [Yikang Shen](https://huggingface.co/YikangS). +- JetMOE uses sparse expert routing; only a subset of experts is activated per input. +- The training throughput of JetMoe-8B is around 100B tokens per day on a cluster of 96 H100 GPUs with a straightforward 3-way pipeline parallelism strategy. ## JetMoeConfig