diff --git a/README.md b/README.md index 004fdcfd39..f4584a9e8d 100644 --- a/README.md +++ b/README.md @@ -21,8 +21,9 @@ ## 📣 News and Discussions -- [04/25/2026][**DeepSeek V4 Flash**](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) We now support finetuning `deepseek-ai/DeepSeek-V4-Flash`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/llm/dsv4-flash.md). +- [04/29/2026][**Mistral Medium 3.5**](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B) We now support finetuning Mistral AI's 128B FP8-native VLM Mistral Medium 3.5. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/vlm/mistral-medium-3-5.md). - [04/28/2026][**Hy3-preview**](https://huggingface.co/tencent/Hy3-preview) We now support finetuning `tencent/Hy3-preview`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml). +- [04/25/2026][**DeepSeek V4 Flash**](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) We now support finetuning `deepseek-ai/DeepSeek-V4-Flash`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/llm/dsv4-flash.md). - [04/22/2026][**Qwen3.6-27B**](https://huggingface.co/Qwen/Qwen3.6-27B) We now support finetuning `Qwen/Qwen3.6-27B`. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5/qwen3_6_27b.yaml). - [04/20/2026][**Qwen-Image**](https://huggingface.co/Qwen/Qwen-Image) We now support finetuning `Qwen/Qwen-Image`, thanks to [@harshareddy832](https://github.com/harshareddy832). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/diffusion/finetune/qwen_image_t2i_flow.yaml). - [04/16/2026][**Qwen3.6 MoE**](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) We now support finetuning `Qwen/Qwen3.6-35B-A3B`. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5_moe/qwen3_6_35b.yaml). diff --git a/docs/guides/vlm/mistral-medium-3-5.md b/docs/guides/vlm/mistral-medium-3-5.md new file mode 100644 index 0000000000..7f93d45b19 --- /dev/null +++ b/docs/guides/vlm/mistral-medium-3-5.md @@ -0,0 +1,97 @@ +# Fine-Tune Mistral Medium 3.5 VLM + +## Introduction + +[Mistral Medium 3.5](https://huggingface.co/mistralai) is Mistral AI's +new flagship model. It is a **128B dense** transformer that merges +*Mistral Medium 3.1*, *Magistral Medium*, and *Devstral 2* into a single +checkpoint with a configurable reasoning mode, supports a **256k-token +context window**, and serves the default model in Mistral Vibe and +Le Chat. + +The model ships natively in FP8, which +combined with its dense (non-MoE) layout makes it materially smaller to +deploy than comparably-capable MoE systems — full inference fits in a +single H200 node or 2 × H100 nodes, and the recipe in this guide +fine-tunes the full VLM end-to-end on 8 × H100 nodes (64 GPUs). + +**Architecture at a glance** + +- 88 Ministral-3 decoder layers (hidden 12288, 96 attention heads, 8 KV + heads, GQA), llama-style RoPE + RMSNorm + SwiGLU MLP. +- Dense — no MoE routing. Compactness vs. MoE peers translates directly + into smaller per-GPU memory and easier multi-node sharding. +- Pixtral vision tower + multi-modal projector for image inputs. +- FP8 on disk; dequantized to BF16 per local TP shard inside the + standard DCP load path. + +This guide walks you through fine-tuning Mistral Medium 3.5 on a medical +Visual Question Answering task using NVIDIA NeMo AutoModel. You will +learn how to prepare the dataset, launch training on a Slurm cluster, +and inspect the results. + +To set up your environment to run NeMo AutoModel, follow the +[installation guide](https://github.com/NVIDIA-NeMo/Automodel#-install-nemo-automodel). + +## Data + +### MedPix-VQA Dataset + +We use the [MedPix-VQA](https://huggingface.co/datasets/mmoukouba/MedPix-VQA) +dataset, a comprehensive medical Visual Question Answering dataset +containing radiological images paired with question-answer pairs for +medical image interpretation. + +- **20,500 total examples** (85% train / 15% validation) +- **Columns**: `image_id`, `mode`, `case_id`, `question`, `answer` + +For a full walkthrough of how MedPix-VQA is preprocessed and integrated +into NeMo AutoModel — including the chat-template conversion and collate +functions — see the +[Multi-Modal Dataset Guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/vlm/dataset.md#multi-modal-datasets). + +## Launch Training + +We provide a ready-to-use recipe at +[`examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml`](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml). +This recipe is configured for **8 nodes × 8 H100-80GB GPUs (64 GPUs total)** +with TP=8, PP=8, DP=1. The vision tower and multi-modal projector are +frozen by default and only the Ministral-3 language model is trained; +flip `freeze_config.freeze_vision_tower: false` to train the vision +side as well. + +NeMo AutoModel supports several ways to launch training — via the +AutoModel CLI with Slurm, interactive sessions, `torchrun`, and more. +For full details on all launch options (Slurm batch jobs, multi-node +configuration, environment variables, etc.), see the +[Run on a Cluster](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/launcher/slurm.md) +guide. + + +**Before you start**: + +- Hugging Face applies rate limits on downloads. We recommend cloning + the model repository to your local filesystem beforehand. +- Ensure your Hugging Face cache (`HF_HOME`) is configured and that the + dataset is already cached locally. +- To enable Weights & Biases logging, set your `WANDB_API_KEY` and + configure the `wandb` section in the YAML file. + + +## Training Results + +The recipe produces a healthy initial loss aligned with the HF +reference forward on matched samples. On MedPix-VQA the first +optimizer step lands around per-token loss **3.2** and grad-norm +**~930** (clipped to `max_grad_norm=1.0`), descending past 1.8 within +a handful of steps. The HF reference forward (single-sample, FP8 +dequantize on-load) on the same first batch produces per-token loss +**3.47**, confirming the distributed forward is numerically +equivalent within bf16 + TP-reduction tolerance. + +The training loss curves for Mistral Medium 3.5 fine-tuned on +MedPix-VQA are shown below. + +

+ Mistral Medium 3.5 Training Loss Curve +

diff --git a/docs/guides/vlm/mistralm35.png b/docs/guides/vlm/mistralm35.png new file mode 100644 index 0000000000..132b75c287 Binary files /dev/null and b/docs/guides/vlm/mistralm35.png differ diff --git a/docs/index.md b/docs/index.md index d5ca57adb1..f5788e148e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -255,6 +255,7 @@ Gemma 3 / 3n Gemma 4 Qwen3.5-VL Nemotron-Omni +Mistral Medium 3.5 VL Diffusion Fine-Tuning dLLM Fine-Tuning QAT diff --git a/docs/model-coverage/latest-models.md b/docs/model-coverage/latest-models.md index dcd27080f8..8cc67287bc 100644 --- a/docs/model-coverage/latest-models.md +++ b/docs/model-coverage/latest-models.md @@ -6,6 +6,7 @@ See the [Model Coverage Overview](overview.md) for release summaries, and the [L | Date | Model | HF Model ID | Modality | Recipe | Try on Brev | |------|-------|-------------|----------|--------|------| +| 2026-04-29 | Mistral Medium 3.5 | [`mistralai/Mistral-Medium-3.5-128B`](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B) | VLM | [mistral3p5_128b_medpix.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml) | 🚧 | | 2026-04-28 | Hy3-preview | [`tencent/Hy3-preview`](https://huggingface.co/tencent/Hy3-preview) | LLM | [hy3_preview_deepep.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml) | 🚧 | | 2026-04-25 | DeepSeek V4 Flash | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | LLM | [deepseek_v4_flash_hellaswag.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) | 🚧 | | 2026-04-22 | Qwen3.6-27B | [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) | VLM | [qwen3_6_27b.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5/qwen3_6_27b.yaml) | 🚧 | diff --git a/docs/model-coverage/vlm/index.md b/docs/model-coverage/vlm/index.md index fa99e0d40f..54157cf8a8 100644 --- a/docs/model-coverage/vlm/index.md +++ b/docs/model-coverage/vlm/index.md @@ -31,6 +31,7 @@ NeMo AutoModel supports [AutoModelForImageTextToText](https://huggingface.co/doc | NVIDIA | [Nemotron-Parse](nvidia/nemotron-parse.md) | `NemotronParseForConditionalGeneration` | | Mistral AI | [Ministral3 VL](mistralai/ministral3-vl.md) | `Mistral3ForConditionalGeneration` | | Mistral AI | [Mistral-Small-4](mistralai/mistral-small-4.md) | `MistralForConditionalGeneration` | +| Mistral AI | [Mistral Medium 3.5](mistralai/mistral-medium-3-5.md) | `Mistral3ForConditionalGeneration` (FP8) | | InternLM / Shanghai AI Lab | [InternVL](internlm/internvl.md) | `InternVLForConditionalGeneration` | | Meta | [Llama 4](meta/llama4.md) | `Llama4ForConditionalGeneration` | | HuggingFace | [SmolVLM](huggingface/smolvlm.md) | `SmolVLMForConditionalGeneration` | @@ -57,6 +58,7 @@ qwen/qwen3-5-vl nvidia/nemotron-parse mistralai/ministral3-vl mistralai/mistral-small-4 +mistralai/mistral-medium-3-5 internlm/internvl meta/llama4 huggingface/smolvlm diff --git a/docs/model-coverage/vlm/mistralai/mistral-medium-3-5.md b/docs/model-coverage/vlm/mistralai/mistral-medium-3-5.md new file mode 100644 index 0000000000..1650d2a4a0 --- /dev/null +++ b/docs/model-coverage/vlm/mistralai/mistral-medium-3-5.md @@ -0,0 +1,158 @@ +# Mistral Medium 3.5 + +[Mistral Medium 3.5](https://huggingface.co/mistralai) is Mistral AI's +flagship **128B dense** model that merges instruction-following, reasoning, +and coding into a single checkpoint with a configurable reasoning mode. +It unifies the lineage of *Mistral Medium 3.1*, *Magistral Medium*, and +*Devstral 2* into one model, and ships natively in FP8 (per-tensor +`weight_scale_inv`) so the full model fits inside an H200 node or 2 × +H100 nodes — a notable footprint advantage over comparably-capable +Mixture-of-Experts (MoE) systems. + +:::{card} +| | | +|---|---| +| **Task** | Image-Text-to-Text | +| **Architecture** | `Mistral3ForConditionalGeneration` (Pixtral vision tower + dense Ministral-3 text decoder) | +| **Parameters** | 128B (dense, FP8 on disk) | +| **Context Window** | 256k tokens | +| **Languages** | 40+ (English, French, Spanish, German, Russian, Chinese, Japanese, Italian, Portuguese, Arabic, Hindi, Korean, plus Indic / Nordic / Eastern European tail) | +| **License** | Modified MIT (open-weights, ≤ $20M annual revenue threshold) | +| **HF Org** | [mistralai](https://huggingface.co/mistralai) | +::: + +## Architecture + +Mistral Medium 3.5 is a **dense** transformer — no MoE routing — built on +the same text backbone as +[`mistralai/Devstral-2-123B-Instruct-2512`](https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512): +88 Ministral-3 decoder layers (hidden 12288, 96 attention heads, +8 KV heads, GQA) with the standard llama-style RoPE + RMSNorm + SwiGLU +MLP layout. The multimodal variant adds a Pixtral vision tower and +multi-modal projector on top, making it an +`AutoModelForImageTextToText` checkpoint. + +Compared with MoE models of similar capability, the dense layout +trades sparse-activation throughput for a substantially smaller +deployment footprint — relevant when you want to fine-tune or serve +the model on a single node. + +## Key Strengths + +- **Compactness.** Dense 128B fits in fewer GPUs than the comparable + MoE class — a single H200 node or 2 × H100 nodes for inference. +- **Configurable reasoning mode.** One checkpoint covers chat, + agentic, and reasoning workloads; the reasoning mode is toggled at + inference time. +- **Strong agentic performance.** Competitive on tool-use and + decision-making benchmarks; suitable as a base for connector-driven + agent workflows. +- **Long context.** 256k-token window for document parsing and + research-assistant use cases. + +Trade-offs disclosed in the model card: weaker non-agentic benchmark +performance and more verbose outputs than some closed-source +competitors. + +## Use Cases + +- Agentic workflows with connectors +- Cloud and local async coding +- Document parsing (multimodal — text + image) +- Research assistants +- General chat +- Base model for downstream fine-tuning + +## Available Models + +- **Mistral-Medium-3.5 128B** + +## Class + +- HF: `Mistral3ForConditionalGeneration` +- NeMo AutoModel custom: `Mistral3FP8VLMForConditionalGeneration` + ([source](https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/components/models/mistral3_vlm/model.py)) + +The custom class extends HF's `Mistral3ForConditionalGeneration` and +attaches a `Mistral3FP8StateDictAdapter.for_vlm_full()` so the FP8 +checkpoint dequantizes per-shard inside the standard DCP load — the +full BF16 model is never materialized on a single rank, allowing TP+PP +training to fit on H100-80GB. + +## Example HF Models + +| Model | HF ID | +|---|---| +| Mistral Medium 3.5 128B | [`mistralai/Mistral-Medium-3.5`](https://huggingface.co/mistralai) | + +## Example Recipes + +| Recipe | Dataset | Description | +|---|---|---| +| {download}`mistral3p5_128b_medpix.yaml <../../../../examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml>` | MedPix-VQA | SFT — Mistral Medium 3.5 128B on MedPix, 8 nodes × 8 GPUs (TP=8 PP=8) | + + +## Try with NeMo AutoModel + +**1. Install** ([full instructions](../../../guides/installation.md)): + +```bash +pip install nemo-automodel +``` + +**2. Clone the repo** to get the example recipes: + +```bash +git clone https://github.com/NVIDIA-NeMo/Automodel.git +cd Automodel +``` + +:::{note} +This recipe was validated on **8 nodes × 8 GPUs (64 H100s)** with +TP=8 PP=8 DP=1. See the [Launcher Guide](../../../launcher/slurm.md) +for multi-node setup. Inference / single-node fine-tune fits in +**1 × H200** or **2 × H100** nodes thanks to the dense + FP8 layout. +::: + +**3. Run the recipe** via Slurm (see the +[fine-tuning guide](../../../guides/vlm/mistral-medium-3-5.md) for a +complete launch script): + +```bash +sbatch your_slurm_script.sub +``` + +:::{dropdown} Run with Docker +**1. Pull the container** and mount a checkpoint directory: + +```bash +docker run --gpus all -it --rm \ + --shm-size=8g \ + -v $(pwd)/checkpoints:/opt/Automodel/checkpoints \ + nvcr.io/nvidia/nemo-automodel:26.02.00 +``` + +**2.** Navigate to the AutoModel directory (where the recipes are): + +```bash +cd /opt/Automodel +``` + +**3. Run the recipe**: + +```bash +automodel --nproc-per-node=8 examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml +``` +::: + +See the [Installation Guide](../../../guides/installation.md) and the +[Mistral Medium 3.5 Fine-Tuning Guide](../../../guides/vlm/mistral-medium-3-5.md). + +## Fine-Tuning + +See the [Mistral Medium 3.5 Fine-Tuning Guide](../../../guides/vlm/mistral-medium-3-5.md). + +## Hugging Face Model Cards + +- [mistralai](https://huggingface.co/mistralai) +- Related architecture: [`mistralai/Devstral-2-123B-Instruct-2512`](https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512)