NVIDIA-NeMo · HuiyingLi · Apr 29, 2026 · Apr 27, 2026
@@ -21,8 +21,9 @@
 </div>
 
 ## 📣 News and Discussions
-- [04/25/2026][**DeepSeek V4 Flash**](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) We now support finetuning `deepseek-ai/DeepSeek-V4-Flash`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/llm/dsv4-flash.md).
+- [04/29/2026][**Mistral Medium 3.5**](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B) We now support finetuning Mistral AI's 128B FP8-native VLM Mistral Medium 3.5. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/vlm/mistral-medium-3-5.md).
 - [04/28/2026][**Hy3-preview**](https://huggingface.co/tencent/Hy3-preview) We now support finetuning `tencent/Hy3-preview`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml).
+- [04/25/2026][**DeepSeek V4 Flash**](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) We now support finetuning `deepseek-ai/DeepSeek-V4-Flash`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/llm/dsv4-flash.md).
 - [04/22/2026][**Qwen3.6-27B**](https://huggingface.co/Qwen/Qwen3.6-27B) We now support finetuning `Qwen/Qwen3.6-27B`. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5/qwen3_6_27b.yaml).
 - [04/20/2026][**Qwen-Image**](https://huggingface.co/Qwen/Qwen-Image) We now support finetuning `Qwen/Qwen-Image`, thanks to [@harshareddy832](https://github.com/harshareddy832). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/diffusion/finetune/qwen_image_t2i_flow.yaml).
 - [04/16/2026][**Qwen3.6 MoE**](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) We now support finetuning `Qwen/Qwen3.6-35B-A3B`. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5_moe/qwen3_6_35b.yaml).

@@ -0,0 +1,97 @@
+# Fine-Tune Mistral Medium 3.5 VLM
+
+## Introduction
+
+[Mistral Medium 3.5](https://huggingface.co/mistralai) is Mistral AI's
+new flagship model. It is a **128B dense** transformer that merges
+*Mistral Medium 3.1*, *Magistral Medium*, and *Devstral 2* into a single
+checkpoint with a configurable reasoning mode, supports a **256k-token
+context window**, and serves the default model in Mistral Vibe and
+Le Chat.
+
+The model ships natively in FP8, which
+combined with its dense (non-MoE) layout makes it materially smaller to
+deploy than comparably-capable MoE systems — full inference fits in a
+single H200 node or 2 × H100 nodes, and the recipe in this guide
+fine-tunes the full VLM end-to-end on 8 × H100 nodes (64 GPUs).
+
+**Architecture at a glance**
+
+- 88 Ministral-3 decoder layers (hidden 12288, 96 attention heads, 8 KV
+  heads, GQA), llama-style RoPE + RMSNorm + SwiGLU MLP.
+- Dense — no MoE routing. Compactness vs. MoE peers translates directly
+  into smaller per-GPU memory and easier multi-node sharding.
+- Pixtral vision tower + multi-modal projector for image inputs.
+- FP8 on disk; dequantized to BF16 per local TP shard inside the
+  standard DCP load path.
+
+This guide walks you through fine-tuning Mistral Medium 3.5 on a medical
+Visual Question Answering task using NVIDIA NeMo AutoModel. You will
+learn how to prepare the dataset, launch training on a Slurm cluster,
+and inspect the results.
+
+To set up your environment to run NeMo AutoModel, follow the
+[installation guide](https://github.com/NVIDIA-NeMo/Automodel#-install-nemo-automodel).
+
+## Data
+
+### MedPix-VQA Dataset
+
+We use the [MedPix-VQA](https://huggingface.co/datasets/mmoukouba/MedPix-VQA)
+dataset, a comprehensive medical Visual Question Answering dataset
+containing radiological images paired with question-answer pairs for
+medical image interpretation.
+
+- **20,500 total examples** (85% train / 15% validation)
+- **Columns**: `image_id`, `mode`, `case_id`, `question`, `answer`
+
+For a full walkthrough of how MedPix-VQA is preprocessed and integrated
+into NeMo AutoModel — including the chat-template conversion and collate
+functions — see the
+[Multi-Modal Dataset Guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/vlm/dataset.md#multi-modal-datasets).
+
+## Launch Training
+
+We provide a ready-to-use recipe at
+[`examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml`](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml).
+This recipe is configured for **8 nodes × 8 H100-80GB GPUs (64 GPUs total)**
+with TP=8, PP=8, DP=1. The vision tower and multi-modal projector are
+frozen by default and only the Ministral-3 language model is trained;
+flip `freeze_config.freeze_vision_tower: false` to train the vision
+side as well.
+
+NeMo AutoModel supports several ways to launch training — via the
+AutoModel CLI with Slurm, interactive sessions, `torchrun`, and more.
+For full details on all launch options (Slurm batch jobs, multi-node
+configuration, environment variables, etc.), see the
+[Run on a Cluster](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/launcher/slurm.md)
+guide.
+
+
+**Before you start**:
+
+- Hugging Face applies rate limits on downloads. We recommend cloning
+  the model repository to your local filesystem beforehand.
+- Ensure your Hugging Face cache (`HF_HOME`) is configured and that the
+  dataset is already cached locally.
+- To enable Weights & Biases logging, set your `WANDB_API_KEY` and
+  configure the `wandb` section in the YAML file.
+
+
+## Training Results
+
+The recipe produces a healthy initial loss aligned with the HF
+reference forward on matched samples. On MedPix-VQA the first
+optimizer step lands around per-token loss **3.2** and grad-norm
+**~930** (clipped to `max_grad_norm=1.0`), descending past 1.8 within
+a handful of steps. The HF reference forward (single-sample, FP8
+dequantize on-load) on the same first batch produces per-token loss
+**3.47**, confirming the distributed forward is numerically
+equivalent within bf16 + TP-reduction tolerance.
+
+The training loss curves for Mistral Medium 3.5 fine-tuned on
+MedPix-VQA are shown below.
+
+<p align="center">
+  <img src="mistralm35.png" alt="Mistral Medium 3.5 Training Loss Curve" width="500">
+</p>
@@ -255,6 +255,7 @@ Gemma 3 / 3n <guides/omni/gemma3-3n.md>
 Gemma 4 <guides/vlm/gemma4.md>
 Qwen3.5-VL <guides/vlm/qwen3-5.md>
 Nemotron-Omni <guides/vlm/nemotron-omni.md>
+Mistral Medium 3.5 VL <guides/vlm/mistral-medium-3-5.md>
 Diffusion Fine-Tuning <guides/diffusion/finetune.md>
 dLLM Fine-Tuning <guides/dllm/finetune.md>
 QAT <guides/quantization-aware-training.md>

@@ -6,6 +6,7 @@ See the [Model Coverage Overview](overview.md) for release summaries, and the [L
 
 | Date | Model | HF Model ID | Modality | Recipe | Try on Brev |
 |------|-------|-------------|----------|--------|------|
+| 2026-04-29 | Mistral Medium 3.5 | [`mistralai/Mistral-Medium-3.5-128B`](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B) | VLM | [mistral3p5_128b_medpix.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml) | 🚧 |
 | 2026-04-28 | Hy3-preview | [`tencent/Hy3-preview`](https://huggingface.co/tencent/Hy3-preview) | LLM | [hy3_preview_deepep.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml) | 🚧 |
 | 2026-04-25 | DeepSeek V4 Flash | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | LLM | [deepseek_v4_flash_hellaswag.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) | 🚧 |
 | 2026-04-22 | Qwen3.6-27B | [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) | VLM | [qwen3_6_27b.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5/qwen3_6_27b.yaml) | 🚧 |

@@ -31,6 +31,7 @@ NeMo AutoModel supports [AutoModelForImageTextToText](https://huggingface.co/doc
 | NVIDIA | [Nemotron-Parse](nvidia/nemotron-parse.md) | `NemotronParseForConditionalGeneration` |
 | Mistral AI | [Ministral3 VL](mistralai/ministral3-vl.md) | `Mistral3ForConditionalGeneration` |
 | Mistral AI | [Mistral-Small-4](mistralai/mistral-small-4.md) | `MistralForConditionalGeneration` |
+| Mistral AI | [Mistral Medium 3.5](mistralai/mistral-medium-3-5.md) | `Mistral3ForConditionalGeneration` (FP8) |
 | InternLM / Shanghai AI Lab | [InternVL](internlm/internvl.md) | `InternVLForConditionalGeneration` |
 | Meta | [Llama 4](meta/llama4.md) | `Llama4ForConditionalGeneration` |
 | HuggingFace | [SmolVLM](huggingface/smolvlm.md) | `SmolVLMForConditionalGeneration` |
@@ -57,6 +58,7 @@ qwen/qwen3-5-vl
 nvidia/nemotron-parse
 mistralai/ministral3-vl
 mistralai/mistral-small-4
+mistralai/mistral-medium-3-5
 internlm/internvl
 meta/llama4
 huggingface/smolvlm

@@ -0,0 +1,158 @@
+# Mistral Medium 3.5
+
+[Mistral Medium 3.5](https://huggingface.co/mistralai) is Mistral AI's
+flagship **128B dense** model that merges instruction-following, reasoning,
+and coding into a single checkpoint with a configurable reasoning mode.
+It unifies the lineage of *Mistral Medium 3.1*, *Magistral Medium*, and
+*Devstral 2* into one model, and ships natively in FP8 (per-tensor
+`weight_scale_inv`) so the full model fits inside an H200 node or 2 ×
+H100 nodes — a notable footprint advantage over comparably-capable
+Mixture-of-Experts (MoE) systems.
+
+:::{card}
+| | |
+|---|---|
+| **Task** | Image-Text-to-Text |
+| **Architecture** | `Mistral3ForConditionalGeneration` (Pixtral vision tower + dense Ministral-3 text decoder) |
+| **Parameters** | 128B (dense, FP8 on disk) |
+| **Context Window** | 256k tokens |
+| **Languages** | 40+ (English, French, Spanish, German, Russian, Chinese, Japanese, Italian, Portuguese, Arabic, Hindi, Korean, plus Indic / Nordic / Eastern European tail) |
+| **License** | Modified MIT (open-weights, ≤ $20M annual revenue threshold) |
+| **HF Org** | [mistralai](https://huggingface.co/mistralai) |
+:::
+
+## Architecture
+
+Mistral Medium 3.5 is a **dense** transformer — no MoE routing — built on
+the same text backbone as
+[`mistralai/Devstral-2-123B-Instruct-2512`](https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512):
+88 Ministral-3 decoder layers (hidden 12288, 96 attention heads,
+8 KV heads, GQA) with the standard llama-style RoPE + RMSNorm + SwiGLU
+MLP layout. The multimodal variant adds a Pixtral vision tower and
+multi-modal projector on top, making it an
+`AutoModelForImageTextToText` checkpoint.
+
+Compared with MoE models of similar capability, the dense layout
+trades sparse-activation throughput for a substantially smaller
+deployment footprint — relevant when you want to fine-tune or serve
+the model on a single node.
+
+## Key Strengths
+
+- **Compactness.** Dense 128B fits in fewer GPUs than the comparable
+  MoE class — a single H200 node or 2 × H100 nodes for inference.
+- **Configurable reasoning mode.** One checkpoint covers chat,
+  agentic, and reasoning workloads; the reasoning mode is toggled at
+  inference time.
+- **Strong agentic performance.** Competitive on tool-use and
+  decision-making benchmarks; suitable as a base for connector-driven
+  agent workflows.
+- **Long context.** 256k-token window for document parsing and
+  research-assistant use cases.
+
+Trade-offs disclosed in the model card: weaker non-agentic benchmark
+performance and more verbose outputs than some closed-source
+competitors.
+
+## Use Cases
+
+- Agentic workflows with connectors
+- Cloud and local async coding
+- Document parsing (multimodal — text + image)
+- Research assistants
+- General chat
+- Base model for downstream fine-tuning
+
+## Available Models
+
+- **Mistral-Medium-3.5 128B**
+
+## Class
+
+- HF: `Mistral3ForConditionalGeneration`
+- NeMo AutoModel custom: `Mistral3FP8VLMForConditionalGeneration`
+  ([source](https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/components/models/mistral3_vlm/model.py))
+
+The custom class extends HF's `Mistral3ForConditionalGeneration` and
+attaches a `Mistral3FP8StateDictAdapter.for_vlm_full()` so the FP8
+checkpoint dequantizes per-shard inside the standard DCP load — the
+full BF16 model is never materialized on a single rank, allowing TP+PP
+training to fit on H100-80GB.
+
+## Example HF Models
+
+| Model | HF ID |
+|---|---|
+| Mistral Medium 3.5 128B | [`mistralai/Mistral-Medium-3.5`](https://huggingface.co/mistralai) |
+
+## Example Recipes
+
+| Recipe | Dataset | Description |
+|---|---|---|
+| {download}`mistral3p5_128b_medpix.yaml <../../../../examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml>` | MedPix-VQA | SFT — Mistral Medium 3.5 128B on MedPix, 8 nodes × 8 GPUs (TP=8 PP=8) |
+
+
+## Try with NeMo AutoModel
+
+**1. Install** ([full instructions](../../../guides/installation.md)):
+
+```bash
+pip install nemo-automodel
+```
+
+**2. Clone the repo** to get the example recipes:
+
+```bash
+git clone https://github.com/NVIDIA-NeMo/Automodel.git
+cd Automodel
+```
+
+:::{note}
+This recipe was validated on **8 nodes × 8 GPUs (64 H100s)** with
+TP=8 PP=8 DP=1. See the [Launcher Guide](../../../launcher/slurm.md)
+for multi-node setup. Inference / single-node fine-tune fits in
+**1 × H200** or **2 × H100** nodes thanks to the dense + FP8 layout.
+:::
+
+**3. Run the recipe** via Slurm (see the
+[fine-tuning guide](../../../guides/vlm/mistral-medium-3-5.md) for a
+complete launch script):
+
+```bash
+sbatch your_slurm_script.sub
+```
+
+:::{dropdown} Run with Docker
+**1. Pull the container** and mount a checkpoint directory:
+
+```bash
+docker run --gpus all -it --rm \
+  --shm-size=8g \
+  -v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
+  nvcr.io/nvidia/nemo-automodel:26.02.00
+```
+
+**2.** Navigate to the AutoModel directory (where the recipes are):
+
+```bash
+cd /opt/Automodel
+```
+
+**3. Run the recipe**:
+
+```bash
+automodel --nproc-per-node=8 examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml
+```
+:::
+
+See the [Installation Guide](../../../guides/installation.md) and the
+[Mistral Medium 3.5 Fine-Tuning Guide](../../../guides/vlm/mistral-medium-3-5.md).
+
+## Fine-Tuning
+
+See the [Mistral Medium 3.5 Fine-Tuning Guide](../../../guides/vlm/mistral-medium-3-5.md).
+
+## Hugging Face Model Cards
+
+- [mistralai](https://huggingface.co/mistralai)
+- Related architecture: [`mistralai/Devstral-2-123B-Instruct-2512`](https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512)