Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@
</div>

## 📣 News and Discussions
- [04/25/2026][**DeepSeek V4 Flash**](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) We now support finetuning `deepseek-ai/DeepSeek-V4-Flash`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/llm/dsv4-flash.md).
- [04/29/2026][**Mistral Medium 3.5**](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B) We now support finetuning Mistral AI's 128B FP8-native VLM Mistral Medium 3.5. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/vlm/mistral-medium-3-5.md).
- [04/28/2026][**Hy3-preview**](https://huggingface.co/tencent/Hy3-preview) We now support finetuning `tencent/Hy3-preview`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml).
- [04/25/2026][**DeepSeek V4 Flash**](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) We now support finetuning `deepseek-ai/DeepSeek-V4-Flash`, thanks to [@Khazic](https://github.com/khazic). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) and [guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/llm/dsv4-flash.md).
- [04/22/2026][**Qwen3.6-27B**](https://huggingface.co/Qwen/Qwen3.6-27B) We now support finetuning `Qwen/Qwen3.6-27B`. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5/qwen3_6_27b.yaml).
- [04/20/2026][**Qwen-Image**](https://huggingface.co/Qwen/Qwen-Image) We now support finetuning `Qwen/Qwen-Image`, thanks to [@harshareddy832](https://github.com/harshareddy832). Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/diffusion/finetune/qwen_image_t2i_flow.yaml).
- [04/16/2026][**Qwen3.6 MoE**](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) We now support finetuning `Qwen/Qwen3.6-35B-A3B`. Check out our [recipe](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5_moe/qwen3_6_35b.yaml).
Expand Down
97 changes: 97 additions & 0 deletions docs/guides/vlm/mistral-medium-3-5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Fine-Tune Mistral Medium 3.5 VLM

## Introduction

[Mistral Medium 3.5](https://huggingface.co/mistralai) is Mistral AI's
new flagship model. It is a **128B dense** transformer that merges
*Mistral Medium 3.1*, *Magistral Medium*, and *Devstral 2* into a single
checkpoint with a configurable reasoning mode, supports a **256k-token
context window**, and serves the default model in Mistral Vibe and
Le Chat.

The model ships natively in FP8, which
combined with its dense (non-MoE) layout makes it materially smaller to
deploy than comparably-capable MoE systems — full inference fits in a
single H200 node or 2 × H100 nodes, and the recipe in this guide
fine-tunes the full VLM end-to-end on 8 × H100 nodes (64 GPUs).

**Architecture at a glance**

- 88 Ministral-3 decoder layers (hidden 12288, 96 attention heads, 8 KV
heads, GQA), llama-style RoPE + RMSNorm + SwiGLU MLP.
- Dense — no MoE routing. Compactness vs. MoE peers translates directly
into smaller per-GPU memory and easier multi-node sharding.
- Pixtral vision tower + multi-modal projector for image inputs.
- FP8 on disk; dequantized to BF16 per local TP shard inside the
standard DCP load path.

This guide walks you through fine-tuning Mistral Medium 3.5 on a medical
Visual Question Answering task using NVIDIA NeMo AutoModel. You will
learn how to prepare the dataset, launch training on a Slurm cluster,
and inspect the results.

To set up your environment to run NeMo AutoModel, follow the
[installation guide](https://github.com/NVIDIA-NeMo/Automodel#-install-nemo-automodel).

## Data

### MedPix-VQA Dataset

We use the [MedPix-VQA](https://huggingface.co/datasets/mmoukouba/MedPix-VQA)
dataset, a comprehensive medical Visual Question Answering dataset
containing radiological images paired with question-answer pairs for
medical image interpretation.

- **20,500 total examples** (85% train / 15% validation)
- **Columns**: `image_id`, `mode`, `case_id`, `question`, `answer`

For a full walkthrough of how MedPix-VQA is preprocessed and integrated
into NeMo AutoModel — including the chat-template conversion and collate
functions — see the
[Multi-Modal Dataset Guide](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/guides/vlm/dataset.md#multi-modal-datasets).

## Launch Training

We provide a ready-to-use recipe at
[`examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml`](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml).
This recipe is configured for **8 nodes × 8 H100-80GB GPUs (64 GPUs total)**
with TP=8, PP=8, DP=1. The vision tower and multi-modal projector are
frozen by default and only the Ministral-3 language model is trained;
flip `freeze_config.freeze_vision_tower: false` to train the vision
side as well.

NeMo AutoModel supports several ways to launch training — via the
AutoModel CLI with Slurm, interactive sessions, `torchrun`, and more.
For full details on all launch options (Slurm batch jobs, multi-node
configuration, environment variables, etc.), see the
[Run on a Cluster](https://github.com/NVIDIA-NeMo/Automodel/blob/main/docs/launcher/slurm.md)
guide.


**Before you start**:

- Hugging Face applies rate limits on downloads. We recommend cloning
the model repository to your local filesystem beforehand.
- Ensure your Hugging Face cache (`HF_HOME`) is configured and that the
dataset is already cached locally.
- To enable Weights & Biases logging, set your `WANDB_API_KEY` and
configure the `wandb` section in the YAML file.


## Training Results

The recipe produces a healthy initial loss aligned with the HF
reference forward on matched samples. On MedPix-VQA the first
optimizer step lands around per-token loss **3.2** and grad-norm
**~930** (clipped to `max_grad_norm=1.0`), descending past 1.8 within
a handful of steps. The HF reference forward (single-sample, FP8
dequantize on-load) on the same first batch produces per-token loss
**3.47**, confirming the distributed forward is numerically
equivalent within bf16 + TP-reduction tolerance.

The training loss curves for Mistral Medium 3.5 fine-tuned on
MedPix-VQA are shown below.

<p align="center">
<img src="mistralm35.png" alt="Mistral Medium 3.5 Training Loss Curve" width="500">
</p>
Binary file added docs/guides/vlm/mistralm35.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,7 @@ Gemma 3 / 3n <guides/omni/gemma3-3n.md>
Gemma 4 <guides/vlm/gemma4.md>
Qwen3.5-VL <guides/vlm/qwen3-5.md>
Nemotron-Omni <guides/vlm/nemotron-omni.md>
Mistral Medium 3.5 VL <guides/vlm/mistral-medium-3-5.md>
Diffusion Fine-Tuning <guides/diffusion/finetune.md>
dLLM Fine-Tuning <guides/dllm/finetune.md>
QAT <guides/quantization-aware-training.md>
Expand Down
1 change: 1 addition & 0 deletions docs/model-coverage/latest-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ See the [Model Coverage Overview](overview.md) for release summaries, and the [L

| Date | Model | HF Model ID | Modality | Recipe | Try on Brev |
|------|-------|-------------|----------|--------|------|
| 2026-04-29 | Mistral Medium 3.5 | [`mistralai/Mistral-Medium-3.5-128B`](https://huggingface.co/mistralai/Mistral-Medium-3.5-128B) | VLM | [mistral3p5_128b_medpix.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml) | 🚧 |
| 2026-04-28 | Hy3-preview | [`tencent/Hy3-preview`](https://huggingface.co/tencent/Hy3-preview) | LLM | [hy3_preview_deepep.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml) | 🚧 |
| 2026-04-25 | DeepSeek V4 Flash | [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | LLM | [deepseek_v4_flash_hellaswag.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml) | 🚧 |
| 2026-04-22 | Qwen3.6-27B | [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) | VLM | [qwen3_6_27b.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/vlm_finetune/qwen3_5/qwen3_6_27b.yaml) | 🚧 |
Expand Down
2 changes: 2 additions & 0 deletions docs/model-coverage/vlm/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ NeMo AutoModel supports [AutoModelForImageTextToText](https://huggingface.co/doc
| NVIDIA | [Nemotron-Parse](nvidia/nemotron-parse.md) | `NemotronParseForConditionalGeneration` |
| Mistral AI | [Ministral3 VL](mistralai/ministral3-vl.md) | `Mistral3ForConditionalGeneration` |
| Mistral AI | [Mistral-Small-4](mistralai/mistral-small-4.md) | `MistralForConditionalGeneration` |
| Mistral AI | [Mistral Medium 3.5](mistralai/mistral-medium-3-5.md) | `Mistral3ForConditionalGeneration` (FP8) |
| InternLM / Shanghai AI Lab | [InternVL](internlm/internvl.md) | `InternVLForConditionalGeneration` |
| Meta | [Llama 4](meta/llama4.md) | `Llama4ForConditionalGeneration` |
| HuggingFace | [SmolVLM](huggingface/smolvlm.md) | `SmolVLMForConditionalGeneration` |
Expand All @@ -57,6 +58,7 @@ qwen/qwen3-5-vl
nvidia/nemotron-parse
mistralai/ministral3-vl
mistralai/mistral-small-4
mistralai/mistral-medium-3-5
internlm/internvl
meta/llama4
huggingface/smolvlm
Expand Down
158 changes: 158 additions & 0 deletions docs/model-coverage/vlm/mistralai/mistral-medium-3-5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Mistral Medium 3.5

[Mistral Medium 3.5](https://huggingface.co/mistralai) is Mistral AI's
flagship **128B dense** model that merges instruction-following, reasoning,
and coding into a single checkpoint with a configurable reasoning mode.
It unifies the lineage of *Mistral Medium 3.1*, *Magistral Medium*, and
*Devstral 2* into one model, and ships natively in FP8 (per-tensor
`weight_scale_inv`) so the full model fits inside an H200 node or 2 ×
H100 nodes — a notable footprint advantage over comparably-capable
Mixture-of-Experts (MoE) systems.

:::{card}
| | |
|---|---|
| **Task** | Image-Text-to-Text |
| **Architecture** | `Mistral3ForConditionalGeneration` (Pixtral vision tower + dense Ministral-3 text decoder) |
| **Parameters** | 128B (dense, FP8 on disk) |
| **Context Window** | 256k tokens |
| **Languages** | 40+ (English, French, Spanish, German, Russian, Chinese, Japanese, Italian, Portuguese, Arabic, Hindi, Korean, plus Indic / Nordic / Eastern European tail) |
| **License** | Modified MIT (open-weights, ≤ $20M annual revenue threshold) |
| **HF Org** | [mistralai](https://huggingface.co/mistralai) |
:::

## Architecture

Mistral Medium 3.5 is a **dense** transformer — no MoE routing — built on
the same text backbone as
[`mistralai/Devstral-2-123B-Instruct-2512`](https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512):
88 Ministral-3 decoder layers (hidden 12288, 96 attention heads,
8 KV heads, GQA) with the standard llama-style RoPE + RMSNorm + SwiGLU
MLP layout. The multimodal variant adds a Pixtral vision tower and
multi-modal projector on top, making it an
`AutoModelForImageTextToText` checkpoint.

Compared with MoE models of similar capability, the dense layout
trades sparse-activation throughput for a substantially smaller
deployment footprint — relevant when you want to fine-tune or serve
the model on a single node.

## Key Strengths

- **Compactness.** Dense 128B fits in fewer GPUs than the comparable
MoE class — a single H200 node or 2 × H100 nodes for inference.
- **Configurable reasoning mode.** One checkpoint covers chat,
agentic, and reasoning workloads; the reasoning mode is toggled at
inference time.
- **Strong agentic performance.** Competitive on tool-use and
decision-making benchmarks; suitable as a base for connector-driven
agent workflows.
- **Long context.** 256k-token window for document parsing and
research-assistant use cases.

Trade-offs disclosed in the model card: weaker non-agentic benchmark
performance and more verbose outputs than some closed-source
competitors.

## Use Cases

- Agentic workflows with connectors
- Cloud and local async coding
- Document parsing (multimodal — text + image)
- Research assistants
- General chat
- Base model for downstream fine-tuning

## Available Models

- **Mistral-Medium-3.5 128B**

## Class

- HF: `Mistral3ForConditionalGeneration`
- NeMo AutoModel custom: `Mistral3FP8VLMForConditionalGeneration`
([source](https://github.com/NVIDIA-NeMo/Automodel/blob/main/nemo_automodel/components/models/mistral3_vlm/model.py))

The custom class extends HF's `Mistral3ForConditionalGeneration` and
attaches a `Mistral3FP8StateDictAdapter.for_vlm_full()` so the FP8
checkpoint dequantizes per-shard inside the standard DCP load — the
full BF16 model is never materialized on a single rank, allowing TP+PP
training to fit on H100-80GB.

## Example HF Models

| Model | HF ID |
|---|---|
| Mistral Medium 3.5 128B | [`mistralai/Mistral-Medium-3.5`](https://huggingface.co/mistralai) |

## Example Recipes

| Recipe | Dataset | Description |
|---|---|---|
| {download}`mistral3p5_128b_medpix.yaml <../../../../examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml>` | MedPix-VQA | SFT — Mistral Medium 3.5 128B on MedPix, 8 nodes × 8 GPUs (TP=8 PP=8) |


## Try with NeMo AutoModel

**1. Install** ([full instructions](../../../guides/installation.md)):

```bash
pip install nemo-automodel
```

**2. Clone the repo** to get the example recipes:

```bash
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
```

:::{note}
This recipe was validated on **8 nodes × 8 GPUs (64 H100s)** with
TP=8 PP=8 DP=1. See the [Launcher Guide](../../../launcher/slurm.md)
for multi-node setup. Inference / single-node fine-tune fits in
**1 × H200** or **2 × H100** nodes thanks to the dense + FP8 layout.
:::

**3. Run the recipe** via Slurm (see the
[fine-tuning guide](../../../guides/vlm/mistral-medium-3-5.md) for a
complete launch script):

```bash
sbatch your_slurm_script.sub
```

:::{dropdown} Run with Docker
**1. Pull the container** and mount a checkpoint directory:

```bash
docker run --gpus all -it --rm \
--shm-size=8g \
-v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
nvcr.io/nvidia/nemo-automodel:26.02.00
```

**2.** Navigate to the AutoModel directory (where the recipes are):

```bash
cd /opt/Automodel
```

**3. Run the recipe**:

```bash
automodel --nproc-per-node=8 examples/vlm_finetune/mistral3p5/mistral3p5_128b_medpix.yaml
```
:::

See the [Installation Guide](../../../guides/installation.md) and the
[Mistral Medium 3.5 Fine-Tuning Guide](../../../guides/vlm/mistral-medium-3-5.md).

## Fine-Tuning

See the [Mistral Medium 3.5 Fine-Tuning Guide](../../../guides/vlm/mistral-medium-3-5.md).

## Hugging Face Model Cards

- [mistralai](https://huggingface.co/mistralai)
- Related architecture: [`mistralai/Devstral-2-123B-Instruct-2512`](https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512)
Loading