NVIDIA · QiJune · Jul 14, 2025 · Jul 11, 2025 · Jul 11, 2025 · Jul 14, 2025
@@ -10,7 +10,11 @@ See the LLaMA example [`examples/models/core/llama`](../llama) for details.
   - [Supported Models](#supported-models)
     - [EXAONE-3.0](#exaone-30)
     - [EXAONE-Deep](#exaone-deep)
+    - [EXAONE-4.0](#exaone-40)
   - [Usage](#usage)
+    - [PyTorch flow](#pytorch-flow)
+        -[PyTorch flow Quantization](#pytorch-flow-quantization)
+    - [TRT Flow](#trt-flow)
     - [Convert checkpoint and build TensorRT engine(s)](#convert-checkpoint-and-build-tensorrt-engines)
     - [FP8 Post-Training Quantization](#fp8-post-training-quantization)
     - [SmoothQuant](#smoothquant)
@@ -39,16 +43,79 @@ git clone https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct $HF_MODEL_
 
 ### EXAONE-Deep
 
-Download the HuggingFace BF16 checkpoints of EXAONE-Deep model. Here, we only use the `EXAONE-Deep-2.4B` model for the example. We can use the same procedure as EXAONE-3.0 to convert the weights and build the TensorRT engine.
+Download the HuggingFace checkpoints of EXAONE-Deep model. Here, we only use the `EXAONE-Deep-2.4B` model for the example. We can use the same procedure as EXAONE-3.0 to convert the weights and build the TensorRT engine.
 
 ```bash
 export HF_MODEL_DIR=hf_models/exaone_deep
 git clone https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-2.4B $HF_MODEL_DIR
 ```
 
+### EXAONE-4.0
+
+Download he HuggingFace checkpoints of EXAONE-4.0 model. Here, we only use the `TODO: replace with REAL name, EXAONE-4.0` model for the example. From EXAONE-4.0 model, we support EXAONE models only on PyTorch flow.
+
+```bash
+export HF_MODEL_DIR=hf_models/exaone4
+git clone ... $HF_MODEL_DIR (TODO Change ... to real HF directory)
+```
+
 ## Usage
 The next section describe how to convert the weights from the [HuggingFace (HF) Transformers](https://github.com/huggingface/transformers) format to the TensorRT-LLM format. We will use llama's [convert_checkpoint.py](../llama/convert_checkpoint.py) for EXAONE model and then we build the model with `trtllm-build`.
 
+### Pytorch flow
+
+To quickly run EXAONE-4.0 models, you can use [examples/llm-api/quickstart_advanced.py](../../../llm-api/quickstart_advanced.py):
+
+```bash
+python ../../../llm-api/quickstart_advanced.py --model_dir hf_models/$MODEL_NAME --disable_kv_cache_reuse
+```
+
+SWA currently does not support kv_cache_reuse. Please make sure to disable KV cache reuse when running with SWA.
+
+The output will be like:
+```bash
+TODO: Fill this with real HF checkpoints output
+```
+
+#### PyTorch flow Quantization
+
+For PyTorch flow, TRT-LLM supports quantized format generated by [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
+
+You can either do pre-quantized models in HF model hub, or can generate quantized model by yourself and then run models with below command:
+
+```bash
+git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
+cd TensorRT-Model-Optimizer/examples/llm_ptq
+scripts/huggingface_example.sh --model  hf_models/$MODEL_NAME --quant fp8 --export_fmt hf
+```
+
+For more information, please refer to official [docs](https://github.com/NVIDIA/TensorRT-Model-Optimizer) or [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
+
+Troubleshooting
+
+The following error may occur during quantization:
+```bash
+torch._dynamo.exc.Unsupported: Graph break under GenericContextWrappingVariable
+Explanation: Attempted to graph break in an active context manager(s) that doesn't support graph breaking.
+Hint: Move the offending context manager(s) to outside the compiled region.
+Hint: This graph break may have been caused by an earlier graph break. Resolving the earlier graph break may resolve this one.
+```
+
+This error may indicate an incompatibility between `torch.compile()` and the `HybridCache` module of the transformers library. As a result, [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt) cannot perform PTQ with HybridCache.
+
+Temporarily switching to `DynamicCache` when creating PTQ models could help address the issue. This can be done by updating the `cache_implementation` field in the `generation_config.json` file located in the model checkpoint directory, for example:
+```json
+# generation_config.json
+{
+    // Change "hybrid" to "dynamic" to run PTQ.
+    // Revert this to "hybrid" after quantization is complete.
+    "cache_implementation": "hybrid",
+    ...
+}
+```
+For models with sliding window attention, DynamicCache is less memory-efficient than HybridCache because it retains the entire key-value cache. However, this does not break the model's attention logic, as the cache implementation is separated from the attention computation itself. This trade-off is acceptable for the PTQ process, which is a one-time procedure. Our tests confirm that this workaround does not degrade accuracy on MMLU or GSM8K benchmarks with the default ModelOpt settings.
+
+### TRT flow
 ### Convert checkpoint and build TensorRT engine(s)
 
 ```bash

@@ -4,6 +4,7 @@
 from .modeling_bert import BertForSequenceClassification
 from .modeling_clip import CLIPVisionModel
 from .modeling_deepseekv3 import DeepseekV3ForCausalLM
+from .modeling_exaone4 import Exaone4ForCausalLM
 from .modeling_gemma3 import Gemma3ForCausalLM
 from .modeling_gemma3vl import Gemma3Model
 from .modeling_hyperclovax import HCXVisionForCausalLM
@@ -30,6 +31,7 @@
     "BertForSequenceClassification",
     "CLIPVisionModel",
     "DeepseekV3ForCausalLM",
+    "Exaone4ForCausalLM",
     "Gemma3ForCausalLM",
     "HCXVisionForCausalLM",
     "Gemma3Model",