diff --git a/docs/source/en/model_doc/minicpm_o_2_6.md b/docs/source/en/model_doc/minicpm_o_2_6.md
index df31c58f280c..feea3263e8fb 100644
--- a/docs/source/en/model_doc/minicpm_o_2_6.md
+++ b/docs/source/en/model_doc/minicpm_o_2_6.md
@@ -11,977 +11,70 @@ specific language governing permissions and limitations under the License.
 
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
+--->
 
-<h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
+# MiniCPM-o 2.6
+
+<h2>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h2>
 
 [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn) | [Technical Blog](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9)
 
+## Overview
+
+The [MiniCPM-o 2.6](https://github.com/OpenBMB/MiniCPM-o) model is an end-to-end omni-modal large multimodal model proposed by the OpenBMB Team. MiniCPM-o 2.6 is built based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters.
 
-### News
-
-* [2025.03.01] 🚀🚀🚀 RLAIF-V, which is the alignment technique of MiniCPM-o, is accepted by CVPR 2025！The [code](https://github.com/RLHF-V/RLAIF-V), [dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), [paper](https://arxiv.org/abs/2405.17220) are open-sourced!
-
-* [2025.01.24] 📢📢📢 MiniCPM-o 2.6 technical report is released! [See Here](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9).
-
-* [2025.01.19] ⭐️⭐️⭐️ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending!
-
-## MiniCPM-o 2.6
-
-
-**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
-
-- 🔥 **Leading Visual Capability.**
-  MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
-
-- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
-
-- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.										
-
-- 💪 **Strong OCR Capability and Others.**
-Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
-  Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
-
-
-- 🚀 **Superior Efficiency.**
-  In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
-
--  💫  **Easy Usage.**
-MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
-
-
-
-**Model Architecture.**
-
-- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
-- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices. 
-- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
-
-<div align="center">
-<img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework-v2.png" , width=100%>
-</div>
-
-
-### Evaluation  <!-- omit in toc -->
-
-<div align="center">
-    <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/radar.jpg" width=90% />
-</div>
-#### Visual understanding results
-
-**Image Understanding:**
-
-<div align="center">
-<table style="margin: 0px auto;">
-    <thead>
-        <tr>
-            <th align="left">Model</th>
-            <th>Size</th>
-            <th>Token Density<sup>+</sup></th>
-            <th>OpenCompass</th>
-            <th>OCRBench</th>
-            <th>MathVista mini</th>
-            <th>ChartQA</th>
-            <th>MMVet</th>
-            <th>MMStar</th>
-            <th>MME</th>
-            <th>MMB1.1 test</th>
-            <th>AI2D</th>
-            <th>MMMU val</th>
-            <th>HallusionBench</th>
-            <th>TextVQA val</th>
-            <th>DocVQA test</th>
-            <th>MathVerse mini</th>
-            <th>MathVision</th>
-            <th>MMHal Score</th>
-        </tr>
-    </thead>
-    <tbody align="center">
-        <tr>
-            <td colspan="19" align="left"><strong>Proprietary</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
-            <td>-</td>
-            <td>1088</td>
-            <td><u>69.9</u></td>
-            <td>736</td>
-            <td>61.3</td>
-            <td>85.7</td>
-            <td><strong>69.1</strong></td>
-            <td>63.9</td>
-            <td>2328.7</td>
-            <td>82.2</td>
-            <td>84.6</td>
-            <td><strong>69.2</strong></td>
-            <td><strong>55.0</strong></td>
-            <td>-</td>
-            <td>92.8</td>
-            <td><strong>50.2</strong></td>
-            <td><strong>30.4</strong></td>
-            <td><u>3.6</u></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
-            <td>-</td>
-            <td>750</td>
-            <td>67.9</td>
-            <td>788</td>
-            <td>61.6</td>
-            <td><strong>90.8</strong></td>
-            <td>66.0</td>
-            <td>62.2</td>
-            <td>1920.0</td>
-            <td>78.5</td>
-            <td>80.2</td>
-            <td><u>65.9</u></td>
-            <td>49.9</td>
-            <td>-</td>
-            <td><strong>95.2</strong></td>
-            <td>-</td>
-            <td>-</td>
-            <td>3.4</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
-            <td>-</td>
-            <td>-</td>
-            <td>64.4</td>
-            <td>754</td>
-            <td>57.7</td>
-            <td>81.3</td>
-            <td>64.0</td>
-            <td>59.1</td>
-            <td>2110.6</td>
-            <td>73.9</td>
-            <td>79.1</td>
-            <td>60.6</td>
-            <td>45.6</td>
-            <td>73.5</td>
-            <td>86.5</td>
-            <td>-</td>
-            <td>19.2</td>
-            <td>-</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
-            <td>-</td>
-            <td>1088</td>
-            <td>64.1</td>
-            <td>785</td>
-            <td>52.4</td>
-            <td>-</td>
-            <td>66.9</td>
-            <td>54.8</td>
-            <td>2003.4</td>
-            <td>76.0</td>
-            <td>77.8</td>
-            <td>60.0</td>
-            <td>46.1</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-            <td>3.3</td>
-        </tr>
-        <tr>
-            <td colspan="19" align="left"><strong>Open Source</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Cambrian-34B</td>
-            <td>34B</td>
-            <td><u>1820</u></td>
-            <td>58.3</td>
-            <td>591</td>
-            <td>50.3</td>
-            <td>75.6</td>
-            <td>53.2</td>
-            <td>54.2</td>
-            <td>2049.9</td>
-            <td>77.8</td>
-            <td>79.5</td>
-            <td>50.4</td>
-            <td>41.6</td>
-            <td>76.7</td>
-            <td>75.5</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GLM-4V-9B</td>
-            <td>13B</td>
-            <td>784</td>
-            <td>59.1</td>
-            <td>776</td>
-            <td>51.1</td>
-            <td>-</td>
-            <td>58.0</td>
-            <td>54.8</td>
-            <td>2018.8</td>
-            <td>67.9</td>
-            <td>71.2</td>
-            <td>46.9</td>
-            <td>45.0</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Pixtral-12B</td>
-            <td>12B</td>
-            <td>256</td>
-            <td>61.0</td>
-            <td>685</td>
-            <td>56.9</td>
-            <td>81.8</td>
-            <td>58.5</td>
-            <td>54.5</td>
-            <td>-</td>
-            <td>72.7</td>
-            <td>79.0</td>
-            <td>51.1</td>
-            <td>47.0</td>
-            <td>75.7</td>
-            <td>90.7</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
-            <td>27B</td>
-            <td>672</td>
-            <td>66.4</td>
-            <td>809</td>
-            <td>63.9</td>
-            <td>86.0</td>
-            <td>60.0</td>
-            <td>61.9</td>
-            <td>2253.0</td>
-            <td>81.2</td>
-            <td>83.8</td>
-            <td>54.0</td>
-            <td>45.3</td>
-            <td><u>84.2</u></td>
-            <td>93.3</td>
-            <td>-</td>
-            <td>-</td>
-            <td>3.0</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
-            <td>8B</td>
-            <td>784</td>
-            <td>67.1</td>
-            <td><u>866</u></td>
-            <td>58.2</td>
-            <td>83.0</td>
-            <td>62.0</td>
-            <td>60.7</td>
-            <td>2326.0</td>
-            <td>81.8</td>
-            <td>83.0</td>
-            <td>54.1</td>
-            <td>50.6</td>
-            <td><strong>84.3</strong></td>
-            <td><u>94.5</u></td>
-            <td>31.9</td>
-            <td>16.3</td>
-            <td>3.2</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
-            <td>72B</td>
-            <td>182</td>
-            <td>68.1</td>
-            <td>741</td>
-            <td>67.5</td>
-            <td>83.7</td>
-            <td>60.6</td>
-            <td><strong>65.8</strong></td>
-            <td>2261.0</td>
-            <td><strong>85.0</strong></td>
-            <td><u>85.6</u></td>
-            <td>56.8</td>
-            <td>49.0</td>
-            <td>80.5</td>
-            <td>91.3</td>
-            <td>39.1</td>
-            <td>-</td>
-            <td>3.5</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
-            <td>8B</td>
-            <td>706</td>
-            <td>68.3</td>
-            <td>822</td>
-            <td><u>64.4</u></td>
-            <td>84.8</td>
-            <td>62.8</td>
-            <td>62.8</td>
-            <td>2344.0</td>
-            <td><u>83.6</u></td>
-            <td>84.5</td>
-            <td>56.0</td>
-            <td>50.1</td>
-            <td>79.1</td>
-            <td>93.0</td>
-            <td>39.5</td>
-            <td>19.7</td>
-            <td>3.4</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
-            <td>8B</td>
-            <td><strong>2822</strong></td>
-            <td>65.2</td>
-            <td>852*</td>
-            <td>60.6</td>
-            <td>79.4</td>
-            <td>60.0</td>
-            <td>57.5</td>
-            <td><u>2348.4*</u></td>
-            <td>78.0</td>
-            <td>82.1</td>
-            <td>49.8*</td>
-            <td>48.1*</td>
-            <td>80.1</td>
-            <td>90.8</td>
-            <td>25.7</td>
-            <td>18.3</td>
-            <td>3.6</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
-            <td>8B</td>
-            <td><strong>2822</strong></td>
-            <td><strong>70.2</strong></td>
-            <td><strong>897*</strong></td>
-            <td><strong>71.9*</strong></td>
-            <td><u>86.9*</u></td>
-            <td><u>67.5</u></td>
-            <td><u>64.0</u></td>
-            <td><strong>2372.0*</strong></td>
-            <td>80.5</td>
-            <td><strong>85.8</strong></td>
-            <td>50.4*</td>
-            <td><u>51.9</u></td>
-            <td>82.0</td>
-            <td>93.5</td>
-            <td><u>41.4*</u></td>
-            <td><u>23.1*</u></td>
-            <td><strong>3.8</strong></td>
-        </tr>
-    </tbody>
-</table>
-</div>
-* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
-
-<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
-
-Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
-
-
-**Multi-image and Video Understanding:**
-
-<details>
-<summary>click to view</summary>
-<div align="center">
- 
-<table style="margin: 0px auto;">
-    <thead>
-        <tr>
-            <th align="left">Model</th>
-            <th>Size</th>
-            <th>BLINK val</th>
-            <th>Mantis Eval</th>
-            <th>MIRB</th>
-            <th>Video-MME (wo / w subs)</th>
-        </tr>
-    </thead>
-    <tbody align="center">
-        <tr>
-            <td colspan="6" align="left"><strong>Proprietary</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
-            <td>-</td>
-            <td><strong>68.0</strong></td>
-            <td>-</td>
-            <td>-</td>
-            <td><strong>71.9/77.2<strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GPT4V</td>
-            <td>-</td>
-            <td>54.6</td>
-            <td>62.7</td>
-            <td>53.1</td>
-            <td>59.9/63.3</td>
-        </tr>
-        <tr>
-            <td colspan="6" align="left"><strong>Open-source</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
-            <td>14B</td>
-            <td>52.6</td>
-            <td>66.4</td>
-            <td>30.2</td>
-            <td>-</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
-            <td>72B</td>
-            <td>55.4</td>
-            <td><strong>77.6</strong></td>
-            <td>-</td>
-            <td><u>66.2/69.5</u></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MANTIS 8B</td>
-            <td>8B</td>
-            <td>49.1</td>
-            <td>59.5</td>
-            <td>34.8</td>
-            <td>-</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
-            <td>8B</td>
-            <td>53.2</td>
-            <td>69.6*</td>
-            <td><strong>67.6*</strong></td>
-            <td>63.3/69.0</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
-            <td>8B</td>
-            <td>54.8</td>
-            <td>67.7</td>
-            <td>52.5</td>
-            <td>64.2/66.9</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
-            <td>8B</td>
-            <td>53.0</td>
-            <td>69.1</td>
-            <td>53.8</td>
-            <td>60.9/63.6</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
-            <td>8B</td>
-            <td><u>56.7</u></td>
-            <td><u>71.9</u></td>
-            <td><u>58.6</u></td>
-            <td>63.9/67.9</td>
-        </tr>
-    </tbody>
-</table>
-</div>
-* We evaluate officially released checkpoints by ourselves.
-
-</details>
-
-
-#### Audio understanding and speech conversation results.
-
-**Audio Understanding:**
-
-<div align="center">
-<table style="margin: 0px auto;">
-    <thead>
-        <tr>
-            <th align="left">Task</th>
-            <th>Size</th>
-            <th colspan="3">ASR (zh)</th>
-            <th colspan="3">ASR (en)</th>
-            <th colspan="2">AST</th>
-            <th>Emotion</th>
-        </tr>
-        <tr>
-            <th align="left">Metric</th>
-            <td></td>
-            <th colspan="3">CER↓</th>
-            <th colspan="3">WER↓</th>
-            <th colspan="2">BLEU↑</th>
-            <th>ACC↑</th>
-        </tr>
-        <tr>
-            <th align="left">Dataset</th>
-            <td></td>
-            <th>AISHELL-1</th>
-            <th>Fleurs zh</th>
-            <th>WenetSpeech test-net</th>
-            <th>LibriSpeech test-clean</th>
-            <th>GigaSpeech</th>
-            <th>TED-LIUM</th>
-            <th>CoVoST en2zh</th>
-            <th>CoVoST zh2en</th>
-            <th>MELD emotion</th>
-        </tr>
-    </thead>
-    <tbody align="center">
-        <tr>
-            <td colspan="11" align="left"><strong>Proprietary</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
-            <td>-</td>
-            <td>7.3*</td>
-            <td><u>5.4*</u></td>
-            <td>28.9*</td>
-            <td>2.6*</td>
-            <td>12.9*</td>
-            <td>4.8*</td>
-            <td>37.1*</td>
-            <td>15.7*</td>
-            <td>33.2*</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
-            <td>-</td>
-            <td>4.5*</td>
-            <td>5.9*</td>
-            <td>14.3*</td>
-            <td>2.9*</td>
-            <td>10.6*</td>
-            <td><strong>3.0*</strong></td>
-            <td><u>47.3*</u></td>
-            <td>22.6*</td>
-            <td>48.4*</td>
-        </tr>
-        <tr>
-            <td colspan="11" align="left"><strong>Open-Source</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Qwen2-Audio-7B</td>
-            <td>8B</td>
-            <td>-</td>
-            <td>7.5</td>
-            <td>-</td>
-            <td><strong>1.6</strong></td>
-            <td>-</td>
-            <td>-</td>
-            <td>45.2</td>
-            <td><u>24.4</u></td>
-            <td><strong>55.3</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Qwen2-Audio-7B-Instruct</td>
-            <td>8B</td>
-            <td>2.6*</td>
-            <td>6.9*</td>
-            <td><u>10.3*</u></td>
-            <td>3.1*</td>
-            <td><u>9.7</u>*</td>
-            <td>5.9*</td>
-            <td>39.5*</td>
-            <td>22.9*</td>
-            <td>17.4*</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
-            <td>9B</td>
-            <td><u>2.5</u></td>
-            <td>-</td>
-            <td>-</td>
-            <td>2.8</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-            <td>-</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
-            <td>8B</td>
-            <td><strong>1.6</strong></td>
-            <td><strong>4.4</strong></td>
-            <td><strong>6.9</strong></td>
-            <td><u>1.7</u></td>
-            <td><strong>8.7</strong></td>
-            <td><strong>3.0</strong></td>
-            <td><strong>48.2</strong></td>
-            <td><strong>27.2</strong></td>
-            <td><u>52.4</u></td>
-        </tr>
-    </tbody>
-</table>
-</div>
-* We evaluate officially released checkpoints by ourselves.<br><br>
-**Speech Generation:**
-
-<div align="center">
-<table style="margin: 0px auto;">
-    <thead>
-        <tr>
-            <th align="left">Task</th>
-            <th>Size</th>
-            <th colspan="9">SpeechQA</th>
-        </tr>
-        <tr>
-            <th align="left">Metric</th>
-            <th></th>
-            <th colspan="3">ACC↑</th>
-            <th>G-Eval (10 point)↑</th>
-            <th>Semantic ELO score↑</th>
-            <th>Acoustic ELO score↑</th>
-            <th>Overall ELO score↑</th>
-            <th>UTMOS↑</th>
-            <th>ASR-WER↓</th>
-        </tr>
-        <tr>
-            <th align="left">Dataset</th>
-            <th></th>
-            <th>Speech Llama Q.</th>
-            <th>Speech Web Q.</th>
-            <th>Speech Trivia QA</th>
-            <th>Speech AlpacaEval</th>
-            <th colspan="5">AudioArena</th>
-        </tr>
-    </thead>
-    <tbody align="center">
-        <tr>
-            <td colspan="11" align="left"><strong>Proprietary</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
-            <td></td>
-            <td><strong>71.7</strong></td>
-            <td><strong>51.6</strong></td>
-            <td><strong>69.7</strong></td>
-            <td><strong>7.4</strong></td>
-            <td><strong>1157</strong></td>
-            <td><strong>1203</strong></td>
-            <td><strong>1200</strong></td>
-            <td><strong>4.2</strong></td>
-            <td><strong>2.3</strong></td>
-        </tr>
-        <tr>
-            <td colspan="11" align="left"><strong>Open-Source</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GLM-4-Voice</td>
-            <td>9B</td>
-            <td>50.0</td>
-            <td>32.0</td>
-            <td>36.4</td>
-            <td><u>5.1</u></td>
-            <td>999</td>
-            <td>1147</td>
-            <td>1035</td>
-            <td><u>4.1</u></td>
-            <td><u>11.7</u></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Llama-Omni</td>
-            <td>8B</td>
-            <td>45.3</td>
-            <td>22.9</td>
-            <td>10.7</td>
-            <td>3.9</td>
-            <td>960</td>
-            <td>878</td>
-            <td>897</td>
-            <td>3.2</td>
-            <td>24.3</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Moshi</td>
-            <td>7B</td>
-            <td>43.7</td>
-            <td>23.8</td>
-            <td>16.7</td>
-            <td>2.4</td>
-            <td>871</td>
-            <td>808</td>
-            <td>875</td>
-            <td>2.8</td>
-            <td>8.2</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Mini-Omni</td>
-            <td>1B</td>
-            <td>22.0</td>
-            <td>12.8</td>
-            <td>6.9</td>
-            <td>2.5</td>
-            <td>926</td>
-            <td>803</td>
-            <td>865</td>
-            <td>3.4</td>
-            <td>10.0</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
-            <td>8B</td>
-            <td><u>61.0</u></td>
-            <td><u>40.0</u></td>
-            <td><u>40.2</u></td>
-            <td><u>5.1</u></td>
-            <td><u>1088</u></td>
-            <td><u>1163</u></td>
-            <td><u>1131</u></td>
-            <td><strong>4.2</strong></td>
-            <td>9.8</td>
-        </tr>
-    </tbody>
-</table>
-</div>
-All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">UltraEval-Audio</a>.<br><br>
-**End-to-end Voice Cloning**
-
-<div align="center">
-<table style="margin: 0px auto;">
-    <thead>
-        <tr>
-            <th align="left">Task</th>
-            <th colspan="2">Voice cloning</th>
-        </tr>
-        <tr>
-            <th align="left">Metric</th>
-            <th>SIMO↑</th>
-            <th>SIMO↑</th>
-        </tr>
-        <tr>
-            <th align="left">Dataset</th>
-            <th>Seed-TTS test-zh</th>
-            <th>Seed-TTS test-en</th>
-        </tr>
-    </thead>
-    <tbody align="center">
-        <tr>
-            <td nowrap="nowrap" align="left">F5-TTS</td>
-            <td><strong>76</strong></td>
-            <td><strong>67</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">CosyVoice</td>
-            <td><u>75</u></td>
-            <td><u>64</u></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">FireRedTTS</td>
-            <td>63</td>
-            <td>46</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
-            <td>57</td>
-            <td>47</td>
-        </tr>
-    </tbody>
-</table>
-</div>
-
-#### Multimodal live streaming results.
-  
-**Multimodal Live Streaming:** results on StreamingBench
-
-<table style="margin: 0px auto;">
-    <thead>
-        <tr>
-            <th align="left">Model</th>
-            <th>Size</th>
-            <th>Real-Time Video Understanding</th>
-            <th>Omni-Source Understanding</th>
-            <th>Contextual Understanding</th>
-            <th>Overall</th>
-        </tr>
-    </thead>
-    <tbody align="center">
-        <tr>
-            <td colspan="7" align="left"><strong>Proprietary</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
-            <td>-</td>
-            <td><u>77.4</u></td>
-            <td><strong>67.8</strong></td>
-            <td><strong>51.1</strong></td>
-            <td><strong>70.3</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">GPT-4o-202408</td>
-            <td>-</td>
-            <td>74.5</td>
-            <td>51.0</td>
-            <td><u>48.0</u></td>
-            <td>64.1</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
-            <td>-</td>
-            <td>74.0</td>
-            <td>41.4</td>
-            <td>37.8</td>
-            <td>59.7</td>
-        </tr>
-        <tr>
-            <td colspan="9" align="left"><strong>Open-source</strong></td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">VILA-1.5</td>
-            <td>8B</td>
-            <td>61.5</td>
-            <td>37.5</td>
-            <td>26.7</td>
-            <td>49.5</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">LongVA</td>
-            <td>7B</td>
-            <td>63.1</td>
-            <td>35.9</td>
-            <td>30.2</td>
-            <td>50.7</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
-            <td>34B</td>
-            <td>69.8</td>
-            <td>41.7</td>
-            <td>34.3</td>
-            <td>56.7</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
-            <td>8B</td>
-            <td>71.2</td>
-            <td>40.7</td>
-            <td>33.1</td>
-            <td>57.0</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">InternVL2-8B</td>
-            <td>8B</td>
-            <td>70.1</td>
-            <td>42.7</td>
-            <td>34.1</td>
-            <td>57.0</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">VITA-1.5</td>
-            <td>8B</td>
-            <td>70.9</td>
-            <td>40.8</td>
-            <td>35.8</td>
-            <td>57.4</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
-            <td>8B</td>
-            <td>74.3</td>
-            <td>40.8</td>
-            <td>31.0</td>
-            <td>58.4</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
-            <td>8B</td>
-            <td>75.4</td>
-            <td>46.2</td>
-            <td>33.6</td>
-            <td>60.8</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
-            <td>8B</td>
-            <td>72.4</td>
-            <td>40.2</td>
-            <td>33.4</td>
-            <td>57.7</td>
-        </tr>
-        <tr>
-            <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
-            <td>8B</td>
-            <td><strong>79.9</strong></td>
-            <td><u>53.4</u></td>
-            <td>38.5</td>
-            <td><u>66.0</u></td>
-        </tr>
-    </tbody>
-</table>
-
-
-### Examples <!-- omit in toc -->
-
-We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
-
-<div align="center">
-  <a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
-</div>
-
-<br>
-
-
-<div style="display: flex; flex-direction: column; align-items: center;">
-  <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
-  <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
-  <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
-</div>
-
-
-
-
-## Online Demo
-Click here to try the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn).
+The model features:
 
+_MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series, featuring leading visual capability with an average score of 70.2 on OpenCompass. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding. It supports state-of-the-art speech capability with bilingual real-time speech conversation and configurable voices in English and Chinese, outperforming GPT-4o-realtime on audio understanding tasks. The model introduces strong multimodal live streaming capability, accepting continuous video and audio streams independent of user queries with real-time speech interaction. It features superior efficiency with state-of-the-art token density, producing only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. The architecture employs an end-to-end omni-modal design with time-division multiplexing (TDM) mechanism for omni-modality streaming processing and configurable speech modeling design with multimodal system prompts._
 
 ## Usage
-Inference using Huggingface transformers on NVIDIA GPUs. Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues. We are investigating this issue. Requirements tested on python 3.10：
+
+Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10：
+
 ```
-Pillow==10.1.0
-torch==2.3.1
-torchaudio==2.3.1
-torchvision==0.18.1
-transformers==4.44.2
-librosa==0.9.0
-soundfile==0.12.1
-vector-quantize-pytorch==1.18.5
-vocos==0.1.0
+transformers
+Pillow
+torch
+torchaudio
+torchvision
+librosa
+soundfile
+vector-quantize-pytorch
+vocos
 decord
 moviepy
 ```
 
-
 ### Model initialization
+
 ```python
 import torch
 from PIL import Image
 from transformers import AutoModel, AutoTokenizer
-# load omni model default, the default init_vision/init_audio/init_tts is True
-# if load vision-only model, please set init_audio=False and init_tts=False
-# if load audio-only model, please set init_vision=False
+
 model = AutoModel.from_pretrained(
     'openbmb/MiniCPM-o-2_6',
-    trust_remote_code=True,
-    attn_implementation='sdpa', # sdpa or flash_attention_2
-    torch_dtype=torch.bfloat16,
-    init_vision=True,
-    init_audio=True,
-    init_tts=True
+    attn_implementation='sdpa', # sdpa or flash_attention_2, no eager
+    dtype=torch.bfloat16
 )
 model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
-# In addition to vision-only mode, tts processor and vocos also needs to be initialized
 model.init_tts()
+
+processor = AutoProcessor.from_pretrained('openbmb/MiniCPM-o-2_6')
 ```
 
 If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please convert the TTS to float32 type.
+
 ```python
 model.tts.float()
 ```
 
 ### Omni mode
-We provide two inference modes: chat and streaming
 
-#### Chat inference
+We provide two inference modes: normal generate and streaming
+
+#### Normal generate inference
+
 ```python
 import math
 import numpy as np
@@ -990,16 +83,17 @@ from moviepy.editor import VideoFileClip
 import tempfile
 import librosa
 import soundfile as sf
+
 def get_video_chunk_content(video_path, flatten=True):
     video = VideoFileClip(video_path)
     print('video_duration:', video.duration)
-    
+
     with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
         temp_audio_file_path = temp_audio_file.name
         video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
         audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
     num_units = math.ceil(video.duration)
-    
+
     # 1 frame + 1s audio chunk
     contents= []
     for i in range(num_units):
@@ -1010,78 +104,79 @@ def get_video_chunk_content(video_path, flatten=True):
             contents.extend(["<unit>", image, audio])
         else:
             contents.append(["<unit>", image, audio])
-    
+
     return contents
+
 video_path="assets/Skiing.mp4"
 # if use voice clone prompt, please set ref_audio
 ref_audio_path = 'assets/demo.wav'
 ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
-sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
+sys_msg = processor.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
 # or use default prompt
 # sys_msg = model.get_sys_prompt(mode='omni', language='en')
 contents = get_video_chunk_content(video_path)
 msg = {"role":"user", "content": contents}
 msgs = [sys_msg, msg]
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
 # please set generate_audio=True and output_audio_path to save the tts result
 generate_audio = True
 output_audio_path = 'output.wav'
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
+res = model.generate(
+    **inputs,
+    processor=processor,
     sampling=True,
     temperature=0.5,
     max_new_tokens=4096,
-    omni_input=True, # please set omni_input=True when omni inference
     use_tts_template=True,
     generate_audio=generate_audio,
     output_audio_path=output_audio_path,
-    max_slice_nums=1,
-    use_image_id=False,
-    return_dict=True
+    repetition_penalty=1.2,
 )
 print(res)
-## You will get the answer: The person in the picture is skiing down a snowy slope.
-# import IPython
-# IPython.display.Audio('output.wav')
 ```
+
 #### Streaming inference
+
 ```python
 # a new conversation need reset session first, it will reset the kv-cache
 model.reset_session()
 contents = get_video_chunk_content(video_path, flatten=False)
 session_id = '123'
-generate_audio = True
+use_tts = True
+
 # 1. prefill system prompt
 res = model.streaming_prefill(
     session_id=session_id,
-    msgs=[sys_msg], 
-    tokenizer=tokenizer
+    msgs=[sys_msg],
+    processor=processor
 )
+
 # 2. prefill video/audio chunks
 for content in contents:
     msgs = [{"role":"user", "content": content}]
     res = model.streaming_prefill(
         session_id=session_id,
-        msgs=msgs, 
-        tokenizer=tokenizer
+        msgs=msgs,
+        processor=processor
     )
 # 3. generate
 res = model.streaming_generate(
     session_id=session_id,
-    tokenizer=tokenizer,
-    temperature=0.5,
-    generate_audio=generate_audio
+    processor=processor,
+    use_tts=use_tts,
+    tts_output_chunk_size=25
 )
 audios = []
 text = ""
-if generate_audio:
+if use_tts:
     for r in res:
         audio_wav = r.audio_wav
         sampling_rate = r.sampling_rate
         txt = r.text
         audios.append(audio_wav)
         text += txt
-        
+
     res = np.concatenate(audios)
     sf.write("output.wav", res, samplerate=sampling_rate)
     print("text:", text)
@@ -1092,143 +187,99 @@ else:
     print("text:", text)
 ```
 
-
-### Speech and Audio Mode
-
-Model initialization
-
-```python
-import torch
-import librosa
-from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
-    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
-model = model.eval().cuda()
-tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
-model.init_tts()
-model.tts.float()
-```
-
 <hr/>
 
 #### Mimick
 
-`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
+`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity.
 
 ```python
 mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
-audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked
-# can also try `./assets/input_examples/cxk_original.wav`, 
-# `./assets/input_examples/fast-pace.wav`, 
-# `./assets/input_examples/chi-english-1.wav` 
-# `./assets/input_examples/exciting-emotion.wav` 
-# for different aspects of speech-centric features.
+audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True)
 msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
     sampling=True,
     max_new_tokens=128,
     use_tts_template=True,
     temperature=0.3,
     generate_audio=True,
-    output_audio_path='output_mimick.wav', # save the tts result to output_audio_path
+    output_audio_path='output_mimick.wav',
 )
+print(res)
 ```
 
 <hr/>
 
-#### General Speech Conversation with Configurable Voices
-
-A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner.
+#### Speech Conversation with Configurable Voices
 
+`MiniCPM-o-2.6` can role-play specific characters based on audio prompts, mimicking their voice and language style.
 
 ```python
-ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
-sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
-# round one
-user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
+ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True)
+sys_prompt = processor.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
+user_audio, _ = librosa.load('user_question.wav', sr=16000, mono=True)
+user_question = {'role': 'user', 'content': [user_audio]}
 msgs = [sys_prompt, user_question]
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
-    sampling=True,
-    max_new_tokens=128,
-    use_tts_template=True,
-    generate_audio=True,
-    temperature=0.3,
-    output_audio_path='result_roleplay_round_1.wav',
-)
-# round two
-history = msgs.append({'role': 'assistant', 'content': res})
-user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
-msgs = history.append(user_question)
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
     sampling=True,
     max_new_tokens=128,
     use_tts_template=True,
     generate_audio=True,
     temperature=0.3,
-    output_audio_path='result_roleplay_round_2.wav',
+    output_audio_path='result_roleplay.wav',
 )
 print(res)
 ```
 
 <hr/>
 
-#### Speech Conversation as an AI Assistant
+#### AI Assistant Mode
 
-An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_female_voice`, `assistant_male_voice`, and `assistant_default_female_voice`. Other voices may work but not as stable as the default voices.
-
-*Please note that, `assistant_female_voice` and `assistant_male_voice` are more stable but sounds like robots, while `assistant_default_female_voice` is more human-alike but not stable, its voice often changes in multiple turns. We suggest you to try stable voices `assistant_female_voice` and `assistant_male_voice`.*
+`MiniCPM-o-2.6` can act as an AI assistant with predefined stable voices. Recommended voices: `assistant_female_voice`, `assistant_male_voice`.
 
 ```python
-ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav`
-sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
-user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question
-# round one
+ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True)
+sys_prompt = processor.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
+user_audio, _ = librosa.load('user_question.wav', sr=16000, mono=True)
+user_question = {'role': 'user', 'content': [user_audio]}
 msgs = [sys_prompt, user_question]
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
-    sampling=True,
-    max_new_tokens=128,
-    use_tts_template=True,
-    generate_audio=True,
-    temperature=0.3,
-    output_audio_path='result_assistant_round_1.wav',
-)
-# round two
-history = msgs.append({'role': 'assistant', 'content': res})
-user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
-msgs = history.append(user_question)
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
     sampling=True,
     max_new_tokens=128,
     use_tts_template=True,
     generate_audio=True,
     temperature=0.3,
-    output_audio_path='result_assistant_round_2.wav',
+    output_audio_path='result_assistant.wav',
 )
 print(res)
 ```
 
 <hr/>
 
-#### Instruction-to-Speech
+#### Instruction-to-Speech (Voice Creation)
 
-`MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/.
+You can describe a voice in detail, and the model will generate a voice that matches the description.
 
 ```python
 instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.'
 msgs = [{'role': 'user', 'content': [instruction]}]
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
     sampling=True,
     max_new_tokens=128,
     use_tts_template=True,
@@ -1236,24 +287,26 @@ res = model.chat(
     temperature=0.3,
     output_audio_path='result_voice_creation.wav',
 )
+print(res)
 ```
 
 <hr/>
 
 #### Voice Cloning
 
-`MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model.
-
+Zero-shot text-to-speech functionality using reference audio.
 
 ```python
-ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio
-sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
-text_prompt = f"Please read the text below."
+ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True)
+sys_prompt = processor.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
+text_prompt = "Please read the text below."
 user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]}
 msgs = [sys_prompt, user_question]
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
     sampling=True,
     max_new_tokens=128,
     use_tts_template=True,
@@ -1261,29 +314,32 @@ res = model.chat(
     temperature=0.3,
     output_audio_path='result_voice_cloning.wav',
 )
+print(res)
 ```
 
 <hr/>
 
-#### Addressing Various Audio Understanding Tasks
+#### Audio Understanding Tasks
 
-`MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging.
+Various audio understanding tasks such as ASR, speaker analysis, audio captioning, and sound scene tagging.
 
-For audio-to-text tasks, you can use the following prompts:
+Available prompts:
 
-- ASR with ZH(same as AST en2zh): `请仔细听这段音频片段，并将其内容逐字记录。`
-- ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.`
+- ASR (Chinese): `请仔细听这段音频片段，并将其内容逐字记录。`
+- ASR (English): `Please listen to the audio snippet carefully and transcribe the content.`
 - Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.`
-- General Audio Caption: `Summarize the main content of the audio.`
-- General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
+- Audio Caption: `Summarize the main content of the audio.`
+- Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.`
 
 ```python
-task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts.
-audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned
+task_prompt = "Please listen to the audio snippet carefully and transcribe the content.\n"
+audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True)
 msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
     sampling=True,
     max_new_tokens=128,
     use_tts_template=True,
@@ -1294,30 +350,33 @@ res = model.chat(
 print(res)
 ```
 
-
 ### Vision-Only mode
 
 `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
 
 #### Chat with single image
+
 ```python
-# test.py
 image = Image.open('xx.jpg').convert('RGB')
 question = 'What is in the image?'
 msgs = [{'role': 'user', 'content': [image, question]}]
-res = model.chat(
-    image=None,
-    msgs=msgs,
-    tokenizer=tokenizer
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
+    sampling=True,
+    max_new_tokens=1024,
 )
 print(res)
-## if you want to use streaming, please make sure sampling=True and stream=True
-## the model.chat will return a generator
-res = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
+
+## for streaming generation
+res = model.generate(
+    **inputs,
+    processor=processor,
     sampling=True,
-    stream=True
+    stream=True,
+    max_new_tokens=1024,
 )
 generated_text = ""
 for new_text in res:
@@ -1326,28 +385,27 @@ for new_text in res:
 ```
 
 #### Chat with multiple images
-<details>
-<summary> Click to show Python code running MiniCPM-o 2.6 with multiple images input. </summary>
-  
+
 ```python
 image1 = Image.open('image1.jpg').convert('RGB')
 image2 = Image.open('image2.jpg').convert('RGB')
 question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
 msgs = [{'role': 'user', 'content': [image1, image2, question]}]
-answer = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
+    sampling=True,
+    max_new_tokens=1024,
 )
-print(answer)
+print(res)
 ```
-</details>
 
 #### In-context few-shot learning
-<details>
-<summary> Click to view Python code running MiniCPM-o 2.6 with few-shot input. </summary>
 
 ```python
-question = "production date" 
+question = "production date"
 image1 = Image.open('example1.jpg').convert('RGB')
 answer1 = "2023.08.04"
 image2 = Image.open('example2.jpg').convert('RGB')
@@ -1358,19 +416,23 @@ msgs = [
     {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
     {'role': 'user', 'content': [image_test, question]}
 ]
-answer = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
+res = model.generate(
+    **inputs,
+    processor=processor,
+    sampling=True,
+    max_new_tokens=1024,
 )
-print(answer)
+print(res)
 ```
-</details>
 
 #### Chat with video
-<details>
-<summary> Click to view Python code running MiniCPM-o 2.6 with video input. </summary>
 
 ```python
+from decord import VideoReader, cpu
+import numpy as np
+
 MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
 def encode_video(video_path):
     def uniform_sample(l, n):
@@ -1386,52 +448,76 @@ def encode_video(video_path):
     frames = [Image.fromarray(v.astype('uint8')) for v in frames]
     print('num frames:', len(frames))
     return frames
+
 video_path ="video_test.mp4"
 frames = encode_video(video_path)
 question = "Describe the video"
-msgs = [
-    {'role': 'user', 'content': frames + [question]}, 
-]
+msgs = [{'role': 'user', 'content': frames + [question]}]
+inputs = processor.apply_chat_template(msgs=msgs).to(model.device)
+
 # Set decode params for video
-params={}
-params["use_image_id"] = False
-params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution >  448*448
-answer = model.chat(
-    msgs=msgs,
-    tokenizer=tokenizer,
-    **params
+res = model.generate(
+    **inputs,
+    processor=processor,
+    sampling=True,
+    max_new_tokens=1024,
+    use_image_id=False,
+    max_slice_nums=2,  # use 1 if cuda OOM and video resolution > 448*448
 )
-print(answer)
+print(res)
 ```
-</details>
 
 Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-o) for more detail about usage.
 
+## Usage Tips
 
-## Inference with llama.cpp<a id="llamacpp"></a>
-MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail.
+### Inference with llama.cpp<a id="llamacpp"></a>
 
+MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail.
 
-## Int4 quantized version
-Download the int4 quantized version for lower GPU memory (7GB) usage:  [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4).
+### Int4 quantized version
 
+Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4).
 
 ## License
+
 #### Model License
-* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. 
-* The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
-* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use.
 
+- The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
+- The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
+- The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use.
 
 #### Statement
-* As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
-* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
+
+- As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
+- We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
+
+## Key Techniques and Other Multimodal Projects
+
+👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team:
+
+[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
+
+## Citation
+
+If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！
+
+```bib
+@article{yao2024minicpm,
+  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
+  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
+  journal={arXiv preprint arXiv:2408.01800},
+  year={2024}
+}
+```
+
+- We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
 
 ## Key Techniques and Other Multimodal Projects
 
 👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team:
 
-[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)  | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
+[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
 
 ## Citation
 
@@ -1444,4 +530,4 @@ If you find our work helpful, please consider citing our papers 📝 and liking
   journal={arXiv preprint arXiv:2408.01800},
   year={2024}
 }
-```
\ No newline at end of file
+```
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index e495e7193220..5626b0ea3106 100644
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -248,7 +248,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
         ("metaclip_2", "MetaClip2Model"),
         ("mgp-str", "MgpstrForSceneTextRecognition"),
         ("mimi", "MimiModel"),
-        ("minicpm_o_2_6", "MiniCPM_o_2_6Model"),
+        ("minicpm_o_2_6", "MiniCPM_o_2_6ForConditionalGeneration"),
         ("minimax", "MiniMaxModel"),
         ("mistral", "MistralModel"),
         ("mistral3", "Mistral3Model"),
diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
index 7b0d7433f403..f6bf74765e85 100644
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -415,7 +415,7 @@
         ("mgp-str", ("MgpstrTokenizer", None)),
         (
             "minicpm_o_2_6",
-            ("MiniCPM_o_2_6Tokenizer", "MiniCPM_o_2_6TokenizerFast" if is_tokenizers_available() else None),
+            ("Qwen2Tokenizer", "MiniCPM_o_2_6TokenizerFast" if is_tokenizers_available() else None),
         ),
         (
             "minimax",
diff --git a/src/transformers/models/minicpm_o_2_6/__init__.py b/src/transformers/models/minicpm_o_2_6/__init__.py
index d7c289dfc944..1f4fbd5164d3 100644
--- a/src/transformers/models/minicpm_o_2_6/__init__.py
+++ b/src/transformers/models/minicpm_o_2_6/__init__.py
@@ -21,7 +21,7 @@
 
 if TYPE_CHECKING:
     from .configuration_minicpm_o_2_6 import *
-    from .image_processing_minicpm import *
+    from .image_processing_minicpm_fast import *
     from .modeling_minicpm_o_2_6 import *
     from .processing_minicpm_o_2_6 import *
     from .tokenization_minicpm_o_2_6_fast import *
diff --git a/src/transformers/models/minicpm_o_2_6/configuration_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/configuration_minicpm_o_2_6.py
index 33d494c665bf..d50f3f23cf90 100644
--- a/src/transformers/models/minicpm_o_2_6/configuration_minicpm_o_2_6.py
+++ b/src/transformers/models/minicpm_o_2_6/configuration_minicpm_o_2_6.py
@@ -1,4 +1,9 @@
-# coding=utf-8
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
+#           This file was automatically generated from src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py.
+#               Do NOT edit this file manually as any edits will be overwritten by the generation of
+#             the file from the modular. If any change should be done, please apply the change to the
+#                          modular_minicpm_o_2_6.py file directly. One of our CI enforces this.
+#                🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
 # Copyright 2025 The OpenBMB Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@@ -13,263 +18,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import os
-from typing import Union
 
 from ...configuration_utils import PretrainedConfig, layer_type_validation
 from ...modeling_rope_utils import rope_config_validation
-from transformers.models.siglip.configuration_siglip import SiglipVisionConfig
-from transformers import Qwen2Config, WhisperConfig
 from ...utils import logging
 
-logger = logging.get_logger(__name__)
-
-
-class MiniCPMVSliceConfig(PretrainedConfig):
-    model_type = "minicpmv"
-
-    def __init__(
-        self,
-        patch_size=14,
-        max_slice_nums=9,
-        scale_resolution=448,
-        **kwargs,
-    ):
-        super().__init__(**kwargs)
-        self.patch_size = patch_size
-        self.max_slice_nums = max_slice_nums
-        self.scale_resolution = scale_resolution
-
-    @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
-        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
-
-        if config_dict.get("model_type") == "minicpmv":
-            config_dict = config_dict["slice_config"]
-
-        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
-            logger.warning(
-                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
-                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
-            )
-
-        return cls.from_dict(config_dict, **kwargs)
-
-
-class MiniCPMConditionalTTSConfig(PretrainedConfig):
-    model_type = "conditional_chattts"
-
-    def __init__(
-        self,
-        llm_dim: int = 2560,
-        hidden_size: int = 768,
-        intermediate_size: int = 3072,
-        num_attention_heads: int = 12,
-        num_hidden_layers: int = 20,
-        max_position_embeddings: int = 4096,
-        num_audio_tokens: int = 626,
-        num_text_tokens: int = 21178,
-        num_mel_bins: int = 100,
-        num_vq: int = 4,
-        use_speaker_embedding: bool = True,
-        use_llm_hidden_state: bool = False,
-        spk_emb_token_id: int = 21143,
-        num_spk_embs: int = 1,
-        audio_bos_token_id: int = 21132,
-        text_eos_token_id: int = 21133,
-        use_text: bool = True,
-        streaming: bool = True,
-        streaming_text_chunk_size: int = 10,
-        streaming_text_reserved_len: int = 300,
-        streaming_audio_chunk_size: int = 50,
-        attn_implementation: str = "sdpa",
-        use_mlp: bool = True,
-        aug_loss_weight: bool = True,
-        **kwargs,
-    ):
-        super().__init__(**kwargs)
-
-        self.llm_dim = llm_dim
-        self.hidden_size = hidden_size
-        self.intermediate_size = intermediate_size
-        self.num_attention_heads = num_attention_heads
-        self.num_hidden_layers = num_hidden_layers
-        self.max_position_embeddings = max_position_embeddings
-        self.num_audio_tokens = num_audio_tokens
-        self.num_text_tokens = num_text_tokens
-        self.num_mel_bins = num_mel_bins
-        self.num_vq = num_vq
-        self.use_speaker_embedding = use_speaker_embedding
-        self.use_llm_hidden_state = use_llm_hidden_state
-        self.spk_emb_token_id = spk_emb_token_id
-        self.num_spk_embs = num_spk_embs
-        self.audio_bos_token_id = audio_bos_token_id
-        self.text_eos_token_id = text_eos_token_id
-        self.use_text = use_text
-        self.streaming = streaming
-        self.streaming_text_chunk_size = streaming_text_chunk_size
-        self.streaming_text_reserved_len = streaming_text_reserved_len
-        self.streaming_audio_chunk_size = streaming_audio_chunk_size
-        self.attn_implementation = attn_implementation
-        self.use_mlp = use_mlp
-        self.aug_loss_weight = aug_loss_weight
-
-
-class MiniCPM_o_2_6Config(PretrainedConfig):
-    model_type = "minicpmo"
-    keys_to_ignore_at_inference = ["past_key_values"]
-
-    default_vision_config = {
-        "hidden_size": 1152,
-        "image_size": 980,
-        "intermediate_size": 4304,
-        "model_type": "siglip",
-        "num_attention_heads": 16,
-        "num_hidden_layers": 27,
-        "patch_size": 14,
-    }
-
-    base_model_tp_plan = {
-        "layers.*.self_attn.q_proj": "colwise",
-        "layers.*.self_attn.k_proj": "colwise",
-        "layers.*.self_attn.v_proj": "colwise",
-        "layers.*.self_attn.o_proj": "rowwise",
-        "layers.*.mlp.gate_proj": "colwise",
-        "layers.*.mlp.up_proj": "colwise",
-        "layers.*.mlp.down_proj": "rowwise",
-    }
-    base_model_pp_plan = {
-        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
-        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
-        "norm": (["hidden_states"], ["hidden_states"]),
-    }
-
-    def __init__(
-        self,
-        use_cache=True,
-        query_num=64,
-        image_size=448,
-        drop_vision_last_layer=True,
-        batch_vision_input=True,
-        slice_config=None,
-        vision_config=None,
-        audio_config=None,
-        tts_config=None,
-        use_image_id=True,
-        vision_batch_size=16,
-        audio_pool_step=2,
-        audio_chunk_length=1.0,
-        stream_input=False,
-        init_vision=True,
-        init_audio=True,
-        init_tts=True,
-        vocab_size=151936,
-        hidden_size=4096,
-        intermediate_size=22016,
-        num_hidden_layers=32,
-        num_attention_heads=32,
-        num_key_value_heads=32,
-        hidden_act="silu",
-        max_position_embeddings=32768,
-        initializer_range=0.02,
-        rms_norm_eps=1e-6,
-        tie_word_embeddings=False,
-        rope_theta=10000.0,
-        rope_scaling=None,
-        use_sliding_window=False,
-        sliding_window=4096,
-        max_window_layers=28,
-        layer_types=None,
-        attention_dropout=0.0,
-        **kwargs,
-    ):
-        self.use_cache = use_cache
-        self.query_num = query_num
-        self.image_size = image_size
-        self.drop_vision_last_layer = drop_vision_last_layer
-        self.batch_vision_input = batch_vision_input
-        self.use_image_id = use_image_id
-        self.vision_batch_size = vision_batch_size
-        self.audio_pool_step = audio_pool_step
-        self.audio_chunk_length = audio_chunk_length
-        self.stream_input = stream_input
-        self.init_vision = init_vision
-        self.init_audio = init_audio
-        self.init_tts = init_tts
-
-        if slice_config is None:
-            self.slice_config = MiniCPMVSliceConfig(max_slice_nums=1)
-        else:
-            self.slice_config = MiniCPMVSliceConfig(**slice_config)
-        self.slice_mode = True
-
-        # same as HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit add tgt_sizes
-        if vision_config is None:
-            self.vision_config = SiglipVisionConfig(**self.default_vision_config)
-            logger.info("vision_config is None, using default vision config")
-        elif isinstance(vision_config, dict):
-            self.vision_config = SiglipVisionConfig(**vision_config)
-        elif isinstance(vision_config, SiglipVisionConfig):
-            self.vision_config = vision_config
-
-        # same as openai/whisper-medium add use_cache
-        if audio_config is None:
-            self.audio_config = WhisperConfig()
-        elif isinstance(audio_config, dict):
-            self.audio_config = WhisperConfig(**audio_config)
-        elif isinstance(audio_config, WhisperConfig):
-            self.audio_config = audio_config
-
-        if tts_config is None:
-            self.tts_config = MiniCPMConditionalTTSConfig()
-        elif isinstance(tts_config, dict):
-            self.tts_config = MiniCPMConditionalTTSConfig(**tts_config)
-        elif isinstance(tts_config, MiniCPMConditionalTTSConfig):
-            self.tts_config = tts_config
-
-        self.patch_size = self.vision_config.patch_size
-
-        self.vocab_size = vocab_size
-        self.max_position_embeddings = max_position_embeddings
-        self.hidden_size = hidden_size
-        self.intermediate_size = intermediate_size
-        self.num_hidden_layers = num_hidden_layers
-        self.num_attention_heads = num_attention_heads
-        self.use_sliding_window = use_sliding_window
-        self.sliding_window = sliding_window if self.use_sliding_window else None
-        self.max_window_layers = max_window_layers
-
-        # for backward compatibility
-        if num_key_value_heads is None:
-            num_key_value_heads = num_attention_heads
-
-        self.num_key_value_heads = num_key_value_heads
-        self.hidden_act = hidden_act
-        self.initializer_range = initializer_range
-        self.rms_norm_eps = rms_norm_eps
-        self.rope_theta = rope_theta
-        self.rope_scaling = rope_scaling
-        self.attention_dropout = attention_dropout
-        # Validate the correctness of rotary position embeddings parameters
-        # BC: if there is a 'type' field, move it to 'rope_type'.
-        if self.rope_scaling is not None and "type" in self.rope_scaling:
-            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
-        rope_config_validation(self)
-
-        self.layer_types = layer_types
-        if self.layer_types is None:
-            self.layer_types = [
-                "sliding_attention"
-                if self.sliding_window is not None and i >= self.max_window_layers
-                else "full_attention"
-                for i in range(self.num_hidden_layers)
-            ]
-        layer_type_validation(self.layer_types)
 
-        super().__init__(
-            tie_word_embeddings=tie_word_embeddings,
-            **kwargs,
-        )
+logger = logging.get_logger(__name__)
 
 
 class MiniCPMConditionalTTSTextConfig(PretrainedConfig):
@@ -471,4 +226,646 @@ def __init__(
         )
 
 
+class MiniCPMConditionalTTSConfig(PretrainedConfig):
+    model_type = "conditional_chattts"
+
+    # sub_configs = {
+    #     "text_config": MiniCPMConditionalTTSTextConfig,
+    # }
+
+    def __init__(
+        self,
+        llm_dim: int = 2560,
+        hidden_size: int = 768,
+        intermediate_size: int = 3072,
+        num_attention_heads: int = 12,
+        num_hidden_layers: int = 20,
+        max_position_embeddings: int = 4096,
+        num_audio_tokens: int = 626,
+        num_text_tokens: int = 21178,
+        num_mel_bins: int = 100,
+        num_vq: int = 4,
+        use_speaker_embedding: bool = True,
+        use_llm_hidden_state: bool = False,
+        spk_emb_token_id: int = 21143,
+        num_spk_embs: int = 1,
+        audio_bos_token_id: int = 21132,
+        text_eos_token_id: int = 21133,
+        use_text: bool = True,
+        streaming: bool = True,
+        streaming_text_chunk_size: int = 10,
+        streaming_text_reserved_len: int = 300,
+        streaming_audio_chunk_size: int = 50,
+        attn_implementation: str = "sdpa",
+        use_mlp: bool = True,
+        aug_loss_weight: bool = True,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.llm_dim = llm_dim
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_attention_heads = num_attention_heads
+        self.num_hidden_layers = num_hidden_layers
+        self.max_position_embeddings = max_position_embeddings
+        self.num_audio_tokens = num_audio_tokens
+        self.num_text_tokens = num_text_tokens
+        self.num_mel_bins = num_mel_bins
+        self.num_vq = num_vq
+        self.use_speaker_embedding = use_speaker_embedding
+        self.use_llm_hidden_state = use_llm_hidden_state
+        self.spk_emb_token_id = spk_emb_token_id
+        self.num_spk_embs = num_spk_embs
+        self.audio_bos_token_id = audio_bos_token_id
+        self.text_eos_token_id = text_eos_token_id
+        self.use_text = use_text
+        self.streaming = streaming
+        self.streaming_text_chunk_size = streaming_text_chunk_size
+        self.streaming_text_reserved_len = streaming_text_reserved_len
+        self.streaming_audio_chunk_size = streaming_audio_chunk_size
+        self.attn_implementation = attn_implementation
+        self.use_mlp = use_mlp
+        self.aug_loss_weight = aug_loss_weight
+
+        self.tts_text_config = MiniCPMConditionalTTSTextConfig(
+            hidden_size=self.hidden_size,
+            intermediate_size=self.intermediate_size,
+            num_attention_heads=self.num_attention_heads,
+            num_hidden_layers=self.num_hidden_layers,
+            max_position_embeddings=self.max_position_embeddings,
+            attn_implementation=self.attn_implementation,
+        )
+
+
+class MiniCPM_o_2_6TextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MiniCPMO26TextModel`]. It is used to instantiate a
+    MiniCPMO26Text model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of
+    MiniCPMO26Text-7B-beta [Qwen/MiniCPMO26Text-7B-beta](https://huggingface.co/Qwen/MiniCPMO26Text-7B-beta).
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 151936):
+            Vocabulary size of the MiniCPMO26Text model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`MiniCPMO26TextModel`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 22016):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 32):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details, check out [this
+            paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `32`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 32768):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`list[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`list[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+        use_sliding_window (`bool`, *optional*, defaults to `False`):
+            Whether to use sliding window attention.
+        sliding_window (`int`, *optional*, defaults to 4096):
+            Sliding window attention (SWA) window size. If not specified, will default to `4096`.
+        max_window_layers (`int`, *optional*, defaults to 28):
+            The number of layers using full attention. The first `max_window_layers` layers will use full attention, while any
+            additional layer afterwards will use SWA (Sliding Window Attention).
+        layer_types (`list`, *optional*):
+            Attention pattern for each layer.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+
+    ```python
+    >>> from transformers import MiniCPMO26TextModel, MiniCPMO26TextConfig
+
+    >>> # Initializing a MiniCPMO26Text style configuration
+    >>> configuration = MiniCPMO26TextConfig()
+
+    >>> # Initializing a model from the MiniCPMO26Text-7B style configuration
+    >>> model = MiniCPMO26TextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "minicpmo"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    # Default tensor parallel plan for base model `MiniCPMO26Text`
+    base_model_tp_plan = {
+        "layers.*.self_attn.q_proj": "colwise",
+        "layers.*.self_attn.k_proj": "colwise",
+        "layers.*.self_attn.v_proj": "colwise",
+        "layers.*.self_attn.o_proj": "rowwise",
+        "layers.*.mlp.gate_proj": "colwise",
+        "layers.*.mlp.up_proj": "colwise",
+        "layers.*.mlp.down_proj": "rowwise",
+    }
+    base_model_pp_plan = {
+        "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+        "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "norm": (["hidden_states"], ["hidden_states"]),
+    }
+
+    def __init__(
+        self,
+        vocab_size=151936,
+        hidden_size=4096,
+        intermediate_size=22016,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=32,
+        hidden_act="silu",
+        max_position_embeddings=32768,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        use_sliding_window=False,
+        sliding_window=4096,
+        max_window_layers=28,
+        layer_types=None,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.use_sliding_window = use_sliding_window
+        self.sliding_window = sliding_window if self.use_sliding_window else None
+        self.max_window_layers = max_window_layers
+
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_dropout = attention_dropout
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, move it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+
+        self.layer_types = layer_types
+        if self.layer_types is None:
+            self.layer_types = [
+                "sliding_attention"
+                if self.sliding_window is not None and i >= self.max_window_layers
+                else "full_attention"
+                for i in range(self.num_hidden_layers)
+            ]
+        layer_type_validation(self.layer_types)
+
+        super().__init__(
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+
+
+class MiniCPMVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MiniCPMVisionModel`]. It is used to instantiate a
+    MiniCPM vision encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the vision encoder of the MiniCPM
+    [google/mini_c_p_m-base-patch16-224](https://huggingface.co/google/mini_c_p_m-base-patch16-224) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_channels (`int`, *optional*, defaults to 3):
+            Number of channels in the input images.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 16):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the layer normalization layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+
+    Example:
+
+    ```python
+    >>> from transformers import MiniCPMVisionConfig, MiniCPMVisionModel
+
+    >>> # Initializing a MiniCPMVisionConfig with google/mini_c_p_m-base-patch16-224 style configuration
+    >>> configuration = MiniCPMVisionConfig()
+
+    >>> # Initializing a MiniCPMVisionModel (with random weights) from the google/mini_c_p_m-base-patch16-224 style configuration
+    >>> model = MiniCPMVisionModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "mini_c_p_m_vision_model"
+    base_config_key = "vision_config"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_channels=3,
+        image_size=224,
+        patch_size=16,
+        hidden_act="gelu_pytorch_tanh",
+        layer_norm_eps=1e-6,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+
+
+# fmt: on
+
+
+class MiniCPMWhisperConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MiniCPMWhisperModel`]. It is used to instantiate a
+    MiniCPMWhisper model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the MiniCPMWhisper
+    [openai/mini_c_p_m_whisper-tiny](https://huggingface.co/openai/mini_c_p_m_whisper-tiny) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 51865):
+            Vocabulary size of the MiniCPMWhisper model. Defines the number of different tokens that can be represented by the
+            `decoder_input_ids` passed when calling [`MiniCPMWhisperModel`]
+        num_mel_bins (`int`, *optional*, defaults to 80):
+            Number of mel features used per input features. Should correspond to the value used in the
+            `MiniCPMWhisperProcessor` class.
+        encoder_layers (`int`, *optional*, defaults to 4):
+            Number of encoder layers.
+        decoder_layers (`int`, *optional*, defaults to 4):
+            Number of decoder layers.
+        encoder_attention_heads (`int`, *optional*, defaults to 6):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        decoder_attention_heads (`int`, *optional*, defaults to 6):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        encoder_ffn_dim (`int`, *optional*, defaults to 1536):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in encoder.
+        decoder_ffn_dim (`int`, *optional*, defaults to 1536):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
+        encoder_layerdrop (`float`, *optional*, defaults to 0.0):
+            The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://huggingface.co/papers/1909.11556)
+            for more details.
+        decoder_layerdrop (`float`, *optional*, defaults to 0.0):
+            The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://huggingface.co/papers/1909.11556)
+            for more details.
+        decoder_start_token_id (`int`, *optional*, defaults to 50257):
+            Corresponds to the "<|startoftranscript|>" token, which is automatically used when no `decoder_input_ids`
+            are provided to the `generate` function. It is used to guide the model`s generation process depending on
+            the task.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+        is_encoder_decoder (`bool`, *optional*, defaults to `True`):
+            Whether the model is used as an encoder/decoder or not.
+        activation_function (`str`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        d_model (`int`, *optional*, defaults to 384):
+            Dimensionality of the layers.
+        dropout (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        activation_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for activations inside the fully connected layer.
+        init_std (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        scale_embedding (`bool`, *optional*, defaults to False):
+            Scale embeddings by diving by sqrt(d_model).
+        max_source_positions (`int`, *optional*, defaults to 1500):
+            The maximum sequence length of log-mel filter-bank features that this model might ever be used with.
+        max_target_positions (`int`, *optional*, defaults to 448):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        pad_token_id (`int`, *optional*, defaults to 50256):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 50256):
+            Begin of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 50256):
+            End of stream token id.
+        suppress_tokens (`list[int]`, *optional*):
+            A list containing the non-speech tokens that will be used by the logit processor in the `generate`
+            function. NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI each correspond to the `english-only` and the
+            `multilingual` model.
+        begin_suppress_tokens (`list[int]`, *optional*, defaults to `[220,50256]`):
+            A list containing tokens that will be suppressed at the beginning of the sampling process. Initialized as
+            the token for `" "` (`blank_token_id`) and the `eos_token_id`
+        use_weighted_layer_sum (`bool`, *optional*, defaults to `False`):
+            Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an
+            instance of [`MiniCPMWhisperForAudioClassification`].
+        classifier_proj_size (`int`, *optional*, defaults to 256):
+            Dimensionality of the projection before token mean-pooling for classification. Only relevant when using an
+            instance of [`MiniCPMWhisperForAudioClassification`].
+        apply_spec_augment (`bool`, *optional*, defaults to `False`):
+            Whether to apply *SpecAugment* data augmentation to the outputs of the feature encoder. For reference see
+            [SpecAugment: A Simple Data Augmentation Method for Automatic Speech
+            Recognition](https://huggingface.co/papers/1904.08779).
+        mask_time_prob (`float`, *optional*, defaults to 0.05):
+            Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The masking
+            procedure generates `mask_time_prob*len(time_axis)/mask_time_length` independent masks over the axis. If
+            reasoning from the probability of each feature vector to be chosen as the start of the vector span to be
+            masked, *mask_time_prob* should be `prob_vector_start*mask_time_length`. Note that overlap may decrease the
+            actual percentage of masked vectors. This is only relevant if `apply_spec_augment == True`.
+        mask_time_length (`int`, *optional*, defaults to 10):
+            Length of vector span along the time axis.
+        mask_time_min_masks (`int`, *optional*, defaults to 2),:
+            The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step,
+            irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length <
+            mask_time_min_masks''
+        mask_feature_prob (`float`, *optional*, defaults to 0.0):
+            Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. The
+            masking procedure generates `mask_feature_prob*len(feature_axis)/mask_time_length` independent masks over
+            the axis. If reasoning from the probability of each feature vector to be chosen as the start of the vector
+            span to be masked, *mask_feature_prob* should be `prob_vector_start*mask_feature_length`. Note that overlap
+            may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is
+            True`.
+        mask_feature_length (`int`, *optional*, defaults to 10):
+            Length of vector span along the feature axis.
+        mask_feature_min_masks (`int`, *optional*, defaults to 0),:
+            The minimum number of masks of length `mask_feature_length` generated along the feature axis, each time
+            step, irrespectively of `mask_feature_prob`. Only relevant if
+            `mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks`.
+        median_filter_width (`int`, *optional*, defaults to 7):
+            Width of the median filter used to smoothen to cross-attention outputs when computing token timestamps.
+            Should be an odd number.
+
+    Example:
+
+    ```python
+    >>> from transformers import MiniCPMWhisperConfig, MiniCPMWhisperModel
+
+    >>> # Initializing a MiniCPMWhisper tiny style configuration
+    >>> configuration = MiniCPMWhisperConfig()
+
+    >>> # Initializing a model (with random weights) from the tiny style configuration
+    >>> model = MiniCPMWhisperModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "mini_c_p_m_whisper"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    attribute_map = {
+        "num_key_value_heads": "encoder_attention_heads",
+        "num_attention_heads": "encoder_attention_heads",
+        "hidden_size": "d_model",
+    }
+
+    def __init__(
+        self,
+        vocab_size=51865,
+        num_mel_bins=80,
+        encoder_layers=4,
+        encoder_attention_heads=6,
+        decoder_layers=4,
+        decoder_attention_heads=6,
+        decoder_ffn_dim=1536,
+        encoder_ffn_dim=1536,
+        encoder_layerdrop=0.0,
+        decoder_layerdrop=0.0,
+        decoder_start_token_id=50257,
+        use_cache=True,
+        is_encoder_decoder=True,
+        activation_function="gelu",
+        d_model=384,
+        dropout=0.0,
+        attention_dropout=0.0,
+        activation_dropout=0.0,
+        init_std=0.02,
+        scale_embedding=False,
+        max_source_positions=1500,
+        max_target_positions=448,
+        pad_token_id=50256,
+        bos_token_id=50256,
+        eos_token_id=50256,
+        suppress_tokens=None,
+        begin_suppress_tokens=[220, 50256],
+        use_weighted_layer_sum=False,
+        classifier_proj_size=256,
+        apply_spec_augment=False,
+        mask_time_prob=0.05,
+        mask_time_length=10,
+        mask_time_min_masks=2,
+        mask_feature_prob=0.0,
+        mask_feature_length=10,
+        mask_feature_min_masks=0,
+        median_filter_width=7,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.num_mel_bins = num_mel_bins
+        self.d_model = d_model
+        self.encoder_layers = encoder_layers
+        self.encoder_attention_heads = encoder_attention_heads
+        self.decoder_layers = decoder_layers
+        self.decoder_attention_heads = decoder_attention_heads
+        self.decoder_ffn_dim = decoder_ffn_dim
+        self.encoder_ffn_dim = encoder_ffn_dim
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.activation_function = activation_function
+        self.init_std = init_std
+        self.encoder_layerdrop = encoder_layerdrop
+        self.decoder_layerdrop = decoder_layerdrop
+        self.use_cache = use_cache
+        self.num_hidden_layers = encoder_layers
+        self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True
+        self.max_source_positions = max_source_positions
+        self.max_target_positions = max_target_positions
+
+        # Audio Classification-specific parameters. Feel free to ignore for other classes.
+        self.classifier_proj_size = classifier_proj_size
+        self.use_weighted_layer_sum = use_weighted_layer_sum
+
+        # fine-tuning config parameters for SpecAugment: https://huggingface.co/papers/1904.08779
+        self.apply_spec_augment = apply_spec_augment
+        self.mask_time_prob = mask_time_prob
+        self.mask_time_length = mask_time_length
+        self.mask_time_min_masks = mask_time_min_masks
+        self.mask_feature_prob = mask_feature_prob
+        self.mask_feature_length = mask_feature_length
+        self.mask_feature_min_masks = mask_feature_min_masks
+
+        self.median_filter_width = median_filter_width
+
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            is_encoder_decoder=is_encoder_decoder,
+            decoder_start_token_id=decoder_start_token_id,
+            suppress_tokens=suppress_tokens,
+            begin_suppress_tokens=begin_suppress_tokens,
+            **kwargs,
+        )
+
+
+class MiniCPM_o_2_6Config(PretrainedConfig):
+    default_vision_config = {
+        "hidden_size": 1152,
+        "image_size": 980,
+        "intermediate_size": 4304,
+        "model_type": "siglip",
+        "num_attention_heads": 16,
+        "num_hidden_layers": 27,
+        "patch_size": 14,
+    }
+
+    def __init__(
+        self,
+        text_config=None,
+        vision_config=None,
+        audio_config=None,
+        tts_config=None,
+        use_cache=True,
+        query_num=64,
+        drop_vision_last_layer=True,
+        vision_batch_size=16,
+        audio_pool_step=2,
+        audio_chunk_length=1.0,
+        **kwargs,
+    ):
+        self.use_cache = use_cache
+        self.query_num = query_num
+        self.drop_vision_last_layer = drop_vision_last_layer
+        self.vision_batch_size = vision_batch_size
+        self.audio_pool_step = audio_pool_step
+        self.audio_chunk_length = audio_chunk_length
+
+        if text_config is None:
+            self.text_config = MiniCPM_o_2_6TextConfig()
+        elif isinstance(text_config, dict):
+            self.text_config = MiniCPM_o_2_6TextConfig(**text_config)
+        elif isinstance(text_config, MiniCPM_o_2_6TextConfig):
+            self.text_config = text_config
+
+        if vision_config is None:
+            self.vision_config = MiniCPMVisionConfig(**self.default_vision_config)
+            logger.info("vision_config is None, using default vision config")
+        elif isinstance(vision_config, dict):
+            self.vision_config = MiniCPMVisionConfig(**vision_config)
+        elif isinstance(vision_config, MiniCPMVisionConfig):
+            self.vision_config = vision_config
+
+        # same as openai/whisper-medium add use_cache
+        if audio_config is None:
+            self.audio_config = MiniCPMWhisperConfig()
+        elif isinstance(audio_config, dict):
+            self.audio_config = MiniCPMWhisperConfig(**audio_config)
+        elif isinstance(audio_config, MiniCPMWhisperConfig):
+            self.audio_config = audio_config
+
+        if tts_config is None:
+            self.tts_config = MiniCPMConditionalTTSConfig()
+        elif isinstance(tts_config, dict):
+            self.tts_config = MiniCPMConditionalTTSConfig(**tts_config)
+        elif isinstance(tts_config, MiniCPMConditionalTTSConfig):
+            self.tts_config = tts_config
+
+        # self.patch_size = self.vision_config.patch_size
+        super().__init__(**kwargs)
+
+
 __all__ = ["MiniCPM_o_2_6Config"]
diff --git a/src/transformers/models/minicpm_o_2_6/feature_extractor_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/feature_extractor_minicpm_o_2_6.py
index 2cb53022d19a..c39f60d1af82 100644
--- a/src/transformers/models/minicpm_o_2_6/feature_extractor_minicpm_o_2_6.py
+++ b/src/transformers/models/minicpm_o_2_6/feature_extractor_minicpm_o_2_6.py
@@ -14,34 +14,44 @@
 # limitations under the License.
 
 import math
-from typing import List, Optional, Union
+from typing import Optional, Union
 
-from transformers import WhisperFeatureExtractor, AutoFeatureExtractor, AutoTokenizer
 import numpy as np
 import torch
 
+from ..whisper.feature_extraction_whisper import WhisperFeatureExtractor
+
 
 class MiniCPM_o_2_6FeatureExtractor(WhisperFeatureExtractor):
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
 
+    def format_audios(self, audios):
+        """
+        Normalize audios format to list of list of numpy arrays.
+
+        Args:
+            audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]]
+
+        Returns:
+            List[List[np.ndarray]]: Normalized audio format
+        """
+        # in batch inference, it may be [[]]
+        if isinstance(audios, np.ndarray):
+            return [[audios]]
+        elif isinstance(audios[0], np.ndarray):
+            return [audios]
+        else:
+            return audios
+
     def __call__(
         self,
-        tokenizer: None,
-        audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]],
+        audios: Union[np.ndarray, list[np.ndarray], list[list[np.ndarray]]],
         audio_parts: Optional[list] = None,
-        chunk_input: Optional[bool] = False,
         sampling_rate: Optional[int] = None,
-        chunk_length: Optional[int] = 1,
         **kwargs,
     ):
-        # in batch inference, it may be [[]]
-        if isinstance(audios, np.ndarray):
-            audios_list = [[audios]]
-        elif isinstance(audios[0], np.ndarray):
-            audios_list = [audios]
-        else:
-            audios_list = audios
+        audios_list = self.format_audios(audios)
 
         if audio_parts is not None:
             assert len(audio_parts) == len(audios_list)
@@ -49,19 +59,8 @@ def __call__(
                 assert len(parts) == len(audios)
 
         audio_feature_lens_list = []
-        audio_ph_list = []
-
         audio_features_all = []
 
-        # audio placeholder not dependent on audio_parts
-        for audios in audios_list:
-            if audios:
-                audio_ph_list.append(
-                    [self.get_audio_placeholder(tokenizer, len(a), chunk_input, chunk_length) for a in audios]
-                )
-            else:
-                audio_ph_list.append([])
-
         for idx, audios in enumerate(audios_list):
             if audio_parts is not None:
                 # same audio part merge
@@ -90,7 +89,7 @@ def __call__(
                     final_merge_audio.append(audio)
                 else:
                     for i in range(math.ceil(len(audio) / max_audio_inp_len)):
-                        final_merge_audio.append(audio[i * max_audio_inp_len : (i + 1) * max_audio_inp_len])
+                        final_merge_audio.append(audio[i * max_audio_inp_len: (i + 1) * max_audio_inp_len])
 
             if audios:
                 audio_inputs = super().__call__(
@@ -121,34 +120,7 @@ def __call__(
         else:
             audio_features = []
 
-        return audio_features, audio_feature_lens_list, audio_ph_list
-
-    def get_audio_placeholder(self, tokenizer, audio_lens, chunk_input, chunk_length):
-        pool_step = 2
-        feature_lens = math.ceil(audio_lens / self.hop_length)
-
-        feature_lens = (feature_lens - 1) // 2 + 1
-        output_lens = (feature_lens - pool_step) // pool_step + 1
-
-        if chunk_input:
-            fbank_feat_in_chunk = int(chunk_length * 100)
-            cnn_feat_in_chunk = (fbank_feat_in_chunk - 1) // 2 + 1
-            audio_embeds_in_chunk = (cnn_feat_in_chunk - pool_step) // pool_step + 1
-            num_audio_chunks = (output_lens + audio_embeds_in_chunk - 1) // audio_embeds_in_chunk
-
-            place_holders = ""
-            total_unk_len = 0
-            for _ in range(num_audio_chunks):
-                unk_len = min(audio_embeds_in_chunk, output_lens - total_unk_len)
-                place_holders += tokenizer.audio_start + tokenizer.unk_token * unk_len + tokenizer.audio_end
-                total_unk_len += unk_len
-            audio_placeholder = place_holders
-        else:
-            audio_placeholder = tokenizer.audio_start + tokenizer.unk_token * output_lens + tokenizer.audio_end
-
-        return audio_placeholder
-
+        return audio_features, audio_feature_lens_list
 
-AutoFeatureExtractor.register("MiniCPM_o_2_6FeatureExtractor", MiniCPM_o_2_6FeatureExtractor)
 
 __all__ = ["MiniCPM_o_2_6FeatureExtractor"]
diff --git a/src/transformers/models/minicpm_o_2_6/image_processing_minicpm.py b/src/transformers/models/minicpm_o_2_6/image_processing_minicpm_fast.py
similarity index 70%
rename from src/transformers/models/minicpm_o_2_6/image_processing_minicpm.py
rename to src/transformers/models/minicpm_o_2_6/image_processing_minicpm_fast.py
index 544ad4da61af..7ca533aea8e8 100755
--- a/src/transformers/models/minicpm_o_2_6/image_processing_minicpm.py
+++ b/src/transformers/models/minicpm_o_2_6/image_processing_minicpm_fast.py
@@ -14,21 +14,24 @@
 # limitations under the License.
 
 import math
-from typing import List
-from typing import Optional
-from typing import Union
+from typing import Optional, Union
 
 import numpy as np
 from numpy.lib.stride_tricks import as_strided
-import torchvision.transforms as transforms
 from PIL import Image
-from transformers import AutoImageProcessor
-from transformers.image_processing_utils import BaseImageProcessor
-from transformers.image_transforms import to_pil_image
-from transformers.image_utils import valid_images, make_nested_list_of_images
-from transformers.utils import TensorType, IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
+from ...image_processing_utils_fast import BaseImageProcessorFast
+from ...image_transforms import to_pil_image
+from ...image_utils import valid_images, make_nested_list_of_images
+from ...utils import TensorType, IMAGENET_STANDARD_MEAN, IMAGENET_STANDARD_STD
+from ...utils.import_utils import is_torchvision_available, is_torchvision_v2_available
 from .processing_minicpm_o_2_6 import MiniCPMOBatchFeature
 
+if is_torchvision_available():
+    if is_torchvision_v2_available():
+        from torchvision.transforms.v2 import functional as F
+    else:
+        from torchvision.transforms import functional as F
+
 
 def recursive_converter(converter, value):
     if isinstance(value, list):
@@ -40,7 +43,7 @@ def recursive_converter(converter, value):
         return converter(value)
 
 
-class MiniCPMVImageProcessor(BaseImageProcessor):
+class MiniCPMVImageProcessorFast(BaseImageProcessorFast):
     model_input_names = ["pixel_values"]
 
     def __init__(
@@ -62,9 +65,10 @@ def __init__(
 
         self.slice_mode = kwargs.pop("slice_mode", True)
 
-        self.image_mean = np.array(image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN)
-        self.image_std = np.array(image_std if image_std is not None else IMAGENET_DEFAULT_STD)
-        self.version = kwargs.pop("version", 2.0)
+        self.image_mean = np.array(
+            image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN)
+        self.image_std = np.array(
+            image_std if image_std is not None else IMAGENET_STANDARD_STD)
 
     def ensure_divide(self, length, patch_size):
         return max(round(length / patch_size) * patch_size, patch_size)
@@ -112,54 +116,38 @@ def split_to_patches(self, image, grid):
     def slice_image(self, image, max_slice_nums=9, scale_resolution=448, patch_size=14, never_split=False):
         original_size = image.size
         source_image = None
-        best_grid = self.get_sliced_grid(original_size, max_slice_nums, never_split)
+        best_grid = self.get_sliced_grid(
+            original_size, max_slice_nums, never_split)
         patches = []
 
         if best_grid is None:
             # dont need to slice, upsample
-            best_size = self.find_best_resize(original_size, scale_resolution, patch_size, allow_upscale=True)
-            source_image = image.resize(best_size, resample=Image.Resampling.BICUBIC)
+            best_size = self.find_best_resize(
+                original_size, scale_resolution, patch_size, allow_upscale=True)
+            source_image = image.resize(
+                best_size, resample=Image.Resampling.BICUBIC)
         else:
             # source image, down-sampling and ensure divided by patch_size
-            best_resize = self.find_best_resize(original_size, scale_resolution, patch_size)
+            best_resize = self.find_best_resize(
+                original_size, scale_resolution, patch_size)
             source_image = image.copy().resize(best_resize, resample=Image.Resampling.BICUBIC)
             refine_size = self.get_refine_size(
                 original_size, best_grid, scale_resolution, patch_size, allow_upscale=True
             )
-            refine_image = image.resize(refine_size, resample=Image.Resampling.BICUBIC)
+            refine_image = image.resize(
+                refine_size, resample=Image.Resampling.BICUBIC)
             patches = self.split_to_patches(refine_image, best_grid)
 
         return source_image, patches, best_grid
 
-    def get_grid_placeholder(self, tokenizer, grid):
-        if grid is None:
-            return ""
-        slice_image_placeholder = (
-            tokenizer.slice_start + tokenizer.unk_token * self.image_feature_size + tokenizer.slice_end
-        )
-
-        cols = grid[0]
-        rows = grid[1]
-        slices = []
-        for i in range(rows):
-            lines = []
-            for j in range(cols):
-                lines.append(slice_image_placeholder)
-            slices.append("".join(lines))
-
-        slice_placeholder = "\n".join(slices)
-        return slice_placeholder
-
-    # def get_image_id_placeholder(self, idx=0):
-    #     return f"{self.tokenizer.im_id_start}{idx}{self.tokenizer.im_id_end}"
-
     def get_sliced_images(self, image, max_slice_nums=None):
         slice_images = []
 
         if not self.slice_mode:
             return [image]
 
-        max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(max_slice_nums)
+        max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(
+            max_slice_nums)
         assert max_slice_nums > 0
         source_image, patches, sliced_grid = self.slice_image(
             # default: 9  # default: 448  # default: 14
@@ -179,7 +167,8 @@ def get_sliced_images(self, image, max_slice_nums=None):
     def get_sliced_grid(self, image_size, max_slice_nums, nerver_split=False):
         original_width, original_height = image_size
         log_ratio = math.log(original_width / original_height)
-        ratio = original_width * original_height / (self.scale_resolution * self.scale_resolution)
+        ratio = original_width * original_height / \
+            (self.scale_resolution * self.scale_resolution)
         multiple = min(math.ceil(ratio), max_slice_nums)
         if multiple <= 1 or nerver_split:
             return None
@@ -207,22 +196,6 @@ def get_sliced_grid(self, image_size, max_slice_nums, nerver_split=False):
 
         return best_grid
 
-    def get_slice_image_placeholder(self, tokenizer, image_size, image_idx=0, max_slice_nums=None, use_image_id=None):
-        max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(max_slice_nums)
-        assert max_slice_nums > 0
-        grid = self.get_sliced_grid(image_size=image_size, max_slice_nums=max_slice_nums)
-
-        image_placeholder = tokenizer.im_start + tokenizer.unk_token * self.image_feature_size + tokenizer.im_end
-        use_image_id = self.use_image_id if use_image_id is None else bool(use_image_id)
-        if use_image_id:
-            final_placeholder = f"{tokenizer.im_id_start}{image_idx}{tokenizer.im_id_end}" + image_placeholder
-        else:
-            final_placeholder = image_placeholder
-
-        if self.slice_mode:
-            final_placeholder = final_placeholder + self.get_grid_placeholder(tokenizer, grid=grid)
-        return final_placeholder
-
     def reshape_by_patch(self, image):
         """
         :param image: shape [3, H, W]
@@ -244,10 +217,10 @@ def reshape_by_patch(self, image):
 
     def preprocess(
         self,
-        images: Union[Image.Image, List[Image.Image], List[List[Image.Image]]],
-        do_pad: Optional[bool] = True,
+        images: Union[Image.Image, list[Image.Image], list[list[Image.Image]]],
         max_slice_nums: int = None,
         return_tensors: Optional[Union[str, TensorType]] = None,
+        do_normalize: bool = True,
         **kwargs,
     ) -> MiniCPMOBatchFeature:
         # in batch inference, it may be [[]], so we can't use `make_nested_list_of_images`
@@ -258,9 +231,6 @@ def preprocess(
         else:
             images_list = images
 
-        to_tensor = transforms.ToTensor()
-        normalize_transform = transforms.Normalize(mean=self.image_mean.tolist(), std=self.image_std.tolist())
-
         new_images_list = []
         image_sizes_list = []
         tgt_sizes_list = []
@@ -286,17 +256,21 @@ def preprocess(
                 for patch in image_patches:
                     # Convert PIL to tensor (0-1 range) and normalize
                     # Shape: [C, H, W], range [0, 1]
-                    tensor_patch = to_tensor(patch)
-                    normalized_patch = normalize_transform(tensor_patch)  # Apply normalization
+                    tensor_patch = F.to_tensor(patch)
+                    if do_normalize:
+                        normalized_patch = F.normalize(tensor_patch, mean=self.image_mean.tolist(),
+                                                       std=self.image_std.tolist())  # Apply normalization
                     image_patches_tensors.append(normalized_patch)
 
                 # Convert back to numpy for compatibility with existing code
-                image_patches = [patch.numpy() for patch in image_patches_tensors]
+                image_patches = [patch.numpy()
+                                 for patch in image_patches_tensors]
 
                 for slice_image in image_patches:
                     new_images.append(self.reshape_by_patch(slice_image))
                     tgt_sizes.append(
-                        np.array((slice_image.shape[1] // self.patch_size, slice_image.shape[2] // self.patch_size))
+                        np.array(
+                            (slice_image.shape[1] // self.patch_size, slice_image.shape[2] // self.patch_size))
                     )
 
             # in batch inference, it may be []
@@ -306,13 +280,12 @@ def preprocess(
             new_images_list.append(new_images)
             image_sizes_list.append(image_sizes)
             tgt_sizes_list.append(tgt_sizes)
+
         return MiniCPMOBatchFeature(
-            data={"pixel_values": new_images_list, "image_sizes": image_sizes_list, "tgt_sizes": tgt_sizes_list},
+            data={"pixel_values": new_images_list,
+                  "image_sizes": image_sizes_list, "tgt_sizes": tgt_sizes_list},
             tensor_type=return_tensors,
         )
 
 
-AutoImageProcessor.register("MiniCPMVImageProcessor", MiniCPMVImageProcessor)
-
-
-__all__ = ["MiniCPMVImageProcessor"]
+__all__ = ["MiniCPMVImageProcessorFast"]
diff --git a/src/transformers/models/minicpm_o_2_6/modeling_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/modeling_minicpm_o_2_6.py
index e8ab46e8667f..12e002ad79e3 100644
--- a/src/transformers/models/minicpm_o_2_6/modeling_minicpm_o_2_6.py
+++ b/src/transformers/models/minicpm_o_2_6/modeling_minicpm_o_2_6.py
@@ -67,41 +67,32 @@
     add_start_docstrings_to_model_forward,
     auto_docstring,
     can_return_tuple,
-    is_flash_attn_2_available,
     logging,
     replace_return_docstrings,
 )
 from ...utils.deprecation import deprecate_kwarg
-from ..whisper.configuration_whisper import WhisperConfig
-from ..siglip.configuration_siglip import SiglipVisionConfig
 from ..bert.tokenization_bert_fast import BertTokenizerFast
 from ...utils.generic import check_model_inputs
+from ...utils.import_utils import _is_package_available, is_flash_attn_2_available
 from .configuration_minicpm_o_2_6 import (
     MiniCPM_o_2_6Config,
     MiniCPMConditionalTTSConfig,
     MiniCPMConditionalTTSTextConfig,
+    MiniCPMVisionConfig,
+    MiniCPMWhisperConfig,
 )
-from .processing_minicpm_o_2_6 import NumberToTextConverter, VoiceChecker, sentence_end
+from .tts_processing_minicpm_o_2_6 import ChatTTSProcessor, NumberToTextConverter, VoiceChecker, sentence_end
 
 
 if is_flash_attn_2_available():
     from flash_attn import flash_attn_func, flash_attn_varlen_func
-    from flash_attn.bert_padding import (
-        index_first_axis,  # noqa
-        pad_input,
-        unpad_input,
-    )
+    from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input
 
-try:
+if _is_package_available("vector_quantize_pytorch") and _is_package_available("vocos"):
     from vector_quantize_pytorch import GroupedResidualFSQ
     from vocos import Vocos
     from vocos.pretrained import instantiate_class
 
-    _tts_deps = True
-except:
-    _tts_deps = False
-
-
 logger = logging.get_logger(__name__)
 
 
@@ -339,7 +330,7 @@ class MiniCPM_o_2_6PreTrainedModel(PreTrainedModel):
     config: MiniCPM_o_2_6Config
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
-    _no_split_modules = ["MiniCPM_o_2_6TextDecoderLayer"]
+    _no_split_modules = ["MiniCPM_o_2_6DecoderLayer"]
     _skip_keys_device_placement = ["past_key_values"]
     _supports_flash_attn = True
     _supports_sdpa = True
@@ -351,24 +342,6 @@ class MiniCPM_o_2_6PreTrainedModel(PreTrainedModel):
         "hidden_states": MiniCPM_o_2_6DecoderLayer,
         "attentions": MiniCPM_o_2_6Attention,
     }
-    config_class = MiniCPM_o_2_6Config
-    _supports_flash_attn_2 = True
-    _supports_cache_class = True
-    _supports_quantized_cache = True
-    _supports_static_cache = True
-
-    def _init_weights(self, module):
-        std = self.config.initializer_range
-        if isinstance(module, nn.Linear):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.bias is not None:
-                module.bias.data.zero_()
-        elif isinstance(module, nn.Embedding):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.padding_idx is not None:
-                module.weight.data[module.padding_idx].zero_()
-        elif isinstance(module, MiniCPM_o_2_6TextRMSNorm):
-            module.weight.data.fill_(1.0)
 
 
 class MiniCPM_o_2_6RotaryEmbedding(nn.Module):
@@ -408,7 +381,7 @@ def forward(self, x, position_ids):
 
 
 @auto_docstring
-class MiniCPMTextModel(MiniCPM_o_2_6PreTrainedModel):
+class MiniCPM_o_2_6TextModel(MiniCPM_o_2_6PreTrainedModel):
     def __init__(self, config: MiniCPM_o_2_6Config):
         super().__init__(config)
         self.padding_idx = config.pad_token_id
@@ -499,6 +472,9 @@ def forward(
         )
 
 
+_tts_deps = _is_package_available("vector_quantize_pytorch") and _is_package_available("vocos")
+
+
 def _prepare_4d_causal_attention_mask_with_cache_position(
     attention_mask: torch.Tensor,
     sequence_length: int,
@@ -572,16 +548,20 @@ def gen_logits(
     return logits_warpers, logits_processors
 
 
-class MiniCPM_o_2_6Model(MiniCPM_o_2_6PreTrainedModel, GenerationMixin):
+class MiniCPM_o_2_6ForConditionalGeneration(MiniCPM_o_2_6PreTrainedModel, GenerationMixin):
     _tied_weights_keys = ["lm_head.weight"]
     _tp_plan = {"lm_head": "colwise_rep"}
     _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
 
-    def __init__(self, config):
-        super().__init__(config)
-        self.language_model = MiniCPMTextModel(config)
-        self.vocab_size = config.vocab_size
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+    def __init__(self, config: MiniCPM_o_2_6Config):
+        super().__init__(config.text_config)
+
+        text_config = config.text_config
+        self.language_model = MiniCPM_o_2_6TextModel(text_config)
+        self.vocab_size = text_config.vocab_size
+        self.lm_head = nn.Linear(text_config.hidden_size, text_config.vocab_size, bias=False)
+
+        self.omni_config = config
 
         # Initialize weights and apply final processing
         self.post_init()
@@ -592,12 +572,12 @@ def __init__(self, config):
         # init vision module
         self.vpm = self.init_vision_module()
         self.vision_dim = self.vpm.embed_dim
-        self.resampler = self.init_resampler(self.embed_dim, self.vision_dim)
+        self.resampler = self.init_resampler(config.query_num, self.embed_dim, self.vision_dim)
 
         # init audio module
         self.apm = self.init_audio_module()
         audio_output_dim = int(self.apm.config.encoder_ffn_dim // 4)
-        self.audio_avg_pooler = nn.AvgPool1d(self.config.audio_pool_step, stride=self.config.audio_pool_step)
+        self.audio_avg_pooler = nn.AvgPool1d(self.omni_config.audio_pool_step, stride=self.omni_config.audio_pool_step)
         self.audio_projection_layer = MultiModalProjector(in_dim=audio_output_dim, out_dim=self.embed_dim)
         self.audio_encoder_layer = -1
 
@@ -627,10 +607,8 @@ def init_tts(
         load tts tokenizer and vocos
         1. try load form local 2. try load from huggingface
         """
-        from .processing_minicpm_o_2_6 import ChatTTSProcessor
-
         if tts_text_tokenizer_path is None:
-            tts_text_tokenizer_path = os.path.join(self.config._name_or_path, "assets/chattts_tokenizer")
+            tts_text_tokenizer_path = os.path.join(self.omni_config._name_or_path, "assets/chattts_tokenizer")
         if not os.path.exists(tts_text_tokenizer_path):
             # try from hf model_id
             tts_text_tokenizer_path = "openbmb/chattts_tokenizer"
@@ -639,7 +617,7 @@ def init_tts(
         self.tts_processor = ChatTTSProcessor(text_tokenizer=tts_text_tokenizer)
 
         if vocos_ckpt_path is None:
-            vocos_ckpt_path = os.path.join(self.config._name_or_path, "assets/Vocos.pt")
+            vocos_ckpt_path = os.path.join(self.omni_config._name_or_path, "assets/Vocos.pt")
         if not os.path.exists(vocos_ckpt_path):
             vocos_ckpt_path = hf_hub_download(repo_id="openbmb/MiniCPM-o-2_6", subfolder="assets", filename="Vocos.pt")
 
@@ -670,12 +648,12 @@ def initialize_vocos(self, ckpt_path):
         return vocos
 
     def init_vision_module(self):
-        if self.config._attn_implementation == "flash_attention_2":
-            self.config.vision_config._attn_implementation = "flash_attention_2"
+        if self.omni_config._attn_implementation == "flash_attention_2":
+            self.omni_config.vision_config._attn_implementation = "flash_attention_2"
         else:
-            self.config.vision_config._attn_implementation = "eager"
-        model = MiniCPMVisionTransformer(self.config.vision_config)
-        if self.config.drop_vision_last_layer:
+            self.omni_config.vision_config._attn_implementation = "eager"
+        model = MiniCPMVisionTransformer(self.omni_config.vision_config)
+        if self.omni_config.drop_vision_last_layer:
             model.encoder.layers = model.encoder.layers[:-1]
 
         setattr(model, "embed_dim", model.embeddings.embed_dim)
@@ -683,9 +661,9 @@ def init_vision_module(self):
 
         return model
 
-    def init_resampler(self, embed_dim, vision_dim):
+    def init_resampler(self, query_num, embed_dim, vision_dim):
         return Resampler(
-            num_queries=self.config.query_num,
+            num_queries=query_num,
             embed_dim=embed_dim,
             num_heads=embed_dim // 128,
             kv_dim=vision_dim,
@@ -693,11 +671,11 @@ def init_resampler(self, embed_dim, vision_dim):
         )
 
     def init_audio_module(self):
-        model = MiniCPMWhisperEncoder(self.config.audio_config)
+        model = MiniCPMWhisperEncoder(self.omni_config.audio_config)
         return model
 
     def init_tts_module(self):
-        model = ConditionalChatTTS(self.config.tts_config)
+        model = ConditionalChatTTS(self.omni_config.tts_config)
         return model
 
     def get_input_embeddings(self):
@@ -769,8 +747,8 @@ def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor):
         """
         input_lengths_after_cnn = (input_lengths - 1) // 2 + 1
         input_lengths_after_pooling = (
-            input_lengths_after_cnn - self.config.audio_pool_step
-        ) // self.config.audio_pool_step + 1
+            input_lengths_after_cnn - self.omni_config.audio_pool_step
+        ) // self.omni_config.audio_pool_step + 1
         input_lengths_after_pooling = input_lengths_after_pooling.to(dtype=torch.int32)
 
         return input_lengths_after_cnn, input_lengths_after_pooling
@@ -798,7 +776,7 @@ def get_image_features(self, pixel_values_list, tgt_sizes, dtype, device):
             for i in range(B):
                 patch_attn_mask[i, 0, : tgt_sizes[i][0] * tgt_sizes[i][1]] = True
 
-            vision_batch_size = self.config.vision_batch_size
+            vision_batch_size = self.omni_config.vision_batch_size
             all_pixel_values = all_pixel_values.type(dtype)
             if B > vision_batch_size:
                 hs = []
@@ -1047,7 +1025,7 @@ def get_omni_embedding(self, data, input_embeddings, chunk_length=-1, stream_inp
             assert len(audio_embeddings) == len(input_embeddings)
             audio_bounds = data["audio_bounds"]
 
-            if self.config.chunk_input:
+            if self.omni_config.chunk_input:
                 for i in range(bs):
                     audio_embs = torch.cat(audio_embeddings[i], dim=0).to(
                         device=input_embeddings.device, dtype=input_embeddings.dtype
@@ -1116,9 +1094,9 @@ def forward(
         >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
         "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
         ```"""
-        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_attentions = output_attentions if output_attentions is not None else self.omni_config.output_attentions
         output_hidden_states = (
-            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+            output_hidden_states if output_hidden_states is not None else self.omni_config.output_hidden_states
         )
 
         # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
@@ -1142,7 +1120,7 @@ def forward(
 
         loss = None
         if labels is not None:
-            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs)
 
         return CausalLMOutputWithPast(
             loss=loss,
@@ -1237,7 +1215,7 @@ def generate(
             model_inputs["inputs_embeds"] = self.get_omni_embedding(
                 model_inputs,
                 input_embeddings=model_inputs["inputs_embeds"],
-                chunk_length=self.config.audio_chunk_length,
+                chunk_length=self.omni_config.audio_chunk_length,
             )
 
             if stream:
@@ -1270,7 +1248,7 @@ def stream_gen():
             spk_embeds = wav_numpy = sr = None
 
             if not batched and use_tts_template and generate_audio:
-                result = processor.decode_text(outputs.sequences, processor.tokenizer)
+                result = processor.decode(outputs.sequences)
                 mel_spec = self._generate_mel_spec(
                     model_inputs,
                     outputs,
@@ -1612,7 +1590,7 @@ def check_uncompleted_token(ids):
             end = check_uncompleted_token(cur_ids[0])
             left_ids = cur_ids[:, end:]
             cur_ids = cur_ids[:, :end]
-            text = processor.decode_text(cur_ids, tokenizer)[0] if end > 0 else ""
+            text = processor.decode(cur_ids)[0] if end > 0 else ""
 
             self.llm_past_key_values = outputs.past_key_values
             input_ids = outputs.sequences[:, -1:]
@@ -2247,6 +2225,37 @@ def decode_mel_to_audio(self, mel_spec, output_path=""):
             logger.info(f"Audio saved to {output_path}")
         return wav_numpy, sr
 
+
+def whisper_eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: Optional[float] = None,
+    dropout: float = 0.0,
+    head_mask: Optional[torch.Tensor] = None,
+    **kwargs,
+):
+    if scaling is None:
+        scaling = query.size(-1) ** -0.5
+
+    attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling
+    if attention_mask is not None and attention_mask.ndim == 4:
+        attn_weights = attn_weights + attention_mask[:, :, :, : key.shape[-2]]
+
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+
+    if head_mask is not None:
+        attn_weights = attn_weights * head_mask.view(1, -1, 1, 1)
+
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, attn_weights
+
+
 class MiniCPMWhisperAttention(nn.Module):
     """Multi-headed attention from 'Attention Is All You Need' paper"""
 
@@ -2295,7 +2304,7 @@ def forward(
         self,
         hidden_states: torch.Tensor,
         key_value_states: Optional[torch.Tensor] = None,
-        past_key_value: Optional[Cache] = None,
+        past_key_values: Optional[Cache] = None,
         attention_mask: Optional[torch.Tensor] = None,
         layer_head_mask: Optional[torch.Tensor] = None,
         output_attentions: bool = False,
@@ -2323,34 +2332,34 @@ def forward(
         query_states = query_states.view(*q_input_shape)
         query_states = query_states.transpose(1, 2).contiguous()
 
-        if past_key_value is not None:
-            is_updated = past_key_value.is_updated.get(self.layer_idx)
+        if past_key_values is not None:
+            is_updated = past_key_values.is_updated.get(self.layer_idx)
             if is_cross_attention:
                 # after the first generated id, we can subsequently re-use all key/value_states from cache
-                past_key_value.is_updated[self.layer_idx] = True
-                past_key_value = past_key_value.cross_attention_cache
+                past_key_values.is_updated[self.layer_idx] = True
+                past_key_values = past_key_values.cross_attention_cache
             else:
-                past_key_value = past_key_value.self_attention_cache
+                past_key_values = past_key_values.self_attention_cache
 
         # use key_value_states if cross attention
         current_states = key_value_states if key_value_states is not None else hidden_states
-        if is_cross_attention and past_key_value and is_updated:
+        if is_cross_attention and past_key_values and is_updated:
             # reuse k,v, cross_attentions
-            key_states = past_key_value.key_cache[self.layer_idx]
-            value_states = past_key_value.value_cache[self.layer_idx]
+            key_states = past_key_values.key_cache[self.layer_idx]
+            value_states = past_key_values.value_cache[self.layer_idx]
         else:
             key_states = self.k_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim)
             value_states = self.v_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim)
             key_states = key_states.transpose(1, 2).contiguous()
             value_states = value_states.transpose(1, 2).contiguous()
-            if past_key_value is not None:
+            if past_key_values is not None:
                 # save all key/value_states to cache to be re-used for fast auto-regressive generation
                 cache_position = cache_position if not is_cross_attention else None
-                key_states, value_states = past_key_value.update(
+                key_states, value_states = past_key_values.update(
                     key_states, value_states, self.layer_idx, {"cache_position": cache_position}
                 )
 
-        attention_interface: Callable = eager_attention_forward
+        attention_interface: Callable = whisper_eager_attention_forward
         if self.config._attn_implementation != "eager":
             attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 
@@ -2370,11 +2379,11 @@ def forward(
         attn_output = attn_output.reshape(bsz, tgt_len, -1).contiguous()
         attn_output = self.out_proj(attn_output)
 
-        return attn_output, attn_weights, past_key_value
+        return attn_output, attn_weights, past_key_values
 
 
 class MiniCPMWhisperEncoderLayer(GradientCheckpointingLayer):
-    def __init__(self, config: WhisperConfig, layer_idx: int = None):
+    def __init__(self, config: MiniCPMWhisperConfig, layer_idx: int = None):
         super().__init__()
         self.embed_dim = config.d_model
         self.self_attn = MiniCPMWhisperAttention(
@@ -2426,7 +2435,7 @@ def forward(
             attention_mask=attention_mask,
             layer_head_mask=layer_head_mask,
             output_attentions=output_attentions,
-            past_key_value=past_key_values,
+            past_key_values=past_key_values,
         )
         hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
         hidden_states = residual + hidden_states
@@ -2477,7 +2486,7 @@ class MiniCPMWhisperEncoder(MiniCPM_o_2_6PreTrainedModel):
         config: MiniCPMWhisperConfig
     """
 
-    def __init__(self, config: WhisperConfig):
+    def __init__(self, config: MiniCPMWhisperConfig):
         super().__init__(config)
         self.dropout = config.dropout
         self.layerdrop = config.encoder_layerdrop
@@ -2592,7 +2601,7 @@ def forward(
                 only present if their respective `output_*` arguments are set to `True`.
 
         Example:
-            >>> from transformers import AutoFeatureExtractor, WhisperConfig, WhisperForConditionalGeneration
+            >>> from transformers import AutoFeatureExtractor, MiniCPMWhisperConfig, WhisperForConditionalGeneration
             >>> import torch
 
             >>> # Load a feature extractor and a Whisper model
@@ -2994,7 +3003,7 @@ class ConditionalChatTTSGenerationOutput(ModelOutput):
     Args:
         new_ids (torch.LongTensor): Newly generated audio code sequence, shape (batch_size, sequence_length, num_vq).
         audio_input_ids (torch.LongTensor): Updated input IDs including condition and generated audio codes, shape (batch_size, full_sequence_length, num_vq).
-        past_key_values (Tuple[Tuple[torch.FloatTensor]]): Tuple containing pre-computed keys and values used for attention mechanism. Each element has shape (batch_size, num_heads, sequence_length, embed_size_per_head).
+        past_key_values (tuple[tuple[torch.FloatTensor]]): tuple containing pre-computed keys and values used for attention mechanism. Each element has shape (batch_size, num_heads, sequence_length, embed_size_per_head).
         finished (bool): Boolean indicating whether generation is complete.
 
     """
@@ -3196,23 +3205,6 @@ class MiniCPMConditionalTTSTextPreTrainedModel(PreTrainedModel):
         "attentions": MiniCPMConditionalTTSTextAttention,
     }
     config_class = MiniCPMConditionalTTSTextConfig
-    _supports_flash_attn_2 = True
-    _supports_cache_class = True
-    _supports_quantized_cache = True
-    _supports_static_cache = True
-
-    def _init_weights(self, module):
-        std = self.config.initializer_range
-        if isinstance(module, nn.Linear):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.bias is not None:
-                module.bias.data.zero_()
-        elif isinstance(module, nn.Embedding):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.padding_idx is not None:
-                module.weight.data[module.padding_idx].zero_()
-        elif isinstance(module, MiniCPMConditionalTTSTextRMSNorm):
-            module.weight.data.fill_(1.0)
 
 
 class MiniCPMConditionalTTSTextRotaryEmbedding(nn.Module):
@@ -3340,7 +3332,7 @@ def forward(
                 hidden_states,
                 attention_mask=causal_mask,
                 position_ids=position_ids,
-                past_key_value=past_key_values,
+                past_key_values=past_key_values,
                 output_attentions=output_attentions,
                 use_cache=use_cache,
                 cache_position=cache_position,
@@ -3681,16 +3673,7 @@ def __init__(self, config: MiniCPMConditionalTTSConfig):
         dvae = DVAE()
         self.dvae = dvae
 
-        model_config = MiniCPMConditionalTTSTextConfig(
-            hidden_size=config.hidden_size,
-            intermediate_size=config.intermediate_size,
-            num_attention_heads=config.num_attention_heads,
-            num_hidden_layers=config.num_hidden_layers,
-            max_position_embeddings=config.max_position_embeddings,
-            attn_implementation=config.attn_implementation,
-        )
-
-        model = MiniCPMConditionalTTSTextModel(model_config)
+        model = MiniCPMConditionalTTSTextModel(config.tts_text_config)
         self.model = model
 
     @torch.inference_mode()
@@ -3751,7 +3734,7 @@ def prefill_text(
         Args:
             input_ids (Tensor): Tensor of shape [batch_size, seq_len]
             position_ids (LongTensor): Tensor of shape [batch_size, seq_len]
-            past_key_values (List[Tuple[Tensor]]): KV Cache of all layers, each layer is a tuple (Tensor, Tensor) denoting keys and values. Each tensor is of seq_len = `self.streaming_text_reserved_len`. `past_key_values` will be updated.
+            past_key_values (List[tuple[Tensor]]): KV Cache of all layers, each layer is a tuple (Tensor, Tensor) denoting keys and values. Each tensor is of seq_len = `self.streaming_text_reserved_len`. `past_key_values` will be updated.
             lm_spk_emb_last_hidden_states (Tensor, optional): Tensor of shape [batch_size, num_spk_emb, llm_dim]. Defaults to None.
             lm_last_hidden_states (Tensor, optional): _description_. Defaults to None.
 
@@ -3825,7 +3808,7 @@ def prefill_audio_ids(
 
         Args:
             input_ids (torch.Tensor): (1, seq_len, num_vq) Audio input token ids.
-            past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
+            past_key_values (List[tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
         """
         assert input_ids.shape[0] == 1
         assert past_key_values is not None
@@ -3891,7 +3874,7 @@ def generate(
 
         Args:
             input_ids (torch.Tensor): Input token ids.
-            past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
+            past_key_values (List[tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
             temperature (torch.Tensor): Temperature for sampling.
             eos_token (Union[int, torch.Tensor]): End of sequence token.
             streaming_tts_text_mask (Optional[torch.Tensor], optional): Mask for streaming TTS text. Defaults to None.
@@ -4433,11 +4416,11 @@ class MiniCPMVisionModelOutput(ModelOutput):
         last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
             Sequence of hidden-states at the output of the last layer of the model.
         hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
             one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
             Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
         attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
             sequence_length)`.
             Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
             heads.
@@ -4450,7 +4433,7 @@ class MiniCPMVisionModelOutput(ModelOutput):
 
 
 class MiniCPMVisionEmbedding(nn.Module):
-    def __init__(self, config: SiglipVisionConfig):
+    def __init__(self, config: MiniCPMVisionConfig):
         super().__init__()
         self.config = config
         self.embed_dim = config.hidden_size
@@ -4807,7 +4790,7 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
 
 
 class MiniCPMVisionEncoderLayer(GradientCheckpointingLayer):
-    def __init__(self, config: SiglipVisionConfig):
+    def __init__(self, config: MiniCPMVisionConfig):
         super().__init__()
         self.embed_dim = config.hidden_size
         self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
@@ -4818,6 +4801,7 @@ def __init__(self, config: SiglipVisionConfig):
         self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
         self.mlp = MiniCPMVisionMLP(config)
 
+
     def forward(
         self,
         hidden_states: torch.Tensor,
@@ -4969,7 +4953,7 @@ class MiniCPMVisionPreTrainedModel(PreTrainedModel):
     models.
     """
 
-    config_class = SiglipVisionConfig
+    config_class = MiniCPMVisionConfig
     base_model_prefix = "siglip"
     supports_gradient_checkpointing = True
 
@@ -5009,10 +4993,10 @@ class MiniCPMVisionEncoder(nn.Module):
     Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
     [`SiglipEncoderLayer`].
     Args:
-        config: SiglipConfig
+        config: MiniCPMVisionConfig
     """
 
-    def __init__(self, config: SiglipVisionConfig):
+    def __init__(self, config: MiniCPMVisionConfig):
         super().__init__()
         self.config = config
         self.layers = nn.ModuleList([MiniCPMVisionEncoderLayer(config) for _ in range(config.num_hidden_layers)])
@@ -5098,7 +5082,7 @@ def forward(
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
     and behavior.
     Parameters:
-        config ([`SiglipVisionConfig`]): Model configuration class with all the parameters of the model.
+        config ([`MiniCPMVisionConfig`]): Model configuration class with all the parameters of the model.
             Initializing with a config file does not load the weights associated with the model, only the
             configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
 """
@@ -5124,12 +5108,12 @@ def forward(
     """The vision model from SigLIP without any head or projection on top.""", SIGLIP_START_DOCSTRING
 )
 class MiniCPMVisionTransformer(MiniCPMVisionPreTrainedModel):
-    config_class = SiglipVisionConfig
+    config_class = MiniCPMVisionConfig
     main_input_name = "pixel_values"
     _supports_flash_attn_2 = True
     _no_split_modules = []
 
-    def __init__(self, config: SiglipVisionConfig):
+    def __init__(self, config: MiniCPMVisionConfig):
         super().__init__(config)
         self.config = config
         embed_dim = config.hidden_size
@@ -5146,7 +5130,7 @@ def get_input_embeddings(self) -> nn.Module:
         return self.embeddings.patch_embedding
 
     @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING)
-    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipVisionConfig)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=MiniCPMVisionConfig)
     def forward(
         self,
         pixel_values,
@@ -5216,4 +5200,4 @@ def forward(
         )
 
 
-__all__ = ["MiniCPM_o_2_6ForConditionalGeneration", "MiniCPM_o_2_6Model", "MiniCPM_o_2_6PreTrainedModel"]
+__all__ = ["MiniCPM_o_2_6ForConditionalGeneration", "MiniCPM_o_2_6TextModel", "MiniCPM_o_2_6PreTrainedModel"]
diff --git a/src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py
index 92e5b55c4582..2807b7f3336f 100644
--- a/src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py
+++ b/src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py
@@ -20,7 +20,7 @@
 from dataclasses import dataclass
 from functools import partial
 from threading import Thread
-from typing import List, Optional, Tuple, Union, Callable
+from typing import Optional, Union, Callable
 
 import numpy as np
 from PIL import Image
@@ -43,27 +43,26 @@
     CausalLMOutputWithPast,
 )
 from ...utils import (
-    ModelOutput,
+    logging,
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
-    is_flash_attn_2_available,
-    logging,
     replace_return_docstrings,
     can_return_tuple,
     auto_docstring,
+    ModelOutput,
     TransformersKwargs,
 )
+from ...utils.import_utils import _is_package_available, is_flash_attn_2_available
 from ...cache_utils import Cache, DynamicCache, EncoderDecoderCache, StaticCache
+from ...configuration_utils import PretrainedConfig
 from ...generation import GenerationMixin
 from ...generation.streamers import TextIteratorStreamer
 from ...generation.utils import GenerateOutput
 from ...generation.logits_process import LogitsProcessor, TopKLogitsWarper, TopPLogitsWarper
-from ...modeling_layers import GradientCheckpointingLayer
-from ...modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
 from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
 from ...activations import ACT2FN
 from ...modeling_attn_mask_utils import _prepare_4d_attention_mask, AttentionMaskConverter
-from ...integrations import is_deepspeed_zero3_enabled, use_kernel_forward_from_hub
+from ...integrations import is_deepspeed_zero3_enabled
 from ...modeling_flash_attention_utils import FlashAttentionKwargs
 from ...processing_utils import Unpack
 
@@ -72,31 +71,182 @@
 from ..siglip.modeling_siglip import SiglipEncoderLayer, SiglipEncoder, SiglipMLP, SiglipVisionModelOutput
 from ..whisper.configuration_whisper import WhisperConfig
 from ..whisper.modeling_whisper import WhisperEncoder, WhisperAttention, WhisperEncoderLayer
+from ..qwen2.configuration_qwen2 import Qwen2Config
 from ..qwen2.modeling_qwen2 import Qwen2Model, Qwen2PreTrainedModel
+from ..llama.configuration_llama import LlamaConfig
 from ..llama.modeling_llama import LlamaModel, LlamaDecoderLayer, LlamaPreTrainedModel
 
-try:
+from .tts_processing_minicpm_o_2_6 import NumberToTextConverter, sentence_end, VoiceChecker, ChatTTSProcessor
+
+if is_flash_attn_2_available():
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input
+
+if _is_package_available('vector_quantize_pytorch') and _is_package_available('vocos'):
     from vector_quantize_pytorch import GroupedResidualFSQ
     from vocos import Vocos
     from vocos.pretrained import instantiate_class
 
-    _tts_deps = True
-except:
-    _tts_deps = False
-
-from .configuration_minicpm_o_2_6 import (
-    MiniCPMConditionalTTSConfig,
-    MiniCPM_o_2_6Config,
-    MiniCPMConditionalTTSTextConfig,
-)
-from .processing_minicpm_o_2_6 import NumberToTextConverter, sentence_end, VoiceChecker
+_tts_deps = _is_package_available('vector_quantize_pytorch') and _is_package_available('vocos')
 
 logger = logging.get_logger(__name__)
 
+class MiniCPMConditionalTTSTextConfig(LlamaConfig):
+    pass
+
+
+class MiniCPMConditionalTTSConfig(PretrainedConfig):
+    model_type = "conditional_chattts"
+
+    # sub_configs = {
+    #     "text_config": MiniCPMConditionalTTSTextConfig,
+    # }
+
+    def __init__(
+        self,
+        llm_dim: int = 2560,
+        hidden_size: int = 768,
+        intermediate_size: int = 3072,
+        num_attention_heads: int = 12,
+        num_hidden_layers: int = 20,
+        max_position_embeddings: int = 4096,
+        num_audio_tokens: int = 626,
+        num_text_tokens: int = 21178,
+        num_mel_bins: int = 100,
+        num_vq: int = 4,
+        use_speaker_embedding: bool = True,
+        use_llm_hidden_state: bool = False,
+        spk_emb_token_id: int = 21143,
+        num_spk_embs: int = 1,
+        audio_bos_token_id: int = 21132,
+        text_eos_token_id: int = 21133,
+        use_text: bool = True,
+        streaming: bool = True,
+        streaming_text_chunk_size: int = 10,
+        streaming_text_reserved_len: int = 300,
+        streaming_audio_chunk_size: int = 50,
+        attn_implementation: str = "sdpa",
+        use_mlp: bool = True,
+        aug_loss_weight: bool = True,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.llm_dim = llm_dim
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_attention_heads = num_attention_heads
+        self.num_hidden_layers = num_hidden_layers
+        self.max_position_embeddings = max_position_embeddings
+        self.num_audio_tokens = num_audio_tokens
+        self.num_text_tokens = num_text_tokens
+        self.num_mel_bins = num_mel_bins
+        self.num_vq = num_vq
+        self.use_speaker_embedding = use_speaker_embedding
+        self.use_llm_hidden_state = use_llm_hidden_state
+        self.spk_emb_token_id = spk_emb_token_id
+        self.num_spk_embs = num_spk_embs
+        self.audio_bos_token_id = audio_bos_token_id
+        self.text_eos_token_id = text_eos_token_id
+        self.use_text = use_text
+        self.streaming = streaming
+        self.streaming_text_chunk_size = streaming_text_chunk_size
+        self.streaming_text_reserved_len = streaming_text_reserved_len
+        self.streaming_audio_chunk_size = streaming_audio_chunk_size
+        self.attn_implementation = attn_implementation
+        self.use_mlp = use_mlp
+        self.aug_loss_weight = aug_loss_weight
+
+        self.tts_text_config = MiniCPMConditionalTTSTextConfig(
+            hidden_size=self.hidden_size,
+            intermediate_size=self.intermediate_size,
+            num_attention_heads=self.num_attention_heads,
+            num_hidden_layers=self.num_hidden_layers,
+            max_position_embeddings=self.max_position_embeddings,
+            attn_implementation=self.attn_implementation,
+        )
+
+
+class MiniCPM_o_2_6TextConfig(Qwen2Config):
+    model_type = "minicpmo"
+
+class MiniCPMVisionConfig(SiglipVisionConfig):
+    pass
+
+class MiniCPMWhisperConfig(WhisperConfig):
+    pass 
+
+class MiniCPM_o_2_6Config(PretrainedConfig):
+
+    default_vision_config = {
+        "hidden_size": 1152,
+        "image_size": 980,
+        "intermediate_size": 4304,
+        "model_type": "siglip",
+        "num_attention_heads": 16,
+        "num_hidden_layers": 27,
+        "patch_size": 14,
+    }
+
+    def __init__(
+        self,
+        text_config=None,
+        vision_config=None,
+        audio_config=None,
+        tts_config=None,
+        use_cache=True,
+        query_num=64,
+        drop_vision_last_layer=True,
+        vision_batch_size=16,
+        audio_pool_step=2,
+        audio_chunk_length=1.0,
+        **kwargs,
+    ):
+        self.use_cache = use_cache
+        self.query_num = query_num
+        self.drop_vision_last_layer = drop_vision_last_layer
+        self.vision_batch_size = vision_batch_size
+        self.audio_pool_step = audio_pool_step
+        self.audio_chunk_length = audio_chunk_length
+
+        if text_config is None:
+            self.text_config = MiniCPM_o_2_6TextConfig()
+        elif isinstance(text_config, dict):
+            self.text_config = MiniCPM_o_2_6TextConfig(**text_config)
+        elif isinstance(text_config, MiniCPM_o_2_6TextConfig):
+            self.text_config = text_config
+
+        if vision_config is None:
+            self.vision_config = MiniCPMVisionConfig(
+                **self.default_vision_config)
+            logger.info("vision_config is None, using default vision config")
+        elif isinstance(vision_config, dict):
+            self.vision_config = MiniCPMVisionConfig(**vision_config)
+        elif isinstance(vision_config, MiniCPMVisionConfig):
+            self.vision_config = vision_config
+
+        # same as openai/whisper-medium add use_cache
+        if audio_config is None:
+            self.audio_config = MiniCPMWhisperConfig()
+        elif isinstance(audio_config, dict):
+            self.audio_config = MiniCPMWhisperConfig(**audio_config)
+        elif isinstance(audio_config, MiniCPMWhisperConfig):
+            self.audio_config = audio_config
+
+        if tts_config is None:
+            self.tts_config = MiniCPMConditionalTTSConfig()
+        elif isinstance(tts_config, dict):
+            self.tts_config = MiniCPMConditionalTTSConfig(**tts_config)
+        elif isinstance(tts_config, MiniCPMConditionalTTSConfig):
+            self.tts_config = tts_config
+
+        # self.patch_size = self.vision_config.patch_size
+        super().__init__(**kwargs)
+
 
 @dataclass
 class OmniOutput(ModelOutput):
-    text: Optional[Union[str, List[str], Iterator]] = None
+    text: Optional[Union[str, list[str], Iterator]] = None
     outputs: GenerateOutput | torch.LongTensor = None
     spk_embeds: Optional[torch.FloatTensor] = None
     audio_wav: Optional[np.ndarray] = None
@@ -105,47 +255,27 @@ class OmniOutput(ModelOutput):
 
 @auto_docstring
 class MiniCPM_o_2_6PreTrainedModel(Qwen2PreTrainedModel):
-    config_class = MiniCPM_o_2_6Config
-    base_model_prefix = "model"
-    supports_gradient_checkpointing = True
-    _no_split_modules = ["MiniCPM_o_2_6TextDecoderLayer"]
-    _skip_keys_device_placement = ["past_key_values"]
-    _supports_flash_attn_2 = True
-    _supports_sdpa = True
-    _supports_flex_attn = True
-    _supports_cache_class = True
-    _supports_quantized_cache = True
-    _supports_static_cache = True
-    _supports_attention_backend = True
-
-    def _init_weights(self, module):
-        std = self.config.initializer_range
-        if isinstance(module, nn.Linear):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.bias is not None:
-                module.bias.data.zero_()
-        elif isinstance(module, nn.Embedding):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.padding_idx is not None:
-                module.weight.data[module.padding_idx].zero_()
-        elif isinstance(module, MiniCPM_o_2_6TextRMSNorm):
-            module.weight.data.fill_(1.0)
+    config: MiniCPM_o_2_6Config
 
 
-class MiniCPMTextModel(Qwen2Model):
+class MiniCPM_o_2_6TextModel(Qwen2Model):
     pass
 
 
-class MiniCPM_o_2_6Model(MiniCPM_o_2_6PreTrainedModel, GenerationMixin):
+class MiniCPM_o_2_6ForConditionalGeneration(MiniCPM_o_2_6PreTrainedModel, GenerationMixin):
     _tied_weights_keys = ["lm_head.weight"]
     _tp_plan = {"lm_head": "colwise_rep"}
     _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
 
-    def __init__(self, config):
-        super().__init__(config)
-        self.language_model = MiniCPMTextModel(config)
-        self.vocab_size = config.vocab_size
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+    def __init__(self, config: MiniCPM_o_2_6Config):
+        super().__init__(config.text_config)
+
+        text_config = config.text_config
+        self.language_model = MiniCPM_o_2_6TextModel(text_config)
+        self.vocab_size = text_config.vocab_size
+        self.lm_head = nn.Linear(text_config.hidden_size, text_config.vocab_size, bias=False)
+
+        self.omni_config = config
 
         # Initialize weights and apply final processing
         self.post_init()
@@ -156,12 +286,12 @@ def __init__(self, config):
         # init vision module
         self.vpm = self.init_vision_module()
         self.vision_dim = self.vpm.embed_dim
-        self.resampler = self.init_resampler(self.embed_dim, self.vision_dim)
+        self.resampler = self.init_resampler(config.query_num, self.embed_dim, self.vision_dim)
 
         # init audio module
         self.apm = self.init_audio_module()
         audio_output_dim = int(self.apm.config.encoder_ffn_dim // 4)
-        self.audio_avg_pooler = nn.AvgPool1d(self.config.audio_pool_step, stride=self.config.audio_pool_step)
+        self.audio_avg_pooler = nn.AvgPool1d(self.omni_config.audio_pool_step, stride=self.omni_config.audio_pool_step)
         self.audio_projection_layer = MultiModalProjector(in_dim=audio_output_dim, out_dim=self.embed_dim)
         self.audio_encoder_layer = -1
 
@@ -191,10 +321,8 @@ def init_tts(
         load tts tokenizer and vocos
         1. try load form local 2. try load from huggingface
         """
-        from .processing_minicpm_o_2_6 import ChatTTSProcessor
-
         if tts_text_tokenizer_path is None:
-            tts_text_tokenizer_path = os.path.join(self.config._name_or_path, "assets/chattts_tokenizer")
+            tts_text_tokenizer_path = os.path.join(self.omni_config._name_or_path, "assets/chattts_tokenizer")
         if not os.path.exists(tts_text_tokenizer_path):
             # try from hf model_id
             tts_text_tokenizer_path = "openbmb/chattts_tokenizer"
@@ -203,7 +331,7 @@ def init_tts(
         self.tts_processor = ChatTTSProcessor(text_tokenizer=tts_text_tokenizer)
 
         if vocos_ckpt_path is None:
-            vocos_ckpt_path = os.path.join(self.config._name_or_path, "assets/Vocos.pt")
+            vocos_ckpt_path = os.path.join(self.omni_config._name_or_path, "assets/Vocos.pt")
         if not os.path.exists(vocos_ckpt_path):
             vocos_ckpt_path = hf_hub_download(repo_id="openbmb/MiniCPM-o-2_6", subfolder="assets", filename="Vocos.pt")
 
@@ -234,12 +362,12 @@ def initialize_vocos(self, ckpt_path):
         return vocos
 
     def init_vision_module(self):
-        if self.config._attn_implementation == "flash_attention_2":
-            self.config.vision_config._attn_implementation = "flash_attention_2"
+        if self.omni_config._attn_implementation == "flash_attention_2":
+            self.omni_config.vision_config._attn_implementation = "flash_attention_2"
         else:
-            self.config.vision_config._attn_implementation = "eager"
-        model = MiniCPMVisionTransformer(self.config.vision_config)
-        if self.config.drop_vision_last_layer:
+            self.omni_config.vision_config._attn_implementation = "eager"
+        model = MiniCPMVisionTransformer(self.omni_config.vision_config)
+        if self.omni_config.drop_vision_last_layer:
             model.encoder.layers = model.encoder.layers[:-1]
 
         setattr(model, "embed_dim", model.embeddings.embed_dim)
@@ -247,9 +375,9 @@ def init_vision_module(self):
 
         return model
 
-    def init_resampler(self, embed_dim, vision_dim):
+    def init_resampler(self, query_num, embed_dim, vision_dim):
         return Resampler(
-            num_queries=self.config.query_num,
+            num_queries=query_num,
             embed_dim=embed_dim,
             num_heads=embed_dim // 128,
             kv_dim=vision_dim,
@@ -257,11 +385,11 @@ def init_resampler(self, embed_dim, vision_dim):
         )
 
     def init_audio_module(self):
-        model = MiniCPMWhisperEncoder(self.config.audio_config)
+        model = MiniCPMWhisperEncoder(self.omni_config.audio_config)
         return model
 
     def init_tts_module(self):
-        model = ConditionalChatTTS(self.config.tts_config)
+        model = ConditionalChatTTS(self.omni_config.tts_config)
         return model
 
     def get_input_embeddings(self):
@@ -333,8 +461,8 @@ def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor):
         """
         input_lengths_after_cnn = (input_lengths - 1) // 2 + 1
         input_lengths_after_pooling = (
-            input_lengths_after_cnn - self.config.audio_pool_step
-        ) // self.config.audio_pool_step + 1
+            input_lengths_after_cnn - self.omni_config.audio_pool_step
+        ) // self.omni_config.audio_pool_step + 1
         input_lengths_after_pooling = input_lengths_after_pooling.to(dtype=torch.int32)
 
         return input_lengths_after_cnn, input_lengths_after_pooling
@@ -362,7 +490,7 @@ def get_image_features(self, pixel_values_list, tgt_sizes, dtype, device):
             for i in range(B):
                 patch_attn_mask[i, 0, : tgt_sizes[i][0] * tgt_sizes[i][1]] = True
 
-            vision_batch_size = self.config.vision_batch_size
+            vision_batch_size = self.omni_config.vision_batch_size
             all_pixel_values = all_pixel_values.type(dtype)
             if B > vision_batch_size:
                 hs = []
@@ -447,7 +575,7 @@ def get_vllm_embedding(self, data):
         return new_vllm_embedding, vision_hidden_states
 
     def get_audio_embedding_streaming(
-        self, audio_features: torch.FloatTensor = [], audio_feature_lens_raw: List[List[int]] = []
+        self, audio_features: torch.FloatTensor = [], audio_feature_lens_raw: list[list[int]] = []
     ):
         r"""
         Extract audio embeddings in a streaming manner using cached key-value pairs.
@@ -508,7 +636,7 @@ def get_audio_embedding_streaming(
     def get_audio_embedding(
         self,
         audio_features: torch.FloatTensor = [],
-        audio_feature_lens_raw: List[List[int]] = [],
+        audio_feature_lens_raw: list[list[int]] = [],
         chunk_length=-1,
         dummy=True,
     ):
@@ -611,7 +739,7 @@ def get_omni_embedding(self, data, input_embeddings, chunk_length=-1, stream_inp
             assert len(audio_embeddings) == len(input_embeddings)
             audio_bounds = data["audio_bounds"]
 
-            if self.config.chunk_input:
+            if self.omni_config.chunk_input:
                 for i in range(bs):
                     audio_embs = torch.cat(audio_embeddings[i], dim=0).to(
                         device=input_embeddings.device, dtype=input_embeddings.dtype
@@ -680,9 +808,9 @@ def forward(
         >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
         "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
         ```"""
-        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_attentions = output_attentions if output_attentions is not None else self.omni_config.output_attentions
         output_hidden_states = (
-            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+            output_hidden_states if output_hidden_states is not None else self.omni_config.output_hidden_states
         )
 
         # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
@@ -706,7 +834,7 @@ def forward(
 
         loss = None
         if labels is not None:
-            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs)
 
         return CausalLMOutputWithPast(
             loss=loss,
@@ -801,7 +929,7 @@ def generate(
             model_inputs["inputs_embeds"] = self.get_omni_embedding(
                 model_inputs,
                 input_embeddings=model_inputs["inputs_embeds"],
-                chunk_length=self.config.audio_chunk_length,
+                chunk_length=self.omni_config.audio_chunk_length,
             )
 
             if stream:
@@ -834,7 +962,7 @@ def stream_gen():
             spk_embeds = wav_numpy = sr = None
 
             if not batched and use_tts_template and generate_audio:
-                result = processor.decode_text(outputs.sequences, processor.tokenizer)
+                result = processor.decode(outputs.sequences)
                 mel_spec = self._generate_mel_spec(
                     model_inputs,
                     outputs,
@@ -1176,7 +1304,7 @@ def check_uncompleted_token(ids):
             end = check_uncompleted_token(cur_ids[0])
             left_ids = cur_ids[:, end:]
             cur_ids = cur_ids[:, :end]
-            text = processor.decode_text(cur_ids, tokenizer)[0] if end > 0 else ""
+            text = processor.decode(cur_ids)[0] if end > 0 else ""
 
             self.llm_past_key_values = outputs.past_key_values
             input_ids = outputs.sequences[:, -1:]
@@ -1382,7 +1510,7 @@ def _generate_mel_spec(
         mel_spec = self.tts.decode_to_mel_specs(outputs.new_ids)
         return mel_spec
 
-    def _linear_overlap_add2_wav(self, frames: List[torch.Tensor], overlap: int):
+    def _linear_overlap_add2_wav(self, frames: list[torch.Tensor], overlap: int):
         """
         Merge two audio waveforms with smooth in streaming audio generation.
         Borrowed some codes from `https://github.com/huggingface/transformers/blob/main/src/transformers/models/encodec/modeling_encodec.py`
@@ -1824,6 +1952,35 @@ def get_cache_usable_length(past_key_value: Cache, new_seq_length: int, layer_id
     return previous_seq_length
 
 
+def whisper_eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: Optional[float] = None,
+    dropout: float = 0.0,
+    head_mask: Optional[torch.Tensor] = None,
+    **kwargs,
+):
+    if scaling is None:
+        scaling = query.size(-1) ** -0.5
+
+    attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling
+    if attention_mask is not None and attention_mask.ndim == 4:
+        attn_weights = attn_weights + attention_mask[:, :, :, : key.shape[-2]]
+
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+
+    if head_mask is not None:
+        attn_weights = attn_weights * head_mask.view(1, -1, 1, 1)
+
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, attn_weights
+
 # Copied from transformers.models.whisper.modeling_whisper.WhisperAttention and support past_key_value
 class MiniCPMWhisperAttention(WhisperAttention):
     """Multi-headed attention from 'Attention Is All You Need' paper"""
@@ -1832,7 +1989,7 @@ def forward(
         self,
         hidden_states: torch.Tensor,
         key_value_states: Optional[torch.Tensor] = None,
-        past_key_value: Optional[Cache] = None,
+        past_key_values: Optional[Cache] = None,
         attention_mask: Optional[torch.Tensor] = None,
         layer_head_mask: Optional[torch.Tensor] = None,
         output_attentions: bool = False,
@@ -1860,34 +2017,34 @@ def forward(
         query_states = query_states.view(*q_input_shape)
         query_states = query_states.transpose(1, 2).contiguous()
 
-        if past_key_value is not None:
-            is_updated = past_key_value.is_updated.get(self.layer_idx)
+        if past_key_values is not None:
+            is_updated = past_key_values.is_updated.get(self.layer_idx)
             if is_cross_attention:
                 # after the first generated id, we can subsequently re-use all key/value_states from cache
-                past_key_value.is_updated[self.layer_idx] = True
-                past_key_value = past_key_value.cross_attention_cache
+                past_key_values.is_updated[self.layer_idx] = True
+                past_key_values = past_key_values.cross_attention_cache
             else:
-                past_key_value = past_key_value.self_attention_cache
+                past_key_values = past_key_values.self_attention_cache
 
         # use key_value_states if cross attention
         current_states = key_value_states if key_value_states is not None else hidden_states
-        if is_cross_attention and past_key_value and is_updated:
+        if is_cross_attention and past_key_values and is_updated:
             # reuse k,v, cross_attentions
-            key_states = past_key_value.key_cache[self.layer_idx]
-            value_states = past_key_value.value_cache[self.layer_idx]
+            key_states = past_key_values.key_cache[self.layer_idx]
+            value_states = past_key_values.value_cache[self.layer_idx]
         else:
             key_states = self.k_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim)
             value_states = self.v_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim)
             key_states = key_states.transpose(1, 2).contiguous()
             value_states = value_states.transpose(1, 2).contiguous()
-            if past_key_value is not None:
+            if past_key_values is not None:
                 # save all key/value_states to cache to be re-used for fast auto-regressive generation
                 cache_position = cache_position if not is_cross_attention else None
-                key_states, value_states = past_key_value.update(
+                key_states, value_states = past_key_values.update(
                     key_states, value_states, self.layer_idx, {"cache_position": cache_position}
                 )
 
-        attention_interface: Callable = eager_attention_forward
+        attention_interface: Callable = whisper_eager_attention_forward
         if self.config._attn_implementation != "eager":
             attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
 
@@ -1907,12 +2064,12 @@ def forward(
         attn_output = attn_output.reshape(bsz, tgt_len, -1).contiguous()
         attn_output = self.out_proj(attn_output)
 
-        return attn_output, attn_weights, past_key_value
+        return attn_output, attn_weights, past_key_values
 
 
 # Copied from transformers.models.whisper.modeling_whisper.WhisperEncoderLayer and add use_cache for streaming inference
 class MiniCPMWhisperEncoderLayer(WhisperEncoderLayer):
-    def __init__(self, config: WhisperConfig, layer_idx: int = None):
+    def __init__(self, config: MiniCPMWhisperConfig, layer_idx: int = None):
         super().__init__()
         self.embed_dim = config.d_model
         self.self_attn = MiniCPMWhisperAttention(
@@ -1964,7 +2121,7 @@ def forward(
             attention_mask=attention_mask,
             layer_head_mask=layer_head_mask,
             output_attentions=output_attentions,
-            past_key_value=past_key_values,
+            past_key_values=past_key_values,
         )
         hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)
         hidden_states = residual + hidden_states
@@ -1996,7 +2153,7 @@ def forward(
 
 # Copied from from transformers.models.whisper.modeling_whisper.WhisperEncoder and add use_cache for streaming inference
 class MiniCPMWhisperEncoder(WhisperEncoder):
-    def __init__(self, config: WhisperConfig):
+    def __init__(self, config: MiniCPMWhisperConfig):
         super().__init__(config)
         self.layers = nn.ModuleList(
             [MiniCPMWhisperEncoderLayer(config, layer_idx=i) for i in range(config.encoder_layers)]
@@ -2081,7 +2238,7 @@ def forward(
                 only present if their respective `output_*` arguments are set to `True`.
 
         Example:
-            >>> from transformers import AutoFeatureExtractor, WhisperConfig, WhisperForConditionalGeneration
+            >>> from transformers import AutoFeatureExtractor, MiniCPMWhisperConfig, WhisperForConditionalGeneration
             >>> import torch
 
             >>> # Load a feature extractor and a Whisper model
@@ -2289,7 +2446,7 @@ class GFSQ(nn.Module):
     def __init__(
         self,
         dim: int,
-        levels: List[int],
+        levels: list[int],
         G: int,
         R: int,
         eps=1e-5,
@@ -2587,14 +2744,14 @@ class ConditionalChatTTSGenerationOutput(ModelOutput):
     Args:
         new_ids (torch.LongTensor): Newly generated audio code sequence, shape (batch_size, sequence_length, num_vq).
         audio_input_ids (torch.LongTensor): Updated input IDs including condition and generated audio codes, shape (batch_size, full_sequence_length, num_vq).
-        past_key_values (Tuple[Tuple[torch.FloatTensor]]): Tuple containing pre-computed keys and values used for attention mechanism. Each element has shape (batch_size, num_heads, sequence_length, embed_size_per_head).
+        past_key_values (tuple[tuple[torch.FloatTensor]]): tuple containing pre-computed keys and values used for attention mechanism. Each element has shape (batch_size, num_heads, sequence_length, embed_size_per_head).
         finished (bool): Boolean indicating whether generation is complete.
 
     """
 
     new_ids: torch.LongTensor = None
     audio_input_ids: torch.LongTensor = None
-    past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+    past_key_values: Optional[tuple[tuple[torch.FloatTensor]]] = None
     finished: bool = None
 
 
@@ -2708,31 +2865,6 @@ def forward(
 @auto_docstring
 class MiniCPMConditionalTTSTextPreTrainedModel(LlamaPreTrainedModel):
     config_class = MiniCPMConditionalTTSTextConfig
-    base_model_prefix = "model"
-    supports_gradient_checkpointing = True
-    _no_split_modules = ["MiniCPMConditionalTTSTextDecoderLayer"]
-    _skip_keys_device_placement = ["past_key_values"]
-    _supports_flash_attn_2 = True
-    _supports_sdpa = True
-    _supports_flex_attn = True
-    _supports_cache_class = True
-    _supports_quantized_cache = True
-    _supports_static_cache = True
-    _supports_attention_backend = True
-
-    def _init_weights(self, module):
-        std = self.config.initializer_range
-        if isinstance(module, nn.Linear):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.bias is not None:
-                module.bias.data.zero_()
-        elif isinstance(module, nn.Embedding):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.padding_idx is not None:
-                module.weight.data[module.padding_idx].zero_()
-        elif isinstance(module, MiniCPMConditionalTTSTextRMSNorm):
-            module.weight.data.fill_(1.0)
-
 
 @auto_docstring
 class MiniCPMConditionalTTSTextModel(LlamaModel):
@@ -2813,7 +2945,7 @@ def forward(
                 hidden_states,
                 attention_mask=causal_mask,
                 position_ids=position_ids,
-                past_key_value=past_key_values,
+                past_key_values=past_key_values,
                 output_attentions=output_attentions,
                 use_cache=use_cache,
                 cache_position=cache_position,
@@ -3044,16 +3176,7 @@ def __init__(self, config: MiniCPMConditionalTTSConfig):
         dvae = DVAE()
         self.dvae = dvae
 
-        model_config = MiniCPMConditionalTTSTextConfig(
-            hidden_size=config.hidden_size,
-            intermediate_size=config.intermediate_size,
-            num_attention_heads=config.num_attention_heads,
-            num_hidden_layers=config.num_hidden_layers,
-            max_position_embeddings=config.max_position_embeddings,
-            attn_implementation=config.attn_implementation,
-        )
-
-        model = MiniCPMConditionalTTSTextModel(model_config)
+        model = MiniCPMConditionalTTSTextModel(config.tts_text_config)
         self.model = model
 
     @torch.inference_mode()
@@ -3105,7 +3228,7 @@ def prefill_text(
         self,
         input_ids: torch.Tensor,
         position_ids: torch.LongTensor,
-        past_key_values: List[Tuple[torch.Tensor, torch.Tensor]],
+        past_key_values: list[tuple[torch.Tensor, torch.Tensor]],
         lm_spk_emb_last_hidden_states: Optional[torch.Tensor] = None,
     ):
         """Prefill a chunk of new text tokens in streaming setting.
@@ -3114,7 +3237,7 @@ def prefill_text(
         Args:
             input_ids (Tensor): Tensor of shape [batch_size, seq_len]
             position_ids (LongTensor): Tensor of shape [batch_size, seq_len]
-            past_key_values (List[Tuple[Tensor]]): KV Cache of all layers, each layer is a tuple (Tensor, Tensor) denoting keys and values. Each tensor is of seq_len = `self.streaming_text_reserved_len`. `past_key_values` will be updated.
+            past_key_values (List[tuple[Tensor]]): KV Cache of all layers, each layer is a tuple (Tensor, Tensor) denoting keys and values. Each tensor is of seq_len = `self.streaming_text_reserved_len`. `past_key_values` will be updated.
             lm_spk_emb_last_hidden_states (Tensor, optional): Tensor of shape [batch_size, num_spk_emb, llm_dim]. Defaults to None.
             lm_last_hidden_states (Tensor, optional): _description_. Defaults to None.
 
@@ -3179,7 +3302,7 @@ def prefill_text(
     def prefill_audio_ids(
         self,
         input_ids: torch.Tensor,
-        past_key_values: List[Tuple[torch.Tensor, torch.Tensor]],
+        past_key_values: list[tuple[torch.Tensor, torch.Tensor]],
         streaming_tts_text_mask=None,
         add_audio_bos: bool = True,
     ):
@@ -3188,7 +3311,7 @@ def prefill_audio_ids(
 
         Args:
             input_ids (torch.Tensor): (1, seq_len, num_vq) Audio input token ids.
-            past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
+            past_key_values (List[tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
         """
         assert input_ids.shape[0] == 1
         assert past_key_values is not None
@@ -3234,15 +3357,15 @@ def prefill_audio_ids(
     def generate(
         self,
         input_ids: torch.Tensor,
-        past_key_values: List[Tuple[torch.Tensor, torch.Tensor]],
+        past_key_values: list[tuple[torch.Tensor, torch.Tensor]],
         temperature: torch.Tensor,
         eos_token: Union[int, torch.Tensor],
         streaming_tts_text_mask=None,
         force_no_stop=False,
         min_new_token=10,
         max_new_token=50,
-        logits_warpers: List[LogitsProcessor] = [],
-        logits_processors: List[CustomRepetitionPenaltyLogitsProcessorRepeat] = [],
+        logits_warpers: list[LogitsProcessor] = [],
+        logits_processors: list[CustomRepetitionPenaltyLogitsProcessorRepeat] = [],
         show_tqdm=False,
     ):
         """Generate audio codes in streaming setting or non-streaming setting.
@@ -3254,7 +3377,7 @@ def generate(
 
         Args:
             input_ids (torch.Tensor): Input token ids.
-            past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
+            past_key_values (List[tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism.
             temperature (torch.Tensor): Temperature for sampling.
             eos_token (Union[int, torch.Tensor]): End of sequence token.
             streaming_tts_text_mask (Optional[torch.Tensor], optional): Mask for streaming TTS text. Defaults to None.
@@ -3470,7 +3593,7 @@ def generate(
     @torch.inference_mode()
     def decode_to_mel_specs(
         self,
-        result_list: List[torch.Tensor],
+        result_list: list[torch.Tensor],
     ):
         """Decode discrete audio codes to mel spectrograms.
 
@@ -3813,13 +3936,6 @@ def forward(
     # See all SigLIP models at https://huggingface.co/models?filter=siglip
 ]
 
-if is_flash_attn_2_available():
-    from flash_attn import flash_attn_func
-    from flash_attn import flash_attn_varlen_func
-    from flash_attn.bert_padding import index_first_axis  # noqa
-    from flash_attn.bert_padding import pad_input
-    from flash_attn.bert_padding import unpad_input
-
 
 # Copied from transformers.models.llama.modeling_llama._get_unpad_data
 def _get_unpad_data(attention_mask):
@@ -3950,11 +4066,11 @@ class MiniCPMVisionModelOutput(SiglipVisionModelOutput):
         last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
             Sequence of hidden-states at the output of the last layer of the model.
         hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
-            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
             one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
             Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
         attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
-            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
             sequence_length)`.
             Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
             heads.
@@ -3964,7 +4080,7 @@ class MiniCPMVisionModelOutput(SiglipVisionModelOutput):
 
 
 class MiniCPMVisionEmbedding(nn.Module):
-    def __init__(self, config: SiglipVisionConfig):
+    def __init__(self, config: MiniCPMVisionConfig):
         super().__init__()
         self.config = config
         self.embed_dim = config.hidden_size
@@ -4057,7 +4173,7 @@ def forward(
         hidden_states: torch.Tensor,
         attention_mask: Optional[torch.Tensor] = None,
         output_attentions: Optional[bool] = False,
-    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
         """Input shape: Batch x Time x Channel"""
 
         batch_size, q_len, _ = hidden_states.size()
@@ -4121,11 +4237,11 @@ def forward(
         hidden_states: torch.Tensor,
         attention_mask: Optional[torch.LongTensor] = None,
         position_ids: Optional[torch.LongTensor] = None,
-        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        past_key_value: Optional[tuple[torch.Tensor]] = None,
         output_attentions: bool = False,
         use_cache: bool = False,
         **kwargs,
-    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
         output_attentions = False
 
         bsz, q_len, _ = hidden_states.size()
@@ -4297,7 +4413,7 @@ class MiniCPMVisionMLP(SiglipMLP):
 
 
 class MiniCPMVisionEncoderLayer(SiglipEncoderLayer):
-    def __init__(self, config: SiglipVisionConfig):
+    def __init__(self, config: MiniCPMVisionConfig):
         super().__init__()
         self.embed_dim = config.hidden_size
         self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
@@ -4315,7 +4431,7 @@ class MiniCPMVisionPreTrainedModel(PreTrainedModel):
     models.
     """
 
-    config_class = SiglipVisionConfig
+    config_class = MiniCPMVisionConfig
     base_model_prefix = "siglip"
     supports_gradient_checkpointing = True
 
@@ -4358,7 +4474,7 @@ def _initialize_weights(self, module):
     Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
     and behavior.
     Parameters:
-        config ([`SiglipVisionConfig`]): Model configuration class with all the parameters of the model.
+        config ([`MiniCPMVisionConfig`]): Model configuration class with all the parameters of the model.
             Initializing with a config file does not load the weights associated with the model, only the
             configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
 """
@@ -4385,10 +4501,10 @@ class MiniCPMVisionEncoder(SiglipEncoder):
     Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
     [`SiglipEncoderLayer`].
     Args:
-        config: SiglipConfig
+        config: MiniCPMVisionConfig
     """
 
-    def __init__(self, config: SiglipVisionConfig):
+    def __init__(self, config: MiniCPMVisionConfig):
         super().__init__()
         self.config = config
         self.layers = nn.ModuleList([MiniCPMVisionEncoderLayer(config) for _ in range(config.num_hidden_layers)])
@@ -4402,7 +4518,7 @@ def forward(
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-    ) -> Union[Tuple, BaseModelOutput]:
+    ) -> Union[tuple, BaseModelOutput]:
         r"""
         Args:
             inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
@@ -4469,12 +4585,12 @@ def forward(
     """The vision model from SigLIP without any head or projection on top.""", SIGLIP_START_DOCSTRING
 )
 class MiniCPMVisionTransformer(MiniCPMVisionPreTrainedModel):
-    config_class = SiglipVisionConfig
+    config_class = MiniCPMVisionConfig
     main_input_name = "pixel_values"
     _supports_flash_attn_2 = True
     _no_split_modules = []
 
-    def __init__(self, config: SiglipVisionConfig):
+    def __init__(self, config: MiniCPMVisionConfig):
         super().__init__(config)
         self.config = config
         embed_dim = config.hidden_size
@@ -4491,7 +4607,7 @@ def get_input_embeddings(self) -> nn.Module:
         return self.embeddings.patch_embedding
 
     @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING)
-    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipVisionConfig)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=MiniCPMVisionConfig)
     def forward(
         self,
         pixel_values,
@@ -4500,7 +4616,7 @@ def forward(
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+    ) -> Union[tuple, BaseModelOutputWithPooling]:
         r"""
         Returns:
         """
@@ -4561,4 +4677,4 @@ def forward(
         )
 
 
-__all__ = ["MiniCPM_o_2_6ForConditionalGeneration", "MiniCPM_o_2_6Model", "MiniCPM_o_2_6PreTrainedModel"]
+__all__ = ["MiniCPM_o_2_6ForConditionalGeneration", "MiniCPM_o_2_6TextModel", "MiniCPM_o_2_6PreTrainedModel", "MiniCPM_o_2_6Config"]
diff --git a/src/transformers/models/minicpm_o_2_6/processing_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/processing_minicpm_o_2_6.py
index 5a6c5dc9c65f..0b10e2ea50cd 100644
--- a/src/transformers/models/minicpm_o_2_6/processing_minicpm_o_2_6.py
+++ b/src/transformers/models/minicpm_o_2_6/processing_minicpm_o_2_6.py
@@ -19,27 +19,34 @@
 import math
 import re
 
-from typing import Any, Dict, List, Literal, Optional, Union
+from typing import Any, Dict, Optional, Union
 
-import librosa
 import numpy as np
 import torch
-import torchaudio
 import json
 from copy import deepcopy
 
 from PIL import Image
-from transformers.image_utils import ImageInput
-from transformers.processing_utils import ProcessorMixin, ProcessingKwargs, Unpack, ImagesKwargs, AudioKwargs
-from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
-from transformers.utils import logging, TensorType
+from ...image_utils import ImageInput
+from ...processing_utils import ProcessorMixin, ProcessingKwargs, Unpack, ImagesKwargs, AudioKwargs
+from ...tokenization_utils_base import PreTokenizedInput, TextInput
 
 from ...feature_extraction_utils import BatchFeature
-from ...utils import is_torch_device, is_torch_dtype, requires_backends, TensorType
+from ...utils import is_torch_device, is_torch_dtype, requires_backends, TensorType, logging
 
 logger = logging.get_logger(__name__)
 
 
+def recursive_converter(converter, value):
+    if isinstance(value, list):
+        new_value = []
+        for v in value:
+            new_value += [recursive_converter(converter, v)]
+        return new_value
+    else:
+        return converter(value)
+
+
 class MiniCPMOBatchFeature(BatchFeature):
     r"""
     Extend from BatchFeature for supporting various image size
@@ -153,19 +160,18 @@ class MiniCPM_o_2_6Processor(ProcessorMixin):
 
     attributes = ["tokenizer", "image_processor", "feature_extractor"]
     tokenizer_class = "AutoTokenizer"
-    image_processor_class = "AutoImageProcessor"
+    image_processor_class = "MiniCPMVImageProcessorFast"
     feature_extractor_class = "MiniCPM_o_2_6FeatureExtractor"
 
     def __init__(self, tokenizer=None, image_processor=None, feature_extractor=None, chat_template=None):
         super().__init__(tokenizer, image_processor, feature_extractor, chat_template=chat_template)
-        self.version = image_processor.version
         self.default_tts_chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n<|spk_bos|><|spk|><|spk_eos|><|tts_bos|>' }}{% endif %}"
 
     def __call__(
         self,
-        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
+        text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]],
         images: ImageInput = None,
-        audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]] = None,
+        audios: Union[np.ndarray, list[np.ndarray], list[list[np.ndarray]]] = None,
         **kwargs: Unpack[MiniCPM_o_2_6ProcessorKwargs],
     ) -> MiniCPMOBatchFeature:
         output_kwargs = self._merge_kwargs(MiniCPM_o_2_6ProcessorKwargs, self.tokenizer.init_kwargs, **kwargs)
@@ -179,13 +185,12 @@ def __call__(
             image_inputs = None
 
         if audios:
-            audio_features, audio_feature_lens, audio_phs = self.feature_extractor(
-                self.tokenizer,
+            audio_features, audio_feature_lens = self.feature_extractor(
                 audios,
                 audio_parts=audio_kwargs["audio_parts"],
-                chunk_input=audio_kwargs["chunk_input"],
                 sampling_rate=audio_kwargs["sampling_rate"],
             )
+            audio_phs = self.get_audios_placeholder(audios=audios, chunk_input=audio_kwargs["chunk_input"])
         else:
             audio_features, audio_feature_lens, audio_phs = [], [], []
 
@@ -300,33 +305,23 @@ def apply_chat_template(
         )
         return inputs
 
-    def decode(self, outputs, batched=False):
-        result = self.decode_text(outputs.sequences, self.tokenizer)
-        if not batched:
-            result = result[0]
-        if isinstance(result, list):
-            result = [i.replace(self.tokenizer.tts_end, "") for i in result]
-        else:
-            result = result.replace(self.tokenizer.tts_end, "")
-        return result
-
-    def decode_text(self, result_ids, tokenizer):
+    def decode(self, result_ids, skeip_special_tokens: bool = False):
         result_text = []
         for result in result_ids:
             result = result[result != 0]
             start, end = 0, len(result)
             for i, tok in enumerate(result):
-                if tok == tokenizer.bos_id:
+                if tok == self.tokenizer.bos_id:
                     start = i + 1
                 else:
                     break
             for i in range(len(result) - 1, -1, -1):
-                if result[i] in tokenizer.terminator_ids:
+                if result[i] in self.tokenizer.terminator_ids:
                     end = i
                 else:
                     break
             result = result[start:end]
-            result_text.append(tokenizer.decode(result))
+            result_text.append(self.tokenizer.decode(result, skip_special_tokens=skeip_special_tokens))
         return result_text
 
     def get_sys_prompt(self, ref_audio=None, mode="default", language="zh"):
@@ -456,7 +451,7 @@ def _convert_omni_to_inputs(
         self,
         images,
         audio_phs,
-        texts: Union[str, List[str]],
+        texts: Union[str, list[str]],
         truncation=None,
         max_length=None,
         max_slice_nums=None,
@@ -502,8 +497,8 @@ def _convert_omni_to_inputs(
             audio_id = 0
             for i, chunk in enumerate(text_chunks):
                 if chunk == self.tokenizer.image_tag:
-                    image_placeholder = self.image_processor.get_slice_image_placeholder(
-                        self.tokenizer, image_sizes[index][image_id], image_id, max_slice_nums, use_image_id
+                    image_placeholder = self.get_slice_image_placeholder(
+                        image_sizes[index][image_id], image_id, max_slice_nums, use_image_id
                     )
                     image_id += 1
                     text_chunks[i] = image_placeholder
@@ -553,273 +548,99 @@ def _convert_omni_to_inputs(
 
         return data
 
-    @property
-    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
-    def model_input_names(self):
-        tokenizer_input_names = self.tokenizer.model_input_names
-        image_processor_input_names = self.image_processor.model_input_names
-        feature_extractor_input_names = self.feature_extractor.model_input_names
-        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names + feature_extractor_input_names))
-
+    def get_slice_image_placeholder(self, image_size, image_idx=0, max_slice_nums=None, use_image_id=None):
+        max_slice_nums = self.image_processor.max_slice_nums if max_slice_nums is None else int(max_slice_nums)
+        assert max_slice_nums > 0
+        grid = self.image_processor.get_sliced_grid(image_size=image_size, max_slice_nums=max_slice_nums)
 
-class MelSpectrogramFeatures(torch.nn.Module):
-    def __init__(
-        self,
-        sample_rate=24000,
-        n_fft=1024,
-        hop_length=256,
-        n_mels=100,
-        padding: Literal["center", "same"] = "center",
-    ):
-        super().__init__()
-        if padding not in ["center", "same"]:
-            raise ValueError("Padding must be 'center' or 'same'.")
-        self.padding = padding
-        self.mel_spec = torchaudio.transforms.MelSpectrogram(
-            sample_rate=sample_rate,
-            n_fft=n_fft,
-            hop_length=hop_length,
-            n_mels=n_mels,
-            center=padding == "center",
-            power=1,
+        image_placeholder = (
+            self.tokenizer.im_start
+            + self.tokenizer.unk_token * self.image_processor.image_feature_size
+            + self.tokenizer.im_end
+        )
+        use_image_id = self.image_processor.use_image_id if use_image_id is None else bool(use_image_id)
+        if use_image_id:
+            final_placeholder = (
+                f"{self.tokenizer.im_id_start}{image_idx}{self.tokenizer.im_id_end}" + image_placeholder
+            )
+        else:
+            final_placeholder = image_placeholder
+
+        if self.image_processor.slice_mode:
+            final_placeholder = final_placeholder + self.get_grid_placeholder(grid=grid)
+        return final_placeholder
+
+    def get_grid_placeholder(self, grid):
+        if grid is None:
+            return ""
+        slice_image_placeholder = (
+            self.tokenizer.slice_start
+            + self.tokenizer.unk_token * self.image_processor.image_feature_size
+            + self.tokenizer.slice_end
         )
 
-    def __call__(self, audio: torch.Tensor) -> torch.Tensor:
-        """
-        audio: Tensor([num_channels, num_samples])
-        """
-        return super().__call__(audio)
-
-    def forward(self, audio: torch.Tensor) -> torch.Tensor:
-        """
-        audio: Tensor([num_channels, num_samples])
-        """
-        mel: torch.Tensor = self.mel_spec(audio)
-        features = torch.log(torch.clip(mel, min=1e-5))
-        return features
-
-
-class ChatTTSProcessor:
-    def __init__(self, text_tokenizer):
-        self.audio_processor = MelSpectrogramFeatures()
-        self.text_tokenizer = text_tokenizer
-
-    def __call__(self, text_list, audio_list):
-        assert len(text_list) == len(audio_list)
-        input_ids_varlen = []
-        for text in text_list:
-            input_ids_ = self.text_tokenizer.encode(
-                text, return_tensors="pt", add_special_tokens=False
-            )  # [1, seq_len]
-            input_ids_ = input_ids_.squeeze(0)  # [seq_len]
-            input_ids_varlen.append(input_ids_)
-
-        audio_features_varlen = []
-        for audio in audio_list:
-            assert audio.shape.__len__() == 1  # [seq_len]
-            try:
-                # [100(num_mel_bins), seq_len_mel]
-                mel = self.audio_processor(audio)
-            except Exception as e:
-                raise e
-            audio_features_varlen.append(mel)
-
-        return {
-            "tts_input_ids_varlen": input_ids_varlen,  # return List[Tensor]
-            # return List[Tensor]
-            "tts_input_features_varlen": audio_features_varlen,
-        }
-
-
-def is_silent(data):
-    if np.abs(data).max() < 3e-3:
-        return True
-    else:
-        return False
-
-
-def sentence_end(txt):
-    for c in [".", "。", "!", "?", "！", "？"]:
-        if c in txt:
-            if c == ".":  # check not number before it like 1.
-                idx = txt.find(c)
-                if idx > 0:
-                    if txt[idx - 1].isdigit():
-                        continue
-            return c
-    return ""
-
-
-class NumberToTextConverter:
-    r"""
-    A helper class to ensure text-to-speech (TTS) systems read numeric digits
-    in the desired language (Chinese or English) digit-by-digit. It forcibly
-    replaces all numeric substrings in text with their language-specific
-    textual representations, thereby reducing the likelihood of TTS mistakes
-    on numbers.
-    Note: MiniCPM-o 2.6 only use this in streaming mode.
-
-    Attributes:
-        num_to_chinese (dict):
-            Mapping from digit (str) to its Chinese textual form (str).
-        num_to_english (dict):
-            Mapping from digit (str) to its English textual form (str).
-
-    Example:
-        >>> converter = NumberToTextConverter()
-        >>> converter.replace_numbers_with_text("我有2个苹果", language="chinese")
-        '我有两个苹果'
-        >>> converter.replace_numbers_with_text("I have 23 books", language="english")
-        'I have two three books'
-    """
-
-    def __init__(self):
-        self.num_to_chinese = {
-            "0": "零",
-            "1": "一",
-            "2": "二",
-            "3": "三",
-            "4": "四",
-            "5": "五",
-            "6": "六",
-            "7": "七",
-            "8": "八",
-            "9": "九",
-        }
-        self.num_to_english = {
-            "0": "zero",
-            "1": "one",
-            "2": "two",
-            "3": "three",
-            "4": "four",
-            "5": "five",
-            "6": "six",
-            "7": "seven",
-            "8": "eight",
-            "9": "nine",
-        }
-
-    def number_to_chinese_digit_by_digit(self, num_str):
-        result = ""
-        for char in num_str:
-            if char in self.num_to_chinese:
-                result += self.num_to_chinese[char]
-        return result
-
-    def number_to_english_digit_by_digit(self, num_str):
-        result = []
-        for char in num_str:
-            if char in self.num_to_english:
-                result.append(self.num_to_english[char])
-        return " ".join(result)
-
-    def detect_language(self, text):
-        chinese_count = len(re.findall(r"[\u4e00-\u9fff]", text))
-        english_count = len(re.findall(r"[a-zA-Z]", text))
-        return "chinese" if chinese_count >= english_count else "english"
-
-    def replace_numbers_with_text(self, text, language=None):
-        if language is None:
-            language = self.detect_language(text)
-        numbers = re.findall(r"\d+", text)
-
-        for num in numbers:
-            if language == "chinese":
-                replacement = self.number_to_chinese_digit_by_digit(num)
+        cols = grid[0]
+        rows = grid[1]
+        slices = []
+        for i in range(rows):
+            lines = []
+            for j in range(cols):
+                lines.append(slice_image_placeholder)
+            slices.append("".join(lines))
+
+        slice_placeholder = "\n".join(slices)
+        return slice_placeholder
+
+    def get_audios_placeholder(self, audios,
+                               chunk_input: Optional[bool] = False,
+                               chunk_length: Optional[int] = 1):
+        audios_list = self.feature_extractor.format_audios(audios)
+        audio_ph_list = []
+        for audios in audios_list:
+            if audios:
+                audio_ph_list.append(
+                    [self.get_single_audio_placeholder(len(a), chunk_input, chunk_length) for a in audios]
+                )
             else:
-                replacement = self.number_to_english_digit_by_digit(num)
-            text = text.replace(num, replacement, 1)
-
-        return text
-
-
-class VoiceChecker:
-    r"""
-    A simple utility class to detect silence or low variation in consecutive audio chunks by comparing
-    the mel-spectrogram distances. It keeps track of consecutive zero-distance and low-distance chunks
-    to decide if the audio is considered "bad" (e.g., overly silent or not changing enough).
-
-    Attributes:
-        previous_mel (`np.ndarray` or `None`):
-            Holds the previously observed mel-spectrogram in decibel scale. Used to compute
-            the next distance; reset via :meth:`reset`.
-        consecutive_zeros (`int`):
-            The number of consecutive chunks that were detected as silent (distance = 0).
-        consecutive_low_distance (`int`):
-            The number of consecutive chunks whose distance was below the threshold.
-
-    Example:
-        >>> checker = VoiceChecker()
-        >>> # Suppose we have audio_wav (list or np.ndarray) and mel_spec (np.ndarray)
-        >>> # We split them into chunks and call checker.is_bad(...)
-        >>> is_audio_bad = checker.is_bad(audio_wav, mel_spec, chunk_size=2560, thresh=100.0)
-        >>> if is_audio_bad:
-        ...     print("Audio deemed bad!")
-        >>> # Reset states if needed
-        >>> checker.reset()
-    """
-
-    def __init__(self):
-        self.previous_mel = None
-        self.consecutive_zeros = 0
-        self.consecutive_low_distance = 0
-
-    def compute_distance(self, audio_chunk, mel_spec):
-        if is_silent(audio_chunk):
-            return 0.0  # 检查是否为空白片段
-
-        mel_db = librosa.power_to_db(mel_spec)
-        if self.previous_mel is None:
-            self.previous_mel = mel_db
-            return -1.0
-
-        distance = np.linalg.norm(np.mean(mel_db, axis=1) - np.mean(self.previous_mel, axis=1))
-        self.previous_mel = mel_db
-        return distance
-
-    def is_bad(self, audio_wav, mel_spec, chunk_size=2560, thresh=100.0):
-        num_chunks = len(audio_wav) // chunk_size
-        mel_chunk_size = mel_spec.shape[-1] // num_chunks
-        for i in range(num_chunks):
-            audio_chunk = audio_wav[i * chunk_size : (i + 1) * chunk_size]
-            mel_spec_chunk = mel_spec[:, i * mel_chunk_size : (i + 1) * mel_chunk_size]
-
-            distance = self.compute_distance(audio_chunk, mel_spec_chunk)
-            logger.warning(
-                f"mel dist: {distance:.1f}, zero: {self.consecutive_zeros}, low: {self.consecutive_low_distance}"
+                audio_ph_list.append([])
+        return audio_ph_list
+
+    def get_single_audio_placeholder(self, audio_lens, chunk_input, chunk_length):
+        pool_step = 2
+        feature_lens = math.ceil(audio_lens / self.feature_extractor.hop_length)
+
+        feature_lens = (feature_lens - 1) // 2 + 1
+        output_lens = (feature_lens - pool_step) // pool_step + 1
+
+        if chunk_input:
+            fbank_feat_in_chunk = int(chunk_length * 100)
+            cnn_feat_in_chunk = (fbank_feat_in_chunk - 1) // 2 + 1
+            audio_embeds_in_chunk = (cnn_feat_in_chunk - pool_step) // pool_step + 1
+            num_audio_chunks = (output_lens + audio_embeds_in_chunk - 1) // audio_embeds_in_chunk
+
+            place_holders = ""
+            total_unk_len = 0
+            for _ in range(num_audio_chunks):
+                unk_len = min(audio_embeds_in_chunk, output_lens - total_unk_len)
+                place_holders += (
+                    self.tokenizer.audio_start + self.tokenizer.unk_token * unk_len + self.tokenizer.audio_end
+                )
+                total_unk_len += unk_len
+            audio_placeholder = place_holders
+        else:
+            audio_placeholder = (
+                self.tokenizer.audio_start + self.tokenizer.unk_token * output_lens + self.tokenizer.audio_end
             )
-            if distance == 0:
-                self.consecutive_low_distance = 0  # reset
-                self.consecutive_zeros += 1
-                if self.consecutive_zeros >= 12:
-                    logger.warning("VoiceChecker detected 1.2 s silent. Marking as failed.")
-                    return True
-            elif distance < thresh:
-                self.consecutive_zeros = 0
-                self.consecutive_low_distance += 1
-                if self.consecutive_low_distance >= 5:
-                    logger.warning("VoiceChecker detected 5 consecutive low distance chunks. Marking as failed.")
-                    return True
-            else:
-                self.consecutive_low_distance = 0
-                self.consecutive_zeros = 0
-
-        return False
-
-    def reset(self):
-        self.previous_mel = None
-        self.consecutive_zeros = 0
-        self.consecutive_low_distance = 0
 
+        return audio_placeholder
 
-def recursive_converter(converter, value):
-    if isinstance(value, list):
-        new_value = []
-        for v in value:
-            new_value += [recursive_converter(converter, v)]
-        return new_value
-    else:
-        return converter(value)
+    @property
+    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        feature_extractor_input_names = self.feature_extractor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names + feature_extractor_input_names))
 
 
 __all__ = ["MiniCPM_o_2_6Processor"]
diff --git a/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6.py
deleted file mode 100644
index b2c910ab14f5..000000000000
--- a/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6.py
+++ /dev/null
@@ -1,24 +0,0 @@
-# coding=utf-8
-# Copyright 2025 The OpenBMB Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from transformers import Qwen2Tokenizer
-
-
-class MiniCPM_o_2_6Tokenizer(Qwen2Tokenizer):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-
-
-__all__ = ["MiniCPM_o_2_6Tokenizer"]
diff --git a/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6_fast.py b/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6_fast.py
index 8d943508c40e..5fcee76500e0 100644
--- a/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6_fast.py
+++ b/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6_fast.py
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from transformers import Qwen2TokenizerFast
+from ..qwen2.tokenization_qwen2_fast import Qwen2TokenizerFast
 
 
 class MiniCPM_o_2_6TokenizerFast(Qwen2TokenizerFast):
diff --git a/src/transformers/models/minicpm_o_2_6/tts_processing_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/tts_processing_minicpm_o_2_6.py
new file mode 100644
index 000000000000..24808aa34f4e
--- /dev/null
+++ b/src/transformers/models/minicpm_o_2_6/tts_processing_minicpm_o_2_6.py
@@ -0,0 +1,277 @@
+# coding=utf-8
+# Copyright 2025 The OpenBMB Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+
+from typing import Literal
+
+import librosa
+import numpy as np
+import torch
+import torchaudio
+
+
+from ...utils import logging
+
+logger = logging.get_logger(__name__)
+
+class MelSpectrogramFeatures(torch.nn.Module):
+    def __init__(
+        self,
+        sample_rate=24000,
+        n_fft=1024,
+        hop_length=256,
+        n_mels=100,
+        padding: Literal["center", "same"] = "center",
+    ):
+        super().__init__()
+        if padding not in ["center", "same"]:
+            raise ValueError("Padding must be 'center' or 'same'.")
+        self.padding = padding
+        self.mel_spec = torchaudio.transforms.MelSpectrogram(
+            sample_rate=sample_rate,
+            n_fft=n_fft,
+            hop_length=hop_length,
+            n_mels=n_mels,
+            center=padding == "center",
+            power=1,
+        )
+
+    def __call__(self, audio: torch.Tensor) -> torch.Tensor:
+        """
+        audio: Tensor([num_channels, num_samples])
+        """
+        return super().__call__(audio)
+
+    def forward(self, audio: torch.Tensor) -> torch.Tensor:
+        """
+        audio: Tensor([num_channels, num_samples])
+        """
+        mel: torch.Tensor = self.mel_spec(audio)
+        features = torch.log(torch.clip(mel, min=1e-5))
+        return features
+
+
+class ChatTTSProcessor:
+    def __init__(self, text_tokenizer):
+        self.audio_processor = MelSpectrogramFeatures()
+        self.text_tokenizer = text_tokenizer
+
+    def __call__(self, text_list, audio_list):
+        assert len(text_list) == len(audio_list)
+        input_ids_varlen = []
+        for text in text_list:
+            input_ids_ = self.text_tokenizer.encode(
+                text, return_tensors="pt", add_special_tokens=False
+            )  # [1, seq_len]
+            input_ids_ = input_ids_.squeeze(0)  # [seq_len]
+            input_ids_varlen.append(input_ids_)
+
+        audio_features_varlen = []
+        for audio in audio_list:
+            assert audio.shape.__len__() == 1  # [seq_len]
+            try:
+                # [100(num_mel_bins), seq_len_mel]
+                mel = self.audio_processor(audio)
+            except Exception as e:
+                raise e
+            audio_features_varlen.append(mel)
+
+        return {
+            "tts_input_ids_varlen": input_ids_varlen,  # return List[Tensor]
+            # return List[Tensor]
+            "tts_input_features_varlen": audio_features_varlen,
+        }
+
+
+def is_silent(data):
+    if np.abs(data).max() < 3e-3:
+        return True
+    else:
+        return False
+
+
+def sentence_end(txt):
+    for c in [".", "。", "!", "?", "！", "？"]:
+        if c in txt:
+            if c == ".":  # check not number before it like 1.
+                idx = txt.find(c)
+                if idx > 0:
+                    if txt[idx - 1].isdigit():
+                        continue
+            return c
+    return ""
+
+
+class NumberToTextConverter:
+    r"""
+    A helper class to ensure text-to-speech (TTS) systems read numeric digits
+    in the desired language (Chinese or English) digit-by-digit. It forcibly
+    replaces all numeric substrings in text with their language-specific
+    textual representations, thereby reducing the likelihood of TTS mistakes
+    on numbers.
+    Note: MiniCPM-o 2.6 only use this in streaming mode.
+
+    Attributes:
+        num_to_chinese (dict):
+            Mapping from digit (str) to its Chinese textual form (str).
+        num_to_english (dict):
+            Mapping from digit (str) to its English textual form (str).
+
+    Example:
+        >>> converter = NumberToTextConverter()
+        >>> converter.replace_numbers_with_text("我有2个苹果", language="chinese")
+        '我有两个苹果'
+        >>> converter.replace_numbers_with_text("I have 23 books", language="english")
+        'I have two three books'
+    """
+
+    def __init__(self):
+        self.num_to_chinese = {
+            "0": "零",
+            "1": "一",
+            "2": "二",
+            "3": "三",
+            "4": "四",
+            "5": "五",
+            "6": "六",
+            "7": "七",
+            "8": "八",
+            "9": "九",
+        }
+        self.num_to_english = {
+            "0": "zero",
+            "1": "one",
+            "2": "two",
+            "3": "three",
+            "4": "four",
+            "5": "five",
+            "6": "six",
+            "7": "seven",
+            "8": "eight",
+            "9": "nine",
+        }
+
+    def number_to_chinese_digit_by_digit(self, num_str):
+        result = ""
+        for char in num_str:
+            if char in self.num_to_chinese:
+                result += self.num_to_chinese[char]
+        return result
+
+    def number_to_english_digit_by_digit(self, num_str):
+        result = []
+        for char in num_str:
+            if char in self.num_to_english:
+                result.append(self.num_to_english[char])
+        return " ".join(result)
+
+    def detect_language(self, text):
+        chinese_count = len(re.findall(r"[\u4e00-\u9fff]", text))
+        english_count = len(re.findall(r"[a-zA-Z]", text))
+        return "chinese" if chinese_count >= english_count else "english"
+
+    def replace_numbers_with_text(self, text, language=None):
+        if language is None:
+            language = self.detect_language(text)
+        numbers = re.findall(r"\d+", text)
+
+        for num in numbers:
+            if language == "chinese":
+                replacement = self.number_to_chinese_digit_by_digit(num)
+            else:
+                replacement = self.number_to_english_digit_by_digit(num)
+            text = text.replace(num, replacement, 1)
+
+        return text
+
+
+class VoiceChecker:
+    r"""
+    A simple utility class to detect silence or low variation in consecutive audio chunks by comparing
+    the mel-spectrogram distances. It keeps track of consecutive zero-distance and low-distance chunks
+    to decide if the audio is considered "bad" (e.g., overly silent or not changing enough).
+
+    Attributes:
+        previous_mel (`np.ndarray` or `None`):
+            Holds the previously observed mel-spectrogram in decibel scale. Used to compute
+            the next distance; reset via :meth:`reset`.
+        consecutive_zeros (`int`):
+            The number of consecutive chunks that were detected as silent (distance = 0).
+        consecutive_low_distance (`int`):
+            The number of consecutive chunks whose distance was below the threshold.
+
+    Example:
+        >>> checker = VoiceChecker()
+        >>> # Suppose we have audio_wav (list or np.ndarray) and mel_spec (np.ndarray)
+        >>> # We split them into chunks and call checker.is_bad(...)
+        >>> is_audio_bad = checker.is_bad(audio_wav, mel_spec, chunk_size=2560, thresh=100.0)
+        >>> if is_audio_bad:
+        ...     print("Audio deemed bad!")
+        >>> # Reset states if needed
+        >>> checker.reset()
+    """
+
+    def __init__(self):
+        self.previous_mel = None
+        self.consecutive_zeros = 0
+        self.consecutive_low_distance = 0
+
+    def compute_distance(self, audio_chunk, mel_spec):
+        if is_silent(audio_chunk):
+            return 0.0  # 检查是否为空白片段
+
+        mel_db = librosa.power_to_db(mel_spec)
+        if self.previous_mel is None:
+            self.previous_mel = mel_db
+            return -1.0
+
+        distance = np.linalg.norm(np.mean(mel_db, axis=1) - np.mean(self.previous_mel, axis=1))
+        self.previous_mel = mel_db
+        return distance
+
+    def is_bad(self, audio_wav, mel_spec, chunk_size=2560, thresh=100.0):
+        num_chunks = len(audio_wav) // chunk_size
+        mel_chunk_size = mel_spec.shape[-1] // num_chunks
+        for i in range(num_chunks):
+            audio_chunk = audio_wav[i * chunk_size: (i + 1) * chunk_size]
+            mel_spec_chunk = mel_spec[:, i * mel_chunk_size: (i + 1) * mel_chunk_size]
+
+            distance = self.compute_distance(audio_chunk, mel_spec_chunk)
+            logger.warning(
+                f"mel dist: {distance:.1f}, zero: {self.consecutive_zeros}, low: {self.consecutive_low_distance}"
+            )
+            if distance == 0:
+                self.consecutive_low_distance = 0  # reset
+                self.consecutive_zeros += 1
+                if self.consecutive_zeros >= 12:
+                    logger.warning("VoiceChecker detected 1.2 s silent. Marking as failed.")
+                    return True
+            elif distance < thresh:
+                self.consecutive_zeros = 0
+                self.consecutive_low_distance += 1
+                if self.consecutive_low_distance >= 5:
+                    logger.warning("VoiceChecker detected 5 consecutive low distance chunks. Marking as failed.")
+                    return True
+            else:
+                self.consecutive_low_distance = 0
+                self.consecutive_zeros = 0
+
+        return False
+
+    def reset(self):
+        self.previous_mel = None
+        self.consecutive_zeros = 0
+        self.consecutive_low_distance = 0

Model	Size	Token Density⁺	OpenCompass	OCRBench	MathVista mini	ChartQA	MMVet	MMStar	MME	MMB1.1 test	AI2D	MMMU val	HallusionBench	TextVQA val	DocVQA test	MathVerse mini	MathVision	MMHal Score
Proprietary
GPT-4o-20240513	-	1088	69.9	736	61.3	85.7	69.1	63.9	2328.7	82.2	84.6	69.2	55.0	-	92.8	50.2	30.4	3.6
Claude3.5-Sonnet	-	750	67.9	788	61.6	90.8	66.0	62.2	1920.0	78.5	80.2	65.9	49.9	-	95.2	-	-	3.4
Gemini 1.5 Pro	-	-	64.4	754	57.7	81.3	64.0	59.1	2110.6	73.9	79.1	60.6	45.6	73.5	86.5	-	19.2	-
GPT-4o-mini-20240718	-	1088	64.1	785	52.4	-	66.9	54.8	2003.4	76.0	77.8	60.0	46.1	-	-	-	-	3.3
Open Source
Cambrian-34B	34B	1820	58.3	591	50.3	75.6	53.2	54.2	2049.9	77.8	79.5	50.4	41.6	76.7	75.5	-	-	-
GLM-4V-9B	13B	784	59.1	776	51.1	-	58.0	54.8	2018.8	67.9	71.2	46.9	45.0	-	-	-	-	-
Pixtral-12B	12B	256	61.0	685	56.9	81.8	58.5	54.5	-	72.7	79.0	51.1	47.0	75.7	90.7	-	-	-
DeepSeek-VL2-27B (4B)	27B	672	66.4	809	63.9	86.0	60.0	61.9	2253.0	81.2	83.8	54.0	45.3	84.2	93.3	-	-	3.0
Qwen2-VL-7B	8B	784	67.1	866	58.2	83.0	62.0	60.7	2326.0	81.8	83.0	54.1	50.6	84.3	94.5	31.9	16.3	3.2
LLaVA-OneVision-72B	72B	182	68.1	741	67.5	83.7	60.6	65.8	2261.0	85.0	85.6	56.8	49.0	80.5	91.3	39.1	-	3.5
InternVL2.5-8B	8B	706	68.3	822	64.4	84.8	62.8	62.8	2344.0	83.6	84.5	56.0	50.1	79.1	93.0	39.5	19.7	3.4
MiniCPM-V 2.6	8B	2822	65.2	852*	60.6	79.4	60.0	57.5	2348.4*	78.0	82.1	49.8*	48.1*	80.1	90.8	25.7	18.3	3.6
MiniCPM-o 2.6	8B	2822	70.2	897*	71.9*	86.9*	67.5	64.0	2372.0*	80.5	85.8	50.4*	51.9	82.0	93.5	41.4*	23.1*	3.8
Model	Size	BLINK val	Mantis Eval	MIRB	Video-MME (wo / w subs)
Proprietary
GPT-4o-20240513	-	68.0	-	-	71.9/77.2
GPT4V	-	54.6	62.7	53.1	59.9/63.3
Open-source
LLaVA-NeXT-Interleave 14B	14B	52.6	66.4	30.2	-
LLaVA-OneVision-72B	72B	55.4	77.6	-	66.2/69.5
MANTIS 8B	8B	49.1	59.5	34.8	-
Qwen2-VL-7B	8B	53.2	69.6*	67.6*	63.3/69.0
InternVL2.5-8B	8B	54.8	67.7	52.5	64.2/66.9
MiniCPM-V 2.6	8B	53.0	69.1	53.8	60.9/63.6
MiniCPM-o 2.6	8B	56.7	71.9	58.6	63.9/67.9
Task	Size	ASR (zh)			ASR (en)			AST		Emotion
Metric		CER↓			WER↓			BLEU↑		ACC↑
Dataset		AISHELL-1	Fleurs zh	WenetSpeech test-net	LibriSpeech test-clean	GigaSpeech	TED-LIUM	CoVoST en2zh	CoVoST zh2en	MELD emotion
Proprietary
GPT-4o-Realtime	-	7.3*	5.4*	28.9*	2.6*	12.9*	4.8*	37.1*	15.7*	33.2*
Gemini 1.5 Pro	-	4.5*	5.9*	14.3*	2.9*	10.6*	3.0*	47.3*	22.6*	48.4*
Open-Source
Qwen2-Audio-7B	8B	-	7.5	-	1.6	-	-	45.2	24.4	55.3
Qwen2-Audio-7B-Instruct	8B	2.6*	6.9*	10.3*	3.1*	9.7*	5.9*	39.5*	22.9*	17.4*
GLM-4-Voice-Base	9B	2.5	-	-	2.8	-	-	-	-
MiniCPM-o 2.6	8B	1.6	4.4	6.9	1.7	8.7	3.0	48.2	27.2	52.4
Task	Size	SpeechQA
Metric		ACC↑			G-Eval (10 point)↑	Semantic ELO score↑	Acoustic ELO score↑	Overall ELO score↑	UTMOS↑	ASR-WER↓
Dataset		Speech Llama Q.	Speech Web Q.	Speech Trivia QA	Speech AlpacaEval	AudioArena
Proprietary
GPT-4o-Realtime		71.7	51.6	69.7	7.4	1157	1203	1200	4.2	2.3
Open-Source
GLM-4-Voice	9B	50.0	32.0	36.4	5.1	999	1147	1035	4.1	11.7
Llama-Omni	8B	45.3	22.9	10.7	3.9	960	878	897	3.2	24.3
Moshi	7B	43.7	23.8	16.7	2.4	871	808	875	2.8	8.2
Mini-Omni	1B	22.0	12.8	6.9	2.5	926	803	865	3.4	10.0
MiniCPM-o 2.6	8B	61.0	40.0	40.2	5.1	1088	1163	1131	4.2	9.8
Task	Voice cloning
Metric	SIMO↑	SIMO↑
Dataset	Seed-TTS test-zh	Seed-TTS test-en
F5-TTS	76	67
CosyVoice	75	64
FireRedTTS	63	46
MiniCPM-o 2.6	57	47
Model	Size	Real-Time Video Understanding	Omni-Source Understanding	Contextual Understanding	Overall
Proprietary
Gemini 1.5 Pro	-	77.4	67.8	51.1	70.3
GPT-4o-202408	-	74.5	51.0	48.0	64.1
Claude-3.5-Sonnet	-	74.0	41.4	37.8	59.7
Open-source
VILA-1.5	8B	61.5	37.5	26.7	49.5
LongVA	7B	63.1	35.9	30.2	50.7
LLaVA-Next-Video-34B	34B	69.8	41.7	34.3	56.7
Qwen2-VL-7B	8B	71.2	40.7	33.1	57.0
InternVL2-8B	8B	70.1	42.7	34.1	57.0
VITA-1.5	8B	70.9	40.8	35.8	57.4
LLaVA-OneVision-7B	8B	74.3	40.8	31.0	58.4
InternLM-XC2.5-OL-7B	8B	75.4	46.2	33.6	60.8
MiniCPM-V 2.6	8B	72.4	40.2	33.4	57.7
MiniCPM-o 2.6	8B	79.9	53.4	38.5	66.0