diff --git a/docs/source/en/model_doc/minicpm_o_2_6.md b/docs/source/en/model_doc/minicpm_o_2_6.md index df31c58f280c..feea3263e8fb 100644 --- a/docs/source/en/model_doc/minicpm_o_2_6.md +++ b/docs/source/en/model_doc/minicpm_o_2_6.md @@ -11,977 +11,70 @@ specific language governing permissions and limitations under the License. โš ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer. +---> -

A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

+# MiniCPM-o 2.6 + +

A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

[GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn) | [Technical Blog](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9) +## Overview + +The [MiniCPM-o 2.6](https://github.com/OpenBMB/MiniCPM-o) model is an end-to-end omni-modal large multimodal model proposed by the OpenBMB Team. MiniCPM-o 2.6 is built based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. -### News - -* [2025.03.01] ๐Ÿš€๐Ÿš€๐Ÿš€ RLAIF-V, which is the alignment technique of MiniCPM-o, is accepted by CVPR 2025๏ผThe [code](https://github.com/RLHF-V/RLAIF-V), [dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), [paper](https://arxiv.org/abs/2405.17220) are open-sourced! - -* [2025.01.24] ๐Ÿ“ข๐Ÿ“ข๐Ÿ“ข MiniCPM-o 2.6 technical report is released! [See Here](https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9). - -* [2025.01.19] โญ๏ธโญ๏ธโญ๏ธ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending! - -## MiniCPM-o 2.6 - - -**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include: - -- ๐Ÿ”ฅ **Leading Visual Capability.** - MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability. - -- ๐ŸŽ™ **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc. - -- ๐ŸŽฌ **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding. - -- ๐Ÿ’ช **Strong OCR Capability and Others.** -Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**. - Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages. - - -- ๐Ÿš€ **Superior Efficiency.** - In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad. - -- ๐Ÿ’ซ **Easy Usage.** -MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/). - - - -**Model Architecture.** - -- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. -- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices. -- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation. - -
- -
- - -### Evaluation - -
- -
-#### Visual understanding results - -**Image Understanding:** - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeToken Density+OpenCompassOCRBenchMathVista miniChartQAMMVetMMStarMMEMMB1.1 testAI2DMMMU valHallusionBenchTextVQA valDocVQA testMathVerse miniMathVisionMMHal Score
Proprietary
GPT-4o-20240513-108869.973661.385.769.163.92328.782.284.669.255.0-92.850.230.43.6
Claude3.5-Sonnet-75067.978861.690.866.062.21920.078.580.265.949.9-95.2--3.4
Gemini 1.5 Pro--64.475457.781.364.059.12110.673.979.160.645.673.586.5-19.2-
GPT-4o-mini-20240718-108864.178552.4-66.954.82003.476.077.860.046.1----3.3
Open Source
Cambrian-34B34B182058.359150.375.653.254.22049.977.879.550.441.676.775.5---
GLM-4V-9B13B78459.177651.1-58.054.82018.867.971.246.945.0-----
Pixtral-12B12B25661.068556.981.858.554.5-72.779.051.147.075.790.7---
DeepSeek-VL2-27B (4B)27B67266.480963.986.060.061.92253.081.283.854.045.384.293.3--3.0
Qwen2-VL-7B8B78467.186658.283.062.060.72326.081.883.054.150.684.394.531.916.33.2
LLaVA-OneVision-72B72B18268.174167.583.760.665.82261.085.085.656.849.080.591.339.1-3.5
InternVL2.5-8B8B70668.382264.484.862.862.82344.083.684.556.050.179.193.039.519.73.4
MiniCPM-V 2.68B282265.2852*60.679.460.057.52348.4*78.082.149.8*48.1*80.190.825.718.33.6
MiniCPM-o 2.68B282270.2897*71.9*86.9*67.564.02372.0*80.585.850.4*51.982.093.541.4*23.1*3.8
-
-* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set. - -+ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens. - -Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation. - - -**Multi-image and Video Understanding:** - -
-click to view -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeBLINK valMantis EvalMIRBVideo-MME (wo / w subs)
Proprietary
GPT-4o-20240513-68.0--71.9/77.2
GPT4V-54.662.753.159.9/63.3
Open-source
LLaVA-NeXT-Interleave 14B14B52.666.430.2-
LLaVA-OneVision-72B72B55.477.6-66.2/69.5
MANTIS 8B8B49.159.534.8-
Qwen2-VL-7B8B53.269.6*67.6*63.3/69.0
InternVL2.5-8B8B54.867.752.564.2/66.9
MiniCPM-V 2.68B53.069.153.860.9/63.6
MiniCPM-o 2.68B56.771.958.663.9/67.9
-
-* We evaluate officially released checkpoints by ourselves. - -
- - -#### Audio understanding and speech conversation results. - -**Audio Understanding:** - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TaskSizeASR (zh)ASR (en)ASTEmotion
MetricCERโ†“WERโ†“BLEUโ†‘ACCโ†‘
DatasetAISHELL-1Fleurs zhWenetSpeech test-netLibriSpeech test-cleanGigaSpeechTED-LIUMCoVoST en2zhCoVoST zh2enMELD emotion
Proprietary
GPT-4o-Realtime-7.3*5.4*28.9*2.6*12.9*4.8*37.1*15.7*33.2*
Gemini 1.5 Pro-4.5*5.9*14.3*2.9*10.6*3.0*47.3*22.6*48.4*
Open-Source
Qwen2-Audio-7B8B-7.5-1.6--45.224.455.3
Qwen2-Audio-7B-Instruct8B2.6*6.9*10.3*3.1*9.7*5.9*39.5*22.9*17.4*
GLM-4-Voice-Base9B2.5--2.8----
MiniCPM-o 2.68B1.64.46.91.78.73.048.227.252.4
-
-* We evaluate officially released checkpoints by ourselves.

-**Speech Generation:** - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TaskSizeSpeechQA
MetricACCโ†‘G-Eval (10 point)โ†‘Semantic ELO scoreโ†‘Acoustic ELO scoreโ†‘Overall ELO scoreโ†‘UTMOSโ†‘ASR-WERโ†“
DatasetSpeech Llama Q.Speech Web Q.Speech Trivia QASpeech AlpacaEvalAudioArena
Proprietary
GPT-4o-Realtime71.751.669.77.41157120312004.22.3
Open-Source
GLM-4-Voice9B50.032.036.45.1999114710354.111.7
Llama-Omni8B45.322.910.73.99608788973.224.3
Moshi7B43.723.816.72.48718088752.88.2
Mini-Omni1B22.012.86.92.59268038653.410.0
MiniCPM-o 2.68B61.040.040.25.11088116311314.29.8
-
-All results are from AudioEvals, and the evaluation methods along with further details can be found in UltraEval-Audio.

-**End-to-end Voice Cloning** - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TaskVoice cloning
MetricSIMOโ†‘SIMOโ†‘
DatasetSeed-TTS test-zhSeed-TTS test-en
F5-TTS7667
CosyVoice7564
FireRedTTS6346
MiniCPM-o 2.65747
-
- -#### Multimodal live streaming results. - -**Multimodal Live Streaming:** results on StreamingBench - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ModelSizeReal-Time Video UnderstandingOmni-Source UnderstandingContextual UnderstandingOverall
Proprietary
Gemini 1.5 Pro-77.467.851.170.3
GPT-4o-202408-74.551.048.064.1
Claude-3.5-Sonnet-74.041.437.859.7
Open-source
VILA-1.58B61.537.526.749.5
LongVA7B63.135.930.250.7
LLaVA-Next-Video-34B34B69.841.734.356.7
Qwen2-VL-7B8B71.240.733.157.0
InternVL2-8B8B70.142.734.157.0
VITA-1.58B70.940.835.857.4
LLaVA-OneVision-7B8B74.340.831.058.4
InternLM-XC2.5-OL-7B8B75.446.233.660.8
MiniCPM-V 2.68B72.440.233.457.7
MiniCPM-o 2.68B79.953.438.566.0
- - -### Examples - -We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo. - -
- -
- -
- - -
- math - diagram - bike -
- - - - -## Online Demo -Click here to try the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn). +The model features: +_MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series, featuring leading visual capability with an average score of 70.2 on OpenCompass. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding. It supports state-of-the-art speech capability with bilingual real-time speech conversation and configurable voices in English and Chinese, outperforming GPT-4o-realtime on audio understanding tasks. The model introduces strong multimodal live streaming capability, accepting continuous video and audio streams independent of user queries with real-time speech interaction. It features superior efficiency with state-of-the-art token density, producing only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. The architecture employs an end-to-end omni-modal design with time-division multiplexing (TDM) mechanism for omni-modality streaming processing and configurable speech modeling design with multimodal system prompts._ ## Usage -Inference using Huggingface transformers on NVIDIA GPUs. Please ensure that `transformers==4.44.2` is installed, as other versions may have compatibility issues. We are investigating this issue. Requirements tested on python 3.10๏ผš + +Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10๏ผš + ``` -Pillow==10.1.0 -torch==2.3.1 -torchaudio==2.3.1 -torchvision==0.18.1 -transformers==4.44.2 -librosa==0.9.0 -soundfile==0.12.1 -vector-quantize-pytorch==1.18.5 -vocos==0.1.0 +transformers +Pillow +torch +torchaudio +torchvision +librosa +soundfile +vector-quantize-pytorch +vocos decord moviepy ``` - ### Model initialization + ```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer -# load omni model default, the default init_vision/init_audio/init_tts is True -# if load vision-only model, please set init_audio=False and init_tts=False -# if load audio-only model, please set init_vision=False + model = AutoModel.from_pretrained( 'openbmb/MiniCPM-o-2_6', - trust_remote_code=True, - attn_implementation='sdpa', # sdpa or flash_attention_2 - torch_dtype=torch.bfloat16, - init_vision=True, - init_audio=True, - init_tts=True + attn_implementation='sdpa', # sdpa or flash_attention_2, no eager + dtype=torch.bfloat16 ) model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True) -# In addition to vision-only mode, tts processor and vocos also needs to be initialized model.init_tts() + +processor = AutoProcessor.from_pretrained('openbmb/MiniCPM-o-2_6') ``` If you are using an older version of PyTorch, you might encounter this issue `"weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'`, Please convert the TTS to float32 type. + ```python model.tts.float() ``` ### Omni mode -We provide two inference modes: chat and streaming -#### Chat inference +We provide two inference modes: normal generate and streaming + +#### Normal generate inference + ```python import math import numpy as np @@ -990,16 +83,17 @@ from moviepy.editor import VideoFileClip import tempfile import librosa import soundfile as sf + def get_video_chunk_content(video_path, flatten=True): video = VideoFileClip(video_path) print('video_duration:', video.duration) - + with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file: temp_audio_file_path = temp_audio_file.name video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000) audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True) num_units = math.ceil(video.duration) - + # 1 frame + 1s audio chunk contents= [] for i in range(num_units): @@ -1010,78 +104,79 @@ def get_video_chunk_content(video_path, flatten=True): contents.extend(["", image, audio]) else: contents.append(["", image, audio]) - + return contents + video_path="assets/Skiing.mp4" # if use voice clone prompt, please set ref_audio ref_audio_path = 'assets/demo.wav' ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True) -sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en') +sys_msg = processor.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en') # or use default prompt # sys_msg = model.get_sys_prompt(mode='omni', language='en') contents = get_video_chunk_content(video_path) msg = {"role":"user", "content": contents} msgs = [sys_msg, msg] +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + # please set generate_audio=True and output_audio_path to save the tts result generate_audio = True output_audio_path = 'output.wav' -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, +res = model.generate( + **inputs, + processor=processor, sampling=True, temperature=0.5, max_new_tokens=4096, - omni_input=True, # please set omni_input=True when omni inference use_tts_template=True, generate_audio=generate_audio, output_audio_path=output_audio_path, - max_slice_nums=1, - use_image_id=False, - return_dict=True + repetition_penalty=1.2, ) print(res) -## You will get the answer: The person in the picture is skiing down a snowy slope. -# import IPython -# IPython.display.Audio('output.wav') ``` + #### Streaming inference + ```python # a new conversation need reset session first, it will reset the kv-cache model.reset_session() contents = get_video_chunk_content(video_path, flatten=False) session_id = '123' -generate_audio = True +use_tts = True + # 1. prefill system prompt res = model.streaming_prefill( session_id=session_id, - msgs=[sys_msg], - tokenizer=tokenizer + msgs=[sys_msg], + processor=processor ) + # 2. prefill video/audio chunks for content in contents: msgs = [{"role":"user", "content": content}] res = model.streaming_prefill( session_id=session_id, - msgs=msgs, - tokenizer=tokenizer + msgs=msgs, + processor=processor ) # 3. generate res = model.streaming_generate( session_id=session_id, - tokenizer=tokenizer, - temperature=0.5, - generate_audio=generate_audio + processor=processor, + use_tts=use_tts, + tts_output_chunk_size=25 ) audios = [] text = "" -if generate_audio: +if use_tts: for r in res: audio_wav = r.audio_wav sampling_rate = r.sampling_rate txt = r.text audios.append(audio_wav) text += txt - + res = np.concatenate(audios) sf.write("output.wav", res, samplerate=sampling_rate) print("text:", text) @@ -1092,143 +187,99 @@ else: print("text:", text) ``` - -### Speech and Audio Mode - -Model initialization - -```python -import torch -import librosa -from transformers import AutoModel, AutoTokenizer -model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True, - attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager -model = model.eval().cuda() -tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True) -model.init_tts() -model.tts.float() -``` -
#### Mimick -`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling. +`Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. ```python mimick_prompt = "Please repeat each user's speech, including voice style and speech content." -audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # load the audio to be mimicked -# can also try `./assets/input_examples/cxk_original.wav`, -# `./assets/input_examples/fast-pace.wav`, -# `./assets/input_examples/chi-english-1.wav` -# `./assets/input_examples/exciting-emotion.wav` -# for different aspects of speech-centric features. +audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}] -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, sampling=True, max_new_tokens=128, use_tts_template=True, temperature=0.3, generate_audio=True, - output_audio_path='output_mimick.wav', # save the tts result to output_audio_path + output_audio_path='output_mimick.wav', ) +print(res) ```
-#### General Speech Conversation with Configurable Voices - -A general usage scenario of `MiniCPM-o-2.6` is role-playing a specific character based on the audio prompt. It will mimic the voice of the character to some extent and act like the character in text, including language style. In this mode, `MiniCPM-o-2.6` sounds **more natural and human-like**. Self-defined audio prompts can be used to customize the voice of the character in an end-to-end manner. +#### Speech Conversation with Configurable Voices +`MiniCPM-o-2.6` can role-play specific characters based on audio prompts, mimicking their voice and language style. ```python -ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio -sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en') -# round one -user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} +ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) +sys_prompt = processor.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en') +user_audio, _ = librosa.load('user_question.wav', sr=16000, mono=True) +user_question = {'role': 'user', 'content': [user_audio]} msgs = [sys_prompt, user_question] -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, - sampling=True, - max_new_tokens=128, - use_tts_template=True, - generate_audio=True, - temperature=0.3, - output_audio_path='result_roleplay_round_1.wav', -) -# round two -history = msgs.append({'role': 'assistant', 'content': res}) -user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} -msgs = history.append(user_question) -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, sampling=True, max_new_tokens=128, use_tts_template=True, generate_audio=True, temperature=0.3, - output_audio_path='result_roleplay_round_2.wav', + output_audio_path='result_roleplay.wav', ) print(res) ```
-#### Speech Conversation as an AI Assistant +#### AI Assistant Mode -An enhanced feature of `MiniCPM-o-2.6` is to act as an AI assistant, but only with limited choice of voices. In this mode, `MiniCPM-o-2.6` is **less human-like and more like a voice assistant**. In this mode, the model is more instruction-following. For demo, you are suggested to use `assistant_female_voice`, `assistant_male_voice`, and `assistant_default_female_voice`. Other voices may work but not as stable as the default voices. - -*Please note that, `assistant_female_voice` and `assistant_male_voice` are more stable but sounds like robots, while `assistant_default_female_voice` is more human-alike but not stable, its voice often changes in multiple turns. We suggest you to try stable voices `assistant_female_voice` and `assistant_male_voice`.* +`MiniCPM-o-2.6` can act as an AI assistant with predefined stable voices. Recommended voices: `assistant_female_voice`, `assistant_male_voice`. ```python -ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # or use `./assets/input_examples/assistant_male_voice.wav` -sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') -user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # load the user's audio question -# round one +ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) +sys_prompt = processor.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') +user_audio, _ = librosa.load('user_question.wav', sr=16000, mono=True) +user_question = {'role': 'user', 'content': [user_audio]} msgs = [sys_prompt, user_question] -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, - sampling=True, - max_new_tokens=128, - use_tts_template=True, - generate_audio=True, - temperature=0.3, - output_audio_path='result_assistant_round_1.wav', -) -# round two -history = msgs.append({'role': 'assistant', 'content': res}) -user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} -msgs = history.append(user_question) -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, sampling=True, max_new_tokens=128, use_tts_template=True, generate_audio=True, temperature=0.3, - output_audio_path='result_assistant_round_2.wav', + output_audio_path='result_assistant.wav', ) print(res) ```
-#### Instruction-to-Speech +#### Instruction-to-Speech (Voice Creation) -`MiniCPM-o-2.6` can also do Instruction-to-Speech, aka **Voice Creation**. You can describe a voice in detail, and the model will generate a voice that matches the description. For more Instruction-to-Speech sample instructions, you can refer to https://voxinstruct.github.io/VoxInstruct/. +You can describe a voice in detail, and the model will generate a voice that matches the description. ```python instruction = 'Speak like a male charming superstar, radiating confidence and style in every word.' msgs = [{'role': 'user', 'content': [instruction]}] -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, sampling=True, max_new_tokens=128, use_tts_template=True, @@ -1236,24 +287,26 @@ res = model.chat( temperature=0.3, output_audio_path='result_voice_creation.wav', ) +print(res) ```
#### Voice Cloning -`MiniCPM-o-2.6` can also do zero-shot text-to-speech, aka **Voice Cloning**. With this mode, model will act like a TTS model. - +Zero-shot text-to-speech functionality using reference audio. ```python -ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # load the reference audio -sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en') -text_prompt = f"Please read the text below." +ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) +sys_prompt = processor.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en') +text_prompt = "Please read the text below." user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} msgs = [sys_prompt, user_question] -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, sampling=True, max_new_tokens=128, use_tts_template=True, @@ -1261,29 +314,32 @@ res = model.chat( temperature=0.3, output_audio_path='result_voice_cloning.wav', ) +print(res) ```
-#### Addressing Various Audio Understanding Tasks +#### Audio Understanding Tasks -`MiniCPM-o-2.6` can also be used to address various audio understanding tasks, such as ASR, speaker analysis, general audio captioning, and sound scene tagging. +Various audio understanding tasks such as ASR, speaker analysis, audio captioning, and sound scene tagging. -For audio-to-text tasks, you can use the following prompts: +Available prompts: -- ASR with ZH(same as AST en2zh): `่ฏทไป”็ป†ๅฌ่ฟ™ๆฎต้Ÿณ้ข‘็‰‡ๆฎต๏ผŒๅนถๅฐ†ๅ…ถๅ†…ๅฎน้€ๅญ—่ฎฐๅฝ•ใ€‚` -- ASR with EN(same as AST zh2en): `Please listen to the audio snippet carefully and transcribe the content.` +- ASR (Chinese): `่ฏทไป”็ป†ๅฌ่ฟ™ๆฎต้Ÿณ้ข‘็‰‡ๆฎต๏ผŒๅนถๅฐ†ๅ…ถๅ†…ๅฎน้€ๅญ—่ฎฐๅฝ•ใ€‚` +- ASR (English): `Please listen to the audio snippet carefully and transcribe the content.` - Speaker Analysis: `Based on the speaker's content, speculate on their gender, condition, age range, and health status.` -- General Audio Caption: `Summarize the main content of the audio.` -- General Sound Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.` +- Audio Caption: `Summarize the main content of the audio.` +- Scene Tagging: `Utilize one keyword to convey the audio's content or the associated scene.` ```python -task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # can change to other prompts. -audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # load the audio to be captioned +task_prompt = "Please listen to the audio snippet carefully and transcribe the content.\n" +audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}] -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, sampling=True, max_new_tokens=128, use_tts_template=True, @@ -1294,30 +350,33 @@ res = model.chat( print(res) ``` - ### Vision-Only mode `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6` #### Chat with single image + ```python -# test.py image = Image.open('xx.jpg').convert('RGB') question = 'What is in the image?' msgs = [{'role': 'user', 'content': [image, question]}] -res = model.chat( - image=None, - msgs=msgs, - tokenizer=tokenizer +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, + sampling=True, + max_new_tokens=1024, ) print(res) -## if you want to use streaming, please make sure sampling=True and stream=True -## the model.chat will return a generator -res = model.chat( - msgs=msgs, - tokenizer=tokenizer, + +## for streaming generation +res = model.generate( + **inputs, + processor=processor, sampling=True, - stream=True + stream=True, + max_new_tokens=1024, ) generated_text = "" for new_text in res: @@ -1326,28 +385,27 @@ for new_text in res: ``` #### Chat with multiple images -
- Click to show Python code running MiniCPM-o 2.6 with multiple images input. - + ```python image1 = Image.open('image1.jpg').convert('RGB') image2 = Image.open('image2.jpg').convert('RGB') question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.' msgs = [{'role': 'user', 'content': [image1, image2, question]}] -answer = model.chat( - msgs=msgs, - tokenizer=tokenizer +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, + sampling=True, + max_new_tokens=1024, ) -print(answer) +print(res) ``` -
#### In-context few-shot learning -
- Click to view Python code running MiniCPM-o 2.6 with few-shot input. ```python -question = "production date" +question = "production date" image1 = Image.open('example1.jpg').convert('RGB') answer1 = "2023.08.04" image2 = Image.open('example2.jpg').convert('RGB') @@ -1358,19 +416,23 @@ msgs = [ {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]}, {'role': 'user', 'content': [image_test, question]} ] -answer = model.chat( - msgs=msgs, - tokenizer=tokenizer +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + +res = model.generate( + **inputs, + processor=processor, + sampling=True, + max_new_tokens=1024, ) -print(answer) +print(res) ``` -
#### Chat with video -
- Click to view Python code running MiniCPM-o 2.6 with video input. ```python +from decord import VideoReader, cpu +import numpy as np + MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number def encode_video(video_path): def uniform_sample(l, n): @@ -1386,52 +448,76 @@ def encode_video(video_path): frames = [Image.fromarray(v.astype('uint8')) for v in frames] print('num frames:', len(frames)) return frames + video_path ="video_test.mp4" frames = encode_video(video_path) question = "Describe the video" -msgs = [ - {'role': 'user', 'content': frames + [question]}, -] +msgs = [{'role': 'user', 'content': frames + [question]}] +inputs = processor.apply_chat_template(msgs=msgs).to(model.device) + # Set decode params for video -params={} -params["use_image_id"] = False -params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448 -answer = model.chat( - msgs=msgs, - tokenizer=tokenizer, - **params +res = model.generate( + **inputs, + processor=processor, + sampling=True, + max_new_tokens=1024, + use_image_id=False, + max_slice_nums=2, # use 1 if cuda OOM and video resolution > 448*448 ) -print(answer) +print(res) ``` -
Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-o) for more detail about usage. +## Usage Tips -## Inference with llama.cpp -MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail. +### Inference with llama.cpp +MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail. -## Int4 quantized version -Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4). +### Int4 quantized version +Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4). ## License + #### Model License -* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. -* The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). -* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use. +- The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. +- The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). +- The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use. #### Statement -* As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers -* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. + +- As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers +- We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. + +## Key Techniques and Other Multimodal Projects + +๐Ÿ‘ Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team: + +[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) + +## Citation + +If you find our work helpful, please consider citing our papers ๐Ÿ“ and liking this project โค๏ธ๏ผ + +```bib +@article{yao2024minicpm, + title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, + author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, + journal={arXiv preprint arXiv:2408.01800}, + year={2024} +} +``` + +- We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. ## Key Techniques and Other Multimodal Projects ๐Ÿ‘ Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team: -[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) +[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) ## Citation @@ -1444,4 +530,4 @@ If you find our work helpful, please consider citing our papers ๐Ÿ“ and liking journal={arXiv preprint arXiv:2408.01800}, year={2024} } -``` \ No newline at end of file +``` diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index e495e7193220..5626b0ea3106 100644 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -248,7 +248,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin): ("metaclip_2", "MetaClip2Model"), ("mgp-str", "MgpstrForSceneTextRecognition"), ("mimi", "MimiModel"), - ("minicpm_o_2_6", "MiniCPM_o_2_6Model"), + ("minicpm_o_2_6", "MiniCPM_o_2_6ForConditionalGeneration"), ("minimax", "MiniMaxModel"), ("mistral", "MistralModel"), ("mistral3", "Mistral3Model"), diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py index 7b0d7433f403..f6bf74765e85 100644 --- a/src/transformers/models/auto/tokenization_auto.py +++ b/src/transformers/models/auto/tokenization_auto.py @@ -415,7 +415,7 @@ ("mgp-str", ("MgpstrTokenizer", None)), ( "minicpm_o_2_6", - ("MiniCPM_o_2_6Tokenizer", "MiniCPM_o_2_6TokenizerFast" if is_tokenizers_available() else None), + ("Qwen2Tokenizer", "MiniCPM_o_2_6TokenizerFast" if is_tokenizers_available() else None), ), ( "minimax", diff --git a/src/transformers/models/minicpm_o_2_6/__init__.py b/src/transformers/models/minicpm_o_2_6/__init__.py index d7c289dfc944..1f4fbd5164d3 100644 --- a/src/transformers/models/minicpm_o_2_6/__init__.py +++ b/src/transformers/models/minicpm_o_2_6/__init__.py @@ -21,7 +21,7 @@ if TYPE_CHECKING: from .configuration_minicpm_o_2_6 import * - from .image_processing_minicpm import * + from .image_processing_minicpm_fast import * from .modeling_minicpm_o_2_6 import * from .processing_minicpm_o_2_6 import * from .tokenization_minicpm_o_2_6_fast import * diff --git a/src/transformers/models/minicpm_o_2_6/configuration_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/configuration_minicpm_o_2_6.py index 33d494c665bf..d50f3f23cf90 100644 --- a/src/transformers/models/minicpm_o_2_6/configuration_minicpm_o_2_6.py +++ b/src/transformers/models/minicpm_o_2_6/configuration_minicpm_o_2_6.py @@ -1,4 +1,9 @@ -# coding=utf-8 +# ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ +# This file was automatically generated from src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py. +# Do NOT edit this file manually as any edits will be overwritten by the generation of +# the file from the modular. If any change should be done, please apply the change to the +# modular_minicpm_o_2_6.py file directly. One of our CI enforces this. +# ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ๐Ÿšจ # Copyright 2025 The OpenBMB Team. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -13,263 +18,13 @@ # See the License for the specific language governing permissions and # limitations under the License. -import os -from typing import Union from ...configuration_utils import PretrainedConfig, layer_type_validation from ...modeling_rope_utils import rope_config_validation -from transformers.models.siglip.configuration_siglip import SiglipVisionConfig -from transformers import Qwen2Config, WhisperConfig from ...utils import logging -logger = logging.get_logger(__name__) - - -class MiniCPMVSliceConfig(PretrainedConfig): - model_type = "minicpmv" - - def __init__( - self, - patch_size=14, - max_slice_nums=9, - scale_resolution=448, - **kwargs, - ): - super().__init__(**kwargs) - self.patch_size = patch_size - self.max_slice_nums = max_slice_nums - self.scale_resolution = scale_resolution - - @classmethod - def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": - config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) - - if config_dict.get("model_type") == "minicpmv": - config_dict = config_dict["slice_config"] - - if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: - logger.warning( - f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " - f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." - ) - - return cls.from_dict(config_dict, **kwargs) - - -class MiniCPMConditionalTTSConfig(PretrainedConfig): - model_type = "conditional_chattts" - - def __init__( - self, - llm_dim: int = 2560, - hidden_size: int = 768, - intermediate_size: int = 3072, - num_attention_heads: int = 12, - num_hidden_layers: int = 20, - max_position_embeddings: int = 4096, - num_audio_tokens: int = 626, - num_text_tokens: int = 21178, - num_mel_bins: int = 100, - num_vq: int = 4, - use_speaker_embedding: bool = True, - use_llm_hidden_state: bool = False, - spk_emb_token_id: int = 21143, - num_spk_embs: int = 1, - audio_bos_token_id: int = 21132, - text_eos_token_id: int = 21133, - use_text: bool = True, - streaming: bool = True, - streaming_text_chunk_size: int = 10, - streaming_text_reserved_len: int = 300, - streaming_audio_chunk_size: int = 50, - attn_implementation: str = "sdpa", - use_mlp: bool = True, - aug_loss_weight: bool = True, - **kwargs, - ): - super().__init__(**kwargs) - - self.llm_dim = llm_dim - self.hidden_size = hidden_size - self.intermediate_size = intermediate_size - self.num_attention_heads = num_attention_heads - self.num_hidden_layers = num_hidden_layers - self.max_position_embeddings = max_position_embeddings - self.num_audio_tokens = num_audio_tokens - self.num_text_tokens = num_text_tokens - self.num_mel_bins = num_mel_bins - self.num_vq = num_vq - self.use_speaker_embedding = use_speaker_embedding - self.use_llm_hidden_state = use_llm_hidden_state - self.spk_emb_token_id = spk_emb_token_id - self.num_spk_embs = num_spk_embs - self.audio_bos_token_id = audio_bos_token_id - self.text_eos_token_id = text_eos_token_id - self.use_text = use_text - self.streaming = streaming - self.streaming_text_chunk_size = streaming_text_chunk_size - self.streaming_text_reserved_len = streaming_text_reserved_len - self.streaming_audio_chunk_size = streaming_audio_chunk_size - self.attn_implementation = attn_implementation - self.use_mlp = use_mlp - self.aug_loss_weight = aug_loss_weight - - -class MiniCPM_o_2_6Config(PretrainedConfig): - model_type = "minicpmo" - keys_to_ignore_at_inference = ["past_key_values"] - - default_vision_config = { - "hidden_size": 1152, - "image_size": 980, - "intermediate_size": 4304, - "model_type": "siglip", - "num_attention_heads": 16, - "num_hidden_layers": 27, - "patch_size": 14, - } - - base_model_tp_plan = { - "layers.*.self_attn.q_proj": "colwise", - "layers.*.self_attn.k_proj": "colwise", - "layers.*.self_attn.v_proj": "colwise", - "layers.*.self_attn.o_proj": "rowwise", - "layers.*.mlp.gate_proj": "colwise", - "layers.*.mlp.up_proj": "colwise", - "layers.*.mlp.down_proj": "rowwise", - } - base_model_pp_plan = { - "embed_tokens": (["input_ids"], ["inputs_embeds"]), - "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), - "norm": (["hidden_states"], ["hidden_states"]), - } - - def __init__( - self, - use_cache=True, - query_num=64, - image_size=448, - drop_vision_last_layer=True, - batch_vision_input=True, - slice_config=None, - vision_config=None, - audio_config=None, - tts_config=None, - use_image_id=True, - vision_batch_size=16, - audio_pool_step=2, - audio_chunk_length=1.0, - stream_input=False, - init_vision=True, - init_audio=True, - init_tts=True, - vocab_size=151936, - hidden_size=4096, - intermediate_size=22016, - num_hidden_layers=32, - num_attention_heads=32, - num_key_value_heads=32, - hidden_act="silu", - max_position_embeddings=32768, - initializer_range=0.02, - rms_norm_eps=1e-6, - tie_word_embeddings=False, - rope_theta=10000.0, - rope_scaling=None, - use_sliding_window=False, - sliding_window=4096, - max_window_layers=28, - layer_types=None, - attention_dropout=0.0, - **kwargs, - ): - self.use_cache = use_cache - self.query_num = query_num - self.image_size = image_size - self.drop_vision_last_layer = drop_vision_last_layer - self.batch_vision_input = batch_vision_input - self.use_image_id = use_image_id - self.vision_batch_size = vision_batch_size - self.audio_pool_step = audio_pool_step - self.audio_chunk_length = audio_chunk_length - self.stream_input = stream_input - self.init_vision = init_vision - self.init_audio = init_audio - self.init_tts = init_tts - - if slice_config is None: - self.slice_config = MiniCPMVSliceConfig(max_slice_nums=1) - else: - self.slice_config = MiniCPMVSliceConfig(**slice_config) - self.slice_mode = True - - # same as HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit add tgt_sizes - if vision_config is None: - self.vision_config = SiglipVisionConfig(**self.default_vision_config) - logger.info("vision_config is None, using default vision config") - elif isinstance(vision_config, dict): - self.vision_config = SiglipVisionConfig(**vision_config) - elif isinstance(vision_config, SiglipVisionConfig): - self.vision_config = vision_config - - # same as openai/whisper-medium add use_cache - if audio_config is None: - self.audio_config = WhisperConfig() - elif isinstance(audio_config, dict): - self.audio_config = WhisperConfig(**audio_config) - elif isinstance(audio_config, WhisperConfig): - self.audio_config = audio_config - - if tts_config is None: - self.tts_config = MiniCPMConditionalTTSConfig() - elif isinstance(tts_config, dict): - self.tts_config = MiniCPMConditionalTTSConfig(**tts_config) - elif isinstance(tts_config, MiniCPMConditionalTTSConfig): - self.tts_config = tts_config - - self.patch_size = self.vision_config.patch_size - - self.vocab_size = vocab_size - self.max_position_embeddings = max_position_embeddings - self.hidden_size = hidden_size - self.intermediate_size = intermediate_size - self.num_hidden_layers = num_hidden_layers - self.num_attention_heads = num_attention_heads - self.use_sliding_window = use_sliding_window - self.sliding_window = sliding_window if self.use_sliding_window else None - self.max_window_layers = max_window_layers - - # for backward compatibility - if num_key_value_heads is None: - num_key_value_heads = num_attention_heads - - self.num_key_value_heads = num_key_value_heads - self.hidden_act = hidden_act - self.initializer_range = initializer_range - self.rms_norm_eps = rms_norm_eps - self.rope_theta = rope_theta - self.rope_scaling = rope_scaling - self.attention_dropout = attention_dropout - # Validate the correctness of rotary position embeddings parameters - # BC: if there is a 'type' field, move it to 'rope_type'. - if self.rope_scaling is not None and "type" in self.rope_scaling: - self.rope_scaling["rope_type"] = self.rope_scaling["type"] - rope_config_validation(self) - - self.layer_types = layer_types - if self.layer_types is None: - self.layer_types = [ - "sliding_attention" - if self.sliding_window is not None and i >= self.max_window_layers - else "full_attention" - for i in range(self.num_hidden_layers) - ] - layer_type_validation(self.layer_types) - super().__init__( - tie_word_embeddings=tie_word_embeddings, - **kwargs, - ) +logger = logging.get_logger(__name__) class MiniCPMConditionalTTSTextConfig(PretrainedConfig): @@ -471,4 +226,646 @@ def __init__( ) +class MiniCPMConditionalTTSConfig(PretrainedConfig): + model_type = "conditional_chattts" + + # sub_configs = { + # "text_config": MiniCPMConditionalTTSTextConfig, + # } + + def __init__( + self, + llm_dim: int = 2560, + hidden_size: int = 768, + intermediate_size: int = 3072, + num_attention_heads: int = 12, + num_hidden_layers: int = 20, + max_position_embeddings: int = 4096, + num_audio_tokens: int = 626, + num_text_tokens: int = 21178, + num_mel_bins: int = 100, + num_vq: int = 4, + use_speaker_embedding: bool = True, + use_llm_hidden_state: bool = False, + spk_emb_token_id: int = 21143, + num_spk_embs: int = 1, + audio_bos_token_id: int = 21132, + text_eos_token_id: int = 21133, + use_text: bool = True, + streaming: bool = True, + streaming_text_chunk_size: int = 10, + streaming_text_reserved_len: int = 300, + streaming_audio_chunk_size: int = 50, + attn_implementation: str = "sdpa", + use_mlp: bool = True, + aug_loss_weight: bool = True, + **kwargs, + ): + super().__init__(**kwargs) + + self.llm_dim = llm_dim + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.num_attention_heads = num_attention_heads + self.num_hidden_layers = num_hidden_layers + self.max_position_embeddings = max_position_embeddings + self.num_audio_tokens = num_audio_tokens + self.num_text_tokens = num_text_tokens + self.num_mel_bins = num_mel_bins + self.num_vq = num_vq + self.use_speaker_embedding = use_speaker_embedding + self.use_llm_hidden_state = use_llm_hidden_state + self.spk_emb_token_id = spk_emb_token_id + self.num_spk_embs = num_spk_embs + self.audio_bos_token_id = audio_bos_token_id + self.text_eos_token_id = text_eos_token_id + self.use_text = use_text + self.streaming = streaming + self.streaming_text_chunk_size = streaming_text_chunk_size + self.streaming_text_reserved_len = streaming_text_reserved_len + self.streaming_audio_chunk_size = streaming_audio_chunk_size + self.attn_implementation = attn_implementation + self.use_mlp = use_mlp + self.aug_loss_weight = aug_loss_weight + + self.tts_text_config = MiniCPMConditionalTTSTextConfig( + hidden_size=self.hidden_size, + intermediate_size=self.intermediate_size, + num_attention_heads=self.num_attention_heads, + num_hidden_layers=self.num_hidden_layers, + max_position_embeddings=self.max_position_embeddings, + attn_implementation=self.attn_implementation, + ) + + +class MiniCPM_o_2_6TextConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`MiniCPMO26TextModel`]. It is used to instantiate a + MiniCPMO26Text model according to the specified arguments, defining the model architecture. Instantiating a configuration + with the defaults will yield a similar configuration to that of + MiniCPMO26Text-7B-beta [Qwen/MiniCPMO26Text-7B-beta](https://huggingface.co/Qwen/MiniCPMO26Text-7B-beta). + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + + Args: + vocab_size (`int`, *optional*, defaults to 151936): + Vocabulary size of the MiniCPMO26Text model. Defines the number of different tokens that can be represented by the + `inputs_ids` passed when calling [`MiniCPMO26TextModel`] + hidden_size (`int`, *optional*, defaults to 4096): + Dimension of the hidden representations. + intermediate_size (`int`, *optional*, defaults to 22016): + Dimension of the MLP representations. + num_hidden_layers (`int`, *optional*, defaults to 32): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 32): + Number of attention heads for each attention layer in the Transformer encoder. + num_key_value_heads (`int`, *optional*, defaults to 32): + This is the number of key_value heads that should be used to implement Grouped Query Attention. If + `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if + `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When + converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed + by meanpooling all the original heads within that group. For more details, check out [this + paper](https://huggingface.co/papers/2305.13245). If it is not specified, will default to `32`. + hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): + The non-linear activation function (function or string) in the decoder. + max_position_embeddings (`int`, *optional*, defaults to 32768): + The maximum sequence length that this model might ever be used with. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + rms_norm_eps (`float`, *optional*, defaults to 1e-06): + The epsilon used by the rms normalization layers. + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). Only + relevant if `config.is_decoder=True`. + tie_word_embeddings (`bool`, *optional*, defaults to `False`): + Whether the model's input and output word embeddings should be tied. + rope_theta (`float`, *optional*, defaults to 10000.0): + The base period of the RoPE embeddings. + rope_scaling (`Dict`, *optional*): + Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type + and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value + accordingly. + Expected contents: + `rope_type` (`str`): + The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope', + 'llama3'], with 'default' being the original RoPE implementation. + `factor` (`float`, *optional*): + Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In + most scaling types, a `factor` of x will enable the model to handle sequences of length x * + original maximum pre-trained length. + `original_max_position_embeddings` (`int`, *optional*): + Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during + pretraining. + `attention_factor` (`float`, *optional*): + Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention + computation. If unspecified, it defaults to value recommended by the implementation, using the + `factor` field to infer the suggested value. + `beta_fast` (`float`, *optional*): + Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear + ramp function. If unspecified, it defaults to 32. + `beta_slow` (`float`, *optional*): + Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear + ramp function. If unspecified, it defaults to 1. + `short_factor` (`list[float]`, *optional*): + Only used with 'longrope'. The scaling factor to be applied to short contexts (< + `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden + size divided by the number of attention heads divided by 2 + `long_factor` (`list[float]`, *optional*): + Only used with 'longrope'. The scaling factor to be applied to long contexts (< + `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden + size divided by the number of attention heads divided by 2 + `low_freq_factor` (`float`, *optional*): + Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE + `high_freq_factor` (`float`, *optional*): + Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE + use_sliding_window (`bool`, *optional*, defaults to `False`): + Whether to use sliding window attention. + sliding_window (`int`, *optional*, defaults to 4096): + Sliding window attention (SWA) window size. If not specified, will default to `4096`. + max_window_layers (`int`, *optional*, defaults to 28): + The number of layers using full attention. The first `max_window_layers` layers will use full attention, while any + additional layer afterwards will use SWA (Sliding Window Attention). + layer_types (`list`, *optional*): + Attention pattern for each layer. + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + + ```python + >>> from transformers import MiniCPMO26TextModel, MiniCPMO26TextConfig + + >>> # Initializing a MiniCPMO26Text style configuration + >>> configuration = MiniCPMO26TextConfig() + + >>> # Initializing a model from the MiniCPMO26Text-7B style configuration + >>> model = MiniCPMO26TextModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "minicpmo" + keys_to_ignore_at_inference = ["past_key_values"] + + # Default tensor parallel plan for base model `MiniCPMO26Text` + base_model_tp_plan = { + "layers.*.self_attn.q_proj": "colwise", + "layers.*.self_attn.k_proj": "colwise", + "layers.*.self_attn.v_proj": "colwise", + "layers.*.self_attn.o_proj": "rowwise", + "layers.*.mlp.gate_proj": "colwise", + "layers.*.mlp.up_proj": "colwise", + "layers.*.mlp.down_proj": "rowwise", + } + base_model_pp_plan = { + "embed_tokens": (["input_ids"], ["inputs_embeds"]), + "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), + "norm": (["hidden_states"], ["hidden_states"]), + } + + def __init__( + self, + vocab_size=151936, + hidden_size=4096, + intermediate_size=22016, + num_hidden_layers=32, + num_attention_heads=32, + num_key_value_heads=32, + hidden_act="silu", + max_position_embeddings=32768, + initializer_range=0.02, + rms_norm_eps=1e-6, + use_cache=True, + tie_word_embeddings=False, + rope_theta=10000.0, + rope_scaling=None, + use_sliding_window=False, + sliding_window=4096, + max_window_layers=28, + layer_types=None, + attention_dropout=0.0, + **kwargs, + ): + self.vocab_size = vocab_size + self.max_position_embeddings = max_position_embeddings + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.use_sliding_window = use_sliding_window + self.sliding_window = sliding_window if self.use_sliding_window else None + self.max_window_layers = max_window_layers + + # for backward compatibility + if num_key_value_heads is None: + num_key_value_heads = num_attention_heads + + self.num_key_value_heads = num_key_value_heads + self.hidden_act = hidden_act + self.initializer_range = initializer_range + self.rms_norm_eps = rms_norm_eps + self.use_cache = use_cache + self.rope_theta = rope_theta + self.rope_scaling = rope_scaling + self.attention_dropout = attention_dropout + # Validate the correctness of rotary position embeddings parameters + # BC: if there is a 'type' field, move it to 'rope_type'. + if self.rope_scaling is not None and "type" in self.rope_scaling: + self.rope_scaling["rope_type"] = self.rope_scaling["type"] + rope_config_validation(self) + + self.layer_types = layer_types + if self.layer_types is None: + self.layer_types = [ + "sliding_attention" + if self.sliding_window is not None and i >= self.max_window_layers + else "full_attention" + for i in range(self.num_hidden_layers) + ] + layer_type_validation(self.layer_types) + + super().__init__( + tie_word_embeddings=tie_word_embeddings, + **kwargs, + ) + + +class MiniCPMVisionConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`MiniCPMVisionModel`]. It is used to instantiate a + MiniCPM vision encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the vision encoder of the MiniCPM + [google/mini_c_p_m-base-patch16-224](https://huggingface.co/google/mini_c_p_m-base-patch16-224) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 3072): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + num_hidden_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + num_channels (`int`, *optional*, defaults to 3): + Number of channels in the input images. + image_size (`int`, *optional*, defaults to 224): + The size (resolution) of each image. + patch_size (`int`, *optional*, defaults to 16): + The size (resolution) of each patch. + hidden_act (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-06): + The epsilon used by the layer normalization layers. + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + + Example: + + ```python + >>> from transformers import MiniCPMVisionConfig, MiniCPMVisionModel + + >>> # Initializing a MiniCPMVisionConfig with google/mini_c_p_m-base-patch16-224 style configuration + >>> configuration = MiniCPMVisionConfig() + + >>> # Initializing a MiniCPMVisionModel (with random weights) from the google/mini_c_p_m-base-patch16-224 style configuration + >>> model = MiniCPMVisionModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "mini_c_p_m_vision_model" + base_config_key = "vision_config" + + def __init__( + self, + hidden_size=768, + intermediate_size=3072, + num_hidden_layers=12, + num_attention_heads=12, + num_channels=3, + image_size=224, + patch_size=16, + hidden_act="gelu_pytorch_tanh", + layer_norm_eps=1e-6, + attention_dropout=0.0, + **kwargs, + ): + super().__init__(**kwargs) + + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.num_channels = num_channels + self.patch_size = patch_size + self.image_size = image_size + self.attention_dropout = attention_dropout + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + + +# fmt: on + + +class MiniCPMWhisperConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`MiniCPMWhisperModel`]. It is used to instantiate a + MiniCPMWhisper model according to the specified arguments, defining the model architecture. Instantiating a configuration + with the defaults will yield a similar configuration to that of the MiniCPMWhisper + [openai/mini_c_p_m_whisper-tiny](https://huggingface.co/openai/mini_c_p_m_whisper-tiny) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + + Args: + vocab_size (`int`, *optional*, defaults to 51865): + Vocabulary size of the MiniCPMWhisper model. Defines the number of different tokens that can be represented by the + `decoder_input_ids` passed when calling [`MiniCPMWhisperModel`] + num_mel_bins (`int`, *optional*, defaults to 80): + Number of mel features used per input features. Should correspond to the value used in the + `MiniCPMWhisperProcessor` class. + encoder_layers (`int`, *optional*, defaults to 4): + Number of encoder layers. + decoder_layers (`int`, *optional*, defaults to 4): + Number of decoder layers. + encoder_attention_heads (`int`, *optional*, defaults to 6): + Number of attention heads for each attention layer in the Transformer encoder. + decoder_attention_heads (`int`, *optional*, defaults to 6): + Number of attention heads for each attention layer in the Transformer decoder. + encoder_ffn_dim (`int`, *optional*, defaults to 1536): + Dimensionality of the "intermediate" (often named feed-forward) layer in encoder. + decoder_ffn_dim (`int`, *optional*, defaults to 1536): + Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. + encoder_layerdrop (`float`, *optional*, defaults to 0.0): + The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://huggingface.co/papers/1909.11556) + for more details. + decoder_layerdrop (`float`, *optional*, defaults to 0.0): + The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://huggingface.co/papers/1909.11556) + for more details. + decoder_start_token_id (`int`, *optional*, defaults to 50257): + Corresponds to the "<|startoftranscript|>" token, which is automatically used when no `decoder_input_ids` + are provided to the `generate` function. It is used to guide the model`s generation process depending on + the task. + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). + is_encoder_decoder (`bool`, *optional*, defaults to `True`): + Whether the model is used as an encoder/decoder or not. + activation_function (`str`, *optional*, defaults to `"gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"silu"` and `"gelu_new"` are supported. + d_model (`int`, *optional*, defaults to 384): + Dimensionality of the layers. + dropout (`float`, *optional*, defaults to 0.1): + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + activation_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for activations inside the fully connected layer. + init_std (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + scale_embedding (`bool`, *optional*, defaults to False): + Scale embeddings by diving by sqrt(d_model). + max_source_positions (`int`, *optional*, defaults to 1500): + The maximum sequence length of log-mel filter-bank features that this model might ever be used with. + max_target_positions (`int`, *optional*, defaults to 448): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 512 or 1024 or 2048). + pad_token_id (`int`, *optional*, defaults to 50256): + Padding token id. + bos_token_id (`int`, *optional*, defaults to 50256): + Begin of stream token id. + eos_token_id (`int`, *optional*, defaults to 50256): + End of stream token id. + suppress_tokens (`list[int]`, *optional*): + A list containing the non-speech tokens that will be used by the logit processor in the `generate` + function. NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI each correspond to the `english-only` and the + `multilingual` model. + begin_suppress_tokens (`list[int]`, *optional*, defaults to `[220,50256]`): + A list containing tokens that will be suppressed at the beginning of the sampling process. Initialized as + the token for `" "` (`blank_token_id`) and the `eos_token_id` + use_weighted_layer_sum (`bool`, *optional*, defaults to `False`): + Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an + instance of [`MiniCPMWhisperForAudioClassification`]. + classifier_proj_size (`int`, *optional*, defaults to 256): + Dimensionality of the projection before token mean-pooling for classification. Only relevant when using an + instance of [`MiniCPMWhisperForAudioClassification`]. + apply_spec_augment (`bool`, *optional*, defaults to `False`): + Whether to apply *SpecAugment* data augmentation to the outputs of the feature encoder. For reference see + [SpecAugment: A Simple Data Augmentation Method for Automatic Speech + Recognition](https://huggingface.co/papers/1904.08779). + mask_time_prob (`float`, *optional*, defaults to 0.05): + Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The masking + procedure generates `mask_time_prob*len(time_axis)/mask_time_length` independent masks over the axis. If + reasoning from the probability of each feature vector to be chosen as the start of the vector span to be + masked, *mask_time_prob* should be `prob_vector_start*mask_time_length`. Note that overlap may decrease the + actual percentage of masked vectors. This is only relevant if `apply_spec_augment == True`. + mask_time_length (`int`, *optional*, defaults to 10): + Length of vector span along the time axis. + mask_time_min_masks (`int`, *optional*, defaults to 2),: + The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step, + irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length < + mask_time_min_masks'' + mask_feature_prob (`float`, *optional*, defaults to 0.0): + Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. The + masking procedure generates `mask_feature_prob*len(feature_axis)/mask_time_length` independent masks over + the axis. If reasoning from the probability of each feature vector to be chosen as the start of the vector + span to be masked, *mask_feature_prob* should be `prob_vector_start*mask_feature_length`. Note that overlap + may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is + True`. + mask_feature_length (`int`, *optional*, defaults to 10): + Length of vector span along the feature axis. + mask_feature_min_masks (`int`, *optional*, defaults to 0),: + The minimum number of masks of length `mask_feature_length` generated along the feature axis, each time + step, irrespectively of `mask_feature_prob`. Only relevant if + `mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks`. + median_filter_width (`int`, *optional*, defaults to 7): + Width of the median filter used to smoothen to cross-attention outputs when computing token timestamps. + Should be an odd number. + + Example: + + ```python + >>> from transformers import MiniCPMWhisperConfig, MiniCPMWhisperModel + + >>> # Initializing a MiniCPMWhisper tiny style configuration + >>> configuration = MiniCPMWhisperConfig() + + >>> # Initializing a model (with random weights) from the tiny style configuration + >>> model = MiniCPMWhisperModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "mini_c_p_m_whisper" + keys_to_ignore_at_inference = ["past_key_values"] + attribute_map = { + "num_key_value_heads": "encoder_attention_heads", + "num_attention_heads": "encoder_attention_heads", + "hidden_size": "d_model", + } + + def __init__( + self, + vocab_size=51865, + num_mel_bins=80, + encoder_layers=4, + encoder_attention_heads=6, + decoder_layers=4, + decoder_attention_heads=6, + decoder_ffn_dim=1536, + encoder_ffn_dim=1536, + encoder_layerdrop=0.0, + decoder_layerdrop=0.0, + decoder_start_token_id=50257, + use_cache=True, + is_encoder_decoder=True, + activation_function="gelu", + d_model=384, + dropout=0.0, + attention_dropout=0.0, + activation_dropout=0.0, + init_std=0.02, + scale_embedding=False, + max_source_positions=1500, + max_target_positions=448, + pad_token_id=50256, + bos_token_id=50256, + eos_token_id=50256, + suppress_tokens=None, + begin_suppress_tokens=[220, 50256], + use_weighted_layer_sum=False, + classifier_proj_size=256, + apply_spec_augment=False, + mask_time_prob=0.05, + mask_time_length=10, + mask_time_min_masks=2, + mask_feature_prob=0.0, + mask_feature_length=10, + mask_feature_min_masks=0, + median_filter_width=7, + **kwargs, + ): + self.vocab_size = vocab_size + self.num_mel_bins = num_mel_bins + self.d_model = d_model + self.encoder_layers = encoder_layers + self.encoder_attention_heads = encoder_attention_heads + self.decoder_layers = decoder_layers + self.decoder_attention_heads = decoder_attention_heads + self.decoder_ffn_dim = decoder_ffn_dim + self.encoder_ffn_dim = encoder_ffn_dim + self.dropout = dropout + self.attention_dropout = attention_dropout + self.activation_dropout = activation_dropout + self.activation_function = activation_function + self.init_std = init_std + self.encoder_layerdrop = encoder_layerdrop + self.decoder_layerdrop = decoder_layerdrop + self.use_cache = use_cache + self.num_hidden_layers = encoder_layers + self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True + self.max_source_positions = max_source_positions + self.max_target_positions = max_target_positions + + # Audio Classification-specific parameters. Feel free to ignore for other classes. + self.classifier_proj_size = classifier_proj_size + self.use_weighted_layer_sum = use_weighted_layer_sum + + # fine-tuning config parameters for SpecAugment: https://huggingface.co/papers/1904.08779 + self.apply_spec_augment = apply_spec_augment + self.mask_time_prob = mask_time_prob + self.mask_time_length = mask_time_length + self.mask_time_min_masks = mask_time_min_masks + self.mask_feature_prob = mask_feature_prob + self.mask_feature_length = mask_feature_length + self.mask_feature_min_masks = mask_feature_min_masks + + self.median_filter_width = median_filter_width + + super().__init__( + pad_token_id=pad_token_id, + bos_token_id=bos_token_id, + eos_token_id=eos_token_id, + is_encoder_decoder=is_encoder_decoder, + decoder_start_token_id=decoder_start_token_id, + suppress_tokens=suppress_tokens, + begin_suppress_tokens=begin_suppress_tokens, + **kwargs, + ) + + +class MiniCPM_o_2_6Config(PretrainedConfig): + default_vision_config = { + "hidden_size": 1152, + "image_size": 980, + "intermediate_size": 4304, + "model_type": "siglip", + "num_attention_heads": 16, + "num_hidden_layers": 27, + "patch_size": 14, + } + + def __init__( + self, + text_config=None, + vision_config=None, + audio_config=None, + tts_config=None, + use_cache=True, + query_num=64, + drop_vision_last_layer=True, + vision_batch_size=16, + audio_pool_step=2, + audio_chunk_length=1.0, + **kwargs, + ): + self.use_cache = use_cache + self.query_num = query_num + self.drop_vision_last_layer = drop_vision_last_layer + self.vision_batch_size = vision_batch_size + self.audio_pool_step = audio_pool_step + self.audio_chunk_length = audio_chunk_length + + if text_config is None: + self.text_config = MiniCPM_o_2_6TextConfig() + elif isinstance(text_config, dict): + self.text_config = MiniCPM_o_2_6TextConfig(**text_config) + elif isinstance(text_config, MiniCPM_o_2_6TextConfig): + self.text_config = text_config + + if vision_config is None: + self.vision_config = MiniCPMVisionConfig(**self.default_vision_config) + logger.info("vision_config is None, using default vision config") + elif isinstance(vision_config, dict): + self.vision_config = MiniCPMVisionConfig(**vision_config) + elif isinstance(vision_config, MiniCPMVisionConfig): + self.vision_config = vision_config + + # same as openai/whisper-medium add use_cache + if audio_config is None: + self.audio_config = MiniCPMWhisperConfig() + elif isinstance(audio_config, dict): + self.audio_config = MiniCPMWhisperConfig(**audio_config) + elif isinstance(audio_config, MiniCPMWhisperConfig): + self.audio_config = audio_config + + if tts_config is None: + self.tts_config = MiniCPMConditionalTTSConfig() + elif isinstance(tts_config, dict): + self.tts_config = MiniCPMConditionalTTSConfig(**tts_config) + elif isinstance(tts_config, MiniCPMConditionalTTSConfig): + self.tts_config = tts_config + + # self.patch_size = self.vision_config.patch_size + super().__init__(**kwargs) + + __all__ = ["MiniCPM_o_2_6Config"] diff --git a/src/transformers/models/minicpm_o_2_6/feature_extractor_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/feature_extractor_minicpm_o_2_6.py index 2cb53022d19a..c39f60d1af82 100644 --- a/src/transformers/models/minicpm_o_2_6/feature_extractor_minicpm_o_2_6.py +++ b/src/transformers/models/minicpm_o_2_6/feature_extractor_minicpm_o_2_6.py @@ -14,34 +14,44 @@ # limitations under the License. import math -from typing import List, Optional, Union +from typing import Optional, Union -from transformers import WhisperFeatureExtractor, AutoFeatureExtractor, AutoTokenizer import numpy as np import torch +from ..whisper.feature_extraction_whisper import WhisperFeatureExtractor + class MiniCPM_o_2_6FeatureExtractor(WhisperFeatureExtractor): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) + def format_audios(self, audios): + """ + Normalize audios format to list of list of numpy arrays. + + Args: + audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]] + + Returns: + List[List[np.ndarray]]: Normalized audio format + """ + # in batch inference, it may be [[]] + if isinstance(audios, np.ndarray): + return [[audios]] + elif isinstance(audios[0], np.ndarray): + return [audios] + else: + return audios + def __call__( self, - tokenizer: None, - audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]], + audios: Union[np.ndarray, list[np.ndarray], list[list[np.ndarray]]], audio_parts: Optional[list] = None, - chunk_input: Optional[bool] = False, sampling_rate: Optional[int] = None, - chunk_length: Optional[int] = 1, **kwargs, ): - # in batch inference, it may be [[]] - if isinstance(audios, np.ndarray): - audios_list = [[audios]] - elif isinstance(audios[0], np.ndarray): - audios_list = [audios] - else: - audios_list = audios + audios_list = self.format_audios(audios) if audio_parts is not None: assert len(audio_parts) == len(audios_list) @@ -49,19 +59,8 @@ def __call__( assert len(parts) == len(audios) audio_feature_lens_list = [] - audio_ph_list = [] - audio_features_all = [] - # audio placeholder not dependent on audio_parts - for audios in audios_list: - if audios: - audio_ph_list.append( - [self.get_audio_placeholder(tokenizer, len(a), chunk_input, chunk_length) for a in audios] - ) - else: - audio_ph_list.append([]) - for idx, audios in enumerate(audios_list): if audio_parts is not None: # same audio part merge @@ -90,7 +89,7 @@ def __call__( final_merge_audio.append(audio) else: for i in range(math.ceil(len(audio) / max_audio_inp_len)): - final_merge_audio.append(audio[i * max_audio_inp_len : (i + 1) * max_audio_inp_len]) + final_merge_audio.append(audio[i * max_audio_inp_len: (i + 1) * max_audio_inp_len]) if audios: audio_inputs = super().__call__( @@ -121,34 +120,7 @@ def __call__( else: audio_features = [] - return audio_features, audio_feature_lens_list, audio_ph_list - - def get_audio_placeholder(self, tokenizer, audio_lens, chunk_input, chunk_length): - pool_step = 2 - feature_lens = math.ceil(audio_lens / self.hop_length) - - feature_lens = (feature_lens - 1) // 2 + 1 - output_lens = (feature_lens - pool_step) // pool_step + 1 - - if chunk_input: - fbank_feat_in_chunk = int(chunk_length * 100) - cnn_feat_in_chunk = (fbank_feat_in_chunk - 1) // 2 + 1 - audio_embeds_in_chunk = (cnn_feat_in_chunk - pool_step) // pool_step + 1 - num_audio_chunks = (output_lens + audio_embeds_in_chunk - 1) // audio_embeds_in_chunk - - place_holders = "" - total_unk_len = 0 - for _ in range(num_audio_chunks): - unk_len = min(audio_embeds_in_chunk, output_lens - total_unk_len) - place_holders += tokenizer.audio_start + tokenizer.unk_token * unk_len + tokenizer.audio_end - total_unk_len += unk_len - audio_placeholder = place_holders - else: - audio_placeholder = tokenizer.audio_start + tokenizer.unk_token * output_lens + tokenizer.audio_end - - return audio_placeholder - + return audio_features, audio_feature_lens_list -AutoFeatureExtractor.register("MiniCPM_o_2_6FeatureExtractor", MiniCPM_o_2_6FeatureExtractor) __all__ = ["MiniCPM_o_2_6FeatureExtractor"] diff --git a/src/transformers/models/minicpm_o_2_6/image_processing_minicpm.py b/src/transformers/models/minicpm_o_2_6/image_processing_minicpm_fast.py similarity index 70% rename from src/transformers/models/minicpm_o_2_6/image_processing_minicpm.py rename to src/transformers/models/minicpm_o_2_6/image_processing_minicpm_fast.py index 544ad4da61af..7ca533aea8e8 100755 --- a/src/transformers/models/minicpm_o_2_6/image_processing_minicpm.py +++ b/src/transformers/models/minicpm_o_2_6/image_processing_minicpm_fast.py @@ -14,21 +14,24 @@ # limitations under the License. import math -from typing import List -from typing import Optional -from typing import Union +from typing import Optional, Union import numpy as np from numpy.lib.stride_tricks import as_strided -import torchvision.transforms as transforms from PIL import Image -from transformers import AutoImageProcessor -from transformers.image_processing_utils import BaseImageProcessor -from transformers.image_transforms import to_pil_image -from transformers.image_utils import valid_images, make_nested_list_of_images -from transformers.utils import TensorType, IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD +from ...image_processing_utils_fast import BaseImageProcessorFast +from ...image_transforms import to_pil_image +from ...image_utils import valid_images, make_nested_list_of_images +from ...utils import TensorType, IMAGENET_STANDARD_MEAN, IMAGENET_STANDARD_STD +from ...utils.import_utils import is_torchvision_available, is_torchvision_v2_available from .processing_minicpm_o_2_6 import MiniCPMOBatchFeature +if is_torchvision_available(): + if is_torchvision_v2_available(): + from torchvision.transforms.v2 import functional as F + else: + from torchvision.transforms import functional as F + def recursive_converter(converter, value): if isinstance(value, list): @@ -40,7 +43,7 @@ def recursive_converter(converter, value): return converter(value) -class MiniCPMVImageProcessor(BaseImageProcessor): +class MiniCPMVImageProcessorFast(BaseImageProcessorFast): model_input_names = ["pixel_values"] def __init__( @@ -62,9 +65,10 @@ def __init__( self.slice_mode = kwargs.pop("slice_mode", True) - self.image_mean = np.array(image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN) - self.image_std = np.array(image_std if image_std is not None else IMAGENET_DEFAULT_STD) - self.version = kwargs.pop("version", 2.0) + self.image_mean = np.array( + image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN) + self.image_std = np.array( + image_std if image_std is not None else IMAGENET_STANDARD_STD) def ensure_divide(self, length, patch_size): return max(round(length / patch_size) * patch_size, patch_size) @@ -112,54 +116,38 @@ def split_to_patches(self, image, grid): def slice_image(self, image, max_slice_nums=9, scale_resolution=448, patch_size=14, never_split=False): original_size = image.size source_image = None - best_grid = self.get_sliced_grid(original_size, max_slice_nums, never_split) + best_grid = self.get_sliced_grid( + original_size, max_slice_nums, never_split) patches = [] if best_grid is None: # dont need to slice, upsample - best_size = self.find_best_resize(original_size, scale_resolution, patch_size, allow_upscale=True) - source_image = image.resize(best_size, resample=Image.Resampling.BICUBIC) + best_size = self.find_best_resize( + original_size, scale_resolution, patch_size, allow_upscale=True) + source_image = image.resize( + best_size, resample=Image.Resampling.BICUBIC) else: # source image, down-sampling and ensure divided by patch_size - best_resize = self.find_best_resize(original_size, scale_resolution, patch_size) + best_resize = self.find_best_resize( + original_size, scale_resolution, patch_size) source_image = image.copy().resize(best_resize, resample=Image.Resampling.BICUBIC) refine_size = self.get_refine_size( original_size, best_grid, scale_resolution, patch_size, allow_upscale=True ) - refine_image = image.resize(refine_size, resample=Image.Resampling.BICUBIC) + refine_image = image.resize( + refine_size, resample=Image.Resampling.BICUBIC) patches = self.split_to_patches(refine_image, best_grid) return source_image, patches, best_grid - def get_grid_placeholder(self, tokenizer, grid): - if grid is None: - return "" - slice_image_placeholder = ( - tokenizer.slice_start + tokenizer.unk_token * self.image_feature_size + tokenizer.slice_end - ) - - cols = grid[0] - rows = grid[1] - slices = [] - for i in range(rows): - lines = [] - for j in range(cols): - lines.append(slice_image_placeholder) - slices.append("".join(lines)) - - slice_placeholder = "\n".join(slices) - return slice_placeholder - - # def get_image_id_placeholder(self, idx=0): - # return f"{self.tokenizer.im_id_start}{idx}{self.tokenizer.im_id_end}" - def get_sliced_images(self, image, max_slice_nums=None): slice_images = [] if not self.slice_mode: return [image] - max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(max_slice_nums) + max_slice_nums = self.max_slice_nums if max_slice_nums is None else int( + max_slice_nums) assert max_slice_nums > 0 source_image, patches, sliced_grid = self.slice_image( # default: 9 # default: 448 # default: 14 @@ -179,7 +167,8 @@ def get_sliced_images(self, image, max_slice_nums=None): def get_sliced_grid(self, image_size, max_slice_nums, nerver_split=False): original_width, original_height = image_size log_ratio = math.log(original_width / original_height) - ratio = original_width * original_height / (self.scale_resolution * self.scale_resolution) + ratio = original_width * original_height / \ + (self.scale_resolution * self.scale_resolution) multiple = min(math.ceil(ratio), max_slice_nums) if multiple <= 1 or nerver_split: return None @@ -207,22 +196,6 @@ def get_sliced_grid(self, image_size, max_slice_nums, nerver_split=False): return best_grid - def get_slice_image_placeholder(self, tokenizer, image_size, image_idx=0, max_slice_nums=None, use_image_id=None): - max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(max_slice_nums) - assert max_slice_nums > 0 - grid = self.get_sliced_grid(image_size=image_size, max_slice_nums=max_slice_nums) - - image_placeholder = tokenizer.im_start + tokenizer.unk_token * self.image_feature_size + tokenizer.im_end - use_image_id = self.use_image_id if use_image_id is None else bool(use_image_id) - if use_image_id: - final_placeholder = f"{tokenizer.im_id_start}{image_idx}{tokenizer.im_id_end}" + image_placeholder - else: - final_placeholder = image_placeholder - - if self.slice_mode: - final_placeholder = final_placeholder + self.get_grid_placeholder(tokenizer, grid=grid) - return final_placeholder - def reshape_by_patch(self, image): """ :param image: shape [3, H, W] @@ -244,10 +217,10 @@ def reshape_by_patch(self, image): def preprocess( self, - images: Union[Image.Image, List[Image.Image], List[List[Image.Image]]], - do_pad: Optional[bool] = True, + images: Union[Image.Image, list[Image.Image], list[list[Image.Image]]], max_slice_nums: int = None, return_tensors: Optional[Union[str, TensorType]] = None, + do_normalize: bool = True, **kwargs, ) -> MiniCPMOBatchFeature: # in batch inference, it may be [[]], so we can't use `make_nested_list_of_images` @@ -258,9 +231,6 @@ def preprocess( else: images_list = images - to_tensor = transforms.ToTensor() - normalize_transform = transforms.Normalize(mean=self.image_mean.tolist(), std=self.image_std.tolist()) - new_images_list = [] image_sizes_list = [] tgt_sizes_list = [] @@ -286,17 +256,21 @@ def preprocess( for patch in image_patches: # Convert PIL to tensor (0-1 range) and normalize # Shape: [C, H, W], range [0, 1] - tensor_patch = to_tensor(patch) - normalized_patch = normalize_transform(tensor_patch) # Apply normalization + tensor_patch = F.to_tensor(patch) + if do_normalize: + normalized_patch = F.normalize(tensor_patch, mean=self.image_mean.tolist(), + std=self.image_std.tolist()) # Apply normalization image_patches_tensors.append(normalized_patch) # Convert back to numpy for compatibility with existing code - image_patches = [patch.numpy() for patch in image_patches_tensors] + image_patches = [patch.numpy() + for patch in image_patches_tensors] for slice_image in image_patches: new_images.append(self.reshape_by_patch(slice_image)) tgt_sizes.append( - np.array((slice_image.shape[1] // self.patch_size, slice_image.shape[2] // self.patch_size)) + np.array( + (slice_image.shape[1] // self.patch_size, slice_image.shape[2] // self.patch_size)) ) # in batch inference, it may be [] @@ -306,13 +280,12 @@ def preprocess( new_images_list.append(new_images) image_sizes_list.append(image_sizes) tgt_sizes_list.append(tgt_sizes) + return MiniCPMOBatchFeature( - data={"pixel_values": new_images_list, "image_sizes": image_sizes_list, "tgt_sizes": tgt_sizes_list}, + data={"pixel_values": new_images_list, + "image_sizes": image_sizes_list, "tgt_sizes": tgt_sizes_list}, tensor_type=return_tensors, ) -AutoImageProcessor.register("MiniCPMVImageProcessor", MiniCPMVImageProcessor) - - -__all__ = ["MiniCPMVImageProcessor"] +__all__ = ["MiniCPMVImageProcessorFast"] diff --git a/src/transformers/models/minicpm_o_2_6/modeling_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/modeling_minicpm_o_2_6.py index e8ab46e8667f..12e002ad79e3 100644 --- a/src/transformers/models/minicpm_o_2_6/modeling_minicpm_o_2_6.py +++ b/src/transformers/models/minicpm_o_2_6/modeling_minicpm_o_2_6.py @@ -67,41 +67,32 @@ add_start_docstrings_to_model_forward, auto_docstring, can_return_tuple, - is_flash_attn_2_available, logging, replace_return_docstrings, ) from ...utils.deprecation import deprecate_kwarg -from ..whisper.configuration_whisper import WhisperConfig -from ..siglip.configuration_siglip import SiglipVisionConfig from ..bert.tokenization_bert_fast import BertTokenizerFast from ...utils.generic import check_model_inputs +from ...utils.import_utils import _is_package_available, is_flash_attn_2_available from .configuration_minicpm_o_2_6 import ( MiniCPM_o_2_6Config, MiniCPMConditionalTTSConfig, MiniCPMConditionalTTSTextConfig, + MiniCPMVisionConfig, + MiniCPMWhisperConfig, ) -from .processing_minicpm_o_2_6 import NumberToTextConverter, VoiceChecker, sentence_end +from .tts_processing_minicpm_o_2_6 import ChatTTSProcessor, NumberToTextConverter, VoiceChecker, sentence_end if is_flash_attn_2_available(): from flash_attn import flash_attn_func, flash_attn_varlen_func - from flash_attn.bert_padding import ( - index_first_axis, # noqa - pad_input, - unpad_input, - ) + from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input -try: +if _is_package_available("vector_quantize_pytorch") and _is_package_available("vocos"): from vector_quantize_pytorch import GroupedResidualFSQ from vocos import Vocos from vocos.pretrained import instantiate_class - _tts_deps = True -except: - _tts_deps = False - - logger = logging.get_logger(__name__) @@ -339,7 +330,7 @@ class MiniCPM_o_2_6PreTrainedModel(PreTrainedModel): config: MiniCPM_o_2_6Config base_model_prefix = "model" supports_gradient_checkpointing = True - _no_split_modules = ["MiniCPM_o_2_6TextDecoderLayer"] + _no_split_modules = ["MiniCPM_o_2_6DecoderLayer"] _skip_keys_device_placement = ["past_key_values"] _supports_flash_attn = True _supports_sdpa = True @@ -351,24 +342,6 @@ class MiniCPM_o_2_6PreTrainedModel(PreTrainedModel): "hidden_states": MiniCPM_o_2_6DecoderLayer, "attentions": MiniCPM_o_2_6Attention, } - config_class = MiniCPM_o_2_6Config - _supports_flash_attn_2 = True - _supports_cache_class = True - _supports_quantized_cache = True - _supports_static_cache = True - - def _init_weights(self, module): - std = self.config.initializer_range - if isinstance(module, nn.Linear): - module.weight.data.normal_(mean=0.0, std=std) - if module.bias is not None: - module.bias.data.zero_() - elif isinstance(module, nn.Embedding): - module.weight.data.normal_(mean=0.0, std=std) - if module.padding_idx is not None: - module.weight.data[module.padding_idx].zero_() - elif isinstance(module, MiniCPM_o_2_6TextRMSNorm): - module.weight.data.fill_(1.0) class MiniCPM_o_2_6RotaryEmbedding(nn.Module): @@ -408,7 +381,7 @@ def forward(self, x, position_ids): @auto_docstring -class MiniCPMTextModel(MiniCPM_o_2_6PreTrainedModel): +class MiniCPM_o_2_6TextModel(MiniCPM_o_2_6PreTrainedModel): def __init__(self, config: MiniCPM_o_2_6Config): super().__init__(config) self.padding_idx = config.pad_token_id @@ -499,6 +472,9 @@ def forward( ) +_tts_deps = _is_package_available("vector_quantize_pytorch") and _is_package_available("vocos") + + def _prepare_4d_causal_attention_mask_with_cache_position( attention_mask: torch.Tensor, sequence_length: int, @@ -572,16 +548,20 @@ def gen_logits( return logits_warpers, logits_processors -class MiniCPM_o_2_6Model(MiniCPM_o_2_6PreTrainedModel, GenerationMixin): +class MiniCPM_o_2_6ForConditionalGeneration(MiniCPM_o_2_6PreTrainedModel, GenerationMixin): _tied_weights_keys = ["lm_head.weight"] _tp_plan = {"lm_head": "colwise_rep"} _pp_plan = {"lm_head": (["hidden_states"], ["logits"])} - def __init__(self, config): - super().__init__(config) - self.language_model = MiniCPMTextModel(config) - self.vocab_size = config.vocab_size - self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) + def __init__(self, config: MiniCPM_o_2_6Config): + super().__init__(config.text_config) + + text_config = config.text_config + self.language_model = MiniCPM_o_2_6TextModel(text_config) + self.vocab_size = text_config.vocab_size + self.lm_head = nn.Linear(text_config.hidden_size, text_config.vocab_size, bias=False) + + self.omni_config = config # Initialize weights and apply final processing self.post_init() @@ -592,12 +572,12 @@ def __init__(self, config): # init vision module self.vpm = self.init_vision_module() self.vision_dim = self.vpm.embed_dim - self.resampler = self.init_resampler(self.embed_dim, self.vision_dim) + self.resampler = self.init_resampler(config.query_num, self.embed_dim, self.vision_dim) # init audio module self.apm = self.init_audio_module() audio_output_dim = int(self.apm.config.encoder_ffn_dim // 4) - self.audio_avg_pooler = nn.AvgPool1d(self.config.audio_pool_step, stride=self.config.audio_pool_step) + self.audio_avg_pooler = nn.AvgPool1d(self.omni_config.audio_pool_step, stride=self.omni_config.audio_pool_step) self.audio_projection_layer = MultiModalProjector(in_dim=audio_output_dim, out_dim=self.embed_dim) self.audio_encoder_layer = -1 @@ -627,10 +607,8 @@ def init_tts( load tts tokenizer and vocos 1. try load form local 2. try load from huggingface """ - from .processing_minicpm_o_2_6 import ChatTTSProcessor - if tts_text_tokenizer_path is None: - tts_text_tokenizer_path = os.path.join(self.config._name_or_path, "assets/chattts_tokenizer") + tts_text_tokenizer_path = os.path.join(self.omni_config._name_or_path, "assets/chattts_tokenizer") if not os.path.exists(tts_text_tokenizer_path): # try from hf model_id tts_text_tokenizer_path = "openbmb/chattts_tokenizer" @@ -639,7 +617,7 @@ def init_tts( self.tts_processor = ChatTTSProcessor(text_tokenizer=tts_text_tokenizer) if vocos_ckpt_path is None: - vocos_ckpt_path = os.path.join(self.config._name_or_path, "assets/Vocos.pt") + vocos_ckpt_path = os.path.join(self.omni_config._name_or_path, "assets/Vocos.pt") if not os.path.exists(vocos_ckpt_path): vocos_ckpt_path = hf_hub_download(repo_id="openbmb/MiniCPM-o-2_6", subfolder="assets", filename="Vocos.pt") @@ -670,12 +648,12 @@ def initialize_vocos(self, ckpt_path): return vocos def init_vision_module(self): - if self.config._attn_implementation == "flash_attention_2": - self.config.vision_config._attn_implementation = "flash_attention_2" + if self.omni_config._attn_implementation == "flash_attention_2": + self.omni_config.vision_config._attn_implementation = "flash_attention_2" else: - self.config.vision_config._attn_implementation = "eager" - model = MiniCPMVisionTransformer(self.config.vision_config) - if self.config.drop_vision_last_layer: + self.omni_config.vision_config._attn_implementation = "eager" + model = MiniCPMVisionTransformer(self.omni_config.vision_config) + if self.omni_config.drop_vision_last_layer: model.encoder.layers = model.encoder.layers[:-1] setattr(model, "embed_dim", model.embeddings.embed_dim) @@ -683,9 +661,9 @@ def init_vision_module(self): return model - def init_resampler(self, embed_dim, vision_dim): + def init_resampler(self, query_num, embed_dim, vision_dim): return Resampler( - num_queries=self.config.query_num, + num_queries=query_num, embed_dim=embed_dim, num_heads=embed_dim // 128, kv_dim=vision_dim, @@ -693,11 +671,11 @@ def init_resampler(self, embed_dim, vision_dim): ) def init_audio_module(self): - model = MiniCPMWhisperEncoder(self.config.audio_config) + model = MiniCPMWhisperEncoder(self.omni_config.audio_config) return model def init_tts_module(self): - model = ConditionalChatTTS(self.config.tts_config) + model = ConditionalChatTTS(self.omni_config.tts_config) return model def get_input_embeddings(self): @@ -769,8 +747,8 @@ def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor): """ input_lengths_after_cnn = (input_lengths - 1) // 2 + 1 input_lengths_after_pooling = ( - input_lengths_after_cnn - self.config.audio_pool_step - ) // self.config.audio_pool_step + 1 + input_lengths_after_cnn - self.omni_config.audio_pool_step + ) // self.omni_config.audio_pool_step + 1 input_lengths_after_pooling = input_lengths_after_pooling.to(dtype=torch.int32) return input_lengths_after_cnn, input_lengths_after_pooling @@ -798,7 +776,7 @@ def get_image_features(self, pixel_values_list, tgt_sizes, dtype, device): for i in range(B): patch_attn_mask[i, 0, : tgt_sizes[i][0] * tgt_sizes[i][1]] = True - vision_batch_size = self.config.vision_batch_size + vision_batch_size = self.omni_config.vision_batch_size all_pixel_values = all_pixel_values.type(dtype) if B > vision_batch_size: hs = [] @@ -1047,7 +1025,7 @@ def get_omni_embedding(self, data, input_embeddings, chunk_length=-1, stream_inp assert len(audio_embeddings) == len(input_embeddings) audio_bounds = data["audio_bounds"] - if self.config.chunk_input: + if self.omni_config.chunk_input: for i in range(bs): audio_embs = torch.cat(audio_embeddings[i], dim=0).to( device=input_embeddings.device, dtype=input_embeddings.dtype @@ -1116,9 +1094,9 @@ def forward( >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ```""" - output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_attentions = output_attentions if output_attentions is not None else self.omni_config.output_attentions output_hidden_states = ( - output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + output_hidden_states if output_hidden_states is not None else self.omni_config.output_hidden_states ) # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) @@ -1142,7 +1120,7 @@ def forward( loss = None if labels is not None: - loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs) + loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs) return CausalLMOutputWithPast( loss=loss, @@ -1237,7 +1215,7 @@ def generate( model_inputs["inputs_embeds"] = self.get_omni_embedding( model_inputs, input_embeddings=model_inputs["inputs_embeds"], - chunk_length=self.config.audio_chunk_length, + chunk_length=self.omni_config.audio_chunk_length, ) if stream: @@ -1270,7 +1248,7 @@ def stream_gen(): spk_embeds = wav_numpy = sr = None if not batched and use_tts_template and generate_audio: - result = processor.decode_text(outputs.sequences, processor.tokenizer) + result = processor.decode(outputs.sequences) mel_spec = self._generate_mel_spec( model_inputs, outputs, @@ -1612,7 +1590,7 @@ def check_uncompleted_token(ids): end = check_uncompleted_token(cur_ids[0]) left_ids = cur_ids[:, end:] cur_ids = cur_ids[:, :end] - text = processor.decode_text(cur_ids, tokenizer)[0] if end > 0 else "" + text = processor.decode(cur_ids)[0] if end > 0 else "" self.llm_past_key_values = outputs.past_key_values input_ids = outputs.sequences[:, -1:] @@ -2247,6 +2225,37 @@ def decode_mel_to_audio(self, mel_spec, output_path=""): logger.info(f"Audio saved to {output_path}") return wav_numpy, sr + +def whisper_eager_attention_forward( + module: nn.Module, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + attention_mask: Optional[torch.Tensor], + scaling: Optional[float] = None, + dropout: float = 0.0, + head_mask: Optional[torch.Tensor] = None, + **kwargs, +): + if scaling is None: + scaling = query.size(-1) ** -0.5 + + attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling + if attention_mask is not None and attention_mask.ndim == 4: + attn_weights = attn_weights + attention_mask[:, :, :, : key.shape[-2]] + + attn_weights = nn.functional.softmax(attn_weights, dim=-1) + + if head_mask is not None: + attn_weights = attn_weights * head_mask.view(1, -1, 1, 1) + + attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training) + attn_output = torch.matmul(attn_weights, value) + attn_output = attn_output.transpose(1, 2).contiguous() + + return attn_output, attn_weights + + class MiniCPMWhisperAttention(nn.Module): """Multi-headed attention from 'Attention Is All You Need' paper""" @@ -2295,7 +2304,7 @@ def forward( self, hidden_states: torch.Tensor, key_value_states: Optional[torch.Tensor] = None, - past_key_value: Optional[Cache] = None, + past_key_values: Optional[Cache] = None, attention_mask: Optional[torch.Tensor] = None, layer_head_mask: Optional[torch.Tensor] = None, output_attentions: bool = False, @@ -2323,34 +2332,34 @@ def forward( query_states = query_states.view(*q_input_shape) query_states = query_states.transpose(1, 2).contiguous() - if past_key_value is not None: - is_updated = past_key_value.is_updated.get(self.layer_idx) + if past_key_values is not None: + is_updated = past_key_values.is_updated.get(self.layer_idx) if is_cross_attention: # after the first generated id, we can subsequently re-use all key/value_states from cache - past_key_value.is_updated[self.layer_idx] = True - past_key_value = past_key_value.cross_attention_cache + past_key_values.is_updated[self.layer_idx] = True + past_key_values = past_key_values.cross_attention_cache else: - past_key_value = past_key_value.self_attention_cache + past_key_values = past_key_values.self_attention_cache # use key_value_states if cross attention current_states = key_value_states if key_value_states is not None else hidden_states - if is_cross_attention and past_key_value and is_updated: + if is_cross_attention and past_key_values and is_updated: # reuse k,v, cross_attentions - key_states = past_key_value.key_cache[self.layer_idx] - value_states = past_key_value.value_cache[self.layer_idx] + key_states = past_key_values.key_cache[self.layer_idx] + value_states = past_key_values.value_cache[self.layer_idx] else: key_states = self.k_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim) value_states = self.v_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim) key_states = key_states.transpose(1, 2).contiguous() value_states = value_states.transpose(1, 2).contiguous() - if past_key_value is not None: + if past_key_values is not None: # save all key/value_states to cache to be re-used for fast auto-regressive generation cache_position = cache_position if not is_cross_attention else None - key_states, value_states = past_key_value.update( + key_states, value_states = past_key_values.update( key_states, value_states, self.layer_idx, {"cache_position": cache_position} ) - attention_interface: Callable = eager_attention_forward + attention_interface: Callable = whisper_eager_attention_forward if self.config._attn_implementation != "eager": attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation] @@ -2370,11 +2379,11 @@ def forward( attn_output = attn_output.reshape(bsz, tgt_len, -1).contiguous() attn_output = self.out_proj(attn_output) - return attn_output, attn_weights, past_key_value + return attn_output, attn_weights, past_key_values class MiniCPMWhisperEncoderLayer(GradientCheckpointingLayer): - def __init__(self, config: WhisperConfig, layer_idx: int = None): + def __init__(self, config: MiniCPMWhisperConfig, layer_idx: int = None): super().__init__() self.embed_dim = config.d_model self.self_attn = MiniCPMWhisperAttention( @@ -2426,7 +2435,7 @@ def forward( attention_mask=attention_mask, layer_head_mask=layer_head_mask, output_attentions=output_attentions, - past_key_value=past_key_values, + past_key_values=past_key_values, ) hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) hidden_states = residual + hidden_states @@ -2477,7 +2486,7 @@ class MiniCPMWhisperEncoder(MiniCPM_o_2_6PreTrainedModel): config: MiniCPMWhisperConfig """ - def __init__(self, config: WhisperConfig): + def __init__(self, config: MiniCPMWhisperConfig): super().__init__(config) self.dropout = config.dropout self.layerdrop = config.encoder_layerdrop @@ -2592,7 +2601,7 @@ def forward( only present if their respective `output_*` arguments are set to `True`. Example: - >>> from transformers import AutoFeatureExtractor, WhisperConfig, WhisperForConditionalGeneration + >>> from transformers import AutoFeatureExtractor, MiniCPMWhisperConfig, WhisperForConditionalGeneration >>> import torch >>> # Load a feature extractor and a Whisper model @@ -2994,7 +3003,7 @@ class ConditionalChatTTSGenerationOutput(ModelOutput): Args: new_ids (torch.LongTensor): Newly generated audio code sequence, shape (batch_size, sequence_length, num_vq). audio_input_ids (torch.LongTensor): Updated input IDs including condition and generated audio codes, shape (batch_size, full_sequence_length, num_vq). - past_key_values (Tuple[Tuple[torch.FloatTensor]]): Tuple containing pre-computed keys and values used for attention mechanism. Each element has shape (batch_size, num_heads, sequence_length, embed_size_per_head). + past_key_values (tuple[tuple[torch.FloatTensor]]): tuple containing pre-computed keys and values used for attention mechanism. Each element has shape (batch_size, num_heads, sequence_length, embed_size_per_head). finished (bool): Boolean indicating whether generation is complete. """ @@ -3196,23 +3205,6 @@ class MiniCPMConditionalTTSTextPreTrainedModel(PreTrainedModel): "attentions": MiniCPMConditionalTTSTextAttention, } config_class = MiniCPMConditionalTTSTextConfig - _supports_flash_attn_2 = True - _supports_cache_class = True - _supports_quantized_cache = True - _supports_static_cache = True - - def _init_weights(self, module): - std = self.config.initializer_range - if isinstance(module, nn.Linear): - module.weight.data.normal_(mean=0.0, std=std) - if module.bias is not None: - module.bias.data.zero_() - elif isinstance(module, nn.Embedding): - module.weight.data.normal_(mean=0.0, std=std) - if module.padding_idx is not None: - module.weight.data[module.padding_idx].zero_() - elif isinstance(module, MiniCPMConditionalTTSTextRMSNorm): - module.weight.data.fill_(1.0) class MiniCPMConditionalTTSTextRotaryEmbedding(nn.Module): @@ -3340,7 +3332,7 @@ def forward( hidden_states, attention_mask=causal_mask, position_ids=position_ids, - past_key_value=past_key_values, + past_key_values=past_key_values, output_attentions=output_attentions, use_cache=use_cache, cache_position=cache_position, @@ -3681,16 +3673,7 @@ def __init__(self, config: MiniCPMConditionalTTSConfig): dvae = DVAE() self.dvae = dvae - model_config = MiniCPMConditionalTTSTextConfig( - hidden_size=config.hidden_size, - intermediate_size=config.intermediate_size, - num_attention_heads=config.num_attention_heads, - num_hidden_layers=config.num_hidden_layers, - max_position_embeddings=config.max_position_embeddings, - attn_implementation=config.attn_implementation, - ) - - model = MiniCPMConditionalTTSTextModel(model_config) + model = MiniCPMConditionalTTSTextModel(config.tts_text_config) self.model = model @torch.inference_mode() @@ -3751,7 +3734,7 @@ def prefill_text( Args: input_ids (Tensor): Tensor of shape [batch_size, seq_len] position_ids (LongTensor): Tensor of shape [batch_size, seq_len] - past_key_values (List[Tuple[Tensor]]): KV Cache of all layers, each layer is a tuple (Tensor, Tensor) denoting keys and values. Each tensor is of seq_len = `self.streaming_text_reserved_len`. `past_key_values` will be updated. + past_key_values (List[tuple[Tensor]]): KV Cache of all layers, each layer is a tuple (Tensor, Tensor) denoting keys and values. Each tensor is of seq_len = `self.streaming_text_reserved_len`. `past_key_values` will be updated. lm_spk_emb_last_hidden_states (Tensor, optional): Tensor of shape [batch_size, num_spk_emb, llm_dim]. Defaults to None. lm_last_hidden_states (Tensor, optional): _description_. Defaults to None. @@ -3825,7 +3808,7 @@ def prefill_audio_ids( Args: input_ids (torch.Tensor): (1, seq_len, num_vq) Audio input token ids. - past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism. + past_key_values (List[tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism. """ assert input_ids.shape[0] == 1 assert past_key_values is not None @@ -3891,7 +3874,7 @@ def generate( Args: input_ids (torch.Tensor): Input token ids. - past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism. + past_key_values (List[tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism. temperature (torch.Tensor): Temperature for sampling. eos_token (Union[int, torch.Tensor]): End of sequence token. streaming_tts_text_mask (Optional[torch.Tensor], optional): Mask for streaming TTS text. Defaults to None. @@ -4433,11 +4416,11 @@ class MiniCPMVisionModelOutput(ModelOutput): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the model. hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): - Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): - Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. @@ -4450,7 +4433,7 @@ class MiniCPMVisionModelOutput(ModelOutput): class MiniCPMVisionEmbedding(nn.Module): - def __init__(self, config: SiglipVisionConfig): + def __init__(self, config: MiniCPMVisionConfig): super().__init__() self.config = config self.embed_dim = config.hidden_size @@ -4807,7 +4790,7 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: class MiniCPMVisionEncoderLayer(GradientCheckpointingLayer): - def __init__(self, config: SiglipVisionConfig): + def __init__(self, config: MiniCPMVisionConfig): super().__init__() self.embed_dim = config.hidden_size self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps) @@ -4818,6 +4801,7 @@ def __init__(self, config: SiglipVisionConfig): self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps) self.mlp = MiniCPMVisionMLP(config) + def forward( self, hidden_states: torch.Tensor, @@ -4969,7 +4953,7 @@ class MiniCPMVisionPreTrainedModel(PreTrainedModel): models. """ - config_class = SiglipVisionConfig + config_class = MiniCPMVisionConfig base_model_prefix = "siglip" supports_gradient_checkpointing = True @@ -5009,10 +4993,10 @@ class MiniCPMVisionEncoder(nn.Module): Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a [`SiglipEncoderLayer`]. Args: - config: SiglipConfig + config: MiniCPMVisionConfig """ - def __init__(self, config: SiglipVisionConfig): + def __init__(self, config: MiniCPMVisionConfig): super().__init__() self.config = config self.layers = nn.ModuleList([MiniCPMVisionEncoderLayer(config) for _ in range(config.num_hidden_layers)]) @@ -5098,7 +5082,7 @@ def forward( Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. Parameters: - config ([`SiglipVisionConfig`]): Model configuration class with all the parameters of the model. + config ([`MiniCPMVisionConfig`]): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. """ @@ -5124,12 +5108,12 @@ def forward( """The vision model from SigLIP without any head or projection on top.""", SIGLIP_START_DOCSTRING ) class MiniCPMVisionTransformer(MiniCPMVisionPreTrainedModel): - config_class = SiglipVisionConfig + config_class = MiniCPMVisionConfig main_input_name = "pixel_values" _supports_flash_attn_2 = True _no_split_modules = [] - def __init__(self, config: SiglipVisionConfig): + def __init__(self, config: MiniCPMVisionConfig): super().__init__(config) self.config = config embed_dim = config.hidden_size @@ -5146,7 +5130,7 @@ def get_input_embeddings(self) -> nn.Module: return self.embeddings.patch_embedding @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING) - @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipVisionConfig) + @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=MiniCPMVisionConfig) def forward( self, pixel_values, @@ -5216,4 +5200,4 @@ def forward( ) -__all__ = ["MiniCPM_o_2_6ForConditionalGeneration", "MiniCPM_o_2_6Model", "MiniCPM_o_2_6PreTrainedModel"] +__all__ = ["MiniCPM_o_2_6ForConditionalGeneration", "MiniCPM_o_2_6TextModel", "MiniCPM_o_2_6PreTrainedModel"] diff --git a/src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py index 92e5b55c4582..2807b7f3336f 100644 --- a/src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py +++ b/src/transformers/models/minicpm_o_2_6/modular_minicpm_o_2_6.py @@ -20,7 +20,7 @@ from dataclasses import dataclass from functools import partial from threading import Thread -from typing import List, Optional, Tuple, Union, Callable +from typing import Optional, Union, Callable import numpy as np from PIL import Image @@ -43,27 +43,26 @@ CausalLMOutputWithPast, ) from ...utils import ( - ModelOutput, + logging, add_start_docstrings, add_start_docstrings_to_model_forward, - is_flash_attn_2_available, - logging, replace_return_docstrings, can_return_tuple, auto_docstring, + ModelOutput, TransformersKwargs, ) +from ...utils.import_utils import _is_package_available, is_flash_attn_2_available from ...cache_utils import Cache, DynamicCache, EncoderDecoderCache, StaticCache +from ...configuration_utils import PretrainedConfig from ...generation import GenerationMixin from ...generation.streamers import TextIteratorStreamer from ...generation.utils import GenerateOutput from ...generation.logits_process import LogitsProcessor, TopKLogitsWarper, TopPLogitsWarper -from ...modeling_layers import GradientCheckpointingLayer -from ...modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel from ...activations import ACT2FN from ...modeling_attn_mask_utils import _prepare_4d_attention_mask, AttentionMaskConverter -from ...integrations import is_deepspeed_zero3_enabled, use_kernel_forward_from_hub +from ...integrations import is_deepspeed_zero3_enabled from ...modeling_flash_attention_utils import FlashAttentionKwargs from ...processing_utils import Unpack @@ -72,31 +71,182 @@ from ..siglip.modeling_siglip import SiglipEncoderLayer, SiglipEncoder, SiglipMLP, SiglipVisionModelOutput from ..whisper.configuration_whisper import WhisperConfig from ..whisper.modeling_whisper import WhisperEncoder, WhisperAttention, WhisperEncoderLayer +from ..qwen2.configuration_qwen2 import Qwen2Config from ..qwen2.modeling_qwen2 import Qwen2Model, Qwen2PreTrainedModel +from ..llama.configuration_llama import LlamaConfig from ..llama.modeling_llama import LlamaModel, LlamaDecoderLayer, LlamaPreTrainedModel -try: +from .tts_processing_minicpm_o_2_6 import NumberToTextConverter, sentence_end, VoiceChecker, ChatTTSProcessor + +if is_flash_attn_2_available(): + from flash_attn import flash_attn_func, flash_attn_varlen_func + from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input + +if _is_package_available('vector_quantize_pytorch') and _is_package_available('vocos'): from vector_quantize_pytorch import GroupedResidualFSQ from vocos import Vocos from vocos.pretrained import instantiate_class - _tts_deps = True -except: - _tts_deps = False - -from .configuration_minicpm_o_2_6 import ( - MiniCPMConditionalTTSConfig, - MiniCPM_o_2_6Config, - MiniCPMConditionalTTSTextConfig, -) -from .processing_minicpm_o_2_6 import NumberToTextConverter, sentence_end, VoiceChecker +_tts_deps = _is_package_available('vector_quantize_pytorch') and _is_package_available('vocos') logger = logging.get_logger(__name__) +class MiniCPMConditionalTTSTextConfig(LlamaConfig): + pass + + +class MiniCPMConditionalTTSConfig(PretrainedConfig): + model_type = "conditional_chattts" + + # sub_configs = { + # "text_config": MiniCPMConditionalTTSTextConfig, + # } + + def __init__( + self, + llm_dim: int = 2560, + hidden_size: int = 768, + intermediate_size: int = 3072, + num_attention_heads: int = 12, + num_hidden_layers: int = 20, + max_position_embeddings: int = 4096, + num_audio_tokens: int = 626, + num_text_tokens: int = 21178, + num_mel_bins: int = 100, + num_vq: int = 4, + use_speaker_embedding: bool = True, + use_llm_hidden_state: bool = False, + spk_emb_token_id: int = 21143, + num_spk_embs: int = 1, + audio_bos_token_id: int = 21132, + text_eos_token_id: int = 21133, + use_text: bool = True, + streaming: bool = True, + streaming_text_chunk_size: int = 10, + streaming_text_reserved_len: int = 300, + streaming_audio_chunk_size: int = 50, + attn_implementation: str = "sdpa", + use_mlp: bool = True, + aug_loss_weight: bool = True, + **kwargs, + ): + super().__init__(**kwargs) + + self.llm_dim = llm_dim + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.num_attention_heads = num_attention_heads + self.num_hidden_layers = num_hidden_layers + self.max_position_embeddings = max_position_embeddings + self.num_audio_tokens = num_audio_tokens + self.num_text_tokens = num_text_tokens + self.num_mel_bins = num_mel_bins + self.num_vq = num_vq + self.use_speaker_embedding = use_speaker_embedding + self.use_llm_hidden_state = use_llm_hidden_state + self.spk_emb_token_id = spk_emb_token_id + self.num_spk_embs = num_spk_embs + self.audio_bos_token_id = audio_bos_token_id + self.text_eos_token_id = text_eos_token_id + self.use_text = use_text + self.streaming = streaming + self.streaming_text_chunk_size = streaming_text_chunk_size + self.streaming_text_reserved_len = streaming_text_reserved_len + self.streaming_audio_chunk_size = streaming_audio_chunk_size + self.attn_implementation = attn_implementation + self.use_mlp = use_mlp + self.aug_loss_weight = aug_loss_weight + + self.tts_text_config = MiniCPMConditionalTTSTextConfig( + hidden_size=self.hidden_size, + intermediate_size=self.intermediate_size, + num_attention_heads=self.num_attention_heads, + num_hidden_layers=self.num_hidden_layers, + max_position_embeddings=self.max_position_embeddings, + attn_implementation=self.attn_implementation, + ) + + +class MiniCPM_o_2_6TextConfig(Qwen2Config): + model_type = "minicpmo" + +class MiniCPMVisionConfig(SiglipVisionConfig): + pass + +class MiniCPMWhisperConfig(WhisperConfig): + pass + +class MiniCPM_o_2_6Config(PretrainedConfig): + + default_vision_config = { + "hidden_size": 1152, + "image_size": 980, + "intermediate_size": 4304, + "model_type": "siglip", + "num_attention_heads": 16, + "num_hidden_layers": 27, + "patch_size": 14, + } + + def __init__( + self, + text_config=None, + vision_config=None, + audio_config=None, + tts_config=None, + use_cache=True, + query_num=64, + drop_vision_last_layer=True, + vision_batch_size=16, + audio_pool_step=2, + audio_chunk_length=1.0, + **kwargs, + ): + self.use_cache = use_cache + self.query_num = query_num + self.drop_vision_last_layer = drop_vision_last_layer + self.vision_batch_size = vision_batch_size + self.audio_pool_step = audio_pool_step + self.audio_chunk_length = audio_chunk_length + + if text_config is None: + self.text_config = MiniCPM_o_2_6TextConfig() + elif isinstance(text_config, dict): + self.text_config = MiniCPM_o_2_6TextConfig(**text_config) + elif isinstance(text_config, MiniCPM_o_2_6TextConfig): + self.text_config = text_config + + if vision_config is None: + self.vision_config = MiniCPMVisionConfig( + **self.default_vision_config) + logger.info("vision_config is None, using default vision config") + elif isinstance(vision_config, dict): + self.vision_config = MiniCPMVisionConfig(**vision_config) + elif isinstance(vision_config, MiniCPMVisionConfig): + self.vision_config = vision_config + + # same as openai/whisper-medium add use_cache + if audio_config is None: + self.audio_config = MiniCPMWhisperConfig() + elif isinstance(audio_config, dict): + self.audio_config = MiniCPMWhisperConfig(**audio_config) + elif isinstance(audio_config, MiniCPMWhisperConfig): + self.audio_config = audio_config + + if tts_config is None: + self.tts_config = MiniCPMConditionalTTSConfig() + elif isinstance(tts_config, dict): + self.tts_config = MiniCPMConditionalTTSConfig(**tts_config) + elif isinstance(tts_config, MiniCPMConditionalTTSConfig): + self.tts_config = tts_config + + # self.patch_size = self.vision_config.patch_size + super().__init__(**kwargs) + @dataclass class OmniOutput(ModelOutput): - text: Optional[Union[str, List[str], Iterator]] = None + text: Optional[Union[str, list[str], Iterator]] = None outputs: GenerateOutput | torch.LongTensor = None spk_embeds: Optional[torch.FloatTensor] = None audio_wav: Optional[np.ndarray] = None @@ -105,47 +255,27 @@ class OmniOutput(ModelOutput): @auto_docstring class MiniCPM_o_2_6PreTrainedModel(Qwen2PreTrainedModel): - config_class = MiniCPM_o_2_6Config - base_model_prefix = "model" - supports_gradient_checkpointing = True - _no_split_modules = ["MiniCPM_o_2_6TextDecoderLayer"] - _skip_keys_device_placement = ["past_key_values"] - _supports_flash_attn_2 = True - _supports_sdpa = True - _supports_flex_attn = True - _supports_cache_class = True - _supports_quantized_cache = True - _supports_static_cache = True - _supports_attention_backend = True - - def _init_weights(self, module): - std = self.config.initializer_range - if isinstance(module, nn.Linear): - module.weight.data.normal_(mean=0.0, std=std) - if module.bias is not None: - module.bias.data.zero_() - elif isinstance(module, nn.Embedding): - module.weight.data.normal_(mean=0.0, std=std) - if module.padding_idx is not None: - module.weight.data[module.padding_idx].zero_() - elif isinstance(module, MiniCPM_o_2_6TextRMSNorm): - module.weight.data.fill_(1.0) + config: MiniCPM_o_2_6Config -class MiniCPMTextModel(Qwen2Model): +class MiniCPM_o_2_6TextModel(Qwen2Model): pass -class MiniCPM_o_2_6Model(MiniCPM_o_2_6PreTrainedModel, GenerationMixin): +class MiniCPM_o_2_6ForConditionalGeneration(MiniCPM_o_2_6PreTrainedModel, GenerationMixin): _tied_weights_keys = ["lm_head.weight"] _tp_plan = {"lm_head": "colwise_rep"} _pp_plan = {"lm_head": (["hidden_states"], ["logits"])} - def __init__(self, config): - super().__init__(config) - self.language_model = MiniCPMTextModel(config) - self.vocab_size = config.vocab_size - self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) + def __init__(self, config: MiniCPM_o_2_6Config): + super().__init__(config.text_config) + + text_config = config.text_config + self.language_model = MiniCPM_o_2_6TextModel(text_config) + self.vocab_size = text_config.vocab_size + self.lm_head = nn.Linear(text_config.hidden_size, text_config.vocab_size, bias=False) + + self.omni_config = config # Initialize weights and apply final processing self.post_init() @@ -156,12 +286,12 @@ def __init__(self, config): # init vision module self.vpm = self.init_vision_module() self.vision_dim = self.vpm.embed_dim - self.resampler = self.init_resampler(self.embed_dim, self.vision_dim) + self.resampler = self.init_resampler(config.query_num, self.embed_dim, self.vision_dim) # init audio module self.apm = self.init_audio_module() audio_output_dim = int(self.apm.config.encoder_ffn_dim // 4) - self.audio_avg_pooler = nn.AvgPool1d(self.config.audio_pool_step, stride=self.config.audio_pool_step) + self.audio_avg_pooler = nn.AvgPool1d(self.omni_config.audio_pool_step, stride=self.omni_config.audio_pool_step) self.audio_projection_layer = MultiModalProjector(in_dim=audio_output_dim, out_dim=self.embed_dim) self.audio_encoder_layer = -1 @@ -191,10 +321,8 @@ def init_tts( load tts tokenizer and vocos 1. try load form local 2. try load from huggingface """ - from .processing_minicpm_o_2_6 import ChatTTSProcessor - if tts_text_tokenizer_path is None: - tts_text_tokenizer_path = os.path.join(self.config._name_or_path, "assets/chattts_tokenizer") + tts_text_tokenizer_path = os.path.join(self.omni_config._name_or_path, "assets/chattts_tokenizer") if not os.path.exists(tts_text_tokenizer_path): # try from hf model_id tts_text_tokenizer_path = "openbmb/chattts_tokenizer" @@ -203,7 +331,7 @@ def init_tts( self.tts_processor = ChatTTSProcessor(text_tokenizer=tts_text_tokenizer) if vocos_ckpt_path is None: - vocos_ckpt_path = os.path.join(self.config._name_or_path, "assets/Vocos.pt") + vocos_ckpt_path = os.path.join(self.omni_config._name_or_path, "assets/Vocos.pt") if not os.path.exists(vocos_ckpt_path): vocos_ckpt_path = hf_hub_download(repo_id="openbmb/MiniCPM-o-2_6", subfolder="assets", filename="Vocos.pt") @@ -234,12 +362,12 @@ def initialize_vocos(self, ckpt_path): return vocos def init_vision_module(self): - if self.config._attn_implementation == "flash_attention_2": - self.config.vision_config._attn_implementation = "flash_attention_2" + if self.omni_config._attn_implementation == "flash_attention_2": + self.omni_config.vision_config._attn_implementation = "flash_attention_2" else: - self.config.vision_config._attn_implementation = "eager" - model = MiniCPMVisionTransformer(self.config.vision_config) - if self.config.drop_vision_last_layer: + self.omni_config.vision_config._attn_implementation = "eager" + model = MiniCPMVisionTransformer(self.omni_config.vision_config) + if self.omni_config.drop_vision_last_layer: model.encoder.layers = model.encoder.layers[:-1] setattr(model, "embed_dim", model.embeddings.embed_dim) @@ -247,9 +375,9 @@ def init_vision_module(self): return model - def init_resampler(self, embed_dim, vision_dim): + def init_resampler(self, query_num, embed_dim, vision_dim): return Resampler( - num_queries=self.config.query_num, + num_queries=query_num, embed_dim=embed_dim, num_heads=embed_dim // 128, kv_dim=vision_dim, @@ -257,11 +385,11 @@ def init_resampler(self, embed_dim, vision_dim): ) def init_audio_module(self): - model = MiniCPMWhisperEncoder(self.config.audio_config) + model = MiniCPMWhisperEncoder(self.omni_config.audio_config) return model def init_tts_module(self): - model = ConditionalChatTTS(self.config.tts_config) + model = ConditionalChatTTS(self.omni_config.tts_config) return model def get_input_embeddings(self): @@ -333,8 +461,8 @@ def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor): """ input_lengths_after_cnn = (input_lengths - 1) // 2 + 1 input_lengths_after_pooling = ( - input_lengths_after_cnn - self.config.audio_pool_step - ) // self.config.audio_pool_step + 1 + input_lengths_after_cnn - self.omni_config.audio_pool_step + ) // self.omni_config.audio_pool_step + 1 input_lengths_after_pooling = input_lengths_after_pooling.to(dtype=torch.int32) return input_lengths_after_cnn, input_lengths_after_pooling @@ -362,7 +490,7 @@ def get_image_features(self, pixel_values_list, tgt_sizes, dtype, device): for i in range(B): patch_attn_mask[i, 0, : tgt_sizes[i][0] * tgt_sizes[i][1]] = True - vision_batch_size = self.config.vision_batch_size + vision_batch_size = self.omni_config.vision_batch_size all_pixel_values = all_pixel_values.type(dtype) if B > vision_batch_size: hs = [] @@ -447,7 +575,7 @@ def get_vllm_embedding(self, data): return new_vllm_embedding, vision_hidden_states def get_audio_embedding_streaming( - self, audio_features: torch.FloatTensor = [], audio_feature_lens_raw: List[List[int]] = [] + self, audio_features: torch.FloatTensor = [], audio_feature_lens_raw: list[list[int]] = [] ): r""" Extract audio embeddings in a streaming manner using cached key-value pairs. @@ -508,7 +636,7 @@ def get_audio_embedding_streaming( def get_audio_embedding( self, audio_features: torch.FloatTensor = [], - audio_feature_lens_raw: List[List[int]] = [], + audio_feature_lens_raw: list[list[int]] = [], chunk_length=-1, dummy=True, ): @@ -611,7 +739,7 @@ def get_omni_embedding(self, data, input_embeddings, chunk_length=-1, stream_inp assert len(audio_embeddings) == len(input_embeddings) audio_bounds = data["audio_bounds"] - if self.config.chunk_input: + if self.omni_config.chunk_input: for i in range(bs): audio_embs = torch.cat(audio_embeddings[i], dim=0).to( device=input_embeddings.device, dtype=input_embeddings.dtype @@ -680,9 +808,9 @@ def forward( >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ```""" - output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_attentions = output_attentions if output_attentions is not None else self.omni_config.output_attentions output_hidden_states = ( - output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + output_hidden_states if output_hidden_states is not None else self.omni_config.output_hidden_states ) # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) @@ -706,7 +834,7 @@ def forward( loss = None if labels is not None: - loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs) + loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size, **kwargs) return CausalLMOutputWithPast( loss=loss, @@ -801,7 +929,7 @@ def generate( model_inputs["inputs_embeds"] = self.get_omni_embedding( model_inputs, input_embeddings=model_inputs["inputs_embeds"], - chunk_length=self.config.audio_chunk_length, + chunk_length=self.omni_config.audio_chunk_length, ) if stream: @@ -834,7 +962,7 @@ def stream_gen(): spk_embeds = wav_numpy = sr = None if not batched and use_tts_template and generate_audio: - result = processor.decode_text(outputs.sequences, processor.tokenizer) + result = processor.decode(outputs.sequences) mel_spec = self._generate_mel_spec( model_inputs, outputs, @@ -1176,7 +1304,7 @@ def check_uncompleted_token(ids): end = check_uncompleted_token(cur_ids[0]) left_ids = cur_ids[:, end:] cur_ids = cur_ids[:, :end] - text = processor.decode_text(cur_ids, tokenizer)[0] if end > 0 else "" + text = processor.decode(cur_ids)[0] if end > 0 else "" self.llm_past_key_values = outputs.past_key_values input_ids = outputs.sequences[:, -1:] @@ -1382,7 +1510,7 @@ def _generate_mel_spec( mel_spec = self.tts.decode_to_mel_specs(outputs.new_ids) return mel_spec - def _linear_overlap_add2_wav(self, frames: List[torch.Tensor], overlap: int): + def _linear_overlap_add2_wav(self, frames: list[torch.Tensor], overlap: int): """ Merge two audio waveforms with smooth in streaming audio generation. Borrowed some codes from `https://github.com/huggingface/transformers/blob/main/src/transformers/models/encodec/modeling_encodec.py` @@ -1824,6 +1952,35 @@ def get_cache_usable_length(past_key_value: Cache, new_seq_length: int, layer_id return previous_seq_length +def whisper_eager_attention_forward( + module: nn.Module, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + attention_mask: Optional[torch.Tensor], + scaling: Optional[float] = None, + dropout: float = 0.0, + head_mask: Optional[torch.Tensor] = None, + **kwargs, +): + if scaling is None: + scaling = query.size(-1) ** -0.5 + + attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling + if attention_mask is not None and attention_mask.ndim == 4: + attn_weights = attn_weights + attention_mask[:, :, :, : key.shape[-2]] + + attn_weights = nn.functional.softmax(attn_weights, dim=-1) + + if head_mask is not None: + attn_weights = attn_weights * head_mask.view(1, -1, 1, 1) + + attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training) + attn_output = torch.matmul(attn_weights, value) + attn_output = attn_output.transpose(1, 2).contiguous() + + return attn_output, attn_weights + # Copied from transformers.models.whisper.modeling_whisper.WhisperAttention and support past_key_value class MiniCPMWhisperAttention(WhisperAttention): """Multi-headed attention from 'Attention Is All You Need' paper""" @@ -1832,7 +1989,7 @@ def forward( self, hidden_states: torch.Tensor, key_value_states: Optional[torch.Tensor] = None, - past_key_value: Optional[Cache] = None, + past_key_values: Optional[Cache] = None, attention_mask: Optional[torch.Tensor] = None, layer_head_mask: Optional[torch.Tensor] = None, output_attentions: bool = False, @@ -1860,34 +2017,34 @@ def forward( query_states = query_states.view(*q_input_shape) query_states = query_states.transpose(1, 2).contiguous() - if past_key_value is not None: - is_updated = past_key_value.is_updated.get(self.layer_idx) + if past_key_values is not None: + is_updated = past_key_values.is_updated.get(self.layer_idx) if is_cross_attention: # after the first generated id, we can subsequently re-use all key/value_states from cache - past_key_value.is_updated[self.layer_idx] = True - past_key_value = past_key_value.cross_attention_cache + past_key_values.is_updated[self.layer_idx] = True + past_key_values = past_key_values.cross_attention_cache else: - past_key_value = past_key_value.self_attention_cache + past_key_values = past_key_values.self_attention_cache # use key_value_states if cross attention current_states = key_value_states if key_value_states is not None else hidden_states - if is_cross_attention and past_key_value and is_updated: + if is_cross_attention and past_key_values and is_updated: # reuse k,v, cross_attentions - key_states = past_key_value.key_cache[self.layer_idx] - value_states = past_key_value.value_cache[self.layer_idx] + key_states = past_key_values.key_cache[self.layer_idx] + value_states = past_key_values.value_cache[self.layer_idx] else: key_states = self.k_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim) value_states = self.v_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim) key_states = key_states.transpose(1, 2).contiguous() value_states = value_states.transpose(1, 2).contiguous() - if past_key_value is not None: + if past_key_values is not None: # save all key/value_states to cache to be re-used for fast auto-regressive generation cache_position = cache_position if not is_cross_attention else None - key_states, value_states = past_key_value.update( + key_states, value_states = past_key_values.update( key_states, value_states, self.layer_idx, {"cache_position": cache_position} ) - attention_interface: Callable = eager_attention_forward + attention_interface: Callable = whisper_eager_attention_forward if self.config._attn_implementation != "eager": attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation] @@ -1907,12 +2064,12 @@ def forward( attn_output = attn_output.reshape(bsz, tgt_len, -1).contiguous() attn_output = self.out_proj(attn_output) - return attn_output, attn_weights, past_key_value + return attn_output, attn_weights, past_key_values # Copied from transformers.models.whisper.modeling_whisper.WhisperEncoderLayer and add use_cache for streaming inference class MiniCPMWhisperEncoderLayer(WhisperEncoderLayer): - def __init__(self, config: WhisperConfig, layer_idx: int = None): + def __init__(self, config: MiniCPMWhisperConfig, layer_idx: int = None): super().__init__() self.embed_dim = config.d_model self.self_attn = MiniCPMWhisperAttention( @@ -1964,7 +2121,7 @@ def forward( attention_mask=attention_mask, layer_head_mask=layer_head_mask, output_attentions=output_attentions, - past_key_value=past_key_values, + past_key_values=past_key_values, ) hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) hidden_states = residual + hidden_states @@ -1996,7 +2153,7 @@ def forward( # Copied from from transformers.models.whisper.modeling_whisper.WhisperEncoder and add use_cache for streaming inference class MiniCPMWhisperEncoder(WhisperEncoder): - def __init__(self, config: WhisperConfig): + def __init__(self, config: MiniCPMWhisperConfig): super().__init__(config) self.layers = nn.ModuleList( [MiniCPMWhisperEncoderLayer(config, layer_idx=i) for i in range(config.encoder_layers)] @@ -2081,7 +2238,7 @@ def forward( only present if their respective `output_*` arguments are set to `True`. Example: - >>> from transformers import AutoFeatureExtractor, WhisperConfig, WhisperForConditionalGeneration + >>> from transformers import AutoFeatureExtractor, MiniCPMWhisperConfig, WhisperForConditionalGeneration >>> import torch >>> # Load a feature extractor and a Whisper model @@ -2289,7 +2446,7 @@ class GFSQ(nn.Module): def __init__( self, dim: int, - levels: List[int], + levels: list[int], G: int, R: int, eps=1e-5, @@ -2587,14 +2744,14 @@ class ConditionalChatTTSGenerationOutput(ModelOutput): Args: new_ids (torch.LongTensor): Newly generated audio code sequence, shape (batch_size, sequence_length, num_vq). audio_input_ids (torch.LongTensor): Updated input IDs including condition and generated audio codes, shape (batch_size, full_sequence_length, num_vq). - past_key_values (Tuple[Tuple[torch.FloatTensor]]): Tuple containing pre-computed keys and values used for attention mechanism. Each element has shape (batch_size, num_heads, sequence_length, embed_size_per_head). + past_key_values (tuple[tuple[torch.FloatTensor]]): tuple containing pre-computed keys and values used for attention mechanism. Each element has shape (batch_size, num_heads, sequence_length, embed_size_per_head). finished (bool): Boolean indicating whether generation is complete. """ new_ids: torch.LongTensor = None audio_input_ids: torch.LongTensor = None - past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None + past_key_values: Optional[tuple[tuple[torch.FloatTensor]]] = None finished: bool = None @@ -2708,31 +2865,6 @@ def forward( @auto_docstring class MiniCPMConditionalTTSTextPreTrainedModel(LlamaPreTrainedModel): config_class = MiniCPMConditionalTTSTextConfig - base_model_prefix = "model" - supports_gradient_checkpointing = True - _no_split_modules = ["MiniCPMConditionalTTSTextDecoderLayer"] - _skip_keys_device_placement = ["past_key_values"] - _supports_flash_attn_2 = True - _supports_sdpa = True - _supports_flex_attn = True - _supports_cache_class = True - _supports_quantized_cache = True - _supports_static_cache = True - _supports_attention_backend = True - - def _init_weights(self, module): - std = self.config.initializer_range - if isinstance(module, nn.Linear): - module.weight.data.normal_(mean=0.0, std=std) - if module.bias is not None: - module.bias.data.zero_() - elif isinstance(module, nn.Embedding): - module.weight.data.normal_(mean=0.0, std=std) - if module.padding_idx is not None: - module.weight.data[module.padding_idx].zero_() - elif isinstance(module, MiniCPMConditionalTTSTextRMSNorm): - module.weight.data.fill_(1.0) - @auto_docstring class MiniCPMConditionalTTSTextModel(LlamaModel): @@ -2813,7 +2945,7 @@ def forward( hidden_states, attention_mask=causal_mask, position_ids=position_ids, - past_key_value=past_key_values, + past_key_values=past_key_values, output_attentions=output_attentions, use_cache=use_cache, cache_position=cache_position, @@ -3044,16 +3176,7 @@ def __init__(self, config: MiniCPMConditionalTTSConfig): dvae = DVAE() self.dvae = dvae - model_config = MiniCPMConditionalTTSTextConfig( - hidden_size=config.hidden_size, - intermediate_size=config.intermediate_size, - num_attention_heads=config.num_attention_heads, - num_hidden_layers=config.num_hidden_layers, - max_position_embeddings=config.max_position_embeddings, - attn_implementation=config.attn_implementation, - ) - - model = MiniCPMConditionalTTSTextModel(model_config) + model = MiniCPMConditionalTTSTextModel(config.tts_text_config) self.model = model @torch.inference_mode() @@ -3105,7 +3228,7 @@ def prefill_text( self, input_ids: torch.Tensor, position_ids: torch.LongTensor, - past_key_values: List[Tuple[torch.Tensor, torch.Tensor]], + past_key_values: list[tuple[torch.Tensor, torch.Tensor]], lm_spk_emb_last_hidden_states: Optional[torch.Tensor] = None, ): """Prefill a chunk of new text tokens in streaming setting. @@ -3114,7 +3237,7 @@ def prefill_text( Args: input_ids (Tensor): Tensor of shape [batch_size, seq_len] position_ids (LongTensor): Tensor of shape [batch_size, seq_len] - past_key_values (List[Tuple[Tensor]]): KV Cache of all layers, each layer is a tuple (Tensor, Tensor) denoting keys and values. Each tensor is of seq_len = `self.streaming_text_reserved_len`. `past_key_values` will be updated. + past_key_values (List[tuple[Tensor]]): KV Cache of all layers, each layer is a tuple (Tensor, Tensor) denoting keys and values. Each tensor is of seq_len = `self.streaming_text_reserved_len`. `past_key_values` will be updated. lm_spk_emb_last_hidden_states (Tensor, optional): Tensor of shape [batch_size, num_spk_emb, llm_dim]. Defaults to None. lm_last_hidden_states (Tensor, optional): _description_. Defaults to None. @@ -3179,7 +3302,7 @@ def prefill_text( def prefill_audio_ids( self, input_ids: torch.Tensor, - past_key_values: List[Tuple[torch.Tensor, torch.Tensor]], + past_key_values: list[tuple[torch.Tensor, torch.Tensor]], streaming_tts_text_mask=None, add_audio_bos: bool = True, ): @@ -3188,7 +3311,7 @@ def prefill_audio_ids( Args: input_ids (torch.Tensor): (1, seq_len, num_vq) Audio input token ids. - past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism. + past_key_values (List[tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism. """ assert input_ids.shape[0] == 1 assert past_key_values is not None @@ -3234,15 +3357,15 @@ def prefill_audio_ids( def generate( self, input_ids: torch.Tensor, - past_key_values: List[Tuple[torch.Tensor, torch.Tensor]], + past_key_values: list[tuple[torch.Tensor, torch.Tensor]], temperature: torch.Tensor, eos_token: Union[int, torch.Tensor], streaming_tts_text_mask=None, force_no_stop=False, min_new_token=10, max_new_token=50, - logits_warpers: List[LogitsProcessor] = [], - logits_processors: List[CustomRepetitionPenaltyLogitsProcessorRepeat] = [], + logits_warpers: list[LogitsProcessor] = [], + logits_processors: list[CustomRepetitionPenaltyLogitsProcessorRepeat] = [], show_tqdm=False, ): """Generate audio codes in streaming setting or non-streaming setting. @@ -3254,7 +3377,7 @@ def generate( Args: input_ids (torch.Tensor): Input token ids. - past_key_values (List[Tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism. + past_key_values (List[tuple[torch.Tensor, torch.Tensor]]): Past key values for attention mechanism. temperature (torch.Tensor): Temperature for sampling. eos_token (Union[int, torch.Tensor]): End of sequence token. streaming_tts_text_mask (Optional[torch.Tensor], optional): Mask for streaming TTS text. Defaults to None. @@ -3470,7 +3593,7 @@ def generate( @torch.inference_mode() def decode_to_mel_specs( self, - result_list: List[torch.Tensor], + result_list: list[torch.Tensor], ): """Decode discrete audio codes to mel spectrograms. @@ -3813,13 +3936,6 @@ def forward( # See all SigLIP models at https://huggingface.co/models?filter=siglip ] -if is_flash_attn_2_available(): - from flash_attn import flash_attn_func - from flash_attn import flash_attn_varlen_func - from flash_attn.bert_padding import index_first_axis # noqa - from flash_attn.bert_padding import pad_input - from flash_attn.bert_padding import unpad_input - # Copied from transformers.models.llama.modeling_llama._get_unpad_data def _get_unpad_data(attention_mask): @@ -3950,11 +4066,11 @@ class MiniCPMVisionModelOutput(SiglipVisionModelOutput): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the model. hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): - Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): - Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. @@ -3964,7 +4080,7 @@ class MiniCPMVisionModelOutput(SiglipVisionModelOutput): class MiniCPMVisionEmbedding(nn.Module): - def __init__(self, config: SiglipVisionConfig): + def __init__(self, config: MiniCPMVisionConfig): super().__init__() self.config = config self.embed_dim = config.hidden_size @@ -4057,7 +4173,7 @@ def forward( hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = False, - ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: + ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]: """Input shape: Batch x Time x Channel""" batch_size, q_len, _ = hidden_states.size() @@ -4121,11 +4237,11 @@ def forward( hidden_states: torch.Tensor, attention_mask: Optional[torch.LongTensor] = None, position_ids: Optional[torch.LongTensor] = None, - past_key_value: Optional[Tuple[torch.Tensor]] = None, + past_key_value: Optional[tuple[torch.Tensor]] = None, output_attentions: bool = False, use_cache: bool = False, **kwargs, - ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: + ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]: output_attentions = False bsz, q_len, _ = hidden_states.size() @@ -4297,7 +4413,7 @@ class MiniCPMVisionMLP(SiglipMLP): class MiniCPMVisionEncoderLayer(SiglipEncoderLayer): - def __init__(self, config: SiglipVisionConfig): + def __init__(self, config: MiniCPMVisionConfig): super().__init__() self.embed_dim = config.hidden_size self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2" @@ -4315,7 +4431,7 @@ class MiniCPMVisionPreTrainedModel(PreTrainedModel): models. """ - config_class = SiglipVisionConfig + config_class = MiniCPMVisionConfig base_model_prefix = "siglip" supports_gradient_checkpointing = True @@ -4358,7 +4474,7 @@ def _initialize_weights(self, module): Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. Parameters: - config ([`SiglipVisionConfig`]): Model configuration class with all the parameters of the model. + config ([`MiniCPMVisionConfig`]): Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. """ @@ -4385,10 +4501,10 @@ class MiniCPMVisionEncoder(SiglipEncoder): Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a [`SiglipEncoderLayer`]. Args: - config: SiglipConfig + config: MiniCPMVisionConfig """ - def __init__(self, config: SiglipVisionConfig): + def __init__(self, config: MiniCPMVisionConfig): super().__init__() self.config = config self.layers = nn.ModuleList([MiniCPMVisionEncoderLayer(config) for _ in range(config.num_hidden_layers)]) @@ -4402,7 +4518,7 @@ def forward( output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - ) -> Union[Tuple, BaseModelOutput]: + ) -> Union[tuple, BaseModelOutput]: r""" Args: inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): @@ -4469,12 +4585,12 @@ def forward( """The vision model from SigLIP without any head or projection on top.""", SIGLIP_START_DOCSTRING ) class MiniCPMVisionTransformer(MiniCPMVisionPreTrainedModel): - config_class = SiglipVisionConfig + config_class = MiniCPMVisionConfig main_input_name = "pixel_values" _supports_flash_attn_2 = True _no_split_modules = [] - def __init__(self, config: SiglipVisionConfig): + def __init__(self, config: MiniCPMVisionConfig): super().__init__(config) self.config = config embed_dim = config.hidden_size @@ -4491,7 +4607,7 @@ def get_input_embeddings(self) -> nn.Module: return self.embeddings.patch_embedding @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING) - @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipVisionConfig) + @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=MiniCPMVisionConfig) def forward( self, pixel_values, @@ -4500,7 +4616,7 @@ def forward( output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - ) -> Union[Tuple, BaseModelOutputWithPooling]: + ) -> Union[tuple, BaseModelOutputWithPooling]: r""" Returns: """ @@ -4561,4 +4677,4 @@ def forward( ) -__all__ = ["MiniCPM_o_2_6ForConditionalGeneration", "MiniCPM_o_2_6Model", "MiniCPM_o_2_6PreTrainedModel"] +__all__ = ["MiniCPM_o_2_6ForConditionalGeneration", "MiniCPM_o_2_6TextModel", "MiniCPM_o_2_6PreTrainedModel", "MiniCPM_o_2_6Config"] diff --git a/src/transformers/models/minicpm_o_2_6/processing_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/processing_minicpm_o_2_6.py index 5a6c5dc9c65f..0b10e2ea50cd 100644 --- a/src/transformers/models/minicpm_o_2_6/processing_minicpm_o_2_6.py +++ b/src/transformers/models/minicpm_o_2_6/processing_minicpm_o_2_6.py @@ -19,27 +19,34 @@ import math import re -from typing import Any, Dict, List, Literal, Optional, Union +from typing import Any, Dict, Optional, Union -import librosa import numpy as np import torch -import torchaudio import json from copy import deepcopy from PIL import Image -from transformers.image_utils import ImageInput -from transformers.processing_utils import ProcessorMixin, ProcessingKwargs, Unpack, ImagesKwargs, AudioKwargs -from transformers.tokenization_utils_base import PreTokenizedInput, TextInput -from transformers.utils import logging, TensorType +from ...image_utils import ImageInput +from ...processing_utils import ProcessorMixin, ProcessingKwargs, Unpack, ImagesKwargs, AudioKwargs +from ...tokenization_utils_base import PreTokenizedInput, TextInput from ...feature_extraction_utils import BatchFeature -from ...utils import is_torch_device, is_torch_dtype, requires_backends, TensorType +from ...utils import is_torch_device, is_torch_dtype, requires_backends, TensorType, logging logger = logging.get_logger(__name__) +def recursive_converter(converter, value): + if isinstance(value, list): + new_value = [] + for v in value: + new_value += [recursive_converter(converter, v)] + return new_value + else: + return converter(value) + + class MiniCPMOBatchFeature(BatchFeature): r""" Extend from BatchFeature for supporting various image size @@ -153,19 +160,18 @@ class MiniCPM_o_2_6Processor(ProcessorMixin): attributes = ["tokenizer", "image_processor", "feature_extractor"] tokenizer_class = "AutoTokenizer" - image_processor_class = "AutoImageProcessor" + image_processor_class = "MiniCPMVImageProcessorFast" feature_extractor_class = "MiniCPM_o_2_6FeatureExtractor" def __init__(self, tokenizer=None, image_processor=None, feature_extractor=None, chat_template=None): super().__init__(tokenizer, image_processor, feature_extractor, chat_template=chat_template) - self.version = image_processor.version self.default_tts_chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n<|spk_bos|><|spk|><|spk_eos|><|tts_bos|>' }}{% endif %}" def __call__( self, - text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]], + text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]], images: ImageInput = None, - audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]] = None, + audios: Union[np.ndarray, list[np.ndarray], list[list[np.ndarray]]] = None, **kwargs: Unpack[MiniCPM_o_2_6ProcessorKwargs], ) -> MiniCPMOBatchFeature: output_kwargs = self._merge_kwargs(MiniCPM_o_2_6ProcessorKwargs, self.tokenizer.init_kwargs, **kwargs) @@ -179,13 +185,12 @@ def __call__( image_inputs = None if audios: - audio_features, audio_feature_lens, audio_phs = self.feature_extractor( - self.tokenizer, + audio_features, audio_feature_lens = self.feature_extractor( audios, audio_parts=audio_kwargs["audio_parts"], - chunk_input=audio_kwargs["chunk_input"], sampling_rate=audio_kwargs["sampling_rate"], ) + audio_phs = self.get_audios_placeholder(audios=audios, chunk_input=audio_kwargs["chunk_input"]) else: audio_features, audio_feature_lens, audio_phs = [], [], [] @@ -300,33 +305,23 @@ def apply_chat_template( ) return inputs - def decode(self, outputs, batched=False): - result = self.decode_text(outputs.sequences, self.tokenizer) - if not batched: - result = result[0] - if isinstance(result, list): - result = [i.replace(self.tokenizer.tts_end, "") for i in result] - else: - result = result.replace(self.tokenizer.tts_end, "") - return result - - def decode_text(self, result_ids, tokenizer): + def decode(self, result_ids, skeip_special_tokens: bool = False): result_text = [] for result in result_ids: result = result[result != 0] start, end = 0, len(result) for i, tok in enumerate(result): - if tok == tokenizer.bos_id: + if tok == self.tokenizer.bos_id: start = i + 1 else: break for i in range(len(result) - 1, -1, -1): - if result[i] in tokenizer.terminator_ids: + if result[i] in self.tokenizer.terminator_ids: end = i else: break result = result[start:end] - result_text.append(tokenizer.decode(result)) + result_text.append(self.tokenizer.decode(result, skip_special_tokens=skeip_special_tokens)) return result_text def get_sys_prompt(self, ref_audio=None, mode="default", language="zh"): @@ -456,7 +451,7 @@ def _convert_omni_to_inputs( self, images, audio_phs, - texts: Union[str, List[str]], + texts: Union[str, list[str]], truncation=None, max_length=None, max_slice_nums=None, @@ -502,8 +497,8 @@ def _convert_omni_to_inputs( audio_id = 0 for i, chunk in enumerate(text_chunks): if chunk == self.tokenizer.image_tag: - image_placeholder = self.image_processor.get_slice_image_placeholder( - self.tokenizer, image_sizes[index][image_id], image_id, max_slice_nums, use_image_id + image_placeholder = self.get_slice_image_placeholder( + image_sizes[index][image_id], image_id, max_slice_nums, use_image_id ) image_id += 1 text_chunks[i] = image_placeholder @@ -553,273 +548,99 @@ def _convert_omni_to_inputs( return data - @property - # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names - def model_input_names(self): - tokenizer_input_names = self.tokenizer.model_input_names - image_processor_input_names = self.image_processor.model_input_names - feature_extractor_input_names = self.feature_extractor.model_input_names - return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names + feature_extractor_input_names)) - + def get_slice_image_placeholder(self, image_size, image_idx=0, max_slice_nums=None, use_image_id=None): + max_slice_nums = self.image_processor.max_slice_nums if max_slice_nums is None else int(max_slice_nums) + assert max_slice_nums > 0 + grid = self.image_processor.get_sliced_grid(image_size=image_size, max_slice_nums=max_slice_nums) -class MelSpectrogramFeatures(torch.nn.Module): - def __init__( - self, - sample_rate=24000, - n_fft=1024, - hop_length=256, - n_mels=100, - padding: Literal["center", "same"] = "center", - ): - super().__init__() - if padding not in ["center", "same"]: - raise ValueError("Padding must be 'center' or 'same'.") - self.padding = padding - self.mel_spec = torchaudio.transforms.MelSpectrogram( - sample_rate=sample_rate, - n_fft=n_fft, - hop_length=hop_length, - n_mels=n_mels, - center=padding == "center", - power=1, + image_placeholder = ( + self.tokenizer.im_start + + self.tokenizer.unk_token * self.image_processor.image_feature_size + + self.tokenizer.im_end + ) + use_image_id = self.image_processor.use_image_id if use_image_id is None else bool(use_image_id) + if use_image_id: + final_placeholder = ( + f"{self.tokenizer.im_id_start}{image_idx}{self.tokenizer.im_id_end}" + image_placeholder + ) + else: + final_placeholder = image_placeholder + + if self.image_processor.slice_mode: + final_placeholder = final_placeholder + self.get_grid_placeholder(grid=grid) + return final_placeholder + + def get_grid_placeholder(self, grid): + if grid is None: + return "" + slice_image_placeholder = ( + self.tokenizer.slice_start + + self.tokenizer.unk_token * self.image_processor.image_feature_size + + self.tokenizer.slice_end ) - def __call__(self, audio: torch.Tensor) -> torch.Tensor: - """ - audio: Tensor([num_channels, num_samples]) - """ - return super().__call__(audio) - - def forward(self, audio: torch.Tensor) -> torch.Tensor: - """ - audio: Tensor([num_channels, num_samples]) - """ - mel: torch.Tensor = self.mel_spec(audio) - features = torch.log(torch.clip(mel, min=1e-5)) - return features - - -class ChatTTSProcessor: - def __init__(self, text_tokenizer): - self.audio_processor = MelSpectrogramFeatures() - self.text_tokenizer = text_tokenizer - - def __call__(self, text_list, audio_list): - assert len(text_list) == len(audio_list) - input_ids_varlen = [] - for text in text_list: - input_ids_ = self.text_tokenizer.encode( - text, return_tensors="pt", add_special_tokens=False - ) # [1, seq_len] - input_ids_ = input_ids_.squeeze(0) # [seq_len] - input_ids_varlen.append(input_ids_) - - audio_features_varlen = [] - for audio in audio_list: - assert audio.shape.__len__() == 1 # [seq_len] - try: - # [100(num_mel_bins), seq_len_mel] - mel = self.audio_processor(audio) - except Exception as e: - raise e - audio_features_varlen.append(mel) - - return { - "tts_input_ids_varlen": input_ids_varlen, # return List[Tensor] - # return List[Tensor] - "tts_input_features_varlen": audio_features_varlen, - } - - -def is_silent(data): - if np.abs(data).max() < 3e-3: - return True - else: - return False - - -def sentence_end(txt): - for c in [".", "ใ€‚", "!", "?", "๏ผ", "๏ผŸ"]: - if c in txt: - if c == ".": # check not number before it like 1. - idx = txt.find(c) - if idx > 0: - if txt[idx - 1].isdigit(): - continue - return c - return "" - - -class NumberToTextConverter: - r""" - A helper class to ensure text-to-speech (TTS) systems read numeric digits - in the desired language (Chinese or English) digit-by-digit. It forcibly - replaces all numeric substrings in text with their language-specific - textual representations, thereby reducing the likelihood of TTS mistakes - on numbers. - Note: MiniCPM-o 2.6 only use this in streaming mode. - - Attributes: - num_to_chinese (dict): - Mapping from digit (str) to its Chinese textual form (str). - num_to_english (dict): - Mapping from digit (str) to its English textual form (str). - - Example: - >>> converter = NumberToTextConverter() - >>> converter.replace_numbers_with_text("ๆˆ‘ๆœ‰2ไธช่‹นๆžœ", language="chinese") - 'ๆˆ‘ๆœ‰ไธคไธช่‹นๆžœ' - >>> converter.replace_numbers_with_text("I have 23 books", language="english") - 'I have two three books' - """ - - def __init__(self): - self.num_to_chinese = { - "0": "้›ถ", - "1": "ไธ€", - "2": "ไบŒ", - "3": "ไธ‰", - "4": "ๅ››", - "5": "ไบ”", - "6": "ๅ…ญ", - "7": "ไธƒ", - "8": "ๅ…ซ", - "9": "ไน", - } - self.num_to_english = { - "0": "zero", - "1": "one", - "2": "two", - "3": "three", - "4": "four", - "5": "five", - "6": "six", - "7": "seven", - "8": "eight", - "9": "nine", - } - - def number_to_chinese_digit_by_digit(self, num_str): - result = "" - for char in num_str: - if char in self.num_to_chinese: - result += self.num_to_chinese[char] - return result - - def number_to_english_digit_by_digit(self, num_str): - result = [] - for char in num_str: - if char in self.num_to_english: - result.append(self.num_to_english[char]) - return " ".join(result) - - def detect_language(self, text): - chinese_count = len(re.findall(r"[\u4e00-\u9fff]", text)) - english_count = len(re.findall(r"[a-zA-Z]", text)) - return "chinese" if chinese_count >= english_count else "english" - - def replace_numbers_with_text(self, text, language=None): - if language is None: - language = self.detect_language(text) - numbers = re.findall(r"\d+", text) - - for num in numbers: - if language == "chinese": - replacement = self.number_to_chinese_digit_by_digit(num) + cols = grid[0] + rows = grid[1] + slices = [] + for i in range(rows): + lines = [] + for j in range(cols): + lines.append(slice_image_placeholder) + slices.append("".join(lines)) + + slice_placeholder = "\n".join(slices) + return slice_placeholder + + def get_audios_placeholder(self, audios, + chunk_input: Optional[bool] = False, + chunk_length: Optional[int] = 1): + audios_list = self.feature_extractor.format_audios(audios) + audio_ph_list = [] + for audios in audios_list: + if audios: + audio_ph_list.append( + [self.get_single_audio_placeholder(len(a), chunk_input, chunk_length) for a in audios] + ) else: - replacement = self.number_to_english_digit_by_digit(num) - text = text.replace(num, replacement, 1) - - return text - - -class VoiceChecker: - r""" - A simple utility class to detect silence or low variation in consecutive audio chunks by comparing - the mel-spectrogram distances. It keeps track of consecutive zero-distance and low-distance chunks - to decide if the audio is considered "bad" (e.g., overly silent or not changing enough). - - Attributes: - previous_mel (`np.ndarray` or `None`): - Holds the previously observed mel-spectrogram in decibel scale. Used to compute - the next distance; reset via :meth:`reset`. - consecutive_zeros (`int`): - The number of consecutive chunks that were detected as silent (distance = 0). - consecutive_low_distance (`int`): - The number of consecutive chunks whose distance was below the threshold. - - Example: - >>> checker = VoiceChecker() - >>> # Suppose we have audio_wav (list or np.ndarray) and mel_spec (np.ndarray) - >>> # We split them into chunks and call checker.is_bad(...) - >>> is_audio_bad = checker.is_bad(audio_wav, mel_spec, chunk_size=2560, thresh=100.0) - >>> if is_audio_bad: - ... print("Audio deemed bad!") - >>> # Reset states if needed - >>> checker.reset() - """ - - def __init__(self): - self.previous_mel = None - self.consecutive_zeros = 0 - self.consecutive_low_distance = 0 - - def compute_distance(self, audio_chunk, mel_spec): - if is_silent(audio_chunk): - return 0.0 # ๆฃ€ๆŸฅๆ˜ฏๅฆไธบ็ฉบ็™ฝ็‰‡ๆฎต - - mel_db = librosa.power_to_db(mel_spec) - if self.previous_mel is None: - self.previous_mel = mel_db - return -1.0 - - distance = np.linalg.norm(np.mean(mel_db, axis=1) - np.mean(self.previous_mel, axis=1)) - self.previous_mel = mel_db - return distance - - def is_bad(self, audio_wav, mel_spec, chunk_size=2560, thresh=100.0): - num_chunks = len(audio_wav) // chunk_size - mel_chunk_size = mel_spec.shape[-1] // num_chunks - for i in range(num_chunks): - audio_chunk = audio_wav[i * chunk_size : (i + 1) * chunk_size] - mel_spec_chunk = mel_spec[:, i * mel_chunk_size : (i + 1) * mel_chunk_size] - - distance = self.compute_distance(audio_chunk, mel_spec_chunk) - logger.warning( - f"mel dist: {distance:.1f}, zero: {self.consecutive_zeros}, low: {self.consecutive_low_distance}" + audio_ph_list.append([]) + return audio_ph_list + + def get_single_audio_placeholder(self, audio_lens, chunk_input, chunk_length): + pool_step = 2 + feature_lens = math.ceil(audio_lens / self.feature_extractor.hop_length) + + feature_lens = (feature_lens - 1) // 2 + 1 + output_lens = (feature_lens - pool_step) // pool_step + 1 + + if chunk_input: + fbank_feat_in_chunk = int(chunk_length * 100) + cnn_feat_in_chunk = (fbank_feat_in_chunk - 1) // 2 + 1 + audio_embeds_in_chunk = (cnn_feat_in_chunk - pool_step) // pool_step + 1 + num_audio_chunks = (output_lens + audio_embeds_in_chunk - 1) // audio_embeds_in_chunk + + place_holders = "" + total_unk_len = 0 + for _ in range(num_audio_chunks): + unk_len = min(audio_embeds_in_chunk, output_lens - total_unk_len) + place_holders += ( + self.tokenizer.audio_start + self.tokenizer.unk_token * unk_len + self.tokenizer.audio_end + ) + total_unk_len += unk_len + audio_placeholder = place_holders + else: + audio_placeholder = ( + self.tokenizer.audio_start + self.tokenizer.unk_token * output_lens + self.tokenizer.audio_end ) - if distance == 0: - self.consecutive_low_distance = 0 # reset - self.consecutive_zeros += 1 - if self.consecutive_zeros >= 12: - logger.warning("VoiceChecker detected 1.2 s silent. Marking as failed.") - return True - elif distance < thresh: - self.consecutive_zeros = 0 - self.consecutive_low_distance += 1 - if self.consecutive_low_distance >= 5: - logger.warning("VoiceChecker detected 5 consecutive low distance chunks. Marking as failed.") - return True - else: - self.consecutive_low_distance = 0 - self.consecutive_zeros = 0 - - return False - - def reset(self): - self.previous_mel = None - self.consecutive_zeros = 0 - self.consecutive_low_distance = 0 + return audio_placeholder -def recursive_converter(converter, value): - if isinstance(value, list): - new_value = [] - for v in value: - new_value += [recursive_converter(converter, v)] - return new_value - else: - return converter(value) + @property + # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names + def model_input_names(self): + tokenizer_input_names = self.tokenizer.model_input_names + image_processor_input_names = self.image_processor.model_input_names + feature_extractor_input_names = self.feature_extractor.model_input_names + return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names + feature_extractor_input_names)) __all__ = ["MiniCPM_o_2_6Processor"] diff --git a/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6.py deleted file mode 100644 index b2c910ab14f5..000000000000 --- a/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6.py +++ /dev/null @@ -1,24 +0,0 @@ -# coding=utf-8 -# Copyright 2025 The OpenBMB Team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from transformers import Qwen2Tokenizer - - -class MiniCPM_o_2_6Tokenizer(Qwen2Tokenizer): - def __init__(self, **kwargs): - super().__init__(**kwargs) - - -__all__ = ["MiniCPM_o_2_6Tokenizer"] diff --git a/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6_fast.py b/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6_fast.py index 8d943508c40e..5fcee76500e0 100644 --- a/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6_fast.py +++ b/src/transformers/models/minicpm_o_2_6/tokenization_minicpm_o_2_6_fast.py @@ -13,7 +13,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -from transformers import Qwen2TokenizerFast +from ..qwen2.tokenization_qwen2_fast import Qwen2TokenizerFast class MiniCPM_o_2_6TokenizerFast(Qwen2TokenizerFast): diff --git a/src/transformers/models/minicpm_o_2_6/tts_processing_minicpm_o_2_6.py b/src/transformers/models/minicpm_o_2_6/tts_processing_minicpm_o_2_6.py new file mode 100644 index 000000000000..24808aa34f4e --- /dev/null +++ b/src/transformers/models/minicpm_o_2_6/tts_processing_minicpm_o_2_6.py @@ -0,0 +1,277 @@ +# coding=utf-8 +# Copyright 2025 The OpenBMB Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import re + +from typing import Literal + +import librosa +import numpy as np +import torch +import torchaudio + + +from ...utils import logging + +logger = logging.get_logger(__name__) + +class MelSpectrogramFeatures(torch.nn.Module): + def __init__( + self, + sample_rate=24000, + n_fft=1024, + hop_length=256, + n_mels=100, + padding: Literal["center", "same"] = "center", + ): + super().__init__() + if padding not in ["center", "same"]: + raise ValueError("Padding must be 'center' or 'same'.") + self.padding = padding + self.mel_spec = torchaudio.transforms.MelSpectrogram( + sample_rate=sample_rate, + n_fft=n_fft, + hop_length=hop_length, + n_mels=n_mels, + center=padding == "center", + power=1, + ) + + def __call__(self, audio: torch.Tensor) -> torch.Tensor: + """ + audio: Tensor([num_channels, num_samples]) + """ + return super().__call__(audio) + + def forward(self, audio: torch.Tensor) -> torch.Tensor: + """ + audio: Tensor([num_channels, num_samples]) + """ + mel: torch.Tensor = self.mel_spec(audio) + features = torch.log(torch.clip(mel, min=1e-5)) + return features + + +class ChatTTSProcessor: + def __init__(self, text_tokenizer): + self.audio_processor = MelSpectrogramFeatures() + self.text_tokenizer = text_tokenizer + + def __call__(self, text_list, audio_list): + assert len(text_list) == len(audio_list) + input_ids_varlen = [] + for text in text_list: + input_ids_ = self.text_tokenizer.encode( + text, return_tensors="pt", add_special_tokens=False + ) # [1, seq_len] + input_ids_ = input_ids_.squeeze(0) # [seq_len] + input_ids_varlen.append(input_ids_) + + audio_features_varlen = [] + for audio in audio_list: + assert audio.shape.__len__() == 1 # [seq_len] + try: + # [100(num_mel_bins), seq_len_mel] + mel = self.audio_processor(audio) + except Exception as e: + raise e + audio_features_varlen.append(mel) + + return { + "tts_input_ids_varlen": input_ids_varlen, # return List[Tensor] + # return List[Tensor] + "tts_input_features_varlen": audio_features_varlen, + } + + +def is_silent(data): + if np.abs(data).max() < 3e-3: + return True + else: + return False + + +def sentence_end(txt): + for c in [".", "ใ€‚", "!", "?", "๏ผ", "๏ผŸ"]: + if c in txt: + if c == ".": # check not number before it like 1. + idx = txt.find(c) + if idx > 0: + if txt[idx - 1].isdigit(): + continue + return c + return "" + + +class NumberToTextConverter: + r""" + A helper class to ensure text-to-speech (TTS) systems read numeric digits + in the desired language (Chinese or English) digit-by-digit. It forcibly + replaces all numeric substrings in text with their language-specific + textual representations, thereby reducing the likelihood of TTS mistakes + on numbers. + Note: MiniCPM-o 2.6 only use this in streaming mode. + + Attributes: + num_to_chinese (dict): + Mapping from digit (str) to its Chinese textual form (str). + num_to_english (dict): + Mapping from digit (str) to its English textual form (str). + + Example: + >>> converter = NumberToTextConverter() + >>> converter.replace_numbers_with_text("ๆˆ‘ๆœ‰2ไธช่‹นๆžœ", language="chinese") + 'ๆˆ‘ๆœ‰ไธคไธช่‹นๆžœ' + >>> converter.replace_numbers_with_text("I have 23 books", language="english") + 'I have two three books' + """ + + def __init__(self): + self.num_to_chinese = { + "0": "้›ถ", + "1": "ไธ€", + "2": "ไบŒ", + "3": "ไธ‰", + "4": "ๅ››", + "5": "ไบ”", + "6": "ๅ…ญ", + "7": "ไธƒ", + "8": "ๅ…ซ", + "9": "ไน", + } + self.num_to_english = { + "0": "zero", + "1": "one", + "2": "two", + "3": "three", + "4": "four", + "5": "five", + "6": "six", + "7": "seven", + "8": "eight", + "9": "nine", + } + + def number_to_chinese_digit_by_digit(self, num_str): + result = "" + for char in num_str: + if char in self.num_to_chinese: + result += self.num_to_chinese[char] + return result + + def number_to_english_digit_by_digit(self, num_str): + result = [] + for char in num_str: + if char in self.num_to_english: + result.append(self.num_to_english[char]) + return " ".join(result) + + def detect_language(self, text): + chinese_count = len(re.findall(r"[\u4e00-\u9fff]", text)) + english_count = len(re.findall(r"[a-zA-Z]", text)) + return "chinese" if chinese_count >= english_count else "english" + + def replace_numbers_with_text(self, text, language=None): + if language is None: + language = self.detect_language(text) + numbers = re.findall(r"\d+", text) + + for num in numbers: + if language == "chinese": + replacement = self.number_to_chinese_digit_by_digit(num) + else: + replacement = self.number_to_english_digit_by_digit(num) + text = text.replace(num, replacement, 1) + + return text + + +class VoiceChecker: + r""" + A simple utility class to detect silence or low variation in consecutive audio chunks by comparing + the mel-spectrogram distances. It keeps track of consecutive zero-distance and low-distance chunks + to decide if the audio is considered "bad" (e.g., overly silent or not changing enough). + + Attributes: + previous_mel (`np.ndarray` or `None`): + Holds the previously observed mel-spectrogram in decibel scale. Used to compute + the next distance; reset via :meth:`reset`. + consecutive_zeros (`int`): + The number of consecutive chunks that were detected as silent (distance = 0). + consecutive_low_distance (`int`): + The number of consecutive chunks whose distance was below the threshold. + + Example: + >>> checker = VoiceChecker() + >>> # Suppose we have audio_wav (list or np.ndarray) and mel_spec (np.ndarray) + >>> # We split them into chunks and call checker.is_bad(...) + >>> is_audio_bad = checker.is_bad(audio_wav, mel_spec, chunk_size=2560, thresh=100.0) + >>> if is_audio_bad: + ... print("Audio deemed bad!") + >>> # Reset states if needed + >>> checker.reset() + """ + + def __init__(self): + self.previous_mel = None + self.consecutive_zeros = 0 + self.consecutive_low_distance = 0 + + def compute_distance(self, audio_chunk, mel_spec): + if is_silent(audio_chunk): + return 0.0 # ๆฃ€ๆŸฅๆ˜ฏๅฆไธบ็ฉบ็™ฝ็‰‡ๆฎต + + mel_db = librosa.power_to_db(mel_spec) + if self.previous_mel is None: + self.previous_mel = mel_db + return -1.0 + + distance = np.linalg.norm(np.mean(mel_db, axis=1) - np.mean(self.previous_mel, axis=1)) + self.previous_mel = mel_db + return distance + + def is_bad(self, audio_wav, mel_spec, chunk_size=2560, thresh=100.0): + num_chunks = len(audio_wav) // chunk_size + mel_chunk_size = mel_spec.shape[-1] // num_chunks + for i in range(num_chunks): + audio_chunk = audio_wav[i * chunk_size: (i + 1) * chunk_size] + mel_spec_chunk = mel_spec[:, i * mel_chunk_size: (i + 1) * mel_chunk_size] + + distance = self.compute_distance(audio_chunk, mel_spec_chunk) + logger.warning( + f"mel dist: {distance:.1f}, zero: {self.consecutive_zeros}, low: {self.consecutive_low_distance}" + ) + if distance == 0: + self.consecutive_low_distance = 0 # reset + self.consecutive_zeros += 1 + if self.consecutive_zeros >= 12: + logger.warning("VoiceChecker detected 1.2 s silent. Marking as failed.") + return True + elif distance < thresh: + self.consecutive_zeros = 0 + self.consecutive_low_distance += 1 + if self.consecutive_low_distance >= 5: + logger.warning("VoiceChecker detected 5 consecutive low distance chunks. Marking as failed.") + return True + else: + self.consecutive_low_distance = 0 + self.consecutive_zeros = 0 + + return False + + def reset(self): + self.previous_mel = None + self.consecutive_zeros = 0 + self.consecutive_low_distance = 0