Hi, I have an error using the local whisper model to process audio:
It happens both with UV install:
C:\Users\myusername\.local\bin>uv tool install batchalign
Resolved 86 packages in 1.64s
Built openai-whisper==20240930
Built batchalign==0.7.19.post9
Built docopt==0.6.2
Prepared 85 packages in 7.31s
░░░░░░░░░░░░░░░░░░░░ [0/86] Installing wheels... warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
Installed 86 packages in 4.55s
+ accelerate==1.8.1
+ annotated-types==0.7.0
+ anyio==4.9.0
+ batchalign==0.7.19.post9
+ blobfile==3.0.0
+ certifi==2025.6.15
+ cffi==1.17.1
+ charset-normalizer==3.4.2
+ click==8.2.1
+ colorama==0.4.6
+ contourpy==1.3.2
+ cycler==0.12.1
+ docopt==0.6.2
+ emoji==2.14.1
+ filelock==3.18.0
+ fonttools==4.58.4
+ fsspec==2025.5.1
+ googletrans==4.0.2
+ h11==0.16.0
+ h2==4.2.0
+ hpack==4.1.0
+ httpcore==1.0.9
+ httpx==0.28.1
+ huggingface-hub==0.33.0
+ hyperframe==6.1.0
+ idna==3.10
+ jinja2==3.1.6
+ joblib==1.5.1
+ kiwisolver==1.4.8
+ llvmlite==0.44.0
+ lxml==5.4.0
+ markdown-it-py==3.0.0
+ markupsafe==3.0.2
+ matplotlib==3.10.3
+ mdurl==0.1.2
+ more-itertools==10.7.0
+ mpmath==1.3.0
+ narwhals==1.44.0
+ networkx==3.5
+ nltk==3.9.1
+ num2words==0.5.14
+ numba==0.61.2
+ numpy==2.2.6
+ openai-whisper==20240930
+ packaging==25.0
+ peft==0.15.2
+ pillow==11.2.1
+ plotly==6.1.2
+ praatio==6.0.1
+ protobuf==6.31.1
+ psutil==7.0.0
+ pycountry==24.6.1
+ pycparser==2.22
+ pycryptodomex==3.23.0
+ pydantic==2.11.7
+ pydantic-core==2.33.2
+ pydub==0.25.1
+ pyfiglet==1.0.2
+ pygments==2.19.2
+ pyparsing==3.2.3
+ python-dateutil==2.9.0.post0
+ pyyaml==6.0.2
+ regex==2024.11.6
+ requests==2.32.4
+ rev-ai==2.21.0
+ rich==13.9.4
+ rich-click==1.8.9
+ safetensors==0.5.3
+ scipy==1.16.0
+ sentencepiece==0.2.0
+ setuptools==80.9.0
+ six==1.17.0
+ sniffio==1.3.1
+ soundfile==0.12.1
+ stanza==1.10.1
+ sympy==1.14.0
+ tiktoken==0.9.0
+ tokenizers==0.21.2
+ torch==2.7.1
+ torchaudio==2.7.1
+ tqdm==4.67.1
+ transformers==4.52.4
+ typing-extensions==4.14.0
+ typing-inspection==0.4.1
+ urllib3==2.5.0
+ websocket-client==0.59.0
Installed 1 executable: batchalign.exe
C:\Users\myusername\.local\bin>batchalign transcribe --whisper --num_speakers 1 D:\ai\batchalign2\inputfiles D:\ai\batchalign2\outputfiles
C:\Users\myusername\AppData\Roaming\uv\tools\batchalign\Lib\site-packages\praatio\utilities\utils.py:9: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import resource_filename
Mode: transcribe; got 1 transcript to process from D:\ai\batchalign2\inputfiles:
Device set to use cpu
You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50359]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask`
to obtain reliable results.
WhisperModel is using WhisperSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, but
specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
male.wav ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0:00:15 FAIL
ERROR on file male.wav: '<=' not supported between instances of 'NoneType' and 'float'
As well as installing the package from GIT with PIP results in the same error and I have a stack trace that points to an included library:
TypeError in pipeline call (likely due to None in token timestamps):
Traceback (most recent call last):
File "D:\ai\batchalign2\venv\Lib\site-packages\batchalign\models\whisper\infer_asr.py", line 198, in __call__
words = self.pipe(data.cpu().numpy(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 283, in __call__
return super().__call__(inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\pipelines\base.py", line 1371, in __call__
return next(
^^^^^
File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 125, in __next__
processed = self.infer(item, **self.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 607, in postprocess
text, optional = self.tokenizer._decode_asr(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 857, in _decode_asr
^^^^^^^^^^^^
File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 1108, in _decode_asr
resolved_tokens, resolved_token_timestamps = _find_longest_common_sequence(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 1215, in _find_longest_common_sequence
matches = sum(
^^^^
File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 1220, in <genexpr>
and left_token_timestamp_sequence[left_start + idx]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '<=' not supported between instances of 'NoneType' and 'float'
Tested with both CUDA and CPU processing, same error same place each time. Tested several different versions of transformers and got nowhere.
I managed to wrap it with a try catch for error handling and just got a blank output .cha file with no transcript:
@UTF8
@Begin
@Languages: eng
@Participants:
@Media: male, audio
@Comment: Batchalign 0.7.19-post.9, ASR Engine whisper. Unchecked output of ASR model; do not use.
@End
I attached a sample .wav file but had to rename it because Github.
https://github.com/user-attachments/assets/49e2332a-4f6b-48b5-861f-af55985b1a8f
Hi, I have an error using the local whisper model to process audio:
It happens both with UV install:
As well as installing the package from GIT with PIP results in the same error and I have a stack trace that points to an included library:
Tested with both CUDA and CPU processing, same error same place each time. Tested several different versions of transformers and got nowhere.
I managed to wrap it with a try catch for error handling and just got a blank output .cha file with no transcript:
I attached a sample .wav file but had to rename it because Github.
https://github.com/user-attachments/assets/49e2332a-4f6b-48b5-861f-af55985b1a8f