Whisper Error

Hi, I have an error using the local whisper model to process audio:

It happens both with UV install:
```
C:\Users\myusername\.local\bin>uv tool install batchalign
Resolved 86 packages in 1.64s
      Built openai-whisper==20240930
      Built batchalign==0.7.19.post9
      Built docopt==0.6.2
Prepared 85 packages in 7.31s
░░░░░░░░░░░░░░░░░░░░ [0/86] Installing wheels...                                                                                                                                                        warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.

Installed 86 packages in 4.55s
 + accelerate==1.8.1
 + annotated-types==0.7.0
 + anyio==4.9.0
 + batchalign==0.7.19.post9
 + blobfile==3.0.0
 + certifi==2025.6.15
 + cffi==1.17.1
 + charset-normalizer==3.4.2
 + click==8.2.1
 + colorama==0.4.6
 + contourpy==1.3.2
 + cycler==0.12.1
 + docopt==0.6.2
 + emoji==2.14.1
 + filelock==3.18.0
 + fonttools==4.58.4
 + fsspec==2025.5.1
 + googletrans==4.0.2
 + h11==0.16.0
 + h2==4.2.0
 + hpack==4.1.0
 + httpcore==1.0.9
 + httpx==0.28.1
 + huggingface-hub==0.33.0
 + hyperframe==6.1.0
 + idna==3.10
 + jinja2==3.1.6
 + joblib==1.5.1
 + kiwisolver==1.4.8
 + llvmlite==0.44.0
 + lxml==5.4.0
 + markdown-it-py==3.0.0
 + markupsafe==3.0.2
 + matplotlib==3.10.3
 + mdurl==0.1.2
 + more-itertools==10.7.0
 + mpmath==1.3.0
 + narwhals==1.44.0
 + networkx==3.5
 + nltk==3.9.1
 + num2words==0.5.14
 + numba==0.61.2
 + numpy==2.2.6
 + openai-whisper==20240930
 + packaging==25.0
 + peft==0.15.2
 + pillow==11.2.1
 + plotly==6.1.2
 + praatio==6.0.1
 + protobuf==6.31.1
 + psutil==7.0.0
 + pycountry==24.6.1
 + pycparser==2.22
 + pycryptodomex==3.23.0
 + pydantic==2.11.7
 + pydantic-core==2.33.2
 + pydub==0.25.1
 + pyfiglet==1.0.2
 + pygments==2.19.2
 + pyparsing==3.2.3
 + python-dateutil==2.9.0.post0
 + pyyaml==6.0.2
 + regex==2024.11.6
 + requests==2.32.4
 + rev-ai==2.21.0
 + rich==13.9.4
 + rich-click==1.8.9
 + safetensors==0.5.3
 + scipy==1.16.0
 + sentencepiece==0.2.0
 + setuptools==80.9.0
 + six==1.17.0
 + sniffio==1.3.1
 + soundfile==0.12.1
 + stanza==1.10.1
 + sympy==1.14.0
 + tiktoken==0.9.0
 + tokenizers==0.21.2
 + torch==2.7.1
 + torchaudio==2.7.1
 + tqdm==4.67.1
 + transformers==4.52.4
 + typing-extensions==4.14.0
 + typing-inspection==0.4.1
 + urllib3==2.5.0
 + websocket-client==0.59.0
Installed 1 executable: batchalign.exe

C:\Users\myusername\.local\bin>batchalign transcribe --whisper --num_speakers 1 D:\ai\batchalign2\inputfiles D:\ai\batchalign2\outputfiles

C:\Users\myusername\AppData\Roaming\uv\tools\batchalign\Lib\site-packages\praatio\utilities\utils.py:9: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_filename

Mode: transcribe; got 1 transcript to process from D:\ai\batchalign2\inputfiles:

Device set to use cpu
You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50359]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask`
to obtain reliable results.
WhisperModel is using WhisperSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, but
specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
  male.wav ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:00:15 FAIL

ERROR on file male.wav: '<=' not supported between instances of 'NoneType' and 'float'
```


As well as installing the package from GIT with PIP results in the same error and I have a stack trace that points to an included library:

```
TypeError in pipeline call (likely due to None in token timestamps):
Traceback (most recent call last):
  File "D:\ai\batchalign2\venv\Lib\site-packages\batchalign\models\whisper\infer_asr.py", line 198, in __call__
    words = self.pipe(data.cpu().numpy(),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 283, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\pipelines\base.py", line 1371, in __call__
    return next(
           ^^^^^
  File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\pipelines\pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 607, in postprocess
    text, optional = self.tokenizer._decode_asr(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 857, in _decode_asr
           ^^^^^^^^^^^^
  File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 1108, in _decode_asr
    resolved_tokens, resolved_token_timestamps = _find_longest_common_sequence(
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 1215, in _find_longest_common_sequence        
    matches = sum(
              ^^^^
  File "D:\ai\batchalign2\venv\Lib\site-packages\transformers\models\whisper\tokenization_whisper.py", line 1220, in <genexpr>
    and left_token_timestamp_sequence[left_start + idx]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '<=' not supported between instances of 'NoneType' and 'float'
```

Tested with both CUDA and CPU processing, same error same place each time. Tested several different versions of transformers and got nowhere.

I managed to wrap it with a try catch for error handling and just got a blank output .cha file with no transcript:
```
@UTF8
@Begin
@Languages:	eng
@Participants:	

@Media:	male, audio
@Comment:	Batchalign 0.7.19-post.9, ASR Engine whisper. Unchecked output of ASR model; do not use.
@End
```

I attached a sample .wav file but had to rename it because Github.
https://github.com/user-attachments/assets/49e2332a-4f6b-48b5-861f-af55985b1a8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Error #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Whisper Error #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions