There is a change in timestamps processing between versions 4.46.3 and 4.47.0. With version 4.47.7 there is an empty segment for each processed audio chunk returned when return_timestamps enabled.
To reproduce the issue please run reproducer.py with transfrormers versions 4.46.3 and 4.47.0.
from transformers import pipeline
import datasets
import typing
def get_sample_from_dataset():
ds = datasets.load_dataset(
"distil-whisper/meanwhile",
split="test",
streaming=True,
trust_remote_code=True,
)
ds = typing.cast(datasets.IterableDataset, ds)
ds = ds.cast_column("audio", datasets.Audio(sampling_rate=16000))
ds = ds.take(1)
return next(iter(ds))["audio"]
sample = get_sample_from_dataset()
whisper = pipeline("automatic-speech-recognition", "openai/whisper-tiny")
transcription = whisper(
sample.copy(),
return_timestamps=True,
)
print(transcription["text"])
for chunk in transcription["chunks"]:
print(chunk)
# transformers version 4.46.3
# {'timestamp': (0.0, 3.2), 'text': ' Folks, if you watch the show, you know, I spent a lot of time'}
# {'timestamp': (3.2, 4.64), 'text': ' right over there.'}
# {'timestamp': (4.64, 7.04), 'text': ' Patiently and astutely scrutinizing the boxwood and'}
# {'timestamp': (7.04, 9.28), 'text': ' mahogany chest set of the days, big stories,'}
# {'timestamp': (9.28, 11.84), 'text': ' developing the central headline pawns,'}
# {'timestamp': (11.84, 15.08), 'text': ' definitely maneuvering an OSO topical night to F6,'}
# {'timestamp': (15.08, 16.8), 'text': ' faming of classic Sicilian,'}
# {'timestamp': (16.8, 18.96), 'text': ' named or variation on the news,'}
# {'timestamp': (18.96, 21.0), 'text': ' all the while seeing eight moves deep and'}
# {'timestamp': (21.0, 24.0), 'text': ' patiently marshalling the latest press releases into a'}
# {'timestamp': (24.0, 27.52), 'text': ' Fisher shows in lip nitsky attack that culminates in the'}
# {'timestamp': (0.0, 3.24), 'text': ' The elegant lethal slow played all-pass on checkmate'}
# {'timestamp': (3.24, 5.18), 'text': ' that is my nightly monologue, but sometimes sometimes'}
# {'timestamp': (5.18, 6.0), 'text': ' folks I'}
# {'timestamp': (6.0, 9.0), 'text': ' sometimes I'}
# {'timestamp': (9.0, 13.0), 'text': ' start a little wake upside down in the monkey bars'}
# {'timestamp': (13.0, 15.48), 'text': ' of a condemned playground on a super fun site.'}
# {'timestamp': (15.48, 17.52), 'text': ' Get all hepped up on goofballs, rummage that were'}
# {'timestamp': (17.52, 20.32), 'text': ' discarded tag bag of defective toys.'}
# {'timestamp': (20.32, 23.4), 'text': ' Yank out a fistball of disembodied doll limbs,'}
# {'timestamp': (23.4, 24.96), 'text': " toss them on a stained kid's place,"}
# {'timestamp': (24.96, 27.98), 'text': ' mad from a defunct denies, set up a table inside a rusty'}
# {'timestamp': (27.98, 29.72), 'text': ' cargo container down by the warf,'}
# {'timestamp': (0.0, 2.28), 'text': ' and challenged toothless drifters to the godless,'}
# {'timestamp': (2.28, 5.76), 'text': ' bug house blitz of tournament that is my segment.'}
# {'timestamp': (5.76, 9.56), 'text': ' Me and Wild.'}
# transformers version 4.47.0
# {'timestamp': (0.0, 3.2), 'text': ' Folks, if you watch the show, you know, I spent a lot of time'}
# {'timestamp': (3.2, 4.64), 'text': ' right over there.'}
# {'timestamp': (4.64, 7.04), 'text': ' Patiently and astutely scrutinizing the boxwood and'}
# {'timestamp': (7.04, 9.28), 'text': ' mahogany chest set of the days, big stories,'}
# {'timestamp': (9.28, 11.84), 'text': ' developing the central headline pawns,'}
# {'timestamp': (11.84, 15.08), 'text': ' definitely maneuvering an OSO topical night to F6,'}
# {'timestamp': (15.08, 16.8), 'text': ' faming of classic Sicilian,'}
# {'timestamp': (16.8, 18.96), 'text': ' named or variation on the news,'}
# {'timestamp': (18.96, 21.0), 'text': ' all the while seeing eight moves deep and'}
# {'timestamp': (21.0, 24.0), 'text': ' patiently marshalling the latest press releases into a'}
# {'timestamp': (24.0, 27.52), 'text': ' Fisher shows in lip nitsky attack that culminates in the'}
# {'timestamp': (27.52, 0.0), 'text': ''}
# {'timestamp': (3.24, 5.18), 'text': ' The elegant lethal slow played all-pass on checkmate that is my nightly monologue, but sometimes sometimes'}
# {'timestamp': (5.18, 6.0), 'text': ' folks I'}
# {'timestamp': (6.0, 9.0), 'text': ' sometimes I'}
# {'timestamp': (9.0, 13.0), 'text': ' start a little wake upside down in the monkey bars'}
# {'timestamp': (13.0, 15.48), 'text': ' of a condemned playground on a super fun site.'}
# {'timestamp': (15.48, 17.52), 'text': ' Get all hepped up on goofballs, rummage that were'}
# {'timestamp': (17.52, 20.32), 'text': ' discarded tag bag of defective toys.'}
# {'timestamp': (20.32, 23.4), 'text': ' Yank out a fistball of disembodied doll limbs,'}
# {'timestamp': (23.4, 24.96), 'text': " toss them on a stained kid's place,"}
# {'timestamp': (24.96, 27.98), 'text': ' mad from a defunct denies, set up a table inside a rusty'}
# {'timestamp': (27.98, 29.72), 'text': ' cargo container down by the warf,'}
# {'timestamp': (29.72, 0.0), 'text': ''}
# {'timestamp': (2.28, 5.76), 'text': ' and challenged toothless drifters to the godless, bug house blitz of tournament that is my segment.'}
# {'timestamp': (5.76, 9.56), 'text': ' Me and Wild.'}
It looks like the empty segment is unnecessary and should not be returned.
System Info
transformersversion: 4.46.3Who can help?
@Rocketknight1 @eustlb
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Hello!
There is a change in timestamps processing between versions 4.46.3 and 4.47.0. With version 4.47.7 there is an empty segment for each processed audio chunk returned when
return_timestampsenabled.To reproduce the issue please run
reproducer.pywithtransfrormersversions 4.46.3 and 4.47.0.reproducer.py
Expected behavior
It looks like the empty segment is unnecessary and should not be returned.