Fix tokenization issue of CJK languages for evaluation by owaski · Pull Request #20 · hlt-mt/simulstream

owaski · 2026-02-24T22:14:06Z

For CJK languages, we need to tokenize them with CJSegmenter before sending them to mweralign.align_texts.

This PR makes the following modifications:

Apply CJSegmenter before calling mweralign.align_texts. This is done for both latency scorer and quality scorer.
Add latency_unit argument to the quality scorer and use this argument to trigger CJSegmenter in the quality scorer.

mgaido91

thank you @owaski for you contribution. I know we are not doing great on UTs at the moment, but if you could add one it would be great. If I manage to, I will try and provide a suggestion for it, but since you know better what is the expected output for character-level segmentation.

simulstream/metrics/scorers/latency/mwersegmenter.py

mgaido91 · 2026-02-25T10:33:10Z

simulstream/metrics/scorers/quality/mwersegmenter.py

+        if self.args.latency_unit == "char":
+            segmenter = CJSegmenter()
+        else:
+            segmenter = None


nit: shall we move it in the init and have it as self.segmenter? This way, if we have multiple score calls, we have only one instance created.

That's a good idea. I'll do it.

simulstream/metrics/scorers/quality/mwersegmenter.py

simulstream/metrics/score_quality.py

simulstream/metrics/scorers/latency/mwersegmenter.py

mgaido91 · 2026-02-25T14:16:18Z

The CI has been fixed in #22 . Please pull from the main branch next time you push so that the CI gets fixed here as well, thanks.

Co-authored-by: Marco Gaido <marcogaido91@gmail.com>

mgaido91 · 2026-02-25T15:14:07Z

Regading the UTs, I think it would be great to add a file like this in the UTs:

# Copyright 2026 FBK

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License

import unittest
from argparse import Namespace

from simulstream.metrics.readers import OutputWithDelays, ReferenceSentenceDefinition
from simulstream.metrics.scorers.latency import LatencyScoringSample
from simulstream.metrics.scorers.latency.stream_laal import StreamLaal


class StreamLaalTestCase(unittest.TestCase):
    def test_basic(self):
        reference = [
            ReferenceSentenceDefinition(
                "A New York, sono a capo di un'associazione no profit, chiamata Robin Hood.",
                12.61,
                4.07,
            ),
            ReferenceSentenceDefinition(
                "Quando non combatto la povertà, combatto gli incendi come assistente capitano di "
                "una brigata di pompieri volontari.",
                16.9,
                5.14,
            )
        ]
        hypothesis = OutputWithDelays(
            "Tornando a New York, sono il capo dello sviluppo per un non-profit chiamato Robin "
            "Hood. Quando non sto combattendo la povertà, sto combattendo i fuochi.",
            [14.0, 14.0, 14.0, 14.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 16.0, 18.0,
             18.0, 18.0, 18.0, 18.0, 18.0, 18.0, 18.0, 20.0, 20.0, 20.0, 20.0],
            [18.22, 18.22, 18.22, 18.22, 19.93, 19.93, 19.93, 19.93, 19.93, 19.93, 19.93, 19.93,
             19.93, 23.01, 23.01, 23.01, 23.01, 23.01, 23.01, 23.01, 23.01, 27.30, 27.30, 27.30,
             27.30,]
        )
        scorer = StreamLaal(Namespace(latency_unit="word"))
        score = scorer.score([LatencyScoringSample("a", hypothesis, reference)])
        self.assertAlmostEqual(score.ideal_latency, 0.868587, 4)
        self.assertAlmostEqual(score.computational_aware_latency, 5.86, 4)
    
    def test_with_characters(self):
        ... # TODO


if __name__ == '__main__':
    unittest.main()

Where we put in the TODO a simple example taken e.g. from past experiments on which we know the expected output latency. And ideally we should add a test case to uts/metrics/log_reader.py to enforce that it works as expected as well. I can also help with these things if you can give a meaningful example, but probably it is faster if you just add it since I have no CJK example at hand. Thanks.

owaski · 2026-02-25T15:28:04Z

Sure I can add a unit test for Chinese.

mgaido91

Sure I can add a unit test for Chinese.

That would be awesome, thanks! Once we have those UTs, this LGTM, thanks!

simulstream/metrics/scorers/latency/mwersegmenter.py

simulstream/metrics/scorers/quality/mwersegmenter.py

owaski · 2026-02-25T16:12:38Z

Sure I can add a unit test for Chinese.

That would be awesome, thanks! Once we have those UTs, this LGTM, thanks!

unit test added.

owaski · 2026-02-25T17:48:24Z

wait, I found an issue with quality scorer. the self._tokenize function will do in-place modification which changes the sample.reference. let me work on a quick fix

mgaido91

Can we add a UT to test the problem you mentioned with the quality segmenter? So that we are sure that future changes will not (re-)introduce it. something similar to the latency scorer UT but with the sacrebleu scorer. It would be gold to check that before your last fix the UT fails and after fixing the issue it passes. Then we can merge this, thanks!

simulstream/metrics/scorers/latency/mwersegmenter.py

owaski · 2026-02-26T16:40:04Z

Can we add a UT to test the problem you mentioned with the quality segmenter? So that we are sure that future changes will not (re-)introduce it. something similar to the latency scorer UT but with the sacrebleu scorer. It would be gold to check that before your last fix the UT fails and after fixing the issue it passes. Then we can merge this, thanks!

Just added the unit test for both quality and latency tokenize function

mgaido91

just few minor style comments, LGTM otherwise, thanks. Once these three small things are fixed I will merge it. Thanks.

simulstream/metrics/scorers/latency/mwersegmenter.py

simulstream/metrics/scorers/quality/mwersegmenter.py

uts/metrics/test_tokenize_no_inplace.py

owaski · 2026-02-26T18:37:33Z

just few minor style comments, LGTM otherwise, thanks. Once these three small things are fixed I will merge it. Thanks.

style fixed!

simulstream/metrics/scorers/quality/mwersegmenter.py

simulstream/metrics/scorers/latency/mwersegmenter.py

owaski added 3 commits February 24, 2026 17:04

fix tokenization issue of CJK for mwersegmenter

7daaedc

bug fix

973e878

fix linting error

def15cc

mgaido91 reviewed Feb 25, 2026

View reviewed changes

owaski and others added 4 commits February 25, 2026 10:05

default latency unit to word

e90ff77

Co-authored-by: Marco Gaido <marcogaido91@gmail.com>

Add empty line after importing segmenter

afedd38

Co-authored-by: Marco Gaido <marcogaido91@gmail.com>

Merge remote-tracking branch 'upstream/main'

23f26f1

add citation to code

60ae290

owaski added 2 commits February 25, 2026 10:18

move tokenize as a class function

f8a8dc3

fix linting

15a76f0

mgaido91 reviewed Feb 25, 2026

View reviewed changes

simulstream/metrics/scorers/latency/mwersegmenter.py Outdated Show resolved Hide resolved

simulstream/metrics/scorers/quality/mwersegmenter.py Outdated Show resolved Hide resolved

owaski added 2 commits February 25, 2026 10:44

remove added empty line

f24d6f9

add unit test

1b320a7

fix inplace modification of _tokenize

ea7b688

mgaido91 reviewed Feb 25, 2026

View reviewed changes

simulstream/metrics/scorers/latency/mwersegmenter.py Outdated Show resolved Hide resolved

owaski added 2 commits February 26, 2026 11:22

fix linting

58f6755

add unit test to prevent in-place modification in tokenize function

673c9ab

mgaido91 reviewed Feb 26, 2026

View reviewed changes

simulstream/metrics/scorers/latency/mwersegmenter.py Outdated Show resolved Hide resolved

simulstream/metrics/scorers/quality/mwersegmenter.py Outdated Show resolved Hide resolved

uts/metrics/test_tokenize_no_inplace.py Outdated Show resolved Hide resolved

style fix

4e1ded7

mgaido91 approved these changes Feb 27, 2026

View reviewed changes

simulstream/metrics/scorers/quality/mwersegmenter.py Outdated Show resolved Hide resolved

simulstream/metrics/scorers/latency/mwersegmenter.py Outdated Show resolved Hide resolved

Apply suggestions from code review

9c4f95b

mgaido91 merged commit 93c51b4 into hlt-mt:main Feb 27, 2026
1 check passed

Conversation

owaski commented Feb 24, 2026

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgaido91 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

owaski Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgaido91 commented Feb 25, 2026

Uh oh!

mgaido91 commented Feb 25, 2026

Uh oh!

owaski commented Feb 25, 2026

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

owaski commented Feb 25, 2026

Uh oh!

owaski commented Feb 25, 2026

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

owaski commented Feb 26, 2026

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

owaski commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants